Comprehensive coverage

Artificial intelligence - computers that think on their own / Yasser S. Abu-Mutzatpa

New techniques that teach computers how to learn are beating the experts

The inside of a home computer, courtesy of Wikipedia
The inside of a home computer, courtesy of Wikipedia

A few years ago, managers of a women's clothing company approached me to help them develop better fashion recommendations for their customers. No sane person would ask me for personal recommendations on a subject where my knowledge tends to zero, after all, I am a male computer scientist. But the advice they asked for was not personal: they asked for advice in the field of computational learning (sometimes called machine learning), and I agreed. Based on sales data and customer surveys alone, I was able to recommend fashion items I'd never seen to women I'd never met. My recommendations were better than those of professional fashion designers - and I still hardly know anything about women's fashion.

Computational learning is a branch of computer science that allows computers to learn from experience, and it is used in many fields. Machine learning makes online search results more relevant, blood tests more accurate, and dating services more likely to find you a match. The simplest algorithms in the field take an existing pool of data, scan it for patterns, and use those patterns to make predictions about the future. However, the progress in the field in the last decade has changed it from end to end: computational learning techniques have made computers "smarter" than humans in many tasks. Take for example Watson, the IBM computer system that used machine learning to defeat the world's best Jeopardy players.

However, in the most important competition in the field, talking computer systems that play Jeopardy did not participate. A few years ago, the online movie rental company Netflix wanted to help its customers find movies that they would like - especially movies that are not new and in demand, but rather forgotten movies offered in its catalog. The company already had its own recommendation system, but the managers knew it was far from perfect, so they announced a competition to improve the existing system. The rules were simple: the first system whose performance exceeds the performance of the existing system by 10% will win a prize of one million dollars. Tens of thousands of people from around the world registered as participants.

For a researcher in the field of computational learning, such a competition is a dream opportunity (and not only because of the monetary prize, as attractive as it may be). The most essential component of any machine learning system is the data, and Netflix's competition offered 100 million pieces of real-world data, ready for download.

training days

Netflix's competition lasted almost three years. Many groups have tackled the problem by decomposing the films into long arrays of different properties. For example, you can rate any movie on many scales: how funny it is, how complex it is, or how attractive the actors are. After that, you check the movies that a certain viewer rated and see how important each scale is to him: to what extent he enjoys comedies, whether he prefers simple or complex plots, and how much he likes looking at beautiful movie stars.

Now, the prediction becomes a simple task of matching the viewer's taste with the characteristics of the films. If the viewer likes comedies and complicated plots, it is likely that films like "Hot and tasty" or "A fish named Wanda" will appeal to him. The algorithm matches dozens of such characteristics, so the final recommendation should predict well how much the viewer will like the selected movie.

We tend to think of features in easy-to-recognize terms such as "comedy" or "convoluted plot", but algorithms do not make such distinctions. In fact, the entire process is automated and the researchers do not bother to analyze the content of any film themselves. The machine learning algorithm starts with random, nameless features. As it accumulates movie rating data by past viewers, it fine tunes the features until they match the way viewers rate movies.

For example, if people who like movie A also tend to like movies B, C, and D, the algorithm will identify a new characteristic that all four have in common. This action takes place during what is known as the "training phase", where the computer reviews millions of viewer ratings. The goal at this stage is to create a collection of objective characteristics based on actual ratings and not on subjective analysis.

Sometimes we will have difficulty deciphering the various characteristics that the computational learning algorithm produces. Sometimes they are not obvious like "comic content". In fact, they may be very subtle and even completely incomprehensible, since the algorithm is only trying to find the best way to predict the rating the viewer will give the movie, and not to explain to us how it does it. If a system works well, we don't insist on understanding how it does it.

This way is different from the way the world usually works. At the beginning of my career, I developed a system for approving credit frames intended for a bank. When I finished, the bank asked me to explain the meaning of each characteristic. This request had nothing to do with the performance of the system, which was satisfactory. The reason was legal: a bank cannot deny credit to a person without providing a logical reason. You can't just send someone a letter saying that their request was rejected because X is less than 0.5.

Different machine learning systems develop unique sets of characteristics. In the final weeks of the Netflix competition, groups working independently began combining their algorithms using methods known as "aggregation techniques". In the last hour of the three-year competition, two teams still fought for the prize. The scoreboard showed a small advantage for The Ensemble, a team that included a Ph.D. from my research group at Caltech, over BellKor's Pragmatic Chaos team, but the final calculation put the teams in a statistical tie: each achieved a 10.06 percent improvement over the original algorithm. According to the competition rules, in case of a tie, the prize will be given to the team that presented its solution earlier. At the end of three years of struggle, in the final hour of the battle, BellKor's Pragmatic Chaos team submitted their solution twenty minutes before The Ensemble. A delay of only twenty minutes in a competition that lasted three years caused the loss of a prize of one million dollars.

The perfect match

The type of computational learning used in the film rating competition is called "supervised learning". This type is also used in tasks such as medical diagnosis. For example, we can feed the computer thousands of images of white blood cells from patients' medical files, with information indicating for each image whether the cells appearing in it are cancerous or not. The algorithm will learn, from this information, to identify malignant cells using certain characteristics of the cell - such as shape, size and shade. In this case, the researchers "guide" the learning process and give the computer the correct answer for each and every image in the training data.

Supervised learning is the most common type of machine learning, but not the only one. Roboticists, for example, don't necessarily know the best way to make a bipedal robot walk. In that case, they can design an algorithm that will experiment with a variety of different walking forms. If a particular gait causes the robot to fall, the algorithm will learn not to use it again.

This is the "reinforcement learning" approach. It is based on trial and error, a learning strategy we are all familiar with. In a typical reinforcement learning scenario, human or machine, we are in a situation that requires some action. Instead of someone telling us what to do, we try something and see what happens. Based on the result, we strengthen the good actions and avoid harmful actions in the future. In the end, both we and the machines learn the appropriate actions for the different situations.

Take for example the search engines on the Internet. The founders of Google did not go over the entire web of 1997 to train their computers to recognize pages that deal with, for example, "Dolly the Sheep". Instead, their algorithms crawled the web and generated an initial draft of results, after which they relied on user click data to rank the various pages according to their relevance. When a user clicks on a link to a page in the search results, the machine learning algorithm learns that this link is relevant. If the users ignored a link that appeared at the top of the search results, the algorithm concluded that this page was not relevant. The algorithm incorporates feedback from millions of users to adjust the evaluation of pages in future searches.

Problems of redundant information

Researchers often use reinforcement learning in tasks that require a sequence of actions, such as games. The "X-Circle" game is a simple example of this. The computer can start playing by randomly marking an X in the corner. This is a strong opening, and if the computer opens with it, it will win more often than if it marks the X on the side. The action that led to victory - marking an X in the corner - is reinforced. Researchers continue this process to deduce what the appropriate action would be at each future stage of the game, and for every game, from checkers to rooks. Reinforcement learning is also used in advanced economic applications, for example to find the Nash equilibrium.

Sometimes, even reinforcement learning isn't practical, because in some cases we can't get feedback on our actions. In such situations we must switch to "unguided learning". Here the researcher has a collection of data, but he has no information that dictates what action to take, neither explicitly, as in guided learning, nor indirectly, as in reinforcement learning. If so, how can you learn something from the data? The first step to extract meaning from them is to classify the data into groups based on their similarities. This process is called clustering. The process collects unlabeled data, about which there is no additional information, and deduces information about its internal structure. Clustering provides us with a better understanding of the data before we consider what action to take. Sometimes the division itself is enough. For example, if we are organizing a library, classifying the books according to similar categories is all that is needed. In other cases we can continue the process and apply guided learning to the data sets.

Ironically, the biggest trap machine learning people fall into is using too much computing power to solve a problem. The difference between the amateurs and the professionals is the ability to recognize this mistake and deal with it properly.

How can the addition of power be harmful? Machine learning algorithms try to find patterns in data. If the algorithm is too powerful - for example, it uses a model that is too sophisticated for a limited sample - it may deceive itself and "discover" random patterns that result from coincidences in the sample, and do not reflect real relationships. A significant part of the research of the mathematical theory of computational learning focuses on this problem of "overfitting" of data. We want to discover real relationships that match the data, not exaggerate and get patterns that cannot be trusted.

Credit: http://work.caltech.edu/telecourse.html

To understand how this can happen, imagine a bet at a roulette table (for simplicity let's say it has only red and black numbers, no 0 or 00). The bettor observes ten results in a row, alternating red and black. "The roulette must be rigged," she thinks, "the results are always red-black-red-black." She creates a model in her mind that is confirmed by the limited collection of data. But in the eleventh round, after she bet one hundred dollars on "red", the random nature of the roulette popped up and returned and the ball stops, for the second time in a row, on black and the bet loses.

She was looking for a pattern where there was none. Statistically, the chance of a sequence of ten stops of alternating red and black is about one in 500. Past results have no effect on the roulette. The chance of getting red in the next round is always 50%. There is a well-known saying in the field of machine learning: if you answer the data long enough, they will admit anything.

To avoid this problem, machine learning algorithms are biased to keep the models as simple as possible, using a technique known as regularization. The more complex the model, the more prone it is to overfitting and regularization helps maintain a reasonable level of complexity.

Typically, researchers validate the algorithm using data that is not included in the training data. In this way it is possible to ensure that the obtained performance is a true performance and not a false result of the training data. In the Netflix competition, for example, the judging was not done based on the data provided to the participants, but based on a new data set that only the Netflix people knew about.

to predict the future

Those who are engaged in computational learning do not have a single moment of boredom. You never know what application you'll be working on next. Machine learning allows people who do not have expertise in a particular field (for example, female fashion computer scientists) to find correlations and predict performance based on the data alone. As a result, there is a huge interest in the field. In the spring of 2012, the machine learning course I teach at the California Institute of Technology attracted students from 15 different fields. For the first time, I also published the course materials online and broadcast lectures live. Thousands of people from all over the world watched them and completed the assignments (you can too: see link below).

Nevertheless, machine learning is only suitable for problems for which there is enough data. Whenever I am presented with a potential machine learning project, my first question is a simple one: What data do you have? Machine learning does not create information but extracts it from the data. If there is not enough training data that contains the necessary information, machine learning will not succeed.

However, more and more data is accumulated in countless fields, and the value of machine learning will continue to grow. Trust me - predictions are my specialty.

__________________________________________________________________________________________________________________________________________________________________

About the author

Yasser S. Abu-Mutzatfa is a professor of electronic engineering and computer science at the California Institute of Technology.

And more on the subject

Machines That Learn from Hints. Yaser S. Abu-Mostafa in Scientific American magazine, Vol. 272, no. 4, pages 64-69; April 1995.

Recommend a Movie, Win a Million Bucks. Joseph Sill in Engineering Science, Vol. 73, no. 2, pages 32–39; Spring 2010.

Learning from Data. Yaser S. Abu-Mostafa, Malik Magdon-Ismail and Hsuan-Tien Lin. AMLbook, 2012. http://amlbook.com

Learning from Data (online course): http://work.caltech.edu/telecourse.html

Post to Twitter Post to Facebook Facebook

2 תגובות

  1. The first learning systems were fire control systems, the computer "guessed" the data that would result in a hit, and received feedback regarding the distance of the hit from the target. He learned from every successful hit.

Leave a Reply

Email will not be published. Required fields are marked *

This site uses Akismat to prevent spam messages. Click here to learn how your response data is processed.