Comprehensive coverage

Help the artificial intelligence: watch "Lost"

The next challenge for software professionals is decoding video and images, which will lead to a real revolution

Image analysis. Photo: Juan Huo, University of Edinburgh
Image analysis. Photo: Juan Huo, University of Edinburgh

Israel Benjamin

On the video sharing site YouTube, more than a billion videos are viewed every day - more than a billion every second. Among other things, it has hundreds of movies in which cats play the piano, interfere with others playing the piano, "comment" on movies in which cats and other pianos appear, or just move to the sounds of music... It seems as if an entire industry has grown up around cat movies in general and movies of cats playing in particular, and these movies have A large audience: the film about Nora, the singing cat, for example, has been viewed close to twenty million times. If so, tens of millions of surfers upload movies and watch movies on this site alone, and of course there are many other sites dedicated to video content.

How do surfers get to such movies? One way is to search for a movie by typing a query into a search engine, and indeed YouTube's search engine is second only to Google in the number of search queries it receives (close to four billion queries in October 2009). The search engine works in a very different way than the way a person remembers videos they have seen.

The person will remember what happened in the movie; what people, animals and objects appeared; What was the atmosphere (for example: humorous, dramatic); and many other details. The search engine does not refer to the video content at all but only to the text attached to the video, such as the title of the film and the description provided by the person who uploaded the film to the site. Therefore, the cat searcher (as mentioned, it turns out that there are many of them) will sometimes come across irrelevant results for him, such as those related to the musician Cat (Cat) Stevens, and it is possible that relevant films will not appear at all in the search results.

This is the result of a well-known limitation: although there is no substitute for the ability of global computer networks to store and share visual content, this content is opaque from the point of view of the computer, which can store and transmit it but not "understand" it. "Comprehension", in this context, is a relatively modest requirement - creating a verbal description that identifies the objects appearing in the video itself, or finding visual content that corresponds to a given verbal description.

This situation is about to change. Much effort is invested in understanding video and images and they are already yielding promising results. The following advances promise a big change, and not only in improving the experience of the user looking for funny movies about cats: as we will see later, video decoding will lead to a revolution that may be bigger and more significant than the revolution created by the ability to search for text on the Internet.

Computer vision competitions

The large community of researchers dealing with visual information analysis has many channels of meetings and collaboration available, including some competitions. Let's mention two of them: one of the challenges in the Pascal VOC competition (link in side column) requires the competing programs to answer questions like "Is there a horse in the picture?" And if the answer is positive, locate the identified object - in this example, it is required to draw a rectangle containing the horse in the picture.

Among the many categories of objects that the software must identify are also "bottle", "motorcycle", "cow", "dining table", "sofa", "pot" and "television". In the TRECVID contest (link at the end of the column), which is managed by the American National Institute of Standards and Technology (NIST), there is a similar object recognition challenge in which the object categories include, among others, "crossroads", "man playing", "soccer player", and "classroom ".

Another challenge is searching, within hundreds of hours of video captured by security cameras, for segments that match definitions such as "a person running", "people hugging", and "a person standing next to an elevator whose doors open but does not enter". The competition also includes a search challenge within video clips of topics such as "something is burning", "a hand draws or writes", "a road seen from the windshield of a moving vehicle", and a copy detection challenge: finding video clips that are likely to have been copied from another source.

These and similar competitions have been held every year for several years now, and there is a trend of improving the results in the recurring challenges every year, as well as a trend of adding more difficult challenges every year. The competitions, and especially the publication of academic articles describing the methods and tools used by the competitors, contribute greatly to the advancements in the field.

independent learning

In order to decode the video clips, the developers use a large pool of tools, most of which come from the field of computer vision: finding contours of parts of the image, connecting the contours to objects, separating the objects from the background, analyzing textures (colors and patterns) of different parts of the image, identifying XNUMXD clues ( For example, when one object hides part of another object), and many other techniques. Added to these techniques are tools for analyzing the sounds contained in the video segment, partly with the aim of identifying words spoken in it.

A long time ago it became clear that in such programs there must be a central element of learning: for example, it is difficult for software developers to define mathematically what a horse looks like, but it is possible to program the computer to find such a definition by itself through learning: if we feed the computer enough images of horses (and of course enough images that do not include a horse) and we will create a suitable learning software, we expect that the computer will learn by itself what a horse is. During the process, the software performs a large number of numerical calculations for each image, thus creating a collection of numbers that characterize different parts of the image.

These numerical characterizations may include the position and relative size of subparts of the bone identified in the image, the texture of the subparts, the movement of the whole bone, and the movement of parts of the bone relative to other parts. This is where the learning software comes into play and looks for characteristics that are common to most horse images and are absent from most images in which horses do not appear.

This description is too simplistic, of course, and ignores many problems: correct choice of numerical characteristics, dealing with different viewing angles of the same object (a horse looks different if you look at leaves from the front or from the side, for example), different sizes (a horse that fills the entire image or is only in a small part of it), objects that are hidden or cut by the borders of the image (for example an image where only the head of the horse is visible), differences within the category (how will the computer understand that a Rottweiler and a Chihuahua both belong to the category "dog"?), and more.

The need for learning is not surprising: even humans learn to recognize most objects from many examples and not from their inherent instinct (we may have an innate ability to recognize "snake-like objects", but we certainly do not have an innate ability to recognize cars). Contest organizers recognize this and provide a large collection of sample photos and videos. Usually, some of the images will be fed into the learning process, while another part will be used to test the software's performance after learning.

Sources for "commentary" videos

The learning process creates a demand for large collections of images and movies to which "commentaries" are attached: just as a toddler learns to identify a car from many events where they show him cars and say "here's a car", so the learning software needs the labeling of objects and actions in the images and movies. There are already such repositories used for learning and competitions like the ones mentioned, but the large and active research community is "hungry" for more and more tagged images and videos.

Prof. Ben Taskar (Taskar) from the University of Pennsylvania decided to teach computers to watch popular TV series (see links at the end of the article for a report and a technical article) to satisfy the hunger of the learning programs. One of the sources he used included about a hundred episodes of the TV series "Lost" and "CSI".

These series have many fans, and some spend a lot of time uploading scripts and subtitles to the Internet. The work of these fans makes it possible to associate the speakers in each moment of the film: for example, a combination of text from the script of "Lost" and subtitles can provide the information that at a certain second Kate asks "So what's stopping you?" And Jack replies: “We're not savages, Kate. Not yet." From this it is reasonable to assume that the faces seen in the seconds before and after the appearance of the appropriate captions include the faces of John and Kate.

A long series of algorithmic analyses, which include, among other things, focusing on faces and identifying moments when the lips move, leads to high accuracy in identifying the speakers: when the computer is required to identify only the eight most common characters, it is wrong only 6% of the time. When the requirement is to identify 32 characters, the percentage of error is 13%.

And what's next?

Imagine a future where every video segment on the Internet is viewed and analyzed by new generation search engines: video search engines, in combination with the huge database of video available on sites like YouTube, and with the video cameras that already cover large parts of the public space. Such searches could be used by scientists placing cameras in forests to look for rare species, but in the same way, the public of fans could also receive a message every time a celebrity was caught on camera.

Concerned parents will be able to check if their missing child was caught on a security camera, and the police will be able to receive an automatic alert whenever a camera "sees" any violent activity. Anthropologists and sociologists will receive innovative quantitative and qualitative research tools (for example, cultural comparisons of distances between speakers or a response to an unusual event), and medical teams will be immediately called to the place where a person has fallen or where an accident has occurred.

On the other hand, totalitarian regimes will be able to identify spontaneous demonstrations or other activities suspected of subverting the regime, and instruct the software to track the participants and expand the circle of suspects by adding those who meet the participants in the suspicious activities. This list is just a small taste of what may happen in the future.

Based on the invasion of privacy that has already become an inevitable result of existing search engines, and the rate of progress in video analysis, it seems that this future is already close, with its promising and threatening sides. Let's at least take solace in the fact that finding a funny movie about cats will be even easier than it is now...

Israel Binyamini works at ClickSoftware developing advanced optimization methods.

links

11 תגובות

  1. Google has technology that allows you to take pictures of books, and they identify which book it is and suggest where to buy it online.
    There is also a company that allows you to photograph people, and find who they are on Facebook.

    The future sounds scary

  2. To Eyal:

    The level of accuracy of computer vision algorithms depends on many factors.
    These factors include the number of objects to be detected (the algorithm will be more successful in distinguishing between 2 objects than 20 objects), the quality of the image, the amount of information (other objects) and the noise in the image.
    Another factor is related to the "amount of learning" that is required for the algorithm before it can identify. The more examples the algorithm is shown in the learning phase, the more successful it will be in the detection phase.
    The commercial success of computer vision algorithms so far is in targeted tasks (such as face recognition, lane or vehicle recognition, spelling recognition) and the path to the ability to see like a two-year-old child is still a distant dream.

    The technologies in the field of computational learning are diverse. Neural networks are one of the approaches but as far as I know there was a big "hype" around them in the past years and now the situation has changed a little. The technologies I know are Support Vector Machines, Bayesian networks, and other statistical methods such as Adaboost

    If you are a software person - there is a computer vision library written in c++ called Opencv and many algorithms are written in it, among others, those that enable the identification of objects in the image.

  3. Thanks Michal, I went through the links given below the article, and it seems that there are already software programs that can very well identify dogs, cats, and other elements in an image. Listen, these things are simply amazing, abilities that were only considered the exclusive ability of humans until recently, and suddenly software can to do this. Day by day I appreciate Ray Kurzweil's predictions more and more, predictions that seem more and more realistic as time goes by.

  4. Eyal:
    The issue is still under investigation and the current article testifies to that. The competitions mentioned in it are intended to encourage people to develop and share knowledge in the field.
    The link I provided also mentions that an online demo for use by the general public is still under development.

  5. This is not true.
    It's kind of like arguing that a baby (who has natural intelligence) is made up of many experts.
    Of course, this also does not belong to Eyal's question, he did not ask about artificial intelligence in general, but about a very specific part of image recognition (something that is not carried out at all using an expert system, on the one hand, and which already has impressive achievements - on the other hand)

  6. to the stag,

    The problem of artificial intelligence lies in the inability to build computers with parallel capabilities,
    as the human mind is capable of.

    Everything that is being developed to date,
    They are "expert software" and not really artificial intelligence.

    To write real artificial intelligence software,
    It is necessary to unite a large number of specialist software into one system,
    which you can use the data of the expert software.

  7. Thanks, it looks amazing! Are there programs that know how to distinguish between a dog and a cat that can be downloaded to a computer at home and checked against random images from Google? I would really like to check it out.

  8. A question for Israel Benjamin, are there already software programs that know how to differentiate between a photo of a dog and a photo of a cat?

    I know that those who like to criticize the subject of artificial intelligence and show that this subject is floundering in place and not progressing in any direction, always like to sting and say that these programs don't even know how to differentiate between a picture of a dog and a picture of a cat, a task that any 2-3 year old child can easily perform.

    So do such programs already exist? What is their level of accuracy in identification in percentages?

    And are they based on neural networks, or on other methods?

    Thank you, I would very much appreciate an answer.

Leave a Reply

Email will not be published. Required fields are marked *

This site uses Akismat to prevent spam messages. Click here to learn how your response data is processed.