Comprehensive coverage

A computer that reads lips

Scientists around the world are working on developing computer programs that will read lips - like in "Space Odyssey" - with a decoding ability even better than that of a human lip reader

Israel Benjamin, "Galileo" magazine

In the movie "2001 A Space Odyssey" there is a scene where the astronauts suspect that the computer HAL, which controls the ship, is endangering their lives. The computer can see and hear them almost everywhere on the ship, but they find a place where only the cameras are working, and there they talk about the situation. They don't know that the cameras allow the computer to read their lips. Soon after, one of the astronauts was killed in a freak accident, and the other astronaut had to dismantle HAL to survive.

Could a lip-reading computer really be a threat? It turns out that, at least for some people, the answer is yes. In England, the Home Office's scientific development branch has begun a three-year research program with the University of East Anglia (East Anglia) and the University of Surrey (Surrey), which aims to develop machines that can automatically convert video footage of people speaking into text containing the words spoken. The goal of the Ministry of the Interior in this collaboration is to examine the possibility of using this technology to fight crime.

It happens that the police have photographs from a distance of suspects, in which it is impossible to hear what was said. The police sometimes use lip-reading experts, but defense attorneys have succeeded in several cases in casting doubt on the experts' abilities and downplaying their testimony. The Ministry of the Interior hopes that lip-reading technology will be more reliable and perceived as more objective, thus increasing the number of convictions and keeping criminals away from law-abiding citizens. Of course, it is easy to describe threatening scenarios that allow governments to listen to conversations anywhere, using the cameras that already cover large public areas.

Everyone is lip-reading

Lip reading may be used not only for discovering secrets and fighting crime. In fact, we are all lip readers to a large extent, as the McGurk effect proves. The effect appears when you watch a video where a person utters certain syllables, but the sound of other syllables is "pasted" on the soundtrack of the video. To the viewers, it seems as if a third sound was played, which is in the middle: for example, when the photographed syllable is "ga" and the sounded syllable is "bah", we tend to hear "da".

There are illusions that disappear after the effect is explained, but the McGurk effect remains even after watching the video several times, while only hearing and only seeing: as soon as we return to looking and listening, the syllable will return to sounding "da". Of course, most of us are aware of the importance of lip reading even without video editing tricks: the tendency to look at the face of the person we are talking to is not only due to politeness, and it is strengthened when it is more difficult to hear - for example, in a noisy room, or with people whose hearing is weak.

One can think, therefore, about the uses of computerized lip reading in situations where computerized speech understanding is not sufficient, such as in noisy places. The technology of controlling a computer through speech already exists, and is even included in operating systems for personal computers.

Nevertheless, it has not been widely used - perhaps because it is sensitive to noise, and perhaps because other people are sensitive to the noise such use may create. The "voice dialing" that exists today in many mobile phones is also not popular, perhaps for the same reasons. If so, maybe "quiet speech" (moving lips without making sounds) could be understood using a camera connected to a computer or phone. Lip reading can also be used, similar to the situation with many people, as a tool supporting the main mechanism of speech recognition, as proposed in 2003 by the Intel company.

speech understanding

There is a software by the company Synface designed for hearing impaired people who talk on the phone, and shows them a virtual character whose lips move according to the recorded sounds. In 2004, an experiment was conducted with this technology, and it was found that the combination of hearing on the phone with watching the lip movements of the virtual character helped 84% of the hearing-impaired participants to conduct telephone conversations in a normal manner.

Unlike Synface, the emphasis in the English project is on understanding speech. Synface "moves" the lips of the face displayed on the screen in a way that corresponds to the syllables that are heard, without this requiring accurate identification of those syllables and without any attempt to identify the spoken word. In this way, Synface puts the main effort on the natural understanding of the listener. On the other hand, the goal of the English project is to create a text copy of the conversation captured on camera, and without the help of human decoding - a much bigger challenge.

Why is this a big challenge? Apparently, if we know the series of rhymes uttered by the speaker, all we have to do is connect them to words - a task that may seem easy at first glance. In fact it is not so. We will mention just a few of the reasons: First, a great difference is expected in the coding of the words for lip movements in the speaker (difference resulting from different accents, speaking habits or simply speaking fast - it is known that the sound and the lip movements used to express any syllable depend on the syllables pronounced before and after that syllable). Errors resulting from limited photo quality and photo decoding are also expected.

As a result, the decoded syllables will often differ from the syllables the speaker intended to utter. Second, in normal speech there is no break between words, and without identifying the beginning and end of each word there are too many possibilities to decode each sequence of syllables (as if we were to print this article without using spaces and without final letters). To decipher the words, it is necessary to combine several levels of thinking and decoding simultaneously, which include, among other things, an understanding of the context of the conversation and the words that are likely to be said in it. To be convinced of this, it is enough to imagine reading a sequence of random words from the dictionary, and examining the degree of success of expert lip readers in deciphering the text: even under the best conditions, many mistakes will appear in such dictation. Even more mistakes will appear if random syllables are spoken that do not join any words.

Lip reading is more difficult than decoding vocal speech: for example, in most English accents, the lip movements in the sentence "Where there's life there's hope" are the same as the lip movements in the sentence "where's the lavender soap". To choose between the options requires a deep understanding of the situations in which each sentence may be said.

The capabilities of today's artificial intelligence systems are still far from this understanding, although it can sometimes be replaced with advanced statistical and probabilistic tools (see "The Priest and Probabilistic Intelligence", "Galileo" 69), which help choose the correct decoding according to the frequency of each of the words individually, the frequency of combinations words, and the chance that words or combinations will appear in certain situations. For example, the question about lavender soap is more appropriate for a conversation in a perfume shop than for a conversation at a bus stop, especially if soap was not mentioned in other sentences in that conversation.

Therefore, this decoding for the conversation at the bus stop is possible, but will receive a smaller probability estimate. In the end, some of the software's "deliberations" will be presented to the person reading the conversation diagram, so here too we put some of the effort on natural intelligence. In this, it is no different from the work of tape decoders, who, although endowed with natural intelligence, often do not know enough about the context of the conversation and the knowledge and assumptions shared by the interlocutors. Therefore, they are sometimes forced to present several options regarding the word that was said, even when the interlocutors themselves, or a person more knowledgeable in the background of that conversation, have no difficulty at all identifying that word.

challenges and opportunities

In the past, lip-reading software has already been introduced, but most of them required ideal environmental conditions and shooting conditions, such as suitable lighting and pointing the face directly at the camera. The English project is much more ambitious, as it must avoid these limitations to achieve its goals. In a way that reflects the level of the challenge he is facing, Dr. Richard Harvey (Harvey), the project manager, defines the research as "very experimental".

However, Harvey will be able to benefit from many works done in this field, such as the article "30D head tracking for a computerized lip reading system", by Gareth Loy and his partners, from which the figure accompanying this column was taken, which shows the tracking of the head's position and the mouth as long as the face is tilted at an angle of at most XNUMX degrees relative to the camera.

When faced with non-ideal photography conditions, several processing steps are required even before attempting to decipher the movements of the lips: it is necessary to identify where people are in the picture, identify the position of the head, focus on the area of ​​the lips, follow the head movements while speaking, identify as much as possible the effect of shadows and obscuring objects, and process the image in a way that neutralizes as much as possible all the movements (except for the movements of the lips themselves) and the lighting changes. This information will be transferred to the process of identifying the syllables pronounced by the speaker. Such processes, as already described, are probabilistic, so that each movement will be associated with several possible "guessings" regarding the syllable it was uttered, and each guess will be associated with a probability.

The next step is to try to assemble these guesses for a coherent decoding of the entire spoken word - which also requires creating coherent hypotheses about the points in time where one word ends and a new word begins. These hypotheses, in turn, can guide the decoding process of the following syllables or lead to a change in the probability estimate for the already decoded syllables. This is of course the classic combination of "bottom-up" and "top-down" (bottom-up, top-down) also known from cognitive psychology, brain research and other fields of artificial intelligence.

One of the resources that will help the developers is the extensive knowledge already accumulated on speech decoding: there also exists the process of probabilistic identification of syllables (more precisely, identification of phonemes: a phoneme is a basic unit of pronunciation that may differentiate between words, i.e. replacing one phoneme with another turns one word into a word different) and creating coherent decodings by using a dictionary of existing words and their frequency.
Beyond the police uses, which can be seen as a contribution to improving the life of the honest citizen but also a threat to his privacy, what else will become possible - for better or for worse - if we have at our disposal software that can read lips?

Historians will be able to use this to search archives of silent films, and in particular "home movies" shot between 1920 and 1970, most of which did not include sound recording (recently it was reported that computer lip-reading of films in which Hitler was filmed at his estate in the Alps during World War II; the software found Among other things, a passage in which Hitler expresses disgust towards Hermann Goering).

Marketing people will try to identify trends and opinions based on video footage from cameras that will be placed on crowded sites. Journalists will try to find out what politicians really say when the microphone is closed (remember statements that were picked up by microphones that were not intended by the speakers, among Israeli and other politicians), and "paparazzi" will find a new way to invade the privacy of celebrities. With all of these, as mentioned at the beginning of the column, new channels will be opened for mediation between people and between people and computing and communication equipment.

Originally published in "Galileo" magazine

3 תגובות

Leave a Reply

Email will not be published. Required fields are marked *

This site uses Akismat to prevent spam messages. Click here to learn how your response data is processed.