Comprehensive coverage

computerized identification

Artificial intelligence researchers are working on the possibility of teaching computers to identify a person's country of origin by analyzing their accent

Speaking different languages ​​(or not) German Chancellor Angela Merkel and California Governor Arnold Schwarzenegger at the Sevit 2009 exhibition. Photo: Public Relations
Speaking different languages ​​(or not) German Chancellor Angela Merkel and California Governor Arnold Schwarzenegger at the Sevit 2009 exhibition. Photo: Public Relations

Israel Benjamin Galileo Magazine

We can often guess a person's country of origin after hearing them say just a few words. Like many human abilities, this ability is now gaining the attention of artificial intelligence researchers, who aim to teach computers how to compete with human ancestry recognition skills.

Recently, the Lincoln Laboratory of the Massachusetts Institute of Technology (MIT) introduced software that takes a significant step in this direction. The software, which was developed by Pedro Torres-Carrasquillo (Torres-Carrasquillo) and his partners in the Information Systems Technology Group at Lincoln Laboratory, can distinguish between pairs of data accents. For example, English with an "All-American" accent versus English with an Indian accent, or Spanish born in Cuba versus Spanish born in Puerto Rico.

According to Torres-Crasquillo, this is the first software that succeeds in automatically distinguishing between different accents in the same language (as opposed to software that provides an analysis of the speech samples to a human expert to help him distinguish between accents). This achievement is another step in the great progress made in recent years in automatic recognition of spoken language. The corresponding problem - identifying the language in which a text was written - is easier, and there are many solutions for it. An example of such software is TextCat, which aims to identify 69 different languages ​​from written text, including Yiddish, Welsh and Tamil. One of the uses of these solutions is in search engines, to help the user search for texts in a certain language or to offer a translation into the user's language.

language models

TextCat uses a technique called "N-Gram". With this technique it is possible to characterize the statistical properties of a text in a certain language by calculating the probability of the appearance of a series of letters of length N, for some values ​​of N, by scanning typical texts in that language. When N=1, the probability expresses the frequency of letters in the given language. For example, the letter q appears in English with a frequency of about 0.1%, but its frequency in Spanish is 0.9% and in French 1.4%. Therefore, we can use the frequency of the letter q in the text we want to identify to choose between the hypotheses that the text is written in English, French or Spanish (in some other languages ​​written in the Latin alphabet, such as Turkish, q does not exist at all). Of course, the frequencies of other letters can also be used in a similar way.

Analysis by frequency of letters has many uses (such as in cryptography, as described by Edgar Allan Poe in his story "The Golden Beetle" already in 1843), but it is not enough for identifying languages, because the frequency differences of individual letters in the different languages ​​may not be sufficient to distinguish between languages close

For this purpose, it is also necessary to use higher N values: an analysis for N=2 gives the frequency of letter pairs (for example, in English the frequency of the pair "TH" is 15 times higher than the frequency of the pair "HT"), N=3 deals with triplets of letters, etc. '. The larger N is, the larger the frequency table is, and some of the series become too rare to be used in a statistical analysis of the short text we want to stay, so N must be carefully chosen and the tables "compressed" to include only the statistically useful cases. (Note: the alphabet used by the English language has 26 letters, but 262 = 676 pairs of letters and more than 17,000 triples; it is true that not all options exist in English - for example, only u will come after q - but punctuation marks must also be considered, first of all the separating space between words

When a collection of texts known to belong to a given language is analyzed using a method such as N-Gram, a "model of the language" is obtained. This model of course misses almost everything that is important for that language to a linguist or a language user, and certainly does not contain any hint of syntax (not to mention grammar). Nevertheless, it is a "full model" of the language, that is, it contains all the features of the language that can be identified from the point of view of the frequency of occurrence of series of letters (and therefore the model can be used, at least for fun, to generate texts from the frequency tables by choosing The next letter according to the probability of the appearance of each letter after the letters we have already chosen; for N=4 we usually get a text many of whose words are not understood, but which is visible to the human reader as belonging to the language from which the model was created).

Once such a model has been created for several languages, it is possible to identify a text in an unknown language by analyzing the frequency of N-Grams in that text and comparing these frequencies to the frequencies expressed by the known models of the "candidate" languages ​​(those languages ​​for which we have a model). Since it is unlikely that we will find a model with the exact same frequency as the one in the text in front of us, we will use statistical tools to calculate the probability that the text corresponds to each of the candidate languages.

Identification by partial model

According to Torres-Crasquillo, approaches that create a complete model of the language are not suitable for accent detection. He points out that the model is not required to "look like the data" (in the same sense that the model created by frequency calculations of letter series "looks like" the texts taken from the given language, meaning that the model represents the statistical properties of frequencies in that language). Instead, it is enough that the model can distinguish between languages, even if it uses only a small part of the features of each language for this purpose.

For example, vowel sounds in Cuban Spanish are somewhat longer than the corresponding sounds in Puerto Rican Spanish. Unfortunately, it is very difficult to find such a single difference that is both significant enough to differentiate with high probability between two accents and common enough to be reasonable to find even in short segments of conversation. Therefore, a collection of such differences must be found, and currently the method is not general: it requires the creation of a distinguishing mechanism for each pair of accents. The goal of the researchers is to arrive at a general process that can reliably differentiate between many accents.

The work of the researchers from MIT differs from previous studies of the same laboratory in the field of language recognition in that it uses smaller units of sound. The previous studies analyzed the sound samples at the phoneme level (phoneme - basic unit of pronunciation) and the form in which the phoneme was used in different accents (different forms of pronouncing the same phoneme, in a way that does not change the meaning of the driving word, are called allophones). In analogy to the written text recognition method, the previous studies treated phonemes as if they were the letters of the spoken language, and aimed to identify languages ​​and accents by features of phonemes and phoneme series.

The new studies "split the atom" and choose smaller letters: short segments, a few thousandths of a second long, that are sampled from speech. This method improves the ability to differentiate between slightly different forms of pronunciation of the same phoneme (allophones) and increases the probability of recognizing an accent from short segments of conversation. As we will see later, there are practical reasons for the need to detect an accent as early as possible during the conversation.

Combining GMM with SVM

To discover a way to distinguish between accents, the short samples are analyzed using a standard technique in signal processing: identifying the frequencies of which each sample is composed, so that the spectrum of the frequencies participating in the sample becomes a pattern that represents the sounds made in the conversation in those milliseconds (the technique also tries to balance the differences between the pitch of the voices of different speakers speaking with the same accent).

Since each sample has a slightly different pattern of the intensity of the frequencies that make it up, the analogy to letters is far from perfect: most languages ​​have only a few tens of letters, and in any case there are clear differences between each pair of letters even if written in a different font, but each pattern of a spoken sound is different than any other pattern, and it is difficult to locate the exact place where one type of pattern borders on another type: the transitions are continuous. Therefore, more advanced statistical and mathematical techniques are required than techniques such as N-Gram used to recognize written language.

The researchers of the Lincoln Laboratory at MIT use a combination of two such techniques, which have gained great popularity in recent years: GMM (Gaussian Mixture Models) and SVM Support Vector) Machines). Both methods show each pattern as a collection of numbers, so if 20 numbers represent each sample, then one point in a 20-dimensional space can be thought of as expressing the sample. The goal is to find a way to differentiate between the points expressing samples from one accent and the points associated with another accent. For this purpose, the space must be divided between areas that contain exclusively (or almost exclusively) points representing samples of one accent and between areas containing points associated with the other accent.

GMM and SVM differ in the mathematical representation of the spatial distribution and the way in which the optimal distribution is calculated. For this software, the GMM method is slower but more accurate than SVM, and the combination of the two methods was found to be the most accurate - the error level was only 7%. I wonder what the error rate of a human expert is...

Practical uses - less privacy, more security?

As mentioned, the new software joins existing solutions for language recognition, whether it is spoken or written. Another possibility is to recognize a language from video recordings of the speakers, even without recording the sounds themselves. lip reading software, developed at the University of East Anglia (UEA: University of East Anglia). Allows you to identify the language spoken. One of the heads of the group that developed the software, Prof. Stephen Cox (Cox), noted that the findings corresponded to the intuition that even when the same person speaks different languages, his facial movements will be different from language to language. For example, the software found that "lip curling" is more common when speaking in French, while speaking in Arabic includes more pronounced tongue movements.

Beyond academic success and progress in imitating another ability that used to be the exclusive property of humans, such programs also have practical uses. At least some of these uses are related to surveillance and eavesdropping. The first example, which appears in a press report about the accent recognition software, involved an American police officer intercepting a conversation in Spanish in which a drug dealer was receiving notification of a new shipment. The officer recognizes that the dispatcher speaks Spanish with a South American accent, but if he could associate the accent with a specific country, he could use this information to guide further investigation.

The same press report also mentions the potential contribution of accent recognition to automatic language-to-language translation systems, so that such systems make fewer mistakes in understanding words and take advantage of the nuances transmitted between humans with the help of the accent. It is clear that as long as automatic translation systems are still in the realm of science fiction, most of the funding and motivation for the development of accent recognition will come from the fields of law enforcement and defense against terrorism.

In a similar way, the identification of the speaker's language in video footage, such as the lip-reading software developed by the same group, may locate on security cameras (which gradually cover large parts of the public space) people with ethnic characteristics that the law enforcement and security authorities link to criminal and terrorist groups, even when the vast majority of those belonging to These ethnic groups are innocent of any crime. On the one hand, this is another step up in the ability to protect the public, and on the other hand, a worrying intrusion into the privacy of at least part of that public. In this dilemma, each country chooses for itself the balance that seems to its leadership to be the most correct and moral, but it is not at all clear whether the leadership provides the citizens of the country with information about the decision process and its results.

Even if we were to live in a utopian world where threats of crime and terrorism do not exist, it would be worthwhile to consider that in most cases the fact that we recognize other people's looks may lead us to stereotypical conclusions - positive or negative - about those people.

Even when the identification of origin does not lead to racism, sometimes a conscious decision is required to ignore the accepted stereotypes and focus on the individual in front of us. Today, most of our interactions with computers and software do not expose us to this danger, as expressed by the cartoon in which we see a dog sitting in front of a computer screen and saying to his friend "on the Internet, no one knows you are a dog". However, it is already possible to sometimes find different treatment based on origin or social status, such as websites that refuse to sell or provide information to surfers living in certain countries, or insurance companies whose risk assessment software is based on socio-economic data, including place of residence. Would it be correct to teach our computers also how to recognize origin and accent, in a way that may lead to behavior in accordance with that recognition?

Israel Binyamini works at ClickSoftware developing advanced optimization methods

One response

  1. Hahaha this reminds me of the movie Matrix where Agent Smith doesn't have an accent,
    Befitting a machine his English sounds as written. 😉

Leave a Reply

Email will not be published. Required fields are marked *

This site uses Akismat to prevent spam messages. Click here to learn how your response data is processed.