Comprehensive coverage

Who wrote this?

In the future, anonymous writers who express themselves online or in written literature will not be able to hide. Researchers have developed a computer program that will analyze the text and give a detailed and reliable description of its author

Olga Kleinerman Galileo

More than a hundred years have passed since the famous literary detective Sherlock Holmes won the hearts of the people of London and managed to capture quite a few criminals thanks to his analytical abilities, logic and quick thinking mind. Despite the time gaps, we continue to marvel at the unusual ability of the legendary detective to draw conclusions about the characters of people in detail.

Analysis of talkbacks and literary texts

Imagine that you have in your hand a computer program with capabilities no less good than those of the famous detective. For this software, it is enough to run the work of an unknown author and receive, with the help of a series of appropriate actions, a detailed and reliable description of the author of the text (country of origin, gender, age, character traits and culture).

Such software can meet the needs of a diverse audience: from the police, who can use the software to reveal the traces of criminals on the Internet, through revealing the characteristics of members of extremist groups that operate websites, to literary researchers who are debating the identity of the text's author. Also, the method can help in checking ancient journals, when we do not have enough information; If, for example, several scrolls were found at an archaeological site, such an analysis can show whether they were all written by the same author.

The Federalist Writings

The Federalist Papers - this is the name of a file of 85 articles on the United States Constitution. The articles were written between the years 1788 - 1787. These are 85 propaganda leaflets written with the intention of explaining to the residents of New York the advantages of the proposed constitution over the Articles of Confederation.

The pamphlets were written by James Madison (Madison), Alexander Hamilton (Hamilton) and John Jay (Jay) and were published under the pseudonym "Publius". It was not clear, regarding some of the articles, who wrote: Hamilton or Madison. The statistical analysis of Walls and Mostler determined that the disputed articles were written by Madison. Thus it seems that the study to identify the authors of the Federalist writings is the most famous work in the field of author profile identification and is still a cornerstone of modern research methods.

And here, Prof. Moshe Kopel and Dr. Yonathan Scheller from Bar-Ilan University, and their colleagues Prof. Shlomo Argamon from the Institute of Technology in Chicago and Prof. James Pennebaker from the University of Texas have found a formula that may provide an answer in the field of identifying the author's characteristics based on the analysis of a written text by him. The software developed by Koppel, Scheller and their colleagues identifies stylistic characteristics of the writer based on the analysis of the text, whether it is a literary text or a comment on the Internet or any written document.

Koppel and Scheller are not only concerned with gender differences between writers: they have perfected the research methods that will allow the identification of the literary "fingerprint" to a level of precision that allows one to distinguish one writer from another. Using the software they developed, it is possible to identify to a high degree the gender of the writer, his age and his native language - and all this by analyzing a text he wrote in English. In addition, the analysis makes it possible to determine whether different texts belong to the same author.

One of the sources for the texts used by Koppel and Scheler for the needs of research and software development is the world of blogs (blogs) personal diaries that are published on the Internet. The blogs are very popular and constitute a database of hundreds of thousands of texts, which can be analyzed and information about the nature of the authors can be extracted from them. This quantity of texts makes it possible to expand and upgrade the database of the computer program.

Author profile identification history

It all started in the early sixties of the 20th century, when Frederick Mosteller (Mosteller) from Harvard University and David Wallace (Wallace) from the University of Chicago published a work that became a breakthrough in the field of automatic author profile identification, based on computer software.

The two researchers worked on 12 articles from the "Federalist Writings" in order to discover the identity of the authors (see box). They used an algorithm that performed a statistical analysis of the language of the articles and calculated the frequency of occurrence of the words in the text. A central premise in the described method, and in those that followed it, is that there is a database of texts by well-known authors, which enables comparative analysis and determination of the suitability of a new text for a specific, identified author.

predictive model

In an interview with Galileo, Prof. Koppel says: "In many cases we come across an anonymous text without having a defined group of possible authors, which includes the anonymous author. What can be done in such a case? In such cases we would like to do what television detectives do: extract from the anonymous text the most information about the writer. For example: gender, age, mother tongue, personality, etc."

The approach is basically similar to that of Mostler Walls, except that instead of characterizing the writing of a given author, the researchers try to characterize the writing of groups of authors (male writing versus female writing, young writing versus older writing, etc.).

How does the mechanism work? In order to decipher an anonymous text whose author's characteristics are unknown, it is necessary to build a predictive model, which will be a sort of formula that allows the classification of a given text into one of the membership groups that define the author's age, gender, age, native language, etc.

Building a predictive model relies mainly on sociolinguistic analysis of hundreds of thousands of blogs, with known author characteristics (called training documents). Analysis of training documents is performed automatically and begins converting written text to a mathematical vector. The components of the vector are the words that appear in the body of the text (text properties). The recognition software checks the frequency of occurrence of specific words, phrases and language combinations in the training documents, and uses machine learning methods to build a formula that classifies each of the training documents into a group of belonging. The same formula is used as a predictive model for analyzing new texts and classifying them.

Who wrote the shelved book?

One of the mysteries that was solved using the computer program dealt with the writings of Rabbi Yosef Chaim, known as the "Ben Ish-Hai", who served as the chief rabbi in Baghdad about 100 years ago. According to the rabbi's claim, he found a hidden book, the author of which is unknown. In order to prove or disprove the originality of the discovery, the buried book and one of the texts of the "Ben Ish-Hai" were examined at the same time with the help of the software. The results showed that both books are the work of the same author...

In order to test the accuracy of the prediction model, Kopel and Schler ran the recognition software on texts with known author characteristics, which of course were not used to build the model (hereafter: test documents). The main requirement in this test is a match as high as possible between the determination of the connector profile by the software and the real settings, known in advance in this case. Analysis results of these test documents resulted in an accuracy of over 80% - a result that was sufficient for the recognition software to be applicable for anonymous texts.

Each criterion and its database

Working with a large amount of training documents allows the construction of a database, which contains a large number of characteristics within each group of belonging, which increases the reliability of the test for new texts. The outstanding advantage of this work method over its predecessors, is the possibility of automatically adding new characteristics to the database, that is, improving the accuracy of the prediction. Remember, the database in the old identification software was limited to the amount of texts available, written by a specific author, to which the tested text was compared.

A separate database of training documents was built for each profile criterion (gender, age, mother tongue, personality, etc., except for the categories of gender and age, for which the database is shared). Each of the profile criteria contains groups to which the examined texts are classified. Belonging groups for the established age criterion are: young people up to 20 years old, adults in their 20s and adults over 30 years old.

Belonging groups for the established mother tongue criterion are: Spanish, Russian, French, Czech, Bulgarian. Belonging groups for the determined gender criterion are women and men, and belonging groups for the character traits criterion defined: neurotic and non-neurotic.

If so, the activity of the identification software is carried out at the same time on several levels: finding an affiliation group for each profile criterion separately and cross-referencing the results obtained in order to determine an overall profile of the author of the examined text.

A blogger or a blogger?

In order to achieve a sharp distinction of the identification marks within each membership group, the training documents are classified independently of other membership groups. For example, to build a template that identifies the gender of the author, texts dealing with the same topic are examined, which were written by groups of people of the same size (men and women) of the same linguistic origin, who are in the same age group and have as similar a background as possible (profession, social status, etc.) .

The software does not rely on any theory or model from the field of social sciences, but on analyzes and mathematical models, which perform statistical analysis of written text. The linguistic analysis of the texts within the belonging groups is based on distinguishing essential differences in the style and content of the writing from one group to another. The way of writing is indeed influenced by the gender of the author, his age and other personal characteristics.

Thus, for example, in blogs discussing natural disasters, it turns out that the men will focus more on the review of the damages, the statistical data and the activities of government institutions, while the women will emphasize the fate of the people and the personal stories. The gender differences will be reflected in both the writing style (use of words) and the content (choice of topic), and will create a jumping off point for building the database.

Analysis of the writing style and not the content

It is important to emphasize that the software does not rely on any theory or model from the field of social sciences, but on analyzes and mathematical models, which perform a statistical analysis of written text.

Koppel and Scheller emphasize that the analysis of the writing style provides more accurate and reliable results than the analysis of the content. This is because the process of searching for identification marks, which indicate the differences in the writing style in the different groups of affiliation, is based on the analysis of the vocabulary, syntax, lexicon, grammar and even the appearance of orthographic errors, the use of unique language expressions and specific keywords. All of these appear in all written texts with a relatively high frequency, while content characteristics are often limited to single words, sometimes rare and unique words for a specific discussion topic.

Unlike sociolinguistic studies, for which the models were built manually by the researcher, here the model is built automatically using an algorithm applied to a text with known author characteristics.

There are a variety of algorithms used to classify texts into multiple membership groups. Koppel, Scheller and their colleagues made use of an algorithm called Bayesian Regression. This algorithm was found to be effective and has a high level of accuracy in determining the results with a short running time. Before presenting the results of the study, we will discuss the mechanism of operation of the algorithm.

Quantitative difference, not qualitative

As mentioned, the examined texts contain hundreds to thousands of parameters. The use of various language expressions, personal pronouns and other words, which are identification marks for the computer software, is done by all the authors to one degree or another.

There are no exclusive identifiers, used only by men or only women or alternatively, only 18-year-olds, and so on. Therefore, searching for identification marks by themselves within the examined text cannot be used as a tool to determine the profile of the author. The determination according to which the text can be classified within this or that group of belonging is based on a mathematical algorithm, which calculates an average weight for the appearance of each of the identification marks in the body of the examined text. The crossing between the calculation values ​​obtained in each group of affiliation, ultimately helps to fully predict the profile of the author.

Women and young bloggers use more abbreviations and letter combinations, and men and older bloggers attach hyperlinks more often

The structure of the database makes it possible to classify new texts into the belonging group while searching for identification marks in the new text that match those found in the database. That is, we analyze the parameters of the classified document, and examine the total weight of these text characteristics over each of the belonging groups. The group with the maximum score is the group to which the document and its author will belong.

Some of the features represented in the vector are single words whose frequency is simple to determine. But there are more complex features. For example, one type of feature includes highly individualized parts of speech. To measure the prevalence of such characteristics, a tree was built, the roots of which are the building blocks of language: words representing nouns, verbs, conjunctions, prepositions, adjectives, and the like. Each branch in the tree constitutes a linguistic subgroup, each node in the branch refers to a specific group of words that represents a linguistic subgroup according to the context of the meaning of the word, and each leaf is a specific keyword.

Male language and female language

The stylistic differences found by Koppel and Scheler between male and female writing from an investigation of tens of thousands of texts from blogs and containing over 7000 words per author, show that the women make a lot of use of personal pronouns and negative words. Words such as: I, you, she, me, him, my, he not, non, nor characterize female writing. In contrast, men often use specifications.

The words that appear more in men are: the, those, these. The use of prepositions also differs between women and men. Women often use words like for or with, while men use more words like of, as and numbers.

It also became clear that the women use more "blogging language" (abbreviations and combinations of letters such as lol, haha, ur and other language innovations) and the men attach hyperlinks more frequently.

The same linguistic findings observed in the men compared to the women, are also observed in the older bloggers (over 30 years old) compared to the younger ones: the older ones attach more links, i.e. use "masculine" language, while the younger ones use more the blogging language, i.e. the "feminine" language.

Additional hallmarks of the young bloggers are the multiple use of linking words and the omission of the hyphens: Im, so, thats, dont, cant.. It should be noted that there were not many stylistic differences observed among the 20-year-olds compared to the 30-year-olds. That is, using words that distinguish a certain age group. Thus, the 20-year-olds used words such as apartment, office, eating, tv, job, work, bar, and the group of adults ten years or older mostly used words such as years, wife, husband, family, children, daughter.

Identification of the writer's native language

To identify the writer's native language, it is necessary to build a "dictionary of common mistakes" in English texts written by people with different native languages. In order to build a database, texts were taken from ICLE (International Corpus of Learner English).

The group of authors of the reference documents contained over 200 students from five countries (Spain, Russia, France, Czech Republic and Bulgaria), whose English is not their mother tongue and who all wrote essays on the same topic. After building the database for the belonging group of the linguistic origin, sorting of new texts within the group is done in the same way as the classification of documents for other belonging groups.

The differences in writing in the English language between people of different origin stem from grammar rules, the form of speech, useful expressions and more. The different linguistic background is expressed in prominent identification marks in the writing, marks which make it possible to determine the origin of the author with a high degree of accuracy. The words and idioms that appear in the written text and this difference becomes hallmarks for each linguistic group when examining the text.

The identification lies in the small words

The research shows that speakers of Russian, Czech and Bulgarian tend to skip the "a" (the) as well as a, an, because these labels do not exist in the group of Slavic languages. Also, Russian speakers often use words like ;over, every, can, can't. The French tend to invent new words whose suffix is ​​ly - and love the word indeed.

The striking characteristic of the Spanish is the frequent use of words such as because, although and the omission of the form to. Instead of writing to go, they will settle for go. Romanian speakers make phonetic mistakes. For example, in many cases the letter 'O' is used in the wrong places, such as author instead of author.


It seems, then, that the recognition model focuses on the small words, and they make the big difference. Personal pronouns, prepositions, linking words (such as 'but', 'also') frequency of morphological forms such as prefixes and suffixes of words. All these are used automatically, which the author is not aware of when he writes, despite the conscious use of grandiose words on purpose.

It is possible that in the future it will be possible to use the method for various psychological studies. In any case, Koppel and Scheller do not look for reasons for the differences in writing, but use the differences that actually exist for the purpose of identifying the author's characteristics without using a database of names and content themselves with research that focuses only on computers and mathematics. The research results are in high demand in all industries, who are thirsty for information relevant to their field of work.

Olga Kleinerman is a materials and chemical engineer, a graduate of the Technion. Currently works in the development and research department of a leading company in Israel for the production of engineering plastic compounds

5 תגובות

  1. lol And if I start writing like Tilligant, then Pitom won't know that it was me who wrote it and they will think that some Shekhnazi wrote it? Come on... people write different texts in different styles. It can't prove anything.

  2. Fascinating, but I was a little disappointed that the identification of the author's mother tongue is based on mistakes. It was more interesting to discover the mother tongue through the style of expression, although this is of course a much more difficult matter. Regarding the fact that young people omit the apostrophe... here is a short explanation of how to punctuate in English:
    http://angryflower.com/bobsqu.gif

  3. Where can I download the software?

    I wonder if anyone has used it to check who composed the books of the Bible. It could be an interesting project.

  4. And surely it will be perfected later and it will predict in a better way.

Leave a Reply

Email will not be published. Required fields are marked *

This site uses Akismat to prevent spam messages. Click here to learn how your response data is processed.