Researchers have developed an algorithm that makes it possible to find a man's last name based on genetic information from the Y chromosome. How did they manage to identify a family based on the Y chromosomes of their sons? Is publishing genetic information on the Internet dangerous and what is the benefit of such a database?
Amos Bev, Galileo
The researchers were able to trace the name and location of a particular person by a combination of his Y chromosome, his age and the fact that he lives in California
Israeli researchers from the Whitehead Institute, Boston, and Tel Aviv University have developed an algorithm that makes it possible to find a man's last name based on genetic information on the Y chromosome. The research may have significant implications on the subject of information privacy. The study was published in the journal Science.
"Since the ability to sequence the human genome developed, many people want to trace their genetic lineage," explains Prof. Eran Halperin, from the School of Computer Science and the Department of Microbiology and Biotechnology at Tel Aviv University. "To meet the need, companies have sprung up in the United States that take saliva samples from those who are interested, and upload their personal genomes to databases open to the public. We used these databases to examine family affiliation according to the Y chromosome - the male sex chromosome. The Y chromosome was particularly suitable for our research, because it is passed from father to son throughout the generations (with minor changes resulting from mutations), and therefore - just like the family name - it is shared by virtually all men in the extended family."
The original research - an initiative of Dr. Yaniv Ehrlich from the Whitehead Institute for Biomedical Research in Boston, and with the participation of Prof. Halperin and doctoral student David Golan from the Department of Statistics at Tel Aviv University - focused on building a computer algorithm that would know how to determine a person's last name based solely on data his Y chromosome. The algorithm is based on the mapping of special segments in the genome called STR's (Short Tandem Repeats). The genome, as we know, is a long sequence consisting of four nucleotides, marked with the letters A, C, G and T. STR is a sequence built from several repetitions of a shorter basic sequence, for example ACTACTACTACT - four repetitions of the basic sequence ACT. Because of the special structure of the STR`s, the number of repeats of each STR tends to change from generation to generation.
Such a change is called a mutation, and the mutation rate of STR's is particularly high compared to other types of mutations in the genome. Several dozen such STR's are found on the Y chromosome - the male sex chromosome, which is found only in males, and is passed in its entirety from father to son. In fact, paternity tests (for boys) between a child and an assumed biological father are based on comparing STR's found on the Y chromosome. The mutation rate of the STR's is high enough to identify if it is indeed the biological father.
Given the genome of some anonymous person, the algorithm maps the STR's on the Y chromosome, then checks the results against the online databases, with the aim of finding relatives. If a match of sufficient quality is found, it can be concluded that the two individuals are relatives on the father's side, and the algorithm determines that the surname of the anonymous person is the same as the surname found in the database.
The special algorithm was tested on a sample of 900 men in the United States. The Y chromosome data of the participants was submitted to an online database containing the sequenced genomes of 135 people - which are a faithful representation of the distribution of surnames in the United States, mainly for those of European origin. In principle, the algorithm is not limited to a particular origin, but today most of the available genomes are of people of European origin. In addition, due to a variety of historical, social and economic reasons, the databases of the companies that offer such genetic tests are biased in favor of European populations, which is the reason why the demonstration of the algorithm's functionality was done on people of this origin.
"The algorithm accurately identified the last name of one out of eight subjects," says Prof. Halperin. That is, for one out of every eight subjects a high-quality match was found between the subject's Y chromosome and the Y chromosome found in the database, and the last names of the subject and the person in the database were identical. For most of the other subjects, the algorithm declared that it "doesn't know" the last name.
In another case, the researchers were able to trace the name and then the location of a certain person based on a combination of his Y chromosome, they discovered, and the fact that he lives in California. Thus, for example, they presented to the algorithm the Y chromosome data of the well-known geneticist Craig Venter, who published his entire genome online. The algorithm identified the last name, and after crossing the name with additional data - Venter's age and the fact that he lives in California - the search was narrowed down to only two people. The researchers were also able to almost certainly identify a large Mormon family from Utah, based on the Y chromosomes of its sons.
Information in the science service
"The identification technique we developed can have quite a few useful uses, such as locating relatives, identifying bodies in natural disasters, and more," says Prof. Halperin. "However, our research revealed a fundamental problem that requires attention: if a person publishes his genome on the Internet, even when this is done anonymously, his identity is quite exposed. And it should be remembered that we tested only one chromosome out of all the genetic information, which includes another 22 pairs of chromosomes and an X chromosome. The focus on the Y chromosome stems from its special connection to a family name (both the Y chromosome and the family name are passed down - in most societies - from father to son)."
"Despite this, it is important to note that we positively consider the sharing of genetic information in public databases, with consent of course. The sharing of information is essential for the advancement of science, and there are many benefits for users of these services. However, it is important that all entities related to the sharing of information, including the people whose data are in the databases, the scientists, and the entities that publish the information, be aware of the nature of the disclosure and exercise their considerations accordingly."
Dr. Yaniv Ehrlich points out that "an obvious conclusion from our research is that biometric databases can create unexpected situations. For example, who thought that surnames could be discovered from genetic information? That is why we believe that legislators should exercise extra caution when they plan to establish such reservoirs."