Comprehensive coverage

The genetic code contains many synonymous 'words' - information theory may help explain the redundancies

New research suggests that there may be two other significant factors that natural systems weigh: the information-theoretic nature of the genetic code and the principle of maximum entropy

By: Subhash Kak, Professor of Electrical and Computer Engineering, University of Oklahoma

The genetic code in DNA. Illustration:
The genetic code in DNA. Illustration:

Almost all life, from bacteria to humans, uses the same genetic code. This code acts as a dictionary, translating genes into amino acids that are used to build proteins. The universality of the genetic code points to a common ancestor among all living things and the essential role this code plays in the structure, function and regulation of biological cells.

Understanding how the genetic code works is the foundation of genetic engineering and synthetic biology. But there are still many mysteries, such as why the code is important for various biological processes such as protein folding.

As a researcher working at the interface between biology and physics, I apply information theory – the mathematics of how information is stored and communicated – to explore some of these interesting questions. Just as computers need strings of binary code to function, biological processes also rely on bits of information.

In my recent research, I suggest that optimization theory may provide a potential explanation for a long-standing mystery about some redundancy in the way amino acids are coded.


The genetic code book consists of "words" consisting of four letters: A, C, G and U. Each of these letters represents a different chemical building block called a nucleotide: adenine, cytosine, guanine, and uracil. A molecular machine called a ribosome reads the codebook to translate genes into proteins.

The ribosome reads three-letter words called codons, and there are 64 different possible combinations of the four letters that make different codons. In this list of 64 words, 61 encode amino acids, and 3 signal to the ribosome to stop protein synthesis in the cell. For example, "AUG" codes for the amino acid methionine and also indicates the beginning of a protein.

But just like in any other language, there are synonymous phrases - different codons can code for the same amino acid. In fact, since there are only 20 amino acids but 61 different words to code for them, there is a lot of overlap. An amino acid can be coded for by between one and six different words. There are only two amino acids that have a single codon, methionine and tryptophan. This redundancy helps ribosomes perform their tasks correctly even when there is a typo in the genetic code.

Engineering guidelines of nature

Why certain amino acids have more synonyms than others is a mystery that has troubled scientists for decades. Is there a pattern to this language, or is it random? To answer this question, scientists study the rules that guide nature's decision-making.

If a human engineer were designing the genetic code, he would want to make sure that each amino acid has a similar degree of redundancy to protect against errors and promote uniformity. The correspondence of the 61 codes to the 20 amino acids was approximately equal, with three codons assigned to each amino acid.

But nature has other priorities. Evolutionary models of natural systems such as bacteria demonstrate that nature always strives for optimization. Not only the final form of a protein should be optimal, but also its intermediate forms. Optimization ensures that natural systems can adapt to different environments.

Scientists understand some of the guidelines nature follows when designing the genetic code. For example, the spatial arrangement of atoms and molecules in and around the genetic code can affect its function, as can the co-evolution of other cellular structures involved in making proteins.

Information theory and genetics

My research suggests that there may be two other significant factors that natural systems weigh: the information-theoretic nature of the genetic code and the principle of maximum entropy.

Similar to the way a computer processes data made up of 0's and 1's, life processes the genetic code based on data made up of the four letters A, C, G and U. Mathematically, however, the most energy efficient way to represent data is not binary (or base 2) – using 0 and 1, as computers do – but base e. Short for Euler's number, e is an irrational number - this means there is no way to write down its exact value using fractions or decimals (although it approaches 2.718).

Nature's drive to optimize using this irrational number is responsible for the infinitely repeating fractals seen in jagged coastlines, rainbow leaves, snowflakes, and trees. Beyond biology, information optimization using e also has applications in mathematics and cosmology.

Another principle operating in the natural world is that of maximum entropy. Entropy is a measure of disorder in a system, and the principle of maximum entropy states that systems evolve into states of increasing disorder. This principle allows researchers to draw conclusions from limited data and was used to explain how amino acids interact in proteins.

In the context of codon sets, the principle of maximum entropy implies that nature confounds data as much as possible - this means that the function that describes the distribution of codon sets should be mathematically complicated to cancel. Learning how to maximize the mathematical complexity of this function reveals potential patterns underlying the codon groups.

I believe that these two principles may help describe the distribution of codon groups in the genetic code and indicate the usefulness of mathematics in the analysis of natural systems. Although there are many biological mysteries that scientists have yet to solve, information theory can be a powerful tool to help decipher the genetic code.

For the article in THE CONVERSATION

More of the topic in Hayadan:

5 תגובות

  1. a question:
    Is it possible to prove that the codons must be in threes, and not fours for example?

Leave a Reply

Email will not be published. Required fields are marked *

This site uses Akismat to prevent spam messages. Click here to learn how your response data is processed.