Hayadan > The IBM research laboratory in Haifa leads a European project for the digitization of ancient texts

The IBM research laboratory in Haifa leads a European project for the digitization of ancient texts

The IMPACT project is designed to develop tools and methods that will make it possible to preserve these valuable historical texts by digital means, allow them to be accessed by online search and ensure their accessibility for future generations

Scientists from the IBM research laboratory in Haifa are participating in a joint venture with more than twenty academic institutions and research bodies in Europe, to preserve documents from the 15th century onwards.

The research effort, funded by the European Community, led to the development of a digitization system (OCR) that uses a community computing method (crowd sourcing) in order to introduce a new method for processing historical texts. The challenge of digitizing texts today not only concerns libraries and historical archives - but also any institution interested in preserving and converting old and ancient documents of historical value or business significance.

The digitization software developed in the IBM laboratory significantly reduces the need for expensive manual handling of the scanned texts, which stems from the past use of complex fonts, which are not familiar to today's software - as well as due to the difference in the accepted vocabulary and language structure. The concept of community computing, which IBM implements as part of the project, allows large groups of volunteers scattered throughout Europe to participate and contribute their time to verify the identification of the texts and correct identification errors through an online web system. As these corrections are made, the system knows how to learn and correct its errors, in order to achieve better detection in the future.

Following the success of the first stages of the project, IBM and the European community are expanding the collaboration, in order to now include national libraries, research institutions, universities and other business companies. Unlike projects carried out in the past in the field of digitization, which produced only static results, in the form of online text libraries, the new and wide-ranging project underway now will also offer new tools and methods that will serve institutions all over Europe, and will allow them to continue to efficiently produce accurate digital copies of texts of historical importance, and make them available to the general public, while opening the possibility to search these contents and edit them into studies and presentations.

The concept of community computing, on which the project is based, is gaining momentum in various content areas. The integration of IBM's OCR technology together with a community computing effort will make it possible for the first time to scan and digitize ancient and unique fonts, reducing the error rate by 35%.

Tal Drori, manager of the document processing group at the IBM research laboratory in Haifa, stated that "the IMPACT project not only provides central research bodies with a way to bring people closer to historical texts that were previously inaccessible and unseen to the public: it also allows them to become part of the preservation effort themselves . This is the first digitization system that combines the power of the crowd and the community - along with adaptive optical recognition technology, capable of learning and correcting errors, capable of handling texts created from the 15th century to the end of the 19th century."

Common OCR engines known today handle modern texts well. However, the faded ink, the paper or the ancient scroll, as well as special forms of fonts characteristic of ancient documents, may lower the level of identification by substantial rates, and therefore require extensive manual work to correct the digitization results. "The only way that enables the digitization of historical material on a large scale is to improve the quality of the optical identification process of the text," says Drori.

The system developed in the IBM research laboratory allows volunteers from all over Europe to check the reliability of the processed text and correct identification errors, using an Internet system. In order to optimize the examination process, the system knows how to present to the examiner not only the scanned source page - but the exact word that requires an in-depth examination. Thus, for example, the combination of the English letters "r" and "n", which appear next to each other, often leads to a mistake in the computer reading, when the computer assumes that rn is actually the letter m. When the system reaches points where there is doubt about the identification - it collects many such cases identified as m scattered throughout the text - and displays them together and next to the doubtful word. Thus, the examiner can more easily draw conclusions about the correct identification, and correct a large number of cases in a single operation.

When there is doubt about the identification of a whole word, the system adds it to a pool of unclear words, which is displayed in alpha-byte order. The volunteers helping the project have to accept or reject the system's suggestions for identifying these words, in a process that is carried out with a single keystroke. In addition, the system uses a unique ability to expand its vocabulary, so that new words are added to the internal dictionary, based on identification and correction received from different users.

The list of bodies participating in the IMPACT project includes, among others, the national libraries of the Netherlands, Great Britain, France, Austria and Germany, the Central Library of the State of Bavaria, the University Library of Göttingen, the Netherlands Institute of Linguistics, the University of Munich, the University of Bath, the National Library of France, the National Library of Spain, the supercomputing center in Poznań, Poland and more.