Comprehensive coverage

Clean the reservoirs

Researchers have developed algorithms that detect errors in databases

Big data handling. Courtesy of Prof. Benny Kimmelfeld, Technion
Big data handling. Courtesy of Prof. Benny Kimmelfeld, Technion

How can organizations - such as hospitals, banks, universities, government offices and media organizations - store huge amounts of information and retrieve data from them when needed? The challenge of managing huge data ("big data") in databases has become a central interest in information technology. A database is an organized collection of data (information) that is stored electronically in a computerized system and is usually controlled by a management system that is based on algorithms for data processing and retrieval. Using it you can enter the data, edit, update, maintain and retrieve it when necessary.

Today every organization, every body that holds information, uses a database. The users define the existing data in the management system (for example personal details of patients and staff members, areas of specialization of doctors, items of medical equipment and the number of rooms in each department) and perform queries (for example: what is the field of specialization and a doctor's ID card, which patients received a certain treatment, and in which treatment every doctor is busy at any given moment). In addition, constraints are defined that help organize and manage the information (for example, each patient has a logical and unique year of birth and at least one doctor is responsible, and each doctor is registered in the system and is in a single treatment per unit of time). This prevents inconsistent entry of information and reduces the chance of errors. When errors occur (for example, a XNUMX number that appears twice) - the management system warns about this.

There are several database models that express different types of relationships between the data. The most common model is the tabular model - where the data is arranged in tables. Each table contains information about a certain entity (for example patients) and has records / rows. Each record includes a series of certain values ​​(eg the patient's ID number). Thus, any user of the system can access the data with relative ease, manage and organize it, and ask questions about it in simple language, without having to be an experienced computer expert.

"The main purpose of databases is to enable a computational language that faithfully represents the information and makes it accessible to the user. In fact, it is about the implementation of mathematical logic, established and ancient mathematical foundations, in computing," says Prof. Benny Kimmelfeld from the Faculty of Computer Science at the Technion, who studies information management in organizations and thereby databases. In his research, he examines how information can be stored in databases and used efficiently - for example asking queries, calculating their result and making sure it is consistent.

Today, organizations work with huge amounts of information and the chance of inconsistency in the information increases due to its collection directly from many end users and information sources accessible on the Internet such as social networks, scientific publications and encyclopedias.

Therefore, researchers in the field are trying to develop advanced methods for collecting and saving information in database systems, and especially to deal with the errors that may arise. This is what Prof. Kimmelfeld and his team tried to do in their latest study, which won a research grant from the National Science Foundation. According to him, "Sometimes the constraints are violated, and thus the information is not always organized properly and even errors and contradictions arise in it. For example, there may be a situation where a researcher's doctorate completion year appears twice - once in 2008 and once in 2005, or where instead of the name of a department, the name of a clinic appears. The information may be pre-entered with errors, by the users, and the errors may be introduced due to the existing data collection methods. Therefore, advanced methods are required to correct the errors and violations of the constraints and to clean the information. The ability to measure the inconsistency in the information and to clean it allows data scientists to assess its quality, and the measurement should be easy to calculate and be done in a reasonable time."

In order to examine to what extent the information in databases is incorrect, contradictory and inconsistent and to correct it, the researchers developed algorithms to measure the inconsistency and assign the responsibility for errors to different records. These algorithms are based on tools and definitions from the field of economics, game theory (a branch of mathematics that analyzes situations of conflict or cooperation between decision makers with different desires), and data science. The researchers implemented these algorithms in database systems - and these treated the records as agents cooperating in the creation of a certain value, and were thus able to find out who the main suspects were in causing the errors observed in the system.

 Life itself:

Prof. Benny Kimelfeld, 46 years old, married + two (15, 12), grew up in Ashdod and currently lives in Haifa. likes to run and plays the bass (member of the Technion's faculty band).

More of the topic in Hayadan:

Science website logo
SEARCH