Comprehensive coverage

Walls of noise to protect privacy

Researchers have developed methods that mask data in databases to maintain privacy and help them be statistically accurate

big data Illustration:
big data Illustration:

The era of big data raises many questions concerning privacy due to the ease of collecting and distributing data about the individual. Data privacy (Differential Privacy) is a method that makes it possible to publish information about users on the Internet without harming their privacy, which companies like Apple, Google and Microsoft use. Using it, a random component can be added to the algorithms and the identity and data of individuals can be obscured. In this way, it is possible to add user data collected from devices such as smartphones, iPads and laptops to a database of aggregated data on additional users and learn from it characteristics of all users while minimizing the harm to the privacy of each of them.

What is the question? How do you maintain data privacy and accuracy?

Prof. Katrina Ligat from the School of Engineering and Computer Science at the Hebrew University and her team deal with data privacy and develop mathematical models (algorithms, theorems and formulas) through which information about people can be used safely. According to her, "In our research, we develop tools with which sensitive data can be extracted from databases and published (for example, on health, medicine, and Google searches), for example to make it accessible to researchers for research, and at the same time we try to understand what the limitations are and how people can be protected. In the end, it is not possible to release information from a database without affecting the privacy of the individuals in the database, but the language of data privacy makes it possible to control the level of violation of privacy, reduce it and balance the needs."

In addition, Prof. Ligat and her team deal with algorithmic fairness - a field of research that aims to identify and correct biases and errors made by algorithms, which contribute to discrimination. For example, banks that scan data with algorithms may reject more women's mortgage and loan applications, and medical systems may miss skin cancer in dark-skinned patients. This, even if the developers did not intend it, due to biases in the data used to train the algorithm.

"Similar to the definition of privacy, it is possible to define for the algorithms that analyze data what fairness (the desired goal) is in a mathematical way and thus be more precise in the decisions that are made and prevent injustices. For example, the Bank of Israel has a database that it has accumulated on the population's credit data, and private companies can scan it using algorithms to determine a credit rating. If we are not careful, they may identify participants in the database based on marital status, origin, residence, hometown, level of income, even if the identity numbers have been deleted, and leak sensitive information about them such as their financial situation. In addition to the privacy problem, the decisions that will be made - giving women a low credit rating for example - may be discriminatory. Therefore, in our research, we try to understand how algorithms can be changed so that they reduce the harm to privacy and identify fairness problems, and we develop tools and mathematical measures to filter and mask the data," explains Prof. Ligat.

We are trying to understand how algorithms can be changed so that they reduce the harm to privacy and identify fairness problems, and we are developing tools and mathematical measures to filter and mask the data.

Another field in which Prof. Ligat and her team deal is statistics with adaptive methods - using information in a way that represents the real world. According to Moshe Shanfeld, a doctoral student in Prof. Ligat's team, "in many cases, researchers use data on small population groups (samples) for the purpose of several research questions, which are chosen based on the answers received to the previous questions from that sample, and therefore may lead to conclusions that do not represent the rest of the population. To prevent this, we add noise (disturbance) to the algorithms that 'hides' the sample, which reduces the chance of choosing questions where the sample is not representative, thus ensuring statistical accuracy in the data. For example, when researchers scan data to find out what is the percentage of patients with a certain disease (the probability of getting it), the results they get on the effect of variables such as age, height and weight may lead them to a hypothesis that combines all the variables in a way that happens to be suitable for the sample but not for the rest of the population. When proportional noise is added to the algorithm, it provides a slightly different percentage that slightly affects the accuracy of the answer, but significantly reduces the chance of an unrepresentative hypothesis".

The researchers' latest research, which won a grant from the National Science Foundation, aims to improve the assurance of the accuracy of the results. "The amount of noise required to mask the credit rating of billionaires, for example, is much higher than that of others. But the chance that they will be in the sample of the general population is low. If you want to ensure privacy for the participants, you need to screen everyone. And to ensure accuracy in the results we added a smaller amount of noise to the algorithms. This way it will be possible to get results that represent the entire population", explains Prof. Ligat.

Life itself:

Prof. Katrina Ligat, was born and raised in New Hampshire, USA ("I grew up in the woods and that's where my interest in privacy comes from. There are no people, there are trees"), and currently lives in Jerusalem.

Moshe Shanfeld, born and raised as an ultra-Orthodox and studied at the Hebron Yeshiva - Knesset of Israel, came out with a question ("I have always pursued the truth and it seems I found it in mathematics"), and currently lives in Jerusalem.

More of the topic in Hayadan: