Detect biases already at the data stage, before they reach the algorithm

Dr. Yuval Moskowitz, a researcher from the Faculty of Computer and Information Sciences at Ben-Gurion University, develops algorithmic tools for identifying and correcting biases in data-based systems, using Data Provenance, the "curriculum vitae" of the data throughout the processing process.

Responsible Data Management is a field that deals with the fairness, transparency, accountability, and reliability of data processing processes. Illustration: depositphotos.com
Responsible Data Management is a field that deals with the fairness, transparency, accountability, and reliability of data processing processes. Illustration: depositphotos.com

Yuval Moskowitz, a researcher at the Faculty of Computer and Information Sciences at Ben-Gurion University of the Negev, is engaged in research supported by a grant she received from the National Science Foundation (ISF) on one of the burning issues in the computing world today: how to identify biases in data-based systems, and how to reduce them before they are “locked in” within models and algorithms. According to her, the key point is not only the model itself, but the entire chain of data processing — from the moment they are collected, through cleaning and filtering stages, to the stage where they are fed for training or calculation.

Moskowitz works in the field of Data Management, and in particular in the field known as Responsible Data Management. This is a field that deals with the fairness, transparency, accountability, and reliability of data processing processes. Instead of just asking “Is the model fair?”, she also asks “What happened to the data along the way?”, “At what point did the bias arise?”, and “Could it have been stopped earlier?”

This need arose from the accumulation of well-known cases of algorithmic discrimination. Moskovitz cites the well-known example of a resume screening system developed at Amazon, which learned to favor male candidates and discriminate against women. According to her, the bias does not always appear directly and clearly. Sometimes the algorithm learns indirect connections: educational institution, residential address, or other characteristics that are historically linked to certain groups. In other words, even without “seeing” gender or ethnicity explicitly, the system may recover discriminatory patterns from the data.

The question is: How do you identify biases in data-driven systems, and how do you reduce them before they become “locked in” within models and algorithms?

The problem starts long before the model

One of the key insights in Moskowitz’s research is that bias can be created long before the training phase. Data goes through countless operations along the way: filtering, selecting, cleaning, joining tables, various queries, and structural changes. Each such operation may, for example, change the representation of groups in the population. Thus, even if the final algorithm is ostensibly “neutral,” it may receive input that has already been distorted.

Therefore, its goal is to identify biases along the way, and in cases where this is possible, to correct them in a controlled manner with minimal change. That is, not to “break” the original calculation, but to make small adjustments that restore a fairer representation, without changing the essence of the calculation.

 Provenance: A concept from art that has moved into the world of data

The central technique that Moskowitz relies on is called Data Provenance. The concept of Provenance originates in the world of art and archaeology, where it involves documenting the history of a work of art: who created it, through which hands it passed, what its origin is, and what its context is. This documentation affects both trust in the work's authenticity and its value.

In the world of data, Data Provenance is a kind of “resume” of the data: its processing path, the operations performed on it, and its dependencies with other data. This information serves as computational metadata, allowing us to understand not only what came out in the end, but also why.

According to Moskowitz, the great advantage of Provenance is that it allows you to effectively identify the points where bias is created, and in some cases also calculate the smallest change required to improve the representation in the output.

The scholarship example: How to change a query “as little as possible,” but improve representation

To illustrate the idea, Moskovitz gives a simple example that doesn’t rely on machine learning: a system that awards scholarships to students that selects those eligible based on criteria that seem entirely relevant: high grade point averages and studies in STEM fields. On the surface, this sounds fair. But if such fields were less accessible to certain groups in the past, they are less represented in them, and therefore will appear less often on the list of those eligible for a scholarship. In other words, even when the selection method seems neutral, it may reflect gaps that already exist in the data.

This is where its algorithm comes in: If you define a representation constraint—for example, a goal of more balanced representation of certain groups—you can look for a minimal change in the query so that the output meets the goal, or at least comes close to it, without fundamentally changing the intent of the original query. Instead of repeatedly running many variations of the query on the data (a computationally expensive process), Provenance allows you to quickly identify the smallest and most effective changes.

For her, this is exactly the important balance: on the one hand, maintaining the original goal of the calculation, and on the other, preventing a situation in which the result retains an unwanted bias simply because of the way the data was processed.

Mathematics, algorithms, and what can be solved efficiently

Moskowitz describes a research process that begins with a real-world problem, proceeds to a formal mathematical formulation, and only then to the development of an algorithmic solution. She examines when a problem can be solved accurately and efficiently, and when approximations or heuristics must be sufficient.

In cases where there is an exact solution, it is also possible to prove mathematically that the algorithm is indeed correct. In more complex cases—for example, ranking problems—it is not always possible to construct an efficient Provenance in the same way, and then approximate solutions are required. The advantage of such solutions is that they are sometimes easier to run on large data sets, but the price is the difficulty of guaranteeing how close they are to the ideal solution.

""Fairness" is not one thing

Moskowitz emphasizes that algorithmic fairness is not a single concept. There are many definitions of fairness, and sometimes they even contradict each other. Therefore, she does not claim that there is “one right definition,” but suggests a practical approach: choose a definition that fits the context, and then develop an algorithm that ensures compliance with the established conditions.

In other words, her research does not decide the ethical question of what “true” fairness is, but rather provides computational tools that can serve a pre-selected policy or definition.

More of the topic in Hayadan:

Leave a Reply

Email will not be published. Required fields are marked *

This site uses Akismet to filter spam comments. More details about how the information from your response will be processed.