News

Big data and data protection: statistics-based solutions

11 Feb 2025

Statistician Jörg Drechsler studies methods for the secure use of digital data.

Prof. Dr. Jörg Drechsler stands at a high table and smiles at the camera

Professor of Statistics with a Focus on Labor Market Research at LMU | © LMU/Stephan Höck

Life is a mountain of data that is constantly growing. From the blood pressure measured by our family doctor to the social insurance contributions deducted from our salaries: everything is recorded digitally. Then there are the digital traces left by the users of smartphones and other digital devices. LMU researcher Jörg Drechsler studies how such data can be used securely without violating the privacy of individuals.

This big data contains information that can help answer many research questions. “Particularly in medical contexts, restricting the access of researchers to data can cost lives,” says Drechsler. The basic challenge is to balance the trade-offs: “On the one hand, it’s about allowing data to be used to obtain socially valuable insights, while on the other we must ensure that people’s privacy is protected.”

“Look elsewhere” – the path to becoming a researcher

Since April 2024 Jörg Drechsler has been Professor of Statistics with a Focus on Labor Market Research at LMU. “It’s wonderful to be back,” he says. Previously, he had had teaching assignments at LMU and was thus familiar with the Department of Statistics, where he values the “pleasant and open collegial atmosphere.”

Becoming a professor of statistics one day could not have been further from Drechsler’s mind as a young student. He began studying piano in 1999 and after a few years took up a course in economics alongside his piano studies. “That was a big step for me,” he recalls. “Ultimately, it was because I’d developed a movement disorder in my left hand – a condition that typically affects people who exercise fine motor skills in their work, like musicians and goldsmiths.” The doctor who eventually diagnosed the condition advised him: “Look elsewhere.”

“Initially, I had no real idea what to do,” Drechsler remembers. After seeking out career coaching, which involved a battery of different tests, he chose economics. “The adviser actually recommended that I study statistics, but I didn’t take up that advice at first.” How right the adviser was would only become apparent after his studies. For now, Drechsler studied economics at the University of Erlangen from 2001 to 2006.

In 2006, he started a research fellowship at the Institute for Employment Research (IAB). “Through my work at IAB, I drifted ever further in the direction of pure statistics. Now, I don’t call myself an economist anymore, but a statistician. After all, I don’t carry out applied research related to the labor market, but focus on statistical methods that help to improve the quality of the collected data.” In 2009, he completed a doctorate at the University of Bamberg while continuing his research at IAB. This was followed by a three month visit at Duke University in North Carolina as a post-doc. “From this moment, it was clear to me that I wanted to stay in research.” Alongside his role at IAB, he obtained his habilitation degree at LMU in 2015.

Drechsler has been Head of the Department for Statistical Methods (KEM) at IAB since 2022. The endowment of a chair has allowed him to combine his activities at IAB and LMU, giving him one course to teach per semester. Drechsler is appreciative of the fact that statistics is not a subdiscipline in the department, but that the students “are here exactly because they want to learn about Statistics, which makes teaching particularly enjoyable.”

Clearing personal traces to make data usable

Together with his team at KEM, Drechsler deals with all questions relating to data quality, so that researchers can evaluate maximally reliable data when conducting their studies. His own main research focus, however, is data privacy and data confidentiality. Drechsler is an expert in methods that allow information from digital sources to be used without revealing too much about the people behind the data.

It is not enough, for example, to just separate the names and addresses from the rest of the data, the process known as pseudonymization. “The problem is that you can easily identify people from other characteristics.” Researchers with access to individual data must therefore go a step further and aggregate certain information, such as by providing ages in five-year intervals instead of reporting exact ages or releasing income information only up to a threshold. “That being said, we’ve come to the realization in recent years that oftentimes even that’s no longer sufficient in an era of ubiquitously available data, unless you go in hard on the aggregation, which means losing a lot of content.”

An alternative approach is to create synthetic data. This is done by developing models based on the original data that only reproduce the structure of these data. These models are used to generate new, synthetic data, which correspond to the original data in their relationships, but which contain no direct information about the individuals included in the original data. Questions such as the influence of school education on income can be answered using synthetic data.

Protecting privacy in the context of AI

In the sphere of artificial intelligence, which of course is trained on large data volumes, the treatment of sensitive information is also an issue. “There’s the notion that data can be protected if we only use the model created by the AI. But the problem is, the AI has stored much of the original information. And repeatedly we find cases where clever prompts yield individual information in relation to the data on which an AI was trained,” explains Drechsler.

This is where the concept of differential privacy – another area of research Drechsler is involved - comes in. Differential privacy protects the output generated, for example, by AI by infusing random noise such that it becomes virtually impossible to make inferences about the underlying individual records in the data. “Differential privacy is a very interesting method, because it offers a mathematical guarantee for data protection,” says Drechsler. However, it also requires a lot of research to ensure that the results are not changed so much that nothing can be learned from them anymore.

As such, the question Drechsler comes up against every day in his research goes well beyond the statistical: What priority should data protection have? Is it necessary to completely protect all personal data? Or is it more important to make data accessible? “I can’t answer that as a statistician,” he says. “It’s a matter for society to decide.”

In his research, Drechsler works with data that arises in people’s dealings with government authorities. In Germany, the protection requirements are very strict for such data, and access is highly regulated.

The data expert observes, however, that people are a lot more liberal with their own data. At least, most users of digital media accept without complaint that their data, which is generated, say, when they use messenger applications, ends up with the private providers of these services. “Sometimes I get the impression that many people don’t give much thought to it. Ultimately, it’s a question of sovereignty over one’s own data. So much money is made with these data. At the moment, they’re used for advertising purposes, but who knows who will get access to these data in the long run.”

What are you looking for?