News

Data scarcity and AI solutions: Social research in new ways

20 Jan 2025

The social sciences rely on data from surveys or data about users. To obtain it, LMU researchers are increasingly relying on artificial intelligence and analyzing digital footprints.

Passers-by walking through a pedestrian zone.

© IMAGO_Zoonar

Data is indispensable as the basis for serious social science or statistical research. It can provide information on how people feel about democracy in a society, how the labor market is changing, or how wide the gap is between rich and poor: It is the basis for studies whose results can even be very relevant for political decisions. If the data source is not reliable, that can have undesirable consequences. “In fact, in many surveys the sample is not a random sample from the population of interest, but some kind of a selective sample. This can distort the results if the aim is to describe this population,” explains sociologist Katrin Auspurg.

“We saw this in surveys conducted during the Covid-19 pandemic, for example, which often missed out older and less educated people. This meant not enough consideration was given — from a societal and political perspective — to a number of problems that these people had. At least some of the results found with the selective sample of younger and more educated people may not be generalizable to the whole population.”

Problems with data collection

One problem that poses challenges for researchers in the social sciences is therefore of major societal relevance: the fact that data is becoming increasingly difficult to collect by traditional means. “Many of the people we want to recruit for a survey are no longer taking part in surveys at all,” says political scientist Professor Alexander Wuttke from LMU, whose research includes investigating the trends surrounding democracy in Germany. One of the problems he sees in the field of democracy research is a general increase in skepticism towards academic research.

Katrin Auspurg, whose academic work focuses on quantitative social research, can confirm this trend: “In the 1980s, 70% of people who were contacted for population surveys such as the ALLBUS took part. Now it’s only around 30%. Generally speaking, you can only expect a participation rate of one third in either face-to-face or postal surveys using random samples from municipal population registers. The rates for telephone surveys are frequently even lower.”

Prof. Katrin Auspurg

Sociologist Katrin Auspurg is skeptical about the data quality of many survey results reported in the media. | © LMU

Auspurg also identifies changes in the culture around doing surveys as a challenge. “The number of surveys being done has increased hugely because it’s now so much easier and cheaper to conduct them online.” Added to that, in many cases they are not even academic surveys but market research instead. Marketing calls, too, are often disguised as surveys.

Demographic aspects also influence access to people, says Auspurg. “It’s easier to get people on middle incomes, for example, to complete a survey. Less-educated individuals, older people, or those with a migration background, on the other hand, are harder to recruit for surveys.” If the data analysis does not correct for this, surveys suffer from what’s known as middle-class bias.

Alexander Wuttke points out that telephone surveys are also becoming increasingly difficult “because many households no longer use a landline and people don’t answer cell phone calls as often.” What this means, he goes on to say, is that many sections of the population are excluded and the claim to be a representative survey is pretty much an illusion.

News

Video with Katrin Auspurg: Why the gender wage gap exists

Read more

“Representative is often just a smokescreen”

“The word ‘representative’ is often a smokescreen, its meaning unclear. To be able to make a broad statement, you need a random sample that is as close as possible to the overall population,” says Katrin Auspurg. “But depending on the research question, it can make sense to step away from random sampling in order to study certain aspects in a more targeted manner.” For experimental research, for example, or if you’re trying to investigate things about specific hard-to-reach groups. But in that case, she explains, you need to clearly explain and justify the type of sample you used.

So, the challenges are numerous. But the researchers are full of ideas, actively accessing alternative data sources or drawing on digital data footprints. Alexander Wuttke explains, “When people use platforms, like X for example, and express their opinions there, they leave behind a digital footprint that can be seen by anyone and analyzed. This is an important trend in the research field, one that has emerged as an addition to survey-based research. For instance, we are able to see when users express anti-democratic opinions.” However, what this does not do is allow you to reflect a cross-section of society; at best, you would be able to analyze what users of X are saying.

Request for data donations

Nevertheless, Frauke Kreuter, a Professor at LMU’s Institute of Statistics, emphasizes the enormous potential of digital data. For one thing, she says, it is significantly cheaper than collecting data the traditional way. And it is easier, too, because you collect the data passively: You are less reliant on the respondents’ memory and you can simply measure how many steps a person has been taking if you’re doing a health study — instead of asking them how much movement they’ve done in the past year.

What’s more, the EU-wide General Data Protection Regulation (GDPR), in force since 2018, opens up the possibility for people to make data donations to researchers: Any online service providers must provide users with their user data if they request it. “We can then ask these people whether they would like to give their data to us for research purposes — for citizen science, so to speak,” says Kreuter.

Using digital data: Lessons to learn

“Data may seem absolute, but if you don’t know the context, it can be misleading” says Frauke Kreuter. | © Fotostudio klassisch-modern

However, digital data also has its secrets that have not (yet) been revealed. “Unfortunately, when using digital data, people often don’t pay attention to the processes used to generate that data and where it comes from.” Frauke Kreuter thinks one reason for this is that platform operators are reluctant to reveal how their algorithms work.

In addition to that, she says not enough is known about the social behaviors around using digital media. “There’s no account taken of when several people are using the same device. Or when women carry their smartphone not on their person but in their purse — which they sometimes put down. If you were trying to measure the number of steps they’d taken each day, the data would be incorrect.”

Digital data is also more difficult to analyze. “We just don’t have the number of data specialists we would need to get everything right with passive reading,” says Kreuter bluntly.

Hard to crack — administrative data

News

Frauke Kreuter: The data treasure hunter

Read more

Another important source for research purposes, says the LMU researcher, could be administrative data, such as the kind collected by public institutions. Here, however, data protection tends to present an obstacle that makes it more difficult to gain access. As Frauke Kreuter explains, “In Denmark, for example, administrative registry data is accessible for research purposes. Population insights are also more precise. While some progress has been made with administrative research data centers, we still have a lot of catching up to do here in Germany.”

She thinks the reason for Germany’s reluctance lies in a general misunderstanding “We tend to focus on protecting data instead of protecting people,” says Kreuter “Save use of data is possible”.

Give and take in cooperation with authorities

One way of still enabling access to data from public institutions could be through collaborations with benefits for both sides: Researchers could give administrative staff the tools to use their data to better manage their own processes, and to gain insights from it themselves. Conversely, the data itself could be made available to researchers for their work. Frauke Kreuter and her fellow researchers have already launched a promising initiative in the United States, where she still has research ongoing.

“In Germany, we recently worked with the Bavarian Ministry of Digitalization and the Directorate General of the State Archives and the Bavarian State Ministry of Justice,” explains the professor. “We moved 60,000 folders into a secure cloud environment and showed the employees how to take samples from them, and other things, too.” At the same time, students and doctoral candidates were able to use this data to work on research questions, says Kreuter.

It’s all about the mix

Alexander Wuttke stands in front of several trees

Examines the use of AI in data collection with Frauke Kreuter: Alexander Wuttke. | © LC Productions

All in all, the LMU researchers are certain that employing a mix of different approaches to data collection could help to close the gap resulting from the difficulties with traditional qualitative and quantitative surveys. Alexander Wuttke points out that in democracy research, discrepancies can be identified and classified through triangulation — the use of different data sources: “In surveys, people will say they value democracy, but at the same time we observe anti-democratic voting behavior. By combining different data sources, we are better able to understand why these discrepancies occur.”

Qualitative versus quantitative — AI can help

But traditional research practice, too, could soon be made much easier. “I think that artificial intelligence can help us to resolve or at least limit the conflict between in-depth qualitative research and quantitative research that relies on large volumes of data,” says Alexander Wuttke confidently.

News

Alexander Wuttke: Studying how democracies keep going

Read more

According to the researcher, the qualitative approach is time consuming and researchers need to take the time to understand what people think or what motivates them to do this or that. “You always need interviewers who have to take their time. You can only do that up to a point and not hundreds or thousands of times. Large comparative studies across different states are therefore inconceivable.”

AI technology like large language models, or LLMs for short, could help here. Using them would make it possible to hold direct conversations, with questions and clarifications, and to react as the situation demands.

Alexander Wuttke and Frauke Kreuter have already initiated a pilot study on this with students, which is showing some very promising signs. But there are also disadvantages here — the lack of the empathy you get in face-to-face conversations. “You could probably do thousands of interviews a day, but AI can’t do it with the same empathy as a human,” admits Wuttke. It remains to be seen whether people will be as happy to answer an AI as they would be to answer a human.

Traditional methods remain relevant

From the context of her interdisciplinary research, Frauke Kreuter knows that there is no either/or between the proven methods of social science research and the new possibilities that digitalization and AI open up for research: “Algorithms are trained on data. To do that, the AI needs high-quality data,” says the LMU researcher. “I foresee an increasing interest in traditional data collection in the social sciences, if only because AI requires good comparative data.”

Recognizing the quality of a study:

Random samples are the most meaningful type of samples. This involves surveying a group of people whose composition matches that of the group the researchers are seeking information about. If you’re surveying the average income in Germany, for example, the random sample will include working people of different ages, genders, and occupations.

Less meaningful are samples that survey people who volunteer themselves, as is often the case with online surveys. The findings cannot then be applied as a general statement of fact, but only to say something about the group that participated.

Importantly, any media that quote or refer to surveys should provide background information on how the surveys were conducted. For example, the sheer number of respondents is irrelevant if the survey is biased.

Data collection projects:

Connecting Data for Evidence-Based Policymaking: Coleridge-Initiative

Data Quality Academy Certificate ProgramForschungsprojekt: KODAQS

New data spaces: project on data donation

Study: Environmental Attitudes and Environmental Behavior

Related content:

Interview: More transparency about how science works

What are you looking for?