Projects - First Funding Phase

This is an overview of the fifteen projects funded in the first phase (2022 to 2024) of META-REP

PIs: Prof. Dr. Niko Busch and Dr. Yuri Pavlov
ECR: Elena Cesnaite

EEG is widely used to investigate human cognition and other psychological phenomena. Yet, despite its popularity, the credibility of several EEG findings has recently been debated. This new skepticism is based on the observation that novel hypotheses are oftentimes tested only in small samples, while replication studies are usually deemed unattractive. Moreover, there is a great deal of flexibility in any EEG analysis, such that analysis pipelines are highly variable across studies. We contend that without assessing the replicability and robustness of EEG research by assessing its results using new data and alternative analyses, we are potentially building a house of cards. Inspired by the lessons emerging from the replication crisis in the psychological sciences, we have a unique opportunity to create a stronger foundation for EEG research. In this proposal, we present two large-scale, international collaborative projects addressing the replicability and robustness of EEG research, respectively,

The #EEGManyLabs project is mobilizing an international network of researchers to replicate the most influential published EEG studies, representing the largest neuroscience replication project undertaken to date. This project will provide an enormous body of data that will be made publicly available and will establish a library of effect sizes for the most commonly reported EEG phenomena. Thus, this project will help assessing the replicability of previous studies and designing future studies.

The #EEGManyPipelines project is an international many-analyst project, in which all participating researchers are provided with the same dataset and are instructed to analyze the data with an analysis pipeline they deem sensible and representative of their own research. Analysts will then report their results and a detailed description of the analysis pipeline, allowing us to analyze the diversity of analysis pipelines and their effects on results. Thus, this project will help assessing the robustness of EEG findings across alternative analyses, identifying (sub)optimal analysis pipelines, and informing guidelines for reporting EEG analyses in publications.

We expect that this project will help improving the credibility of EEG findings and the quality of analyses, and will inspire new standards for conducting and reporting EEG studies.

PIs: Prof. Dr. Marie-Ann Sengewald & Prof. Dr. Anne Gast & Prof. Dr. Steffi Pohl & Dr. Mathias Twardawski
ECRs: Jerome Hoffmann & Johanna Höhs & Dennis Kondzic

Variation of study characteristics, instead of exact replication, is a core aspect for understanding effect heterogeneity. We focus on systematic variation of study characteristics in conceptual replications and rely on the recently developed causal replication framework (CRF, Steiner & Wong, 2018; Steiner, Wong, & Anglin, 2019). The CRF uses causal theory for defining and designing replication studies.
Specifically, the CRF allows for causally identifying factors that impact replicability, and describes how research designs and analyses methods can be used for systematically studying effect heterogeneity.
So far, the framework has been successfully applied for conceptual replications in educational research. However, characteristics of studies and how they can be kept constant across replications differs across disciplines, calling for respective designs and analyses methods. We aim to make the CRF applicable to a broader range of disciplines. For this, we systematically design conceptual replications with the CRF in different psychological disciplines. We focus on social and cognitive psychology, representing different challenges for replication, and compare these to existing studies in educational psychology. Furthermore, cooperations within the SPP allow for an even broader interdisciplinary application of the CRF. As such, in our project we will develop approaches for the application of the CRF in various disciplines, identify factors that cause effect heterogeneity within and across different disciplines, and derive guidelines for conceptual replication studies.

Additional information can be found at the project website Conceptual Replications and the website Collaboratory Replication Lab of our cooperation partners on systematic replication studies.

PIs: Prof. Dr. Michael Zehetleitner & Dr. Manuel Rausch
ECR: Cem Tabakci

Psychological science is currently facing a crisis of credibility because researchers have realized that numerous influential psychological studies cannot be replicated. A potential explanation for replication failures is that psychological theories are often underspecified. As a countermeasure against weak theories, it has bees been recommended that psychological studies should make more wide-spread use of formal cognitive modelling to generate more precise predictions. However, it has never been empirically investigated if cognitive modelling analysis is in fact beneficial for replicability.

Given the large number of arbitrary analysis decisions required for cognitive modelling analyses, there is the possibility that cognitive modelling is in fact counterproductive for replicability of psychological findings. In our project, we aim to investigate the replicability of cognitive models based on Bayesian Brain Theory in three exemplary studies. First, we will investigate the reproducibility of the analyses conducted by the authors of the original studies using the original data sets. Second, we will examine the robustness of cognitive modelling analyses by systematically assessing the impact of a variety of theoretically equivalent analysis decisions onto the results. Finally, we will test if we obtain equivalent results as reported in the original studies when we perform exact replication studies of the original experiments.

PIs: Prof. Dr. Katrin Auspurg & Dr. Andreas Schneck
ECR: Daniel Krähmer & Laura Schächtele

Replication sciences have so far heavily focused on experimental research where uncertainty of results is primarily caused by sampling error. However, the analysis of the robustness of research with non-experimental data, which is dominant in many social sciences disciplines, requires other methods that also capture uncertainty caused by model choice: In research with non-experimental data, there are often numerous possibilities to specify the analysed samples, the functional form of studied associations, the selection of covariates, and regression models. In addition, unobserved heterogeneity can endanger the validity of non-experimental research (so-called sensitivity). However, large-scale evaluation studies are missing, and there is also a lack of suitable methods for this purpose.

This project therefore asks: How can the robustness of non-experimental social science research be assessed and improved with the help of computational “multi-model” programs? Three closely related research objectives serve this purpose:

  • The further development of promising programs for robustness and sensitivity analyses (so-called "multi-model", "multiverse", or "specification curve" analyses) for their use in large-scale evaluations. For example, we aim at developing standardised robustness measures and defining model variants to be tested (in the form of sample/variable/regression models).
  • To achieve the first large-scale robustness analysis of effect estimates with regression analyses of non-experimental data. For this purpose, the tools developed in 1. will be applied to 100 studies published in leading journals of relevant disciplines (sociology, political science, and economics). We will investigate the reproducibility rate: To what extent are results reproducible with the models and data of the primary studies; and what role do possible (coding) errors play in this? The robustness rate: To what extent are results robust against (which) alternative models? As well as the sensitivity rate: To what extent does unobserved heterogeneity threaten the robustness and validity of estimated effects? These comprehensive analyses also allow for the first time a systematic identification key (statistical) sources for robustness.
  • The exploration of routines to improve robustness in primary research: To what extent can multi-model and sensitivity analyses already help researchers to arrive at more robust estimates? As another novelty we will implement with “robustness notes” a new publication format.


With these three closely intertwined research objectives, the project makes important contributions to the "what" and "how" question of the META-REP Priority Programme: What is the replication rate (robustness), how can it be determined, and how can robustness be improved already in primary research?

PIs: Dr. Frank Renkewitz & Prof. Dr. Moritz Heene
ECR: Lukas Beinhauer & Jens Fünderich & Maximilian Frank

From a meta-scientific perspective, heterogeneity in effect sizes is a highly relevant topic for several reasons: Such between-study variation is one of the candidate causes of replication problems. It affects the statistical power of primary studies and impedes the detection of publication biases and questionable research practices (the other potential main cause of replication problems). Finally, unexplained variation in effect sizes may be considered as an indication of the state of theory development in a field. Thus, it is little surprise that the replication crisis has sparked huge interest in the heterogeneity of psychological effects. Multi-center replication studies have made data available that makes it possible to investigate empirically the heterogeneity of psychological effects in direct replications on a larger scale. First analyses of these data suggest that heterogeneity is smaller in direct than in conceptual replications, but still occurs with considerable frequency.

This project is based on two observations regarding previous assessments of heterogeneity in psychological effects:

  • These assessments may have missed important information, as they focused almost entirely on heterogeneity in standardized effect sizes.
  • They may be subject to several methodological artefacts widely discussed in the meta-analytical literature and, therefore, be biased.

A focus on standardized effect sizes overlooks the fact that heterogeneity in these statistics may be due not only to variation in mean differences (that is, actual treatment effects) but also to variation in error variances. Heterogeneity in error variances may be caused by the use of convenience samples in replication studies and, hence, be theoretically fully uninformative. Exclusively analyzing standardized effect sizes may mask that already control group means (base levels) are heterogeneous. If so, expecting homogenous effects requires the additional, rather strict assumption of independence between base levels and mean differences. Methodological artefacts that may affect previous heterogeneity assessments include variation in measurement reliabilities and range restrictions in independent and dependent variables. Finally, earlier analyses treated Likert-scaled data as continuous. This model misspecification may have biased meta-analytic effect size and heterogeneity estimates.

We aim to reanalyze all available data from multi-center replication projects in psychology. In these re-analyses, we will not only assess the heterogeneity in standardized and unstandardized effect sizes, but also in all of their components (group means and error variances). We will conduct multilevel mixed-effects meta-analysis to assess the relationship between true control group means and true mean differences. Wherever possible, we will determine measurement and treatment reliabilities in direct replications and estimate their heterogeneity. In analyzing standardized effect sizes, we will apply corrections for unreliability and range restrictions. Likert-scaled data will be analyzed with more appropriate ordinal regression techniques and corresponding meta-analytic methods. Taken together, these efforts should not only provide a much more valid assessment of the heterogeneity in psychological effects. They should also reveal whether and to what degree various methodological and statistical problems actually biased previous heterogeneity estimates. Additionally, they may uncover relationships in the heterogeneities of components of effect sizes that are of theoretical importance.

In a second part of the project, we will use the results of the re-analyses as a starting point and a restriction for simulation studies. The aim here is to investigate how the aforementioned methodological and statistical artefacts can affect heterogeneity estimates of effect sizes under various conditions. The results should provide guidelines for the analysis of heterogeneity.

PIs: Prof. Dr. Felix Schönbrodt & Prof. Dr. Richard McElreath & Dr. Filip Melinscak
ECR: Florian Kohrt

In several scientific fields, evidence has accumulated that the scientific literature is much less robust and trustworthy than desired. The prevalence of unreliable findings poses a problem for scientific progress and for the application of research output, constituting a “replication crisis”. From a top-down perspective, this leads to an academic governance challenge: How can an academic system be structured to perform better? Furthermore, as many aspects of research activity are not top-down regulated, a complementary challenge concerns bottom-up or self-organizing processes: Which practices are worth adopting when thousands of autonomous researchers interact and follow their own goals?

Many reform suggestions are hard to be tested empirically, as (quasi-)experimental interventions are difficult, impossible, or must be performed on too small of a scale for definitive conclusions. In addition to empirical tests, theoretical tests are necessary as well. Verbal descriptions are usually insufficient to describe the behavior of complex systems, such as academia. A complementary approach is to simulate the consequences of structural reforms in agent-based models (ABM).

ABMs can be useful, even when they omit important features, because they focus research on clear, algorithmic proposals instead of vague, wishful ones. This project aims to extend existing models of science by evaluating recent proposals of the reform movement, and situating the results in three different disciplines targeted in the priority program. Specifically the selected proposals (a) have momentum in the community, (b) have a dynamic social aspect (in contrast to static statistical considerations), (c) can fruitfully be tackled with ABMs, and (d) have not been sufficiently addressed with social models yet.

PIs: Prof. Dr. Andrea Hildebrandt & Prof. Dr. Stefan Debener & Prof. Dr. Christiane Thiel & Dr. Carsten Gießing
ECR: Nadine Jacobsen & Daniel Kristanto

There is a replication crisis and a “real-world or the lab” dilemma in psychology and cognitive neuroscience. Solving the dilemma and overcoming the crisis at the same time is arguably a serious challenge. One of the main aims in cognitive neuroscience is to discover brain-cognition associations which are replicable across laboratories. A precondition for replicability of individual differences findings in terms of brain-cognition associations obtained inside or outside of the laboratory is rank order stability of neural parameters derived from noisy and complex signal recordings. However, to date we do not know well enough how much hitherto unsuccessful replications are due to the oppressive number of methodological decisions researchers have to make á priori to testing a brain-cognition association. Moreover, we do not yet have standards with respect to the unit of analysis at which replications should be considered successful. We also lack a knowledge app containing a systematic and exhaustive overview of potential methodological choices that are defensible in a typical individual differences analysis workflow for mobile EEG or fMRI, as well as multivariate behavioral data. Thus, laboratories still stick to their customized choices which are sometimes passed over through many generations of young scientist. But – as described in the very recent literature – variability of workflows and of associated substantial findings is huge across laboratories. Finally, hitherto proposed statistical approaches for analyzing the multiverse of potentially constructed datasets for noisy and highly complex multidimensional neural data need extensions through tools available for big data analysis. Such approaches would allow learning about influential decisions and would predict potential heterogeneity of future findings.

To take a large step toward filling these gaps, METEOR aims to bring together a larger group of scientists with different and complementary expertise (cognitive neuroscientists using mobile EEG methodology, network neuroscientists working with fMRI data and statisticians experienced with big data analyses tools). By joining forces and a fruitful environment of a collaborative research programme, METEOR will provide standards on a replication success definition for cognitive neuroscience applicable across neuroimaging modalities. Furthermore, it will deliver systematized knowledge and analytic solutions for the multiverse in two neuroimaging modalities – mobile EEG and resting state fMRI – applied to the realm of assessiong individual differences and brain cognition associations. Proposed solutions will be discussed with respect to their applicability to further research questions in the future.

PI: Prof. Dr. Anna Dreber Almenberg & Prof. Dr. Nathan Fiala & Prof. Dr. Magnus Johannesson & Prof. Dr. Jörg Peters & Dr. Mandy Malan
ECR: Florian Neubauer & Julian Rose

In economics, most empirical work is based on non-laboratory experimental and quasi-experimental designs. While in other more standardized disciplines, replicability mostly is a matter of generalizability across contexts, in economics it also hinges on the robustness of findings across different specifications. This project will provide replicability rates in economics by conducting computational and robustness replications of 30 non-laboratory studies in economics using the original data set. These studies cover different empirical methods such as primary data-based Randomized Controlled Trials (RCTs) and secondary data-based quasi-experimental methods. Complementing these robustness replications, we will also run surveys with experts in the field to elicit how they assess replication success or failure and to obtain assessments of generalizability across contexts – a frequently discussed limitation of RCTs in particular.

The main objectives are, first, to define replication success for robustness replications. We will also develop standardized protocols on how to conduct robustness replications and standardized forms for reporting the results. Second, we will compare replicability rates across methods and evaluate differences. The prior of experts usually is that secondary data-based studies are more prone to p-hacking, HARKing, and publication bias than RCTs. Third, we will develop an expert interview-based toolkit to assess robustness replicability and generalizability across contexts.

PI: Dr. Marc Jekel
ECR: Patrick Smela

The inconsistency of a statistically significant result in an original study and a non-significant result in its replication study can be due to problems with statistical conclusion validity, construct validity, internal validity, and external validity in the original and replication study. While problems with statistical conclusion validity in original studies have been the main focus in the replication debate (e.g., inflated alpha-errors due to questionable research practices, inflated effect sizes due to under-powered studies in combination with publication bias), problems concerning the other types of validity have been largely neglected until recently.

In the project, we will develop a model from machine learning to predict replication success based on potentially non-linear relations of properties of original and replication studies as input variables. The first class of input variables consists of indicators of all types of validity coded for original and replication studies in an extensive literature review. The second class consists of assessments from the experts of those studies: We will ask authors of original and replication studies to rate to which extent their studies achieved the different types of validities. In addition, we will ask authors with a “failed” replication to indicate reasons for non-replication in an open question format. We will use the model to identify replication studies that “failed” although the predicted replication success is high. As a concept-of-proof for the machine learning model and to demonstrate our model-driven approach, we will select and replicate a small sample of those failed replications using a validity-optimized study design. The coding of original and replication studies as well as the assessments from the authors of these studies will be made available in a standardized database for researchers interested in meta-science on the (non-)replicability of research results in psychology.

PI: Prof. Dr. Malte Elson & Dr. Ruben Arslan
ECR: Ian Hussey & Taym Alsalti

Global, open standards, like the fixed width of the soccer goal, reduce conflict when players from around the world come together. In psychology, such standards are few—and unsurprisingly “moving the goal posts” is a typical metaphor chosen for the controversy that often erupts after a failed replication attempt. At the same time, difficulties with replicating empirical works –reproducing analyses based on the same data, repeating previously made observations in new data – have been documented across the social and behavioral sciences. As psychology grapples with this crisis, many ask what constitutes a direct replication—when are materials and methods sufficiently similar to be considered the same? If we find an effect when using one measure but not when using an altered version, we may gain insight into boundary conditions and generalizability. However, we may also chase false leads: differences in results may not be replicable if researchers exploited degrees of freedom in measurement to obtain the desired results. Global, open standards remove these degrees of freedom and transparently crystallize agreement on basic aspects of research: units, norms, and measurement procedures. Without standards, efforts to build a cumulative evidence base through replication and evidence synthesis will often end in screaming matches about the goal posts (but without the benefit of a referee). With them, planning new research, assessing replicability of previous research, and synthesizing evidence all become easier.

We propose a comprehensive work programme to study the role standards (and their absence) play in the reproducibility, robustness, replicability, and generalizability of psychological research. We examine how open standards can help the transition of psychology to a mature, cumulative science. We develop SOBER, a rubric to describe and quantify measurement standardization in a machine-readable metadata standard. We demonstrate SOBER’s utility for the prediction of replicability in existing meta-analyses and large-scale replication projects and show how global standards ease evidence synthesis by reducing hypothesis-irrelevant variation. With SOBER, we lay the foundation to reduce redundancy, error, and flexibility in measurement. We catalogue flexible measurement practices, simulate their cost for psychometric quality and robustness of evidence, and test these costs empirically in a series of methodological experiments in which common psychological measures are modified. We integrate our findings into a framework to account for ad-hoc measurement modification effects in psychological research synthesis. Taken together, our plans to engineer a change towards a more standardized culture in psychology take the shape of tools, debates, and educational resources.

PI: Dr. Malte Jansen & Dr. Aleksander Kocaj
ECR: Aishvarya Aravindan Rajagopal

The project at the IQB's Research Data Centre (FDZ, Forschungsdatenzentrum), led by Dr. Malte Jansen and Dr. Aleksander Kocaj, addresses questions concerning the reproducibility and robustness of research results based on secondary analyses in empirical educational research (META-REP-IQB; Funding period: 02/2022-01/2025, total funding: 231.578€). The project is based on applications for secondary data analyses at the FDZ at IQB. In these applications, researchers describe their central questions, hypotheses and their planned analytic approach. We will systematically compare these applications with publications resulting from the applications. In addition, we aim to reproduce the results for selected publications.

The project at the FDZ of the IQB has four goals. First, we examine whether data applications that result in significant or hypothesis-confirming results are also more likely to be submitted for publication and published (publication bias). Second, we plan to develop an index that quantifies the similarity between data use applications and resulting publications to provide evidence of selective reporting. Third, we will test whether the results of selected applications that resulted from data usage applications be reproduced using the information provided in the manuscripts. This will take advantage of the fact that the underlying data are available at the FDZ of the IQB. Based on this, we will investigate how robust selected research results are with regard to alternative plausible analytic approaches (multiverse analysis) in the fourth step. Overall, the project aims at identifying sources of heterogeneity that may influence the replicability and reproducibility of research findings in empirical educational research.

PI: Prof. Dr. Eunike Wetzel
ECR: Caroline Böhm & Steven Bißantz

The goal of this project is to investigate the role of measurement in the replicability of empirical findings. We will focus on two widespread, potentially problematic measurement practices: 1) the use of ad hoc scales and 2) the use of modified scales. Ad hoc scales are scales that authors constructed for a particular study and that have not (or only very superficially) been validated. The term “modified scales” refers to deviations from preexisting, validated scales for example in terms of the number of items, the items’ wording, or the response format. We will investigate how these practices influence replication rates and the heterogeneity of effect sizes using three complementary methodological approaches: by reanalyzing existing empirical data, running an experiment, and conducting a simulation study.

In the first study, we will reanalyze data from replication projects that applied item-based measures such as the Many Labs projects. In the second study, we will conduct a multisample replication project on multiple original effects from different fields that is combined with an experiment on the influence of measurement on replicability and effect size heterogeneity. The third study will consist of a simulation study in which the “original” scale will be compared to ad hoc scales and different types of modifications of the original scale in terms of the recovery of the true effect sizes. Our application context is multidisciplinary and includes several areas within psychology (social psychology, health psychology) as well as the political sciences and economics, allowing us to make comparisons across fields. In sum, this project will contribute to the META-REP priority program and the metascientific literature by investigating measurement as a factor that can explain the replicability of empirical findings.

PI: Dr. Nate Breznau
ECR: Hung Nguyen

The reliability of a given empirical test is a product of the quality of the theory used to design that test. Too many unknowns or competing causal pathways renders test results unreliable or even uninterpretable. If hypothesis tests are repeated under theoretical ambiguity, an entire area of study may be unreliable. Such a scenario could explain a lack of consensus and failures to reproduce findings in that area. In the behavioral and social sciences especially, theoretical ambiguity is acute due to the complexities of human interaction and societal organization. Theoretical inquiry to improve a single hypothesis test is time consuming, and alone will not improve reliability across an entire area of study. Therefore, I propose an approach to maximize theory while minimizing the time investments of scholars working in an area.

This project will test the extent of theoretical ambiguity in one area of study, as a cause of empirical unreliability in that area. Then it will check if computer-assisted comparison of causal models can efficiently identify theoretical ambiguities in that area. Next, it tests if crowdsourcing theoretical claims combined with computer-assisted causal model comparison can improve reliability and the replicability of findings in that area. Finally, if initial results are positive, then it will develop computer systems to improve and economize the process for deployment across all areas of cognitive, behavioral and social sciences. It should ideally contribute to knowledge on why replication and reliability varies across studies and hypotheses, and how to improve them via technology and meta-theoretical work.

PI: Dr. Xenia Schmalz
ECR: Tatiana Logvinenko & Yi Leung

Building falsifiable theories is particularly important in unravelling the scientific underpinnings of neurodevelopmental disorders, such as developmental dyslexia, as it helps establish a testable, logical skeleton for hypothesis testing (Popper, 2005; Wacker, 1998). Yet, in psychological science, several factors have been impeding the construction of falsifiable theories. These include, for example, ill-defined terminologies and indeterminate operational indicators of psychological constructs (e.g., Meehl, 1978), dubious psychological tests (e.g., Eronen & Bringmann, 2021), flexibility in data analysis (Simmons et al., 2011), and publication bias (e.g., Francis, 2013). These factors might be associated with the replication crisis in psychological sciences, manifested by the conflicting evidence and low effect sizes found in empirical studies (Open Science Collaboration, 2015). Yet, to date, the relative prominence of these theoretical, methodological, and systematic factors on replicability is unknown. Preceding the building of sound theories in psychological science, understanding and assessing the relationship between these associated factors and replicability is fundamental.

In light of the above, this meta-science project aims to evaluate the effects of theory underspecification, questionable measurements, and biased research practices on replicability in psychological sciences. Taking developmental dyslexia research as a case study, this project aims to quantify to what extent (1) a theory might be underspecified, (2) psychological measurements might be invalid or unreliable, (3) research designs and statistical models might be maladapted and reported with biases, and (4) how the above factors predict effect size variability and replicability. There are three work packages (WP1, WP2, and WP3) in this project: WP1 develops methods to quantify theory specificity, in particular, specificity in defining key terminology and hypothesis generation; WP2 examines the ways to quantify the methodological strength of empirical studies of each dyslexia theory, indicated by the psychometric properties of the cognitive tasks, the reproducibility and robustness of the studies, and publication bias; WP3 explores the relative importance of the above-examined factors, including theory underspecification, poor measurement, and publication bias, as correlates of low replicability. The developed framework for quantifying and assessing the weights of potential factors associated with low replicability, especially the importance of theory specification, is expected to be generalizable to other subfields of psychology.

For more details, the proposal of our project is available at https://osf.io/b9nmr/.

PI: Dr. Johannes Breuer & Prof. Dr. Mario Haim
ECR: Philipp Knöpfle

Computational communication science (CCS) is a young but quickly growing field characterized by the use of digital traces and other media data (e.g., online news, social media) and methods suitable for collecting/generating (e.g., scraping, simulating) and analyzing (e.g., machine learning, natural language processing) those. As, for example, publication practices in the journal Computational Communication Research show, many CCS researchers have been advocating for and implementing open-science principles such as sharing data and materials. However, several challenges complicate making CCS research reproducible and replicable. These challenges presumably are a consequence of (a) the methods commonly used in the field, (b) the volatility of its topics of study and the data it makes use of, and (c) the increasing dependency on third-party data providers, such as search engines or social networking sites.

The aim of this project is to investigate the determinants and conditions for replicability in CCS. To achieve this, the project will (1) assess the potential replicability of research in CCS by means of a large-scale content analysis of publications in the field, and (2) test and evaluate the actual replicability of studies from this field via reproduction and replication of purposively selected studies.