Inspiring Collaborative Workshop on Designing Synthetic Benchmarks for Real-World Cohort Data

We just concluded an energizing two-day workshop dedicated to designing synthetic benchmarks and diagnostic tools for complex, real-world cohort data. Co-organized by CRC 1597 Small Data, the STRATOS initiative, EVA4MII, and PrivateAIM for June 3-4, 2025, the workshop brought together experts from clinical research, epidemiology, and statistics. Together with the participants, they delved into how synthetic datasets and real-world clinical data can drive innovation in clinical research and methodological development.

Day 1: Open Window
The day started with insightful presentations from our invited speakers André Scherag, Anne-Laure Boulesteix, Cécile Proust-Lima, Cristina Has, Daiana Stolz, Els Goetghebeur, Christian Keller, James Carpenter, Martin Wolkewitz, Moritz Hess, and Thomas Nührenberg. Part I focused on the heterogeneity, missingness, and semantic challenges inherent in real-world clinical data, illustrated through concrete applications in respiratory medicine, neurology, cardiology, and rare diseases. Part II addressed methodological innovations for creating realistic synthetic datasets, with contributions on simulation design, modeling causal effects, handling time-dependent processes, and incorporating domain knowledge. Discussions emphasized the need for rigorous methodology, reproducibility, and context-aware modeling to ensure synthetic data supports robust and informative research in clinical settings.

These impulses set the stage for a lively and focused afternoon discussion session, centered on the challenges of creating realistic benchmarks for specific clinical research questions related to:
• Chronic Obstructive Pulmonary Disease (COPD)
• Multiple Sclerosis (MS)
• Epidermolysis Bullosa (EB)

Participants divided into topic-specific groups, enabling in-depth, targeted discussions tailored to the unique challenges and data needs of each medical condition. We were pleased to be joined by clinical experts, whose insights ensured that methodological planning remained firmly grounded in clinical realities. This group-based format fostered interdisciplinary exchange and effectively prepared the ground for the hands-on development work on Day 2.

Day 2: Hackathon
The next day began with a productive and engaging hackathon. Participants divided into groups based on the three clinical use cases, each starting with a pre-defined simulation script as their foundation. The task was to extend and adapt this script to address a set of data challenges identified during the Day 1 discussions and tailored to each group’s specific clinical context. At the end of the session, each group presented their adapted simulation script.
This hands-on exercise offered a valuable opportunity to bridge clinical expertise with methodological development, ensuring that the resulting simulation models were both technically sound and clinically meaningful. The close interdisciplinary collaboration fostered alignment between modeling decisions, real-world data characteristics, and the medical context.

Sincere thanks to all contributors and the organizers Harald Binder, Nadine Binder, Willi Sauerbrei, Michelle Pfaffenlehner, and Patric Tippmann for making this an intellectually enriching collaborative experience. We’re looking forward to sharing the results of the workshop soon and seeing how this work supports the development of robust, clinically relevant AI methods.

Further information about the workshop program can be found here: Workshop 2025 – SmallData

Administrative Manager

Marc Schumacher

Institute of Medical Biometry and Statistics,
Faculty of Medicine and Medical Center –
University of Freiburg