Hackathon

Multi-Analyst-Study and Hackathon on Interpretable Machine Learning for Longitudinal and Clustered Data

Data with stochastic dependencies is common in many fields, including education, psychology, social sciences, and health sciences. Two common sources of dependencies are repeated measurements of the same individuals in a longitudinal design (e.g., panel data) and data with noticeable cluster-level effects, like cross-sectional data with students nested in schools. Both designs are important to understand associations and causal mechanisms. Data analysis is changing, since current practices in data collection yield many variables. This creates opportunities like the study of dynamic behavior or analyses comparing many competing theories, but poses challenges in variable selection, feature estimation, and causal effect estimation.

Supervised machine learning (ML) includes powerful prediction models. While their predictive performance is sometimes astounding, many ML approaches are opaque and hard to interpret. In addition, common ML models like the random forest are not designed for dependent observations, as all observations are weighted equally by default. The big interest in ML in the social and behavioral sciences calls for adaptations of existing approaches and innovative techniques, overcoming inherent limitations with respect to dependent data.

At the same time, research and applications need to go beyond mere pragmatic predictive performance. Interpretability is a key requirement for deepened understanding. Interpretability might be either in-built into a method (including regularized regression approaches or CART) or obtained post-hoc with techniques like feature importance measures or partial dependency plots. Interestingly, many examples show that interpretability and performance are not necessarily at odds, and especially in high-stakes applications, it will often be beneficial to employ methods with in-built interpretability.

Aims

Discuss recent developments in prediction modelling of dependent outcomes and trajectories.
Compare approaches on two datasets with a multi-analyst approach.
Derive and publish authoritative guidelines for research with supervised machine learning with stochastic dependencies

Examples of methods within scope

classification and regression trees and related methods
regularized regression models, regularized structural equation models, and other statistical learning models
well-explained black box models, including (tree-)ensembles and (deep) neural networks
functional/additive models
surrogate modelling and explainability techniques

Organizers

Philipp Doebler, Department of Statistics & Center for Agile PAIR , TU Dortmund University
Jörg-Tobias Kuhn, Faculty of Rehabilitation Sciences, Methods of Empirical Educational Research & Center for Agile PAIR, TU Dortmund University
Annette Lohbeck, Faculty of Rehabilitation Sciences & Center for Agile PAIR, TU Dortmund University
Katja Ickstadt, Department of Statistics & TU Dortmund Center for Data Science and Simulation
Claus Weihs, Department of Statistics
Jakob Schwerter, Hector Research Institute of Education Sciences and Psychology, University of Tübingen
Ulrich Ludewig, Department of Educational Sciences and Psychology, TU Dortmund University

By car

By train

By plane

The H-Bahn (Suspended Monorail System)

Map

By car

By train

By plane

The H-Bahn (Suspended Monorail System)

Map

Multi-Analyst-Study and Hackathon on Interpretable Machine Learning for Longitudinal and Clustered Data