Multi-Analyst-Study and Hackathon on Interpretable Machine Learning for Longitudinal and Clustered Data
Data with stochastic dependencies is common in many fields, including education, psychology, social sciences, and health sciences. Two common sources of dependencies are repeated measurements of the same individuals in a longitudinal design (e.g., panel data) and data with noticeable cluster-level effects, like cross-sectional data with students nested in schools. Both designs are important to understand associations and causal mechanisms. Data analysis is changing, since current practices in data collection yield many variables. This creates opportunities like the study of dynamic behavior or analyses comparing many competing theories, but poses challenges in variable selection, feature estimation, and causal effect estimation.
Supervised machine learning (ML) includes powerful prediction models. While their predictive performance is sometimes astounding, many ML approaches are opaque and hard to interpret. In addition, common ML models like the random forest are not designed for dependent observations, as all observations are weighted equally by default. The big interest in ML in the social and behavioral sciences calls for adaptations of existing approaches and innovative techniques, overcoming inherent limitations with respect to dependent data.
At the same time, research and applications need to go beyond mere pragmatic predictive performance. Interpretability is a key requirement for deepened understanding. Interpretability might be either in-built into a method (including regularized regression approaches or CART) or obtained post-hoc with techniques like feature importance measures or partial dependency plots. Interestingly, many examples show that interpretability and performance are not necessarily at odds, and especially in high-stakes applications, it will often be beneficial to employ methods with in-built interpretability.
Aims
- Discuss recent developments in prediction modelling of dependent outcomes and trajectories.
- Compare approaches on two datasets with a multi-analyst approach.
- Derive and publish authoritative guidelines for research with supervised machine learning with stochastic dependencies
Examples of methods within scope
- classification and regression trees and related methods
- regularized regression models, regularized structural equation models, and other statistical learning models
- well-explained black box models, including (tree-)ensembles and (deep) neural networks
- functional/additive models
- surrogate modelling and explainability techniques
Organizers
- Philipp Doebler, Department of Statistics & Center for Agile PAIR , TU Dortmund University
- Jörg-Tobias Kuhn, Faculty of Rehabilitation Sciences, Methods of Empirical Educational Research & Center for Agile PAIR, TU Dortmund University
- Annette Lohbeck, Faculty of Rehabilitation Sciences & Center for Agile PAIR, TU Dortmund University
- Katja Ickstadt, Department of Statistics & TU Dortmund Center for Data Science and Simulation
- Claus Weihs, Department of Statistics
- Jakob Schwerter, Hector Research Institute of Education Sciences and Psychology, University of Tübingen
- Ulrich Ludewig, Department of Educational Sciences and Psychology, TU Dortmund University
