The "Active Mind" – Predicting Academic Resilience through Physical Activity
Description
Dataset Description Primer: The "Active Mind"
The influence of proper documentation on the reusability for research data should not be underestimated!
In order to help others understand how to interpret and reuse your data, we provide you with a few questions to help you structure your dataset's description (though please don't feel obligated to stick to them):
Context and methodology
Research Domain/Project: This dataset is part of the "Active Mind" project within the FAIR Data Science course (DaSt 2026). It explores the intersection of physical health and academic performance in higher education.
Purpose: The dataset serves as the basis for training a Random Forest Classifier to predict student academic satisfaction. It is used to demonstrate robust Data Management Plans and FAIR principles in an ML workflow.
Creation: This is a secondary dataset derived from the Austrian Student Social Survey 2024. It was created by filtering the original survey for variables related to sports frequency, activity types, and satisfaction scores.
Technical details
Structure & Naming: The dataset follows a modular structure:
/datacontains the raw and cleaned CSV files, while/modelscontains the trained.pklfiles. Files follow a strict naming convention:ActiveMind_Dataset_v1.csvandActiveMind_Model_v1.pkl.Software Requirements: To work with this dataset, Python 3.x is required, specifically the
pandaslibrary for data manipulation andscikit-learnfor executing the machine learning model.Additional Resources: Full implementation details, including the Data Flow Diagram and the Python source code, are provided in the project repository. A
README.mdfile is included to map the survey variable codes (e.g.,v158) to their human-readable descriptions.
Further details
- Reuse Instructions: Anyone wishing to reuse this dataset must acknowledge the original data source (AUSSDA) via the DOI: 10.11587/3YHJJ3. Users should be aware that while the model provides a framework for analysis, the actual performance was secondary to the data management simulation.