DMP: Predicting Productivity Levels Based on Coffee Consumption
Creators
Description
Context and Methodology
This dataset is used within a data science project focused on exploring the relationship between coffee consumption and productivity levels. The project is part of a course-based exercise in data stewardship and machine learning.
The primary purpose of the dataset is to:
- Train a machine learning model that predicts a binary productivity variable
- Demonstrate how correlations in data can be misleading (spurious correlation scenario)
- Provide a reproducible ML pipeline including preprocessing, feature engineering, and evaluation
The dataset is not created from scratch but reused from an existing open data source, specifically the European Health Interview Survey (EHIS), accessed via the European Data Portal .
Dataset creation process:
- Original data collected by EHIS through standardized health surveys across European countries
- Dataset downloaded in tabular format (e.g., CSV)
- Preprocessing steps applied:
- Removal of irrelevant or missing values
- Selection of relevant features (coffee consumption, sleep, working hours)
- Encoding and normalization of variables
- Feature engineering:
- Creation of a derived variable “Productivity” based on working hours and sleep duration
- Final dataset prepared for machine learning tasks
All steps are documented to ensure reproducibility.
Technical Details
Dataset structure and organization
The dataset is organized into a simple and reproducible project structure:
│
├── data/
│ ├── raw_data.csv
│ └── processed_data.csv
│
├── notebooks/
│ └── analysis.ipynb
│
├── outputs/
│ ├── histograms.png
│ ├── confusion_matrix.png
│ └── predictions.csv
│
├── docs/
│ ├── use_case.pdf
│ └── DMP.pdf
Naming conventions:
- Files use descriptive, lowercase names with underscores
- Processed files clearly distinguished from raw data
- Outputs named according to content (e.g.,
confusion_matrix.png)
Software requirements
To access and use the dataset, the following tools are required:
- Python (version 3.x recommended)
- Jupyter Notebook or JupyterLab
- Libraries:
- pandas
- numpy
- scikit-learn
- matplotlib
For basic access:
- CSV files can be opened with Excel or any spreadsheet software
Additional resources
The dataset is supported by the following materials:
- Jupyter Notebook containing:
- data preprocessing
- feature engineering
- model training and evaluation
- Code comments explaining each step
- Use case description document
- Data Management Plan (DMP)
The original dataset source is publicly available via the European Data Portal and should be referenced when reusing the data .
Further Details (Important for Reuse)
- The “Productivity” variable is artificially constructed and does not represent a scientifically validated measure
- The dataset is intended for demonstration and educational purposes, not for drawing causal conclusions
- The relationship between coffee consumption and productivity is likely spurious, and users should avoid misinterpretation
- No personal or sensitive data is included, so there are no privacy concerns
- Reusers should:
- Cite the original data source
- Document any modifications they make
- Be aware of limitations in data interpretation