DMP: Predicting Productivity Levels Based on Coffee Consumption

Nisar, Ahsan

doi:10.70124/x2ksk-zsy49

Published April 12, 2026 | Version v1

Dataset Open

DMP: Predicting Productivity Levels Based on Coffee Consumption

Nisar, Ahsan (Contact person)¹

1. TU Wien

Context and Methodology

This dataset is used within a data science project focused on exploring the relationship between coffee consumption and productivity levels. The project is part of a course-based exercise in data stewardship and machine learning.

The primary purpose of the dataset is to:

Train a machine learning model that predicts a binary productivity variable
Demonstrate how correlations in data can be misleading (spurious correlation scenario)
Provide a reproducible ML pipeline including preprocessing, feature engineering, and evaluation

The dataset is not created from scratch but reused from an existing open data source, specifically the European Health Interview Survey (EHIS), accessed via the European Data Portal .

Dataset creation process:

Original data collected by EHIS through standardized health surveys across European countries
Dataset downloaded in tabular format (e.g., CSV)
Preprocessing steps applied:
- Removal of irrelevant or missing values
- Selection of relevant features (coffee consumption, sleep, working hours)
- Encoding and normalization of variables
Feature engineering:
- Creation of a derived variable “Productivity” based on working hours and sleep duration
Final dataset prepared for machine learning tasks

All steps are documented to ensure reproducibility.

Technical Details

Dataset structure and organization

The dataset is organized into a simple and reproducible project structure:

project-folder/
│
├── data/
│ ├── raw_data.csv
│ └── processed_data.csv
│
├── notebooks/
│ └── analysis.ipynb
│
├── outputs/
│ ├── histograms.png
│ ├── confusion_matrix.png
│ └── predictions.csv
│
├── docs/
│ ├── use_case.pdf
│ └── DMP.pdf

Naming conventions:

Files use descriptive, lowercase names with underscores
Processed files clearly distinguished from raw data
Outputs named according to content (e.g., confusion_matrix.png)

Software requirements

To access and use the dataset, the following tools are required:

Python (version 3.x recommended)
Jupyter Notebook or JupyterLab
Libraries:
- pandas
- numpy
- scikit-learn
- matplotlib

For basic access:

CSV files can be opened with Excel or any spreadsheet software

Additional resources

The dataset is supported by the following materials:

Jupyter Notebook containing:
- data preprocessing
- feature engineering
- model training and evaluation
Code comments explaining each step
Use case description document
Data Management Plan (DMP)

The original dataset source is publicly available via the European Data Portal and should be referenced when reusing the data .

Further Details (Important for Reuse)

The “Productivity” variable is artificially constructed and does not represent a scientifically validated measure
The dataset is intended for demonstration and educational purposes, not for drawing causal conclusions
The relationship between coffee consumption and productivity is likely spurious, and users should avoid misinterpretation
No personal or sensitive data is included, so there are no privacy concerns
Reusers should:
- Cite the original data source
- Document any modifications they make
- Be aware of limitations in data interpretation

Files