Published April 12, 2026 | Version v1
Dataset Open

DMP: Predicting Productivity Levels Based on Coffee Consumption

  • 1. ROR icon TU Wien

Description

Context and Methodology

This dataset is used within a data science project focused on exploring the relationship between coffee consumption and productivity levels. The project is part of a course-based exercise in data stewardship and machine learning.

The primary purpose of the dataset is to:

  • Train a machine learning model that predicts a binary productivity variable
  • Demonstrate how correlations in data can be misleading (spurious correlation scenario)
  • Provide a reproducible ML pipeline including preprocessing, feature engineering, and evaluation

The dataset is not created from scratch but reused from an existing open data source, specifically the European Health Interview Survey (EHIS), accessed via the European Data Portal .

Dataset creation process:

  • Original data collected by EHIS through standardized health surveys across European countries
  • Dataset downloaded in tabular format (e.g., CSV)
  • Preprocessing steps applied:
    • Removal of irrelevant or missing values
    • Selection of relevant features (coffee consumption, sleep, working hours)
    • Encoding and normalization of variables
  • Feature engineering:
    • Creation of a derived variable “Productivity” based on working hours and sleep duration
  • Final dataset prepared for machine learning tasks

All steps are documented to ensure reproducibility.

Technical Details

Dataset structure and organization

The dataset is organized into a simple and reproducible project structure:

 
project-folder/

├── data/
│ ├── raw_data.csv
│ └── processed_data.csv

├── notebooks/
│ └── analysis.ipynb

├── outputs/
│ ├── histograms.png
│ ├── confusion_matrix.png
│ └── predictions.csv

├── docs/
│ ├── use_case.pdf
│ └── DMP.pdf
 

Naming conventions:

  • Files use descriptive, lowercase names with underscores
  • Processed files clearly distinguished from raw data
  • Outputs named according to content (e.g., confusion_matrix.png)

Software requirements

To access and use the dataset, the following tools are required:

  • Python (version 3.x recommended)
  • Jupyter Notebook or JupyterLab
  • Libraries:
    • pandas
    • numpy
    • scikit-learn
    • matplotlib

For basic access:

  • CSV files can be opened with Excel or any spreadsheet software

Additional resources

The dataset is supported by the following materials:

  • Jupyter Notebook containing:
    • data preprocessing
    • feature engineering
    • model training and evaluation
  • Code comments explaining each step
  • Use case description document
  • Data Management Plan (DMP)

The original dataset source is publicly available via the European Data Portal and should be referenced when reusing the data .

Further Details (Important for Reuse)

  • The “Productivity” variable is artificially constructed and does not represent a scientifically validated measure
  • The dataset is intended for demonstration and educational purposes, not for drawing causal conclusions
  • The relationship between coffee consumption and productivity is likely spurious, and users should avoid misinterpretation
  • No personal or sensitive data is included, so there are no privacy concerns
  • Reusers should:
    • Cite the original data source
    • Document any modifications they make
    • Be aware of limitations in data interpretation

Files

DMP Predicting Productivity Levels Based on Coffee Consumption.pdf

Files (359.9 KiB)