Home Advantage in European Football Leagues: A Comparison of Poland and England
Contributors
Supervisor:
- 1. SBA Research
- 2. TU Wien
Description
A primer on your dataset's description
Context and methodology
This dataset was created for a small research project in the field of sports analytics and research data management, carried out in the course Introduction to Research Data Management at TU Wien. The project investigates whether playing at home is associated with higher win rates in professional football (soccer).
The deposited data serve two main purposes: (i) to document the empirical results of the home-advantage study, and (ii) to provide a reproducible, well-documented example of a small research data package that can be reused in teaching or extended in further analyses.
The processed datasets are derived from openly available historical match results for the Polish Ekstraklasa and the English Premier League. The original CSV files were downloaded from football-data.co.uk and the “engsoccerdata” GitHub repository. Using Python (pandas) and Jupyter Notebooks, the raw files were cleaned (type checks, duplicate removal, consistency checks on goals and match counts) and aggregated into league- and overall-level statistics of home wins, draws and away wins. No personal or sensitive data are involved at any stage.
Technical details
The deposit is organised as a compressed project directory with a simple, self-describing folder structure:
data/processed/– six CSV files with match-level and aggregated statistics (overall distribution of results, counts and percentages by league, and summary tables used in the analysis).notebooks/– a Jupyter notebook implementing the full data-cleaning, aggregation and visualisation workflow.src/– small Python helper scripts for loading, transforming and checking the data.reports/– the main PDF report and the key figures (overall result distribution, home vs. away win rates).fair_assessment/– material documenting the FAIR assessment of the dataset.READMEand license files – human-readable documentation of the project structure, provenance of the data and licensing information.
All tabular data are stored in plain text CSV format and can be opened with common tools such as spreadsheet software (e.g. Excel, LibreOffice) or analysis environments (Python, R). The computational workflow requires Python 3 with standard scientific libraries (pandas, matplotlib) and Jupyter; the README file provides basic instructions for rerunning the analysis.
Further details
Only processed, derived datasets are redistributed in this deposit. The original raw data from football-data.co.uk and the engsoccerdata project are not included; instead, stable source URLs and usage conditions are documented in the README and the associated Data Management Plan.
The processed datasets in data/processed/ are released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, allowing reuse and redistribution with appropriate attribution. The accompanying analysis code and notebooks are licensed under the MIT license, enabling adaptation and integration into teaching materials, student projects or further research on home advantage and sports analytics.
The dataset is primarily intended for students and researchers interested in sports statistics, data science and research data management. It can be reused for teaching examples on data cleaning and aggregation, replication of the presented results, or as a starting point for extended analyses of home advantage in other leagues or time periods.
The full project is also available on GitHub:
https://github.com/bartciesielski/2025_Football_Home_Advantage