DMP: Predicting Road Traffic Accident Severity Using Machine Learning

Edeh, Ekene

doi:10.70124/r07ry-eb445

Published March 29, 2026 | Version 1

Data Management Plan Open

DMP: Predicting Road Traffic Accident Severity Using Machine Learning

Edeh, Ekene (Contact person)¹

1. TU Wien

Road Traffic Accident Severity Prediction

Context and Methodology

This dataset is used in a machine learning project focused on predicting road traffic accident severity. The research domain is data science and transportation safety analysis.

The purpose of this dataset is to train and evaluate a classification model that predicts whether an accident is Slight, Serious, or Fatal based on accident conditions.

The dataset is not newly collected. It is reused from an open data source (EU Open Data Portal). The project involves:

downloading the dataset
cleaning and preprocessing the data
encoding categorical variables
splitting the data into training, validation, and test sets
training a machine learning model (Random Forest)

Additional datasets are created during the project, including cleaned data, model outputs, and evaluation results.

Technical Details

The dataset is structured in a clear project folder:

data/raw
data/processed
src
outputs
models

Files include:

raw dataset (CSV): original downloaded file
processed dataset (CSV): cleaned and encoded data
train/validation/test splits (CSV): dataset split into 70/15/15
predictions (CSV): model predictions
plots (PNG): histograms, confusion matrix, feature importance
model file (PKL): trained machine learning model
source code (PY, IPYNB): scripts and notebook

Naming convention:

Software requirements:

To use this dataset, the following tools are needed:

Python 3
pandas
scikit-learn
matplotlib

All tools are open source.

Additional resources:

source code for preprocessing and training
Jupyter notebook for exploration
README file explaining the workflow
data dictionary (column meanings)

Further Details

The dataset contains no personal or sensitive data. All records describe accident events only.

Important points for reuse:

The dataset is imbalanced (many Fatal accidents, fewer Slight ones), which affects model performance
Categorical variables are encoded using one-hot encoding
Random processes (e.g., dataset split) use a fixed seed (random_state=42) to ensure reproducibility

The original dataset is publicly available and can be reused under its open data license.

Files