DMP: Predicting Road Traffic Accident Severity Using Machine Learning
Creators
Description
Road Traffic Accident Severity Prediction
Context and Methodology
This dataset is used in a machine learning project focused on predicting road traffic accident severity. The research domain is data science and transportation safety analysis.
The purpose of this dataset is to train and evaluate a classification model that predicts whether an accident is Slight, Serious, or Fatal based on accident conditions.
The dataset is not newly collected. It is reused from an open data source (EU Open Data Portal). The project involves:
- downloading the dataset
- cleaning and preprocessing the data
- encoding categorical variables
- splitting the data into training, validation, and test sets
- training a machine learning model (Random Forest)
Additional datasets are created during the project, including cleaned data, model outputs, and evaluation results.
Technical Details
The dataset is structured in a clear project folder:
data/raw
data/processed
src
outputs
models
Files include:
- raw dataset (CSV): original downloaded file
- processed dataset (CSV): cleaned and encoded data
- train/validation/test splits (CSV): dataset split into 70/15/15
- predictions (CSV): model predictions
- plots (PNG): histograms, confusion matrix, feature importance
- model file (PKL): trained machine learning model
- source code (PY, IPYNB): scripts and notebook
Naming convention:
Software requirements:
To use this dataset, the following tools are needed:
- Python 3
- pandas
- scikit-learn
- matplotlib
All tools are open source.
Additional resources:
- source code for preprocessing and training
- Jupyter notebook for exploration
- README file explaining the workflow
- data dictionary (column meanings)
Further Details
The dataset contains no personal or sensitive data. All records describe accident events only.
Important points for reuse:
- The dataset is imbalanced (many Fatal accidents, fewer Slight ones), which affects model performance
- Categorical variables are encoded using one-hot encoding
- Random processes (e.g., dataset split) use a fixed seed (
random_state=42) to ensure reproducibility
The original dataset is publicly available and can be reused under its open data license.
