Credit Card Fraud Detection

doi:10.82556/yvxj-9t22

Published April 28, 2025 | Version v1

Dataset Open

Credit Card Fraud Detection

Grizhja, Ajdina¹

1. TU Wien

Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:

1. Dataset Description

Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.

Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.

Data Sources
We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.

Method of Dataset Preparation

Schema validation: Renamed columns to snake_case (e.g. transaction_amount, is_declined) so they conform to DBRepo’s requirements.
Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).
Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr. Each subset was materialized in DBRepo and assigned its own PID for precise citation.
Cleaning: Converted the categorical flags (is_declined, isforeigntransaction, ishighriskcountry, isfradulent) from “Y”/“N” to 1/0 and dropped non‐feature identifiers (actionnr, merchant_id).
Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.

2. Technical Details

Dataset Structure

The raw data is a single CSV with columns:
- actionnr (integer transaction ID)
- merchant_id (string)
- average_amount_transaction_day (float)
- transaction_amount (float)
- is_declined, isforeigntransaction, ishighriskcountry, isfradulent (binary flags)
- total_number_of_declines_day, daily_chargeback_avg_amt, sixmonth_avg_chbk_amt, sixmonth_chbk_freq (numeric features)

Naming Conventions

All columns use lowercase snake_case.
Subsets are named creditcard_training, creditcard_validation, creditcard_test in DBRepo.

Files in the code repo follow a clear structure:

├── data/                  # local copies only; raw data lives in DBRepo  
├── notebooks/Task.ipynb  
├── models/rf_model_v1.joblib  
├── outputs/               # confusion_matrix.png, roc_curve.png, predictions.csv  
├── README.md  
├── requirements.txt  
└── codemeta.json

Required Software

Python 3.9+
pandas, numpy (data handling)
scikit-learn (modeling, metrics)
matplotlib (visualizations)
dbrepo‐client.py (DBRepo API)
requests (TU WRD API)

Additional Resources

Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
Scikit-learn docs: https://scikit-learn.org/stable
DBRepo API guide: via the starter notebook’s dbrepo_client.py template
TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs

3. Further Details

Data Limitations

Highly imbalanced: only ~0.17% of transactions are fraudulent.
Anonymized PCA features (V1–V28) hidden; we extended with domain features but cannot reverse engineer raw variables.
Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.

Licensing and Attribution

Raw data: CC-0 (per Kaggle terms)
Code & notebooks: MIT License
Model artifacts & outputs: CC-BY 4.0
DUWRD records include ORCID identifiers for the author.

Recommended Uses

Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.
Educational purposes: demonstrating model‐training pipelines, FAIR data practices.
Extension: adding time‐series or deep‐learning models.

Known Issues

Possible temporal leakage if date/time features not handled correctly.
Model performance may degrade on live data due to concept drift.
Binary flags may oversimplify nuanced transaction outcomes.

Files

Add MIT license.txt

Files (588.6 KiB)

Name	Size
Add MIT license.txt md5:9d3b370f6db878c5b07240f37fe29eda	1.0 KiB	Preview Download
codemeta.json md5:2d089022f16ae997053990998ad77eaf	567 Bytes	Preview Download
credit_card.csv md5:d3f5cd54ff3767d4e174a6a7fd4f3166	168.6 KiB	Preview Download
Model_Card.pdf md5:cc160f08976823f93098ba2e373a09f4	166.2 KiB	Preview Download
README.md md5:cbd833770b938cca69f225a389892d0d	341 Bytes	Preview Download
requirements.txt md5:922e12515f2f722603b8f0f1c09eeb0f	42 Bytes	Preview Download
test_subset.csv md5:845af2bd6397922b71d3123babd78cc4	37.8 KiB	Preview Download
training_subset.csv md5:af5a1f95f38919f8674b628a595803dc	176.4 KiB	Preview Download
validation_subset.csv md5:eeb1d8ac04b02e42454b693a5f7aaecb	37.7 KiB	Preview Download

Additional details

Accepted: 2025-04-28

Credit Card Fraud Detection

Creators

Description

1. Dataset Description

2. Technical Details

3. Further Details

Files

Add MIT license.txt

Files (588.6 KiB)

Additional details

Dates