Polytechnic School of the University of São Paulo

Graduation Project 2022

Machine Learning Applied to Card Payment Fraud Detection

Student: Pedro Henrique Carvalho dos Reis

Advisor Professor: Prof. Dr. Reginaldo Arakaki

Introduction

Despite of being a global trend because of its many facilities and being leveraged by e-commerce, payments made by bank card are not completely secure. A 2019 study by NilsonReport estimates that in the next 10 years, $408.5 billion will be due to bank card payment frauds. Historically, anti-fraud systems were based on a pre-programmed set of rules that highlights a payment as fradulent, but with online shopping, fraudsters have much more frexibility, making these single ruleset systems weak to detect frauds. In contrast, the advancement in computational processing in recente decades allowed technologies such as Machine Learning to enter the domain of bank card fraud detection. In this type of system, a ML model analyzes historical data and learns the main fraud patterns from it.

Objective

Based on papers 1 , 2 and 3, compare whether recent findings in Deep Learning (DL) for tabular data can outperform Gradient Boosted Decision Trees (GBDT), the state-of-art in this domain, considering a highly imbalanced tabular dataset. Moreover, it will propose an optimized model with significant efficiency in detecting fradulent card payments.

Metodology

In a GPU-enabled environment and using a highly imbalanced tabular dataset with millions of card payment transactions labeled as fradulent or not, a whole data pipeline was developed from scratch to train, validate, optimize, test and compare two GBDT and four DL models. Furthermore, several techniques, such as over sampling and adapted loss function, were discussed and used to compensate the data imbalancing.

Results

The results showed that the GBDT models outperformed so far the DL ones in all three metrics: performance by F1 score, training time and ease of code implementation. And after several optimization steps and techniques, the XGBoost model ended up performing greatly on the considered dataset, keeping the number of false positives low and increasing significantly the number of true positives.

Results
Model TN FP FN TP F1 Score Train Time
XGBoost 297987 643 846 524 41.31 10min 3s
LightGBM 297905 725 1074 296 24.76 6.6s
MLP 297250 1380 994 376 24.06 25min
ResNet 298535 95 1197 173 21.12 57min 13s
FTT 298625 5 1277 93 12.67 3h 13s
XBNet 298421 209 1287 83 9.90 8h