Polytechnic School of the University of São Paulo

Graduation Project 2022

Machine Learning Applied to Card Payment Fraud Detection

Student: Pedro Henrique Carvalho dos Reis

Advisor Professor: Prof. Dr. Reginaldo Arakaki

Introduction

Despite of being a global trend because of its many facilities and being leveraged by e-commerce, payments made by bank card are not completely secure. A 2019 study by NilsonReport estimates that in the next 10 years, $408.5 billion will be due to bank card payment frauds. Historically, anti-fraud systems were based on a pre-programmed set of rules that highlights a payment as fradulent, but with online shopping, fraudsters have much more frexibility, making these single ruleset systems weak to detect frauds. In contrast, the advancement in computational processing in recente decades allowed technologies such as Machine Learning to enter the domain of bank card fraud detection. In this type of system, a ML model analyzes historical data and learns the main fraud patterns from it.

Objective

Based on papers 1 , 2 and 3, compare whether recent findings in Deep Learning (DL) for tabular data can outperform Gradient Boosted Decision Trees (GBDT), the state-of-art in this domain, considering a highly imbalanced tabular dataset. Moreover, it will propose an optimized model with significant efficiency in detecting fradulent card payments.

Metodology

In a GPU-enabled environment and using a highly imbalanced tabular dataset with millions of card payment transactions labeled as fradulent or not, a whole data pipeline was developed from scratch to train, validate, optimize, test and compare two GBDT and four DL models. Furthermore, several techniques, such as over sampling and adapted loss function, were discussed and used to compensate the data imbalancing.

Results

The results showed that the GBDT models outperformed so far the DL ones in all three metrics: performance by F1 score, training time and ease of code implementation. And after several optimization steps and techniques, the XGBoost model ended up performing greatly on the considered dataset, keeping the number of false positives low and increasing significantly the number of true positives.

Results
Model	TN	FP	FN	TP	F1 Score	Train Time
XGBoost	297987	643	846	524	41.31	10min 3s
LightGBM	297905	725	1074	296	24.76	6.6s
MLP	297250	1380	994	376	24.06	25min
ResNet	298535	95	1197	173	21.12	57min 13s
FTT	298625	5	1277	93	12.67	3h 13s
XBNet	298421	209	1287	83	9.90	8h