So this is my first time using Apache Spark and machine learning in general
and i'm currently trying to create a small application to detect credit
card fraud.

Currently I have about 10000 transaction objects i'm using for my data set
with 70% of it going towards training the model and 30% for testing.

I'm using a Logistic Regression model with features being the amount spent,
types of merchants, the card number, total amount spent in the last 24
hours and the time since the last transaction. I have one label for the
fraud probability where 0 equals a valid transaction and 1 equals a
fraudulent one.

Currently the model never predicts fraud in any situation. I think the fact
that I have a very skewed data set might be affecting it as currently only
10% of my data represent fraudulent transactions. I tried to use the
classWeight column to give more weight to the minority class in order to
try and work around this but I haven't been successful so far. If I adjust
the classWeight too much, it eventually only starts predicting fraud which
is not correct either. Ideally I would have a higher threshold as well and
have tried both lower and higher regularization to see if it would make any
differences as well.

Reply via email to