So this is my first time using Apache Spark and machine learning in general and i'm currently trying to create a small application to detect credit card fraud.
Currently I have about 10000 transaction objects i'm using for my data set with 70% of it going towards training the model and 30% for testing. I'm using a Logistic Regression model with features being the amount spent, types of merchants, the card number, total amount spent in the last 24 hours and the time since the last transaction. I have one label for the fraud probability where 0 equals a valid transaction and 1 equals a fraudulent one. Currently the model never predicts fraud in any situation. I think the fact that I have a very skewed data set might be affecting it as currently only 10% of my data represent fraudulent transactions. I tried to use the classWeight column to give more weight to the minority class in order to try and work around this but I haven't been successful so far. If I adjust the classWeight too much, it eventually only starts predicting fraud which is not correct either. Ideally I would have a higher threshold as well and have tried both lower and higher regularization to see if it would make any differences as well.