Hi all, I am exploring Mahout's SGD classifier and like some feedback because I think I didn't properly configure things.
I created an example app that trains an SGD classifier on the 'bank marketing' dataset from UCI: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing My app is at: https://github.com/frankscholten/mahout-sgd-bank-marketing The app reads a CSV file of telephone calls, encodes the features into a vector and tries to predict whether a customer answers yes to a business proposal. I do a few runs and measure accuracy but I'm I don't trust the results. When I only use an intercept term as a feature I get around 88% accuracy and when I add all features it drops to around 85%. Is this perhaps because the dataset highly unbalanced? Most customers answer no. Or is the classifier biased to predict 0 as the target code when it doesn't have any data to go with? Any other comments about my code or improvements I can make in the app are welcome! :) Cheers, Frank
