This is a newbie question from someone is just getting familiar with Mahout and machine learning.
I bought and have read Mahout In Action, and I'm trying to apply the concepts to some "real-world" data (i.e., not in the examples). The problem I am trying to solve is a classification problem, so I started with OnlineLogisticRegression. I'm struggling to get good results out of it, however, so I wonder if I am using the wrong algorithm. Other notes about my data: - My target variable has (5) multiple categories....although 1 of the 5 dominates and appears in 90%+ of the classifications in the training set. - My (6) predictor variables are all numeric; some of the variables range from 0...5, others range from 0...1,000,000. - The training set has millions of records. I have modified the TrainLogistic / RunLogistic examples to use classifyFull() instead of classifyScalar(), and output the resulting Vector as probabilities for the selection of each category. So why do I think the results aren't very good? When I run the model against the validation set, I am not much better than random. Also, if I change the problem, so that the target variable just has 2 categories instead of 5 (either in the 90% category or out), and then use Auc to validate against the training set, my best score is 0.52. I have also tried many values for --rate, --features, but none seem to make difference. Does anyone have any advice on whether I using a hammer on a screw? Is it more likely that I have not found predictors that are very relevant? Or am I using an algorithm that is a poor fit? I really appreciate your help, Mike * *
