First, this is a tiny training set. You are well outside the intended application range so you are likely to find less experience in the community in that range. That said, the algorithm should still produce reasonably stable results.
Here are a few questions: a) which class are you using to train your model? I would start with OnlineLogisticRegression and experiment with training rate schedules and amount of regularization to find out how to build a good model. b) how many times are you passing through your data? Do you randomize the order each time? These are critical to proper training. Instead of randomizing order, you could just sample a data point at random and not worry about using a complete permutation of the data. With such a tiny data set, you will need to pass through the data many times ... possibly hundreds of times or more. c) what kind of data do you have? Sparse? Dense? How many variables? What kind? d) can you post your data? On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <[email protected]>wrote: > Thanks a lot lance. Let me elaborate the problem if it was a bit confusing. > > Assuming I am making a binary classifier using SGD. I have got 50 positive > and 50 negative examples to train the classifier. After training and > testing the model, the confusion matrix tells you the number of correctly > and incorrectly classified instances. Let's assume I got 85% correct and > 15% incorrect instances. > > Now if I run my program again using the same 50 negative and 50 positive > examples, then according to my knowledge the classifier should yield the > same results as before (cause not even a single training or testing data > was changed), but this is not the case. I get different results for > different runs. The confusion matrix figures changes each time I generate a > model keeping the data constant. What I do is, I generate a model several > times and keep a look for the accuracy, and if it is above 90%, then I stop > running the code and hence an accurate model is created. > > So what you are saying is to shuffle my data before I use it for training > and testing? > Thanks! > On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote: > > > Now I remember: SGD wants its data input in random order. You need to > > permute the order of your data. > > > > If that does not help, another trick: for each data point, randomly > > generate 5 or 10 or 20 points which are close. And again, randomly > > permute the entire input set. > > > > On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <[email protected]> > wrote: > >> The more data you have, the closer each run will be. How much data do > you have? > >> > >> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <[email protected]> > wrote: > >>> I have noticed that every time I train and test a model using the same > data (in SGD algo), I get different confusion matrix. Meaning, if I > generate a model and look at the confusion matrix, it might say 90% > correctly classified instances, but if I generate the model again (with the > SAME data for training and testing as before) and test it, the confusion > matrix changes and it might say 75% correctly classified instances. > >>> > >>> Is this a desired behavior? > >> > >> > >> > >> -- > >> Lance Norskog > >> [email protected] > > > > > > > > -- > > Lance Norskog > > [email protected] > >
