OK. Try passing through the data 100 times for a start. I think that this is likely to fix your problems.
Be warned that AdaptiveLogisticRegression has been misbehaving lately and may converge faster than it should. On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <[email protected]>wrote: > Thanks a lot ted. Here are the answers: > d) Data (news articles from different feeds) > News Article 1: Title : BP Profits Plunge On Massive Asset > Write-down > Description :BP PLC (BP) Tuesday > posted a dramatic fall of 96% in adjusted profit for the > second quarter as it wrote down the value of its assets by $5 billion > including some U.S. refineries a suspended Alaskan oil project and U.S. > shale gas resources > > News Article 2: Title : Morgan Stanley Missed Big > Description: Why It's Still A > Fantastic Short,"By Mike Williams: Though the market responded very > positively to Citigroup (C) and Bank of America's (BAC) reserve > release-driven earnings ""beats"" last week's Morgan Stanley (MS) earnings > report illustrated what happens when a bank doesn't have billions of > reserves to release back into earnings. Estimates called for the following: > $.43 per share in earnings $.29 per share in earnings ex-DVA (debt value > adjustment) $7.7 billion in revenue GAAP results (including the DVA) came > in at $.28 per share while ex-DVA earnings were $.16. Revenue was a > particular disappointment coming in at $6.95 billion. > > c) As you can see the data is textual. and I am using title and > description as predictor variable and the target variable is the company > name a news belongs to. > > b) I am passing through the data once (at least this is what I think). I > folowed the 20newsgroup example code(in java) and dint find that the data > was passed more than once. > Yes I randomize the order every time. > > a) I am using AdaptiveLearningRegression (just like 20newsgroup). > > Thanks! > > > > On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote: > > > First, this is a tiny training set. You are well outside the intended > > application range so you are likely to find less experience in the > > community in that range. That said, the algorithm should still produce > > reasonably stable results. > > > > Here are a few questions: > > > > a) which class are you using to train your model? I would start with > > OnlineLogisticRegression and experiment with training rate schedules and > > amount of regularization to find out how to build a good model. > > > > b) how many times are you passing through your data? Do you randomize > the > > order each time? These are critical to proper training. Instead of > > randomizing order, you could just sample a data point at random and not > > worry about using a complete permutation of the data. With such a tiny > > data set, you will need to pass through the data many times ... possibly > > hundreds of times or more. > > > > c) what kind of data do you have? Sparse? Dense? How many variables? > > What kind? > > > > d) can you post your data? > > > > > > On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <[email protected] > >wrote: > > > >> Thanks a lot lance. Let me elaborate the problem if it was a bit > confusing. > >> > >> Assuming I am making a binary classifier using SGD. I have got 50 > positive > >> and 50 negative examples to train the classifier. After training and > >> testing the model, the confusion matrix tells you the number of > correctly > >> and incorrectly classified instances. Let's assume I got 85% correct and > >> 15% incorrect instances. > >> > >> Now if I run my program again using the same 50 negative and 50 positive > >> examples, then according to my knowledge the classifier should yield the > >> same results as before (cause not even a single training or testing data > >> was changed), but this is not the case. I get different results for > >> different runs. The confusion matrix figures changes each time I > generate a > >> model keeping the data constant. What I do is, I generate a model > several > >> times and keep a look for the accuracy, and if it is above 90%, then I > stop > >> running the code and hence an accurate model is created. > >> > >> So what you are saying is to shuffle my data before I use it for > training > >> and testing? > >> Thanks! > >> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote: > >> > >>> Now I remember: SGD wants its data input in random order. You need to > >>> permute the order of your data. > >>> > >>> If that does not help, another trick: for each data point, randomly > >>> generate 5 or 10 or 20 points which are close. And again, randomly > >>> permute the entire input set. > >>> > >>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <[email protected]> > >> wrote: > >>>> The more data you have, the closer each run will be. How much data do > >> you have? > >>>> > >>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood < > [email protected]> > >> wrote: > >>>>> I have noticed that every time I train and test a model using the > same > >> data (in SGD algo), I get different confusion matrix. Meaning, if I > >> generate a model and look at the confusion matrix, it might say 90% > >> correctly classified instances, but if I generate the model again (with > the > >> SAME data for training and testing as before) and test it, the confusion > >> matrix changes and it might say 75% correctly classified instances. > >>>>> > >>>>> Is this a desired behavior? > >>>> > >>>> > >>>> > >>>> -- > >>>> Lance Norskog > >>>> [email protected] > >>> > >>> > >>> > >>> -- > >>> Lance Norskog > >>> [email protected] > >> > >> > >
