Cheers ted. Appreciate the input! Sent from my iPhone
On 31 Aug 2012, at 17:53, Ted Dunning <[email protected]> wrote: > OK. > > Try passing through the data 100 times for a start. I think that this is > likely to fix your problems. > > Be warned that AdaptiveLogisticRegression has been misbehaving lately and > may converge faster than it should. > > On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <[email protected]>wrote: > >> Thanks a lot ted. Here are the answers: >> d) Data (news articles from different feeds) >> News Article 1: Title : BP Profits Plunge On Massive Asset >> Write-down >> Description :BP PLC (BP) Tuesday >> posted a dramatic fall of 96% in adjusted profit for the >> second quarter as it wrote down the value of its assets by $5 billion >> including some U.S. refineries a suspended Alaskan oil project and U.S. >> shale gas resources >> >> News Article 2: Title : Morgan Stanley Missed Big >> Description: Why It's Still A >> Fantastic Short,"By Mike Williams: Though the market responded very >> positively to Citigroup (C) and Bank of America's (BAC) reserve >> release-driven earnings ""beats"" last week's Morgan Stanley (MS) earnings >> report illustrated what happens when a bank doesn't have billions of >> reserves to release back into earnings. Estimates called for the following: >> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt value >> adjustment) $7.7 billion in revenue GAAP results (including the DVA) came >> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a >> particular disappointment coming in at $6.95 billion. >> >> c) As you can see the data is textual. and I am using title and >> description as predictor variable and the target variable is the company >> name a news belongs to. >> >> b) I am passing through the data once (at least this is what I think). I >> folowed the 20newsgroup example code(in java) and dint find that the data >> was passed more than once. >> Yes I randomize the order every time. >> >> a) I am using AdaptiveLearningRegression (just like 20newsgroup). >> >> Thanks! >> >> >> >> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote: >> >>> First, this is a tiny training set. You are well outside the intended >>> application range so you are likely to find less experience in the >>> community in that range. That said, the algorithm should still produce >>> reasonably stable results. >>> >>> Here are a few questions: >>> >>> a) which class are you using to train your model? I would start with >>> OnlineLogisticRegression and experiment with training rate schedules and >>> amount of regularization to find out how to build a good model. >>> >>> b) how many times are you passing through your data? Do you randomize >> the >>> order each time? These are critical to proper training. Instead of >>> randomizing order, you could just sample a data point at random and not >>> worry about using a complete permutation of the data. With such a tiny >>> data set, you will need to pass through the data many times ... possibly >>> hundreds of times or more. >>> >>> c) what kind of data do you have? Sparse? Dense? How many variables? >>> What kind? >>> >>> d) can you post your data? >>> >>> >>> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <[email protected] >>> wrote: >>> >>>> Thanks a lot lance. Let me elaborate the problem if it was a bit >> confusing. >>>> >>>> Assuming I am making a binary classifier using SGD. I have got 50 >> positive >>>> and 50 negative examples to train the classifier. After training and >>>> testing the model, the confusion matrix tells you the number of >> correctly >>>> and incorrectly classified instances. Let's assume I got 85% correct and >>>> 15% incorrect instances. >>>> >>>> Now if I run my program again using the same 50 negative and 50 positive >>>> examples, then according to my knowledge the classifier should yield the >>>> same results as before (cause not even a single training or testing data >>>> was changed), but this is not the case. I get different results for >>>> different runs. The confusion matrix figures changes each time I >> generate a >>>> model keeping the data constant. What I do is, I generate a model >> several >>>> times and keep a look for the accuracy, and if it is above 90%, then I >> stop >>>> running the code and hence an accurate model is created. >>>> >>>> So what you are saying is to shuffle my data before I use it for >> training >>>> and testing? >>>> Thanks! >>>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote: >>>> >>>>> Now I remember: SGD wants its data input in random order. You need to >>>>> permute the order of your data. >>>>> >>>>> If that does not help, another trick: for each data point, randomly >>>>> generate 5 or 10 or 20 points which are close. And again, randomly >>>>> permute the entire input set. >>>>> >>>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <[email protected]> >>>> wrote: >>>>>> The more data you have, the closer each run will be. How much data do >>>> you have? >>>>>> >>>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood < >> [email protected]> >>>> wrote: >>>>>>> I have noticed that every time I train and test a model using the >> same >>>> data (in SGD algo), I get different confusion matrix. Meaning, if I >>>> generate a model and look at the confusion matrix, it might say 90% >>>> correctly classified instances, but if I generate the model again (with >> the >>>> SAME data for training and testing as before) and test it, the confusion >>>> matrix changes and it might say 75% correctly classified instances. >>>>>>> >>>>>>> Is this a desired behavior? >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Lance Norskog >>>>>> [email protected] >>>>> >>>>> >>>>> >>>>> -- >>>>> Lance Norskog >>>>> [email protected] >>>> >>>> >> >>
