"Try passing through the data 100 times for a start. " And randomize the order each time?
On Fri, Aug 31, 2012 at 9:04 AM, Salman Mahmood <[email protected]> wrote: > Cheers ted. Appreciate the input! > > Sent from my iPhone > > On 31 Aug 2012, at 17:53, Ted Dunning <[email protected]> wrote: > >> OK. >> >> Try passing through the data 100 times for a start. I think that this is >> likely to fix your problems. >> >> Be warned that AdaptiveLogisticRegression has been misbehaving lately and >> may converge faster than it should. >> >> On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <[email protected]>wrote: >> >>> Thanks a lot ted. Here are the answers: >>> d) Data (news articles from different feeds) >>> News Article 1: Title : BP Profits Plunge On Massive Asset >>> Write-down >>> Description :BP PLC (BP) Tuesday >>> posted a dramatic fall of 96% in adjusted profit for the >>> second quarter as it wrote down the value of its assets by $5 billion >>> including some U.S. refineries a suspended Alaskan oil project and U.S. >>> shale gas resources >>> >>> News Article 2: Title : Morgan Stanley Missed Big >>> Description: Why It's Still A >>> Fantastic Short,"By Mike Williams: Though the market responded very >>> positively to Citigroup (C) and Bank of America's (BAC) reserve >>> release-driven earnings ""beats"" last week's Morgan Stanley (MS) earnings >>> report illustrated what happens when a bank doesn't have billions of >>> reserves to release back into earnings. Estimates called for the following: >>> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt value >>> adjustment) $7.7 billion in revenue GAAP results (including the DVA) came >>> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a >>> particular disappointment coming in at $6.95 billion. >>> >>> c) As you can see the data is textual. and I am using title and >>> description as predictor variable and the target variable is the company >>> name a news belongs to. >>> >>> b) I am passing through the data once (at least this is what I think). I >>> folowed the 20newsgroup example code(in java) and dint find that the data >>> was passed more than once. >>> Yes I randomize the order every time. >>> >>> a) I am using AdaptiveLearningRegression (just like 20newsgroup). >>> >>> Thanks! >>> >>> >>> >>> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote: >>> >>>> First, this is a tiny training set. You are well outside the intended >>>> application range so you are likely to find less experience in the >>>> community in that range. That said, the algorithm should still produce >>>> reasonably stable results. >>>> >>>> Here are a few questions: >>>> >>>> a) which class are you using to train your model? I would start with >>>> OnlineLogisticRegression and experiment with training rate schedules and >>>> amount of regularization to find out how to build a good model. >>>> >>>> b) how many times are you passing through your data? Do you randomize >>> the >>>> order each time? These are critical to proper training. Instead of >>>> randomizing order, you could just sample a data point at random and not >>>> worry about using a complete permutation of the data. With such a tiny >>>> data set, you will need to pass through the data many times ... possibly >>>> hundreds of times or more. >>>> >>>> c) what kind of data do you have? Sparse? Dense? How many variables? >>>> What kind? >>>> >>>> d) can you post your data? >>>> >>>> >>>> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <[email protected] >>>> wrote: >>>> >>>>> Thanks a lot lance. Let me elaborate the problem if it was a bit >>> confusing. >>>>> >>>>> Assuming I am making a binary classifier using SGD. I have got 50 >>> positive >>>>> and 50 negative examples to train the classifier. After training and >>>>> testing the model, the confusion matrix tells you the number of >>> correctly >>>>> and incorrectly classified instances. Let's assume I got 85% correct and >>>>> 15% incorrect instances. >>>>> >>>>> Now if I run my program again using the same 50 negative and 50 positive >>>>> examples, then according to my knowledge the classifier should yield the >>>>> same results as before (cause not even a single training or testing data >>>>> was changed), but this is not the case. I get different results for >>>>> different runs. The confusion matrix figures changes each time I >>> generate a >>>>> model keeping the data constant. What I do is, I generate a model >>> several >>>>> times and keep a look for the accuracy, and if it is above 90%, then I >>> stop >>>>> running the code and hence an accurate model is created. >>>>> >>>>> So what you are saying is to shuffle my data before I use it for >>> training >>>>> and testing? >>>>> Thanks! >>>>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote: >>>>> >>>>>> Now I remember: SGD wants its data input in random order. You need to >>>>>> permute the order of your data. >>>>>> >>>>>> If that does not help, another trick: for each data point, randomly >>>>>> generate 5 or 10 or 20 points which are close. And again, randomly >>>>>> permute the entire input set. >>>>>> >>>>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <[email protected]> >>>>> wrote: >>>>>>> The more data you have, the closer each run will be. How much data do >>>>> you have? >>>>>>> >>>>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood < >>> [email protected]> >>>>> wrote: >>>>>>>> I have noticed that every time I train and test a model using the >>> same >>>>> data (in SGD algo), I get different confusion matrix. Meaning, if I >>>>> generate a model and look at the confusion matrix, it might say 90% >>>>> correctly classified instances, but if I generate the model again (with >>> the >>>>> SAME data for training and testing as before) and test it, the confusion >>>>> matrix changes and it might say 75% correctly classified instances. >>>>>>>> >>>>>>>> Is this a desired behavior? >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Lance Norskog >>>>>>> [email protected] >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Lance Norskog >>>>>> [email protected] >>>>> >>>>> >>> >>> -- Lance Norskog [email protected]
