Re: SGD diferent confusion matrix for each run

Salman Mahmood Fri, 31 Aug 2012 06:33:40 -0700

Thanks a lot ted. Here are the answers:
d) Data (news articles from different feeds)
        News Article 1: Title : BP Profits Plunge On Massive Asset Write-down
                                    Description :BP PLC (BP) Tuesday posted a 
dramatic fall of 96% in adjusted profit for the                 second quarter 
as it wrote down the value of its assets by $5 billion including some U.S. 
refineries a suspended Alaskan oil project and U.S. shale gas resources


        News Article 2: Title : Morgan Stanley Missed Big
                                     Description: Why It's Still A Fantastic 
Short,"By Mike Williams: Though the market responded very positively to 
Citigroup (C) and Bank of America's (BAC) reserve release-driven earnings 
""beats"" last week's Morgan Stanley (MS) earnings report illustrated what 
happens when a bank doesn't have billions of reserves to release back into 
earnings. Estimates called for the following: $.43 per share in earnings $.29 
per share in earnings ex-DVA (debt value adjustment) $7.7 billion in revenue 
GAAP results (including the DVA) came in at $.28 per share while ex-DVA 
earnings were $.16. Revenue was a particular disappointment coming in at $6.95 
billion.

c) As you can see the data is textual. and I am using title and description as 
predictor variable and the target variable is the company name a news belongs 
to.

b) I am passing through the data once (at least this is what I think). I 
folowed the 20newsgroup example code(in java) and dint find that the data was 
passed more than once. 
Yes I randomize the order every time.

a) I am using AdaptiveLearningRegression (just like 20newsgroup).

Thanks!


  
On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:

> First, this is a tiny training set.  You are well outside the intended
> application range so you are likely to find less experience in the
> community in that range.  That said, the algorithm should still produce
> reasonably stable results.
> 
> Here are a few questions:
> 
> a) which class are you using to train your model?  I would start with
> OnlineLogisticRegression and experiment with training rate schedules and
> amount of regularization to find out how to build a good model.
> 
> b) how many times are you passing through your data?  Do you randomize the
> order each time?  These are critical to proper training.  Instead of
> randomizing order, you could just sample a data point at random and not
> worry about using a complete permutation of the data.  With such a tiny
> data set, you will need to pass through the data many times ... possibly
> hundreds of times or more.
> 
> c) what kind of data do you have?  Sparse?  Dense?  How many variables?
> What kind?
> 
> d) can you post your data?
> 
> 
> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <[email protected]>wrote:
> 
>> Thanks a lot lance. Let me elaborate the problem if it was a bit confusing.
>> 
>> Assuming I am making a binary classifier using SGD. I have got 50 positive
>> and 50 negative examples to train the classifier. After training and
>> testing the model, the confusion matrix tells you the number of correctly
>> and incorrectly classified instances. Let's assume I got 85% correct and
>> 15% incorrect instances.
>> 
>> Now if I run my program again using the same 50 negative and 50 positive
>> examples, then according to my knowledge the classifier should yield the
>> same results as before (cause not even a single training or testing data
>> was changed), but this is not the case. I get different results for
>> different runs. The confusion matrix figures changes each time I generate a
>> model keeping the data constant. What I do is, I generate a model several
>> times and keep a look for the accuracy, and if it is above 90%, then I stop
>> running the code and hence an accurate model is created.
>> 
>> So what you are saying is to shuffle my data before I use it for training
>> and testing?
>> Thanks!
>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
>> 
>>> Now I remember: SGD wants its data input in random order. You need to
>>> permute the order of your data.
>>> 
>>> If that does not help, another trick: for each data point, randomly
>>> generate 5 or 10 or 20 points which are close. And again, randomly
>>> permute the entire input set.
>>> 
>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <[email protected]>
>> wrote:
>>>> The more data you have, the closer each run will be. How much data do
>> you have?
>>>> 
>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <[email protected]>
>> wrote:
>>>>> I have noticed that every time I train and test a model using the same
>> data (in SGD algo), I get different confusion matrix. Meaning, if I
>> generate a model and look at the confusion matrix, it might say 90%
>> correctly classified instances, but if I generate the model again (with the
>> SAME data for training and testing as before) and test it, the confusion
>> matrix changes and it might say 75% correctly classified instances.
>>>>> 
>>>>> Is this a desired behavior?
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Lance Norskog
>>>> [email protected]
>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> [email protected]
>> 
>>

Re: SGD diferent confusion matrix for each run

Reply via email to