Thanks a lot ted. Here are the answers:
d) Data (news articles from different feeds)
News Article 1: Title : BP Profits Plunge On Massive Asset Write-down
Description :BP PLC (BP) Tuesday posted a
dramatic fall of 96% in adjusted profit for the second quarter
as it wrote down the value of its assets by $5 billion including some U.S.
refineries a suspended Alaskan oil project and U.S. shale gas resources
News Article 2: Title : Morgan Stanley Missed Big
Description: Why It's Still A Fantastic
Short,"By Mike Williams: Though the market responded very positively to
Citigroup (C) and Bank of America's (BAC) reserve release-driven earnings
""beats"" last week's Morgan Stanley (MS) earnings report illustrated what
happens when a bank doesn't have billions of reserves to release back into
earnings. Estimates called for the following: $.43 per share in earnings $.29
per share in earnings ex-DVA (debt value adjustment) $7.7 billion in revenue
GAAP results (including the DVA) came in at $.28 per share while ex-DVA
earnings were $.16. Revenue was a particular disappointment coming in at $6.95
billion.
c) As you can see the data is textual. and I am using title and description as
predictor variable and the target variable is the company name a news belongs
to.
b) I am passing through the data once (at least this is what I think). I
folowed the 20newsgroup example code(in java) and dint find that the data was
passed more than once.
Yes I randomize the order every time.
a) I am using AdaptiveLearningRegression (just like 20newsgroup).
Thanks!
On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:
> First, this is a tiny training set. You are well outside the intended
> application range so you are likely to find less experience in the
> community in that range. That said, the algorithm should still produce
> reasonably stable results.
>
> Here are a few questions:
>
> a) which class are you using to train your model? I would start with
> OnlineLogisticRegression and experiment with training rate schedules and
> amount of regularization to find out how to build a good model.
>
> b) how many times are you passing through your data? Do you randomize the
> order each time? These are critical to proper training. Instead of
> randomizing order, you could just sample a data point at random and not
> worry about using a complete permutation of the data. With such a tiny
> data set, you will need to pass through the data many times ... possibly
> hundreds of times or more.
>
> c) what kind of data do you have? Sparse? Dense? How many variables?
> What kind?
>
> d) can you post your data?
>
>
> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <[email protected]>wrote:
>
>> Thanks a lot lance. Let me elaborate the problem if it was a bit confusing.
>>
>> Assuming I am making a binary classifier using SGD. I have got 50 positive
>> and 50 negative examples to train the classifier. After training and
>> testing the model, the confusion matrix tells you the number of correctly
>> and incorrectly classified instances. Let's assume I got 85% correct and
>> 15% incorrect instances.
>>
>> Now if I run my program again using the same 50 negative and 50 positive
>> examples, then according to my knowledge the classifier should yield the
>> same results as before (cause not even a single training or testing data
>> was changed), but this is not the case. I get different results for
>> different runs. The confusion matrix figures changes each time I generate a
>> model keeping the data constant. What I do is, I generate a model several
>> times and keep a look for the accuracy, and if it is above 90%, then I stop
>> running the code and hence an accurate model is created.
>>
>> So what you are saying is to shuffle my data before I use it for training
>> and testing?
>> Thanks!
>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
>>
>>> Now I remember: SGD wants its data input in random order. You need to
>>> permute the order of your data.
>>>
>>> If that does not help, another trick: for each data point, randomly
>>> generate 5 or 10 or 20 points which are close. And again, randomly
>>> permute the entire input set.
>>>
>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <[email protected]>
>> wrote:
>>>> The more data you have, the closer each run will be. How much data do
>> you have?
>>>>
>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <[email protected]>
>> wrote:
>>>>> I have noticed that every time I train and test a model using the same
>> data (in SGD algo), I get different confusion matrix. Meaning, if I
>> generate a model and look at the confusion matrix, it might say 90%
>> correctly classified instances, but if I generate the model again (with the
>> SAME data for training and testing as before) and test it, the confusion
>> matrix changes and it might say 75% correctly classified instances.
>>>>>
>>>>> Is this a desired behavior?
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> [email protected]
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> [email protected]
>>
>>