It is possibly sparseness, but more likely this is the known pathology of
the adaptive logistic regression in which it gets over-confident and locks
down training rate too early.

I have a few suggestions:

1) try the OnlineLogisticRegression.  I think that you can find decent
training parameters pretty easily and that would avoid the issue I
mentioned (at some human cost)

2) post some anonymized data and I will try a few different techniques and
post back comparisons.  Notably, for small data like this, glmnet is
probably the gold standard to compare against.  You shouldn't need to do
the step-wise stuff because you will get an entire plot of which variables
are significant with different amounts of regularization.

If you can do (2), it would be fabulous if you could actually allow use of
the data as a test case.  That would have the highest benefit to you since
it would mean that Mahout won't ever forget your needs.  :-)

On Sun, Jul 15, 2012 at 9:42 AM, Seda Sinangil <[email protected]>wrote:

> I am  running adaptive logistic regression on a data set consisting of
> 250k training examples for click through rate predictions (on this sample
> there are 350 clicks). For starting out I am trying each feature alone by
> itself to see how much it correlates with the data set. I have 2 problems;
>
> First my results are not consistent. I run my program with same input and
> configuration back to back, but the results it produces vary a lot.
> Sometimes my weights are around -3.3xxxx (which makes most sense),
> sometimes around -1.xxxx mark, but mostly around 0.000xx.
>
> Second when I use one of my simple feature with three categories and
> compare the regression results with the actual rates, sometimes the results
> do not correlate. Results usually give coefficients in favor of wrong
> features.  And sometimes when the order is okay, the suggested results seem
> to be overestimated than the actual ones.
>
> I have tried
> 1)changing number of passes between 1 and 20 (as far as I learned so far,
> with my data set size for adaptive logistic regression, theoretically 1
> pass should be enough)
>
> 2) played with windows size and interval (I'm not exactly sure how these
> are supposed to impact the results - larger window and interval size seemed
> to produce better results up to a certain point - window size:5000,
> interval:8000)
>
> 3)shuffling the data set before each pass which didn't really changed
> results
>
> 4) downsampling of non-click samples which made things even worse
>
> my questions are :
>
> Is it normal that I get inconsistent results even though I don't have any
> random part on my side of the code?
> Can this bee happening because my data is too sparse?
> What else can I try to tweaking?
> Can you think of anything I might be missing out?
>
> Thank you,
> Seda

Reply via email to