This paper is probably of interest for this problem:

http://research.microsoft.com/apps/pubs/default.aspx?id=122779



On Thu, Dec 27, 2012 at 6:14 AM, Johannes Schulte <
[email protected]> wrote:

> Oops, hit enter to early...
>
> Just wanted to say that those are the two ways I'm thinking of right now
> since i got a similar challenge. I'm thankful for any suggestions or
> comments.
>
> Cheers,
>
> Johannes
>
>
> On Thu, Dec 27, 2012 at 3:13 PM, Johannes Schulte <
> [email protected]> wrote:
>
> > Hi Pavel,
> >
> > first of all i would include an intercept term in the model. This learns
> > the proportion of examples in the training set.
> >
> > Second, for getting "calibrated" probabilities out of the downsampled
> > model, I can think of two ways:
> >
> > 1. Use another set of input data to measure the observed maximum
> > likelihood ctr per score rang. You can partition all the scores into
> > equally sized bins (order by score) and then take  another set of
> training
> > examples and measure a "per bin ctr". This would we number of clicks in
> the
> > bin divided by bin size. If this is too static, you can further look into
> > something like "Pool Adjacent Violators"-Algorithms that produces a
> > piecewise constant curve..
> >
> > 2. If you have included an intercept term, you can use one of the methods
> > described in this paper to adjust the model intercept directly:
> >
> >
> >
> http://dash.harvard.edu/bitstream/handle/1/4125045/relogit%20rare%20events.pdf?sequence=2
> >
> > This paper has been recommended here before.
> >
> > For the logistic regresssion in mahout and the "Prior Correction" in the
> > paper this would become something like this:
> >
> > // correction
> >
> >         double trueCtr = 0.01;
> >
> >         double sampleCtr = (double) ac / ai;
> >
> >
> >         double correction = Math.log(((1 - trueCtr) / trueCtr) *
> > (sampleCtr / (1 - sampleCtr)));
> >
> >         double betaBefore = learner.getBeta().get(0, 0);
> >
> >         learner.setBeta(0, 0, betaBefore - correction);
> >
> >
> >
> >
> > On Thu, Dec 27, 2012 at 2:09 PM, Abramov Pavel <[email protected]
> >wrote:
> >
> >> Hello,
> >>
> >> I'm trying to predict web-page click probability using Mahout. In my
> case
> >> every click has it's "cost" and the goal is to maximize the
> >> sum (just like Google does in sponsored search, but my Ads are
> >> non-commercial, they contain media, and I dont need to handle billions
> of
> >> Ads).
> >>
> >> That’s why I want to predict Click Through Rate for every user's
> >> impression. After CTR prediction I can sort my ads by CTR*Cost and
> >> "recommend" top items.
> >>
> >> The idea is to train the model (one per ad, I can't extract ad's
> features)
> >> using historical data and user's features (age, gender, interests,
> latent
> >> factors etc).
> >>
> >> For example there is an Ad shown 250 000 times and clicked 2 000 times.
> >> It's average CTR is 0.008.
> >>
> >> Its time to create learning set and train the model. I will start with
> >> Logistic Regression (Mahout SGD) with 1 target variable (was clicked or
> >> not) and 1 predictor (users gender 1/0).
> >>
> >> I will provide model with full negative (target variable = 0, total
> >> examples = 250 000) and positive (target variable = 1, total examples =
> 2
> >> 000) sets.
> >>
> >> Q1: Is it correct? Or should I downsample negative set from 250 000 to 2
> >> 000? (thanks to Mahout In Action examples)
> >>
> >> Q2: Is Logistic Regression a good idea? In real world I have tens of
> Ads,
> >> hudreds of user's features and millions of impressions. Should I try
> more
> >> complex algorithms such as RandomForest or NeuralNets (with sigmoid
> >> activation func)?
> >>
> >>
> >> Ok, I trained my model with "trainlogistic". AUC = 0,60 (not bad for 1
> >> predictor).
> >> Q3: How to convert model output (I used "runlogistic") to Click
> >> probability? My model always outputs ~0.5 both  sets (downsampled and
> >> original).
> >> So model average is 0.5 while "real" CTR average is 0.008.
> >>
> >> Thank in advice!
> >>
> >> Pavel
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
>

Reply via email to