Re: Click probability prediction using Mahout. From model output to probability

Johannes Schulte Thu, 27 Dec 2012 06:13:55 -0800

Hi Pavel,

first of all i would include an intercept term in the model. This learns
the proportion of examples in the training set.


Second, for getting "calibrated" probabilities out of the downsampled
model, I can think of two ways:

1. Use another set of input data to measure the observed maximum likelihood
ctr per score rang. You can partition all the scores into equally sized
bins (order by score) and then take  another set of training examples and
measure a "per bin ctr". This would we number of clicks in the bin divided
by bin size. If this is too static, you can further look into something
like "Pool Adjacent Violators"-Algorithms that produces a piecewise
constant curve..

2. If you have included an intercept term, you can use one of the methods
described in this paper to adjust the model intercept directly:

http://dash.harvard.edu/bitstream/handle/1/4125045/relogit%20rare%20events.pdf?sequence=2

This paper has been recommended here before.

For the logistic regresssion in mahout and the "Prior Correction" in the
paper this would become something like this:

// correction

        double trueCtr = 0.01;

        double sampleCtr = (double) ac / ai;


        double correction = Math.log(((1 - trueCtr) / trueCtr) * (sampleCtr
/ (1 - sampleCtr)));

        double betaBefore = learner.getBeta().get(0, 0);

        learner.setBeta(0, 0, betaBefore - correction);




On Thu, Dec 27, 2012 at 2:09 PM, Abramov Pavel <[email protected]>wrote:

> Hello,
>
> I'm trying to predict web-page click probability using Mahout. In my case
> every click has it's "cost" and the goal is to maximize the
> sum (just like Google does in sponsored search, but my Ads are
> non-commercial, they contain media, and I dont need to handle billions of
> Ads).
>
> That’s why I want to predict Click Through Rate for every user's
> impression. After CTR prediction I can sort my ads by CTR*Cost and
> "recommend" top items.
>
> The idea is to train the model (one per ad, I can't extract ad's features)
> using historical data and user's features (age, gender, interests, latent
> factors etc).
>
> For example there is an Ad shown 250 000 times and clicked 2 000 times.
> It's average CTR is 0.008.
>
> Its time to create learning set and train the model. I will start with
> Logistic Regression (Mahout SGD) with 1 target variable (was clicked or
> not) and 1 predictor (users gender 1/0).
>
> I will provide model with full negative (target variable = 0, total
> examples = 250 000) and positive (target variable = 1, total examples = 2
> 000) sets.
>
> Q1: Is it correct? Or should I downsample negative set from 250 000 to 2
> 000? (thanks to Mahout In Action examples)
>
> Q2: Is Logistic Regression a good idea? In real world I have tens of Ads,
> hudreds of user's features and millions of impressions. Should I try more
> complex algorithms such as RandomForest or NeuralNets (with sigmoid
> activation func)?
>
>
> Ok, I trained my model with "trainlogistic". AUC = 0,60 (not bad for 1
> predictor).
> Q3: How to convert model output (I used "runlogistic") to Click
> probability? My model always outputs ~0.5 both  sets (downsampled and
> original).
> So model average is 0.5 while "real" CTR average is 0.008.
>
> Thank in advice!
>
> Pavel
>
>
>
>
>
>
>
>
>

Re: Click probability prediction using Mahout. From model output to probability

Reply via email to