Oops, hit enter to early...

Just wanted to say that those are the two ways I'm thinking of right now
since i got a similar challenge. I'm thankful for any suggestions or
comments.

Cheers,

Johannes


On Thu, Dec 27, 2012 at 3:13 PM, Johannes Schulte <
[email protected]> wrote:

> Hi Pavel,
>
> first of all i would include an intercept term in the model. This learns
> the proportion of examples in the training set.
>
> Second, for getting "calibrated" probabilities out of the downsampled
> model, I can think of two ways:
>
> 1. Use another set of input data to measure the observed maximum
> likelihood ctr per score rang. You can partition all the scores into
> equally sized bins (order by score) and then take  another set of training
> examples and measure a "per bin ctr". This would we number of clicks in the
> bin divided by bin size. If this is too static, you can further look into
> something like "Pool Adjacent Violators"-Algorithms that produces a
> piecewise constant curve..
>
> 2. If you have included an intercept term, you can use one of the methods
> described in this paper to adjust the model intercept directly:
>
>
> http://dash.harvard.edu/bitstream/handle/1/4125045/relogit%20rare%20events.pdf?sequence=2
>
> This paper has been recommended here before.
>
> For the logistic regresssion in mahout and the "Prior Correction" in the
> paper this would become something like this:
>
> // correction
>
>         double trueCtr = 0.01;
>
>         double sampleCtr = (double) ac / ai;
>
>
>         double correction = Math.log(((1 - trueCtr) / trueCtr) *
> (sampleCtr / (1 - sampleCtr)));
>
>         double betaBefore = learner.getBeta().get(0, 0);
>
>         learner.setBeta(0, 0, betaBefore - correction);
>
>
>
>
> On Thu, Dec 27, 2012 at 2:09 PM, Abramov Pavel <[email protected]>wrote:
>
>> Hello,
>>
>> I'm trying to predict web-page click probability using Mahout. In my case
>> every click has it's "cost" and the goal is to maximize the
>> sum (just like Google does in sponsored search, but my Ads are
>> non-commercial, they contain media, and I dont need to handle billions of
>> Ads).
>>
>> That’s why I want to predict Click Through Rate for every user's
>> impression. After CTR prediction I can sort my ads by CTR*Cost and
>> "recommend" top items.
>>
>> The idea is to train the model (one per ad, I can't extract ad's features)
>> using historical data and user's features (age, gender, interests, latent
>> factors etc).
>>
>> For example there is an Ad shown 250 000 times and clicked 2 000 times.
>> It's average CTR is 0.008.
>>
>> Its time to create learning set and train the model. I will start with
>> Logistic Regression (Mahout SGD) with 1 target variable (was clicked or
>> not) and 1 predictor (users gender 1/0).
>>
>> I will provide model with full negative (target variable = 0, total
>> examples = 250 000) and positive (target variable = 1, total examples = 2
>> 000) sets.
>>
>> Q1: Is it correct? Or should I downsample negative set from 250 000 to 2
>> 000? (thanks to Mahout In Action examples)
>>
>> Q2: Is Logistic Regression a good idea? In real world I have tens of Ads,
>> hudreds of user's features and millions of impressions. Should I try more
>> complex algorithms such as RandomForest or NeuralNets (with sigmoid
>> activation func)?
>>
>>
>> Ok, I trained my model with "trainlogistic". AUC = 0,60 (not bad for 1
>> predictor).
>> Q3: How to convert model output (I used "runlogistic") to Click
>> probability? My model always outputs ~0.5 both  sets (downsampled and
>> original).
>> So model average is 0.5 while "real" CTR average is 0.008.
>>
>> Thank in advice!
>>
>> Pavel
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Reply via email to