Oops, hit enter to early... Just wanted to say that those are the two ways I'm thinking of right now since i got a similar challenge. I'm thankful for any suggestions or comments.
Cheers, Johannes On Thu, Dec 27, 2012 at 3:13 PM, Johannes Schulte < [email protected]> wrote: > Hi Pavel, > > first of all i would include an intercept term in the model. This learns > the proportion of examples in the training set. > > Second, for getting "calibrated" probabilities out of the downsampled > model, I can think of two ways: > > 1. Use another set of input data to measure the observed maximum > likelihood ctr per score rang. You can partition all the scores into > equally sized bins (order by score) and then take another set of training > examples and measure a "per bin ctr". This would we number of clicks in the > bin divided by bin size. If this is too static, you can further look into > something like "Pool Adjacent Violators"-Algorithms that produces a > piecewise constant curve.. > > 2. If you have included an intercept term, you can use one of the methods > described in this paper to adjust the model intercept directly: > > > http://dash.harvard.edu/bitstream/handle/1/4125045/relogit%20rare%20events.pdf?sequence=2 > > This paper has been recommended here before. > > For the logistic regresssion in mahout and the "Prior Correction" in the > paper this would become something like this: > > // correction > > double trueCtr = 0.01; > > double sampleCtr = (double) ac / ai; > > > double correction = Math.log(((1 - trueCtr) / trueCtr) * > (sampleCtr / (1 - sampleCtr))); > > double betaBefore = learner.getBeta().get(0, 0); > > learner.setBeta(0, 0, betaBefore - correction); > > > > > On Thu, Dec 27, 2012 at 2:09 PM, Abramov Pavel <[email protected]>wrote: > >> Hello, >> >> I'm trying to predict web-page click probability using Mahout. In my case >> every click has it's "cost" and the goal is to maximize the >> sum (just like Google does in sponsored search, but my Ads are >> non-commercial, they contain media, and I dont need to handle billions of >> Ads). >> >> That’s why I want to predict Click Through Rate for every user's >> impression. After CTR prediction I can sort my ads by CTR*Cost and >> "recommend" top items. >> >> The idea is to train the model (one per ad, I can't extract ad's features) >> using historical data and user's features (age, gender, interests, latent >> factors etc). >> >> For example there is an Ad shown 250 000 times and clicked 2 000 times. >> It's average CTR is 0.008. >> >> Its time to create learning set and train the model. I will start with >> Logistic Regression (Mahout SGD) with 1 target variable (was clicked or >> not) and 1 predictor (users gender 1/0). >> >> I will provide model with full negative (target variable = 0, total >> examples = 250 000) and positive (target variable = 1, total examples = 2 >> 000) sets. >> >> Q1: Is it correct? Or should I downsample negative set from 250 000 to 2 >> 000? (thanks to Mahout In Action examples) >> >> Q2: Is Logistic Regression a good idea? In real world I have tens of Ads, >> hudreds of user's features and millions of impressions. Should I try more >> complex algorithms such as RandomForest or NeuralNets (with sigmoid >> activation func)? >> >> >> Ok, I trained my model with "trainlogistic". AUC = 0,60 (not bad for 1 >> predictor). >> Q3: How to convert model output (I used "runlogistic") to Click >> probability? My model always outputs ~0.5 both sets (downsampled and >> original). >> So model average is 0.5 while "real" CTR average is 0.008. >> >> Thank in advice! >> >> Pavel >> >> >> >> >> >> >> >> >> >
