Hi Pavel, first of all i would include an intercept term in the model. This learns the proportion of examples in the training set.
Second, for getting "calibrated" probabilities out of the downsampled model, I can think of two ways: 1. Use another set of input data to measure the observed maximum likelihood ctr per score rang. You can partition all the scores into equally sized bins (order by score) and then take another set of training examples and measure a "per bin ctr". This would we number of clicks in the bin divided by bin size. If this is too static, you can further look into something like "Pool Adjacent Violators"-Algorithms that produces a piecewise constant curve.. 2. If you have included an intercept term, you can use one of the methods described in this paper to adjust the model intercept directly: http://dash.harvard.edu/bitstream/handle/1/4125045/relogit%20rare%20events.pdf?sequence=2 This paper has been recommended here before. For the logistic regresssion in mahout and the "Prior Correction" in the paper this would become something like this: // correction double trueCtr = 0.01; double sampleCtr = (double) ac / ai; double correction = Math.log(((1 - trueCtr) / trueCtr) * (sampleCtr / (1 - sampleCtr))); double betaBefore = learner.getBeta().get(0, 0); learner.setBeta(0, 0, betaBefore - correction); On Thu, Dec 27, 2012 at 2:09 PM, Abramov Pavel <[email protected]>wrote: > Hello, > > I'm trying to predict web-page click probability using Mahout. In my case > every click has it's "cost" and the goal is to maximize the > sum (just like Google does in sponsored search, but my Ads are > non-commercial, they contain media, and I dont need to handle billions of > Ads). > > That’s why I want to predict Click Through Rate for every user's > impression. After CTR prediction I can sort my ads by CTR*Cost and > "recommend" top items. > > The idea is to train the model (one per ad, I can't extract ad's features) > using historical data and user's features (age, gender, interests, latent > factors etc). > > For example there is an Ad shown 250 000 times and clicked 2 000 times. > It's average CTR is 0.008. > > Its time to create learning set and train the model. I will start with > Logistic Regression (Mahout SGD) with 1 target variable (was clicked or > not) and 1 predictor (users gender 1/0). > > I will provide model with full negative (target variable = 0, total > examples = 250 000) and positive (target variable = 1, total examples = 2 > 000) sets. > > Q1: Is it correct? Or should I downsample negative set from 250 000 to 2 > 000? (thanks to Mahout In Action examples) > > Q2: Is Logistic Regression a good idea? In real world I have tens of Ads, > hudreds of user's features and millions of impressions. Should I try more > complex algorithms such as RandomForest or NeuralNets (with sigmoid > activation func)? > > > Ok, I trained my model with "trainlogistic". AUC = 0,60 (not bad for 1 > predictor). > Q3: How to convert model output (I used "runlogistic") to Click > probability? My model always outputs ~0.5 both sets (downsampled and > original). > So model average is 0.5 while "real" CTR average is 0.008. > > Thank in advice! > > Pavel > > > > > > > > >
