This paper is probably of interest for this problem: http://research.microsoft.com/apps/pubs/default.aspx?id=122779
On Thu, Dec 27, 2012 at 6:14 AM, Johannes Schulte < [email protected]> wrote: > Oops, hit enter to early... > > Just wanted to say that those are the two ways I'm thinking of right now > since i got a similar challenge. I'm thankful for any suggestions or > comments. > > Cheers, > > Johannes > > > On Thu, Dec 27, 2012 at 3:13 PM, Johannes Schulte < > [email protected]> wrote: > > > Hi Pavel, > > > > first of all i would include an intercept term in the model. This learns > > the proportion of examples in the training set. > > > > Second, for getting "calibrated" probabilities out of the downsampled > > model, I can think of two ways: > > > > 1. Use another set of input data to measure the observed maximum > > likelihood ctr per score rang. You can partition all the scores into > > equally sized bins (order by score) and then take another set of > training > > examples and measure a "per bin ctr". This would we number of clicks in > the > > bin divided by bin size. If this is too static, you can further look into > > something like "Pool Adjacent Violators"-Algorithms that produces a > > piecewise constant curve.. > > > > 2. If you have included an intercept term, you can use one of the methods > > described in this paper to adjust the model intercept directly: > > > > > > > http://dash.harvard.edu/bitstream/handle/1/4125045/relogit%20rare%20events.pdf?sequence=2 > > > > This paper has been recommended here before. > > > > For the logistic regresssion in mahout and the "Prior Correction" in the > > paper this would become something like this: > > > > // correction > > > > double trueCtr = 0.01; > > > > double sampleCtr = (double) ac / ai; > > > > > > double correction = Math.log(((1 - trueCtr) / trueCtr) * > > (sampleCtr / (1 - sampleCtr))); > > > > double betaBefore = learner.getBeta().get(0, 0); > > > > learner.setBeta(0, 0, betaBefore - correction); > > > > > > > > > > On Thu, Dec 27, 2012 at 2:09 PM, Abramov Pavel <[email protected] > >wrote: > > > >> Hello, > >> > >> I'm trying to predict web-page click probability using Mahout. In my > case > >> every click has it's "cost" and the goal is to maximize the > >> sum (just like Google does in sponsored search, but my Ads are > >> non-commercial, they contain media, and I dont need to handle billions > of > >> Ads). > >> > >> That’s why I want to predict Click Through Rate for every user's > >> impression. After CTR prediction I can sort my ads by CTR*Cost and > >> "recommend" top items. > >> > >> The idea is to train the model (one per ad, I can't extract ad's > features) > >> using historical data and user's features (age, gender, interests, > latent > >> factors etc). > >> > >> For example there is an Ad shown 250 000 times and clicked 2 000 times. > >> It's average CTR is 0.008. > >> > >> Its time to create learning set and train the model. I will start with > >> Logistic Regression (Mahout SGD) with 1 target variable (was clicked or > >> not) and 1 predictor (users gender 1/0). > >> > >> I will provide model with full negative (target variable = 0, total > >> examples = 250 000) and positive (target variable = 1, total examples = > 2 > >> 000) sets. > >> > >> Q1: Is it correct? Or should I downsample negative set from 250 000 to 2 > >> 000? (thanks to Mahout In Action examples) > >> > >> Q2: Is Logistic Regression a good idea? In real world I have tens of > Ads, > >> hudreds of user's features and millions of impressions. Should I try > more > >> complex algorithms such as RandomForest or NeuralNets (with sigmoid > >> activation func)? > >> > >> > >> Ok, I trained my model with "trainlogistic". AUC = 0,60 (not bad for 1 > >> predictor). > >> Q3: How to convert model output (I used "runlogistic") to Click > >> probability? My model always outputs ~0.5 both sets (downsampled and > >> original). > >> So model average is 0.5 while "real" CTR average is 0.008. > >> > >> Thank in advice! > >> > >> Pavel > >> > >> > >> > >> > >> > >> > >> > >> > >> > > >
