There are lots of problems with the problem as posed. I am not surprised with poor results.
You should not downsample negative examples so severely. I would keep as many as 10-30 x as many positive examples you have. Even then, I suspect you don't have enough data especially if you have already included data for all of your models. Your Feature A is not useful unless you are putting all ad results together. Even then, you need to include more advertiser, campaign and ad specific features. The feature vector size of 10,000 is actually relatively small if you have any reasonable degree of sparsity in your user and ad features. Unused features do not hurt learning. Finally, you should be combining group ranking objective as well as regression objectives. Otherwise, your model will simply be learning which users are likely to click on anything and those users who will never click on anything. There are provisions for segmented AUC in the code, but that will only work for binary targets. In general, it is common to build cascaded models to deal with this. The first model learns to predict click and the cascaded model learns conversion conditional on click. Most importantly, really, I would recommend that you experiment with model design using a system like R so that you can get fast turn-around on modeling efforts. On Mon, Jul 11, 2011 at 3:04 PM, Weihua Zhu <[email protected]> wrote: > hi Thanks Ted. > I understand that the training dataset size is small. The reason is that we > have very limited number of "action" class events/instances. We also want > to make each target class have equal number of events/instances. > Feature A is the advertisement campaign ID, and Feature B is the behaviors > that internet user has, for example, gender:male, country: us, etc. > I set the size of the encoder to 10000, which is very large. > I used this setup for OnlineLogisticRegressioN: > olr = new OnlineLogisticRegression(3, FEATURES, new L1()); > olr.alpha(1).stepOffset(1000).lambda(3e-5).learningRate(3); > > Thanks. > > -wz > > > On Jul 11, 2011, at 2:49 PM, Ted Dunning wrote: > > > This is a tiny amount of data. The regularization in Mahout's SGD > > implementation is probably not as effective as second order techniques > for > > such tiny data. > > > > Btw... you didn't answer my questions about what kind of data feature A > and > > B are. I understand that you might be shy about this, but without that > kind > > of information, I can't help you. > > > > (and add this additional question) > > > > What is the size of the encoded vector? > > > > On Mon, Jul 11, 2011 at 2:26 PM, Weihua Zhu <[email protected]> wrote: > > > >> Target class is if a user click an ad(advertisement), buy through an ad, > or > >> not; so 3 classes. > >> Feature A s about the Advertisement itself; > >> Feature B is about the user's behaviors; > >> Currently im only using feature A and B. > >> Total training data is 250 for each class; > >> > >> thanks.. > >> > >> > >> ________________________________________ > >> From: Ted Dunning [[email protected]] > >> Sent: Monday, July 11, 2011 2:15 PM > >> To: [email protected] > >> Subject: Re: combination of features worsen the performance > >> > >> Can you say a little bit about the data? > >> > >> What are features A and B? What kind of data do they represent? > >> > >> How many other features are there? > >> > >> What is the target variable? How many possible values does it have? > >> > >> How much training data do you have? > >> > >> What sort of training are you doing? > >> > >> > >> > >> On Mon, Jul 11, 2011 at 2:08 PM, Weihua Zhu <[email protected]> wrote: > >> > >>> Hi, Dear all, > >>> > >>> I am using mahout logistic regression for classification; > interestingly, > >>> for feature A, B, individually each has satisfactory performances, say > >> 65%, > >>> 80%, but when i combine them together(using encoder), the performance > is > >>> like 72%. Shouldn't the performance be better? Any thoughts? Thanks a > >> lot, > >>> > >>> > >>> -wz. > >>> > >> > >
