There are lots of problems with the problem as posed.  I am not surprised
with poor results.

You should not downsample negative examples so severely.  I would keep as
many as 10-30 x as many positive examples you have.  Even then, I suspect
you don't have enough data especially if you have already included data for
all of your models.

Your Feature A is not useful unless you are putting all ad results together.
  Even then, you need to include more advertiser, campaign and ad specific
features.

The feature vector size of 10,000 is actually relatively small if you have
any reasonable degree of sparsity in your user and ad features.  Unused
features do not hurt learning.

Finally, you should be combining group ranking objective as well as
regression objectives.  Otherwise, your model will simply be learning which
users are likely to click on anything and those users who will never click
on anything.  There are provisions for segmented AUC in the code, but that
will only work for binary targets.  In general, it is common to build
cascaded models to deal with this.  The first model learns to predict click
and the cascaded model learns conversion conditional on click.

Most importantly, really, I would recommend that you experiment with model
design using a system like R so that you can get fast turn-around on
modeling efforts.

On Mon, Jul 11, 2011 at 3:04 PM, Weihua Zhu <[email protected]> wrote:

> hi Thanks Ted.
> I understand that the training dataset size is small. The reason is that we
> have very limited number of "action" class events/instances.  We also want
> to make each target class have equal number of events/instances.
> Feature A is the advertisement campaign ID, and Feature B is the behaviors
> that internet user has, for example, gender:male, country: us, etc.
> I set the size of the encoder to 10000, which is very large.
> I used this setup for  OnlineLogisticRegressioN:
>        olr = new OnlineLogisticRegression(3, FEATURES, new L1());
>        olr.alpha(1).stepOffset(1000).lambda(3e-5).learningRate(3);
>
> Thanks.
>
> -wz
>
>
> On Jul 11, 2011, at 2:49 PM, Ted Dunning wrote:
>
> > This is a tiny amount of data.  The regularization in Mahout's SGD
> > implementation is probably not as effective as second order techniques
> for
> > such tiny data.
> >
> > Btw... you didn't answer my questions about what kind of data feature A
> and
> > B are.  I understand that you might be shy about this, but without that
> kind
> > of information, I can't help you.
> >
> > (and add this additional question)
> >
> > What is the size of the encoded vector?
> >
> > On Mon, Jul 11, 2011 at 2:26 PM, Weihua Zhu <[email protected]> wrote:
> >
> >> Target class is if a user click an ad(advertisement), buy through an ad,
> or
> >> not; so 3 classes.
> >> Feature A s about the Advertisement itself;
> >> Feature B is about the user's behaviors;
> >> Currently im only using feature A and B.
> >> Total training data is 250 for each class;
> >>
> >> thanks..
> >>
> >>
> >> ________________________________________
> >> From: Ted Dunning [[email protected]]
> >> Sent: Monday, July 11, 2011 2:15 PM
> >> To: [email protected]
> >> Subject: Re: combination of features worsen the performance
> >>
> >> Can you say a little bit about the data?
> >>
> >> What are features A and B?  What kind of data do they represent?
> >>
> >> How many other features are there?
> >>
> >> What is the target variable?  How many possible values does it have?
> >>
> >> How much training data do you have?
> >>
> >> What sort of training are you doing?
> >>
> >>
> >>
> >> On Mon, Jul 11, 2011 at 2:08 PM, Weihua Zhu <[email protected]> wrote:
> >>
> >>> Hi, Dear all,
> >>>
> >>> I am using mahout logistic regression for classification;
> interestingly,
> >>> for feature A, B, individually each has satisfactory performances, say
> >> 65%,
> >>> 80%, but when i combine them together(using encoder), the performance
> is
> >>> like 72%. Shouldn't the performance be better? Any thoughts? Thanks a
> >> lot,
> >>>
> >>>
> >>> -wz.
> >>>
> >>
>
>

Reply via email to