Yes I was guessing this kind of problem. My problem is that I will not always have hundred of learnig data for new categories.
This is why we wanted to improve our model with manually selected keywords.

Maybe it would be better to split this into 2 phases and use the keyword extraction phase only for categories with poor historical values.

Thanks for your help it's becoming clearer to me now :)

Le 22.09.2011 21:04, Ted Dunning a écrit :
If these keywords already appear in these other fields, I think that you
should just let the algorithm find them.  I think that your problem is
insufficient training data or insufficient number of passes on the data.

In general, you should not do anything to your training data that you will
not later do to your production data.

On Thu, Sep 22, 2011 at 10:04 AM, Loic Descotte<[email protected]>wrote:

  Have you seen the Mahout book?

Yes I've bought your (very good) book in early access preview. It helps me
a lot in my investigations.

?! If a feature is not found in the production data, then you should not
give it to the model as a predictor during training. Otherwise, you have a
form of target leak.

I think I did'nt explain myself very well, sorry.

I mean that here is no attribute named 'keyword' in my test or train data.
But all the keywords I put when I create my new vector appear in the other
attributes of my datas (body and title)

I've selected them because I know they will occur very often in body and
title.

I was just worried about creating a "fake" attribute name (keyword), like
this :

for (String keyword : keywords) {
  predictorEncoders.get(99).addToVector(keyword, 700, featureVector);


(The 99 predictor is a new predictor I've created just for this keywords)

But it seems to work (with big weights), keywords seems to be found in
other attributes because when I do this my results are getting better in the
confusion matrix.

So is it ok to do like this or it is still a dirty hack?

Loic





Reply via email to