Re: Classification + glossary usage

Loic Descotte Thu, 22 Sep 2011 10:10:05 -0700

>  Have you seen the Mahout book?


Yes I've bought your (very good) book in early access preview. It helps me a 
lot in my investigations.

> ?! If a feature is not found in the production data, then you should not give 
> it to the model as a predictor during training. Otherwise, you have a form of 
> target leak.

I think I did'nt explain myself very well, sorry.

I mean that here is no attribute named 'keyword' in my test or train data.
But all the keywords I put when I create my new vector appear in the other 
attributes of my datas (body and title)

I've selected them because I know they will occur very often in body and title.

I was just worried about creating a "fake" attribute name (keyword), like this :

for (String keyword : keywords) {
 predictorEncoders.get(99).addToVector(keyword, 700, featureVector);


(The 99 predictor is a new predictor I've created just for this keywords)

But it seems to work (with big weights), keywords seems to be found in other 
attributes because when I do this my results are getting better in the 
confusion matrix.

So is it ok to do like this or it is still a dirty hack?

Loic
> 
> 
>

Re: Classification + glossary usage

Reply via email to