Re: Classification + glossary usage

Ted Dunning Thu, 22 Sep 2011 12:04:51 -0700

If these keywords already appear in these other fields, I think that you
should just let the algorithm find them.  I think that your problem is
insufficient training data or insufficient number of passes on the data.


In general, you should not do anything to your training data that you will
not later do to your production data.

On Thu, Sep 22, 2011 at 10:04 AM, Loic Descotte <[email protected]>wrote:

> >  Have you seen the Mahout book?
>
>
> Yes I've bought your (very good) book in early access preview. It helps me
> a lot in my investigations.
>
> > ?! If a feature is not found in the production data, then you should not
> give it to the model as a predictor during training. Otherwise, you have a
> form of target leak.
>
> I think I did'nt explain myself very well, sorry.
>
> I mean that here is no attribute named 'keyword' in my test or train data.
> But all the keywords I put when I create my new vector appear in the other
> attributes of my datas (body and title)
>
> I've selected them because I know they will occur very often in body and
> title.
>
> I was just worried about creating a "fake" attribute name (keyword), like
> this :
>
> for (String keyword : keywords) {
>  predictorEncoders.get(99).addToVector(keyword, 700, featureVector);
>
>
> (The 99 predictor is a new predictor I've created just for this keywords)
>
> But it seems to work (with big weights), keywords seems to be found in
> other attributes because when I do this my results are getting better in the
> confusion matrix.
>
> So is it ok to do like this or it is still a dirty hack?
>
> Loic
> >
> >
> >
>
>
>

Re: Classification + glossary usage

Reply via email to