On Thu, Sep 22, 2011 at 12:36 AM, Loic Descotte <[email protected]>wrote:

> ...
>
>  One thing that you said that worried me is your comment about putting a
>> really high weight on the feature.  I am surprised that this was required.
>>
>
> Maybe it is because my training data are not good enough...
> What would be a "reasonable" order of height for such a weight?
>

Normally features do not need special weighting.  If one feature is good and
others are noisy, then it may take a little time before the model figures
out which is which, but it will figure it out even with reasonably equal
weighting.


> Another thing that made me feel it could be a dirty hack is that the
> 'keyword' feature added manually during learning cannot be found in test or
> production data because it does not exist :
>
> My train data are text files. For example, news articles.
> They have one class, the category : politics, sport, finance...
> They have 2 attributes : title and body.
>
> So the keywords I add with the code I put in the previous mail is
> completely "virtual". It will never be found in title or body. Despite that,
> I guess that the SGD alorithm looks for the keywords into the other
> attributes (title and body) to match the right category.
>

?!

If a feature is not found in the production data, then you should not give
it to the model as a predictor during training. Otherwise, you have a form
of target leak.

Have you seen the Mahout book?  I went into extensive detail about this and
provided examples of how high quality features available at training only
could lead to poor training.

Reply via email to