On Thu, Sep 22, 2011 at 12:36 AM, Loic Descotte <[email protected]>wrote:
> ... > > One thing that you said that worried me is your comment about putting a >> really high weight on the feature. I am surprised that this was required. >> > > Maybe it is because my training data are not good enough... > What would be a "reasonable" order of height for such a weight? > Normally features do not need special weighting. If one feature is good and others are noisy, then it may take a little time before the model figures out which is which, but it will figure it out even with reasonably equal weighting. > Another thing that made me feel it could be a dirty hack is that the > 'keyword' feature added manually during learning cannot be found in test or > production data because it does not exist : > > My train data are text files. For example, news articles. > They have one class, the category : politics, sport, finance... > They have 2 attributes : title and body. > > So the keywords I add with the code I put in the previous mail is > completely "virtual". It will never be found in title or body. Despite that, > I guess that the SGD alorithm looks for the keywords into the other > attributes (title and body) to match the right category. > ?! If a feature is not found in the production data, then you should not give it to the model as a predictor during training. Otherwise, you have a form of target leak. Have you seen the Mahout book? I went into extensive detail about this and provided examples of how high quality features available at training only could lead to poor training.
