You could run the terms that you would like to not see through a StopWordFilter while training on your articles' titles. As an example, Lucene comes with a default StopWordFilter; you could create something similar for your scenario and run your text through this filter (for both training and test).
On Tuesday, October 15, 2013 10:20 AM, Andrew Butkus <[email protected]> wrote: Hi i was wondering if you could help, I've set up mahout to provide some classification for news articles, so i can extract only those news articles which are of interest. I've gone through an manually trained the titles of these news articles, done approximately 80,000 (both articles i want and don't want) I have written an app which outputs the top words and their scores, and it seems certain keywords are creeping high up on the top words. Some of the so called top words are false positives, - they are only top because every title page has them. such as 'stratford herald' (which is a name of the newspaper) - is there anyway to remove them once a model is already created? There are about 20 top words which i would like to simply get rid off (or get mahout to ignore when providing best labels), but i don't want this to be an exercise on input (i.e. filtering those names id like to exclude on training input), i'd prefer to post remove as I've already spent a lot of time manually training. Top words - home: 1067 - dorset: 1493 - details: 908 - back: 867 - poole: 1651 - set: 819 - help: 743 - get: 812 - bournemouth: 14728 - new: 2661 - avon: 2684 - local: 3092 - cherries: 1244 - police: 1011 - over: 1813 - echo: 6526 - null: 79983 - after: 2292 - stratford: 2657 - school: 1395 - jobs: 881 - job: 6982 - car: 772 - herald: 2817 - nurse: 1174 - man: 1335 - manager: 1071 - day: 759 - time: 764 - council: 824 - upon: 2676 Number of labels: 2 Number of documents in training set: 79983 Top 75 words for label negative_article - stratford: 10748.598348617554 - herald: 7579.555884361267 - avon: 7484.692479610443 - upon: 7476.3635239601135 - local: 7426.4039397239685 - after: 3837.6605548858643 - man: 3512.4373264312744 - police: 2586.899124145508 - over: 1537.557123184204 - woman: 1434.1630334854126 Top 75 words for label other - bournemouth: 39076.86379265785 - job: 24028.39960718155 - echo: 22974.801107406616 - new: 10888.526140213013 - stratford: 8045.635549545288 - poole: 7493.278381347656 - over: 7077.8266887664795 - school: 7011.863867282867 - local: 7004.647378444672 - dorset: 6961.040742397308
