Hi i was wondering if you could help,

I've set up mahout to provide some classification for news articles, so i can 
extract only those news articles which are of interest.

I've gone through an manually trained the titles of these news articles, done 
approximately 80,000 (both articles i want and don't want)

I have written an app which outputs the top words and their scores, and it 
seems certain keywords are creeping high up on the top words.

Some of the so called top words are false positives, - they are only top 
because every title page has them.


such as 'stratford herald' (which is a name of the newspaper) - is there anyway 
to remove them once a model is already created?

There are about 20 top words which i would like to simply get rid off (or get 
mahout to ignore when providing best labels), but i don't want this to be an 
exercise on input (i.e. filtering those names id like to exclude on training 
input), i'd prefer to post remove as I've already spent a lot of time manually 
training.


Top words
- home: 1067
- dorset: 1493
- details: 908
- back: 867
- poole: 1651
- set: 819
- help: 743
- get: 812
- bournemouth: 14728
- new: 2661
- avon: 2684
- local: 3092
- cherries: 1244
- police: 1011
- over: 1813
- echo: 6526
- null: 79983
- after: 2292
- stratford: 2657
- school: 1395
- jobs: 881
- job: 6982
- car: 772
- herald: 2817
- nurse: 1174
- man: 1335
- manager: 1071
- day: 759
- time: 764
- council: 824
- upon: 2676
Number of labels: 2
Number of documents in training set: 79983
Top 75 words for label negative_article
- stratford: 10748.598348617554
- herald: 7579.555884361267
- avon: 7484.692479610443
- upon: 7476.3635239601135
- local: 7426.4039397239685
- after: 3837.6605548858643
- man: 3512.4373264312744
- police: 2586.899124145508
- over: 1537.557123184204
- woman: 1434.1630334854126
Top 75 words for label other
- bournemouth: 39076.86379265785
- job: 24028.39960718155
- echo: 22974.801107406616
- new: 10888.526140213013
- stratford: 8045.635549545288
- poole: 7493.278381347656
- over: 7077.8266887664795
- school: 7011.863867282867
- local: 7004.647378444672
- dorset: 6961.040742397308

Reply via email to