Re: stop word-lists VS maxDFPercent

Ted Dunning Tue, 13 May 2014 05:51:25 -0700

I usually recommend using a custom stop list based on your own corpus.  That 
tends to work much better than general ones.


I like using the doc frequency limit as well in case something goes strange on 
me. 

Sent from my iPhone

> On May 12, 2014, at 6:24, David Noel <[email protected]> wrote:
> 
> What's everyone's opinion on using large stop word lists vs a very
> small value for maxDFPercent (like 30)? I'm playing around with both
> and am having trouble deciding whether one is better than the other,
> or if I should use a combination of both. My data set is one day's
> worth of news articles gathered from 1000 online news outlets. It's
> probably similar to the reuters data set, but with a little more
> noise. I used Boilerpipe for article extraction.
> 
> I spent a good while Googling around to build the largest (English)
> stop word-list I could. I'll paste it below for anyone who's
> interested and would like to save themselves an hour of Googling and
> collating.

Re: stop word-lists VS maxDFPercent

Reply via email to