I usually recommend using a custom stop list based on your own corpus. That tends to work much better than general ones.
I like using the doc frequency limit as well in case something goes strange on me. Sent from my iPhone > On May 12, 2014, at 6:24, David Noel <[email protected]> wrote: > > What's everyone's opinion on using large stop word lists vs a very > small value for maxDFPercent (like 30)? I'm playing around with both > and am having trouble deciding whether one is better than the other, > or if I should use a combination of both. My data set is one day's > worth of news articles gathered from 1000 online news outlets. It's > probably similar to the reuters data set, but with a little more > noise. I used Boilerpipe for article extraction. > > I spent a good while Googling around to build the largest (English) > stop word-list I could. I'll paste it below for anyone who's > interested and would like to save themselves an hour of Googling and > collating.
