On Sep 16, 2011, at 3:51 PM, Jeff Eastman wrote: > > [jeff] seq2sparse has --maxDFPercent which can be used to remove really high > frequency terms. No explicit stop word lists though.
In order to do that, you need to pass in your own Lucene analyzer. Note, the MailArchivesClusteringAnalyzer is an example in the Mahout code base and has a decent size stop list for mail archives. -Grant
