On Sep 16, 2011, at 3:51 PM, Jeff Eastman wrote:
> 
> [jeff] seq2sparse has --maxDFPercent which can be used to remove really high 
> frequency terms. No explicit stop word lists though.

In order to do that, you need to pass in your own Lucene analyzer.  Note, the 
MailArchivesClusteringAnalyzer is an example in the Mahout code base and has a 
decent size stop list for mail archives.

-Grant

Reply via email to