Don't know your use case, but if you just want a list of the 400 most common words you can use the lucene contrib. HighFreqTerms.java with the - t flag. You have to point it at your lucene index. You also probably don't want Solr to be running and want to give the JVM running HighFreqTerms a lot of memory.
http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java?view=log Tom http://www.hathitrust.org/blogs/large-scale-search -----Original Message----- From: mdz-munich [mailto:sebastian.lu...@bsb-muenchen.de] Sent: Tuesday, April 26, 2011 9:29 AM To: solr-user@lucene.apache.org Subject: TermsCompoment + Dist. Search + Large Index + HEAP SPACE Hi! We've got one index splitted into 4 shards รก 70.000 records of large full-text data from (very dirty) OCR. Thus we got a lot of "unique" terms. No we try to obtain the first 400 most common words for "CommonGramsFilter" via TermsComponent but the request runs allways out of memory. The VM is equipped with 32 GB of RAM, 16-26 GB alocated to the Java-VM. Any Ideas how to get the most common terms without increasing VMs Memory? Thanks & best regards, Sebastian -- View this message in context: http://lucene.472066.n3.nabble.com/TermsCompoment-Dist-Search-Large-Index-HEAP-SPACE-tp2865609p2865609.html Sent from the Solr - User mailing list archive at Nabble.com.