CommonGrams itself seems to have some other dependencies on nutch because
of other utilities in the same class, but based on a quick skim, what you
really want is the nested "private static class Filter extends
TokenFilter" which doesn't really have any external dependencies.  If you
extract that class into some more specificly named "CommonGramsFilter",
all you need after that to use it in Solr is a simple little
"FilterFactory" so you can refrence it in your schema.xml ... you can use
the StopFilterFactory as a template since you'll need exactly the same
initalization (get the name of a word list file from the init params,
parse it, and build a word set out of it)...

Chris, thanks for the tips (or should I say, detailed explanation!). I actually got it working! It was a pain at first (never did any java, and all this ant, junit, war, jar, java, .classes are confusing!). I had some compile errors that I cleaned up. Playing around with the filter in the admin panel analyser yields expected results; I can't thank you enough for your help. I now use :

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> <filter class="solr.CommonGramsFilterFactory" words="stopwords-complete.txt" ignoreCase="true"/> <filter class="solr.StopFilterFactory" words="stopwords-complete.txt" ignoreCase="true"/>

And it works perfectly.

If Solr is interested in the filter, just tell me (and how should I do to contribute it).

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/analysis/StopFilterFactory.java?view=markup

...all you really need to change is that the "create" method should return
a new "CommonGramsFilter" instead of a StopFilter.

Incidently: most of the code in CommonGrams.Filter seems to be dealing
with the buffering of tokens ... it may be easier to reimpliment the logic
with Solr's BufferedTokenStream as a base class.

Reply via email to