: > : Nutch has phrase pre-filtering which helps with this. It indexes the
: > : phrase fragments as separate terms and uses that set of matches to
: > : filter the set of matching documents.

: > That reminds me ... i seem to remember someone saying once that Nutch lso
: > builds word based n-grams out of it's stop words, so searches on "the"
: > or "on" won't match anything because those words are never indexed as a
: > single tokens, but if a document contains "the dog in the house" it would
: > match a search on "in the" because the Analyzer would treat that as a
: > single token "in_the".

: This looks like exactly what I'm looking for. Is it related to the above
: 'nutch pre-filtering'? This way if I stopword single letters and
: numbers, it would still index 'hepatitis_a' as a single token, and match
: a search on 'hepatitis a' (non-phrase search) without hitting 'a patient
: has hepatitis'? I guess i'd have to apply the filter to the query too,
: so it turns the query into hepatitis_a?

right ... i think we were both talking baout the same feature, which Otis
says is in Nutch's "CommonGrams" class...

http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/CommonGrams.java?view=markup

: Any chance at all this kind of filter gets implemented into solr? If
: not, indications on how to do it myself would be appreciated - I can't

CommonGrams itself seems to have some other dependencies on nutch because
of other utilities in the same class, but based on a quick skim, what you
really want is the nested "private static class Filter extends
TokenFilter" which doesn't really have any external dependencies.  If you
extract that class into some more specificly named "CommonGramsFilter",
all you need after that to use it in Solr is a simple little
"FilterFactory" so you can refrence it in your schema.xml ... you can use
the StopFilterFactory as a template since you'll need exactly the same
initalization (get the name of a word list file from the init params,
parse it, and build a word set out of it)...

http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/analysis/StopFilterFactory.java?view=markup

...all you really need to change is that the "create" method should return
a new "CommonGramsFilter" instead of a StopFilter.

Incidently: most of the code in CommonGrams.Filter seems to be dealing
with the buffering of tokens ... it may be easier to reimpliment the logic
with Solr's BufferedTokenStream as a base class.

-Hoss

Reply via email to