RE: Leaving certain tokens intact during indexing and search

Steven A Rowe Wed, 30 Nov 2011 07:50:06 -0800

Note that my example does not actually use PatternReplaceCharFilterFactory 
twice - the second one is actually a PatternReplaceFilterFactory - note that 
"Char" isn't present in the second one.


CharFilters operate before tokenizers, and regular filters operate after 
tokenizers.

Steve

> -----Original Message-----
> From: Marian Steinbach [mailto:marian.steinb...@gmail.com]
> Sent: Wednesday, November 30, 2011 10:44 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Leaving certain tokens intact during indexing and search
> 
> That's pretty helpful, thanks! Especially since I didn't understand so far
> that I could use a filter like PatternReplaceCharFilterFactory both as a
> charFilter and as a filter.
> 
> In the meantime I had figured out another alternative,
> involving WordDelimiterFilterFactory. But I had to
> use WhitespaceTokenizerFactory instead of StandardTokenizerFactory, which
> means that I had to use extra PatternReplaceCharFilterFactory filters to
> get rid of leading/trailing punctuation.
> 
> Again, thanks!
> 
> Marian
> 
> 2011/11/30 Steven A Rowe <sar...@syr.edu>
> 
> > Hi Marian,
> >
> > Extending the StandardTokenizer(Factory) java class is not the way to go
> > if you want to change its behavior.
> >
> > StandardTokenizer is generated from a JFlex <http://jflex.de/>
> > specification, so you would need to modify the specification to include
> > your special slash-containing-word rule, then regenerate the java code,
> and
> > then compile it.
> >
> > It would be much simpler to use a PatternReplaceCharFilter <
> >
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceC
> harFilter.html>
> > to convert the slashes into unusual (sequences of) characters that won't
> be
> > broken up by the analyzer you're using, then add a PatternReplaceFilter
> to
> > convert the unusual sequences back to slashes.  E.g. if you used "-blah-
> "
> > as the unusual sequence (note: people have also reported using a single
> > character drawn from a script that would otherwise not be used in the
> text,
> > e.g. a Chinese ideograph in English text), "AB/1234/5678" could become
> > "AB-blah-1234-blah-5678".
> >
> > Here's an (untested!) analyzer specification that would do this:
> >
> > <analyzer>
> >  <charFilter class="solr.PatternReplaceCharFilterFactory"
> > pattern="([A-Z]+)/([0-9]+)/([0-9]+)"
> >              replacement="$1-blah-$2-blah-$3"/>
> >  <tokenizer class="solr.StandardTokenizerFactory"/>
> >  <filter class="solr.PatternReplaceFilterFactory" pattern="-blah-"
> > replacement="/" replace="all"/>
> >  <filter class="solr.LowerCaseFilterFactory"/>
> > </analyzer>
> >
> > Steve
> >
> >

RE: Leaving certain tokens intact during indexing and search

Reply via email to