That's pretty helpful, thanks! Especially since I didn't understand so far
that I could use a filter like PatternReplaceCharFilterFactory both as a
charFilter and as a filter.

In the meantime I had figured out another alternative,
involving WordDelimiterFilterFactory. But I had to
use WhitespaceTokenizerFactory instead of StandardTokenizerFactory, which
means that I had to use extra PatternReplaceCharFilterFactory filters to
get rid of leading/trailing punctuation.

Again, thanks!

Marian

2011/11/30 Steven A Rowe <sar...@syr.edu>

> Hi Marian,
>
> Extending the StandardTokenizer(Factory) java class is not the way to go
> if you want to change its behavior.
>
> StandardTokenizer is generated from a JFlex <http://jflex.de/>
> specification, so you would need to modify the specification to include
> your special slash-containing-word rule, then regenerate the java code, and
> then compile it.
>
> It would be much simpler to use a PatternReplaceCharFilter <
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilter.html>
> to convert the slashes into unusual (sequences of) characters that won't be
> broken up by the analyzer you're using, then add a PatternReplaceFilter to
> convert the unusual sequences back to slashes.  E.g. if you used "-blah-"
> as the unusual sequence (note: people have also reported using a single
> character drawn from a script that would otherwise not be used in the text,
> e.g. a Chinese ideograph in English text), "AB/1234/5678" could become
> "AB-blah-1234-blah-5678".
>
> Here's an (untested!) analyzer specification that would do this:
>
> <analyzer>
>  <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="([A-Z]+)/([0-9]+)/([0-9]+)"
>              replacement="$1-blah-$2-blah-$3"/>
>  <tokenizer class="solr.StandardTokenizerFactory"/>
>  <filter class="solr.PatternReplaceFilterFactory" pattern="-blah-"
> replacement="/" replace="all"/>
>  <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
>
> Steve
>
>

Reply via email to