Re: WordDelimiterFilterFactory and StandardTokenizer

Shawn Heisey Fri, 16 May 2014 23:44:35 -0700

On 5/16/2014 9:24 AM, aiguofer wrote:
> Jack Krupansky-2 wrote
>> Typically the white space tokenizer is the best choice when the word 
>> delimiter filter will be used.
>>
>> -- Jack Krupansky
> 
> If we wanted to keep the StandardTokenizer (because we make use of the token
> types) but wanted to use the WDFF to get combinations of words that are
> split with certain characters (mainly - and /, but possibly others as well),
> what is the suggested way of accomplishing this? Would we just have to
> extend the JFlex file for the tokenizer and re-compile it?


You can use the ICUTokenizer instead, and pass it a special rulefile
that makes it only break Latin characters on whitespace instead of all
the usual places.  This is exactly what I do in my index.

In the Solr source code, you can find this special rulefile at the
following path:

lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/Latin-break-only-on-whitespace.rbbi

You would place the rule file in the same location as schema.xml, and
then use this in your fieldType:

<tokenizer class="solr.ICUTokenizerFactory"
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>

Note that the ICUTokenizer requires that you add contrib jars to your
Solr install -- the required jars and a README outlining which files you
need are included in the Solr download in solr/contrib/analysis-extras.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory

Thanks,
Shawn

Re: WordDelimiterFilterFactory and StandardTokenizer

Reply via email to