On 5/16/2014 9:24 AM, aiguofer wrote: > Jack Krupansky-2 wrote >> Typically the white space tokenizer is the best choice when the word >> delimiter filter will be used. >> >> -- Jack Krupansky > > If we wanted to keep the StandardTokenizer (because we make use of the token > types) but wanted to use the WDFF to get combinations of words that are > split with certain characters (mainly - and /, but possibly others as well), > what is the suggested way of accomplishing this? Would we just have to > extend the JFlex file for the tokenizer and re-compile it?
You can use the ICUTokenizer instead, and pass it a special rulefile that makes it only break Latin characters on whitespace instead of all the usual places. This is exactly what I do in my index. In the Solr source code, you can find this special rulefile at the following path: lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/Latin-break-only-on-whitespace.rbbi You would place the rule file in the same location as schema.xml, and then use this in your fieldType: <tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/> Note that the ICUTokenizer requires that you add contrib jars to your Solr install -- the required jars and a README outlining which files you need are included in the Solr download in solr/contrib/analysis-extras. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory Thanks, Shawn