On Mon, Nov 1, 2010 at 1:34 PM, Burton-West, Tom <tburt...@umich.edu> wrote: > Thanks Robert, > > I'll use the workaround for now (using StandardTokenizerFactory and > specifying version 3.1), but I suspect that I don't want the added URL/IP > address recognition due to my use case. I've also talked to a couple people > who recommended using the ICUTokenFilter with some rule modifications, but > haven't had a chance to investigate that yet. >
yes, as far as doing rule modifications, we can think about how to hook this in. At the end of the day, if we allow someone to specify the classname of their ICUTokenizerConfig (default: DefaultICUTokenizerConfig), that would at least allow this customization. separately i'd be interested in hearing about whatever rule modifications might be useful for different purposes. > I opened two JIRA issues (https://issues.apache.org/jira/browse/SOLR-2210) > and https://issues.apache.org/jira/browse/SOLR-2211. Sometime later this > week I'll try writing the FilterFactories and upload patches. (Unless someone > beats me to it :) > Thanks Tom, there are actually a lot of analysis factories (even in just icu itself) not exposed to Solr, so its a good deal of work. I know i have a few of them, but they aren't the best. I suggested on SOLR-2210 we could make a contrib like 'extraAnalyzers' and put all the analyzers-that-have-large-dependencies/dictionaries (e.g. SmartChinese too) in there. So theres a lot to be done... including tests, any help is appreciated!