Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

Robert Muir Mon, 01 Nov 2010 10:48:39 -0700

On Mon, Nov 1, 2010 at 1:34 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> Thanks Robert,
>
> I'll use the workaround for now (using StandardTokenizerFactory and 
> specifying version 3.1), but I suspect that I don't want the added URL/IP 
> address recognition due to my use case.  I've also talked to a couple people 
> who recommended using the ICUTokenFilter with some rule modifications, but 
> haven't had a chance to investigate that yet.
>


yes, as far as doing rule modifications, we can think about how to
hook this in. At the end of the day, if we allow someone to specify
the classname of their ICUTokenizerConfig (default:
DefaultICUTokenizerConfig), that would at least allow this
customization.

separately i'd be interested in hearing about whatever rule
modifications might be useful for different purposes.

>  I opened two JIRA issues (https://issues.apache.org/jira/browse/SOLR-2210) 
> and https://issues.apache.org/jira/browse/SOLR-2211.  Sometime later this 
> week I'll try writing the FilterFactories and upload patches. (Unless someone 
> beats me to it :)
>

Thanks Tom, there are actually a lot of analysis factories (even in
just icu itself) not exposed to Solr, so its a good deal of work. I
know i have a few of them, but they aren't the best. I suggested on
SOLR-2210 we could make a contrib like 'extraAnalyzers' and put all
the analyzers-that-have-large-dependencies/dictionaries (e.g.
SmartChinese too) in there.

So theres a lot to be done... including tests, any help is appreciated!

Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

Reply via email to