Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr
We are trying to solve some multilingual issues with our Solr analysis filter chain and would like to use the new Lucene 3.x filters that are Unicode compliant. Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr? Is it just a matter of writing the appropriate Solr filter factories? Are there any tricky gotchas in writing such a filter? If so, should I open a JIRA issue or two JIRA issues so the filter factories can be contributed to the Solr code base? Tom
Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr
On Mon, Nov 1, 2010 at 12:24 PM, Burton-West, Tom tburt...@umich.edu wrote: We are trying to solve some multilingual issues with our Solr analysis filter chain and would like to use the new Lucene 3.x filters that are Unicode compliant. Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr? right now, you can use the StandardTokenizerFactory (which is UAX#29 + URL and IP address recognition) from Solr. just make sure you set the Version to 3.1 in your solrconfig.xml with branch_3x, otherwise it will use the old standardtokenizer for backwards compatibility. !-- Controls what version of Lucene various components of Solr adhere to. Generally, you want to use the latest version to get all bug fixes and improvements. It is highly recommended that you fully re-index after changing this setting as it can affect both how text is indexed and queried. -- luceneMatchVersionLUCENE_31/luceneMatchVersion But if you want the pure UAX#29 Tokenizer without this, there isn't a factory. Also if you want customization/supplementary character support, there is no factory for ICUTokenizer at the moment. If so, should I open a JIRA issue or two JIRA issues so the filter factories can be contributed to the Solr code base? Please open issues for a factory for the pure UAX#29 Tokenizer, and for the ICU factories (maybe we can just put this into a contrib for now?) !
RE: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr
Thanks Robert, I'll use the workaround for now (using StandardTokenizerFactory and specifying version 3.1), but I suspect that I don't want the added URL/IP address recognition due to my use case. I've also talked to a couple people who recommended using the ICUTokenFilter with some rule modifications, but haven't had a chance to investigate that yet. I opened two JIRA issues (https://issues.apache.org/jira/browse/SOLR-2210) and https://issues.apache.org/jira/browse/SOLR-2211. Sometime later this week I'll try writing the FilterFactories and upload patches. (Unless someone beats me to it :) Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Monday, November 01, 2010 12:49 PM To: solr-user@lucene.apache.org Subject: Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr On Mon, Nov 1, 2010 at 12:24 PM, Burton-West, Tom tburt...@umich.edu wrote: We are trying to solve some multilingual issues with our Solr analysis filter chain and would like to use the new Lucene 3.x filters that are Unicode compliant. Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr? right now, you can use the StandardTokenizerFactory (which is UAX#29 + URL and IP address recognition) from Solr. just make sure you set the Version to 3.1 in your solrconfig.xml with branch_3x, otherwise it will use the old standardtokenizer for backwards compatibility. !-- Controls what version of Lucene various components of Solr adhere to. Generally, you want to use the latest version to get all bug fixes and improvements. It is highly recommended that you fully re-index after changing this setting as it can affect both how text is indexed and queried. -- luceneMatchVersionLUCENE_31/luceneMatchVersion But if you want the pure UAX#29 Tokenizer without this, there isn't a factory. Also if you want customization/supplementary character support, there is no factory for ICUTokenizer at the moment. If so, should I open a JIRA issue or two JIRA issues so the filter factories can be contributed to the Solr code base? Please open issues for a factory for the pure UAX#29 Tokenizer, and for the ICU factories (maybe we can just put this into a contrib for now?) !
Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr
On Mon, Nov 1, 2010 at 1:34 PM, Burton-West, Tom tburt...@umich.edu wrote: Thanks Robert, I'll use the workaround for now (using StandardTokenizerFactory and specifying version 3.1), but I suspect that I don't want the added URL/IP address recognition due to my use case. I've also talked to a couple people who recommended using the ICUTokenFilter with some rule modifications, but haven't had a chance to investigate that yet. yes, as far as doing rule modifications, we can think about how to hook this in. At the end of the day, if we allow someone to specify the classname of their ICUTokenizerConfig (default: DefaultICUTokenizerConfig), that would at least allow this customization. separately i'd be interested in hearing about whatever rule modifications might be useful for different purposes. I opened two JIRA issues (https://issues.apache.org/jira/browse/SOLR-2210) and https://issues.apache.org/jira/browse/SOLR-2211. Sometime later this week I'll try writing the FilterFactories and upload patches. (Unless someone beats me to it :) Thanks Tom, there are actually a lot of analysis factories (even in just icu itself) not exposed to Solr, so its a good deal of work. I know i have a few of them, but they aren't the best. I suggested on SOLR-2210 we could make a contrib like 'extraAnalyzers' and put all the analyzers-that-have-large-dependencies/dictionaries (e.g. SmartChinese too) in there. So theres a lot to be done... including tests, any help is appreciated!