Thanks Robert,

I'll use the workaround for now (using StandardTokenizerFactory and specifying 
version 3.1), but I suspect that I don't want the added URL/IP address 
recognition due to my use case.  I've also talked to a couple people who 
recommended using the ICUTokenFilter with some rule modifications, but haven't 
had a chance to investigate that yet.

  I opened two JIRA issues (https://issues.apache.org/jira/browse/SOLR-2210) 
and https://issues.apache.org/jira/browse/SOLR-2211.  Sometime later this week 
I'll try writing the FilterFactories and upload patches. (Unless someone beats 
me to it :)

Tom

-----Original Message-----
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Monday, November 01, 2010 12:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support 
from Solr

On Mon, Nov 1, 2010 at 12:24 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> We are trying to solve some multilingual issues with our Solr analysis filter 
> chain and would like to use the new Lucene 3.x filters that are Unicode 
> compliant.
>
> Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with 
> UAX#29 support from Solr?

right now, you can use the StandardTokenizerFactory (which is UAX#29 +
URL and IP address recognition) from Solr.
just make sure you set the Version to 3.1 in your solrconfig.xml with
branch_3x, otherwise it will use the "old" standardtokenizer for
backwards compatibility.

  <!--
    Controls what version of Lucene various components of Solr adhere
to. Generally, you want
    to use the latest version to get all bug fixes and improvements.
It is highly recommended
    that you fully re-index after changing this setting as it can
affect both how text is indexed
    and queried.
  -->
  <luceneMatchVersion>LUCENE_31</luceneMatchVersion>

But if you want the pure UAX#29 Tokenizer without this, there isn't a
factory. Also if you want customization/supplementary character
support, there is no factory for ICUTokenizer at the moment.

> If so, should I open a JIRA issue or two JIRA issues so the filter factories 
> can be contributed to the Solr code base?

Please open issues for a factory for the pure UAX#29 Tokenizer, and
for the ICU factories (maybe we can just put this into a contrib for
now?) !

Reply via email to