Hi Alex,

Thanks for the suggestions. These steps will definitely help out with our use 
case.
Thanks for the idea about the lengthFilter to protect our system.

Thanks,
Rishi.

 

 

 

-----Original Message-----
From: Alexandre Rafalovitch <arafa...@gmail.com>
To: solr-user <solr-user@lucene.apache.org>
Sent: Tue, Feb 24, 2015 8:50 am
Subject: Re: Basic Multilingual search capability


Given the limited needs, I would probably do something like this:

1) Put a language identifier in the UpdateRequestProcessor chain
during indexing and route out at least known problematic languages,
such as Chinese, Japanese, Arabic into individual fields
2) Put everything else together into one field with ICUTokenizer,
maybe also ICUFoldingFilter
3) At the very end of that joint filter, stick in LengthFilter with
some high number, e.g. 25 characters max. This will ensure that
super-long words from non-space languages and edge conditions do not
break the rest of your system.


Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 February 2015 at 23:14, Walter Underwood <wun...@wunderwood.org> wrote:
>> I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.

 

Reply via email to