Re: Basic Multilingual search capability

Rishi Easwaran Mon, 23 Feb 2015 20:39:07 -0800

Hi Wunder,

Yes we do expect incoming documents to contain Chinese/Japanese/Arabic 
languages.

From what you have mentioned, it looks like we need to auto detect the incoming 
content language and tokenize/filter after that.
But I thought the ICU tokenizer had capability to do that  
(https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-ICUTokenizer)
"This tokenizer processes multilingual text and tokenizes it appropriately 
based on its script attribute." 
or am I missing something? 

Thanks,
Rishi.

-----Original Message-----
From: Walter Underwood <wun...@wunderwood.org>
To: solr-user <solr-user@lucene.apache.org>
Sent: Mon, Feb 23, 2015 11:17 pm
Subject: Re: Basic Multilingual search capability

It isn’t just complicated, it can be impossible.

Do you have content in Chinese or Japanese? Those languages (and some others) 
do 
not separate words with spaces. You cannot even do word search without a 
language-specific, dictionary-based parser.

German is space separated, except many noun compounds are not space-separated.

Do you have Finnish content? Entire prepositional phrases turn into word 
endings.

Do you have Arabic content? That is even harder.

If all your content is in space-separated languages that are not heavily 
inflected, you can kind of do OK with a language-insensitive approach. But it 
hits the wall pretty fast.

One thing that does work pretty well is trademarked names (LaserJet, Coke, 
etc). 
Those are spelled the same in all languages and usually not inflected.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Feb 23, 2015, at 8:00 PM, Rishi Easwaran <rishi.easwa...@aol.com> wrote:

> Hi Alex,
> 
> There is no specific language list.  
> For example: the documents that needs to be indexed are emails or any 
> messages 
for a global customer base. The messages back and forth could be in any 
language 
or mix of languages.
> 
> I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.
> 
> Now it would be great if it had capability to tokenize email addresses 
(ex:he...@aol.com- i think standardTokenizer already does this),  filenames 
(здравствуйте.pdf), but maybe we can use filters to accomplish that. 
> 
> Thanks,
> Rishi.
> 
> -----Original Message-----
> From: Alexandre Rafalovitch <arafa...@gmail.com>
> To: solr-user <solr-user@lucene.apache.org>
> Sent: Mon, Feb 23, 2015 5:49 pm
> Subject: Re: Basic Multilingual search capability
> 
> 
> Which languages are you expecting to deal with? Multilingual support
> is a complex issue. Even if you think you don't need much, it is
> usually a lot more complex than expected, especially around relevancy.
> 
> Regards,
>   Alex.
> ----
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
> 
> 
> On 23 February 2015 at 16:19, Rishi Easwaran <rishi.easwa...@aol.com> wrote:
>> Hi All,
>> 
>> For our use case we don't really need to do a lot of manipulation of 
>> incoming 

> text during index time. At most removal of common stop words, tokenize 
> emails/ 

> filenames etc if possible. We get text documents from our end users, which 
> can 

> be in any language (sometimes combination) and we cannot determine the 
language 
> of the incoming text. Language detection at index time is not necessary.
>> 
>> Which analyzer is recommended to achive basic multilingual search capability 
> for a use case like this.
>> I have read a bunch of posts about using a combination standardtokenizer or 
> ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking 
> for 

> ideas, suggestions, best practices.
>> 
>> http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
>> http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
>> https://issues.apache.org/jira/browse/SOLR-6492
>> 
>> 
>> Thanks,
>> Rishi.
>> 
> 
>

Re: Basic Multilingual search capability

Reply via email to