Re: Different analyzers for dfferent documents in different languages?

Bernd Fehling Wed, 22 Sep 2010 06:31:34 -0700

Actually, this is one of the biggest disadvantage of Solr for multilingual 
content.
Solr is field based which means you have to know the language _before_ you feed
the content to a specific field and process the content for that field.
This results in having separate fields for each language.
E.g. for Europe this will be 24 to 26 languages for each title, keyword, 
description, ...


I guess when they started with Lucene/Solr they never had multilingual on their 
mind.

The alternative is to have a separate index for each language.
Therefore you also have to know the language of the content _before_ feeding to 
the core.
E.g. again for Europe you end up with 24 to 26 cores.

Onother option is to "see" the multilingual fields (title, keywords, 
description,...) as
a "subdocument". Write a filter class as subpipeline, use language and encoding 
detection
as first step in that pipeline and then go on with all other linguistic 
processing within
that pipeline and return the processed content back to the field for further 
filtering
and storing.

Many solutions, but nothing out off the box :-)

Bernd

Am 22.09.2010 12:01, schrieb Andy:
> I have documents that are in different languages. There's a field in the 
> documents specifying what language it's in.
> 
> Is it possible to index the documents such that based on what language a 
> document is in, a different analyzer will be used on that document?
> 
> What is the "normal" way to handle documents in different languages?
> 
> Thanks
> Andy
> 
> 
>

Re: Different analyzers for dfferent documents in different languages?

Reply via email to