Re: Different analyzers for dfferent documents in different languages?

Andrzej Bialecki Wed, 22 Sep 2010 07:09:03 -0700

On 2010-09-22 15:30, Bernd Fehling wrote:

Actually, this is one of the biggest disadvantage of Solr for multilingual 
content.
Solr is field based which means you have to know the language _before_ you feed
the content to a specific field and process the content for that field.
This results in having separate fields for each language.
E.g. for Europe this will be 24 to 26 languages for each title, keyword, 
description, ...


I guess when they started with Lucene/Solr they never had multilingual on their 
mind.

The alternative is to have a separate index for each language.
Therefore you also have to know the language of the content _before_ feeding to 
the core.
E.g. again for Europe you end up with 24 to 26 cores.

Onother option is to "see" the multilingual fields (title, keywords, 
description,...) as
a "subdocument". Write a filter class as subpipeline, use language and encoding 
detection
as first step in that pipeline and then go on with all other linguistic 
processing within
that pipeline and return the processed content back to the field for further 
filtering
and storing.

Many solutions, but nothing out off the box :-)

Take a look at SOLR-1536, it contains an example of a tokenizing chainthat could use a language detector to create different fields (ortokenize differently) based on this decision.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Different analyzers for dfferent documents in different languages?

Reply via email to