On 2010-09-22 15:30, Bernd Fehling wrote:
Actually, this is one of the biggest disadvantage of Solr for multilingual
content.
Solr is field based which means you have to know the language _before_ you feed
the content to a specific field and process the content for that field.
This results in having separate fields for each language.
E.g. for Europe this will be 24 to 26 languages for each title, keyword,
description, ...
I guess when they started with Lucene/Solr they never had multilingual on their
mind.
The alternative is to have a separate index for each language.
Therefore you also have to know the language of the content _before_ feeding to
the core.
E.g. again for Europe you end up with 24 to 26 cores.
Onother option is to "see" the multilingual fields (title, keywords,
description,...) as
a "subdocument". Write a filter class as subpipeline, use language and encoding
detection
as first step in that pipeline and then go on with all other linguistic
processing within
that pipeline and return the processed content back to the field for further
filtering
and storing.
Many solutions, but nothing out off the box :-)
Take a look at SOLR-1536, it contains an example of a tokenizing chain
that could use a language detector to create different fields (or
tokenize differently) based on this decision.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com