RE: Multi-language indexing and searching

Teruhiko Kurosaka Tue, 12 Jun 2007 10:53:53 -0700

Hi Yonik,
> On 6/12/07, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote:
> > For bi-lingual
> > or tri-lingual search, we can have parallel fields (title_en, 
> > title_fr, title_de, for example) but this wouldn't scale well.
> 
> Due to search across multiple fields, or due to increased index size?


Due to the prolification of number of fields.  Say, we want
to have the field "title" to have the title of the book in
its original language.  But because Solr has this implicit
assumption of one language per field, we would have to have
the artifitial fields title_fr, title_de, title_en, title_es, 
etc. etc. for the number of supported languages, only one of
which has a ral value per document.  This sounds silly, doesn't it?



> Something could be done for the indexing side of things, but 
> then how do you query?
> Would you be able to do language detection on single word 
> queries, or do you apply multiple analyzers and query the 
> same field multiple ways (which seems very close to the 
> multiple field approach)?

You are right that the language auto-detection does not
work on query. The search user would have to specify the
language somehow.  One commercial search engine vendor
does this by prefixing a query term with "$lang=en ".
I would do this by drop down list.  Each user or session
would have a default language that is configurable.



> Also, would multiple languages in a single field perhaps 
> cause idf skew?

Sorry, I don't know enough about inside of the search engines
to discuss this.


> 50 languages is a lot... perhaps a simple analyzer that could 
> just try to break into words and lowercase?

This won't work because:
(1) Concept of lowercase doesn't apply to all languages.
(2) Even among languages that use Latin script,
    there can be different normalization rules.  For many
    European languages, accent marks can be dropped ("ü" becomes
    "u"), but for German, "ü" may better be mapped to "ue"
    which is the alternative spelling of "ü" in German
    writing. 
(3) Some languages such as Chinese and Japanese does not
    even use space or other delimiters to indicate the word
    boundary.  Language specific rules have to be applied
    just to extract words from the run of text.

-kuro

RE: Multi-language indexing and searching

Reply via email to