Re: Multi-language indexing and searching

Yonik Seeley Tue, 12 Jun 2007 08:30:51 -0700

On 6/12/07, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote:

For bi-lingual
or tri-lingual search, we can have parallel fields (title_en,
title_fr, title_de, for example) but this wouldn't scale well.


Due to search across multiple fields, or due to increased index size?

Lucene and Solr
requires that the language be known before an Analyzer can be
instantiated,and it's the Analyzer that detects the language in my
design....  A second obstacle is that the kinds of Filters
the Analyzer use depends on the language, so it must be
dynamically changed. This could be done programatically but
it's not easy.  My big hope is that we can work together to
come up with some way so that the detected language within
the Analayzer can somehow be retrieved and made it into the field.


Something could be done for the indexing side of things, but then how
do you query?
Would you be able to do language detection on single word queries, or
do you apply multiple analyzers and query the same field multiple ways
(which seems very close to the multiple field approach)?

Also, would multiple languages in a single field perhaps cause idf skew?

50 languages is a lot... perhaps a simple analyzer that could just try
to break into words and lowercase?


-Yonik

Re: Multi-language indexing and searching

Reply via email to