RE: Multi-language indexing and searching

Teruhiko Kurosaka Tue, 12 Jun 2007 08:06:04 -0700

Daniel,
I was reading your email and responses to it with great 
interest.

I was aware that Solr has an implicit assumption that 
a field is mono-lingual per system. But your mail and
its correspondence made me wonder if this limitation 
is practical for multi-lingual search applications.  For bi-lingual 
or tri-lingual search, we can have parallel fields (title_en, 
title_fr, title_de, for example) but this wouldn't scale well.


Assume we are making a search application for multi-lingual 
library in a university in Japan, for example, 
the application would have a book title field in Japanese, 
perhaps another title field in English for visiting 
scholars, and a title field in the original language.  
The last field's field would vary among more than 50 modern 
languages (and not so modern languages like Latin).  Solr 
may need some rearchitecutring in this area.

I work for a company called Basis Technology,
(www.basistech.com) which develops a suite of language 
processing software and I've written a module to integrate 
this with Solr (and Lucene in general).  The module is 
made of a universal Tokenizer and Analyzers for English and 
Japanese, but they can be modified easily to handle any of
the 16 languages we can handle. (Source code is provided.)

When I was developing this module, I thought of writing 
a super Analyzer that automatically detects the language 
and do the right thing.  But I've found this won't fit 
well with the design of Lucene and Solr.  For one thing, 
there is no way to save the detected language in the field, 
if the language is detected within the Analyzer.  Lucene and Solr 
requires that the language be known before an Analyzer can be 
instantiated,and it's the Analyzer that detects the language in my
design....  A second obstacle is that the kinds of Filters
the Analyzer use depends on the language, so it must be
dynamically changed. This could be done programatically but
it's not easy.  My big hope is that we can work together to 
come up with some way so that the detected language within 
the Analayzer can somehow be retrieved and made it into the field.

Anyway, if you are interested in trying my multi-lingual 
Analyzers, please contact me in private email.

Regards,
-kuro

RE: Multi-language indexing and searching

Reply via email to