On 6/12/07, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote:
For bi-lingual or tri-lingual search, we can have parallel fields (title_en, title_fr, title_de, for example) but this wouldn't scale well.
Due to search across multiple fields, or due to increased index size?
Lucene and Solr requires that the language be known before an Analyzer can be instantiated,and it's the Analyzer that detects the language in my design.... A second obstacle is that the kinds of Filters the Analyzer use depends on the language, so it must be dynamically changed. This could be done programatically but it's not easy. My big hope is that we can work together to come up with some way so that the detected language within the Analayzer can somehow be retrieved and made it into the field.
Something could be done for the indexing side of things, but then how do you query? Would you be able to do language detection on single word queries, or do you apply multiple analyzers and query the same field multiple ways (which seems very close to the multiple field approach)? Also, would multiple languages in a single field perhaps cause idf skew? 50 languages is a lot... perhaps a simple analyzer that could just try to break into words and lowercase? -Yonik