Hi Yonik, > On 6/12/07, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote: > > For bi-lingual > > or tri-lingual search, we can have parallel fields (title_en, > > title_fr, title_de, for example) but this wouldn't scale well. > > Due to search across multiple fields, or due to increased index size?
Due to the prolification of number of fields. Say, we want to have the field "title" to have the title of the book in its original language. But because Solr has this implicit assumption of one language per field, we would have to have the artifitial fields title_fr, title_de, title_en, title_es, etc. etc. for the number of supported languages, only one of which has a ral value per document. This sounds silly, doesn't it? > Something could be done for the indexing side of things, but > then how do you query? > Would you be able to do language detection on single word > queries, or do you apply multiple analyzers and query the > same field multiple ways (which seems very close to the > multiple field approach)? You are right that the language auto-detection does not work on query. The search user would have to specify the language somehow. One commercial search engine vendor does this by prefixing a query term with "$lang=en ". I would do this by drop down list. Each user or session would have a default language that is configurable. > Also, would multiple languages in a single field perhaps > cause idf skew? Sorry, I don't know enough about inside of the search engines to discuss this. > 50 languages is a lot... perhaps a simple analyzer that could > just try to break into words and lowercase? This won't work because: (1) Concept of lowercase doesn't apply to all languages. (2) Even among languages that use Latin script, there can be different normalization rules. For many European languages, accent marks can be dropped ("ü" becomes "u"), but for German, "ü" may better be mapped to "ue" which is the alternative spelling of "ü" in German writing. (3) Some languages such as Chinese and Japanese does not even use space or other delimiters to indicate the word boundary. Language specific rules have to be applied just to extract words from the run of text. -kuro