The most common way to handle this is to just index to language-specific fields, e.t. text_ex, text_en, text_de. Since you know what language the user is searching in, you can route the queries to the correct set of fields....
That said, this is an interesting approach. You don't necessarily need to have two different fields, since you could simply remove the language markers from the stored data before displaying it. I think your approach is better since there are fewer places to forget. And there's no penalty for having two fields, one stored and not indexed and one indexed and not stored. So I'd use the approach you outlined. However, I believe that the call to create in your code only happens once, so I don't see how you get different versions of the stemmer for different documents. I might be wrong here, but have you checked? You'd probably want to incorporate the AsciiFoldingFilterFactory in your analysis chain too. This seems like it has possibilities for reasonably closely-related languages. I suspect it'd fall over if you put Arabic or CJK languages in here, but that doesn't seem to be the problem you're addressing. Best Erick On Mon, Aug 8, 2011 at 7:57 AM, cnyee <yeec...@gmail.com> wrote: > Sorry if this has already been discussed, but I have already spent a couple > of days googling in vain.... > > The problem: > - documents in multiple languages (us, de, fr, es). > - language is known (a team of editors determines the language manually, and > users are asked to specify language option for searching). > > My intended approach: > - one index. > - a multiplexing token filter, a MultilingualSnowballFilterFactory that > instantiates a Snowball Stemmer for the appropriate language. > - language is a facet, to get rid of cross-language ambiguities with > multiple languages mixed in the same field. > > The problem is how to communicate the language to the > MultilingualSnowballFilterFactory. Once the language is known, instantiating > the Snowball Stemmer for the right language is easy. I got a working version > attached below. > > My solution: > - append the language as the first token for the FilterFactory to pick up. > E.g. "es This is a spanish document....". > - this would mean I need to duplicate the fields - an original version for > storing, and a version with the language marker appended for indexing. E.g > description (indexed=false, stored=true), description_i (indexed=true, > stored=false). > > Is there a better way? > > Many thanks in advance. > > Yee > > http://lucene.472066.n3.nabble.com/file/n3235341/MultilingualSnowballFilterFactory.java > MultilingualSnowballFilterFactory.java > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Multiplexing-TokenFilter-for-multi-language-tp3235341p3235341.html > Sent from the Solr - User mailing list archive at Nabble.com. >