Re: Multiplexing TokenFilter for multi-language?

Erick Erickson Tue, 09 Aug 2011 05:46:54 -0700

The most common way to handle this is to just index to
language-specific fields, e.t. text_ex, text_en, text_de. Since
you know what language the user is searching in, you can
route the queries to the correct set of fields....

That said, this is an interesting approach. You don't
necessarily need to have two different fields, since you could
simply remove the language markers from the stored data
before  displaying it. I think your approach
is better since there are fewer places to forget. And there's no
penalty for having two fields, one stored and not indexed and
one indexed and not stored. So I'd use the approach you
outlined.

However, I believe that the call to create in your code only happens
once, so I don't see how you get different versions of the stemmer
for different documents. I might be wrong here, but have you checked?
You'd probably want to incorporate the AsciiFoldingFilterFactory in
your analysis chain too.

This seems like it has possibilities for reasonably closely-related
languages. I suspect it'd fall over if you put Arabic or CJK languages
in here, but that doesn't seem to be the problem you're addressing.

Best
Erick

On Mon, Aug 8, 2011 at 7:57 AM, cnyee <yeec...@gmail.com> wrote:
> Sorry if this has already been discussed, but I have already spent a couple
> of days googling in vain....
>
> The problem:
> - documents in multiple languages (us, de, fr, es).
> - language is known (a team of editors determines the language manually, and
> users are asked to specify language option for searching).
>
> My intended approach:
> - one index.
> - a multiplexing token filter, a MultilingualSnowballFilterFactory that
> instantiates a Snowball Stemmer for the appropriate language.
> - language is a facet, to get rid of cross-language ambiguities with
> multiple languages mixed in the same field.
>
> The problem is how to communicate the language to the
> MultilingualSnowballFilterFactory. Once the language is known, instantiating
> the Snowball Stemmer for the right language is easy. I got a working version
> attached below.
>
> My solution:
> - append the language as the first token for the FilterFactory to pick up.
> E.g. "es This is a spanish document....".
> - this would mean I need to duplicate the fields - an original version for
> storing, and a version with the language marker appended for indexing. E.g
> description (indexed=false, stored=true), description_i (indexed=true,
> stored=false).
>
> Is there a better way?
>
> Many thanks in advance.
>
> Yee
>
> http://lucene.472066.n3.nabble.com/file/n3235341/MultilingualSnowballFilterFactory.java
> MultilingualSnowballFilterFactory.java
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Multiplexing-TokenFilter-for-multi-language-tp3235341p3235341.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Multiplexing TokenFilter for multi-language?

Reply via email to