Re: Skewed IDF in multi lingual index

Tom Burton-West Thu, 08 Nov 2012 10:53:09 -0800

Hi Markus,

No answers, but I am very interested in what you find out.  We currently
index all languages in one index, which presents different IDF issues, but
are interested in exploring alternatives such as the one you describe.


Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search

On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> Hi,
>
> We're testing a large multi lingual index with _LANG fields for each
> language and using dismax to query them all. Users provide, explicit or
> implicit, language preferences that we use for either additive or
> multiplicative boosting on the language of the document. However, additive
> boosting is not adequate because it cannot overcome the extremely high IDF
> values for the same word in another language so regardless of the the
> preference, foreign documents are returned. Multiplicative boosting solves
> this problem but has the other downside as it doesn't allow us with
> standard qf=field^boost to prefer documents in another language above the
> preferred language because the multiplicative is so strong. We do use the
> def function (boost=def(query($qq),.3)) to prevent one boost query to
> return 0 and thus a product of 0 for all boost queries. But it doesn't help
> that much
>
> This all comes down to IDF differences between the languages, even common
> words such as country names like `india` show large differences in IDF. Is
> here anyone with some hints or experiences to share about skewed IDF in
> such an index?
>
> Thanks,
> Markus
>

Re: Skewed IDF in multi lingual index

Reply via email to