Re: Skewed IDF in multi lingual index

Robert Muir Thu, 08 Nov 2012 08:39:09 -0800

Hi Markus: how are the languages distributed across documents?

Imagine I have a text_en field and a text_fr field. Lets say I have
100 documents, 95 are english and only 5 are french.
So the text_en field is populated 95% of the time, and the text_fr 5%
of the time.

But the default IDF computation doesnt look at things this way: it
always uses '100' as maxDoc. So in such a situation, any terms against
text_fr are "rare" :)

The first thing i would look at, is treating this situation as merging
results from a english index with 95 docs and a french index with 5
docs.
So I would consider overriding the two idfExplain methods (term and
phrase) to use CollectionStatistics.docCount() instead of
CollectionStatistics.maxDoc()
The former would be 95 for the english field (instead of 100), and 5
for the french field (instead of 100).

I dont think this will solve all your problems: but it might help.

Note: you must ensure your index is fully upgraded to 4.0 to try this
statistic, otherwise it will return -1 if you have any 3.x segments in
your index.

On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
<markus.jel...@openindex.io> wrote:
> Hi,
>
> We're testing a large multi lingual index with _LANG fields for each language 
> and using dismax to query them all. Users provide, explicit or implicit, 
> language preferences that we use for either additive or multiplicative 
> boosting on the language of the document. However, additive boosting is not 
> adequate because it cannot overcome the extremely high IDF values for the 
> same word in another language so regardless of the the preference, foreign 
> documents are returned. Multiplicative boosting solves this problem but has 
> the other downside as it doesn't allow us with standard qf=field^boost to 
> prefer documents in another language above the preferred language because the 
> multiplicative is so strong. We do use the def function 
> (boost=def(query($qq),.3)) to prevent one boost query to return 0 and thus a 
> product of 0 for all boost queries. But it doesn't help that much
>
> This all comes down to IDF differences between the languages, even common 
> words such as country names like `india` show large differences in IDF. Is 
> here anyone with some hints or experiences to share about skewed IDF in such 
> an index?
>
> Thanks,
> Markus

Re: Skewed IDF in multi lingual index

Reply via email to