Hi Markus: how are the languages distributed across documents? Imagine I have a text_en field and a text_fr field. Lets say I have 100 documents, 95 are english and only 5 are french. So the text_en field is populated 95% of the time, and the text_fr 5% of the time.
But the default IDF computation doesnt look at things this way: it always uses '100' as maxDoc. So in such a situation, any terms against text_fr are "rare" :) The first thing i would look at, is treating this situation as merging results from a english index with 95 docs and a french index with 5 docs. So I would consider overriding the two idfExplain methods (term and phrase) to use CollectionStatistics.docCount() instead of CollectionStatistics.maxDoc() The former would be 95 for the english field (instead of 100), and 5 for the french field (instead of 100). I dont think this will solve all your problems: but it might help. Note: you must ensure your index is fully upgraded to 4.0 to try this statistic, otherwise it will return -1 if you have any 3.x segments in your index. On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hi, > > We're testing a large multi lingual index with _LANG fields for each language > and using dismax to query them all. Users provide, explicit or implicit, > language preferences that we use for either additive or multiplicative > boosting on the language of the document. However, additive boosting is not > adequate because it cannot overcome the extremely high IDF values for the > same word in another language so regardless of the the preference, foreign > documents are returned. Multiplicative boosting solves this problem but has > the other downside as it doesn't allow us with standard qf=field^boost to > prefer documents in another language above the preferred language because the > multiplicative is so strong. We do use the def function > (boost=def(query($qq),.3)) to prevent one boost query to return 0 and thus a > product of 0 for all boost queries. But it doesn't help that much > > This all comes down to IDF differences between the languages, even common > words such as country names like `india` show large differences in IDF. Is > here anyone with some hints or experiences to share about skewed IDF in such > an index? > > Thanks, > Markus