This is unfortunately not what we want. Some customers use filters to restrict 
language, but some customers don't. They want to be able to find documents in 
all languages, so we use user preference to get their local language on top. 
Except for very relevant documents in foreign languages, hence the deboost is 
not too low.

Thanks,
Markus

 
-----Original message-----
> From:Walter Underwood <wun...@wunderwood.org>
> Sent: Thursday 30th November 2017 17:29
> To: solr-user@lucene.apache.org
> Subject: Re: Skewed IDF in multi lingual index, again
> 
> I’ve occasionally considered using Unicode language tags (U+E001 and friends) 
> on each term. That would make a term specific to a language, so we would get 
> [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big 
> hammer, because it restricts matches to the same language. If the entire 
> document is in one language, might as well use a filter query for that 
> language. The tags would work for multiple languages in one document.
> 
> Maybe make the untagged term a synonym. For cross-language terms like 
> “LaserJet”, the untagged one would have worse idf.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Nov 30, 2017, at 8:14 AM, Markus Jelsma <markus.jel...@openindex.io> 
> > wrote:
> > 
> > Hello,
> > 
> > We already discussed this problem five years ago [1]. In short: documents 
> > in foreign languages are scored higher for some terms.
> > 
> > It was solved back then by using docCount instead of maxDoc when 
> > calculating idf, it worked really well! But, probably due to index changes, 
> > the problem is back for some terms, mostly proper nouns, well, just like 
> > five years ago.
> > 
> > We already deboost documents by 0.7 that are not in the user's preference 
> > language but in some cases it is not enough. I can go on by reducing that 
> > boost but that's not what i prefer.
> > 
> > I'd like to know if there are additional tricks to solve the problem.
> > 
> > Many thanks!
> > Markus
> > 
> > [1] 
> > http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html
> 
> 

Reply via email to