Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Doug Turnbull
It is challenging as the performance of different use cases and domains will by very dependent on the use case (there's no one globally perfect relevance solution). But a good set of metrics to see *generally* how stock Solr performs across a reasonable set of verticals would be nice. My philosoph

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread alessandro.benedetti
Thanks Yonik and thanks Doug. I agree with Doug in adding few generics test corpora Jenkins automatically runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a golden truth too much. This of course can be very complex, but I think it is a direction the Apache Lucene/Solr comm

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Doug Turnbull
Just a piece of feedback from clients on the original docCount change. I have seen several cases with clients where the switch to docCount surprised and harmed relevance. More broadly, I’m concerned when we make these changes there’s not a testing process against test corpuses with judgments and

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Yonik Seeley
On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti wrote: > "Lucene/Solr doesn't actually delete documents when you delete them, it > just marks them as deleted. I'm pretty sure that the difference between > docCount and maxDoc is deleted documents. Maybe I don't understand what > I'm talking

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread alessandro.benedetti
"Lucene/Solr doesn't actually delete documents when you delete them, it just marks them as deleted. I'm pretty sure that the difference between docCount and maxDoc is deleted documents. Maybe I don't understand what I'm talking about, but that is the best I can come up with. " Thanks Shawn, y

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread Yonik Seeley
On Mon, Dec 4, 2017 at 1:35 PM, Shawn Heisey wrote: > I'm pretty sure that the difference between docCount and maxDoc is deleted > documents. docCount (not the best name) here is the number of documents with the field being searched. docFreq (df) is the number of documents actually containing t

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread Shawn Heisey
On 12/4/2017 7:21 AM, alessandro.benedetti wrote: the reason docCount was improving things is because it was using a docCount relative to a specific field while maxDoc is global all over the index ? Lucene/Solr doesn't actually delete documents when you delete them, it just marks them as delet

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread alessandro.benedetti
Furthermore, taking a look to the code for BM25 similarity, it seems to me it is currently working right : - docCount is used per field if != -1 /** * Computes a score factor for a simple term and returns an explanation * for that score factor. * * * The default implementation us

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread alessandro.benedetti
Hi Markus, just out of interest, why did " It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well!" solve the problem ? i assume you are using different fields, one per language. Each field is appearing on a different number of docs I guess. e.g. t

Re: Skewed IDF in multi lingual index, again

2017-11-30 Thread Walter Underwood
y relevant documents in foreign languages, > hence the deboost is not too low. > > Thanks, > Markus > > > -Original message- >> From:Walter Underwood >> Sent: Thursday 30th November 2017 17:29 >> To: solr-user@lucene.apache.org >> Subject: R

RE: Skewed IDF in multi lingual index, again

2017-11-30 Thread Markus Jelsma
uages, hence the deboost is not too low. Thanks, Markus -Original message- > From:Walter Underwood > Sent: Thursday 30th November 2017 17:29 > To: solr-user@lucene.apache.org > Subject: Re: Skewed IDF in multi lingual index, again > > I’ve occasionally considered using U

Re: Skewed IDF in multi lingual index, again

2017-11-30 Thread Walter Underwood
hat are not in the user's preference > language but in some cases it is not enough. I can go on by reducing that > boost but that's not what i prefer. > > I'd like to know if there are additional tricks to solve the problem. > > Many thanks! > Markus > > [1] &g

Skewed IDF in multi lingual index, again

2017-11-30 Thread Markus Jelsma
ow if there are additional tricks to solve the problem. Many thanks! Markus [1] http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html

Re: Skewed IDF in multi lingual index

2012-11-26 Thread Robert Muir
ive boosts > will be lower than the product of boosts similar boosts, lowering the > document in rank instead of boosting it. > > -Original message- > > From:Markus Jelsma > > Sent: Fri 09-Nov-2012 10:23 > > To: solr-user@lucene.apache.org > > Subject: RE: S

RE: Skewed IDF in multi lingual index

2012-11-12 Thread Markus Jelsma
r@lucene.apache.org > Subject: RE: Skewed IDF in multi lingual index > > Robert, Tom, > > That's it indeed! Using maxDoc as numerator opposed to docCount yields very > skewed results for an unevenly distributed multi-lingual index. We have one > language dominatin

RE: Skewed IDF in multi lingual index

2012-11-09 Thread Markus Jelsma
-Original message- > From:Robert Muir > Sent: Thu 08-Nov-2012 17:44 > To: solr-user@lucene.apache.org > Subject: Re: Skewed IDF in multi lingual index > > Hi Markus: how are the languages distributed across documents? > > Imagine I have a text_en field and a text_fr

Re: Skewed IDF in multi lingual index

2012-11-08 Thread Tom Burton-West
Hi Markus, No answers, but I am very interested in what you find out. We currently index all languages in one index, which presents different IDF issues, but are interested in exploring alternatives such as the one you describe. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search

Re: Skewed IDF in multi lingual index

2012-11-08 Thread Robert Muir
Hi Markus: how are the languages distributed across documents? Imagine I have a text_en field and a text_fr field. Lets say I have 100 documents, 95 are english and only 5 are french. So the text_en field is populated 95% of the time, and the text_fr 5% of the time. But the default IDF computatio

Skewed IDF in multi lingual index

2012-11-08 Thread Markus Jelsma
Hi, We're testing a large multi lingual index with _LANG fields for each language and using dismax to query them all. Users provide, explicit or implicit, language preferences that we use for either additive or multiplicative boosting on the language of the document. However, additive boosting