Re: Skewed IDF in multi lingual index, again
It is challenging as the performance of different use cases and domains will by very dependent on the use case (there's no one globally perfect relevance solution). But a good set of metrics to see *generally* how stock Solr performs across a reasonable set of verticals would be nice. My philosophy about Lucene-based search is that it's not a solution, but rather a framework that should have sane defaults but large amounts of configurability. For example,I'm not sure there's a globally "right" answer maxDoc vs docCount Problems with docCount come into play when a corpus usually has an empty field, but it's occasionally filled out. This creates a strong bias against matches in that usually empty field, when previously a match in that field was weighted very highly For example, if a product catalog has a user-editable tag field that is rarely used, and a product description, such as Product Name: Nice Pants! Product Description: Come wear these pants! Tags: [blue] [acid-wash] Product Name: Acid Wash Pants Product Description: Come wear these pants! Tags: (empty) In this case, the IDF for the acid wash match in tags is very low using docCount whereas with maxDocs it was very high. Not sure what the right answer is, but there is often a desire to want more complete docs to be boosted much higher, which the "maxDocs" method does. Another case where docCount can be a problem is copy fields: With copy fields, you care that the original field had terms, even if for some reason they were removed in the analysis chain. This can happen with some methods we use for simple entity extraction. Further the definitions of BM25, etc rely on corpus level document frequency for a term and don't have a concept of fields. BM25F can mostly be implemented with BlendedTermQuery which blends doc frequencies across fields http://opensourceconnections.com/blog/2016/10/19/bm25f-in-lucene/ On Tue, Dec 5, 2017 at 10:28 AM alessandro.benedettiwrote: > Thanks Yonik and thanks Doug. > > I agree with Doug in adding few generics test corpora Jenkins automatically > runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a > golden truth too much. > This of course can be very complex, but I think it is a direction the > Apache > Lucene/Solr community should work on. > > Given that, I do believe that in this case, moving from maxDocs(field > independent) to docCount(field dependent) was a good move ( and this > specific multi language use case is an example). > > Actually I also believe that theoretically docCount(field dependent) is > still better than maxDocs(field dependent). > This is because docCount(field dependent) represents a state in time > associated to the current index while maxDocs represents an historical > consideration. > A corpus of documents can change in time, and how much a term is rare can > drastically change ( let's pick an highly dynamic domain such news). > > Doug, were you able to generalise and abstract any consideration from what > happened to your customers and why they got regressions moving from maxDocs > to docCount(field dependent) ? > > > > > - > --- > Alessandro Benedetti > Search Consultant, R Software Engineer, Director > Sease Ltd. - www.sease.io > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > -- Consultant, OpenSource Connections. Contact info at http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)
Re: Skewed IDF in multi lingual index, again
Thanks Yonik and thanks Doug. I agree with Doug in adding few generics test corpora Jenkins automatically runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a golden truth too much. This of course can be very complex, but I think it is a direction the Apache Lucene/Solr community should work on. Given that, I do believe that in this case, moving from maxDocs(field independent) to docCount(field dependent) was a good move ( and this specific multi language use case is an example). Actually I also believe that theoretically docCount(field dependent) is still better than maxDocs(field dependent). This is because docCount(field dependent) represents a state in time associated to the current index while maxDocs represents an historical consideration. A corpus of documents can change in time, and how much a term is rare can drastically change ( let's pick an highly dynamic domain such news). Doug, were you able to generalise and abstract any consideration from what happened to your customers and why they got regressions moving from maxDocs to docCount(field dependent) ? - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Skewed IDF in multi lingual index, again
Just a piece of feedback from clients on the original docCount change. I have seen several cases with clients where the switch to docCount surprised and harmed relevance. More broadly, I’m concerned when we make these changes there’s not a testing process against test corpuses with judgments and relevance metrics to understand their impact. I see it mentioned in a JIRA from time to time that someone saw an improvement on a private collection in NDCG. And we have to take their word for it. Public testing of relevance against every build using stock settings could be extremely valuable and would more easily justify these changes. Something similar to the performance tests that are made. Sadly I can only complain now :) I wish I had time to work on something like this. Doug On Tue, Dec 5, 2017 at 7:38 AM Yonik Seeleywrote: > On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti > wrote: > > "Lucene/Solr doesn't actually delete documents when you delete them, it > > just marks them as deleted. I'm pretty sure that the difference between > > docCount and maxDoc is deleted documents. Maybe I don't understand what > > I'm talking about, but that is the best I can come up with. " > > > > Thanks Shawn, yes, that is correct and I was aware of it. > > I was curious of another difference : > > I think we confirmed that docCount is local to the field ( thanks Yonik > for > > that) so : > > > > docCount(index,field1)= # of documents in the index that currently have > > value(s) for field1 > > > > My question is : > > > > maxDocs(index,field1)= max # of documents in the index that had value(s) > for > > field1 > > > > OR > > > > maxDocs(index)= max # of documents that appeared in the index ( field > > independent) > > The latter. > I imagine that's why docCount was introduced (to avoid changing the > meaning of an existing term). > FWIW, the scoring change was made in > https://issues.apache.org/jira/browse/LUCENE-6711 for Lucene/Solr 6.0 > > -Yonik > -- Consultant, OpenSource Connections. Contact info at http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)
Re: Skewed IDF in multi lingual index, again
On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedettiwrote: > "Lucene/Solr doesn't actually delete documents when you delete them, it > just marks them as deleted. I'm pretty sure that the difference between > docCount and maxDoc is deleted documents. Maybe I don't understand what > I'm talking about, but that is the best I can come up with. " > > Thanks Shawn, yes, that is correct and I was aware of it. > I was curious of another difference : > I think we confirmed that docCount is local to the field ( thanks Yonik for > that) so : > > docCount(index,field1)= # of documents in the index that currently have > value(s) for field1 > > My question is : > > maxDocs(index,field1)= max # of documents in the index that had value(s) for > field1 > > OR > > maxDocs(index)= max # of documents that appeared in the index ( field > independent) The latter. I imagine that's why docCount was introduced (to avoid changing the meaning of an existing term). FWIW, the scoring change was made in https://issues.apache.org/jira/browse/LUCENE-6711 for Lucene/Solr 6.0 -Yonik
Re: Skewed IDF in multi lingual index, again
"Lucene/Solr doesn't actually delete documents when you delete them, it just marks them as deleted. I'm pretty sure that the difference between docCount and maxDoc is deleted documents. Maybe I don't understand what I'm talking about, but that is the best I can come up with. " Thanks Shawn, yes, that is correct and I was aware of it. I was curious of another difference : I think we confirmed that docCount is local to the field ( thanks Yonik for that) so : docCount(index,field1)= # of documents in the index that currently have value(s) for field1 My question is : maxDocs(index,field1)= max # of documents in the index that had value(s) for field1 OR maxDocs(index)= max # of documents that appeared in the index ( field independent) Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Skewed IDF in multi lingual index, again
On Mon, Dec 4, 2017 at 1:35 PM, Shawn Heiseywrote: > I'm pretty sure that the difference between docCount and maxDoc is deleted > documents. docCount (not the best name) here is the number of documents with the field being searched. docFreq (df) is the number of documents actually containing the term in that field. In the past, maxDoc was used instead of docCount. -Yonik
Re: Skewed IDF in multi lingual index, again
On 12/4/2017 7:21 AM, alessandro.benedetti wrote: the reason docCount was improving things is because it was using a docCount relative to a specific field while maxDoc is global all over the index ? Lucene/Solr doesn't actually delete documents when you delete them, it just marks them as deleted. I'm pretty sure that the difference between docCount and maxDoc is deleted documents. Maybe I don't understand what I'm talking about, but that is the best I can come up with. Not all aspects of the impact on scores from deleted documents can be eliminated, but there has been some effort to make it as minimal as possible. For what has been described here, the actual count is available, so it gets used. Thanks, Shawn
Re: Skewed IDF in multi lingual index, again
Furthermore, taking a look to the code for BM25 similarity, it seems to me it is currently working right : - docCount is used per field if != -1 /** * Computes a score factor for a simple term and returns an explanation * for that score factor. * * * The default implementation uses: * * * idf(docFreq, docCount); * * * Note that {@link CollectionStatistics#docCount()} is used instead of * {@link org.apache.lucene.index.IndexReader#numDocs() IndexReader#numDocs()} because also * {@link TermStatistics#docFreq()} is used, and when the latter * is inaccurate, so is {@link CollectionStatistics#docCount()}, and in the same direction. * In addition, {@link CollectionStatistics#docCount()} does not skew when fields are sparse. * * @param collectionStats collection-level statistics * @param termStats term-level statistics for the term * @return an Explain object that includes both an idf score factor and an explanation for the term. */ public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) { final long df = termStats.docFreq(); final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount(); final float idf = idf(df, docCount); return Explanation.match(idf, "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", Explanation.match(df, "docFreq"), Explanation.match(docCount, "docCount")); } - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Skewed IDF in multi lingual index, again
Hi Markus, just out of interest, why did " It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well!" solve the problem ? i assume you are using different fields, one per language. Each field is appearing on a different number of docs I guess. e.g. text_en -> 1 docs text_fr -> 1000 docs text_it -> 500 docs the reason docCount was improving things is because it was using a docCount relative to a specific field while maxDoc is global all over the index ? - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Skewed IDF in multi lingual index, again
Expanding the query to use both the tagged and untagged term might work. I’m not sure the effect would be a lot different than boosting the preferred language. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 30, 2017, at 8:35 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > > This is unfortunately not what we want. Some customers use filters to > restrict language, but some customers don't. They want to be able to find > documents in all languages, so we use user preference to get their local > language on top. Except for very relevant documents in foreign languages, > hence the deboost is not too low. > > Thanks, > Markus > > > -Original message- >> From:Walter Underwood <wun...@wunderwood.org> >> Sent: Thursday 30th November 2017 17:29 >> To: solr-user@lucene.apache.org >> Subject: Re: Skewed IDF in multi lingual index, again >> >> I’ve occasionally considered using Unicode language tags (U+E001 and >> friends) on each term. That would make a term specific to a language, so we >> would get [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a >> pretty big hammer, because it restricts matches to the same language. If the >> entire document is in one language, might as well use a filter query for >> that language. The tags would work for multiple languages in one document. >> >> Maybe make the untagged term a synonym. For cross-language terms like >> “LaserJet”, the untagged one would have worse idf. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >>> On Nov 30, 2017, at 8:14 AM, Markus Jelsma <markus.jel...@openindex.io> >>> wrote: >>> >>> Hello, >>> >>> We already discussed this problem five years ago [1]. In short: documents >>> in foreign languages are scored higher for some terms. >>> >>> It was solved back then by using docCount instead of maxDoc when >>> calculating idf, it worked really well! But, probably due to index changes, >>> the problem is back for some terms, mostly proper nouns, well, just like >>> five years ago. >>> >>> We already deboost documents by 0.7 that are not in the user's preference >>> language but in some cases it is not enough. I can go on by reducing that >>> boost but that's not what i prefer. >>> >>> I'd like to know if there are additional tricks to solve the problem. >>> >>> Many thanks! >>> Markus >>> >>> [1] >>> http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html >> >>
RE: Skewed IDF in multi lingual index, again
This is unfortunately not what we want. Some customers use filters to restrict language, but some customers don't. They want to be able to find documents in all languages, so we use user preference to get their local language on top. Except for very relevant documents in foreign languages, hence the deboost is not too low. Thanks, Markus -Original message- > From:Walter Underwood <wun...@wunderwood.org> > Sent: Thursday 30th November 2017 17:29 > To: solr-user@lucene.apache.org > Subject: Re: Skewed IDF in multi lingual index, again > > I’ve occasionally considered using Unicode language tags (U+E001 and friends) > on each term. That would make a term specific to a language, so we would get > [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big > hammer, because it restricts matches to the same language. If the entire > document is in one language, might as well use a filter query for that > language. The tags would work for multiple languages in one document. > > Maybe make the untagged term a synonym. For cross-language terms like > “LaserJet”, the untagged one would have worse idf. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Nov 30, 2017, at 8:14 AM, Markus Jelsma <markus.jel...@openindex.io> > > wrote: > > > > Hello, > > > > We already discussed this problem five years ago [1]. In short: documents > > in foreign languages are scored higher for some terms. > > > > It was solved back then by using docCount instead of maxDoc when > > calculating idf, it worked really well! But, probably due to index changes, > > the problem is back for some terms, mostly proper nouns, well, just like > > five years ago. > > > > We already deboost documents by 0.7 that are not in the user's preference > > language but in some cases it is not enough. I can go on by reducing that > > boost but that's not what i prefer. > > > > I'd like to know if there are additional tricks to solve the problem. > > > > Many thanks! > > Markus > > > > [1] > > http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html > >
Re: Skewed IDF in multi lingual index, again
I’ve occasionally considered using Unicode language tags (U+E001 and friends) on each term. That would make a term specific to a language, so we would get [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big hammer, because it restricts matches to the same language. If the entire document is in one language, might as well use a filter query for that language. The tags would work for multiple languages in one document. Maybe make the untagged term a synonym. For cross-language terms like “LaserJet”, the untagged one would have worse idf. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 30, 2017, at 8:14 AM, Markus Jelsmawrote: > > Hello, > > We already discussed this problem five years ago [1]. In short: documents in > foreign languages are scored higher for some terms. > > It was solved back then by using docCount instead of maxDoc when calculating > idf, it worked really well! But, probably due to index changes, the problem > is back for some terms, mostly proper nouns, well, just like five years ago. > > We already deboost documents by 0.7 that are not in the user's preference > language but in some cases it is not enough. I can go on by reducing that > boost but that's not what i prefer. > > I'd like to know if there are additional tricks to solve the problem. > > Many thanks! > Markus > > [1] > http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html
Skewed IDF in multi lingual index, again
Hello, We already discussed this problem five years ago [1]. In short: documents in foreign languages are scored higher for some terms. It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well! But, probably due to index changes, the problem is back for some terms, mostly proper nouns, well, just like five years ago. We already deboost documents by 0.7 that are not in the user's preference language but in some cases it is not enough. I can go on by reducing that boost but that's not what i prefer. I'd like to know if there are additional tricks to solve the problem. Many thanks! Markus [1] http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html