It is challenging as the performance of different use cases and domains will by very dependent on the use case (there's no one globally perfect relevance solution). But a good set of metrics to see *generally* how stock Solr performs across a reasonable set of verticals would be nice.
My philosophy about Lucene-based search is that it's not a solution, but rather a framework that should have sane defaults but large amounts of configurability. For example,I'm not sure there's a globally "right" answer maxDoc vs docCount Problems with docCount come into play when a corpus usually has an empty field, but it's occasionally filled out. This creates a strong bias against matches in that usually empty field, when previously a match in that field was weighted very highly For example, if a product catalog has a user-editable tag field that is rarely used, and a product description, such as Product Name: Nice Pants! Product Description: Come wear these pants! Tags: [blue] [acid-wash] Product Name: Acid Wash Pants Product Description: Come wear these pants! Tags: (empty) In this case, the IDF for the acid wash match in tags is very low using docCount whereas with maxDocs it was very high. Not sure what the right answer is, but there is often a desire to want more complete docs to be boosted much higher, which the "maxDocs" method does. Another case where docCount can be a problem is copy fields: With copy fields, you care that the original field had terms, even if for some reason they were removed in the analysis chain. This can happen with some methods we use for simple entity extraction. Further the definitions of BM25, etc rely on corpus level document frequency for a term and don't have a concept of fields. BM25F can mostly be implemented with BlendedTermQuery which blends doc frequencies across fields http://opensourceconnections.com/blog/2016/10/19/bm25f-in-lucene/ On Tue, Dec 5, 2017 at 10:28 AM alessandro.benedetti <a.benede...@sease.io> wrote: > Thanks Yonik and thanks Doug. > > I agree with Doug in adding few generics test corpora Jenkins automatically > runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a > golden truth too much. > This of course can be very complex, but I think it is a direction the > Apache > Lucene/Solr community should work on. > > Given that, I do believe that in this case, moving from maxDocs(field > independent) to docCount(field dependent) was a good move ( and this > specific multi language use case is an example). > > Actually I also believe that theoretically docCount(field dependent) is > still better than maxDocs(field dependent). > This is because docCount(field dependent) represents a state in time > associated to the current index while maxDocs represents an historical > consideration. > A corpus of documents can change in time, and how much a term is rare can > drastically change ( let's pick an highly dynamic domain such news). > > Doug, were you able to generalise and abstract any consideration from what > happened to your customers and why they got regressions moving from maxDocs > to docCount(field dependent) ? > > > > > ----- > --------------- > Alessandro Benedetti > Search Consultant, R&D Software Engineer, Director > Sease Ltd. - www.sease.io > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > -- Consultant, OpenSource Connections. Contact info at http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)