Re: Skewed IDF in multi lingual index, again

Doug Turnbull Tue, 05 Dec 2017 09:11:48 -0800

It is challenging as the performance of different use cases and domains
will by very dependent on the use case (there's no one globally perfect
relevance solution). But a good set of metrics to see *generally* how stock
Solr performs across a reasonable set of verticals would be nice.

My philosophy about Lucene-based search is that it's not a solution, but
rather a framework that should have sane defaults but large amounts of
configurability.

For example,I'm not sure there's a globally "right" answer maxDoc vs
docCount

Problems with docCount come into play when a corpus usually has an empty
field, but it's occasionally filled out. This creates a strong bias against
matches in that usually empty field, when previously a match in that field
was weighted very highly

For example, if a product catalog has a user-editable tag field that is
rarely used, and a product description, such as

Product Name: Nice Pants!
Product Description: Come wear these pants!
Tags: [blue] [acid-wash]

Product Name: Acid Wash Pants
Product Description: Come wear these pants!
Tags: (empty)

In this case, the IDF for the acid wash match in tags is very low using
docCount whereas with maxDocs it was very high. Not sure what the right
answer is, but there is often a desire to want more complete docs to be
boosted much higher, which the "maxDocs" method does.

Another case where docCount can be a problem is copy fields: With copy
fields, you care that the original field had terms, even if for some reason
they were removed in the analysis chain. This can happen with some methods
we use for simple entity extraction.

Further the definitions of BM25, etc rely on corpus level document
frequency for a term and don't have a concept of fields. BM25F can mostly
be implemented with BlendedTermQuery which blends doc frequencies across
fields
http://opensourceconnections.com/blog/2016/10/19/bm25f-in-lucene/

On Tue, Dec 5, 2017 at 10:28 AM alessandro.benedetti <a.benede...@sease.io>
wrote:

> Thanks Yonik and thanks Doug.
>
> I agree with Doug in adding few generics test corpora Jenkins automatically
> runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a
> golden truth too much.
> This of course can be very complex, but I think it is a direction the
> Apache
> Lucene/Solr community should work on.
>
> Given that, I do believe that in this case, moving from maxDocs(field
> independent) to docCount(field dependent) was a good move ( and this
> specific multi language use case is an example).
>
> Actually I also believe that theoretically docCount(field dependent) is
> still better than maxDocs(field dependent).
> This is because docCount(field dependent) represents a state in time
> associated to the current index while maxDocs represents an historical
> consideration.
> A corpus of documents can change in time, and how much a term is rare can
> drastically change ( let's pick an highly dynamic domain such news).
>
> Doug, were you able to generalise and abstract any consideration from what
> happened to your customers and why they got regressions moving from maxDocs
> to docCount(field dependent) ?
>
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
-- 
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)

Re: Skewed IDF in multi lingual index, again

Reply via email to