The thing to keep in mind, is that w/o a fully deterministic sort, 
the underlying problem statement "doc may appera on multiple pages" can 
exist even in a single node solr index, even if no documents are 
added/deleted between bage requests: because background merges / 
searcher re-opening may happen in between those page requests.

The best practice, if you really care about ensuring no (non-updated) doc 
is ever returned twice in subsequent pages, is to to use a fully 
deterministic sort, with a "tie breaker" clause that is unique to every 
document (ie: uniqueKey field)



: Date: Wed, 29 Mar 2017 23:14:22 +0300
: From: Mikhail Khludnev <m...@apache.org>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user <solr-user@lucene.apache.org>
: Subject: Re: Pagination bug? when sorting by a field (not unique field)
: 
: Great explanation, Alessandro!
: 
: Let me briefly explain my experience. I have a tiny test with 2 shards and
: 2 replicas, index about a hundred of docs. And then when I fully paginate
: search results with score ranking, I've got duplicates across pages. And
: the reason is deletes, which occur probably due to update/failover. Every
: paging request lands to the different replica. There are a few workarounds:
: lands consequent requests to the same replicas; also <optimize> fixes
: duplicates; but tie-breaking is the best way for sure.
: 
: On Wed, Mar 29, 2017 at 7:10 PM, alessandro.benedetti <a.benede...@sease.io>
: wrote:
: 
: > The reason Mikhail mentioned that, is probably related to :
: >
: > *The way how number of document calculated is changed (LUCENE-6711)*
: > /The number of documents (docCount) is used to calculate term specificity
: > (idf) and average document length (avdl). Prior to LUCENE-6711,
: > collectionStats.maxDoc() was used for the statistics. Now,
: > collectionStats.docCount() is used whenever possible, if not maxDocs() is
: > used.
: > Assume that a collection contains 100 documents, and 50 of them have
: > "keywords" field. In this example, maxDocs is 100 while docCount is 50 for
: > the "keywords" field. The total number of tokens for "keywords" field is
: > divided by docCount to obtain avdl. Therefore, docCount which is the total
: > number of documents that have at least one term for the field, is a more
: > precise metric for optional fields.
: > DefaultSimilarity does not leverage avdl, so this change would have
: > relatively minor change in the result list. Because relative idf values of
: > terms will remain same. However, when combined with other factors such as
: > term frequency, relative ranking of documents could change. Some Similarity
: > implementations (such as the ones instantiated with NormalizationH2 and
: > BM25) take account into avdl and would have notable change in ranked list.
: > Especially if you have a collection of documents with varying lengths.
: > Because NormalizationH2 tends to punish documents longer than avdl./
: >
: > This means that if you are load balancing, the page 2 query could go to
: > another replica, where the doc is scored differently, ending up on a
: > different position ( and maybe appearing again as a final effect).
: > This scenario is referred to scored ranking, so it will not affect sorting
: > (
: > and I believe in your initial mail you were referring not to sorting)
: >
: > Cheers
: >
: >
: > Pablo wrote
: > > Mikhall,
: > >
: > > effectively maxDocs are different and also deletedDocs, but numDocs are
: > > ok.
: > >
: > > I don't really get it, but can that be the problem?
: >
: >
: >
: >
: >
: > -----
: > ---------------
: > Alessandro Benedetti
: > Search Consultant, R&D Software Engineer, Director
: > Sease Ltd. - www.sease.io
: > --
: > View this message in context: http://lucene.472066.n3.
: > nabble.com/Pagination-bug-when-sorting-by-a-field-not-unique-field-
: > tp4327408p4327461.html
: > Sent from the Solr - User mailing list archive at Nabble.com.
: >
: 
: 
: 
: -- 
: Sincerely yours
: Mikhail Khludnev
: 

-Hoss
http://www.lucidworks.com/

Reply via email to