Re: Cursor mark page duplicates

Erick Erickson Thu, 07 Nov 2019 04:17:00 -0800

Dwane:

Nice writeup. This is puzzling. First, theoretically the two replicas shouldn’t 
have any effect. Shawn’e comment was more that somehow two _different_ shards 
had a duplicate ID.


Do both replicas have exactly the same document count? You can find this out by 
“..solr/collection1_shard1_replica_n1?q=*:*&distrib=false”. The “distrib=false” 
will query _only_ the replica it’s pointed to. I’m wondering if somehow the 
replicas are out of sync and this is a crude test.

If you can record the IDs when this happens and use the above trick to see 
whether there is anything unexpected about the returns when you look at, say, 
the 5 docs before the repeated one and the 5 docs after. They should, of 
course, be the exact same.

You could also use the "&distrib=false” trick to pull all the IDs from the two 
replicas and see if they all match with a streaming expression.

If all the IDs are all the same on both replicas, I haven’t a clue…..

Best,
Erick

> On Nov 7, 2019, at 5:34 AM, Dwane Hall <dwaneh...@hotmail.com> wrote:
> 
> Hey Solr community,
> 
> I'm using Solr's cursor mark feature and noticing duplicates when paging 
> through results.   The duplicate records happen intermittently and appear at 
> the end of one page, and the beginning of the next (but not on all pages 
> through the results). So if rows=20 the duplicate records would be document 
> 20 on page1, and document 21 on page 2.   The document's id come from a 
> database and that field is a unique primary key so I'm confident that there 
> are no duplicate document id's in my corpus.   Additionally no index updates 
> are occurring in the index (it's completely static).  My result sort order is 
> id (a string representation of a timestamp (YYYY-MM-DD HH:MM.SSSSSS)), score. 
> In this Solr community post 
> (https://lucene.472066.n3.nabble.com/Solr-document-duplicated-during-pagination-td4269176.html)
>  Shawn Heisey suggests:
> 
> 
> "There are two ways this can happen.  One is that the index has changed
> between different queries, pushing or pulling results between the end of
> one page and the beginning of the next page.  The other is having the
> same uniqueKey value in more than one shard."
> 
> In the Solr query below for one of the example duplicates in question I can 
> see a search by the id returns only a single document. The replication factor 
> for the collection is 2 so the id will also appear in this shards replica.  
> Taking into consideration Shawn's advice above, my question is will having a 
> shard replica still count as the document having a duplicate id in another 
> shard and potentially introduce duplicates into my paged results?  If not 
> could anyone suggest another possible scenario where duplicates could 
> potentially be introduced?
> 
> As always any advice would be greatly appreciated,
> 
> Thanks,
> 
> Dwane
> 
> Environment
> Solr cloud (7.7.2)
> 8 shard collection, replication factor 2
> 
> {
> 
>  "responseHeader":{
> 
>    "zkConnected":true,
> 
>    "status":0,
> 
>    "QTime":2072,
> 
>    "params":{
> 
>      "q":"id:myUUID(YYYY-MM-DD HH:MM.SSSSSS)",
> 
>      "fl":"id,[shard]"}},
> 
>  "response":{"numFound":1,"start":0,"maxScore":17.601822,"docs":[
> 
>      {
> 
>        "id":"myUUID(YYYY-MM-DD HH:MM.SSSSSS)",
> 
>        
> "[shard]":"https://solr1:9014/solr/MyCollection_shard4_replica_n12/|https://solr2:9011/solr/MyCollection_shard4_replica_n35/"}]
> 
>  }}
> 
>

Re: Cursor mark page duplicates

Reply via email to