Re: Cursor mark page duplicates
Thanks Shawn, you are indeed correct these are NRT replicas! Thanks very much for the advice and possible resolutions. I went down the NRT path as in the past I've read advice from some of the Solr gurus recommending to use these replica types unless you have a very good reason not to. I do have basic auth enabled on my Solr cloud configuration and believe I can't use PULL replicas until the following JIRA is resolved (https://issues.apache.org/jira/plugins/servlet/mobile#issue/SOLR-11904) as Solr users the index replicator for this process. With this being the case I'll attempt your second suggestion and see how I go. Thanks again for taking the time to look at this it really was a confusing one to debug. Have a great weekend fellow Solr users and happy Solr-ing. Dwane From: Shawn Heisey Sent: Friday, 29 November 2019 4:51 AM To: solr-user@lucene.apache.org Subject: Re: Cursor mark page duplicates On 11/28/2019 1:30 AM, Dwane Hall wrote: > I asked a question on the forum a couple of weeks ago regarding cursorMark > duplicates. I initially thought it may be due to HDFSCaching because I was > unable replicate the issue on local indexes but unfortunately the dreaded > duplicates have returned!! For a refresher I was seeing what I thought was > duplicate documents appearing randomly on the last page of one cursor, and > the first page of the next. So if rows=50 the duplicates are document 50 on > page 1 and document 1 on page 2. > > After further investigation I don't actually believe these documents are > duplicates but the same document being returned from a different replica on > each page. After running a diff on the two documents the only difference is > the field "Solr_Update_Date" which I insert on each document as it is > inserted into the corpus. > > This is how the managed-schema mapping for this field looks > > default="NOW" /> This can happen with SolrCloud using NRT replicas. The default replica type is NRT. Based on the core names returned by the [shard] field in your responses, it looks like you do have NRT replicas. There are two solutions. The better solution is to use TimestampUpdateProcessorFactory for setting your timestamp field instead of a default of NOW in the schema. An alternate solution is to use TLOG/PULL replica types instead of NRT -- that way replicas are populated by copying exact index contents instead of independently indexing. Thanks, Shawn
Re: Cursor mark page duplicates
On 11/28/2019 1:30 AM, Dwane Hall wrote: I asked a question on the forum a couple of weeks ago regarding cursorMark duplicates. I initially thought it may be due to HDFSCaching because I was unable replicate the issue on local indexes but unfortunately the dreaded duplicates have returned!! For a refresher I was seeing what I thought was duplicate documents appearing randomly on the last page of one cursor, and the first page of the next. So if rows=50 the duplicates are document 50 on page 1 and document 1 on page 2. After further investigation I don't actually believe these documents are duplicates but the same document being returned from a different replica on each page. After running a diff on the two documents the only difference is the field "Solr_Update_Date" which I insert on each document as it is inserted into the corpus. This is how the managed-schema mapping for this field looks This can happen with SolrCloud using NRT replicas. The default replica type is NRT. Based on the core names returned by the [shard] field in your responses, it looks like you do have NRT replicas. There are two solutions. The better solution is to use TimestampUpdateProcessorFactory for setting your timestamp field instead of a default of NOW in the schema. An alternate solution is to use TLOG/PULL replica types instead of NRT -- that way replicas are populated by copying exact index contents instead of independently indexing. Thanks, Shawn
Re: Cursor mark page duplicates
Hey guys, I asked a question on the forum a couple of weeks ago regarding cursorMark duplicates. I initially thought it may be due to HDFSCaching because I was unable replicate the issue on local indexes but unfortunately the dreaded duplicates have returned!! For a refresher I was seeing what I thought was duplicate documents appearing randomly on the last page of one cursor, and the first page of the next. So if rows=50 the duplicates are document 50 on page 1 and document 1 on page 2. After further investigation I don't actually believe these documents are duplicates but the same document being returned from a different replica on each page. After running a diff on the two documents the only difference is the field "Solr_Update_Date" which I insert on each document as it is inserted into the corpus. This is how the managed-schema mapping for this field looks The only sort parameter is the id field "sort":"id desc" rows=50 Here are the results Document 50 on page 1 is { "responseHeader":{ "zkConnected":true, "status":0, "QTime":8, "params":{ "q":"id:\"2019-10-29 15:15:36.748052\"", "fl":"id,_version_,[shard],Solr_Update_Date", "_":"1574900506126"}}, "response":{"numFound":1,"start":0,"maxScore":7.312953,"docs":[ { "id":"2019-10-29 15:15:36.748052", "Solr_Update_Date":"2019-11-01T00:15:07.811Z", "_version_":1648956337338449920, "[shard]":"https://solrHost:9021/solr/my_collection_shard4_replica_n14/|https://solrHost:9022/solr/my_collection_shard4_replica_n12/"}] }} Document 1 on page 2 is { "responseHeader":{ "zkConnected":true, "status":0, "QTime":7, "params":{ "q":"id:\"2019-10-29 15:15:36.748052\"", "fl":"id,_version_,[shard],Solr_Update_Date", "_":"1574900506126"}}, "response":{"numFound":1,"start":0,"maxScore":7.822712,"docs":[ { "id":"2019-10-29 15:15:36.748052", "Solr_Update_Date":"2019-11-01T00:15:07.794Z", "_version_":1648956337338449920, "[shard]":"https://solrHost:9022/solr/my_collection_shard4_replica_n12/|https://solrHost:9021/solr/my_collection_shard4_replica_n14/"}] }} As you can see both documents have the same version number but different maxScores and Solr_Update_Date's. My understanding is the cursorMark should only be generated off the id field so I can't see why I would get a different document from a different shard at the end of one page, and the beginning of the next? Would anyone have any insight into this behaviour as this happens randomly on page boundaries when paging through results. Thanks for your time Dwane From: Dwane Hall Sent: Monday, 11 November 2019 10:10 PM To: solr-user@lucene.apache.org Subject: Re: Cursor mark page duplicates Thanks Erick/Hossman, I appreciate your input it's always an interesting read seeing Solr legends like yourselves work through a problem! I certainly learn a lot from following your responses in this user group. As you recommended I ran the distrib=false query on each shard and the results were the identical in both instances. Below is a snapshot from the admin ui showing the details of each shard which all looks in order to me (other than our large number of deletes in the corpus ...we have quite a dynamic environment when the index is live) Last Modified:23 days ago Num Docs:47247895 Max Doc:68108804 Heap Memory Usage:-1 Deleted Docs:20860909 Version:528038 Segment Count:41 Master (Searching) Version:1571148411550 Gen:25528 Size:42.56 GB Master (Replicable) Version:1571153302013 Gen:25529 Last Modified:23 days ago Num Docs:47247895 Max Doc:68223647 Heap Memory Usage:-1 Deleted Docs:20975752 Version:526613 Segment Count:43 Master (Searching) Version:1571148411615 Gen:25527 Size:42.63 GB Master (Replicable) Version:1571153302076 Gen:25528 I was however able to replicate the issue but under unusual circumstances with some crude in browser testing. If I use a cursorMark other than "*" and constantly re-run the query (just resubmitting the url in a browser with the same cursor and query) the first result on the page toggles between the expected value, and the last item from the previous page. So if rows=50, page 2 toggles between result 51 (expected) and result 50 (the last item from the previous p
Re: Cursor mark page duplicates
Thanks Erick/Hossman, I appreciate your input it's always an interesting read seeing Solr legends like yourselves work through a problem! I certainly learn a lot from following your responses in this user group. As you recommended I ran the distrib=false query on each shard and the results were the identical in both instances. Below is a snapshot from the admin ui showing the details of each shard which all looks in order to me (other than our large number of deletes in the corpus ...we have quite a dynamic environment when the index is live) Last Modified:23 days ago Num Docs:47247895 Max Doc:68108804 Heap Memory Usage:-1 Deleted Docs:20860909 Version:528038 Segment Count:41 Master (Searching) Version:1571148411550 Gen:25528 Size:42.56 GB Master (Replicable) Version:1571153302013 Gen:25529 Last Modified:23 days ago Num Docs:47247895 Max Doc:68223647 Heap Memory Usage:-1 Deleted Docs:20975752 Version:526613 Segment Count:43 Master (Searching) Version:1571148411615 Gen:25527 Size:42.63 GB Master (Replicable) Version:1571153302076 Gen:25528 I was however able to replicate the issue but under unusual circumstances with some crude in browser testing. If I use a cursorMark other than "*" and constantly re-run the query (just resubmitting the url in a browser with the same cursor and query) the first result on the page toggles between the expected value, and the last item from the previous page. So if rows=50, page 2 toggles between result 51 (expected) and result 50 (the last item from the previous page). It doesn't happen all the time but every one in five or so refreshes I'm able to replicate it consistently (and on every subsequent cursor). I failed to mention in my original email that we use the HdfsDirectoryFactory to store our indexes in HDFS. This configuration uses an off heap block cache to cache HDFS blocks in memory as it is unable to take advantage of the OS disk cache. I mention this as we're currently in the process of switching to local disk and I've been unable to replicate the issue when using the local storage configuration of the same index. This maybe completely unrelated, and additionally the local storage index is freshly loaded so it has not experienced the same number of deletes or updates that our HDFS indexes have. I think my best bet is to monitor our new index configuration and if I notice any similar behaviour I'll make the community aware of my findings. Once again, Thanks for your input Dwane From: Chris Hostetter Sent: Friday, 8 November 2019 9:58 AM To: solr-user@lucene.apache.org Subject: Re: Cursor mark page duplicates : I'm using Solr's cursor mark feature and noticing duplicates when paging : through results. The duplicate records happen intermittently and appear : at the end of one page, and the beginning of the next (but not on all : pages through the results). So if rows=20 the duplicate records would be : document 20 on page1, and document 21 on page 2. The document's id come Can you try to reproduce and show us the specifics of this including: 1) The sort param you're using 2) An 'fl' list that includes every field in the sort param 3) The returned values of every 'fl' field for the "duplicate" document you are seeing as it appears in *BOTH* pages of results -- allong with the cursorMark value in use on both of those pages. : (-MM-DD HH:MM.SS)), score. In this Solr community post : (https://lucene.472066.n3.nabble.com/Solr-document-duplicated-during-pagination-td4269176.html) : Shawn Heisey suggests: ...that post was *NOT* about using cursorMark -- it was plain old regular pagination, where even on a single core/replica you can see a document X get "pushed" from page#1 to page#2 by updates/additions of some other doxument Z that causes Z to sort "before" X. With cursors this kind of "pushing other docs back" or "pushing other docs forward" doesn't exist because of the cursorMark. The only way a doc *should* move is if it's OWN sort values are updated, causing it to reposition itself. But, if you have a static index, then it's *possible* that the last time your document X was updated, there was a "glitch" somewhere in the distributed update process, and the update didn't succeed in osme replicas -- so the same document may have different sort values on diff replicas. : In the Solr query below for one of the example duplicates in question I : can see a search by the id returns only a single document. The : replication factor for the collection is 2 so the id will also appear in : this shards replica. Taking into consideration Shawn's advice above, my If you've already identified a particular document where this has happened, then you can also verify/disprove my hypoth
Re: Cursor mark page duplicates
: I'm using Solr's cursor mark feature and noticing duplicates when paging : through results. The duplicate records happen intermittently and appear : at the end of one page, and the beginning of the next (but not on all : pages through the results). So if rows=20 the duplicate records would be : document 20 on page1, and document 21 on page 2. The document's id come Can you try to reproduce and show us the specifics of this including: 1) The sort param you're using 2) An 'fl' list that includes every field in the sort param 3) The returned values of every 'fl' field for the "duplicate" document you are seeing as it appears in *BOTH* pages of results -- allong with the cursorMark value in use on both of those pages. : (-MM-DD HH:MM.SS)), score. In this Solr community post : (https://lucene.472066.n3.nabble.com/Solr-document-duplicated-during-pagination-td4269176.html) : Shawn Heisey suggests: ...that post was *NOT* about using cursorMark -- it was plain old regular pagination, where even on a single core/replica you can see a document X get "pushed" from page#1 to page#2 by updates/additions of some other doxument Z that causes Z to sort "before" X. With cursors this kind of "pushing other docs back" or "pushing other docs forward" doesn't exist because of the cursorMark. The only way a doc *should* move is if it's OWN sort values are updated, causing it to reposition itself. But, if you have a static index, then it's *possible* that the last time your document X was updated, there was a "glitch" somewhere in the distributed update process, and the update didn't succeed in osme replicas -- so the same document may have different sort values on diff replicas. : In the Solr query below for one of the example duplicates in question I : can see a search by the id returns only a single document. The : replication factor for the collection is 2 so the id will also appear in : this shards replica. Taking into consideration Shawn's advice above, my If you've already identified a particular document where this has happened, then you can also verify/disprove my hypothosis by hitting each of the replicas that hosts this document with a request that looks like... /solr/MyCollection_shard4_replica_n12/select?q=id:FOO&distrib=false /solr/MyCollection_shard4_replica_n35/select?q=id:FOO&distrib=false ...and compare the results to see if all field values match -Hoss http://www.lucidworks.com/
Re: Cursor mark page duplicates
Dwane: Nice writeup. This is puzzling. First, theoretically the two replicas shouldn’t have any effect. Shawn’e comment was more that somehow two _different_ shards had a duplicate ID. Do both replicas have exactly the same document count? You can find this out by “..solr/collection1_shard1_replica_n1?q=*:*&distrib=false”. The “distrib=false” will query _only_ the replica it’s pointed to. I’m wondering if somehow the replicas are out of sync and this is a crude test. If you can record the IDs when this happens and use the above trick to see whether there is anything unexpected about the returns when you look at, say, the 5 docs before the repeated one and the 5 docs after. They should, of course, be the exact same. You could also use the "&distrib=false” trick to pull all the IDs from the two replicas and see if they all match with a streaming expression. If all the IDs are all the same on both replicas, I haven’t a clue….. Best, Erick > On Nov 7, 2019, at 5:34 AM, Dwane Hall wrote: > > Hey Solr community, > > I'm using Solr's cursor mark feature and noticing duplicates when paging > through results. The duplicate records happen intermittently and appear at > the end of one page, and the beginning of the next (but not on all pages > through the results). So if rows=20 the duplicate records would be document > 20 on page1, and document 21 on page 2. The document's id come from a > database and that field is a unique primary key so I'm confident that there > are no duplicate document id's in my corpus. Additionally no index updates > are occurring in the index (it's completely static). My result sort order is > id (a string representation of a timestamp (-MM-DD HH:MM.SS)), score. > In this Solr community post > (https://lucene.472066.n3.nabble.com/Solr-document-duplicated-during-pagination-td4269176.html) > Shawn Heisey suggests: > > > "There are two ways this can happen. One is that the index has changed > between different queries, pushing or pulling results between the end of > one page and the beginning of the next page. The other is having the > same uniqueKey value in more than one shard." > > In the Solr query below for one of the example duplicates in question I can > see a search by the id returns only a single document. The replication factor > for the collection is 2 so the id will also appear in this shards replica. > Taking into consideration Shawn's advice above, my question is will having a > shard replica still count as the document having a duplicate id in another > shard and potentially introduce duplicates into my paged results? If not > could anyone suggest another possible scenario where duplicates could > potentially be introduced? > > As always any advice would be greatly appreciated, > > Thanks, > > Dwane > > Environment > Solr cloud (7.7.2) > 8 shard collection, replication factor 2 > > { > > "responseHeader":{ > >"zkConnected":true, > >"status":0, > >"QTime":2072, > >"params":{ > > "q":"id:myUUID(-MM-DD HH:MM.SS)", > > "fl":"id,[shard]"}}, > > "response":{"numFound":1,"start":0,"maxScore":17.601822,"docs":[ > > { > >"id":"myUUID(-MM-DD HH:MM.SS)", > > > "[shard]":"https://solr1:9014/solr/MyCollection_shard4_replica_n12/|https://solr2:9011/solr/MyCollection_shard4_replica_n35/"}] > > }} > >
Cursor mark page duplicates
Hey Solr community, I'm using Solr's cursor mark feature and noticing duplicates when paging through results. The duplicate records happen intermittently and appear at the end of one page, and the beginning of the next (but not on all pages through the results). So if rows=20 the duplicate records would be document 20 on page1, and document 21 on page 2. The document's id come from a database and that field is a unique primary key so I'm confident that there are no duplicate document id's in my corpus. Additionally no index updates are occurring in the index (it's completely static). My result sort order is id (a string representation of a timestamp (-MM-DD HH:MM.SS)), score. In this Solr community post (https://lucene.472066.n3.nabble.com/Solr-document-duplicated-during-pagination-td4269176.html) Shawn Heisey suggests: "There are two ways this can happen. One is that the index has changed between different queries, pushing or pulling results between the end of one page and the beginning of the next page. The other is having the same uniqueKey value in more than one shard." In the Solr query below for one of the example duplicates in question I can see a search by the id returns only a single document. The replication factor for the collection is 2 so the id will also appear in this shards replica. Taking into consideration Shawn's advice above, my question is will having a shard replica still count as the document having a duplicate id in another shard and potentially introduce duplicates into my paged results? If not could anyone suggest another possible scenario where duplicates could potentially be introduced? As always any advice would be greatly appreciated, Thanks, Dwane Environment Solr cloud (7.7.2) 8 shard collection, replication factor 2 { "responseHeader":{ "zkConnected":true, "status":0, "QTime":2072, "params":{ "q":"id:myUUID(-MM-DD HH:MM.SS)", "fl":"id,[shard]"}}, "response":{"numFound":1,"start":0,"maxScore":17.601822,"docs":[ { "id":"myUUID(-MM-DD HH:MM.SS)", "[shard]":"https://solr1:9014/solr/MyCollection_shard4_replica_n12/|https://solr2:9011/solr/MyCollection_shard4_replica_n35/"}] }}