On 11/28/2019 1:30 AM, Dwane Hall wrote:
I asked a question on the forum a couple of weeks ago regarding cursorMark
duplicates. I initially thought it may be due to HDFSCaching because I was
unable replicate the issue on local indexes but unfortunately the dreaded
duplicates have returned!! For a refresher I was seeing what I thought was
duplicate documents appearing randomly on the last page of one cursor, and the
first page of the next. So if rows=50 the duplicates are document 50 on page 1
and document 1 on page 2.
After further investigation I don't actually believe these documents are duplicates but
the same document being returned from a different replica on each page. After running a
diff on the two documents the only difference is the field "Solr_Update_Date"
which I insert on each document as it is inserted into the corpus.
This is how the managed-schema mapping for this field looks
<field name="Solr_Update_Date" type="rdate" indexed="true" stored="true"
default="NOW" />
This can happen with SolrCloud using NRT replicas. The default replica
type is NRT. Based on the core names returned by the [shard] field in
your responses, it looks like you do have NRT replicas.
There are two solutions. The better solution is to use
TimestampUpdateProcessorFactory for setting your timestamp field instead
of a default of NOW in the schema. An alternate solution is to use
TLOG/PULL replica types instead of NRT -- that way replicas are
populated by copying exact index contents instead of independently indexing.
Thanks,
Shawn