Hi

Every document has an internal lucene id attached to it at the lucene
level, when the scores for 2 documents are same , it sorts by this internal
id, and returns the response.

This was the implementation when I checked sometime ago if I remember
correctly

Regards
Kshitij
On Wed, 20 Nov 2024 at 6:21 PM Saksham Gupta
<saksham.gu...@indiamart.com.invalid> wrote:

> Hi All,
>
> I verified hypotheses shared by Deepak in this mail thread for a few cases
> where the indexed date was different and where it was the same, and the
> hypothesis turned out accurate for all of the cases.
>
> *Summarising the Hypothesis:* Solr handles duplicate documents [documents
> present in multiple shards], by preferring the document which is oldest
> according to indexed date, and if indexed date is same, then it compares
> version and document with higher version is displayed.
>
> Although, recently, I've found a document which is not following the above
> hypothesis, the indexed date for the document[present on 2 shards] on both
> the shards is the same, although the document with lower _version_ is being
> ranked [contrary to above hypothesis]. To check if the version visible is
> correct or not, I filtered the respective copy based on version:
>
> 1. [query: fq=id:{document-copy1-id} AND _version_:{document-copy1-id}],
> 2. [query: fq=id:{document-copy2-id} AND _version_:{document-copy2-id}];
> and found that one document is not being displayed if we add fq on version.
>
> *How does solr set the _version_ field? Is there a possibility that the
> version displayed is incorrect? Does solr maintain a different version
> internally which can differ from one visible?*
> *Is this the reason why the above hypothesis is failing?*
>
> Would appreciate any help regarding solr duplicity handling/ and my
> aforementioned doubts!
>
> On Thu, Aug 1, 2024 at 4:38 PM Saksham Gupta <saksham.gu...@indiamart.com>
> wrote:
>
> > Hi Deepak,
> >
> > Thanks for digging out such a detailed answer for my query. I did observe
> > that the documents indexed earlier were the ones being displayed, but
> could
> > not find any relevant documentation supporting this.
> >
> > Although, I could not understand the nuances pointed out in point 4, What
> > do we mean by `If a commit happens between the first and
> >       second phase of the distributed search`, what is first and second
> > phase here, and what issue will it cause?
> >
> > On Wed, Jul 31, 2024 at 12:24 PM Deepak Goel <deic...@gmail.com> wrote:
> >
> >> *Answer from Copilot:*
> >>
> >>
> >> Ah, the intricate dance of Solr shards and their cosmic collisions!
> Let’s
> >> unravel this like a digital detective, shall we? 🕵️‍♂️
> >>
> >> When it comes to Solr and its distributed architecture, handling
> duplicate
> >> documents across shards can be as tricky as juggling flaming torches
> while
> >> riding a unicycle. But fear not—I’ve got some insights for you:
> >>
> >>    1.
> >>
> >>    *Duplicate Documents and Shards:*
> >>    - Imagine our document—a digital doppelgänger—migrating from one
> shard
> >>       to another. It’s like a restless soul seeking a new home.
> >>       - During this transition, both shards might harbor copies of the
> >> same
> >>       document. They’re like twins separated at birth, each vying for
> the
> >>       spotlight.
> >>    2.
> >>
> >>    *The Solr Query Showdown:*
> >>    - Now, let’s stage a Solr query duel. Our query gallops across the
> >>       shards, demanding answers.
> >>       - If our document is the top-ranked contender in both shards, who
> >>       emerges victorious? 🏆
> >>    3.
> >>
> >>    *The Winner Takes It All (Sort of):*
> >>    - Solr, being the wise oracle it is, follows a simple rule: *“First
> >>       come, first served.”*
> >>       - When Solr discovers duplicate document IDs during distributed
> >>       searching, it selects the *first document* it encounters and
> >> discards
> >>       subsequent ones. It’s like a cosmic game of “finders keepers.”
> >>       - So, whichever shard’s copy of the document was indexed first—the
> >>       early bird with the freshest ink—takes the spotlight. The other
> >> copy bows
> >>       out gracefully.
> >>    4.
> >>
> >>    *The Momentary Sync Shimmy:*
> >>    - But wait! There’s a twist. If a commit happens between the first
> and
> >>       second phase of the distributed search, the index might shimmy
> >> out of sync
> >>       for a moment.
> >>       - Picture this: Shard A says, “I’ve got the document!” Shard B
> says,
> >>       “No, I’ve got it!” And Solr, in its infinite wisdom, says, “Hold
> >> my query,
> >>       folks—I need to sync up.”
> >>       - Eventually, harmony is restored, and the universe aligns itself.
> >>       But for that brief moment, Solr juggles realities like a cosmic
> >> circus
> >>       performer.
> >>    5.
> >>
> >>    *The Shard Key Sorcery:*
> >>    - Remember the shard key? It’s like Solr’s secret handshake. You can
> >> use
> >>       it to influence how documents are distributed across shards.
> >>       - For example, if you want to spread documents related to a
> specific
> >>       customer (let’s say “IBM”) across multiple shards, you can use a
> >> syntax
> >>       like this: "shard_key/num!document_id". The /num part determines
> how
> >>       many bits from the shard key contribute to the composite hash
> >>       <
> >>
> https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html
> >> >
> >>       1
> >>       <
> >>
> https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html
> >> >
> >>       .
> >>    6.
> >>
> >>    *Balance and Scalability:*
> >>    - To prevent hotspots, distribute documents evenly across shards.
> >>       Balance is key!
> >>       - Choose shard keys that reflect your data’s access patterns.
> Think
> >>       of them as Solr’s cosmic compass.
> >>       - And maintain flexibility—consider using composite IDs for easier
> >>       scalability. It’s like Solr’s way of saying, “Why settle for one
> >> shard when
> >>       you can have a whole constellation?”
> >>
> >> So, in the grand Solr arena, the early bird document wins the query
> race.
> >> But remember, even in the digital cosmos, duplicates play by the
> >> rules—mostly.
> >>
> >>
> >> Deepak
> >> "The greatness of a nation can be judged by the way its animals are
> >> treated
> >> - Mahatma Gandhi"
> >>
> >> +91 73500 12833
> >> deic...@gmail.com
> >>
> >> LinkedIn: www.linkedin.com/in/deicool
> >>
> >> "Plant a Tree, Go Green"
> >>
> >> Make In India : http://www.makeinindia.com/home
> >>
> >>
> >> On Mon, Jul 29, 2024 at 10:11 PM Saksham Gupta
> >> <saksham.gu...@indiamart.com.invalid> wrote:
> >>
> >> > Hi Solr Developers,
> >> >
> >> > Which solr document will be displayed if a duplicate instance of the
> >> same
> >> > document is present?
> >> >
> >> > In our current solr architecture, there is a possibility that a
> document
> >> > can move from one solr shard to another shard. While the document will
> >> > eventually be deleted from its old shard, there will be some duration
> >> where
> >> > multiple instances of this document will be present.
> >> >
> >> > Now, if a solr query executes on both these shards and this document
> is
> >> the
> >> > top ranked document from both the shards, which document will be
> >> returned
> >> > in solr result?
> >> >
> >>
> >
>

Reply via email to