Hi Every document has an internal lucene id attached to it at the lucene level, when the scores for 2 documents are same , it sorts by this internal id, and returns the response.
This was the implementation when I checked sometime ago if I remember correctly Regards Kshitij On Wed, 20 Nov 2024 at 6:21 PM Saksham Gupta <saksham.gu...@indiamart.com.invalid> wrote: > Hi All, > > I verified hypotheses shared by Deepak in this mail thread for a few cases > where the indexed date was different and where it was the same, and the > hypothesis turned out accurate for all of the cases. > > *Summarising the Hypothesis:* Solr handles duplicate documents [documents > present in multiple shards], by preferring the document which is oldest > according to indexed date, and if indexed date is same, then it compares > version and document with higher version is displayed. > > Although, recently, I've found a document which is not following the above > hypothesis, the indexed date for the document[present on 2 shards] on both > the shards is the same, although the document with lower _version_ is being > ranked [contrary to above hypothesis]. To check if the version visible is > correct or not, I filtered the respective copy based on version: > > 1. [query: fq=id:{document-copy1-id} AND _version_:{document-copy1-id}], > 2. [query: fq=id:{document-copy2-id} AND _version_:{document-copy2-id}]; > and found that one document is not being displayed if we add fq on version. > > *How does solr set the _version_ field? Is there a possibility that the > version displayed is incorrect? Does solr maintain a different version > internally which can differ from one visible?* > *Is this the reason why the above hypothesis is failing?* > > Would appreciate any help regarding solr duplicity handling/ and my > aforementioned doubts! > > On Thu, Aug 1, 2024 at 4:38 PM Saksham Gupta <saksham.gu...@indiamart.com> > wrote: > > > Hi Deepak, > > > > Thanks for digging out such a detailed answer for my query. I did observe > > that the documents indexed earlier were the ones being displayed, but > could > > not find any relevant documentation supporting this. > > > > Although, I could not understand the nuances pointed out in point 4, What > > do we mean by `If a commit happens between the first and > > second phase of the distributed search`, what is first and second > > phase here, and what issue will it cause? > > > > On Wed, Jul 31, 2024 at 12:24 PM Deepak Goel <deic...@gmail.com> wrote: > > > >> *Answer from Copilot:* > >> > >> > >> Ah, the intricate dance of Solr shards and their cosmic collisions! > Let’s > >> unravel this like a digital detective, shall we? 🕵️♂️ > >> > >> When it comes to Solr and its distributed architecture, handling > duplicate > >> documents across shards can be as tricky as juggling flaming torches > while > >> riding a unicycle. But fear not—I’ve got some insights for you: > >> > >> 1. > >> > >> *Duplicate Documents and Shards:* > >> - Imagine our document—a digital doppelgänger—migrating from one > shard > >> to another. It’s like a restless soul seeking a new home. > >> - During this transition, both shards might harbor copies of the > >> same > >> document. They’re like twins separated at birth, each vying for > the > >> spotlight. > >> 2. > >> > >> *The Solr Query Showdown:* > >> - Now, let’s stage a Solr query duel. Our query gallops across the > >> shards, demanding answers. > >> - If our document is the top-ranked contender in both shards, who > >> emerges victorious? 🏆 > >> 3. > >> > >> *The Winner Takes It All (Sort of):* > >> - Solr, being the wise oracle it is, follows a simple rule: *“First > >> come, first served.”* > >> - When Solr discovers duplicate document IDs during distributed > >> searching, it selects the *first document* it encounters and > >> discards > >> subsequent ones. It’s like a cosmic game of “finders keepers.” > >> - So, whichever shard’s copy of the document was indexed first—the > >> early bird with the freshest ink—takes the spotlight. The other > >> copy bows > >> out gracefully. > >> 4. > >> > >> *The Momentary Sync Shimmy:* > >> - But wait! There’s a twist. If a commit happens between the first > and > >> second phase of the distributed search, the index might shimmy > >> out of sync > >> for a moment. > >> - Picture this: Shard A says, “I’ve got the document!” Shard B > says, > >> “No, I’ve got it!” And Solr, in its infinite wisdom, says, “Hold > >> my query, > >> folks—I need to sync up.” > >> - Eventually, harmony is restored, and the universe aligns itself. > >> But for that brief moment, Solr juggles realities like a cosmic > >> circus > >> performer. > >> 5. > >> > >> *The Shard Key Sorcery:* > >> - Remember the shard key? It’s like Solr’s secret handshake. You can > >> use > >> it to influence how documents are distributed across shards. > >> - For example, if you want to spread documents related to a > specific > >> customer (let’s say “IBM”) across multiple shards, you can use a > >> syntax > >> like this: "shard_key/num!document_id". The /num part determines > how > >> many bits from the shard key contribute to the composite hash > >> < > >> > https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html > >> > > >> 1 > >> < > >> > https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html > >> > > >> . > >> 6. > >> > >> *Balance and Scalability:* > >> - To prevent hotspots, distribute documents evenly across shards. > >> Balance is key! > >> - Choose shard keys that reflect your data’s access patterns. > Think > >> of them as Solr’s cosmic compass. > >> - And maintain flexibility—consider using composite IDs for easier > >> scalability. It’s like Solr’s way of saying, “Why settle for one > >> shard when > >> you can have a whole constellation?” > >> > >> So, in the grand Solr arena, the early bird document wins the query > race. > >> But remember, even in the digital cosmos, duplicates play by the > >> rules—mostly. > >> > >> > >> Deepak > >> "The greatness of a nation can be judged by the way its animals are > >> treated > >> - Mahatma Gandhi" > >> > >> +91 73500 12833 > >> deic...@gmail.com > >> > >> LinkedIn: www.linkedin.com/in/deicool > >> > >> "Plant a Tree, Go Green" > >> > >> Make In India : http://www.makeinindia.com/home > >> > >> > >> On Mon, Jul 29, 2024 at 10:11 PM Saksham Gupta > >> <saksham.gu...@indiamart.com.invalid> wrote: > >> > >> > Hi Solr Developers, > >> > > >> > Which solr document will be displayed if a duplicate instance of the > >> same > >> > document is present? > >> > > >> > In our current solr architecture, there is a possibility that a > document > >> > can move from one solr shard to another shard. While the document will > >> > eventually be deleted from its old shard, there will be some duration > >> where > >> > multiple instances of this document will be present. > >> > > >> > Now, if a solr query executes on both these shards and this document > is > >> the > >> > top ranked document from both the shards, which document will be > >> returned > >> > in solr result? > >> > > >> > > >