Before I open a JIRA, I wanted to put this out to solicit feedback on what I'm seeing and what Solr should be doing. So I've indexed the following 8 docs into a 2-shard collection (Solr 4.8'ish - internal custom branch roughly based on 4.8) ... notice that the 3 grand-children of 2-1 have dup'd keys:
[ { "id":"1", "name":"parent", "_childDocuments_":[ { "id":"1-1", "name":"child" }, { "id":"1-2", "name":"child" } ] }, { "id":"2", "name":"parent", "_childDocuments_":[ { "id":"2-1", "name":"child", "_childDocuments_":[ { "id":"2-1-1", "name":"grandchild" }, { "id":"2-1-1", "name":"grandchild2" }, { "id":"2-1-1", "name":"grandchild3" } ] } ] } ] When I query this collection, using: http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10 I get: { "responseHeader":{ "status":0, "QTime":9, "params":{ "indent":"true", "q":"*:*", "shards.info":"true", "wt":"json", "rows":"10"}}, "shards.info":{ "http://localhost:8984/solr/blockjoin2_shard1_replica1/|http://localhost:8985/solr/blockjoin2_shard1_replica2/":{ "numFound":3, "maxScore":1.0, "shardAddress":"http://localhost:8984/solr/blockjoin2_shard1_replica1", "time":4}, "http://localhost:8984/solr/blockjoin2_shard2_replica1/|http://localhost:8985/solr/blockjoin2_shard2_replica2/":{ "numFound":5, "maxScore":1.0, "shardAddress":"http://localhost:8985/solr/blockjoin2_shard2_replica2", "time":4}}, "response":{"numFound":6,"start":0,"maxScore":1.0,"docs":[ { "id":"1-1", "name":"child"}, { "id":"1-2", "name":"child"}, { "id":"1", "name":"parent", "_version_":1495272401329455104}, { "id":"2-1-1", "name":"grandchild"}, { "id":"2-1", "name":"child"}, { "id":"2", "name":"parent", "_version_":1495272401361960960}] }} So Solr has de-duped the results. If I execute this query against the shard that has the dupes (distrib=false): http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10&distrib=false Then the dupes are returned: { "responseHeader":{ "status":0, "QTime":0, "params":{ "indent":"true", "q":"*:*", "shards.info":"true", "distrib":"false", "wt":"json", "rows":"10"}}, "response":{"numFound":5,"start":0,"docs":[ { "id":"2-1-1", "name":"grandchild"}, { "id":"2-1-1", "name":"grandchild2"}, { "id":"2-1-1", "name":"grandchild3"}, { "id":"2-1", "name":"child"}, { "id":"2", "name":"parent", "_version_":1495272401361960960}] }} So I guess my question is why doesn't the non-distrib query do de-duping? Mainly confirming this is how it's supposed to work and this behavior doesn't strike anyone else as odd ;-) Cheers, Tim