Before I open a JIRA, I wanted to put this out to solicit feedback on what
I'm seeing and what Solr should be doing. So I've indexed the following 8
docs into a 2-shard collection (Solr 4.8'ish - internal custom branch
roughly based on 4.8) ... notice that the 3 grand-children of 2-1 have
dup'd keys:

[
  {
    "id":"1",
    "name":"parent",
    "_childDocuments_":[
      {
        "id":"1-1",
        "name":"child"
      },
      {
        "id":"1-2",
        "name":"child"
      }
    ]
  },
  {
    "id":"2",
    "name":"parent",
    "_childDocuments_":[
      {
        "id":"2-1",
        "name":"child",
        "_childDocuments_":[
          {
            "id":"2-1-1",
            "name":"grandchild"
          },
          {
            "id":"2-1-1",
            "name":"grandchild2"
          },
          {
            "id":"2-1-1",
            "name":"grandchild3"
          }
        ]
      }
    ]
  }
]

When I query this collection, using:

http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10

I get:

{
  "responseHeader":{
    "status":0,
    "QTime":9,
    "params":{
      "indent":"true",
      "q":"*:*",
      "shards.info":"true",
      "wt":"json",
      "rows":"10"}},
  "shards.info":{
    
"http://localhost:8984/solr/blockjoin2_shard1_replica1/|http://localhost:8985/solr/blockjoin2_shard1_replica2/":{
      "numFound":3,
      "maxScore":1.0,
      "shardAddress":"http://localhost:8984/solr/blockjoin2_shard1_replica1";,
      "time":4},
    
"http://localhost:8984/solr/blockjoin2_shard2_replica1/|http://localhost:8985/solr/blockjoin2_shard2_replica2/":{
      "numFound":5,
      "maxScore":1.0,
      "shardAddress":"http://localhost:8985/solr/blockjoin2_shard2_replica2";,
      "time":4}},
  "response":{"numFound":6,"start":0,"maxScore":1.0,"docs":[
      {
        "id":"1-1",
        "name":"child"},
      {
        "id":"1-2",
        "name":"child"},
      {
        "id":"1",
        "name":"parent",
        "_version_":1495272401329455104},
      {
        "id":"2-1-1",
        "name":"grandchild"},
      {
        "id":"2-1",
        "name":"child"},
      {
        "id":"2",
        "name":"parent",
        "_version_":1495272401361960960}]
  }}


So Solr has de-duped the results.

If I execute this query against the shard that has the dupes (distrib=false):

http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10&distrib=false

Then the dupes are returned:

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "indent":"true",
      "q":"*:*",
      "shards.info":"true",
      "distrib":"false",
      "wt":"json",
      "rows":"10"}},
  "response":{"numFound":5,"start":0,"docs":[
      {
        "id":"2-1-1",
        "name":"grandchild"},
      {
        "id":"2-1-1",
        "name":"grandchild2"},
      {
        "id":"2-1-1",
        "name":"grandchild3"},
      {
        "id":"2-1",
        "name":"child"},
      {
        "id":"2",
        "name":"parent",
        "_version_":1495272401361960960}]
  }}

So I guess my question is why doesn't the non-distrib query do
de-duping? Mainly confirming this is how it's supposed to work and
this behavior doesn't strike anyone else as odd ;-)

Cheers,

Tim

Reply via email to