On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
> I've query each Solr shard server one by one and the total number of
> documents is correct. However, when I change rows parameter from 10 to 100
> the total numFound of documents change:

I've seen this problem on the list before and the cause has been
determined each time to be caused by documents with the same uniqueKey
value appearing in more than one shard.

What I think happens here:

With rows=10, you get the top ten docs from each of the three shards,
and each shard sends its numFound for that query to the core that's
coordinating the search.  The coordinator adds up numFound, looks
through those thirty docs, and arranges them according to the requested
sort order, returning only the top 10.  In this case, there happen to be
no duplicates.

With rows=100, you get a total of 300 docs.  This time, duplicates are
found and removed by the coordinator.  I think that the coordinator
adjusts the total numFound by the number of duplicate documents it
removed, in an attempt to be more accurate.

I don't know if adjusting numFound when duplicates are found in a
sharded query is the right thing to do, I'll leave that for smarter
people.  Perhaps Solr should return a message with the results saying
that duplicates were found, and if a config option is not enabled, the
server should throw an exception and return a 4xx HTTP error code.  One
idea for a config parameter name would be allowShardDuplicates, but
something better can probably be found.

Thanks,
Shawn

Reply via email to