On 10/14/2018 6:25 PM, dami...@gmail.com wrote:
I had an issue with async backup on solr 6.5.1 reporting that the backup was complete when clearly it was not. I was using 12 shards across 6 nodes. I only noticed this issue when one shard was much larger than the others. There were no answers here http://lucene.472066.n3.nabble.com/async-backup-td4342776.html
One detail I thought I had written but isn't there: The backup did fully complete -- all 30 shards were in the backup location. Not a lot in each shard backup -- the collection was empty. It would be easy enough to add a few thousand documents to the collection before doing the backup.
If the backup process reports that it's done before it's ACTUALLY done, that's a bad thing. It's hard to say whether that problem is related to the problem I described. Since I haven't dived into the code, I cannot say for sure, but it honestly would not surprise me to find they are connected. Every time I try to understand Collections API code, I find it extremely difficult to follow.
I'm sorry that you never got resolution on your problem. Do you know whether that is still a problem in 7.x? Setting up a reproduction where one shard is significantly larger than the others will take a little bit of work.
I was focusing on the STATUS returned from the REQUESTSTATUS command, but looking again now I can see a response from only 6 shards, and each shard is from a different node. So this fits with what you're seeing. I assume your shards 1, 7, 9 are all on different nodes.
I did not actually check, and the cloud example I was using isn't around any more, but each of the shards in the status response were PROBABLY on separate nodes. The cloud example was 3 nodes. It's an easy enough scenario to replicate, and I provided enough details for anyone to do it.
The person on IRC that reported this problem had a cluster of 15 nodes, and the status response had ten shards (out of 30) mentioned. It was shards 1-9 and shard 20. The suspicion is that there's something hard-coded that limits it to 10 responses ... because without that, I would expect the number of shards in the response to match the number of nodes.
Thanks, Shawn