Hi all,
My team at work maintains a SolrCloud 5.3.2 cluster with multiple
collections configured with sharding and replication.
We recently backed up our Solr indexes using the built-in backup
functionality. After the cluster was restored from the backup, we
noticed that atomic updates of documents are failing occasionally with
the error message 'missing required field [...]'. The exceptions are
thrown on a host on which the document to be updated is not stored. From
this we are deducing that there is a problem with finding the right host
by the hash of the uniqueKey. Indeed, our investigations so far showed
that for at least one collection in the new cluster, the shards have
different hash ranges assigned now. We checked the hash ranges by
querying /admin/collections?action=CLUSTERSTATUS. Find below the shard
hash ranges of one collection that we debugged.
Old cluster:
shard1_0 80000000 - aaa9ffff
shard1_1 aaaa0000 - d554ffff
shard2_0 d5550000 - fffeffff
shard2_1 ffff0000 - 2aa9ffff
shard3_0 2aaa0000 - 5554ffff
shard3_1 55550000 - 7fffffff
New cluster:
shard1 80000000 - aaa9ffff
shard2 aaaa0000 - d554ffff
shard3 d5550000 - ffffffff
shard4 0 - 2aa9ffff
shard5 2aaa0000 - 5554ffff
shard6 55550000 - 7fffffff
Note that the shard names differ because the old cluster's shards were
split.
As you can see, the ranges of shard3 and shard4 differ from the old
cluster. This change of hash ranges matches with the symptoms we are
currently experiencing.
We found this JIRA ticket https://issues.apache.org/jira/browse/SOLR-5750
in which David Smiley comments:
shard hash ranges aren't restored; this error could be disasterous
It seems that this is what happened to us. We would like to hear some
suggestions on how we could recover from this problem.
Best,
Gary