I recall I had some luck fixing a leader-less shard (after a ZK quorum failure) by forcably removing the records for the down-state replicas from the leader election list, and then forcing an election. The ZK path looks like collections/<collection>/leader_elect/shardX/election. Usually you’ll find the down-state one that keeps getting elected is the first one. Delete that, then try the force-election collections api command again.
On 4/5/16, 3:15 AM, "Tom Evans" <tevans...@googlemail.com> wrote: >Hi all, I have an 8 node SolrCloud 5.5 cluster with 11 collections, >most of them in a 1 shard x 8 replicas configuration. We have 5 ZK >nodes. > >During the night, we attempted to reindex one of the larger >collections. We reindex by pushing json docs to the update handler >from a number of processes. It seemed this overwhelmed the servers, >and caused all of the collections to fail and end up in either a down >or a recovering state, often with no leader. > >Restarting and rebooting the servers brought a lot of the collections >back online, but we are left with a few collections for which all the >nodes hosting those replicas are up, but the replica reports as either >"active" or "down", and with no leader. > >Trying to force a leader election has no effect, it keeps choosing a >leader that is in "down" state. Removing all the nodes that are in >"down" state and forcing a leader election also has no effect. > > >Any ideas? The only viable option I see is to create a new collection, >index it and then remove the old collection and alias it in. > >Cheers > >Tom