Hi - i think you're seeing:
https://issues.apache.org/jira/browse/SOLR-3993
-----Original message-----
> From:Bill Au <bill.w...@gmail.com>
> Sent: Thu 08-Nov-2012 21:16
> To: solr-user@lucene.apache.org
> Subject: best practice for restarting the entire SolrCloud cluster
>
> I have a simple SolrCloud cluster with 4 Solr instances and 1 shard. I can
> start and stop individual Solr instances without any problem. But not when
> I have to shutdown all the Solr instances at the same time.
>
> After shutting down all the Solr instances, the first instance that starts
> up wait for all the replicas:
>
> INFO: Waiting until we see more replicas up: total=4 found=3
> timeoutin=169243
>
> In the meantime, any additional Solr instances that start up while the
> first one is waiting can't get the leader from zookeeper:
>
> SEVERE: Error getting leader from zk
> org.apache.solr.common.SolrException: Could not get leader props
>
> When the first Solr instance see all the replicas, it becomes the leader:
>
> INFO: Enough replicas found to continue.
> INFO: I may be the new leader - try and sync
>
> But it fails to sync with the instances that had failed to get the leader
> before:
>
> WARNING: PeerSync: core=collection1 url=http://host2:8983/solr exception
> talking to http://host2:8983/solr/collection1/, failed
> org.apache.solr.client.solrj.SolrServerException: Timeout occured while
> waiting response from server at: http://host2:8983/solr/collection1
>
> So I ended up with one for more replicas down after the restart. I had to
> figure out which replica is down and restart them.
>
> What I also discovered is that if I start the first Solr instance and wait
> until it returns after the leaderVoteWait of 3 minutes, the rest of the
> Solr instance can be started without any problem since by then they can get
> the leader from zookeeper.
>
> Is there a better way to restart an entire SolrCloud cluster?
>
> Bill
>