thanks Mark. will further dig into the logs. there is another problem related.
we have collections with 3 shards (2 nodes in one shard), the collection have about 1000 records in it. but unfortunately that after the leader is down, replica node failed to become the leader.the detail is : after the leader node is down, replica node try to become the new leader, but it said ======================= ShardLeaderElectionContext.runLeaderProcess(131) - Running the leader process. ShardLeaderElectionContext.shouldIBeLeader(331) - Checking if I should try and be the leader. ShardLeaderElectionContext.shouldIBeLeader(339) - My last published State was Active, it's okay to be the leader. ShardLeaderElectionContext.runLeaderProcess(164) - I may be the new leader - try and sync SyncStrategy.sync(89) - Sync replicas to http://localhost:8486/solr/exception/ PeerSync.sync(182) - PeerSync: core=exception url=http://localhost:8486/solr START replicas=[http://localhost:8483/solr/exception/] nUpdates=100 PeerSync.sync(250) - PeerSync: core=exception url=http://localhost:8486/solr DONE. We have no versions. sync failed. SyncStrategy.log(114) - Sync Failed ShardLeaderElectionContext.rejoinLeaderElection(311) - There is a better leader candidate than us - going back into recovery DefaultSolrCoreState.doRecovery(214) - Running recovery - first canceling any ongoing recovery ======================== after that, it try to recovery from the leader node, which is already down. then recovery + failed + recovery..... is it related to SOLR-3939 and SOLR-3940? but the index data isn't empty. On Thu, Jan 10, 2013 at 10:09 AM, Mark Miller <markrmil...@gmail.com> wrote: > It may be able to do that because it's forwarding requests to other nodes > that are up? > > Would be good to dig into the logs to see if you can narrow in on the > reason for the recovery_failed. > > - Mark > > On Jan 9, 2013, at 8:52 PM, Zeng Lames <lezhi.z...@gmail.com> wrote: > > > Hi , > > > > we meet below strange case in production environment. from the Solr Admin > > Console -> Cloud -> Graph, we can find that one node is in > recovery_failed > > status. but at the same time, we found that the recovery_failed node can > > server query/update request normally. > > > > any idea about it? thanks! > > > > -- > > Best Wishes! > > Lames > > -- Best Wishes! Lames