Hi All - I have a 40 node cluster that has been running great for a long while, but it all came down due to OOM.  I adjusted the parameters and restarted, but one shard with 3 replicas (all NRT) will not elect a leader.  I see messages like:

2019-05-30 12:35:30.597 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.c.SyncStrategy Sync replicas to http://elara:9100/solr/UNCLASS_30DAYS_shard31_replica_n182/ 2019-05-30 12:35:30.597 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 url=http://elara:9100/solr START replicas=[http://enceladus:9100/solr/UNCLASS_30DAYS_shard31_replica_n180/, http://rosalind:9100/solr/UNCLASS_30DAYS_shard31_replica_n184/] nUpdates=100 2019-05-30 12:35:30.651 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 url=http://elara:9100/solr  Received 100 versions from http://enceladus:9100/solr/UNCLASS_30DAYS_shard31_replica_n180/ fingerprint:null 2019-05-30 12:35:30.652 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 url=http://elara:9100/solr  Our versions are too old. ourHighThreshold=1634891841359839232 otherLowThreshold=1634892098551414784 ourHighest=1634892003501146112 otherHighest=1634892708023631872 2019-05-30 12:35:30.652 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 url=http://elara:9100/solr DONE. sync failed 2019-05-30 12:35:30.652 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.c.SyncStrategy Leader's attempt to sync with shard failed, moving to the next candidate 2019-05-30 12:35:30.683 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than us - going back into recovery 2019-05-30 12:35:30.693 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader parent node, won't remove previous leader registration. 2019-05-30 12:35:30.694 WARN (updateExecutor-3-thread-4-processing-n:elara:9100_solr x:UNCLASS_30DAYS_shard31_replica_n182 c:UNCLASS_30DAYS s:shard31 r:core_node185) [c:UNCLASS_30DAYS s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.c.RecoveryStrategy Stopping recovery for core=[UNCLASS_30DAYS_shard31_replica_n182] coreNodeName=[core_node185]

and

2019-05-30 12:25:39.522 INFO  (zkCallback-7-thread-1) [c:UNCLASS_30DAYS s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] o.a.s.c.ActionThrottle Throttling leader attempts - waiting for 136ms 2019-05-30 12:25:39.672 INFO  (zkCallback-7-thread-1) [c:UNCLASS_30DAYS s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] o.a.s.c.ShardLeaderElectionContext Can't become leader, other replicas with higher term participated in leader election 2019-05-30 12:25:39.672 INFO  (zkCallback-7-thread-1) [c:UNCLASS_30DAYS s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than us - going back into recovery 2019-05-30 12:25:39.677 INFO  (zkCallback-7-thread-1) [c:UNCLASS_30DAYS s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader parent node, won't remove previous leader registration.

and

2019-05-30 12:26:39.820 INFO  (zkCallback-7-thread-5) [c:UNCLASS_30DAYS s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180] o.a.s.c.ShardLeaderElectionContext Can't become leader, other replicas with higher term participated in leader election 2019-05-30 12:26:39.820 INFO  (zkCallback-7-thread-5) [c:UNCLASS_30DAYS s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180] o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than us - going back into recovery 2019-05-30 12:26:39.826 INFO  (zkCallback-7-thread-5) [c:UNCLASS_30DAYS s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180] o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader parent node, won't remove previous leader registration.

I've tried FORCELEADER, but it had no effect.  I also tried adding a shard, but that one didn't come up either.  The index is on HDFS.

Help!

-Joe

Reply via email to