I have a 6 node lightly used Solr cloud setup that seemingly at
random loses its leader and then fails to recover on its own. Logs
scroll on all machines with errors like:


2025-05-16 19:06:05.975 ERROR (qtp1378497201-42484) [c:dovecot
s:shard8 r:core_node118 x:dovecot_shard8_replica_n117]
o.a.s.u.p.DistributedZkUpdateProcessor ClusterState says we are the
leader, but locally we don't think so
2025-05-16 19:06:05.975 ERROR (qtp1378497201-42484) [c:dovecot
s:shard8 r:core_node118 x:dovecot_shard8_replica_n117]
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException:
ClusterState says we are the leader
(http://192.168.1.4:8983/solr/dovecot_shard8_replica_n117), but
locally we don't think so. Request came from
http://192.168.1.11:8983/solr/dovecot_shard3_replica_n139/ =>
org.apache.solr.common.SolrException: ClusterState says we are the
leader (http://192.168.1.4:8983/solr/dovecot_shard8_replica_n117), but
locally we don't think so. Request came from
http://192.168.1.11:8983/solr/dovecot_shard3_replica_n139/
        at
org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDefensiveChecks(DistributedZkUpdateProcessor.java:1025)
org.apache.solr.common.SolrException: ClusterState says we are the
leader (http://192.168.1.4:8983/solr/dovecot_shard8_replica_n117), but
locally we don't think so. Request came from
http://192.168.1.11:8983/solr/dovecot_shard3_replica_n139/

The thing is, these errors happen on each server and they each think
they are the leader!


request: http://192.168.1.13:8983/solr/dovecot_shard7_replica_n155/
Remote error message: ClusterState says we are the leader
(http://192.168.1.13:8983/solr/dovecot_shard7_replica_n155), but
locally we don't think so. Request came from
http://192.168.1.197:8983/solr/dovecot_shard6_replica_n159/
        at
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:275)
~[?:?]
        at
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:181)
~[?:?]
        at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)
~[metrics-core-4.1.5.jar:4.1.5]
        at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)
~[?:?]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]​

​RPM for each node is around ~0.10 , never goes above 0.20

In total I have 10 shards split across 6 solr servers and 5
zookeepers. The zookeeper page still shows a leader, the graph page
shows all shards/nodes as green but it fails to index and those errors
keep on scrolling. Getting it out of this mess requires stopping all
requests to the cluster and restarting each and every node. Lately
this has been a daily occurrence. Each shard has around 44.3 million
documents with nodes storing between 160-200Gb of data ( some have
only 3 replicas, others 4 ).

How can I figure out what's going on and why this keeps on happening
?​​

Reply via email to