Hello,
We had a stuck leader election for a shard.

We have collections with 2 shards, each shard has 5 replicas. We have many
collections but the issue happened for a single shard. Once all host
restarts completed, this shard was stuck with one replica is "recovery"
state and all other is "down" state.

Here is the state of the shard returned by CLUSTERSTATUS command.
      "replicas":{
        "core_node3":{
          "core":"...._shard1_replica_n1",
          "base_url":"https://host1:8983/solr";,
          "node_name":"host1:8983_solr",
          "state":"recovering",
          "type":"NRT",
          "force_set_state":"false"},
        "core_node9":{
          "core":"...._shard1_replica_n6",
          "base_url":"https://host2:8983/solr";,
          "node_name":"host2:8983_solr",
          "state":"down",
          "type":"NRT",
          "force_set_state":"false"},
        "core_node26":{
          "core":"...._shard1_replica_n25",
          "base_url":"https://host3:8983/solr";,
          "node_name":"host3:8983_solr",
          "state":"down",
          "type":"NRT",
          "force_set_state":"false"},
        "core_node28":{
          "core":"...._shard1_replica_n27",
          "base_url":"https://host4:8983/solr";,
          "node_name":"host4:8983_solr",
          "state":"down",
          "type":"NRT",
          "force_set_state":"false"},
        "core_node34":{
          "core":"...._shard1_replica_n33",
          "base_url":"https://host5:8983/solr";,
          "node_name":"host5:8983_solr",
          "state":"down",
          "type":"NRT",
          "force_set_state":"false"}}}

The workarounds to shutdown server host1 with the replica stuck in recovery
state. This unblocked leader election, the 4 other replicas went active.

Here is the first error I found in logs related to this shard. It happened
while shutting a server host3 that was the leader at that time/
 (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25
r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26
x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient Error
consuming and closing http response stream. =>
java.nio.channels.AsynchronousCloseException
at
org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
java.nio.channels.AsynchronousCloseException: null
at
org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
at java.io.InputStream.read(InputStream.java:205) ~[?:?]
at
org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]

My understanding is following this error, each server restart ended in the
replica on this server being in "down" state, but I'm not sure how to
confirm that.
We then entered in a loop where term is increased because of failed
replication.

Is this a know issue? I found no similar ticket in Jira.
Could you please having a better understanding of the issue?
Thanks

Reply via email to