Situation

I'm currently trying to set up SolrCloud in an AWS Autoscaling Group, so
that it can scale dynamically.

I've also added the following triggers to Solr, so that each node will have
1 (and only one) replication of each collection:

{
"set-cluster-policy": [
  {"replica": "<2", "shard": "#EACH", "node": "#EACH"}
  ],
  "set-trigger": [{
    "name": "node_added_trigger",
    "event": "nodeAdded",
    "waitFor": "5s",
    "preferredOperation": "ADDREPLICA"
  },{
    "name": "node_lost_trigger",
    "event": "nodeLost",
    "waitFor": "120s",
    "preferredOperation": "DELETENODE"
  }]
}

This works pretty well. But my problem is that when the a node gets
removed, it doesn't remove all 19 replicas from this node and I have
problems when accessing the "nodes" page:

[image: enter image description here] <https://i.stack.imgur.com/QyJrY.png>

In the logs, this exception occurs:

Operation deletenode
failed:java.util.concurrent.RejectedExecutionException: Task
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$45/1104948431@467049e2
rejected from 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@773563df[Running,
pool size = 10, active threads = 10, queued tasks = 0, completed tasks
= 1]
    at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
    at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
    at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
    at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:194)
    at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
    at 
org.apache.solr.cloud.api.collections.DeleteReplicaCmd.deleteCore(DeleteReplicaCmd.java:276)
    at 
org.apache.solr.cloud.api.collections.DeleteReplicaCmd.deleteReplica(DeleteReplicaCmd.java:95)
    at 
org.apache.solr.cloud.api.collections.DeleteNodeCmd.cleanupReplicas(DeleteNodeCmd.java:109)
    at 
org.apache.solr.cloud.api.collections.DeleteNodeCmd.call(DeleteNodeCmd.java:62)
    at 
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:292)
    at 
org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:496)
    at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Problem description

So, the problem is that it only has a pool size of 10, of which 10 are busy
and nothing gets queued (synchronous execution). In fact, it really only
removed 10 replicas and the other 9 replicas stayed there. When manually
sending the API command to delete this node it works fine, since Solr only
needs to remove the remaining 9 replicas and everything is good again.
Question

How can I either increase this (small) thread pool size and/or activate
queueing the remaining deletion tasks? Another solution might be to retry
the failed task until it succeeds.

Using Solr 7.7.1 on Ubuntu Server installed with the installation script
from Solr (so I guess it's using Jetty?).

Thanks for your help!

Reply via email to