On 9/12/23 18:28, rajani m wrote:
Solr 9.1.1 version, upon restarting solr on any node in the cluster, a
unique event is triggered across all the *other* nodes in the cluster that
has an impact similar to restarting solr on all the other nodes in the
cluster. There is dip in the cpu usage, all the caches are emptied and
warmed up, there are disk reads/writes on all the other nodes.
How much RAM is in each node? How much is given to the Java heap? Are
you running more than one Solr instance on each node? How much disk
space do the indexes on each node consume?
What are the counts of:
* Nodes
* Collections
* Shards per collection
* Replica count per shard
* Documents per shard
There is sometimes some confusion about replica count. I've seen people
say they have "one shard and one replica" when the right way to state it
is that the replica count is two.
If the counts above are large (meaning that you have a LOT of cores)
then restarting a node can be very disruptive to the cloud as a whole.
See this issue from several years ago where I explored this:
https://issues.apache.org/jira/browse/SOLR-7191
The issue has been marked as resolved in version 6.3.0, but no code was
modified, and as far as I know, the problem still exists.
It's worth noting that in my tests for that issue, the collections were
empty. For collections that actually have data, the problem will be worse.
If there are a lot of adds/updates/deletes happening, then the delta
between the replicas might exceed the threshold for transaction log
recovery. Solr may be doing a full replication to the cores on the
restarted node. But I would expect that to only affect the shard
leaders, which are the source for the replicated data.
Thanks,
Shawn