Hi All,

We have recently moved from solr 6.5 to solr cloud 8.10.


*Earlier Architecture:*We were using a master-slave architecture where we
had 4 slaves(14 cpu, 96 GB ram, 20 GB Heap, 110 GB index size). We used to
optimize and replicate nightly.

*Now.*
We didn't have a clear direction on the number of shards. So we did some
POC with variable numbers of shards. We found that with 8 shards we were
close to the response time we were getting earlier without using too much
infrastructure.
Based on our queries we couldn't find a routing parameter so now all
queries are being broadcasted to every shard.

Now, we have 8+1 solr nodes cluster. Where 1 Indexing node contains all(8)
NRT Primary shards. This is where all indexing happens. Then We have
another 8 nodes each having ( 10 cpu, 42 GB ram,8 GB heap ~23 GB Index)
consisting of one pull replica of each primary shard. For querying, we have
used *shard.preference as PULL *so that all queries are returned from pull
replicas.

Our thought process was that we should have the indexing layer and query
layer separate so one does not affect the other.

we made it live this week. Though it didn't help in reducing the response
time, in fact, we found an increase in average response time. We found a
substantial impact on response time after 85 percentile response time, So
timeouts reduced significantly.

*Now I have a few questions for all the guys who are using solr cloud to
help me understand and increase the stability of my cluster. *

1. Were we right to assume to separate indexing and query layer? is it a
good idea? or something else could have been done better?  because right
now it can affect our cluster stability, if in case replica node is not
available then queries will start going to indexing node, which is very
weak and it could choke the whole cluster.

2. is there any guideline for the number of shards and shards size?

3. How to decide the ideal number of CPUs to have per node? is there any
metric we can follow like load or CPU usage?
what should be the ideal CPU usages and load average based on the number of
CPU ?
because our response time increases exponentially with the traffic. 250 ms
to 400 ms in peak hours. Peak hour traffic remains at 2000 requests per
minute. cpu usages at 55% and load average at ~6(10 cpu)

4. How do decide the number of nodes based on shards or any other metric?
should one increase nodes or CPUs on existing nodes?

5 how to handle dev and stage environments, should we have other smaller
clusters or any other approach?

6. Did your infrastructure requirement also increase compared to standalone
when moving to the cloud, if yes then how much?

7. How do you maintain versioning of config in zookeeper?
8, any performance issue you faced or any other recommendation?

Reply via email to