On 8/3/2020 12:04 PM, Mathew Mathew wrote:
Have been looking for architectural guidance on correctly configuring SolrCloud
on Public Cloud (eg Azure/AWS)
In particular the zookeeper based autoscaling seems to overlap with the auto
scaling capabilities of cloud platforms.
I have the following questions.
1. Should the ZooKeeper ensable be put in a autoscaling group. This seems
to be a no, since the SolrNodes need to register against a static list of
Zookeeper ips.
Correct. There are features in ZK 3.5 for dynamic server membership,
but in general it is better to have a static list. The client must be
upgraded as well for that feature to work. The ZK client was upgraded
to a 3.5 version in Solr 8.2.0. I don't think we have done any testing
of the dynamic membership feature.
ZK is generally best set up with either 3 or 5 servers, depending on the
level of redundancy desired, and left alone unless there's a problem.
With 3 servers, the ensemble can survive the failure of 1 server. With
5, it can survive the failure of 2. As far as I know, getting back to
full redundancy is best handled as a manual process, even if running
version 3.5.
2. Should the SolrNodes be put in a AutoScaling group? Or should we just
launch/register SolrNodes using a lambda function/Azure function.
That really depends on what you're doing. There is no "one size fits
most" configuration.
I personally would avoid setting things up in a way that results in Solr
nodes automatically being added or removed. Adding a node will
generally result in a LOT of data being copied, and that can impact
performance in a major way, so adding nodes should be scheduled to
minimize impact. If it's automatic in response to high load, adding a
node can make performance a lot worse before it gets better. When a
node disappears, manual action is required for SolrCloud to forget the node.
3. Should the SolrNodes be associated with local storage or should they be
attached to shared storage volumes.
Lucene (which provides most of Solr's functionality) generally does not
like to work with shared storage. In addition to potential latency
issues for storage connected via a network, Lucene works extremely hard
to ensure that only one process can open an index. Using shared storage
will encourage attempts to share the index directory between multiple
processes, which almost always fails to work.
Things work best with locally attached storage utilizing an extremely
fast connection method (like SATA or SCSI), and a locally handled
filesystem. Lucene uses some pretty involved file locking mechanisms,
which often do not work well on remote or shared filesystems.
---
We (the developers that build this software) generally have a very
near-sighted view of things, not really caring about details like the
hardware deployment. That probably needs to change a little bit,
particularly when it comes to documentation.
Thanks,
Shawn