On 8/3/2020 12:04 PM, Mathew Mathew wrote:
Have been looking for architectural guidance on correctly configuring SolrCloud 
on Public Cloud (eg Azure/AWS)
In particular the zookeeper based autoscaling seems to overlap with the auto 
scaling capabilities of cloud platforms.

I have the following questions.

   1.  Should the ZooKeeper ensable be put in a autoscaling group. This seems 
to be a no, since the SolrNodes need to register against a static list of 
Zookeeper ips.

Correct. There are features in ZK 3.5 for dynamic server membership, but in general it is better to have a static list. The client must be upgraded as well for that feature to work. The ZK client was upgraded to a 3.5 version in Solr 8.2.0. I don't think we have done any testing of the dynamic membership feature.

ZK is generally best set up with either 3 or 5 servers, depending on the level of redundancy desired, and left alone unless there's a problem. With 3 servers, the ensemble can survive the failure of 1 server. With 5, it can survive the failure of 2. As far as I know, getting back to full redundancy is best handled as a manual process, even if running version 3.5.

   2.  Should the SolrNodes be put in a AutoScaling group? Or should we just 
launch/register SolrNodes using a lambda function/Azure function.

That really depends on what you're doing. There is no "one size fits most" configuration.

I personally would avoid setting things up in a way that results in Solr nodes automatically being added or removed. Adding a node will generally result in a LOT of data being copied, and that can impact performance in a major way, so adding nodes should be scheduled to minimize impact. If it's automatic in response to high load, adding a node can make performance a lot worse before it gets better. When a node disappears, manual action is required for SolrCloud to forget the node.

   3.  Should the SolrNodes be associated with local storage or should they be 
attached to shared storage volumes.

Lucene (which provides most of Solr's functionality) generally does not like to work with shared storage. In addition to potential latency issues for storage connected via a network, Lucene works extremely hard to ensure that only one process can open an index. Using shared storage will encourage attempts to share the index directory between multiple processes, which almost always fails to work.

Things work best with locally attached storage utilizing an extremely fast connection method (like SATA or SCSI), and a locally handled filesystem. Lucene uses some pretty involved file locking mechanisms, which often do not work well on remote or shared filesystems.

---

We (the developers that build this software) generally have a very near-sighted view of things, not really caring about details like the hardware deployment. That probably needs to change a little bit, particularly when it comes to documentation.

Thanks,
Shawn

Reply via email to