Hi Jan, My Artemis knowledge is extremely limited but I'll see if I can provide some AWS insight... I lead engineering for Amazon MQ and we host ActiveMQ "Classic" brokers as a managed service on AWS. The Image you shared for as your goal is analogous to our mesh network of active/standby brokers (https://docs.aws.amazon.com/amazon-mq/latest/developer-guide/network-of-brokers.html#nob-topologies-mesh). Amazon MQ's active/standby offering is 2 x EC2 instances in separate AZs backed by a shared EFS volume. EFS latency is "single-digit millisecond" for reads and "single digit to double-digit milliseconds" for writes which seems to be acceptable for many of our customers. Availability Zones are within 100 kilometers of each other and generally produce single digit millisecond roundtrip latency between AZs in the same region. I'm not sure if this meets Justin's definition of "local networks with very low latency". Amazon MQ runs a custom locking mechanism since we did encounter split-brain scenarios early in the life of the service with ActiveMQ Classic's Shared File Locker and EFS data volumes. I don't know if the Artemis file locker is the same or different from Classic's so your mileage may vary!
For single-instance brokers, either in a network of brokers or standalone, we use ASG, EC2 health checks, and our own custom health checks to replace unhealthy EC2 instances as Justin suggested. Provisioning and configuring a new instance can take up to 15 minutes in the worst case which does result in downtime as you mentioned. We do re-attach the EBS or EFS data volume so no messages are lost when the broker eventually starts back up. Hopefully that helps! - Lucas On 2022-07-20, 9:54 PM, "Jan Šmucr" <jan.sm...@aimtecglobal.com> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Thank you for your response. Unfortunately, restarting a broker without failover means service outage, which is not an option. The same goes for broker and host restarts and upgrades. Sometimes there's also an issue with the EC2 instance network meaning that neither we nor AWS can connect to it and rebooting it then takes up to 45 minutes. And that's unacceptable. We use Artemis and its core protocol to distribute metadata enriched payloads up to hundreds of MiBs in size from and to our customers, and also between internal services which are responsible for processing such payloads. Low latency is therefore not as crucial as reliability. Our current setup consists of a single master/slave replicating pair which is far from ideal, but a single broker handles the throughput just fine. My goal here is to prevent outages, data loss, split brain situations and duplicate delivery. What I aim for: https://developers.redhat.com/sites/default/files/blog/2019/12/img_5df171079a67a.png Source: https://developers.redhat.com/blog/2020/01/10/architecting-messaging-solutions-with-apache-activemq-artemis#reference_architectures What do you suggest? Thank you. Jan Dne 21. 7. 2022 0:22 napsal uživatel Justin Bertram <jbert...@apache.org>: Typically in a cloud environment like AWS you'd let the environment itself restart any failed broker instances so that HA isn't even necessary. Also, if you want redundancy between disparate locations (e.g. different availability zones) I'd recommend mirroring [1]. Normal HA solutions (e.g. shared storage or replication) are really designed to be used on local networks with very low latency. Justin [1] https://activemq.apache.org/components/artemis/documentation/latest/amqp-broker-connections.html#mirroring On Wed, Jul 20, 2022 at 6:12 AM Jan Šmucr <jan.sm...@aimtecglobal.com> wrote: > Hello. > > We too are trying to switch from replication to a more simple model, > especially when it comes to single master-slave pair cluster model which > suffers from split brain issues. AWS EFS and the shared storage model makes > sense. > The idea is that before we expand our cluster, there would be only one > master and one slave node, both in a different AZ with EFS based storage > shared amongst these AZs. > Are there any drawbacks? Are the storage locks reliable so that no split > brain situation occurs even if the two AZs stop communicating between each > other? > > Looking forward to hearing some input. 🙂 > Jan > > On 2020/11/25 13:10:36 Luis De Bello wrote: > > Hi guys, > > > > I would like to know your experience operating an Artemis cluster in > AWS, is anyone doing that? how do you handle the broker state? EFS, JDBC? > > > > Currently we have 4 instances in production (EC2 instances) using disk > state and we avoid destroying instances using termination policies (to > avoid lossing messages), during releases we mount extra instances lets say > 4 more and we wait until message distribution moves messages from old > instances to new one. > > > > It is working fine, but it has the drawback of avoiding the instances > going down when something fails like, OOM, topology issues leading us to > execute manual restart, so we are looking to move to a differnet model, > options are externalizaing state or doing the broker stateless and > repopulating messages. > > > > I would like to hear about your deployment models and similar issues. > > > > Regards, > > Luis > > >