Hi Jan, 

My Artemis knowledge is extremely limited but I'll see if I can provide some 
AWS insight... I lead engineering for Amazon MQ and we host ActiveMQ "Classic" 
brokers as a managed service on AWS. The Image you shared for as your goal is 
analogous to our mesh network of active/standby brokers 
(https://docs.aws.amazon.com/amazon-mq/latest/developer-guide/network-of-brokers.html#nob-topologies-mesh).
 Amazon MQ's active/standby offering is 2 x EC2 instances in separate AZs 
backed by a shared EFS volume. EFS latency is "single-digit millisecond" for 
reads and "single digit to double-digit milliseconds" for writes which seems to 
be acceptable for many of our customers. Availability Zones are within 100 
kilometers of each other and generally produce single digit millisecond 
roundtrip latency between AZs in the same region. I'm not sure if this meets 
Justin's definition of "local networks with very low latency". Amazon MQ runs a 
custom locking mechanism since we did encounter split-brain scenarios early in 
the life of the service with ActiveMQ Classic's Shared File Locker and EFS data 
volumes. I don't know if the Artemis file locker is the same or different from 
Classic's so your mileage may vary! 

For single-instance brokers, either in a network of brokers or standalone, we 
use ASG, EC2 health checks, and our own custom health checks to replace 
unhealthy EC2 instances as Justin suggested. Provisioning and configuring a new 
instance can take up to 15 minutes in the worst case which does result in 
downtime as you mentioned. We do re-attach the EBS or EFS data volume so no 
messages are lost when the broker eventually starts back up. 

Hopefully that helps! 

- Lucas


On 2022-07-20, 9:54 PM, "Jan Šmucr" <jan.sm...@aimtecglobal.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



    Thank you for your response.

    Unfortunately, restarting a broker without failover means service outage, 
which is not an option. The same goes for broker and host restarts and 
upgrades. Sometimes there's also an issue with the EC2 instance network meaning 
that neither we nor AWS can connect to it and rebooting it then takes up to 45 
minutes. And that's unacceptable.

    We use Artemis and its core protocol to distribute metadata enriched 
payloads up to hundreds of MiBs in size from and to our customers, and also 
between internal services which are responsible for processing such payloads. 
Low latency is therefore not as crucial as reliability.

    Our current setup consists of a single master/slave replicating pair which 
is far from ideal, but a single broker handles the throughput just fine. My 
goal here is to prevent outages, data loss, split brain situations and 
duplicate delivery.

    What I aim for:
    
https://developers.redhat.com/sites/default/files/blog/2019/12/img_5df171079a67a.png
    Source:
    
https://developers.redhat.com/blog/2020/01/10/architecting-messaging-solutions-with-apache-activemq-artemis#reference_architectures

    What do you suggest?

    Thank you.
    Jan


    Dne 21. 7. 2022 0:22 napsal uživatel Justin Bertram <jbert...@apache.org>:
    Typically in a cloud environment like AWS you'd let the environment itself
    restart any failed broker instances so that HA isn't even necessary. Also,
    if you want redundancy between disparate locations (e.g. different
    availability zones) I'd recommend mirroring [1]. Normal HA solutions (e.g.
    shared storage or replication) are really designed to be used on local
    networks with very low latency.


    Justin

    [1]
    
https://activemq.apache.org/components/artemis/documentation/latest/amqp-broker-connections.html#mirroring

    On Wed, Jul 20, 2022 at 6:12 AM Jan Šmucr <jan.sm...@aimtecglobal.com>
    wrote:

    > Hello.
    >
    > We too are trying to switch from replication to a more simple model,
    > especially when it comes to single master-slave pair cluster model which
    > suffers from split brain issues. AWS EFS and the shared storage model 
makes
    > sense.
    > The idea is that before we expand our cluster, there would be only one
    > master and one slave node, both in a different AZ with EFS based storage
    > shared amongst these AZs.
    > Are there any drawbacks? Are the storage locks reliable so that no split
    > brain situation occurs even if the two AZs stop communicating between each
    > other?
    >
    > Looking forward to hearing some input. 🙂
    > Jan
    >
    > On 2020/11/25 13:10:36 Luis De Bello wrote:
    > > Hi guys,
    > >
    > > I would like to know your experience operating an Artemis cluster in
    > AWS, is anyone doing that? how do you handle the broker state? EFS, JDBC?
    > >
    > > Currently we have 4 instances in production (EC2 instances) using disk
    > state and we avoid destroying instances using termination policies (to
    > avoid lossing messages), during releases we mount extra instances lets say
    > 4 more and we wait until message distribution moves messages from old
    > instances to new one.
    > >
    > > It is working fine, but it has the drawback of avoiding the instances
    > going down when something fails like, OOM, topology issues leading us to
    > execute manual restart, so we are looking to move to a differnet model,
    > options are externalizaing state or doing the broker stateless and
    > repopulating messages.
    > >
    > > I would like to hear about your deployment models and similar issues.
    > >
    > > Regards,
    > > Luis
    > >
    >


Reply via email to