Re: How to make any service highly available using Zookeper

Lars Albertsson Thu, 20 Aug 2015 14:36:53 -0700

Hi again Vikram,

I am convinced that there are suitable off-the-shelf solutions from
the HA service niche, i.e. something similar to HAProxy or ELB. I am
not an expert in that area, however, so I cannot recommend anything in
particular.

>From the batch ecosystem components, I see two reasonable options. You
will need a watchdog that respawns a scheduler on failure. It should
be masterless, or you will have issues regarding who watches the
watchdog.

Your first option would be to introduce Mesos, and use either Marathon
or Aurora, which can guarantee that a service runs in exactly N
instances.

If you don't want to bring in Mesos as a dependency, you can roll your
own watchdog. I believe that it is simpler to base it on regular
health checks driven by cron, as opposed to long-running processes
that both monitor each other and your scheduler. If you put the same
script in crontab on redundant machines, and use a Zookeeper-based
lease (http://kazoo.readthedocs.org/en/latest/api/recipe/lease.html),
you effectively have an HA cron service. The script can then perform a
health check and respawn your scheduler if necessary.

With both of your solutions, you will risk a split brain scenario if
the current scheduler does not respond to health checks, but still
thinks it is alive, e.g. if you run on bare metal, and there are
network issues. A straightforward solution would be to run your
scheduler in a VM or other container, and bring the whole container
down on failover.

>From your mail, I get the impression that you are considering running
the scheduler on one of the Cassandra nodes, or on the Zookeeper
leader node. I suggest avoiding that for stability reasons, and
instead run the scheduler on a dedicated node. A centralised job
scheduler is a bottleneck, and its resource consumption will rise over
time. Cassandra works best in balanced and symmetric scenarios, and
Zookeeper is not scalable and sensitive to overload, so both are best
left alone.

I hope the information is useful.

Regards,

Lars Albertsson

On Tue, Aug 18, 2015 at 8:53 PM, Vikram Kone <[email protected]> wrote:
> Hi,
> I'm a newbie to Zookeeper, so pardon any naive question I ask here.
> I have a cassandra cluster running on linux VMs and have a spark job
> scheduler service running on one of the nodes. Since cassandra has a
> peer-peer architecture there is no concept of leader.
> I want to provide high availability for this job scheduler service using
> Zookeeper. I can't make any code changes to the job scheduler service since
> it's a 3rd party app.
> I'm thinking of copying the application folder on all the servers in the
> cluster and use zookeeper to start an instance of the service on the
> leader/master node by executing /opt/job-scheduler/bin/start.sh on leader
> election.
> Is this something easy to do with zookeeper?
>
> Please point to any documentation or tutorial on how to run a bash script
> on the leader node in zoo keeper's ensemble after a node is elected as
> leader by the quorum.
>
> Thanks

Re: How to make any service highly available using Zookeper

Reply via email to