HBase on EMR is fairly reliable but is still subject to hardware
failures (which has happened to me before). Is there a best practice for
adding backup masters to an EMR cluster?
I know this isn't technically a supported feature from AWS but we're
already heavily invested into HBase on EMR and would like to investigate
options on mitigating the risk of a master failure. In EMR if the master
dies the entire cluster is terminated so we need fail over for HBase,
Hadoop/HDFS and Zookeeper. The one idea that I've had is to create a
second (or third) EMR cluster with its HBase, Zookeeper and Hadoop/HDFS
configuration pointed to the primary cluster. This would in effect add
the RegionServers and Datanodes to the primary cluster. I know that
loosing 1/3 to 1/2 of your Datanodes would most likely mean you would
loose some WALs but re-ingesting the last days worth of data is
acceptable trade off for us in exchange for not having downtime.
I realize this is a slightly crazy idea and using something like
Kubernetes is the 'correct' solution but I have to work with what we
have and mitigate possible issues. My question is are there any big
issues that anyone would foresee us having with this idea?
Thanks for the feedback,
Austin