Re: Long downtimes for VMs through automatically triggered storage migration

Stephan Seitz Thu, 13 Oct 2016 08:32:08 -0700

> > 
> > What we are still thinking about is the point, if it is principally
> > a
> > good idea to limit CloudStack in its ability to freely and
> > automatically
> > migrate VMs between all cluster nodes. Is setting
> > "enable.ha.storage.migration"=false the intended way to handle a
> > setup
> > with multiple clusters or is it kind of a dirty hack to circumvent
> > disadvantages of our setup? In the latter case we would like to
> > know
> > that and keep a focus on alternatives and be ready to improve our
> > setup
> > in the mid-term.
> The logic around having multiple primary storage options tied to
> clusters is really designed to limit failure domains. Ideally you
> want to spread your workloads across different failure domains so
> that if you do lose a primary storage system, you still have services
> up and running.
> We build redundancy into the cluster and the storage attached to the
> cluster. We also run multiple clusters within a pod. If you spread
> your redundant VMs across multiple clusters (with their own primary
> storage), it's easier to absorb a catastrophic storage failure, as
> your eggs aren't in one basket.


Thank you for this explanation.

> We turn off HA storage migration, as it doesn't make much sense to
> us. It assumes the storage is still up, as you obviously can't
> migrate a VM to a different primary storage if it's down. If you have
> enough hosts in a cluster, you should never run into a situation
> where you can't bring all your VMs back up due to host failure. So in
> that sense, HA storage migration is a pointless feature if you build
> and scale your clusters properly.

Indeed, even if you have enough hosts, we ran into that situation due
to a bug with multiple datadisks and hvm introduced via https://github.
com/apache/cloudstack/pull/792 .
As a result acs tries to start the vm on node1,2,3... and so on and
fails on all hosts due to the underlying qemu-dm parsing error.
Finally it tries to start on nodes of another cluster which
subsequently triggers a storage migration.

- Stephan

Re: Long downtimes for VMs through automatically triggered storage migration

Reply via email to