Re: All cluster reboot when a Primary storage fails

vas...@gmx.de Wed, 20 Oct 2021 08:27:34 -0700

really an interesting challange...

I am not sure if you will actually get to a point where an VM will survive
this "undamaged". My experiences with other technologies are normally the
same as yours. At least something like an "fsck" is needed to get things
back running - still with the problem that you often have some kind of
"undefined" data.
So the workflow i was used to was something like:
- get the effected volume
- check filesystem
- try to run the old vm / create a new one
- try to see if data can be safed
- revert to the last backup / snapshot (valid known working state)
- revert the backed up data (and even here to concider if it really is
needed depending on the kind of application the data is used for)


I can see why you might try a different setup for the storage system.
questions would be how fast has an fail-over to work - so that:
-the nfs mount will concidered as availeable
-nfs client is able to "cache" data that needs to be transmitted to the
storage server and write i after the nfs-mount is availeable again
imho quiet many obstacles to get to this point though

would be nice to kept updated with your findings!


Am Mi., 20. Okt. 2021 um 15:09 Uhr schrieb Mauro Ferraro - G2K Hosting <
mferr...@g2khosting.com>:

> Thanks to all guys for your feedback.
>
> We think that the problem is hard to solve without damage a VM. We were
> trying with Gluster+NFS Ganesha+PaceMaker+Corosync and when the NFS goes
> down de IP floats to other node but it takes few seconds, and all VMs
> goes down and in this scenario the VM be Damaged. And the performance
> with gluster is not really good.
>
> Now we want to test with ACS 4.16 and linstor, have someone any
> references about this?.
>
> Regards
>
> Mauro
>
>
> El 20/10/2021 a las 05:44, Piotr Pisz escribió:
> > Hi,
> > I experienced this problem myself, in the KVM, Ceph, NFS-Ganesha
> environment at full Ceph load, the Ganesha NFS server was able to hang.
> > Servers were able to randomly restart due to lack of NFS access.
> > Which magnified the problem and there was a cascade and restart of the
> entire environment.
> > We currently have the restart line removed in kvmheartbeat, instead we
> report the restart attempt via prometheus.
> >
> > Regards,
> > Piotr
> >
> >
> > -----Original Message-----
> > From: Sina Kashipazha <s.kashipa...@protonmail.com.INVALID>
> > Sent: Wednesday, October 20, 2021 10:35 AM
> > To: users@cloudstack.apache.org
> > Subject: Re: All cluster reboot when a Primary storage fails
> >
> > Hey Daniel,
> >
> > PR #4586 (https://github.com/apache/cloudstack/pull/4586) addressed
> your issue, as well. I'm currently working on it. Could you share with me
> how I can reproduce your reboot problem?
> >
> > Kind regards,
> > Sina
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> >
> > On Saturday, October 16th, 2021 at 05:40, Daniel Augusto Veronezi
> Salvador <dvsalvador...@gmail.com> wrote:
> >
> >> Hi Mauro,
> >>
> >> On KVM's monitor, when there is an inconsistency on the heartbeat's file
> >>
> >> or heartbeat timeout is extrapolated several times, by default, the host
> >>
> >> is restarted.
> >>
> >> The PR 4586 (https://github.com/apache/cloudstack/pull/4586) already
> >>
> >> addressed this issue by externalizing a property, which allows the
> >>
> >> operator to decide if the host must be restarted or not (default is
> >>
> >> 'true', meaning that the host will be restarted). However, this feature
> >>
> >> will be available only after release 4.16.
> >>
> >> Best regards,
> >>
> >> Daniel Salvador
> >>
> >> On 15/10/2021 20:43, Mauro Ferraro - G2K Hosting wrote:
> >>
> >>> Hi guys, how are you?.
> >>>
> >>> We are having this problems with ACS when a primary storages fails.
> >>>
> >>> We have several primary storage with Linux and NFS server serving KVM
> >>>
> >>> images. So every hosts have been mounted all the NFS servers because
> >>>
> >>> in one Host can be running VMs from different storages. The main
> >>>
> >>> problem of this, is when some storage fails because any reason all the
> >>>
> >>> cluster gets crazy and start rebooting the hosts to reconnect with
> >>>
> >>> this storage and all the VMs on the cluster, (including the VMs that
> >>>
> >>> were working good) goes down becuase the conection to one storage
> fails.
> >>>
> >>> If the problem with storage is permanent, the cluster never start
> >>>
> >>> again and hosts will reboot indefinitely.
> >>>
> >>> When this problem appears, the logs say this:
> >>>
> >>> host heartbeat: kvmheartbeat.sh will reboot system because it was
> >>>
> >>> unable to write the heartbeat to the storage.
> >>>
> >>> Many users, edit the script kvmheartbeat.shto avoid the hosts reboot
> >>>
> >>> or restart the agent on the host but i really not be sure that this is
> >>>
> >>> the real solution.
> >>>
> >>> Can someone help to propose a best solution at this high risk problem?.
> >>>
> >>> Regards,
> >>>
> >>> Mauro
>

Re: All cluster reboot when a Primary storage fails

Reply via email to