[ovirt-users] Re: How to handle broken NFS storage?

Nir Soffer Sun, 06 Jun 2021 02:38:58 -0700

On Sun, Jun 6, 2021 at 12:31 PM Nir Soffer <nsof...@redhat.com> wrote:
>
> On Sat, Jun 5, 2021 at 3:25 AM David White via Users <users@ovirt.org> wrote:
> >
> > When I stopped the NFS service, I was connect to a VM over ssh.
> > I was also connected to one of the physical hosts over ssh, and was running 
> > top.
> >
> > I observed that server load continued to increase over time on the physical 
> > host.
> > Several of the VMs (perhaps all?), including the one I was connected to, 
> > went down due to an underlying storage issue.
> > It appears to me that HA VMs were restarted automatically. For example, I 
> > see the following in the oVirt Manager Event Log (domain names changed / 
> > redacted):
> >
> >
> > Jun 4, 2021, 4:25:42 AM
> > Highly Available VM server2.example.com failed. It will be restarted 
> > automatically.
>
> Do  you have a cdrom on an ISO storage domain, maybe on the same NFS server
> that you stopped?


If you share vm xml for the ha vms and the regular vms it will be easier to
understand your system.

The best way is to use:

    sudo virsh -r dumpxml {vm-name}

> > Jun 4, 2021, 4:25:42 AM
> > Highly Available VM mail.example.com failed. It will be restarted 
> > automatically.
> >
> > Jun 4, 2021, 4:25:42 AM
> > Highly Available VM core1.mgt.example.com failed. It will be restarted 
> > automatically.
> >
> > Jun 4, 2021, 4:25:42 AM
> > VM cha1-shared.example.com has been paused due to unknown storage error.
> >
> > Jun 4, 2021, 4:25:42 AM
> > VM server.example.org has been paused due to storage I/O problem.
> >
> > Jun 4, 2021, 4:25:42 AM
> > VM server.example.com has been paused.
>
> I guess this vm was using the NFS server?
>
> > Jun 4, 2021, 4:25:42 AM
> > VM server.example.org has been paused.
> >
> > Jun 4, 2021, 4:25:41 AM
> > VM server.example.org has been paused due to unknown storage error.
> >
> > Jun 4, 2021, 4:25:41 AM
> > VM HostedEngine has been paused due to storage I/O problem.
> >
> >
> > During this outage, I also noticed that customer websites were not working.
> > So I clearly took an outage.
> >
> > > If you have a good way to reproduce the issue please file a bug with
> > > all the logs, we try to improve this situation.
> >
> > I don't have a separate lab environment, but if I'm able to reproduce the 
> > issue off hours, I may try to do so.
> > What logs would be helpful?
>
> /var/log/vdsm.log
> /var/log/sanlock.log
> /var/log/messages or output of journalctl
>
> > > NFS storage domain will always affect other storage domains, but if you 
> > > mount
> > > your NFS storage outside of ovirt, the mount will not affect the system.
> > >
> >
> > > Then you can backup to this mount, for example using backup_vm.py:
> > > https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py
> >
> > If I'm understanding you correctly, it sounds like you're suggesting that I 
> > just connect 1 (or multiple) hosts to the NFS mount manually,
>
> Yes
>
> > and don't use the oVirt manager to build the backup domain. Then just run 
> > this script on a cron or something - is that correct?
>
> Yes.
>
> You can run the backup in many ways, for example you can run it via ssh
> from another host, finding where vms are running, and connecting to
> the host to perform a backup. This is outside of ovirt, since ovirt does not
> have built-in a backup feature. We have backup API and example code using it
> which can be used to build a backup solution.
>
> > Sent with ProtonMail Secure Email.
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Friday, June 4, 2021 12:29 PM, Nir Soffer <nsof...@redhat.com> wrote:
> >
> > > On Fri, Jun 4, 2021 at 12:11 PM David White via Users users@ovirt.org 
> > > wrote:
> > >
> >
> > > > I'm trying to figure out how to keep a "broken" NFS mount point from 
> > > > causing the entire HCI cluster to crash.
> > > > HCI is working beautifully.
> > > > Last night, I finished adding some NFS storage to the cluster - this is 
> > > > storage that I don't necessarily need to be HA, and I was hoping to 
> > > > store some backups and less-important VMs on, since my Gluster (sssd) 
> > > > storage availability is pretty limited.
> > > > But as a test, after I got everything setup, I stopped the nfs-server.
> > > > This caused the entire cluster to go down, and several VMs - that are 
> > > > not stored on the NFS storage - went belly up.
> > >
> >
> > > Please explain in more detail "went belly up".
> > >
> >
> > > In general vms not using he nfs storage domain should not be affected, but
> > > due to unfortunate design of vdsm, all storage domain share the same 
> > > global lock
> > > and when one storage domain has trouble, it can cause delays in
> > > operations on other
> > > domains. This may lead to timeouts and vms reported as non-responsive,
> > > but the actual
> > > vms, should not be affected.
> > >
> >
> > > If you have a good way to reproduce the issue please file a bug with
> > > all the logs, we try
> > > to improve this situation.
> > >
> >
> > > > Once I started the NFS server process again, HCI did what it was 
> > > > supposed to do, and was able to automatically recover.
> > > > My concern is that NFS is a single point of failure, and if VMs that 
> > > > don't even rely on that storage are affected if the NFS storage goes 
> > > > away, then I don't want anything to do with it.
> > >
> >
> > > You need to understand the actual effect on the vms before you reject NFS.
> > >
> >
> > > > On the other hand, I'm still struggling to come up with a good way to 
> > > > run on-site backups and snapshots without using up more gluster space 
> > > > on my (more expensive) sssd storage.
> > >
> >
> > > NFS is useful for this purpose. You don't need synchronous replication, 
> > > and
> > > you want the backups outside of your cluster so in case of disaster you 
> > > can
> > > restore the backups on another system.
> > >
> >
> > > Snapshots are always on the same storage so it will not help.
> > >
> >
> > > > Is there any way to setup NFS storage for a Backup Domain - as well as 
> > > > a Data domain (for lesser important VMs) - such that, if the NFS server 
> > > > crashed, all of my non-NFS stuff would be unaffected?
> > >
> >
> > > NFS storage domain will always affect other storage domains, but if you 
> > > mount
> > > your NFS storage outside of ovirt, the mount will not affect the system.
> > >
> >
> > > Then you can backup to this mount, for example using backup_vm.py:
> > > https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py
> > >
> >
> > > Or one of the backup solutions, all of them are not using a storage 
> > > domain for
> > > keeping the backups so the mount should not affect the system.
> > >
> >
> > > Nir
> > >
> >
> > > Users mailing list -- users@ovirt.org
> > > To unsubscribe send an email to users-le...@ovirt.org
> > > Privacy Statement: https://www.ovirt.org/privacy-policy.html
> > > oVirt Code of Conduct: 
> > > https://www.ovirt.org/community/about/community-guidelines/
> > > List Archives: 
> > > https://lists.ovirt.org/archives/list/users@ovirt.org/message/MYQAQTMXRAZT7EYAYCMYXBJYZHSNJT7G/
> >
> > _______________________________________________
> > Users mailing list -- users@ovirt.org
> > To unsubscribe send an email to users-le...@ovirt.org
> > Privacy Statement: https://www.ovirt.org/privacy-policy.html
> > oVirt Code of Conduct: 
> > https://www.ovirt.org/community/about/community-guidelines/
> > List Archives: 
> > https://lists.ovirt.org/archives/list/users@ovirt.org/message/MUXCOH6H7EYR7R637IBJJMDO2VI6QDW7/
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZGM6LJPG7SGVCTW7D6RWOFRS6UYQCV4E/

[ovirt-users] Re: How to handle broken NFS storage?

Reply via email to