On Sun, Jun 6, 2021 at 12:31 PM Nir Soffer <nsof...@redhat.com> wrote: > > On Sat, Jun 5, 2021 at 3:25 AM David White via Users <users@ovirt.org> wrote: > > > > When I stopped the NFS service, I was connect to a VM over ssh. > > I was also connected to one of the physical hosts over ssh, and was running > > top. > > > > I observed that server load continued to increase over time on the physical > > host. > > Several of the VMs (perhaps all?), including the one I was connected to, > > went down due to an underlying storage issue. > > It appears to me that HA VMs were restarted automatically. For example, I > > see the following in the oVirt Manager Event Log (domain names changed / > > redacted): > > > > > > Jun 4, 2021, 4:25:42 AM > > Highly Available VM server2.example.com failed. It will be restarted > > automatically. > > Do you have a cdrom on an ISO storage domain, maybe on the same NFS server > that you stopped?
If you share vm xml for the ha vms and the regular vms it will be easier to understand your system. The best way is to use: sudo virsh -r dumpxml {vm-name} > > Jun 4, 2021, 4:25:42 AM > > Highly Available VM mail.example.com failed. It will be restarted > > automatically. > > > > Jun 4, 2021, 4:25:42 AM > > Highly Available VM core1.mgt.example.com failed. It will be restarted > > automatically. > > > > Jun 4, 2021, 4:25:42 AM > > VM cha1-shared.example.com has been paused due to unknown storage error. > > > > Jun 4, 2021, 4:25:42 AM > > VM server.example.org has been paused due to storage I/O problem. > > > > Jun 4, 2021, 4:25:42 AM > > VM server.example.com has been paused. > > I guess this vm was using the NFS server? > > > Jun 4, 2021, 4:25:42 AM > > VM server.example.org has been paused. > > > > Jun 4, 2021, 4:25:41 AM > > VM server.example.org has been paused due to unknown storage error. > > > > Jun 4, 2021, 4:25:41 AM > > VM HostedEngine has been paused due to storage I/O problem. > > > > > > During this outage, I also noticed that customer websites were not working. > > So I clearly took an outage. > > > > > If you have a good way to reproduce the issue please file a bug with > > > all the logs, we try to improve this situation. > > > > I don't have a separate lab environment, but if I'm able to reproduce the > > issue off hours, I may try to do so. > > What logs would be helpful? > > /var/log/vdsm.log > /var/log/sanlock.log > /var/log/messages or output of journalctl > > > > NFS storage domain will always affect other storage domains, but if you > > > mount > > > your NFS storage outside of ovirt, the mount will not affect the system. > > > > > > > > Then you can backup to this mount, for example using backup_vm.py: > > > https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py > > > > If I'm understanding you correctly, it sounds like you're suggesting that I > > just connect 1 (or multiple) hosts to the NFS mount manually, > > Yes > > > and don't use the oVirt manager to build the backup domain. Then just run > > this script on a cron or something - is that correct? > > Yes. > > You can run the backup in many ways, for example you can run it via ssh > from another host, finding where vms are running, and connecting to > the host to perform a backup. This is outside of ovirt, since ovirt does not > have built-in a backup feature. We have backup API and example code using it > which can be used to build a backup solution. > > > Sent with ProtonMail Secure Email. > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > On Friday, June 4, 2021 12:29 PM, Nir Soffer <nsof...@redhat.com> wrote: > > > > > On Fri, Jun 4, 2021 at 12:11 PM David White via Users users@ovirt.org > > > wrote: > > > > > > > > > I'm trying to figure out how to keep a "broken" NFS mount point from > > > > causing the entire HCI cluster to crash. > > > > HCI is working beautifully. > > > > Last night, I finished adding some NFS storage to the cluster - this is > > > > storage that I don't necessarily need to be HA, and I was hoping to > > > > store some backups and less-important VMs on, since my Gluster (sssd) > > > > storage availability is pretty limited. > > > > But as a test, after I got everything setup, I stopped the nfs-server. > > > > This caused the entire cluster to go down, and several VMs - that are > > > > not stored on the NFS storage - went belly up. > > > > > > > > Please explain in more detail "went belly up". > > > > > > > > In general vms not using he nfs storage domain should not be affected, but > > > due to unfortunate design of vdsm, all storage domain share the same > > > global lock > > > and when one storage domain has trouble, it can cause delays in > > > operations on other > > > domains. This may lead to timeouts and vms reported as non-responsive, > > > but the actual > > > vms, should not be affected. > > > > > > > > If you have a good way to reproduce the issue please file a bug with > > > all the logs, we try > > > to improve this situation. > > > > > > > > > Once I started the NFS server process again, HCI did what it was > > > > supposed to do, and was able to automatically recover. > > > > My concern is that NFS is a single point of failure, and if VMs that > > > > don't even rely on that storage are affected if the NFS storage goes > > > > away, then I don't want anything to do with it. > > > > > > > > You need to understand the actual effect on the vms before you reject NFS. > > > > > > > > > On the other hand, I'm still struggling to come up with a good way to > > > > run on-site backups and snapshots without using up more gluster space > > > > on my (more expensive) sssd storage. > > > > > > > > NFS is useful for this purpose. You don't need synchronous replication, > > > and > > > you want the backups outside of your cluster so in case of disaster you > > > can > > > restore the backups on another system. > > > > > > > > Snapshots are always on the same storage so it will not help. > > > > > > > > > Is there any way to setup NFS storage for a Backup Domain - as well as > > > > a Data domain (for lesser important VMs) - such that, if the NFS server > > > > crashed, all of my non-NFS stuff would be unaffected? > > > > > > > > NFS storage domain will always affect other storage domains, but if you > > > mount > > > your NFS storage outside of ovirt, the mount will not affect the system. > > > > > > > > Then you can backup to this mount, for example using backup_vm.py: > > > https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py > > > > > > > > Or one of the backup solutions, all of them are not using a storage > > > domain for > > > keeping the backups so the mount should not affect the system. > > > > > > > > Nir > > > > > > > > Users mailing list -- users@ovirt.org > > > To unsubscribe send an email to users-le...@ovirt.org > > > Privacy Statement: https://www.ovirt.org/privacy-policy.html > > > oVirt Code of Conduct: > > > https://www.ovirt.org/community/about/community-guidelines/ > > > List Archives: > > > https://lists.ovirt.org/archives/list/users@ovirt.org/message/MYQAQTMXRAZT7EYAYCMYXBJYZHSNJT7G/ > > > > _______________________________________________ > > Users mailing list -- users@ovirt.org > > To unsubscribe send an email to users-le...@ovirt.org > > Privacy Statement: https://www.ovirt.org/privacy-policy.html > > oVirt Code of Conduct: > > https://www.ovirt.org/community/about/community-guidelines/ > > List Archives: > > https://lists.ovirt.org/archives/list/users@ovirt.org/message/MUXCOH6H7EYR7R637IBJJMDO2VI6QDW7/ _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZGM6LJPG7SGVCTW7D6RWOFRS6UYQCV4E/