[ovirt-users] Re: Constantly XFS in memory corruption inside VMs

Vinícius Ferrão via Users Sun, 29 Nov 2020 12:38:53 -0800

Hi Strahil,

The majority of the VMs are UEFI. But I do have some Legacy BIOS VMs and they 
are corrupting too. I have a mix of RHEL/CentOS 7 and 8.


All of them are corrupting. XFS on everything with default values from 
installation.

There’s one VM with Ubuntu 18.04 LTS and ext4 that corruption is not found 
there. And the three NTFS VMs that I have are good too.

So the common denominator is XFS on Enterprise Linux (7 or 8).

Any other ideas?

Thanks.

PS: That VM that will die after the reboot is almost new. Installed on November 
19th, and oVirt is even with the Run Once flag because it never rebooted since 
installation.


Sent from my iPhone

> On 29 Nov 2020, at 17:03, Strahil Nikolov <hunter86...@yahoo.com> wrote:
> 
> Damn...
> 
> You are using EFI boot. Does this happen only to EFI machines ?
> Did you notice if only EL 8 is affected ?
> 
> Best Regards,
> Strahil Nikolov
> 
> 
> 
> 
> 
> 
> В неделя, 29 ноември 2020 г., 19:36:09 Гринуич+2, Vinícius Ferrão 
> <fer...@versatushpc.com.br> написа: 
> 
> 
> 
> 
> 
> Yes!
> 
> I have a live VM right now that will de dead on a reboot:
> 
> [root@kontainerscomk ~]# cat /etc/*release
> NAME="Red Hat Enterprise Linux"
> VERSION="8.3 (Ootpa)"
> ID="rhel"
> ID_LIKE="fedora"
> VERSION_ID="8.3"
> PLATFORM_ID="platform:el8"
> PRETTY_NAME="Red Hat Enterprise Linux 8.3 (Ootpa)"
> ANSI_COLOR="0;31"
> CPE_NAME="cpe:/o:redhat:enterprise_linux:8.3:GA"
> HOME_URL="https://www.redhat.com/";
> BUG_REPORT_URL="https://bugzilla.redhat.com/";
> 
> REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
> REDHAT_BUGZILLA_PRODUCT_VERSION=8.3
> REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
> REDHAT_SUPPORT_PRODUCT_VERSION="8.3"
> Red Hat Enterprise Linux release 8.3 (Ootpa)
> Red Hat Enterprise Linux release 8.3 (Ootpa)
> 
> [root@kontainerscomk ~]# sysctl -a | grep dirty
> vm.dirty_background_bytes = 0
> vm.dirty_background_ratio = 10
> vm.dirty_bytes = 0
> vm.dirty_expire_centisecs = 3000
> vm.dirty_ratio = 30
> vm.dirty_writeback_centisecs = 500
> vm.dirtytime_expire_seconds = 43200
> 
> [root@kontainerscomk ~]# xfs_db -r /dev/dm-0
> xfs_db: /dev/dm-0 is not a valid XFS filesystem (unexpected SB magic number 
> 0xa82a0000)
> Use -F to force a read attempt.
> [root@kontainerscomk ~]# xfs_db -r /dev/dm-0 -F
> xfs_db: /dev/dm-0 is not a valid XFS filesystem (unexpected SB magic number 
> 0xa82a0000)
> xfs_db: size check failed
> xfs_db: V1 inodes unsupported. Please try an older xfsprogs.
> 
> [root@kontainerscomk ~]# cat /etc/fstab
> #
> # /etc/fstab
> # Created by anaconda on Thu Nov 19 22:40:39 2020
> #
> # Accessible filesystems, by reference, are maintained under '/dev/disk/'.
> # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info.
> #
> # After editing this file, run 'systemctl daemon-reload' to update systemd
> # units generated from this file.
> #
> /dev/mapper/rhel-root  /                      xfs    defaults        0 0
> UUID=ad84d1ea-c9cc-4b22-8338-d1a6b2c7d27e /boot                  xfs    
> defaults        0 0
> UUID=4642-2FF6          /boot/efi              vfat    
> umask=0077,shortname=winnt 0 2
> /dev/mapper/rhel-swap  none                    swap    defaults        0 0
> 
> Thanks,
> 
> 
> -----Original Message-----
> From: Strahil Nikolov <hunter86...@yahoo.com> 
> Sent: Sunday, November 29, 2020 2:33 PM
> To: Vinícius Ferrão <fer...@versatushpc.com.br>
> Cc: users <users@ovirt.org>
> Subject: Re: [ovirt-users] Re: Constantly XFS in memory corruption inside VMs
> 
> Can you check the output on the VM that was affected:
> # cat /etc/*release
> # sysctl -a | grep dirty
> 
> 
> Best Regards,
> Strahil Nikolov
> 
> 
> 
> 
> 
> В неделя, 29 ноември 2020 г., 19:07:48 Гринуич+2, Vinícius Ferrão via Users 
> <users@ovirt.org> написа: 
> 
> 
> 
> 
> 
> Hi Strahil.
> 
> I’m not using barrier options on mount. It’s the default settings from CentOS 
> install.
> 
> I have some additional findings, there’s a big number of discarded packages 
> on the switch on the hypervisor interfaces.
> 
> Discards are OK as far as I know, I hope TCP handles this and do the proper 
> retransmissions, but I ask if this may be related or not. Our storage is over 
> NFS. My general expertise is with iSCSI and I’ve never seen this kind of 
> issue with iSCSI, not that I’m aware of.
> 
> In other clusters, I’ve seen a high number of discards with iSCSI on 
> XenServer 7.2 but there’s no corruption on the VMs there...
> 
> Thanks,
> 
> Sent from my iPhone
> 
>> On 29 Nov 2020, at 04:00, Strahil Nikolov <hunter86...@yahoo.com> wrote:
>> 
>> Are you using "nobarrier" mount options in the VM ?
>> 
>> If yes, can you try to remove the "nobarrrier" option.
>> 
>> 
>> Best Regards,
>> Strahil Nikolov
>> 
>> 
>> 
>> 
>> 
>> 
>> В събота, 28 ноември 2020 г., 19:25:48 Гринуич+2, Vinícius Ferrão 
>> <fer...@versatushpc.com.br> написа: 
>> 
>> 
>> 
>> 
>> 
>> Hi Strahil,
>> 
>> I moved a running VM to other host, rebooted and no corruption was found. If 
>> there's any corruption it may be silent corruption... I've cases where the 
>> VM was new, just installed, run dnf -y update to get the updated packages, 
>> rebooted, and boom XFS corruption. So perhaps the motion process isn't the 
>> one to blame.
>> 
>> But, in fact, I remember when moving a VM that it went down during the 
>> process and when I rebooted it was corrupted. But this may not seems 
>> related. It perhaps was already in a inconsistent state.
>> 
>> Anyway, here's the mount options:
>> 
>> Host1:
>> 192.168.10.14:/mnt/pool0/ovirt/vm on 
>> /rhev/data-center/mnt/192.168.10.14:_mnt_pool0_ovirt_vm type nfs4 
>> (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,soft,noshar
>> ecache,proto=tcp,timeo=100,retrans=3,sec=sys,clientaddr=192.168.10.1,l
>> ocal_lock=none,addr=192.168.10.14)
>> 
>> Host2:
>> 192.168.10.14:/mnt/pool0/ovirt/vm on 
>> /rhev/data-center/mnt/192.168.10.14:_mnt_pool0_ovirt_vm type nfs4 
>> (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,soft,noshar
>> ecache,proto=tcp,timeo=100,retrans=3,sec=sys,clientaddr=192.168.10.1,l
>> ocal_lock=none,addr=192.168.10.14)
>> 
>> The options are the default ones. I haven't changed anything when 
>> configuring this cluster.
>> 
>> Thanks.
>> 
>> 
>> 
>> -----Original Message-----
>> From: Strahil Nikolov <hunter86...@yahoo.com>
>> Sent: Saturday, November 28, 2020 1:54 PM
>> To: users <users@ovirt.org>; Vinícius Ferrão 
>> <fer...@versatushpc.com.br>
>> Subject: Re: [ovirt-users] Constantly XFS in memory corruption inside 
>> VMs
>> 
>> Can you try with a test vm, if this happens after a Virtual Machine 
>> migration ?
>> 
>> What are your mount options for the storage domain ?
>> 
>> Best Regards,
>> Strahil Nikolov
>> 
>> 
>> 
>> 
>> 
>> 
>> В събота, 28 ноември 2020 г., 18:25:15 Гринуич+2, Vinícius Ferrão via Users 
>> <users@ovirt.org> написа: 
>> 
>> 
>> 
>> 
>> 
>>   
>> 
>> 
>> Hello,
>> 
>>   
>> 
>> I’m trying to discover why an oVirt 4.4.3 Cluster with two hosts and NFS 
>> shared storage on TrueNAS 12.0 is constantly getting XFS corruption inside 
>> the VMs.
>> 
>>   
>> 
>> For random reasons VM’s gets corrupted, sometimes halting it or just being 
>> silent corrupted and after a reboot the system is unable to boot due to 
>> “corruption of in-memory data detected”. Sometimes the corrupted data are 
>> “all zeroes”, sometimes there’s data there. In extreme cases the XFS 
>> superblock 0 get’s corrupted and the system cannot even detect a XFS 
>> partition anymore since the magic XFS key is corrupted on the first blocks 
>> of the virtual disk.
>> 
>>   
>> 
>> This is happening for a month now. We had to rollback some backups, and I 
>> don’t trust anymore on the state of the VMs.
>> 
>>   
>> 
>> Using xfs_db I can see that some VM’s have corrupted superblocks but the VM 
>> is up. One in specific, was with sb0 corrupted, so I knew when a reboot 
>> kicks in the machine will be gone, and that’s exactly what happened.
>> 
>>   
>> 
>> Another day I was just installing a new CentOS 8 VM for random reasons, and 
>> after running dnf -y update and a reboot the VM was corrupted needing XFS 
>> repair. That was an extreme case.
>> 
>>   
>> 
>> So, I’ve looked on the TrueNAS logs, and there’s apparently nothing wrong on 
>> the system. No errors logged on dmesg, nothing on /var/log/messages and no 
>> errors on the “zpools”, not even after scrub operations. On the switch, a 
>> Catalyst 2960X, we’ve been monitoring it and all it’s interfaces. There are 
>> no “up and down” and zero errors on all interfaces (we have a 4x Port LACP 
>> on the TrueNAS side and 2x Port LACP on each hosts), everything seems to be 
>> fine. The only metric that I was unable to get is “dropped packages”, but 
>> I’m don’t know if this can be an issue or not.
>> 
>>   
>> 
>> Finally, on oVirt, I can’t find anything either. I looked on 
>> /var/log/messages and /var/log/sanlock.log but there’s nothing that I found 
>> suspicious.
>> 
>>   
>> 
>> Is there’s anyone out there experiencing this? Our VM’s are mainly CentOS 
>> 7/8 with XFS, there’s 3 Windows VM’s that does not seems to be affected, 
>> everything else is affected.
>> 
>>   
>> 
>> Thanks all.
>> 
>> 
>> 
>> _______________________________________________
>> Users mailing list -- users@ovirt.org
>> To unsubscribe send an email to users-le...@ovirt.org Privacy 
>> Statement: https://www.ovirt.org/privacy-policy.html
>> oVirt Code of Conduct: 
>> https://www.ovirt.org/community/about/community-guidelines/
>> List Archives: 
>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/VLYSE7HC
>> FNWTWFZZTL2EJHV36OENHUGB/
> 
> _______________________________________________
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: 
> https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/CZ5E55LJMA7Y5XUAIXBH2FMGYSUU27EV/
> 
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/KYD3C5VDQ4WDFLEIZOG4Z77Z5TMVN5QV/

[ovirt-users] Re: Constantly XFS in memory corruption inside VMs

Reply via email to