[ovirt-users] Re: VMs hang periodically: gluster problem?
I did it following the {read,write}-perf example reported in paragraph 12.6 and 12.7 https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/html/administration_guide/sect-running_the_volume_top_command, here are the results: https://cloud.ssis.sm/index.php/s/9bncnNSopnFReRS ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NND3WXVLXSKYSIRXZTQJU4IWJYFO67K7/
[ovirt-users] Re: VMs hang periodically: gluster problem?
It's just a guess until there is a proof in the gluster's top read-perf/write-perf . Can you share atleast the read-perf ? I'm pretty confident that the issue is not network-related , as the cluster.choose-local requires all reads to be local (reducing the network usage). Best Regards,Strahil Nikolov On Wed, Sep 21, 2022 at 10:40, Diego Ercolani wrote: I tried to measure IOs using gluster volume top but its results seem very cryptic to me (I need a deep analyze and don't have the time now) Thank you very much for your analysis, if I understood the problem is that the consumer SSD cache is too weak to help in times under a smoll number ~15 not particularly IO intensive VMs, so the IO hangs as the performance is poor and this hangs the VMs. The VMs kernel think that the CPU had hanged and so it crash. This seem to be the case If it's possible would be very useful a sort of profiler in the gluster enviromnent that raise up the evidence of issue related to speed of the undelying storage infrastructure, it can be a problem related to disks or to network, in any case the errors reported to user are almost misleading as it seem there is a data integrity issue (cannot read... or something like this. Only for reference these are the first lines of the "open" top command (currently I don't experience problems): [root@ovirt-node2 ~]# gluster volume top gv1 open Brick: ovirt-node2.ovirt:/brickgv1/gv1 Current open fds: 15, Max open fds: 38, Max openfd time: 2022-09-19 07:27:20.033304 + Count filename === 331763 /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/inbox 66284 /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/leases 53939 /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata.new 169 /45b4f14c-8323-482f-90ab-99d8fd610018/images/910fa026-d30b-4be2-9111-3c9f4f646fde/b7d6f39a-1481-4f5c-84fd-fc43f9e14d71 [...] ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/F7FKIJHYOANZM657KDZMIKC23CHXKRDS/ ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/DEPVELZ2YE336RMKVOYONDOILQAFCXQB/
[ovirt-users] Re: VMs hang periodically: gluster problem?
I tried to measure IOs using gluster volume top but its results seem very cryptic to me (I need a deep analyze and don't have the time now) Thank you very much for your analysis, if I understood the problem is that the consumer SSD cache is too weak to help in times under a smoll number ~15 not particularly IO intensive VMs, so the IO hangs as the performance is poor and this hangs the VMs. The VMs kernel think that the CPU had hanged and so it crash. This seem to be the case If it's possible would be very useful a sort of profiler in the gluster enviromnent that raise up the evidence of issue related to speed of the undelying storage infrastructure, it can be a problem related to disks or to network, in any case the errors reported to user are almost misleading as it seem there is a data integrity issue (cannot read... or something like this. Only for reference these are the first lines of the "open" top command (currently I don't experience problems): [root@ovirt-node2 ~]# gluster volume top gv1 open Brick: ovirt-node2.ovirt:/brickgv1/gv1 Current open fds: 15, Max open fds: 38, Max openfd time: 2022-09-19 07:27:20.033304 + Count filename === 331763 /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/inbox 66284 /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/leases 53939 /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata.new 169 /45b4f14c-8323-482f-90ab-99d8fd610018/images/910fa026-d30b-4be2-9111-3c9f4f646fde/b7d6f39a-1481-4f5c-84fd-fc43f9e14d71 [...] ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/F7FKIJHYOANZM657KDZMIKC23CHXKRDS/
[ovirt-users] Re: VMs hang periodically: gluster problem?
I hope you do realize that modern consumer SSDs have a small cache (at least according to https://www.storagereview.com/review/samsung-860-evo-ssd-review ) and we can't rule out the disks. Use gluster's top command to view the read (17.2.6) and write (17.2.7) performance of the bricks before (regular usage), during (high load) and after the backup (regular usage): https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/html/administration_guide/sect-running_the_volume_top_command Best Regards,Strahil Nikolov On Sat, Sep 17, 2022 at 17:49, Diego Ercolani wrote: Parameter cluster.choose-local set to off. I confirm the filesystem of the bricks are all XFS as required. I started the farm only to accomplish a test bench of oVirt implementation, so I used 3 hosts based on ryzen5 processor desktop environment equipped with 4 DDR (4 32GB modules) and 1 disk for the OS and the others to use as data brick or nfs target. All SATA based, while the OS is installed on an internal M.2 disk. The node4 as it doesn't require much space as it is the arbiter, uses only the internal M.2 disk. Every host is equipped with dual channel x520 intel chipset with 2 SFP+ configured with 9000 packet size. Access lan is the management lan (and even the lan used by gluster) the VLAN are the "production" vlans. node2: /dev/mapper/glustervg-glhe on /brickhe type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) /dev/mapper/glustervg-gv0 on /brickgv0 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) /dev/mapper/glustervg-gv1 on /brickgv1 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) Devices file mpath_uuid part4-mpath-Samsung_SSD_860_EVO_500GB_S4XBNF0N334942Y PVID 9iyl5761LWcy3AYy36fNcPk0fADjNYtC last seen on /dev/mapper/Samsung_SSD_860_EVO_500GB_S4XBNF0N334942Y4 not found. PV /dev/mapper/Samsung_SSD_870_EVO_4TB_S6BCNG0R300064E VG glustervg lvm2 [<3.64 TiB / 1.54 TiB free] PV /dev/nvme0n1p4 VG glustervg lvm2 [<287.02 GiB / <287.02 GiB free] PV /dev/nvme0n1p3 VG onn_ovirt-node2 lvm2 [177.15 GiB / <33.71 GiB free] PV /dev/mapper/ST4000NM000A-2HZ100_WJG1ZC85 VG daticold lvm2 [<3.64 TiB / 2.44 TiB free] node3: /dev/mapper/glustervg-glhe on /brickhe type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) /dev/mapper/glustervg-gv1 on /brickgv1 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) /dev/mapper/glustervg-gv0 on /brickgv0 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) [root@ovirt-node3 ~]# pvscan -v PV /dev/sda VG glustervg lvm2 [<3.64 TiB / <1.64 TiB free] PV /dev/nvme0n1p4 VG glustervg lvm2 [<287.02 GiB / <187.02 GiB free] PV /dev/nvme0n1p3 VG onn_ovirt-node3 lvm2 [177.15 GiB / <33.71 GiB free] Total: 3 [4.09 TiB] / in use: 3 [4.09 TiB] / in no VG: 0 [0 ] node4: /dev/mapper/onn_ovirt--node4-gluster on /dati type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=64k,sunit=128,swidth=128,noquota) ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/PO7D7UJUAZV2KAD2KNFZY3TTZNDUXLSL/ ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/RJJU22F44AFIDMY6A77WQS4Z2NUQMZOC/
[ovirt-users] Re: VMs hang periodically: gluster problem?
Parameter cluster.choose-local set to off. I confirm the filesystem of the bricks are all XFS as required. I started the farm only to accomplish a test bench of oVirt implementation, so I used 3 hosts based on ryzen5 processor desktop environment equipped with 4 DDR (4 32GB modules) and 1 disk for the OS and the others to use as data brick or nfs target. All SATA based, while the OS is installed on an internal M.2 disk. The node4 as it doesn't require much space as it is the arbiter, uses only the internal M.2 disk. Every host is equipped with dual channel x520 intel chipset with 2 SFP+ configured with 9000 packet size. Access lan is the management lan (and even the lan used by gluster) the VLAN are the "production" vlans. node2: /dev/mapper/glustervg-glhe on /brickhe type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) /dev/mapper/glustervg-gv0 on /brickgv0 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) /dev/mapper/glustervg-gv1 on /brickgv1 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) Devices file mpath_uuid part4-mpath-Samsung_SSD_860_EVO_500GB_S4XBNF0N334942Y PVID 9iyl5761LWcy3AYy36fNcPk0fADjNYtC last seen on /dev/mapper/Samsung_SSD_860_EVO_500GB_S4XBNF0N334942Y4 not found. PV /dev/mapper/Samsung_SSD_870_EVO_4TB_S6BCNG0R300064E VG glustervg lvm2 [<3.64 TiB / 1.54 TiB free] PV /dev/nvme0n1p4VG glustervg lvm2 [<287.02 GiB / <287.02 GiB free] PV /dev/nvme0n1p3VG onn_ovirt-node2 lvm2 [177.15 GiB / <33.71 GiB free] PV /dev/mapper/ST4000NM000A-2HZ100_WJG1ZC85 VG daticold lvm2 [<3.64 TiB / 2.44 TiB free] node3: /dev/mapper/glustervg-glhe on /brickhe type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) /dev/mapper/glustervg-gv1 on /brickgv1 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) /dev/mapper/glustervg-gv0 on /brickgv0 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) [root@ovirt-node3 ~]# pvscan -v PV /dev/sda VG glustervg lvm2 [<3.64 TiB / <1.64 TiB free] PV /dev/nvme0n1p4 VG glustervg lvm2 [<287.02 GiB / <187.02 GiB free] PV /dev/nvme0n1p3 VG onn_ovirt-node3 lvm2 [177.15 GiB / <33.71 GiB free] Total: 3 [4.09 TiB] / in use: 3 [4.09 TiB] / in no VG: 0 [0 ] node4: /dev/mapper/onn_ovirt--node4-gluster on /dati type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=64k,sunit=128,swidth=128,noquota) ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/PO7D7UJUAZV2KAD2KNFZY3TTZNDUXLSL/
[ovirt-users] Re: VMs hang periodically: gluster problem?
Set it back to the original value . The option picks the local brick for reading instead of picking the fastest one (which could be either a remote or a local one) which could help with bandwidth issues. Can you provide details about the bricks like HW raid/JBOD, raid type (0,5,6,10), stripe size, stripe width , filesystem (I expect XFS but it's nice to know) ,etc. Also share the gluster client log from the node where the backup proxy is. Should be something like: /var/log/glusterfs/rhev-data-center-mnt-glusterSD-:_gv1.log Best Regards,Strahil Nikolov On Thu, Sep 15, 2022 at 17:01, Diego Ercolani wrote: During this time (Hosted-Engine Hung, this appears in the host were it's supposed to have Hosted-Engine Running: 2022-09-15 13:59:27,762+ WARN (Thread-10) [virt.vm] (vmId='8486ed73-df34-4c58-bfdc-7025dec63b7f') Shutdown by QEMU Guest Agent failed (agent probably inactive) (vm:5490) 2022-09-15 13:59:27,762+ WARN (Thread-10) [virt.vm] (vmId='8486ed73-df34-4c58-bfdc-7025dec63b7f') Shutting down with guest agent FAILED (vmpowerdown:115) 2022-09-15 13:59:28,780+ ERROR (qgapoller/1) [virt.periodic.Operation] > operation failed (periodic:204) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 202, in __call__ self._func() File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 493, in _poller vm_id, self._qga_call_get_vcpus(vm_obj)) File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 814, in _qga_call_get_vcpus if 'online' in vcpus: TypeError: argument of type 'NoneType' is not iterable ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZMZ3V5E4ZFNWPW3R74ZXYFZA5RR3BV7R/ ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/RANBGQMMFTPMRL4PIMFPGUBJPKGQ7CZ2/
[ovirt-users] Re: VMs hang periodically: gluster problem?
During this time (Hosted-Engine Hung, this appears in the host were it's supposed to have Hosted-Engine Running: 2022-09-15 13:59:27,762+ WARN (Thread-10) [virt.vm] (vmId='8486ed73-df34-4c58-bfdc-7025dec63b7f') Shutdown by QEMU Guest Agent failed (agent probably inactive) (vm:5490) 2022-09-15 13:59:27,762+ WARN (Thread-10) [virt.vm] (vmId='8486ed73-df34-4c58-bfdc-7025dec63b7f') Shutting down with guest agent FAILED (vmpowerdown:115) 2022-09-15 13:59:28,780+ ERROR (qgapoller/1) [virt.periodic.Operation] > operation failed (periodic:204) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 202, in __call__ self._func() File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 493, in _poller vm_id, self._qga_call_get_vcpus(vm_obj)) File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 814, in _qga_call_get_vcpus if 'online' in vcpus: TypeError: argument of type 'NoneType' is not iterable ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZMZ3V5E4ZFNWPW3R74ZXYFZA5RR3BV7R/
[ovirt-users] Re: VMs hang periodically: gluster problem?
The current set is: [root@ovirt-node2 ~]# gluster volume get glen cluster.choose-local| awk '/choose-local/ {print $2}' off [root@ovirt-node2 ~]# gluster volume get gv0 cluster.choose-local| awk '/choose-local/ {print $2}' off [root@ovirt-node2 ~]# gluster volume get gv1 cluster.choose-local| awk '/choose-local/ {print $2}' off Is stated in the "virt" group: /var/lib/glusterd/groups/virt:cluster.choose-local=off I set the cluster.choose-local to true on every gluster volume and started migrating Hosted Engine around... a bunch of vms freezed and after a while also the Hosted-Engine hung To complete the environment, here it is the complete set for the glen (Hosted -Engine volume) gv0 and gv1 (volumes used by VMs): [root@ovirt-node3 ~]# gluster volume info gv1 Volume Name: gv1 Type: Replicate Volume ID: 863221f4-e11c-4589-95e9-aa3948e177f5 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: ovirt-node2.ovirt:/brickgv1/gv1 Brick2: ovirt-node3.ovirt:/brickgv1/gv1 Brick3: ovirt-node4.ovirt:/dati/gv1 (arbiter) Options Reconfigured: storage.build-pgfid: off cluster.granular-entry-heal: enable storage.owner-gid: 36 storage.owner-uid: 36 cluster.lookup-optimize: off server.keepalive-count: 5 server.keepalive-interval: 2 server.keepalive-time: 10 server.tcp-user-timeout: 20 network.ping-timeout: 30 performance.client-io-threads: on server.event-threads: 4 client.event-threads: 4 cluster.choose-local: true features.shard: on cluster.shd-wait-qlength: 1 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable performance.strict-o-direct: on network.remote-dio: off performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: off transport.address-family: inet nfs.disable: on ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ARHV3AX7I7NZ5LYMZR7FHBXMENHSVVYN/
[ovirt-users] Re: VMs hang periodically: gluster problem?
Can you test the backup after setting:status=$(gluster volume get cluster.choose-local awk '/choose-local/ {print $2}') gluster volume set cluster.choose-local true And after the test:gluster volume set cluster.choose-local $status Best Regards,Strahil Nikolov On Thu, Sep 15, 2022 at 12:26, Diego Ercolani wrote: Sorry, I see that the editor bring away all the head spaces that indent the timestamp. I retried the test, hoping to find the same error, and I found it. On node3. I changed the code of the read routine: cd /rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1; while sleep 0.1 ; do date +'Timestamp:%s.%N'; cat testfile ; done Also I have to point that in my gluster configuration: node2 and node3 are replicating while node4 is the arbiter. I find this: node2: 1663233449.088250919 1663233449.192357508 1663233449.296979848 1663233449.401279036 1663233449.504945285 1663233449.609107728 1663233449.713468581 1663233449.817435890 1663233449.922132348 1663233450.030449768 1663233450.134975317 1663233450.239171022 1663233450.342905278 1663233450.447466303 1663233450.551867180 1663233450.658387123 1663233450.762761972 1663233450.868063254 1663233450.973718716 1663233451.077074998 1663233451.181540916 1663233451.286831549 1663233451.393060700 1663233451.500488204 1663233451.606233103 1663233451.711308978 1663233451.816455012 1663233451.922142384 1663233452.028786138 1663233452.134080858 1663233452.239052098 1663233452.343540758 1663233452.449015706 1663233452.553832377 1663233452.658255495 1663233452.762774092 1663233452.866525770 1663233452.970784862 1663233453.075297458 1663233453.178379039 1663233453.281728609 1663233453.385722608 1663233453.489965321 1663233453.593885612 1663233453.698436388 1663233453.802415640 1663233453.906987275 1663233454.010658544 1663233454.114877122 1663233454.218459344 1663233454.322761948 1663233454.428025821 1663233454.533464752 1663233454.637652754 1663233454.741783087 1663233454.845600527 1663233454.950286885 1663233455.055143240 1663233455.161169524 1663233455.265582394 1663233455.369963173 1663233455.475453048 1663233455.580044209 1663233455.684503325 1663233455.788750947 1663233455.894135415 1663233455.998738750 node3: Timestamp:1663233450.000172185 1663233449.296979848 Timestamp:1663233450.101871259 1663233449.296979848 Timestamp:1663233450.204006554 1663233449.296979848 Timestamp:1663233450.306014420 1663233449.296979848 Timestamp:1663233450.407890669 1663233450.342905278 Timestamp:1663233450.511435794 1663233450.342905278 Timestamp:1663233450.613144044 1663233450.342905278 Timestamp:1663233450.714936282 1663233450.342905278 Timestamp:1663233450.816689957 1663233450.342905278 Timestamp:1663233450.919563686 1663233450.342905278 Timestamp:1663233451.021558628 1663233450.342905278 Timestamp:1663233451.123617850 1663233450.342905278 Timestamp:1663233451.225769366 1663233450.342905278 Timestamp:1663233451.327726226 1663233450.342905278 Timestamp:1663233451.429934369 1663233451.393060700 Timestamp:1663233451.532945857 1663233451.393060700 Timestamp:1663233451.634935468 1663233451.393060700 Timestamp:1663233451.737058041 1663233451.393060700 Timestamp:1663233451.839167797 1663233451.393060700 Timestamp:1663233451.941486148 1663233451.393060700 Timestamp:1663233452.043288336 1663233451.393060700 Timestamp:1663233452.145090644 1663233451.393060700 Timestamp:1663233452.246825425 1663233451.393060700 Timestamp:1663233452.348501234 1663233451.393060700 Timestamp:1663233452.450351853 Timestamp:1663233452.553106458 Timestamp:1663233452.655222156 Timestamp:1663233452.757315704 Timestamp:1663233452.859298562 Timestamp:1663233452.961655817 Timestamp:1663233453.063383043 Timestamp:1663233453.165180993 Timestamp:1663233453.266883792 Timestamp:1663233453.368890215 Timestamp:1663233453.470586924 1663233453.385722608 Timestamp:1663233453.573171648 1663233453.385722608 Timestamp:1663233453.675160288 1663233453.385722608 Timestamp:1663233453.777281257 1663233453.385722608 Timestamp:1663233453.879306084 1663233453.385722608 Timestamp:1663233453.981588858 1663233453.385722608 Timestamp:1663233454.083371309 1663233453.385722608 Timestamp:1663233454.185268095 1663233453.385722608 Timestamp:1663233454.287256013 1663233453.385722608 Timestamp:1663233454.389068540 1663233453.385722608 Timestamp:1663233454.490809573 1663233454.428025821 Timestamp:1663233454.593597380 1663233454.428025821 Timestamp:1663233454.695329646 1663233454.428025821 Timestamp:1663233454.797029330 1663233454.428025821 Timestamp:1663233454.899000216 1663233454.428025821 node4: Timestam:1663233450.043398632 1663233449.817435890 Timestam:1663233450.144889219 1663233449.817435890 Timestam:1663233450.246423969 1663233449.817435890 Timestam:1663233450.347730771 1663233449.817435890 Timestam:1663233450.449109919 1663233449.817435890 Timestam:1663233450.550659616 1663233449.817435890 Timestam:1663233450.652173237 1663233449.817435890 Timestam:1663233450.753610724 1663233449.817435890 Timestam:1663233450.855978621
[ovirt-users] Re: VMs hang periodically: gluster problem?
Sorry, I see that the editor bring away all the head spaces that indent the timestamp. I retried the test, hoping to find the same error, and I found it. On node3. I changed the code of the read routine: cd /rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1; while sleep 0.1 ; do date +'Timestamp:%s.%N'; cat testfile ; done Also I have to point that in my gluster configuration: node2 and node3 are replicating while node4 is the arbiter. I find this: node2: 1663233449.088250919 1663233449.192357508 1663233449.296979848 1663233449.401279036 1663233449.504945285 1663233449.609107728 1663233449.713468581 1663233449.817435890 1663233449.922132348 1663233450.030449768 1663233450.134975317 1663233450.239171022 1663233450.342905278 1663233450.447466303 1663233450.551867180 1663233450.658387123 1663233450.762761972 1663233450.868063254 1663233450.973718716 1663233451.077074998 1663233451.181540916 1663233451.286831549 1663233451.393060700 1663233451.500488204 1663233451.606233103 1663233451.711308978 1663233451.816455012 1663233451.922142384 1663233452.028786138 1663233452.134080858 1663233452.239052098 1663233452.343540758 1663233452.449015706 1663233452.553832377 1663233452.658255495 1663233452.762774092 1663233452.866525770 1663233452.970784862 1663233453.075297458 1663233453.178379039 1663233453.281728609 1663233453.385722608 1663233453.489965321 1663233453.593885612 1663233453.698436388 1663233453.802415640 1663233453.906987275 1663233454.010658544 1663233454.114877122 1663233454.218459344 1663233454.322761948 1663233454.428025821 1663233454.533464752 1663233454.637652754 1663233454.741783087 1663233454.845600527 1663233454.950286885 1663233455.055143240 1663233455.161169524 1663233455.265582394 1663233455.369963173 1663233455.475453048 1663233455.580044209 1663233455.684503325 1663233455.788750947 1663233455.894135415 1663233455.998738750 node3: Timestamp:1663233450.000172185 1663233449.296979848 Timestamp:1663233450.101871259 1663233449.296979848 Timestamp:1663233450.204006554 1663233449.296979848 Timestamp:1663233450.306014420 1663233449.296979848 Timestamp:1663233450.407890669 1663233450.342905278 Timestamp:1663233450.511435794 1663233450.342905278 Timestamp:1663233450.613144044 1663233450.342905278 Timestamp:1663233450.714936282 1663233450.342905278 Timestamp:1663233450.816689957 1663233450.342905278 Timestamp:1663233450.919563686 1663233450.342905278 Timestamp:1663233451.021558628 1663233450.342905278 Timestamp:1663233451.123617850 1663233450.342905278 Timestamp:1663233451.225769366 1663233450.342905278 Timestamp:1663233451.327726226 1663233450.342905278 Timestamp:1663233451.429934369 1663233451.393060700 Timestamp:1663233451.532945857 1663233451.393060700 Timestamp:1663233451.634935468 1663233451.393060700 Timestamp:1663233451.737058041 1663233451.393060700 Timestamp:1663233451.839167797 1663233451.393060700 Timestamp:1663233451.941486148 1663233451.393060700 Timestamp:1663233452.043288336 1663233451.393060700 Timestamp:1663233452.145090644 1663233451.393060700 Timestamp:1663233452.246825425 1663233451.393060700 Timestamp:1663233452.348501234 1663233451.393060700 Timestamp:1663233452.450351853 Timestamp:1663233452.553106458 Timestamp:1663233452.655222156 Timestamp:1663233452.757315704 Timestamp:1663233452.859298562 Timestamp:1663233452.961655817 Timestamp:1663233453.063383043 Timestamp:1663233453.165180993 Timestamp:1663233453.266883792 Timestamp:1663233453.368890215 Timestamp:1663233453.470586924 1663233453.385722608 Timestamp:1663233453.573171648 1663233453.385722608 Timestamp:1663233453.675160288 1663233453.385722608 Timestamp:1663233453.777281257 1663233453.385722608 Timestamp:1663233453.879306084 1663233453.385722608 Timestamp:1663233453.981588858 1663233453.385722608 Timestamp:1663233454.083371309 1663233453.385722608 Timestamp:1663233454.185268095 1663233453.385722608 Timestamp:1663233454.287256013 1663233453.385722608 Timestamp:1663233454.389068540 1663233453.385722608 Timestamp:1663233454.490809573 1663233454.428025821 Timestamp:1663233454.593597380 1663233454.428025821 Timestamp:1663233454.695329646 1663233454.428025821 Timestamp:1663233454.797029330 1663233454.428025821 Timestamp:1663233454.899000216 1663233454.428025821 node4: Timestam:1663233450.043398632 1663233449.817435890 Timestam:1663233450.144889219 1663233449.817435890 Timestam:1663233450.246423969 1663233449.817435890 Timestam:1663233450.347730771 1663233449.817435890 Timestam:1663233450.449109919 1663233449.817435890 Timestam:1663233450.550659616 1663233449.817435890 Timestam:1663233450.652173237 1663233449.817435890 Timestam:1663233450.753610724 1663233449.817435890 Timestam:1663233450.855978621 1663233450.762761972 Timestam:1663233450.958988505 1663233450.762761972 Timestam:1663233451.060495133 1663233450.762761972 Timestam:1663233451.162022459 1663233450.762761972 Timestam:1663233451.263371279 1663233450.762761972 Timestam:1663233451.364879118 1663233450.762761972 Timestam:1663233451.466311416 1663233450.762761972 T
[ovirt-users] Re: VMs hang periodically: gluster problem?
Thank you for the analisys: The version is the last distributed in the ovirt@centos8 distribution: [root@ovirt-node2 ~]# rpm -qa | grep '\(glusterfs-server\|ovirt-node\)' ovirt-node-ng-image-update-placeholder-4.5.2-1.el8.noarch glusterfs-server-10.2-1.el8s.x86_64 ovirt-node-ng-nodectl-4.4.2-1.el8.noarch python3-ovirt-node-ng-nodectl-4.4.2-1.el8.noarch ovirt-node-ng-image-update-4.5.2-1.el8.noarch [root@ovirt-node3 ~]# rpm -qa | grep '\(glusterfs-server\|ovirt-node\)' ovirt-node-ng-image-update-placeholder-4.5.2-1.el8.noarch glusterfs-server-10.2-1.el8s.x86_64 ovirt-node-ng-nodectl-4.4.2-1.el8.noarch python3-ovirt-node-ng-nodectl-4.4.2-1.el8.noarch ovirt-node-ng-image-update-4.5.2-1.el8.noarch [root@ovirt-node4 ~]# rpm -qa | grep '\(glusterfs-server\|ovirt-node\)' ovirt-node-ng-image-update-placeholder-4.5.2-1.el8.noarch glusterfs-server-10.2-1.el8s.x86_64 ovirt-node-ng-nodectl-4.4.2-1.el8.noarch python3-ovirt-node-ng-nodectl-4.4.2-1.el8.noarch ovirt-node-ng-image-update-4.5.2-1.el8.noarch Duiring backup (or when there is an input/output (even not to intensive looking at the SDD led) the only think I noticed is that sometimes there is a sort of lag: I issue "gluster volume heal glen|gv0|gv1 info" and the answer wait 4-5 seconds before answer... even if the aswer give 0 object missing... I have ever connected nodes. eg. Brick ovirt-node2.ovirt:/brickhe/_glen Status: Connected Number of entries: 0 Brick ovirt-node3.ovirt:/brickhe/glen Status: Connected Number of entries: 0 Brick ovirt-node4.ovirt:/dati/_glen Status: Connected Number of entries: 0 For hte "rate limit" I didn't work on the QOS, but the destination is an NFS sata raid5 NAS publisced via 1Gb link so I think I have a 20MB/s "cap" by architecture, the gluster bricks are all built by SSD SATA drives, I recorded a troughput of 200MB/s. I also tried to monitor performace via iotop command but I didn't recorded a "band problem" and even monitored network via iftop recording no band saturation and no errors. Searching in the gluster mailing list (https://lists.gluster.org/pipermail/gluster-users/2022-September/040063.html) I tried the same test but under 1/10 seconds write and read: [root@ovirt-node2 ~]# su - vdsm -s /bin/bash Last login: Wed Sep 14 15:33:45 UTC 2022 on pts/1 nodectl must be run as root! nodectl must be run as root! cd /rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1; while sleep 0.1; do date +'%s.%N' | tee testfile ; done ovirt-node-ng-image-update-4.5.2-1.el8.noarch [root@ovirt-node3 ~]# su - vdsm -s /bin/bash nodectl must be run as root! nodectl must be run as root! [vdsm@ovirt-node3 ~]$ cd /rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1; while sleep 0.1 ; do date +' %s.%N'; cat testfile ; done [root@ovirt-node4 ~]# su - vdsm -s /bin/bash Last login: Wed Aug 24 16:52:55 UTC 2022 nodectl must be run as root! nodectl must be run as root! [vdsm@ovirt-node4 ~]$ cd /rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1; while sleep 0.1 ; do date +' %s.%N'; cat testfile ; done Obtaining that for the nodes reading glusterfs I record only a 1 second update... more or less: to report the test I selected timestamp for node2 (the write node) betweeen 1663228352 and 1663228356, for node3 and 4 between 1663228353 and 1663228356: node2: 1663228352.589998302 1663228352.695887198 1663228352.801681699 1663228352.907548634 1663228353.011931276 1663228353.115904115 1663228353.222383590 1663228353.329941123 1663228353.436480791 1663228353.540536995 1663228353.644858473 1663228353.749470221 1663228353.853969491 1663228353.958703186 1663228354.062732971 1663228354.166616934 1663228354.270398507 1663228354.373989214 1663228354.477149100 1663228354.581862187 1663228354.686177524 1663228354.790362507 1663228354.894673446 1663228354.999136257 1663228355.102889616 1663228355.207043913 1663228355.312522545 1663228355.416667384 1663228355.520897473 1663228355.624582255 1663228355.728590069 1663228355.832979634 1663228355.937309737 1663228356.042289521 1663228356.146565174 1663228356.250773672 1663228356.356361818 1663228356.460048755 1663228356.565054968 1663228356.669126850 1663228356.773807899 1663228356.878011739 1663228356.983842597 node3: 1663228353.027991911 1663228352.064562785 1663228353.129696675 1663228353.115904115 1663228353.232351572 1663228353.115904115 1663228353.334188748 1663228353.115904115 1663228353.436208688 1663228353.115904115 1663228353.538268493 1663228353.115904115 1663228353.641266519 1663228353.115904115 1663228353.743094997 1663228353.115904115 1663228353.845244131 1663228353.115904115 1663228353.947049766 1663228353.115904115 1663228354.048876741 1663228353.115904115 1663228354.150979017 1663228354.062732971 1663228354.254198339 1663228354.062732971 1663228354.356197640 1663228354.270398507 1663228354.459541685 1663228354.270398507 1663228354.561548541 1663228354.270398507 1663228354.664280563 1663228354.270398507 1663228354.766557007 1663228354.270398507 1663228354.8
[ovirt-users] Re: VMs hang periodically: gluster problem?
I see some entries that are not good: [2022-09-11 03:50:26.131393 +] W [MSGID: 108001] [afr-transaction.c:1016:afr_handle_quorum] 0-gv1-replicate-0: 228740f8-1d14-4253-b95b-47e5feb6a3cc: Failing WRITE as quorum is not met [Invalid argument] When the backup goes , what is the output of 'gluster pool list'? Which gluster version are you using:rpm -qa | grep gluster Is it possible to rate limit the backup ? Best Regards,Strahil Nikolov On Mon, Sep 12, 2022 at 19:22, Diego Ercolani wrote: Hello. I did a full backup using veeam but I recorded many errors in the gluster log. This is the log (https://cloud.ssis.sm/index.php/s/KRimf5MLXK3Ds3d). The log is from the same node where veeam-proxy and the backupped VMs resides. Both are running in the gv1 storage domain. See that hours are reported in GMT and the backup started at 15:55:48 GMT and ended at 16:04:35. The log is full of errors that I cannot understand: ... [2022-09-12 15:56:24.983887 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2620:client4_0_lookup_cbk] 0-gv1-client-2: remote operation failed. [{path=}, {gfid=2ca980cf-3cda-4f56-84bd-c089cdeb5878}, {errno=2}, {error=No such file or directory}] ... [2022-09-12 15:56:42.570987 +] W [fuse-bridge.c:2979:fuse_readv_cbk] 0-glusterfs-fuse: 891795: READ => -1 gfid=2906fff6-2292-451c-879c-547e189d0bb3 fd=0x55b916d199b8 (Invalid argument) ... e.g. [root@ovirt-node3 ~]# mount -t glusterfs -o aux-gfid-mount,ro ovirt-node2.ovirt:/gv1 /mnt/ [root@ovirt-node3 ~]# gluster volume set gv1 build-pgfid on volume set: success [root@ovirt-node3 ~]# getfattr -n glusterfs.ancestry.path -e text /mnt/.gfid/2ca980cf-3cda-4f56-84bd-c089cdeb5878 getfattr: /mnt/.gfid/2ca980cf-3cda-4f56-84bd-c089cdeb5878: No such file or directory [root@ovirt-node3 ~]# getfattr -n glusterfs.ancestry.path -e text /mnt/.gfid/2906fff6-2292-451c-879c-547e189d0bb3 getfattr: /mnt/.gfid/2906fff6-2292-451c-879c-547e189d0bb3: No such file or directory ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/GRFUCFTWFUZUP5GQ74SHKDHMAUTHR57S/ ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/Y4QI63IUYTKDHAZTI43Y5S6S5N2UR26M/
[ovirt-users] Re: VMs hang periodically: gluster problem?
Hello. I did a full backup using veeam but I recorded many errors in the gluster log. This is the log (https://cloud.ssis.sm/index.php/s/KRimf5MLXK3Ds3d). The log is from the same node where veeam-proxy and the backupped VMs resides. Both are running in the gv1 storage domain. See that hours are reported in GMT and the backup started at 15:55:48 GMT and ended at 16:04:35. The log is full of errors that I cannot understand: ... [2022-09-12 15:56:24.983887 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2620:client4_0_lookup_cbk] 0-gv1-client-2: remote operation failed. [{path=}, {gfid=2ca980cf-3cda-4f56-84bd-c089cdeb5878}, {errno=2}, {error=No such file or directory}] ... [2022-09-12 15:56:42.570987 +] W [fuse-bridge.c:2979:fuse_readv_cbk] 0-glusterfs-fuse: 891795: READ => -1 gfid=2906fff6-2292-451c-879c-547e189d0bb3 fd=0x55b916d199b8 (Invalid argument) ... e.g. [root@ovirt-node3 ~]# mount -t glusterfs -o aux-gfid-mount,ro ovirt-node2.ovirt:/gv1 /mnt/ [root@ovirt-node3 ~]# gluster volume set gv1 build-pgfid on volume set: success [root@ovirt-node3 ~]# getfattr -n glusterfs.ancestry.path -e text /mnt/.gfid/2ca980cf-3cda-4f56-84bd-c089cdeb5878 getfattr: /mnt/.gfid/2ca980cf-3cda-4f56-84bd-c089cdeb5878: No such file or directory [root@ovirt-node3 ~]# getfattr -n glusterfs.ancestry.path -e text /mnt/.gfid/2906fff6-2292-451c-879c-547e189d0bb3 getfattr: /mnt/.gfid/2906fff6-2292-451c-879c-547e189d0bb3: No such file or directory ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/GRFUCFTWFUZUP5GQ74SHKDHMAUTHR57S/
[ovirt-users] Re: VMs hang periodically: gluster problem?
Yes, it seem so, but I cannot record any "erroir" on the interface, I have 0 TX error and 0 RX error. all three nodes are connected through a single switch. I set the MTU to 9000 to help gluster transfers but I cannot record any error. In the /var/log/vdsm/vdsm.log I log periodically in all the nodes I cannot understand what is happening. I have a 10Gb/s unsaturated links between hosts: 2022-09-11 12:30:31,708+ ERROR (periodic/134) [ovirt_hosted_engine_ha.client.client.HAClient] Malformed metadata for host 2: received 0 of 512 expected bytes (client:137) ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ECAPNBSMZ4TYA3VSA62C77G6NT5O32UN/
[ovirt-users] Re: VMs hang periodically: gluster problem?
I suspect you have network issues.Check the gluster log for the client side /var/log/glusterfs/rhev-data-center-mnt-glusterSD-:_.log Best Regards,Strahil Nikolov On Tue, Sep 6, 2022 at 17:19, Diego Ercolani wrote: I really don't understand, I was monitoring vdsm.log of one node (node2) And I saw a complain: 2022-09-06 14:08:27,105+ ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1/45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata (monitor:511) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1/45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata', 1, 'Read timeout') 2022-09-06 14:08:27,105+ INFO (check/loop) [storage.monitor] Domain 45b4f14c-8323-482f-90ab-99d8fd610018 became INVALID (monitor:482) 2022-09-06 14:08:27,149+ ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata (monitor:511) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata', 1, 'Read timeout') 2022-09-06 14:08:27,814+ INFO (jsonrpc/5) [api.virt] START getStats() from=::1,54242, vmId=8486ed73-df34-4c58-bfdc-7025dec63b7f (api:48) 2022-09-06 14:08:27,814+ INFO (jsonrpc/5) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': '8486ed73-df34-4c58-bfdc-7025dec63b7f'} (api:129) 2022-09-06 14:08:27,814+ INFO (jsonrpc/5) [api.virt] FINISH getStats return={'status': {'code': 1, 'message': "Virtual machine does not exist: {'vmId': '8486ed73-df34-4c58-bfdc-7025dec63b7f'}"}} from=::1,54242, vmId=8486ed73-df34-4c58-bfdc-7025dec63b7f (api:54) 2022-09-06 14:08:27,814+ INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call VM.getStats failed (error 1) in 0.00 seconds (__init__:312) 2022-09-06 14:08:31,357+ ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/localhost:_glen/3577c21e-f757-4405-97d1-0f827c9b4e22/dom_md/metadata (monitor:511) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/glusterSD/localhost:_glen/3577c21e-f757-4405-97d1-0f827c9b4e22/dom_md/metadata', 1, 'Read timeout') 2022-09-06 14:08:32,918+ INFO (periodic/5) [Executor] Worker was discarded (executor:305) but, on the same node, from commandline I can issue a simple cat without any problem: [root@ovirt-node2 ~]# cat "/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata" ALIGNMENT=1048576 BLOCK_SIZE=512 CLASS=Data DESCRIPTION=gv0 IOOPTIMEOUTSEC=10 LEASERETRIES=3 LEASETIMESEC=60 LOCKPOLICY= LOCKRENEWALINTERVALSEC=5 POOL_UUID=da146814-f823-40e0-bd7b-8478dcfa38cd REMOTE_PATH=localhost:/gv0 ROLE=Regular SDUUID=60b7f172-08ed-4a22-8414-31fd5b100d72 TYPE=GLUSTERFS VERSION=5 _SHA_CKSUM=a63324fa9b3030c3ffa35891c2d6c4e129c76af9 and [root@ovirt-node2 ~]# cat '/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata' ALIGNMENT=1048576 BLOCK_SIZE=512 CLASS=Data DESCRIPTION=gv0 IOOPTIMEOUTSEC=10 LEASERETRIES=3 LEASETIMESEC=60 LOCKPOLICY= LOCKRENEWALINTERVALSEC=5 POOL_UUID=da146814-f823-40e0-bd7b-8478dcfa38cd REMOTE_PATH=localhost:/gv0 ROLE=Regular SDUUID=60b7f172-08ed-4a22-8414-31fd5b100d72 TYPE=GLUSTERFS VERSION=5 _SHA_CKSUM=a63324fa9b3030c3ffa35891c2d6c4e129c76af9 after while, I retried the same cat, and the host console hanged so, sometimes, gluster revoke access to file ?!?!?! why? I think this "hang" is the source of all my problems. ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.o
[ovirt-users] Re: VMs hang periodically: gluster problem?
I really don't understand, I was monitoring vdsm.log of one node (node2) And I saw a complain: 2022-09-06 14:08:27,105+ ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1/45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata (monitor:511) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1/45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata', 1, 'Read timeout') 2022-09-06 14:08:27,105+ INFO (check/loop) [storage.monitor] Domain 45b4f14c-8323-482f-90ab-99d8fd610018 became INVALID (monitor:482) 2022-09-06 14:08:27,149+ ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata (monitor:511) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata', 1, 'Read timeout') 2022-09-06 14:08:27,814+ INFO (jsonrpc/5) [api.virt] START getStats() from=::1,54242, vmId=8486ed73-df34-4c58-bfdc-7025dec63b7f (api:48) 2022-09-06 14:08:27,814+ INFO (jsonrpc/5) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': '8486ed73-df34-4c58-bfdc-7025dec63b7f'} (api:129) 2022-09-06 14:08:27,814+ INFO (jsonrpc/5) [api.virt] FINISH getStats return={'status': {'code': 1, 'message': "Virtual machine does not exist: {'vmId': '8486ed73-df34-4c58-bfdc-7025dec63b7f'}"}} from=::1,54242, vmId=8486ed73-df34-4c58-bfdc-7025dec63b7f (api:54) 2022-09-06 14:08:27,814+ INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call VM.getStats failed (error 1) in 0.00 seconds (__init__:312) 2022-09-06 14:08:31,357+ ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/localhost:_glen/3577c21e-f757-4405-97d1-0f827c9b4e22/dom_md/metadata (monitor:511) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/glusterSD/localhost:_glen/3577c21e-f757-4405-97d1-0f827c9b4e22/dom_md/metadata', 1, 'Read timeout') 2022-09-06 14:08:32,918+ INFO (periodic/5) [Executor] Worker was discarded (executor:305) but, on the same node, from commandline I can issue a simple cat without any problem: [root@ovirt-node2 ~]# cat "/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata" ALIGNMENT=1048576 BLOCK_SIZE=512 CLASS=Data DESCRIPTION=gv0 IOOPTIMEOUTSEC=10 LEASERETRIES=3 LEASETIMESEC=60 LOCKPOLICY= LOCKRENEWALINTERVALSEC=5 POOL_UUID=da146814-f823-40e0-bd7b-8478dcfa38cd REMOTE_PATH=localhost:/gv0 ROLE=Regular SDUUID=60b7f172-08ed-4a22-8414-31fd5b100d72 TYPE=GLUSTERFS VERSION=5 _SHA_CKSUM=a63324fa9b3030c3ffa35891c2d6c4e129c76af9 and [root@ovirt-node2 ~]# cat '/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata' ALIGNMENT=1048576 BLOCK_SIZE=512 CLASS=Data DESCRIPTION=gv0 IOOPTIMEOUTSEC=10 LEASERETRIES=3 LEASETIMESEC=60 LOCKPOLICY= LOCKRENEWALINTERVALSEC=5 POOL_UUID=da146814-f823-40e0-bd7b-8478dcfa38cd REMOTE_PATH=localhost:/gv0 ROLE=Regular SDUUID=60b7f172-08ed-4a22-8414-31fd5b100d72 TYPE=GLUSTERFS VERSION=5 _SHA_CKSUM=a63324fa9b3030c3ffa35891c2d6c4e129c76af9 after while, I retried the same cat, and the host console hanged so, sometimes, gluster revoke access to file ?!?!?! why? I think this "hang" is the source of all my problems. ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/DY5WC45NQSQK2IPWSUKUT5U4MIFMPNW7/
[ovirt-users] Re: VMs hang periodically: gluster problem?
I don't have disk problems as I enabled smartd and I perform a periodic test (smartctl -t long ) but in sanlock I have some problems, and also in gluster glheal logs are not clean: The last event I recorded is today at 00:28 (22/09/4 22:28 GMTZ), this is the time when node3 sent mail: ovirt-hosted-engine state transition EngineMaybeAway-EngineDown Received: from ovirt-node3.ovirt Mon, 5 Sep 2022 00:28:45 +0200 (CEST) These are a bunch of log from the three nodes. The gluster volume of the hosted-engine is "glen": https://cloud.ssis.sm/index.php/s/dSRyT3MM6ESnrb9 ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/2RD5WE2A6SLSL54K5YZKLNU4DEUCVOSB/
[ovirt-users] Re: VMs hang periodically: gluster problem?
Any sanlock errors to indicate storage problems ?Have you checked Gluster logs for errors or indication of network disruption? Best Regards,Strahil Nikolov On Thu, Sep 1, 2022 at 12:18, Diego Ercolani wrote: Hello, I have a cluster made by 3 nodes in a "self-hosted-engine" topology. I implemented the storage with gluster implementation in 2 replica + arbiter topology. I have two gluster volumes glen - is the volume used by hosted-engine vm gv0 - is the volume used by VMs The physical disks are 4TB SSD used only to accomodate VMs (also hosted-engine) I have continuos VMs hangs, even hosted-engine, this give full of troubles as I have continuous hangs by hosted-engine and this happen asyncrounosly even while there is management operation on VMs (mobility, cloning...) after a while it happens that the VM is freed but in the VMs I have in console kernel complaining by CPU hang or timer hangs and the solution is only to shutdown/poweroff the VM... even hosted engine in fact it happens that hosted-engine -vm-status give "state=EngineUpBadHealth" This is the log during the event in the host while there is the event: Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]: [1662018538.0166] device (vnet73): state change: activated -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed') Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]: [1662018538.0168] device (vnet73): released from master device ovirtmgmt Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: Unable to read from monitor: Connection reset by peer Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: internal error: qemu unexpectedly closed the monitor: 2022-09-01T07:48:57.930955Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-0a1a501c-fc45-430f-bfd3-076172cec406,bootindex=1,write-cache=on,serial=0a1a501c-fc45-430f-bfd3-076172cec406,werror=stop,rerror=stop: Failed to get "write" lock Is another process using the image [/run/vdsm/storage/3577c21e-f757-4405-97d1-0f827c9b4e22/0a1a501c-fc45-430f-bfd3-076172cec406/f65dab86-67f1-46fa-87c0-f9076f479741]? Sep 01 07:48:58 ovirt-node3.ovirt kvm[268578]: 5 guests now active Sep 01 07:48:58 ovirt-node3.ovirt systemd[1]: machine-qemu\x2d67\x2dHostedEngine.scope: Succeeded. Sep 01 07:48:58 ovirt-node3.ovirt systemd-machined[1613]: Machine qemu-67-HostedEngine terminated. Sep 01 07:49:08 ovirt-node3.ovirt systemd[1]: NetworkManager-dispatcher.service: Succeeded. Sep 01 07:49:08 ovirt-node3.ovirt ovirt-ha-agent[3338]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Engine VM stopped on localhost Sep 01 07:49:14 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out Sep 01 07:49:28 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 [5083]: s4 delta_renew long write time 11 sec Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 [5033]: s3 delta_renew long write time 11 sec Sep 01 07:49:34 ovirt-node3.ovirt libvirtd[2496]: Domain id=65 name='ocr-Brain-28-ovirt.dmz.ssis' uuid=00425fb1-c24b-4eaa-9683-534d66b2cb04 is tainted: custom-ga-command Sep 01 07:49:47 ovirt-node3.ovirt sudo[268984]: root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/privsep-helper --privsep_context os_brick.privileged.default --privsep_sock_path /tmp/tmp1iolt06i/privsep.sock This is the indication I have on gluster: [root@ovirt-node3 ~]# gluster volume heal gv0 info Brick ovirt-node2.ovirt:/brickgv0/_gv0 Status: Connected Number of entries: 0 Brick ovirt-node3.ovirt:/brickgv0/gv0_1 Status: Connected Number of entries: 0 Brick ovirt-node4.ovirt:/dati/_gv0 Status: Connected Number of entries: 0 [root@ovirt-node3 ~]# gluster volume heal glen info Brick ovirt-node2.ovirt:/brickhe/_glen Status: Connected Number of entries: 0 Brick ovirt-node3.ovirt:/brickhe/glen Status: Connected Number of entries: 0 Brick ovirt-node4.ovirt:/dati/_glen Status: Connected Number of entries: 0 So it seem healty. I don't know how to address the issue but this is a great problem. ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/62EZCKA5HCAY2CP4O7RAXVNYEW4RAQOF/ ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-
[ovirt-users] Re: VMs hang periodically: gluster problem?
Versions are last: ovirt-host-4.5.0-3.el8.x86_64 ovirt-engine-4.5.2.4-1.el8.noarch glusterfs-server-10.2-1.el8s.x86_64 ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/FPKATK2I5E7EEHVX3NDJYL6PIFUXUCAQ/