[ovirt-users] Re: Gluster volume engine stuck in healing with 1 unsynched entry & HostedEngine paused
The output of the getfattr command on the nodes was the following: Node1: [root@ov-no1 ~]# getfattr -d -m . -e hex /gluster_bricks/engine/engine/80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7 getfattr: Removing leading '/' from absolute path names # file: gluster_bricks/engine/engine/80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x0394 trusted.afr.engine-client-2=0x trusted.gfid=0x3fafabf3d0cd4b9a8dd743145451f7cf trusted.gfid2path.06f4f1065c7ed193=0x36313936323032302d386431342d343261372d613565332d3233346365656635343035632f61343835353566342d626532332d343436372d386135342d343030616537626166396437 trusted.glusterfs.mdata=0x015fec62872f5849585fec62872f5849585d791c1a00ba286e trusted.glusterfs.shard.block-size=0x0400 trusted.glusterfs.shard.file-size=0x00190092040b Node2: [root@ov-no2 ~]# getfattr -d -m . -e hex /gluster_bricks/engine/engine/80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7 getfattr: Removing leading '/' from absolute path names # file: gluster_bricks/engine/engine/80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x trusted.afr.engine-client-0=0x043a trusted.afr.engine-client-2=0x trusted.gfid=0x3fafabf3d0cd4b9a8dd743145451f7cf trusted.gfid2path.06f4f1065c7ed193=0x36313936323032302d386431342d343261372d613565332d3233346365656635343035632f61343835353566342d626532332d343436372d386135342d343030616537626166396437 trusted.glusterfs.mdata=0x015fec62872f5849585fec62872f5849585d791c1a00ba286e trusted.glusterfs.shard.block-size=0x0400 trusted.glusterfs.shard.file-size=0x00190092040b Node3: [root@ov-no3 ~]# getfattr -d -m . -e hex /gluster_bricks/engine/engine/80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7 getfattr: Removing leading '/' from absolute path names # file: gluster_bricks/engine/engine/80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x trusted.afr.engine-client-0=0x0444 trusted.gfid=0x3fafabf3d0cd4b9a8dd743145451f7cf trusted.gfid2path.06f4f1065c7ed193=0x36313936323032302d386431342d343261372d613565332d3233346365656635343035632f61343835353566342d626532332d343436372d386135342d343030616537626166396437 trusted.glusterfs.mdata=0x015fec62872f5849585fec62872f5849585d791c1a00ba286e trusted.glusterfs.shard.block-size=0x0400 trusted.glusterfs.shard.file-size=0x00190092040b ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/PUVBESAIZEJ7URDMDQ7LDUPNS6YDBVAS/
[ovirt-users] Re: Gluster volume engine stuck in healing with 1 unsynched entry & HostedEngine paused
Thank you for your reply. I'm trying that right now and I see it triggered the self-healing process. I will come back with an update. Best regards. ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/WKW4RAVHVOZN6CZVK2TOC7727DHLKWRZ/
[ovirt-users] Re: Gluster volume engine stuck in healing with 1 unsynched entry & HostedEngine paused
Thank you. I have tried that and it didn't work as the system sees that the file is not in split-brain. I have also tried force heal and full heal and still nothing. I always end up with the entry being stuck in unsynched stage. ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/W5AJ4PKEK36NZEIAPTX3UQD6P7EZM7EL/
[ovirt-users] Re: Gluster volume engine stuck in healing with 1 unsynched entry & HostedEngine paused
Hello again, I've tried to heal the brick with latest-mtime, but I get the following: gluster volume heal engine split-brain latest-mtime /80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7 Healing /80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7 failed: File not in split-brain. Volume heal failed. Should I try the solution described in this question, where we manually remove the conflicting entry, triggering the heal operations? https://lists.ovirt.org/archives/list/users@ovirt.org/thread/RPYIMSQCBYVQ654HYGBN5NCPRVCGRRYB/#H6EBSPL5XRLBUVZBE7DGSY25YFPIR2KY ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/CCRNM7N3FSUYXDHFP2XDMGAMKSHBMJQQ/
[ovirt-users] Re: Gluster volume engine stuck in healing with 1 unsynched entry & HostedEngine paused
I tried only the simple healing because I wasn't sure if I'd mess the gluster more than it already is. I will try latest-mtime in a couple of hours because the system is a production system and I have to do it after office hours. I will come back with an update. Thank you very much for your help! ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YHI63SPSJG6MNAI6737LZXS5ZG5UPXAG/
[ovirt-users] Re: Gluster volume engine stuck in healing with 1 unsynched entry & HostedEngine paused
Hello, Thank you very much for your reply. I get the following from the below gluster commands: [root@ov-no1 ~]# gluster volume heal engine info split-brain Brick ov-no1.ariadne-t.local:/gluster_bricks/engine/engine Status: Connected Number of entries in split-brain: 0 Brick ov-no2.ariadne-t.local:/gluster_bricks/engine/engine Status: Connected Number of entries in split-brain: 0 Brick ov-no3.ariadne-t.local:/gluster_bricks/engine/engine Status: Connected Number of entries in split-brain: 0 [root@ov-no1 ~]# gluster volume heal engine info summary Brick ov-no1.ariadne-t.local:/gluster_bricks/engine/engine Status: Connected Total Number of entries: 1 Number of entries in heal pending: 1 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick ov-no2.ariadne-t.local:/gluster_bricks/engine/engine Status: Connected Total Number of entries: 1 Number of entries in heal pending: 1 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick ov-no3.ariadne-t.local:/gluster_bricks/engine/engine Status: Connected Total Number of entries: 1 Number of entries in heal pending: 1 Number of entries in split-brain: 0 Number of entries possibly healing: 0 [root@ov-no1 ~]# gluster volume info Volume Name: data Type: Replicate Volume ID: 6c7bb2e4-ed35-4826-81f6-34fcd2d0a984 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: ov-no1.ariadne-t.local:/gluster_bricks/data/data Brick2: ov-no2.ariadne-t.local:/gluster_bricks/data/data Brick3: ov-no3.ariadne-t.local:/gluster_bricks/data/data (arbiter) Options Reconfigured: performance.client-io-threads: on nfs.disable: on transport.address-family: inet performance.strict-o-direct: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 1 features.shard: on user.cifs: off cluster.choose-local: off client.event-threads: 4 server.event-threads: 4 network.ping-timeout: 30 storage.owner-uid: 36 storage.owner-gid: 36 cluster.granular-entry-heal: enable Volume Name: engine Type: Replicate Volume ID: 7173c827-309f-4e84-a0da-6b2b8eb50264 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: ov-no1.ariadne-t.local:/gluster_bricks/engine/engine Brick2: ov-no2.ariadne-t.local:/gluster_bricks/engine/engine Brick3: ov-no3.ariadne-t.local:/gluster_bricks/engine/engine Options Reconfigured: performance.client-io-threads: on nfs.disable: on transport.address-family: inet performance.strict-o-direct: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm:
[ovirt-users] Re: Gluster volume engine stuck in healing with 1 unsynched entry & HostedEngine paused
Hello again, I am back with a brief description of the situation I am in, and questions about the recovery. oVirt environment: 4.3.5.2 Hyperconverged GlusterFS: Replica 2 + Arbiter 1 GlusterFS volumes: data, engine, vmstore The current situation is the following: - The Cluster is in Global Maintenance. - The volume engine is up with comment (in the Web GUI) : Up, unsynched entries, needs healing. - The VM HostedEngine is paused due to a storage I/O error (Web GUI) while the output of virsh list --all command shows that the HostedEngine is running. I tried to issue the gluster heal command (gluster volume heal engine) but nothing changed. I have the following questions: 1. Should I restart the glusterd service? Where from? Is it enough if the glusterd is restarted on one host or should it be restarted on the other two as well? 2. Should the node that was NonResponsive and came back, be rebooted or not? It seems alright now and in good health. 3. Should the HostedEngine be restored with engine-backup or is it not necessary? 4. Could the loss of the DNS server for the oVirt hosts lead to an unresponsive host? The nsswitch file on the ovirt hosts and engine, has the DNS defined as: hosts: files dns myhostname 5. How can we recover/rectify the situation above? Thanks for your help, Maria Souvalioti ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/GO6S6GXRJWYZN5NZ5IFTNQ6SGNEB75WQ/
[ovirt-users] Gluster volume engine stuck in healing with 1 unsynched entry & HostedEngine paused
Hello everyone, Any help would be greatly appreciated in the following problem. In my lab, the day before yesterday, we had power issues, with a UPS going off-line and following the power outage of the NFS/DNS server I have set up to serve ovirt with isos and as a DNS server (our other DNS servers are located as VMs within the oVirt environment). We found a broadcast storm on the switch (due to a faulty NIC on the aformentioned UPS) that the ovirt nodes are connected and later on had to re-establish several of the virtual connections as well. The above led to one of the hosts becoming NonResponsive, two machines becoming unresponsive and three VMs shuting down. The oVirt environment, version 4.3.5.2, is a replica 2 + arbiter 1 environment and runs GlusterFS with the recommended volumes of data, engine and vmstore. So far, the times there was some kind of a problem, usually oVirt was able to solve it by its own. This time, however, after we recovered from the above state, the volumes of data and vmstore successfully healing , the volume engine became stuck to the healing process (Up, unsynched entries, needs healing), and from the web GUI I see that the VM HostedEngine is paused due to a storage I/O error while the output of virsh list --all command shows that the HostedEngine is running.. How is that happening? I tried to manually trigger the healing process for the volume but nothing with gluster volume heal engine The command gluster volume heal engine info shows the following [root@ov-no3 ~]# gluster volume heal engine info Brick ov-no1.ariadne-t.local:/gluster_bricks/engine/engine Status: Connected Number of entries: 0 Brick ov-no2.ariadne-t.local:/gluster_bricks/engine/engine /80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7 Status: Connected Number of entries: 1 Brick ov-no3.ariadne-t.local:/gluster_bricks/engine/engine /80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7 Status: Connected Number of entries: 1 This morning I came upon this Reddit post https://www.reddit.com/r/gluster/comments/fl3yb7/entries_stuck_in_heal_pending/ where it seems that after a graceful reboot one of the ovirt hosts, the gluster came back online after it completed the appropriate healing processes. The thing is from what I have read that when there are unsynched entries in the gluster a host cannot be put into maintenance mode so that it can be rebooted, correct? Should I try to restart the glusterd service. Could someone tell me what I should do? Thank you all for your time and help, Maria Souvalioti ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BSOF7BXAMVJ4IMYUEB3OBU4T64FGYA2J/
[ovirt-users] Re: QEMU error qemuDomainAgentAvailable in /var/log/messages
Thank you very much for your answer. The service is up and running in the engine but it is only loaded on the nodes. Should I start it (enabling also?) on the nodes to? There is one VM that I have not installed the guest agent on and it is running on the arbitrary host. Also, about the snapshots and backup, you mean the built-in ovirt capabilities or an external backup/snapshot program also? Thank you again Maria ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/SMY6FQ5MTBC37ZK47SQFYLARGVFPIZNN/
[ovirt-users] QEMU error qemuDomainAgentAvailable in /var/log/messages
Hello everyone and a happy new year! I have a question which might be silly but I am stumped. I keep getting the following error in my /var/log/messages Jan 5 12:20:30 ovno3 libvirtd: 2021-01-05 10:20:30.481+: 5283: error : qemuDomainAgentAvailable:9133 : Guest agent is not responding: QEMU guest agent is not connected This entry appears on the arbitrary node only and it has a recurrence of 5 minutes. I have GlusterFS on the ovirt environment (production environment) and it's serving several vital services. The VMs are running ok and I haven't noticed any discrepancy. Almost a week ago there was a disconnection on the gluster storage but since then everything works as expected. Does anyone know what this error is and if there is a guide or something to fix it? I have no idea what to search and where. Thank you all very much for your time! ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/RGYORYFEGLDMKP6PBJW537CGJFGO3FTR/
[ovirt-users] Re: Best Practice? Affinity Rules Enforcement Manager or High Availability?
Thank you very much for your reply. I will check this out immediately. Best regards and merry holidays, Maria Souvalioti ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/UJKVRBOVQXDJ4ON5GZPDRKPGZQCFCF6J/
[ovirt-users] Best Practice? Affinity Rules Enforcement Manager or High Availability?
Hello everyone, Not sure if I should ask this here as it seems to be a pretty obvious question but here it is. What is the best solution for making your VMs able to automatically boot up on another working host when something goes wrong (gluster problem, non responsive host etc)? Would you enable the Affinity Manager and enforce some policies or would you set the VMs you want as Highly Available? Thank you very much for your time! Best regards, Maria Souvalioti ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/7JAHYLIWGSLRIOUMXWTH5Q6BRFD5WPD4/
[ovirt-users] Re: VM HostedEngine is down with error
Hello, This is what I could gather from the gluster logs around the time frame of the HE shutdown. NODE1: [root@ov-no1 glusterfs]# more bricks/gluster_bricks-vmstore-vmstore.log-20200830 |egrep "( W | E )"|more [2020-08-27 15:35:03.090477] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7fa6e04a3dd5] -->/usr/sbin/glusterfsd(glus terfs_sigwaiter+0xe5) [0x55a40138d1b5] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x55a40138d01b] ) 0-: received signum (15), shutting down [2020-08-27 15:35:14.926794] E [MSGID: 100018] [glusterfsd.c:2333:glusterfs_pidfile_update] 0-glusterfsd: pidfile /var/run/gluster/vols/vmstore/ov-no 1.ariadne-t.local-gluster_bricks-vmstore-vmstore.pid lock failed [Resource temporarily unavailable] [root@ov-no1 glusterfs]# more bricks/gluster_bricks-data-data.log-20200830 |egrep "( W | E )"|more [2020-08-27 15:35:01.087875] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7fc3cbf69dd5] -->/usr/sbin/glusterfsd(glus terfs_sigwaiter+0xe5) [0x555e313711b5] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x555e3137101b] ) 0-: received signum (15), shutting down [2020-08-27 15:35:14.890471] E [MSGID: 100018] [glusterfsd.c:2333:glusterfs_pidfile_update] 0-glusterfsd: pidfile /var/run/gluster/vols/data/ov-no1.a riadne-t.local-gluster_bricks-data-data.pid lock failed [Resource temporarily unavailable] [root@ov-no1 glusterfs]# more bricks/gluster_bricks-engine-engine.log-20200830 |egrep "( W | E )"|more [2020-08-27 15:35:02.088732] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f70b99cbdd5] -->/usr/sbin/glusterfsd(glus terfs_sigwaiter+0xe5) [0x55ebd132b1b5] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x55ebd132b01b] ) 0-: received signum (15), shutting down [2020-08-27 15:35:14.907603] E [MSGID: 100018] [glusterfsd.c:2333:glusterfs_pidfile_update] 0-glusterfsd: pidfile /var/run/gluster/vols/engine/ov-no1 .ariadne-t.local-gluster_bricks-engine-engine.pid lock failed [Resource temporarily unavailable] [root@ov-no1 glusterfs]# more bricks/gluster_bricks-vmstore-vmstore.log |egrep "( W | E )"|more [nothing in the output] [root@ov-no1 glusterfs]# more bricks/gluster_bricks-data-data.log |egrep "( W | E )"|more [nothing in the output] [root@ov-no1 glusterfs]# more bricks/gluster_bricks-engine-engine.log |egrep "( W | E )"|more [nothing in the output] [root@ov-no1 glusterfs]# more cmd_history.log | egrep "(WARN|error|fail)" |more [2020-09-01 02:00:38.685251] : volume geo-replication status : FAILED : Commit failed on ov-no2.ariadne-t.local. Please check log file for details. Commit failed on ov-no3.ariadne-t.local. Please check log file for details. [2020-09-01 03:02:39.094984] : volume geo-replication status : FAILED : Commit failed on ov-no2.ariadne-t.local. Please check log file for details. Commit failed on ov-no3.ariadne-t.local. Please check log file for details. [2020-09-01 11:18:32.510224] : volume geo-replication status : FAILED : Commit failed on ov-no2.ariadne-t.local. Please check log file for details. Commit failed on ov-no3.ariadne-t.local. Please check log file for details. [2020-09-01 14:24:33.778942] : volume geo-replication status : FAILED : Commit failed on ov-no2.ariadne-t.local. Please check log file for details. Commit failed on ov-no3.ariadne-t.local. Please check log file for details. [root@ov-no1 glusterfs]# cat glusterd.log | egrep "( W | E )" |more [2020-09-01 07:00:31.326169] E [glusterd-op-sm.c:8132:glusterd_op_sm] (-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x23a1e) [0x7f23d8ac8a1e] -->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x1c1be) [0x7f23d8ac11be] -->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x4306f) [0x7f23 d8ae806f] ) 0-management: Unable to get transaction opinfo for transaction ID :435d3780-aa0c-4a64-bc28-56ae394159d0 [2020-09-01 08:02:31.551563] E [glusterd-op-sm.c:8132:glusterd_op_sm] (-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x23a1e) [0x7f23d8ac8a1e] -->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x1c1be) [0x7f23d8ac11be] -->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x4306f) [0x7f23 d8ae806f] ) 0-management: Unable to get transaction opinfo for transaction ID :930a8a08-1044-41cf-b921-913b982e0c72 [2020-09-01 09:04:31.786157] E [glusterd-op-sm.c:8132:glusterd_op_sm] (-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x23a1e) [0x7f23d8ac8a1e] -->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x1c1be) [0x7f23d8ac11be] -->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x4306f) [0x7f23 d8ae806f] ) 0-management: Unable to get transaction opinfo for transaction ID :9942b579-5240-4fee-bb4c-78b9a1c98da8 [2020-09-01 10:06:32.014362] E [glusterd-op-sm.c:8132:glusterd_op_sm] (-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x23a1e) [0x7f23d8ac8a1e] -->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x1c1be) [0x7f23d8ac11be]
[ovirt-users] Re: VM HostedEngine is down with error
Thank you very much for your reply. I checked the NTP and realized the service wasn't working properly on two of the three nodes, but despite that the clocks seemed to have the correct time(date and hwdate). I switched to chronyd and stopped the ntpd service and now it seems the servers' clocks are synchronized. The time in BIOS has different time than the systems. Does this affect the behaviour of the overall performance? This is what I could gather from the logs: Node1: [root@ov-no1 ~]# more /var/log/messages |egrep "(WARN|error)"|more d Aug 27 17:53:08 ov-no1 libvirtd: 2020-08-27 14:53:08.947+: 5613: error : qemuDomainAgentAvailable:9133 : Guest agent is not responding: QEMU gues t agent is not connected Aug 27 17:58:08 ov-no1 libvirtd: 2020-08-27 14:58:08.943+: 5613: error : qemuDomainAgentAvailable:9133 : Guest agent is not responding: QEMU gues t agent is not connected Aug 27 18:03:08 ov-no1 libvirtd: 2020-08-27 15:03:08.937+: 5614: error : qemuDomainAgentAvailable:9133 : Guest agent is not responding: QEMU gues t agent is not connected Aug 27 18:08:08 ov-no1 libvirtd: 2020-08-27 15:08:08.951+: 5617: error : qemuDomainAgentAvailable:9133 : Guest agent is not responding: QEMU gues t agent is not connected Aug 27 18:13:08 ov-no1 libvirtd: 2020-08-27 15:13:08.951+: 5616: error : qemuDomainAgentAvailable:9133 : Guest agent is not responding: QEMU gues t agent is not connected Aug 27 18:18:08 ov-no1 libvirtd: 2020-08-27 15:18:08.942+: 5618: error : qemuDomainAgentAvailable:9133 : Guest agent is not responding: QEMU gues t agent is not connected . . [Around that time Node 3, who is the arbiter node, was placed in local Maintenance mode. It was shut down for maintenance and when we boot it up again all seemed right. We removed it from the maintenance mode and when the healing processes finished, Node 1 became NonResponsive. Long story short, the VDSM agent sent Node1 a restart command. Node1 rebooted, HostedEngine was up on Node1 and the rest of the VMs that were hosted by Node1 had to be manually brought up. Since then everything seemed to be working as it should. The HostedEngine VM shutdown with no apparent reason 5 days later, making us believe there was no connection between the two incidents.] . . . . . Sep 1 05:53:30 ov-no1 vdsm[6706]: WARN Worker blocked: timeout=60, duration=60.00 at 0x7f1ed7381dd0> t ask#=76268 at 0x7f1ebc0797d0>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012 self.__bootstrap_inner()#012Fil e: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012 self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run #012 self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run#012 ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012 self._execute_task()#012File: "/usr/lib/p ython2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 3 91, in __call__#012 self._callable()#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__#012 self._handler(sel f._ctx, self._req)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest#012 response = self._handle_request (req, ctx)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request#012 res = method(**params)#012File: "/usr /lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod#012 result = fn(*methodArgs)#012File: "/usr/lib/python2.7/site-package s/vdsm/gluster/apiwrapper.py", line 237, in geoRepSessionList#012 remoteUserName)#012File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", l ine 93, in wrapper#012 rv = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 551, in volumeGeoRepSessionL ist#012 remoteUserName,#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 56, in __call__#012 return callMethod()#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 54, in #012 **kwargs)#012File: "", line 2, in glusterVolumeGeoRep Status#012File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod#012 kind, result = conn.recv() Sep 1 05:54:30 ov-no1 vdsm[6706]: WARN Worker blocked: timeout=60, duration=120.00 at 0x7f1ed7381dd0> task#=76268 at 0x7f1ebc0797d0>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012 self.__bootstrap_inner()#012Fi le: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012 self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in ru n#012 self.__target(*self.__args, **self.__kwargs)#012File:
[ovirt-users] VM HostedEngine is down with error
Hello everyone, I have a replica 2 + arbiter installation and this morning the Hosted Engine gave the following error on the UI and resumed on a different node (node3) than the one it was originally running(node1). (The original node has more memory than the one it ended up, but it had a better memory usage percentage at the time). Also, the only way I discovered the migration had happened and there was an Error in Events, was because I logged in the web interface of ovirt for a routine inspection. Βesides that, everything was working properly and still is. The error that popped is the following: VM HostedEngine is down with error. Exit message: internal error: qemu unexpectedly closed the monitor: 2020-09-01T06:49:20.749126Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future 2020-09-01T06:49:20.927274Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2,id=ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2,bootindex=1,write-cache=on: Failed to get "write" lock Is another process using the image?. Which from what I could gather concerns the following snippet from the HostedEngine.xml and it's the virtio disk of the Hosted Engine: d5de54b6-9f8e-4fba-819b-ebf6780757d2 I've tried looking into the logs and the sar command but I couldn't find anything to relate with the above errors and determining the reason for it to happen. Is this a Gluster or a QEMU problem? The Hosted Engine was manually migrated five days before on node1. Is there a standard practice I could follow to determine what happened and secure my system? Thank you very much for your time, Maria Souvalioti ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/HBU4P4E5ECOA6BNNFVLK2Y44ZX5UHYYE/
[ovirt-users] Re: Upgrade Memory of oVirt Nodes
Hello again, Hope everyone's ok. I'm really sorry for being so late, but I came back with the update. We wanted to upgrade the memory because we plan on deploying several VMs on the platform, mostly for services, and thought that now was a better time to do the upgrade than later on. I manually migrated some of the VMs on one another node, trying to keep the percentage of the physical memory of each node somewhat equal. The upgrade of the memory of the nodes happened with no problem (from 32GB to 72GB). The VMs that were active on the time (8 VMs including HE) didn't show any kind of downtime, slow-down or any other issue. The percentage of the memory of the two on-line nodes at the time was around 75-80%, with the HE consuming the most. The upgrade happened right after I had to replace a failed SAS HDD (hot-plug) which contained the mirror of the HDD with the ovirt node OS. Everything went as my team and I hoped with no problems on either side. For the storage; the deployment we have is GlusterFS with one node as arbiter. When the node rejoined the others, it took around 8-10 mins for the healing operations to be completed and since then everything's going perfectly well. Thank you all for the time you took to respond to me and the valuable information you shared. It means a lot. Best regards, Maria ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YVZYK3OPJBUPPH7IY5GXFVQ4RORONGGM/
[ovirt-users] Upgrade Memory of oVirt Nodes
Hello everyone, I have an oVirt 4.3.2.5 hyperconverged 3 node production environment and we want to add some RAM to it. Can I upgrade the RAM without my users noticing any disruptions and keep the VMs running? The way I thought I should do it was to migrate any running VMs to the other nodes, then set one node in maintenance mode, shut it down, place the new memory, bring it back up, remove it from maintenance mode and see how the installation reacts and repeat for the other two nodes. Is this correct or should I follow another way? Will there be a problem during the time when the nodes will not be identical in their resources? Thank you for your time, Souvalioti Maria ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/F4E6DMLL23QU6KGMUVUNRGNR3IUYCT5W/
[ovirt-users] Messed up 4.2.3.1 installation - SSL handshake ERROR
Hello, everyone! So, I have an experimental installation of ovirt 4.2.3.1 with 3 nodes and glustered. Recently I deployed a new installation with ovirt 4.3.5.2, 3nodes and glustered storage here as well. The thing is, in my enthousiasm I thought "hey! what if I can import the experimental nodes as hosts in the new installation in a new cluster and see what happens? Will the 4.3.5.2 engine see them? Probably yeah. But will it see the VMs I have there?" And so I imported the experimental nodes. Without detaching them from their hosted engine. I could see the only VM that was active at the moment and not one of the suspended ones and of course I could not see the 4.2.3.1 HE VM. I have removed the hosts from the new installation and I have tried reconnecting the old engine and its nodes. Passwordless ssh works just fine, but the problem persists. hosted-engine --vm-status reports stale-data on node 2 and node 3 The thing is I know I messed the experimental installation (and I blame only my curiosity), SSL handshake is no longer feasable and I can't remove the hosts from the initial Cluster to Import them again. Basically everything is either in the process of activating without ever being able to do so or down or Non responsive. I would like to find a way around this, as I have seen in other posts in the ovirt forum that the SSL handshake error appears in some other cases and I would like to have a know-how if an occasion like this occurs in the future in production. Is it possible to re-deploy the engine on the nodes and not lose the glustered space or the existing VMs? Can the HE be destroyed and then deployed from scratch? What about the glustered space and the VMs' space? Will the VMs just take up space without being able to neither bring them up nor destroy them? I know I'm asking a lot and it was my fault to begin with but I am really curious if we can see this through. Thanks in advance ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/FTJGN7TEE2Y4OEAN5IXMZT6QSBWZML2V/
[ovirt-users] VMs inaccessible from Ubuntu/Debian-based OS
Hello all, I'm having an issue the past couple of days. I have tried anything I could find to solve this but with no success. I have a three node ovirt installation, glustered, hyperconverged and there i have a few VMs. The installation is for experimental reasons before we merge it in our DC. Anyway, though I can connect to the VMs' console from Fedora and Windows, I can't from Ubuntu. I have tried installing the browser-spice-plugin that's the corresponding package to spice-xpi and I got no results. I purged the browser-spice-plugin and then I installed the spice-client package, downloaded the spice-xpi from Fedora (FC19) (as instructed in https://www.ovirt.org/develop/infra/testing/spice.html), copied the libnsISpicec.so to /usr/lib/mozilla/plugins and made sure that xserver-xorg-video-qxl and spice-vdagent were installed and in their latest version, and still nothing. No matter what I have tried, I can't gain access to the console. No matter the browser I use, the message I get is "Unable to connect to the graphic server /tmp/mozilla_msouval0/console-1.vv". I have checked the logs and couldn't find anything useful, maybe I'm not checking the right logs? I run tcpdump on both the node the VM is being hosted and the Ubuntu machine I'm using and though on the UBuntu I capture a few packets (6) on both sides but on the Ubuntu side there were 15 packets received by the filter. Could you please guide towards the right way to solve this? Could this be an ntp problem? ovirt node version: 4.2.3.1 Workstation: Jessie (I thought maybe Jessie was too old, and I run the same steps from Ubuntu 14.04 and Ubuntu 16.04, but the problem still remains) Thank you in advance for any help. Maria ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/4VMMYEJNSHGUHNRDA6AM6TNLBL4JLL6O/
[ovirt-users] Re: Hosted Engine Abruptly Stopped Responding - Unexpected Shutdown
Hello and thank you very much for your reply. I'm terribly sorry for being so late to respond. I thought the same, that dropping the cache was more of a workaround and not a real solution but truthfully I was stuck and can't think of anything more than how much I need to upgrade the memory on the nodes. I try to find info about other ovirt virtualization set-ups and the amount of memory allocated so I can get an idea of what my set-up needs. The only thing that I found was that one admin had set ovirt up with 128GB and still needed more because of the growing needs of the system and its users and was about to upgrade its memory too. I'm just worried that ovirt is very memory consuming and no matter how much I will "feed" it, it will still ask for more. Also, I'm worried that there one, two or even more tweaks in the configurations that I still miss and they'd be able to solve the memory problem. Anyway, KSM is enabled. Sar shows that the committed memory when a Windows 10 VM is active too (alongside Hosted Engine of course, and two Linux VMs - 1 CentOS, 1 Debian) is around 89% in the specific host that it runs (together with the Debian VM) and has reached up to 98%. You are correct about the monitoring system too. I have set up a PRTG environment and there's Nagios running but they can't yet see ovirt. I will set them up correctly the next few days. I haven't made any changes to my tuned profile. it's the default from ovirt. Specifically, the active profile says it's set to virtual-host. Again I'm very sorry for taking me so long to reply and thank you very much for your response. Best Regards, Maria Souvalioti ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/G4YELWF5L4AKUT3OH4C4QJHHEEJPCI3G/
[ovirt-users] Hosted Engine Abruptly Stopped Responding - Unexpected Shutdown
Hello, I came upon a problem the previous month that I figured it would be good to discuss here. I'm sorry I didn't post here earlier but time slipped me. I have set up a glustered, hyperconverged oVirt environment for experimental use as a means to see its behaviour and get used to its management and performance before setting it up as a production environment for use in our organization. The environment is up and running since 2018 October. The three nodes are HP ProLiant DL380 G7 and have the following characteristics: Mem: 22GB CPU: 2x Hexa Core - Intel Xeon Hexa Core E56xx HDD: 5x 300GB Network: BCM5709C with dual-port Gigabit OS: Linux RedHat 7.5.1804(Core 3.10.0-862.3.2.el7.x86_64 x86_64) - Ovirt Node 4.2.3.1 As I was working on the environment, the engine stopped working. Not long before the time the HE stopped, I was in the web interface managing my VMs, when the browser froze and the HE was also not responding to ICMP requests. The first thing I did was to connect via ssh to all nodes and run the command #hosted-engine --vm-status which showed that the HE was down in nodes 1 and 2 and up on the 3rd node. After executing #virsh -r list the VM list that was shown contained two of the VMs I had previously created and were up; the HE was nowhere. I tried to restart the HE with the #hosted-engine --vm-start but it didn't work. I then put all nodes in maintenance mode with the command #hosted-engine --set-maintenance --mode=global (I guess I should have done that earlier) and re-run #hosted-engine --vm-start that had the same result as it previously did. After checking the mails the system sent to the root user, I saw there were several mails on the 3rd node (where the HE had been), informing of the HE's state. The messages were changing between EngineDown-EngineStart, EngineStart-EngineStarting, EngineStarting-EngineMaybeAway, EngineMaybeAway-EngineUnexpectedlyDown, EngineUnexpectedlyDown-EngineDown, EngineDown-EngineStart and so forth. I continued by searching the following logs in all nodes : /var/log/libvirt/qemu/HostedEngine.log /var/log/libvirt/qemu/win10.log /var/log/libvirt/qemu/DNStest.log /var/log/vdsm/vdsm.log /var/log/ovirt-hosted-engine-ha/agent.log After that I spotted and error that had started appearing almost a month ago in node #2: ERROR Internal server error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod result = fn(*methodArgs) File "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 85, in logicalVolumeList return self._gluster.logicalVolumeList() File "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper rv = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 808, in logicalVolumeList status = self.svdsmProxy.glusterLogicalVolumeList() File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 52, in getattr(self._supervdsmProxy._svdsm, self._funcName)(*args, AttributeError: 'AutoProxy[instance]' object has no attribute 'glusterLogicalVolumeList' The outputs of the following commands were also checked as a way to see if there was a mandatory process missing/killed, a memory problem or even disk space shortage that led to the sudden death of a process #ps -A #top #free -h #df -hT Finally, after some time delving in the logs, the output of the #journalctl --dmesg showed the following message "Out of memory: Kill process 5422 (qemu-kvm) score 514 or sacrifice child. Killed process 5422 (qemu-kvm) total-vm:17526548kB, anon-rss:9310396kB, file-rss:2336kB, shmem-rss:12kB" which after that the ovirtmgmt started not responding. I tried to restart the vhostd by executing #/etc/rc.d/init.d/vhostmd start but it didn't work. Finally, I decided to run the HE restart command on the other nodes as well (I'd figured that since the HE was last running on the node #3, that's where I should try to restart it). So, I run #hosted-engine --vm-start and the output was "Command VM.getStats with args {'vmID':'...<το ID της HE>'} failed: (code=1,message=Virtual machine does not exist: {'vmID':'...<το ID της HE>'})" And then I run the command again and the output was "VM exists and its status is Powering Up." After that I executed #virsh -r list and the output was the following: Id Name State 2 HostedEngine running After the HE's restart two mails came that stated: ReinitializeFSMEngineStarting and EngineStarting-EngineUp After that and after checking that we had access to the web interface again, we executed hosted-engine --set-maintenance --mode=none to get out of the