Hi all, I continue to try to understand my problem between (I suppose) oVirt anf Gluster. After my recents posts titled 'VMs unexpectidly restarted' that did not provide solution nor search idea, I submit to you another (related ?) problem. Parallely with the problem of VMs down (that did not reproduce since Oct 16), I have ramdomly some events in the GUI saying "VM xxxxx is not responding." For example, VM "patjoub1" on 2018-11-11 14:34. Never the same hour, not all the days, often this VM patjoub1 but not always : I had it on two others. All VMs disks are on a volume DATA02 (with leases on the same volume).
Searching in engine.log, I found : 2018-11-11 14:34:32,953+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-28) [] VM '6116fb07-096b-4c7e-97fe-01ecc9a6bd9b'(patjoub1) moved from 'Up' --> 'NotResponding' 2018-11-11 14:34:33,116+01 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder] (EE-ManagedThreadFactory-engineScheduled-Thread-1) [] Invalid or unknown guest architecture type '' received from guest agent 2018-11-11 14:34:33,176+01 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-28) [] EVENT_ID: VM_NOT_RESPONDING(126), VM patjoub1 is not responding. ... ... 2018-11-11 14:34:48,278+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-48) [] VM '6116fb07-096b-4c7e-97fe-01ecc9a6bd9b'(patjoub1) moved from 'NotResponding' --> 'Up'So it becomes up 15s after, and the VM (and the monitoring) see no downtime. At this time, I see in vdsm.log of the nodes : 2018-11-11 14:33:49,450+0100 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata (monitor:498) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 496, in _pathChecked delay = result.delay() File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata', 1, 'Read timeout') 2018-11-11 14:33:49,450+0100 INFO (check/loop) [storage.Monitor] Domain ffc53fd8-c5d1-4070-ae51-2e91835cd937 became INVALID (monitor:469) 2018-11-11 14:33:59,451+0100 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata' is blocked for 20.00 seconds (check:282) 2018-11-11 14:34:09,480+0100 INFO (event/37) [storage.StoragePool] Linking /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937 to /rhev/data-center/6efda7f8-b62f-11e8-9d16-00163e263d21/ffc53fd8-c5d1-4070-ae51-2e91835cd937 (sp:1230)OK : so, DATA02 marked as blocked for 20s ? I definitly have a problem with gluster ? I'll inevitably find the reason in the gluster logs ? Uh : not at all. Please see gluster logs here : https://seafile.systea.fr/d/65df86cca9d34061a1e4/ Unfortunatly I discovered this morning that I have not the sanlock.log for this date. I don't understand why, the log rotate seems OK with "rotate 3", but I have no backups files :(. But, luck in bad luck, the same event occurs this morning ! Same VM patjoub1, 2018-11-13 08:01:37. So I have added the sanlock.log for today, maybe it can help. IMPORTANT NOTE : don't forget that Gluster log with on hour shift. For this event at 14:34, search at 13h34 in gluster logs. I recall my configuration : Gluster 3.12.13 oVirt 4.2.3 3 nodes where the third is arbiter (volumes in replica 2) The nodes are never overloaded (CPU average 5%, no peak detected at the time of the event, mem 128G used at 15% (only 10 VMs on this cluster)). Network underused, gluster is on a separate network on a bond (2 NICs) 1+1Gb mode 4 = 2Gb, used in peak at 10%. Here is the configuration for the given volume : # gluster volume status DATA02 Status of volume: DATA02 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick victorstorage.local.systea.fr:/home/d ata02/data02/brick 49158 0 Y 4990 Brick gingerstorage.local.systea.fr:/home/d ata02/data02/brick 49153 0 Y 8460 Brick eskarinastorage.local.systea.fr:/home /data01/data02/brick 49158 0 Y 2470 Self-heal Daemon on localhost N/A N/A Y 8771 Self-heal Daemon on eskarinastorage.local.s ystea.fr N/A N/A Y 11745 Self-heal Daemon on victorstorage.local.sys tea.fr N/A N/A Y 17055 Task Status of Volume DATA02 ------------------------------------------------------------------------------ There are no active volume tasks # gluster volume info DATA02 Volume Name: DATA02 Type: Replicate Volume ID: 48bf5871-339b-4f39-bea5-9b5848809c83 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: victorstorage.local.systea.fr:/home/data02/data02/brick Brick2: gingerstorage.local.systea.fr:/home/data02/data02/brick Brick3: eskarinastorage.local.systea.fr:/home/data01/data02/brick (arbiter) Options Reconfigured: network.ping-timeout: 30 server.allow-insecure: on cluster.granular-entry-heal: enable features.shard-block-size: 64MB performance.stat-prefetch: on server.event-threads: 3 client.event-threads: 8 performance.io-thread-count: 32 storage.owner-gid: 36 storage.owner-uid: 36 user.cifs: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: enable performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off transport.address-family: inet nfs.disable: on performance.client-io-threads: off cluster.server-quorum-ratio: 51% So : is there someone around trying to make me understand what append ? Pleeease :/ -- Regards, Frank
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/D6HKP6TBZKE3KE3TGJAWFKLHBPCKBMFH/