[ovirt-users] VM ramdomly unresponsive

fsoyer Tue, 13 Nov 2018 03:16:59 -0800

Hi all,
I continue to try to understand my problem between (I suppose) oVirt anf 
Gluster.
After my recents posts titled 'VMs unexpectidly restarted' that did not provide 
solution nor search idea, I submit to you another (related ?) problem.
Parallely with the problem of VMs down (that did not reproduce since Oct 16), I 
have ramdomly some events in the GUI saying "VM xxxxx is not responding." For 
example, VM "patjoub1" on 2018-11-11 14:34. Never the same hour, not all the 
days, often this VM patjoub1 but not always : I had it on two others. All VMs 
disks are on a volume DATA02 (with leases on the same volume).


Searching in engine.log, I found :
2018-11-11 14:34:32,953+01 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(EE-ManagedThreadFactory-engineScheduled-Thread-28) [] VM 
'6116fb07-096b-4c7e-97fe-01ecc9a6bd9b'(patjoub1) moved from 'Up' --> 
'NotResponding'
2018-11-11 14:34:33,116+01 WARN  
[org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder] 
(EE-ManagedThreadFactory-engineScheduled-Thread-1) [] Invalid or unknown guest 
architecture type '' received from guest agent
2018-11-11 14:34:33,176+01 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(EE-ManagedThreadFactory-engineScheduled-Thread-28) [] EVENT_ID: 
VM_NOT_RESPONDING(126), VM patjoub1 is not responding.
...
...
2018-11-11 14:34:48,278+01 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(EE-ManagedThreadFactory-engineScheduled-Thread-48) [] VM 
'6116fb07-096b-4c7e-97fe-01ecc9a6bd9b'(patjoub1) moved from 'NotResponding' --> 
'Up'So it becomes up 15s after, and the VM (and the monitoring) see no downtime.
At this time, I see in vdsm.log of the nodes :
2018-11-11 14:33:49,450+0100 ERROR (check/loop) [storage.Monitor] Error 
checking path 
/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata
 (monitor:498)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 496, in 
_pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391, in 
delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: 
(u'/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata',
 1, 'Read timeout')
2018-11-11 14:33:49,450+0100 INFO  (check/loop) [storage.Monitor] Domain 
ffc53fd8-c5d1-4070-ae51-2e91835cd937 became INVALID (monitor:469)

2018-11-11 14:33:59,451+0100 WARN  (check/loop) [storage.check] Checker 
u'/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata'
 is blocked for 20.00 seconds (check:282)

2018-11-11 14:34:09,480+0100 INFO  (event/37) [storage.StoragePool] Linking 
/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937
 to 
/rhev/data-center/6efda7f8-b62f-11e8-9d16-00163e263d21/ffc53fd8-c5d1-4070-ae51-2e91835cd937
 (sp:1230)OK : so, DATA02 marked as blocked for 20s ? I definitly have a 
problem with gluster ? I'll inevitably find the reason in the gluster logs ? Uh 
: not at all.
Please see gluster logs here :
https://seafile.systea.fr/d/65df86cca9d34061a1e4/

Unfortunatly I discovered this morning that I have not the sanlock.log for this 
date. I don't understand why, the log rotate seems OK with "rotate 3", but I 
have no backups files :(.
But, luck in bad luck, the same event occurs this morning ! Same VM patjoub1, 
2018-11-13 08:01:37. So I have added the sanlock.log for today, maybe it can 
help.

IMPORTANT NOTE : don't forget that Gluster log with on hour shift. For this 
event at 14:34, search at 13h34 in gluster logs.
I recall my configuration :
Gluster 3.12.13
oVirt 4.2.3
3 nodes where the third is arbiter (volumes in replica 2)

The nodes are never overloaded (CPU average 5%, no peak detected at the time of 
the event, mem 128G used at 15% (only 10 VMs on this cluster)). Network 
underused, gluster is on a separate network on a bond (2 NICs) 1+1Gb mode 4 = 
2Gb, used in peak at 10%.

Here is the configuration for the given volume :
# gluster volume status DATA02
Status of volume: DATA02
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick victorstorage.local.systea.fr:/home/d
ata02/data02/brick                          49158     0          Y       4990 
Brick gingerstorage.local.systea.fr:/home/d
ata02/data02/brick                          49153     0          Y       8460 
Brick eskarinastorage.local.systea.fr:/home
/data01/data02/brick                        49158     0          Y       2470 
Self-heal Daemon on localhost               N/A       N/A        Y       8771 
Self-heal Daemon on eskarinastorage.local.s
ystea.fr                                    N/A       N/A        Y       11745
Self-heal Daemon on victorstorage.local.sys
tea.fr                                      N/A       N/A        Y       17055
 
Task Status of Volume DATA02
------------------------------------------------------------------------------
There are no active volume tasks
  
# gluster volume info DATA02
 
Volume Name: DATA02
Type: Replicate
Volume ID: 48bf5871-339b-4f39-bea5-9b5848809c83
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: victorstorage.local.systea.fr:/home/data02/data02/brick
Brick2: gingerstorage.local.systea.fr:/home/data02/data02/brick
Brick3: eskarinastorage.local.systea.fr:/home/data01/data02/brick (arbiter)
Options Reconfigured:
network.ping-timeout: 30
server.allow-insecure: on
cluster.granular-entry-heal: enable
features.shard-block-size: 64MB
performance.stat-prefetch: on
server.event-threads: 3
client.event-threads: 8
performance.io-thread-count: 32
storage.owner-gid: 36
storage.owner-uid: 36
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.server-quorum-ratio: 51%
So : is there someone around trying to make me understand what append ? 
Pleeease :/

--

Regards,

Frank

_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/D6HKP6TBZKE3KE3TGJAWFKLHBPCKBMFH/

[ovirt-users] VM ramdomly unresponsive

Reply via email to