[ovirt-users] Re: VM ramdomly unresponsive

fsoyer Tue, 27 Nov 2018 07:10:03 -0800

Hi,
questioning me about all the chain oVirt -> Gluster -> hardware, I continued to 
check all the components, finally testing the hardware.
I found some latencies on storage when it was busy, and some searches on web 
convinced me that the RAID cards could be the problem : Dell servers are 
shipped with H310 cards which do not support cache... last week we ordered H710 
cards, providing cache, installed Saturday. Since it, storage performances are 
better, and I noticed no more timeout or errors. But it happened ramdomly, so I 
wait some days more to say that this is solved !


Thank you for the wasted time,
--

Regards,

Frank
Le Mardi, Novembre 27, 2018 08:30 CET, Sahina Bose <sab...@redhat.com> a écrit:
 On Tue, Nov 13, 2018 at 4:46 PM fsoyer <fso...@systea.fr> wrote:
>
> Hi all,
> I continue to try to understand my problem between (I suppose) oVirt anf 
> Gluster.
> After my recents posts titled 'VMs unexpectidly restarted' that did not 
> provide solution nor search idea, I submit to you another (related ?) problem.
> Parallely with the problem of VMs down (that did not reproduce since Oct 16), 
> I have ramdomly some events in the GUI saying "VM xxxxx is not responding." 
> For example, VM "patjoub1" on 2018-11-11 14:34. Never the same hour, not all 
> the days, often this VM patjoub1 but not always : I had it on two others. All 
> VMs disks are on a volume DATA02 (with leases on the same volume).
>
> Searching in engine.log, I found :
>
> 2018-11-11 14:34:32,953+01 INFO 
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-28) [] VM 
> '6116fb07-096b-4c7e-97fe-01ecc9a6bd9b'(patjoub1) moved from 'Up' --> 
> 'NotResponding'
> 2018-11-11 14:34:33,116+01 WARN 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-1) [] Invalid or unknown 
> guest architecture type '' received from guest agent
> 2018-11-11 14:34:33,176+01 WARN 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-28) [] EVENT_ID: 
> VM_NOT_RESPONDING(126), VM patjoub1 is not responding.
> ...
> ...
> 2018-11-11 14:34:48,278+01 INFO 
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-48) [] VM 
> '6116fb07-096b-4c7e-97fe-01ecc9a6bd9b'(patjoub1) moved from 'NotResponding' 
> --> 'Up'
>
> So it becomes up 15s after, and the VM (and the monitoring) see no downtime.
> At this time, I see in vdsm.log of the nodes :
>
> 2018-11-11 14:33:49,450+0100 ERROR (check/loop) [storage.Monitor] Error 
> checking path 
> /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata
>  (monitor:498)
> Traceback (most recent call last):
> File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 496, in 
> _pathChecked
> delay = result.delay()
> File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391, in 
> delay
> raise exception.MiscFileReadException(self.path, self.rc, self.err)
> MiscFileReadException: Internal file read failure: 
> (u'/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata',
>  1, 'Read timeout')
> 2018-11-11 14:33:49,450+0100 INFO (check/loop) [storage.Monitor] Domain 
> ffc53fd8-c5d1-4070-ae51-2e91835cd937 became INVALID (monitor:469)
>
> 2018-11-11 14:33:59,451+0100 WARN (check/loop) [storage.check] Checker 
> u'/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata'
>  is blocked for 20.00 seconds (check:282)
>
> 2018-11-11 14:34:09,480+0100 INFO (event/37) [storage.StoragePool] Linking 
> /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937
>  to 
> /rhev/data-center/6efda7f8-b62f-11e8-9d16-00163e263d21/ffc53fd8-c5d1-4070-ae51-2e91835cd937
>  (sp:1230)
>
> OK : so, DATA02 marked as blocked for 20s ? I definitly have a problem with 
> gluster ? I'll inevitably find the reason in the gluster logs ? Uh : not at 
> all.
> Please see gluster logs here :
> https://seafile.systea.fr/d/65df86cca9d34061a1e4/
>
> Unfortunatly I discovered this morning that I have not the sanlock.log for 
> this date. I don't understand why, the log rotate seems OK with "rotate 3", 
> but I have no backups files :(.
> But, luck in bad luck, the same event occurs this morning ! Same VM patjoub1, 
> 2018-11-13 08:01:37. So I have added the sanlock.log for today, maybe it can 
> help.
>
> IMPORTANT NOTE : don't forget that Gluster log with on hour shift. For this 
> event at 14:34, search at 13h34 in gluster logs.
> I recall my configuration :
> Gluster 3.12.13
> oVirt 4.2.3
> 3 nodes where the third is arbiter (volumes in replica 2)
>
> The nodes are never overloaded (CPU average 5%, no peak detected at the time 
> of the event, mem 128G used at 15% (only 10 VMs on this cluster)). Network 
> underused, gluster is on a separate network on a bond (2 NICs) 1+1Gb mode 4 = 
> 2Gb, used in peak at 10%.
>
> Here is the configuration for the given volume :
>
> # gluster volume status DATA02
> Status of volume: DATA02
> Gluster process TCP Port RDMA Port Online Pid
> ------------------------------------------------------------------------------
> Brick victorstorage.local.systea.fr:/home/d
> ata02/data02/brick 49158 0 Y 4990
> Brick gingerstorage.local.systea.fr:/home/d
> ata02/data02/brick 49153 0 Y 8460
> Brick eskarinastorage.local.systea.fr:/home
> /data01/data02/brick 49158 0 Y 2470
> Self-heal Daemon on localhost N/A N/A Y 8771
> Self-heal Daemon on eskarinastorage.local.s
> ystea.fr N/A N/A Y 11745
> Self-heal Daemon on victorstorage.local.sys
> tea.fr N/A N/A Y 17055
>
> Task Status of Volume DATA02
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
>
>
>
> # gluster volume info DATA02
>
> Volume Name: DATA02
> Type: Replicate
> Volume ID: 48bf5871-339b-4f39-bea5-9b5848809c83
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: victorstorage.local.systea.fr:/home/data02/data02/brick
> Brick2: gingerstorage.local.systea.fr:/home/data02/data02/brick
> Brick3: eskarinastorage.local.systea.fr:/home/data01/data02/brick (arbiter)
> Options Reconfigured:
> network.ping-timeout: 30
> server.allow-insecure: on
> cluster.granular-entry-heal: enable
> features.shard-block-size: 64MB
> performance.stat-prefetch: on
> server.event-threads: 3
> client.event-threads: 8
> performance.io-thread-count: 32
> storage.owner-gid: 36
> storage.owner-uid: 36
> user.cifs: off
> features.shard: on
> cluster.shd-wait-qlength: 10000
> cluster.shd-max-threads: 8
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> cluster.eager-lock: enable
> network.remote-dio: enable
> performance.low-prio-threads: 32
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> transport.address-family: inet
> nfs.disable: on
> performance.client-io-threads: off
> cluster.server-quorum-ratio: 51%
>
>
> So : is there someone around trying to make me understand what append ? 
> Pleeease :/

MiscFileReadException: Internal file read failure:
(u'/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata',
1, 'Read timeout') - indicates that there was no response from storage
within 32s (I think this is the sanlock read timeout? Denis? Nir?)

To figure out if there are possible issues with your storage, can you
check the gluster volume profile output to see if there are any
potential issues there.

>
> --
>
> Regards,
>
> Frank
>
> _______________________________________________
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/D6HKP6TBZKE3KE3TGJAWFKLHBPCKBMFH/

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/6ZPQFMGCHC7EUQHKDYPPCZ6MDGENONBT/

[ovirt-users] Re: VM ramdomly unresponsive

Reply via email to