Re: [ovirt-users] VMs becoming non-responsive sporadically

nicolas Sun, 01 May 2016 07:05:27 -0700

El 2016-05-01 14:01, Nir Soffer escribió:

On Sun, May 1, 2016 at 3:31 PM,  <[email protected]> wrote:
El 2016-04-30 23:22, Nir Soffer escribió:
On Sun, May 1, 2016 at 12:48 AM,  <[email protected]> wrote:
El 2016-04-30 22:37, Nir Soffer escribió:
On Sat, Apr 30, 2016 at 10:28 PM, Nir Soffer <[email protected]>wrote:
On Sat, Apr 30, 2016 at 7:16 PM,  <[email protected]> wrote:
El 2016-04-30 16:55, Nir Soffer escribió:
On Sat, Apr 30, 2016 at 11:33 AM, Nicolás <[email protected]>wrote:
Hi Nir,

El 29/04/16 a las 22:34, Nir Soffer escribió:
On Fri, Apr 29, 2016 at 9:17 PM,  <[email protected]> wrote:
Hi,
We're running oVirt 3.6.5.3-1 and lately we're experiencingsome
issues
with
some VMs being paused because they're marked asnon-responsive.
Mostly,
after a few seconds they recover, but we want to debugprecisely
this
problem so we can fix it consistently.

Our scenario is the following:

~495 VMs, of which ~120 are constantly up
3 datastores, all of them iSCSI-based:
   * ds1: 2T, currently has 276 disks
   * ds2: 2T, currently has 179 disks
   * ds3: 500G, currently has 65 disks
7 hosts: All have mostly the same hardware. CPU and memoryare
currently
very lowly used (< 10%).
ds1 and ds2 are physically the same backend which exportstwo
2TB
volumes.
ds3 is a different storage backend where we're currentlymigrating
some
disks from ds1 and ds2.
What the the storage backend behind ds1 and 2?
The storage backend for ds1 and ds2 is the iSCSI-based HPLeftHand
P4000
G2.
Usually, when VMs become unresponsive, the whole host wherethey
run
gets
unresponsive too, so that gives a hint about the problem, mybet
is
the
culprit is somewhere on the host side and not on the VMsside.
Probably the vm became unresponsive because connection to thehost
was
lost.
I forgot to mention that less commonly we have situations wherethe
host
doesn't get unresponsive but the VMs on it do and they don'tbecomeresponsive ever again, so we have to forcibly power them offand
start
them
on a different host. But in this case the connection with thehost
doesn't
ever get lost (so basically the host is Up, but any VM run onthem
is
unresponsive).
When that
happens, the host itself gets non-responsive and onlyrecoverable
after
reboot, since it's unable to reconnect.
Piotr, can you check engine log and explain why host is not
reconnected?
I must say this is not specific to
this oVirt version, when we were using v.3.6.4 the samehappened,
and
it's
also worthy mentioning we've not done any configurationchanges
and
everything had been working quite well for a long time.

We were monitoring our ds1 and ds2 physical backend to see
performance
and
we suspect we've run out of IOPS since we're reaching themaximum
specified
by the manufacturer, probably at certain times the hostcannot
perform
a
storage operation within some time limit and it marks VMs as
unresponsive.
That's why we've set up ds3 and we're migrating ds1 and ds2to
ds3.
When
we
run out of space on ds3 we'll create more smaller volumes tokeep
migrating.
On the host side, when this happens, we've run repoplot onthe
vdsm
log
and
I'm attaching the result. Clearly there's a *huge* LVMresponse
time
(~30
secs.).
Indeed the log show very slow vgck and vgs commands - theseare
called
every
5 minutes for checking the vg health and refreshing vdsm lvmcache.
1. starting vgck

Thread-96::DEBUG::2016-04-29
13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd)/usr/bin/taskset
--cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config '
devices
{ pre
ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3 filter = [
'\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\
'', '\''r|.*|'\'' ] }  global {  locking_type=1
prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 }backup
{
retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
9c-8eee-10368647c413 (cwd None)

2. vgck ends after 55 seconds

Thread-96::DEBUG::2016-04-29
13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS:<err> =
'
WARNING: lvmetad is running but disabled. Restart lvmetadbefore
enabling it!\n'; <rc> = 0

3. starting vgs

Thread-96::DEBUG::2016-04-29
13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd)/usr/bin/taskset
--cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config '
devices
{ pref
erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3 filter = [
'\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/de




v/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
'\''r|.*|'\'' ] }  global {
  locking_type=1  prioritise_write_locks=1  wait_for_locks=1
use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 }'
--noheadings --units b --nosuffix --separator '|
' --ignoreskippedcluster -o




uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
5de4a000-a9c4-489c-8eee-10368
647c413 (cwd None)

4. vgs finished after 37 seconds

Thread-96::DEBUG::2016-04-29
13:17:48,680::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS:<err> =
'
WARNING: lvmetad is running but disabled. Restart lvmetadbefore
enabling it!\n'; <rc> = 0

Zdenek, how do you suggest to debug this slow lvm commands?
Can you run the following commands on a host in trouble, andon
some
other
hosts in the same timeframe?

time vgck -vvvv --config ' devices { filter =
['\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\'',
'\''r|.*|'\'' ] }  global {  locking_type=1
prioritise_write_locks=1
wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50
retain_days = 0 } ' 5de4a000-a9c4-489c-8eee-10368647c413

time vgs -vvvv --config ' global { locking_type=1
prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 }backup
{
retain_min = 50  retain_days = 0 } '
5de4a000-a9c4-489c-8eee-10368647c413
Note that I added -vvvv to both commands, this will generatehuge
amount
of debugging info. Please share the output of these commands.
You may need to fix the commands. You can always copy andpaste
directly
from vdsm log into the shell and add the -vvvv flag.
Indeed, there seems to be a big difference on hosts 5 and 6.I'm
attaching
the results of the execution of both commands on all hosts.Both
commands
show a pretty bigger output on hosts 5 and 6, and also a muchbigger
execution time. Times are also attached in a file called TIMES.
Our host storage network is correctly configured and on a 1G
interface, no errors on the host itself, switches, etc.
1G serving about 20-70 vms per host? (495 vms, 120 always up,7
hosts)?
Do you have separate network for management and storage, orboth
use this 1G interface?
Yes, we have separate networks for management, storage andmotion.
Storage
and motion have 1G each (plus, for storage we use a bond of 2
interfaces
in
ALB-mode (6)). Currently no host has more than 30 VMs at atime.
We've also limited storage in QoS to use 10MB/s and 40 IOPS,
How did you configure this limit?
In the Data Center tab, I chose our DC and a QoS sub-tabappears,
just
like
described here:http://www.ovirt.org/documentation/sla/network-qos/
but this issue
still happens, which leads me to be concerned whether this isnot
just
an
IOPS issue; each host handles about cca. 600 LVs. Could thisbe an
issue
too? I remark the LVM response times are low in normalconditions
(~1-2
seconds).
We recommend to limit number lvs per vg to 350. If you have276
disks
on
ds1, and the disks are using snapshots, you may have too manylvs,
which
can cause slowdowns in lvm operations.

Can you share the output of:
vdsm-tool dump-volume-chains5de4a000-a9c4-489c-8eee-10368647c413
Ok, right now no VG has more than 350 VGs so I guess this isnot
currently
the problem. Unfortunately, I run the vdsm-tool command but it
didn't
end
nor provide any output in cca. 1 hour, so I guess it was hangedand
I
stopped it. If you confirm this is a normal execution time Ican
leave
it
running whatever time it takes.

host5:~# vgs
  VG                                   #PV #LV #SN Attr   VSize
VFree
0927c050-6fb6-463c-bb37-8b8da641dcd3 1 63 0 wz--n-499,62g
206,50g
5de4a000-a9c4-489c-8eee-10368647c413 1 335 0 wz--n-2,00t
518,62g
b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8 1 215 0 wz--n-2,00t
495,25g
sys_vg 1 6 0 wz--n-136,44g
122,94g
I'm attaching the vdsm.log, engine.log and repoplot PDF;
This is very useful, thanks. Can you send also the vdsm logsand
repoplots
from other hosts for the same timeframe?
if someone could
give a hint on some additional problems in them and shed some
light
on
the
above thoughts I'd be very grateful.
Do you have sar configured on the host? having sar logs canreveal
more
info about this timeframe.
Do you have information about amount of io from vms duringthis
timeframe?
amount of io on the storage backend during this timeframe?
Not currently, but I'll be alert for the time this happensagain and
I'll
check some sar commands related to I/O and I'll providefeedback.Nevertheless, by the time I run the commands above no machinewas
unresponsive and I'm still getting such huge execution times. I
tried
running iotop now and I see there are 2 vgck and vgs processeswith
a
rate
of ~500Kb/s each for reading.
root 24446 1.3 0.0 57008 6824 ? D< 09:150:00
/usr/sbin/lvm vgs --config  devices { preferred_names =
["^/dev/mapper/"]
ignore_suspended_devices=1 write_cache_state=0
disable_after_error_count=3
filter = [



'a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/dev/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|',
'r|.*|' ] }  global {  locking_type=1  prioritise_write_locks=1
wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50
retain_days
= 0
}  --noheadings --units b --nosuffix --separator |
--ignoreskippedcluster
-o



uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d
I'm also attaching the iostat output for host 5 in a filecalled
iostat-host5.
Interesting, all hosts are checking the same storage, but onhost5
and
host6
the output of vgs and vgck is 10 times bigger - this explain whythey
take
about
10 times longer to run.

$ ls -lh vgs-*
-rw-r--r--. 1 nsoffer nsoffer 221K Apr 30 08:55 vgs-host1
-rw-r--r--. 1 nsoffer nsoffer 239K Apr 30 08:55 vgs-host2
-rw-r--r--. 1 nsoffer nsoffer 228K Apr 30 08:55 vgs-host3
-rw-r--r--. 1 nsoffer nsoffer 288K Apr 30 08:55 vgs-host4
-rw-r--r--. 1 nsoffer nsoffer 2.2M Apr 30 08:55 vgs-host5
-rw-r--r--. 1 nsoffer nsoffer 2.2M Apr 30 08:55 vgs-host6
-rw-r--r--. 1 nsoffer nsoffer 232K Apr 30 08:55 vgs-host7
$ ls -lh vgck-*
-rw-r--r--. 1 nsoffer nsoffer 238K Apr 30 08:55 vgck-host1
-rw-r--r--. 1 nsoffer nsoffer 254K Apr 30 08:55 vgck-host2
-rw-r--r--. 1 nsoffer nsoffer 244K Apr 30 08:55 vgck-host3
-rw-r--r--. 1 nsoffer nsoffer 298K Apr 30 08:55 vgck-host4
-rw-r--r--. 1 nsoffer nsoffer 2.0M Apr 30 08:55 vgck-host5
-rw-r--r--. 1 nsoffer nsoffer 2.0M Apr 30 08:55 vgck-host6
-rw-r--r--. 1 nsoffer nsoffer 248K Apr 30 08:55 vgck-host7
Can you do collect the output of "dmsetup table -v" on host 1,5, and
6?

Nir
Sure, please find attached results of the command. Once again,for
hosts
5
and 6 the size is 10 times bigger than for host 1.
I think the issue causing slow vgs and vgck commands is stale
lvs on host 5 and 6. This may be the root caused for paused vms,but I
cannot
explain how it is related yet.

Comparing vgck on host1 and host5 - we can see that vgck opens
213 /dev/mapper/* devices, actually 53 uniq devices, most of them
are opened 4 times. On host5,  vgck opens 2489 devices (622 uniq).
This explains why the operation takes about 10 times longer.
Checking dmsteup table output, we can see that host1 has 53devices,
and host5 622 devices.

Checking the device open count, host1 has 15 stale devices
(Open count: 0), but host5 has 597 stale devices.
Leaving stale devices is a known issue, but we never had evidencethat
it
cause
trouble except warnings in lvm commands.
Please open an ovirt bug about this issue, and include all thefiles
you
sent
so far in this thread, and all the vdsm logs on host5.

To remove the stale devices, you can do:
for name in `dmsetup info -c -o open,name | awk '/ 0 / {print$2}'`; do
    echo "removing stale device: $name"
    dmsetup remove $name
done
Note that this cleanup is safe only in maintenance mode, since itwill
also remove
the mapping for vdsm special lvs (ids, leases, inbox, outbox,master,
metatada)
that are accessed by vdsm regularly, and are not stale.

This is also not safe if you use hosted engine, hosted engine agent
should maintain
its vg.

The safest way to remove stale devices is to restart vdsm - it
deactivates all unused
lvs (except special lvs) during startup.
Hm, I run it without even putting the host on maintenance, but Iwill
next
time. Fortunately it seems it didn't do any harm and everything is
working.
I'll probably do a script that will put host on maintenance, cleanuporrestart vdsm and start again (this is not a hosted engine) if thenumber
of
stale devices is bigger than a threshold.
I would not do anything automatic at this point except monitoringthis
issue.

For cleanup - you have 2 options:

- host is up - restart vdsm
- host in maintenance - remove stale devices

Based on local testing, I think the issue is this:

- When connecting to storage, all lvs becomes active - this may be a
change
  in lvm during el 7 development that we did not notice.
- Each time vdsm is restarted, all active lvs that should not beactive
are
  deactivated during vdsm startup

- When deactivating storage domain, vdsm does not deactivate the lvs
   leaving stale devices.

Can you confirm that works like this on your system?

To test this:
I did these tests on host5. Before switching the host to maintenance,I runthe vgck command and execution time was: 0m1.347s. Current staledevices at
that time were:

5de4a000--a9c4--489c--8eee--10368647c413-master
5de4a000--a9c4--489c--8eee--10368647c413-outbox
0927c050--6fb6--463c--bb37--8b8da641dcd3-metadata
5de4a000--a9c4--489c--8eee--10368647c413-ids
0927c050--6fb6--463c--bb37--8b8da641dcd3-leases
5de4a000--a9c4--489c--8eee--10368647c413-inbox
b13b9eac--1f2e--4a7e--bcd9--49f5f855c3d8-leases
b13b9eac--1f2e--4a7e--bcd9--49f5f855c3d8-metadata
b13b9eac--1f2e--4a7e--bcd9--49f5f855c3d8-ids
b13b9eac--1f2e--4a7e--bcd9--49f5f855c3d8-inbox
5de4a000--a9c4--489c--8eee--10368647c413-leases
0927c050--6fb6--463c--bb37--8b8da641dcd3-master
5de4a000--a9c4--489c--8eee--10368647c413-metadata
b13b9eac--1f2e--4a7e--bcd9--49f5f855c3d8-master
0927c050--6fb6--463c--bb37--8b8da641dcd3-outbox
0927c050--6fb6--463c--bb37--8b8da641dcd3-inbox
b13b9eac--1f2e--4a7e--bcd9--49f5f855c3d8-outbox
0927c050--6fb6--463c--bb37--8b8da641dcd3-ids
(seems there were not any new stale devices since yesterday afterrunning
the loop).
Expected
- vdsm access master, metadata, inbox and outbox from time to time,they must
  be active when host is up
- sanlock access ids when host is up - this lv should have open countof 1
1. Put one host to maintenance

You should see some stale devices - the special lvs
(ids, leases, master, metadata, inbox, outbox)

And probably 2 OVF_STORE lvs per vg (128m lvs used to store
vm ovf)
Still had the same devices as above.
Expected bug - we don't cleanup lvs properly when entering maintenance.
2. Cleanup the stale devices
After cleaning the stale devices, those remained:

36000eb3a4f1acbc20000000000000043
360014056f0dc8930d744f83af8ddc709
36000eb3a4f1acbc200000000000000b9
Expected bug.
These are the multipath mappings, running multipath -F should removethem.
If we clean the lvs properly when deactivating domain duringmaintenance, these
mappings will be automatically removed.
3. Activate host

You should see again about 600 stale devices, and lvm commands
will probably run much slower as you reported.
Current stale devices were these:

5de4a000--a9c4--489c--8eee--10368647c413-master
5de4a000--a9c4--489c--8eee--10368647c413-outbox
0927c050--6fb6--463c--bb37--8b8da641dcd3-metadata
0927c050--6fb6--463c--bb37--8b8da641dcd3-leases
5de4a000--a9c4--489c--8eee--10368647c413-inbox
b13b9eac--1f2e--4a7e--bcd9--49f5f855c3d8-leases
b13b9eac--1f2e--4a7e--bcd9--49f5f855c3d8-metadata
b13b9eac--1f2e--4a7e--bcd9--49f5f855c3d8-inbox
5de4a000--a9c4--489c--8eee--10368647c413-leases
0927c050--6fb6--463c--bb37--8b8da641dcd3-master
5de4a000--a9c4--489c--8eee--10368647c413-metadata
b13b9eac--1f2e--4a7e--bcd9--49f5f855c3d8-master
0927c050--6fb6--463c--bb37--8b8da641dcd3-outbox
b13b9eac--1f2e--4a7e--bcd9--49f5f855c3d8-outbox
0927c050--6fb6--463c--bb37--8b8da641dcd3-inbox
Strange, on my test setup (RHEL 7.2), *all* the lvs in the vg get
activated.

Did you check the output of lvs?

pvscan --cache
lvs

I guess you will see about 600 lvs active (-wi-a-----)

Maybe you did not wait until host connected to all storage domains?

Ok, that was the problem, I didn't run the 'pvscan --cache' commandprior to see the output. After activating the host again, I see 620 LVs.However, in my case all LVs except the special ones are inactive:


host5:~# lvscan | grep -v inactive

ACTIVE '/dev/0927c050-6fb6-463c-bb37-8b8da641dcd3/metadata'[512,00 MiB] inheritACTIVE '/dev/0927c050-6fb6-463c-bb37-8b8da641dcd3/outbox'[128,00 MiB] inheritACTIVE '/dev/0927c050-6fb6-463c-bb37-8b8da641dcd3/leases'[2,00 GiB] inheritACTIVE '/dev/0927c050-6fb6-463c-bb37-8b8da641dcd3/ids'[128,00 MiB] inheritACTIVE '/dev/0927c050-6fb6-463c-bb37-8b8da641dcd3/inbox'[128,00 MiB] inheritACTIVE '/dev/0927c050-6fb6-463c-bb37-8b8da641dcd3/master'[1,00 GiB] inherit

  ACTIVE            '/dev/sys_vg/audit' [512,00 MiB] inherit
  ACTIVE            '/dev/sys_vg/tmp' [1,00 GiB] inherit
  ACTIVE            '/dev/sys_vg/var' [4,00 GiB] inherit
  ACTIVE            '/dev/sys_vg/home' [1,00 GiB] inherit
  ACTIVE            '/dev/sys_vg/root' [6,00 GiB] inherit
  ACTIVE            '/dev/sys_vg/swap' [1,00 GiB] inherit

ACTIVE '/dev/5de4a000-a9c4-489c-8eee-10368647c413/metadata'[512,00 MiB] inheritACTIVE '/dev/5de4a000-a9c4-489c-8eee-10368647c413/outbox'[128,00 MiB] inheritACTIVE '/dev/5de4a000-a9c4-489c-8eee-10368647c413/leases'[2,00 GiB] inheritACTIVE '/dev/5de4a000-a9c4-489c-8eee-10368647c413/ids'[128,00 MiB] inheritACTIVE '/dev/5de4a000-a9c4-489c-8eee-10368647c413/inbox'[128,00 MiB] inheritACTIVE '/dev/5de4a000-a9c4-489c-8eee-10368647c413/master'[1,00 GiB] inheritACTIVE '/dev/b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8/metadata'[512,00 MiB] inheritACTIVE '/dev/b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8/outbox'[128,00 MiB] inheritACTIVE '/dev/b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8/leases'[2,00 GiB] inheritACTIVE '/dev/b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8/ids'[128,00 MiB] inheritACTIVE '/dev/b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8/inbox'[128,00 MiB] inheritACTIVE '/dev/b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8/master'[1,00 GiB] inherit


Others are all 'inactive':

inactive'/dev/b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8/d256c965-25dd-40f2-aeae-4a3ae10befcf'[3,12 GiB] inheritinactive'/dev/b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8/6cfafa3a-c335-47bf-805e-bddb850dc914'[3,12 GiB] inheritinactive'/dev/b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8/0d60a745-6952-42e7-8bcf-a12d36b2c39e'[5,25 GiB] inheritinactive'/dev/b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8/38d94545-b4b0-4384-ae3a-fa26c40e40e3'[5,25 GiB] inherit


I tested this on both hosts (5 and 6) with same result.

4. Restart vdsm

All the regular lvs should be inactive now, no stale devices
should exists, except the special lvs (expected)

Same result as above. The strange thing is that now I don't see anyVGs

running the 'vgs' command but the system's:

host5:~# vgs
  VG     #PV #LV #SN Attr   VSize   VFree
  sys_vg   1   6   0 wz--n- 136,44g 122,94g


Vdsm bypass lvmetads, so all lvm commands from the shell show incorrect
results. You must update lvmetad about changes made by vdsm using

pvscan --cache
vgs

Or using vdsm way:

vgs --config 'global {use_lvmetad=0}'


(Same here, after using 'pvscan --cache' I can see the VGs).

However, machines running on that host have no issues and storage isworkingwell. vgscan doesn't do the trick either. I tested this on host6 tooand thesame happens: After reactivating host I cannot see VGs with the vgscommand
- however, everything seems to work quite well.

This is CentOS Linux release 7.2.1511 FWIW.
Nir

_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] VMs becoming non-responsive sporadically

Reply via email to