Re: [ovirt-users] VMs becoming non-responsive sporadically

2016-05-01 Thread nicolas

El 2016-05-01 14:01, Nir Soffer escribió:

On Sun, May 1, 2016 at 3:31 PM,   wrote:

El 2016-04-30 23:22, Nir Soffer escribió:


On Sun, May 1, 2016 at 12:48 AM,   wrote:


El 2016-04-30 22:37, Nir Soffer escribió:



On Sat, Apr 30, 2016 at 10:28 PM, Nir Soffer  
wrote:



On Sat, Apr 30, 2016 at 7:16 PM,   wrote:



El 2016-04-30 16:55, Nir Soffer escribió:




On Sat, Apr 30, 2016 at 11:33 AM, Nicolás  
wrote:




Hi Nir,

El 29/04/16 a las 22:34, Nir Soffer escribió:





On Fri, Apr 29, 2016 at 9:17 PM,   wrote:





Hi,

We're running oVirt 3.6.5.3-1 and lately we're experiencing 
some

issues
with
some VMs being paused because they're marked as 
non-responsive.

Mostly,
after a few seconds they recover, but we want to debug 
precisely

this
problem so we can fix it consistently.

Our scenario is the following:

~495 VMs, of which ~120 are constantly up
3 datastores, all of them iSCSI-based:
   * ds1: 2T, currently has 276 disks
   * ds2: 2T, currently has 179 disks
   * ds3: 500G, currently has 65 disks
7 hosts: All have mostly the same hardware. CPU and memory 
are

currently
very lowly used (< 10%).

   ds1 and ds2 are physically the same backend which exports 
two

2TB
volumes.
ds3 is a different storage backend where we're currently 
migrating

some
disks from ds1 and ds2.





What the the storage backend behind ds1 and 2?






The storage backend for ds1 and ds2 is the iSCSI-based HP 
LeftHand

P4000
G2.

Usually, when VMs become unresponsive, the whole host where 
they

run
gets
unresponsive too, so that gives a hint about the problem, my 
bet

is
the
culprit is somewhere on the host side and not on the VMs 
side.





Probably the vm became unresponsive because connection to the 
host

was
lost.






I forgot to mention that less commonly we have situations where 
the

host
doesn't get unresponsive but the VMs on it do and they don't 
become
responsive ever again, so we have to forcibly power them off 
and

start
them
on a different host. But in this case the connection with the 
host

doesn't
ever get lost (so basically the host is Up, but any VM run on 
them

is
unresponsive).



When that
happens, the host itself gets non-responsive and only 
recoverable

after
reboot, since it's unable to reconnect.





Piotr, can you check engine log and explain why host is not
reconnected?


I must say this is not specific to
this oVirt version, when we were using v.3.6.4 the same 
happened,

and
it's
also worthy mentioning we've not done any configuration 
changes

and
everything had been working quite well for a long time.

We were monitoring our ds1 and ds2 physical backend to see
performance
and
we suspect we've run out of IOPS since we're reaching the 
maximum

specified
by the manufacturer, probably at certain times the host 
cannot

perform
a
storage operation within some time limit and it marks VMs as
unresponsive.
That's why we've set up ds3 and we're migrating ds1 and ds2 
to

ds3.
When
we
run out of space on ds3 we'll create more smaller volumes to 
keep

migrating.

On the host side, when this happens, we've run repoplot on 
the

vdsm
log
and
I'm attaching the result. Clearly there's a *huge* LVM 
response

time
(~30
secs.).





Indeed the log show very slow vgck and vgs commands - these 
are

called
every
5 minutes for checking the vg health and refreshing vdsm lvm 
cache.


1. starting vgck

Thread-96::DEBUG::2016-04-29
13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) 
/usr/bin/taskset

--cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config '
devices
{ pre
ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3 filter = [
'\''a|/dev/mapper/36000eb3a4f1acbc20043|'\
'', '\''r|.*|'\'' ] }  global {  locking_type=1
prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  
backup

{
retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
9c-8eee-10368647c413 (cwd None)

2. vgck ends after 55 seconds

Thread-96::DEBUG::2016-04-29
13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: 
 =

'
WARNING: lvmetad is running but disabled. Restart lvmetad 
before

enabling it!\n';  = 0

3. starting vgs

Thread-96::DEBUG::2016-04-29
13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) 
/usr/bin/taskset

--cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config '
devices
{ pref
erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3 filter = [
'\''a|/dev/mapper/36000eb3a4f1acbc20043|/de




v/mapper/36000eb3a4f1acbc200b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
'\''r|.*|'\'' ] }  global {
  locking_type=1  prioritise_write_locks=1  wait_for_locks=1
use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } 
'

--noheadings --units b --nosuffix --separator '|
' 

Re: [ovirt-users] VMs becoming non-responsive sporadically

2016-05-01 Thread Nir Soffer
On Sun, May 1, 2016 at 3:31 PM,   wrote:
> El 2016-04-30 23:22, Nir Soffer escribió:
>>
>> On Sun, May 1, 2016 at 12:48 AM,   wrote:
>>>
>>> El 2016-04-30 22:37, Nir Soffer escribió:


 On Sat, Apr 30, 2016 at 10:28 PM, Nir Soffer  wrote:
>
>
> On Sat, Apr 30, 2016 at 7:16 PM,   wrote:
>>
>>
>> El 2016-04-30 16:55, Nir Soffer escribió:
>>>
>>>
>>>
>>> On Sat, Apr 30, 2016 at 11:33 AM, Nicolás  wrote:



 Hi Nir,

 El 29/04/16 a las 22:34, Nir Soffer escribió:
>
>
>
>
> On Fri, Apr 29, 2016 at 9:17 PM,   wrote:
>>
>>
>>
>>
>> Hi,
>>
>> We're running oVirt 3.6.5.3-1 and lately we're experiencing some
>> issues
>> with
>> some VMs being paused because they're marked as non-responsive.
>> Mostly,
>> after a few seconds they recover, but we want to debug precisely
>> this
>> problem so we can fix it consistently.
>>
>> Our scenario is the following:
>>
>> ~495 VMs, of which ~120 are constantly up
>> 3 datastores, all of them iSCSI-based:
>>* ds1: 2T, currently has 276 disks
>>* ds2: 2T, currently has 179 disks
>>* ds3: 500G, currently has 65 disks
>> 7 hosts: All have mostly the same hardware. CPU and memory are
>> currently
>> very lowly used (< 10%).
>>
>>ds1 and ds2 are physically the same backend which exports two
>> 2TB
>> volumes.
>> ds3 is a different storage backend where we're currently migrating
>> some
>> disks from ds1 and ds2.
>
>
>
>
> What the the storage backend behind ds1 and 2?





 The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand
 P4000
 G2.

>> Usually, when VMs become unresponsive, the whole host where they
>> run
>> gets
>> unresponsive too, so that gives a hint about the problem, my bet
>> is
>> the
>> culprit is somewhere on the host side and not on the VMs side.
>
>
>
>
> Probably the vm became unresponsive because connection to the host
> was
> lost.





 I forgot to mention that less commonly we have situations where the
 host
 doesn't get unresponsive but the VMs on it do and they don't become
 responsive ever again, so we have to forcibly power them off and
 start
 them
 on a different host. But in this case the connection with the host
 doesn't
 ever get lost (so basically the host is Up, but any VM run on them
 is
 unresponsive).


>> When that
>> happens, the host itself gets non-responsive and only recoverable
>> after
>> reboot, since it's unable to reconnect.
>
>
>
>
> Piotr, can you check engine log and explain why host is not
> reconnected?
>
>> I must say this is not specific to
>> this oVirt version, when we were using v.3.6.4 the same happened,
>> and
>> it's
>> also worthy mentioning we've not done any configuration changes
>> and
>> everything had been working quite well for a long time.
>>
>> We were monitoring our ds1 and ds2 physical backend to see
>> performance
>> and
>> we suspect we've run out of IOPS since we're reaching the maximum
>> specified
>> by the manufacturer, probably at certain times the host cannot
>> perform
>> a
>> storage operation within some time limit and it marks VMs as
>> unresponsive.
>> That's why we've set up ds3 and we're migrating ds1 and ds2 to
>> ds3.
>> When
>> we
>> run out of space on ds3 we'll create more smaller volumes to keep
>> migrating.
>>
>> On the host side, when this happens, we've run repoplot on the
>> vdsm
>> log
>> and
>> I'm attaching the result. Clearly there's a *huge* LVM response
>> time
>> (~30
>> secs.).
>
>
>
>
> Indeed the log show very slow vgck and vgs commands - these are
> called
> every
> 5 minutes for checking the vg health and refreshing vdsm lvm cache.
>
> 1. starting vgck
>
> Thread-96::DEBUG::2016-04-29
> 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) 

Re: [ovirt-users] VMs becoming non-responsive sporadically

2016-05-01 Thread nicolas

El 2016-04-30 23:22, Nir Soffer escribió:

On Sun, May 1, 2016 at 12:48 AM,   wrote:

El 2016-04-30 22:37, Nir Soffer escribió:


On Sat, Apr 30, 2016 at 10:28 PM, Nir Soffer  
wrote:


On Sat, Apr 30, 2016 at 7:16 PM,   wrote:


El 2016-04-30 16:55, Nir Soffer escribió:



On Sat, Apr 30, 2016 at 11:33 AM, Nicolás  
wrote:



Hi Nir,

El 29/04/16 a las 22:34, Nir Soffer escribió:




On Fri, Apr 29, 2016 at 9:17 PM,   wrote:




Hi,

We're running oVirt 3.6.5.3-1 and lately we're experiencing 
some

issues
with
some VMs being paused because they're marked as non-responsive.
Mostly,
after a few seconds they recover, but we want to debug 
precisely

this
problem so we can fix it consistently.

Our scenario is the following:

~495 VMs, of which ~120 are constantly up
3 datastores, all of them iSCSI-based:
   * ds1: 2T, currently has 276 disks
   * ds2: 2T, currently has 179 disks
   * ds3: 500G, currently has 65 disks
7 hosts: All have mostly the same hardware. CPU and memory are
currently
very lowly used (< 10%).

   ds1 and ds2 are physically the same backend which exports 
two 2TB

volumes.
ds3 is a different storage backend where we're currently 
migrating

some
disks from ds1 and ds2.




What the the storage backend behind ds1 and 2?





The storage backend for ds1 and ds2 is the iSCSI-based HP 
LeftHand

P4000
G2.

Usually, when VMs become unresponsive, the whole host where 
they run

gets
unresponsive too, so that gives a hint about the problem, my 
bet is

the
culprit is somewhere on the host side and not on the VMs side.




Probably the vm became unresponsive because connection to the 
host

was
lost.





I forgot to mention that less commonly we have situations where 
the

host
doesn't get unresponsive but the VMs on it do and they don't 
become
responsive ever again, so we have to forcibly power them off and 
start

them
on a different host. But in this case the connection with the 
host

doesn't
ever get lost (so basically the host is Up, but any VM run on 
them is

unresponsive).



When that
happens, the host itself gets non-responsive and only 
recoverable

after
reboot, since it's unable to reconnect.




Piotr, can you check engine log and explain why host is not
reconnected?


I must say this is not specific to
this oVirt version, when we were using v.3.6.4 the same 
happened,

and
it's
also worthy mentioning we've not done any configuration changes 
and

everything had been working quite well for a long time.

We were monitoring our ds1 and ds2 physical backend to see
performance
and
we suspect we've run out of IOPS since we're reaching the 
maximum

specified
by the manufacturer, probably at certain times the host cannot
perform
a
storage operation within some time limit and it marks VMs as
unresponsive.
That's why we've set up ds3 and we're migrating ds1 and ds2 to 
ds3.

When
we
run out of space on ds3 we'll create more smaller volumes to 
keep

migrating.

On the host side, when this happens, we've run repoplot on the 
vdsm

log
and
I'm attaching the result. Clearly there's a *huge* LVM response 
time

(~30
secs.).




Indeed the log show very slow vgck and vgs commands - these are
called
every
5 minutes for checking the vg health and refreshing vdsm lvm 
cache.


1. starting vgck

Thread-96::DEBUG::2016-04-29
13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) 
/usr/bin/taskset

--cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config '
devices
{ pre
ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3 filter = [
'\''a|/dev/mapper/36000eb3a4f1acbc20043|'\
'', '\''r|.*|'\'' ] }  global {  locking_type=1
prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  
backup {

retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
9c-8eee-10368647c413 (cwd None)

2. vgck ends after 55 seconds

Thread-96::DEBUG::2016-04-29
13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: 
 = '

WARNING: lvmetad is running but disabled. Restart lvmetad before
enabling it!\n';  = 0

3. starting vgs

Thread-96::DEBUG::2016-04-29
13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) 
/usr/bin/taskset
--cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' 
devices

{ pref
erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3 filter = [
'\''a|/dev/mapper/36000eb3a4f1acbc20043|/de



v/mapper/36000eb3a4f1acbc200b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
'\''r|.*|'\'' ] }  global {
  locking_type=1  prioritise_write_locks=1  wait_for_locks=1
use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } '
--noheadings --units b --nosuffix --separator '|
' --ignoreskippedcluster -o




Re: [ovirt-users] VMs becoming non-responsive sporadically

2016-05-01 Thread Nir Soffer
On Sun, May 1, 2016 at 1:35 AM,   wrote:
> El 2016-04-30 23:22, Nir Soffer escribió:
>>
>> On Sun, May 1, 2016 at 12:48 AM,   wrote:
>>>
>>> El 2016-04-30 22:37, Nir Soffer escribió:


 On Sat, Apr 30, 2016 at 10:28 PM, Nir Soffer  wrote:
>
>
> On Sat, Apr 30, 2016 at 7:16 PM,   wrote:
>>
>>
>> El 2016-04-30 16:55, Nir Soffer escribió:
>>>
>>>
>>>
>>> On Sat, Apr 30, 2016 at 11:33 AM, Nicolás  wrote:



 Hi Nir,

 El 29/04/16 a las 22:34, Nir Soffer escribió:
>
>
>
>
> On Fri, Apr 29, 2016 at 9:17 PM,   wrote:
>>
>>
>>
>>
>> Hi,
>>
>> We're running oVirt 3.6.5.3-1 and lately we're experiencing some
>> issues
>> with
>> some VMs being paused because they're marked as non-responsive.
>> Mostly,
>> after a few seconds they recover, but we want to debug precisely
>> this
>> problem so we can fix it consistently.
>>
>> Our scenario is the following:
>>
>> ~495 VMs, of which ~120 are constantly up
>> 3 datastores, all of them iSCSI-based:
>>* ds1: 2T, currently has 276 disks
>>* ds2: 2T, currently has 179 disks
>>* ds3: 500G, currently has 65 disks
>> 7 hosts: All have mostly the same hardware. CPU and memory are
>> currently
>> very lowly used (< 10%).
>>
>>ds1 and ds2 are physically the same backend which exports two
>> 2TB
>> volumes.
>> ds3 is a different storage backend where we're currently migrating
>> some
>> disks from ds1 and ds2.
>
>
>
>
> What the the storage backend behind ds1 and 2?





 The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand
 P4000
 G2.

>> Usually, when VMs become unresponsive, the whole host where they
>> run
>> gets
>> unresponsive too, so that gives a hint about the problem, my bet
>> is
>> the
>> culprit is somewhere on the host side and not on the VMs side.
>
>
>
>
> Probably the vm became unresponsive because connection to the host
> was
> lost.





 I forgot to mention that less commonly we have situations where the
 host
 doesn't get unresponsive but the VMs on it do and they don't become
 responsive ever again, so we have to forcibly power them off and
 start
 them
 on a different host. But in this case the connection with the host
 doesn't
 ever get lost (so basically the host is Up, but any VM run on them
 is
 unresponsive).


>> When that
>> happens, the host itself gets non-responsive and only recoverable
>> after
>> reboot, since it's unable to reconnect.
>
>
>
>
> Piotr, can you check engine log and explain why host is not
> reconnected?
>
>> I must say this is not specific to
>> this oVirt version, when we were using v.3.6.4 the same happened,
>> and
>> it's
>> also worthy mentioning we've not done any configuration changes
>> and
>> everything had been working quite well for a long time.
>>
>> We were monitoring our ds1 and ds2 physical backend to see
>> performance
>> and
>> we suspect we've run out of IOPS since we're reaching the maximum
>> specified
>> by the manufacturer, probably at certain times the host cannot
>> perform
>> a
>> storage operation within some time limit and it marks VMs as
>> unresponsive.
>> That's why we've set up ds3 and we're migrating ds1 and ds2 to
>> ds3.
>> When
>> we
>> run out of space on ds3 we'll create more smaller volumes to keep
>> migrating.
>>
>> On the host side, when this happens, we've run repoplot on the
>> vdsm
>> log
>> and
>> I'm attaching the result. Clearly there's a *huge* LVM response
>> time
>> (~30
>> secs.).
>
>
>
>
> Indeed the log show very slow vgck and vgs commands - these are
> called
> every
> 5 minutes for checking the vg health and refreshing vdsm lvm cache.
>
> 1. starting vgck
>
> Thread-96::DEBUG::2016-04-29
> 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) 

Re: [ovirt-users] VMs becoming non-responsive sporadically

2016-04-30 Thread nicolas

El 2016-04-30 23:22, Nir Soffer escribió:

On Sun, May 1, 2016 at 12:48 AM,   wrote:

El 2016-04-30 22:37, Nir Soffer escribió:


On Sat, Apr 30, 2016 at 10:28 PM, Nir Soffer  
wrote:


On Sat, Apr 30, 2016 at 7:16 PM,   wrote:


El 2016-04-30 16:55, Nir Soffer escribió:



On Sat, Apr 30, 2016 at 11:33 AM, Nicolás  
wrote:



Hi Nir,

El 29/04/16 a las 22:34, Nir Soffer escribió:




On Fri, Apr 29, 2016 at 9:17 PM,   wrote:




Hi,

We're running oVirt 3.6.5.3-1 and lately we're experiencing 
some

issues
with
some VMs being paused because they're marked as non-responsive.
Mostly,
after a few seconds they recover, but we want to debug 
precisely

this
problem so we can fix it consistently.

Our scenario is the following:

~495 VMs, of which ~120 are constantly up
3 datastores, all of them iSCSI-based:
   * ds1: 2T, currently has 276 disks
   * ds2: 2T, currently has 179 disks
   * ds3: 500G, currently has 65 disks
7 hosts: All have mostly the same hardware. CPU and memory are
currently
very lowly used (< 10%).

   ds1 and ds2 are physically the same backend which exports 
two 2TB

volumes.
ds3 is a different storage backend where we're currently 
migrating

some
disks from ds1 and ds2.




What the the storage backend behind ds1 and 2?





The storage backend for ds1 and ds2 is the iSCSI-based HP 
LeftHand

P4000
G2.

Usually, when VMs become unresponsive, the whole host where 
they run

gets
unresponsive too, so that gives a hint about the problem, my 
bet is

the
culprit is somewhere on the host side and not on the VMs side.




Probably the vm became unresponsive because connection to the 
host

was
lost.





I forgot to mention that less commonly we have situations where 
the

host
doesn't get unresponsive but the VMs on it do and they don't 
become
responsive ever again, so we have to forcibly power them off and 
start

them
on a different host. But in this case the connection with the 
host

doesn't
ever get lost (so basically the host is Up, but any VM run on 
them is

unresponsive).



When that
happens, the host itself gets non-responsive and only 
recoverable

after
reboot, since it's unable to reconnect.




Piotr, can you check engine log and explain why host is not
reconnected?


I must say this is not specific to
this oVirt version, when we were using v.3.6.4 the same 
happened,

and
it's
also worthy mentioning we've not done any configuration changes 
and

everything had been working quite well for a long time.

We were monitoring our ds1 and ds2 physical backend to see
performance
and
we suspect we've run out of IOPS since we're reaching the 
maximum

specified
by the manufacturer, probably at certain times the host cannot
perform
a
storage operation within some time limit and it marks VMs as
unresponsive.
That's why we've set up ds3 and we're migrating ds1 and ds2 to 
ds3.

When
we
run out of space on ds3 we'll create more smaller volumes to 
keep

migrating.

On the host side, when this happens, we've run repoplot on the 
vdsm

log
and
I'm attaching the result. Clearly there's a *huge* LVM response 
time

(~30
secs.).




Indeed the log show very slow vgck and vgs commands - these are
called
every
5 minutes for checking the vg health and refreshing vdsm lvm 
cache.


1. starting vgck

Thread-96::DEBUG::2016-04-29
13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) 
/usr/bin/taskset

--cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config '
devices
{ pre
ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3 filter = [
'\''a|/dev/mapper/36000eb3a4f1acbc20043|'\
'', '\''r|.*|'\'' ] }  global {  locking_type=1
prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  
backup {

retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
9c-8eee-10368647c413 (cwd None)

2. vgck ends after 55 seconds

Thread-96::DEBUG::2016-04-29
13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: 
 = '

WARNING: lvmetad is running but disabled. Restart lvmetad before
enabling it!\n';  = 0

3. starting vgs

Thread-96::DEBUG::2016-04-29
13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) 
/usr/bin/taskset
--cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' 
devices

{ pref
erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3 filter = [
'\''a|/dev/mapper/36000eb3a4f1acbc20043|/de



v/mapper/36000eb3a4f1acbc200b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
'\''r|.*|'\'' ] }  global {
  locking_type=1  prioritise_write_locks=1  wait_for_locks=1
use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } '
--noheadings --units b --nosuffix --separator '|
' --ignoreskippedcluster -o




Re: [ovirt-users] VMs becoming non-responsive sporadically

2016-04-30 Thread Nir Soffer
On Sun, May 1, 2016 at 12:48 AM,   wrote:
> El 2016-04-30 22:37, Nir Soffer escribió:
>>
>> On Sat, Apr 30, 2016 at 10:28 PM, Nir Soffer  wrote:
>>>
>>> On Sat, Apr 30, 2016 at 7:16 PM,   wrote:

 El 2016-04-30 16:55, Nir Soffer escribió:
>
>
> On Sat, Apr 30, 2016 at 11:33 AM, Nicolás  wrote:
>>
>>
>> Hi Nir,
>>
>> El 29/04/16 a las 22:34, Nir Soffer escribió:
>>>
>>>
>>>
>>> On Fri, Apr 29, 2016 at 9:17 PM,   wrote:



 Hi,

 We're running oVirt 3.6.5.3-1 and lately we're experiencing some
 issues
 with
 some VMs being paused because they're marked as non-responsive.
 Mostly,
 after a few seconds they recover, but we want to debug precisely
 this
 problem so we can fix it consistently.

 Our scenario is the following:

 ~495 VMs, of which ~120 are constantly up
 3 datastores, all of them iSCSI-based:
* ds1: 2T, currently has 276 disks
* ds2: 2T, currently has 179 disks
* ds3: 500G, currently has 65 disks
 7 hosts: All have mostly the same hardware. CPU and memory are
 currently
 very lowly used (< 10%).

ds1 and ds2 are physically the same backend which exports two 2TB
 volumes.
 ds3 is a different storage backend where we're currently migrating
 some
 disks from ds1 and ds2.
>>>
>>>
>>>
>>> What the the storage backend behind ds1 and 2?
>>
>>
>>
>>
>> The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand
>> P4000
>> G2.
>>
 Usually, when VMs become unresponsive, the whole host where they run
 gets
 unresponsive too, so that gives a hint about the problem, my bet is
 the
 culprit is somewhere on the host side and not on the VMs side.
>>>
>>>
>>>
>>> Probably the vm became unresponsive because connection to the host
>>> was
>>> lost.
>>
>>
>>
>>
>> I forgot to mention that less commonly we have situations where the
>> host
>> doesn't get unresponsive but the VMs on it do and they don't become
>> responsive ever again, so we have to forcibly power them off and start
>> them
>> on a different host. But in this case the connection with the host
>> doesn't
>> ever get lost (so basically the host is Up, but any VM run on them is
>> unresponsive).
>>
>>
 When that
 happens, the host itself gets non-responsive and only recoverable
 after
 reboot, since it's unable to reconnect.
>>>
>>>
>>>
>>> Piotr, can you check engine log and explain why host is not
>>> reconnected?
>>>
 I must say this is not specific to
 this oVirt version, when we were using v.3.6.4 the same happened,
 and
 it's
 also worthy mentioning we've not done any configuration changes and
 everything had been working quite well for a long time.

 We were monitoring our ds1 and ds2 physical backend to see
 performance
 and
 we suspect we've run out of IOPS since we're reaching the maximum
 specified
 by the manufacturer, probably at certain times the host cannot
 perform
 a
 storage operation within some time limit and it marks VMs as
 unresponsive.
 That's why we've set up ds3 and we're migrating ds1 and ds2 to ds3.
 When
 we
 run out of space on ds3 we'll create more smaller volumes to keep
 migrating.

 On the host side, when this happens, we've run repoplot on the vdsm
 log
 and
 I'm attaching the result. Clearly there's a *huge* LVM response time
 (~30
 secs.).
>>>
>>>
>>>
>>> Indeed the log show very slow vgck and vgs commands - these are
>>> called
>>> every
>>> 5 minutes for checking the vg health and refreshing vdsm lvm cache.
>>>
>>> 1. starting vgck
>>>
>>> Thread-96::DEBUG::2016-04-29
>>> 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
>>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config '
>>> devices
>>> { pre
>>> ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>>> write_cache_state=0 disable_after_error_count=3 filter = [
>>> '\''a|/dev/mapper/36000eb3a4f1acbc20043|'\
>>> '', '\''r|.*|'\'' ] }  global {  locking_type=1
>>> prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {
>>> retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
>>> 9c-8eee-10368647c413 (cwd None)
>>>
>>> 2. vgck ends after 55 seconds

Re: [ovirt-users] VMs becoming non-responsive sporadically

2016-04-30 Thread nicolas

El 2016-04-30 22:37, Nir Soffer escribió:
On Sat, Apr 30, 2016 at 10:28 PM, Nir Soffer  
wrote:

On Sat, Apr 30, 2016 at 7:16 PM,   wrote:

El 2016-04-30 16:55, Nir Soffer escribió:


On Sat, Apr 30, 2016 at 11:33 AM, Nicolás  wrote:


Hi Nir,

El 29/04/16 a las 22:34, Nir Soffer escribió:



On Fri, Apr 29, 2016 at 9:17 PM,   wrote:



Hi,

We're running oVirt 3.6.5.3-1 and lately we're experiencing some 
issues

with
some VMs being paused because they're marked as non-responsive. 
Mostly,
after a few seconds they recover, but we want to debug precisely 
this

problem so we can fix it consistently.

Our scenario is the following:

~495 VMs, of which ~120 are constantly up
3 datastores, all of them iSCSI-based:
   * ds1: 2T, currently has 276 disks
   * ds2: 2T, currently has 179 disks
   * ds3: 500G, currently has 65 disks
7 hosts: All have mostly the same hardware. CPU and memory are
currently
very lowly used (< 10%).

   ds1 and ds2 are physically the same backend which exports two 
2TB

volumes.
ds3 is a different storage backend where we're currently 
migrating some

disks from ds1 and ds2.



What the the storage backend behind ds1 and 2?




The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand 
P4000

G2.

Usually, when VMs become unresponsive, the whole host where they 
run

gets
unresponsive too, so that gives a hint about the problem, my bet 
is the

culprit is somewhere on the host side and not on the VMs side.



Probably the vm became unresponsive because connection to the host 
was

lost.




I forgot to mention that less commonly we have situations where the 
host

doesn't get unresponsive but the VMs on it do and they don't become
responsive ever again, so we have to forcibly power them off and 
start

them
on a different host. But in this case the connection with the host
doesn't
ever get lost (so basically the host is Up, but any VM run on them 
is

unresponsive).



When that
happens, the host itself gets non-responsive and only recoverable 
after

reboot, since it's unable to reconnect.



Piotr, can you check engine log and explain why host is not 
reconnected?



I must say this is not specific to
this oVirt version, when we were using v.3.6.4 the same happened, 
and

it's
also worthy mentioning we've not done any configuration changes 
and

everything had been working quite well for a long time.

We were monitoring our ds1 and ds2 physical backend to see 
performance

and
we suspect we've run out of IOPS since we're reaching the maximum
specified
by the manufacturer, probably at certain times the host cannot 
perform

a
storage operation within some time limit and it marks VMs as
unresponsive.
That's why we've set up ds3 and we're migrating ds1 and ds2 to 
ds3.

When
we
run out of space on ds3 we'll create more smaller volumes to keep
migrating.

On the host side, when this happens, we've run repoplot on the 
vdsm log

and
I'm attaching the result. Clearly there's a *huge* LVM response 
time

(~30
secs.).



Indeed the log show very slow vgck and vgs commands - these are 
called

every
5 minutes for checking the vg health and refreshing vdsm lvm 
cache.


1. starting vgck

Thread-96::DEBUG::2016-04-29
13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) 
/usr/bin/taskset
--cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config ' 
devices

{ pre
ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3 filter = [
'\''a|/dev/mapper/36000eb3a4f1acbc20043|'\
'', '\''r|.*|'\'' ] }  global {  locking_type=1
prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  
backup {

retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
9c-8eee-10368647c413 (cwd None)

2. vgck ends after 55 seconds

Thread-96::DEBUG::2016-04-29
13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS:  
= '

WARNING: lvmetad is running but disabled. Restart lvmetad before
enabling it!\n';  = 0

3. starting vgs

Thread-96::DEBUG::2016-04-29
13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) 
/usr/bin/taskset
--cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' 
devices

{ pref
erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3 filter = [
'\''a|/dev/mapper/36000eb3a4f1acbc20043|/de


v/mapper/36000eb3a4f1acbc200b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
'\''r|.*|'\'' ] }  global {
  locking_type=1  prioritise_write_locks=1  wait_for_locks=1
use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } '
--noheadings --units b --nosuffix --separator '|
' --ignoreskippedcluster -o


uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
5de4a000-a9c4-489c-8eee-10368
647c413 (cwd None)

4. vgs finished after 37 seconds

Thread-96::DEBUG::2016-04-29

Re: [ovirt-users] VMs becoming non-responsive sporadically

2016-04-30 Thread Nir Soffer
On Sat, Apr 30, 2016 at 10:28 PM, Nir Soffer  wrote:
> On Sat, Apr 30, 2016 at 7:16 PM,   wrote:
>> El 2016-04-30 16:55, Nir Soffer escribió:
>>>
>>> On Sat, Apr 30, 2016 at 11:33 AM, Nicolás  wrote:

 Hi Nir,

 El 29/04/16 a las 22:34, Nir Soffer escribió:
>
>
> On Fri, Apr 29, 2016 at 9:17 PM,   wrote:
>>
>>
>> Hi,
>>
>> We're running oVirt 3.6.5.3-1 and lately we're experiencing some issues
>> with
>> some VMs being paused because they're marked as non-responsive. Mostly,
>> after a few seconds they recover, but we want to debug precisely this
>> problem so we can fix it consistently.
>>
>> Our scenario is the following:
>>
>> ~495 VMs, of which ~120 are constantly up
>> 3 datastores, all of them iSCSI-based:
>>* ds1: 2T, currently has 276 disks
>>* ds2: 2T, currently has 179 disks
>>* ds3: 500G, currently has 65 disks
>> 7 hosts: All have mostly the same hardware. CPU and memory are
>> currently
>> very lowly used (< 10%).
>>
>>ds1 and ds2 are physically the same backend which exports two 2TB
>> volumes.
>> ds3 is a different storage backend where we're currently migrating some
>> disks from ds1 and ds2.
>
>
> What the the storage backend behind ds1 and 2?



 The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand P4000
 G2.

>> Usually, when VMs become unresponsive, the whole host where they run
>> gets
>> unresponsive too, so that gives a hint about the problem, my bet is the
>> culprit is somewhere on the host side and not on the VMs side.
>
>
> Probably the vm became unresponsive because connection to the host was
> lost.



 I forgot to mention that less commonly we have situations where the host
 doesn't get unresponsive but the VMs on it do and they don't become
 responsive ever again, so we have to forcibly power them off and start
 them
 on a different host. But in this case the connection with the host
 doesn't
 ever get lost (so basically the host is Up, but any VM run on them is
 unresponsive).


>> When that
>> happens, the host itself gets non-responsive and only recoverable after
>> reboot, since it's unable to reconnect.
>
>
> Piotr, can you check engine log and explain why host is not reconnected?
>
>> I must say this is not specific to
>> this oVirt version, when we were using v.3.6.4 the same happened, and
>> it's
>> also worthy mentioning we've not done any configuration changes and
>> everything had been working quite well for a long time.
>>
>> We were monitoring our ds1 and ds2 physical backend to see performance
>> and
>> we suspect we've run out of IOPS since we're reaching the maximum
>> specified
>> by the manufacturer, probably at certain times the host cannot perform
>> a
>> storage operation within some time limit and it marks VMs as
>> unresponsive.
>> That's why we've set up ds3 and we're migrating ds1 and ds2 to ds3.
>> When
>> we
>> run out of space on ds3 we'll create more smaller volumes to keep
>> migrating.
>>
>> On the host side, when this happens, we've run repoplot on the vdsm log
>> and
>> I'm attaching the result. Clearly there's a *huge* LVM response time
>> (~30
>> secs.).
>
>
> Indeed the log show very slow vgck and vgs commands - these are called
> every
> 5 minutes for checking the vg health and refreshing vdsm lvm cache.
>
> 1. starting vgck
>
> Thread-96::DEBUG::2016-04-29
> 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config ' devices
> { pre
> ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
> write_cache_state=0 disable_after_error_count=3 filter = [
> '\''a|/dev/mapper/36000eb3a4f1acbc20043|'\
> '', '\''r|.*|'\'' ] }  global {  locking_type=1
> prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {
> retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
> 9c-8eee-10368647c413 (cwd None)
>
> 2. vgck ends after 55 seconds
>
> Thread-96::DEBUG::2016-04-29
> 13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS:  = '
> WARNING: lvmetad is running but disabled. Restart lvmetad before
> enabling it!\n';  = 0
>
> 3. starting vgs
>
> Thread-96::DEBUG::2016-04-29
> 13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices
> { pref
> erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
> write_cache_state=0 disable_after_error_count=3 filter = 

Re: [ovirt-users] VMs becoming non-responsive sporadically

2016-04-30 Thread Nir Soffer
On Sat, Apr 30, 2016 at 11:33 AM, Nicolás  wrote:
> Hi Nir,
>
> El 29/04/16 a las 22:34, Nir Soffer escribió:
>>
>> On Fri, Apr 29, 2016 at 9:17 PM,   wrote:
>>>
>>> Hi,
>>>
>>> We're running oVirt 3.6.5.3-1 and lately we're experiencing some issues
>>> with
>>> some VMs being paused because they're marked as non-responsive. Mostly,
>>> after a few seconds they recover, but we want to debug precisely this
>>> problem so we can fix it consistently.
>>>
>>> Our scenario is the following:
>>>
>>> ~495 VMs, of which ~120 are constantly up
>>> 3 datastores, all of them iSCSI-based:
>>>* ds1: 2T, currently has 276 disks
>>>* ds2: 2T, currently has 179 disks
>>>* ds3: 500G, currently has 65 disks
>>> 7 hosts: All have mostly the same hardware. CPU and memory are currently
>>> very lowly used (< 10%).
>>>
>>>ds1 and ds2 are physically the same backend which exports two 2TB
>>> volumes.
>>> ds3 is a different storage backend where we're currently migrating some
>>> disks from ds1 and ds2.
>>
>> What the the storage backend behind ds1 and 2?
>
>
> The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand P4000 G2.
>
>>> Usually, when VMs become unresponsive, the whole host where they run gets
>>> unresponsive too, so that gives a hint about the problem, my bet is the
>>> culprit is somewhere on the host side and not on the VMs side.
>>
>> Probably the vm became unresponsive because connection to the host was
>> lost.
>
>
> I forgot to mention that less commonly we have situations where the host
> doesn't get unresponsive but the VMs on it do and they don't become
> responsive ever again, so we have to forcibly power them off and start them
> on a different host. But in this case the connection with the host doesn't
> ever get lost (so basically the host is Up, but any VM run on them is
> unresponsive).
>
>
>>> When that
>>> happens, the host itself gets non-responsive and only recoverable after
>>> reboot, since it's unable to reconnect.
>>
>> Piotr, can you check engine log and explain why host is not reconnected?
>>
>>> I must say this is not specific to
>>> this oVirt version, when we were using v.3.6.4 the same happened, and
>>> it's
>>> also worthy mentioning we've not done any configuration changes and
>>> everything had been working quite well for a long time.
>>>
>>> We were monitoring our ds1 and ds2 physical backend to see performance
>>> and
>>> we suspect we've run out of IOPS since we're reaching the maximum
>>> specified
>>> by the manufacturer, probably at certain times the host cannot perform a
>>> storage operation within some time limit and it marks VMs as
>>> unresponsive.
>>> That's why we've set up ds3 and we're migrating ds1 and ds2 to ds3. When
>>> we
>>> run out of space on ds3 we'll create more smaller volumes to keep
>>> migrating.
>>>
>>> On the host side, when this happens, we've run repoplot on the vdsm log
>>> and
>>> I'm attaching the result. Clearly there's a *huge* LVM response time (~30
>>> secs.).
>>
>> Indeed the log show very slow vgck and vgs commands - these are called
>> every
>> 5 minutes for checking the vg health and refreshing vdsm lvm cache.
>>
>> 1. starting vgck
>>
>> Thread-96::DEBUG::2016-04-29
>> 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config ' devices
>> { pre
>> ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>> write_cache_state=0 disable_after_error_count=3 filter = [
>> '\''a|/dev/mapper/36000eb3a4f1acbc20043|'\
>> '', '\''r|.*|'\'' ] }  global {  locking_type=1
>> prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {
>> retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
>> 9c-8eee-10368647c413 (cwd None)
>>
>> 2. vgck ends after 55 seconds
>>
>> Thread-96::DEBUG::2016-04-29
>> 13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS:  = '
>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>> enabling it!\n';  = 0
>>
>> 3. starting vgs
>>
>> Thread-96::DEBUG::2016-04-29
>> 13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices
>> { pref
>> erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>> write_cache_state=0 disable_after_error_count=3 filter = [
>> '\''a|/dev/mapper/36000eb3a4f1acbc20043|/de
>>
>> v/mapper/36000eb3a4f1acbc200b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
>> '\''r|.*|'\'' ] }  global {
>>   locking_type=1  prioritise_write_locks=1  wait_for_locks=1
>> use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } '
>> --noheadings --units b --nosuffix --separator '|
>> ' --ignoreskippedcluster -o
>>
>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>> 5de4a000-a9c4-489c-8eee-10368
>> 647c413 

Re: [ovirt-users] VMs becoming non-responsive sporadically

2016-04-30 Thread Nir Soffer
On Sat, Apr 30, 2016 at 7:16 PM,   wrote:
> El 2016-04-30 16:55, Nir Soffer escribió:
>>
>> On Sat, Apr 30, 2016 at 11:33 AM, Nicolás  wrote:
>>>
>>> Hi Nir,
>>>
>>> El 29/04/16 a las 22:34, Nir Soffer escribió:


 On Fri, Apr 29, 2016 at 9:17 PM,   wrote:
>
>
> Hi,
>
> We're running oVirt 3.6.5.3-1 and lately we're experiencing some issues
> with
> some VMs being paused because they're marked as non-responsive. Mostly,
> after a few seconds they recover, but we want to debug precisely this
> problem so we can fix it consistently.
>
> Our scenario is the following:
>
> ~495 VMs, of which ~120 are constantly up
> 3 datastores, all of them iSCSI-based:
>* ds1: 2T, currently has 276 disks
>* ds2: 2T, currently has 179 disks
>* ds3: 500G, currently has 65 disks
> 7 hosts: All have mostly the same hardware. CPU and memory are
> currently
> very lowly used (< 10%).
>
>ds1 and ds2 are physically the same backend which exports two 2TB
> volumes.
> ds3 is a different storage backend where we're currently migrating some
> disks from ds1 and ds2.


 What the the storage backend behind ds1 and 2?
>>>
>>>
>>>
>>> The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand P4000
>>> G2.
>>>
> Usually, when VMs become unresponsive, the whole host where they run
> gets
> unresponsive too, so that gives a hint about the problem, my bet is the
> culprit is somewhere on the host side and not on the VMs side.


 Probably the vm became unresponsive because connection to the host was
 lost.
>>>
>>>
>>>
>>> I forgot to mention that less commonly we have situations where the host
>>> doesn't get unresponsive but the VMs on it do and they don't become
>>> responsive ever again, so we have to forcibly power them off and start
>>> them
>>> on a different host. But in this case the connection with the host
>>> doesn't
>>> ever get lost (so basically the host is Up, but any VM run on them is
>>> unresponsive).
>>>
>>>
> When that
> happens, the host itself gets non-responsive and only recoverable after
> reboot, since it's unable to reconnect.


 Piotr, can you check engine log and explain why host is not reconnected?

> I must say this is not specific to
> this oVirt version, when we were using v.3.6.4 the same happened, and
> it's
> also worthy mentioning we've not done any configuration changes and
> everything had been working quite well for a long time.
>
> We were monitoring our ds1 and ds2 physical backend to see performance
> and
> we suspect we've run out of IOPS since we're reaching the maximum
> specified
> by the manufacturer, probably at certain times the host cannot perform
> a
> storage operation within some time limit and it marks VMs as
> unresponsive.
> That's why we've set up ds3 and we're migrating ds1 and ds2 to ds3.
> When
> we
> run out of space on ds3 we'll create more smaller volumes to keep
> migrating.
>
> On the host side, when this happens, we've run repoplot on the vdsm log
> and
> I'm attaching the result. Clearly there's a *huge* LVM response time
> (~30
> secs.).


 Indeed the log show very slow vgck and vgs commands - these are called
 every
 5 minutes for checking the vg health and refreshing vdsm lvm cache.

 1. starting vgck

 Thread-96::DEBUG::2016-04-29
 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
 --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config ' devices
 { pre
 ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
 write_cache_state=0 disable_after_error_count=3 filter = [
 '\''a|/dev/mapper/36000eb3a4f1acbc20043|'\
 '', '\''r|.*|'\'' ] }  global {  locking_type=1
 prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {
 retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
 9c-8eee-10368647c413 (cwd None)

 2. vgck ends after 55 seconds

 Thread-96::DEBUG::2016-04-29
 13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS:  = '
 WARNING: lvmetad is running but disabled. Restart lvmetad before
 enabling it!\n';  = 0

 3. starting vgs

 Thread-96::DEBUG::2016-04-29
 13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
 --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices
 { pref
 erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
 write_cache_state=0 disable_after_error_count=3 filter = [
 '\''a|/dev/mapper/36000eb3a4f1acbc20043|/de


 

Re: [ovirt-users] VMs becoming non-responsive sporadically

2016-04-30 Thread Nir Soffer
On Sat, Apr 30, 2016 at 11:33 AM, Nicolás  wrote:
> Hi Nir,
>
> El 29/04/16 a las 22:34, Nir Soffer escribió:
>>
>> On Fri, Apr 29, 2016 at 9:17 PM,   wrote:
>>>
>>> Hi,
>>>
>>> We're running oVirt 3.6.5.3-1 and lately we're experiencing some issues
>>> with
>>> some VMs being paused because they're marked as non-responsive. Mostly,
>>> after a few seconds they recover, but we want to debug precisely this
>>> problem so we can fix it consistently.
>>>
>>> Our scenario is the following:
>>>
>>> ~495 VMs, of which ~120 are constantly up
>>> 3 datastores, all of them iSCSI-based:
>>>* ds1: 2T, currently has 276 disks
>>>* ds2: 2T, currently has 179 disks
>>>* ds3: 500G, currently has 65 disks
>>> 7 hosts: All have mostly the same hardware. CPU and memory are currently
>>> very lowly used (< 10%).
>>>
>>>ds1 and ds2 are physically the same backend which exports two 2TB
>>> volumes.
>>> ds3 is a different storage backend where we're currently migrating some
>>> disks from ds1 and ds2.
>>
>> What the the storage backend behind ds1 and 2?
>
>
> The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand P4000 G2.
>
>>> Usually, when VMs become unresponsive, the whole host where they run gets
>>> unresponsive too, so that gives a hint about the problem, my bet is the
>>> culprit is somewhere on the host side and not on the VMs side.
>>
>> Probably the vm became unresponsive because connection to the host was
>> lost.
>
>
> I forgot to mention that less commonly we have situations where the host
> doesn't get unresponsive but the VMs on it do and they don't become
> responsive ever again, so we have to forcibly power them off and start them
> on a different host. But in this case the connection with the host doesn't
> ever get lost (so basically the host is Up, but any VM run on them is
> unresponsive).
>
>
>>> When that
>>> happens, the host itself gets non-responsive and only recoverable after
>>> reboot, since it's unable to reconnect.
>>
>> Piotr, can you check engine log and explain why host is not reconnected?
>>
>>> I must say this is not specific to
>>> this oVirt version, when we were using v.3.6.4 the same happened, and
>>> it's
>>> also worthy mentioning we've not done any configuration changes and
>>> everything had been working quite well for a long time.
>>>
>>> We were monitoring our ds1 and ds2 physical backend to see performance
>>> and
>>> we suspect we've run out of IOPS since we're reaching the maximum
>>> specified
>>> by the manufacturer, probably at certain times the host cannot perform a
>>> storage operation within some time limit and it marks VMs as
>>> unresponsive.
>>> That's why we've set up ds3 and we're migrating ds1 and ds2 to ds3. When
>>> we
>>> run out of space on ds3 we'll create more smaller volumes to keep
>>> migrating.
>>>
>>> On the host side, when this happens, we've run repoplot on the vdsm log
>>> and
>>> I'm attaching the result. Clearly there's a *huge* LVM response time (~30
>>> secs.).
>>
>> Indeed the log show very slow vgck and vgs commands - these are called
>> every
>> 5 minutes for checking the vg health and refreshing vdsm lvm cache.
>>
>> 1. starting vgck
>>
>> Thread-96::DEBUG::2016-04-29
>> 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config ' devices
>> { pre
>> ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>> write_cache_state=0 disable_after_error_count=3 filter = [
>> '\''a|/dev/mapper/36000eb3a4f1acbc20043|'\
>> '', '\''r|.*|'\'' ] }  global {  locking_type=1
>> prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {
>> retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
>> 9c-8eee-10368647c413 (cwd None)
>>
>> 2. vgck ends after 55 seconds
>>
>> Thread-96::DEBUG::2016-04-29
>> 13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS:  = '
>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>> enabling it!\n';  = 0
>>
>> 3. starting vgs
>>
>> Thread-96::DEBUG::2016-04-29
>> 13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices
>> { pref
>> erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>> write_cache_state=0 disable_after_error_count=3 filter = [
>> '\''a|/dev/mapper/36000eb3a4f1acbc20043|/de
>>
>> v/mapper/36000eb3a4f1acbc200b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
>> '\''r|.*|'\'' ] }  global {
>>   locking_type=1  prioritise_write_locks=1  wait_for_locks=1
>> use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } '
>> --noheadings --units b --nosuffix --separator '|
>> ' --ignoreskippedcluster -o
>>
>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>> 5de4a000-a9c4-489c-8eee-10368
>> 647c413 

Re: [ovirt-users] VMs becoming non-responsive sporadically

2016-04-29 Thread Nir Soffer
On Fri, Apr 29, 2016 at 9:17 PM,   wrote:
> Hi,
>
> We're running oVirt 3.6.5.3-1 and lately we're experiencing some issues with
> some VMs being paused because they're marked as non-responsive. Mostly,
> after a few seconds they recover, but we want to debug precisely this
> problem so we can fix it consistently.
>
> Our scenario is the following:
>
> ~495 VMs, of which ~120 are constantly up
> 3 datastores, all of them iSCSI-based:
>   * ds1: 2T, currently has 276 disks
>   * ds2: 2T, currently has 179 disks
>   * ds3: 500G, currently has 65 disks
> 7 hosts: All have mostly the same hardware. CPU and memory are currently
> very lowly used (< 10%).
>
>   ds1 and ds2 are physically the same backend which exports two 2TB volumes.
> ds3 is a different storage backend where we're currently migrating some
> disks from ds1 and ds2.

What the the storage backend behind ds1 and 2?

>
> Usually, when VMs become unresponsive, the whole host where they run gets
> unresponsive too, so that gives a hint about the problem, my bet is the
> culprit is somewhere on the host side and not on the VMs side.

Probably the vm became unresponsive because connection to the host was lost.

> When that
> happens, the host itself gets non-responsive and only recoverable after
> reboot, since it's unable to reconnect.

Piotr, can you check engine log and explain why host is not reconnected?

> I must say this is not specific to
> this oVirt version, when we were using v.3.6.4 the same happened, and it's
> also worthy mentioning we've not done any configuration changes and
> everything had been working quite well for a long time.
>
> We were monitoring our ds1 and ds2 physical backend to see performance and
> we suspect we've run out of IOPS since we're reaching the maximum specified
> by the manufacturer, probably at certain times the host cannot perform a
> storage operation within some time limit and it marks VMs as unresponsive.
> That's why we've set up ds3 and we're migrating ds1 and ds2 to ds3. When we
> run out of space on ds3 we'll create more smaller volumes to keep migrating.
>
> On the host side, when this happens, we've run repoplot on the vdsm log and
> I'm attaching the result. Clearly there's a *huge* LVM response time (~30
> secs.).

Indeed the log show very slow vgck and vgs commands - these are called every
5 minutes for checking the vg health and refreshing vdsm lvm cache.

1. starting vgck

Thread-96::DEBUG::2016-04-29
13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
--cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config ' devices
{ pre
ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3 filter = [
'\''a|/dev/mapper/36000eb3a4f1acbc20043|'\
'', '\''r|.*|'\'' ] }  global {  locking_type=1
prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {
retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
9c-8eee-10368647c413 (cwd None)

2. vgck ends after 55 seconds

Thread-96::DEBUG::2016-04-29
13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS:  = '
WARNING: lvmetad is running but disabled. Restart lvmetad before
enabling it!\n';  = 0

3. starting vgs

Thread-96::DEBUG::2016-04-29
13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
--cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices
{ pref
erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3 filter = [
'\''a|/dev/mapper/36000eb3a4f1acbc20043|/de
v/mapper/36000eb3a4f1acbc200b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
'\''r|.*|'\'' ] }  global {
 locking_type=1  prioritise_write_locks=1  wait_for_locks=1
use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } '
--noheadings --units b --nosuffix --separator '|
' --ignoreskippedcluster -o
uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
5de4a000-a9c4-489c-8eee-10368
647c413 (cwd None)

4. vgs finished after 37 seconds

Thread-96::DEBUG::2016-04-29
13:17:48,680::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS:  = '
WARNING: lvmetad is running but disabled. Restart lvmetad before
enabling it!\n';  = 0

Zdenek, how do you suggest to debug this slow lvm commands?

Can you run the following commands on a host in trouble, and on some other
hosts in the same timeframe?

time vgck - --config ' devices { filter =
['\''a|/dev/mapper/36000eb3a4f1acbc20043|'\'',
'\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1
wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50
retain_days = 0 } ' 5de4a000-a9c4-489c-8eee-10368647c413

time vgs - --config ' global { locking_type=1
prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {
retain_min = 50  retain_days = 0 } '
5de4a000-a9c4-489c-8eee-10368647c413

Note that I