Re: [openstack-dev] [gate][neutron][infra] tempest jobs timing out due to general sluggishness of the node?

2017-02-15 Thread Ihar Hrachyshka
On Fri, Feb 10, 2017 at 2:48 PM, Clark Boylan  wrote:
> On Fri, Feb 10, 2017, at 10:54 AM, Ihar Hrachyshka wrote:
>> Oh nice, I haven't seen that. It does give (virtualized) CPU model
>> types. I don't see a clear correlation between models and
>> failures/test times though. We of course miss some more details, like
>> flags being emulated, but I doubt it will give us a clue.
>
> Yes, this will still be the virtualized CPU. Also the lack of cpu flag
> info is a regression compared to the old method of collecting this data.
> If we think that info could be useful somehow we should find a way to
> add it back in. (Maybe just add back the cat /proc/cpuinfo step in
> devstack-gate).

To update, I posted a patch that logs /proc/cpuinfo using new ansible
data gathering playbook: https://review.openstack.org/#/c/433949/

>
>> It would be interesting to know the overcommit/system load for each
>> hypervisor affected. But I assume we don't have access to that info,
>> right?
>
> Correct, with the exception of infracloud and OSIC (if we ask nicely) I
> don't expect it will be very easy to get this sort of information from
> our clouds.
>
> For infracloud a random sample of a hypervisor shows that it has 24 real
> cores. In the vanilla region we are limited to 126 VM  instances with
> 8vcpu each. We have ~41 hypervisors which is just over 3 VM instances
> per hypervisor. 24realcpus/8vcpu = 3 VM instances without
> oversubscribing. So we are just barely oversubscribing if at all.

Ack, thanks for checking, we will need to find some other hypothesis then.

For the record, we discussed with Clark an idea of adding a synthetic
benchmark at the start of every job (before our software is actually
installed on the node), to get some easily comparable performance
numbers between runs that are guaranteed to be unaffected by OpenStack
installation; but Clark had his reservation because the test would be
synthetic and hence not real life, and because we already have
./stack.sh run time that can be used as a silly benchmark. Of course,
./stack.sh depends on lots of externalities, so it's not as precise as
a targeted benchmark would be, but Clark feels the latter would be of
limited use.

Apart from that, it's not clear where to go next. I doubt cpuinfo dump
will reveal anything insane in failing jobs, so other ideas are
welcome.

Ihar

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate][neutron][infra] tempest jobs timing out due to general sluggishness of the node?

2017-02-10 Thread Clark Boylan
On Fri, Feb 10, 2017, at 10:54 AM, Ihar Hrachyshka wrote:
> Oh nice, I haven't seen that. It does give (virtualized) CPU model
> types. I don't see a clear correlation between models and
> failures/test times though. We of course miss some more details, like
> flags being emulated, but I doubt it will give us a clue.

Yes, this will still be the virtualized CPU. Also the lack of cpu flag
info is a regression compared to the old method of collecting this data.
If we think that info could be useful somehow we should find a way to
add it back in. (Maybe just add back the cat /proc/cpuinfo step in
devstack-gate).
 
> It would be interesting to know the overcommit/system load for each
> hypervisor affected. But I assume we don't have access to that info,
> right?

Correct, with the exception of infracloud and OSIC (if we ask nicely) I
don't expect it will be very easy to get this sort of information from
our clouds.

For infracloud a random sample of a hypervisor shows that it has 24 real
cores. In the vanilla region we are limited to 126 VM  instances with
8vcpu each. We have ~41 hypervisors which is just over 3 VM instances
per hypervisor. 24realcpus/8vcpu = 3 VM instances without
oversubscribing. So we are just barely oversubscribing if at all.

Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate][neutron][infra] tempest jobs timing out due to general sluggishness of the node?

2017-02-10 Thread Ihar Hrachyshka
Oh nice, I haven't seen that. It does give (virtualized) CPU model
types. I don't see a clear correlation between models and
failures/test times though. We of course miss some more details, like
flags being emulated, but I doubt it will give us a clue.

It would be interesting to know the overcommit/system load for each
hypervisor affected. But I assume we don't have access to that info,
right?

Ihar

On Fri, Feb 10, 2017 at 8:39 AM, Clark Boylan  wrote:
> On Fri, Feb 10, 2017, at 08:21 AM, Morales, Victor wrote:
>>
>> On 2/9/17, 10:59 PM, "Ihar Hrachyshka"  wrote:
>>
>> >Hi all,
>> >
>> >I noticed lately a number of job failures in neutron gate that all
>> >result in job timeouts. I describe
>> >gate-tempest-dsvm-neutron-dvr-ubuntu-xenial job below, though I see
>> >timeouts happening in other jobs too.
>> >
>> >The failure mode is all operations, ./stack.sh and each tempest test
>> >take significantly more time (like 50% to 150% more, which results in
>> >job timeout triggered). An example of what I mean can be found in [1].
>> >
>> >A good run usually takes ~20 minutes to stack up devstack; then ~40
>> >minutes to pass full suite; a bad run usually takes ~30 minutes for
>> >./stack.sh; and then 1:20h+ until it is killed due to timeout.
>> >
>> >It affects different clouds (we see rax, internap, infracloud-vanilla,
>> >ovh jobs affected; we haven't seen osic though). It can't be e.g. slow
>> >pypi or apt mirrors because then we would see slowdown in ./stack.sh
>> >phase only.
>> >
>> >We can't be sure that CPUs are the same, and devstack does not seem to
>> >dump /proc/cpuinfo anywhere (in the end, it's all virtual, so not sure
>>
>> I don’t think that logging this information could be useful mainly
>> because this depends on enabling *host-passthrough*[3] in nova-compute
>> configuration of Public cloud providers
>
> While this is true we do log it anyways (was useful for sorting out live
> migration cpu flag inconsistencies). For example:
> http://logs.openstack.org/95/429095/2/check/gate-tempest-dsvm-neutron-dvr-ubuntu-xenial/35aa22f/logs/devstack-gate-setup-host.txt.gz
> and grep for 'cpu'.
>
> Note that we used to grab proper /proc/cpuinfo contents but now its just
> whatever ansible is reporting back in its fact list there.
>
> Clark
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate][neutron][infra] tempest jobs timing out due to general sluggishness of the node?

2017-02-10 Thread Clark Boylan
On Fri, Feb 10, 2017, at 08:21 AM, Morales, Victor wrote:
> 
> On 2/9/17, 10:59 PM, "Ihar Hrachyshka"  wrote:
> 
> >Hi all,
> >
> >I noticed lately a number of job failures in neutron gate that all
> >result in job timeouts. I describe
> >gate-tempest-dsvm-neutron-dvr-ubuntu-xenial job below, though I see
> >timeouts happening in other jobs too.
> >
> >The failure mode is all operations, ./stack.sh and each tempest test
> >take significantly more time (like 50% to 150% more, which results in
> >job timeout triggered). An example of what I mean can be found in [1].
> >
> >A good run usually takes ~20 minutes to stack up devstack; then ~40
> >minutes to pass full suite; a bad run usually takes ~30 minutes for
> >./stack.sh; and then 1:20h+ until it is killed due to timeout.
> >
> >It affects different clouds (we see rax, internap, infracloud-vanilla,
> >ovh jobs affected; we haven't seen osic though). It can't be e.g. slow
> >pypi or apt mirrors because then we would see slowdown in ./stack.sh
> >phase only.
> >
> >We can't be sure that CPUs are the same, and devstack does not seem to
> >dump /proc/cpuinfo anywhere (in the end, it's all virtual, so not sure
> 
> I don’t think that logging this information could be useful mainly
> because this depends on enabling *host-passthrough*[3] in nova-compute
> configuration of Public cloud providers

While this is true we do log it anyways (was useful for sorting out live
migration cpu flag inconsistencies). For example:
http://logs.openstack.org/95/429095/2/check/gate-tempest-dsvm-neutron-dvr-ubuntu-xenial/35aa22f/logs/devstack-gate-setup-host.txt.gz
and grep for 'cpu'.

Note that we used to grab proper /proc/cpuinfo contents but now its just
whatever ansible is reporting back in its fact list there.

Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate][neutron][infra] tempest jobs timing out due to general sluggishness of the node?

2017-02-10 Thread Morales, Victor

On 2/9/17, 10:59 PM, "Ihar Hrachyshka"  wrote:

>Hi all,
>
>I noticed lately a number of job failures in neutron gate that all
>result in job timeouts. I describe
>gate-tempest-dsvm-neutron-dvr-ubuntu-xenial job below, though I see
>timeouts happening in other jobs too.
>
>The failure mode is all operations, ./stack.sh and each tempest test
>take significantly more time (like 50% to 150% more, which results in
>job timeout triggered). An example of what I mean can be found in [1].
>
>A good run usually takes ~20 minutes to stack up devstack; then ~40
>minutes to pass full suite; a bad run usually takes ~30 minutes for
>./stack.sh; and then 1:20h+ until it is killed due to timeout.
>
>It affects different clouds (we see rax, internap, infracloud-vanilla,
>ovh jobs affected; we haven't seen osic though). It can't be e.g. slow
>pypi or apt mirrors because then we would see slowdown in ./stack.sh
>phase only.
>
>We can't be sure that CPUs are the same, and devstack does not seem to
>dump /proc/cpuinfo anywhere (in the end, it's all virtual, so not sure

I don’t think that logging this information could be useful mainly because this 
depends on enabling *host-passthrough*[3] in nova-compute configuration of 
Public cloud providers

>if it would help anyway). Neither we have a way to learn whether
>slowliness could be a result of adherence to RFC1149. ;)
>
>We discussed the matter in neutron channel [2] though couldn't figure
>out the culprit, or where to go next. At this point we assume it's not
>neutron's fault, and we hope others (infra?) may have suggestions on
>where to look.
>
>[1] 
>http://logs.openstack.org/95/429095/2/check/gate-tempest-dsvm-neutron-dvr-ubuntu-xenial/35aa22f/console.html#_2017-02-09_04_47_12_874550
>[2] 
>http://eavesdrop.openstack.org/irclogs/%23openstack-neutron/%23openstack-neutron.2017-02-10.log.html#t2017-02-10T04:06:01
[3] 
http://docs.openstack.org/newton/config-reference/compute/hypervisor-kvm.html 

>
>Thanks,
>Ihar
>
>__
>OpenStack Development Mailing List (not for usage questions)
>Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate][neutron][infra] tempest jobs timing out due to general sluggishness of the node?

2017-02-10 Thread Attila Fazekas
I wonder, can we switch to CINDER_ISCSI_HELPER="lioadm"  ?

On Fri, Feb 10, 2017 at 9:17 AM, Miguel Angel Ajo Pelayo <
majop...@redhat.com> wrote:

> I believe those are traces left by the reference implementation of cinder
> setting very high debug level on tgtd. I'm not sure if that's related or
> the culprit at all (probably the culprit is a mix of things).
>
> I wonder if we could disable such verbosity on tgtd, which certainly is
> going to slow down things.
>
> On Fri, Feb 10, 2017 at 9:07 AM, Antonio Ojea  wrote:
>
>> I guess it's an infra issue, specifically related to the storage, or the
>> network that provide the storage.
>>
>> If you look at the syslog file [1] , there are a lot of this entries:
>>
>> Feb 09 04:20:42 
>> 
>>  ubuntu-xenial-rax-ord-7193667 tgtd[8542]: tgtd: iscsi_task_tx_start(2024) 
>> no more dataFeb 09 04:20:42 
>> 
>>  ubuntu-xenial-rax-ord-7193667 tgtd[8542]: tgtd: iscsi_task_tx_start(1996) 
>> found a task 71 131072 0 0Feb 09 04:20:42 
>> 
>>  ubuntu-xenial-rax-ord-7193667 tgtd[8542]: tgtd: iscsi_data_rsp_build(1136) 
>> 131072 131072 0 26214471Feb 09 04:20:42 
>> 
>>  ubuntu-xenial-rax-ord-7193667 tgtd[8542]: tgtd: __cmd_done(1281) (nil) 
>> 0x2563000 0 131072
>>
>> grep tgtd syslog.txt.gz| wc
>>   139602 1710808 15699432
>>
>> [1] http://logs.openstack.org/95/429095/2/check/gate-tempest-dsv
>> m-neutron-dvr-ubuntu-xenial/35aa22f/logs/syslog.txt.gz
>>
>>
>>
>> On Fri, Feb 10, 2017 at 5:59 AM, Ihar Hrachyshka 
>> wrote:
>>
>>> Hi all,
>>>
>>> I noticed lately a number of job failures in neutron gate that all
>>> result in job timeouts. I describe
>>> gate-tempest-dsvm-neutron-dvr-ubuntu-xenial job below, though I see
>>> timeouts happening in other jobs too.
>>>
>>> The failure mode is all operations, ./stack.sh and each tempest test
>>> take significantly more time (like 50% to 150% more, which results in
>>> job timeout triggered). An example of what I mean can be found in [1].
>>>
>>> A good run usually takes ~20 minutes to stack up devstack; then ~40
>>> minutes to pass full suite; a bad run usually takes ~30 minutes for
>>> ./stack.sh; and then 1:20h+ until it is killed due to timeout.
>>>
>>> It affects different clouds (we see rax, internap, infracloud-vanilla,
>>> ovh jobs affected; we haven't seen osic though). It can't be e.g. slow
>>> pypi or apt mirrors because then we would see slowdown in ./stack.sh
>>> phase only.
>>>
>>> We can't be sure that CPUs are the same, and devstack does not seem to
>>> dump /proc/cpuinfo anywhere (in the end, it's all virtual, so not sure
>>> if it would help anyway). Neither we have a way to learn whether
>>> slowliness could be a result of adherence to RFC1149. ;)
>>>
>>> We discussed the matter in neutron channel [2] though couldn't figure
>>> out the culprit, or where to go next. At this point we assume it's not
>>> neutron's fault, and we hope others (infra?) may have suggestions on
>>> where to look.
>>>
>>> [1] http://logs.openstack.org/95/429095/2/check/gate-tempest-dsv
>>> m-neutron-dvr-ubuntu-xenial/35aa22f/console.html#_2017-02-09
>>> _04_47_12_874550
>>> [2] http://eavesdrop.openstack.org/irclogs/%23openstack-neutron/
>>> %23openstack-neutron.2017-02-10.log.html#t2017-02-10T04:06:01
>>>
>>> Thanks,
>>> Ihar
>>>
>>> 
>>> __
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: openstack-dev-requ...@lists.op
>>> enstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>>
>> 
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate][neutron][infra] tempest jobs timing out due to general sluggishness of the node?

2017-02-10 Thread Miguel Angel Ajo Pelayo
I believe those are traces left by the reference implementation of cinder
setting very high debug level on tgtd. I'm not sure if that's related or
the culprit at all (probably the culprit is a mix of things).

I wonder if we could disable such verbosity on tgtd, which certainly is
going to slow down things.

On Fri, Feb 10, 2017 at 9:07 AM, Antonio Ojea  wrote:

> I guess it's an infra issue, specifically related to the storage, or the
> network that provide the storage.
>
> If you look at the syslog file [1] , there are a lot of this entries:
>
> Feb 09 04:20:42 
> 
>  ubuntu-xenial-rax-ord-7193667 tgtd[8542]: tgtd: iscsi_task_tx_start(2024) no 
> more dataFeb 09 04:20:42 
> 
>  ubuntu-xenial-rax-ord-7193667 tgtd[8542]: tgtd: iscsi_task_tx_start(1996) 
> found a task 71 131072 0 0Feb 09 04:20:42 
> 
>  ubuntu-xenial-rax-ord-7193667 tgtd[8542]: tgtd: iscsi_data_rsp_build(1136) 
> 131072 131072 0 26214471Feb 09 04:20:42 
> 
>  ubuntu-xenial-rax-ord-7193667 tgtd[8542]: tgtd: __cmd_done(1281) (nil) 
> 0x2563000 0 131072
>
> grep tgtd syslog.txt.gz| wc
>   139602 1710808 15699432
>
> [1] http://logs.openstack.org/95/429095/2/check/gate-tempest-
> dsvm-neutron-dvr-ubuntu-xenial/35aa22f/logs/syslog.txt.gz
>
>
>
> On Fri, Feb 10, 2017 at 5:59 AM, Ihar Hrachyshka 
> wrote:
>
>> Hi all,
>>
>> I noticed lately a number of job failures in neutron gate that all
>> result in job timeouts. I describe
>> gate-tempest-dsvm-neutron-dvr-ubuntu-xenial job below, though I see
>> timeouts happening in other jobs too.
>>
>> The failure mode is all operations, ./stack.sh and each tempest test
>> take significantly more time (like 50% to 150% more, which results in
>> job timeout triggered). An example of what I mean can be found in [1].
>>
>> A good run usually takes ~20 minutes to stack up devstack; then ~40
>> minutes to pass full suite; a bad run usually takes ~30 minutes for
>> ./stack.sh; and then 1:20h+ until it is killed due to timeout.
>>
>> It affects different clouds (we see rax, internap, infracloud-vanilla,
>> ovh jobs affected; we haven't seen osic though). It can't be e.g. slow
>> pypi or apt mirrors because then we would see slowdown in ./stack.sh
>> phase only.
>>
>> We can't be sure that CPUs are the same, and devstack does not seem to
>> dump /proc/cpuinfo anywhere (in the end, it's all virtual, so not sure
>> if it would help anyway). Neither we have a way to learn whether
>> slowliness could be a result of adherence to RFC1149. ;)
>>
>> We discussed the matter in neutron channel [2] though couldn't figure
>> out the culprit, or where to go next. At this point we assume it's not
>> neutron's fault, and we hope others (infra?) may have suggestions on
>> where to look.
>>
>> [1] http://logs.openstack.org/95/429095/2/check/gate-tempest-dsv
>> m-neutron-dvr-ubuntu-xenial/35aa22f/console.html#_2017-02-
>> 09_04_47_12_874550
>> [2] http://eavesdrop.openstack.org/irclogs/%23openstack-neutron/
>> %23openstack-neutron.2017-02-10.log.html#t2017-02-10T04:06:01
>>
>> Thanks,
>> Ihar
>>
>> 
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate][neutron][infra] tempest jobs timing out due to general sluggishness of the node?

2017-02-10 Thread Antonio Ojea
I guess it's an infra issue, specifically related to the storage, or the
network that provide the storage.

If you look at the syslog file [1] , there are a lot of this entries:

Feb 09 04:20:42

ubuntu-xenial-rax-ord-7193667 tgtd[8542]: tgtd:
iscsi_task_tx_start(2024) no more dataFeb 09 04:20:42

ubuntu-xenial-rax-ord-7193667 tgtd[8542]: tgtd:
iscsi_task_tx_start(1996) found a task 71 131072 0 0Feb 09 04:20:42

ubuntu-xenial-rax-ord-7193667 tgtd[8542]: tgtd:
iscsi_data_rsp_build(1136) 131072 131072 0 26214471Feb 09 04:20:42

ubuntu-xenial-rax-ord-7193667 tgtd[8542]: tgtd: __cmd_done(1281) (nil)
0x2563000 0 131072

grep tgtd syslog.txt.gz| wc
  139602 1710808 15699432

[1]
http://logs.openstack.org/95/429095/2/check/gate-tempest-dsvm-neutron-dvr-ubuntu-xenial/35aa22f/logs/syslog.txt.gz



On Fri, Feb 10, 2017 at 5:59 AM, Ihar Hrachyshka 
wrote:

> Hi all,
>
> I noticed lately a number of job failures in neutron gate that all
> result in job timeouts. I describe
> gate-tempest-dsvm-neutron-dvr-ubuntu-xenial job below, though I see
> timeouts happening in other jobs too.
>
> The failure mode is all operations, ./stack.sh and each tempest test
> take significantly more time (like 50% to 150% more, which results in
> job timeout triggered). An example of what I mean can be found in [1].
>
> A good run usually takes ~20 minutes to stack up devstack; then ~40
> minutes to pass full suite; a bad run usually takes ~30 minutes for
> ./stack.sh; and then 1:20h+ until it is killed due to timeout.
>
> It affects different clouds (we see rax, internap, infracloud-vanilla,
> ovh jobs affected; we haven't seen osic though). It can't be e.g. slow
> pypi or apt mirrors because then we would see slowdown in ./stack.sh
> phase only.
>
> We can't be sure that CPUs are the same, and devstack does not seem to
> dump /proc/cpuinfo anywhere (in the end, it's all virtual, so not sure
> if it would help anyway). Neither we have a way to learn whether
> slowliness could be a result of adherence to RFC1149. ;)
>
> We discussed the matter in neutron channel [2] though couldn't figure
> out the culprit, or where to go next. At this point we assume it's not
> neutron's fault, and we hope others (infra?) may have suggestions on
> where to look.
>
> [1] http://logs.openstack.org/95/429095/2/check/gate-tempest-
> dsvm-neutron-dvr-ubuntu-xenial/35aa22f/console.html#_
> 2017-02-09_04_47_12_874550
> [2] http://eavesdrop.openstack.org/irclogs/%23openstack-
> neutron/%23openstack-neutron.2017-02-10.log.html#t2017-02-10T04:06:01
>
> Thanks,
> Ihar
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [gate][neutron][infra] tempest jobs timing out due to general sluggishness of the node?

2017-02-09 Thread Ihar Hrachyshka
Hi all,

I noticed lately a number of job failures in neutron gate that all
result in job timeouts. I describe
gate-tempest-dsvm-neutron-dvr-ubuntu-xenial job below, though I see
timeouts happening in other jobs too.

The failure mode is all operations, ./stack.sh and each tempest test
take significantly more time (like 50% to 150% more, which results in
job timeout triggered). An example of what I mean can be found in [1].

A good run usually takes ~20 minutes to stack up devstack; then ~40
minutes to pass full suite; a bad run usually takes ~30 minutes for
./stack.sh; and then 1:20h+ until it is killed due to timeout.

It affects different clouds (we see rax, internap, infracloud-vanilla,
ovh jobs affected; we haven't seen osic though). It can't be e.g. slow
pypi or apt mirrors because then we would see slowdown in ./stack.sh
phase only.

We can't be sure that CPUs are the same, and devstack does not seem to
dump /proc/cpuinfo anywhere (in the end, it's all virtual, so not sure
if it would help anyway). Neither we have a way to learn whether
slowliness could be a result of adherence to RFC1149. ;)

We discussed the matter in neutron channel [2] though couldn't figure
out the culprit, or where to go next. At this point we assume it's not
neutron's fault, and we hope others (infra?) may have suggestions on
where to look.

[1] 
http://logs.openstack.org/95/429095/2/check/gate-tempest-dsvm-neutron-dvr-ubuntu-xenial/35aa22f/console.html#_2017-02-09_04_47_12_874550
[2] 
http://eavesdrop.openstack.org/irclogs/%23openstack-neutron/%23openstack-neutron.2017-02-10.log.html#t2017-02-10T04:06:01

Thanks,
Ihar

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev