Re: [openstack-dev] [nova] The same SRIOV / NFV CI failures missed a regression, why?

2016-03-29 Thread Moshe Levi


> -Original Message-
> From: Jay Pipes [mailto:jaypi...@gmail.com]
> Sent: Friday, March 25, 2016 10:20 PM
> To: openstack-dev@lists.openstack.org
> Subject: Re: [openstack-dev] [nova] The same SRIOV / NFV CI failures missed a
> regression, why?
> 
> On 03/24/2016 09:35 AM, Matt Riedemann wrote:
> > We have another mitaka-rc-potential bug [1] due to a regression when
> > detaching SR-IOV interfaces in the libvirt driver.
> >
> > There were two NFV CIs that ran on the original change [2].
> >
> > Both failed with the same devstack setup error [3][4].
> >
> > So it sucks that we have a regression, it sucks that no one watched
> > for those CI results before approving the change, and it really sucks
> > in this case since it was specifically reported from mellanox for
> > sriov which failed in [4]. But it happens.
> >
> > What I'd like to know is, have the CI problems been fixed? There is a
> > change up to fix the regression [5] and this time the Mellanox CI
> > check is passing [6]. The Intel NFV CI hasn't reported, but with the
> > mellanox one also testing the suspend scenario, it's probably good enough.
> 
>  From the commit message of the original patch that introduced the
> regression:
> 
> "This fix was tested on a real environment containing the above type of VMs.
> test_driver.test_detach_sriov_ports was slightly modified so that the VIF from
> which data is sent to _detach_pci_devices will contain the correct SRIOV 
> values
> (pci_slot, vlan and hw_veb VIF type)"
> 
> I'm not sure if the above statement could ever have been true considering the
> AttributeError that occurred in the bug...
> 
> In any case, I think that it's pretty clear that the CI systems for NFV and 
> PCI
> have been less than reliable at functionally testing the PCI and NFV-specific
> functionality in Nova.
> 
> This isn't trying to put down the people that work on those systems -- I know
> first hand that it can be difficult to build and maintain CI systems that 
> report in
> to upstream, and I appreciate the effort that goes into this.
> 
> But, going forward, I think we need to do something as a concerned
> community.
> 
> How about this for a proposal?
> 
> 1) We establish a joint lab environment that contains heterogeneous hardware
> to which all interested hardware vendors must provide hardware.
> 
> 2) The OpenStack Foundation and the hardware vendors each foot some
> portion of the bill to hire 2 or more systems administrators to maintain this 
> lab
> environment.
> 
> 3) The upstream Infrastructure team works with the hired system
> administrators to create a single CI system that can spawn functional test 
> jobs
> on the lab hardware and report results back to upstream Gerrit
> 
> Given the will to do this, I think the benefits of more trusted testing 
> results for
> the PCI and SR-IOV/NFV areas would more than make up for the cost.
+1 I like this proposal. We can help by providing Mellanox hardware and share 
our CI knowledge. 

> 
> Best,
> -jay
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] The same SRIOV / NFV CI failures missed a regression, why?

2016-03-28 Thread Moshe Levi


> -Original Message-
> From: Matt Riedemann [mailto:mrie...@linux.vnet.ibm.com]
> Sent: Thursday, March 24, 2016 3:35 PM
> To: OpenStack Development Mailing List (not for usage questions)  d...@lists.openstack.org>
> Subject: [openstack-dev] [nova] The same SRIOV / NFV CI failures missed a
> regression, why?
> 
> We have another mitaka-rc-potential bug [1] due to a regression when
> detaching SR-IOV interfaces in the libvirt driver.
> 
> There were two NFV CIs that ran on the original change [2].
> 
> Both failed with the same devstack setup error [3][4].
> 
> So it sucks that we have a regression, it sucks that no one watched for those 
> CI
> results before approving the change, and it really sucks in this case since 
> it was
> specifically reported from mellanox for sriov which failed in [4]. But it 
> happens.
> 
> What I'd like to know is, have the CI problems been fixed? There is a change 
> up
> to fix the regression [5] and this time the Mellanox CI check is passing [6]. 
> The
> Intel NFV CI hasn't reported, but with the mellanox one also testing the 
> suspend
> scenario, it's probably good enough.
Patch-Set 6  of patch [2] passed in Mellanox CI  see 
http://144.76.193.39/ci-artifacts/262341/6/Nova-ML2-Sriov/testr_results.html.gz 
I am not sure why patch-set 7 failed. At first I thought it was because of that 
in  PS 6 we install oslo.utils==3.4.0 and in PS7 oslo.utils==3.5.0
but I could not find a difference that can be related to  this:
2016-02-16 05:05:42.164 7182 ERROR nova   File 
"/usr/local/lib/python2.7/dist-packages/oslo_utils/importutils.py", line 30, in 
import_class
2016-02-16 05:05:42.164 7182 ERROR nova __import__(mod_str)
2016-02-16 05:05:42.164 7182 ERROR nova ValueError: Empty module name

Putting that a side,  the Mellanox CI in nova is currently running and passing 
so the SR-IOV for Ethernet is not broken.
The fix in [5] is for our SR-IOV InfiniBand solution. At the moment we only 
test it in neutron (SR-IOV InfiniBand solution) and 
the reason for that is that we don't have many physical server to run the CI  
for nova and neutron. 

> 
> [1] https://bugs.launchpad.net/nova/+bug/1560860
> [2] https://review.openstack.org/#/c/262341/
> [3]
> http://intel-openstack-ci-logs.ovh/compute-
> ci/refs/changes/41/262341/7/compute-nfv-
> flavors/20160215_232057/screen/n-sch.log.gz
> [4]
> http://144.76.193.39/ci-artifacts/262341/7/Nova-ML2-Sriov/logs/n-sch.log.gz
> [5] https://review.openstack.org/#/c/296305/
> [6]
> http://144.76.193.39/ci-artifacts/296305/1/Nova-ML2-
> Sriov/testr_results.html.gz
> 
> --
> 
> Thanks,
> 
> Matt Riedemann
> 
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] The same SRIOV / NFV CI failures missed a regression, why?

2016-03-25 Thread Jay Pipes

On 03/24/2016 09:35 AM, Matt Riedemann wrote:

We have another mitaka-rc-potential bug [1] due to a regression when
detaching SR-IOV interfaces in the libvirt driver.

There were two NFV CIs that ran on the original change [2].

Both failed with the same devstack setup error [3][4].

So it sucks that we have a regression, it sucks that no one watched for
those CI results before approving the change, and it really sucks in
this case since it was specifically reported from mellanox for sriov
which failed in [4]. But it happens.

What I'd like to know is, have the CI problems been fixed? There is a
change up to fix the regression [5] and this time the Mellanox CI check
is passing [6]. The Intel NFV CI hasn't reported, but with the mellanox
one also testing the suspend scenario, it's probably good enough.


From the commit message of the original patch that introduced the 
regression:


"This fix was tested on a real environment containing the above type of 
VMs. test_driver.test_detach_sriov_ports was slightly modified so that 
the VIF from which data is sent to _detach_pci_devices will contain the 
correct SRIOV values (pci_slot, vlan and hw_veb VIF type)"


I'm not sure if the above statement could ever have been true 
considering the AttributeError that occurred in the bug...


In any case, I think that it's pretty clear that the CI systems for NFV 
and PCI have been less than reliable at functionally testing the PCI and 
NFV-specific functionality in Nova.


This isn't trying to put down the people that work on those systems -- I 
know first hand that it can be difficult to build and maintain CI 
systems that report in to upstream, and I appreciate the effort that 
goes into this.


But, going forward, I think we need to do something as a concerned 
community.


How about this for a proposal?

1) We establish a joint lab environment that contains heterogeneous 
hardware to which all interested hardware vendors must provide hardware.


2) The OpenStack Foundation and the hardware vendors each foot some 
portion of the bill to hire 2 or more systems administrators to maintain 
this lab environment.


3) The upstream Infrastructure team works with the hired system 
administrators to create a single CI system that can spawn functional 
test jobs on the lab hardware and report results back to upstream Gerrit


Given the will to do this, I think the benefits of more trusted testing 
results for the PCI and SR-IOV/NFV areas would more than make up for the 
cost.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] The same SRIOV / NFV CI failures missed a regression, why?

2016-03-24 Thread Matt Riedemann
We have another mitaka-rc-potential bug [1] due to a regression when 
detaching SR-IOV interfaces in the libvirt driver.


There were two NFV CIs that ran on the original change [2].

Both failed with the same devstack setup error [3][4].

So it sucks that we have a regression, it sucks that no one watched for 
those CI results before approving the change, and it really sucks in 
this case since it was specifically reported from mellanox for sriov 
which failed in [4]. But it happens.


What I'd like to know is, have the CI problems been fixed? There is a 
change up to fix the regression [5] and this time the Mellanox CI check 
is passing [6]. The Intel NFV CI hasn't reported, but with the mellanox 
one also testing the suspend scenario, it's probably good enough.


[1] https://bugs.launchpad.net/nova/+bug/1560860
[2] https://review.openstack.org/#/c/262341/
[3] 
http://intel-openstack-ci-logs.ovh/compute-ci/refs/changes/41/262341/7/compute-nfv-flavors/20160215_232057/screen/n-sch.log.gz
[4] 
http://144.76.193.39/ci-artifacts/262341/7/Nova-ML2-Sriov/logs/n-sch.log.gz

[5] https://review.openstack.org/#/c/296305/
[6] 
http://144.76.193.39/ci-artifacts/296305/1/Nova-ML2-Sriov/testr_results.html.gz


--

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev