On Wednesday morning we discussed the state of performance VMs CI and technical debt. Performance VMs are more commonly known as those taking advantage of network function virtualization (NFV) features, like SR-IOV, PCI, NUMA, CPU pinning and huge pages. The full etherpad is here [1].

The session started out with a recap of the existing CI testing we have in Nova today for NFV:

1. Intel PCI CI - pretty basic custom test(s) of booting an instance with a PCI device flavor and then SSHing into the guest to ensure the device shows up.

2. Mellanox SR-IOV CI for macvtap - networking scenario tests in Tempest using an SR-IOV port of type 'macvtap'.

3. Mellanox SR-IOV CI for direct - networking scenario tests in Tempest using an SR-IOV port of type 'direct'.

4. Intel NFV CI - custom API tests in a Tempest plugin using flavors that have NUMA, CPU pinning and Huge Pages extra specs.

We then talked about gaps in testing of NFV features, the major ones being:

1. Intel NFV CI is single-node so we don't expose bugs with scheduling to multiple computes (we had a major bug in Nova where we'd only ever schedule to a single compute when using NUMA). We could potentially test some of this with an in-tree functional test.

2. We don't have any testing for SR-IOV ports of type 'direct-physical' which was recently added but is buggy.

3. We don't have any testing for resize/migrate with a different PCI device flavor, and according to Moshe Levi from Mellanox it's never worked, or he doesn't see how it could have. Testing this properly would require a multinode devstack job, which we don't have for any of the NFV third party CI today. Moshe has a patch up to fix the bug [2] but long-term we really need CI testing for this so we don't regress it.

4. ovs-dpdk has limited testing in Nova today. The Intel Networking CI job runs it on any changes to nova/virt/libvirt/vif.py and on Neutron changes. I've asked that the module whitelist be expanded for Nova changes to run these tests. It also sounds like it's going to be run on os-vif changes, so once we integrate os-vif for ovs-dpdk we'll have some coverage there.

5. In general we have issues with the NFV CI systems:

a) There are different teams running the different Intel CI jobs, so communication and status reporting can be difficult. Sean Mooney said that his team might be consolidating and owning some of the various jobs, so that should help.

b) The Mellanox CI jobs are running on dedicated hardware and doing cleanups of the host between runs, but this can miss things. The Intel CI guys said that they use privileged containers to get around this type of issue. It would be great if the various teams running these CIs could share what they are doing and best practices, tooling, etc.

c) We might be able to run some of the Intel NFV CI testing in the community infra since some of the public cloud providers being used allow nested virt. However, Clark Boylan reported that they have noticed very strange and abrupt crashes when running in these modes, so right now the stability is in question. Sean Mooney from Intel said that they could look into upstreaming some of their CI to community infra. We could also get an experimental job setup to see how stable it is and tease out the issues.

--

Beyond CI testing we also talked about the gap in upstream documentation. The good news is there is more documentation upstream than I was aware of. The neutron networking guide has information on configuring nova/neutron for using SR-IOV. The admin guide has some good information on CPU pinning and large pages, and some documentation for some of the more widely used flavor extra specs, but is by no means exhaustive - or clear on when a flavor extra spec or image metadata is used.

Stephen Finucane and Ludovic Beliveau volunteered to help work on the documentation.

--

One of the takeaways from this session was the clear lack of NFV users and people from the OPNFV community in the room. At one point someone asked for anyone from those groups to raise their hand and maybe one person did. There are surely developers involved, like Moshe, Sean, Stephen and Ludovic, but we still have a gap between the companies pushing for these features and the developers doing the work. That's one of the reasons why the core team consistently makes NFV support a lower priority. Part of the issue might simply be that those stakeholders are in different track sessions at the same time as the design summit. But I and some others from the core team were in an NFV luncheon on Monday to talk about what the NFV community can do to be more involved and we went over some of the above and pointed out this very session to attend, and it didn't seem to change that since the NFV stakeholders in that luncheon didn't attend the design session.

--

On Friday during the meetup session we briefly discussed FPGAs and similar acceleration-type resources. There were a lot of questions around not only what to do about modeling these resources, but what to do with an instance if/when the function it needs is re-programmed. As an initial step, Jay Pipes, Ed Leafe and some others agreed to talk about how generic resource pools can model these types of resource classes, but this is all very early stage conversation.

--

Looking ahead:

1. Moshe is taking over the SR-IOV/PCI bi-weekly IRC meeting [3]. We can continue some of the discussions in that meeting.

2. Sean Mooney and the Intel CI teams sound like they have some work to do with consolidation and potentially upstreaming some of their CI to community infra.

3. There are some volunteers to help dig into documentation gaps. I expect we can start to get an idea of concrete action items for this in the SR-IOV meeting.

4. Jay Pipes is working on refactoring the PCI resource tracker code as part of the overall scheduler effort, and Moshe is working on the resize/migrate bugs with respect to PCI devices. It would also be great if we could get away from hard-coding a PCI whitelist in nova.conf, but there isn't a clear picture, at least in my mind, on what this entails and who would drive the work. This is probably another item for the SR-IOV/PCI meeting.

5. We're going to document the current list of gaps (code issues, testing, documentation) in the Nova devref so we have something to point to when new features are requested. Basically, this is our list of debt, and we want to see that paid off before taking on new features and debt for NFV.

[1] https://etherpad.openstack.org/p/newton-nova-performance-vms
[2] https://review.openstack.org/#/c/307124/
[3] http://lists.openstack.org/pipermail/openstack-dev/2016-April/093541.html

--

Thanks,

Matt Riedemann


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to