Re: RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Pankaj Thakkar
The purpose of this email is to introduce the architecture and the design 
principles. The overall project involves more than just changes to vmxnet3 
driver and hence we though an overview email would be better. Once people agree 
to the design in general we intend to provide the code changes to the vmxnet3 
driver.

The architecture supports more than Intel NICs. We started the project with 
Intel but plan to support all major IHVs including Broadcom, Qlogic, Emulex and 
others through a certification program. The architecture works on VMware ESX 
server only as it requires significant support from the hypervisor. Also, the 
vmxnet3 driver works on VMware platform only. AFAICT Xen has a different model 
for supporting SR-IOV devices and allowing live migration and the document 
briefly talks about it (paragraph 6).

Thanks,

-pankaj


On Tue, May 04, 2010 at 05:05:31PM -0700, Stephen Hemminger wrote:
 Date: Tue, 4 May 2010 17:05:31 -0700
 From: Stephen Hemminger shemmin...@vyatta.com
 To: Pankaj Thakkar pthak...@vmware.com
 CC: linux-ker...@vger.kernel.org linux-ker...@vger.kernel.org,
   net...@vger.kernel.org net...@vger.kernel.org,
   virtualization@lists.linux-foundation.org
  virtualization@lists.linux-foundation.org,
   pv-driv...@vmware.com pv-driv...@vmware.com,
   Shreyas Bhatewara sbhatew...@vmware.com
 Subject: Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
 
 On Tue, 4 May 2010 16:02:25 -0700
 Pankaj Thakkar pthak...@vmware.com wrote:
 
  Device passthrough technology allows a guest to bypass the hypervisor and 
  drive
  the underlying physical device. VMware has been exploring various ways to
  deliver this technology to users in a manner which is easy to adopt. In this
  process we have prepared an architecture along with Intel - NPA (Network 
  Plugin
  Architecture). NPA allows the guest to use the virtualized NIC vmxnet3 to
  passthrough to a number of physical NICs which support it. The document 
  below
  provides an overview of NPA.
  
  We intend to upgrade the upstreamed vmxnet3 driver to implement NPA so that
  Linux users can exploit the benefits provided by passthrough devices in a
  seamless manner while retaining the benefits of virtualization. The document
  below tries to answer most of the questions which we anticipated. Please 
  let us
  know your comments and queries.
  
  Thank you.
  
  Signed-off-by: Pankaj Thakkar pthak...@vmware.com
 
 
 Code please. Also, it has to work for all architectures not just VMware and
 Intel.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Pankaj Thakkar
Sure. We have been working on NPA for a while and have the code internally up
and running. Let me sync up internally on how and when we can provide the
vmxnet3 driver code so that people can look at it.


On Tue, May 04, 2010 at 05:32:36PM -0700, David Miller wrote:
 Date: Tue, 4 May 2010 17:32:36 -0700
 From: David Miller da...@davemloft.net
 To: Pankaj Thakkar pthak...@vmware.com
 CC: shemmin...@vyatta.com shemmin...@vyatta.com,
   linux-ker...@vger.kernel.org linux-ker...@vger.kernel.org,
   net...@vger.kernel.org net...@vger.kernel.org,
   virtualization@lists.linux-foundation.org
  virtualization@lists.linux-foundation.org,
   pv-driv...@vmware.com pv-driv...@vmware.com,
   Shreyas Bhatewara sbhatew...@vmware.com
 Subject: Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
 
 From: Pankaj Thakkar pthak...@vmware.com
 Date: Tue, 4 May 2010 17:18:57 -0700
 
  The purpose of this email is to introduce the architecture and the
  design principles. The overall project involves more than just
  changes to vmxnet3 driver and hence we though an overview email
  would be better. Once people agree to the design in general we
  intend to provide the code changes to the vmxnet3 driver.
 
 Stephen's point is that code talks and bullshit walks.
 
 Talk about high level designs rarely gets any traction, and often goes
 nowhere.  Give us an example implementation so there is something
 concrete for us to sink our teeth into.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Christoph Hellwig
On Tue, May 04, 2010 at 04:02:25PM -0700, Pankaj Thakkar wrote:
 The plugin image is provided by the IHVs along with the PF driver and is
 packaged in the hypervisor. The plugin image is OS agnostic and can be loaded
 either into a Linux VM or a Windows VM. The plugin is written against the 
 Shell
 API interface which the shell is responsible for implementing. The API

We're not going to add any kind of loader for binry blobs into kernel
space, sorry.  Don't even bother wasting your time on this.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Pankaj Thakkar
On Tue, May 04, 2010 at 05:58:52PM -0700, Chris Wright wrote:
 Date: Tue, 4 May 2010 17:58:52 -0700
 From: Chris Wright chr...@sous-sol.org
 To: Pankaj Thakkar pthak...@vmware.com
 CC: linux-ker...@vger.kernel.org linux-ker...@vger.kernel.org,
   net...@vger.kernel.org net...@vger.kernel.org,
   virtualization@lists.linux-foundation.org
  virtualization@lists.linux-foundation.org,
   pv-driv...@vmware.com pv-driv...@vmware.com,
   Shreyas Bhatewara sbhatew...@vmware.com,
   k...@vger.kernel.org k...@vger.kernel.org
 Subject: Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
 
 * Pankaj Thakkar (pthak...@vmware.com) wrote:
  We intend to upgrade the upstreamed vmxnet3 driver to implement NPA so that
  Linux users can exploit the benefits provided by passthrough devices in a
  seamless manner while retaining the benefits of virtualization. The document
  below tries to answer most of the questions which we anticipated. Please 
  let us
  know your comments and queries.
 
 How does the throughput, latency, and host CPU utilization for normal
 data path compare with say NetQueue?

NetQueue is really for scaling across multiple VMs. NPA allows similar scaling
and also helps in improving the CPU efficiency for a single VM since the
hypervisor is bypassed. Througput wise both emulation and passthrough (NPA) can
obtain line rates on 10gig but passthrough saves upto 40% cpu based on the
workload. We did a demo at IDF 2009 where we compared 8 VMs running on NetQueue
v/s 8 VMs running on NPA (using Niantic) and we obtained similar CPU efficiency
gains.

 
 And does this obsolete your UPT implementation?

NPA and UPT share a lot of code in the hypervisor. UPT was adopted only by very
limited IHVs and hence NPA is our way forward to have all IHVs onboard.

 How many cards actually support this NPA interface?  What does it look
 like, i.e. where is the NPA specification?  (AFAIK, we never got the UPT
 one).

We have it working internally with Intel Niantic (10G) and Kawela (1G) SR-IOV
NIC. We are also working with upcoming Broadcom 10G card and plan to support
other IHVs. This is unlike UPT so we don't dictate the register sets or rings
like we did in UPT. Rather we have guidelines like that the card should have an
embedded switch for inter VF switching or should support programming (rx
filters, VLAN, etc) though the PF driver rather than the VF driver.

 How do you handle hardware which has a more symmetric view of the
 SR-IOV world (SR-IOV is only PCI sepcification, not a network driver
 specification)?  Or hardware which has multiple functions per physical
 port (multiqueue, hw filtering, embedded switch, etc.)?

I am not sure what do you mean by symmetric view of SR-IOV world?

NPA allows multi-queue VFs and requires an embedded switch currently. As far as
the PF driver is concerned we require IHVs to support all existing and upcoming
features like NetQueue, FCoE, etc. The PF driver is considered special and is
used to drive the traffic for the emulated/paravirtualized VMs and is also used
to program things on behalf of the VFs through the hypervisor. If the hardware
has multiple physical functions they are treated as separate adapters (with
their own set of VFs) and we require the embedded switch to maintain that
distinction as well.


  NPA offers several benefits:
  1. Performance: Critical performance sensitive paths are not trapped and the
  guest can directly drive the hardware without incurring virtualization
  overheads.
 
 Can you demonstrate with data?

The setup is 2.667Ghz Nehalem server running SLES11 VM talking to a 2.33Ghz
Barcelona client box running RHEL 5.1. We had netperf streams with 16k msg size
over 64k socket size running between server VM and client and they are using
Intel Niantic 10G cards. In both cases (NPA and regular) the VM was CPU
saturated (used one full core).

TX: regular vmxnet3 = 3085.5 Mbps/GHz; NPA vmxnet3 = 4397.2 Mbps/GHz
RX: regular vmxnet3 = 1379.6 Mbps/GHz; NPA vmxnet3 = 2349.7 Mbps/GHz

We have similar results for other configuration and in general we have seen NPA
is better in terms of CPU cost and can save upto 40% of CPU cost.

 
  2. Hypervisor control: All control operations from the guest such as 
  programming
  MAC address go through the hypervisor layer and hence can be subjected to
  hypervisor policies. The PF driver can be further used to put policy 
  decisions
  like which VLAN the guest should be on.
 
 This can happen without NPA as well.  VF simply needs to request
 the change via the PF (in fact, hw does that right now).  Also, we
 already have a host side management interface via PF (see, for example,
 RTM_SETLINK IFLA_VF_MAC interface).
 
 What is control plane interface?  Just something like a fixed register set?

All operations other than TX/RX go through the vmxnet3 shell to the vmxnet3
device emulation. So the control plane is really the vmxnet3 device emulation
as far as the guest is concerned.

 
  3. Guest Management

Re: RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Pankaj Thakkar
On Wed, May 05, 2010 at 10:59:51AM -0700, Avi Kivity wrote:
 Date: Wed, 5 May 2010 10:59:51 -0700
 From: Avi Kivity a...@redhat.com
 To: Pankaj Thakkar pthak...@vmware.com
 CC: linux-ker...@vger.kernel.org linux-ker...@vger.kernel.org,
   net...@vger.kernel.org net...@vger.kernel.org,
   virtualization@lists.linux-foundation.org
  virtualization@lists.linux-foundation.org,
   pv-driv...@vmware.com pv-driv...@vmware.com,
   Shreyas Bhatewara sbhatew...@vmware.com
 Subject: Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
 
 On 05/05/2010 02:02 AM, Pankaj Thakkar wrote:
  2. Hypervisor control: All control operations from the guest such as 
  programming
  MAC address go through the hypervisor layer and hence can be subjected to
  hypervisor policies. The PF driver can be further used to put policy 
  decisions
  like which VLAN the guest should be on.
 
 
 Is this enforced?  Since you pass the hardware through, you can't rely 
 on the guest actually doing this, yes?

We don't pass the whole VF to the guest. Only the BAR which is responsible for
TX/RX/intr is mapped into guest space. The interface between the shell and
plugin only allows to do operations related to TX and RX such as send a packet
to the VF, allocate RX buffers, indicate a packet upto the shell. All control
operations are handled by the shell and the shell does what the existing
vmxnet3 drivers does (touch a specific register and let the device emulation do
the work). When a VF is mapped to the guest the hypervisor knows this and
programs the h/w accordingly on behalf of the shell. So for example if the VM
does a MAC address change inside the guest, the shell would write to
VMXNET3_REG_MAC{L|H} registers which would trigger the device emulation to read
the new mac address and update its internal virtual port information for the
virtual switch and if the VF is mapped it would also program the embedded
switch RX filters to reflect the new mac address.

 
  The plugin image is provided by the IHVs along with the PF driver and is
  packaged in the hypervisor. The plugin image is OS agnostic and can be 
  loaded
  either into a Linux VM or a Windows VM. The plugin is written against the 
  Shell
  API interface which the shell is responsible for implementing. The API
  interface allows the plugin to do TX and RX only by programming the hardware
  rings (along with things like buffer allocation and basic initialization). 
  The
  virtual machine comes up in paravirtualized/emulated mode when it is booted.
  The hypervisor allocates the VF and other resources and notifies the shell 
  of
  the availability of the VF. The hypervisor injects the plugin into memory
  location specified by the shell. The shell initializes the plugin by calling
  into a known entry point and the plugin initializes the data path. The 
  control
  path is already initialized by the PF driver when the VF is allocated. At 
  this
  point the shell switches to using the loaded plugin to do all further TX 
  and RX
  operations. The guest networking stack does not participate in these 
  operations
  and continues to function normally. All the control operations continue 
  being
  trapped by the hypervisor and are directed to the PF driver as needed. For
  example, if the MAC address changes the hypervisor updates its internal 
  state
  and changes the state of the embedded switch as well through the PF control
  API.
 
 
 This is essentially a miniature network stack with a its own mini 
 bonding layer, mini hotplug, and mini API, except s/API/ABI/.  Is this a 
 correct view?

To some extent yes but there is no complicated bonding nor there is any thing
like a PCI hotplug. The shell interface is small and the OS always interacts
with the shell as the main driver. Based on the underlying VF the plugin
changes and this plugin as well is really small. Our vmxnet3 s/w plugin is
about 1300 lines with whitespaces and comments and the Intel Kawela plugin is
about 1100 lines with whitspaces and comments. The design principle is to put
more of the complexity related to initialization/control into the PF driver
rather than in plugin.

 
 If so, the Linuxy approach would be to use the ordinary drivers and the 
 Linux networking API, and hide the bond setup using namespaces.  The 
 bond driver, or perhaps a new, similar, driver can be enhanced to 
 propagate ethtool commands to its (hidden) components, and to have a 
 control channel with the hypervisor.
 
 This would make the approach hypervisor agnostic, you're just pairing 
 two devices and presenting them to the rest of the stack as a single device.
 
  We have reworked our existing Linux vmxnet3 driver to accomodate NPA by
  splitting the driver into two parts: Shell and Plugin. The new split driver 
  is
 
 
 So the Shell would be the reworked or new bond driver, and Plugins would 
 be ordinary Linux network drivers.

In NPA we do not rely on the guest OS to provide any of these services like

Re: RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-04 Thread Chris Wright
* Pankaj Thakkar (pthak...@vmware.com) wrote:
 We intend to upgrade the upstreamed vmxnet3 driver to implement NPA so that
 Linux users can exploit the benefits provided by passthrough devices in a
 seamless manner while retaining the benefits of virtualization. The document
 below tries to answer most of the questions which we anticipated. Please let 
 us
 know your comments and queries.

How does the throughput, latency, and host CPU utilization for normal
data path compare with say NetQueue?

And does this obsolete your UPT implementation?

 Network Plugin Architecture
 ---
 
 VMware has been working on various device passthrough technologies for the 
 past
 few years. Passthrough technology is interesting as it can result in better
 performance/cpu utilization for certain demanding applications. In our vSphere
 product we support direct assignment of PCI devices like networking adapters 
 to
 a guest virtual machine. This allows the guest to drive the device using the
 device drivers installed inside the guest. This is similar to the way KVM
 allows for passthrough of PCI devices to the guests. The hypervisor is 
 bypassed
 for all I/O and control operations and hence it can not provide any value add
 features such as live migration, suspend/resume, etc.
 
 
 Network Plugin Architecture (NPA) is an approach which VMware has developed in
 joint partnership with Intel which allows us to retain the best of passthrough
 technology and virtualization. NPA allows for passthrough of the fast data
 (I/O) path and lets the hypervisor deal with the slow control path using
 traditional emulation/paravirtualization techniques. Through this splitting of
 data and control path the hypervisor can still provide the above mentioned
 value add features and exploit the performance benefits of passthrough.

How many cards actually support this NPA interface?  What does it look
like, i.e. where is the NPA specification?  (AFAIK, we never got the UPT
one).

 NPA requires SR-IOV hardware which allows for sharing of one single NIC 
 adapter
 by multiple guests. SR-IOV hardware has many logically separate functions
 called virtual functions (VF) which can be independently assigned to the guest
 OS. They also have one or more physical functions (PF) (managed by a PF 
 driver)
 which are used by the hypervisor to control certain aspects of the VFs and the
 rest of the hardware.

How do you handle hardware which has a more symmetric view of the
SR-IOV world (SR-IOV is only PCI sepcification, not a network driver
specification)?  Or hardware which has multiple functions per physical
port (multiqueue, hw filtering, embedded switch, etc.)?

 NPA splits the guest driver into two components called
 the Shell and the Plugin. The shell is responsible for interacting with the
 guest networking stack and funneling the control operations to the hypervisor.
 The plugin is responsible for driving the data path of the virtual function
 exposed to the guest and is specific to the NIC hardware. NPA also requires an
 embedded switch in the NIC to allow for switching traffic among the virtual
 functions. The PF is also used as an uplink to provide connectivity to other
 VMs which are in emulation mode. The figure below shows the major components 
 in
 a block diagram.
 
 +--+
 | Guest VM |
 |  |
 |  ++  |
 |  | vmxnet3 driver |  |
 |  | Shell  |  |
 |  | ++ |  |
 |  | |   Plugin   | |  |
 +--+-++-+--+
 |   .
+-+  .
| vmxnet3 |  .
|___+-+  .
  |  .
  |  .
 ++
 ||
 |   virtual switch   |
 ++
   | .   \
   | .\
+=+  . \
| PF control  |  .  \
| |  .   \
|  L2 driver  |  .\
+-+  . \
   | .  \
   | .   \
 ++ ++
 | PF   VF1 VF2 ...   VFn | ||
 || |  regular   |
 |   SR-IOV NIC   | |nic |
 |+--+| |   ++
 ||   embedded   || +---+
 ||switch||
 |+--+|