On 24 April 2015 at 11:47, Stefan Hajnoczi <[email protected]> wrote:

> > Incidentally, we also did a pile of work last year on zero-copy NIC->VM
> > transfers and discovered a lot of interesting problems and edge cases
> where
> > Virtio-net spec and/or drivers are hard to match up with common NICs.
> Happy
> > to explain a bit about our experience if that would be valuable.
>
> That sounds interesting, can you describe the setup?
>

Sure.

We implemented a zero-copy receive path that maps guest buffers received
from the avail ring directly onto hardware receive buffers on a dedicated
hardware receive queue for that VM (VMDq).

This means that when the NIC receives a packet it stores it directly into
the guest's memory but the vswitch has the opportunity to do as much or as
little processing as it wants before making the packet available with a
used ring descriptor.

This scheme seems quite elegant to me. (I am sure it is not original - this
is what the VMDq hardware feature is for, after all.) The devil is in the
details though.

I suspect it would work well given two extensions to Virtio-net:

1. The 'used' ring allow an offset where the payload starts.

2. The guest to always supply buffers with space for >= 2048 bytes of
payload.

but without these things it is tricky to satisfy the requirements of real
NICs such as the Intel 10G ones. There are conflicting requirements. For
example:

- NIC requires buffer sizes to be uniform and a multiple of 1024 bytes.
Guest suppliers variable-size buffers often of ~1500 bytes. These need to
be either rounded down to 1024 bytes (causing excessive segmentation) or
rounded up to 2048 bytes (requiring jumbo frames to be globally disabled on
the port to avoid potential overruns).

- Virtio-net with MRG_RXBUF expects the packet payload to be in a different
offset for the first descriptor in a chain (offset 14 after the vnet
header) vs following descriptions in the chain (offset 0). The NIC always
stores packets at the same offset so the vswitch needs to pick one and then
correct with memmove() when needed.

- If the vswitch wants to shorten the packet payload, e.g. to remove
encapsulation, then this will require a memmove() because there is no way
to communicate an offset on the used ring.

- The NIC has a limit to how many receive descriptors it can chain
together. If the guest is supplying small buffers then this limit may be
too low for jumbo frames to be received.

... and at a certain point we decided we were better off switching our
focus away from clever-but-fragile NIC hacks and towards clever-and-robust
SIMD hacks, and that is the path we have been on since a few months ago.
_______________________________________________
Virtualization mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Reply via email to