Re: [virtio-dev] packed ring layout proposal v3

2017-10-20 Thread Lars Ganrot
Hi Michael,

I'm trying to understand your Sep 28 13:32 example (text copied below). I've 
in-lined my questions [#lga#].

>
> Let's assume device promised to consume packets in order
>
> ring size = 2
>
> Ring is 0 initialized.
>
> Device initially polls DESC[0].flags for WRAP bit to change.
>
> driver adds:
>
> DESC[0].addr = 1234
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_MORE | DESC_WRAP
>
> and
>
> DESC[0].addr = 5678

[#lga#] Is index 0 above a typo (makes more sense if it is 1)?

> DESC[1].id = 1
> DESC[1].flags = DESC_DRIVER | DESC_WRAP
>
> it now starts polling DESC[0] flags.
>
> Device reads 1234, executes it, does not use it.
>
> Device reads 5678, executes it, and uses it:
>
> DESC[0].id = 1
> DESC[0].flags = 0

[#lga#] Should above line be: "DESC[0].flags =  DESC[1].flags  & DESC_WRAP"?

>
> Device now polls DESC[0].flags for WRAP bit to change.
>
> Now driver sees that DRIVER bit has been cleared, so it nows that id
> is valid. I sees id 1, therefore id 0 and 1 has been read and are safe to 
> overwrite.

[#lga#] At this point, has the device returned both buffers *(1234) and *(5678) 
to the driver?
[#lga#] If so, how would out-of-order completion avoid head of line blocking?
[#lga#] E.g. the device is done with id=1 and *(5678), but not id=0 and *(1234).

>
> So it writes it out. It wrapped around to beginning of ring, so it
> flips the WRAP bit to 0 on all descriptors now:
>
> DESC[0].addr = 9ABC
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_MORE
>
> DESC[0].addr = DEF0
> DESC[0].id = 1
> DESC[0].flags = DESC_DRIVER

[#lga#] Index typo in all 3 lines above? DESC[1] makes more sense.

>
> Next round wrap will be 1 again.
>
> To summarise:
>
> DRIVER bit is used by driver to detect device has used one or more
> descriptors.  WRAP is is used by device to detect driver has made a
> new descriptor available.

Best Regards,

-Lars
Disclaimer: This email and any files transmitted with it may contain 
confidential information intended for the addressee(s) only. The information is 
not to be surrendered or copied to unauthorized persons. If you have received 
this communication in error, please notify the sender immediately and delete 
this e-mail from your system.

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] packed ring layout proposal v3

2017-10-10 Thread Liang, Cunming


> -Original Message-
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> On Behalf Of Michael S. Tsirkin
> Sent: Wednesday, October 4, 2017 8:58 PM
> To: Jens Freimann 
> Cc: Liang, Cunming ; Steven Luong (sluong)
> ; virtio-dev@lists.oasis-open.org;
> virtualizat...@lists.linux-foundation.org
> Subject: Re: [virtio-dev] packed ring layout proposal v3
> 
> On Wed, Oct 04, 2017 at 02:39:01PM +0200, Jens Freimann wrote:
> > On Sun, Oct 01, 2017 at 04:08:29AM +, Michael S. Tsirkin wrote:
> > > On Thu, Sep 28, 2017 at 09:44:35AM +, Liang, Cunming wrote:
> > > >
> > > > Get it now. Please correct me if I missing something.
> > > >
> > > >
> > > > Flags status hints,
> > > >
> > > > - DESC_DRIVER only: driver owns the descriptor w/o available info
> > > > ready for device to use
> > > >
> > > > - DESC_DRIVER | DESC_WRAP: driver has prepared an available
> > > > descriptor, device hasn't used it yet
> > > >
> > > > - None: device has used the descriptor, and write descriptor out
> > > >
> > > > - DESC_WRAP only: shall not happen, device make sure to clear it
> > > >
> > > >
> > > > Polling behavior is,
> > > >
> > > > - Device monitor DESC_WRAP bit set or not; If set, go to use
> > > > descriptor and clear DESC_DRIVER bit in the end (note: always need
> > > > to clear DESC_WRAP)
> > > >
> > > > - Driver monitor DESC_DRIVER bit cleared or not; If cleared,
> > > > reclaim descriptor(set DESC_DRIVER) and set DESC_WRAP once new
> > > > available descriptor get ready to go
> > > >
> > > >
> > > > --
> > > > Steve
> > >
> > >
> > > Hmm no, not what I had in mind.
> > >
> > > DESC_DRIVER: used by driver to poll. Driver sets it when writing a
> > > descriptor.  Device clears it when overwriting a descriptor.
> > > Thus driver uses DESC_DRIVER to detect that device data in
> > > descriptor is valid.
> >
> > Basically DESC_HW from v2 split in two?
> 
> Yes in order to avoid overwriting all descriptors.
> 
> > >
> > > DESC_WRAP: used by device to poll. Driver sets it to a *different*
> > > value every time it overwrites a descriptor. How to achieve it?
> > > since descriptors are written out in ring order, simply maintain the
> > > current value internally (start value 1) and flip it every time you
> > > overwrite the first descriptor.
> > > Device leaves it intact when overwriting a descriptor.
Ok, get it now.

> >
> > This is confusing me a bit.
> >
> > My understanding is: 1. the internally kept wrap value only flipped
> > when the first descriptor is overwritten
> >
> > 2. the moment the first descriptor is written the internal wrap value
> > is flipped 0->1 or 1->0 and this value is written to every descriptor
> > DESC_WRAP until we reach the first descriptor again
That's right, it's also my take.
DESC_WRAP is only used by driver, device does nothing with that flag.

> 
> Yes this is what I tried to say. Can you suggest a better wording then?
The term of DESC_WRAP is fine to me.

> 
> > >
> > > After writing down this explanation, I think the names aren't great.
> > >
> > > Let me try an alternative explanation.
> > >
> > > ---
> > > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
> > > It has 4 possible values:
> > > values 0x1, 0x11 are written by driver values 0x0, 0x10 are written
> > > by device
> >
> > The 0x prefix might add to the confusion here. It is really just two
> > bits, no?
> 
> Ouch. Yes I meant 0b. Thanks!
0b00, 0b10 are written by device?
I suppose device can only clear high bit, can keep low bit no change.
Then the value written by device can be either 0b01 or 0b00, but 0b10 means to 
set high bit, no?

> 
> > > each time driver writes out a descriptor, it must make sure that the
> > > high bit in OWNER changes.
> > >
> > > each time device writes out a descriptor, it must make sure that the
> > > high bit in OWNER does not change.
Typo here? It should be "..., it must make sure that the low bit in OWNER does 
not change."?
For high bit in OWNER, each time devices writes out a descriptor, it must make 
sure to clear high bit in OWNER.
 
> > >
> > > this is exactly the same functionally

Re: [virtio-dev] packed ring layout proposal v3

2017-10-04 Thread Michael S. Tsirkin
On Wed, Oct 04, 2017 at 02:39:01PM +0200, Jens Freimann wrote:
> On Sun, Oct 01, 2017 at 04:08:29AM +, Michael S. Tsirkin wrote:
> > On Thu, Sep 28, 2017 at 09:44:35AM +, Liang, Cunming wrote:
> > > 
> > > Get it now. Please correct me if I missing something.
> > > 
> > > 
> > > Flags status hints,
> > > 
> > > - DESC_DRIVER only: driver owns the descriptor w/o available info ready 
> > > for device to use
> > > 
> > > - DESC_DRIVER | DESC_WRAP: driver has prepared an available descriptor, 
> > > device hasn't used it yet
> > > 
> > > - None: device has used the descriptor, and write descriptor out
> > > 
> > > - DESC_WRAP only: shall not happen, device make sure to clear it
> > > 
> > > 
> > > Polling behavior is,
> > > 
> > > - Device monitor DESC_WRAP bit set or not; If set, go to use descriptor 
> > > and clear DESC_DRIVER bit in the end (note: always need to clear 
> > > DESC_WRAP)
> > > 
> > > - Driver monitor DESC_DRIVER bit cleared or not; If cleared, reclaim 
> > > descriptor(set DESC_DRIVER) and set DESC_WRAP once new available 
> > > descriptor get ready to go
> > > 
> > > 
> > > --
> > > Steve
> > 
> > 
> > Hmm no, not what I had in mind.
> > 
> > DESC_DRIVER: used by driver to poll. Driver sets it when writing a
> > descriptor.  Device clears it when overwriting a descriptor.
> > Thus driver uses DESC_DRIVER to detect that device data in
> > descriptor is valid.
> 
> Basically DESC_HW from v2 split in two?

Yes in order to avoid overwriting all descriptors.

> > 
> > DESC_WRAP: used by device to poll. Driver sets it to a *different*
> > value every time it overwrites a descriptor. How to achieve it?
> > since descriptors are written out in ring order,
> > simply maintain the current value internally (start value 1) and flip it
> > every time you overwrite the first descriptor.
> > Device leaves it intact when overwriting a descriptor.
> 
> This is confusing me a bit.
> 
> My understanding is: 1. the internally kept wrap value only flipped when the
> first
> descriptor is overwritten
> 
> 2. the moment the first descriptor is written the internal wrap value
> is flipped 0->1 or 1->0 and this value is written to every descriptor
> DESC_WRAP until we reach the first descriptor again

Yes this is what I tried to say. Can you suggest a better wording then?

> > 
> > After writing down this explanation, I think the names aren't
> > great.
> > 
> > Let me try an alternative explanation.
> > 
> > ---
> > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
> > It has 4 possible values:
> > values 0x1, 0x11 are written by driver
> > values 0x0, 0x10 are written by device
> 
> The 0x prefix might add to the confusion here. It is really just two
> bits, no?

Ouch. Yes I meant 0b. Thanks!

> > each time driver writes out a descriptor, it must make sure
> > that the high bit in OWNER changes.
> > 
> > each time device writes out a descriptor, it must make sure
> > that the high bit in OWNER does not change.
> > 
> > this is exactly the same functionally, DRIVER is high bit and
> > WRAP is the low bit.  Does this make things clearer?
> 
> So far it makes sense to me.
> > ---
> > 
> > 
> > 
> > Maybe the difference between device and driver
> > is confusing. We can fix that by changing the values.
> > Here is an alternative. Let me know if you like it better -
> > I need to think a bit more to make sure it works,
> > but to give you an idea:
> > 
> > 
> > ---
> > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
> > It has 4 possible values:
> > values 0x1, 0x10 are written by driver
> > values 0x0, 0x11 are written by device
> > 
> > each time driver writes out a descriptor, it must make sure
> > that the high bit in OWNER changes. Thus first time
> > it writes 0x10, next time 0x1, then 0x10 again.
> > 
> > each time device writes out a descriptor, it must make sure
> > that the low bit in OWNER changes.  Thus first time
> > it writes 0x11, next time 0x0, then 0x11 again.
> 
> DESC_WRAP is changed by the device now, so this would work differently
> than in the scenario from above. This would mean we don't need the
> internally kept wrap value, right?

I think no but I did not complete simulating this yet so I don't
yet know if it even works.

> 
> regards,
> Jens

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v3

2017-09-30 Thread Michael S. Tsirkin
On Thu, Sep 28, 2017 at 09:44:35AM +, Liang, Cunming wrote:
> 
> Get it now. Please correct me if I missing something.
> 
> 
> Flags status hints,
> 
> - DESC_DRIVER only: driver owns the descriptor w/o available info ready for 
> device to use
> 
> - DESC_DRIVER | DESC_WRAP: driver has prepared an available descriptor, 
> device hasn't used it yet
> 
> - None: device has used the descriptor, and write descriptor out
> 
> - DESC_WRAP only: shall not happen, device make sure to clear it
> 
> 
> Polling behavior is,
> 
> - Device monitor DESC_WRAP bit set or not; If set, go to use descriptor and 
> clear DESC_DRIVER bit in the end (note: always need to clear DESC_WRAP)
> 
> - Driver monitor DESC_DRIVER bit cleared or not; If cleared, reclaim 
> descriptor(set DESC_DRIVER) and set DESC_WRAP once new available descriptor 
> get ready to go
> 
> 
> --
> Steve


Hmm no, not what I had in mind.

DESC_DRIVER: used by driver to poll. Driver sets it when writing a
descriptor.  Device clears it when overwriting a descriptor.
Thus driver uses DESC_DRIVER to detect that device data in
descriptor is valid.



DESC_WRAP: used by device to poll. Driver sets it to a *different*
value every time it overwrites a descriptor. How to achieve it?
since descriptors are written out in ring order,
simply maintain the current value internally (start value 1) and flip it
every time you overwrite the first descriptor.
Device leaves it intact when overwriting a descriptor.


After writing down this explanation, I think the names aren't
great.



Let me try an alternative explanation.

---
A two-bit field, DRIVER_OWNER, signals the buffer ownership.
It has 4 possible values:
values 0x1, 0x11 are written by driver
values 0x0, 0x10 are written by device

each time driver writes out a descriptor, it must make sure
that the high bit in OWNER changes.

each time device writes out a descriptor, it must make sure
that the high bit in OWNER does not change.



this is exactly the same functionally, DRIVER is high bit and
WRAP is the low bit.  Does this make things clearer?
---



Maybe the difference between device and driver
is confusing. We can fix that by changing the values.
Here is an alternative. Let me know if you like it better -
I need to think a bit more to make sure it works,
but to give you an idea:


---
A two-bit field, DRIVER_OWNER, signals the buffer ownership.
It has 4 possible values:
values 0x1, 0x10 are written by driver
values 0x0, 0x11 are written by device

each time driver writes out a descriptor, it must make sure
that the high bit in OWNER changes. Thus first time
it writes 0x10, next time 0x1, then 0x10 again.

each time device writes out a descriptor, it must make sure
that the low bit in OWNER changes.  Thus first time
it writes 0x11, next time 0x0, then 0x11 again.

---



















> > -Original Message-
> > From: Michael S. Tsirkin [mailto:m...@redhat.com]
> > Sent: Thursday, September 28, 2017 7:49 AM
> > To: Steven Luong (sluong)
> > Cc: Liang, Cunming; virtio-dev@lists.oasis-open.org;
> > virtualizat...@lists.linux-foundation.org
> > Subject: Re: [virtio-dev] packed ring layout proposal v3
> > 
> > On Tue, Sep 26, 2017 at 11:38:18PM +, Steven Luong (sluong) wrote:
> > > Michael,
> > >
> > > Would you please give an example or two how these two flags
> > DESC_DRIVER and DESC_WRAP are used together? Like others, I am
> > confused by the description and still don’t quite grok it.
> > >
> > > Steven
> > 
> > My bad, I will need to work on it. Here is an example:
> > 
> > Let's assume device promised to consume packets in order
> > 
> > ring size = 2
> > 
> > Ring is 0 initialized.
> > 
> > Device initially polls DESC[0].flags for WRAP bit to change.
> > 
> > driver adds:
> > 
> > DESC[0].addr = 1234
> > DESC[0].id = 0
> > DESC[0].flags = DESC_DRIVER | DESC_NEXT | DESC_WRAP
> > 
> > and
> > 
> > DESC[0].addr = 5678
> > DESC[1].id = 1
> > DESC[1].flags = DESC_DRIVER | DESC_WRAP
> > 
> > 
> > it now starts polling DESC[0] flags.
> > 
> > 
> > Device reads 1234, executes it, does not use it.
> > 
> > Device reads 5678, executes it, and uses it:
> > 
> > DESC[0].id = 1
> > DESC[0].flags = 0
> > 
> > Device now polls DESC[0].flags for WRAP bit to change.
> > 
> > Now driver sees that DRIVER bit has been cleared, so it nows that id is 
> > valid. I
> > sees id 1, therefore id 0 and 1 has been read and are safe to overwrite.
> > 
> > So it writes it out. It wrapped around to beginning of ring, so

Re: [virtio-dev] packed ring layout proposal v3

2017-09-28 Thread Michael S. Tsirkin
On Thu, Sep 21, 2017 at 01:36:37PM +, Liang, Cunming wrote:
> Hi,
> 
> > -Original Message-
> > From: virtio-dev@lists.oasis-open.org 
> > [mailto:virtio-dev@lists.oasis-open.org] On
> > Behalf Of Michael S. Tsirkin
> > Sent: Sunday, September 10, 2017 1:06 PM
> > To: virtio-dev@lists.oasis-open.org
> > Cc: virtualizat...@lists.linux-foundation.org
> > Subject: [virtio-dev] packed ring layout proposal v3
> > 
> [...]
> > * Batching descriptors:
> > 
> > virtio 1.0 allows passing a batch of descriptors in both directions, by 
> > incrementing
> > the used/avail index by values > 1.
> > At the moment only batching of used descriptors is used.
> > 
> > We can support this by chaining a list of device descriptors through
> > VRING_DESC_F_MORE flag. Device sets this bit to signal driver that this is 
> > part of
> > a batch of used descriptors which are all part of a single transaction.
> 
> It supposes each s/g chain represents for a packet, while each descriptor 
> among batching chain represents for a packet. There're a few thoughts of 
> batching chain(by VRING_DESC_F_MORE) and s/g chain(by VRING_DESC_F_NEXT).
> 
> - batching chain: It's up to device to coalesce the write-out of a batched 
> used descriptors. As the batching size is variable, driver has to detect 
> validity of each descriptor unless the number of incoming valid descriptor is 
> predictable, being curious on the benefits of driver from VRING_DESC_F_MORE 
> to take  batching descriptors as a single transaction. On device perspective, 
> it's great to write out one descriptor for the whole chain. However, it 
> assumes no other useful fields in each descriptor of chain needs to write. TX 
> buffer reclaiming can be the candidate, while RX side has to update 'len' at 
> least. Even for this purpose, instead of writing out VRING_DESC_F_MORE on a 
> few descriptors to suppress device writing back, it's cheaper to set flag 
> (e.g. VRING_DESC_F_WB) on single descriptor of chain to hint the expected 
> position for device to write back.

But driver does not really benefit from batching and does not know how
many to batch, this depends on device. E.g. a software device does not
need batching at all, a pci express device would want batches to be
multiples of what fits in a pci express transaction, etc.  We would have
to provide that info from device to driver.

> - s/g chain: It makes sense to indicate the packet boundary. Considering 
> in-order descriptor ring without VRING_DESC_F_INDIRECT, the next descriptor 
> always belongs to the same s/g chain until end of packet indicators occur. So 
> one alternative approach is only to set a flag (e.g. VRING_DESC_F_EOP) on the 
> last descriptor of the chain. 

EOP would be the reverse of NEXT then? I think it does not matter much,
but NEXT matches what is there in virtio 1.0 right now. It also means that
simple implementations with short buffers can have flags set to 0 which
seems cleaner.


> > 
> > Driver might detect a partial descriptor chain (VRING_DESC_F_MORE set but 
> > next
> > descriptor not valid). In that case it must not use any parts of the chain 
> > - it will
> > later be completed by device, but driver is allowed to store the valid 
> > parts of the
> > chain as device is not allowed to change them anymore.
> As each descriptor represent for a whole packet(otherwise it's s/g chain),

For RX mergeable buffers, a packet is composed of multiple s/g chains.

> wondering why it must not use any parts of the chain. 

This is to match what is there in virtio 1.0 right now: driver
does not touch any used descriptors until the used index is updated.




> > 
> > Descriptor should not have both VRING_DESC_F_MORE and
> > VRING_DESC_F_NEXT set.
> > 
> [...]
> > 
> > * Selective use of descriptors
> > 
> > As described above, descriptors with NEXT bit set are part of a 
> > scatter/gather
> > chain and so do not have to cause device to write a used descriptor out.
> > 
> > Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to 
> > signal to
> > device that it does not have to write out the used descriptor as it is part 
> > of a batch
> > of descriptors. Device has two options (similar to VRING_DESC_F_NEXT):
> > 
> > Device can write out the same number of descriptors for the batch, setting
> > VRING_DESC_F_MORE for all but the last descriptor.
> > Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set.
> It will write out last descriptor without VRING_DESC_F_MORE anyway, so the 
> following statement seems not like another option.

I don't understand this statement. All I said is that it's up to device
whether to write out the descriptors with VRING_DESC_F_MORE, or to skip
the write out.

> > 
> > Device only writes out a single descriptor for the whole batch.
> > However, to keep device and driver in sync, it then skips a number of 
> > descriptors
> > corresponding to the length of the batch before writing out the next used
> > descriptor.
> > After de

Re: [virtio-dev] packed ring layout proposal v3

2017-09-28 Thread Michael S. Tsirkin
On Thu, Sep 28, 2017 at 02:49:15AM +0300, Michael S. Tsirkin wrote:
> On Tue, Sep 26, 2017 at 11:38:18PM +, Steven Luong (sluong) wrote:
> > Michael,
> > 
> > Would you please give an example or two how these two flags DESC_DRIVER and 
> > DESC_WRAP are used together? Like others, I am confused by the description 
> > and still don’t quite grok it.
> > 
> > Steven

Note: I made a mistake in the email. Instead of DESC_NEXT it should read
DESC_MORE everywhere. I corrected the quoted text below for simplicity.



> My bad, I will need to work on it. Here is an example:
> 
> Let's assume device promised to consume packets in order
> 
> ring size = 2
> 
> Ring is 0 initialized.
> 
> Device initially polls DESC[0].flags for WRAP bit to change.
> 
> driver adds:
> 
> DESC[0].addr = 1234
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_MORE | DESC_WRAP
> 
> and
> 
> DESC[0].addr = 5678
> DESC[1].id = 1
> DESC[1].flags = DESC_DRIVER | DESC_WRAP
> 
> 
> it now starts polling DESC[0] flags.
> 
> 
> Device reads 1234, executes it, does not use it.
> 
> Device reads 5678, executes it, and uses it:
> 
> DESC[0].id = 1
> DESC[0].flags = 0
> 
> Device now polls DESC[0].flags for WRAP bit to change.
> 
> Now driver sees that DRIVER bit has been cleared, so it nows that id is
> valid. I sees id 1, therefore id 0 and 1 has been read and are safe to
> overwrite.
> 
> So it writes it out. It wrapped around to beginning of ring,
> so it flips the WRAP bit to 0 on all descriptors now:
> 
> DESC[0].addr = 9ABC
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_MORE
> 
> 
> DESC[0].addr = DEF0
> DESC[0].id = 1
> DESC[0].flags = DESC_DRIVER
> 
> 
> Next round wrap will be 1 again.
> 
> 
> To summarise:
> 
> DRIVER bit is used by driver to detect device has used one or more
> descriptors.  WRAP is is used by device to detect driver has made a
> new descriptor available.
> 
> 
> -- 
> MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] packed ring layout proposal v3

2017-09-28 Thread Liang, Cunming

Get it now. Please correct me if I missing something.


Flags status hints,

- DESC_DRIVER only: driver owns the descriptor w/o available info ready for 
device to use

- DESC_DRIVER | DESC_WRAP: driver has prepared an available descriptor, device 
hasn't used it yet

- None: device has used the descriptor, and write descriptor out

- DESC_WRAP only: shall not happen, device make sure to clear it


Polling behavior is,

- Device monitor DESC_WRAP bit set or not; If set, go to use descriptor and 
clear DESC_DRIVER bit in the end (note: always need to clear DESC_WRAP)

- Driver monitor DESC_DRIVER bit cleared or not; If cleared, reclaim 
descriptor(set DESC_DRIVER) and set DESC_WRAP once new available descriptor get 
ready to go


--
Steve

> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Thursday, September 28, 2017 7:49 AM
> To: Steven Luong (sluong)
> Cc: Liang, Cunming; virtio-dev@lists.oasis-open.org;
> virtualizat...@lists.linux-foundation.org
> Subject: Re: [virtio-dev] packed ring layout proposal v3
> 
> On Tue, Sep 26, 2017 at 11:38:18PM +, Steven Luong (sluong) wrote:
> > Michael,
> >
> > Would you please give an example or two how these two flags
> DESC_DRIVER and DESC_WRAP are used together? Like others, I am
> confused by the description and still don’t quite grok it.
> >
> > Steven
> 
> My bad, I will need to work on it. Here is an example:
> 
> Let's assume device promised to consume packets in order
> 
> ring size = 2
> 
> Ring is 0 initialized.
> 
> Device initially polls DESC[0].flags for WRAP bit to change.
> 
> driver adds:
> 
> DESC[0].addr = 1234
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_NEXT | DESC_WRAP
> 
> and
> 
> DESC[0].addr = 5678
> DESC[1].id = 1
> DESC[1].flags = DESC_DRIVER | DESC_WRAP
> 
> 
> it now starts polling DESC[0] flags.
> 
> 
> Device reads 1234, executes it, does not use it.
> 
> Device reads 5678, executes it, and uses it:
> 
> DESC[0].id = 1
> DESC[0].flags = 0
> 
> Device now polls DESC[0].flags for WRAP bit to change.
> 
> Now driver sees that DRIVER bit has been cleared, so it nows that id is 
> valid. I
> sees id 1, therefore id 0 and 1 has been read and are safe to overwrite.
> 
> So it writes it out. It wrapped around to beginning of ring, so it flips the
> WRAP bit to 0 on all descriptors now:
> 
> DESC[0].addr = 9ABC
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_NEXT
> 
> 
> DESC[0].addr = DEF0
> DESC[0].id = 1
> DESC[0].flags = DESC_DRIVER
> 
> 
> Next round wrap will be 1 again.
> 
> 
> To summarise:
> 
> DRIVER bit is used by driver to detect device has used one or more
> descriptors.  WRAP is is used by device to detect driver has made a new
> descriptor available.
> 
> 
> --
> MST


Re: [virtio-dev] packed ring layout proposal v3

2017-09-27 Thread Michael S. Tsirkin
On Tue, Sep 26, 2017 at 11:38:18PM +, Steven Luong (sluong) wrote:
> Michael,
> 
> Would you please give an example or two how these two flags DESC_DRIVER and 
> DESC_WRAP are used together? Like others, I am confused by the description 
> and still don’t quite grok it.
> 
> Steven

My bad, I will need to work on it. Here is an example:

Let's assume device promised to consume packets in order

ring size = 2

Ring is 0 initialized.

Device initially polls DESC[0].flags for WRAP bit to change.

driver adds:

DESC[0].addr = 1234
DESC[0].id = 0
DESC[0].flags = DESC_DRIVER | DESC_NEXT | DESC_WRAP

and

DESC[0].addr = 5678
DESC[1].id = 1
DESC[1].flags = DESC_DRIVER | DESC_WRAP


it now starts polling DESC[0] flags.


Device reads 1234, executes it, does not use it.

Device reads 5678, executes it, and uses it:

DESC[0].id = 1
DESC[0].flags = 0

Device now polls DESC[0].flags for WRAP bit to change.

Now driver sees that DRIVER bit has been cleared, so it nows that id is
valid. I sees id 1, therefore id 0 and 1 has been read and are safe to
overwrite.

So it writes it out. It wrapped around to beginning of ring,
so it flips the WRAP bit to 0 on all descriptors now:

DESC[0].addr = 9ABC
DESC[0].id = 0
DESC[0].flags = DESC_DRIVER | DESC_NEXT


DESC[0].addr = DEF0
DESC[0].id = 1
DESC[0].flags = DESC_DRIVER


Next round wrap will be 1 again.


To summarise:

DRIVER bit is used by driver to detect device has used one or more
descriptors.  WRAP is is used by device to detect driver has made a
new descriptor available.


-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v3

2017-09-26 Thread Steven Luong (sluong)
Michael,

Would you please give an example or two how these two flags DESC_DRIVER and 
DESC_WRAP are used together? Like others, I am confused by the description and 
still don’t quite grok it.

Steven

On 9/25/17, 3:24 PM, "virtio-dev@lists.oasis-open.org on behalf of Michael S. 
Tsirkin"  wrote:

On Wed, Sep 20, 2017 at 09:11:57AM +, Liang, Cunming wrote:
> Hi Michael,
> 
> > -Original Message-
> > From: virtio-dev@lists.oasis-open.org 
[mailto:virtio-dev@lists.oasis-open.org]
> > On Behalf Of Michael S. Tsirkin
> > Sent: Sunday, September 10, 2017 1:06 PM
> > To: virtio-dev@lists.oasis-open.org
> > Cc: virtualizat...@lists.linux-foundation.org
> > Subject: [virtio-dev] packed ring layout proposal v3
> > 
> [...]
> > * Descriptor ring:
> > 
> > Driver writes descriptors with unique index values and DESC_DRIVER set 
in
> > flags.
> > Descriptors are written in a ring order: from start to end of ring, 
wrapping
> > around to the beginning.
> > Device writes used descriptors with correct len, index, and DESC_HW 
clear.
> > Again descriptors are written in ring order. This might not be the same 
order
> > of driver descriptors, and not all descriptors have to be written out.
> > 
> > Driver and device are expected to maintain (internally) a wrap-around 
bit,
> > starting at 0 and changing value each time they start writing out 
descriptors
> > at the beginning of the ring. This bit is passed as DESC_WRAP bit in 
the flags
> > field.
> 
> One simple question there, trying to understand the usage of DESC_WRAP 
flag.
> 
> DESC_WRAP bit is a new flag since v2. It's used to address 'non 
power-of-2 ring sizes' mentioned in v2?
> 
> Being confused by the statement of wrap-around bit here, it's an internal 
wrap-around counter represented by single bit (0/1)?
> DESC_WRAP can appear on any descriptor entry in the ring, why it 
highlights changing value at the beginning of the ring?


No, this is necessary if not all descriptors are overwritten by device
after they are used.

Each time driver overwrites a descriptor, the value in DESC_WRAP changes
which makes it possible for device to detect that there's a new
descriptor.


> > 
> > Flags are always set/cleared last.
> > 
> > Note that driver can write descriptors out in any order, but device 
will not
> > execute descriptor X+1 until descriptor X has been read as valid.
> > 
> > Driver operation:
> > 
> [...]
> > 
> > DESC_WRAP - device uses this field to detect descriptor change by 
driver.
> 
> Device uses this field to detect change of wrap-around boundary by 
driver? 
> 
> [...]
> > 
> > Device operation (using descriptors):
> > 
> [...]
> > 
> > DESC_WRAP - driver uses this field to detect descriptor change by 
device.
> 
> Driver uses this field to detect change of wrap-around boundary by device?
>
> By using this, driver doesn't need to maintain any internal wrap-around 
count, but being aware of wrap-around by DESC_WRAP flag.
> 
> 
> Thanks,
> Steve

So v2 simply said descriptor has a single bit: driver writes 1 there,
device writes 0.

This requires device to overwrite each descriptor and people asked 
for a way to communicate where some descriptors are not overwritten.

This new bit helps device distinguish new and old descriptors written by 
driver.



> > 
> [...]
> > 
> > ---
> > 
> > Note: should this proposal be accepted and approved, one or more
> >   claims disclosed to the TC admin and listed on the Virtio TC
> >   IPR page https://www.oasis-open.org/committees/virtio/ipr.php
> >   might become Essential Claims.
> > Note: the page above is unfortunately out of date and out of
> >   my hands. I'm in the process of updating ipr disclosures
> >   in github instead.  Will make sure all is in place before
> >   this proposal is put to vote. As usual this TC operates under the
> >   Non-Assertion Mode of the OASIS IPR Policy, which protects
> >   anyone implementing the virtio spec.
> > 
> > --
> > MST
> > 
> > -
> > To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org





Re: [virtio-dev] packed ring layout proposal v3

2017-09-25 Thread Michael S. Tsirkin
On Wed, Sep 20, 2017 at 09:11:57AM +, Liang, Cunming wrote:
> Hi Michael,
> 
> > -Original Message-
> > From: virtio-dev@lists.oasis-open.org 
> > [mailto:virtio-dev@lists.oasis-open.org]
> > On Behalf Of Michael S. Tsirkin
> > Sent: Sunday, September 10, 2017 1:06 PM
> > To: virtio-dev@lists.oasis-open.org
> > Cc: virtualizat...@lists.linux-foundation.org
> > Subject: [virtio-dev] packed ring layout proposal v3
> > 
> [...]
> > * Descriptor ring:
> > 
> > Driver writes descriptors with unique index values and DESC_DRIVER set in
> > flags.
> > Descriptors are written in a ring order: from start to end of ring, wrapping
> > around to the beginning.
> > Device writes used descriptors with correct len, index, and DESC_HW clear.
> > Again descriptors are written in ring order. This might not be the same 
> > order
> > of driver descriptors, and not all descriptors have to be written out.
> > 
> > Driver and device are expected to maintain (internally) a wrap-around bit,
> > starting at 0 and changing value each time they start writing out 
> > descriptors
> > at the beginning of the ring. This bit is passed as DESC_WRAP bit in the 
> > flags
> > field.
> 
> One simple question there, trying to understand the usage of DESC_WRAP flag.
> 
> DESC_WRAP bit is a new flag since v2. It's used to address 'non power-of-2 
> ring sizes' mentioned in v2?
> 
> Being confused by the statement of wrap-around bit here, it's an internal 
> wrap-around counter represented by single bit (0/1)?
> DESC_WRAP can appear on any descriptor entry in the ring, why it highlights 
> changing value at the beginning of the ring?


No, this is necessary if not all descriptors are overwritten by device
after they are used.

Each time driver overwrites a descriptor, the value in DESC_WRAP changes
which makes it possible for device to detect that there's a new
descriptor.


> > 
> > Flags are always set/cleared last.
> > 
> > Note that driver can write descriptors out in any order, but device will not
> > execute descriptor X+1 until descriptor X has been read as valid.
> > 
> > Driver operation:
> > 
> [...]
> > 
> > DESC_WRAP - device uses this field to detect descriptor change by driver.
> 
> Device uses this field to detect change of wrap-around boundary by driver? 
> 
> [...]
> > 
> > Device operation (using descriptors):
> > 
> [...]
> > 
> > DESC_WRAP - driver uses this field to detect descriptor change by device.
> 
> Driver uses this field to detect change of wrap-around boundary by device?
>
> By using this, driver doesn't need to maintain any internal wrap-around 
> count, but being aware of wrap-around by DESC_WRAP flag.
> 
> 
> Thanks,
> Steve

So v2 simply said descriptor has a single bit: driver writes 1 there,
device writes 0.

This requires device to overwrite each descriptor and people asked 
for a way to communicate where some descriptors are not overwritten.

This new bit helps device distinguish new and old descriptors written by driver.



> > 
> [...]
> > 
> > ---
> > 
> > Note: should this proposal be accepted and approved, one or more
> >   claims disclosed to the TC admin and listed on the Virtio TC
> >   IPR page https://www.oasis-open.org/committees/virtio/ipr.php
> >   might become Essential Claims.
> > Note: the page above is unfortunately out of date and out of
> >   my hands. I'm in the process of updating ipr disclosures
> >   in github instead.  Will make sure all is in place before
> >   this proposal is put to vote. As usual this TC operates under the
> >   Non-Assertion Mode of the OASIS IPR Policy, which protects
> >   anyone implementing the virtio spec.
> > 
> > --
> > MST
> > 
> > -
> > To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] packed ring layout proposal v3

2017-09-21 Thread Liang, Cunming
Hi,

> -Original Message-
> From: virtio-dev@lists.oasis-open.org 
> [mailto:virtio-dev@lists.oasis-open.org] On
> Behalf Of Michael S. Tsirkin
> Sent: Sunday, September 10, 2017 1:06 PM
> To: virtio-dev@lists.oasis-open.org
> Cc: virtualizat...@lists.linux-foundation.org
> Subject: [virtio-dev] packed ring layout proposal v3
> 
[...]
> * Batching descriptors:
> 
> virtio 1.0 allows passing a batch of descriptors in both directions, by 
> incrementing
> the used/avail index by values > 1.
> At the moment only batching of used descriptors is used.
> 
> We can support this by chaining a list of device descriptors through
> VRING_DESC_F_MORE flag. Device sets this bit to signal driver that this is 
> part of
> a batch of used descriptors which are all part of a single transaction.

It supposes each s/g chain represents for a packet, while each descriptor among 
batching chain represents for a packet. There're a few thoughts of batching 
chain(by VRING_DESC_F_MORE) and s/g chain(by VRING_DESC_F_NEXT).

- batching chain: It's up to device to coalesce the write-out of a batched used 
descriptors. As the batching size is variable, driver has to detect validity of 
each descriptor unless the number of incoming valid descriptor is predictable, 
being curious on the benefits of driver from VRING_DESC_F_MORE to take  
batching descriptors as a single transaction. On device perspective, it's great 
to write out one descriptor for the whole chain. However, it assumes no other 
useful fields in each descriptor of chain needs to write. TX buffer reclaiming 
can be the candidate, while RX side has to update 'len' at least. Even for this 
purpose, instead of writing out VRING_DESC_F_MORE on a few descriptors to 
suppress device writing back, it's cheaper to set flag (e.g. VRING_DESC_F_WB) 
on single descriptor of chain to hint the expected position for device to write 
back.

- s/g chain: It makes sense to indicate the packet boundary. Considering 
in-order descriptor ring without VRING_DESC_F_INDIRECT, the next descriptor 
always belongs to the same s/g chain until end of packet indicators occur. So 
one alternative approach is only to set a flag (e.g. VRING_DESC_F_EOP) on the 
last descriptor of the chain. 

> 
> Driver might detect a partial descriptor chain (VRING_DESC_F_MORE set but next
> descriptor not valid). In that case it must not use any parts of the chain - 
> it will
> later be completed by device, but driver is allowed to store the valid parts 
> of the
> chain as device is not allowed to change them anymore.
As each descriptor represent for a whole packet(otherwise it's s/g chain), 
wondering why it must not use any parts of the chain. 

> 
> Descriptor should not have both VRING_DESC_F_MORE and
> VRING_DESC_F_NEXT set.
> 
[...]
> 
> * Selective use of descriptors
> 
> As described above, descriptors with NEXT bit set are part of a scatter/gather
> chain and so do not have to cause device to write a used descriptor out.
> 
> Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to 
> signal to
> device that it does not have to write out the used descriptor as it is part 
> of a batch
> of descriptors. Device has two options (similar to VRING_DESC_F_NEXT):
> 
> Device can write out the same number of descriptors for the batch, setting
> VRING_DESC_F_MORE for all but the last descriptor.
> Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set.
It will write out last descriptor without VRING_DESC_F_MORE anyway, so the 
following statement seems not like another option.

> 
> Device only writes out a single descriptor for the whole batch.
> However, to keep device and driver in sync, it then skips a number of 
> descriptors
> corresponding to the length of the batch before writing out the next used
> descriptor.
> After detecting a used descriptor driver must find out the length of the 
> batch that
> it built in order to know where to look for the next device descriptor.
It would be good to keep it simple on device side, and to have the driver 
control the expectation.

> 
> 
> TODO (blocker): skipping descriptors for selective and scatter/gather seem to 
> be
> only requested with in-order right now. Let's require in-order for this 
> skipping?
> This will simplify the accounting by driver.
> 
> 

Thanks,
Steve

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] packed ring layout proposal v3

2017-09-20 Thread Liang, Cunming
Hi Michael,

> -Original Message-
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> On Behalf Of Michael S. Tsirkin
> Sent: Sunday, September 10, 2017 1:06 PM
> To: virtio-dev@lists.oasis-open.org
> Cc: virtualizat...@lists.linux-foundation.org
> Subject: [virtio-dev] packed ring layout proposal v3
> 
[...]
> * Descriptor ring:
> 
> Driver writes descriptors with unique index values and DESC_DRIVER set in
> flags.
> Descriptors are written in a ring order: from start to end of ring, wrapping
> around to the beginning.
> Device writes used descriptors with correct len, index, and DESC_HW clear.
> Again descriptors are written in ring order. This might not be the same order
> of driver descriptors, and not all descriptors have to be written out.
> 
> Driver and device are expected to maintain (internally) a wrap-around bit,
> starting at 0 and changing value each time they start writing out descriptors
> at the beginning of the ring. This bit is passed as DESC_WRAP bit in the flags
> field.

One simple question there, trying to understand the usage of DESC_WRAP flag.

DESC_WRAP bit is a new flag since v2. It's used to address 'non power-of-2 ring 
sizes' mentioned in v2?

Being confused by the statement of wrap-around bit here, it's an internal 
wrap-around counter represented by single bit (0/1)?
DESC_WRAP can appear on any descriptor entry in the ring, why it highlights 
changing value at the beginning of the ring?

> 
> Flags are always set/cleared last.
> 
> Note that driver can write descriptors out in any order, but device will not
> execute descriptor X+1 until descriptor X has been read as valid.
> 
> Driver operation:
> 
[...]
> 
> DESC_WRAP - device uses this field to detect descriptor change by driver.

Device uses this field to detect change of wrap-around boundary by driver? 

[...]
> 
> Device operation (using descriptors):
> 
[...]
> 
> DESC_WRAP - driver uses this field to detect descriptor change by device.

Driver uses this field to detect change of wrap-around boundary by device?

By using this, driver doesn't need to maintain any internal wrap-around count, 
but being aware of wrap-around by DESC_WRAP flag.


Thanks,
Steve

> 
[...]
> 
> ---
> 
> Note: should this proposal be accepted and approved, one or more
>   claims disclosed to the TC admin and listed on the Virtio TC
>   IPR page https://www.oasis-open.org/committees/virtio/ipr.php
>   might become Essential Claims.
> Note: the page above is unfortunately out of date and out of
>   my hands. I'm in the process of updating ipr disclosures
>   in github instead.  Will make sure all is in place before
>   this proposal is put to vote. As usual this TC operates under the
>   Non-Assertion Mode of the OASIS IPR Policy, which protects
>   anyone implementing the virtio spec.
> 
> --
> MST
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v3

2017-09-12 Thread Willem de Bruijn
On Sun, Sep 10, 2017 at 1:06 AM, Michael S. Tsirkin  wrote:
> This is an update from v2 version.
> Changes:
> - update event suppression mechanism
> - add wrap counter: DESC_WRAP flag in addition to
>   DESC_DRIVER flag used for validity so device does not have to
>   write out all used descriptors.
> - more options especially helpful for hardware implementations
> - in-order processing option due to popular demand
> - list of TODO items to consider as a follow-up, only two are
>   open questions we need to descide now, marked as blocker,
>   others are future enhancement ideas.

Perhaps this would make a good topic for a BoF session at the upcoming
netdev. A new ring structure can be useful elsewhere, too, such as for
af_packet v4.

>
> ---
>
> Performance analysis of this is in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.
>
> Note: the following mode of operation will actually work
> without changes when descriptor rings do not overlap, with driver
> writing out available entries in a read-only driver descriptor ring,
> device writing used entries in a write-only device descriptor ring.

The ring is always read-write, as the consumer has to toggle the
DESC_DRIVER flag, right? Which mode are you referring to?

> TODO: does this have any value for some? E.g. as a security feature?
>
>
> * Descriptor ring:
>
> Driver writes descriptors with unique index values and DESC_DRIVER set in 
> flags.
> Descriptors are written in a ring order: from start to end of ring,
> wrapping around to the beginning.
> Device writes used descriptors with correct len, index, and DESC_HW clear.
> Again descriptors are written in ring order. This might not be the same
> order of driver descriptors, and not all descriptors have to be written out.

Virtio rings are used in both directions between guest device driver and
host device, as well as in increasingly many situations outside vm i/o.

I suggest using producer and consumer instead of driver and device
when describing ring operations.

When working on the virtio-net tx path, having to invert all the documentation
in my head, because it is written from an rx point of view was a bit tricky ;-)

I would then also convert DESC_DRIVER to DESC_VALID or so.

> Driver and device are expected to maintain (internally) a wrap-around
> bit, starting at 0 and changing value each time they start writing out
> descriptors at the beginning of the ring. This bit is passed as
> DESC_WRAP bit in the flags field.

So, the flag effectively doubles the namespace of the id from 16 bit to
17 bit? Instead, how about using a larger identifier. Such as 32 bit.

This also future proofs the design for cases where the ring may grow
to exceed 65536 entries. Doing so is not a short term change, but it
ould avoid the need for indirect descriptors and give greated room
for out of order acknowledgement.

> Flags are always set/cleared last.
>
> Note that driver can write descriptors out in any order, but device
> will not execute descriptor X+1 until descriptor X has been
> read as valid.

Why this constraint on the ring?

> Driver operation:
>
> Driver makes descriptors available to device by writing out descriptors
> in the descriptor ring. Once ring is full, driver waits for device to
> use some descriptors before making more available.
>
> Descriptors can be used by device in any order, but must be read from
> ring in-order, and must be read completely before starting use.  Thus,
> once a descriptor is used, driver can over-write both this descriptor
> and any descriptors which preceded it in the ring.

Does this mean that completing a descriptor by the consumer implicitly
completes all descriptors that precede it in the ring?

> Driver can detect use of descriptor either by device specific means
> (e.g. detect a buffer data change by device) or in a generic way
> by detecting that a used buffer has been written out by device.

I don't quite follow this.

> Driver writes out available scatter/gather descriptor entries in guest
> descriptor format:
>
>
> #define DESC_WRAP 0x0040
> #define DESC_DRIVER 0x0080
>
> struct desc {
> __le64 addr;
> __le32 len;
> __le16 index;
> __le16 flags;
> };
>
> Fields:
>
> addr - physical address of a s/g entry
> len - length of an entry

Is this ever larger than 16-bit? If not, then reducing to 16-bit allows
growing index to 32-bit.

> index - unique index.  The low $\lceil log(N) \rceil - 1$
>   bits of the index is a driver-supplied value which can have any value
>   (under driver control).  The high bits are reserved and should be
>   set to 0.
>
> flags - descriptor flags.
>
> Descriptors written by driver have DESC_DRIVER set.
>
> Writing out this field makes the descriptor available for the device to use,
> so all other fields must be valid when it is written.
>
> DESC_WRAP - device uses this field to detect d

Re: [virtio-dev] packed ring layout proposal v2

2017-07-20 Thread Michael S. Tsirkin
On Wed, Jul 19, 2017 at 07:41:55AM +, Lior Narkis wrote:
> 
> 
> > -Original Message-
> > From: Michael S. Tsirkin [mailto:m...@redhat.com]
> > Sent: Tuesday, July 18, 2017 7:23 PM
> > To: Lior Narkis 
> > Cc: virtio-dev@lists.oasis-open.org; 
> > virtualizat...@lists.linux-foundation.org
> > Subject: Re: [virtio-dev] packed ring layout proposal v2
> > 
> > On Sun, Jul 16, 2017 at 06:00:45AM +, Lior Narkis wrote:
> > > > -Original Message-
> > > > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-
> > open.org]
> > > > On Behalf Of Michael S. Tsirkin
> > > > Sent: Wednesday, February 08, 2017 5:20 AM
> > > > To: virtio-dev@lists.oasis-open.org
> > > > Cc: virtualizat...@lists.linux-foundation.org
> > > > Subject: [virtio-dev] packed ring layout proposal v2
> > > >
> > > > This is an update from v1 version.
> > > > Changes:
> > > > - update event suppression mechanism
> > > > - separate options for indirect and direct s/g
> > > > - lots of new features
> > > >
> > > > ---
> > > >
> > > > Performance analysis of this is in my kvm forum 2016 presentation.
> > > > The idea is to have a r/w descriptor in a ring structure,
> > > > replacing the used and available ring, index and descriptor
> > > > buffer.
> > > >
> > > > * Descriptor ring:
> > > >
> > > > Guest adds descriptors with unique index values and DESC_HW set in 
> > > > flags.
> > > > Host overwrites used descriptors with correct len, index, and DESC_HW
> > > > clear.  Flags are always set/cleared last.
> > > >
> > > > #define DESC_HW 0x0080
> > > >
> > > > struct desc {
> > > > __le64 addr;
> > > > __le32 len;
> > > > __le16 index;
> > > > __le16 flags;
> > > > };
> > > >
> > > > When DESC_HW is set, descriptor belongs to device. When it is clear,
> > > > it belongs to the driver.
> > > >
> > > > We can use 1 bit to set direction
> > > > /* This marks a buffer as write-only (otherwise read-only). */
> > > > #define VRING_DESC_F_WRITE  2
> > > >
> > >
> > > A valid bit per descriptor does not let the device do an efficient 
> > > prefetch.
> > > An alternative is to use a producer index(PI).
> > > Using the PI posted by the driver, and the Consumer Index(CI) maintained
> > by the device, the device knows how much work it has outstanding, so it can
> > do the prefetch accordingly.
> > > There are few options for the device to acquire the PI.
> > > Most efficient will be to write the PI in the doorbell together with the 
> > > queue
> > number.
> > 
> > Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout".
> > Or just the PI if we don't need the queue number.
> > 
> > > I would like to raise the need for a Completion Queue(CQ).
> > > Multiple Work Queues(hold the work descriptors, WQ in short) can be
> > connected to a single CQ.
> > > So when the device completes the work on the descriptor, it writes a
> > Completion Queue Entry (CQE) to the CQ.
> > 
> > This seems similar to the design we currently have with a separate used
> > ring. It seems to underperform writing into the available ring
> > at least on a microbenchmark. The reason seems to be that
> > more cache lines need to get invalidated and re-fetched.
> 
> 
> Few points on that:
> Each PCIe write will cause invalidation to a cache line.
> Writing less than a cache line is inefficient, so it is better to put all 
> metadata together and allocate a cache line for it.
> Putting the metadata in the data buffer means two writes of less than a cache 
> line each, both will be accessed by the driver, so potential two misses.
> 

I'm not sure how this is related to your suggestion. Sorry about being
dense.  You suggested a separate used ring that can also be shared with
multiple available rings. Current design in effect makes used and
available rings overlap instead. One side effect is each entry is bigger
now (16 bytes as opposed to 8 bytes previously) so it should be easier
to move metadata from e.g. virtio net header to the descriptor entry.


> > 
> > > CQEs are continuous in memory so prefetching by the driver is efficient,
> > although the device might complete work descriptors o

RE: [virtio-dev] packed ring layout proposal v2

2017-07-19 Thread Lior Narkis


> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Tuesday, July 18, 2017 7:23 PM
> To: Lior Narkis 
> Cc: virtio-dev@lists.oasis-open.org; virtualizat...@lists.linux-foundation.org
> Subject: Re: [virtio-dev] packed ring layout proposal v2
> 
> On Sun, Jul 16, 2017 at 06:00:45AM +, Lior Narkis wrote:
> > > -Original Message-
> > > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-
> open.org]
> > > On Behalf Of Michael S. Tsirkin
> > > Sent: Wednesday, February 08, 2017 5:20 AM
> > > To: virtio-dev@lists.oasis-open.org
> > > Cc: virtualizat...@lists.linux-foundation.org
> > > Subject: [virtio-dev] packed ring layout proposal v2
> > >
> > > This is an update from v1 version.
> > > Changes:
> > > - update event suppression mechanism
> > > - separate options for indirect and direct s/g
> > > - lots of new features
> > >
> > > ---
> > >
> > > Performance analysis of this is in my kvm forum 2016 presentation.
> > > The idea is to have a r/w descriptor in a ring structure,
> > > replacing the used and available ring, index and descriptor
> > > buffer.
> > >
> > > * Descriptor ring:
> > >
> > > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > > Host overwrites used descriptors with correct len, index, and DESC_HW
> > > clear.  Flags are always set/cleared last.
> > >
> > > #define DESC_HW 0x0080
> > >
> > > struct desc {
> > > __le64 addr;
> > > __le32 len;
> > > __le16 index;
> > > __le16 flags;
> > > };
> > >
> > > When DESC_HW is set, descriptor belongs to device. When it is clear,
> > > it belongs to the driver.
> > >
> > > We can use 1 bit to set direction
> > > /* This marks a buffer as write-only (otherwise read-only). */
> > > #define VRING_DESC_F_WRITE  2
> > >
> >
> > A valid bit per descriptor does not let the device do an efficient prefetch.
> > An alternative is to use a producer index(PI).
> > Using the PI posted by the driver, and the Consumer Index(CI) maintained
> by the device, the device knows how much work it has outstanding, so it can
> do the prefetch accordingly.
> > There are few options for the device to acquire the PI.
> > Most efficient will be to write the PI in the doorbell together with the 
> > queue
> number.
> 
> Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout".
> Or just the PI if we don't need the queue number.
> 
> > I would like to raise the need for a Completion Queue(CQ).
> > Multiple Work Queues(hold the work descriptors, WQ in short) can be
> connected to a single CQ.
> > So when the device completes the work on the descriptor, it writes a
> Completion Queue Entry (CQE) to the CQ.
> 
> This seems similar to the design we currently have with a separate used
> ring. It seems to underperform writing into the available ring
> at least on a microbenchmark. The reason seems to be that
> more cache lines need to get invalidated and re-fetched.


Few points on that:
Each PCIe write will cause invalidation to a cache line.
Writing less than a cache line is inefficient, so it is better to put all 
metadata together and allocate a cache line for it.
Putting the metadata in the data buffer means two writes of less than a cache 
line each, both will be accessed by the driver, so potential two misses.


> 
> > CQEs are continuous in memory so prefetching by the driver is efficient,
> although the device might complete work descriptors out of order.
> 
> That's not different from proposal you are replying to - descriptors
> can be used and written out in any order, they are contigious
> so driver can prefetch. 

Point is that if descriptors 1, 2, 4 are being completed in that order, in the 
proposed layout the completion indication will be placed at 1, 2, 4 in the 
virtq desc buffer.
With a CQ, completions on 1, 2, 4 will be placed at 1, 2, 3 CQ indexes.
This is why it is better for prefetching.

> I'd like to add that attempts to
> add prefetch e.g. for the virtio used ring never showed any
> conclusive gains - some workloads would get minor a speedup, others
> lose a bit of performance. I do not think anyone looked
> deeply into why that was the case. It would be easy for you to add this
> code to current virtio drivers and/or devices and try it out.

Noted.
I will say though that mlx5 uses prefetch and gets good performance because of 
it...

> 
> > The in

Re: [virtio-dev] packed ring layout proposal v2

2017-07-18 Thread Michael S. Tsirkin
On Sun, Jul 16, 2017 at 06:00:45AM +, Lior Narkis wrote:
> > -Original Message-
> > From: virtio-dev@lists.oasis-open.org 
> > [mailto:virtio-dev@lists.oasis-open.org]
> > On Behalf Of Michael S. Tsirkin
> > Sent: Wednesday, February 08, 2017 5:20 AM
> > To: virtio-dev@lists.oasis-open.org
> > Cc: virtualizat...@lists.linux-foundation.org
> > Subject: [virtio-dev] packed ring layout proposal v2
> > 
> > This is an update from v1 version.
> > Changes:
> > - update event suppression mechanism
> > - separate options for indirect and direct s/g
> > - lots of new features
> > 
> > ---
> > 
> > Performance analysis of this is in my kvm forum 2016 presentation.
> > The idea is to have a r/w descriptor in a ring structure,
> > replacing the used and available ring, index and descriptor
> > buffer.
> > 
> > * Descriptor ring:
> > 
> > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > Host overwrites used descriptors with correct len, index, and DESC_HW
> > clear.  Flags are always set/cleared last.
> > 
> > #define DESC_HW 0x0080
> > 
> > struct desc {
> > __le64 addr;
> > __le32 len;
> > __le16 index;
> > __le16 flags;
> > };
> > 
> > When DESC_HW is set, descriptor belongs to device. When it is clear,
> > it belongs to the driver.
> > 
> > We can use 1 bit to set direction
> > /* This marks a buffer as write-only (otherwise read-only). */
> > #define VRING_DESC_F_WRITE  2
> > 
> 
> A valid bit per descriptor does not let the device do an efficient prefetch.
> An alternative is to use a producer index(PI).
> Using the PI posted by the driver, and the Consumer Index(CI) maintained by 
> the device, the device knows how much work it has outstanding, so it can do 
> the prefetch accordingly. 
> There are few options for the device to acquire the PI.
> Most efficient will be to write the PI in the doorbell together with the 
> queue number.

Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout".
Or just the PI if we don't need the queue number.

> I would like to raise the need for a Completion Queue(CQ).
> Multiple Work Queues(hold the work descriptors, WQ in short) can be connected 
> to a single CQ.
> So when the device completes the work on the descriptor, it writes a 
> Completion Queue Entry (CQE) to the CQ.

This seems similar to the design we currently have with a separate used
ring. It seems to underperform writing into the available ring
at least on a microbenchmark. The reason seems to be that
more cache lines need to get invalidated and re-fetched.

> CQEs are continuous in memory so prefetching by the driver is efficient, 
> although the device might complete work descriptors out of order.

That's not different from proposal you are replying to - descriptors
can be used and written out in any order, they are contigious
so driver can prefetch. I'd like to add that attempts to
add prefetch e.g. for the virtio used ring never showed any
conclusive gains - some workloads would get minor a speedup, others
lose a bit of performance. I do not think anyone looked
deeply into why that was the case. It would be easy for you to add this
code to current virtio drivers and/or devices and try it out.

> The interrupt handler is connected to the CQ, so an allocation of a single CQ 
> per core, with a single interrupt handler is possible although this core 
> might be using multiple WQs.

Sending a single interrupt from multiple rings might save some
cycles. event index/interrupt disable are currently in
RAM so access is very cheap for the guest.
If you are going to share, just disable all interrupts
when you start processing.

I was wondering how do people want to do this in hardware
in fact. Are you going to read event index after each descriptor?

It might make sense to move event index/flags into device memory. And
once you do this, writing these out might become slower, and then some
kind of sharing of the event index might help.

No one suggested anything like this so far - maybe it's less of an issue
than I think, e.g. because of interrupt coalescing (which virtio also
does not have, but could be added if there is interest) or maybe people
just mostly care about polling performance.

> One application for multiple WQs with a single CQ is Quality of Service(QoS).
> A user can open a WQ per QoS value(pcp value for example), and the device 
> will schedule the work accordingly.

A ring per QOS level might make sense e.g. in a network device. I don't
see why it's helpful to write out completed entries into a separate
ring for that though.

As we don't have QOS support in virtio net at all right now,
would that be best discussed once we have a working prototype
and everyone can see what the problem is?


> > * Scatter/gather support
> > 
> > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> > 
> > /* This marks a buffer as continuing via the next field. */
> > #define VRING_DESC_F_NEXT   1
> > 
> > Unlike virtio

RE: [virtio-dev] packed ring layout proposal v2

2017-07-15 Thread Lior Narkis
> -Original Message-
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> On Behalf Of Michael S. Tsirkin
> Sent: Wednesday, February 08, 2017 5:20 AM
> To: virtio-dev@lists.oasis-open.org
> Cc: virtualizat...@lists.linux-foundation.org
> Subject: [virtio-dev] packed ring layout proposal v2
> 
> This is an update from v1 version.
> Changes:
> - update event suppression mechanism
> - separate options for indirect and direct s/g
> - lots of new features
> 
> ---
> 
> Performance analysis of this is in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.
> 
> * Descriptor ring:
> 
> Guest adds descriptors with unique index values and DESC_HW set in flags.
> Host overwrites used descriptors with correct len, index, and DESC_HW
> clear.  Flags are always set/cleared last.
> 
> #define DESC_HW 0x0080
> 
> struct desc {
> __le64 addr;
> __le32 len;
> __le16 index;
> __le16 flags;
> };
> 
> When DESC_HW is set, descriptor belongs to device. When it is clear,
> it belongs to the driver.
> 
> We can use 1 bit to set direction
> /* This marks a buffer as write-only (otherwise read-only). */
> #define VRING_DESC_F_WRITE  2
> 

A valid bit per descriptor does not let the device do an efficient prefetch.
An alternative is to use a producer index(PI).
Using the PI posted by the driver, and the Consumer Index(CI) maintained by the 
device, the device knows how much work it has outstanding, so it can do the 
prefetch accordingly. 
There are few options for the device to acquire the PI.
Most efficient will be to write the PI in the doorbell together with the queue 
number.

I would like to raise the need for a Completion Queue(CQ).
Multiple Work Queues(hold the work descriptors, WQ in short) can be connected 
to a single CQ.
So when the device completes the work on the descriptor, it writes a Completion 
Queue Entry (CQE) to the CQ.
CQEs are continuous in memory so prefetching by the driver is efficient, 
although the device might complete work descriptors out of order.
The interrupt handler is connected to the CQ, so an allocation of a single CQ 
per core, with a single interrupt handler is possible although this core might 
be using multiple WQs.

One application for multiple WQs with a single CQ is Quality of Service(QoS).
A user can open a WQ per QoS value(pcp value for example), and the device will 
schedule the work accordingly.

> * Scatter/gather support
> 
> We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> 
> /* This marks a buffer as continuing via the next field. */
> #define VRING_DESC_F_NEXT   1
> 
> Unlike virtio 1.0, all descriptors must have distinct ID values.
> 
> Also unlike virtio 1.0, use of this flag will be an optional feature
> (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
> 
> * Indirect buffers
> 
> Can be marked like in virtio 1.0:
> 
> /* This means the buffer contains a table of buffer descriptors. */
> #define VRING_DESC_F_INDIRECT   4
> 
> Unlike virtio 1.0, this is a table, not a list:
> struct indirect_descriptor_table {
> /* The actual descriptors (16 bytes each) */
> struct virtq_desc desc[len / 16];
> };
> 
> The first descriptor is located at start of the indirect descriptor
> table, additional indirect descriptors come immediately afterwards.
> DESC_F_WRITE is the only valid flag for descriptors in the indirect
> table. Others should be set to 0 and are ignored.  id is also set to 0
> and should be ignored.
> 
> virtio 1.0 seems to allow a s/g entry followed by
> an indirect descriptor. This does not seem useful,
> so we do not allow that anymore.
> 
> This support would be an optional feature, same as in virtio 1.0
> 
> * Batching descriptors:
> 
> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.  We can support this by
> chaining a list of descriptors through a bit the flags field.
> To allow use together with s/g, a different bit will be used.
> 
> #define VRING_DESC_F_BATCH_NEXT 0x0010
> 
> Batching works for both driver and device descriptors.
> 
> 
> 
> * Processing descriptors in and out of order
> 
> Device processing all descriptors in order can simply flip
> the DESC_HW bit as it is done with descriptors.
> 
> Device can write descriptors out in order as they are used, overwriting
> descriptors that are there.
> 
> Device must not use a descriptor until DESC_HW is set.
> It is only required to look at the first descriptor
> submitted.
> 
> Driver must not overwrite a descriptor until DESC_HW is clear.
> It is only required to look at the first descriptor
> submitted.
> 
> * Device specific descriptor flags
> We have a lot of unused space in the descriptor.  This can be put to
> good use by reserving some flag bits for device use.
> For example, network

Re: [virtio-dev] packed ring layout proposal v2

2017-07-10 Thread Amnon Ilan

+Lior

- Original Message -
> From: "Michael S. Tsirkin" 
> To: "Cornelia Huck" 
> Cc: "Paolo Bonzini" , 
> virtualizat...@lists.linux-foundation.org, virtio-dev@lists.oasis-open.org
> Sent: Tuesday, March 7, 2017 10:33:57 PM
> Subject: Re: [virtio-dev] packed ring layout proposal v2
> 
> On Tue, Mar 07, 2017 at 04:53:53PM +0100, Cornelia Huck wrote:
> > On Wed, 22 Feb 2017 18:43:05 +0200
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Thu, Feb 09, 2017 at 05:11:05PM +0100, Cornelia Huck wrote:
> > > > > >>> * Non power-of-2 ring sizes
> > > > > >>>
> > > > > >>> As the ring simply wraps around, there's no reason to
> > > > > >>> require ring size to be power of two.
> > > > > >>> It can be made a separate feature though.
> > > > > >>
> > > > > >> Power of 2 ring sizes are required in order to ignore the high
> > > > > >> bits of
> > > > > >> the indices.  With non-power-of-2 sizes you are forced to keep the
> > > > > >> indices less than the ring size.
> > > > > > 
> > > > > > Right. So
> > > > > > 
> > > > > > if (unlikely(idx++ > size))
> > > > > > idx = 0;
> > > > > > 
> > > > > > OTOH ring size that's twice larger than necessary
> > > > > > because of power of two requirements wastes cache.
> > > > > 
> > > > > I don't know.  Power of 2 ring size is pretty standard, I'd rather
> > > > > avoid
> > > > > the complication and the gratuitous difference with 1.0.
> > > > 
> > > > I agree. I don't think dropping the power of 2 requirement buys us so
> > > > much that it makes up for the added complexity.
> > > 
> > > I recalled why I came up with this. The issue is cache associativity.
> > > Recall that besides the ring we have event suppression
> > > structures - if we are lucky and things run at the same speed
> > > everything can work by polling keeping events disabled, then
> > > event suppression structures are never written to, they are read-only.
> > > 
> > > However if ring and event suppression share a cache line ring accesses
> > > have a chance to push the event suppression out of cache, causing
> > > misses on read.
> > > 
> > > This can happen if they are at the same offset in the set.
> > > E.g. with L1 cache 4Kbyte sets are common, so same offset
> > > within a 4K page.
> > > 
> > > We can fix this by making event suppression adjacent in memory, e.g.:
> > > 
> > > 
> > > [interrupt suppress]
> > > [descriptor ring]
> > > [kick suppress]
> > > 
> > > If this whole structure fits in a single set, ring accesses will
> > > not push kick or interrupt suppress out of cache.
> > > Specific layout can be left for drivers, but as set size is
> > > a power of two this might require a non-power of two ring size.
> > > 
> > > I conclude that this is an optimization that needs to be
> > > benchmarked.
> > 
> > This makes sense. But wouldn't the optimum layout not depend on the
> > platform?
> 
> There's generally a tradeoff between performance and portability.
> Whether it's worth it would need to be tested.
> Further, it might be better to have platform-specific optimization
> tied to a given platform rather than a feature bit.
> 
> > > 
> > > I also note that the generic description does not have to force
> > > powers of two *even if devices actually require it*.
> > > I would be inclined to word the text in a way that makes
> > > relaxing the restriction easier.
> > > 
> > > For example, we can say "free running 16 bit index" and this forces a
> > > power of two, but we can also say "free running index wrapping to 0
> > > after (N*queue-size - 1) with N chosen such that the value fits in 16
> > > bit" and this is exactly the same if queue size is a power of 2.
> > > 
> > > So we can add text saying "ring size MUST be a power of two"
> > > and later it will be easy to relax just by adding a feature bit.
> > 
> > A later feature bit sounds good.
> 
> No need to delay benchmarking if someone has the time though :)
> 
> --
> MST
> ___
> Virtualization mailing list
> virtualizat...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
> 

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal - todo list

2017-04-01 Thread Yuanhan Liu
On Wed, Mar 29, 2017 at 03:39:20PM +0300, Michael S. Tsirkin wrote:
> On Wed, Mar 08, 2017 at 03:56:24PM +0800, Yuanhan Liu wrote:
> > On Wed, Mar 08, 2017 at 03:09:48PM +0800, Yuanhan Liu wrote:
> > > On Wed, Mar 01, 2017 at 03:07:29AM +0200, Michael S. Tsirkin wrote:
> > > > On Tue, Feb 28, 2017 at 12:29:43PM +0800, Yuanhan Liu wrote:
> > > > > Hi Michael,
> > > > > 
> > > > > Again, as usual, sorry for being late :/
> > > > > 
> > > > > On Wed, Feb 22, 2017 at 06:27:11AM +0200, Michael S. Tsirkin wrote:
> > > > > > Stage 2: prototype guest/host drivers
> > > > > > 
> > > > > > At this stage we need real guest and host drivers
> > > > > > to be able to test real life performance.
> > > > > > I suggest dpdk drivers + munimal hack in qemu to
> > > > > > pass features around.
> > > > > > 
> > > > > 
> > > > > I have already done that in last Nov. I made a very rough (yet hacky)
> > > > > version (only with Tx path) in one day while companying my wife in
> > > > > hospital.
> > > > 
> > > > Any performance data?
> > > 
> > > A straightfoward implementation only brings 10% performance boost in a
> > > txonly micro benchmarking. But I'm sure there are still plenty of room
> > > for improvement.
> > > 
> > > > > If someone are interested in, I could share the code soon. I could
> > > > > even cleanup the code a bit if necessary.
> > > > 
> > > > Especially if you don't have time to benchmark, I think sharing it
> > > > might help.
> > > 
> > > Here it is (check the README-virtio-1.1 for howto):
> > > 
> > > git://fridaybit.com/git/dpdk.git  virtio-1.1-v0.1
> > 
> > Well, I was told it maybe not proper to share code like this way. So
> > this channel is closed. I will check how to find a proper way. Sorry
> > for the inconvenience!
> > 
> > --yliu
> 
> Where you going to re-post it?

It's at

git://dpdk.org/next/dpdk-next-virtiofor-testing

> I don't see what the issue could be frankly.
> Care to elaborate?

Honestly, I don't know, neither.

--yliu

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal - todo list

2017-03-29 Thread Michael S. Tsirkin
On Wed, Mar 08, 2017 at 03:56:24PM +0800, Yuanhan Liu wrote:
> On Wed, Mar 08, 2017 at 03:09:48PM +0800, Yuanhan Liu wrote:
> > On Wed, Mar 01, 2017 at 03:07:29AM +0200, Michael S. Tsirkin wrote:
> > > On Tue, Feb 28, 2017 at 12:29:43PM +0800, Yuanhan Liu wrote:
> > > > Hi Michael,
> > > > 
> > > > Again, as usual, sorry for being late :/
> > > > 
> > > > On Wed, Feb 22, 2017 at 06:27:11AM +0200, Michael S. Tsirkin wrote:
> > > > > Stage 2: prototype guest/host drivers
> > > > > 
> > > > > At this stage we need real guest and host drivers
> > > > > to be able to test real life performance.
> > > > > I suggest dpdk drivers + munimal hack in qemu to
> > > > > pass features around.
> > > > > 
> > > > 
> > > > I have already done that in last Nov. I made a very rough (yet hacky)
> > > > version (only with Tx path) in one day while companying my wife in
> > > > hospital.
> > > 
> > > Any performance data?
> > 
> > A straightfoward implementation only brings 10% performance boost in a
> > txonly micro benchmarking. But I'm sure there are still plenty of room
> > for improvement.
> > 
> > > > If someone are interested in, I could share the code soon. I could
> > > > even cleanup the code a bit if necessary.
> > > 
> > > Especially if you don't have time to benchmark, I think sharing it
> > > might help.
> > 
> > Here it is (check the README-virtio-1.1 for howto):
> > 
> > git://fridaybit.com/git/dpdk.git  virtio-1.1-v0.1
> 
> Well, I was told it maybe not proper to share code like this way. So
> this channel is closed. I will check how to find a proper way. Sorry
> for the inconvenience!
> 
>   --yliu

Where you going to re-post it?
I don't see what the issue could be frankly.
Care to elaborate?

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal - todo list

2017-03-07 Thread Yuanhan Liu
On Wed, Mar 08, 2017 at 03:09:48PM +0800, Yuanhan Liu wrote:
> On Wed, Mar 01, 2017 at 03:07:29AM +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 28, 2017 at 12:29:43PM +0800, Yuanhan Liu wrote:
> > > Hi Michael,
> > > 
> > > Again, as usual, sorry for being late :/
> > > 
> > > On Wed, Feb 22, 2017 at 06:27:11AM +0200, Michael S. Tsirkin wrote:
> > > > Stage 2: prototype guest/host drivers
> > > > 
> > > > At this stage we need real guest and host drivers
> > > > to be able to test real life performance.
> > > > I suggest dpdk drivers + munimal hack in qemu to
> > > > pass features around.
> > > > 
> > > 
> > > I have already done that in last Nov. I made a very rough (yet hacky)
> > > version (only with Tx path) in one day while companying my wife in
> > > hospital.
> > 
> > Any performance data?
> 
> A straightfoward implementation only brings 10% performance boost in a
> txonly micro benchmarking. But I'm sure there are still plenty of room
> for improvement.
> 
> > > If someone are interested in, I could share the code soon. I could
> > > even cleanup the code a bit if necessary.
> > 
> > Especially if you don't have time to benchmark, I think sharing it
> > might help.
> 
> Here it is (check the README-virtio-1.1 for howto):
> 
> git://fridaybit.com/git/dpdk.git  virtio-1.1-v0.1

Well, I was told it maybe not proper to share code like this way. So
this channel is closed. I will check how to find a proper way. Sorry
for the inconvenience!

--yliu

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal - todo list

2017-03-07 Thread Yuanhan Liu
On Wed, Mar 01, 2017 at 03:07:29AM +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 28, 2017 at 12:29:43PM +0800, Yuanhan Liu wrote:
> > Hi Michael,
> > 
> > Again, as usual, sorry for being late :/
> > 
> > On Wed, Feb 22, 2017 at 06:27:11AM +0200, Michael S. Tsirkin wrote:
> > > Stage 2: prototype guest/host drivers
> > > 
> > > At this stage we need real guest and host drivers
> > > to be able to test real life performance.
> > > I suggest dpdk drivers + munimal hack in qemu to
> > > pass features around.
> > > 
> > 
> > I have already done that in last Nov. I made a very rough (yet hacky)
> > version (only with Tx path) in one day while companying my wife in
> > hospital.
> 
> Any performance data?

A straightfoward implementation only brings 10% performance boost in a
txonly micro benchmarking. But I'm sure there are still plenty of room
for improvement.

> > If someone are interested in, I could share the code soon. I could
> > even cleanup the code a bit if necessary.
> 
> Especially if you don't have time to benchmark, I think sharing it
> might help.

Here it is (check the README-virtio-1.1 for howto):

git://fridaybit.com/git/dpdk.git  virtio-1.1-v0.1

> > - Try to accelerate vhost/virtio with vector instructions.
> 
> That's interesting. What kind of optimizations would you say
> do vector instructions enable, and why?

If we have made the cache impact being minimum, the left thing could
be optimized is the instruction cycles. SIMD instructions (like AVX)
then should help on this.

--yliu

> >   Something I will look into when above item is done. Currently,
> >   I thought of two items may help the vector implementation:
> > 
> >   * what kind of vring and desc layout could make the vector
> > implementation easier.
> > 
> >   * what kind of hint we need from virtio spec for (dynamically)
> > enabling the vector path.
> > 
> >   Besides that, I don't have too much clue yet.
> > 
> > --yliu

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-03-07 Thread Michael S. Tsirkin
On Tue, Mar 07, 2017 at 04:53:53PM +0100, Cornelia Huck wrote:
> On Wed, 22 Feb 2017 18:43:05 +0200
> "Michael S. Tsirkin"  wrote:
> 
> > On Thu, Feb 09, 2017 at 05:11:05PM +0100, Cornelia Huck wrote:
> > > > >>> * Non power-of-2 ring sizes
> > > > >>>
> > > > >>> As the ring simply wraps around, there's no reason to
> > > > >>> require ring size to be power of two.
> > > > >>> It can be made a separate feature though.
> > > > >>
> > > > >> Power of 2 ring sizes are required in order to ignore the high bits 
> > > > >> of
> > > > >> the indices.  With non-power-of-2 sizes you are forced to keep the
> > > > >> indices less than the ring size.
> > > > > 
> > > > > Right. So
> > > > > 
> > > > >   if (unlikely(idx++ > size))
> > > > >   idx = 0;
> > > > > 
> > > > > OTOH ring size that's twice larger than necessary
> > > > > because of power of two requirements wastes cache.
> > > > 
> > > > I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> > > > the complication and the gratuitous difference with 1.0.
> > > 
> > > I agree. I don't think dropping the power of 2 requirement buys us so
> > > much that it makes up for the added complexity.
> > 
> > I recalled why I came up with this. The issue is cache associativity.
> > Recall that besides the ring we have event suppression
> > structures - if we are lucky and things run at the same speed
> > everything can work by polling keeping events disabled, then
> > event suppression structures are never written to, they are read-only.
> > 
> > However if ring and event suppression share a cache line ring accesses
> > have a chance to push the event suppression out of cache, causing
> > misses on read.
> > 
> > This can happen if they are at the same offset in the set.
> > E.g. with L1 cache 4Kbyte sets are common, so same offset
> > within a 4K page.
> > 
> > We can fix this by making event suppression adjacent in memory, e.g.:
> > 
> > 
> > [interrupt suppress]
> > [descriptor ring]
> > [kick suppress]
> > 
> > If this whole structure fits in a single set, ring accesses will
> > not push kick or interrupt suppress out of cache.
> > Specific layout can be left for drivers, but as set size is
> > a power of two this might require a non-power of two ring size.
> > 
> > I conclude that this is an optimization that needs to be
> > benchmarked.
> 
> This makes sense. But wouldn't the optimum layout not depend on the
> platform?

There's generally a tradeoff between performance and portability.
Whether it's worth it would need to be tested.
Further, it might be better to have platform-specific optimization
tied to a given platform rather than a feature bit.

> > 
> > I also note that the generic description does not have to force
> > powers of two *even if devices actually require it*.
> > I would be inclined to word the text in a way that makes
> > relaxing the restriction easier.
> > 
> > For example, we can say "free running 16 bit index" and this forces a
> > power of two, but we can also say "free running index wrapping to 0
> > after (N*queue-size - 1) with N chosen such that the value fits in 16
> > bit" and this is exactly the same if queue size is a power of 2.
> > 
> > So we can add text saying "ring size MUST be a power of two"
> > and later it will be easy to relax just by adding a feature bit.
> 
> A later feature bit sounds good.

No need to delay benchmarking if someone has the time though :)

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-03-07 Thread Cornelia Huck
On Wed, 22 Feb 2017 18:43:05 +0200
"Michael S. Tsirkin"  wrote:

> On Thu, Feb 09, 2017 at 05:11:05PM +0100, Cornelia Huck wrote:
> > > >>> * Non power-of-2 ring sizes
> > > >>>
> > > >>> As the ring simply wraps around, there's no reason to
> > > >>> require ring size to be power of two.
> > > >>> It can be made a separate feature though.
> > > >>
> > > >> Power of 2 ring sizes are required in order to ignore the high bits of
> > > >> the indices.  With non-power-of-2 sizes you are forced to keep the
> > > >> indices less than the ring size.
> > > > 
> > > > Right. So
> > > > 
> > > > if (unlikely(idx++ > size))
> > > > idx = 0;
> > > > 
> > > > OTOH ring size that's twice larger than necessary
> > > > because of power of two requirements wastes cache.
> > > 
> > > I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> > > the complication and the gratuitous difference with 1.0.
> > 
> > I agree. I don't think dropping the power of 2 requirement buys us so
> > much that it makes up for the added complexity.
> 
> I recalled why I came up with this. The issue is cache associativity.
> Recall that besides the ring we have event suppression
> structures - if we are lucky and things run at the same speed
> everything can work by polling keeping events disabled, then
> event suppression structures are never written to, they are read-only.
> 
> However if ring and event suppression share a cache line ring accesses
> have a chance to push the event suppression out of cache, causing
> misses on read.
> 
> This can happen if they are at the same offset in the set.
> E.g. with L1 cache 4Kbyte sets are common, so same offset
> within a 4K page.
> 
> We can fix this by making event suppression adjacent in memory, e.g.:
> 
> 
> [interrupt suppress]
> [descriptor ring]
> [kick suppress]
> 
> If this whole structure fits in a single set, ring accesses will
> not push kick or interrupt suppress out of cache.
> Specific layout can be left for drivers, but as set size is
> a power of two this might require a non-power of two ring size.
> 
> I conclude that this is an optimization that needs to be
> benchmarked.

This makes sense. But wouldn't the optimum layout not depend on the
platform?

> 
> I also note that the generic description does not have to force
> powers of two *even if devices actually require it*.
> I would be inclined to word the text in a way that makes
> relaxing the restriction easier.
> 
> For example, we can say "free running 16 bit index" and this forces a
> power of two, but we can also say "free running index wrapping to 0
> after (N*queue-size - 1) with N chosen such that the value fits in 16
> bit" and this is exactly the same if queue size is a power of 2.
> 
> So we can add text saying "ring size MUST be a power of two"
> and later it will be easy to relax just by adding a feature bit.

A later feature bit sounds good.


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-02-28 Thread Yuanhan Liu
On Wed, Mar 01, 2017 at 06:14:21AM +0200, Michael S. Tsirkin wrote:
> > > > That said, if it's "16: 16" and if we use only 8 bits for batch, we
> > > > could still have another 8 bit for anything else, say the number of
> > > > desc for a single pkt. With that, the num_buffers of mergeable Rx
> > > > header could be replaced.  More importantly, we could reduce a cache
> > > > line write if non offload are supported in mergeable Rx path. 
> > > 
> > > Why do you bother with mergeable Rx without offloads?
> > 
> > Oh, my bad. I actually meant "without offloads __being used__". Just
> > assume the pkt size is 64B and no offloads are used. When mergeable
> > Rx is negotiated (which is the default case), num_buffers has to be
> > set 1. That means an extra cache line write. For the case of non
> > mergeable, the cache line write could be avoid by a trick like what
> > the following patch did:
> > 
> >
> > http://dpdk.org/browse/dpdk/commit/?id=c9ea670c1dc7e3f111d8139f915082b60c9c1ffe
> > 
> > It basically tries to avoid writing 0 if the value is already 0:
> > the case when no offloads are used.
> > So while writing this email, I was thinking maybe we could not set
> > num_buffers to 1 when there is only one desc (let it be 0 and let
> > num_buffers == 0 imply num_buffers = 1). I'm not quite sure we can
> > do that now, thus I checked the DPDK code and found it's Okay.
> > 
> > 896 seg_num = header->num_buffers;
> > 897
> > 898 if (seg_num == 0)
> > 899 seg_num = 1;
> > 
> > 
> > I then also checked linux kernel code, and found it's not okay as
> > the code depends on the value being set correctly:
> > 
> > ==> 365 u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
> > 366 struct page *page = virt_to_head_page(buf);
> > 367 int offset = buf - page_address(page);
> > 368 unsigned int truesize = max(len, 
> > mergeable_ctx_to_buf_truesize(ctx));
> > 369
> > 370 struct sk_buff *head_skb = page_to_skb(vi, rq, page, 
> > offset, len,
> > 371truesize);
> > 372 struct sk_buff *curr_skb = head_skb;
> > 373
> > 374 if (unlikely(!curr_skb))
> > 375 goto err_skb;
> > ==> 376 while (--num_buf) {
> > 
> > That means, if we want to do that, it needs an extra feature flag
> > (either a global feature flag or a desc flag), something like
> > ALLOW_ZERO_NUM_BUFFERS. Or even, make it allowable in virtio 1.1
> > (virtio 0.95/1.0 won't benifit from it though).
> > 
> > Does it make sense to you?
> 
> Right and then we could use a descriptor flag "header is all 0s".
> For virtio 1.0 we could put these in the used ring instead.

Great.

--yliu

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-02-28 Thread Michael S. Tsirkin
On Wed, Mar 01, 2017 at 11:57:15AM +0800, Yuanhan Liu wrote:
> On Wed, Mar 01, 2017 at 03:02:29AM +0200, Michael S. Tsirkin wrote:
> > > > * Descriptor ring:
> > > > 
> > > > Guest adds descriptors with unique index values and DESC_HW set in 
> > > > flags.
> > > > Host overwrites used descriptors with correct len, index, and DESC_HW
> > > > clear.  Flags are always set/cleared last.
> > > 
> > > May I know what's the index intended for? Back referencing a pkt buffer?
> > 
> > Yes and generally identify which descriptor completed. Recall that
> > even though vhost net completes in order at the moment,
> > virtio rings serve devices (e.g. storage) that complete out of order.
> 
> I see, and thanks.
> 
> > > That said, if it's "16: 16" and if we use only 8 bits for batch, we
> > > could still have another 8 bit for anything else, say the number of
> > > desc for a single pkt. With that, the num_buffers of mergeable Rx
> > > header could be replaced.  More importantly, we could reduce a cache
> > > line write if non offload are supported in mergeable Rx path. 
> > 
> > Why do you bother with mergeable Rx without offloads?
> 
> Oh, my bad. I actually meant "without offloads __being used__". Just
> assume the pkt size is 64B and no offloads are used. When mergeable
> Rx is negotiated (which is the default case), num_buffers has to be
> set 1. That means an extra cache line write. For the case of non
> mergeable, the cache line write could be avoid by a trick like what
> the following patch did:
> 
>
> http://dpdk.org/browse/dpdk/commit/?id=c9ea670c1dc7e3f111d8139f915082b60c9c1ffe
> 
> It basically tries to avoid writing 0 if the value is already 0:
> the case when no offloads are used.
> So while writing this email, I was thinking maybe we could not set
> num_buffers to 1 when there is only one desc (let it be 0 and let
> num_buffers == 0 imply num_buffers = 1). I'm not quite sure we can
> do that now, thus I checked the DPDK code and found it's Okay.
> 
> 896 seg_num = header->num_buffers;
> 897
> 898 if (seg_num == 0)
> 899 seg_num = 1;
> 
> 
> I then also checked linux kernel code, and found it's not okay as
> the code depends on the value being set correctly:
> 
> ==> 365 u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
> 366 struct page *page = virt_to_head_page(buf);
> 367 int offset = buf - page_address(page);
> 368 unsigned int truesize = max(len, 
> mergeable_ctx_to_buf_truesize(ctx));
> 369
> 370 struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, 
> len,
> 371truesize);
> 372 struct sk_buff *curr_skb = head_skb;
> 373
> 374 if (unlikely(!curr_skb))
> 375 goto err_skb;
> ==> 376 while (--num_buf) {
> 
> That means, if we want to do that, it needs an extra feature flag
> (either a global feature flag or a desc flag), something like
> ALLOW_ZERO_NUM_BUFFERS. Or even, make it allowable in virtio 1.1
> (virtio 0.95/1.0 won't benifit from it though).
> 
> Does it make sense to you?

Right and then we could use a descriptor flag "header is all 0s".
For virtio 1.0 we could put these in the used ring instead.

> 
> > Make each buffer
> > MTU sized and it'll fit without merging.  Linux used not to, it only
> > started doing this to save memory aggressively. I don't think
> > DPDK cares about this.
> > 
> > > 
> > >   struct desc {
> > >   __le64 addr;
> > >   __le16 len;
> > >   __le8  batch;
> > >   __le8  num_buffers;
> > >   __le16 index;
> > >   __le16 flags;
> > >   };
> > 
> > Interesting. How about a benchmark for these ideas?
> 
> Sure, I would like to benchmark it. It won't take long to me. But
> currently, I was still focusing on analysising the performance behaviour
> of virtio 0.95/1.0 (when I get some time), to see what's not good for
> performance and what's can be improved.
> 
> Besides that, as said, I often got interrupted. Moreoever, DPDK v17.05
> code freeze deadline is coming. So it's just a remind that I may don't
> have time for it recently. Sorry for that.
> 
> > > > * Device specific descriptor flags
> > > > We have a lot of unused space in the descriptor.  This can be put to
> > > > good use by reserving some flag bits for device use.
> > > > For example, network device can set a bit to request
> > > > that header in the descriptor is suppressed
> > > > (in case it's all 0s anyway). This reduces cache utilization.
> > > 
> > > Good proposal. But I think it could only work with Tx, where the driver
> > > knows whether the headers are all 0s while filling the desc. But for
> > > Rx, whereas the driver has to pre-fill the desc, it doesn't know. Thus
> > > it still requires filling an header desc for storing it.
> > 
> > I don't see why - I don't think drivers pre-fill buffers in head

Re: [virtio-dev] packed ring layout proposal v2

2017-02-28 Thread Yuanhan Liu
On Wed, Mar 01, 2017 at 03:02:29AM +0200, Michael S. Tsirkin wrote:
> > > * Descriptor ring:
> > > 
> > > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > > Host overwrites used descriptors with correct len, index, and DESC_HW
> > > clear.  Flags are always set/cleared last.
> > 
> > May I know what's the index intended for? Back referencing a pkt buffer?
> 
> Yes and generally identify which descriptor completed. Recall that
> even though vhost net completes in order at the moment,
> virtio rings serve devices (e.g. storage) that complete out of order.

I see, and thanks.

> > That said, if it's "16: 16" and if we use only 8 bits for batch, we
> > could still have another 8 bit for anything else, say the number of
> > desc for a single pkt. With that, the num_buffers of mergeable Rx
> > header could be replaced.  More importantly, we could reduce a cache
> > line write if non offload are supported in mergeable Rx path. 
> 
> Why do you bother with mergeable Rx without offloads?

Oh, my bad. I actually meant "without offloads __being used__". Just
assume the pkt size is 64B and no offloads are used. When mergeable
Rx is negotiated (which is the default case), num_buffers has to be
set 1. That means an extra cache line write. For the case of non
mergeable, the cache line write could be avoid by a trick like what
the following patch did:

   
http://dpdk.org/browse/dpdk/commit/?id=c9ea670c1dc7e3f111d8139f915082b60c9c1ffe

It basically tries to avoid writing 0 if the value is already 0:
the case when no offloads are used.

So while writing this email, I was thinking maybe we could not set
num_buffers to 1 when there is only one desc (let it be 0 and let
num_buffers == 0 imply num_buffers = 1). I'm not quite sure we can
do that now, thus I checked the DPDK code and found it's Okay.

896 seg_num = header->num_buffers;
897
898 if (seg_num == 0)
899 seg_num = 1;


I then also checked linux kernel code, and found it's not okay as
the code depends on the value being set correctly:

==> 365 u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
366 struct page *page = virt_to_head_page(buf);
367 int offset = buf - page_address(page);
368 unsigned int truesize = max(len, 
mergeable_ctx_to_buf_truesize(ctx));
369
370 struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, 
len,
371truesize);
372 struct sk_buff *curr_skb = head_skb;
373
374 if (unlikely(!curr_skb))
375 goto err_skb;
==> 376 while (--num_buf) {

That means, if we want to do that, it needs an extra feature flag
(either a global feature flag or a desc flag), something like
ALLOW_ZERO_NUM_BUFFERS. Or even, make it allowable in virtio 1.1
(virtio 0.95/1.0 won't benifit from it though).

Does it make sense to you?

> Make each buffer
> MTU sized and it'll fit without merging.  Linux used not to, it only
> started doing this to save memory aggressively. I don't think
> DPDK cares about this.
> 
> > 
> > struct desc {
> > __le64 addr;
> > __le16 len;
> > __le8  batch;
> > __le8  num_buffers;
> > __le16 index;
> > __le16 flags;
> > };
> 
> Interesting. How about a benchmark for these ideas?

Sure, I would like to benchmark it. It won't take long to me. But
currently, I was still focusing on analysising the performance behaviour
of virtio 0.95/1.0 (when I get some time), to see what's not good for
performance and what's can be improved.

Besides that, as said, I often got interrupted. Moreoever, DPDK v17.05
code freeze deadline is coming. So it's just a remind that I may don't
have time for it recently. Sorry for that.

> > > * Device specific descriptor flags
> > > We have a lot of unused space in the descriptor.  This can be put to
> > > good use by reserving some flag bits for device use.
> > > For example, network device can set a bit to request
> > > that header in the descriptor is suppressed
> > > (in case it's all 0s anyway). This reduces cache utilization.
> > 
> > Good proposal. But I think it could only work with Tx, where the driver
> > knows whether the headers are all 0s while filling the desc. But for
> > Rx, whereas the driver has to pre-fill the desc, it doesn't know. Thus
> > it still requires filling an header desc for storing it.
> 
> I don't see why - I don't think drivers pre-fill buffers in header for RX
> right now. Why would they start?

Again, my bad, I meant "prepare" but not "pre-fill". Let me try to explain
it again. I'm thinking:

- For Tx, when the header is all 0s, the header could be discarded. We
  could use one desc for transfering a packet (with a flag NO_HEADER
  or HEADER_ALL_ZERO bit set)

- For Rx, the header is filled in the device (or vhost) side. And the
  driver has t

Re: [virtio-dev] packed ring layout proposal - todo list

2017-02-28 Thread Michael S. Tsirkin
On Tue, Feb 28, 2017 at 12:29:43PM +0800, Yuanhan Liu wrote:
> Hi Michael,
> 
> Again, as usual, sorry for being late :/
> 
> On Wed, Feb 22, 2017 at 06:27:11AM +0200, Michael S. Tsirkin wrote:
> > Stage 2: prototype guest/host drivers
> > 
> > At this stage we need real guest and host drivers
> > to be able to test real life performance.
> > I suggest dpdk drivers + munimal hack in qemu to
> > pass features around.
> > 
> 
> I have already done that in last Nov. I made a very rough (yet hacky)
> version (only with Tx path) in one day while companying my wife in
> hospital.

Any performance data?

> If someone are interested in, I could share the code soon. I could
> even cleanup the code a bit if necessary.

Especially if you don't have time to benchmark, I think sharing it
might help.

> > Tasks:
> > 
> > - implement vhost-user support in dpdk
> > - implement virtio support in dpdk
> > - implement minimal stub in qemu
> 
> I didn't hack the QEMU, instead, I hacked the DPDK virtio-user, yet
> another virtio-net emulation. It's simpler and quicker for me.

Sure, I merely meant a stub for negotiating new feature bits
between host and guest. But I guess an environment set
the same way in host and guest would serve too.

> And here is my plan on virtio 1.1:
> 
> - Look deeper inside the virtio net performance issues (WIP)
> 
>   It's basically a job about digging the DPDK vhost/virtio code
>   deeper, something like how exactly the cache acts while Tx/Rx
>   pkts, what can be optimized by implementation, and what could
>   be improved with the help of spec extension.
> 
>   Please note that I often got interrupted on this task: it didn't
>   go smooth as I would have expected.
> 
> 
> - Try to accelerate vhost/virtio with vector instructions.

That's interesting. What kind of optimizations would you say
do vector instructions enable, and why?


>   Something I will look into when above item is done. Currently,
>   I thought of two items may help the vector implementation:
> 
>   * what kind of vring and desc layout could make the vector
> implementation easier.
> 
>   * what kind of hint we need from virtio spec for (dynamically)
> enabling the vector path.
> 
>   Besides that, I don't have too much clue yet.
> 
>   --yliu

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-02-28 Thread Michael S. Tsirkin
On Tue, Feb 28, 2017 at 01:02:18PM +0800, Yuanhan Liu wrote:
> On Wed, Feb 08, 2017 at 05:20:14AM +0200, Michael S. Tsirkin wrote:
> > This is an update from v1 version.
> > Changes:
> > - update event suppression mechanism
> > - separate options for indirect and direct s/g
> > - lots of new features
> > 
> > ---
> > 
> > Performance analysis of this is in my kvm forum 2016 presentation.
> > The idea is to have a r/w descriptor in a ring structure,
> > replacing the used and available ring, index and descriptor
> > buffer.
> > 
> > * Descriptor ring:
> > 
> > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > Host overwrites used descriptors with correct len, index, and DESC_HW
> > clear.  Flags are always set/cleared last.
> 
> May I know what's the index intended for? Back referencing a pkt buffer?

Yes and generally identify which descriptor completed. Recall that
even though vhost net completes in order at the moment,
virtio rings serve devices (e.g. storage) that complete out of order.

> > #define DESC_HW 0x0080
> > 
> > struct desc {
> > __le64 addr;
> > __le32 len;
> > __le16 index;
> > __le16 flags;
> > };
> 
> ...
> > * Batching descriptors:
> > 
> > virtio 1.0 allows passing a batch of descriptors in both directions, by
> > incrementing the used/avail index by values > 1.  We can support this by
> > chaining a list of descriptors through a bit the flags field.
> > To allow use together with s/g, a different bit will be used.
> > 
> > #define VRING_DESC_F_BATCH_NEXT 0x0010
> > 
> > Batching works for both driver and device descriptors.
> 
> Honestly, I don't think it's an efficient way for batching. Neither the
> DESC_F_NEXT nor the BATCH_NEXT tells us how many new descs are there
> for processing: it's just a hint that there is one more. And you have
> to follow the link one by one.
> 
> I was thinking, maybe we could sub-divide some fields of desc, thus
> we could introduce more. For example, 32 bits "len" maybe too much,
> at least to virtio-net: the biggest pkt size is 64K, which is 16 bits.
> If we use 16 bits for len, we could use the extra 16 bits for telling
> how telling the batch number.
> 
>   struct desc {
>   __le64 addr;
>   __le16 len;
>   __le16 batch;
>   __le16 index;
>   __le16 flags;
>   };
> 
> Only the heading desc need set the batch count and DESC_HW flag. With
> the two used together, we don't have to set/clear the DESC_HW flag on
> driver/device.
> If 64K is small enough for other devices (say, BLK), we may re-allocate
> the bits to something like "24 : 8", whereas 24 for len (16M at most)
> and 8 for batch.  OTOH, 8 bit of batch should be pretty enough, judging
> that the default vring size is 256 and a typical burst size normally is
> way less than that.
> 
> That said, if it's "16: 16" and if we use only 8 bits for batch, we
> could still have another 8 bit for anything else, say the number of
> desc for a single pkt. With that, the num_buffers of mergeable Rx
> header could be replaced.  More importantly, we could reduce a cache
> line write if non offload are supported in mergeable Rx path. 

Why do you bother with mergeable Rx without offloads?  Make each buffer
MTU sized and it'll fit without merging.  Linux used not to, it only
started doing this to save memory aggressively. I don't think
DPDK cares about this.

> 
>   struct desc {
>   __le64 addr;
>   __le16 len;
>   __le8  batch;
>   __le8  num_buffers;
>   __le16 index;
>   __le16 flags;
>   };

Interesting. How about a benchmark for these ideas?

> 
> > * Device specific descriptor flags
> > We have a lot of unused space in the descriptor.  This can be put to
> > good use by reserving some flag bits for device use.
> > For example, network device can set a bit to request
> > that header in the descriptor is suppressed
> > (in case it's all 0s anyway). This reduces cache utilization.
> 
> Good proposal. But I think it could only work with Tx, where the driver
> knows whether the headers are all 0s while filling the desc. But for
> Rx, whereas the driver has to pre-fill the desc, it doesn't know. Thus
> it still requires filling an header desc for storing it.

I don't see why - I don't think drivers pre-fill buffers in header for RX
right now. Why would they start?

> Maybe we could introduce a global feature? When that's negotiated, no
> header desc need filled and processed? I'm thinking this could also
> help the vector implementation I mentioned in another email.

It's possible of course - it's a subset of what I said.
Though it makes it less useful in the general case.

> > Note: this feature can be supported in virtio 1.0 as well,
> > as we have unused bits in both descriptor and used ring there.
> 
> Agreed.
> 
>   --yliu

-
To un

Re: [virtio-dev] packed ring layout proposal v2

2017-02-27 Thread Yuanhan Liu
On Wed, Feb 08, 2017 at 05:20:14AM +0200, Michael S. Tsirkin wrote:
> This is an update from v1 version.
> Changes:
> - update event suppression mechanism
> - separate options for indirect and direct s/g
> - lots of new features
> 
> ---
> 
> Performance analysis of this is in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.
> 
> * Descriptor ring:
> 
> Guest adds descriptors with unique index values and DESC_HW set in flags.
> Host overwrites used descriptors with correct len, index, and DESC_HW
> clear.  Flags are always set/cleared last.

May I know what's the index intended for? Back referencing a pkt buffer?

> #define DESC_HW 0x0080
> 
> struct desc {
> __le64 addr;
> __le32 len;
> __le16 index;
> __le16 flags;
> };

...
> * Batching descriptors:
> 
> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.  We can support this by
> chaining a list of descriptors through a bit the flags field.
> To allow use together with s/g, a different bit will be used.
> 
> #define VRING_DESC_F_BATCH_NEXT 0x0010
> 
> Batching works for both driver and device descriptors.

Honestly, I don't think it's an efficient way for batching. Neither the
DESC_F_NEXT nor the BATCH_NEXT tells us how many new descs are there
for processing: it's just a hint that there is one more. And you have
to follow the link one by one.

I was thinking, maybe we could sub-divide some fields of desc, thus
we could introduce more. For example, 32 bits "len" maybe too much,
at least to virtio-net: the biggest pkt size is 64K, which is 16 bits.
If we use 16 bits for len, we could use the extra 16 bits for telling
how telling the batch number.

struct desc {
__le64 addr;
__le16 len;
__le16 batch;
__le16 index;
__le16 flags;
};

Only the heading desc need set the batch count and DESC_HW flag. With
the two used together, we don't have to set/clear the DESC_HW flag on
driver/device.

If 64K is small enough for other devices (say, BLK), we may re-allocate
the bits to something like "24 : 8", whereas 24 for len (16M at most)
and 8 for batch.  OTOH, 8 bit of batch should be pretty enough, judging
that the default vring size is 256 and a typical burst size normally is
way less than that.

That said, if it's "16: 16" and if we use only 8 bits for batch, we
could still have another 8 bit for anything else, say the number of
desc for a single pkt. With that, the num_buffers of mergeable Rx
header could be replaced.  More importantly, we could reduce a cache
line write if non offload are supported in mergeable Rx path. 

struct desc {
__le64 addr;
__le16 len;
__le8  batch;
__le8  num_buffers;
__le16 index;
__le16 flags;
};

> * Device specific descriptor flags
> We have a lot of unused space in the descriptor.  This can be put to
> good use by reserving some flag bits for device use.
> For example, network device can set a bit to request
> that header in the descriptor is suppressed
> (in case it's all 0s anyway). This reduces cache utilization.

Good proposal. But I think it could only work with Tx, where the driver
knows whether the headers are all 0s while filling the desc. But for
Rx, whereas the driver has to pre-fill the desc, it doesn't know. Thus
it still requires filling an header desc for storing it.

Maybe we could introduce a global feature? When that's negotiated, no
header desc need filled and processed? I'm thinking this could also
help the vector implementation I mentioned in another email.

> Note: this feature can be supported in virtio 1.0 as well,
> as we have unused bits in both descriptor and used ring there.

Agreed.

--yliu

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal - todo list

2017-02-27 Thread Yuanhan Liu
Hi Michael,

Again, as usual, sorry for being late :/

On Wed, Feb 22, 2017 at 06:27:11AM +0200, Michael S. Tsirkin wrote:
> Stage 2: prototype guest/host drivers
> 
> At this stage we need real guest and host drivers
> to be able to test real life performance.
> I suggest dpdk drivers + munimal hack in qemu to
> pass features around.
> 

I have already done that in last Nov. I made a very rough (yet hacky)
version (only with Tx path) in one day while companying my wife in
hospital.

If someone are interested in, I could share the code soon. I could
even cleanup the code a bit if necessary.


> Tasks:
> 
> - implement vhost-user support in dpdk
> - implement virtio support in dpdk
> - implement minimal stub in qemu

I didn't hack the QEMU, instead, I hacked the DPDK virtio-user, yet
another virtio-net emulation. It's simpler and quicker for me.

And here is my plan on virtio 1.1:

- Look deeper inside the virtio net performance issues (WIP)

  It's basically a job about digging the DPDK vhost/virtio code
  deeper, something like how exactly the cache acts while Tx/Rx
  pkts, what can be optimized by implementation, and what could
  be improved with the help of spec extension.

  Please note that I often got interrupted on this task: it didn't
  go smooth as I would have expected.


- Try to accelerate vhost/virtio with vector instructions.

  Something I will look into when above item is done. Currently,
  I thought of two items may help the vector implementation:

  * what kind of vring and desc layout could make the vector
implementation easier.

  * what kind of hint we need from virtio spec for (dynamically)
enabling the vector path.

  Besides that, I don't have too much clue yet.

--yliu

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-02-22 Thread Michael S. Tsirkin
On Thu, Feb 09, 2017 at 05:11:05PM +0100, Cornelia Huck wrote:
> > >>> * Non power-of-2 ring sizes
> > >>>
> > >>> As the ring simply wraps around, there's no reason to
> > >>> require ring size to be power of two.
> > >>> It can be made a separate feature though.
> > >>
> > >> Power of 2 ring sizes are required in order to ignore the high bits of
> > >> the indices.  With non-power-of-2 sizes you are forced to keep the
> > >> indices less than the ring size.
> > > 
> > > Right. So
> > > 
> > >   if (unlikely(idx++ > size))
> > >   idx = 0;
> > > 
> > > OTOH ring size that's twice larger than necessary
> > > because of power of two requirements wastes cache.
> > 
> > I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> > the complication and the gratuitous difference with 1.0.
> 
> I agree. I don't think dropping the power of 2 requirement buys us so
> much that it makes up for the added complexity.

I recalled why I came up with this. The issue is cache associativity.
Recall that besides the ring we have event suppression
structures - if we are lucky and things run at the same speed
everything can work by polling keeping events disabled, then
event suppression structures are never written to, they are read-only.

However if ring and event suppression share a cache line ring accesses
have a chance to push the event suppression out of cache, causing
misses on read.

This can happen if they are at the same offset in the set.
E.g. with L1 cache 4Kbyte sets are common, so same offset
within a 4K page.

We can fix this by making event suppression adjacent in memory, e.g.:


[interrupt suppress]
[descriptor ring]
[kick suppress]

If this whole structure fits in a single set, ring accesses will
not push kick or interrupt suppress out of cache.
Specific layout can be left for drivers, but as set size is
a power of two this might require a non-power of two ring size.

I conclude that this is an optimization that needs to be
benchmarked.

I also note that the generic description does not have to force
powers of two *even if devices actually require it*.
I would be inclined to word the text in a way that makes
relaxing the restriction easier.

For example, we can say "free running 16 bit index" and this forces a
power of two, but we can also say "free running index wrapping to 0
after (N*queue-size - 1) with N chosen such that the value fits in 16
bit" and this is exactly the same if queue size is a power of 2.

So we can add text saying "ring size MUST be a power of two"
and later it will be easy to relax just by adding a feature bit.



-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal - todo list

2017-02-22 Thread Michael S. Tsirkin
On Wed, Feb 22, 2017 at 09:19:41AM +, Gray, Mark D wrote:
> > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-
> > open.org] On Behalf Of Michael S. Tsirkin
> > 
> > Here is an attempt to summarise the list of things we need to do with 
> > respect
> > to this proposal.
> 
> Thanks for outlining this, it makes it a lot clearer.
> 
> > 
> > 
> > Stage 1: update prototype and finalize proposal
> > 
> > At this stage we need to update the prototype under
> > tools/virtio/ringtest/ring.c to make it match latest proposal, adding in 
> > verious
> 
> Would a DPDK based prototype be good aswell or would you rather 
> everything be prototyped here?

DPDK is good.

You would then
- code up current proposal
- code up an alternative you are suggesting
- compare

tools/virtio/ringtest/ is only there to make prototyping
easier. I like keeping it alive for this purpose but it
is not a must.



> > options under discussion.
> > Then we will measure performance. While more ideas are welcome there
> > won't be useful without ability to test!
> > 
> > Tasks:
> > 
> > - update tools/virtio/ringtest/ring.c to proposal, adding
> >   options as listed below.
> > - compare performance with and without indirect buffers
> >   issue: how to estimate cost of memory allocations?
> > - batching descriptors - is it worth it?
> > - three ways to suppress interrupts/exits
> >   which works better?
> >   issue: how to estimate cost of interrupts/exits?
> > 
> > 
> 
> We need to ensure we have a proposal that is suitable for
> implementation in hardware.

Right. Obviously the system has to work well as a whole though, e.g. if
you save express bandwidth speeding up hardware but pay cache misses
slowing down software it's not clear you have won anything at all.


> > More ideas:
> > 
> > - current tests all indicate cache synchronizations due to r/w descriptors
> >   cost as much as read-only/write-only descriptors,
> >   and there are less of these synchronizations.
> >   Disagree? Write a patch and benchmark.
> > - some devices could use smaller descriptors. For example,
> >   if you don't need length, id (e.g. using in-order, fixed length) or 
> > flags, you
> > can
> >   use a single 64 bit pointer as a descriptor. This can sometimes work
> >   e.g. for networking rx rings.
> >   Not clear whether this gains/costs us anything. Disagree? Write a patch 
> > and
> > benchmark.
> > 
> > Stage 2: prototype guest/host drivers
> > 
> > At this stage we need real guest and host drivers to be able to test real 
> > life
> > performance.
> > I suggest dpdk drivers + munimal hack in qemu to pass features around.
> > 
> > Tasks:
> > 
> > - implement vhost-user support in dpdk
> > - implement virtio support in dpdk
> > - implement minimal stub in qemu
> > - test performance. Possibly revisit questions from stage 2
> >   if any are left open
> > 
> > 
> > Stage 3: complete guest/host drivers
> > 
> > At this stage we need to add linux support in virtio and vhost, and complete
> > qemu support.
> > 
> > Tasks:
> > 
> > - implement vhost support in kernel
> > - implement virtio support in kernel
> > - complete virtio support in qemu
> > - test performance. Possibly revisit questions from stage 2
> >   if any are left open
> > 
> > 
> > 
> > Stage X: Finalizing the spec
> > 
> > When are we ready to put out a spec draft? Surely not before Stage 1 is
> > done. Surely no later than stage 3.  A driver could be wish for some party 
> > to
> > productize an implementation.
> > 
> > Interested? Join TC and start discussion on timing and which features should
> > be included.
> > 
> > 
> > --
> > MST
> > 
> > -
> > To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] packed ring layout proposal v2

2017-02-22 Thread Chien, Roger S
Here are some feedbacks w.r.t virtio v1.1 proposal from FPGA implementation 
view point. 

1) As the Gary presented, depends on flags to determine ownership of 
descriptors is not efficiency. And for QPI FPGA platform, it may also lead to 
race condition because current design only support whole cache line read/write 
and no partial  byte enabled. Even with partial byte enable, it will cause the 
CPU/FPGA to invalidate the whole cache line even when you just toggle one bit 
(For RX, return the descriptor to SW)

2) S/G is a nightmare to HW, because next descriptor is only available when you 
have it. So the HW is unable to parallelize descriptor fetching & buffer 
fetching. Indirect table is better (compare to indirect linked list). But it 
even better if we can put chained descriptors inside the original descriptor 
table and using a flag to indicate that their buffers belong to the a large 
piece (but may be conflict to flexibility to increase/decrease descriptors in 
run time)

3) Better to keep power of 2 ring size from HW viewpoint. But not so painful if 
we need to support non power of 2 ring size. (Single AND mask versus comparator)


-Original Message-
From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org] 
On Behalf Of Michael S. Tsirkin
Sent: Wednesday, February 8, 2017 11:20 AM
To: virtio-dev@lists.oasis-open.org
Cc: virtualizat...@lists.linux-foundation.org
Subject: [virtio-dev] packed ring layout proposal v2

This is an update from v1 version.
Changes:
- update event suppression mechanism
- separate options for indirect and direct s/g
- lots of new features

---

Performance analysis of this is in my kvm forum 2016 presentation.
The idea is to have a r/w descriptor in a ring structure, replacing the used 
and available ring, index and descriptor buffer.

* Descriptor ring:

Guest adds descriptors with unique index values and DESC_HW set in flags.
Host overwrites used descriptors with correct len, index, and DESC_HW clear.  
Flags are always set/cleared last.

#define DESC_HW 0x0080

struct desc {
__le64 addr;
__le32 len;
__le16 index;
__le16 flags;
};

When DESC_HW is set, descriptor belongs to device. When it is clear, it belongs 
to the driver.

We can use 1 bit to set direction
/* This marks a buffer as write-only (otherwise read-only). */
#define VRING_DESC_F_WRITE  2

* Scatter/gather support

We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:

/* This marks a buffer as continuing via the next field. */
#define VRING_DESC_F_NEXT   1

Unlike virtio 1.0, all descriptors must have distinct ID values.

Also unlike virtio 1.0, use of this flag will be an optional feature (e.g. 
VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.

* Indirect buffers

Can be marked like in virtio 1.0:

/* This means the buffer contains a table of buffer descriptors. */
#define VRING_DESC_F_INDIRECT   4

Unlike virtio 1.0, this is a table, not a list:
struct indirect_descriptor_table {
/* The actual descriptors (16 bytes each) */
struct virtq_desc desc[len / 16]; };

The first descriptor is located at start of the indirect descriptor table, 
additional indirect descriptors come immediately afterwards.
DESC_F_WRITE is the only valid flag for descriptors in the indirect table. 
Others should be set to 0 and are ignored.  id is also set to 0 and should be 
ignored.

virtio 1.0 seems to allow a s/g entry followed by an indirect descriptor. This 
does not seem useful, so we do not allow that anymore.

This support would be an optional feature, same as in virtio 1.0

* Batching descriptors:

virtio 1.0 allows passing a batch of descriptors in both directions, by 
incrementing the used/avail index by values > 1.  We can support this by 
chaining a list of descriptors through a bit the flags field.
To allow use together with s/g, a different bit will be used.

#define VRING_DESC_F_BATCH_NEXT 0x0010

Batching works for both driver and device descriptors.



* Processing descriptors in and out of order

Device processing all descriptors in order can simply flip the DESC_HW bit as 
it is done with descriptors.

Device can write descriptors out in order as they are used, overwriting 
descriptors that are there.

Device must not use a descriptor until DESC_HW is set.
It is only required to look at the first descriptor submitted.

Driver must not overwrite a descriptor until DESC_HW is clear.
It is only required to look at the first descriptor submitted.

* Device specific descriptor flags
We have a lot of unused space in the descriptor.  This can be put to good use 
by reserving some flag bits for device use.
For example, network device can set a bit to request that header in the 
descriptor is suppressed (in case it's all 0s anyway). This reduces cache 
utilization.

Note: this feature can be supported in virtio 1.0 as well, as we have unused 
bits in both descriptor and used ring there.

* Descript

RE: [virtio-dev] packed ring layout proposal - todo list

2017-02-22 Thread Gray, Mark D
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-
> open.org] On Behalf Of Michael S. Tsirkin
> 
> Here is an attempt to summarise the list of things we need to do with respect
> to this proposal.

Thanks for outlining this, it makes it a lot clearer.

> 
> 
> Stage 1: update prototype and finalize proposal
> 
> At this stage we need to update the prototype under
> tools/virtio/ringtest/ring.c to make it match latest proposal, adding in 
> verious

Would a DPDK based prototype be good aswell or would you rather 
everything be prototyped here?

> options under discussion.
> Then we will measure performance. While more ideas are welcome there
> won't be useful without ability to test!
> 
> Tasks:
> 
> - update tools/virtio/ringtest/ring.c to proposal, adding
>   options as listed below.
> - compare performance with and without indirect buffers
>   issue: how to estimate cost of memory allocations?
> - batching descriptors - is it worth it?
> - three ways to suppress interrupts/exits
>   which works better?
>   issue: how to estimate cost of interrupts/exits?
> 
> 

We need to ensure we have a proposal that is suitable for
implementation in hardware.

> More ideas:
> 
> - current tests all indicate cache synchronizations due to r/w descriptors
>   cost as much as read-only/write-only descriptors,
>   and there are less of these synchronizations.
>   Disagree? Write a patch and benchmark.
> - some devices could use smaller descriptors. For example,
>   if you don't need length, id (e.g. using in-order, fixed length) or flags, 
> you
> can
>   use a single 64 bit pointer as a descriptor. This can sometimes work
>   e.g. for networking rx rings.
>   Not clear whether this gains/costs us anything. Disagree? Write a patch and
> benchmark.
> 
> Stage 2: prototype guest/host drivers
> 
> At this stage we need real guest and host drivers to be able to test real life
> performance.
> I suggest dpdk drivers + munimal hack in qemu to pass features around.
> 
> Tasks:
> 
> - implement vhost-user support in dpdk
> - implement virtio support in dpdk
> - implement minimal stub in qemu
> - test performance. Possibly revisit questions from stage 2
>   if any are left open
> 
> 
> Stage 3: complete guest/host drivers
> 
> At this stage we need to add linux support in virtio and vhost, and complete
> qemu support.
> 
> Tasks:
> 
> - implement vhost support in kernel
> - implement virtio support in kernel
> - complete virtio support in qemu
> - test performance. Possibly revisit questions from stage 2
>   if any are left open
> 
> 
> 
> Stage X: Finalizing the spec
> 
> When are we ready to put out a spec draft? Surely not before Stage 1 is
> done. Surely no later than stage 3.  A driver could be wish for some party to
> productize an implementation.
> 
> Interested? Join TC and start discussion on timing and which features should
> be included.
> 
> 
> --
> MST
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-02-10 Thread Paolo Bonzini


- Original Message -
> From: "Michael S. Tsirkin" 
> To: "Paolo Bonzini" 
> Cc: virtio-dev@lists.oasis-open.org, virtualizat...@lists.linux-foundation.org
> Sent: Friday, February 10, 2017 4:20:17 PM
> Subject: Re: [virtio-dev] packed ring layout proposal v2
> 
> On Fri, Feb 10, 2017 at 12:32:49PM +0100, Paolo Bonzini wrote:
> > 
> > 
> > On 09/02/2017 19:24, Michael S. Tsirkin wrote:
> > >> I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> > >> the complication and the gratuitous difference with 1.0.
> > >
> > > I thought originally there's a reason 1.0 rings had to be powers of two
> > > but now I don't see why. OK, we can make it a feature flag later if we
> > > want to.
> > 
> > The reason is that it allows indices to be free running.
> 
> Well what I meant is that with qsize not a power of 2 you can still do
> this but have to do everything mod N*qsize as opposed to mod 2^16. So
> you need a branch there - easiest to do if you do signed math.
> 
> int nheads = avail - last_avail;
> /*Check and handle index wrap-around */
> if (unlikely(nheads < 0)) {
>   nheads += N_qsize;
> }
> 
> if (nheads < 0 || nheads > vdev->vq[i].vring.num) {
>   error_report(...);
>   return -1;
> }
> 
> This can only catch bugs if N > 1

Agreed.

> > This is an example of QEMU code that requires that:
> 
> Same thing here, this never triggers if vring.num == 2^16

Free running indices require the counter range to be bigger than the
size of the vring (otherwise you confuse an empty vring with a full
one), so the maximum size of the vring in virtio <=1.0 is 2^15.

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-02-10 Thread Michael S. Tsirkin
On Fri, Feb 10, 2017 at 12:32:49PM +0100, Paolo Bonzini wrote:
> 
> 
> On 09/02/2017 19:24, Michael S. Tsirkin wrote:
> >> I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> >> the complication and the gratuitous difference with 1.0.
> >
> > I thought originally there's a reason 1.0 rings had to be powers of two
> > but now I don't see why. OK, we can make it a feature flag later if we
> > want to.
> 
> The reason is that it allows indices to be free running.

Well what I meant is that with qsize not a power of 2 you can still do
this but have to do everything mod N*qsize as opposed to mod 2^16. So
you need a branch there - easiest to do if you do signed math.

int nheads = avail - last_avail;
/*Check and handle index wrap-around */
if (unlikely(nheads < 0)) {
nheads += N_qsize;
}

if (nheads < 0 || nheads > vdev->vq[i].vring.num) {
error_report(...);
return -1;
}

This can only catch bugs if N > 1

>  This is an 
> example of QEMU code that requires that:
> 
> nheads = vring_avail_idx(&vdev->vq[i]) - 
> vdev->vq[i].last_avail_idx;
> /* Check it isn't doing strange things with descriptor numbers. */
> if (nheads > vdev->vq[i].vring.num) {
> error_report("VQ %d size 0x%x Guest index 0x%x "
>  "inconsistent with Host index 0x%x: delta 0x%x",
>  i, vdev->vq[i].vring.num,
>  vring_avail_idx(&vdev->vq[i]),
>  vdev->vq[i].last_avail_idx, nheads);
> return -1;
> }
> 
> Paolo

Same thing here, this never triggers if vring.num == 2^16

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-02-10 Thread Paolo Bonzini


On 09/02/2017 19:24, Michael S. Tsirkin wrote:
>> I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
>> the complication and the gratuitous difference with 1.0.
>
> I thought originally there's a reason 1.0 rings had to be powers of two
> but now I don't see why. OK, we can make it a feature flag later if we
> want to.

The reason is that it allows indices to be free running.  This is an 
example of QEMU code that requires that:

nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
/* Check it isn't doing strange things with descriptor numbers. */
if (nheads > vdev->vq[i].vring.num) {
error_report("VQ %d size 0x%x Guest index 0x%x "
 "inconsistent with Host index 0x%x: delta 0x%x",
 i, vdev->vq[i].vring.num,
 vring_avail_idx(&vdev->vq[i]),
 vdev->vq[i].last_avail_idx, nheads);
return -1;
}

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-02-09 Thread Michael S. Tsirkin
On Thu, Feb 09, 2017 at 04:48:53PM +0100, Paolo Bonzini wrote:
> 
> 
> On 08/02/2017 20:59, Michael S. Tsirkin wrote:
> > We couldn't decide what's better for everyone in 1.0 days and I doubt
> > we'll be able to now, but yes, benchmarking is needed to make
> > sire it's required. Very easy to remove or not to use/support in
> > drivers/devices though.
> 
> Fair enough, but of course then we must specify that devices MUST
> support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers
> SHOULD support both (or use neither).
> 
> The drivers' part adds to the implementation burden, which is why I
> wanted to remove it.  Alternatively we can say that indirect is
> mandatory for both devices and drivers (and save a feature bit), while
> VRING_DESC_F_NEXT is optional.
> 
> >>> * Non power-of-2 ring sizes
> >>>
> >>> As the ring simply wraps around, there's no reason to
> >>> require ring size to be power of two.
> >>> It can be made a separate feature though.
> >>
> >> Power of 2 ring sizes are required in order to ignore the high bits of
> >> the indices.  With non-power-of-2 sizes you are forced to keep the
> >> indices less than the ring size.
> > 
> > Right. So
> > 
> > if (unlikely(idx++ > size))
> > idx = 0;
> > 
> > OTOH ring size that's twice larger than necessary
> > because of power of two requirements wastes cache.
> 
> I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> the complication and the gratuitous difference with 1.0.

I thought originally there's a reason 1.0 rings had to be powers of two
but now I don't see why. OK, we can make it a feature flag later if we
want to.

> If batching is mostly advisory (with exceptions such as mrg-rxbuf) I
> don't have any problem with it.
> 
> Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-02-09 Thread Cornelia Huck
On Thu, 9 Feb 2017 16:48:53 +0100
Paolo Bonzini  wrote:

> On 08/02/2017 20:59, Michael S. Tsirkin wrote:
> > We couldn't decide what's better for everyone in 1.0 days and I doubt
> > we'll be able to now, but yes, benchmarking is needed to make
> > sire it's required. Very easy to remove or not to use/support in
> > drivers/devices though.
> 
> Fair enough, but of course then we must specify that devices MUST
> support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers
> SHOULD support both (or use neither).
> 
> The drivers' part adds to the implementation burden, which is why I
> wanted to remove it.  Alternatively we can say that indirect is
> mandatory for both devices and drivers (and save a feature bit), while
> VRING_DESC_F_NEXT is optional.

I think this (INDIRECT mandatory, NEXT optional) makes sense. But we
really need some benchmarking.

> 
> >>> * Non power-of-2 ring sizes
> >>>
> >>> As the ring simply wraps around, there's no reason to
> >>> require ring size to be power of two.
> >>> It can be made a separate feature though.
> >>
> >> Power of 2 ring sizes are required in order to ignore the high bits of
> >> the indices.  With non-power-of-2 sizes you are forced to keep the
> >> indices less than the ring size.
> > 
> > Right. So
> > 
> > if (unlikely(idx++ > size))
> > idx = 0;
> > 
> > OTOH ring size that's twice larger than necessary
> > because of power of two requirements wastes cache.
> 
> I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> the complication and the gratuitous difference with 1.0.

I agree. I don't think dropping the power of 2 requirement buys us so
much that it makes up for the added complexity.


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-02-09 Thread Paolo Bonzini


On 08/02/2017 20:59, Michael S. Tsirkin wrote:
> We couldn't decide what's better for everyone in 1.0 days and I doubt
> we'll be able to now, but yes, benchmarking is needed to make
> sire it's required. Very easy to remove or not to use/support in
> drivers/devices though.

Fair enough, but of course then we must specify that devices MUST
support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers
SHOULD support both (or use neither).

The drivers' part adds to the implementation burden, which is why I
wanted to remove it.  Alternatively we can say that indirect is
mandatory for both devices and drivers (and save a feature bit), while
VRING_DESC_F_NEXT is optional.

>>> * Non power-of-2 ring sizes
>>>
>>> As the ring simply wraps around, there's no reason to
>>> require ring size to be power of two.
>>> It can be made a separate feature though.
>>
>> Power of 2 ring sizes are required in order to ignore the high bits of
>> the indices.  With non-power-of-2 sizes you are forced to keep the
>> indices less than the ring size.
> 
> Right. So
> 
>   if (unlikely(idx++ > size))
>   idx = 0;
> 
> OTOH ring size that's twice larger than necessary
> because of power of two requirements wastes cache.

I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.

If batching is mostly advisory (with exceptions such as mrg-rxbuf) I
don't have any problem with it.

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-02-08 Thread Michael S. Tsirkin
On Wed, Feb 08, 2017 at 06:41:40PM +0100, Paolo Bonzini wrote:
> 
> 
> On 08/02/2017 04:20, Michael S. Tsirkin wrote:
> > * Scatter/gather support
> > 
> > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> > 
> > /* This marks a buffer as continuing via the next field. */
> > #define VRING_DESC_F_NEXT   1
> > 
> > Unlike virtio 1.0, all descriptors must have distinct ID values.
> > 
> > Also unlike virtio 1.0, use of this flag will be an optional feature
> > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
> 
> I would still prefer that we had  _either_ single-direct or
> multiple-indirect descriptors, i.e. no VRING_DESC_F_NEXT.  I can propose
> my idea for this in a separate message.

All it costs us spec-wise is a single bit :) 

The cost of indirect is an extra cache miss.

We couldn't decide what's better for everyone in 1.0 days and I doubt
we'll be able to now, but yes, benchmarking is needed to make
sire it's required. Very easy to remove or not to use/support in
drivers/devices though.

> > * Batching descriptors:
> > 
> > virtio 1.0 allows passing a batch of descriptors in both directions, by
> > incrementing the used/avail index by values > 1.  We can support this by
> > chaining a list of descriptors through a bit the flags field.
> > To allow use together with s/g, a different bit will be used.
> > 
> > #define VRING_DESC_F_BATCH_NEXT 0x0010
> > 
> > Batching works for both driver and device descriptors.
> 
> I'm still not sure how this would be useful.


So this is used at least by virtio-net mergeable buffers to combine
many buffers into a single packet.

Similarly, on transmit linux sometimes supplies packets in batches
(XMIT_MORE flag) if the other side processes them it seems nice to tell
it: there's more to come soon, if you see this it is wise to poll now.

That's why I kind of felt it's better as a standard bit.

>  It cannot be mandatory to
> set the bit, I think, because you don't know when the host/guest is
> going to read descriptors.  So both host and guest always have to look
> ahead one element in any case.

Right but the point is what to do if you find nothing there?
If you saw VRING_DESC_F_BATCH_NEXT it's a hint that
you should poll, there's more to come soon.

> > * Non power-of-2 ring sizes
> > 
> > As the ring simply wraps around, there's no reason to
> > require ring size to be power of two.
> > It can be made a separate feature though.
> 
> Power of 2 ring sizes are required in order to ignore the high bits of
> the indices.  With non-power-of-2 sizes you are forced to keep the
> indices less than the ring size.

Right. So

if (unlikely(idx++ > size))
idx = 0;

OTOH ring size that's twice larger than necessary
because of power of two requirements wastes cache.

>  Alternatively you can do this:
> 
> > * Event index would be in the range 0 to 2 * Queue Size
> > (to detect wrap arounds) and wrap to 0 after that.
> > 
> > The assumption is that each side maintains an internal
> > descriptor counter 0 to 2 * Queue Size that wraps to 0.
> > In that case, interrupt triggers when counter reaches
> > the given value.
> 
> but it seems more complicated than just forcing power-of-2 and ignoring
> the high bits.
> 
> Thanks,
> 
> Paolo

Absolutely power of 2 lets you save a branch.
At this stage I'm just recording all the ideas
and then as a next step we can micro-benchmark prototypes
and compare.

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal v2

2017-02-08 Thread Paolo Bonzini


On 08/02/2017 04:20, Michael S. Tsirkin wrote:
> * Scatter/gather support
> 
> We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> 
> /* This marks a buffer as continuing via the next field. */
> #define VRING_DESC_F_NEXT   1
> 
> Unlike virtio 1.0, all descriptors must have distinct ID values.
> 
> Also unlike virtio 1.0, use of this flag will be an optional feature
> (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.

I would still prefer that we had  _either_ single-direct or
multiple-indirect descriptors, i.e. no VRING_DESC_F_NEXT.  I can propose
my idea for this in a separate message.

> * Batching descriptors:
> 
> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.  We can support this by
> chaining a list of descriptors through a bit the flags field.
> To allow use together with s/g, a different bit will be used.
> 
> #define VRING_DESC_F_BATCH_NEXT 0x0010
> 
> Batching works for both driver and device descriptors.

I'm still not sure how this would be useful.  It cannot be mandatory to
set the bit, I think, because you don't know when the host/guest is
going to read descriptors.  So both host and guest always have to look
ahead one element in any case.

> * Non power-of-2 ring sizes
> 
> As the ring simply wraps around, there's no reason to
> require ring size to be power of two.
> It can be made a separate feature though.

Power of 2 ring sizes are required in order to ignore the high bits of
the indices.  With non-power-of-2 sizes you are forced to keep the
indices less than the ring size.  Alternatively you can do this:

> * Event index would be in the range 0 to 2 * Queue Size
> (to detect wrap arounds) and wrap to 0 after that.
> 
> The assumption is that each side maintains an internal
> descriptor counter 0 to 2 * Queue Size that wraps to 0.
> In that case, interrupt triggers when counter reaches
> the given value.

but it seems more complicated than just forcing power-of-2 and ignoring
the high bits.

Thanks,

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-10-26 Thread Maxime Coquelin



On 10/26/2016 01:24 PM, Paolo Bonzini wrote:



On 26/10/2016 12:14, Yuanhan Liu wrote:

On Wed, Oct 26, 2016 at 11:49:17AM +0200, Maxime Coquelin wrote:



On 09/16/2016 08:51 PM, Michael S. Tsirkin wrote:

On Fri, Sep 16, 2016 at 10:28:19AM +0100, Stefan Hajnoczi wrote:

On Fri, Sep 16, 2016 at 01:39:15AM +0300, Michael S. Tsirkin wrote:

Let's start a discussion around an alternative ring layout.
This has been in my kvm forum 2016 presentation.
The idea is to have a r/w descriptor in a ring structure,
replacing the used and available ring, index and descriptor
buffer.


Do you have pseudo-code?  I find it difficult to infer how it's supposed
to work from the description.


Find a prototype implementation under tools/virtio/ringbench/ring.c in
latest Linux.


Are you sure of the path?
I also had a look at your kernel.org repository, but no luck there
either.


It's tools/virtio/ringtest/ring.c, not ringbench. :)


Thanks!
Maxime

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-10-26 Thread Michael S. Tsirkin
On Wed, Oct 26, 2016 at 11:49:17AM +0200, Maxime Coquelin wrote:
> 
> 
> On 09/16/2016 08:51 PM, Michael S. Tsirkin wrote:
> > On Fri, Sep 16, 2016 at 10:28:19AM +0100, Stefan Hajnoczi wrote:
> > > On Fri, Sep 16, 2016 at 01:39:15AM +0300, Michael S. Tsirkin wrote:
> > > > Let's start a discussion around an alternative ring layout.
> > > > This has been in my kvm forum 2016 presentation.
> > > > The idea is to have a r/w descriptor in a ring structure,
> > > > replacing the used and available ring, index and descriptor
> > > > buffer.
> > > 
> > > Do you have pseudo-code?  I find it difficult to infer how it's supposed
> > > to work from the description.
> > 
> > Find a prototype implementation under tools/virtio/ringbench/ring.c in
> > latest Linux.
> 
> Are you sure of the path?

Sorry - tools/virtio/ringtest/ring.c

> I also had a look at your kernel.org repository, but no luck there
> either.
> 
> Yuanhan is willing to prototype it in DPDK, so I think he will be
> interested in your implementation.
> 
> Thanks,
> Maxime

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-10-26 Thread Paolo Bonzini


On 26/10/2016 12:14, Yuanhan Liu wrote:
> On Wed, Oct 26, 2016 at 11:49:17AM +0200, Maxime Coquelin wrote:
>>
>>
>> On 09/16/2016 08:51 PM, Michael S. Tsirkin wrote:
>>> On Fri, Sep 16, 2016 at 10:28:19AM +0100, Stefan Hajnoczi wrote:
 On Fri, Sep 16, 2016 at 01:39:15AM +0300, Michael S. Tsirkin wrote:
> Let's start a discussion around an alternative ring layout.
> This has been in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.

 Do you have pseudo-code?  I find it difficult to infer how it's supposed
 to work from the description.
>>>
>>> Find a prototype implementation under tools/virtio/ringbench/ring.c in
>>> latest Linux.
>>
>> Are you sure of the path?
>> I also had a look at your kernel.org repository, but no luck there
>> either.

It's tools/virtio/ringtest/ring.c, not ringbench. :)

Thanks,

Paolo

> Me, neither.
> 
>> Yuanhan is willing to prototype it in DPDK, so I think he will be
>> interested in your implementation.
> 
> Yep :)
> 
> Thanks.
> 
>   --yliu
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
> 

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-10-26 Thread Yuanhan Liu
On Wed, Oct 26, 2016 at 11:49:17AM +0200, Maxime Coquelin wrote:
> 
> 
> On 09/16/2016 08:51 PM, Michael S. Tsirkin wrote:
> >On Fri, Sep 16, 2016 at 10:28:19AM +0100, Stefan Hajnoczi wrote:
> >>On Fri, Sep 16, 2016 at 01:39:15AM +0300, Michael S. Tsirkin wrote:
> >>>Let's start a discussion around an alternative ring layout.
> >>>This has been in my kvm forum 2016 presentation.
> >>>The idea is to have a r/w descriptor in a ring structure,
> >>>replacing the used and available ring, index and descriptor
> >>>buffer.
> >>
> >>Do you have pseudo-code?  I find it difficult to infer how it's supposed
> >>to work from the description.
> >
> >Find a prototype implementation under tools/virtio/ringbench/ring.c in
> >latest Linux.
> 
> Are you sure of the path?
> I also had a look at your kernel.org repository, but no luck there
> either.

Me, neither.

> Yuanhan is willing to prototype it in DPDK, so I think he will be
> interested in your implementation.

Yep :)

Thanks.

--yliu

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-10-26 Thread Maxime Coquelin



On 09/16/2016 08:51 PM, Michael S. Tsirkin wrote:

On Fri, Sep 16, 2016 at 10:28:19AM +0100, Stefan Hajnoczi wrote:

On Fri, Sep 16, 2016 at 01:39:15AM +0300, Michael S. Tsirkin wrote:

Let's start a discussion around an alternative ring layout.
This has been in my kvm forum 2016 presentation.
The idea is to have a r/w descriptor in a ring structure,
replacing the used and available ring, index and descriptor
buffer.


Do you have pseudo-code?  I find it difficult to infer how it's supposed
to work from the description.


Find a prototype implementation under tools/virtio/ringbench/ring.c in
latest Linux.


Are you sure of the path?
I also had a look at your kernel.org repository, but no luck there
either.

Yuanhan is willing to prototype it in DPDK, so I think he will be
interested in your implementation.

Thanks,
Maxime

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-20 Thread Paolo Bonzini


On 20/09/2016 10:38, Pierre Pfister (ppfister) wrote:
> More inline, but first I want to propose something.
> 
> One of the expensive operation in device workload is to map physical address 
> found in the descriptor to
> the virtual address. The reason being that there are multiple memory regions.
> 
> When there are only a few regions, optimal implementation uses a loop.
> Otherwise, a tree lookup can be used.  In any case, it is costy.
> 
> So here is the proposal.
> The driver numbers memory regions and provides this numbering to the device, 
> such that
> both device and driver have the same memory region indexing.
> 
> Then, instead of transmitting '__le64 addr' in the descriptor, we put 
> '__le16 region_idx' and '__le32 offset'.

This makes it much more complicated to change the memory map (which
happens rarely---but happens, e.g. on memory hotplug).  There's also
additional logic to split the regions to 4GB when they're bigger, which
makes the tables larger and less cache-friendly.

The simple solution is to save the last two or four regions used, and
only then fall back to a binary search for everything else.  For the
common case not involving memory hotplug this is the fastest method,
because there are effectively only one or two regions in use.

Again: let's not second-guess the software.  Software is easy to
improve, hardware is hard.

> We also reduce the size of the descriptor structure by 16 bits.
> Which is not very helpful for now.
> At least for networking it wouldn't be a problem to change the 'len' field to 
> '__le16' too,
> hence reducing even more.
> But maybe for storage devices this could be an issue ?

Yeah, storage requires 32 bits for the length.

Also, 16 bytes is kind of a sweet spot because it is naturally aligned
and it divides the cache line.  14 bytes per descriptor doesn't really
buy you anything.  12 bytes would let you fit 5 descriptors in a cache
line in some cases, but in other cases some descriptors would straddle
two cache lines.  16 seems better.

 But, VRING_DESC_F_INDIRECT adds a layer of indirection and makes 
 prefetching a bit
 more complicated.
 The reason why VRING_DESC_F_INDIRECT was useful in the first place is that 
 virtio 1.0 
 didn't have enough descriptors in the main queue (it is equal to the queue 
 size).
>>>
>>> rings are physically contigious, and allocating large physically
>>> contigious buffers is problematic for many guests.
>>>
>>
>> If having physically contiguous rings is very important, could we allocate 
>> multiple arrays instead of one ?
>>
>> For example, allocating 8 rings of 1024 descriptos;
>> struct desc *current_desc = &vring[idx & 0x3c00][idx & 0x3ff];
>>
>> Just like indirect descriptors, this has two levels of indirection, but 
>> unlike indirect, this the lookup
>> is from non-shared memory and is guaranteed to not change.

Like Michael, I would like a benchmark for this.  Above 2 or 3
descriptors the cost of indirection should be a wash, because if
anything the loop on the indirect descriptors (loop for desc.len / 16
iterations) is more predictable than the loop on direct descriptors
(loop while !VRING_DESC_F_NEXT).

And even for 2 or 3 direct descriptors there's some risk of repeated
cache line bouncing between the consumer and the producer who's
preparing the next item.  Instead, when you process an indirect
descriptor you do fewer reads on the ring's cache line, and all of them
are very very close to each other.

Considering the simplified logic in the consumer _and_ the possibility
of preallocating the second-level descriptors in the producer, indirect
descriptors seem like a very straightforward win.

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-20 Thread Pierre Pfister (ppfister)
Oops. Pressed 'send' by mistake on the previous mail.

---

More inline, but first I want to propose something.

One of the expensive operation in device workload is to map physical address 
found in the descriptor to
the virtual address.

The reason being that there are multiple memory regions.

When there are only a few regions, optimal implementation uses a loop.
Otherwise, a tree lookup can be used.
In any case, it is costy.

So here is the proposal.

The driver numbers memory regions and provides this numbering to the device, 
such that
both device and driver have the same memory region indexing.

Then, instead of transmitting '__le64 addr' in the descriptor, we put 
'__le16 region_idx' and '__le32 offset'.

This way, the translation could be done very efficiently.

We also reduce the size of the descriptor structure by 16 bits.
Which is not very helpful for now.
At least for networking it wouldn't be a problem to change the 'len' field to 
'__le16' too,
hence reducing even more.
But maybe for storage devices this could be an issue ?


> 
> 
> 
> 
>>> 
>>> I think the simplified ring layout is a good idea. 
>>> It reduces the number of indirections, and will probably help pre-fetching 
>>> data in order to avoid cache misses.
>>> 
>>> I am not completely convinced by the flag-based approach instead of the 
>>> index-based approach though.
>>> With the index-based approach, you have only one-read to do in order
>>> to know up to what point in the queue you can go.
>>> With this flag, every-time you look at a new descriptor, before doing 
>>> anything you will have to look at the flags.
>>> Meaning that you will have to wait for the result of a test on a very newly 
>>> fetched piece of memory.
>>> 
>>> So maybe it would be interesting to still have a used_index and 
>>> available_index (put in different cache lines).
>> 
>> Problem is, index access is guaranteed to be a cache miss.
>> You need to stall waiting for descriptor length anyway -
>> go ahead and implement an alternative, and prove me wrong,
>> but my testing shows removing the index speeds things up.
>> 
>> Find the prototype here:
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/virtio/ringtest/ring.c
>> 
> 
> Actually what you say makes a lot of sense.
> And it's really easy to just prefetch next descriptors (which would include 
> de flags).
> 
> I had a few reasons why having the index was handy, but nothing so important.
> 
>>> 
>>> First of all, if I understand correctly, VRING_DESC_F_NEXT doesn't use a 
>>> 'next' field, and rather just
>>> say that the next descriptor in the queue is part of the same message. I 
>>> think this is really, really, good.
>>> 
>>> Paolo proposed in a later mail to drop VRING_DESC_F_NEXT altogether, and I 
>>> think
>>> he is right at least in the sense that having two solutions for doing 
>>> almost the same thing 
>>> is not necessary (IMHO, even harmful).
>> 
>> I'd like to see some code and numbers to back this up.  In particular,
>> testing dpdk performance with/without indirect descriptors would be
>> interesting.
>> If the optimal things depends on workload or the type of device in use,
>> then it might be necessary.
> 
> There is a gain when indirect descriptors are enabled, but I think this is 
> just due to the fact that
> the number of descriptors is virtually increased.
> I will run some tests this week and try to compare 512 w/o indirect Vs 256 
> with indirect.
> 
>> 
>>> But, VRING_DESC_F_INDIRECT adds a layer of indirection and makes 
>>> prefetching a bit
>>> more complicated.
>>> The reason why VRING_DESC_F_INDIRECT was useful in the first place is that 
>>> virtio 1.0 
>>> didn't have enough descriptors in the main queue (it is equal to the queue 
>>> size).
>> 
>> rings are physically contigious, and allocating large physically
>> contigious buffers is problematic for many guests.
>> 
> 
> If having physically contiguous rings is very important, could we allocate 
> multiple arrays instead of one ?
> 
> For example, allocating 8 rings of 1024 descriptos;
> struct desc *current_desc = &vring[idx & 0x3c00][idx & 0x3ff];
> 
> Just like indirect descriptors, this has two levels of indirection, but 
> unlike indirect, this the lookup
> is from non-shared memory and is guaranteed to not change.
> 
> 
>> 
>>> And if delay spent in the queue is an issue. I think it's driver's job to 
>>> avoid bufferbloat if necessary 
>>> by adapting the number of descriptors is allows the device to use.
>>> 
 
 
 * Batching descriptors:
 
 virtio 1.0 allows passing a batch of descriptors in both directions, by
 incrementing the used/avail index by values > 1.  We can support this by
 tagging first/middle/last descriptors in the flags field. For
 comptibility, polarity is reversed so bit is set for non-first and
 non-last descriptors in a batch:
 
 #define BATCH_NOT_FIRST 0x0010
 #define BATCH_NOT_LAST  0x0020
 
 In other

Re: [virtio-dev] packed ring layout proposal

2016-09-20 Thread Pierre Pfister (ppfister)
More inline, but first I want to propose something.

One of the expensive operation in device workload is to map physical address 
found in the descriptor to
the virtual address.

The reason being that there are multiple memory regions.

When there are only a few regions, optimal implementation uses a loop.
Otherwise, a tree lookup can be used.
In any case, it is costy.

So here is the proposal.

The driver numbers memory regions and provides this numbering to the device, 
such that
both device and driver have the same memory region indexing.

Then, instead of transmitting '__le64 addr' in the descriptor, we put 





>> 
>> I think the simplified ring layout is a good idea. 
>> It reduces the number of indirections, and will probably help pre-fetching 
>> data in order to avoid cache misses.
>> 
>> I am not completely convinced by the flag-based approach instead of the 
>> index-based approach though.
>> With the index-based approach, you have only one-read to do in order
>> to know up to what point in the queue you can go.
>> With this flag, every-time you look at a new descriptor, before doing 
>> anything you will have to look at the flags.
>> Meaning that you will have to wait for the result of a test on a very newly 
>> fetched piece of memory.
>> 
>> So maybe it would be interesting to still have a used_index and 
>> available_index (put in different cache lines).
> 
> Problem is, index access is guaranteed to be a cache miss.
> You need to stall waiting for descriptor length anyway -
> go ahead and implement an alternative, and prove me wrong,
> but my testing shows removing the index speeds things up.
> 
> Find the prototype here:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/virtio/ringtest/ring.c
> 

Actually what you say makes a lot of sense.
And it's really easy to just prefetch next descriptors (which would include de 
flags).

I had a few reasons why having the index was handy, but nothing so important.

>> 
>> First of all, if I understand correctly, VRING_DESC_F_NEXT doesn't use a 
>> 'next' field, and rather just
>> say that the next descriptor in the queue is part of the same message. I 
>> think this is really, really, good.
>> 
>> Paolo proposed in a later mail to drop VRING_DESC_F_NEXT altogether, and I 
>> think
>> he is right at least in the sense that having two solutions for doing almost 
>> the same thing 
>> is not necessary (IMHO, even harmful).
> 
> I'd like to see some code and numbers to back this up.  In particular,
> testing dpdk performance with/without indirect descriptors would be
> interesting.
> If the optimal things depends on workload or the type of device in use,
> then it might be necessary.

There is a gain when indirect descriptors are enabled, but I think this is just 
due to the fact that
the number of descriptors is virtually increased.
I will run some tests this week and try to compare 512 w/o indirect Vs 256 with 
indirect.

> 
>> But, VRING_DESC_F_INDIRECT adds a layer of indirection and makes prefetching 
>> a bit
>> more complicated.
>> The reason why VRING_DESC_F_INDIRECT was useful in the first place is that 
>> virtio 1.0 
>> didn't have enough descriptors in the main queue (it is equal to the queue 
>> size).
> 
> rings are physically contigious, and allocating large physically
> contigious buffers is problematic for many guests.
> 

If having physically contiguous rings is very important, could we allocate 
multiple arrays instead of one ?

For example, allocating 8 rings of 1024 descriptos;
struct desc *current_desc = &vring[idx & 0x3c00][idx & 0x3ff];

Just like indirect descriptors, this has two levels of indirection, but unlike 
indirect, this the lookup
is from non-shared memory and is guaranteed to not change.


> 
>> And if delay spent in the queue is an issue. I think it's driver's job to 
>> avoid bufferbloat if necessary 
>> by adapting the number of descriptors is allows the device to use.
>> 
>>> 
>>> 
>>> * Batching descriptors:
>>> 
>>> virtio 1.0 allows passing a batch of descriptors in both directions, by
>>> incrementing the used/avail index by values > 1.  We can support this by
>>> tagging first/middle/last descriptors in the flags field. For
>>> comptibility, polarity is reversed so bit is set for non-first and
>>> non-last descriptors in a batch:
>>> 
>>> #define BATCH_NOT_FIRST 0x0010
>>> #define BATCH_NOT_LAST  0x0020
>>> 
>>> In other words only descriptor in batch is .
>>> 
>>> Batch does not have to be processed together, so
>>> !first/!last flags can be changed when descriptor is used.
>>> 
>> 
>> I don't exactly understand why this is necessary.
>> Is there any added value by knowing that a bunch of messages have been 
>> written at 
>> the same time.
> 
> virtio net mergeable buffers pretty much rely on this.
> 

One other way to do a batch commit is to set the DESC_HW flag of the first 
committed
descriptor after the following.


e.g. to batch 3 descriptors for hardware.

==> [...] [HW

Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Michael S. Tsirkin
On Mon, Sep 19, 2016 at 09:59:50PM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/09/2016 21:14, Michael S. Tsirkin wrote:
> > > But I prefer all these tricks to be hidden within the driver.  It seems
> > > a good idea in the beginning to rig the device to second-guess what a
> > > driver could do, but then it makes things awkward.  (This is also why
> > > I'd rather get rid of VRING_DESC_F_NEXT).
> > 
> > Right, I'll CC dpdk list on this proposal. As dpdk uses _NEXT almost
> > exclusively I'd like to make sure we are not messing things up.
> > 
> > Maybe the right thing to do is to disallow _NEXT if _INDIRECT is
> > negotiated. Each device can then decide whether it wants to
> > use _INDIRECT or _NEXT (or both). How's that?
> 
> Still a bit of feature creep if we can avoid it, but at least it lets
> you write two fast loops to parse the descriptors.  So that's already a
> huge improvement.
> 
> Negotiating _INDIRECT would still allow a single direct buffer.
> 
> Just one thing: in the  _NEXT case, does the driver write only one
> available descriptor for the head (effectively ignoring desc.index on
> all descriptors but the first)?  Or does it have to write all the
> descriptors?


It would be cleaner to write them all out. But maybe it works without,
somehow. This will need some thought.


>  If the latter, _INDIRECT would almost surely end up faster.
> 
> Paolo

It does add overhead for sure, but OTOH it makes reuse by driver simpler.
I'll look into testing this.

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Paolo Bonzini


On 19/09/2016 21:14, Michael S. Tsirkin wrote:
> > But I prefer all these tricks to be hidden within the driver.  It seems
> > a good idea in the beginning to rig the device to second-guess what a
> > driver could do, but then it makes things awkward.  (This is also why
> > I'd rather get rid of VRING_DESC_F_NEXT).
> 
> Right, I'll CC dpdk list on this proposal. As dpdk uses _NEXT almost
> exclusively I'd like to make sure we are not messing things up.
> 
> Maybe the right thing to do is to disallow _NEXT if _INDIRECT is
> negotiated. Each device can then decide whether it wants to
> use _INDIRECT or _NEXT (or both). How's that?

Still a bit of feature creep if we can avoid it, but at least it lets
you write two fast loops to parse the descriptors.  So that's already a
huge improvement.

Negotiating _INDIRECT would still allow a single direct buffer.

Just one thing: in the  _NEXT case, does the driver write only one
available descriptor for the head (effectively ignoring desc.index on
all descriptors but the first)?  Or does it have to write all the
descriptors?  If the latter, _INDIRECT would almost surely end up faster.

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Michael S. Tsirkin
On Mon, Sep 19, 2016 at 09:07:08PM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/09/2016 19:03, Stephen Hemminger wrote:
> > Ideally, the DPDK driver should use the best algorithm it can negotiate.
> > Be sure and test with old RHEL6 type systems, modern enterprise (RHEL/SLES)
> > and also Google Compute which has there own virtio.  Not sure what virtual
> > box is doing?? 
> 
> RHEL6 has all pre-1.0 features, but vbox doesn't have indirect
> descriptors.  No idea about GCE.
> 
> Paolo

As long as the compatibility code is maintained,
it might be cheaper to do
if (packed_ring)
completely separate code path

than try to mix in compatibility.

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Michael S. Tsirkin
On Mon, Sep 19, 2016 at 08:47:47PM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/09/2016 19:22, Michael S. Tsirkin wrote:
> > No but you do need to make sure that when get_buf returns
> > the first buffer, all buffers are available.
> 
> Is that not the case already?  But anyway I now agree it's better to put
> it in the descriptor, thanks for clarifying.
> 
> > And from the achitecture POV, I feel this is a transport
> > thing, not a device specific thing.
> 
> It depends, many devices (e.g. storage) are packet-based and bursts do
> not have any meaning.  So I guess it's okay, but then I would prefer to
> make it a separate feature bit.

It just becomes an optimization feature then, and can be ignored
without a feature flag.
Maybe it's ok - will allow reusing some bits for something else - but
some people did complain about all the branches forced by the
feature bits.

> I'm mostly afraid that it would be hard for the spec to define it in
> general terms (while it's easy to define something specific to rxbuf
> merging).  Having device-specific descriptor flags seems like a natural
> extension anyway...
> 
> Paolo

Well - the batching of the index is kind of an undocumented property
in 1.0.  I'll try to write it up, let's see what I can come up with.

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Michael S. Tsirkin
On Mon, Sep 19, 2016 at 09:06:19PM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/09/2016 19:15, Michael S. Tsirkin wrote:
> > On Mon, Sep 19, 2016 at 02:01:57PM +0200, Paolo Bonzini wrote:
> >> The main issue with VRING_DESC_F_INDIRECT, at least as it is implemented
> >> in Linux, is the cost of allocating and freeing the indirect descriptor.
> >>  But if you know that the number of descriptor is fixed (e.g. 2 for
> >> request header + data) you can pre-allocate the descriptors with a 1:1
> >> relationship with the entries of the ring buffer.  This might be useful
> >> for the virtio-net tx ring.
> > 
> > Yes, I've been thinking the same thing. Is there any advantage
> > to have the indirect addresses pre-programmed as
> > opposed to stored in the descriptors?
> > 
> > In theory
> > if (desc.flags & indirect)
> > use(*address[i])
> > 
> > might be speculated as compared to
> > if (desc.flags & indirect)
> > use(*desc.addr)
> > which can't.
> > But OTOH this might add to cache pressure as desc is already in cache
> > anyway.
> 
> Yes, I think this would be the worst of both worlds. :)
> 
> If the driver knows it has 2 descriptors per request, it could allocate
> a big block in advance and do for example
> 
>   struct {
>   // pre-allocated and pre-initialized
>   // to point to ->hdr
>   struct virtio_desc req_desc;
>   // only flags prefilled
>   struct virtio_desc data_desc;
>   struct req_header  hdr;
>   } *indirect_data;
> 
>   i = index++ & (size - 1);
>   indirect_addr = &indirect_data[i];
>   indirect_data[i]->data_desc.addr = this_data_addr;
>   indirect_data[i]->data_desc.len = this_data_len;
>   // ... fill in indirect_data[i]->hdr ...
> 
>   desc[i].addr = indirect_addr;
>   smp_wmb();
>   desc[i].index = i | DESC_HW;
> 
> where there is no need for an array of descriptor addresses,
> &indirect_data[i].  As long as the indirect_data size is a power of 2,
> indirect_data needn't even be physically contiguous.
> 
> But I prefer all these tricks to be hidden within the driver.  It seems
> a good idea in the beginning to rig the device to second-guess what a
> driver could do, but then it makes things awkward.  (This is also why
> I'd rather get rid of VRING_DESC_F_NEXT).

Right, I'll CC dpdk list on this proposal. As dpdk uses _NEXT almost
exclusively I'd like to make sure we are not messing things up.

Maybe the right thing to do is to disallow _NEXT if _INDIRECT is
negotiated. Each device can then decide whether it wants to
use _INDIRECT or _NEXT (or both). How's that?


> > If all descriptors are known ahead of the time to be indirect we could
> > pack more of them per cache line by dropping addr (and maybe len?).
> > This might work well for virtio blk.
> 
> Nope, virtio-blk doesn't know at all the size of the sg list (for
> small-I/O workloads it is always 2, but potentially unlimited).  For the
> above ideas I was thinking of the virtio tx header (but only for
> MTU=1500, as Stephen pointed out).
> 
> Paolo

Length in the indirect descriptor itself might not be needed to figure
out s.g list if we use the VRING_DESC_F_NEXT in the direct one that it
points to.
Using a flag in the descriptor might cause a stall but I think that's
typically cheaper than a cache miss (although we are not talking
a full miss here, more like 1/8 of a miss).

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Paolo Bonzini


On 19/09/2016 19:15, Michael S. Tsirkin wrote:
> On Mon, Sep 19, 2016 at 02:01:57PM +0200, Paolo Bonzini wrote:
>> The main issue with VRING_DESC_F_INDIRECT, at least as it is implemented
>> in Linux, is the cost of allocating and freeing the indirect descriptor.
>>  But if you know that the number of descriptor is fixed (e.g. 2 for
>> request header + data) you can pre-allocate the descriptors with a 1:1
>> relationship with the entries of the ring buffer.  This might be useful
>> for the virtio-net tx ring.
> 
> Yes, I've been thinking the same thing. Is there any advantage
> to have the indirect addresses pre-programmed as
> opposed to stored in the descriptors?
> 
> In theory
>   if (desc.flags & indirect)
>   use(*address[i])
> 
> might be speculated as compared to
>   if (desc.flags & indirect)
>   use(*desc.addr)
> which can't.
> But OTOH this might add to cache pressure as desc is already in cache
> anyway.

Yes, I think this would be the worst of both worlds. :)

If the driver knows it has 2 descriptors per request, it could allocate
a big block in advance and do for example

struct {
// pre-allocated and pre-initialized
// to point to ->hdr
struct virtio_desc req_desc;
// only flags prefilled
struct virtio_desc data_desc;
struct req_header  hdr;
} *indirect_data;

i = index++ & (size - 1);
indirect_addr = &indirect_data[i];
indirect_data[i]->data_desc.addr = this_data_addr;
indirect_data[i]->data_desc.len = this_data_len;
// ... fill in indirect_data[i]->hdr ...

desc[i].addr = indirect_addr;
smp_wmb();
desc[i].index = i | DESC_HW;

where there is no need for an array of descriptor addresses,
&indirect_data[i].  As long as the indirect_data size is a power of 2,
indirect_data needn't even be physically contiguous.

But I prefer all these tricks to be hidden within the driver.  It seems
a good idea in the beginning to rig the device to second-guess what a
driver could do, but then it makes things awkward.  (This is also why
I'd rather get rid of VRING_DESC_F_NEXT).

> If all descriptors are known ahead of the time to be indirect we could
> pack more of them per cache line by dropping addr (and maybe len?).
> This might work well for virtio blk.

Nope, virtio-blk doesn't know at all the size of the sg list (for
small-I/O workloads it is always 2, but potentially unlimited).  For the
above ideas I was thinking of the virtio tx header (but only for
MTU=1500, as Stephen pointed out).

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Paolo Bonzini


On 19/09/2016 19:03, Stephen Hemminger wrote:
> Ideally, the DPDK driver should use the best algorithm it can negotiate.
> Be sure and test with old RHEL6 type systems, modern enterprise (RHEL/SLES)
> and also Google Compute which has there own virtio.  Not sure what virtual
> box is doing?? 

RHEL6 has all pre-1.0 features, but vbox doesn't have indirect
descriptors.  No idea about GCE.

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Paolo Bonzini


On 19/09/2016 19:22, Michael S. Tsirkin wrote:
> No but you do need to make sure that when get_buf returns
> the first buffer, all buffers are available.

Is that not the case already?  But anyway I now agree it's better to put
it in the descriptor, thanks for clarifying.

> And from the achitecture POV, I feel this is a transport
> thing, not a device specific thing.

It depends, many devices (e.g. storage) are packet-based and bursts do
not have any meaning.  So I guess it's okay, but then I would prefer to
make it a separate feature bit.

I'm mostly afraid that it would be hard for the spec to define it in
general terms (while it's easy to define something specific to rxbuf
merging).  Having device-specific descriptor flags seems like a natural
extension anyway...

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Maxime Coquelin

Hi Michael,

On 09/19/2016 07:24 PM, Michael S. Tsirkin wrote:

On Mon, Sep 19, 2016 at 10:03:36AM -0700, Stephen Hemminger wrote:

On Sun, 18 Sep 2016 08:36:05 -0400 (EDT)
Paolo Bonzini  wrote:


Without indirect the usable ring size is cut in half on older kernels that
don't support any layout. That limit on qemu ends up being 128 packets


Hi Stephen,

note that here we were talking about limiting direct descriptors to
requests with a single buffer.

Paolo


The test I was looking at was small packet transmit (like pktgen or RFC2544).
The idea was to make the DPDK driver use all the same algorithms as the Linux
kernel which is the baseline for virtio. Also handling jumbo MTU packets
efficiently with scatter/gather.

With DPDK the issue was that the typical packet required several transmit
slot entries. Particularly bad for jumbo packets which usually are made up
of several mbuf's. A 9K packet typically has 5 mbuf's plus one more for the
virtio header.

Ideally, the DPDK driver should use the best algorithm it can negotiate.
Be sure and test with old RHEL6 type systems, modern enterprise (RHEL/SLES)
and also Google Compute which has there own virtio.  Not sure what virtual
box is doing??  Why can't code be written like ixgbe driver which has a
fast path if config allows it but makes decision at run time.


Interesting.  Maxime here posted integrating the indirect buffer
support to address this but this didn't get accepted
since he could not prove the performance gain.

Maxime, could you share the link to your patch so we
can discuss this on the dpdk mailing list?

Here it is:
http://dpdk.org/dev/patchwork/patch/14797/

Thanks,
Maxime



Thanks!



-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Michael S. Tsirkin
On Mon, Sep 19, 2016 at 10:03:36AM -0700, Stephen Hemminger wrote:
> On Sun, 18 Sep 2016 08:36:05 -0400 (EDT)
> Paolo Bonzini  wrote:
> 
> > > Without indirect the usable ring size is cut in half on older kernels that
> > > don't support any layout. That limit on qemu ends up being 128 packets  
> > 
> > Hi Stephen,
> > 
> > note that here we were talking about limiting direct descriptors to
> > requests with a single buffer.
> > 
> > Paolo
> 
> The test I was looking at was small packet transmit (like pktgen or RFC2544).
> The idea was to make the DPDK driver use all the same algorithms as the Linux
> kernel which is the baseline for virtio. Also handling jumbo MTU packets
> efficiently with scatter/gather.
> 
> With DPDK the issue was that the typical packet required several transmit
> slot entries. Particularly bad for jumbo packets which usually are made up
> of several mbuf's. A 9K packet typically has 5 mbuf's plus one more for the
> virtio header.
> 
> Ideally, the DPDK driver should use the best algorithm it can negotiate.
> Be sure and test with old RHEL6 type systems, modern enterprise (RHEL/SLES)
> and also Google Compute which has there own virtio.  Not sure what virtual
> box is doing??  Why can't code be written like ixgbe driver which has a
> fast path if config allows it but makes decision at run time.

Interesting.  Maxime here posted integrating the indirect buffer
support to address this but this didn't get accepted
since he could not prove the performance gain.

Maxime, could you share the link to your patch so we
can discuss this on the dpdk mailing list?

Thanks!
-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Michael S. Tsirkin
On Mon, Sep 19, 2016 at 07:00:10PM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/09/2016 18:48, Michael S. Tsirkin wrote:
> > > > 
> > > > #define BATCH_NOT_FIRST 0x0010
> > > > #define BATCH_NOT_LAST  0x0020
> > > > 
> > > > In other words only descriptor in batch is .
> > > > 
> > > > Batch does not have to be processed together, so
> > > > !first/!last flags can be changed when descriptor is used.
> > > > 
> > > 
> > > I don't exactly understand why this is necessary.
> > > Is there any added value by knowing that a bunch of messages have been 
> > > written at 
> > > the same time.
> > 
> > virtio net mergeable buffers pretty much rely on this.
> 
> How so?  num_buffers is specified in the virtio-net header, which is
> going to be in the first received descriptor.

BTW this access causes an extra cache miss for some short packets when
said packets don't need to be examined immediately.  Andrew's testing
shows an immediate performance gain just by removing this lookup.  I
also feel this is a kind of transport thing that really belongs in the
descriptor, e.g. if you know there are more descriptors coming
prefetch might be more justified. I didn't test this idea yet.

>  You don't need explicit
> marking of which descriptor is the first or the last.

No but you do need to make sure that when get_buf returns
the first buffer, all buffers are available.


> If you want to do parallel polling of the ring, perhaps we can define
> extra device-specific flags, and define one (e.g.,
> VIRTIO_NET_F_RXBUF_HAS_VIRTIO_NET_HDR) or two (adding also
> VIRTIO_NET_F_RXBUF_LAST) for MRG_RXBUF.

>From performance POV, there's a bunch of unused bits in
the descriptor that can be put to this use, this is better
than a buffer access which might not need to be in cache
for a while.
And from the achitecture POV, I feel this is a transport
thing, not a device specific thing.

My kvm forum talk also mentions possible optimization opportunities
if the transport can pass a batch of packets around,
such as implementing xmit_more.



> > Which is optimal depends on the workload:
> > - index requires and update each time we enable interrupts
> > - flag requires an ipdate each time we switch to enabled/disabled
> >   interrupts
> > 
> > Accordingly, for workloads that enable interrupts often,
> > flags are better. For workloads that keep interrupts
> > disabled most of the time, index is better.
> 
> Yeah, this makes sense.  So I guess both indexes and flags should stay.
> 
> Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Michael S. Tsirkin
On Mon, Sep 19, 2016 at 02:01:57PM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/09/2016 09:37, Pierre Pfister (ppfister) wrote:
> > Paolo proposed in a later mail to drop VRING_DESC_F_NEXT altogether,
> > and I think he is right at least in the sense that having two solutions
> > for doing almost the same thing is not necessary (IMHO, even harmful).
> > 
> > But, VRING_DESC_F_INDIRECT adds a layer of indirection and makes
> > prefetching a bit more complicated. The reason why VRING_DESC_F_INDIRECT
> > was useful in the first place is that virtio 1.0 didn't have enough
> > descriptors in the main queue (it is equal to the queue size).
> > 
> > Therefore, ideally, I'd rather have a longer queue and use
> > VRING_DESC_F_NEXT only and drop VRING_DESC_F_INDIRECT altogether.
> 
> VRING_DESC_F_INDIRECT is necessary for other devices.  For storage it is
> not rare to have s/g lists composed of 10-15 descriptors.
> 
> The main issue with VRING_DESC_F_INDIRECT, at least as it is implemented
> in Linux, is the cost of allocating and freeing the indirect descriptor.
>  But if you know that the number of descriptor is fixed (e.g. 2 for
> request header + data) you can pre-allocate the descriptors with a 1:1
> relationship with the entries of the ring buffer.  This might be useful
> for the virtio-net tx ring.

Yes, I've been thinking the same thing. Is there any advantage
to have the indirect addresses pre-programmed as
opposed to stored in the descriptors?

In theory
if (desc.flags & indirect)
use(*address[i])

might be speculated as compared to
if (desc.flags & indirect)
use(*desc.addr)
which can't.

But OTOH this might add to cache pressure as desc is already in cache
anyway.

If all descriptors are known ahead of the time to be indirect we could
pack more of them per cache line by dropping addr (and maybe len?).
This might work well for virtio blk.

Another consideration is numa locality though. It might be benefitial
to move the indirect buffers to the correct numa node,
if these are always preallocated you can't.


> > With this flag, every-time you look at a new descriptor, before doing
> > anything you will have to look at the flags. Meaning that you will
> > have to wait for the result of a test on a very newly fetched piece
> > of memory.
> 
> Yeah, it trades less cache traffic with higher latency on the remaining
> traffic.  But I like the idea.  For L2 the processor's speculative
> execution should still hide most of the latency (the branch should be
> well predicted in the direction of "another descriptor"---and if there's
> no other descriptor you don't care too much about the speculation
> penalty because you have no work to do).

Yes, it seems to work well.

> You could also try a blind prefetch of desc.addr before looking at the
> flags, in order to help with indirect descriptors.

Tried and this does not seem to help at least in the micro-benchmark.

> >> struct event {
> >>__le16 idx;
> >>__le16 flags;
> >> }
> >>
> >> Flags can be used like in virtio 1.0, by storing a special
> >> value there:
> >>
> >> #define VRING_F_EVENT_ENABLE  0x0
> >>
> >> #define VRING_F_EVENT_DISABLE 0x1
> >>
> >> or it can be used to switch to event idx:
> >>
> >> #define VRING_F_EVENT_IDX 0x2
> >>
> >> in that case, interrupt triggers when descriptor with
> >> a given index value is used.
> > 
> > Is there a reason why we would keep both ways and not just use the IDX one ?
> 
> Probably a good idea, too, to keep it simple.  Pre-1.0 can go away.
> 
> Paolo

It's not there for compatibility.  I did put some thought into this, and
I feel different workloads need different mechanisms.  I'll add some
motivation in the next revision I post.

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Paolo Bonzini


On 19/09/2016 18:48, Michael S. Tsirkin wrote:
> > > 
> > > #define BATCH_NOT_FIRST 0x0010
> > > #define BATCH_NOT_LAST  0x0020
> > > 
> > > In other words only descriptor in batch is .
> > > 
> > > Batch does not have to be processed together, so
> > > !first/!last flags can be changed when descriptor is used.
> > > 
> > 
> > I don't exactly understand why this is necessary.
> > Is there any added value by knowing that a bunch of messages have been 
> > written at 
> > the same time.
> 
> virtio net mergeable buffers pretty much rely on this.

How so?  num_buffers is specified in the virtio-net header, which is
going to be in the first received descriptor.  You don't need explicit
marking of which descriptor is the first or the last.

If you want to do parallel polling of the ring, perhaps we can define
extra device-specific flags, and define one (e.g.,
VIRTIO_NET_F_RXBUF_HAS_VIRTIO_NET_HDR) or two (adding also
VIRTIO_NET_F_RXBUF_LAST) for MRG_RXBUF.

> Which is optimal depends on the workload:
> - index requires and update each time we enable interrupts
> - flag requires an ipdate each time we switch to enabled/disabled
>   interrupts
> 
> Accordingly, for workloads that enable interrupts often,
> flags are better. For workloads that keep interrupts
> disabled most of the time, index is better.

Yeah, this makes sense.  So I guess both indexes and flags should stay.

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Michael S. Tsirkin
On Mon, Sep 19, 2016 at 07:37:53AM +, Pierre Pfister (ppfister) wrote:
> Hello Michael, 
> 
> Thanks for doing this. ;-)
> 
> 
> > Le 16 sept. 2016 à 00:39, Michael S. Tsirkin  a écrit :
> > 
> > 
> > Let's start a discussion around an alternative ring layout.
> > This has been in my kvm forum 2016 presentation.
> > The idea is to have a r/w descriptor in a ring structure,
> > replacing the used and available ring, index and descriptor
> > buffer.
> > 
> > * Descriptor ring:
> > 
> > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > Host overwrites used descriptors with correct len, index, and DESC_HW
> > clear.  Flags are always set/cleared last.
> > 
> > #define DESC_HW 0x0080
> > 
> > struct desc {
> >__le64 addr;
> >__le32 len;
> >__le16 index;
> >__le16 flags;
> > };
> > 
> > When DESC_HW is set, descriptor belongs to device. When it is clear,
> > it belongs to the driver.
> 
> I think the simplified ring layout is a good idea. 
> It reduces the number of indirections, and will probably help pre-fetching 
> data in order to avoid cache misses.
> 
> I am not completely convinced by the flag-based approach instead of the 
> index-based approach though.
> With the index-based approach, you have only one-read to do in order
> to know up to what point in the queue you can go.
> With this flag, every-time you look at a new descriptor, before doing 
> anything you will have to look at the flags.
> Meaning that you will have to wait for the result of a test on a very newly 
> fetched piece of memory.
> 
> So maybe it would be interesting to still have a used_index and 
> available_index (put in different cache lines).

Problem is, index access is guaranteed to be a cache miss.
You need to stall waiting for descriptor length anyway -
go ahead and implement an alternative, and prove me wrong,
but my testing shows removing the index speeds things up.

Find the prototype here:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/virtio/ringtest/ring.c



> 
> > 
> > 
> > * Scatter/gather support
> > 
> > We can use 3 bits to set direction
> > and to chain s/g entries in a request, same as virtio 1.0:
> > 
> > /* This marks a buffer as continuing via the next field. */
> > #define VRING_DESC_F_NEXT   1
> > /* This marks a buffer as write-only (otherwise read-only). */
> > #define VRING_DESC_F_WRITE  2
> > 
> > Unlike virtio 1.0, all descriptors must have distinct ID
> > values.
> > 
> > * Indirect buffers
> > 
> > Can be done like in virtio 1.0:
> > 
> > /* This means the buffer contains a list of buffer descriptors. */
> > #define VRING_DESC_F_INDIRECT   4
> > 
> > In the descriptors in the indirect buffers, I think we should drop index
> > field altogether, just put s/g entries one after the other.
> > 
> > Also, length in the indirect descriptor should mark
> > the list of the chain.
> > 
> > virtio 1.0 seems to allow a s/g entry followed by
> > an indirect descriptor. This does not seem useful,
> > so let's not allow that anymore.
> 
> First of all, if I understand correctly, VRING_DESC_F_NEXT doesn't use a 
> 'next' field, and rather just
> say that the next descriptor in the queue is part of the same message. I 
> think this is really, really, good.
> 
> Paolo proposed in a later mail to drop VRING_DESC_F_NEXT altogether, and I 
> think
> he is right at least in the sense that having two solutions for doing almost 
> the same thing 
> is not necessary (IMHO, even harmful).

I'd like to see some code and numbers to back this up.  In particular,
testing dpdk performance with/without indirect descriptors would be
interesting.
If the optimal things depends on workload or the type of device in use,
then it might be necessary.

> But, VRING_DESC_F_INDIRECT adds a layer of indirection and makes prefetching 
> a bit
> more complicated.
> The reason why VRING_DESC_F_INDIRECT was useful in the first place is that 
> virtio 1.0 
> didn't have enough descriptors in the main queue (it is equal to the queue 
> size).

rings are physically contigious, and allocating large physically
contigious buffers is problematic for many guests.



> Therefore, ideally, I'd rather have a longer queue and use VRING_DESC_F_NEXT 
> only and drop
> VRING_DESC_F_INDIRECT altogether.

One can always just avoid implementing VRING_DESC_F_INDIRECT if desired.

> And if delay spent in the queue is an issue. I think it's driver's job to 
> avoid bufferbloat if necessary 
> by adapting the number of descriptors is allows the device to use.
>
> > 
> > 
> > * Batching descriptors:
> > 
> > virtio 1.0 allows passing a batch of descriptors in both directions, by
> > incrementing the used/avail index by values > 1.  We can support this by
> > tagging first/middle/last descriptors in the flags field. For
> > comptibility, polarity is reversed so bit is set for non-first and
> > non-last descriptors in a batch:
> > 
> > #define BATCH_NOT_FIRST 0x0010
> > #define

Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Paolo Bonzini


On 19/09/2016 09:37, Pierre Pfister (ppfister) wrote:
> Paolo proposed in a later mail to drop VRING_DESC_F_NEXT altogether,
> and I think he is right at least in the sense that having two solutions
> for doing almost the same thing is not necessary (IMHO, even harmful).
> 
> But, VRING_DESC_F_INDIRECT adds a layer of indirection and makes
> prefetching a bit more complicated. The reason why VRING_DESC_F_INDIRECT
> was useful in the first place is that virtio 1.0 didn't have enough
> descriptors in the main queue (it is equal to the queue size).
> 
> Therefore, ideally, I'd rather have a longer queue and use
> VRING_DESC_F_NEXT only and drop VRING_DESC_F_INDIRECT altogether.

VRING_DESC_F_INDIRECT is necessary for other devices.  For storage it is
not rare to have s/g lists composed of 10-15 descriptors.

The main issue with VRING_DESC_F_INDIRECT, at least as it is implemented
in Linux, is the cost of allocating and freeing the indirect descriptor.
 But if you know that the number of descriptor is fixed (e.g. 2 for
request header + data) you can pre-allocate the descriptors with a 1:1
relationship with the entries of the ring buffer.  This might be useful
for the virtio-net tx ring.

> With this flag, every-time you look at a new descriptor, before doing
> anything you will have to look at the flags. Meaning that you will
> have to wait for the result of a test on a very newly fetched piece
> of memory.

Yeah, it trades less cache traffic with higher latency on the remaining
traffic.  But I like the idea.  For L2 the processor's speculative
execution should still hide most of the latency (the branch should be
well predicted in the direction of "another descriptor"---and if there's
no other descriptor you don't care too much about the speculation
penalty because you have no work to do).

You could also try a blind prefetch of desc.addr before looking at the
flags, in order to help with indirect descriptors.

>> struct event {
>>  __le16 idx;
>>  __le16 flags;
>> }
>>
>> Flags can be used like in virtio 1.0, by storing a special
>> value there:
>>
>> #define VRING_F_EVENT_ENABLE  0x0
>>
>> #define VRING_F_EVENT_DISABLE 0x1
>>
>> or it can be used to switch to event idx:
>>
>> #define VRING_F_EVENT_IDX 0x2
>>
>> in that case, interrupt triggers when descriptor with
>> a given index value is used.
> 
> Is there a reason why we would keep both ways and not just use the IDX one ?

Probably a good idea, too, to keep it simple.  Pre-1.0 can go away.

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-19 Thread Pierre Pfister (ppfister)
Hello Michael, 

Thanks for doing this. ;-)


> Le 16 sept. 2016 à 00:39, Michael S. Tsirkin  a écrit :
> 
> 
> Let's start a discussion around an alternative ring layout.
> This has been in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.
> 
> * Descriptor ring:
> 
> Guest adds descriptors with unique index values and DESC_HW set in flags.
> Host overwrites used descriptors with correct len, index, and DESC_HW
> clear.  Flags are always set/cleared last.
> 
> #define DESC_HW 0x0080
> 
> struct desc {
>__le64 addr;
>__le32 len;
>__le16 index;
>__le16 flags;
> };
> 
> When DESC_HW is set, descriptor belongs to device. When it is clear,
> it belongs to the driver.

I think the simplified ring layout is a good idea. 
It reduces the number of indirections, and will probably help pre-fetching data 
in order to avoid cache misses.

I am not completely convinced by the flag-based approach instead of the 
index-based approach though.
With the index-based approach, you have only one-read to do in order
to know up to what point in the queue you can go.

With this flag, every-time you look at a new descriptor, before doing anything 
you will have to look at the flags.
Meaning that you will have to wait for the result of a test on a very newly 
fetched piece of memory.

So maybe it would be interesting to still have a used_index and available_index 
(put in different cache lines).

> 
> 
> * Scatter/gather support
> 
> We can use 3 bits to set direction
> and to chain s/g entries in a request, same as virtio 1.0:
> 
> /* This marks a buffer as continuing via the next field. */
> #define VRING_DESC_F_NEXT   1
> /* This marks a buffer as write-only (otherwise read-only). */
> #define VRING_DESC_F_WRITE  2
> 
> Unlike virtio 1.0, all descriptors must have distinct ID
> values.
> 
> * Indirect buffers
> 
> Can be done like in virtio 1.0:
> 
> /* This means the buffer contains a list of buffer descriptors. */
> #define VRING_DESC_F_INDIRECT   4
> 
> In the descriptors in the indirect buffers, I think we should drop index
> field altogether, just put s/g entries one after the other.
> 
> Also, length in the indirect descriptor should mark
> the list of the chain.
> 
> virtio 1.0 seems to allow a s/g entry followed by
> an indirect descriptor. This does not seem useful,
> so let's not allow that anymore.

First of all, if I understand correctly, VRING_DESC_F_NEXT doesn't use a 'next' 
field, and rather just
say that the next descriptor in the queue is part of the same message. I think 
this is really, really, good.

Paolo proposed in a later mail to drop VRING_DESC_F_NEXT altogether, and I think
he is right at least in the sense that having two solutions for doing almost 
the same thing 
is not necessary (IMHO, even harmful).

But, VRING_DESC_F_INDIRECT adds a layer of indirection and makes prefetching a 
bit
more complicated.
The reason why VRING_DESC_F_INDIRECT was useful in the first place is that 
virtio 1.0 
didn't have enough descriptors in the main queue (it is equal to the queue 
size).

Therefore, ideally, I'd rather have a longer queue and use VRING_DESC_F_NEXT 
only and drop
VRING_DESC_F_INDIRECT altogether.

And if delay spent in the queue is an issue. I think it's driver's job to avoid 
bufferbloat if necessary 
by adapting the number of descriptors is allows the device to use.

> 
> 
> * Batching descriptors:
> 
> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.  We can support this by
> tagging first/middle/last descriptors in the flags field. For
> comptibility, polarity is reversed so bit is set for non-first and
> non-last descriptors in a batch:
> 
> #define BATCH_NOT_FIRST 0x0010
> #define BATCH_NOT_LAST  0x0020
> 
> In other words only descriptor in batch is .
> 
> Batch does not have to be processed together, so
> !first/!last flags can be changed when descriptor is used.
> 

I don't exactly understand why this is necessary.
Is there any added value by knowing that a bunch of messages have been written 
at 
the same time.

As mentioned above, I'd rather keep the index approach instead of those flags.

> 
> 
> * Processing descriptors in and out of order
> 
> Device processing all descriptors in order can simply flip
> the DESC_HW bit as it is done with descriptors.
> 
> Device can process descriptors out of order, and write them out in order
> as they are used, overwriting descriptors that are there.
> 
> Device must not use a descriptor until DESC_HW is set.
> It is only required to look at the first descriptor
> submitted, but it is allowed to look anywhere
> in the ring. This might allow parallel processing.
> 
> Driver must not overwrite a descriptor until DESC_HW is clear.
> It is only required to look at the first descriptor
> submitted, but it is allowed to look anywh

Re: [virtio-dev] packed ring layout proposal

2016-09-18 Thread Paolo Bonzini

> Without indirect the usable ring size is cut in half on older kernels that
> don't support any layout. That limit on qemu ends up being 128 packets

Hi Stephen,

note that here we were talking about limiting direct descriptors to
requests with a single buffer.

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-17 Thread Michael S. Tsirkin
On Sat, Sep 17, 2016 at 08:54:06AM +0200, Maxime Coquelin wrote:
> 
> 
> On 09/16/2016 10:03 PM, Michael S. Tsirkin wrote:
> > On Fri, Sep 16, 2016 at 12:11:59PM +0200, Paolo Bonzini wrote:
> > > 
> > > 
> > > On 16/09/2016 11:46, Stefan Hajnoczi wrote:
> > > > > > The idea is to have a r/w descriptor in a ring structure,
> > > > > > replacing the used and available ring, index and descriptor
> > > > > > buffer.
> > > > This is probably obvious but:
> > > > 
> > > > Why is the desc.index field needed?
> > > > 
> > > > The index field would be needed if the driver had to jump over a DESC_HW
> > > > descriptor when adding descriptors, but that doesn't seem necessary
> > > > since the device is supposed to overwrite descriptors in ascending ring
> > > > order when they complete out-of-order.
> > > 
> > > AIUI, when reading completed descriptors, the driver uses the index in
> > > order to map the descriptor to the original request.
> > > 
> > > It's more of an id; it's certainly not a linked list as in virtio 1.0 
> > > rings.
> > > 
> > > I think we can refine the description of the index, saying that only the
> > > first descriptor in a request needs an index; descriptors reached via
> > > VRING_DESC_F_NEXT do not need an index.  The advantage is that the
> > > device only needs to remember the descriptor of the first descriptor.
> > > 
> > > However...
> > > 
> > > - is there any reason for the device or driver to even look at DESC_HW
> > > if VRING_DESC_F_NEXT is set?  DESC_HW only matters on the first buffer.
> > > 
> > > - why not drop VRING_DESC_F_NEXT altogether?
> > > 
> > > Because only two cases are strictly necessary:
> > > 
> > > - single-buffer requests.  These have some very important cases:
> > > virtio-rng, virtio-serial, virtio-net with merged rxbufs, virtio-scsi
> > > and virtio-vsock events, virtio-balloon, virtio-input all use
> > > single-buffer requests exclusively.
> > > 
> > > - indirect requests, where the length of the indirect descriptor can be
> > > used to figure out the number of buffers
> > > 
> > > Let's just drop the added complication of multi-buffer direct requests.
> > > Yes, this can cause one extra cache miss but 1) as shown above a lot of
> > > performance-critical cases use single-buffer requests 2) no one has ever
> > > bothered to use multi-buffer direct requests in Linux drivers, at least
> > > when indirect descriptors are available.
> > 
> > True. I note that dpdk used direct descriptors exclusively until
> > very recently.
> > commit 6dc5de3a6aefba3946fe04368d93994db3f7a5fd
> > Author: Stephen Hemminger 
> > Date:   Fri Mar 4 10:19:19 2016 -0800
> > 
> > virtio: use indirect ring elements
> 
> Indeed, and I'd like to add that this patch doesn't enable
> VIRTIO_RING_F_INDIRECT_DESC, so code was here but not used in mainline.
> 
> It has been fixed few weeks ago, but the patch didn't land yet into
> master, only in virtio-next[0]:
> 
> commit b19d3763de3f6c96380b6decf51752acebd2acee
> Author: Pierre Pfister 
> Date:   Wed Sep 7 10:46:18 2016 +0800
> 
> net/virtio: enable indirect descriptors feature
> 
> 
> > I'd like to make sure removing these doesn't cause regressions.
> > Stephen, could you please confirm that
> > 1. with your patch, each packet takes at most 1 slot
> > 2. your patch was a win for most workloads
> >(as opposed to e.g. enabling/disabling this depending on workload)
> 
> > 
> > Also, it does not look like vhost in dpdk supports
> > indirect descriptors.
> 
> I have done the patch for vhost in dpdk[1] two months ago.
> It's not merged because I didn't show noticeable gain using it. Note
> that I didn't show noticeable loss either.
> I'll try to re-run some benchmarks next week.
> 
> Maxime

You will likely see a gain if you run linux with a ton of tcp sockets
within guest.


> [0]: git://dpdk.org/next/dpdk-next-virtio
> [1]: http://dpdk.org/dev/patchwork/patch/14797/

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-17 Thread Michael S. Tsirkin
On Sat, Sep 17, 2016 at 06:13:31AM -0700, Stephen Hemminger wrote:
> Without indirect the usable ring size is cut in half on older kernels that
> don't support any layout. That limit on qemu ends up being 128 packets
> 

Which test would you use to measure a performance gain?

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-16 Thread Maxime Coquelin



On 09/16/2016 10:03 PM, Michael S. Tsirkin wrote:

On Fri, Sep 16, 2016 at 12:11:59PM +0200, Paolo Bonzini wrote:



On 16/09/2016 11:46, Stefan Hajnoczi wrote:

The idea is to have a r/w descriptor in a ring structure,
replacing the used and available ring, index and descriptor
buffer.

This is probably obvious but:

Why is the desc.index field needed?

The index field would be needed if the driver had to jump over a DESC_HW
descriptor when adding descriptors, but that doesn't seem necessary
since the device is supposed to overwrite descriptors in ascending ring
order when they complete out-of-order.


AIUI, when reading completed descriptors, the driver uses the index in
order to map the descriptor to the original request.

It's more of an id; it's certainly not a linked list as in virtio 1.0 rings.

I think we can refine the description of the index, saying that only the
first descriptor in a request needs an index; descriptors reached via
VRING_DESC_F_NEXT do not need an index.  The advantage is that the
device only needs to remember the descriptor of the first descriptor.

However...

- is there any reason for the device or driver to even look at DESC_HW
if VRING_DESC_F_NEXT is set?  DESC_HW only matters on the first buffer.

- why not drop VRING_DESC_F_NEXT altogether?

Because only two cases are strictly necessary:

- single-buffer requests.  These have some very important cases:
virtio-rng, virtio-serial, virtio-net with merged rxbufs, virtio-scsi
and virtio-vsock events, virtio-balloon, virtio-input all use
single-buffer requests exclusively.

- indirect requests, where the length of the indirect descriptor can be
used to figure out the number of buffers

Let's just drop the added complication of multi-buffer direct requests.
Yes, this can cause one extra cache miss but 1) as shown above a lot of
performance-critical cases use single-buffer requests 2) no one has ever
bothered to use multi-buffer direct requests in Linux drivers, at least
when indirect descriptors are available.


True. I note that dpdk used direct descriptors exclusively until
very recently.
commit 6dc5de3a6aefba3946fe04368d93994db3f7a5fd
Author: Stephen Hemminger 
Date:   Fri Mar 4 10:19:19 2016 -0800

virtio: use indirect ring elements


Indeed, and I'd like to add that this patch doesn't enable
VIRTIO_RING_F_INDIRECT_DESC, so code was here but not used in mainline.

It has been fixed few weeks ago, but the patch didn't land yet into
master, only in virtio-next[0]:

commit b19d3763de3f6c96380b6decf51752acebd2acee
Author: Pierre Pfister 
Date:   Wed Sep 7 10:46:18 2016 +0800

net/virtio: enable indirect descriptors feature



I'd like to make sure removing these doesn't cause regressions.
Stephen, could you please confirm that
1. with your patch, each packet takes at most 1 slot
2. your patch was a win for most workloads
   (as opposed to e.g. enabling/disabling this depending on workload)




Also, it does not look like vhost in dpdk supports
indirect descriptors.


I have done the patch for vhost in dpdk[1] two months ago.
It's not merged because I didn't show noticeable gain using it. Note
that I didn't show noticeable loss either.
I'll try to re-run some benchmarks next week.

Maxime

[0]: git://dpdk.org/next/dpdk-next-virtio
[1]: http://dpdk.org/dev/patchwork/patch/14797/


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-16 Thread Michael S. Tsirkin
On Fri, Sep 16, 2016 at 12:11:59PM +0200, Paolo Bonzini wrote:
> 
> 
> On 16/09/2016 11:46, Stefan Hajnoczi wrote:
> >> > The idea is to have a r/w descriptor in a ring structure,
> >> > replacing the used and available ring, index and descriptor
> >> > buffer.
> > This is probably obvious but:
> > 
> > Why is the desc.index field needed?
> > 
> > The index field would be needed if the driver had to jump over a DESC_HW
> > descriptor when adding descriptors, but that doesn't seem necessary
> > since the device is supposed to overwrite descriptors in ascending ring
> > order when they complete out-of-order.
> 
> AIUI, when reading completed descriptors, the driver uses the index in
> order to map the descriptor to the original request.
> 
> It's more of an id; it's certainly not a linked list as in virtio 1.0 rings.

Yes.

> I think we can refine the description of the index, saying that only the
> first descriptor in a request needs an index; descriptors reached via
> VRING_DESC_F_NEXT do not need an index.  The advantage is that the
> device only needs to remember the descriptor of the first descriptor.

OTOH the driver will have to remember the id of the 1st
descriptor in the chain. Maybe require that all
descriptors in a chain have the same id. This seems better to me.

> However...
> 
> - is there any reason for the device or driver to even look at DESC_HW
> if VRING_DESC_F_NEXT is set?  DESC_HW only matters on the first buffer.
> 
> - why not drop VRING_DESC_F_NEXT altogether?
> 
> Because only two cases are strictly necessary:
> 
> - single-buffer requests.  These have some very important cases:
> virtio-rng, virtio-serial, virtio-net with merged rxbufs, virtio-scsi
> and virtio-vsock events, virtio-balloon, virtio-input all use
> single-buffer requests exclusively.
> 
> - indirect requests, where the length of the indirect descriptor can be
> used to figure out the number of buffers
> 
> Let's just drop the added complication of multi-buffer direct requests.
> Yes, this can cause one extra cache miss but 1) as shown above a lot of
> performance-critical cases use single-buffer requests 2) no one has ever
> bothered to use multi-buffer direct requests in Linux drivers, at least
> when indirect descriptors are available.
> 
> This simplifies the device logic like this
> 
> read lookahead descriptor from ring
> while (DESC_HW set in descriptor) {
> advance internal used index
> if VRING_DESC_F_INDIRECT
> collect len/16 descriptors from address
> else
> use (len,addr,flags & VRING_DESC_F_WRITE) as a single buffer
> process request
> read lookahead descriptor from ring
> }
> 
> and on completion
> 
> if (next descriptor has DESC_HW set, i.e. ring full) {
> set broken flag
> } else {
> write addr,index from original descriptor into addr,index
> write number of written bytes into len
> wmb
> write flags from original descriptor & ~DESC_HW into flags
> }
> 
> Paolo
> 




-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-16 Thread Michael S. Tsirkin
On Fri, Sep 16, 2016 at 10:46:52AM +0100, Stefan Hajnoczi wrote:
> On Fri, Sep 16, 2016 at 01:39:15AM +0300, Michael S. Tsirkin wrote:
> > The idea is to have a r/w descriptor in a ring structure,
> > replacing the used and available ring, index and descriptor
> > buffer.
> 
> This is probably obvious but:
> 
> Why is the desc.index field needed?
> 
> The index field would be needed if the driver had to jump over a DESC_HW
> descriptor when adding descriptors, but that doesn't seem necessary
> since the device is supposed to overwrite descriptors in ascending ring
> order when they complete out-of-order.
> 
> Stefan


This should be desc.id. The point is to be able to
figure out which descriptor was used. I'll rename it.



-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-16 Thread Michael S. Tsirkin
On Fri, Sep 16, 2016 at 12:11:59PM +0200, Paolo Bonzini wrote:
> 
> 
> On 16/09/2016 11:46, Stefan Hajnoczi wrote:
> >> > The idea is to have a r/w descriptor in a ring structure,
> >> > replacing the used and available ring, index and descriptor
> >> > buffer.
> > This is probably obvious but:
> > 
> > Why is the desc.index field needed?
> > 
> > The index field would be needed if the driver had to jump over a DESC_HW
> > descriptor when adding descriptors, but that doesn't seem necessary
> > since the device is supposed to overwrite descriptors in ascending ring
> > order when they complete out-of-order.
> 
> AIUI, when reading completed descriptors, the driver uses the index in
> order to map the descriptor to the original request.
> 
> It's more of an id; it's certainly not a linked list as in virtio 1.0 rings.
> 
> I think we can refine the description of the index, saying that only the
> first descriptor in a request needs an index; descriptors reached via
> VRING_DESC_F_NEXT do not need an index.  The advantage is that the
> device only needs to remember the descriptor of the first descriptor.
> 
> However...
> 
> - is there any reason for the device or driver to even look at DESC_HW
> if VRING_DESC_F_NEXT is set?  DESC_HW only matters on the first buffer.
> 
> - why not drop VRING_DESC_F_NEXT altogether?
> 
> Because only two cases are strictly necessary:
> 
> - single-buffer requests.  These have some very important cases:
> virtio-rng, virtio-serial, virtio-net with merged rxbufs, virtio-scsi
> and virtio-vsock events, virtio-balloon, virtio-input all use
> single-buffer requests exclusively.
> 
> - indirect requests, where the length of the indirect descriptor can be
> used to figure out the number of buffers
> 
> Let's just drop the added complication of multi-buffer direct requests.
> Yes, this can cause one extra cache miss but 1) as shown above a lot of
> performance-critical cases use single-buffer requests 2) no one has ever
> bothered to use multi-buffer direct requests in Linux drivers, at least
> when indirect descriptors are available.

True. I note that dpdk used direct descriptors exclusively until
very recently.
commit 6dc5de3a6aefba3946fe04368d93994db3f7a5fd
Author: Stephen Hemminger 
Date:   Fri Mar 4 10:19:19 2016 -0800

virtio: use indirect ring elements

I'd like to make sure removing these doesn't cause regressions.
Stephen, could you please confirm that
1. with your patch, each packet takes at most 1 slot
2. your patch was a win for most workloads
   (as opposed to e.g. enabling/disabling this depending on workload)


Also, it does not look like vhost in dpdk supports
indirect descriptors.


> This simplifies the device logic like this
> 
> read lookahead descriptor from ring
> while (DESC_HW set in descriptor) {
> advance internal used index
> if VRING_DESC_F_INDIRECT
> collect len/16 descriptors from address
> else
> use (len,addr,flags & VRING_DESC_F_WRITE) as a single buffer
> process request
> read lookahead descriptor from ring
> }
> 
> and on completion
> 
> if (next descriptor has DESC_HW set, i.e. ring full) {
> set broken flag
> } else {
> write addr,index from original descriptor into addr,index
> write number of written bytes into len
> wmb
> write flags from original descriptor & ~DESC_HW into flags
> }
> 
> Paolo
> 




-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-16 Thread Michael S. Tsirkin
On Fri, Sep 16, 2016 at 10:28:19AM +0100, Stefan Hajnoczi wrote:
> On Fri, Sep 16, 2016 at 01:39:15AM +0300, Michael S. Tsirkin wrote:
> > Let's start a discussion around an alternative ring layout.
> > This has been in my kvm forum 2016 presentation.
> > The idea is to have a r/w descriptor in a ring structure,
> > replacing the used and available ring, index and descriptor
> > buffer.
> 
> Do you have pseudo-code?  I find it difficult to infer how it's supposed
> to work from the description.

Find a prototype implementation under tools/virtio/ringbench/ring.c in
latest Linux.


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] packed ring layout proposal

2016-09-16 Thread Paolo Bonzini


On 16/09/2016 11:46, Stefan Hajnoczi wrote:
>> > The idea is to have a r/w descriptor in a ring structure,
>> > replacing the used and available ring, index and descriptor
>> > buffer.
> This is probably obvious but:
> 
> Why is the desc.index field needed?
> 
> The index field would be needed if the driver had to jump over a DESC_HW
> descriptor when adding descriptors, but that doesn't seem necessary
> since the device is supposed to overwrite descriptors in ascending ring
> order when they complete out-of-order.

AIUI, when reading completed descriptors, the driver uses the index in
order to map the descriptor to the original request.

It's more of an id; it's certainly not a linked list as in virtio 1.0 rings.

I think we can refine the description of the index, saying that only the
first descriptor in a request needs an index; descriptors reached via
VRING_DESC_F_NEXT do not need an index.  The advantage is that the
device only needs to remember the descriptor of the first descriptor.

However...

- is there any reason for the device or driver to even look at DESC_HW
if VRING_DESC_F_NEXT is set?  DESC_HW only matters on the first buffer.

- why not drop VRING_DESC_F_NEXT altogether?

Because only two cases are strictly necessary:

- single-buffer requests.  These have some very important cases:
virtio-rng, virtio-serial, virtio-net with merged rxbufs, virtio-scsi
and virtio-vsock events, virtio-balloon, virtio-input all use
single-buffer requests exclusively.

- indirect requests, where the length of the indirect descriptor can be
used to figure out the number of buffers

Let's just drop the added complication of multi-buffer direct requests.
Yes, this can cause one extra cache miss but 1) as shown above a lot of
performance-critical cases use single-buffer requests 2) no one has ever
bothered to use multi-buffer direct requests in Linux drivers, at least
when indirect descriptors are available.

This simplifies the device logic like this

read lookahead descriptor from ring
while (DESC_HW set in descriptor) {
advance internal used index
if VRING_DESC_F_INDIRECT
collect len/16 descriptors from address
else
use (len,addr,flags & VRING_DESC_F_WRITE) as a single buffer
process request
read lookahead descriptor from ring
}

and on completion

if (next descriptor has DESC_HW set, i.e. ring full) {
set broken flag
} else {
write addr,index from original descriptor into addr,index
write number of written bytes into len
wmb
write flags from original descriptor & ~DESC_HW into flags
}

Paolo



signature.asc
Description: OpenPGP digital signature


Re: [virtio-dev] packed ring layout proposal

2016-09-16 Thread Stefan Hajnoczi
On Fri, Sep 16, 2016 at 01:39:15AM +0300, Michael S. Tsirkin wrote:
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.

This is probably obvious but:

Why is the desc.index field needed?

The index field would be needed if the driver had to jump over a DESC_HW
descriptor when adding descriptors, but that doesn't seem necessary
since the device is supposed to overwrite descriptors in ascending ring
order when they complete out-of-order.

Stefan


signature.asc
Description: PGP signature


Re: [virtio-dev] packed ring layout proposal

2016-09-16 Thread Stefan Hajnoczi
On Fri, Sep 16, 2016 at 01:39:15AM +0300, Michael S. Tsirkin wrote:
> Let's start a discussion around an alternative ring layout.
> This has been in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.

Do you have pseudo-code?  I find it difficult to infer how it's supposed
to work from the description.


signature.asc
Description: PGP signature


Re: [virtio-dev] packed ring layout proposal

2016-09-15 Thread Paolo Bonzini


On 16/09/2016 00:39, Michael S. Tsirkin wrote:
> #define DESC_HW 0x0080

0x8000 seems easier to spot when looking at dumps.

> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.  We can support this by
> tagging first/middle/last descriptors in the flags field. For
> comptibility, polarity is reversed so bit is set for non-first and
> non-last descriptors in a batch:
> 
> #define BATCH_NOT_FIRST 0x0010
> #define BATCH_NOT_LAST  0x0020

What are the intended uses of this in the device and driver?  (Spoiler:
I think it's unnecessary)

- I cannot think of the desired meaning of BATCH_NOT_FIRST, it seems
like it shouldn't have any effect.

- For BATCH_NOT_LAST, you might not know that another descriptor will
have come at the time of the kick.  So, when the second descriptor in
the batch comes you have to go back and add BATCH_NOT_LAST to the
previous descriptor.  This is problematic because:

  * if you had already flipped DESC_HW, you aren't allowed to do that

  * if you do not flip DESC_HW as early as possible you have introduced
  additional latency for a polling-mode device

So BATCH_NOT_LAST is not necessary either, *but* when the device finds a
terminating descriptor (!VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT) the
device must look one element ahead in the ring to see if there's another
DESC_HW descriptor.  But hey, such lookahead is not any more inefficient
than separate memory locations for used index so it should be okay.  The
logic would be:

read lookahead descriptor from ring
while (DESC_HW set in descriptor) {
do {
advance internal used index
if (not processing indirect desc. and VRING_DESC_F_INDIRECT)
prepare to read first indirect descriptor
else if (inside indirect descriptor) {
check that VRING_DESC_F_INDIRECT is clear
check that descriptor is valid
add descriptor to request
/* next test done based on legnth field of indirect
   descriptor */
if (last descriptor)
break
} else {
/* for polling mode device: wait until DESC_HW is set */
check that DESC_HW is set
check that descriptor is valid
add descriptor to request
if (!VRING_DESC_F_NEXT)
break
}
read next descriptor
}
process request
read lookahead descriptor from ring
}

BATCH_NOT_LAST is not necessary in the device->driver direction either.
And in this case, the driver knows the index of the last descriptor it
wrote, so it can avoid the lookahead.


I thought of adding BATCH_LAST for the driver->device direction only.
It would only be valid for a terminating descriptor, and it would tell
the device "Once I set DESC_HW in the following descriptor, there's
going to be a virtqueue kick".  BATCH_LAST would eliminate lookahead in
the device.  However, when the virtqueue kick is processed
asynchronously (this is the case when using KVM ioeventfd) it has a
great potential for introducing races.  So I don't like it either and
discourage introducing it.


> Also, length in the indirect descriptor should mark
> the list of the chain.
> 
> virtio 1.0 seems to allow a s/g entry followed by
> an indirect descriptor. This does not seem useful,
> so let's not allow that anymore.

Agreed.

> In the descriptors in the indirect buffers, I think we should drop index
> field altogether, just put s/g entries one after the other.

That causes worse packing and alignment (14 bytes per entry), so I'm not
sure it's a good idea.

Also, DESC_HW is not used by indirect descriptors, right?

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org