Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload

2019-02-11 Thread Jason Wang



On 2019/2/12 下午1:08, Michael S. Tsirkin wrote:

On Mon, Feb 11, 2019 at 02:58:30PM +, David Riddoch wrote:

This can result in a very high rate of doorbells with some
drivers, which can become a severe bottleneck (because x86 CPUs can't emit
MMIOs at very high rates).

Interesting. Is there any data you could share to help guide the design?
E.g. what's the highest rate of MMIO writes supported etc?

On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.

On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
PCIe-remote socket ~8M/s.

Is that mega bytes? Or million writes?
With 32 bit writes are we limited to 2M then?

Sorry, this is million-writes-per-second.  The size of the write doesn't
matter.

So it's not too bad, you only need one per batch after all.


Also I thought some more about this. In fact on x86/intel specifically
PCI descriptor reads are atomic with cache line granularity
and writes are ordered. So in fact you can safely prefetch
descriptors and if you see them valid, you can go ahead
and use them.



If I understand this correctly, unless driver set avail in order for 
several descriptors. Device could not assume that if N is avail then [0, 
N) is also available.


But prefetching indeed improve the performance when I play with vhost 
kernel implementation. It reduces the overhead of memory accessing, 
which is similar to PCI.





This makes the descriptor index merely a hint for performance,
which device can play with at will.


Other platforms are not like this so you need the data
but do they have the same problem?






This doesn't just impose a limit on aggregate packet rate: If you hit this
bottleneck then the CPU core is back-pressured by the MMIO writes, and so
instr/cycle takes a huge hit.


The proposed offset/wrap field allows devices to disable doorbells when
appropriate, and determine the latest fill level via a PCIe read.

This kind of looks like a split ring though, does it not?

I don't agree, because the ring isn't split.  Split-ring is very painful for
hardware because of the potentially random accesses to the descriptor table
(including pointer chasing) which inhibit batching of descriptor fetches, as
well as causing complexity in implementation.

That's different with IN_ORDER, right?

Yes, IN_ORDER will give some of the benefit of packed ring, and is also a
win because you can merge 'used' writes.  (But packed-ring still wins due to
not having to fetch ring and descriptor table separately).



Note, with IN_ORDER, there's no need to read available ring at all even 
for split ring. And in some cases, no need to update used ring at all.






Packed ring is a huge improvement on both dimensions, but introduces a
specific pain point for hardware offload.


The issue is
we will again need to bounce more cache lines to communicate.

You'll only bounce cache lines if the device chooses to read the idx.  A PV
device need not offer this feature.  A PCIe device will, but the cost to the
driver of writing an in-memory idx is much smaller than the cost of an MMIO,
which is always a strongly ordered barrier on x86/64.

With vDPA you ideally would have this feature enabled, and the device would
sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
field and so cache lines would not bounce then either.

Ie. The only time cache lines are shared is when sharing with a PCIe device,
which is the scenario when this is a win.

True. OTOH on non-x86 you will need more write barriers :( It would be
natural to say driver can reuse the barrier before flags update, but note
that that would mean hardware still needs to support re-fetching if
flags value is wrong. Such re-fetch is probably rare so fine by me, but it
does add a bit more complexity.

I would prefer to have the write barrier before writing the idx.

Well that's driver overhead for something device might never utilise in
a given workload. If we are optimizing let's optimize for speed.


Note that drivers could take advantage of this feature to avoid read
barriers when consuming descriptors: At the moment there is a virtio_rmb()
per descriptor read.  With the proposed feature the driver can do just one
barrier after reading idx.

Oh you want device to write a used index too? Hmm. Isn't this
contrary to your efforts to consume PCIe BW?


I expect that on platforms where write barriers
have a cost read barriers likely have a significant cost too, so this might
be a win with PV devices too.

I'm not sure.  It is pretty easy to replace an rmb with a dependency
which is generally quite cheap in my testing.

But since it's supposed to benefit PV, at this point we already have
implementations so rather than speculate (pun intended), people can
write patches and experiment.

For the proposed avail idx I think Jason (CC'd) was interested in adding
something like this for vhost.



Yes, it's kind of IN_ORDER + split ring I believe.

Thanks






So I wonder: what if we made a change 

Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload

2019-02-11 Thread Michael S. Tsirkin
On Mon, Feb 11, 2019 at 02:58:30PM +, David Riddoch wrote:
> > > > > This can result in a very high rate of doorbells with some
> > > > > drivers, which can become a severe bottleneck (because x86 CPUs can't 
> > > > > emit
> > > > > MMIOs at very high rates).
> > > > Interesting. Is there any data you could share to help guide the design?
> > > > E.g. what's the highest rate of MMIO writes supported etc?
> > > On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
> > > 
> > > On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
> > > PCIe-remote socket ~8M/s.
> > Is that mega bytes? Or million writes?
> > With 32 bit writes are we limited to 2M then?
> 
> Sorry, this is million-writes-per-second.  The size of the write doesn't
> matter.

So it's not too bad, you only need one per batch after all.


Also I thought some more about this. In fact on x86/intel specifically
PCI descriptor reads are atomic with cache line granularity
and writes are ordered. So in fact you can safely prefetch
descriptors and if you see them valid, you can go ahead
and use them.

This makes the descriptor index merely a hint for performance,
which device can play with at will.


Other platforms are not like this so you need the data
but do they have the same problem?





> > > This doesn't just impose a limit on aggregate packet rate: If you hit this
> > > bottleneck then the CPU core is back-pressured by the MMIO writes, and so
> > > instr/cycle takes a huge hit.
> > > 
> > > > > The proposed offset/wrap field allows devices to disable doorbells 
> > > > > when
> > > > > appropriate, and determine the latest fill level via a PCIe read.
> > > > This kind of looks like a split ring though, does it not?
> > > I don't agree, because the ring isn't split.  Split-ring is very painful 
> > > for
> > > hardware because of the potentially random accesses to the descriptor 
> > > table
> > > (including pointer chasing) which inhibit batching of descriptor fetches, 
> > > as
> > > well as causing complexity in implementation.
> > That's different with IN_ORDER, right?
> 
> Yes, IN_ORDER will give some of the benefit of packed ring, and is also a
> win because you can merge 'used' writes.  (But packed-ring still wins due to
> not having to fetch ring and descriptor table separately).
>
> > > Packed ring is a huge improvement on both dimensions, but introduces a
> > > specific pain point for hardware offload.
> > > 
> > > > The issue is
> > > > we will again need to bounce more cache lines to communicate.
> > > You'll only bounce cache lines if the device chooses to read the idx.  A 
> > > PV
> > > device need not offer this feature.  A PCIe device will, but the cost to 
> > > the
> > > driver of writing an in-memory idx is much smaller than the cost of an 
> > > MMIO,
> > > which is always a strongly ordered barrier on x86/64.
> > > 
> > > With vDPA you ideally would have this feature enabled, and the device 
> > > would
> > > sometimes be PV and sometimes PCIe.  The PV device would ignore the new 
> > > idx
> > > field and so cache lines would not bounce then either.
> > > 
> > > Ie. The only time cache lines are shared is when sharing with a PCIe 
> > > device,
> > > which is the scenario when this is a win.
> > True. OTOH on non-x86 you will need more write barriers :( It would be
> > natural to say driver can reuse the barrier before flags update, but note
> > that that would mean hardware still needs to support re-fetching if
> > flags value is wrong. Such re-fetch is probably rare so fine by me, but it
> > does add a bit more complexity.
> 
> I would prefer to have the write barrier before writing the idx.

Well that's driver overhead for something device might never utilise in
a given workload. If we are optimizing let's optimize for speed.

> Note that drivers could take advantage of this feature to avoid read
> barriers when consuming descriptors: At the moment there is a virtio_rmb()
> per descriptor read.  With the proposed feature the driver can do just one
> barrier after reading idx.

Oh you want device to write a used index too? Hmm. Isn't this
contrary to your efforts to consume PCIe BW?

> I expect that on platforms where write barriers
> have a cost read barriers likely have a significant cost too, so this might
> be a win with PV devices too.

I'm not sure.  It is pretty easy to replace an rmb with a dependency
which is generally quite cheap in my testing.

But since it's supposed to benefit PV, at this point we already have
implementations so rather than speculate (pun intended), people can
write patches and experiment.

For the proposed avail idx I think Jason (CC'd) was interested in adding
something like this for vhost.


> > > > So I wonder: what if we made a change to spec that would allow prefetch
> > > > of ring entries?  E.g. you would be able to read at random and if you
> > > > guessed right then you can just use what you have read, no need to
> > > > re-fetch?
> > > Unless 

Re: [virtio-dev] Memory sharing device

2019-02-11 Thread Michael S. Tsirkin
On Mon, Feb 11, 2019 at 07:14:53AM -0800, Frank Yang wrote:
> 
> 
> On Mon, Feb 11, 2019 at 6:49 AM Michael S. Tsirkin  wrote:
> 
> On Mon, Feb 04, 2019 at 11:42:25PM -0800, Roman Kiryanov wrote:
> > Hi Gerd,
> >
> > > virtio-gpu specifically needs that to support vulkan and opengl
> > > extensions for coherent buffers, which must be allocated by the host
> gpu
> > > driver.  It's WIP still.
> >
> > the proposed spec says:
> >
> > +Shared memory regions MUST NOT be used to control the operation
> > +of the device, nor to stream data; those should still be performed
> > +using virtqueues.
> >
> > Is there a strong reason to prohibit using memory regions for control
> purposes?
> 
> That's in order to make virtio have portability implications, such that if
> people see a virtio device in lspci they know there's
> no lock-in, their guest can be moved between hypervisors
> and will still work.
> 
> > Our long term goal is to have as few kernel drivers as possible and to
> move
> > "drivers" into userspace. If we go with the virtqueues, is there
> > general a purpose
> > device/driver to talk between our host and guest to support custom
> hardware
> > (with own blobs)?
> 
> The challenge is to answer the following question:
> how to do this without losing the benefits of standartization?
> 
> 
> Draft spec is incoming, but the basic idea is to standardize how to enumerate,
> discover, and operate (with high performance) such userspace drivers/devices;
> the basic operations would be standardized, and userspace drivers would be
> constructed out of the resulting primitives.

As long standartization facilitates functionality, e.g.
if we can support moving between hypervisors, this seems
in-scope for virtio.


> > Could you please advise if we can use something else to
> > achieve this goal?
> 
> I am not sure what the goal is though. Blobs is a means I guess
> or it should be :) E.g. is it about being able to iterate quickly?
> 
> Maybe you should look at vhost-user-gpu patches on qemu?
> Would this address your need?
> Acks for these patches would be a good thing.
> 
> 
> 
> Is this it:
> 
> https://patchwork.kernel.org/patch/10444089/ ?
> 
> I'll check it out and try to discuss. Is there a draft spec for it as well?


virtio gpu is part of the csprd01 draft.

> 
> 
> --
> MST
> 

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Memory sharing device

2019-02-11 Thread Frank Yang
BTW, I have a few concerns about the upcoming shared-mem virtio type. This
is mostly directed at David and kraxel.

We've found that for many applications, simply telling the guest to create
a new host pointer of Vulkan or OpenGL has quite some overhead in just
telling the hypervisor to map it, and in fact, it's easy to run out of KVM
slots by doing so. So for Vulkan, we rely on having one large host visible
region on the host that is a single region of host shared memory. That, is
then sub-allocated for the guest. So there is no Vulkan host pointer that
is being shared to the guest 1:1; we suballocate, then generate the right
'underlying' Vulkan device memory offset and size parameters for the host.

In general though, this means that the ideal usage of host pointers would
be to set a few regions up front for certain purposes, then share that out
amongst other device contexts. This also facilitates sharing the memory
between guest processes, which is useful for implementing things like
compositors. This also features heavily for our "virtio userspace" thing.
Since this is a common pattern, should this sharing concept be standardized
somehow? I.e., should there be a standard way to send Shmid/offset/size to
other devices, or have that be a standard struct in the hypervisor?

On Mon, Feb 11, 2019 at 7:14 AM Frank Yang  wrote:

>
>
> On Mon, Feb 11, 2019 at 6:49 AM Michael S. Tsirkin  wrote:
>
>> On Mon, Feb 04, 2019 at 11:42:25PM -0800, Roman Kiryanov wrote:
>> > Hi Gerd,
>> >
>> > > virtio-gpu specifically needs that to support vulkan and opengl
>> > > extensions for coherent buffers, which must be allocated by the host
>> gpu
>> > > driver.  It's WIP still.
>> >
>> > the proposed spec says:
>> >
>> > +Shared memory regions MUST NOT be used to control the operation
>> > +of the device, nor to stream data; those should still be performed
>> > +using virtqueues.
>> >
>> > Is there a strong reason to prohibit using memory regions for control
>> purposes?
>>
>> That's in order to make virtio have portability implications, such that if
>> people see a virtio device in lspci they know there's
>> no lock-in, their guest can be moved between hypervisors
>> and will still work.
>>
>> > Our long term goal is to have as few kernel drivers as possible and to
>> move
>> > "drivers" into userspace. If we go with the virtqueues, is there
>> > general a purpose
>> > device/driver to talk between our host and guest to support custom
>> hardware
>> > (with own blobs)?
>>
>> The challenge is to answer the following question:
>> how to do this without losing the benefits of standartization?
>>
>> Draft spec is incoming, but the basic idea is to standardize how to
> enumerate, discover, and operate (with high performance) such userspace
> drivers/devices; the basic operations would be standardized, and userspace
> drivers would be constructed out of the resulting primitives.
>
> > Could you please advise if we can use something else to
>> > achieve this goal?
>>
>> I am not sure what the goal is though. Blobs is a means I guess
>> or it should be :) E.g. is it about being able to iterate quickly?
>>
>> Maybe you should look at vhost-user-gpu patches on qemu?
>> Would this address your need?
>> Acks for these patches would be a good thing.
>>
>>
> Is this it:
>
> https://patchwork.kernel.org/patch/10444089/ ?
>
> I'll check it out and try to discuss. Is there a draft spec for it as well?
>
>
>>
>> --
>> MST
>>
>


Re: [virtio-dev] Memory sharing device

2019-02-11 Thread Frank Yang
On Mon, Feb 11, 2019 at 6:49 AM Michael S. Tsirkin  wrote:

> On Mon, Feb 04, 2019 at 11:42:25PM -0800, Roman Kiryanov wrote:
> > Hi Gerd,
> >
> > > virtio-gpu specifically needs that to support vulkan and opengl
> > > extensions for coherent buffers, which must be allocated by the host
> gpu
> > > driver.  It's WIP still.
> >
> > the proposed spec says:
> >
> > +Shared memory regions MUST NOT be used to control the operation
> > +of the device, nor to stream data; those should still be performed
> > +using virtqueues.
> >
> > Is there a strong reason to prohibit using memory regions for control
> purposes?
>
> That's in order to make virtio have portability implications, such that if
> people see a virtio device in lspci they know there's
> no lock-in, their guest can be moved between hypervisors
> and will still work.
>
> > Our long term goal is to have as few kernel drivers as possible and to
> move
> > "drivers" into userspace. If we go with the virtqueues, is there
> > general a purpose
> > device/driver to talk between our host and guest to support custom
> hardware
> > (with own blobs)?
>
> The challenge is to answer the following question:
> how to do this without losing the benefits of standartization?
>
> Draft spec is incoming, but the basic idea is to standardize how to
enumerate, discover, and operate (with high performance) such userspace
drivers/devices; the basic operations would be standardized, and userspace
drivers would be constructed out of the resulting primitives.

> Could you please advise if we can use something else to
> > achieve this goal?
>
> I am not sure what the goal is though. Blobs is a means I guess
> or it should be :) E.g. is it about being able to iterate quickly?
>
> Maybe you should look at vhost-user-gpu patches on qemu?
> Would this address your need?
> Acks for these patches would be a good thing.
>
>
Is this it:

https://patchwork.kernel.org/patch/10444089/ ?

I'll check it out and try to discuss. Is there a draft spec for it as well?


>
> --
> MST
>


Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload

2019-02-11 Thread David Riddoch

This can result in a very high rate of doorbells with some
drivers, which can become a severe bottleneck (because x86 CPUs can't emit
MMIOs at very high rates).

Interesting. Is there any data you could share to help guide the design?
E.g. what's the highest rate of MMIO writes supported etc?

On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.

On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
PCIe-remote socket ~8M/s.

Is that mega bytes? Or million writes?
With 32 bit writes are we limited to 2M then?


Sorry, this is million-writes-per-second.  The size of the write doesn't 
matter.



This doesn't just impose a limit on aggregate packet rate: If you hit this
bottleneck then the CPU core is back-pressured by the MMIO writes, and so
instr/cycle takes a huge hit.


The proposed offset/wrap field allows devices to disable doorbells when
appropriate, and determine the latest fill level via a PCIe read.

This kind of looks like a split ring though, does it not?

I don't agree, because the ring isn't split.  Split-ring is very painful for
hardware because of the potentially random accesses to the descriptor table
(including pointer chasing) which inhibit batching of descriptor fetches, as
well as causing complexity in implementation.

That's different with IN_ORDER, right?


Yes, IN_ORDER will give some of the benefit of packed ring, and is also 
a win because you can merge 'used' writes.  (But packed-ring still wins 
due to not having to fetch ring and descriptor table separately).



Packed ring is a huge improvement on both dimensions, but introduces a
specific pain point for hardware offload.


The issue is
we will again need to bounce more cache lines to communicate.

You'll only bounce cache lines if the device chooses to read the idx.  A PV
device need not offer this feature.  A PCIe device will, but the cost to the
driver of writing an in-memory idx is much smaller than the cost of an MMIO,
which is always a strongly ordered barrier on x86/64.

With vDPA you ideally would have this feature enabled, and the device would
sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
field and so cache lines would not bounce then either.

Ie. The only time cache lines are shared is when sharing with a PCIe device,
which is the scenario when this is a win.

True. OTOH on non-x86 you will need more write barriers :( It would be
natural to say driver can reuse the barrier before flags update, but note
that that would mean hardware still needs to support re-fetching if
flags value is wrong. Such re-fetch is probably rare so fine by me, but it
does add a bit more complexity.


I would prefer to have the write barrier before writing the idx.

Note that drivers could take advantage of this feature to avoid read 
barriers when consuming descriptors: At the moment there is a 
virtio_rmb() per descriptor read.  With the proposed feature the driver 
can do just one barrier after reading idx.  I expect that on platforms 
where write barriers have a cost read barriers likely have a significant 
cost too, so this might be a win with PV devices too.



So I wonder: what if we made a change to spec that would allow prefetch
of ring entries?  E.g. you would be able to read at random and if you
guessed right then you can just use what you have read, no need to
re-fetch?

Unless I've misunderstood I think this would imply that the driver would
have to ensure strict ordering for each descriptor it wrote, which would
impose a cost to the driver.  At the moment a write barrier is only needed
before writing the flags of the first descriptor in a batch.

On non-x86 right?  OTOH the extra data structure also adds more ordering
requirements.


Yes on non-x86.  The extra data structure only adds an ordering once per 
request (when enabled) whereas allowing prefetch would add an ordering 
per descriptor.  The number of descriptors is always >= the number of 
requests, and can be much larger (esp. for a net rx virt-queue).



I suggest the best place to put this would be in the driver area,
immediately after the event suppression structure.

Could you comment on why is that a good place though?

The new field is written by the driver, as are the other fields in the
driver area.  Also I expect that devices might be able to read this new idx
together with the interrupt suppression fields in a single read, reducing
PCIe overheads.

OK then you need it in the same aligned 256 byte boundary.


The event suppression structs currently require 4byte align.  Are you 
suggesting we increase the align required when 
VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated?  Sounds good to me, but 
8bytes would presumably suffice.


Thanks for the feedback.

David

--
David Riddoch   -- Chief Architect, Solarflare


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: 

Re: [virtio-dev] Memory sharing device

2019-02-11 Thread Michael S. Tsirkin
On Mon, Feb 04, 2019 at 11:42:25PM -0800, Roman Kiryanov wrote:
> Hi Gerd,
> 
> > virtio-gpu specifically needs that to support vulkan and opengl
> > extensions for coherent buffers, which must be allocated by the host gpu
> > driver.  It's WIP still.
> 
> the proposed spec says:
> 
> +Shared memory regions MUST NOT be used to control the operation
> +of the device, nor to stream data; those should still be performed
> +using virtqueues.
> 
> Is there a strong reason to prohibit using memory regions for control 
> purposes?

That's in order to make virtio have portability implications, such that if
people see a virtio device in lspci they know there's
no lock-in, their guest can be moved between hypervisors
and will still work.

> Our long term goal is to have as few kernel drivers as possible and to move
> "drivers" into userspace. If we go with the virtqueues, is there
> general a purpose
> device/driver to talk between our host and guest to support custom hardware
> (with own blobs)?

The challenge is to answer the following question:
how to do this without losing the benefits of standartization?

> Could you please advise if we can use something else to
> achieve this goal?

I am not sure what the goal is though. Blobs is a means I guess
or it should be :) E.g. is it about being able to iterate quickly?

Maybe you should look at vhost-user-gpu patches on qemu?
Would this address your need?
Acks for these patches would be a good thing.


-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload

2019-02-11 Thread Michael S. Tsirkin
On Mon, Feb 11, 2019 at 08:52:01AM +, David Riddoch wrote:
> On 11/02/2019 07:33, Michael S. Tsirkin wrote:
> > On Fri, Feb 01, 2019 at 02:23:55PM +, David Riddoch wrote:
> > > All,
> > > 
> > > I'd like to propose a small extension to the packed virtqueue mode.  My
> > > proposal is to add an offset/wrap field, written by the driver, indicating
> > > how many available descriptors have been added to the ring.
> > > 
> > > The reason for wanting this is to improve performance of hardware devices.
> > > Because of high read latency over a PCIe bus, it is important for hardware
> > > devices to read multiple ring entries in parallel.  It is desirable to 
> > > know
> > > how many descriptors are available prior to issuing these reads, else you
> > > risk fetching descriptors that are not yet available.  As well as wasting
> > > bus bandwidth this adds complexity.
> > > 
> > > I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this
> > > problem,
> > Right. And this seems like the ideal solution latency-wise since
> > this pushes data out to device without need for round-trips
> > over PCIe.
> 
> Yes, and I'm not proposing getting rid of it.  I'd expect a PCIe device to
> use both features together.
> 
> > > but we still have a problem.  If you rely on doorbells to tell you
> > > how many descriptors are available, then you have to keep doorbells 
> > > enabled
> > > at all times.
> > I would say right now there are two modes and device can transition
> > between them at will:
> > 
> > 1. read each descriptor twice - once speculatively, once
> > to get the actual data
> > 
> > optimal for driver suboptimal for device
> 
> You might read each descriptor multiple times in some scenarios. Reading
> descriptors in batches is hugely important given the latency and overheads
> of PCIe (and lack of adjacent data fetching that caching gives you).
> 
> > 2. enable notification for each descritor and rely on
> > these notifications
> > 
> > optimal for device suboptimal for driver
> > 
> > > This can result in a very high rate of doorbells with some
> > > drivers, which can become a severe bottleneck (because x86 CPUs can't emit
> > > MMIOs at very high rates).
> > Interesting. Is there any data you could share to help guide the design?
> > E.g. what's the highest rate of MMIO writes supported etc?
> 
> On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
> 
> On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
> PCIe-remote socket ~8M/s.

Is that mega bytes? Or million writes?
With 32 bit writes are we limited to 2M then?

> This doesn't just impose a limit on aggregate packet rate: If you hit this
> bottleneck then the CPU core is back-pressured by the MMIO writes, and so
> instr/cycle takes a huge hit.
>
> > > The proposed offset/wrap field allows devices to disable doorbells when
> > > appropriate, and determine the latest fill level via a PCIe read.
> > This kind of looks like a split ring though, does it not?
> 
> I don't agree, because the ring isn't split.  Split-ring is very painful for
> hardware because of the potentially random accesses to the descriptor table
> (including pointer chasing) which inhibit batching of descriptor fetches, as
> well as causing complexity in implementation.

That's different with IN_ORDER, right?


> Packed ring is a huge improvement on both dimensions, but introduces a
> specific pain point for hardware offload.
> 
> > The issue is
> > we will again need to bounce more cache lines to communicate.
> 
> You'll only bounce cache lines if the device chooses to read the idx.  A PV
> device need not offer this feature.  A PCIe device will, but the cost to the
> driver of writing an in-memory idx is much smaller than the cost of an MMIO,
> which is always a strongly ordered barrier on x86/64.
> 
> With vDPA you ideally would have this feature enabled, and the device would
> sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
> field and so cache lines would not bounce then either.
> 
> Ie. The only time cache lines are shared is when sharing with a PCIe device,
> which is the scenario when this is a win.

True. OTOH on non-x86 you will need more write barriers :( It would be
natural to say driver can reuse the barrier before flags update, but note
that that would mean hardware still needs to support re-fetching if
flags value is wrong. Such re-fetch is probably rare so fine by me, but it
does add a bit more complexity.


> > So I wonder: what if we made a change to spec that would allow prefetch
> > of ring entries?  E.g. you would be able to read at random and if you
> > guessed right then you can just use what you have read, no need to
> > re-fetch?
> 
> Unless I've misunderstood I think this would imply that the driver would
> have to ensure strict ordering for each descriptor it wrote, which would
> impose a cost to the driver.  At the moment a write barrier is only needed
> before 

[virtio-dev] Re: [virtio] RE: Re: [virtio-dev] Device-to-driver notification management for hardware implementations

2019-02-11 Thread David Riddoch
(Apologies if you got this a second time: The first copy went to the 
wrong list).


I think your proposal of write to p_int_en_doorbell will work for us. 
The write

event to this new address in BAR itself implicitly tell the device that
interrupt is enabled and also it conveys the last used entry that it 
processed.
Then device can provide nice interrupt moderation by making sure at 
least one
new used entry was written and some time has elapsed from previous 
interrupt to

driver.

It would also be helpful if you can also outline the new structure 
and your

proposal for offset in the BAR.


Chandra

So this boils down to adding ability to notify devices about enabling
interrupts. Question would be, this slows down the driver. Is this
still worth doing? Right now we are trying to offload as
much as we can to the device.
I don't think it slows down the driver significantly, because the 
interrupt rate is usually moderated so that it is not too high. (So 
these notify MMIOs should not be issued at a high enough rate to get 
close to causing a bottleneck).


It is a win in terms of PCIe overheads, as the device never needs to 
read the suppression structure.  But again the win is limited because 
interrupt rate will usually be limited by moderation.


It can be a win for latency.  In the case of a net rx virt-queue the 
device can issue an interrupt without first reading the suppression 
structure.


However, despite having made the proposal, I no longer consider this 
high priority: Because it is an optional feature it would not reduce 
complexity since a device has to be able to operate without it.  I'm not 
sure the benefits are worth the complexity added.


David

--
David Riddoch   -- Chief Architect, Solarflare


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload

2019-02-11 Thread David Riddoch

On 11/02/2019 07:33, Michael S. Tsirkin wrote:

On Fri, Feb 01, 2019 at 02:23:55PM +, David Riddoch wrote:

All,

I'd like to propose a small extension to the packed virtqueue mode.  My
proposal is to add an offset/wrap field, written by the driver, indicating
how many available descriptors have been added to the ring.

The reason for wanting this is to improve performance of hardware devices.
Because of high read latency over a PCIe bus, it is important for hardware
devices to read multiple ring entries in parallel.  It is desirable to know
how many descriptors are available prior to issuing these reads, else you
risk fetching descriptors that are not yet available.  As well as wasting
bus bandwidth this adds complexity.

I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this
problem,

Right. And this seems like the ideal solution latency-wise since
this pushes data out to device without need for round-trips
over PCIe.


Yes, and I'm not proposing getting rid of it.  I'd expect a PCIe device 
to use both features together.



but we still have a problem.  If you rely on doorbells to tell you
how many descriptors are available, then you have to keep doorbells enabled
at all times.

I would say right now there are two modes and device can transition
between them at will:

1. read each descriptor twice - once speculatively, once
to get the actual data

optimal for driver suboptimal for device


You might read each descriptor multiple times in some scenarios. Reading 
descriptors in batches is hugely important given the latency and 
overheads of PCIe (and lack of adjacent data fetching that caching gives 
you).



2. enable notification for each descritor and rely on
these notifications

optimal for device suboptimal for driver


This can result in a very high rate of doorbells with some
drivers, which can become a severe bottleneck (because x86 CPUs can't emit
MMIOs at very high rates).

Interesting. Is there any data you could share to help guide the design?
E.g. what's the highest rate of MMIO writes supported etc?


On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.

On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On 
PCIe-remote socket ~8M/s.


This doesn't just impose a limit on aggregate packet rate: If you hit 
this bottleneck then the CPU core is back-pressured by the MMIO writes, 
and so instr/cycle takes a huge hit.



The proposed offset/wrap field allows devices to disable doorbells when
appropriate, and determine the latest fill level via a PCIe read.

This kind of looks like a split ring though, does it not?


I don't agree, because the ring isn't split.  Split-ring is very painful 
for hardware because of the potentially random accesses to the 
descriptor table (including pointer chasing) which inhibit batching of 
descriptor fetches, as well as causing complexity in implementation.


Packed ring is a huge improvement on both dimensions, but introduces a 
specific pain point for hardware offload.



The issue is
we will again need to bounce more cache lines to communicate.


You'll only bounce cache lines if the device chooses to read the idx.  A 
PV device need not offer this feature.  A PCIe device will, but the cost 
to the driver of writing an in-memory idx is much smaller than the cost 
of an MMIO, which is always a strongly ordered barrier on x86/64.


With vDPA you ideally would have this feature enabled, and the device 
would sometimes be PV and sometimes PCIe.  The PV device would ignore 
the new idx field and so cache lines would not bounce then either.


Ie. The only time cache lines are shared is when sharing with a PCIe 
device, which is the scenario when this is a win.



So I wonder: what if we made a change to spec that would allow prefetch
of ring entries?  E.g. you would be able to read at random and if you
guessed right then you can just use what you have read, no need to
re-fetch?


Unless I've misunderstood I think this would imply that the driver would 
have to ensure strict ordering for each descriptor it wrote, which would 
impose a cost to the driver.  At the moment a write barrier is only 
needed before writing the flags of the first descriptor in a batch.



I suggest the best place to put this would be in the driver area,
immediately after the event suppression structure.

Could you comment on why is that a good place though?


The new field is written by the driver, as are the other fields in the 
driver area.  Also I expect that devices might be able to read this new 
idx together with the interrupt suppression fields in a single read, 
reducing PCIe overheads.


Placing the new field immediately after the descriptor ring would also 
work, but lose the benefit of combining reads, and potentially cause 
drivers to allocate a substantially bigger buffer (as I expect the 
descriptor ring is typically a power-of-2 size and drivers allocate 
multiples of page size).


Placing the new field in a