Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
On 2019/2/12 下午1:08, Michael S. Tsirkin wrote: On Mon, Feb 11, 2019 at 02:58:30PM +, David Riddoch wrote: This can result in a very high rate of doorbells with some drivers, which can become a severe bottleneck (because x86 CPUs can't emit MMIOs at very high rates). Interesting. Is there any data you could share to help guide the design? E.g. what's the highest rate of MMIO writes supported etc? On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s. On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s. On PCIe-remote socket ~8M/s. Is that mega bytes? Or million writes? With 32 bit writes are we limited to 2M then? Sorry, this is million-writes-per-second. The size of the write doesn't matter. So it's not too bad, you only need one per batch after all. Also I thought some more about this. In fact on x86/intel specifically PCI descriptor reads are atomic with cache line granularity and writes are ordered. So in fact you can safely prefetch descriptors and if you see them valid, you can go ahead and use them. If I understand this correctly, unless driver set avail in order for several descriptors. Device could not assume that if N is avail then [0, N) is also available. But prefetching indeed improve the performance when I play with vhost kernel implementation. It reduces the overhead of memory accessing, which is similar to PCI. This makes the descriptor index merely a hint for performance, which device can play with at will. Other platforms are not like this so you need the data but do they have the same problem? This doesn't just impose a limit on aggregate packet rate: If you hit this bottleneck then the CPU core is back-pressured by the MMIO writes, and so instr/cycle takes a huge hit. The proposed offset/wrap field allows devices to disable doorbells when appropriate, and determine the latest fill level via a PCIe read. This kind of looks like a split ring though, does it not? I don't agree, because the ring isn't split. Split-ring is very painful for hardware because of the potentially random accesses to the descriptor table (including pointer chasing) which inhibit batching of descriptor fetches, as well as causing complexity in implementation. That's different with IN_ORDER, right? Yes, IN_ORDER will give some of the benefit of packed ring, and is also a win because you can merge 'used' writes. (But packed-ring still wins due to not having to fetch ring and descriptor table separately). Note, with IN_ORDER, there's no need to read available ring at all even for split ring. And in some cases, no need to update used ring at all. Packed ring is a huge improvement on both dimensions, but introduces a specific pain point for hardware offload. The issue is we will again need to bounce more cache lines to communicate. You'll only bounce cache lines if the device chooses to read the idx. A PV device need not offer this feature. A PCIe device will, but the cost to the driver of writing an in-memory idx is much smaller than the cost of an MMIO, which is always a strongly ordered barrier on x86/64. With vDPA you ideally would have this feature enabled, and the device would sometimes be PV and sometimes PCIe. The PV device would ignore the new idx field and so cache lines would not bounce then either. Ie. The only time cache lines are shared is when sharing with a PCIe device, which is the scenario when this is a win. True. OTOH on non-x86 you will need more write barriers :( It would be natural to say driver can reuse the barrier before flags update, but note that that would mean hardware still needs to support re-fetching if flags value is wrong. Such re-fetch is probably rare so fine by me, but it does add a bit more complexity. I would prefer to have the write barrier before writing the idx. Well that's driver overhead for something device might never utilise in a given workload. If we are optimizing let's optimize for speed. Note that drivers could take advantage of this feature to avoid read barriers when consuming descriptors: At the moment there is a virtio_rmb() per descriptor read. With the proposed feature the driver can do just one barrier after reading idx. Oh you want device to write a used index too? Hmm. Isn't this contrary to your efforts to consume PCIe BW? I expect that on platforms where write barriers have a cost read barriers likely have a significant cost too, so this might be a win with PV devices too. I'm not sure. It is pretty easy to replace an rmb with a dependency which is generally quite cheap in my testing. But since it's supposed to benefit PV, at this point we already have implementations so rather than speculate (pun intended), people can write patches and experiment. For the proposed avail idx I think Jason (CC'd) was interested in adding something like this for vhost. Yes, it's kind of IN_ORDER + split ring I believe. Thanks So I wonder: what if we made a change
Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
On Mon, Feb 11, 2019 at 02:58:30PM +, David Riddoch wrote: > > > > > This can result in a very high rate of doorbells with some > > > > > drivers, which can become a severe bottleneck (because x86 CPUs can't > > > > > emit > > > > > MMIOs at very high rates). > > > > Interesting. Is there any data you could share to help guide the design? > > > > E.g. what's the highest rate of MMIO writes supported etc? > > > On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s. > > > > > > On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s. On > > > PCIe-remote socket ~8M/s. > > Is that mega bytes? Or million writes? > > With 32 bit writes are we limited to 2M then? > > Sorry, this is million-writes-per-second. The size of the write doesn't > matter. So it's not too bad, you only need one per batch after all. Also I thought some more about this. In fact on x86/intel specifically PCI descriptor reads are atomic with cache line granularity and writes are ordered. So in fact you can safely prefetch descriptors and if you see them valid, you can go ahead and use them. This makes the descriptor index merely a hint for performance, which device can play with at will. Other platforms are not like this so you need the data but do they have the same problem? > > > This doesn't just impose a limit on aggregate packet rate: If you hit this > > > bottleneck then the CPU core is back-pressured by the MMIO writes, and so > > > instr/cycle takes a huge hit. > > > > > > > > The proposed offset/wrap field allows devices to disable doorbells > > > > > when > > > > > appropriate, and determine the latest fill level via a PCIe read. > > > > This kind of looks like a split ring though, does it not? > > > I don't agree, because the ring isn't split. Split-ring is very painful > > > for > > > hardware because of the potentially random accesses to the descriptor > > > table > > > (including pointer chasing) which inhibit batching of descriptor fetches, > > > as > > > well as causing complexity in implementation. > > That's different with IN_ORDER, right? > > Yes, IN_ORDER will give some of the benefit of packed ring, and is also a > win because you can merge 'used' writes. (But packed-ring still wins due to > not having to fetch ring and descriptor table separately). > > > > Packed ring is a huge improvement on both dimensions, but introduces a > > > specific pain point for hardware offload. > > > > > > > The issue is > > > > we will again need to bounce more cache lines to communicate. > > > You'll only bounce cache lines if the device chooses to read the idx. A > > > PV > > > device need not offer this feature. A PCIe device will, but the cost to > > > the > > > driver of writing an in-memory idx is much smaller than the cost of an > > > MMIO, > > > which is always a strongly ordered barrier on x86/64. > > > > > > With vDPA you ideally would have this feature enabled, and the device > > > would > > > sometimes be PV and sometimes PCIe. The PV device would ignore the new > > > idx > > > field and so cache lines would not bounce then either. > > > > > > Ie. The only time cache lines are shared is when sharing with a PCIe > > > device, > > > which is the scenario when this is a win. > > True. OTOH on non-x86 you will need more write barriers :( It would be > > natural to say driver can reuse the barrier before flags update, but note > > that that would mean hardware still needs to support re-fetching if > > flags value is wrong. Such re-fetch is probably rare so fine by me, but it > > does add a bit more complexity. > > I would prefer to have the write barrier before writing the idx. Well that's driver overhead for something device might never utilise in a given workload. If we are optimizing let's optimize for speed. > Note that drivers could take advantage of this feature to avoid read > barriers when consuming descriptors: At the moment there is a virtio_rmb() > per descriptor read. With the proposed feature the driver can do just one > barrier after reading idx. Oh you want device to write a used index too? Hmm. Isn't this contrary to your efforts to consume PCIe BW? > I expect that on platforms where write barriers > have a cost read barriers likely have a significant cost too, so this might > be a win with PV devices too. I'm not sure. It is pretty easy to replace an rmb with a dependency which is generally quite cheap in my testing. But since it's supposed to benefit PV, at this point we already have implementations so rather than speculate (pun intended), people can write patches and experiment. For the proposed avail idx I think Jason (CC'd) was interested in adding something like this for vhost. > > > > So I wonder: what if we made a change to spec that would allow prefetch > > > > of ring entries? E.g. you would be able to read at random and if you > > > > guessed right then you can just use what you have read, no need to > > > > re-fetch? > > > Unless
Re: [virtio-dev] Memory sharing device
On Mon, Feb 11, 2019 at 07:14:53AM -0800, Frank Yang wrote: > > > On Mon, Feb 11, 2019 at 6:49 AM Michael S. Tsirkin wrote: > > On Mon, Feb 04, 2019 at 11:42:25PM -0800, Roman Kiryanov wrote: > > Hi Gerd, > > > > > virtio-gpu specifically needs that to support vulkan and opengl > > > extensions for coherent buffers, which must be allocated by the host > gpu > > > driver. It's WIP still. > > > > the proposed spec says: > > > > +Shared memory regions MUST NOT be used to control the operation > > +of the device, nor to stream data; those should still be performed > > +using virtqueues. > > > > Is there a strong reason to prohibit using memory regions for control > purposes? > > That's in order to make virtio have portability implications, such that if > people see a virtio device in lspci they know there's > no lock-in, their guest can be moved between hypervisors > and will still work. > > > Our long term goal is to have as few kernel drivers as possible and to > move > > "drivers" into userspace. If we go with the virtqueues, is there > > general a purpose > > device/driver to talk between our host and guest to support custom > hardware > > (with own blobs)? > > The challenge is to answer the following question: > how to do this without losing the benefits of standartization? > > > Draft spec is incoming, but the basic idea is to standardize how to enumerate, > discover, and operate (with high performance) such userspace drivers/devices; > the basic operations would be standardized, and userspace drivers would be > constructed out of the resulting primitives. As long standartization facilitates functionality, e.g. if we can support moving between hypervisors, this seems in-scope for virtio. > > Could you please advise if we can use something else to > > achieve this goal? > > I am not sure what the goal is though. Blobs is a means I guess > or it should be :) E.g. is it about being able to iterate quickly? > > Maybe you should look at vhost-user-gpu patches on qemu? > Would this address your need? > Acks for these patches would be a good thing. > > > > Is this it: > > https://patchwork.kernel.org/patch/10444089/ ? > > I'll check it out and try to discuss. Is there a draft spec for it as well? virtio gpu is part of the csprd01 draft. > > > -- > MST > - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Memory sharing device
BTW, I have a few concerns about the upcoming shared-mem virtio type. This is mostly directed at David and kraxel. We've found that for many applications, simply telling the guest to create a new host pointer of Vulkan or OpenGL has quite some overhead in just telling the hypervisor to map it, and in fact, it's easy to run out of KVM slots by doing so. So for Vulkan, we rely on having one large host visible region on the host that is a single region of host shared memory. That, is then sub-allocated for the guest. So there is no Vulkan host pointer that is being shared to the guest 1:1; we suballocate, then generate the right 'underlying' Vulkan device memory offset and size parameters for the host. In general though, this means that the ideal usage of host pointers would be to set a few regions up front for certain purposes, then share that out amongst other device contexts. This also facilitates sharing the memory between guest processes, which is useful for implementing things like compositors. This also features heavily for our "virtio userspace" thing. Since this is a common pattern, should this sharing concept be standardized somehow? I.e., should there be a standard way to send Shmid/offset/size to other devices, or have that be a standard struct in the hypervisor? On Mon, Feb 11, 2019 at 7:14 AM Frank Yang wrote: > > > On Mon, Feb 11, 2019 at 6:49 AM Michael S. Tsirkin wrote: > >> On Mon, Feb 04, 2019 at 11:42:25PM -0800, Roman Kiryanov wrote: >> > Hi Gerd, >> > >> > > virtio-gpu specifically needs that to support vulkan and opengl >> > > extensions for coherent buffers, which must be allocated by the host >> gpu >> > > driver. It's WIP still. >> > >> > the proposed spec says: >> > >> > +Shared memory regions MUST NOT be used to control the operation >> > +of the device, nor to stream data; those should still be performed >> > +using virtqueues. >> > >> > Is there a strong reason to prohibit using memory regions for control >> purposes? >> >> That's in order to make virtio have portability implications, such that if >> people see a virtio device in lspci they know there's >> no lock-in, their guest can be moved between hypervisors >> and will still work. >> >> > Our long term goal is to have as few kernel drivers as possible and to >> move >> > "drivers" into userspace. If we go with the virtqueues, is there >> > general a purpose >> > device/driver to talk between our host and guest to support custom >> hardware >> > (with own blobs)? >> >> The challenge is to answer the following question: >> how to do this without losing the benefits of standartization? >> >> Draft spec is incoming, but the basic idea is to standardize how to > enumerate, discover, and operate (with high performance) such userspace > drivers/devices; the basic operations would be standardized, and userspace > drivers would be constructed out of the resulting primitives. > > > Could you please advise if we can use something else to >> > achieve this goal? >> >> I am not sure what the goal is though. Blobs is a means I guess >> or it should be :) E.g. is it about being able to iterate quickly? >> >> Maybe you should look at vhost-user-gpu patches on qemu? >> Would this address your need? >> Acks for these patches would be a good thing. >> >> > Is this it: > > https://patchwork.kernel.org/patch/10444089/ ? > > I'll check it out and try to discuss. Is there a draft spec for it as well? > > >> >> -- >> MST >> >
Re: [virtio-dev] Memory sharing device
On Mon, Feb 11, 2019 at 6:49 AM Michael S. Tsirkin wrote: > On Mon, Feb 04, 2019 at 11:42:25PM -0800, Roman Kiryanov wrote: > > Hi Gerd, > > > > > virtio-gpu specifically needs that to support vulkan and opengl > > > extensions for coherent buffers, which must be allocated by the host > gpu > > > driver. It's WIP still. > > > > the proposed spec says: > > > > +Shared memory regions MUST NOT be used to control the operation > > +of the device, nor to stream data; those should still be performed > > +using virtqueues. > > > > Is there a strong reason to prohibit using memory regions for control > purposes? > > That's in order to make virtio have portability implications, such that if > people see a virtio device in lspci they know there's > no lock-in, their guest can be moved between hypervisors > and will still work. > > > Our long term goal is to have as few kernel drivers as possible and to > move > > "drivers" into userspace. If we go with the virtqueues, is there > > general a purpose > > device/driver to talk between our host and guest to support custom > hardware > > (with own blobs)? > > The challenge is to answer the following question: > how to do this without losing the benefits of standartization? > > Draft spec is incoming, but the basic idea is to standardize how to enumerate, discover, and operate (with high performance) such userspace drivers/devices; the basic operations would be standardized, and userspace drivers would be constructed out of the resulting primitives. > Could you please advise if we can use something else to > > achieve this goal? > > I am not sure what the goal is though. Blobs is a means I guess > or it should be :) E.g. is it about being able to iterate quickly? > > Maybe you should look at vhost-user-gpu patches on qemu? > Would this address your need? > Acks for these patches would be a good thing. > > Is this it: https://patchwork.kernel.org/patch/10444089/ ? I'll check it out and try to discuss. Is there a draft spec for it as well? > > -- > MST >
Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
This can result in a very high rate of doorbells with some drivers, which can become a severe bottleneck (because x86 CPUs can't emit MMIOs at very high rates). Interesting. Is there any data you could share to help guide the design? E.g. what's the highest rate of MMIO writes supported etc? On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s. On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s. On PCIe-remote socket ~8M/s. Is that mega bytes? Or million writes? With 32 bit writes are we limited to 2M then? Sorry, this is million-writes-per-second. The size of the write doesn't matter. This doesn't just impose a limit on aggregate packet rate: If you hit this bottleneck then the CPU core is back-pressured by the MMIO writes, and so instr/cycle takes a huge hit. The proposed offset/wrap field allows devices to disable doorbells when appropriate, and determine the latest fill level via a PCIe read. This kind of looks like a split ring though, does it not? I don't agree, because the ring isn't split. Split-ring is very painful for hardware because of the potentially random accesses to the descriptor table (including pointer chasing) which inhibit batching of descriptor fetches, as well as causing complexity in implementation. That's different with IN_ORDER, right? Yes, IN_ORDER will give some of the benefit of packed ring, and is also a win because you can merge 'used' writes. (But packed-ring still wins due to not having to fetch ring and descriptor table separately). Packed ring is a huge improvement on both dimensions, but introduces a specific pain point for hardware offload. The issue is we will again need to bounce more cache lines to communicate. You'll only bounce cache lines if the device chooses to read the idx. A PV device need not offer this feature. A PCIe device will, but the cost to the driver of writing an in-memory idx is much smaller than the cost of an MMIO, which is always a strongly ordered barrier on x86/64. With vDPA you ideally would have this feature enabled, and the device would sometimes be PV and sometimes PCIe. The PV device would ignore the new idx field and so cache lines would not bounce then either. Ie. The only time cache lines are shared is when sharing with a PCIe device, which is the scenario when this is a win. True. OTOH on non-x86 you will need more write barriers :( It would be natural to say driver can reuse the barrier before flags update, but note that that would mean hardware still needs to support re-fetching if flags value is wrong. Such re-fetch is probably rare so fine by me, but it does add a bit more complexity. I would prefer to have the write barrier before writing the idx. Note that drivers could take advantage of this feature to avoid read barriers when consuming descriptors: At the moment there is a virtio_rmb() per descriptor read. With the proposed feature the driver can do just one barrier after reading idx. I expect that on platforms where write barriers have a cost read barriers likely have a significant cost too, so this might be a win with PV devices too. So I wonder: what if we made a change to spec that would allow prefetch of ring entries? E.g. you would be able to read at random and if you guessed right then you can just use what you have read, no need to re-fetch? Unless I've misunderstood I think this would imply that the driver would have to ensure strict ordering for each descriptor it wrote, which would impose a cost to the driver. At the moment a write barrier is only needed before writing the flags of the first descriptor in a batch. On non-x86 right? OTOH the extra data structure also adds more ordering requirements. Yes on non-x86. The extra data structure only adds an ordering once per request (when enabled) whereas allowing prefetch would add an ordering per descriptor. The number of descriptors is always >= the number of requests, and can be much larger (esp. for a net rx virt-queue). I suggest the best place to put this would be in the driver area, immediately after the event suppression structure. Could you comment on why is that a good place though? The new field is written by the driver, as are the other fields in the driver area. Also I expect that devices might be able to read this new idx together with the interrupt suppression fields in a single read, reducing PCIe overheads. OK then you need it in the same aligned 256 byte boundary. The event suppression structs currently require 4byte align. Are you suggesting we increase the align required when VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated? Sounds good to me, but 8bytes would presumably suffice. Thanks for the feedback. David -- David Riddoch -- Chief Architect, Solarflare - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail:
Re: [virtio-dev] Memory sharing device
On Mon, Feb 04, 2019 at 11:42:25PM -0800, Roman Kiryanov wrote: > Hi Gerd, > > > virtio-gpu specifically needs that to support vulkan and opengl > > extensions for coherent buffers, which must be allocated by the host gpu > > driver. It's WIP still. > > the proposed spec says: > > +Shared memory regions MUST NOT be used to control the operation > +of the device, nor to stream data; those should still be performed > +using virtqueues. > > Is there a strong reason to prohibit using memory regions for control > purposes? That's in order to make virtio have portability implications, such that if people see a virtio device in lspci they know there's no lock-in, their guest can be moved between hypervisors and will still work. > Our long term goal is to have as few kernel drivers as possible and to move > "drivers" into userspace. If we go with the virtqueues, is there > general a purpose > device/driver to talk between our host and guest to support custom hardware > (with own blobs)? The challenge is to answer the following question: how to do this without losing the benefits of standartization? > Could you please advise if we can use something else to > achieve this goal? I am not sure what the goal is though. Blobs is a means I guess or it should be :) E.g. is it about being able to iterate quickly? Maybe you should look at vhost-user-gpu patches on qemu? Would this address your need? Acks for these patches would be a good thing. -- MST - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
On Mon, Feb 11, 2019 at 08:52:01AM +, David Riddoch wrote: > On 11/02/2019 07:33, Michael S. Tsirkin wrote: > > On Fri, Feb 01, 2019 at 02:23:55PM +, David Riddoch wrote: > > > All, > > > > > > I'd like to propose a small extension to the packed virtqueue mode. My > > > proposal is to add an offset/wrap field, written by the driver, indicating > > > how many available descriptors have been added to the ring. > > > > > > The reason for wanting this is to improve performance of hardware devices. > > > Because of high read latency over a PCIe bus, it is important for hardware > > > devices to read multiple ring entries in parallel. It is desirable to > > > know > > > how many descriptors are available prior to issuing these reads, else you > > > risk fetching descriptors that are not yet available. As well as wasting > > > bus bandwidth this adds complexity. > > > > > > I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this > > > problem, > > Right. And this seems like the ideal solution latency-wise since > > this pushes data out to device without need for round-trips > > over PCIe. > > Yes, and I'm not proposing getting rid of it. I'd expect a PCIe device to > use both features together. > > > > but we still have a problem. If you rely on doorbells to tell you > > > how many descriptors are available, then you have to keep doorbells > > > enabled > > > at all times. > > I would say right now there are two modes and device can transition > > between them at will: > > > > 1. read each descriptor twice - once speculatively, once > > to get the actual data > > > > optimal for driver suboptimal for device > > You might read each descriptor multiple times in some scenarios. Reading > descriptors in batches is hugely important given the latency and overheads > of PCIe (and lack of adjacent data fetching that caching gives you). > > > 2. enable notification for each descritor and rely on > > these notifications > > > > optimal for device suboptimal for driver > > > > > This can result in a very high rate of doorbells with some > > > drivers, which can become a severe bottleneck (because x86 CPUs can't emit > > > MMIOs at very high rates). > > Interesting. Is there any data you could share to help guide the design? > > E.g. what's the highest rate of MMIO writes supported etc? > > On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s. > > On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s. On > PCIe-remote socket ~8M/s. Is that mega bytes? Or million writes? With 32 bit writes are we limited to 2M then? > This doesn't just impose a limit on aggregate packet rate: If you hit this > bottleneck then the CPU core is back-pressured by the MMIO writes, and so > instr/cycle takes a huge hit. > > > > The proposed offset/wrap field allows devices to disable doorbells when > > > appropriate, and determine the latest fill level via a PCIe read. > > This kind of looks like a split ring though, does it not? > > I don't agree, because the ring isn't split. Split-ring is very painful for > hardware because of the potentially random accesses to the descriptor table > (including pointer chasing) which inhibit batching of descriptor fetches, as > well as causing complexity in implementation. That's different with IN_ORDER, right? > Packed ring is a huge improvement on both dimensions, but introduces a > specific pain point for hardware offload. > > > The issue is > > we will again need to bounce more cache lines to communicate. > > You'll only bounce cache lines if the device chooses to read the idx. A PV > device need not offer this feature. A PCIe device will, but the cost to the > driver of writing an in-memory idx is much smaller than the cost of an MMIO, > which is always a strongly ordered barrier on x86/64. > > With vDPA you ideally would have this feature enabled, and the device would > sometimes be PV and sometimes PCIe. The PV device would ignore the new idx > field and so cache lines would not bounce then either. > > Ie. The only time cache lines are shared is when sharing with a PCIe device, > which is the scenario when this is a win. True. OTOH on non-x86 you will need more write barriers :( It would be natural to say driver can reuse the barrier before flags update, but note that that would mean hardware still needs to support re-fetching if flags value is wrong. Such re-fetch is probably rare so fine by me, but it does add a bit more complexity. > > So I wonder: what if we made a change to spec that would allow prefetch > > of ring entries? E.g. you would be able to read at random and if you > > guessed right then you can just use what you have read, no need to > > re-fetch? > > Unless I've misunderstood I think this would imply that the driver would > have to ensure strict ordering for each descriptor it wrote, which would > impose a cost to the driver. At the moment a write barrier is only needed > before
[virtio-dev] Re: [virtio] RE: Re: [virtio-dev] Device-to-driver notification management for hardware implementations
(Apologies if you got this a second time: The first copy went to the wrong list). I think your proposal of write to p_int_en_doorbell will work for us. The write event to this new address in BAR itself implicitly tell the device that interrupt is enabled and also it conveys the last used entry that it processed. Then device can provide nice interrupt moderation by making sure at least one new used entry was written and some time has elapsed from previous interrupt to driver. It would also be helpful if you can also outline the new structure and your proposal for offset in the BAR. Chandra So this boils down to adding ability to notify devices about enabling interrupts. Question would be, this slows down the driver. Is this still worth doing? Right now we are trying to offload as much as we can to the device. I don't think it slows down the driver significantly, because the interrupt rate is usually moderated so that it is not too high. (So these notify MMIOs should not be issued at a high enough rate to get close to causing a bottleneck). It is a win in terms of PCIe overheads, as the device never needs to read the suppression structure. But again the win is limited because interrupt rate will usually be limited by moderation. It can be a win for latency. In the case of a net rx virt-queue the device can issue an interrupt without first reading the suppression structure. However, despite having made the proposal, I no longer consider this high priority: Because it is an optional feature it would not reduce complexity since a device has to be able to operate without it. I'm not sure the benefits are worth the complexity added. David -- David Riddoch -- Chief Architect, Solarflare - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
On 11/02/2019 07:33, Michael S. Tsirkin wrote: On Fri, Feb 01, 2019 at 02:23:55PM +, David Riddoch wrote: All, I'd like to propose a small extension to the packed virtqueue mode. My proposal is to add an offset/wrap field, written by the driver, indicating how many available descriptors have been added to the ring. The reason for wanting this is to improve performance of hardware devices. Because of high read latency over a PCIe bus, it is important for hardware devices to read multiple ring entries in parallel. It is desirable to know how many descriptors are available prior to issuing these reads, else you risk fetching descriptors that are not yet available. As well as wasting bus bandwidth this adds complexity. I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this problem, Right. And this seems like the ideal solution latency-wise since this pushes data out to device without need for round-trips over PCIe. Yes, and I'm not proposing getting rid of it. I'd expect a PCIe device to use both features together. but we still have a problem. If you rely on doorbells to tell you how many descriptors are available, then you have to keep doorbells enabled at all times. I would say right now there are two modes and device can transition between them at will: 1. read each descriptor twice - once speculatively, once to get the actual data optimal for driver suboptimal for device You might read each descriptor multiple times in some scenarios. Reading descriptors in batches is hugely important given the latency and overheads of PCIe (and lack of adjacent data fetching that caching gives you). 2. enable notification for each descritor and rely on these notifications optimal for device suboptimal for driver This can result in a very high rate of doorbells with some drivers, which can become a severe bottleneck (because x86 CPUs can't emit MMIOs at very high rates). Interesting. Is there any data you could share to help guide the design? E.g. what's the highest rate of MMIO writes supported etc? On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s. On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s. On PCIe-remote socket ~8M/s. This doesn't just impose a limit on aggregate packet rate: If you hit this bottleneck then the CPU core is back-pressured by the MMIO writes, and so instr/cycle takes a huge hit. The proposed offset/wrap field allows devices to disable doorbells when appropriate, and determine the latest fill level via a PCIe read. This kind of looks like a split ring though, does it not? I don't agree, because the ring isn't split. Split-ring is very painful for hardware because of the potentially random accesses to the descriptor table (including pointer chasing) which inhibit batching of descriptor fetches, as well as causing complexity in implementation. Packed ring is a huge improvement on both dimensions, but introduces a specific pain point for hardware offload. The issue is we will again need to bounce more cache lines to communicate. You'll only bounce cache lines if the device chooses to read the idx. A PV device need not offer this feature. A PCIe device will, but the cost to the driver of writing an in-memory idx is much smaller than the cost of an MMIO, which is always a strongly ordered barrier on x86/64. With vDPA you ideally would have this feature enabled, and the device would sometimes be PV and sometimes PCIe. The PV device would ignore the new idx field and so cache lines would not bounce then either. Ie. The only time cache lines are shared is when sharing with a PCIe device, which is the scenario when this is a win. So I wonder: what if we made a change to spec that would allow prefetch of ring entries? E.g. you would be able to read at random and if you guessed right then you can just use what you have read, no need to re-fetch? Unless I've misunderstood I think this would imply that the driver would have to ensure strict ordering for each descriptor it wrote, which would impose a cost to the driver. At the moment a write barrier is only needed before writing the flags of the first descriptor in a batch. I suggest the best place to put this would be in the driver area, immediately after the event suppression structure. Could you comment on why is that a good place though? The new field is written by the driver, as are the other fields in the driver area. Also I expect that devices might be able to read this new idx together with the interrupt suppression fields in a single read, reducing PCIe overheads. Placing the new field immediately after the descriptor ring would also work, but lose the benefit of combining reads, and potentially cause drivers to allocate a substantially bigger buffer (as I expect the descriptor ring is typically a power-of-2 size and drivers allocate multiples of page size). Placing the new field in a