Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-15 Thread Jinhao Fan



> On Jun 15, 2022, at 5:38 PM, Klaus Jensen  wrote:
> 
> I prefer we use the NVMe terminology to minimize misunderstandings, so
> "host" means the driver and "device" means the qemu side of things
> 

Thanks for helping me disambiguate this!

Now that we have resolved all issues in v1, I’ve submitted a v2 patch.




Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-15 Thread John Levon
On Wed, Jun 15, 2022 at 07:22:22PM +0800, Jinhao Fan wrote:

> >>> Isn't this racy against the driver? Compare
> >>> https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317
> >> 
> >> QEMU has full memory barriers on dma read/write, so I believe this is
> >> safe?
> > 
> > But don't you need to re-read the tail still, for example:
> 
> I think we also have a check for concurrent update on the tail. After writing 
> eventidx, we read the tail again. It is here:
> 
> @@ -5854,6 +5943,11 @@ static void nvme_process_sq(void *opaque)
>  req->status = status;
>  nvme_enqueue_req_completion(cq, req);
>  }
> +
> +if (n->dbbuf_enabled) {
> +nvme_update_sq_eventidx(sq);
> +nvme_update_sq_tail(sq);
> +}

Ah, and we go around the loop another time in this case.

> > driver  device
> > 
> > eventidx is 3
> > 
> > write 4 to tail
> > read tail of 4
> > write 5 to tail
> > read eventidx of 3
> > nvme_dbbuf_need_event (1)
> > 
> > set eventidx to 4
> 
> Therefore, at this point, we read the tail of 5.

The driver could still update the tail after the nvme_update_sq_tail() above.
However, the driver ordering (read tail, then eventidx), does mean that it would
then do an mmio write, so yes, this looks safe, thank you.

regards
john



Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-15 Thread Jinhao Fan



> On Jun 15, 2022, at 6:11 PM, John Levon  wrote:
> 
> On Wed, Jun 15, 2022 at 11:33:02AM +0200, Klaus Jensen wrote:
> 
>>> BTW I'm surprised that this patch has just this:
>>> 
>>> +static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
>>> +{
>>> +pci_dma_write(>ctrl->parent_obj, sq->ei_addr, >tail,
>>> +  sizeof(sq->tail));
>>> +}
>>> 
>>> Isn't this racy against the driver? Compare
>>> https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317
>>> 
>>> thanks
>>> john
>> 
>> QEMU has full memory barriers on dma read/write, so I believe this is
>> safe?
> 
> But don't you need to re-read the tail still, for example:


Hi John,

I think we also have a check for concurrent update on the tail. After writing 
eventidx, we read the tail again. It is here:

@@ -5854,6 +5943,11 @@ static void nvme_process_sq(void *opaque)
 req->status = status;
 nvme_enqueue_req_completion(cq, req);
 }
+
+if (n->dbbuf_enabled) {
+nvme_update_sq_eventidx(sq);
+nvme_update_sq_tail(sq);
+}

> 
> 
> driverdevice
> 
>   eventidx is 3
> 
> write 4 to tail
>   read tail of 4
> write 5 to tail
> read eventidx of 3
> nvme_dbbuf_need_event (1)
> 
>   set eventidx to 4

Therefore, at this point, we read the tail of 5.

>   go to sleep
> 
> at (1), our tail update of 4->5 doesn't straddle the eventidx, so we don't 
> send
> any MMIO, and the device won't wake up. This is why the above code checks the
> tail twice for any concurrent update.

Thanks,
Jinhao Fan

> 
> regards
> john




Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-15 Thread John Levon
On Wed, Jun 15, 2022 at 11:33:02AM +0200, Klaus Jensen wrote:

> > BTW I'm surprised that this patch has just this:
> > 
> > +static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
> > +{
> > +pci_dma_write(>ctrl->parent_obj, sq->ei_addr, >tail,
> > +  sizeof(sq->tail));
> > +}
> > 
> > Isn't this racy against the driver? Compare
> > https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317
> > 
> > thanks
> > john
> 
> QEMU has full memory barriers on dma read/write, so I believe this is
> safe?

But don't you need to re-read the tail still, for example:


driver  device

eventidx is 3

write 4 to tail
read tail of 4
write 5 to tail
read eventidx of 3
nvme_dbbuf_need_event (1)

set eventidx to 4
go to sleep

at (1), our tail update of 4->5 doesn't straddle the eventidx, so we don't send
any MMIO, and the device won't wake up. This is why the above code checks the
tail twice for any concurrent update.

regards
john





Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-15 Thread Klaus Jensen
On Jun 15 11:58, Jinhao Fan wrote:
> 
> > On Jun 14, 2022, at 11:41 PM, Keith Busch  wrote:
> > 
> > It's a pretty nasty hack, and definitely not in compliance with the spec: 
> > the
> > db_addr is supposed to be read-only from the device side, though I do think
> > it's safe for this environment. Unless Klaus or anyone finds something I'm
> > missing, I feel this is an acceptable compromise to address this odd
> > discrepency.
> 
> :) In my next patch I will check the performance numbers with this hack. Not
> sure if updating db_addr value from the host will have any performance 
> implications but I guess it should be OK.
> 

I prefer we use the NVMe terminology to minimize misunderstandings, so
"host" means the driver and "device" means the qemu side of things

> > By the way, I noticed that the patch never updates the cq's ei_addr value. 
> > Is
> > that on purpose?
> 
> Klaus also raised a similar question in a prior comment. I think we need to 
> figure
> this out before we move on to the v2 patch. I did this because the original 
> Google
> extension patch did not update cq’s ei_addr. This seems to make sense because
> the purpose of cq’s ei_addr is for the guest to notify the host about cq head
> changes when necessary. However, the host does not need this notification 
> because we let the host proactively check for cq’s db_addr value when it wants
> to post a new cqe.
> This is also the only point where the host uses the cq’s
> db_addr. Therefore, it is OK to postpone the check for cq’s db_addr to this 
> point,
> instead of getting timely but not useful notifications by updating cq’s 
> ei_addr.
> This helps to reduce the number of MMIO’s on the cq’s doorbell register.
> 

True, it does reduce it, but it may leave CQEs "lingering" on the device
side (since the device has not been notified that the host has consumed
them).

> Klaus, Keith, do you think this design makes sense?

As I mentioned in my reply to John, I can see why this is a perfectly
reasonable optimization, we don't really care about the lingering CQEs
since we read the head anyway prior to posting completions. I jumped the
gun here in my eagerness to be "spec compliant" ;)


signature.asc
Description: PGP signature


Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-15 Thread Klaus Jensen
On Jun 15 10:07, John Levon wrote:
> On Wed, Jun 15, 2022 at 10:48:26AM +0200, Klaus Jensen wrote:
> 
> > > By the way, I noticed that the patch never updates the cq's ei_addr 
> > > value. Is
> > > that on purpose?
> > 
> > Yeah, I also mentioned this previously[1] and I still think we need to
> > update the event index. Otherwise (and my testing confirms this), we end
> > up in a situation where the driver skips the mmio, leaving a completion
> > queue entry "in use" on the device until some other completion comes
> > along.
> 
> Hmm, can you expand on this a little bit? We don't touch cq eventidx this in
> SPDK either, on the basis that mmio exits are expensive, and we only ever need
> to look at cq_head when we're checking for room when posting a completion - 
> and
> in that case, we can just look directly at shadow cq_head value.
> 
> Can you clarify the exact circumstance that needs an mmio write when the 
> driver
> updates cq_head?
> 

No, I see, you are correct that not updating the eventidx reduces MMIO
and that we check read the cq head anyway prior to posting completions.
I guess its a perfectly reasonable device-side optimization in this
case. We can safely drop that addition again I think.

> BTW I'm surprised that this patch has just this:
> 
> +static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
> +{
> +pci_dma_write(>ctrl->parent_obj, sq->ei_addr, >tail,
> +  sizeof(sq->tail));
> +}
> 
> Isn't this racy against the driver? Compare
> https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317
> 
> thanks
> john

QEMU has full memory barriers on dma read/write, so I believe this is
safe?


signature.asc
Description: PGP signature


Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-15 Thread John Levon
On Wed, Jun 15, 2022 at 10:48:26AM +0200, Klaus Jensen wrote:

> > By the way, I noticed that the patch never updates the cq's ei_addr value. 
> > Is
> > that on purpose?
> 
> Yeah, I also mentioned this previously[1] and I still think we need to
> update the event index. Otherwise (and my testing confirms this), we end
> up in a situation where the driver skips the mmio, leaving a completion
> queue entry "in use" on the device until some other completion comes
> along.

Hmm, can you expand on this a little bit? We don't touch cq eventidx this in
SPDK either, on the basis that mmio exits are expensive, and we only ever need
to look at cq_head when we're checking for room when posting a completion - and
in that case, we can just look directly at shadow cq_head value.

Can you clarify the exact circumstance that needs an mmio write when the driver
updates cq_head?

BTW I'm surprised that this patch has just this:

+static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
+{
+pci_dma_write(>ctrl->parent_obj, sq->ei_addr, >tail,
+  sizeof(sq->tail));
+}

Isn't this racy against the driver? Compare
https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317

thanks
john



Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-15 Thread Klaus Jensen
On Jun 14 08:41, Keith Busch wrote:
> It's a pretty nasty hack, and definitely not in compliance with the spec: the
> db_addr is supposed to be read-only from the device side, though I do think
> it's safe for this environment. Unless Klaus or anyone finds something I'm
> missing, I feel this is an acceptable compromise to address this odd
> discrepency.
> 

No, I love this hack! :D

I have tested your hack against a dbbuf enabled driver that enables
shadow doorbells on the admin queue by default. I can confirm that this
works as well as on "broken" (or, lets call them "reasonable") drivers.

> By the way, I noticed that the patch never updates the cq's ei_addr value. Is
> that on purpose?

Yeah, I also mentioned this previously[1] and I still think we need to
update the event index. Otherwise (and my testing confirms this), we end
up in a situation where the driver skips the mmio, leaving a completion
queue entry "in use" on the device until some other completion comes
along.

I have folded these changes into a patch for testing[2]. Note, your
patch was missing equivalent changes in nvme_post_cqes(), so I added
that as well as updating of the event index.


  [1]: https://lore.kernel.org/qemu-devel/YqEMwsclktptJvQI@apples/
  [2]: 
http://git.infradead.org/qemu-nvme.git/commitdiff/60712930e441b684490a792b00ef6698cc85f116


Cheers,
Klaus


signature.asc
Description: PGP signature


Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-14 Thread Jinhao Fan


> On Jun 14, 2022, at 11:41 PM, Keith Busch  wrote:
> 
> It's a pretty nasty hack, and definitely not in compliance with the spec: the
> db_addr is supposed to be read-only from the device side, though I do think
> it's safe for this environment. Unless Klaus or anyone finds something I'm
> missing, I feel this is an acceptable compromise to address this odd
> discrepency.

:) In my next patch I will check the performance numbers with this hack. Not
sure if updating db_addr value from the host will have any performance 
implications but I guess it should be OK.

> 
> I believe the recommended tag for something like this is "Suggested-by:", but
> no need to credit me. Just fold it into your first patch and send a v2.

Sure. Thanks!

> 
> By the way, I noticed that the patch never updates the cq's ei_addr value. Is
> that on purpose?

Klaus also raised a similar question in a prior comment. I think we need to 
figure
this out before we move on to the v2 patch. I did this because the original 
Google
extension patch did not update cq’s ei_addr. This seems to make sense because
the purpose of cq’s ei_addr is for the guest to notify the host about cq head
changes when necessary. However, the host does not need this notification 
because we let the host proactively check for cq’s db_addr value when it wants
to post a new cqe. This is also the only point where the host uses the cq’s
db_addr. Therefore, it is OK to postpone the check for cq’s db_addr to this 
point,
instead of getting timely but not useful notifications by updating cq’s ei_addr.
This helps to reduce the number of MMIO’s on the cq’s doorbell register.

Klaus, Keith, do you think this design makes sense?



Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-14 Thread Keith Busch
On Tue, Jun 14, 2022 at 03:24:37PM +0800, Jinhao Fan wrote:
> > On Jun 14, 2022, at 5:15 AM, Keith Busch  wrote:
> > @@ -6538,9 +6544,25 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr 
> > addr, int val)
> > 
> > trace_pci_nvme_mmio_doorbell_sq(sq->sqid, new_tail);
> > 
> > -if (!sq->db_addr) {
> > sq->tail = new_tail;
> > +if (sq->db_addr) {
> > +/*
> > + * The spec states "the host shall also update the controller's
> > + * corresponding doorbell property to match the value of that 
> > entry
> > + * in the Shadow Doorbell buffer."
> > + *
> > + * Since this context is currently a VM trap, we can safely 
> > enforce
> > + * the requirement from the device side in case the host is
> > + * misbehaving.
> > + *
> > + * Note, we shouldn't have to do this, but various drivers
> > + * including ones that run on Linux, are not updating Admin 
> > Queues,
> > + * so we can't trust reading it for an appropriate sq tail.
> > + */
> > +pci_dma_write(>parent_obj, sq->db_addr, >tail,
> > +sizeof(sq->tail));
> > }
> > +
> > timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> > }
> > }
> > --
> 
> Thanks Keith,
> 
> This is an interesting hack. I wonder how should I incorporate your changes 
> in my patch. I guess I can modify the code in PATCH 1/2 and add a 
> “Proposed-by” tag. Is this the correct way?

It's a pretty nasty hack, and definitely not in compliance with the spec: the
db_addr is supposed to be read-only from the device side, though I do think
it's safe for this environment. Unless Klaus or anyone finds something I'm
missing, I feel this is an acceptable compromise to address this odd
discrepency.

I believe the recommended tag for something like this is "Suggested-by:", but
no need to credit me. Just fold it into your first patch and send a v2.

By the way, I noticed that the patch never updates the cq's ei_addr value. Is
that on purpose?



Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-14 Thread Jinhao Fan



> On Jun 14, 2022, at 5:15 AM, Keith Busch  wrote:
> 
> 
> @@ -6538,9 +6544,25 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, 
> int val)
> 
> trace_pci_nvme_mmio_doorbell_sq(sq->sqid, new_tail);
> 
> -if (!sq->db_addr) {
> sq->tail = new_tail;
> +if (sq->db_addr) {
> +/*
> + * The spec states "the host shall also update the controller's
> + * corresponding doorbell property to match the value of that 
> entry
> + * in the Shadow Doorbell buffer."
> + *
> + * Since this context is currently a VM trap, we can safely 
> enforce
> + * the requirement from the device side in case the host is
> + * misbehaving.
> + *
> + * Note, we shouldn't have to do this, but various drivers
> + * including ones that run on Linux, are not updating Admin 
> Queues,
> + * so we can't trust reading it for an appropriate sq tail.
> + */
> +pci_dma_write(>parent_obj, sq->db_addr, >tail,
> +sizeof(sq->tail));
> }
> +
> timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> }
> }
> --

Thanks Keith,

This is an interesting hack. I wonder how should I incorporate your changes in 
my patch. I guess I can modify the code in PATCH 1/2 and add a “Proposed-by” 
tag. Is this the correct way?

Regards,
Jinhao Fan




Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-13 Thread Keith Busch
On Sun, Jun 12, 2022 at 07:40:55PM +0800, Jinhao Fan wrote:
> 
> > On Jun 10, 2022, at 1:27 AM, Klaus Jensen  wrote:
> > 
> > I'm ok with following the concensus here, but we all agree that this is
> > a blatant spec violation that ended up manifesting itself down the
> > stack, right?
> > 
> > So... if QEMU wants to be compliant here, I guess we could ask the
> > kernel to introduce a quirk for *compliant* controllers. Now, THAT would
> > be a first! Not sure if I am being serious or not here ;)
> 
> Hi all,
> 
> Is this our final decision?

What a mess...

The spec should have gone into more details on initializing the shadow and
event buffers if they really intended it to be run on a live queue.

Anyway, the following hack on top of your patch should allow the host to use
admin shadow queues, and also remain backward compatible for the "broken"
hosts, like Linux and SPDK.

---
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index a0a9208c0f..03d84feecf 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4267,7 +4267,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, 
uint64_t dma_addr,
 }
 sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
 
-if (sqid && n->dbbuf_dbs && n->dbbuf_eis) {
+if (n->dbbuf_dbs && n->dbbuf_eis) {
 sq->db_addr = n->dbbuf_dbs + 2 * sqid * stride;
 sq->ei_addr = n->dbbuf_eis + 2 * sqid * stride;
 }
@@ -4632,7 +4632,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
uint64_t dma_addr,
 cq->head = cq->tail = 0;
 QTAILQ_INIT(>req_list);
 QTAILQ_INIT(>sq_list);
-if (cqid && n->dbbuf_dbs && n->dbbuf_eis) {
+if (n->dbbuf_dbs && n->dbbuf_eis) {
 cq->db_addr = n->dbbuf_dbs + (2 * cqid + 1) * stride;
 cq->ei_addr = n->dbbuf_eis + (2 * cqid + 1) * stride;
 }
@@ -5805,7 +5805,7 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const 
NvmeRequest *req)
 n->dbbuf_dbs = dbs_addr;
 n->dbbuf_eis = eis_addr;
 
-for (i = 1; i < n->params.max_ioqpairs + 1; i++) {
+for (i = 0; i < n->params.max_ioqpairs + 1; i++) {
 NvmeSQueue *sq = n->sq[i];
 NvmeCQueue *cq = n->cq[i];
 
@@ -5813,12 +5813,16 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const 
NvmeRequest *req)
 /* Submission queue tail pointer location, 2 * QID * stride */
 sq->db_addr = dbs_addr + 2 * i * stride;
 sq->ei_addr = eis_addr + 2 * i * stride;
+pci_dma_write(>parent_obj, sq->db_addr, >tail,
+sizeof(sq->tail));
 }
 
 if (cq) {
 /* Completion queue head pointer location, (2 * QID + 1) * stride 
*/
 cq->db_addr = dbs_addr + (2 * i + 1) * stride;
 cq->ei_addr = eis_addr + (2 * i + 1) * stride;
+pci_dma_write(>parent_obj, cq->db_addr, >head,
+sizeof(cq->head));
 }
 }
 
@@ -6479,8 +6483,10 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, 
int val)
 trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head);
 
 start_sqs = nvme_cq_full(cq) ? 1 : 0;
-if (!cq->db_addr) {
 cq->head = new_head;
+if (cq->db_addr) {
+pci_dma_write(>parent_obj, cq->db_addr, >head,
+sizeof(cq->head));
 }
 if (start_sqs) {
 NvmeSQueue *sq;
@@ -6538,9 +6544,25 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, 
int val)
 
 trace_pci_nvme_mmio_doorbell_sq(sq->sqid, new_tail);
 
-if (!sq->db_addr) {
 sq->tail = new_tail;
+if (sq->db_addr) {
+/*
+ * The spec states "the host shall also update the controller's
+ * corresponding doorbell property to match the value of that entry
+ * in the Shadow Doorbell buffer."
+ *
+ * Since this context is currently a VM trap, we can safely enforce
+ * the requirement from the device side in case the host is
+ * misbehaving.
+ *
+ * Note, we shouldn't have to do this, but various drivers
+ * including ones that run on Linux, are not updating Admin Queues,
+ * so we can't trust reading it for an appropriate sq tail.
+ */
+pci_dma_write(>parent_obj, sq->db_addr, >tail,
+sizeof(sq->tail));
 }
+
 timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 }
--



Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-12 Thread Jinhao Fan


> On Jun 10, 2022, at 1:27 AM, Klaus Jensen  wrote:
> 
> I'm ok with following the concensus here, but we all agree that this is
> a blatant spec violation that ended up manifesting itself down the
> stack, right?
> 
> So... if QEMU wants to be compliant here, I guess we could ask the
> kernel to introduce a quirk for *compliant* controllers. Now, THAT would
> be a first! Not sure if I am being serious or not here ;)

Hi all,

Is this our final decision?

Thanks,
Jinhao Fan




Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-09 Thread John Levon
On Thu, Jun 09, 2022 at 07:27:57PM +0200, Klaus Jensen wrote:

> > It's not just unnecessary, but enabling shadow doorbells on admin queues 
> > will
> > actively break device implementations (such as SPDK), which have had to 
> > presume
> > that driver implementations don't use shadow doorbells for the admin queue.
> 
> I'm ok with following the concensus here, but we all agree that this is
> a blatant spec violation that ended up manifesting itself down the
> stack, right?

Yup. I've been meaning to try following up to get the spec updated to reflect
reality, but haven't made progress there yet.

> So... if QEMU wants to be compliant here, I guess we could ask the
> kernel to introduce a quirk for *compliant* controllers. Now, THAT would
> be a first! Not sure if I am being serious or not here ;)

:)

regards
john



Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-09 Thread Klaus Jensen
On Jun  9 16:52, John Levon wrote:
> On Thu, Jun 09, 2022 at 08:29:30AM -0600, Keith Busch wrote:
> 
> > On Wed, Jun 08, 2022 at 10:55:30PM +0200, Klaus Jensen wrote:
> > > 
> > > Keith, is this a bug in the kernel? If the code here would expect the
> > > doorbell buffer to be updated for the admin queue as well, would we
> > > break?
> > 
> > The admin queue has to be created before db buffer can be set up, so we 
> > can't
> > rely on it. And this is a performance feature. Admin commands have never 
> > been
> > considered performant, so we decided not to consider it though the spec 
> > allows
> > it.

It's not really a question of whether or not the spec *allows* it. If
our device chooses to be compliant here, then it will read invalid
doorbell values if drivers doesnt update the admin doorbell buffer.

> 
> It's not just unnecessary, but enabling shadow doorbells on admin queues will
> actively break device implementations (such as SPDK), which have had to 
> presume
> that driver implementations don't use shadow doorbells for the admin queue.
> 

I'm ok with following the concensus here, but we all agree that this is
a blatant spec violation that ended up manifesting itself down the
stack, right?

So... if QEMU wants to be compliant here, I guess we could ask the
kernel to introduce a quirk for *compliant* controllers. Now, THAT would
be a first! Not sure if I am being serious or not here ;)


signature.asc
Description: PGP signature


Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-09 Thread John Levon
On Thu, Jun 09, 2022 at 08:29:30AM -0600, Keith Busch wrote:

> On Wed, Jun 08, 2022 at 10:55:30PM +0200, Klaus Jensen wrote:
> > 
> > Keith, is this a bug in the kernel? If the code here would expect the
> > doorbell buffer to be updated for the admin queue as well, would we
> > break?
> 
> The admin queue has to be created before db buffer can be set up, so we can't
> rely on it. And this is a performance feature. Admin commands have never been
> considered performant, so we decided not to consider it though the spec allows
> it.

It's not just unnecessary, but enabling shadow doorbells on admin queues will
actively break device implementations (such as SPDK), which have had to presume
that driver implementations don't use shadow doorbells for the admin queue.

regards
john



Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-09 Thread Keith Busch
On Wed, Jun 08, 2022 at 10:55:30PM +0200, Klaus Jensen wrote:
> 
> Keith, is this a bug in the kernel? If the code here would expect the
> doorbell buffer to be updated for the admin queue as well, would we
> break?

The admin queue has to be created before db buffer can be set up, so we can't
rely on it. And this is a performance feature. Admin commands have never been
considered performant, so we decided not to consider it though the spec allows
it.



Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-08 Thread Jinhao Fan


> On Jun 9, 2022, at 4:55 AM, Klaus Jensen  wrote:
> 
> On Jun  8 09:36, Jinhao Fan wrote:
>> Implement Doorbel Buffer Config command (Section 5.7 in NVMe Spec 1.3)
>> and Shadow Doorbel buffer & EventIdx buffer handling logic (Section 7.13
>> in NVMe Spec 1.3). For queues created before the Doorbell Buffer Config
>> command, the nvme_dbbuf_config function tries to associate each existing
>> SQ and CQ with its Shadow Doorbel buffer and EventIdx buffer address.
>> Queues created after the Doorbell Buffer Config command will have the
>> doorbell buffers associated with them when they are initialized.
>> 
>> In nvme_process_sq and nvme_post_cqe, proactively check for Shadow
>> Doorbell buffer changes instead of wait for doorbell register changes.
>> This reduces the number of MMIOs.
>> 
>> Signed-off-by: Jinhao Fan 
>> ---
>> hw/nvme/ctrl.c   | 95 ++--
>> hw/nvme/nvme.h   |  8 
>> include/block/nvme.h |  2 +
>> 3 files changed, 102 insertions(+), 3 deletions(-)
>> 
>> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
>> index 03760ddeae..d3f6c432df 100644
>> --- a/hw/nvme/ctrl.c
>> +++ b/hw/nvme/ctrl.c
>> @@ -223,6 +223,7 @@ static const uint32_t nvme_cse_acs[256] = {
>> [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
>> [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
>> [NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
>> +[NVME_ADM_CMD_DBBUF_CONFIG] = NVME_CMD_EFF_CSUPP,
>> [NVME_ADM_CMD_FORMAT_NVM]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
>> };
>> 
>> @@ -1304,6 +1305,12 @@ static inline void nvme_blk_write(BlockBackend *blk, 
>> int64_t offset,
>> }
>> }
>> 
>> +static void nvme_update_cq_head(NvmeCQueue *cq)
>> +{
>> +pci_dma_read(>ctrl->parent_obj, cq->db_addr, >head,
>> +sizeof(cq->head));
>> +}
>> +
>> static void nvme_post_cqes(void *opaque)
>> {
>> NvmeCQueue *cq = opaque;
>> @@ -1316,6 +1323,10 @@ static void nvme_post_cqes(void *opaque)
>> NvmeSQueue *sq;
>> hwaddr addr;
>> 
>> +if (cq->cqid && n->dbbuf_enabled) {
>> +nvme_update_cq_head(cq);
> 
> Shouldn't we update the cq head eventidx here (prior to reading the
> doorbell buffer)? Like we do for the sq tail?

I’m not sure whether updating cq head eventidx is necessary. My understanding 
is that the purpose of updating eventidx is for the guest to notify the host 
about cq head or sq tail changes with MMIO. For sq tail this is necessary 
because other than MMIO, there is no way to trigger the host to check for sq 
tail updates (shadow doorbell here). For cq head this is different. The host 
will read cq head shadow doorbell every time it wants to post a cqe. Therefore, 
letting the guest notify the host on cq head changes seems unnecessary. 

Please correct me if I miss some point.

> 
>> +}
>> +
>> if (nvme_cq_full(cq)) {
>> break;
>> }
>> @@ -4237,6 +4248,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest 
>> *req)
>> static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
>>  uint16_t sqid, uint16_t cqid, uint16_t size)
>> {
>> +uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
>> int i;
>> NvmeCQueue *cq;
>> 
>> @@ -4256,6 +4268,11 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, 
>> uint64_t dma_addr,
>> }
>> sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
>> 
>> +if (sqid && n->dbbuf_dbs && n->dbbuf_eis) {
>> +sq->db_addr = n->dbbuf_dbs + 2 * sqid * stride;
>> +sq->ei_addr = n->dbbuf_eis + 2 * sqid * stride;
>> +}
>> +
>> assert(n->cq[cqid]);
>> cq = n->cq[cqid];
>> QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry);
>> @@ -4599,6 +4616,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
>> uint64_t dma_addr,
>>  uint16_t cqid, uint16_t vector, uint16_t size,
>>  uint16_t irq_enabled)
>> {
>> +uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
>> int ret;
>> 
>> if (msix_enabled(>parent_obj)) {
>> @@ -4615,6 +4633,10 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
>> uint64_t dma_addr,
>> cq->head = cq->tail = 0;
>> QTAILQ_INIT(>req_list);
>> QTAILQ_INIT(>sq_list);
>> +if (cqid && n->dbbuf_dbs && n->dbbuf_eis) {
>> +cq->db_addr = n->dbbuf_dbs + (2 * cqid + 1) * stride;
>> +cq->ei_addr = n->dbbuf_eis + (2 * cqid + 1) * stride;
>> +}
>> n->cq[cqid] = cq;
>> cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
>> }
>> @@ -5767,6 +5789,43 @@ out:
>> return status;
>> }
>> 
>> +static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
>> +{
>> +uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
>> +uint64_t dbs_addr = le64_to_cpu(req->cmd.dptr.prp1);
>> +uint64_t eis_addr = le64_to_cpu(req->cmd.dptr.prp2);
>> +int i;
>> +
>> +/* Address should 

Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-08 Thread Klaus Jensen
On Jun  8 09:36, Jinhao Fan wrote:
> Implement Doorbel Buffer Config command (Section 5.7 in NVMe Spec 1.3)
> and Shadow Doorbel buffer & EventIdx buffer handling logic (Section 7.13
> in NVMe Spec 1.3). For queues created before the Doorbell Buffer Config
> command, the nvme_dbbuf_config function tries to associate each existing
> SQ and CQ with its Shadow Doorbel buffer and EventIdx buffer address.
> Queues created after the Doorbell Buffer Config command will have the
> doorbell buffers associated with them when they are initialized.
> 
> In nvme_process_sq and nvme_post_cqe, proactively check for Shadow
> Doorbell buffer changes instead of wait for doorbell register changes.
> This reduces the number of MMIOs.
> 
> Signed-off-by: Jinhao Fan 
> ---
>  hw/nvme/ctrl.c   | 95 ++--
>  hw/nvme/nvme.h   |  8 
>  include/block/nvme.h |  2 +
>  3 files changed, 102 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 03760ddeae..d3f6c432df 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -223,6 +223,7 @@ static const uint32_t nvme_cse_acs[256] = {
>  [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
>  [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
>  [NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
> +[NVME_ADM_CMD_DBBUF_CONFIG] = NVME_CMD_EFF_CSUPP,
>  [NVME_ADM_CMD_FORMAT_NVM]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
>  };
>  
> @@ -1304,6 +1305,12 @@ static inline void nvme_blk_write(BlockBackend *blk, 
> int64_t offset,
>  }
>  }
>  
> +static void nvme_update_cq_head(NvmeCQueue *cq)
> +{
> +pci_dma_read(>ctrl->parent_obj, cq->db_addr, >head,
> +sizeof(cq->head));
> +}
> +
>  static void nvme_post_cqes(void *opaque)
>  {
>  NvmeCQueue *cq = opaque;
> @@ -1316,6 +1323,10 @@ static void nvme_post_cqes(void *opaque)
>  NvmeSQueue *sq;
>  hwaddr addr;
>  
> +if (cq->cqid && n->dbbuf_enabled) {
> +nvme_update_cq_head(cq);

Shouldn't we update the cq head eventidx here (prior to reading the
doorbell buffer)? Like we do for the sq tail?

> +}
> +
>  if (nvme_cq_full(cq)) {
>  break;
>  }
> @@ -4237,6 +4248,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest 
> *req)
>  static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
>   uint16_t sqid, uint16_t cqid, uint16_t size)
>  {
> +uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
>  int i;
>  NvmeCQueue *cq;
>  
> @@ -4256,6 +4268,11 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, 
> uint64_t dma_addr,
>  }
>  sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
>  
> +if (sqid && n->dbbuf_dbs && n->dbbuf_eis) {
> +sq->db_addr = n->dbbuf_dbs + 2 * sqid * stride;
> +sq->ei_addr = n->dbbuf_eis + 2 * sqid * stride;
> +}
> +
>  assert(n->cq[cqid]);
>  cq = n->cq[cqid];
>  QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry);
> @@ -4599,6 +4616,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
> uint64_t dma_addr,
>   uint16_t cqid, uint16_t vector, uint16_t size,
>   uint16_t irq_enabled)
>  {
> +uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
>  int ret;
>  
>  if (msix_enabled(>parent_obj)) {
> @@ -4615,6 +4633,10 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
> uint64_t dma_addr,
>  cq->head = cq->tail = 0;
>  QTAILQ_INIT(>req_list);
>  QTAILQ_INIT(>sq_list);
> +if (cqid && n->dbbuf_dbs && n->dbbuf_eis) {
> +cq->db_addr = n->dbbuf_dbs + (2 * cqid + 1) * stride;
> +cq->ei_addr = n->dbbuf_eis + (2 * cqid + 1) * stride;
> +}
>  n->cq[cqid] = cq;
>  cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
>  }
> @@ -5767,6 +5789,43 @@ out:
>  return status;
>  }
>  
> +static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
> +{
> +uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
> +uint64_t dbs_addr = le64_to_cpu(req->cmd.dptr.prp1);
> +uint64_t eis_addr = le64_to_cpu(req->cmd.dptr.prp2);
> +int i;
> +
> +/* Address should be page aligned */
> +if (dbs_addr & (n->page_size - 1) || eis_addr & (n->page_size - 1)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +/* Save shadow buffer base addr for use during queue creation */
> +n->dbbuf_dbs = dbs_addr;
> +n->dbbuf_eis = eis_addr;
> +n->dbbuf_enabled = true;
> +
> +for (i = 1; i < n->params.max_ioqpairs + 1; i++) {
> +NvmeSQueue *sq = n->sq[i];
> +NvmeCQueue *cq = n->cq[i];
> +
> +if (sq) {
> +/* Submission queue tail pointer location, 2 * QID * stride */
> +sq->db_addr = dbs_addr + 2 * i * stride;
> +sq->ei_addr = eis_addr + 2 * i * stride;
> +}
> +
> +if 

[PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-07 Thread Jinhao Fan
Implement Doorbel Buffer Config command (Section 5.7 in NVMe Spec 1.3)
and Shadow Doorbel buffer & EventIdx buffer handling logic (Section 7.13
in NVMe Spec 1.3). For queues created before the Doorbell Buffer Config
command, the nvme_dbbuf_config function tries to associate each existing
SQ and CQ with its Shadow Doorbel buffer and EventIdx buffer address.
Queues created after the Doorbell Buffer Config command will have the
doorbell buffers associated with them when they are initialized.

In nvme_process_sq and nvme_post_cqe, proactively check for Shadow
Doorbell buffer changes instead of wait for doorbell register changes.
This reduces the number of MMIOs.

Signed-off-by: Jinhao Fan 
---
 hw/nvme/ctrl.c   | 95 ++--
 hw/nvme/nvme.h   |  8 
 include/block/nvme.h |  2 +
 3 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 03760ddeae..d3f6c432df 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -223,6 +223,7 @@ static const uint32_t nvme_cse_acs[256] = {
 [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
+[NVME_ADM_CMD_DBBUF_CONFIG] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_FORMAT_NVM]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 };
 
@@ -1304,6 +1305,12 @@ static inline void nvme_blk_write(BlockBackend *blk, 
int64_t offset,
 }
 }
 
+static void nvme_update_cq_head(NvmeCQueue *cq)
+{
+pci_dma_read(>ctrl->parent_obj, cq->db_addr, >head,
+sizeof(cq->head));
+}
+
 static void nvme_post_cqes(void *opaque)
 {
 NvmeCQueue *cq = opaque;
@@ -1316,6 +1323,10 @@ static void nvme_post_cqes(void *opaque)
 NvmeSQueue *sq;
 hwaddr addr;
 
+if (cq->cqid && n->dbbuf_enabled) {
+nvme_update_cq_head(cq);
+}
+
 if (nvme_cq_full(cq)) {
 break;
 }
@@ -4237,6 +4248,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req)
 static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
  uint16_t sqid, uint16_t cqid, uint16_t size)
 {
+uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
 int i;
 NvmeCQueue *cq;
 
@@ -4256,6 +4268,11 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, 
uint64_t dma_addr,
 }
 sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
 
+if (sqid && n->dbbuf_dbs && n->dbbuf_eis) {
+sq->db_addr = n->dbbuf_dbs + 2 * sqid * stride;
+sq->ei_addr = n->dbbuf_eis + 2 * sqid * stride;
+}
+
 assert(n->cq[cqid]);
 cq = n->cq[cqid];
 QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry);
@@ -4599,6 +4616,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
uint64_t dma_addr,
  uint16_t cqid, uint16_t vector, uint16_t size,
  uint16_t irq_enabled)
 {
+uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
 int ret;
 
 if (msix_enabled(>parent_obj)) {
@@ -4615,6 +4633,10 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
uint64_t dma_addr,
 cq->head = cq->tail = 0;
 QTAILQ_INIT(>req_list);
 QTAILQ_INIT(>sq_list);
+if (cqid && n->dbbuf_dbs && n->dbbuf_eis) {
+cq->db_addr = n->dbbuf_dbs + (2 * cqid + 1) * stride;
+cq->ei_addr = n->dbbuf_eis + (2 * cqid + 1) * stride;
+}
 n->cq[cqid] = cq;
 cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
 }
@@ -5767,6 +5789,43 @@ out:
 return status;
 }
 
+static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
+{
+uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
+uint64_t dbs_addr = le64_to_cpu(req->cmd.dptr.prp1);
+uint64_t eis_addr = le64_to_cpu(req->cmd.dptr.prp2);
+int i;
+
+/* Address should be page aligned */
+if (dbs_addr & (n->page_size - 1) || eis_addr & (n->page_size - 1)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+/* Save shadow buffer base addr for use during queue creation */
+n->dbbuf_dbs = dbs_addr;
+n->dbbuf_eis = eis_addr;
+n->dbbuf_enabled = true;
+
+for (i = 1; i < n->params.max_ioqpairs + 1; i++) {
+NvmeSQueue *sq = n->sq[i];
+NvmeCQueue *cq = n->cq[i];
+
+if (sq) {
+/* Submission queue tail pointer location, 2 * QID * stride */
+sq->db_addr = dbs_addr + 2 * i * stride;
+sq->ei_addr = eis_addr + 2 * i * stride;
+}
+
+if (cq) {
+/* Completion queue head pointer location, (2 * QID + 1) * stride 
*/
+cq->db_addr = dbs_addr + (2 * i + 1) * stride;
+cq->ei_addr = eis_addr + (2 * i + 1) * stride;
+}
+}
+
+return NVME_SUCCESS;
+}
+
 static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
 trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req->cmd.opcode,
@@ -5809,6