Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring

2018-01-22 Thread Keith Busch
On Wed, Jan 03, 2018 at 01:21:05PM -0700, Keith Busch wrote:
> 
> I've removed the submission side poll in a local build, and amazingly I
> am observing a not insignificant increase in latency without it on some
> workloads with certain hardware. I will have to delay recommending/posting
> removal of this pending further investigation.

This took longer to narrow than I hoped. I had been getting very wild
results, and there were two things to blame: a kernel irq issue[*],
and a platform bios causing a lot of CPU thermal frequency throttling.

Now that's sorted out, I am back to tuning this and the opprotunistic
nvme submission side polling once again does not produce a measurable
difference. My world is sane again. Will try to send an update for
consideration this week after some more testing.

 * 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=a0c9259dc4


Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring

2018-01-03 Thread Keith Busch
On Thu, Dec 21, 2017 at 02:01:44PM -0700, Jens Axboe wrote:
> I haven't either, but curious if others had. It's mostly just extra
> overhead, I haven't seen it provide a latency reduction of any kind.

I've removed the submission side poll in a local build, and amazingly I
am observing a not insignificant increase in latency without it on some
workloads with certain hardware. I will have to delay recommending/posting
removal of this pending further investigation.


Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring

2017-12-29 Thread Christoph Hellwig
On Mon, Dec 25, 2017 at 12:12:52PM +0200, Sagi Grimberg wrote:
>
>>> But that does bring up the fact if we should always be doing the
>>> nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
>>> maybe it's better to make the submission path faster and skip it?
>>
>> Yes, I am okay to remove the opprotunistic nvme_process_cq in the
>> submission path. Even under deeply queued IO, I've not seen this provide
>> any measurable benefit.
>
> +100 for removing it. Never really understood why it gets us anything...

Yes, we should get rid of it.


Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring

2017-12-29 Thread Christoph Hellwig
Looks fine, although I agree a comment on the q_lock exclusion would
be useful.


Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring

2017-12-27 Thread Sagi Grimberg



Curious to know what does this buy us?


blk_mq_start_request doesn't do anything to make the command ready to
submit to hardware, so this is pure software overhead, somewhere between
200-300 nanoseconds as far as I can measure (YMMV). We can post the command
to hardware before taking on this software overhead so the hardware gets
to fetch the command that much sooner.

One thing to remember is we need to ensure the driver can't complete the
request before starting it. The driver's exisitng locking does ensure
that with this patch, so it's just something to keep in mind if anything
should ever change in the future.


Needs to be well documented.

This mean we won't be able to split the submission and completion locks
now... So it does present a tradeoff.


Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring

2017-12-26 Thread Keith Busch
On Mon, Dec 25, 2017 at 12:11:47PM +0200, Sagi Grimberg wrote:
> > This is a performance optimization that allows the hardware to work on
> > a command in parallel with the kernel's stats and timeout tracking.
> 
> Curious to know what does this buy us?

blk_mq_start_request doesn't do anything to make the command ready to
submit to hardware, so this is pure software overhead, somewhere between
200-300 nanoseconds as far as I can measure (YMMV). We can post the command
to hardware before taking on this software overhead so the hardware gets
to fetch the command that much sooner.

One thing to remember is we need to ensure the driver can't complete the
request before starting it. The driver's exisitng locking does ensure
that with this patch, so it's just something to keep in mind if anything
should ever change in the future.


Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring

2017-12-25 Thread Sagi Grimberg



But that does bring up the fact if we should always be doing the
nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
maybe it's better to make the submission path faster and skip it?


Yes, I am okay to remove the opprotunistic nvme_process_cq in the
submission path. Even under deeply queued IO, I've not seen this provide
any measurable benefit.


+100 for removing it. Never really understood why it gets us anything...


Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring

2017-12-25 Thread Sagi Grimberg



This is a performance optimization that allows the hardware to work on
a command in parallel with the kernel's stats and timeout tracking.


Curious to know what does this buy us?


Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring

2017-12-21 Thread Jens Axboe
On 12/21/17 2:02 PM, Keith Busch wrote:
> On Thu, Dec 21, 2017 at 01:53:44PM -0700, Jens Axboe wrote:
>> Turns out that wasn't what patch 2 was. And the code is right there
>> above as well, and under the q_lock, so I guess that race doesn't
>> exist.
>>
>> But that does bring up the fact if we should always be doing the
>> nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
>> maybe it's better to make the submission path faster and skip it?
> 
> Yes, I am okay to remove the opprotunistic nvme_process_cq in the
> submission path. Even under deeply queued IO, I've not seen this provide
> any measurable benefit.

I haven't either, but curious if others had. It's mostly just extra
overhead, I haven't seen it provide a latency reduction of any kind.

-- 
Jens Axboe



Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring

2017-12-21 Thread Keith Busch
On Thu, Dec 21, 2017 at 01:53:44PM -0700, Jens Axboe wrote:
> Turns out that wasn't what patch 2 was. And the code is right there
> above as well, and under the q_lock, so I guess that race doesn't
> exist.
> 
> But that does bring up the fact if we should always be doing the
> nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
> maybe it's better to make the submission path faster and skip it?

Yes, I am okay to remove the opprotunistic nvme_process_cq in the
submission path. Even under deeply queued IO, I've not seen this provide
any measurable benefit.


Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring

2017-12-21 Thread Jens Axboe
On 12/21/17 1:49 PM, Jens Axboe wrote:
> On 12/21/17 1:46 PM, Keith Busch wrote:
>> This is a performance optimization that allows the hardware to work on
>> a command in parallel with the kernel's stats and timeout tracking.
>>
>> Signed-off-by: Keith Busch 
>> ---
>>  drivers/nvme/host/pci.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index f5800c3c9082..df5550ce0531 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -886,8 +886,6 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx 
>> *hctx,
>>  goto out_cleanup_iod;
>>  }
>>  
>> -blk_mq_start_request(req);
>> -
>>  spin_lock_irq(>q_lock);
>>  if (unlikely(nvmeq->cq_vector < 0)) {
>>  ret = BLK_STS_IOERR;
>> @@ -895,6 +893,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx 
>> *hctx,
>>  goto out_cleanup_iod;
>>  }
>>  __nvme_submit_cmd(nvmeq, );
>> +blk_mq_start_request(req);
>>  nvme_process_cq(nvmeq);
>>  spin_unlock_irq(>q_lock);
>>  return BLK_STS_OK;
> 
> I guess this will work since we hold the q_lock, but you should probably
> include a comment to that effect since it's important that we can't find
> the completion before we've called started.
> 
> Actually, you'd need to reorder this after #2 (I'm assuming, it hasn't
> shown up yet, but I'm guessing it's kill the cq check after submission)
> since if we have multiple CPUs on the same hardware queue, one of
> the other CPUs could find a completion event between your
> __nvme_submit_cmd() and blk_mq_start_request() call. Very short window,
> but it does exist.

Turns out that wasn't what patch 2 was. And the code is right there
above as well, and under the q_lock, so I guess that race doesn't
exist.

But that does bring up the fact if we should always be doing the
nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
maybe it's better to make the submission path faster and skip it?

-- 
Jens Axboe



Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring

2017-12-21 Thread Jens Axboe
On 12/21/17 1:46 PM, Keith Busch wrote:
> This is a performance optimization that allows the hardware to work on
> a command in parallel with the kernel's stats and timeout tracking.
> 
> Signed-off-by: Keith Busch 
> ---
>  drivers/nvme/host/pci.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f5800c3c9082..df5550ce0531 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -886,8 +886,6 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx 
> *hctx,
>   goto out_cleanup_iod;
>   }
>  
> - blk_mq_start_request(req);
> -
>   spin_lock_irq(>q_lock);
>   if (unlikely(nvmeq->cq_vector < 0)) {
>   ret = BLK_STS_IOERR;
> @@ -895,6 +893,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx 
> *hctx,
>   goto out_cleanup_iod;
>   }
>   __nvme_submit_cmd(nvmeq, );
> + blk_mq_start_request(req);
>   nvme_process_cq(nvmeq);
>   spin_unlock_irq(>q_lock);
>   return BLK_STS_OK;

I guess this will work since we hold the q_lock, but you should probably
include a comment to that effect since it's important that we can't find
the completion before we've called started.

Actually, you'd need to reorder this after #2 (I'm assuming, it hasn't
shown up yet, but I'm guessing it's kill the cq check after submission)
since if we have multiple CPUs on the same hardware queue, one of
the other CPUs could find a completion event between your
__nvme_submit_cmd() and blk_mq_start_request() call. Very short window,
but it does exist.

-- 
Jens Axboe



[PATCH 1/3] nvme/pci: Start request after doorbell ring

2017-12-21 Thread Keith Busch
This is a performance optimization that allows the hardware to work on
a command in parallel with the kernel's stats and timeout tracking.

Signed-off-by: Keith Busch 
---
 drivers/nvme/host/pci.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3c9082..df5550ce0531 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -886,8 +886,6 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx 
*hctx,
goto out_cleanup_iod;
}
 
-   blk_mq_start_request(req);
-
spin_lock_irq(>q_lock);
if (unlikely(nvmeq->cq_vector < 0)) {
ret = BLK_STS_IOERR;
@@ -895,6 +893,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx 
*hctx,
goto out_cleanup_iod;
}
__nvme_submit_cmd(nvmeq, );
+   blk_mq_start_request(req);
nvme_process_cq(nvmeq);
spin_unlock_irq(>q_lock);
return BLK_STS_OK;
-- 
2.13.6