Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-05 Thread Jens Axboe

On 11/04/2016 05:13 PM, Ming Lei wrote:

Yes, that might be a good idea, since it doesn't cost us anything.
For the mq case, I'm hard pressed to think of areas where we could
complete IO in parallel on the same software queue. You'll never have
a software queue mapped to multiple hardware queues. So we should
essentially be serialized.


For blk-mq, blk_mq_stat_add() is called in __blk_mq_complete_request()
which is often run from interrupt handler, and the CPU serving the
interrupt can be different with the submitting CPU for rq->mq_ctx. And
there can be several CPUs handling the interrupts originating from
same sw queue.


BTW, one small improvement might be to call blk_mq_stat_add() on the
curent ctx, in case it's different than rq->mq_ctx. That can happen if
we have multiple CPUs per hardware queue. In reality, even for that
case, a real race is rare. You'd have to rebalance interrupt masks
basically, at least on x86 where multiple CPUs in the IRQ affinity mask
still always trigger on the first one.

So while we could just grab the current ctx instead, I don't think it's
going to make a difference in practice.


--
Jens Axboe


Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-05 Thread Jens Axboe

On 11/04/2016 05:13 PM, Ming Lei wrote:

Even though it is true, the statistics still may become a mess with rare
collisons.



How so? Not saying we could not improve it, but we're trading off
precision for scalability. My claim is that the existing code is good
enough. I've run a TON of testing on it, since I've used it for multiple
projects, and it's been solid.


+static void blk_stat_flush_batch(struct blk_rq_stat *stat)
+{
+   if (!stat->nr_batch)
+   return;
+   if (!stat->nr_samples)
+   stat->mean = div64_s64(stat->batch, stat->nr_batch);

For example, two reqs(A & B) are completed at the same time, and A is
on CPU0, and B is on CPU1.

If the two last writting in the function is reordered observed from
CPU1, for B, CPU1 runs the above branch when it just sees stat->batch
is set as zero, but nr_samples isn't updated yet, then div_zero is
triggered.


We should probably just have the nr_batch be a READ_ONCE(). I'm fine
with the stats being a bit off in the rare case of a collision, but we
can't have a divide-by-zero, obviously.



+   else {
+   stat->mean = div64_s64((stat->mean * stat->nr_samples) +
+   stat->batch,
+   stat->nr_samples + stat->nr_batch);
+   }

BTW, the above 'if else' can be removed, and 'stat->mean' can be computed
in the 2nd way.


True, they could be collapsed.


Yes, that might be a good idea, since it doesn't cost us anything. For
the mq case, I'm hard pressed to think of areas where we could complete
IO in parallel on the same software queue. You'll never have a software
queue mapped to multiple hardware queues. So we should essentially be
serialized.


For blk-mq, blk_mq_stat_add() is called in __blk_mq_complete_request()
which is often run from interrupt handler, and the CPU serving the interrupt
can be different with the submitting CPU for rq->mq_ctx. And there can be
several CPUs handling the interrupts originating from same sw queue.

BTW, I don't object to this patch actually, but maybe we should add
comment about this kind of race. Cause in the future someone might find
the statistics becomes not accurate, and they may understand or
try to improve.


I'm fine with documenting it. For the two use cases I have, I'm fine
with it not being 100% stable. For by far the majority of the windows,
it'll be just fine. I'll fix the divide-by-zero, though.

--
Jens Axboe


Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-04 Thread Ming Lei
On Fri, Nov 4, 2016 at 12:55 AM, Jens Axboe  wrote:
> On 11/03/2016 08:57 AM, Ming Lei wrote:
>>
>> On Thu, Nov 3, 2016 at 9:38 PM, Jens Axboe  wrote:
>>>
>>> On 11/03/2016 05:17 AM, Ming Lei wrote:
>
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0bfaa54d3e9f..ca77c725b4e5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
>  {
> blk_dequeue_request(req);
>
> +   blk_stat_set_issue_time(&req->issue_stat);
> +
> /*
>  * We are now handing the request to the hardware, initialize
>  * resid_len to full count and add the timeout handler.
> @@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int
> error, unsigned int nr_bytes)
>
> trace_block_rq_complete(req->q, req, nr_bytes);
>
> +   blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);



 blk_update_request() is often called lockless, so it isn't good to
 do it here.
>>>
>>>
>>>
>>> It's not really a concern, not for the legacy path here nor the mq one
>>> where it is per sw context. The collisions are rare enough that it'll
>>
>>
>> How do you get the conclusion that the collisions are rare enough
>> when the counting becomes completely lockless?
>
>
> Of all the races I can spot, it basically boils down to accounting one
> IO to little or too many.
>
>> Even though it is true, the statistics still may become a mess with rare
>> collisons.
>
>
> How so? Not saying we could not improve it, but we're trading off
> precision for scalability. My claim is that the existing code is good
> enough. I've run a TON of testing on it, since I've used it for multiple
> projects, and it's been solid.

+static void blk_stat_flush_batch(struct blk_rq_stat *stat)
+{
+   if (!stat->nr_batch)
+   return;
+   if (!stat->nr_samples)
+   stat->mean = div64_s64(stat->batch, stat->nr_batch);

For example, two reqs(A & B) are completed at the same time, and
A is on CPU0, and B is on CPU1.

If the two last writting in the function is reordered observed from CPU1,
for B, CPU1 runs the above branch when it just sees stat->batch is set as zero,
but nr_samples isn't updated yet, then div_zero is triggered.


+   else {
+   stat->mean = div64_s64((stat->mean * stat->nr_samples) +
+   stat->batch,
+   stat->nr_samples + stat->nr_batch);
+   }

BTW, the above 'if else' can be removed, and 'stat->mean' can be computed
in the 2nd way.

+
+   stat->nr_samples += stat->nr_batch;

one addition of stat->nr_batch can be overrideed, and it may not
be accounted into stat->mean at all.

+   stat->nr_batch = stat->batch = 0;
+}

>
>>> skew the latencies a bit for that short window, but then go away again.
>>> I'd much rather take that, than adding locking for this part.
>>
>>
>> For legacy case, blk_stat_add() can be moved into blk_finish_request()
>> for avoiding the collision.
>
>
> Yes, that might be a good idea, since it doesn't cost us anything. For
> the mq case, I'm hard pressed to think of areas where we could complete
> IO in parallel on the same software queue. You'll never have a software
> queue mapped to multiple hardware queues. So we should essentially be
> serialized.

For blk-mq, blk_mq_stat_add() is called in __blk_mq_complete_request()
which is often run from interrupt handler, and the CPU serving the interrupt
can be different with the submitting CPU for rq->mq_ctx. And there can be
several CPUs handling the interrupts originating from same sw queue.

BTW, I don't object to this patch actually, but maybe we should add
comment about this kind of race. Cause in the future someone might find
the statistics becomes not accurate, and they may understand or
try to improve.


Thanks,
Ming Lei


Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-03 Thread Jens Axboe

On 11/03/2016 08:57 AM, Ming Lei wrote:

On Thu, Nov 3, 2016 at 9:38 PM, Jens Axboe  wrote:

On 11/03/2016 05:17 AM, Ming Lei wrote:


diff --git a/block/blk-core.c b/block/blk-core.c
index 0bfaa54d3e9f..ca77c725b4e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
 {
blk_dequeue_request(req);

+   blk_stat_set_issue_time(&req->issue_stat);
+
/*
 * We are now handing the request to the hardware, initialize
 * resid_len to full count and add the timeout handler.
@@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int
error, unsigned int nr_bytes)

trace_block_rq_complete(req->q, req, nr_bytes);

+   blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);



blk_update_request() is often called lockless, so it isn't good to
do it here.



It's not really a concern, not for the legacy path here nor the mq one
where it is per sw context. The collisions are rare enough that it'll


How do you get the conclusion that the collisions are rare enough
when the counting becomes completely lockless?


Of all the races I can spot, it basically boils down to accounting one
IO to little or too many.


Even though it is true, the statistics still may become a mess with rare
collisons.


How so? Not saying we could not improve it, but we're trading off
precision for scalability. My claim is that the existing code is good
enough. I've run a TON of testing on it, since I've used it for multiple
projects, and it's been solid.


skew the latencies a bit for that short window, but then go away again.
I'd much rather take that, than adding locking for this part.


For legacy case, blk_stat_add() can be moved into blk_finish_request()
for avoiding the collision.


Yes, that might be a good idea, since it doesn't cost us anything. For
the mq case, I'm hard pressed to think of areas where we could complete
IO in parallel on the same software queue. You'll never have a software
queue mapped to multiple hardware queues. So we should essentially be
serialized.

In short, I don't see any problems with this.

--
Jens Axboe



Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-03 Thread Ming Lei
On Thu, Nov 3, 2016 at 9:38 PM, Jens Axboe  wrote:
> On 11/03/2016 05:17 AM, Ming Lei wrote:
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 0bfaa54d3e9f..ca77c725b4e5 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
>>>  {
>>> blk_dequeue_request(req);
>>>
>>> +   blk_stat_set_issue_time(&req->issue_stat);
>>> +
>>> /*
>>>  * We are now handing the request to the hardware, initialize
>>>  * resid_len to full count and add the timeout handler.
>>> @@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int
>>> error, unsigned int nr_bytes)
>>>
>>> trace_block_rq_complete(req->q, req, nr_bytes);
>>>
>>> +   blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);
>>
>>
>> blk_update_request() is often called lockless, so it isn't good to
>> do it here.
>
>
> It's not really a concern, not for the legacy path here nor the mq one
> where it is per sw context. The collisions are rare enough that it'll

How do you get the conclusion that the collisions are rare enough
when the counting becomes completely lockless?

Even though it is true, the statistics still may become a mess with rare
collisons.

> skew the latencies a bit for that short window, but then go away again.
> I'd much rather take that, than adding locking for this part.

For legacy case, blk_stat_add() can be moved into blk_finish_request()
for avoiding the collision.


-- 
Ming Lei


Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-03 Thread Bart Van Assche

On 11/01/2016 03:05 PM, Jens Axboe wrote:

+void blk_stat_init(struct blk_rq_stat *stat)
+{
+   __blk_stat_init(stat, ktime_to_ns(ktime_get()));
+}
+
+static bool __blk_stat_is_current(struct blk_rq_stat *stat, s64 now)
+{
+   return (now & BLK_STAT_NSEC_MASK) == (stat->time & BLK_STAT_NSEC_MASK);
+}
+
+bool blk_stat_is_current(struct blk_rq_stat *stat)
+{
+   return __blk_stat_is_current(stat, ktime_to_ns(ktime_get()));
+}


Hello Jens,

What is the performance impact of these patches? My experience is that 
introducing ktime_get() in the I/O path of high-performance I/O devices 
measurably slows down I/O. On https://lkml.org/lkml/2016/4/21/107 I read 
that a single ktime_get() call takes about 100 ns.


Bart.


Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-03 Thread Jens Axboe

On 11/03/2016 08:10 AM, Bart Van Assche wrote:

On 11/01/2016 03:05 PM, Jens Axboe wrote:

+void blk_stat_init(struct blk_rq_stat *stat)
+{
+__blk_stat_init(stat, ktime_to_ns(ktime_get()));
+}
+
+static bool __blk_stat_is_current(struct blk_rq_stat *stat, s64 now)
+{
+return (now & BLK_STAT_NSEC_MASK) == (stat->time &
BLK_STAT_NSEC_MASK);
+}
+
+bool blk_stat_is_current(struct blk_rq_stat *stat)
+{
+return __blk_stat_is_current(stat, ktime_to_ns(ktime_get()));
+}


Hello Jens,

What is the performance impact of these patches? My experience is that
introducing ktime_get() in the I/O path of high-performance I/O devices
measurably slows down I/O. On https://lkml.org/lkml/2016/4/21/107 I read
that a single ktime_get() call takes about 100 ns.


Hmm, on the testing I did, it didn't seem to have any noticeable
slowdown. If we do see a slowdown, we can look into enabling it only
when we need it.

Outside of the polling, my buffered writeback throttling patches also
use this stat tracking. For that patchset, it's easy enough to enable it
if we have wbt enabled. For polling, it's a bit more difficult. One easy
way would be to have a queue flag for it, and the first poll would
enable it unless it has been explicitly turned off.

--
Jens Axboe



Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-03 Thread Jens Axboe

On 11/03/2016 05:17 AM, Ming Lei wrote:

diff --git a/block/blk-core.c b/block/blk-core.c
index 0bfaa54d3e9f..ca77c725b4e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
 {
blk_dequeue_request(req);

+   blk_stat_set_issue_time(&req->issue_stat);
+
/*
 * We are now handing the request to the hardware, initialize
 * resid_len to full count and add the timeout handler.
@@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)

trace_block_rq_complete(req->q, req, nr_bytes);

+   blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);


blk_update_request() is often called lockless, so it isn't good to
do it here.


It's not really a concern, not for the legacy path here nor the mq one
where it is per sw context. The collisions are rare enough that it'll
skew the latencies a bit for that short window, but then go away again.
I'd much rather take that, than adding locking for this part.

--
Jens Axboe



Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-03 Thread Ming Lei
On Wed, Nov 2, 2016 at 5:05 AM, Jens Axboe  wrote:
> For legacy block, we simply track them in the request queue. For
> blk-mq, we track them on a per-sw queue basis, which we can then
> sum up through the hardware queues and finally to a per device
> state.
>
> The stats are tracked in, roughly, 0.1s interval windows.
>
> Add sysfs files to display the stats.
>
> Signed-off-by: Jens Axboe 
> ---
>  block/Makefile|   2 +-
>  block/blk-core.c  |   4 +
>  block/blk-mq-sysfs.c  |  47 ++
>  block/blk-mq.c|  14 +++
>  block/blk-mq.h|   3 +
>  block/blk-stat.c  | 226 
> ++
>  block/blk-stat.h  |  37 
>  block/blk-sysfs.c |  26 ++
>  include/linux/blk_types.h |  16 
>  include/linux/blkdev.h|   4 +
>  10 files changed, 378 insertions(+), 1 deletion(-)
>  create mode 100644 block/blk-stat.c
>  create mode 100644 block/blk-stat.h
>
> diff --git a/block/Makefile b/block/Makefile
> index 934dac73fb37..2528c596f7ec 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -5,7 +5,7 @@
>  obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
> blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
> blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
> -   blk-lib.o blk-mq.o blk-mq-tag.o \
> +   blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
> blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
> genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
> badblocks.o partitions/
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0bfaa54d3e9f..ca77c725b4e5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
>  {
> blk_dequeue_request(req);
>
> +   blk_stat_set_issue_time(&req->issue_stat);
> +
> /*
>  * We are now handing the request to the hardware, initialize
>  * resid_len to full count and add the timeout handler.
> @@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int error, 
> unsigned int nr_bytes)
>
> trace_block_rq_complete(req->q, req, nr_bytes);
>
> +   blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);

blk_update_request() is often called lockless, so it isn't good to
do it here.

> +
> if (!req->bio)
> return false;
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 01fb455d3377..633c79a538ea 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -259,6 +259,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct 
> blk_mq_hw_ctx *hctx, char *page)
> return ret;
>  }
>
> +static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx)
> +{
> +   struct blk_mq_ctx *ctx;
> +   unsigned int i;
> +
> +   hctx_for_each_ctx(hctx, ctx, i) {
> +   blk_stat_init(&ctx->stat[0]);
> +   blk_stat_init(&ctx->stat[1]);
> +   }
> +}
> +
> +static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx,
> + const char *page, size_t count)
> +{
> +   blk_mq_stat_clear(hctx);
> +   return count;
> +}
> +
> +static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char 
> *pre)
> +{
> +   return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, 
> max=%lld\n",
> +   pre, (long long) stat->nr_samples,
> +   (long long) stat->mean, (long long) stat->min,
> +   (long long) stat->max);
> +}
> +
> +static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char 
> *page)
> +{
> +   struct blk_rq_stat stat[2];
> +   ssize_t ret;
> +
> +   blk_stat_init(&stat[0]);
> +   blk_stat_init(&stat[1]);
> +
> +   blk_hctx_stat_get(hctx, stat);
> +
> +   ret = print_stat(page, &stat[0], "read :");
> +   ret += print_stat(page + ret, &stat[1], "write:");
> +   return ret;
> +}
> +
>  static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
> .attr = {.name = "dispatched", .mode = S_IRUGO },
> .show = blk_mq_sysfs_dispatched_show,
> @@ -317,6 +358,11 @@ static struct blk_mq_hw_ctx_sysfs_entry 
> blk_mq_hw_sysfs_poll = {
> .show = blk_mq_hw_sysfs_poll_show,
> .store = blk_mq_hw_sysfs_poll_store,
>  };
> +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = {
> +   .attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR },
> +   .show = blk_mq_hw_sysfs_stat_show,
> +   .store = blk_mq_hw_sysfs_stat_store,
> +};
>
>  static struct attribute *default_hw_ctx_attrs[] = {
> &blk_mq_hw_sysfs_queued.attr,
> @@ -327,6 +373,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
> &blk_mq_hw_sysfs_cpus.attr,
> &blk_mq_hw_sysfs_active.attr,
> &blk_mq_hw_sy

Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-02 Thread Christoph Hellwig
On Wed, Nov 02, 2016 at 08:55:16AM -0600, Jens Axboe wrote:
> On 11/02/2016 08:52 AM, Christoph Hellwig wrote:
>> On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:
>>> For legacy block, we simply track them in the request queue. For
>>> blk-mq, we track them on a per-sw queue basis, which we can then
>>> sum up through the hardware queues and finally to a per device
>>> state.
>>
>> what is the use case for the legacy request tracking?
>
> Buffered writeback code uses the same base. Additionally, it could
> replace some user space tracking for latency outliers that people are
> currently running.

Ok.


Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-02 Thread Jens Axboe

On 11/02/2016 08:52 AM, Christoph Hellwig wrote:

On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:

For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.


what is the use case for the legacy request tracking?


Buffered writeback code uses the same base. Additionally, it could
replace some user space tracking for latency outliers that people are
currently running.

--
Jens Axboe



Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-02 Thread Christoph Hellwig
On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:
> For legacy block, we simply track them in the request queue. For
> blk-mq, we track them on a per-sw queue basis, which we can then
> sum up through the hardware queues and finally to a per device
> state.

what is the use case for the legacy request tracking?


Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-01 Thread Jens Axboe
On Tue, Nov 01 2016, Johannes Thumshirn wrote:
> On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:
> > For legacy block, we simply track them in the request queue. For
> > blk-mq, we track them on a per-sw queue basis, which we can then
> > sum up through the hardware queues and finally to a per device
> > state.
> > 
> > The stats are tracked in, roughly, 0.1s interval windows.
> > 
> > Add sysfs files to display the stats.
> > 
> > Signed-off-by: Jens Axboe 
> > ---
> 
> [...]
> 
> >  
> > /* incremented at completion time */
> > unsigned long   cacheline_aligned_in_smp rq_completed[2];
> > +   struct blk_rq_stat  stat[2];
> 
> Can you add an enum or define for the directions? Just 0 and 1 aren't very
> intuitive.

Good point, I added that and updated both the stats patch and the
subsequent blk-mq poll code.

-- 
Jens Axboe



Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-01 Thread Johannes Thumshirn
On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:
> For legacy block, we simply track them in the request queue. For
> blk-mq, we track them on a per-sw queue basis, which we can then
> sum up through the hardware queues and finally to a per device
> state.
> 
> The stats are tracked in, roughly, 0.1s interval windows.
> 
> Add sysfs files to display the stats.
> 
> Signed-off-by: Jens Axboe 
> ---

[...]

>  
>   /* incremented at completion time */
>   unsigned long   cacheline_aligned_in_smp rq_completed[2];
> + struct blk_rq_stat  stat[2];

Can you add an enum or define for the directions? Just 0 and 1 aren't very
intuitive.

Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH 1/4] block: add scalable completion tracking of requests

2016-11-01 Thread Jens Axboe
For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

Signed-off-by: Jens Axboe 
---
 block/Makefile|   2 +-
 block/blk-core.c  |   4 +
 block/blk-mq-sysfs.c  |  47 ++
 block/blk-mq.c|  14 +++
 block/blk-mq.h|   3 +
 block/blk-stat.c  | 226 ++
 block/blk-stat.h  |  37 
 block/blk-sysfs.c |  26 ++
 include/linux/blk_types.h |  16 
 include/linux/blkdev.h|   4 +
 10 files changed, 378 insertions(+), 1 deletion(-)
 create mode 100644 block/blk-stat.c
 create mode 100644 block/blk-stat.h

diff --git a/block/Makefile b/block/Makefile
index 934dac73fb37..2528c596f7ec 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-   blk-lib.o blk-mq.o blk-mq-tag.o \
+   blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index 0bfaa54d3e9f..ca77c725b4e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
 {
blk_dequeue_request(req);
 
+   blk_stat_set_issue_time(&req->issue_stat);
+
/*
 * We are now handing the request to the hardware, initialize
 * resid_len to full count and add the timeout handler.
@@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)
 
trace_block_rq_complete(req->q, req, nr_bytes);
 
+   blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);
+
if (!req->bio)
return false;
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 01fb455d3377..633c79a538ea 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -259,6 +259,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct 
blk_mq_hw_ctx *hctx, char *page)
return ret;
 }
 
+static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx)
+{
+   struct blk_mq_ctx *ctx;
+   unsigned int i;
+
+   hctx_for_each_ctx(hctx, ctx, i) {
+   blk_stat_init(&ctx->stat[0]);
+   blk_stat_init(&ctx->stat[1]);
+   }
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx,
+ const char *page, size_t count)
+{
+   blk_mq_stat_clear(hctx);
+   return count;
+}
+
+static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char 
*pre)
+{
+   return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
+   pre, (long long) stat->nr_samples,
+   (long long) stat->mean, (long long) stat->min,
+   (long long) stat->max);
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char 
*page)
+{
+   struct blk_rq_stat stat[2];
+   ssize_t ret;
+
+   blk_stat_init(&stat[0]);
+   blk_stat_init(&stat[1]);
+
+   blk_hctx_stat_get(hctx, stat);
+
+   ret = print_stat(page, &stat[0], "read :");
+   ret += print_stat(page + ret, &stat[1], "write:");
+   return ret;
+}
+
 static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
.attr = {.name = "dispatched", .mode = S_IRUGO },
.show = blk_mq_sysfs_dispatched_show,
@@ -317,6 +358,11 @@ static struct blk_mq_hw_ctx_sysfs_entry 
blk_mq_hw_sysfs_poll = {
.show = blk_mq_hw_sysfs_poll_show,
.store = blk_mq_hw_sysfs_poll_store,
 };
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = {
+   .attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR },
+   .show = blk_mq_hw_sysfs_stat_show,
+   .store = blk_mq_hw_sysfs_stat_store,
+};
 
 static struct attribute *default_hw_ctx_attrs[] = {
&blk_mq_hw_sysfs_queued.attr,
@@ -327,6 +373,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
&blk_mq_hw_sysfs_cpus.attr,
&blk_mq_hw_sysfs_active.attr,
&blk_mq_hw_sysfs_poll.attr,
+   &blk_mq_hw_sysfs_stat.attr,
NULL,
 };
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2da1a0ee3318..4555a76d22a7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -30,6 +30,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-stat.h"
 
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
@@ -376,10 +377,1