Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-30 Thread Ming Lei
On Wed, May 30, 2018 at 4:36 PM, jianchao.wang
 wrote:
> Hi ming
>
> Thanks for your kindly response.
>
> On 05/30/2018 04:22 PM, Ming Lei wrote:
 you could keep the software queues as-is but add our own version of
 flush_busy_ctxs() that only removes requests of the domain that we want.
 If one domain gets backed up, that might get messy with long iterations,
 though.
>>> Yes, I also considered this approach :)
>>> But the long iterations on every ctx->rq_list looks really inefficient.
>> Right, this list can be quite long if dispatch token is used up.
>>
>> You might try to introduce per-domain list into ctx directly, then 'none'
>> may benefit from this change too since bio merge should be done
>> on the per-domain list actually.
>
> Yes, it maybe good for merging of 'none', because the rq_list is split into 3
> lists, and not need to iterate the whole rq_list any more.
> But what's about the dispatch when there is no io scheduler.

blk_mq_flush_busy_ctxs() and blk_mq_dequeue_from_ctx() should work
fine in case of 'none' if per-domain list is added to ctx. Then we can make
none to be a bit fair on READ/WRITE.

> We will dispatch request from ctx one by one at the moment.
> If we have per-domain list in ctx, we have to introduce some policies to 
> determine
> which domain to dispatch, and these policies should be in io scheduler 
> actually.

The policy is done by IO scheduler, and you can just pick up request
from ctx/domain list easily by introducing one blk-mq core API.

Thanks,
Ming Lei


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-30 Thread Ming Lei
On Wed, May 30, 2018 at 4:36 PM, jianchao.wang
 wrote:
> Hi ming
>
> Thanks for your kindly response.
>
> On 05/30/2018 04:22 PM, Ming Lei wrote:
 you could keep the software queues as-is but add our own version of
 flush_busy_ctxs() that only removes requests of the domain that we want.
 If one domain gets backed up, that might get messy with long iterations,
 though.
>>> Yes, I also considered this approach :)
>>> But the long iterations on every ctx->rq_list looks really inefficient.
>> Right, this list can be quite long if dispatch token is used up.
>>
>> You might try to introduce per-domain list into ctx directly, then 'none'
>> may benefit from this change too since bio merge should be done
>> on the per-domain list actually.
>
> Yes, it maybe good for merging of 'none', because the rq_list is split into 3
> lists, and not need to iterate the whole rq_list any more.
> But what's about the dispatch when there is no io scheduler.

blk_mq_flush_busy_ctxs() and blk_mq_dequeue_from_ctx() should work
fine in case of 'none' if per-domain list is added to ctx. Then we can make
none to be a bit fair on READ/WRITE.

> We will dispatch request from ctx one by one at the moment.
> If we have per-domain list in ctx, we have to introduce some policies to 
> determine
> which domain to dispatch, and these policies should be in io scheduler 
> actually.

The policy is done by IO scheduler, and you can just pick up request
from ctx/domain list easily by introducing one blk-mq core API.

Thanks,
Ming Lei


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread jianchao.wang
Hi Jens and Holger

Thank for your kindly response.
That's really appreciated.

I will post next version based on Jens' patch.

Thanks
Jianchao

On 05/23/2018 02:32 AM, Holger Hoffstätte wrote:
 This looks great but prevents kyber from being built as module,
 which is AFAIK supposed to work (and works now):

 ..
   CC [M]  block/kyber-iosched.o
   Building modules, stage 2.
   MODPOST 313 modules
 ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
 ..

 It does build fine when compiled in, obviously. :)
>>> It's basically duplicating the contents of blk_attempt_plug_merge().
>>> I would suggest abstracting out the list loop and merge check
>>> into a helper, that could then both be called from kyber and the
>>> plug merge function.
>> See attached, prep patch and yours rebased on top of it.
> That was quick. :)
> 
> Applies smoothly on top of my 4.16++ tree, now builds correctly as
> module and is reproducibly (slightly) faster even on my pedestrian
> SATA SSDs, now on par or occasionally even faster than mq-deadline.
> What's not to like? So:
> 
> Tested-by: Holger Hoffstätte 


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread jianchao.wang
Hi Jens and Holger

Thank for your kindly response.
That's really appreciated.

I will post next version based on Jens' patch.

Thanks
Jianchao

On 05/23/2018 02:32 AM, Holger Hoffstätte wrote:
 This looks great but prevents kyber from being built as module,
 which is AFAIK supposed to work (and works now):

 ..
   CC [M]  block/kyber-iosched.o
   Building modules, stage 2.
   MODPOST 313 modules
 ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
 ..

 It does build fine when compiled in, obviously. :)
>>> It's basically duplicating the contents of blk_attempt_plug_merge().
>>> I would suggest abstracting out the list loop and merge check
>>> into a helper, that could then both be called from kyber and the
>>> plug merge function.
>> See attached, prep patch and yours rebased on top of it.
> That was quick. :)
> 
> Applies smoothly on top of my 4.16++ tree, now builds correctly as
> module and is reproducibly (slightly) faster even on my pedestrian
> SATA SSDs, now on par or occasionally even faster than mq-deadline.
> What's not to like? So:
> 
> Tested-by: Holger Hoffstätte 


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread jianchao.wang
Hi Omar

Thanks for your kindly response.

On 05/23/2018 04:02 AM, Omar Sandoval wrote:
> On Tue, May 22, 2018 at 10:48:29PM +0800, Jianchao Wang wrote:
>> Currently, kyber is very unfriendly with merging. kyber depends
>> on ctx rq_list to do merging, however, most of time, it will not
>> leave any requests in ctx rq_list. This is because even if tokens
>> of one domain is used up, kyber will try to dispatch requests
>> from other domain and flush the rq_list there.
> 
> That's a great catch, I totally missed this.
> 
> This approach does end up duplicating a lot of code with the blk-mq core
> even after Jens' change, so I'm curious if you tried other approaches.
> One idea I had is to try the bio merge against the kqd->rqs lists. Since
> that's per-queue, the locking overhead might be too high. Alternatively,

Yes, I used to make a patch as you say, try the bio merge against kqd->rqs 
directly.
The patch looks even simpler. However, because the khd->lock is needed every 
time 
when try bio merge, there maybe high contending overhead on hkd->lock when 
cpu-hctx
mapping is not 1:1. 

> you could keep the software queues as-is but add our own version of
> flush_busy_ctxs() that only removes requests of the domain that we want.
> If one domain gets backed up, that might get messy with long iterations,
> though.

Yes, I also considered this approach :)
But the long iterations on every ctx->rq_list looks really inefficient.

> 
> Regarding this approach, a couple of comments below.
...
>>  }
>> @@ -379,12 +414,33 @@ static void kyber_exit_sched(struct elevator_queue *e)
>>  static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int 
>> hctx_idx)
>>  {
>>  struct kyber_hctx_data *khd;
>> +struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
>>  int i;
>> +int sd;
>>  
>>  khd = kmalloc_node(sizeof(*khd), GFP_KERNEL, hctx->numa_node);
>>  if (!khd)
>>  return -ENOMEM;
>>  
>> +khd->kcqs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
>> +GFP_KERNEL, hctx->numa_node);
>> +if (!khd->kcqs)
>> +goto err_khd;
> 
> Why the double indirection of a percpu allocation per hardware queue
> here? With, say, 56 cpus and that many hardware queues, that's 3136
> pointers, which seems like overkill. Can't you just use the percpu array
> in the kqd directly, or make it per-hardware queue instead?

oops, I forgot to change the nr_cpu_ids to hctx->nr_ctx.
The mapping between cpu and hctx has been setup when kyber_init_hctx is invoked,
so just need to allocate hctx->nr_ctx * struct kyber_ctx_queue per khd.

...
>> +static int bio_sched_domain(const struct bio *bio)
>> +{
>> +unsigned int op = bio->bi_opf;
>> +
>> +if ((op & REQ_OP_MASK) == REQ_OP_READ)
>> +return KYBER_READ;
>> +else if ((op & REQ_OP_MASK) == REQ_OP_WRITE && op_is_sync(op))
>> +return KYBER_SYNC_WRITE;
>> +else
>> +return KYBER_OTHER;
>> +}
> 
> Please add a common helper for rq_sched_domain() and bio_sched_domain()
> instead of duplicating the logic.
> 

Yes, I will do it in next version.

Thanks
Jianchao


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread jianchao.wang
Hi Omar

Thanks for your kindly response.

On 05/23/2018 04:02 AM, Omar Sandoval wrote:
> On Tue, May 22, 2018 at 10:48:29PM +0800, Jianchao Wang wrote:
>> Currently, kyber is very unfriendly with merging. kyber depends
>> on ctx rq_list to do merging, however, most of time, it will not
>> leave any requests in ctx rq_list. This is because even if tokens
>> of one domain is used up, kyber will try to dispatch requests
>> from other domain and flush the rq_list there.
> 
> That's a great catch, I totally missed this.
> 
> This approach does end up duplicating a lot of code with the blk-mq core
> even after Jens' change, so I'm curious if you tried other approaches.
> One idea I had is to try the bio merge against the kqd->rqs lists. Since
> that's per-queue, the locking overhead might be too high. Alternatively,

Yes, I used to make a patch as you say, try the bio merge against kqd->rqs 
directly.
The patch looks even simpler. However, because the khd->lock is needed every 
time 
when try bio merge, there maybe high contending overhead on hkd->lock when 
cpu-hctx
mapping is not 1:1. 

> you could keep the software queues as-is but add our own version of
> flush_busy_ctxs() that only removes requests of the domain that we want.
> If one domain gets backed up, that might get messy with long iterations,
> though.

Yes, I also considered this approach :)
But the long iterations on every ctx->rq_list looks really inefficient.

> 
> Regarding this approach, a couple of comments below.
...
>>  }
>> @@ -379,12 +414,33 @@ static void kyber_exit_sched(struct elevator_queue *e)
>>  static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int 
>> hctx_idx)
>>  {
>>  struct kyber_hctx_data *khd;
>> +struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
>>  int i;
>> +int sd;
>>  
>>  khd = kmalloc_node(sizeof(*khd), GFP_KERNEL, hctx->numa_node);
>>  if (!khd)
>>  return -ENOMEM;
>>  
>> +khd->kcqs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
>> +GFP_KERNEL, hctx->numa_node);
>> +if (!khd->kcqs)
>> +goto err_khd;
> 
> Why the double indirection of a percpu allocation per hardware queue
> here? With, say, 56 cpus and that many hardware queues, that's 3136
> pointers, which seems like overkill. Can't you just use the percpu array
> in the kqd directly, or make it per-hardware queue instead?

oops, I forgot to change the nr_cpu_ids to hctx->nr_ctx.
The mapping between cpu and hctx has been setup when kyber_init_hctx is invoked,
so just need to allocate hctx->nr_ctx * struct kyber_ctx_queue per khd.

...
>> +static int bio_sched_domain(const struct bio *bio)
>> +{
>> +unsigned int op = bio->bi_opf;
>> +
>> +if ((op & REQ_OP_MASK) == REQ_OP_READ)
>> +return KYBER_READ;
>> +else if ((op & REQ_OP_MASK) == REQ_OP_WRITE && op_is_sync(op))
>> +return KYBER_SYNC_WRITE;
>> +else
>> +return KYBER_OTHER;
>> +}
> 
> Please add a common helper for rq_sched_domain() and bio_sched_domain()
> instead of duplicating the logic.
> 

Yes, I will do it in next version.

Thanks
Jianchao


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Omar Sandoval
On Tue, May 22, 2018 at 10:48:29PM +0800, Jianchao Wang wrote:
> Currently, kyber is very unfriendly with merging. kyber depends
> on ctx rq_list to do merging, however, most of time, it will not
> leave any requests in ctx rq_list. This is because even if tokens
> of one domain is used up, kyber will try to dispatch requests
> from other domain and flush the rq_list there.

That's a great catch, I totally missed this.

This approach does end up duplicating a lot of code with the blk-mq core
even after Jens' change, so I'm curious if you tried other approaches.
One idea I had is to try the bio merge against the kqd->rqs lists. Since
that's per-queue, the locking overhead might be too high. Alternatively,
you could keep the software queues as-is but add our own version of
flush_busy_ctxs() that only removes requests of the domain that we want.
If one domain gets backed up, that might get messy with long iterations,
though.

Regarding this approach, a couple of comments below.

> To improve this, we setup kyber_ctx_queue (kcq) which is similar
> with ctx, but it has rq_lists for different domain and build same
> mapping between kcq and khd as the ctx & hctx. Then we could merge,
> insert and dispatch for different domains separately. If one domain
> token is used up, the requests could be left in the rq_list of
> that domain and maybe merged with following io.
> 
> Following is my test result on machine with 8 cores and NVMe card
> INTEL SSDPEKKR128G7
> 
> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
> seq/random
> +--+---+
> |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
> +--+
> | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
> +--+
> | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
> +--+
> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
> on my platform.
> 
> Signed-off-by: Jianchao Wang 
> ---
>  block/kyber-iosched.c | 240 
> --
>  1 file changed, 212 insertions(+), 28 deletions(-)
> 
> diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
> index 0d6d25e3..04da05b 100644
> --- a/block/kyber-iosched.c
> +++ b/block/kyber-iosched.c
> @@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = {
>   [KYBER_OTHER] = 8,
>  };
>  
> +struct kyber_ctx_queue {
> + /*
> +  * Copied from blk_mq_ctx->index_hw
> +  */
> + unsigned int index;
> + spinlock_t lock;
> + struct list_head rq_list[KYBER_NUM_DOMAINS];
> +} cacheline_aligned_in_smp;
> +
>  struct kyber_queue_data {
>   struct request_queue *q;
>  
> @@ -84,6 +93,7 @@ struct kyber_queue_data {
>*/
>   struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS];
>  
> + struct kyber_ctx_queue *ctx_queue;
>   /*
>* Async request percentage, converted to per-word depth for
>* sbitmap_get_shallow().
> @@ -99,6 +109,8 @@ struct kyber_hctx_data {
>   struct list_head rqs[KYBER_NUM_DOMAINS];
>   unsigned int cur_domain;
>   unsigned int batching;
> + struct kyber_ctx_queue **kcqs;
> + struct sbitmap kcq_map[KYBER_NUM_DOMAINS];
>   wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS];
>   struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS];
>   atomic_t wait_index[KYBER_NUM_DOMAINS];
> @@ -284,6 +296,19 @@ static unsigned int kyber_sched_tags_shift(struct 
> kyber_queue_data *kqd)
>   return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
>  }
>  
> +static void kyber_ctx_queue_init(struct kyber_queue_data *kqd)
> +{
> + unsigned int i, j;
> +
> + for (i = 0; i < nr_cpu_ids; i++) {
> + struct kyber_ctx_queue *kcq = >ctx_queue[i];
> +
> + spin_lock_init(>lock);
> + for (j = 0; j < KYBER_NUM_DOMAINS; j++)
> + INIT_LIST_HEAD(>rq_list[j]);
> + }
> +}
> +
>  static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue 
> *q)
>  {
>   struct kyber_queue_data *kqd;
> @@ -302,6 +327,13 @@ static struct kyber_queue_data 
> *kyber_queue_data_alloc(struct request_queue *q)
>   if (!kqd->cb)
>   goto err_kqd;
>  
> + kqd->ctx_queue = kmalloc_array_node(nr_cpu_ids,
> + sizeof(struct kyber_ctx_queue), GFP_KERNEL, -1);

The whitespace here and in several other places is weird, please run
this through checkpatch.

> + if (!kqd->ctx_queue)
> + goto err_cb;
> +
> + kyber_ctx_queue_init(kqd);
> +
>   /*
>* The maximum number of tokens for any scheduling domain is at least
>* the queue depth of a single hardware queue. If the hardware doesn't
> @@ 

Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Omar Sandoval
On Tue, May 22, 2018 at 10:48:29PM +0800, Jianchao Wang wrote:
> Currently, kyber is very unfriendly with merging. kyber depends
> on ctx rq_list to do merging, however, most of time, it will not
> leave any requests in ctx rq_list. This is because even if tokens
> of one domain is used up, kyber will try to dispatch requests
> from other domain and flush the rq_list there.

That's a great catch, I totally missed this.

This approach does end up duplicating a lot of code with the blk-mq core
even after Jens' change, so I'm curious if you tried other approaches.
One idea I had is to try the bio merge against the kqd->rqs lists. Since
that's per-queue, the locking overhead might be too high. Alternatively,
you could keep the software queues as-is but add our own version of
flush_busy_ctxs() that only removes requests of the domain that we want.
If one domain gets backed up, that might get messy with long iterations,
though.

Regarding this approach, a couple of comments below.

> To improve this, we setup kyber_ctx_queue (kcq) which is similar
> with ctx, but it has rq_lists for different domain and build same
> mapping between kcq and khd as the ctx & hctx. Then we could merge,
> insert and dispatch for different domains separately. If one domain
> token is used up, the requests could be left in the rq_list of
> that domain and maybe merged with following io.
> 
> Following is my test result on machine with 8 cores and NVMe card
> INTEL SSDPEKKR128G7
> 
> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
> seq/random
> +--+---+
> |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
> +--+
> | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
> +--+
> | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
> +--+
> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
> on my platform.
> 
> Signed-off-by: Jianchao Wang 
> ---
>  block/kyber-iosched.c | 240 
> --
>  1 file changed, 212 insertions(+), 28 deletions(-)
> 
> diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
> index 0d6d25e3..04da05b 100644
> --- a/block/kyber-iosched.c
> +++ b/block/kyber-iosched.c
> @@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = {
>   [KYBER_OTHER] = 8,
>  };
>  
> +struct kyber_ctx_queue {
> + /*
> +  * Copied from blk_mq_ctx->index_hw
> +  */
> + unsigned int index;
> + spinlock_t lock;
> + struct list_head rq_list[KYBER_NUM_DOMAINS];
> +} cacheline_aligned_in_smp;
> +
>  struct kyber_queue_data {
>   struct request_queue *q;
>  
> @@ -84,6 +93,7 @@ struct kyber_queue_data {
>*/
>   struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS];
>  
> + struct kyber_ctx_queue *ctx_queue;
>   /*
>* Async request percentage, converted to per-word depth for
>* sbitmap_get_shallow().
> @@ -99,6 +109,8 @@ struct kyber_hctx_data {
>   struct list_head rqs[KYBER_NUM_DOMAINS];
>   unsigned int cur_domain;
>   unsigned int batching;
> + struct kyber_ctx_queue **kcqs;
> + struct sbitmap kcq_map[KYBER_NUM_DOMAINS];
>   wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS];
>   struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS];
>   atomic_t wait_index[KYBER_NUM_DOMAINS];
> @@ -284,6 +296,19 @@ static unsigned int kyber_sched_tags_shift(struct 
> kyber_queue_data *kqd)
>   return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
>  }
>  
> +static void kyber_ctx_queue_init(struct kyber_queue_data *kqd)
> +{
> + unsigned int i, j;
> +
> + for (i = 0; i < nr_cpu_ids; i++) {
> + struct kyber_ctx_queue *kcq = >ctx_queue[i];
> +
> + spin_lock_init(>lock);
> + for (j = 0; j < KYBER_NUM_DOMAINS; j++)
> + INIT_LIST_HEAD(>rq_list[j]);
> + }
> +}
> +
>  static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue 
> *q)
>  {
>   struct kyber_queue_data *kqd;
> @@ -302,6 +327,13 @@ static struct kyber_queue_data 
> *kyber_queue_data_alloc(struct request_queue *q)
>   if (!kqd->cb)
>   goto err_kqd;
>  
> + kqd->ctx_queue = kmalloc_array_node(nr_cpu_ids,
> + sizeof(struct kyber_ctx_queue), GFP_KERNEL, -1);

The whitespace here and in several other places is weird, please run
this through checkpatch.

> + if (!kqd->ctx_queue)
> + goto err_cb;
> +
> + kyber_ctx_queue_init(kqd);
> +
>   /*
>* The maximum number of tokens for any scheduling domain is at least
>* the queue depth of a single hardware queue. If the hardware doesn't
> @@ -318,7 +350,7 @@ static struct 

Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Holger Hoffstätte
On 05/22/18 19:46, Jens Axboe wrote:
> On 5/22/18 10:20 AM, Jens Axboe wrote:
>> On 5/22/18 10:17 AM, Holger Hoffstätte wrote:
>>> On 05/22/18 16:48, Jianchao Wang wrote:
 Currently, kyber is very unfriendly with merging. kyber depends
 on ctx rq_list to do merging, however, most of time, it will not
 leave any requests in ctx rq_list. This is because even if tokens
 of one domain is used up, kyber will try to dispatch requests
 from other domain and flush the rq_list there.

 To improve this, we setup kyber_ctx_queue (kcq) which is similar
 with ctx, but it has rq_lists for different domain and build same
 mapping between kcq and khd as the ctx & hctx. Then we could merge,
 insert and dispatch for different domains separately. If one domain
 token is used up, the requests could be left in the rq_list of
 that domain and maybe merged with following io.

 Following is my test result on machine with 8 cores and NVMe card
 INTEL SSDPEKKR128G7

 fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
 seq/random
 +--+---+
 |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
 +--+
 | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
 +--+
 | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
 +--+
 When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
 on my platform.

 Signed-off-by: Jianchao Wang 
>>>
>>> 
>>>
>>> This looks great but prevents kyber from being built as module,
>>> which is AFAIK supposed to work (and works now):
>>>
>>> ..
>>>   CC [M]  block/kyber-iosched.o
>>>   Building modules, stage 2.
>>>   MODPOST 313 modules
>>> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
>>> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
>>> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
>>> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
>>> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
>>> ..
>>>
>>> It does build fine when compiled in, obviously. :)
>>
>> It's basically duplicating the contents of blk_attempt_plug_merge().
>> I would suggest abstracting out the list loop and merge check
>> into a helper, that could then both be called from kyber and the
>> plug merge function.
> 
> See attached, prep patch and yours rebased on top of it.

That was quick. :)

Applies smoothly on top of my 4.16++ tree, now builds correctly as
module and is reproducibly (slightly) faster even on my pedestrian
SATA SSDs, now on par or occasionally even faster than mq-deadline.
What's not to like? So:

Tested-by: Holger Hoffstätte 

cheers,
Holger


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Holger Hoffstätte
On 05/22/18 19:46, Jens Axboe wrote:
> On 5/22/18 10:20 AM, Jens Axboe wrote:
>> On 5/22/18 10:17 AM, Holger Hoffstätte wrote:
>>> On 05/22/18 16:48, Jianchao Wang wrote:
 Currently, kyber is very unfriendly with merging. kyber depends
 on ctx rq_list to do merging, however, most of time, it will not
 leave any requests in ctx rq_list. This is because even if tokens
 of one domain is used up, kyber will try to dispatch requests
 from other domain and flush the rq_list there.

 To improve this, we setup kyber_ctx_queue (kcq) which is similar
 with ctx, but it has rq_lists for different domain and build same
 mapping between kcq and khd as the ctx & hctx. Then we could merge,
 insert and dispatch for different domains separately. If one domain
 token is used up, the requests could be left in the rq_list of
 that domain and maybe merged with following io.

 Following is my test result on machine with 8 cores and NVMe card
 INTEL SSDPEKKR128G7

 fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
 seq/random
 +--+---+
 |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
 +--+
 | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
 +--+
 | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
 +--+
 When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
 on my platform.

 Signed-off-by: Jianchao Wang 
>>>
>>> 
>>>
>>> This looks great but prevents kyber from being built as module,
>>> which is AFAIK supposed to work (and works now):
>>>
>>> ..
>>>   CC [M]  block/kyber-iosched.o
>>>   Building modules, stage 2.
>>>   MODPOST 313 modules
>>> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
>>> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
>>> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
>>> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
>>> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
>>> ..
>>>
>>> It does build fine when compiled in, obviously. :)
>>
>> It's basically duplicating the contents of blk_attempt_plug_merge().
>> I would suggest abstracting out the list loop and merge check
>> into a helper, that could then both be called from kyber and the
>> plug merge function.
> 
> See attached, prep patch and yours rebased on top of it.

That was quick. :)

Applies smoothly on top of my 4.16++ tree, now builds correctly as
module and is reproducibly (slightly) faster even on my pedestrian
SATA SSDs, now on par or occasionally even faster than mq-deadline.
What's not to like? So:

Tested-by: Holger Hoffstätte 

cheers,
Holger


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Jens Axboe
On 5/22/18 10:20 AM, Jens Axboe wrote:
> On 5/22/18 10:17 AM, Holger Hoffstätte wrote:
>> On 05/22/18 16:48, Jianchao Wang wrote:
>>> Currently, kyber is very unfriendly with merging. kyber depends
>>> on ctx rq_list to do merging, however, most of time, it will not
>>> leave any requests in ctx rq_list. This is because even if tokens
>>> of one domain is used up, kyber will try to dispatch requests
>>> from other domain and flush the rq_list there.
>>>
>>> To improve this, we setup kyber_ctx_queue (kcq) which is similar
>>> with ctx, but it has rq_lists for different domain and build same
>>> mapping between kcq and khd as the ctx & hctx. Then we could merge,
>>> insert and dispatch for different domains separately. If one domain
>>> token is used up, the requests could be left in the rq_list of
>>> that domain and maybe merged with following io.
>>>
>>> Following is my test result on machine with 8 cores and NVMe card
>>> INTEL SSDPEKKR128G7
>>>
>>> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
>>> seq/random
>>> +--+---+
>>> |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
>>> +--+
>>> | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
>>> +--+
>>> | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
>>> +--+
>>> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
>>> on my platform.
>>>
>>> Signed-off-by: Jianchao Wang 
>>
>> 
>>
>> This looks great but prevents kyber from being built as module,
>> which is AFAIK supposed to work (and works now):
>>
>> ..
>>   CC [M]  block/kyber-iosched.o
>>   Building modules, stage 2.
>>   MODPOST 313 modules
>> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
>> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
>> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
>> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
>> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
>> ..
>>
>> It does build fine when compiled in, obviously. :)
> 
> It's basically duplicating the contents of blk_attempt_plug_merge().
> I would suggest abstracting out the list loop and merge check
> into a helper, that could then both be called from kyber and the
> plug merge function.

See attached, prep patch and yours rebased on top of it.

-- 
Jens Axboe

>From edb57af07cf619f27a3fee85705870875f853f58 Mon Sep 17 00:00:00 2001
From: Jianchao Wang 
Date: Tue, 22 May 2018 11:45:01 -0600
Subject: [PATCH 2/2] block: kyber: make kyber more friendly with merging

Currently, kyber is very unfriendly with merging. kyber depends
on ctx rq_list to do merging, however, most of time, it will not
leave any requests in ctx rq_list. This is because even if tokens
of one domain is used up, kyber will try to dispatch requests
from other domain and flush the rq_list there.

To improve this, we setup kyber_ctx_queue (kcq) which is similar
with ctx, but it has rq_lists for different domain and build same
mapping between kcq and khd as the ctx & hctx. Then we could merge,
insert and dispatch for different domains separately. If one domain
token is used up, the requests could be left in the rq_list of
that domain and maybe merged with following io.

Following is my test result on machine with 8 cores and NVMe card
INTEL SSDPEKKR128G7

fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
seq/random
+--+---+
|patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
+--+
| w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
+--+
| w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
+--+
When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
on my platform.

Signed-off-by: Jianchao Wang 
---
 block/kyber-iosched.c | 212 +++---
 1 file changed, 184 insertions(+), 28 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 5b33dc394cc7..f28147be729b 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = {
 	[KYBER_OTHER] = 8,
 };
 
+struct kyber_ctx_queue {
+	/*
+	 * Copied from blk_mq_ctx->index_hw
+	 */
+	unsigned int index;
+	spinlock_t lock;
+	struct list_head rq_list[KYBER_NUM_DOMAINS];
+} cacheline_aligned_in_smp;
+
 

Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Jens Axboe
On 5/22/18 10:20 AM, Jens Axboe wrote:
> On 5/22/18 10:17 AM, Holger Hoffstätte wrote:
>> On 05/22/18 16:48, Jianchao Wang wrote:
>>> Currently, kyber is very unfriendly with merging. kyber depends
>>> on ctx rq_list to do merging, however, most of time, it will not
>>> leave any requests in ctx rq_list. This is because even if tokens
>>> of one domain is used up, kyber will try to dispatch requests
>>> from other domain and flush the rq_list there.
>>>
>>> To improve this, we setup kyber_ctx_queue (kcq) which is similar
>>> with ctx, but it has rq_lists for different domain and build same
>>> mapping between kcq and khd as the ctx & hctx. Then we could merge,
>>> insert and dispatch for different domains separately. If one domain
>>> token is used up, the requests could be left in the rq_list of
>>> that domain and maybe merged with following io.
>>>
>>> Following is my test result on machine with 8 cores and NVMe card
>>> INTEL SSDPEKKR128G7
>>>
>>> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
>>> seq/random
>>> +--+---+
>>> |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
>>> +--+
>>> | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
>>> +--+
>>> | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
>>> +--+
>>> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
>>> on my platform.
>>>
>>> Signed-off-by: Jianchao Wang 
>>
>> 
>>
>> This looks great but prevents kyber from being built as module,
>> which is AFAIK supposed to work (and works now):
>>
>> ..
>>   CC [M]  block/kyber-iosched.o
>>   Building modules, stage 2.
>>   MODPOST 313 modules
>> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
>> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
>> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
>> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
>> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
>> ..
>>
>> It does build fine when compiled in, obviously. :)
> 
> It's basically duplicating the contents of blk_attempt_plug_merge().
> I would suggest abstracting out the list loop and merge check
> into a helper, that could then both be called from kyber and the
> plug merge function.

See attached, prep patch and yours rebased on top of it.

-- 
Jens Axboe

>From edb57af07cf619f27a3fee85705870875f853f58 Mon Sep 17 00:00:00 2001
From: Jianchao Wang 
Date: Tue, 22 May 2018 11:45:01 -0600
Subject: [PATCH 2/2] block: kyber: make kyber more friendly with merging

Currently, kyber is very unfriendly with merging. kyber depends
on ctx rq_list to do merging, however, most of time, it will not
leave any requests in ctx rq_list. This is because even if tokens
of one domain is used up, kyber will try to dispatch requests
from other domain and flush the rq_list there.

To improve this, we setup kyber_ctx_queue (kcq) which is similar
with ctx, but it has rq_lists for different domain and build same
mapping between kcq and khd as the ctx & hctx. Then we could merge,
insert and dispatch for different domains separately. If one domain
token is used up, the requests could be left in the rq_list of
that domain and maybe merged with following io.

Following is my test result on machine with 8 cores and NVMe card
INTEL SSDPEKKR128G7

fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
seq/random
+--+---+
|patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
+--+
| w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
+--+
| w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
+--+
When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
on my platform.

Signed-off-by: Jianchao Wang 
---
 block/kyber-iosched.c | 212 +++---
 1 file changed, 184 insertions(+), 28 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 5b33dc394cc7..f28147be729b 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = {
 	[KYBER_OTHER] = 8,
 };
 
+struct kyber_ctx_queue {
+	/*
+	 * Copied from blk_mq_ctx->index_hw
+	 */
+	unsigned int index;
+	spinlock_t lock;
+	struct list_head rq_list[KYBER_NUM_DOMAINS];
+} cacheline_aligned_in_smp;
+
 struct kyber_queue_data {
 	struct request_queue *q;
 
@@ -84,6 +93,7 @@ struct 

Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Holger Hoffstätte
On 05/22/18 16:48, Jianchao Wang wrote:
> Currently, kyber is very unfriendly with merging. kyber depends
> on ctx rq_list to do merging, however, most of time, it will not
> leave any requests in ctx rq_list. This is because even if tokens
> of one domain is used up, kyber will try to dispatch requests
> from other domain and flush the rq_list there.
> 
> To improve this, we setup kyber_ctx_queue (kcq) which is similar
> with ctx, but it has rq_lists for different domain and build same
> mapping between kcq and khd as the ctx & hctx. Then we could merge,
> insert and dispatch for different domains separately. If one domain
> token is used up, the requests could be left in the rq_list of
> that domain and maybe merged with following io.
> 
> Following is my test result on machine with 8 cores and NVMe card
> INTEL SSDPEKKR128G7
> 
> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
> seq/random
> +--+---+
> |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
> +--+
> | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
> +--+
> | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
> +--+
> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
> on my platform.
> 
> Signed-off-by: Jianchao Wang 



This looks great but prevents kyber from being built as module,
which is AFAIK supposed to work (and works now):

..
  CC [M]  block/kyber-iosched.o
  Building modules, stage 2.
  MODPOST 313 modules
ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
..

It does build fine when compiled in, obviously. :)

cheers,
Holger


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Holger Hoffstätte
On 05/22/18 16:48, Jianchao Wang wrote:
> Currently, kyber is very unfriendly with merging. kyber depends
> on ctx rq_list to do merging, however, most of time, it will not
> leave any requests in ctx rq_list. This is because even if tokens
> of one domain is used up, kyber will try to dispatch requests
> from other domain and flush the rq_list there.
> 
> To improve this, we setup kyber_ctx_queue (kcq) which is similar
> with ctx, but it has rq_lists for different domain and build same
> mapping between kcq and khd as the ctx & hctx. Then we could merge,
> insert and dispatch for different domains separately. If one domain
> token is used up, the requests could be left in the rq_list of
> that domain and maybe merged with following io.
> 
> Following is my test result on machine with 8 cores and NVMe card
> INTEL SSDPEKKR128G7
> 
> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
> seq/random
> +--+---+
> |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
> +--+
> | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
> +--+
> | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
> +--+
> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
> on my platform.
> 
> Signed-off-by: Jianchao Wang 



This looks great but prevents kyber from being built as module,
which is AFAIK supposed to work (and works now):

..
  CC [M]  block/kyber-iosched.o
  Building modules, stage 2.
  MODPOST 313 modules
ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
..

It does build fine when compiled in, obviously. :)

cheers,
Holger


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Jens Axboe
On 5/22/18 10:17 AM, Holger Hoffstätte wrote:
> On 05/22/18 16:48, Jianchao Wang wrote:
>> Currently, kyber is very unfriendly with merging. kyber depends
>> on ctx rq_list to do merging, however, most of time, it will not
>> leave any requests in ctx rq_list. This is because even if tokens
>> of one domain is used up, kyber will try to dispatch requests
>> from other domain and flush the rq_list there.
>>
>> To improve this, we setup kyber_ctx_queue (kcq) which is similar
>> with ctx, but it has rq_lists for different domain and build same
>> mapping between kcq and khd as the ctx & hctx. Then we could merge,
>> insert and dispatch for different domains separately. If one domain
>> token is used up, the requests could be left in the rq_list of
>> that domain and maybe merged with following io.
>>
>> Following is my test result on machine with 8 cores and NVMe card
>> INTEL SSDPEKKR128G7
>>
>> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
>> seq/random
>> +--+---+
>> |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
>> +--+
>> | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
>> +--+
>> | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
>> +--+
>> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
>> on my platform.
>>
>> Signed-off-by: Jianchao Wang 
> 
> 
> 
> This looks great but prevents kyber from being built as module,
> which is AFAIK supposed to work (and works now):
> 
> ..
>   CC [M]  block/kyber-iosched.o
>   Building modules, stage 2.
>   MODPOST 313 modules
> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
> ..
> 
> It does build fine when compiled in, obviously. :)

It's basically duplicating the contents of blk_attempt_plug_merge().
I would suggest abstracting out the list loop and merge check
into a helper, that could then both be called from kyber and the
plug merge function.

-- 
Jens Axboe



Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Jens Axboe
On 5/22/18 10:17 AM, Holger Hoffstätte wrote:
> On 05/22/18 16:48, Jianchao Wang wrote:
>> Currently, kyber is very unfriendly with merging. kyber depends
>> on ctx rq_list to do merging, however, most of time, it will not
>> leave any requests in ctx rq_list. This is because even if tokens
>> of one domain is used up, kyber will try to dispatch requests
>> from other domain and flush the rq_list there.
>>
>> To improve this, we setup kyber_ctx_queue (kcq) which is similar
>> with ctx, but it has rq_lists for different domain and build same
>> mapping between kcq and khd as the ctx & hctx. Then we could merge,
>> insert and dispatch for different domains separately. If one domain
>> token is used up, the requests could be left in the rq_list of
>> that domain and maybe merged with following io.
>>
>> Following is my test result on machine with 8 cores and NVMe card
>> INTEL SSDPEKKR128G7
>>
>> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
>> seq/random
>> +--+---+
>> |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
>> +--+
>> | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
>> +--+
>> | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
>> +--+
>> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
>> on my platform.
>>
>> Signed-off-by: Jianchao Wang 
> 
> 
> 
> This looks great but prevents kyber from being built as module,
> which is AFAIK supposed to work (and works now):
> 
> ..
>   CC [M]  block/kyber-iosched.o
>   Building modules, stage 2.
>   MODPOST 313 modules
> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
> ..
> 
> It does build fine when compiled in, obviously. :)

It's basically duplicating the contents of blk_attempt_plug_merge().
I would suggest abstracting out the list loop and merge check
into a helper, that could then both be called from kyber and the
plug merge function.

-- 
Jens Axboe



[PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Jianchao Wang
Currently, kyber is very unfriendly with merging. kyber depends
on ctx rq_list to do merging, however, most of time, it will not
leave any requests in ctx rq_list. This is because even if tokens
of one domain is used up, kyber will try to dispatch requests
from other domain and flush the rq_list there.

To improve this, we setup kyber_ctx_queue (kcq) which is similar
with ctx, but it has rq_lists for different domain and build same
mapping between kcq and khd as the ctx & hctx. Then we could merge,
insert and dispatch for different domains separately. If one domain
token is used up, the requests could be left in the rq_list of
that domain and maybe merged with following io.

Following is my test result on machine with 8 cores and NVMe card
INTEL SSDPEKKR128G7

fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
seq/random
+--+---+
|patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
+--+
| w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
+--+
| w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
+--+
When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
on my platform.

Signed-off-by: Jianchao Wang 
---
 block/kyber-iosched.c | 240 --
 1 file changed, 212 insertions(+), 28 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 0d6d25e3..04da05b 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = {
[KYBER_OTHER] = 8,
 };
 
+struct kyber_ctx_queue {
+   /*
+* Copied from blk_mq_ctx->index_hw
+*/
+   unsigned int index;
+   spinlock_t lock;
+   struct list_head rq_list[KYBER_NUM_DOMAINS];
+} cacheline_aligned_in_smp;
+
 struct kyber_queue_data {
struct request_queue *q;
 
@@ -84,6 +93,7 @@ struct kyber_queue_data {
 */
struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS];
 
+   struct kyber_ctx_queue *ctx_queue;
/*
 * Async request percentage, converted to per-word depth for
 * sbitmap_get_shallow().
@@ -99,6 +109,8 @@ struct kyber_hctx_data {
struct list_head rqs[KYBER_NUM_DOMAINS];
unsigned int cur_domain;
unsigned int batching;
+   struct kyber_ctx_queue **kcqs;
+   struct sbitmap kcq_map[KYBER_NUM_DOMAINS];
wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS];
struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS];
atomic_t wait_index[KYBER_NUM_DOMAINS];
@@ -284,6 +296,19 @@ static unsigned int kyber_sched_tags_shift(struct 
kyber_queue_data *kqd)
return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
 }
 
+static void kyber_ctx_queue_init(struct kyber_queue_data *kqd)
+{
+   unsigned int i, j;
+
+   for (i = 0; i < nr_cpu_ids; i++) {
+   struct kyber_ctx_queue *kcq = >ctx_queue[i];
+
+   spin_lock_init(>lock);
+   for (j = 0; j < KYBER_NUM_DOMAINS; j++)
+   INIT_LIST_HEAD(>rq_list[j]);
+   }
+}
+
 static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
 {
struct kyber_queue_data *kqd;
@@ -302,6 +327,13 @@ static struct kyber_queue_data 
*kyber_queue_data_alloc(struct request_queue *q)
if (!kqd->cb)
goto err_kqd;
 
+   kqd->ctx_queue = kmalloc_array_node(nr_cpu_ids,
+   sizeof(struct kyber_ctx_queue), GFP_KERNEL, -1);
+   if (!kqd->ctx_queue)
+   goto err_cb;
+
+   kyber_ctx_queue_init(kqd);
+
/*
 * The maximum number of tokens for any scheduling domain is at least
 * the queue depth of a single hardware queue. If the hardware doesn't
@@ -318,7 +350,7 @@ static struct kyber_queue_data 
*kyber_queue_data_alloc(struct request_queue *q)
if (ret) {
while (--i >= 0)
sbitmap_queue_free(>domain_tokens[i]);
-   goto err_cb;
+   goto err_kcq;
}
sbitmap_queue_resize(>domain_tokens[i], kyber_depth[i]);
}
@@ -331,6 +363,8 @@ static struct kyber_queue_data 
*kyber_queue_data_alloc(struct request_queue *q)
 
return kqd;
 
+err_kcq:
+   kfree(kqd->ctx_queue);
 err_cb:
blk_stat_free_callback(kqd->cb);
 err_kqd:
@@ -372,6 +406,7 @@ static void kyber_exit_sched(struct elevator_queue *e)
 
for (i = 0; i < KYBER_NUM_DOMAINS; i++)
sbitmap_queue_free(>domain_tokens[i]);
+   kfree(kqd->ctx_queue);
blk_stat_free_callback(kqd->cb);
kfree(kqd);
 }

[PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Jianchao Wang
Currently, kyber is very unfriendly with merging. kyber depends
on ctx rq_list to do merging, however, most of time, it will not
leave any requests in ctx rq_list. This is because even if tokens
of one domain is used up, kyber will try to dispatch requests
from other domain and flush the rq_list there.

To improve this, we setup kyber_ctx_queue (kcq) which is similar
with ctx, but it has rq_lists for different domain and build same
mapping between kcq and khd as the ctx & hctx. Then we could merge,
insert and dispatch for different domains separately. If one domain
token is used up, the requests could be left in the rq_list of
that domain and maybe merged with following io.

Following is my test result on machine with 8 cores and NVMe card
INTEL SSDPEKKR128G7

fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
seq/random
+--+---+
|patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
+--+
| w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
+--+
| w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
+--+
When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
on my platform.

Signed-off-by: Jianchao Wang 
---
 block/kyber-iosched.c | 240 --
 1 file changed, 212 insertions(+), 28 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 0d6d25e3..04da05b 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = {
[KYBER_OTHER] = 8,
 };
 
+struct kyber_ctx_queue {
+   /*
+* Copied from blk_mq_ctx->index_hw
+*/
+   unsigned int index;
+   spinlock_t lock;
+   struct list_head rq_list[KYBER_NUM_DOMAINS];
+} cacheline_aligned_in_smp;
+
 struct kyber_queue_data {
struct request_queue *q;
 
@@ -84,6 +93,7 @@ struct kyber_queue_data {
 */
struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS];
 
+   struct kyber_ctx_queue *ctx_queue;
/*
 * Async request percentage, converted to per-word depth for
 * sbitmap_get_shallow().
@@ -99,6 +109,8 @@ struct kyber_hctx_data {
struct list_head rqs[KYBER_NUM_DOMAINS];
unsigned int cur_domain;
unsigned int batching;
+   struct kyber_ctx_queue **kcqs;
+   struct sbitmap kcq_map[KYBER_NUM_DOMAINS];
wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS];
struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS];
atomic_t wait_index[KYBER_NUM_DOMAINS];
@@ -284,6 +296,19 @@ static unsigned int kyber_sched_tags_shift(struct 
kyber_queue_data *kqd)
return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
 }
 
+static void kyber_ctx_queue_init(struct kyber_queue_data *kqd)
+{
+   unsigned int i, j;
+
+   for (i = 0; i < nr_cpu_ids; i++) {
+   struct kyber_ctx_queue *kcq = >ctx_queue[i];
+
+   spin_lock_init(>lock);
+   for (j = 0; j < KYBER_NUM_DOMAINS; j++)
+   INIT_LIST_HEAD(>rq_list[j]);
+   }
+}
+
 static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
 {
struct kyber_queue_data *kqd;
@@ -302,6 +327,13 @@ static struct kyber_queue_data 
*kyber_queue_data_alloc(struct request_queue *q)
if (!kqd->cb)
goto err_kqd;
 
+   kqd->ctx_queue = kmalloc_array_node(nr_cpu_ids,
+   sizeof(struct kyber_ctx_queue), GFP_KERNEL, -1);
+   if (!kqd->ctx_queue)
+   goto err_cb;
+
+   kyber_ctx_queue_init(kqd);
+
/*
 * The maximum number of tokens for any scheduling domain is at least
 * the queue depth of a single hardware queue. If the hardware doesn't
@@ -318,7 +350,7 @@ static struct kyber_queue_data 
*kyber_queue_data_alloc(struct request_queue *q)
if (ret) {
while (--i >= 0)
sbitmap_queue_free(>domain_tokens[i]);
-   goto err_cb;
+   goto err_kcq;
}
sbitmap_queue_resize(>domain_tokens[i], kyber_depth[i]);
}
@@ -331,6 +363,8 @@ static struct kyber_queue_data 
*kyber_queue_data_alloc(struct request_queue *q)
 
return kqd;
 
+err_kcq:
+   kfree(kqd->ctx_queue);
 err_cb:
blk_stat_free_callback(kqd->cb);
 err_kqd:
@@ -372,6 +406,7 @@ static void kyber_exit_sched(struct elevator_queue *e)
 
for (i = 0; i < KYBER_NUM_DOMAINS; i++)
sbitmap_queue_free(>domain_tokens[i]);
+   kfree(kqd->ctx_queue);
blk_stat_free_callback(kqd->cb);
kfree(kqd);
 }
@@ -379,12 +414,33 @@