Re: [PATCH] block: kyber: make kyber more friendly with merging
On Wed, May 30, 2018 at 4:36 PM, jianchao.wang wrote: > Hi ming > > Thanks for your kindly response. > > On 05/30/2018 04:22 PM, Ming Lei wrote: you could keep the software queues as-is but add our own version of flush_busy_ctxs() that only removes requests of the domain that we want. If one domain gets backed up, that might get messy with long iterations, though. >>> Yes, I also considered this approach :) >>> But the long iterations on every ctx->rq_list looks really inefficient. >> Right, this list can be quite long if dispatch token is used up. >> >> You might try to introduce per-domain list into ctx directly, then 'none' >> may benefit from this change too since bio merge should be done >> on the per-domain list actually. > > Yes, it maybe good for merging of 'none', because the rq_list is split into 3 > lists, and not need to iterate the whole rq_list any more. > But what's about the dispatch when there is no io scheduler. blk_mq_flush_busy_ctxs() and blk_mq_dequeue_from_ctx() should work fine in case of 'none' if per-domain list is added to ctx. Then we can make none to be a bit fair on READ/WRITE. > We will dispatch request from ctx one by one at the moment. > If we have per-domain list in ctx, we have to introduce some policies to > determine > which domain to dispatch, and these policies should be in io scheduler > actually. The policy is done by IO scheduler, and you can just pick up request from ctx/domain list easily by introducing one blk-mq core API. Thanks, Ming Lei
Re: [PATCH] block: kyber: make kyber more friendly with merging
On Wed, May 30, 2018 at 4:36 PM, jianchao.wang wrote: > Hi ming > > Thanks for your kindly response. > > On 05/30/2018 04:22 PM, Ming Lei wrote: you could keep the software queues as-is but add our own version of flush_busy_ctxs() that only removes requests of the domain that we want. If one domain gets backed up, that might get messy with long iterations, though. >>> Yes, I also considered this approach :) >>> But the long iterations on every ctx->rq_list looks really inefficient. >> Right, this list can be quite long if dispatch token is used up. >> >> You might try to introduce per-domain list into ctx directly, then 'none' >> may benefit from this change too since bio merge should be done >> on the per-domain list actually. > > Yes, it maybe good for merging of 'none', because the rq_list is split into 3 > lists, and not need to iterate the whole rq_list any more. > But what's about the dispatch when there is no io scheduler. blk_mq_flush_busy_ctxs() and blk_mq_dequeue_from_ctx() should work fine in case of 'none' if per-domain list is added to ctx. Then we can make none to be a bit fair on READ/WRITE. > We will dispatch request from ctx one by one at the moment. > If we have per-domain list in ctx, we have to introduce some policies to > determine > which domain to dispatch, and these policies should be in io scheduler > actually. The policy is done by IO scheduler, and you can just pick up request from ctx/domain list easily by introducing one blk-mq core API. Thanks, Ming Lei
Re: [PATCH] block: kyber: make kyber more friendly with merging
Hi Jens and Holger Thank for your kindly response. That's really appreciated. I will post next version based on Jens' patch. Thanks Jianchao On 05/23/2018 02:32 AM, Holger Hoffstätte wrote: This looks great but prevents kyber from being built as module, which is AFAIK supposed to work (and works now): .. CC [M] block/kyber-iosched.o Building modules, stage 2. MODPOST 313 modules ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined! ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined! ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined! ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined! ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined! .. It does build fine when compiled in, obviously. :) >>> It's basically duplicating the contents of blk_attempt_plug_merge(). >>> I would suggest abstracting out the list loop and merge check >>> into a helper, that could then both be called from kyber and the >>> plug merge function. >> See attached, prep patch and yours rebased on top of it. > That was quick. :) > > Applies smoothly on top of my 4.16++ tree, now builds correctly as > module and is reproducibly (slightly) faster even on my pedestrian > SATA SSDs, now on par or occasionally even faster than mq-deadline. > What's not to like? So: > > Tested-by: Holger Hoffstätte
Re: [PATCH] block: kyber: make kyber more friendly with merging
Hi Jens and Holger Thank for your kindly response. That's really appreciated. I will post next version based on Jens' patch. Thanks Jianchao On 05/23/2018 02:32 AM, Holger Hoffstätte wrote: This looks great but prevents kyber from being built as module, which is AFAIK supposed to work (and works now): .. CC [M] block/kyber-iosched.o Building modules, stage 2. MODPOST 313 modules ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined! ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined! ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined! ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined! ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined! .. It does build fine when compiled in, obviously. :) >>> It's basically duplicating the contents of blk_attempt_plug_merge(). >>> I would suggest abstracting out the list loop and merge check >>> into a helper, that could then both be called from kyber and the >>> plug merge function. >> See attached, prep patch and yours rebased on top of it. > That was quick. :) > > Applies smoothly on top of my 4.16++ tree, now builds correctly as > module and is reproducibly (slightly) faster even on my pedestrian > SATA SSDs, now on par or occasionally even faster than mq-deadline. > What's not to like? So: > > Tested-by: Holger Hoffstätte
Re: [PATCH] block: kyber: make kyber more friendly with merging
Hi Omar Thanks for your kindly response. On 05/23/2018 04:02 AM, Omar Sandoval wrote: > On Tue, May 22, 2018 at 10:48:29PM +0800, Jianchao Wang wrote: >> Currently, kyber is very unfriendly with merging. kyber depends >> on ctx rq_list to do merging, however, most of time, it will not >> leave any requests in ctx rq_list. This is because even if tokens >> of one domain is used up, kyber will try to dispatch requests >> from other domain and flush the rq_list there. > > That's a great catch, I totally missed this. > > This approach does end up duplicating a lot of code with the blk-mq core > even after Jens' change, so I'm curious if you tried other approaches. > One idea I had is to try the bio merge against the kqd->rqs lists. Since > that's per-queue, the locking overhead might be too high. Alternatively, Yes, I used to make a patch as you say, try the bio merge against kqd->rqs directly. The patch looks even simpler. However, because the khd->lock is needed every time when try bio merge, there maybe high contending overhead on hkd->lock when cpu-hctx mapping is not 1:1. > you could keep the software queues as-is but add our own version of > flush_busy_ctxs() that only removes requests of the domain that we want. > If one domain gets backed up, that might get messy with long iterations, > though. Yes, I also considered this approach :) But the long iterations on every ctx->rq_list looks really inefficient. > > Regarding this approach, a couple of comments below. ... >> } >> @@ -379,12 +414,33 @@ static void kyber_exit_sched(struct elevator_queue *e) >> static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int >> hctx_idx) >> { >> struct kyber_hctx_data *khd; >> +struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data; >> int i; >> +int sd; >> >> khd = kmalloc_node(sizeof(*khd), GFP_KERNEL, hctx->numa_node); >> if (!khd) >> return -ENOMEM; >> >> +khd->kcqs = kmalloc_array_node(nr_cpu_ids, sizeof(void *), >> +GFP_KERNEL, hctx->numa_node); >> +if (!khd->kcqs) >> +goto err_khd; > > Why the double indirection of a percpu allocation per hardware queue > here? With, say, 56 cpus and that many hardware queues, that's 3136 > pointers, which seems like overkill. Can't you just use the percpu array > in the kqd directly, or make it per-hardware queue instead? oops, I forgot to change the nr_cpu_ids to hctx->nr_ctx. The mapping between cpu and hctx has been setup when kyber_init_hctx is invoked, so just need to allocate hctx->nr_ctx * struct kyber_ctx_queue per khd. ... >> +static int bio_sched_domain(const struct bio *bio) >> +{ >> +unsigned int op = bio->bi_opf; >> + >> +if ((op & REQ_OP_MASK) == REQ_OP_READ) >> +return KYBER_READ; >> +else if ((op & REQ_OP_MASK) == REQ_OP_WRITE && op_is_sync(op)) >> +return KYBER_SYNC_WRITE; >> +else >> +return KYBER_OTHER; >> +} > > Please add a common helper for rq_sched_domain() and bio_sched_domain() > instead of duplicating the logic. > Yes, I will do it in next version. Thanks Jianchao
Re: [PATCH] block: kyber: make kyber more friendly with merging
Hi Omar Thanks for your kindly response. On 05/23/2018 04:02 AM, Omar Sandoval wrote: > On Tue, May 22, 2018 at 10:48:29PM +0800, Jianchao Wang wrote: >> Currently, kyber is very unfriendly with merging. kyber depends >> on ctx rq_list to do merging, however, most of time, it will not >> leave any requests in ctx rq_list. This is because even if tokens >> of one domain is used up, kyber will try to dispatch requests >> from other domain and flush the rq_list there. > > That's a great catch, I totally missed this. > > This approach does end up duplicating a lot of code with the blk-mq core > even after Jens' change, so I'm curious if you tried other approaches. > One idea I had is to try the bio merge against the kqd->rqs lists. Since > that's per-queue, the locking overhead might be too high. Alternatively, Yes, I used to make a patch as you say, try the bio merge against kqd->rqs directly. The patch looks even simpler. However, because the khd->lock is needed every time when try bio merge, there maybe high contending overhead on hkd->lock when cpu-hctx mapping is not 1:1. > you could keep the software queues as-is but add our own version of > flush_busy_ctxs() that only removes requests of the domain that we want. > If one domain gets backed up, that might get messy with long iterations, > though. Yes, I also considered this approach :) But the long iterations on every ctx->rq_list looks really inefficient. > > Regarding this approach, a couple of comments below. ... >> } >> @@ -379,12 +414,33 @@ static void kyber_exit_sched(struct elevator_queue *e) >> static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int >> hctx_idx) >> { >> struct kyber_hctx_data *khd; >> +struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data; >> int i; >> +int sd; >> >> khd = kmalloc_node(sizeof(*khd), GFP_KERNEL, hctx->numa_node); >> if (!khd) >> return -ENOMEM; >> >> +khd->kcqs = kmalloc_array_node(nr_cpu_ids, sizeof(void *), >> +GFP_KERNEL, hctx->numa_node); >> +if (!khd->kcqs) >> +goto err_khd; > > Why the double indirection of a percpu allocation per hardware queue > here? With, say, 56 cpus and that many hardware queues, that's 3136 > pointers, which seems like overkill. Can't you just use the percpu array > in the kqd directly, or make it per-hardware queue instead? oops, I forgot to change the nr_cpu_ids to hctx->nr_ctx. The mapping between cpu and hctx has been setup when kyber_init_hctx is invoked, so just need to allocate hctx->nr_ctx * struct kyber_ctx_queue per khd. ... >> +static int bio_sched_domain(const struct bio *bio) >> +{ >> +unsigned int op = bio->bi_opf; >> + >> +if ((op & REQ_OP_MASK) == REQ_OP_READ) >> +return KYBER_READ; >> +else if ((op & REQ_OP_MASK) == REQ_OP_WRITE && op_is_sync(op)) >> +return KYBER_SYNC_WRITE; >> +else >> +return KYBER_OTHER; >> +} > > Please add a common helper for rq_sched_domain() and bio_sched_domain() > instead of duplicating the logic. > Yes, I will do it in next version. Thanks Jianchao
Re: [PATCH] block: kyber: make kyber more friendly with merging
On Tue, May 22, 2018 at 10:48:29PM +0800, Jianchao Wang wrote: > Currently, kyber is very unfriendly with merging. kyber depends > on ctx rq_list to do merging, however, most of time, it will not > leave any requests in ctx rq_list. This is because even if tokens > of one domain is used up, kyber will try to dispatch requests > from other domain and flush the rq_list there. That's a great catch, I totally missed this. This approach does end up duplicating a lot of code with the blk-mq core even after Jens' change, so I'm curious if you tried other approaches. One idea I had is to try the bio merge against the kqd->rqs lists. Since that's per-queue, the locking overhead might be too high. Alternatively, you could keep the software queues as-is but add our own version of flush_busy_ctxs() that only removes requests of the domain that we want. If one domain gets backed up, that might get messy with long iterations, though. Regarding this approach, a couple of comments below. > To improve this, we setup kyber_ctx_queue (kcq) which is similar > with ctx, but it has rq_lists for different domain and build same > mapping between kcq and khd as the ctx & hctx. Then we could merge, > insert and dispatch for different domains separately. If one domain > token is used up, the requests could be left in the rq_list of > that domain and maybe merged with following io. > > Following is my test result on machine with 8 cores and NVMe card > INTEL SSDPEKKR128G7 > > fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 > seq/random > +--+---+ > |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | > +--+ > | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | > +--+ > | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | > +--+ > When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k > on my platform. > > Signed-off-by: Jianchao Wang> --- > block/kyber-iosched.c | 240 > -- > 1 file changed, 212 insertions(+), 28 deletions(-) > > diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c > index 0d6d25e3..04da05b 100644 > --- a/block/kyber-iosched.c > +++ b/block/kyber-iosched.c > @@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = { > [KYBER_OTHER] = 8, > }; > > +struct kyber_ctx_queue { > + /* > + * Copied from blk_mq_ctx->index_hw > + */ > + unsigned int index; > + spinlock_t lock; > + struct list_head rq_list[KYBER_NUM_DOMAINS]; > +} cacheline_aligned_in_smp; > + > struct kyber_queue_data { > struct request_queue *q; > > @@ -84,6 +93,7 @@ struct kyber_queue_data { >*/ > struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS]; > > + struct kyber_ctx_queue *ctx_queue; > /* >* Async request percentage, converted to per-word depth for >* sbitmap_get_shallow(). > @@ -99,6 +109,8 @@ struct kyber_hctx_data { > struct list_head rqs[KYBER_NUM_DOMAINS]; > unsigned int cur_domain; > unsigned int batching; > + struct kyber_ctx_queue **kcqs; > + struct sbitmap kcq_map[KYBER_NUM_DOMAINS]; > wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS]; > struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS]; > atomic_t wait_index[KYBER_NUM_DOMAINS]; > @@ -284,6 +296,19 @@ static unsigned int kyber_sched_tags_shift(struct > kyber_queue_data *kqd) > return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift; > } > > +static void kyber_ctx_queue_init(struct kyber_queue_data *kqd) > +{ > + unsigned int i, j; > + > + for (i = 0; i < nr_cpu_ids; i++) { > + struct kyber_ctx_queue *kcq = >ctx_queue[i]; > + > + spin_lock_init(>lock); > + for (j = 0; j < KYBER_NUM_DOMAINS; j++) > + INIT_LIST_HEAD(>rq_list[j]); > + } > +} > + > static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue > *q) > { > struct kyber_queue_data *kqd; > @@ -302,6 +327,13 @@ static struct kyber_queue_data > *kyber_queue_data_alloc(struct request_queue *q) > if (!kqd->cb) > goto err_kqd; > > + kqd->ctx_queue = kmalloc_array_node(nr_cpu_ids, > + sizeof(struct kyber_ctx_queue), GFP_KERNEL, -1); The whitespace here and in several other places is weird, please run this through checkpatch. > + if (!kqd->ctx_queue) > + goto err_cb; > + > + kyber_ctx_queue_init(kqd); > + > /* >* The maximum number of tokens for any scheduling domain is at least >* the queue depth of a single hardware queue. If the hardware doesn't > @@
Re: [PATCH] block: kyber: make kyber more friendly with merging
On Tue, May 22, 2018 at 10:48:29PM +0800, Jianchao Wang wrote: > Currently, kyber is very unfriendly with merging. kyber depends > on ctx rq_list to do merging, however, most of time, it will not > leave any requests in ctx rq_list. This is because even if tokens > of one domain is used up, kyber will try to dispatch requests > from other domain and flush the rq_list there. That's a great catch, I totally missed this. This approach does end up duplicating a lot of code with the blk-mq core even after Jens' change, so I'm curious if you tried other approaches. One idea I had is to try the bio merge against the kqd->rqs lists. Since that's per-queue, the locking overhead might be too high. Alternatively, you could keep the software queues as-is but add our own version of flush_busy_ctxs() that only removes requests of the domain that we want. If one domain gets backed up, that might get messy with long iterations, though. Regarding this approach, a couple of comments below. > To improve this, we setup kyber_ctx_queue (kcq) which is similar > with ctx, but it has rq_lists for different domain and build same > mapping between kcq and khd as the ctx & hctx. Then we could merge, > insert and dispatch for different domains separately. If one domain > token is used up, the requests could be left in the rq_list of > that domain and maybe merged with following io. > > Following is my test result on machine with 8 cores and NVMe card > INTEL SSDPEKKR128G7 > > fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 > seq/random > +--+---+ > |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | > +--+ > | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | > +--+ > | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | > +--+ > When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k > on my platform. > > Signed-off-by: Jianchao Wang > --- > block/kyber-iosched.c | 240 > -- > 1 file changed, 212 insertions(+), 28 deletions(-) > > diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c > index 0d6d25e3..04da05b 100644 > --- a/block/kyber-iosched.c > +++ b/block/kyber-iosched.c > @@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = { > [KYBER_OTHER] = 8, > }; > > +struct kyber_ctx_queue { > + /* > + * Copied from blk_mq_ctx->index_hw > + */ > + unsigned int index; > + spinlock_t lock; > + struct list_head rq_list[KYBER_NUM_DOMAINS]; > +} cacheline_aligned_in_smp; > + > struct kyber_queue_data { > struct request_queue *q; > > @@ -84,6 +93,7 @@ struct kyber_queue_data { >*/ > struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS]; > > + struct kyber_ctx_queue *ctx_queue; > /* >* Async request percentage, converted to per-word depth for >* sbitmap_get_shallow(). > @@ -99,6 +109,8 @@ struct kyber_hctx_data { > struct list_head rqs[KYBER_NUM_DOMAINS]; > unsigned int cur_domain; > unsigned int batching; > + struct kyber_ctx_queue **kcqs; > + struct sbitmap kcq_map[KYBER_NUM_DOMAINS]; > wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS]; > struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS]; > atomic_t wait_index[KYBER_NUM_DOMAINS]; > @@ -284,6 +296,19 @@ static unsigned int kyber_sched_tags_shift(struct > kyber_queue_data *kqd) > return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift; > } > > +static void kyber_ctx_queue_init(struct kyber_queue_data *kqd) > +{ > + unsigned int i, j; > + > + for (i = 0; i < nr_cpu_ids; i++) { > + struct kyber_ctx_queue *kcq = >ctx_queue[i]; > + > + spin_lock_init(>lock); > + for (j = 0; j < KYBER_NUM_DOMAINS; j++) > + INIT_LIST_HEAD(>rq_list[j]); > + } > +} > + > static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue > *q) > { > struct kyber_queue_data *kqd; > @@ -302,6 +327,13 @@ static struct kyber_queue_data > *kyber_queue_data_alloc(struct request_queue *q) > if (!kqd->cb) > goto err_kqd; > > + kqd->ctx_queue = kmalloc_array_node(nr_cpu_ids, > + sizeof(struct kyber_ctx_queue), GFP_KERNEL, -1); The whitespace here and in several other places is weird, please run this through checkpatch. > + if (!kqd->ctx_queue) > + goto err_cb; > + > + kyber_ctx_queue_init(kqd); > + > /* >* The maximum number of tokens for any scheduling domain is at least >* the queue depth of a single hardware queue. If the hardware doesn't > @@ -318,7 +350,7 @@ static struct
Re: [PATCH] block: kyber: make kyber more friendly with merging
On 05/22/18 19:46, Jens Axboe wrote: > On 5/22/18 10:20 AM, Jens Axboe wrote: >> On 5/22/18 10:17 AM, Holger Hoffstätte wrote: >>> On 05/22/18 16:48, Jianchao Wang wrote: Currently, kyber is very unfriendly with merging. kyber depends on ctx rq_list to do merging, however, most of time, it will not leave any requests in ctx rq_list. This is because even if tokens of one domain is used up, kyber will try to dispatch requests from other domain and flush the rq_list there. To improve this, we setup kyber_ctx_queue (kcq) which is similar with ctx, but it has rq_lists for different domain and build same mapping between kcq and khd as the ctx & hctx. Then we could merge, insert and dispatch for different domains separately. If one domain token is used up, the requests could be left in the rq_list of that domain and maybe merged with following io. Following is my test result on machine with 8 cores and NVMe card INTEL SSDPEKKR128G7 fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 seq/random +--+---+ |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | +--+ | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | +--+ | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | +--+ When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k on my platform. Signed-off-by: Jianchao Wang>>> >>> >>> >>> This looks great but prevents kyber from being built as module, >>> which is AFAIK supposed to work (and works now): >>> >>> .. >>> CC [M] block/kyber-iosched.o >>> Building modules, stage 2. >>> MODPOST 313 modules >>> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined! >>> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined! >>> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined! >>> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined! >>> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined! >>> .. >>> >>> It does build fine when compiled in, obviously. :) >> >> It's basically duplicating the contents of blk_attempt_plug_merge(). >> I would suggest abstracting out the list loop and merge check >> into a helper, that could then both be called from kyber and the >> plug merge function. > > See attached, prep patch and yours rebased on top of it. That was quick. :) Applies smoothly on top of my 4.16++ tree, now builds correctly as module and is reproducibly (slightly) faster even on my pedestrian SATA SSDs, now on par or occasionally even faster than mq-deadline. What's not to like? So: Tested-by: Holger Hoffstätte cheers, Holger
Re: [PATCH] block: kyber: make kyber more friendly with merging
On 05/22/18 19:46, Jens Axboe wrote: > On 5/22/18 10:20 AM, Jens Axboe wrote: >> On 5/22/18 10:17 AM, Holger Hoffstätte wrote: >>> On 05/22/18 16:48, Jianchao Wang wrote: Currently, kyber is very unfriendly with merging. kyber depends on ctx rq_list to do merging, however, most of time, it will not leave any requests in ctx rq_list. This is because even if tokens of one domain is used up, kyber will try to dispatch requests from other domain and flush the rq_list there. To improve this, we setup kyber_ctx_queue (kcq) which is similar with ctx, but it has rq_lists for different domain and build same mapping between kcq and khd as the ctx & hctx. Then we could merge, insert and dispatch for different domains separately. If one domain token is used up, the requests could be left in the rq_list of that domain and maybe merged with following io. Following is my test result on machine with 8 cores and NVMe card INTEL SSDPEKKR128G7 fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 seq/random +--+---+ |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | +--+ | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | +--+ | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | +--+ When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k on my platform. Signed-off-by: Jianchao Wang >>> >>> >>> >>> This looks great but prevents kyber from being built as module, >>> which is AFAIK supposed to work (and works now): >>> >>> .. >>> CC [M] block/kyber-iosched.o >>> Building modules, stage 2. >>> MODPOST 313 modules >>> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined! >>> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined! >>> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined! >>> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined! >>> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined! >>> .. >>> >>> It does build fine when compiled in, obviously. :) >> >> It's basically duplicating the contents of blk_attempt_plug_merge(). >> I would suggest abstracting out the list loop and merge check >> into a helper, that could then both be called from kyber and the >> plug merge function. > > See attached, prep patch and yours rebased on top of it. That was quick. :) Applies smoothly on top of my 4.16++ tree, now builds correctly as module and is reproducibly (slightly) faster even on my pedestrian SATA SSDs, now on par or occasionally even faster than mq-deadline. What's not to like? So: Tested-by: Holger Hoffstätte cheers, Holger
Re: [PATCH] block: kyber: make kyber more friendly with merging
On 5/22/18 10:20 AM, Jens Axboe wrote: > On 5/22/18 10:17 AM, Holger Hoffstätte wrote: >> On 05/22/18 16:48, Jianchao Wang wrote: >>> Currently, kyber is very unfriendly with merging. kyber depends >>> on ctx rq_list to do merging, however, most of time, it will not >>> leave any requests in ctx rq_list. This is because even if tokens >>> of one domain is used up, kyber will try to dispatch requests >>> from other domain and flush the rq_list there. >>> >>> To improve this, we setup kyber_ctx_queue (kcq) which is similar >>> with ctx, but it has rq_lists for different domain and build same >>> mapping between kcq and khd as the ctx & hctx. Then we could merge, >>> insert and dispatch for different domains separately. If one domain >>> token is used up, the requests could be left in the rq_list of >>> that domain and maybe merged with following io. >>> >>> Following is my test result on machine with 8 cores and NVMe card >>> INTEL SSDPEKKR128G7 >>> >>> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 >>> seq/random >>> +--+---+ >>> |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | >>> +--+ >>> | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | >>> +--+ >>> | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | >>> +--+ >>> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k >>> on my platform. >>> >>> Signed-off-by: Jianchao Wang>> >> >> >> This looks great but prevents kyber from being built as module, >> which is AFAIK supposed to work (and works now): >> >> .. >> CC [M] block/kyber-iosched.o >> Building modules, stage 2. >> MODPOST 313 modules >> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined! >> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined! >> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined! >> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined! >> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined! >> .. >> >> It does build fine when compiled in, obviously. :) > > It's basically duplicating the contents of blk_attempt_plug_merge(). > I would suggest abstracting out the list loop and merge check > into a helper, that could then both be called from kyber and the > plug merge function. See attached, prep patch and yours rebased on top of it. -- Jens Axboe >From edb57af07cf619f27a3fee85705870875f853f58 Mon Sep 17 00:00:00 2001 From: Jianchao Wang Date: Tue, 22 May 2018 11:45:01 -0600 Subject: [PATCH 2/2] block: kyber: make kyber more friendly with merging Currently, kyber is very unfriendly with merging. kyber depends on ctx rq_list to do merging, however, most of time, it will not leave any requests in ctx rq_list. This is because even if tokens of one domain is used up, kyber will try to dispatch requests from other domain and flush the rq_list there. To improve this, we setup kyber_ctx_queue (kcq) which is similar with ctx, but it has rq_lists for different domain and build same mapping between kcq and khd as the ctx & hctx. Then we could merge, insert and dispatch for different domains separately. If one domain token is used up, the requests could be left in the rq_list of that domain and maybe merged with following io. Following is my test result on machine with 8 cores and NVMe card INTEL SSDPEKKR128G7 fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 seq/random +--+---+ |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | +--+ | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | +--+ | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | +--+ When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k on my platform. Signed-off-by: Jianchao Wang --- block/kyber-iosched.c | 212 +++--- 1 file changed, 184 insertions(+), 28 deletions(-) diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index 5b33dc394cc7..f28147be729b 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = { [KYBER_OTHER] = 8, }; +struct kyber_ctx_queue { + /* + * Copied from blk_mq_ctx->index_hw + */ + unsigned int index; + spinlock_t lock; + struct list_head rq_list[KYBER_NUM_DOMAINS]; +} cacheline_aligned_in_smp; +
Re: [PATCH] block: kyber: make kyber more friendly with merging
On 5/22/18 10:20 AM, Jens Axboe wrote: > On 5/22/18 10:17 AM, Holger Hoffstätte wrote: >> On 05/22/18 16:48, Jianchao Wang wrote: >>> Currently, kyber is very unfriendly with merging. kyber depends >>> on ctx rq_list to do merging, however, most of time, it will not >>> leave any requests in ctx rq_list. This is because even if tokens >>> of one domain is used up, kyber will try to dispatch requests >>> from other domain and flush the rq_list there. >>> >>> To improve this, we setup kyber_ctx_queue (kcq) which is similar >>> with ctx, but it has rq_lists for different domain and build same >>> mapping between kcq and khd as the ctx & hctx. Then we could merge, >>> insert and dispatch for different domains separately. If one domain >>> token is used up, the requests could be left in the rq_list of >>> that domain and maybe merged with following io. >>> >>> Following is my test result on machine with 8 cores and NVMe card >>> INTEL SSDPEKKR128G7 >>> >>> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 >>> seq/random >>> +--+---+ >>> |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | >>> +--+ >>> | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | >>> +--+ >>> | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | >>> +--+ >>> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k >>> on my platform. >>> >>> Signed-off-by: Jianchao Wang >> >> >> >> This looks great but prevents kyber from being built as module, >> which is AFAIK supposed to work (and works now): >> >> .. >> CC [M] block/kyber-iosched.o >> Building modules, stage 2. >> MODPOST 313 modules >> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined! >> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined! >> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined! >> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined! >> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined! >> .. >> >> It does build fine when compiled in, obviously. :) > > It's basically duplicating the contents of blk_attempt_plug_merge(). > I would suggest abstracting out the list loop and merge check > into a helper, that could then both be called from kyber and the > plug merge function. See attached, prep patch and yours rebased on top of it. -- Jens Axboe >From edb57af07cf619f27a3fee85705870875f853f58 Mon Sep 17 00:00:00 2001 From: Jianchao Wang Date: Tue, 22 May 2018 11:45:01 -0600 Subject: [PATCH 2/2] block: kyber: make kyber more friendly with merging Currently, kyber is very unfriendly with merging. kyber depends on ctx rq_list to do merging, however, most of time, it will not leave any requests in ctx rq_list. This is because even if tokens of one domain is used up, kyber will try to dispatch requests from other domain and flush the rq_list there. To improve this, we setup kyber_ctx_queue (kcq) which is similar with ctx, but it has rq_lists for different domain and build same mapping between kcq and khd as the ctx & hctx. Then we could merge, insert and dispatch for different domains separately. If one domain token is used up, the requests could be left in the rq_list of that domain and maybe merged with following io. Following is my test result on machine with 8 cores and NVMe card INTEL SSDPEKKR128G7 fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 seq/random +--+---+ |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | +--+ | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | +--+ | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | +--+ When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k on my platform. Signed-off-by: Jianchao Wang --- block/kyber-iosched.c | 212 +++--- 1 file changed, 184 insertions(+), 28 deletions(-) diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index 5b33dc394cc7..f28147be729b 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = { [KYBER_OTHER] = 8, }; +struct kyber_ctx_queue { + /* + * Copied from blk_mq_ctx->index_hw + */ + unsigned int index; + spinlock_t lock; + struct list_head rq_list[KYBER_NUM_DOMAINS]; +} cacheline_aligned_in_smp; + struct kyber_queue_data { struct request_queue *q; @@ -84,6 +93,7 @@ struct
Re: [PATCH] block: kyber: make kyber more friendly with merging
On 05/22/18 16:48, Jianchao Wang wrote: > Currently, kyber is very unfriendly with merging. kyber depends > on ctx rq_list to do merging, however, most of time, it will not > leave any requests in ctx rq_list. This is because even if tokens > of one domain is used up, kyber will try to dispatch requests > from other domain and flush the rq_list there. > > To improve this, we setup kyber_ctx_queue (kcq) which is similar > with ctx, but it has rq_lists for different domain and build same > mapping between kcq and khd as the ctx & hctx. Then we could merge, > insert and dispatch for different domains separately. If one domain > token is used up, the requests could be left in the rq_list of > that domain and maybe merged with following io. > > Following is my test result on machine with 8 cores and NVMe card > INTEL SSDPEKKR128G7 > > fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 > seq/random > +--+---+ > |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | > +--+ > | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | > +--+ > | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | > +--+ > When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k > on my platform. > > Signed-off-by: Jianchao WangThis looks great but prevents kyber from being built as module, which is AFAIK supposed to work (and works now): .. CC [M] block/kyber-iosched.o Building modules, stage 2. MODPOST 313 modules ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined! ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined! ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined! ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined! ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined! .. It does build fine when compiled in, obviously. :) cheers, Holger
Re: [PATCH] block: kyber: make kyber more friendly with merging
On 05/22/18 16:48, Jianchao Wang wrote: > Currently, kyber is very unfriendly with merging. kyber depends > on ctx rq_list to do merging, however, most of time, it will not > leave any requests in ctx rq_list. This is because even if tokens > of one domain is used up, kyber will try to dispatch requests > from other domain and flush the rq_list there. > > To improve this, we setup kyber_ctx_queue (kcq) which is similar > with ctx, but it has rq_lists for different domain and build same > mapping between kcq and khd as the ctx & hctx. Then we could merge, > insert and dispatch for different domains separately. If one domain > token is used up, the requests could be left in the rq_list of > that domain and maybe merged with following io. > > Following is my test result on machine with 8 cores and NVMe card > INTEL SSDPEKKR128G7 > > fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 > seq/random > +--+---+ > |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | > +--+ > | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | > +--+ > | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | > +--+ > When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k > on my platform. > > Signed-off-by: Jianchao Wang This looks great but prevents kyber from being built as module, which is AFAIK supposed to work (and works now): .. CC [M] block/kyber-iosched.o Building modules, stage 2. MODPOST 313 modules ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined! ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined! ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined! ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined! ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined! .. It does build fine when compiled in, obviously. :) cheers, Holger
Re: [PATCH] block: kyber: make kyber more friendly with merging
On 5/22/18 10:17 AM, Holger Hoffstätte wrote: > On 05/22/18 16:48, Jianchao Wang wrote: >> Currently, kyber is very unfriendly with merging. kyber depends >> on ctx rq_list to do merging, however, most of time, it will not >> leave any requests in ctx rq_list. This is because even if tokens >> of one domain is used up, kyber will try to dispatch requests >> from other domain and flush the rq_list there. >> >> To improve this, we setup kyber_ctx_queue (kcq) which is similar >> with ctx, but it has rq_lists for different domain and build same >> mapping between kcq and khd as the ctx & hctx. Then we could merge, >> insert and dispatch for different domains separately. If one domain >> token is used up, the requests could be left in the rq_list of >> that domain and maybe merged with following io. >> >> Following is my test result on machine with 8 cores and NVMe card >> INTEL SSDPEKKR128G7 >> >> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 >> seq/random >> +--+---+ >> |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | >> +--+ >> | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | >> +--+ >> | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | >> +--+ >> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k >> on my platform. >> >> Signed-off-by: Jianchao Wang> > > > This looks great but prevents kyber from being built as module, > which is AFAIK supposed to work (and works now): > > .. > CC [M] block/kyber-iosched.o > Building modules, stage 2. > MODPOST 313 modules > ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined! > ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined! > ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined! > ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined! > ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined! > .. > > It does build fine when compiled in, obviously. :) It's basically duplicating the contents of blk_attempt_plug_merge(). I would suggest abstracting out the list loop and merge check into a helper, that could then both be called from kyber and the plug merge function. -- Jens Axboe
Re: [PATCH] block: kyber: make kyber more friendly with merging
On 5/22/18 10:17 AM, Holger Hoffstätte wrote: > On 05/22/18 16:48, Jianchao Wang wrote: >> Currently, kyber is very unfriendly with merging. kyber depends >> on ctx rq_list to do merging, however, most of time, it will not >> leave any requests in ctx rq_list. This is because even if tokens >> of one domain is used up, kyber will try to dispatch requests >> from other domain and flush the rq_list there. >> >> To improve this, we setup kyber_ctx_queue (kcq) which is similar >> with ctx, but it has rq_lists for different domain and build same >> mapping between kcq and khd as the ctx & hctx. Then we could merge, >> insert and dispatch for different domains separately. If one domain >> token is used up, the requests could be left in the rq_list of >> that domain and maybe merged with following io. >> >> Following is my test result on machine with 8 cores and NVMe card >> INTEL SSDPEKKR128G7 >> >> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 >> seq/random >> +--+---+ >> |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | >> +--+ >> | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | >> +--+ >> | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | >> +--+ >> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k >> on my platform. >> >> Signed-off-by: Jianchao Wang > > > > This looks great but prevents kyber from being built as module, > which is AFAIK supposed to work (and works now): > > .. > CC [M] block/kyber-iosched.o > Building modules, stage 2. > MODPOST 313 modules > ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined! > ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined! > ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined! > ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined! > ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined! > .. > > It does build fine when compiled in, obviously. :) It's basically duplicating the contents of blk_attempt_plug_merge(). I would suggest abstracting out the list loop and merge check into a helper, that could then both be called from kyber and the plug merge function. -- Jens Axboe
[PATCH] block: kyber: make kyber more friendly with merging
Currently, kyber is very unfriendly with merging. kyber depends on ctx rq_list to do merging, however, most of time, it will not leave any requests in ctx rq_list. This is because even if tokens of one domain is used up, kyber will try to dispatch requests from other domain and flush the rq_list there. To improve this, we setup kyber_ctx_queue (kcq) which is similar with ctx, but it has rq_lists for different domain and build same mapping between kcq and khd as the ctx & hctx. Then we could merge, insert and dispatch for different domains separately. If one domain token is used up, the requests could be left in the rq_list of that domain and maybe merged with following io. Following is my test result on machine with 8 cores and NVMe card INTEL SSDPEKKR128G7 fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 seq/random +--+---+ |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | +--+ | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | +--+ | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | +--+ When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k on my platform. Signed-off-by: Jianchao Wang--- block/kyber-iosched.c | 240 -- 1 file changed, 212 insertions(+), 28 deletions(-) diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index 0d6d25e3..04da05b 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = { [KYBER_OTHER] = 8, }; +struct kyber_ctx_queue { + /* +* Copied from blk_mq_ctx->index_hw +*/ + unsigned int index; + spinlock_t lock; + struct list_head rq_list[KYBER_NUM_DOMAINS]; +} cacheline_aligned_in_smp; + struct kyber_queue_data { struct request_queue *q; @@ -84,6 +93,7 @@ struct kyber_queue_data { */ struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS]; + struct kyber_ctx_queue *ctx_queue; /* * Async request percentage, converted to per-word depth for * sbitmap_get_shallow(). @@ -99,6 +109,8 @@ struct kyber_hctx_data { struct list_head rqs[KYBER_NUM_DOMAINS]; unsigned int cur_domain; unsigned int batching; + struct kyber_ctx_queue **kcqs; + struct sbitmap kcq_map[KYBER_NUM_DOMAINS]; wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS]; struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS]; atomic_t wait_index[KYBER_NUM_DOMAINS]; @@ -284,6 +296,19 @@ static unsigned int kyber_sched_tags_shift(struct kyber_queue_data *kqd) return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift; } +static void kyber_ctx_queue_init(struct kyber_queue_data *kqd) +{ + unsigned int i, j; + + for (i = 0; i < nr_cpu_ids; i++) { + struct kyber_ctx_queue *kcq = >ctx_queue[i]; + + spin_lock_init(>lock); + for (j = 0; j < KYBER_NUM_DOMAINS; j++) + INIT_LIST_HEAD(>rq_list[j]); + } +} + static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) { struct kyber_queue_data *kqd; @@ -302,6 +327,13 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) if (!kqd->cb) goto err_kqd; + kqd->ctx_queue = kmalloc_array_node(nr_cpu_ids, + sizeof(struct kyber_ctx_queue), GFP_KERNEL, -1); + if (!kqd->ctx_queue) + goto err_cb; + + kyber_ctx_queue_init(kqd); + /* * The maximum number of tokens for any scheduling domain is at least * the queue depth of a single hardware queue. If the hardware doesn't @@ -318,7 +350,7 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) if (ret) { while (--i >= 0) sbitmap_queue_free(>domain_tokens[i]); - goto err_cb; + goto err_kcq; } sbitmap_queue_resize(>domain_tokens[i], kyber_depth[i]); } @@ -331,6 +363,8 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) return kqd; +err_kcq: + kfree(kqd->ctx_queue); err_cb: blk_stat_free_callback(kqd->cb); err_kqd: @@ -372,6 +406,7 @@ static void kyber_exit_sched(struct elevator_queue *e) for (i = 0; i < KYBER_NUM_DOMAINS; i++) sbitmap_queue_free(>domain_tokens[i]); + kfree(kqd->ctx_queue); blk_stat_free_callback(kqd->cb); kfree(kqd); }
[PATCH] block: kyber: make kyber more friendly with merging
Currently, kyber is very unfriendly with merging. kyber depends on ctx rq_list to do merging, however, most of time, it will not leave any requests in ctx rq_list. This is because even if tokens of one domain is used up, kyber will try to dispatch requests from other domain and flush the rq_list there. To improve this, we setup kyber_ctx_queue (kcq) which is similar with ctx, but it has rq_lists for different domain and build same mapping between kcq and khd as the ctx & hctx. Then we could merge, insert and dispatch for different domains separately. If one domain token is used up, the requests could be left in the rq_list of that domain and maybe merged with following io. Following is my test result on machine with 8 cores and NVMe card INTEL SSDPEKKR128G7 fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8 seq/random +--+---+ |patch?| bw(MB/s) | iops| slat(usec) |clat(usec) | merge | +--+ | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 | +--+ | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k | +--+ When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k on my platform. Signed-off-by: Jianchao Wang --- block/kyber-iosched.c | 240 -- 1 file changed, 212 insertions(+), 28 deletions(-) diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index 0d6d25e3..04da05b 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = { [KYBER_OTHER] = 8, }; +struct kyber_ctx_queue { + /* +* Copied from blk_mq_ctx->index_hw +*/ + unsigned int index; + spinlock_t lock; + struct list_head rq_list[KYBER_NUM_DOMAINS]; +} cacheline_aligned_in_smp; + struct kyber_queue_data { struct request_queue *q; @@ -84,6 +93,7 @@ struct kyber_queue_data { */ struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS]; + struct kyber_ctx_queue *ctx_queue; /* * Async request percentage, converted to per-word depth for * sbitmap_get_shallow(). @@ -99,6 +109,8 @@ struct kyber_hctx_data { struct list_head rqs[KYBER_NUM_DOMAINS]; unsigned int cur_domain; unsigned int batching; + struct kyber_ctx_queue **kcqs; + struct sbitmap kcq_map[KYBER_NUM_DOMAINS]; wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS]; struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS]; atomic_t wait_index[KYBER_NUM_DOMAINS]; @@ -284,6 +296,19 @@ static unsigned int kyber_sched_tags_shift(struct kyber_queue_data *kqd) return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift; } +static void kyber_ctx_queue_init(struct kyber_queue_data *kqd) +{ + unsigned int i, j; + + for (i = 0; i < nr_cpu_ids; i++) { + struct kyber_ctx_queue *kcq = >ctx_queue[i]; + + spin_lock_init(>lock); + for (j = 0; j < KYBER_NUM_DOMAINS; j++) + INIT_LIST_HEAD(>rq_list[j]); + } +} + static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) { struct kyber_queue_data *kqd; @@ -302,6 +327,13 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) if (!kqd->cb) goto err_kqd; + kqd->ctx_queue = kmalloc_array_node(nr_cpu_ids, + sizeof(struct kyber_ctx_queue), GFP_KERNEL, -1); + if (!kqd->ctx_queue) + goto err_cb; + + kyber_ctx_queue_init(kqd); + /* * The maximum number of tokens for any scheduling domain is at least * the queue depth of a single hardware queue. If the hardware doesn't @@ -318,7 +350,7 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) if (ret) { while (--i >= 0) sbitmap_queue_free(>domain_tokens[i]); - goto err_cb; + goto err_kcq; } sbitmap_queue_resize(>domain_tokens[i], kyber_depth[i]); } @@ -331,6 +363,8 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) return kqd; +err_kcq: + kfree(kqd->ctx_queue); err_cb: blk_stat_free_callback(kqd->cb); err_kqd: @@ -372,6 +406,7 @@ static void kyber_exit_sched(struct elevator_queue *e) for (i = 0; i < KYBER_NUM_DOMAINS; i++) sbitmap_queue_free(>domain_tokens[i]); + kfree(kqd->ctx_queue); blk_stat_free_callback(kqd->cb); kfree(kqd); } @@ -379,12 +414,33 @@