Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers
On 01/13/2017 05:41 PM, Omar Sandoval wrote: > On Fri, Jan 13, 2017 at 12:15:17PM +0100, Hannes Reinecke wrote: >> On 01/11/2017 10:40 PM, Jens Axboe wrote: >>> This adds a set of hooks that intercepts the blk-mq path of >>> allocating/inserting/issuing/completing requests, allowing >>> us to develop a scheduler within that framework. >>> >>> We reuse the existing elevator scheduler API on the registration >>> side, but augment that with the scheduler flagging support for >>> the blk-mq interfce, and with a separate set of ops hooks for MQ >>> devices. >>> >>> We split driver and scheduler tags, so we can run the scheduling >>> independent of device queue depth. >>> >>> Signed-off-by: Jens Axboe>> [ .. ] >>> @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned >>> int queued) >>> return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); >>> } >>> >>> +static bool blk_mq_get_driver_tag(struct request *rq, >>> + struct blk_mq_hw_ctx **hctx, bool wait) >>> +{ >>> + struct blk_mq_alloc_data data = { >>> + .q = rq->q, >>> + .ctx = rq->mq_ctx, >>> + .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), >>> + .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT, >>> + }; >>> + >>> + if (blk_mq_hctx_stopped(data.hctx)) >>> + return false; >>> + >>> + if (rq->tag != -1) { >>> +done: >>> + if (hctx) >>> + *hctx = data.hctx; >>> + return true; >>> + } >>> + >>> + rq->tag = blk_mq_get_tag(); >>> + if (rq->tag >= 0) { >>> + data.hctx->tags->rqs[rq->tag] = rq; >>> + goto done; >>> + } >>> + >>> + return false; >>> +} >>> + >> What happens with the existing request at 'rqs[rq->tag]' ? >> Surely there is one already, right? >> Things like '->init_request' assume a fully populated array, so moving >> one entry to another location is ... interesting. >> >> I would have thought we need to do a request cloning here, >> otherwise this would introduce a memory leak, right? >> (Not to mention a potential double completion, as the request is now at >> two positions in the array) >> >> Cheers, >> >> Hannes > > The entries in tags->rqs aren't slab objects, they're pointers into > pages allocated separately and tracked on tags->page_list. See > blk_mq_alloc_rqs(). In blk_mq_free_rqs(), we free all of the pages on > tags->page_list, so there shouldn't be a memory leak. > > As for hctx->tags->rqs, entries are only overwritten when a scheduler is > enabled. In that case, the rqs array is storing pointers to requests > actually from hctx->sched_tags, so overwriting/leaking isn't an issue. Ah. Thanks. That explains it. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers
On 01/13/2017 05:41 PM, Omar Sandoval wrote: > On Fri, Jan 13, 2017 at 12:15:17PM +0100, Hannes Reinecke wrote: >> On 01/11/2017 10:40 PM, Jens Axboe wrote: >>> This adds a set of hooks that intercepts the blk-mq path of >>> allocating/inserting/issuing/completing requests, allowing >>> us to develop a scheduler within that framework. >>> >>> We reuse the existing elevator scheduler API on the registration >>> side, but augment that with the scheduler flagging support for >>> the blk-mq interfce, and with a separate set of ops hooks for MQ >>> devices. >>> >>> We split driver and scheduler tags, so we can run the scheduling >>> independent of device queue depth. >>> >>> Signed-off-by: Jens Axboe >> [ .. ] >>> @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned >>> int queued) >>> return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); >>> } >>> >>> +static bool blk_mq_get_driver_tag(struct request *rq, >>> + struct blk_mq_hw_ctx **hctx, bool wait) >>> +{ >>> + struct blk_mq_alloc_data data = { >>> + .q = rq->q, >>> + .ctx = rq->mq_ctx, >>> + .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), >>> + .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT, >>> + }; >>> + >>> + if (blk_mq_hctx_stopped(data.hctx)) >>> + return false; >>> + >>> + if (rq->tag != -1) { >>> +done: >>> + if (hctx) >>> + *hctx = data.hctx; >>> + return true; >>> + } >>> + >>> + rq->tag = blk_mq_get_tag(); >>> + if (rq->tag >= 0) { >>> + data.hctx->tags->rqs[rq->tag] = rq; >>> + goto done; >>> + } >>> + >>> + return false; >>> +} >>> + >> What happens with the existing request at 'rqs[rq->tag]' ? >> Surely there is one already, right? >> Things like '->init_request' assume a fully populated array, so moving >> one entry to another location is ... interesting. >> >> I would have thought we need to do a request cloning here, >> otherwise this would introduce a memory leak, right? >> (Not to mention a potential double completion, as the request is now at >> two positions in the array) >> >> Cheers, >> >> Hannes > > The entries in tags->rqs aren't slab objects, they're pointers into > pages allocated separately and tracked on tags->page_list. See > blk_mq_alloc_rqs(). In blk_mq_free_rqs(), we free all of the pages on > tags->page_list, so there shouldn't be a memory leak. > > As for hctx->tags->rqs, entries are only overwritten when a scheduler is > enabled. In that case, the rqs array is storing pointers to requests > actually from hctx->sched_tags, so overwriting/leaking isn't an issue. Ah. Thanks. That explains it. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers
On Fri, Jan 13, 2017 at 12:15:17PM +0100, Hannes Reinecke wrote: > On 01/11/2017 10:40 PM, Jens Axboe wrote: > > This adds a set of hooks that intercepts the blk-mq path of > > allocating/inserting/issuing/completing requests, allowing > > us to develop a scheduler within that framework. > > > > We reuse the existing elevator scheduler API on the registration > > side, but augment that with the scheduler flagging support for > > the blk-mq interfce, and with a separate set of ops hooks for MQ > > devices. > > > > We split driver and scheduler tags, so we can run the scheduling > > independent of device queue depth. > > > > Signed-off-by: Jens Axboe> [ .. ] > > @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned > > int queued) > > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); > > } > > > > +static bool blk_mq_get_driver_tag(struct request *rq, > > + struct blk_mq_hw_ctx **hctx, bool wait) > > +{ > > + struct blk_mq_alloc_data data = { > > + .q = rq->q, > > + .ctx = rq->mq_ctx, > > + .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), > > + .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT, > > + }; > > + > > + if (blk_mq_hctx_stopped(data.hctx)) > > + return false; > > + > > + if (rq->tag != -1) { > > +done: > > + if (hctx) > > + *hctx = data.hctx; > > + return true; > > + } > > + > > + rq->tag = blk_mq_get_tag(); > > + if (rq->tag >= 0) { > > + data.hctx->tags->rqs[rq->tag] = rq; > > + goto done; > > + } > > + > > + return false; > > +} > > + > What happens with the existing request at 'rqs[rq->tag]' ? > Surely there is one already, right? > Things like '->init_request' assume a fully populated array, so moving > one entry to another location is ... interesting. > > I would have thought we need to do a request cloning here, > otherwise this would introduce a memory leak, right? > (Not to mention a potential double completion, as the request is now at > two positions in the array) > > Cheers, > > Hannes The entries in tags->rqs aren't slab objects, they're pointers into pages allocated separately and tracked on tags->page_list. See blk_mq_alloc_rqs(). In blk_mq_free_rqs(), we free all of the pages on tags->page_list, so there shouldn't be a memory leak. As for hctx->tags->rqs, entries are only overwritten when a scheduler is enabled. In that case, the rqs array is storing pointers to requests actually from hctx->sched_tags, so overwriting/leaking isn't an issue.
Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers
On Fri, Jan 13, 2017 at 12:15:17PM +0100, Hannes Reinecke wrote: > On 01/11/2017 10:40 PM, Jens Axboe wrote: > > This adds a set of hooks that intercepts the blk-mq path of > > allocating/inserting/issuing/completing requests, allowing > > us to develop a scheduler within that framework. > > > > We reuse the existing elevator scheduler API on the registration > > side, but augment that with the scheduler flagging support for > > the blk-mq interfce, and with a separate set of ops hooks for MQ > > devices. > > > > We split driver and scheduler tags, so we can run the scheduling > > independent of device queue depth. > > > > Signed-off-by: Jens Axboe > [ .. ] > > @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned > > int queued) > > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); > > } > > > > +static bool blk_mq_get_driver_tag(struct request *rq, > > + struct blk_mq_hw_ctx **hctx, bool wait) > > +{ > > + struct blk_mq_alloc_data data = { > > + .q = rq->q, > > + .ctx = rq->mq_ctx, > > + .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), > > + .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT, > > + }; > > + > > + if (blk_mq_hctx_stopped(data.hctx)) > > + return false; > > + > > + if (rq->tag != -1) { > > +done: > > + if (hctx) > > + *hctx = data.hctx; > > + return true; > > + } > > + > > + rq->tag = blk_mq_get_tag(); > > + if (rq->tag >= 0) { > > + data.hctx->tags->rqs[rq->tag] = rq; > > + goto done; > > + } > > + > > + return false; > > +} > > + > What happens with the existing request at 'rqs[rq->tag]' ? > Surely there is one already, right? > Things like '->init_request' assume a fully populated array, so moving > one entry to another location is ... interesting. > > I would have thought we need to do a request cloning here, > otherwise this would introduce a memory leak, right? > (Not to mention a potential double completion, as the request is now at > two positions in the array) > > Cheers, > > Hannes The entries in tags->rqs aren't slab objects, they're pointers into pages allocated separately and tracked on tags->page_list. See blk_mq_alloc_rqs(). In blk_mq_free_rqs(), we free all of the pages on tags->page_list, so there shouldn't be a memory leak. As for hctx->tags->rqs, entries are only overwritten when a scheduler is enabled. In that case, the rqs array is storing pointers to requests actually from hctx->sched_tags, so overwriting/leaking isn't an issue.
Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers
On Fri, 2017-01-13 at 12:15 +0100, Hannes Reinecke wrote: > On 01/11/2017 10:40 PM, Jens Axboe wrote: > > This adds a set of hooks that intercepts the blk-mq path of > > allocating/inserting/issuing/completing requests, allowing > > us to develop a scheduler within that framework. > > > > We reuse the existing elevator scheduler API on the registration > > side, but augment that with the scheduler flagging support for > > the blk-mq interfce, and with a separate set of ops hooks for MQ > > devices. > > > > We split driver and scheduler tags, so we can run the scheduling > > independent of device queue depth. > > > > Signed-off-by: Jens Axboe> > [ .. ] > > @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned > > int queued) > > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); > > } > > > > +static bool blk_mq_get_driver_tag(struct request *rq, > > + struct blk_mq_hw_ctx **hctx, bool wait) > > +{ > > + struct blk_mq_alloc_data data = { > > + .q = rq->q, > > + .ctx = rq->mq_ctx, > > + .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), > > + .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT, > > + }; > > + > > + if (blk_mq_hctx_stopped(data.hctx)) > > + return false; > > + > > + if (rq->tag != -1) { > > +done: > > + if (hctx) > > + *hctx = data.hctx; > > + return true; > > + } > > + > > + rq->tag = blk_mq_get_tag(); > > + if (rq->tag >= 0) { > > + data.hctx->tags->rqs[rq->tag] = rq; > > + goto done; > > + } > > + > > + return false; > > +} > > + > > What happens with the existing request at 'rqs[rq->tag]' ? > Surely there is one already, right? > Things like '->init_request' assume a fully populated array, so moving > one entry to another location is ... interesting. > > I would have thought we need to do a request cloning here, > otherwise this would introduce a memory leak, right? > (Not to mention a potential double completion, as the request is now at > two positions in the array) Hello Hannes, Have you noticed that there are two .rqs[] arrays - tags->rqs and sched_tags->rqs[]? .init_request() loops over sched_tags->rqs[]. The above assignment applies to tags->rqs[]. Bart.
Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers
On Fri, 2017-01-13 at 12:15 +0100, Hannes Reinecke wrote: > On 01/11/2017 10:40 PM, Jens Axboe wrote: > > This adds a set of hooks that intercepts the blk-mq path of > > allocating/inserting/issuing/completing requests, allowing > > us to develop a scheduler within that framework. > > > > We reuse the existing elevator scheduler API on the registration > > side, but augment that with the scheduler flagging support for > > the blk-mq interfce, and with a separate set of ops hooks for MQ > > devices. > > > > We split driver and scheduler tags, so we can run the scheduling > > independent of device queue depth. > > > > Signed-off-by: Jens Axboe > > [ .. ] > > @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned > > int queued) > > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); > > } > > > > +static bool blk_mq_get_driver_tag(struct request *rq, > > + struct blk_mq_hw_ctx **hctx, bool wait) > > +{ > > + struct blk_mq_alloc_data data = { > > + .q = rq->q, > > + .ctx = rq->mq_ctx, > > + .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), > > + .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT, > > + }; > > + > > + if (blk_mq_hctx_stopped(data.hctx)) > > + return false; > > + > > + if (rq->tag != -1) { > > +done: > > + if (hctx) > > + *hctx = data.hctx; > > + return true; > > + } > > + > > + rq->tag = blk_mq_get_tag(); > > + if (rq->tag >= 0) { > > + data.hctx->tags->rqs[rq->tag] = rq; > > + goto done; > > + } > > + > > + return false; > > +} > > + > > What happens with the existing request at 'rqs[rq->tag]' ? > Surely there is one already, right? > Things like '->init_request' assume a fully populated array, so moving > one entry to another location is ... interesting. > > I would have thought we need to do a request cloning here, > otherwise this would introduce a memory leak, right? > (Not to mention a potential double completion, as the request is now at > two positions in the array) Hello Hannes, Have you noticed that there are two .rqs[] arrays - tags->rqs and sched_tags->rqs[]? .init_request() loops over sched_tags->rqs[]. The above assignment applies to tags->rqs[]. Bart.
Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers
On 01/11/2017 10:40 PM, Jens Axboe wrote: > This adds a set of hooks that intercepts the blk-mq path of > allocating/inserting/issuing/completing requests, allowing > us to develop a scheduler within that framework. > > We reuse the existing elevator scheduler API on the registration > side, but augment that with the scheduler flagging support for > the blk-mq interfce, and with a separate set of ops hooks for MQ > devices. > > We split driver and scheduler tags, so we can run the scheduling > independent of device queue depth. > > Signed-off-by: Jens Axboe[ .. ] > @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned int > queued) > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); > } > > +static bool blk_mq_get_driver_tag(struct request *rq, > + struct blk_mq_hw_ctx **hctx, bool wait) > +{ > + struct blk_mq_alloc_data data = { > + .q = rq->q, > + .ctx = rq->mq_ctx, > + .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), > + .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT, > + }; > + > + if (blk_mq_hctx_stopped(data.hctx)) > + return false; > + > + if (rq->tag != -1) { > +done: > + if (hctx) > + *hctx = data.hctx; > + return true; > + } > + > + rq->tag = blk_mq_get_tag(); > + if (rq->tag >= 0) { > + data.hctx->tags->rqs[rq->tag] = rq; > + goto done; > + } > + > + return false; > +} > + What happens with the existing request at 'rqs[rq->tag]' ? Surely there is one already, right? Things like '->init_request' assume a fully populated array, so moving one entry to another location is ... interesting. I would have thought we need to do a request cloning here, otherwise this would introduce a memory leak, right? (Not to mention a potential double completion, as the request is now at two positions in the array) Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers
On 01/11/2017 10:40 PM, Jens Axboe wrote: > This adds a set of hooks that intercepts the blk-mq path of > allocating/inserting/issuing/completing requests, allowing > us to develop a scheduler within that framework. > > We reuse the existing elevator scheduler API on the registration > side, but augment that with the scheduler flagging support for > the blk-mq interfce, and with a separate set of ops hooks for MQ > devices. > > We split driver and scheduler tags, so we can run the scheduling > independent of device queue depth. > > Signed-off-by: Jens Axboe [ .. ] > @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned int > queued) > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); > } > > +static bool blk_mq_get_driver_tag(struct request *rq, > + struct blk_mq_hw_ctx **hctx, bool wait) > +{ > + struct blk_mq_alloc_data data = { > + .q = rq->q, > + .ctx = rq->mq_ctx, > + .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), > + .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT, > + }; > + > + if (blk_mq_hctx_stopped(data.hctx)) > + return false; > + > + if (rq->tag != -1) { > +done: > + if (hctx) > + *hctx = data.hctx; > + return true; > + } > + > + rq->tag = blk_mq_get_tag(); > + if (rq->tag >= 0) { > + data.hctx->tags->rqs[rq->tag] = rq; > + goto done; > + } > + > + return false; > +} > + What happens with the existing request at 'rqs[rq->tag]' ? Surely there is one already, right? Things like '->init_request' assume a fully populated array, so moving one entry to another location is ... interesting. I would have thought we need to do a request cloning here, otherwise this would introduce a memory leak, right? (Not to mention a potential double completion, as the request is now at two positions in the array) Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers
On Thu, Jan 12 2017, Bart Van Assche wrote: > On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote: > > @@ -451,11 +456,11 @@ void blk_insert_flush(struct request *rq) > > * processed directly without going through flush machinery. Queue > > * for normal execution. > > */ > > - if ((policy & REQ_FSEQ_DATA) && > > - !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { > > - if (q->mq_ops) { > > - blk_mq_insert_request(rq, false, true, false); > > - } else > > + if (((policy & REQ_FSEQ_DATA) && > > +!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH { > > + if (q->mq_ops) > > + blk_mq_sched_insert_request(rq, false, true, false); > > + else > > list_add_tail(>queuelist, >queue_head); > > return; > > } > > Not that it really matters, but this change adds a pair of parentheses -- > "if (e)" is changed into "if ((e))". Is this necessary? I fixed that up earlier today, as I noticed the same. So that's gone in the current -git tree. > > +void blk_mq_sched_free_hctx_data(struct request_queue *q, > > +void (*exit)(struct blk_mq_hw_ctx *)) > > +{ > > + struct blk_mq_hw_ctx *hctx; > > + int i; > > + > > + queue_for_each_hw_ctx(q, hctx, i) { > > + if (exit) > > + exit(hctx); > > + kfree(hctx->sched_data); > > + hctx->sched_data = NULL; > > + } > > +} > > +EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data); > > + > > +int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size, > > + int (*init)(struct blk_mq_hw_ctx *), > > + void (*exit)(struct blk_mq_hw_ctx *)) > > +{ > > + struct blk_mq_hw_ctx *hctx; > > + int ret; > > + int i; > > + > > + queue_for_each_hw_ctx(q, hctx, i) { > > + hctx->sched_data = kmalloc_node(size, GFP_KERNEL, > > hctx->numa_node); > > + if (!hctx->sched_data) { > > + ret = -ENOMEM; > > + goto error; > > + } > > + > > + if (init) { > > + ret = init(hctx); > > + if (ret) { > > + /* > > +* We don't want to give exit() a partially > > +* initialized sched_data. init() must clean up > > +* if it fails. > > +*/ > > + kfree(hctx->sched_data); > > + hctx->sched_data = NULL; > > + goto error; > > + } > > + } > > + } > > + > > + return 0; > > +error: > > + blk_mq_sched_free_hctx_data(q, exit); > > + return ret; > > +} > > If one of the init() calls by blk_mq_sched_init_hctx_data() fails then > blk_mq_sched_free_hctx_data() will call exit() even for hctx's for which > init() has not been called. How about changing "if (exit)" into "if (exit && > hctx->sched_data)" such that exit() is only called for hctx's for which > init() has been called? Good point, I'll make that change to the exit function. > > +struct request *blk_mq_sched_get_request(struct request_queue *q, > > +struct bio *bio, > > +unsigned int op, > > +struct blk_mq_alloc_data *data) > > +{ > > + struct elevator_queue *e = q->elevator; > > + struct blk_mq_hw_ctx *hctx; > > + struct blk_mq_ctx *ctx; > > + struct request *rq; > > + > > + blk_queue_enter_live(q); > > + ctx = blk_mq_get_ctx(q); > > + hctx = blk_mq_map_queue(q, ctx->cpu); > > + > > + blk_mq_set_alloc_data(data, q, 0, ctx, hctx); > > + > > + if (e) { > > + data->flags |= BLK_MQ_REQ_INTERNAL; > > + if (e->type->ops.mq.get_request) > > + rq = e->type->ops.mq.get_request(q, op, data); > > + else > > + rq = __blk_mq_alloc_request(data, op); > > + } else { > > + rq = __blk_mq_alloc_request(data, op); > > + if (rq) { > > + rq->tag = rq->internal_tag; > > + rq->internal_tag = -1; > > + } > > + } > > + > > + if (rq) { > > + rq->elv.icq = NULL; > > + if (e && e->type->icq_cache) > > + blk_mq_sched_assign_ioc(q, rq, bio); > > + data->hctx->queued++; > > + return rq; > > + } > > + > > + blk_queue_exit(q); > > + return NULL; > > +} > > The "rq->tag = rq->internal_tag; rq->internal_tag = -1;" occurs not only > here but also in blk_mq_alloc_request_hctx(). Has it been considered to move > that code into __blk_mq_alloc_request()? Yes, it's in two locations. I wanted to keep it out of __blk_mq_alloc_request(), so we can still use that for normal tag allocations. But maybe it's better for __blk_mq_alloc_request() to
Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers
On Thu, Jan 12 2017, Bart Van Assche wrote: > On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote: > > @@ -451,11 +456,11 @@ void blk_insert_flush(struct request *rq) > > * processed directly without going through flush machinery. Queue > > * for normal execution. > > */ > > - if ((policy & REQ_FSEQ_DATA) && > > - !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { > > - if (q->mq_ops) { > > - blk_mq_insert_request(rq, false, true, false); > > - } else > > + if (((policy & REQ_FSEQ_DATA) && > > +!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH { > > + if (q->mq_ops) > > + blk_mq_sched_insert_request(rq, false, true, false); > > + else > > list_add_tail(>queuelist, >queue_head); > > return; > > } > > Not that it really matters, but this change adds a pair of parentheses -- > "if (e)" is changed into "if ((e))". Is this necessary? I fixed that up earlier today, as I noticed the same. So that's gone in the current -git tree. > > +void blk_mq_sched_free_hctx_data(struct request_queue *q, > > +void (*exit)(struct blk_mq_hw_ctx *)) > > +{ > > + struct blk_mq_hw_ctx *hctx; > > + int i; > > + > > + queue_for_each_hw_ctx(q, hctx, i) { > > + if (exit) > > + exit(hctx); > > + kfree(hctx->sched_data); > > + hctx->sched_data = NULL; > > + } > > +} > > +EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data); > > + > > +int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size, > > + int (*init)(struct blk_mq_hw_ctx *), > > + void (*exit)(struct blk_mq_hw_ctx *)) > > +{ > > + struct blk_mq_hw_ctx *hctx; > > + int ret; > > + int i; > > + > > + queue_for_each_hw_ctx(q, hctx, i) { > > + hctx->sched_data = kmalloc_node(size, GFP_KERNEL, > > hctx->numa_node); > > + if (!hctx->sched_data) { > > + ret = -ENOMEM; > > + goto error; > > + } > > + > > + if (init) { > > + ret = init(hctx); > > + if (ret) { > > + /* > > +* We don't want to give exit() a partially > > +* initialized sched_data. init() must clean up > > +* if it fails. > > +*/ > > + kfree(hctx->sched_data); > > + hctx->sched_data = NULL; > > + goto error; > > + } > > + } > > + } > > + > > + return 0; > > +error: > > + blk_mq_sched_free_hctx_data(q, exit); > > + return ret; > > +} > > If one of the init() calls by blk_mq_sched_init_hctx_data() fails then > blk_mq_sched_free_hctx_data() will call exit() even for hctx's for which > init() has not been called. How about changing "if (exit)" into "if (exit && > hctx->sched_data)" such that exit() is only called for hctx's for which > init() has been called? Good point, I'll make that change to the exit function. > > +struct request *blk_mq_sched_get_request(struct request_queue *q, > > +struct bio *bio, > > +unsigned int op, > > +struct blk_mq_alloc_data *data) > > +{ > > + struct elevator_queue *e = q->elevator; > > + struct blk_mq_hw_ctx *hctx; > > + struct blk_mq_ctx *ctx; > > + struct request *rq; > > + > > + blk_queue_enter_live(q); > > + ctx = blk_mq_get_ctx(q); > > + hctx = blk_mq_map_queue(q, ctx->cpu); > > + > > + blk_mq_set_alloc_data(data, q, 0, ctx, hctx); > > + > > + if (e) { > > + data->flags |= BLK_MQ_REQ_INTERNAL; > > + if (e->type->ops.mq.get_request) > > + rq = e->type->ops.mq.get_request(q, op, data); > > + else > > + rq = __blk_mq_alloc_request(data, op); > > + } else { > > + rq = __blk_mq_alloc_request(data, op); > > + if (rq) { > > + rq->tag = rq->internal_tag; > > + rq->internal_tag = -1; > > + } > > + } > > + > > + if (rq) { > > + rq->elv.icq = NULL; > > + if (e && e->type->icq_cache) > > + blk_mq_sched_assign_ioc(q, rq, bio); > > + data->hctx->queued++; > > + return rq; > > + } > > + > > + blk_queue_exit(q); > > + return NULL; > > +} > > The "rq->tag = rq->internal_tag; rq->internal_tag = -1;" occurs not only > here but also in blk_mq_alloc_request_hctx(). Has it been considered to move > that code into __blk_mq_alloc_request()? Yes, it's in two locations. I wanted to keep it out of __blk_mq_alloc_request(), so we can still use that for normal tag allocations. But maybe it's better for __blk_mq_alloc_request() to
Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers
On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote: > @@ -451,11 +456,11 @@ void blk_insert_flush(struct request *rq) >* processed directly without going through flush machinery. Queue >* for normal execution. >*/ > - if ((policy & REQ_FSEQ_DATA) && > - !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { > - if (q->mq_ops) { > - blk_mq_insert_request(rq, false, true, false); > - } else > + if (((policy & REQ_FSEQ_DATA) && > + !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH { > + if (q->mq_ops) > + blk_mq_sched_insert_request(rq, false, true, false); > + else > list_add_tail(>queuelist, >queue_head); > return; > } Not that it really matters, but this change adds a pair of parentheses -- "if (e)" is changed into "if ((e))". Is this necessary? > +void blk_mq_sched_free_hctx_data(struct request_queue *q, > + void (*exit)(struct blk_mq_hw_ctx *)) > +{ > + struct blk_mq_hw_ctx *hctx; > + int i; > + > + queue_for_each_hw_ctx(q, hctx, i) { > + if (exit) > + exit(hctx); > + kfree(hctx->sched_data); > + hctx->sched_data = NULL; > + } > +} > +EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data); > + > +int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size, > + int (*init)(struct blk_mq_hw_ctx *), > + void (*exit)(struct blk_mq_hw_ctx *)) > +{ > + struct blk_mq_hw_ctx *hctx; > + int ret; > + int i; > + > + queue_for_each_hw_ctx(q, hctx, i) { > + hctx->sched_data = kmalloc_node(size, GFP_KERNEL, > hctx->numa_node); > + if (!hctx->sched_data) { > + ret = -ENOMEM; > + goto error; > + } > + > + if (init) { > + ret = init(hctx); > + if (ret) { > + /* > + * We don't want to give exit() a partially > + * initialized sched_data. init() must clean up > + * if it fails. > + */ > + kfree(hctx->sched_data); > + hctx->sched_data = NULL; > + goto error; > + } > + } > + } > + > + return 0; > +error: > + blk_mq_sched_free_hctx_data(q, exit); > + return ret; > +} If one of the init() calls by blk_mq_sched_init_hctx_data() fails then blk_mq_sched_free_hctx_data() will call exit() even for hctx's for which init() has not been called. How about changing "if (exit)" into "if (exit && hctx->sched_data)" such that exit() is only called for hctx's for which init() has been called? > +struct request *blk_mq_sched_get_request(struct request_queue *q, > + struct bio *bio, > + unsigned int op, > + struct blk_mq_alloc_data *data) > +{ > + struct elevator_queue *e = q->elevator; > + struct blk_mq_hw_ctx *hctx; > + struct blk_mq_ctx *ctx; > + struct request *rq; > + > + blk_queue_enter_live(q); > + ctx = blk_mq_get_ctx(q); > + hctx = blk_mq_map_queue(q, ctx->cpu); > + > + blk_mq_set_alloc_data(data, q, 0, ctx, hctx); > + > + if (e) { > + data->flags |= BLK_MQ_REQ_INTERNAL; > + if (e->type->ops.mq.get_request) > + rq = e->type->ops.mq.get_request(q, op, data); > + else > + rq = __blk_mq_alloc_request(data, op); > + } else { > + rq = __blk_mq_alloc_request(data, op); > + if (rq) { > + rq->tag = rq->internal_tag; > + rq->internal_tag = -1; > + } > + } > + > + if (rq) { > + rq->elv.icq = NULL; > + if (e && e->type->icq_cache) > + blk_mq_sched_assign_ioc(q, rq, bio); > + data->hctx->queued++; > + return rq; > + } > + > + blk_queue_exit(q); > + return NULL; > +} The "rq->tag = rq->internal_tag; rq->internal_tag = -1;" occurs not only here but also in blk_mq_alloc_request_hctx(). Has it been considered to move that code into __blk_mq_alloc_request()? @@ -223,14 +225,17 @@ struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data, > > tag = blk_mq_get_tag(data); > if (tag != BLK_MQ_TAG_FAIL) { > - rq = data->hctx->tags->rqs[tag]; > + struct blk_mq_tags *tags = blk_mq_tags_from_data(data); > + > + rq = tags->rqs[tag]; > > if (blk_mq_tag_busy(data->hctx)) { > rq->rq_flags = RQF_MQ_INFLIGHT; >
Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers
On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote: > @@ -451,11 +456,11 @@ void blk_insert_flush(struct request *rq) >* processed directly without going through flush machinery. Queue >* for normal execution. >*/ > - if ((policy & REQ_FSEQ_DATA) && > - !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { > - if (q->mq_ops) { > - blk_mq_insert_request(rq, false, true, false); > - } else > + if (((policy & REQ_FSEQ_DATA) && > + !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH { > + if (q->mq_ops) > + blk_mq_sched_insert_request(rq, false, true, false); > + else > list_add_tail(>queuelist, >queue_head); > return; > } Not that it really matters, but this change adds a pair of parentheses -- "if (e)" is changed into "if ((e))". Is this necessary? > +void blk_mq_sched_free_hctx_data(struct request_queue *q, > + void (*exit)(struct blk_mq_hw_ctx *)) > +{ > + struct blk_mq_hw_ctx *hctx; > + int i; > + > + queue_for_each_hw_ctx(q, hctx, i) { > + if (exit) > + exit(hctx); > + kfree(hctx->sched_data); > + hctx->sched_data = NULL; > + } > +} > +EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data); > + > +int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size, > + int (*init)(struct blk_mq_hw_ctx *), > + void (*exit)(struct blk_mq_hw_ctx *)) > +{ > + struct blk_mq_hw_ctx *hctx; > + int ret; > + int i; > + > + queue_for_each_hw_ctx(q, hctx, i) { > + hctx->sched_data = kmalloc_node(size, GFP_KERNEL, > hctx->numa_node); > + if (!hctx->sched_data) { > + ret = -ENOMEM; > + goto error; > + } > + > + if (init) { > + ret = init(hctx); > + if (ret) { > + /* > + * We don't want to give exit() a partially > + * initialized sched_data. init() must clean up > + * if it fails. > + */ > + kfree(hctx->sched_data); > + hctx->sched_data = NULL; > + goto error; > + } > + } > + } > + > + return 0; > +error: > + blk_mq_sched_free_hctx_data(q, exit); > + return ret; > +} If one of the init() calls by blk_mq_sched_init_hctx_data() fails then blk_mq_sched_free_hctx_data() will call exit() even for hctx's for which init() has not been called. How about changing "if (exit)" into "if (exit && hctx->sched_data)" such that exit() is only called for hctx's for which init() has been called? > +struct request *blk_mq_sched_get_request(struct request_queue *q, > + struct bio *bio, > + unsigned int op, > + struct blk_mq_alloc_data *data) > +{ > + struct elevator_queue *e = q->elevator; > + struct blk_mq_hw_ctx *hctx; > + struct blk_mq_ctx *ctx; > + struct request *rq; > + > + blk_queue_enter_live(q); > + ctx = blk_mq_get_ctx(q); > + hctx = blk_mq_map_queue(q, ctx->cpu); > + > + blk_mq_set_alloc_data(data, q, 0, ctx, hctx); > + > + if (e) { > + data->flags |= BLK_MQ_REQ_INTERNAL; > + if (e->type->ops.mq.get_request) > + rq = e->type->ops.mq.get_request(q, op, data); > + else > + rq = __blk_mq_alloc_request(data, op); > + } else { > + rq = __blk_mq_alloc_request(data, op); > + if (rq) { > + rq->tag = rq->internal_tag; > + rq->internal_tag = -1; > + } > + } > + > + if (rq) { > + rq->elv.icq = NULL; > + if (e && e->type->icq_cache) > + blk_mq_sched_assign_ioc(q, rq, bio); > + data->hctx->queued++; > + return rq; > + } > + > + blk_queue_exit(q); > + return NULL; > +} The "rq->tag = rq->internal_tag; rq->internal_tag = -1;" occurs not only here but also in blk_mq_alloc_request_hctx(). Has it been considered to move that code into __blk_mq_alloc_request()? @@ -223,14 +225,17 @@ struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data, > > tag = blk_mq_get_tag(data); > if (tag != BLK_MQ_TAG_FAIL) { > - rq = data->hctx->tags->rqs[tag]; > + struct blk_mq_tags *tags = blk_mq_tags_from_data(data); > + > + rq = tags->rqs[tag]; > > if (blk_mq_tag_busy(data->hctx)) { > rq->rq_flags = RQF_MQ_INFLIGHT; >