Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-13 Thread Hannes Reinecke
On 01/13/2017 05:41 PM, Omar Sandoval wrote:
> On Fri, Jan 13, 2017 at 12:15:17PM +0100, Hannes Reinecke wrote:
>> On 01/11/2017 10:40 PM, Jens Axboe wrote:
>>> This adds a set of hooks that intercepts the blk-mq path of
>>> allocating/inserting/issuing/completing requests, allowing
>>> us to develop a scheduler within that framework.
>>>
>>> We reuse the existing elevator scheduler API on the registration
>>> side, but augment that with the scheduler flagging support for
>>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>>> devices.
>>>
>>> We split driver and scheduler tags, so we can run the scheduling
>>> independent of device queue depth.
>>>
>>> Signed-off-by: Jens Axboe 
>> [ .. ]
>>> @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned 
>>> int queued)
>>> return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
>>>  }
>>>  
>>> +static bool blk_mq_get_driver_tag(struct request *rq,
>>> + struct blk_mq_hw_ctx **hctx, bool wait)
>>> +{
>>> +   struct blk_mq_alloc_data data = {
>>> +   .q = rq->q,
>>> +   .ctx = rq->mq_ctx,
>>> +   .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu),
>>> +   .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
>>> +   };
>>> +
>>> +   if (blk_mq_hctx_stopped(data.hctx))
>>> +   return false;
>>> +
>>> +   if (rq->tag != -1) {
>>> +done:
>>> +   if (hctx)
>>> +   *hctx = data.hctx;
>>> +   return true;
>>> +   }
>>> +
>>> +   rq->tag = blk_mq_get_tag();
>>> +   if (rq->tag >= 0) {
>>> +   data.hctx->tags->rqs[rq->tag] = rq;
>>> +   goto done;
>>> +   }
>>> +
>>> +   return false;
>>> +}
>>> +
>> What happens with the existing request at 'rqs[rq->tag]' ?
>> Surely there is one already, right?
>> Things like '->init_request' assume a fully populated array, so moving
>> one entry to another location is ... interesting.
>>
>> I would have thought we need to do a request cloning here,
>> otherwise this would introduce a memory leak, right?
>> (Not to mention a potential double completion, as the request is now at
>> two positions in the array)
>>
>> Cheers,
>>
>> Hannes
> 
> The entries in tags->rqs aren't slab objects, they're pointers into
> pages allocated separately and tracked on tags->page_list. See
> blk_mq_alloc_rqs(). In blk_mq_free_rqs(), we free all of the pages on
> tags->page_list, so there shouldn't be a memory leak.
> 
> As for hctx->tags->rqs, entries are only overwritten when a scheduler is
> enabled. In that case, the rqs array is storing pointers to requests
> actually from hctx->sched_tags, so overwriting/leaking isn't an issue.

Ah. Thanks.
That explains it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)


Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-13 Thread Hannes Reinecke
On 01/13/2017 05:41 PM, Omar Sandoval wrote:
> On Fri, Jan 13, 2017 at 12:15:17PM +0100, Hannes Reinecke wrote:
>> On 01/11/2017 10:40 PM, Jens Axboe wrote:
>>> This adds a set of hooks that intercepts the blk-mq path of
>>> allocating/inserting/issuing/completing requests, allowing
>>> us to develop a scheduler within that framework.
>>>
>>> We reuse the existing elevator scheduler API on the registration
>>> side, but augment that with the scheduler flagging support for
>>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>>> devices.
>>>
>>> We split driver and scheduler tags, so we can run the scheduling
>>> independent of device queue depth.
>>>
>>> Signed-off-by: Jens Axboe 
>> [ .. ]
>>> @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned 
>>> int queued)
>>> return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
>>>  }
>>>  
>>> +static bool blk_mq_get_driver_tag(struct request *rq,
>>> + struct blk_mq_hw_ctx **hctx, bool wait)
>>> +{
>>> +   struct blk_mq_alloc_data data = {
>>> +   .q = rq->q,
>>> +   .ctx = rq->mq_ctx,
>>> +   .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu),
>>> +   .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
>>> +   };
>>> +
>>> +   if (blk_mq_hctx_stopped(data.hctx))
>>> +   return false;
>>> +
>>> +   if (rq->tag != -1) {
>>> +done:
>>> +   if (hctx)
>>> +   *hctx = data.hctx;
>>> +   return true;
>>> +   }
>>> +
>>> +   rq->tag = blk_mq_get_tag();
>>> +   if (rq->tag >= 0) {
>>> +   data.hctx->tags->rqs[rq->tag] = rq;
>>> +   goto done;
>>> +   }
>>> +
>>> +   return false;
>>> +}
>>> +
>> What happens with the existing request at 'rqs[rq->tag]' ?
>> Surely there is one already, right?
>> Things like '->init_request' assume a fully populated array, so moving
>> one entry to another location is ... interesting.
>>
>> I would have thought we need to do a request cloning here,
>> otherwise this would introduce a memory leak, right?
>> (Not to mention a potential double completion, as the request is now at
>> two positions in the array)
>>
>> Cheers,
>>
>> Hannes
> 
> The entries in tags->rqs aren't slab objects, they're pointers into
> pages allocated separately and tracked on tags->page_list. See
> blk_mq_alloc_rqs(). In blk_mq_free_rqs(), we free all of the pages on
> tags->page_list, so there shouldn't be a memory leak.
> 
> As for hctx->tags->rqs, entries are only overwritten when a scheduler is
> enabled. In that case, the rqs array is storing pointers to requests
> actually from hctx->sched_tags, so overwriting/leaking isn't an issue.

Ah. Thanks.
That explains it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)


Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-13 Thread Omar Sandoval
On Fri, Jan 13, 2017 at 12:15:17PM +0100, Hannes Reinecke wrote:
> On 01/11/2017 10:40 PM, Jens Axboe wrote:
> > This adds a set of hooks that intercepts the blk-mq path of
> > allocating/inserting/issuing/completing requests, allowing
> > us to develop a scheduler within that framework.
> > 
> > We reuse the existing elevator scheduler API on the registration
> > side, but augment that with the scheduler flagging support for
> > the blk-mq interfce, and with a separate set of ops hooks for MQ
> > devices.
> > 
> > We split driver and scheduler tags, so we can run the scheduling
> > independent of device queue depth.
> > 
> > Signed-off-by: Jens Axboe 
> [ .. ]
> > @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned 
> > int queued)
> > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
> >  }
> >  
> > +static bool blk_mq_get_driver_tag(struct request *rq,
> > + struct blk_mq_hw_ctx **hctx, bool wait)
> > +{
> > +   struct blk_mq_alloc_data data = {
> > +   .q = rq->q,
> > +   .ctx = rq->mq_ctx,
> > +   .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu),
> > +   .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
> > +   };
> > +
> > +   if (blk_mq_hctx_stopped(data.hctx))
> > +   return false;
> > +
> > +   if (rq->tag != -1) {
> > +done:
> > +   if (hctx)
> > +   *hctx = data.hctx;
> > +   return true;
> > +   }
> > +
> > +   rq->tag = blk_mq_get_tag();
> > +   if (rq->tag >= 0) {
> > +   data.hctx->tags->rqs[rq->tag] = rq;
> > +   goto done;
> > +   }
> > +
> > +   return false;
> > +}
> > +
> What happens with the existing request at 'rqs[rq->tag]' ?
> Surely there is one already, right?
> Things like '->init_request' assume a fully populated array, so moving
> one entry to another location is ... interesting.
> 
> I would have thought we need to do a request cloning here,
> otherwise this would introduce a memory leak, right?
> (Not to mention a potential double completion, as the request is now at
> two positions in the array)
> 
> Cheers,
> 
> Hannes

The entries in tags->rqs aren't slab objects, they're pointers into
pages allocated separately and tracked on tags->page_list. See
blk_mq_alloc_rqs(). In blk_mq_free_rqs(), we free all of the pages on
tags->page_list, so there shouldn't be a memory leak.

As for hctx->tags->rqs, entries are only overwritten when a scheduler is
enabled. In that case, the rqs array is storing pointers to requests
actually from hctx->sched_tags, so overwriting/leaking isn't an issue.


Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-13 Thread Omar Sandoval
On Fri, Jan 13, 2017 at 12:15:17PM +0100, Hannes Reinecke wrote:
> On 01/11/2017 10:40 PM, Jens Axboe wrote:
> > This adds a set of hooks that intercepts the blk-mq path of
> > allocating/inserting/issuing/completing requests, allowing
> > us to develop a scheduler within that framework.
> > 
> > We reuse the existing elevator scheduler API on the registration
> > side, but augment that with the scheduler flagging support for
> > the blk-mq interfce, and with a separate set of ops hooks for MQ
> > devices.
> > 
> > We split driver and scheduler tags, so we can run the scheduling
> > independent of device queue depth.
> > 
> > Signed-off-by: Jens Axboe 
> [ .. ]
> > @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned 
> > int queued)
> > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
> >  }
> >  
> > +static bool blk_mq_get_driver_tag(struct request *rq,
> > + struct blk_mq_hw_ctx **hctx, bool wait)
> > +{
> > +   struct blk_mq_alloc_data data = {
> > +   .q = rq->q,
> > +   .ctx = rq->mq_ctx,
> > +   .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu),
> > +   .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
> > +   };
> > +
> > +   if (blk_mq_hctx_stopped(data.hctx))
> > +   return false;
> > +
> > +   if (rq->tag != -1) {
> > +done:
> > +   if (hctx)
> > +   *hctx = data.hctx;
> > +   return true;
> > +   }
> > +
> > +   rq->tag = blk_mq_get_tag();
> > +   if (rq->tag >= 0) {
> > +   data.hctx->tags->rqs[rq->tag] = rq;
> > +   goto done;
> > +   }
> > +
> > +   return false;
> > +}
> > +
> What happens with the existing request at 'rqs[rq->tag]' ?
> Surely there is one already, right?
> Things like '->init_request' assume a fully populated array, so moving
> one entry to another location is ... interesting.
> 
> I would have thought we need to do a request cloning here,
> otherwise this would introduce a memory leak, right?
> (Not to mention a potential double completion, as the request is now at
> two positions in the array)
> 
> Cheers,
> 
> Hannes

The entries in tags->rqs aren't slab objects, they're pointers into
pages allocated separately and tracked on tags->page_list. See
blk_mq_alloc_rqs(). In blk_mq_free_rqs(), we free all of the pages on
tags->page_list, so there shouldn't be a memory leak.

As for hctx->tags->rqs, entries are only overwritten when a scheduler is
enabled. In that case, the rqs array is storing pointers to requests
actually from hctx->sched_tags, so overwriting/leaking isn't an issue.


Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-13 Thread Bart Van Assche
On Fri, 2017-01-13 at 12:15 +0100, Hannes Reinecke wrote:
> On 01/11/2017 10:40 PM, Jens Axboe wrote:
> > This adds a set of hooks that intercepts the blk-mq path of
> > allocating/inserting/issuing/completing requests, allowing
> > us to develop a scheduler within that framework.
> > 
> > We reuse the existing elevator scheduler API on the registration
> > side, but augment that with the scheduler flagging support for
> > the blk-mq interfce, and with a separate set of ops hooks for MQ
> > devices.
> > 
> > We split driver and scheduler tags, so we can run the scheduling
> > independent of device queue depth.
> > 
> > Signed-off-by: Jens Axboe 
> 
> [ .. ]
> > @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned 
> > int queued)
> > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
> >  }
> >  
> > +static bool blk_mq_get_driver_tag(struct request *rq,
> > + struct blk_mq_hw_ctx **hctx, bool wait)
> > +{
> > +   struct blk_mq_alloc_data data = {
> > +   .q = rq->q,
> > +   .ctx = rq->mq_ctx,
> > +   .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu),
> > +   .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
> > +   };
> > +
> > +   if (blk_mq_hctx_stopped(data.hctx))
> > +   return false;
> > +
> > +   if (rq->tag != -1) {
> > +done:
> > +   if (hctx)
> > +   *hctx = data.hctx;
> > +   return true;
> > +   }
> > +
> > +   rq->tag = blk_mq_get_tag();
> > +   if (rq->tag >= 0) {
> > +   data.hctx->tags->rqs[rq->tag] = rq;
> > +   goto done;
> > +   }
> > +
> > +   return false;
> > +}
> > +
> 
> What happens with the existing request at 'rqs[rq->tag]' ?
> Surely there is one already, right?
> Things like '->init_request' assume a fully populated array, so moving
> one entry to another location is ... interesting.
> 
> I would have thought we need to do a request cloning here,
> otherwise this would introduce a memory leak, right?
> (Not to mention a potential double completion, as the request is now at
> two positions in the array)

Hello Hannes,

Have you noticed that there are two .rqs[] arrays - tags->rqs and
sched_tags->rqs[]? .init_request() loops over sched_tags->rqs[]. The
above assignment applies to tags->rqs[].

Bart.

Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-13 Thread Bart Van Assche
On Fri, 2017-01-13 at 12:15 +0100, Hannes Reinecke wrote:
> On 01/11/2017 10:40 PM, Jens Axboe wrote:
> > This adds a set of hooks that intercepts the blk-mq path of
> > allocating/inserting/issuing/completing requests, allowing
> > us to develop a scheduler within that framework.
> > 
> > We reuse the existing elevator scheduler API on the registration
> > side, but augment that with the scheduler flagging support for
> > the blk-mq interfce, and with a separate set of ops hooks for MQ
> > devices.
> > 
> > We split driver and scheduler tags, so we can run the scheduling
> > independent of device queue depth.
> > 
> > Signed-off-by: Jens Axboe 
> 
> [ .. ]
> > @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned 
> > int queued)
> > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
> >  }
> >  
> > +static bool blk_mq_get_driver_tag(struct request *rq,
> > + struct blk_mq_hw_ctx **hctx, bool wait)
> > +{
> > +   struct blk_mq_alloc_data data = {
> > +   .q = rq->q,
> > +   .ctx = rq->mq_ctx,
> > +   .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu),
> > +   .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
> > +   };
> > +
> > +   if (blk_mq_hctx_stopped(data.hctx))
> > +   return false;
> > +
> > +   if (rq->tag != -1) {
> > +done:
> > +   if (hctx)
> > +   *hctx = data.hctx;
> > +   return true;
> > +   }
> > +
> > +   rq->tag = blk_mq_get_tag();
> > +   if (rq->tag >= 0) {
> > +   data.hctx->tags->rqs[rq->tag] = rq;
> > +   goto done;
> > +   }
> > +
> > +   return false;
> > +}
> > +
> 
> What happens with the existing request at 'rqs[rq->tag]' ?
> Surely there is one already, right?
> Things like '->init_request' assume a fully populated array, so moving
> one entry to another location is ... interesting.
> 
> I would have thought we need to do a request cloning here,
> otherwise this would introduce a memory leak, right?
> (Not to mention a potential double completion, as the request is now at
> two positions in the array)

Hello Hannes,

Have you noticed that there are two .rqs[] arrays - tags->rqs and
sched_tags->rqs[]? .init_request() loops over sched_tags->rqs[]. The
above assignment applies to tags->rqs[].

Bart.

Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-13 Thread Hannes Reinecke
On 01/11/2017 10:40 PM, Jens Axboe wrote:
> This adds a set of hooks that intercepts the blk-mq path of
> allocating/inserting/issuing/completing requests, allowing
> us to develop a scheduler within that framework.
> 
> We reuse the existing elevator scheduler API on the registration
> side, but augment that with the scheduler flagging support for
> the blk-mq interfce, and with a separate set of ops hooks for MQ
> devices.
> 
> We split driver and scheduler tags, so we can run the scheduling
> independent of device queue depth.
> 
> Signed-off-by: Jens Axboe 
[ .. ]
> @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned int 
> queued)
>   return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
>  }
>  
> +static bool blk_mq_get_driver_tag(struct request *rq,
> +   struct blk_mq_hw_ctx **hctx, bool wait)
> +{
> + struct blk_mq_alloc_data data = {
> + .q = rq->q,
> + .ctx = rq->mq_ctx,
> + .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu),
> + .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
> + };
> +
> + if (blk_mq_hctx_stopped(data.hctx))
> + return false;
> +
> + if (rq->tag != -1) {
> +done:
> + if (hctx)
> + *hctx = data.hctx;
> + return true;
> + }
> +
> + rq->tag = blk_mq_get_tag();
> + if (rq->tag >= 0) {
> + data.hctx->tags->rqs[rq->tag] = rq;
> + goto done;
> + }
> +
> + return false;
> +}
> +
What happens with the existing request at 'rqs[rq->tag]' ?
Surely there is one already, right?
Things like '->init_request' assume a fully populated array, so moving
one entry to another location is ... interesting.

I would have thought we need to do a request cloning here,
otherwise this would introduce a memory leak, right?
(Not to mention a potential double completion, as the request is now at
two positions in the array)

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-13 Thread Hannes Reinecke
On 01/11/2017 10:40 PM, Jens Axboe wrote:
> This adds a set of hooks that intercepts the blk-mq path of
> allocating/inserting/issuing/completing requests, allowing
> us to develop a scheduler within that framework.
> 
> We reuse the existing elevator scheduler API on the registration
> side, but augment that with the scheduler flagging support for
> the blk-mq interfce, and with a separate set of ops hooks for MQ
> devices.
> 
> We split driver and scheduler tags, so we can run the scheduling
> independent of device queue depth.
> 
> Signed-off-by: Jens Axboe 
[ .. ]
> @@ -823,6 +847,35 @@ static inline unsigned int queued_to_index(unsigned int 
> queued)
>   return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
>  }
>  
> +static bool blk_mq_get_driver_tag(struct request *rq,
> +   struct blk_mq_hw_ctx **hctx, bool wait)
> +{
> + struct blk_mq_alloc_data data = {
> + .q = rq->q,
> + .ctx = rq->mq_ctx,
> + .hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu),
> + .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
> + };
> +
> + if (blk_mq_hctx_stopped(data.hctx))
> + return false;
> +
> + if (rq->tag != -1) {
> +done:
> + if (hctx)
> + *hctx = data.hctx;
> + return true;
> + }
> +
> + rq->tag = blk_mq_get_tag();
> + if (rq->tag >= 0) {
> + data.hctx->tags->rqs[rq->tag] = rq;
> + goto done;
> + }
> +
> + return false;
> +}
> +
What happens with the existing request at 'rqs[rq->tag]' ?
Surely there is one already, right?
Things like '->init_request' assume a fully populated array, so moving
one entry to another location is ... interesting.

I would have thought we need to do a request cloning here,
otherwise this would introduce a memory leak, right?
(Not to mention a potential double completion, as the request is now at
two positions in the array)

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-12 Thread Jens Axboe
On Thu, Jan 12 2017, Bart Van Assche wrote:
> On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote:
> > @@ -451,11 +456,11 @@ void blk_insert_flush(struct request *rq)
> >  * processed directly without going through flush machinery.  Queue
> >  * for normal execution.
> >  */
> > -   if ((policy & REQ_FSEQ_DATA) &&
> > -   !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
> > -   if (q->mq_ops) {
> > -   blk_mq_insert_request(rq, false, true, false);
> > -   } else
> > +   if (((policy & REQ_FSEQ_DATA) &&
> > +!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH {
> > +   if (q->mq_ops)
> > +   blk_mq_sched_insert_request(rq, false, true, false);
> > +   else
> > list_add_tail(>queuelist, >queue_head);
> > return;
> > }
> 
> Not that it really matters, but this change adds a pair of parentheses --
> "if (e)" is changed into "if ((e))". Is this necessary?

I fixed that up earlier today, as I noticed the same. So that's gone in
the current -git tree.

> > +void blk_mq_sched_free_hctx_data(struct request_queue *q,
> > +void (*exit)(struct blk_mq_hw_ctx *))
> > +{
> > +   struct blk_mq_hw_ctx *hctx;
> > +   int i;
> > +
> > +   queue_for_each_hw_ctx(q, hctx, i) {
> > +   if (exit)
> > +   exit(hctx);
> > +   kfree(hctx->sched_data);
> > +   hctx->sched_data = NULL;
> > +   }
> > +}
> > +EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
> > +
> > +int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
> > +   int (*init)(struct blk_mq_hw_ctx *),
> > +   void (*exit)(struct blk_mq_hw_ctx *))
> > +{
> > +   struct blk_mq_hw_ctx *hctx;
> > +   int ret;
> > +   int i;
> > +
> > +   queue_for_each_hw_ctx(q, hctx, i) {
> > +   hctx->sched_data = kmalloc_node(size, GFP_KERNEL, 
> > hctx->numa_node);
> > +   if (!hctx->sched_data) {
> > +   ret = -ENOMEM;
> > +   goto error;
> > +   }
> > +
> > +   if (init) {
> > +   ret = init(hctx);
> > +   if (ret) {
> > +   /*
> > +* We don't want to give exit() a partially
> > +* initialized sched_data. init() must clean up
> > +* if it fails.
> > +*/
> > +   kfree(hctx->sched_data);
> > +   hctx->sched_data = NULL;
> > +   goto error;
> > +   }
> > +   }
> > +   }
> > +
> > +   return 0;
> > +error:
> > +   blk_mq_sched_free_hctx_data(q, exit);
> > +   return ret;
> > +}
> 
> If one of the init() calls by blk_mq_sched_init_hctx_data() fails then
> blk_mq_sched_free_hctx_data() will call exit() even for hctx's for which
> init() has not been called. How about changing "if (exit)" into "if (exit &&
> hctx->sched_data)" such that exit() is only called for hctx's for which
> init() has been called?

Good point, I'll make that change to the exit function.

> > +struct request *blk_mq_sched_get_request(struct request_queue *q,
> > +struct bio *bio,
> > +unsigned int op,
> > +struct blk_mq_alloc_data *data)
> > +{
> > +   struct elevator_queue *e = q->elevator;
> > +   struct blk_mq_hw_ctx *hctx;
> > +   struct blk_mq_ctx *ctx;
> > +   struct request *rq;
> > +
> > +   blk_queue_enter_live(q);
> > +   ctx = blk_mq_get_ctx(q);
> > +   hctx = blk_mq_map_queue(q, ctx->cpu);
> > +
> > +   blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> > +
> > +   if (e) {
> > +   data->flags |= BLK_MQ_REQ_INTERNAL;
> > +   if (e->type->ops.mq.get_request)
> > +   rq = e->type->ops.mq.get_request(q, op, data);
> > +   else
> > +   rq = __blk_mq_alloc_request(data, op);
> > +   } else {
> > +   rq = __blk_mq_alloc_request(data, op);
> > +   if (rq) {
> > +   rq->tag = rq->internal_tag;
> > +   rq->internal_tag = -1;
> > +   }
> > +   }
> > +
> > +   if (rq) {
> > +   rq->elv.icq = NULL;
> > +   if (e && e->type->icq_cache)
> > +   blk_mq_sched_assign_ioc(q, rq, bio);
> > +   data->hctx->queued++;
> > +   return rq;
> > +   }
> > +
> > +   blk_queue_exit(q);
> > +   return NULL;
> > +}
> 
> The "rq->tag = rq->internal_tag; rq->internal_tag = -1;" occurs not only
> here but also in blk_mq_alloc_request_hctx(). Has it been considered to move
> that code into __blk_mq_alloc_request()?

Yes, it's in two locations. I wanted to keep it out of
__blk_mq_alloc_request(), so we can still use that for normal tag
allocations. But maybe it's better for __blk_mq_alloc_request() to 

Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-12 Thread Jens Axboe
On Thu, Jan 12 2017, Bart Van Assche wrote:
> On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote:
> > @@ -451,11 +456,11 @@ void blk_insert_flush(struct request *rq)
> >  * processed directly without going through flush machinery.  Queue
> >  * for normal execution.
> >  */
> > -   if ((policy & REQ_FSEQ_DATA) &&
> > -   !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
> > -   if (q->mq_ops) {
> > -   blk_mq_insert_request(rq, false, true, false);
> > -   } else
> > +   if (((policy & REQ_FSEQ_DATA) &&
> > +!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH {
> > +   if (q->mq_ops)
> > +   blk_mq_sched_insert_request(rq, false, true, false);
> > +   else
> > list_add_tail(>queuelist, >queue_head);
> > return;
> > }
> 
> Not that it really matters, but this change adds a pair of parentheses --
> "if (e)" is changed into "if ((e))". Is this necessary?

I fixed that up earlier today, as I noticed the same. So that's gone in
the current -git tree.

> > +void blk_mq_sched_free_hctx_data(struct request_queue *q,
> > +void (*exit)(struct blk_mq_hw_ctx *))
> > +{
> > +   struct blk_mq_hw_ctx *hctx;
> > +   int i;
> > +
> > +   queue_for_each_hw_ctx(q, hctx, i) {
> > +   if (exit)
> > +   exit(hctx);
> > +   kfree(hctx->sched_data);
> > +   hctx->sched_data = NULL;
> > +   }
> > +}
> > +EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
> > +
> > +int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
> > +   int (*init)(struct blk_mq_hw_ctx *),
> > +   void (*exit)(struct blk_mq_hw_ctx *))
> > +{
> > +   struct blk_mq_hw_ctx *hctx;
> > +   int ret;
> > +   int i;
> > +
> > +   queue_for_each_hw_ctx(q, hctx, i) {
> > +   hctx->sched_data = kmalloc_node(size, GFP_KERNEL, 
> > hctx->numa_node);
> > +   if (!hctx->sched_data) {
> > +   ret = -ENOMEM;
> > +   goto error;
> > +   }
> > +
> > +   if (init) {
> > +   ret = init(hctx);
> > +   if (ret) {
> > +   /*
> > +* We don't want to give exit() a partially
> > +* initialized sched_data. init() must clean up
> > +* if it fails.
> > +*/
> > +   kfree(hctx->sched_data);
> > +   hctx->sched_data = NULL;
> > +   goto error;
> > +   }
> > +   }
> > +   }
> > +
> > +   return 0;
> > +error:
> > +   blk_mq_sched_free_hctx_data(q, exit);
> > +   return ret;
> > +}
> 
> If one of the init() calls by blk_mq_sched_init_hctx_data() fails then
> blk_mq_sched_free_hctx_data() will call exit() even for hctx's for which
> init() has not been called. How about changing "if (exit)" into "if (exit &&
> hctx->sched_data)" such that exit() is only called for hctx's for which
> init() has been called?

Good point, I'll make that change to the exit function.

> > +struct request *blk_mq_sched_get_request(struct request_queue *q,
> > +struct bio *bio,
> > +unsigned int op,
> > +struct blk_mq_alloc_data *data)
> > +{
> > +   struct elevator_queue *e = q->elevator;
> > +   struct blk_mq_hw_ctx *hctx;
> > +   struct blk_mq_ctx *ctx;
> > +   struct request *rq;
> > +
> > +   blk_queue_enter_live(q);
> > +   ctx = blk_mq_get_ctx(q);
> > +   hctx = blk_mq_map_queue(q, ctx->cpu);
> > +
> > +   blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> > +
> > +   if (e) {
> > +   data->flags |= BLK_MQ_REQ_INTERNAL;
> > +   if (e->type->ops.mq.get_request)
> > +   rq = e->type->ops.mq.get_request(q, op, data);
> > +   else
> > +   rq = __blk_mq_alloc_request(data, op);
> > +   } else {
> > +   rq = __blk_mq_alloc_request(data, op);
> > +   if (rq) {
> > +   rq->tag = rq->internal_tag;
> > +   rq->internal_tag = -1;
> > +   }
> > +   }
> > +
> > +   if (rq) {
> > +   rq->elv.icq = NULL;
> > +   if (e && e->type->icq_cache)
> > +   blk_mq_sched_assign_ioc(q, rq, bio);
> > +   data->hctx->queued++;
> > +   return rq;
> > +   }
> > +
> > +   blk_queue_exit(q);
> > +   return NULL;
> > +}
> 
> The "rq->tag = rq->internal_tag; rq->internal_tag = -1;" occurs not only
> here but also in blk_mq_alloc_request_hctx(). Has it been considered to move
> that code into __blk_mq_alloc_request()?

Yes, it's in two locations. I wanted to keep it out of
__blk_mq_alloc_request(), so we can still use that for normal tag
allocations. But maybe it's better for __blk_mq_alloc_request() to 

Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-12 Thread Bart Van Assche
On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote:
> @@ -451,11 +456,11 @@ void blk_insert_flush(struct request *rq)
>* processed directly without going through flush machinery.  Queue
>* for normal execution.
>*/
> - if ((policy & REQ_FSEQ_DATA) &&
> - !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
> - if (q->mq_ops) {
> - blk_mq_insert_request(rq, false, true, false);
> - } else
> + if (((policy & REQ_FSEQ_DATA) &&
> +  !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH {
> + if (q->mq_ops)
> + blk_mq_sched_insert_request(rq, false, true, false);
> + else
>   list_add_tail(>queuelist, >queue_head);
>   return;
>   }

Not that it really matters, but this change adds a pair of parentheses --
"if (e)" is changed into "if ((e))". Is this necessary?

> +void blk_mq_sched_free_hctx_data(struct request_queue *q,
> +  void (*exit)(struct blk_mq_hw_ctx *))
> +{
> + struct blk_mq_hw_ctx *hctx;
> + int i;
> +
> + queue_for_each_hw_ctx(q, hctx, i) {
> + if (exit)
> + exit(hctx);
> + kfree(hctx->sched_data);
> + hctx->sched_data = NULL;
> + }
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
> +
> +int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
> + int (*init)(struct blk_mq_hw_ctx *),
> + void (*exit)(struct blk_mq_hw_ctx *))
> +{
> + struct blk_mq_hw_ctx *hctx;
> + int ret;
> + int i;
> +
> + queue_for_each_hw_ctx(q, hctx, i) {
> + hctx->sched_data = kmalloc_node(size, GFP_KERNEL, 
> hctx->numa_node);
> + if (!hctx->sched_data) {
> + ret = -ENOMEM;
> + goto error;
> + }
> +
> + if (init) {
> + ret = init(hctx);
> + if (ret) {
> + /*
> +  * We don't want to give exit() a partially
> +  * initialized sched_data. init() must clean up
> +  * if it fails.
> +  */
> + kfree(hctx->sched_data);
> + hctx->sched_data = NULL;
> + goto error;
> + }
> + }
> + }
> +
> + return 0;
> +error:
> + blk_mq_sched_free_hctx_data(q, exit);
> + return ret;
> +}

If one of the init() calls by blk_mq_sched_init_hctx_data() fails then
blk_mq_sched_free_hctx_data() will call exit() even for hctx's for which
init() has not been called. How about changing "if (exit)" into "if (exit &&
hctx->sched_data)" such that exit() is only called for hctx's for which
init() has been called?

> +struct request *blk_mq_sched_get_request(struct request_queue *q,
> +  struct bio *bio,
> +  unsigned int op,
> +  struct blk_mq_alloc_data *data)
> +{
> + struct elevator_queue *e = q->elevator;
> + struct blk_mq_hw_ctx *hctx;
> + struct blk_mq_ctx *ctx;
> + struct request *rq;
> +
> + blk_queue_enter_live(q);
> + ctx = blk_mq_get_ctx(q);
> + hctx = blk_mq_map_queue(q, ctx->cpu);
> +
> + blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> +
> + if (e) {
> + data->flags |= BLK_MQ_REQ_INTERNAL;
> + if (e->type->ops.mq.get_request)
> + rq = e->type->ops.mq.get_request(q, op, data);
> + else
> + rq = __blk_mq_alloc_request(data, op);
> + } else {
> + rq = __blk_mq_alloc_request(data, op);
> + if (rq) {
> + rq->tag = rq->internal_tag;
> + rq->internal_tag = -1;
> + }
> + }
> +
> + if (rq) {
> + rq->elv.icq = NULL;
> + if (e && e->type->icq_cache)
> + blk_mq_sched_assign_ioc(q, rq, bio);
> + data->hctx->queued++;
> + return rq;
> + }
> +
> + blk_queue_exit(q);
> + return NULL;
> +}

The "rq->tag = rq->internal_tag; rq->internal_tag = -1;" occurs not only
here but also in blk_mq_alloc_request_hctx(). Has it been considered to move
that code into __blk_mq_alloc_request()?

@@ -223,14 +225,17 @@ struct request *__blk_mq_alloc_request(struct 
blk_mq_alloc_data *data,
>  
>   tag = blk_mq_get_tag(data);
>   if (tag != BLK_MQ_TAG_FAIL) {
> - rq = data->hctx->tags->rqs[tag];
> + struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
> +
> + rq = tags->rqs[tag];
>  
>   if (blk_mq_tag_busy(data->hctx)) {
>   rq->rq_flags = RQF_MQ_INFLIGHT;
>  

Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-12 Thread Bart Van Assche
On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote:
> @@ -451,11 +456,11 @@ void blk_insert_flush(struct request *rq)
>* processed directly without going through flush machinery.  Queue
>* for normal execution.
>*/
> - if ((policy & REQ_FSEQ_DATA) &&
> - !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
> - if (q->mq_ops) {
> - blk_mq_insert_request(rq, false, true, false);
> - } else
> + if (((policy & REQ_FSEQ_DATA) &&
> +  !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH {
> + if (q->mq_ops)
> + blk_mq_sched_insert_request(rq, false, true, false);
> + else
>   list_add_tail(>queuelist, >queue_head);
>   return;
>   }

Not that it really matters, but this change adds a pair of parentheses --
"if (e)" is changed into "if ((e))". Is this necessary?

> +void blk_mq_sched_free_hctx_data(struct request_queue *q,
> +  void (*exit)(struct blk_mq_hw_ctx *))
> +{
> + struct blk_mq_hw_ctx *hctx;
> + int i;
> +
> + queue_for_each_hw_ctx(q, hctx, i) {
> + if (exit)
> + exit(hctx);
> + kfree(hctx->sched_data);
> + hctx->sched_data = NULL;
> + }
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
> +
> +int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
> + int (*init)(struct blk_mq_hw_ctx *),
> + void (*exit)(struct blk_mq_hw_ctx *))
> +{
> + struct blk_mq_hw_ctx *hctx;
> + int ret;
> + int i;
> +
> + queue_for_each_hw_ctx(q, hctx, i) {
> + hctx->sched_data = kmalloc_node(size, GFP_KERNEL, 
> hctx->numa_node);
> + if (!hctx->sched_data) {
> + ret = -ENOMEM;
> + goto error;
> + }
> +
> + if (init) {
> + ret = init(hctx);
> + if (ret) {
> + /*
> +  * We don't want to give exit() a partially
> +  * initialized sched_data. init() must clean up
> +  * if it fails.
> +  */
> + kfree(hctx->sched_data);
> + hctx->sched_data = NULL;
> + goto error;
> + }
> + }
> + }
> +
> + return 0;
> +error:
> + blk_mq_sched_free_hctx_data(q, exit);
> + return ret;
> +}

If one of the init() calls by blk_mq_sched_init_hctx_data() fails then
blk_mq_sched_free_hctx_data() will call exit() even for hctx's for which
init() has not been called. How about changing "if (exit)" into "if (exit &&
hctx->sched_data)" such that exit() is only called for hctx's for which
init() has been called?

> +struct request *blk_mq_sched_get_request(struct request_queue *q,
> +  struct bio *bio,
> +  unsigned int op,
> +  struct blk_mq_alloc_data *data)
> +{
> + struct elevator_queue *e = q->elevator;
> + struct blk_mq_hw_ctx *hctx;
> + struct blk_mq_ctx *ctx;
> + struct request *rq;
> +
> + blk_queue_enter_live(q);
> + ctx = blk_mq_get_ctx(q);
> + hctx = blk_mq_map_queue(q, ctx->cpu);
> +
> + blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> +
> + if (e) {
> + data->flags |= BLK_MQ_REQ_INTERNAL;
> + if (e->type->ops.mq.get_request)
> + rq = e->type->ops.mq.get_request(q, op, data);
> + else
> + rq = __blk_mq_alloc_request(data, op);
> + } else {
> + rq = __blk_mq_alloc_request(data, op);
> + if (rq) {
> + rq->tag = rq->internal_tag;
> + rq->internal_tag = -1;
> + }
> + }
> +
> + if (rq) {
> + rq->elv.icq = NULL;
> + if (e && e->type->icq_cache)
> + blk_mq_sched_assign_ioc(q, rq, bio);
> + data->hctx->queued++;
> + return rq;
> + }
> +
> + blk_queue_exit(q);
> + return NULL;
> +}

The "rq->tag = rq->internal_tag; rq->internal_tag = -1;" occurs not only
here but also in blk_mq_alloc_request_hctx(). Has it been considered to move
that code into __blk_mq_alloc_request()?

@@ -223,14 +225,17 @@ struct request *__blk_mq_alloc_request(struct 
blk_mq_alloc_data *data,
>  
>   tag = blk_mq_get_tag(data);
>   if (tag != BLK_MQ_TAG_FAIL) {
> - rq = data->hctx->tags->rqs[tag];
> + struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
> +
> + rq = tags->rqs[tag];
>  
>   if (blk_mq_tag_busy(data->hctx)) {
>   rq->rq_flags = RQF_MQ_INFLIGHT;
>