Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Ming Lei
On Fri, Jan 19, 2018 at 05:09:46AM +, Bart Van Assche wrote:
> On Fri, 2018-01-19 at 10:32 +0800, Ming Lei wrote:
> > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
> > it should be DM-only which returns STS_RESOURCE so often.
> 
> That's wrong at least for SCSI. See also 
> https://marc.info/?l=linux-block=151578329417076.
> 

> For other scenario's, e.g. if a SCSI initiator submits a
> SCSI request over a fabric and the SCSI target replies with "BUSY" then the

Could you explain a bit when SCSI target replies with BUSY very often?

Inside initiator, we have limited the max per-LUN requests and per-host
requests already before calling .queue_rq().

> SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> maximum number of retries has been reached (see also scsi_io_completion()).
> In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> to the initiator, there is no other approach for the SCSI initiator to
> figure out whether it can queue another request than to resubmit the
> request. The worst possible strategy is to resubmit a request immediately
> because that will cause a significant fraction of the fabric bandwidth to
> be used just for replying "BUSY" to requests that can't be processed
> immediately.


-- 
Ming


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Ming Lei
On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote:
> On 1/18/18 7:32 PM, Ming Lei wrote:
> > On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote:
> >> On 1/18/18 11:47 AM, Bart Van Assche wrote:
>  This is all very tiresome.
> >>>
> >>> Yes, this is tiresome. It is very annoying to me that others keep
> >>> introducing so many regressions in such important parts of the kernel.
> >>> It is also annoying to me that I get blamed if I report a regression
> >>> instead of seeing that the regression gets fixed.
> >>
> >> I agree, it sucks that any change there introduces the regression. I'm
> >> fine with doing the delay insert again until a new patch is proven to be
> >> better.
> > 
> > That way is still buggy as I explained, since rerun queue before adding
> > request to hctx->dispatch_list isn't correct. Who can make sure the request
> > is visible when __blk_mq_run_hw_queue() is called?
> 
> That race basically doesn't exist for a 10ms gap.
> 
> > Not mention this way will cause performance regression again.
> 
> How so? It's _exactly_ the same as what you are proposing, except mine
> will potentially run the queue when it need not do so. But given that
> these are random 10ms queue kicks because we are screwed, it should not
> matter. The key point is that it only should be if we have NO better
> options. If it's a frequently occurring event that we have to return
> BLK_STS_RESOURCE, then we need to get a way to register an event for
> when that condition clears. That event will then kick the necessary
> queue(s).

Please see queue_delayed_work_on(), hctx->run_work is shared by all
scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
scheduling can make progress during the 100ms.

> 
> >> From the original topic of this email, we have conditions that can cause
> >> the driver to not be able to submit an IO. A set of those conditions can
> >> only happen if IO is in flight, and those cases we have covered just
> >> fine. Another set can potentially trigger without IO being in flight.
> >> These are cases where a non-device resource is unavailable at the time
> >> of submission. This might be iommu running out of space, for instance,
> >> or it might be a memory allocation of some sort. For these cases, we
> >> don't get any notification when the shortage clears. All we can do is
> >> ensure that we restart operations at some point in the future. We're SOL
> >> at that point, but we have to ensure that we make forward progress.
> > 
> > Right, it is a generic issue, not DM-specific one, almost all drivers
> > call kmalloc(GFP_ATOMIC) in IO path.
> 
> GFP_ATOMIC basically never fails, unless we are out of memory. The

I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be
possible, and it is mentioned[1] there is such code in mm allocation
path, also OOM can happen too.

  if (some randomly generated condition) && (request is atomic)
  return NULL;

[1] https://lwn.net/Articles/276731/

> exception is higher order allocations. If a driver has a higher order
> atomic allocation in its IO path, the device driver writer needs to be
> taken out behind the barn and shot. Simple as that. It will NEVER work
> well in a production environment. Witness the disaster that so many NIC
> driver writers have learned.
> 
> This is NOT the case we care about here. It's resources that are more
> readily depleted because other devices are using them. If it's a high
> frequency or generally occurring event, then we simply must have a
> callback to restart the queue from that. The condition then becomes
> identical to device private starvation, the only difference being from
> where we restart the queue.
> 
> > IMO, there is enough time for figuring out a generic solution before
> > 4.16 release.
> 
> I would hope so, but the proposed solutions have not filled me with
> a lot of confidence in the end result so far.
> 
> >> That last set of conditions better not be a a common occurence, since
> >> performance is down the toilet at that point. I don't want to introduce
> >> hot path code to rectify it. Have the driver return if that happens in a
> >> way that is DIFFERENT from needing a normal restart. The driver knows if
> >> this is a resource that will become available when IO completes on this
> >> device or not. If we get that return, we have a generic run-again delay.
> > 
> > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
> > it should be DM-only which returns STS_RESOURCE so often.
> 
> Where does the dm STS_RESOURCE error usually come from - what's exact
> resource are we running out of?

It is from blk_get_request(underlying queue), see multipath_clone_and_map().

Thanks,
Ming


Re: [PATCH] dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Ming Lei
On Thu, Jan 18, 2018 at 05:49:10PM -0700, Jens Axboe wrote:
> On 1/18/18 5:35 PM, Ming Lei wrote:
> > On Thu, Jan 18, 2018 at 05:24:29PM -0700, Jens Axboe wrote:
> >> On 1/18/18 5:18 PM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 12:14:24AM +, Bart Van Assche wrote:
>  On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote:
> > On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote:
> >> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> >> index f16096af879a..c59c59cfd2a5 100644
> >> --- a/drivers/md/dm-rq.c
> >> +++ b/drivers/md/dm-rq.c
> >> @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct 
> >> blk_mq_hw_ctx *hctx,
> >>/* Undo dm_start_request() before requeuing */
> >>rq_end_stats(md, rq);
> >>rq_completed(md, rq_data_dir(rq), false);
> >> +  blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
> >>return BLK_STS_RESOURCE;
> >>}
> >>  
> >
> > Nak.
> 
>  This patch fixes a regression that was introduced by you. You should know
>  that regressions are not acceptable. If you don't agree with this patch,
>  please fix the root cause.
> >>>
> >>> Yesterday I sent a patch, did you test that?
> >>
> >> That patch isn't going to be of much help. It might prevent you from
> >> completely stalling, but it might take you tens of seconds to get there.
> > 
> > Yeah, that is true, and why I marked it as RFC, but the case is so rare to
> > happen.
> 
> You don't know that. If the resource is very limited, or someone else
> is gobbling up all of it, then it may trigger quite often. Granted,
> in that case, you need some way of signaling the freeing of those
> resources to make it efficient.
> 
> >> On top of that, it's a rolling timer that gets added when IO is queued,
> >> and re-added if IO is pending when it fires. If the latter case is not
> >> true, then it won't get armed again. So if IO fails to queue without
> >> any other IO pending, you're pretty much in the same situation as if
> > 
> > No IO pending detection may take a bit time, we can do it after
> > BLK_STS_RESOURCE is returned and SCHED_RESTART is set. I have done
> > this way in the following patch, what do you think of it?
> > 
> > https://github.com/ming1/linux/commit/f866ce0b97b0ae22d033881da7eb07706fd458b4
> 
> I think it's overly complicated. As I wrote in a lengthier email earlier
> today, just ensure that we have a unique return code from the driver for
> when it aborts queuing an IO due to a resource shortage that isn't tied
> to it's own consumption of it. When we get that return, do a delayed
> queue run after X amount of time to ensure we always try again. If you
> want to get fancy (later on), you could track the frequency of such
> returns and complain if it's too high.

Suppose the new introduced return code is BLK_STS_NO_DEV_RESOURCE.

This way may degrade performance, for example, DM-MPATH returns
BLK_STS_NO_DEV_RESOURCE when blk_get_request() returns NULL, blk-mq
handles it by blk_mq_delay_run_hw_queue(10ms). Then blk_mq_sched_restart()
is just triggered when one DM-MPATH request is completed, that means one
request of underlying queue is completed too, but blk_mq_sched_restart()
still can't make progress during the 10ms.

That means we should only run blk_mq_delay_run_hw_queue(10ms/100ms) when
the queue is idle.

I will post out the patch in github for discussion.

Thanks,
Ming


Re: [PATCH v5 1/2] Return bytes transferred for partial direct I/O

2018-01-18 Thread Al Viro
On Thu, Jan 18, 2018 at 10:31:18PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 19, 2018 at 02:59:51PM +1100, Dave Chinner wrote:
> > On Fri, Jan 19, 2018 at 02:13:53AM +, Al Viro wrote:
> > > On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues 
> > > > 
> > > > In case direct I/O encounters an error midway, it returns the error.
> > > > Instead it should be returning the number of bytes transferred so far.
> > > > 
> > > > Test case for filesystems (with ENOSPC):
> > > > 1. Create an almost full filesystem
> > > > 2. Create a file, say /mnt/lastfile, until the filesystem is full.
> > > > 3. Direct write() with count > sizeof /mnt/lastfile.
> > > > 
> > > > Result: write() returns -ENOSPC. However, file content has data written
> > > > in step 3.
> > > > 
> > > > This fixes fstest generic/472.
> > > 
> > > OK...  I can live with that.  What about the XFS side?  It should be
> > > a prereq, to avoid bisection hazard; I can throw both into vfs.git,
> > > if XFS folks are OK with that.  Objections?
> > 
> > Going through the VFS tree seesm the best approach to me - it's a
> > trivial change. I'm sure Darrick will shout if it's going to be a
> > problem, though.
> 
> vfs.git is fine, though the second patch to remove the xfs assert should
> go first, as Al points out.
> 
> For both patches,
> Reviewed-by: Darrick J. Wong 

Applied; will be in -next tomorrow morning after the tree I'm putting together
gets through local beating.


Re: [PATCH v5 1/2] Return bytes transferred for partial direct I/O

2018-01-18 Thread Darrick J. Wong
On Fri, Jan 19, 2018 at 02:59:51PM +1100, Dave Chinner wrote:
> On Fri, Jan 19, 2018 at 02:13:53AM +, Al Viro wrote:
> > On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues 
> > > 
> > > In case direct I/O encounters an error midway, it returns the error.
> > > Instead it should be returning the number of bytes transferred so far.
> > > 
> > > Test case for filesystems (with ENOSPC):
> > > 1. Create an almost full filesystem
> > > 2. Create a file, say /mnt/lastfile, until the filesystem is full.
> > > 3. Direct write() with count > sizeof /mnt/lastfile.
> > > 
> > > Result: write() returns -ENOSPC. However, file content has data written
> > > in step 3.
> > > 
> > > This fixes fstest generic/472.
> > 
> > OK...  I can live with that.  What about the XFS side?  It should be
> > a prereq, to avoid bisection hazard; I can throw both into vfs.git,
> > if XFS folks are OK with that.  Objections?
> 
> Going through the VFS tree seesm the best approach to me - it's a
> trivial change. I'm sure Darrick will shout if it's going to be a
> problem, though.

vfs.git is fine, though the second patch to remove the xfs assert should
go first, as Al points out.

For both patches,
Reviewed-by: Darrick J. Wong 

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Bart Van Assche
On Fri, 2018-01-19 at 10:32 +0800, Ming Lei wrote:
> Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
> it should be DM-only which returns STS_RESOURCE so often.

That's wrong at least for SCSI. See also 
https://marc.info/?l=linux-block=151578329417076.

Bart.

Re: [PATCH 2/2] xfs: remove assert to check bytes returned

2018-01-18 Thread Dave Chinner
On Fri, Jan 19, 2018 at 02:23:16AM -0200, Raphael Carvalho wrote:
> On Fri, Jan 19, 2018 at 1:57 AM, Dave Chinner  wrote:
> >
> > On Thu, Jan 18, 2018 at 06:57:41PM -0600, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues 
> > >
> > > Since we can return less than count in case of partial direct
> > > writes, remove the ASSERT.
> > >
> > > Signed-off-by: Goldwyn Rodrigues 
> > > ---
> > >  fs/xfs/xfs_file.c | 6 --
> > >  1 file changed, 6 deletions(-)
> > >
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 8601275cc5e6..8fc4dbf66910 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -590,12 +590,6 @@ xfs_file_dio_aio_write(
> > >   ret = iomap_dio_rw(iocb, from, _iomap_ops, 
> > > xfs_dio_write_end_io);
> > >  out:
> > >   xfs_iunlock(ip, iolock);
> > > -
> > > - /*
> > > -  * No fallback to buffered IO on errors for XFS, direct IO will 
> > > either
> > > -  * complete fully or fail.
> > > -  */
> > > - ASSERT(ret < 0 || ret == count);
> > >   return ret;
> > >  }
> >
> > Acked-by: Dave Chinner 
> 
> 
> Is this really correct?

Yes.

> Isn't this check with regards to DIO
> submission?

Yes, if there is an error during submission.

But it also checked synchronous IO completion (i.e. error or bytes
written), because iomap_dio_rw() waits for non-AIO DIO to complete
and returns the IO completion status in that case.

> The bytes written is returned in a different asynchronous
> path due to AIO support, no?!

That is correct. For AIO we'll get -EIOCBQUEUED here on successful
submission.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 2/2] xfs: remove assert to check bytes returned

2018-01-18 Thread Raphael Carvalho
On Fri, Jan 19, 2018 at 1:57 AM, Dave Chinner  wrote:
>
> On Thu, Jan 18, 2018 at 06:57:41PM -0600, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues 
> >
> > Since we can return less than count in case of partial direct
> > writes, remove the ASSERT.
> >
> > Signed-off-by: Goldwyn Rodrigues 
> > ---
> >  fs/xfs/xfs_file.c | 6 --
> >  1 file changed, 6 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 8601275cc5e6..8fc4dbf66910 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -590,12 +590,6 @@ xfs_file_dio_aio_write(
> >   ret = iomap_dio_rw(iocb, from, _iomap_ops, xfs_dio_write_end_io);
> >  out:
> >   xfs_iunlock(ip, iolock);
> > -
> > - /*
> > -  * No fallback to buffered IO on errors for XFS, direct IO will either
> > -  * complete fully or fail.
> > -  */
> > - ASSERT(ret < 0 || ret == count);
> >   return ret;
> >  }
>
> Acked-by: Dave Chinner 


Is this really correct? Isn't this check with regards to DIO
submission? The bytes written is returned in a different asynchronous
path due to AIO support, no?!

>
>
> --
> Dave Chinner
> da...@fromorbit.com


Regards,
Raphael S. Carvalho


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Jens Axboe
On 1/18/18 7:32 PM, Ming Lei wrote:
> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote:
>> On 1/18/18 11:47 AM, Bart Van Assche wrote:
 This is all very tiresome.
>>>
>>> Yes, this is tiresome. It is very annoying to me that others keep
>>> introducing so many regressions in such important parts of the kernel.
>>> It is also annoying to me that I get blamed if I report a regression
>>> instead of seeing that the regression gets fixed.
>>
>> I agree, it sucks that any change there introduces the regression. I'm
>> fine with doing the delay insert again until a new patch is proven to be
>> better.
> 
> That way is still buggy as I explained, since rerun queue before adding
> request to hctx->dispatch_list isn't correct. Who can make sure the request
> is visible when __blk_mq_run_hw_queue() is called?

That race basically doesn't exist for a 10ms gap.

> Not mention this way will cause performance regression again.

How so? It's _exactly_ the same as what you are proposing, except mine
will potentially run the queue when it need not do so. But given that
these are random 10ms queue kicks because we are screwed, it should not
matter. The key point is that it only should be if we have NO better
options. If it's a frequently occurring event that we have to return
BLK_STS_RESOURCE, then we need to get a way to register an event for
when that condition clears. That event will then kick the necessary
queue(s).

>> From the original topic of this email, we have conditions that can cause
>> the driver to not be able to submit an IO. A set of those conditions can
>> only happen if IO is in flight, and those cases we have covered just
>> fine. Another set can potentially trigger without IO being in flight.
>> These are cases where a non-device resource is unavailable at the time
>> of submission. This might be iommu running out of space, for instance,
>> or it might be a memory allocation of some sort. For these cases, we
>> don't get any notification when the shortage clears. All we can do is
>> ensure that we restart operations at some point in the future. We're SOL
>> at that point, but we have to ensure that we make forward progress.
> 
> Right, it is a generic issue, not DM-specific one, almost all drivers
> call kmalloc(GFP_ATOMIC) in IO path.

GFP_ATOMIC basically never fails, unless we are out of memory. The
exception is higher order allocations. If a driver has a higher order
atomic allocation in its IO path, the device driver writer needs to be
taken out behind the barn and shot. Simple as that. It will NEVER work
well in a production environment. Witness the disaster that so many NIC
driver writers have learned.

This is NOT the case we care about here. It's resources that are more
readily depleted because other devices are using them. If it's a high
frequency or generally occurring event, then we simply must have a
callback to restart the queue from that. The condition then becomes
identical to device private starvation, the only difference being from
where we restart the queue.

> IMO, there is enough time for figuring out a generic solution before
> 4.16 release.

I would hope so, but the proposed solutions have not filled me with
a lot of confidence in the end result so far.

>> That last set of conditions better not be a a common occurence, since
>> performance is down the toilet at that point. I don't want to introduce
>> hot path code to rectify it. Have the driver return if that happens in a
>> way that is DIFFERENT from needing a normal restart. The driver knows if
>> this is a resource that will become available when IO completes on this
>> device or not. If we get that return, we have a generic run-again delay.
> 
> Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
> it should be DM-only which returns STS_RESOURCE so often.

Where does the dm STS_RESOURCE error usually come from - what's exact
resource are we running out of?

-- 
Jens Axboe



Re: [PATCH v3 RESEND 2/2] blk-throttle: fix wrong initialization in case of dm device

2018-01-18 Thread Joseph Qi
Hi Mike,

On 18/1/19 11:29, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at 10:09pm -0500,
> Joseph Qi  wrote:
> 
>> From: Joseph Qi 
>>
>> DM device sets QUEUE_FLAG_NONROT after the queue is registered. That is
>> to mean, the previous initialization in blk_throtl_register_queue is
>> wrong in this case.
>> Fix it by checking and then updating the info during root tg
>> initialization as we don't have a better choice.
>>
>> Signed-off-by: Joseph Qi 
>> Reviewed-by: Shaohua Li 
>> ---
>>  block/blk-throttle.c | 19 +++
>>  1 file changed, 19 insertions(+)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index bf52035..7150f14 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -541,6 +541,25 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
>>  if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
>>  sq->parent_sq = _to_tg(blkg->parent)->service_queue;
>>  tg->td = td;
>> +
>> +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>> +/*
>> + * DM device sets QUEUE_FLAG_NONROT after the queue is registered,
>> + * so the previous initialization is wrong in this case. Check and
>> + * update it here.
>> + */
>> +if (blk_queue_nonrot(blkg->q) &&
>> +td->filtered_latency != LATENCY_FILTERED_SSD) {
>> +int i;
>> +
>> +td->throtl_slice = DFL_THROTL_SLICE_SSD;
>> +td->filtered_latency = LATENCY_FILTERED_SSD;
>> +for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
>> +td->avg_buckets[READ][i].latency = 0;
>> +td->avg_buckets[WRITE][i].latency = 0;
>> +}
>> +}
>> +#endif
>>  }
>>  
>>  /*
>> -- 
>> 1.9.4
> 
> This should be fixed for 4.16, please see these block tree commits:
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=fa70d2e2c4a0a54ced98260c6a176cc94c876d27
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=c100ec49fdd836ff8a17c7bfcc7611d2ee2b
> 
> The last commit's patch header even references the previous submission
> you had for this patch with:
> 
> "These changes also stave off the need to introduce new DM-specific
> workarounds in block core, e.g. this proposal:
> https://patchwork.kernel.org/patch/10067961/;
> 
Yes, if we call dm_table_set_restrictions before blk_register_queue now,
we can make sure the initialization is correct by checking whether flag
QUEUE_FLAG_NONROT is set or not. So my patch is no longer needed.
Jens, please ignore this patch, thanks.

Thanks,
Joseph


Re: [PATCH 2/2] xfs: remove assert to check bytes returned

2018-01-18 Thread Dave Chinner
On Thu, Jan 18, 2018 at 06:57:41PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Since we can return less than count in case of partial direct
> writes, remove the ASSERT.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/xfs/xfs_file.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 8601275cc5e6..8fc4dbf66910 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -590,12 +590,6 @@ xfs_file_dio_aio_write(
>   ret = iomap_dio_rw(iocb, from, _iomap_ops, xfs_dio_write_end_io);
>  out:
>   xfs_iunlock(ip, iolock);
> -
> - /*
> -  * No fallback to buffered IO on errors for XFS, direct IO will either
> -  * complete fully or fail.
> -  */
> - ASSERT(ret < 0 || ret == count);
>   return ret;
>  }

Acked-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v3 RESEND 2/2] blk-throttle: fix wrong initialization in case of dm device

2018-01-18 Thread Jens Axboe
On 1/18/18 8:29 PM, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at 10:09pm -0500,
> Joseph Qi  wrote:
> 
>> From: Joseph Qi 
>>
>> DM device sets QUEUE_FLAG_NONROT after the queue is registered. That is
>> to mean, the previous initialization in blk_throtl_register_queue is
>> wrong in this case.
>> Fix it by checking and then updating the info during root tg
>> initialization as we don't have a better choice.
>>
>> Signed-off-by: Joseph Qi 
>> Reviewed-by: Shaohua Li 
>> ---
>>  block/blk-throttle.c | 19 +++
>>  1 file changed, 19 insertions(+)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index bf52035..7150f14 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -541,6 +541,25 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
>>  if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
>>  sq->parent_sq = _to_tg(blkg->parent)->service_queue;
>>  tg->td = td;
>> +
>> +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>> +/*
>> + * DM device sets QUEUE_FLAG_NONROT after the queue is registered,
>> + * so the previous initialization is wrong in this case. Check and
>> + * update it here.
>> + */
>> +if (blk_queue_nonrot(blkg->q) &&
>> +td->filtered_latency != LATENCY_FILTERED_SSD) {
>> +int i;
>> +
>> +td->throtl_slice = DFL_THROTL_SLICE_SSD;
>> +td->filtered_latency = LATENCY_FILTERED_SSD;
>> +for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
>> +td->avg_buckets[READ][i].latency = 0;
>> +td->avg_buckets[WRITE][i].latency = 0;
>> +}
>> +}
>> +#endif
>>  }
>>  
>>  /*
>> -- 
>> 1.9.4
> 
> This should be fixed for 4.16, please see these block tree commits:
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=fa70d2e2c4a0a54ced98260c6a176cc94c876d27
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=c100ec49fdd836ff8a17c7bfcc7611d2ee2b
> 
> The last commit's patch header even references the previous submission
> you had for this patch with:
> 
> "These changes also stave off the need to introduce new DM-specific
> workarounds in block core, e.g. this proposal:
> https://patchwork.kernel.org/patch/10067961/;
> 

Joseph, please confirm if it works like it should with the dm tree.  The
point of Mike's work was to avoid special casing stuff like this, so
hopefully it took care of yours too.

-- 
Jens Axboe



Re: [PATCH v3 RESEND 2/2] blk-throttle: fix wrong initialization in case of dm device

2018-01-18 Thread Mike Snitzer
On Thu, Jan 18 2018 at 10:09pm -0500,
Joseph Qi  wrote:

> From: Joseph Qi 
> 
> DM device sets QUEUE_FLAG_NONROT after the queue is registered. That is
> to mean, the previous initialization in blk_throtl_register_queue is
> wrong in this case.
> Fix it by checking and then updating the info during root tg
> initialization as we don't have a better choice.
> 
> Signed-off-by: Joseph Qi 
> Reviewed-by: Shaohua Li 
> ---
>  block/blk-throttle.c | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index bf52035..7150f14 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -541,6 +541,25 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
>   if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
>   sq->parent_sq = _to_tg(blkg->parent)->service_queue;
>   tg->td = td;
> +
> +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
> + /*
> +  * DM device sets QUEUE_FLAG_NONROT after the queue is registered,
> +  * so the previous initialization is wrong in this case. Check and
> +  * update it here.
> +  */
> + if (blk_queue_nonrot(blkg->q) &&
> + td->filtered_latency != LATENCY_FILTERED_SSD) {
> + int i;
> +
> + td->throtl_slice = DFL_THROTL_SLICE_SSD;
> + td->filtered_latency = LATENCY_FILTERED_SSD;
> + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
> + td->avg_buckets[READ][i].latency = 0;
> + td->avg_buckets[WRITE][i].latency = 0;
> + }
> + }
> +#endif
>  }
>  
>  /*
> -- 
> 1.9.4

This should be fixed for 4.16, please see these block tree commits:
http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=fa70d2e2c4a0a54ced98260c6a176cc94c876d27
http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=c100ec49fdd836ff8a17c7bfcc7611d2ee2b

The last commit's patch header even references the previous submission
you had for this patch with:

"These changes also stave off the need to introduce new DM-specific
workarounds in block core, e.g. this proposal:
https://patchwork.kernel.org/patch/10067961/;


[PATCH v3 RESEND 2/2] blk-throttle: fix wrong initialization in case of dm device

2018-01-18 Thread Joseph Qi
From: Joseph Qi 

DM device sets QUEUE_FLAG_NONROT after the queue is registered. That is
to mean, the previous initialization in blk_throtl_register_queue is
wrong in this case.
Fix it by checking and then updating the info during root tg
initialization as we don't have a better choice.

Signed-off-by: Joseph Qi 
Reviewed-by: Shaohua Li 
---
 block/blk-throttle.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index bf52035..7150f14 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -541,6 +541,25 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
sq->parent_sq = _to_tg(blkg->parent)->service_queue;
tg->td = td;
+
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+   /*
+* DM device sets QUEUE_FLAG_NONROT after the queue is registered,
+* so the previous initialization is wrong in this case. Check and
+* update it here.
+*/
+   if (blk_queue_nonrot(blkg->q) &&
+   td->filtered_latency != LATENCY_FILTERED_SSD) {
+   int i;
+
+   td->throtl_slice = DFL_THROTL_SLICE_SSD;
+   td->filtered_latency = LATENCY_FILTERED_SSD;
+   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+   td->avg_buckets[READ][i].latency = 0;
+   td->avg_buckets[WRITE][i].latency = 0;
+   }
+   }
+#endif
 }
 
 /*
-- 
1.9.4


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-18 Thread jianchao.wang
Hi ming

Sorry for delayed report this.

On 01/17/2018 05:57 PM, Ming Lei wrote:
> 2) hctx->next_cpu can become offline from online before __blk_mq_run_hw_queue
> is run, there isn't warning, but once the IO is submitted to hardware,
> after it is completed, how does the HBA/hw queue notify CPU since CPUs
> assigned to this hw queue(irq vector) are offline? blk-mq's timeout
> handler may cover that, but looks too tricky.

In theory, the irq affinity will be migrated to other cpu. This is done by
fixup_irqs() in the context of stop_machine.
However, in my test, I found this log:

[  267.161043] do_IRQ: 7.33 No irq handler for vector

The 33 is the vector used by nvme cq.
The irq seems to be missed and sometimes IO hang occurred.
It is not every time, I think maybe due to nvme_process_cq in nvme_queue_rq.

I add dump stack behind the error log and get following:
[  267.161043] do_IRQ: 7.33 No irq handler for vector migration/7
[  267.161045] CPU: 7 PID: 52 Comm: migration/7 Not tainted 4.15.0-rc7+ #27
[  267.161045] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  267.161046] Call Trace:
[  267.161047]  
[  267.161052]  dump_stack+0x7c/0xb5
[  267.161054]  do_IRQ+0xb9/0xf0
[  267.161056]  common_interrupt+0xa2/0xa2
[  267.161057]  
[  267.161059] RIP: 0010:multi_cpu_stop+0xb0/0x120
[  267.161060] RSP: 0018:bb6c81af7e70 EFLAGS: 0202 ORIG_RAX: 
ffde
[  267.161061] RAX: 0001 RBX: 0004 RCX: 
[  267.161062] RDX: 0006 RSI: 898c4591 RDI: 0202
[  267.161063] RBP: bb6c826e7c88 R08: 991abc1256bc R09: 0005
[  267.161063] R10: bb6c81af7db8 R11: 89c91d20 R12: 0001
[  267.161064] R13: bb6c826e7cac R14: 0003 R15: 
[  267.161067]  ? cpu_stop_queue_work+0x90/0x90
[  267.161068]  cpu_stopper_thread+0x83/0x100
[  267.161070]  smpboot_thread_fn+0x161/0x220
[  267.161072]  kthread+0xf5/0x130
[  267.161073]  ? sort_range+0x20/0x20
[  267.161074]  ? kthread_associate_blkcg+0xe0/0xe0
[  267.161076]  ret_from_fork+0x24/0x30

The irq just occurred after the irq is enabled in multi_cpu_stop.

0x8112d655 is in multi_cpu_stop 
(/home/will/u04/source_code/linux-block/kernel/stop_machine.c:223).
218  */
219 touch_nmi_watchdog();
220 }
221 } while (curstate != MULTI_STOP_EXIT);
222 
223 local_irq_restore(flags);
224 return err;
225 }


Thanks
Jianchao


Re: [PATCH v3 RESEND 1/2] blk-throttle: track read and write request individually

2018-01-18 Thread Joseph Qi
Sure, I will resend it.
Thanks.

Joseph

On 18/1/19 10:52, Jens Axboe wrote:
> On 1/18/18 7:11 PM, Joseph Qi wrote:
>> Hi Jens,
>> Could you please pick the two pending patches for 4.16?
>> They all have been reviewed by Shaohua.
> 
> Sorry, I guess I forgot about this series. I picked up 1/2,
> for some reason I don't have 2/2 and I can't find it on
> linux-block either. The thread only shows the first patch.
> 
> Can you resend 2/2?
> 


Re: [PATCH v3 RESEND 1/2] blk-throttle: track read and write request individually

2018-01-18 Thread Jens Axboe
On 1/18/18 7:11 PM, Joseph Qi wrote:
> Hi Jens,
> Could you please pick the two pending patches for 4.16?
> They all have been reviewed by Shaohua.

Sorry, I guess I forgot about this series. I picked up 1/2,
for some reason I don't have 2/2 and I can't find it on
linux-block either. The thread only shows the first patch.

Can you resend 2/2?

-- 
Jens Axboe



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Ming Lei
On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote:
> On 1/18/18 11:47 AM, Bart Van Assche wrote:
> >> This is all very tiresome.
> > 
> > Yes, this is tiresome. It is very annoying to me that others keep
> > introducing so many regressions in such important parts of the kernel.
> > It is also annoying to me that I get blamed if I report a regression
> > instead of seeing that the regression gets fixed.
> 
> I agree, it sucks that any change there introduces the regression. I'm
> fine with doing the delay insert again until a new patch is proven to be
> better.

That way is still buggy as I explained, since rerun queue before adding
request to hctx->dispatch_list isn't correct. Who can make sure the request
is visible when __blk_mq_run_hw_queue() is called?

Not mention this way will cause performance regression again.

> 
> From the original topic of this email, we have conditions that can cause
> the driver to not be able to submit an IO. A set of those conditions can
> only happen if IO is in flight, and those cases we have covered just
> fine. Another set can potentially trigger without IO being in flight.
> These are cases where a non-device resource is unavailable at the time
> of submission. This might be iommu running out of space, for instance,
> or it might be a memory allocation of some sort. For these cases, we
> don't get any notification when the shortage clears. All we can do is
> ensure that we restart operations at some point in the future. We're SOL
> at that point, but we have to ensure that we make forward progress.

Right, it is a generic issue, not DM-specific one, almost all drivers
call kmalloc(GFP_ATOMIC) in IO path.

IMO, there is enough time for figuring out a generic solution before
4.16 release.

> 
> That last set of conditions better not be a a common occurence, since
> performance is down the toilet at that point. I don't want to introduce
> hot path code to rectify it. Have the driver return if that happens in a
> way that is DIFFERENT from needing a normal restart. The driver knows if
> this is a resource that will become available when IO completes on this
> device or not. If we get that return, we have a generic run-again delay.

Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
it should be DM-only which returns STS_RESOURCE so often.

> 
> This basically becomes the same as doing the delay queue thing from DM,
> but just in a generic fashion.

Yeah, it is right.

-- 
Ming


Re: [PATCH v5 1/2] Return bytes transferred for partial direct I/O

2018-01-18 Thread Al Viro
On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> In case direct I/O encounters an error midway, it returns the error.
> Instead it should be returning the number of bytes transferred so far.
> 
> Test case for filesystems (with ENOSPC):
> 1. Create an almost full filesystem
> 2. Create a file, say /mnt/lastfile, until the filesystem is full.
> 3. Direct write() with count > sizeof /mnt/lastfile.
> 
> Result: write() returns -ENOSPC. However, file content has data written
> in step 3.
> 
> This fixes fstest generic/472.

OK...  I can live with that.  What about the XFS side?  It should be
a prereq, to avoid bisection hazard; I can throw both into vfs.git,
if XFS folks are OK with that.  Objections?


Re: [PATCH v3 RESEND 1/2] blk-throttle: track read and write request individually

2018-01-18 Thread Joseph Qi
Hi Jens,
Could you please pick the two pending patches for 4.16?
They all have been reviewed by Shaohua.

Thanks,
Joseph

On 18/1/8 20:05, Joseph Qi wrote:
> A polite ping for the two pending patches...
> 
> Thanks,
> Joseph
> 
> On 17/11/24 13:13, Jens Axboe wrote:
>> On 11/23/2017 06:31 PM, Joseph Qi wrote:
>>> Hi Jens,
>>> Could you please give your advice for the two patches or pick them up if
>>> you think they are good?
>>
>> It looks OK to me, but my preference would be to push this until
>> 4.16.
>>


[PATCH v5 1/2] Return bytes transferred for partial direct I/O

2018-01-18 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

In case direct I/O encounters an error midway, it returns the error.
Instead it should be returning the number of bytes transferred so far.

Test case for filesystems (with ENOSPC):
1. Create an almost full filesystem
2. Create a file, say /mnt/lastfile, until the filesystem is full.
3. Direct write() with count > sizeof /mnt/lastfile.

Result: write() returns -ENOSPC. However, file content has data written
in step 3.

This fixes fstest generic/472.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Goldwyn Rodrigues 
---
 fs/block_dev.c |  2 +-
 fs/direct-io.c |  4 +---
 fs/iomap.c | 20 ++--
 3 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb5175..49d94360bb51 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -409,7 +409,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
 
if (!ret)
ret = blk_status_to_errno(dio->bio.bi_status);
-   if (likely(!ret))
+   if (likely(dio->size))
ret = dio->size;
 
bio_put(>bio);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 3aafb3343a65..0c98b6e65d7c 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -262,8 +262,6 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, 
unsigned int flags)
ret = dio->page_errors;
if (ret == 0)
ret = dio->io_error;
-   if (ret == 0)
-   ret = transferred;
 
if (dio->end_io) {
// XXX: ki_pos??
@@ -310,7 +308,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, 
unsigned int flags)
}
 
kmem_cache_free(dio_cache, dio);
-   return ret;
+   return transferred ? transferred : ret;
 }
 
 static void dio_aio_complete_work(struct work_struct *work)
diff --git a/fs/iomap.c b/fs/iomap.c
index 47d29ccffaef..cab57d85404e 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -716,23 +716,23 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
struct kiocb *iocb = dio->iocb;
struct inode *inode = file_inode(iocb->ki_filp);
loff_t offset = iocb->ki_pos;
-   ssize_t ret;
+   ssize_t err;
+   ssize_t transferred = dio->size;
 
if (dio->end_io) {
-   ret = dio->end_io(iocb,
-   dio->error ? dio->error : dio->size,
+   err = dio->end_io(iocb,
+   transferred ? transferred : dio->error,
dio->flags);
} else {
-   ret = dio->error;
+   err = dio->error;
}
 
-   if (likely(!ret)) {
-   ret = dio->size;
+   if (likely(transferred)) {
/* check for short read */
-   if (offset + ret > dio->i_size &&
+   if (offset + transferred > dio->i_size &&
!(dio->flags & IOMAP_DIO_WRITE))
-   ret = dio->i_size - offset;
-   iocb->ki_pos += ret;
+   transferred = dio->i_size - offset;
+   iocb->ki_pos += transferred;
}
 
/*
@@ -759,7 +759,7 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
inode_dio_end(file_inode(iocb->ki_filp));
kfree(dio);
 
-   return ret;
+   return transferred ? transferred : err;
 }
 
 static void iomap_dio_complete_work(struct work_struct *work)
-- 
2.15.1



[PATCH 2/2] xfs: remove assert to check bytes returned

2018-01-18 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Since we can return less than count in case of partial direct
writes, remove the ASSERT.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/xfs/xfs_file.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8601275cc5e6..8fc4dbf66910 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -590,12 +590,6 @@ xfs_file_dio_aio_write(
ret = iomap_dio_rw(iocb, from, _iomap_ops, xfs_dio_write_end_io);
 out:
xfs_iunlock(ip, iolock);
-
-   /*
-* No fallback to buffered IO on errors for XFS, direct IO will either
-* complete fully or fail.
-*/
-   ASSERT(ret < 0 || ret == count);
return ret;
 }
 
-- 
2.15.1



Re: [PATCH] dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Jens Axboe
On 1/18/18 5:35 PM, Ming Lei wrote:
> On Thu, Jan 18, 2018 at 05:24:29PM -0700, Jens Axboe wrote:
>> On 1/18/18 5:18 PM, Ming Lei wrote:
>>> On Fri, Jan 19, 2018 at 12:14:24AM +, Bart Van Assche wrote:
 On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote:
> On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote:
>> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
>> index f16096af879a..c59c59cfd2a5 100644
>> --- a/drivers/md/dm-rq.c
>> +++ b/drivers/md/dm-rq.c
>> @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct 
>> blk_mq_hw_ctx *hctx,
>>  /* Undo dm_start_request() before requeuing */
>>  rq_end_stats(md, rq);
>>  rq_completed(md, rq_data_dir(rq), false);
>> +blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
>>  return BLK_STS_RESOURCE;
>>  }
>>  
>
> Nak.

 This patch fixes a regression that was introduced by you. You should know
 that regressions are not acceptable. If you don't agree with this patch,
 please fix the root cause.
>>>
>>> Yesterday I sent a patch, did you test that?
>>
>> That patch isn't going to be of much help. It might prevent you from
>> completely stalling, but it might take you tens of seconds to get there.
> 
> Yeah, that is true, and why I marked it as RFC, but the case is so rare to
> happen.

You don't know that. If the resource is very limited, or someone else
is gobbling up all of it, then it may trigger quite often. Granted,
in that case, you need some way of signaling the freeing of those
resources to make it efficient.

>> On top of that, it's a rolling timer that gets added when IO is queued,
>> and re-added if IO is pending when it fires. If the latter case is not
>> true, then it won't get armed again. So if IO fails to queue without
>> any other IO pending, you're pretty much in the same situation as if
> 
> No IO pending detection may take a bit time, we can do it after
> BLK_STS_RESOURCE is returned and SCHED_RESTART is set. I have done
> this way in the following patch, what do you think of it?
> 
> https://github.com/ming1/linux/commit/f866ce0b97b0ae22d033881da7eb07706fd458b4

I think it's overly complicated. As I wrote in a lengthier email earlier
today, just ensure that we have a unique return code from the driver for
when it aborts queuing an IO due to a resource shortage that isn't tied
to it's own consumption of it. When we get that return, do a delayed
queue run after X amount of time to ensure we always try again. If you
want to get fancy (later on), you could track the frequency of such
returns and complain if it's too high.

There's no point in adding a lot of code to check whether we need to run
the queue or not. Just always do it. We won't be doing it more than 100
times per second for the worst case of the condition taking a while to
clear, if we stick to the 10ms re-run time.

-- 
Jens Axboe



Re: dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Bart Van Assche
On Fri, 2018-01-19 at 08:26 +0800, Ming Lei wrote:
> No, this case won't return BLK_STS_RESOURCE.

It's possible that my attempt to reverse engineer the latest blk-mq changes
was wrong. But the queue stall is real and needs to be fixed.

Bart.

Re: dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Ming Lei
On Thu, Jan 18, 2018 at 05:13:53PM +, Bart Van Assche wrote:
> On Thu, 2018-01-18 at 11:50 -0500, Mike Snitzer wrote:
> > The issue you say it was originally intended to fix _should_ be
> > addressed with this change:
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=4dd6edd23e7ea971efddc303f9e67eb79e95808e
> 
> Hello Mike,
> 
> Sorry but I'm not convinced that that patch is sufficient. That patch helps
> if .end_io() is called with status BLK_STS_RESOURCE and also if
> blk_insert_cloned_request() returns the .queue_rq() return value. It does not
> help if .queue_rq() returns BLK_STS_RESOURCE and that return value gets
> ignored.

The return value from .queue_rq() is handled by blk-mq, why do you think
it can be ignored? Please see blk_mq_dispatch_rq_list().

> I think that can happen as follows:
> - Request cloning in multipath_clone_and_map() succeeds and that function
>   returns DM_MAPIO_REMAPPED.
> - dm_dispatch_clone_request() calls blk_insert_cloned_request().
> - blk_insert_cloned_request() calls blk_mq_request_direct_issue(), which
>   results in a call of __blk_mq_try_issue_directly().
> - __blk_mq_try_issue_directly() calls blk_mq_sched_insert_request(). In this

This only happens iff queue is stopped or quiesced, then we return
BLK_STS_OK to blk-mq via .queue_rq(), please see __blk_mq_try_issue_directly(),
how does this cause IO hang? 

>   case the BLK_STS_RESOURCE returned by the .queue_rq() implementation of the
>   underlying path will be ignored.

No, this case won't return BLK_STS_RESOURCE.

-- 
Ming


Re: [PATCH] dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Jens Axboe
On 1/18/18 5:18 PM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 12:14:24AM +, Bart Van Assche wrote:
>> On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote:
>>> On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote:
 diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
 index f16096af879a..c59c59cfd2a5 100644
 --- a/drivers/md/dm-rq.c
 +++ b/drivers/md/dm-rq.c
 @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct 
 blk_mq_hw_ctx *hctx,
/* Undo dm_start_request() before requeuing */
rq_end_stats(md, rq);
rq_completed(md, rq_data_dir(rq), false);
 +  blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
return BLK_STS_RESOURCE;
}
  
>>>
>>> Nak.
>>
>> This patch fixes a regression that was introduced by you. You should know
>> that regressions are not acceptable. If you don't agree with this patch,
>> please fix the root cause.
> 
> Yesterday I sent a patch, did you test that?

That patch isn't going to be of much help. It might prevent you from
completely stalling, but it might take you tens of seconds to get there.

On top of that, it's a rolling timer that gets added when IO is queued,
and re-added if IO is pending when it fires. If the latter case is not
true, then it won't get armed again. So if IO fails to queue without
any other IO pending, you're pretty much in the same situation as if
you marked the queue as restart. Nobody is going to be noticing
either of those conditions.

-- 
Jens Axboe



Re: [PATCH] dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Bart Van Assche
On Fri, 2018-01-19 at 08:18 +0800, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 12:14:24AM +, Bart Van Assche wrote:
> > On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote:
> > > On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote:
> > > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> > > > index f16096af879a..c59c59cfd2a5 100644
> > > > --- a/drivers/md/dm-rq.c
> > > > +++ b/drivers/md/dm-rq.c
> > > > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct 
> > > > blk_mq_hw_ctx *hctx,
> > > > /* Undo dm_start_request() before requeuing */
> > > > rq_end_stats(md, rq);
> > > > rq_completed(md, rq_data_dir(rq), false);
> > > > +   blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
> > > > return BLK_STS_RESOURCE;
> > > > }
> > > >  
> > > 
> > > Nak.
> > 
> > This patch fixes a regression that was introduced by you. You should know
> > that regressions are not acceptable. If you don't agree with this patch,
> > please fix the root cause.
> 
> Yesterday I sent a patch, did you test that?

Yes, I did. It caused queue stalls that were so bad that sending "kick" to the
debugfs "state" attribute was not sufficient to resolve the queue stall.

Bart.

Re: [PATCH] dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Ming Lei
On Fri, Jan 19, 2018 at 12:14:24AM +, Bart Van Assche wrote:
> On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote:
> > On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote:
> > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> > > index f16096af879a..c59c59cfd2a5 100644
> > > --- a/drivers/md/dm-rq.c
> > > +++ b/drivers/md/dm-rq.c
> > > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct 
> > > blk_mq_hw_ctx *hctx,
> > >   /* Undo dm_start_request() before requeuing */
> > >   rq_end_stats(md, rq);
> > >   rq_completed(md, rq_data_dir(rq), false);
> > > + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
> > >   return BLK_STS_RESOURCE;
> > >   }
> > >  
> > 
> > Nak.
> 
> This patch fixes a regression that was introduced by you. You should know
> that regressions are not acceptable. If you don't agree with this patch,
> please fix the root cause.

Yesterday I sent a patch, did you test that?

-- 
Ming


Re: [PATCH] dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Bart Van Assche
On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote:
> On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote:
> > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> > index f16096af879a..c59c59cfd2a5 100644
> > --- a/drivers/md/dm-rq.c
> > +++ b/drivers/md/dm-rq.c
> > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx 
> > *hctx,
> > /* Undo dm_start_request() before requeuing */
> > rq_end_stats(md, rq);
> > rq_completed(md, rq_data_dir(rq), false);
> > +   blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
> > return BLK_STS_RESOURCE;
> > }
> >  
> 
> Nak.

This patch fixes a regression that was introduced by you. You should know
that regressions are not acceptable. If you don't agree with this patch,
please fix the root cause.

Bart.

Re: [PATCH] dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Ming Lei
On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote:
> If the .queue_rq() implementation of a block driver returns
> BLK_STS_RESOURCE then that block driver is responsible for
> rerunning the queue once the condition that caused it to return
> BLK_STS_RESOURCE has been cleared. The dm-mpath driver tells the
> dm core to requeue a request if e.g. not enough memory is
> available for cloning a request or if the underlying path is
> busy. Since the dm-mpath driver does not receive any kind of
> notification if the condition that caused it to return "requeue"
> is cleared, the only solution to avoid that dm-mpath request
> processing stalls is to call blk_mq_delay_run_hw_queue(). Hence
> this patch.
> 
> Fixes: ec3eaf9a6731 ("dm mpath: don't call blk_mq_delay_run_hw_queue() in 
> case of BLK_STS_RESOURCE")
> Signed-off-by: Bart Van Assche 
> Cc: Ming Lei 
> ---
>  drivers/md/dm-rq.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index f16096af879a..c59c59cfd2a5 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx 
> *hctx,
>   /* Undo dm_start_request() before requeuing */
>   rq_end_stats(md, rq);
>   rq_completed(md, rq_data_dir(rq), false);
> + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
>   return BLK_STS_RESOURCE;
>   }
>  

Nak.

It still takes a bit time to add this request to hctx->dispatch_list
from here, so suppose the time is longer than 100ms because of interrupt
, preemption or whatever, this request can't be observed in the scheduled
run queue(__blk_mq_run_hw_queue).

Not mention it is just a ugly workaround, which degrades performance
a lot.

-- 
Ming


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Bart Van Assche
On Thu, 2018-01-18 at 15:39 -0700, Jens Axboe wrote:
> When you do have a solid test case, please please submit a blktests
> test case for it! This needs to be something we can regularly in
> testing.

Hello Jens,

That sounds like a good idea to me. BTW, I think the reason why so far I
can reproduce these queue stalls easier than others is because I modified the
SRP initiator to make it easy to cause the .get_budget() call to succeed and
the scsi_queue_rq() to return BLK_STS_BUSY. A possible next step is to apply
a similar change to the scsi_debug driver. The change I made is as follows:

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 0c887ebfbc64..7f3c4a197425 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3135,6 +3135,16 @@ static int srp_reset_host(struct scsi_cmnd *scmnd)
return srp_reconnect_rport(target->rport) == 0 ? SUCCESS : FAILED;
 }
 
+static int srp_target_alloc(struct scsi_target *starget)
+{
+   struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
+   struct srp_target_port *target = host_to_target(shost);
+
+   if (target->target_can_queue)
+   starget->can_queue = target->target_can_queue;
+   return 0;
+}
+
 static int srp_slave_alloc(struct scsi_device *sdev)
 {
struct Scsi_Host *shost = sdev->host;
@@ -3348,6 +3358,7 @@ static struct scsi_host_template srp_template = {
.module = THIS_MODULE,
.name   = "InfiniBand SRP initiator",
.proc_name  = DRV_NAME,
+   .target_alloc   = srp_target_alloc,
.slave_alloc= srp_slave_alloc,
.slave_configure= srp_slave_configure,
.info   = srp_target_info,
@@ -3515,6 +3526,7 @@ enum {
SRP_OPT_QUEUE_SIZE  = 1 << 14,
SRP_OPT_IP_SRC  = 1 << 15,
SRP_OPT_IP_DEST = 1 << 16,
+   SRP_OPT_TARGET_CAN_QUEUE= 1 << 17,
 };
 
 static unsigned int srp_opt_mandatory[] = {
@@ -3536,6 +3548,7 @@ static const match_table_t srp_opt_tokens = {
{ SRP_OPT_SERVICE_ID,   "service_id=%s" },
{ SRP_OPT_MAX_SECT, "max_sect=%d"   },
{ SRP_OPT_MAX_CMD_PER_LUN,  "max_cmd_per_lun=%d"},
+   { SRP_OPT_TARGET_CAN_QUEUE, "target_can_queue=%d"   },
{ SRP_OPT_IO_CLASS, "io_class=%x"   },
{ SRP_OPT_INITIATOR_EXT,"initiator_ext=%s"  },
{ SRP_OPT_CMD_SG_ENTRIES,   "cmd_sg_entries=%u" },
@@ -3724,6 +3737,15 @@ static int srp_parse_options(struct net *net, const char 
*buf,
target->scsi_host->cmd_per_lun = token;
break;
 
+   case SRP_OPT_TARGET_CAN_QUEUE:
+   if (match_int(args, ) || token < 1) {
+   pr_warn("bad max target_can_queue parameter 
'%s'\n",
+   p);
+   goto out;
+   }
+   target->target_can_queue = token;
+   break;
+
case SRP_OPT_IO_CLASS:
if (match_hex(args, )) {
pr_warn("bad IO class parameter '%s'\n", p);
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h 
b/drivers/infiniband/ulp/srp/ib_srp.h
index d66c9057d5ea..70334fa3de8e 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -216,6 +216,7 @@ struct srp_target_port {
chartarget_name[32];
unsigned intscsi_id;
unsigned intsg_tablesize;
+   unsigned inttarget_can_queue;
int mr_pool_size;
int mr_per_cmd;
int queue_size;

Bart.

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Jens Axboe
On 1/18/18 3:35 PM, Laurence Oberman wrote:
> On Thu, 2018-01-18 at 22:24 +, Bart Van Assche wrote:
>> On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote:
>>> OK, I ran 5 at once of 5 separate mount points.
>>> I am using 4k block sizes
>>> Its solid consistent for me. No stalls no gaps.
>>
>> Hi Laurence,
>>
>> That's great news and thank you for having shared this information
>> but I think
>> it should be mentioned that you have been running my tree in which
>> some recent
>> block and dm patches were reverted
>> (https://github.com/bvanassche/linux/tree/block-scsi-for-next)
>>
>>>
>>> [global]
>>> name=02-mq
>>> filename=fio-output-02-mq.txt
>>> rw=randwrite
>>> verify=md5
>>> ;rwmixread=60
>>> ;rwmixwrite=40
>>> bs=4K
>>> ;direct=1
>>> ;numjobs=4
>>> ;time_based=1
>>> runtime=120
>>>
>>> [file1]
>>> size=3G
>>> ioengine=libaio
>>> iodepth=16
>>>
>>> I watch I/O and I see it ramp up but fio still runs and it kind of
>>> shuts down.
>>
>> In test "file1" I see an I/O size of 3G. Does that mean that the
>> patch that
>> should fix the sgl_alloc() issue is working?
>>
>> Thanks,
>>
>> Bart.
> 
> Hello Bart Thank you.
> 
> OK, so booting into Mike tree now and then hopefully I get the lockups.
> Can you give me some idea of what to look for.
> I assume I/O just stops.
> 
> I want to get this happening in-house so we have an avenue to fix this.
> Following getting the stall I will attend to the slg patch test for the
> SRPT side.
> 
> Please note I am not running latest SRPT right now as you know.

When you do have a solid test case, please please submit a blktests
test case for it! This needs to be something we can regularly in
testing.

-- 
Jens Axboe



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Laurence Oberman
On Thu, 2018-01-18 at 22:24 +, Bart Van Assche wrote:
> On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote:
> > OK, I ran 5 at once of 5 separate mount points.
> > I am using 4k block sizes
> > Its solid consistent for me. No stalls no gaps.
> 
> Hi Laurence,
> 
> That's great news and thank you for having shared this information
> but I think
> it should be mentioned that you have been running my tree in which
> some recent
> block and dm patches were reverted
> (https://github.com/bvanassche/linux/tree/block-scsi-for-next)
> 
> > 
> > [global]
> > name=02-mq
> > filename=fio-output-02-mq.txt
> > rw=randwrite
> > verify=md5
> > ;rwmixread=60
> > ;rwmixwrite=40
> > bs=4K
> > ;direct=1
> > ;numjobs=4
> > ;time_based=1
> > runtime=120
> > 
> > [file1]
> > size=3G
> > ioengine=libaio
> > iodepth=16
> > 
> > I watch I/O and I see it ramp up but fio still runs and it kind of
> > shuts down.
> 
> In test "file1" I see an I/O size of 3G. Does that mean that the
> patch that
> should fix the sgl_alloc() issue is working?
> 
> Thanks,
> 
> Bart.

Hello Bart Thank you.

OK, so booting into Mike tree now and then hopefully I get the lockups.
Can you give me some idea of what to look for.
I assume I/O just stops.

I want to get this happening in-house so we have an avenue to fix this.
Following getting the stall I will attend to the slg patch test for the
SRPT side.

Please note I am not running latest SRPT right now as you know.

Regards
Back later with results



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Bart Van Assche
On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote:
> OK, I ran 5 at once of 5 separate mount points.
> I am using 4k block sizes
> Its solid consistent for me. No stalls no gaps.

Hi Laurence,

That's great news and thank you for having shared this information but I think
it should be mentioned that you have been running my tree in which some recent
block and dm patches were reverted
(https://github.com/bvanassche/linux/tree/block-scsi-for-next)

> 
> [global]
> name=02-mq
> filename=fio-output-02-mq.txt
> rw=randwrite
> verify=md5
> ;rwmixread=60
> ;rwmixwrite=40
> bs=4K
> ;direct=1
> ;numjobs=4
> ;time_based=1
> runtime=120
> 
> [file1]
> size=3G
> ioengine=libaio
> iodepth=16
> 
> I watch I/O and I see it ramp up but fio still runs and it kind of
> shuts down.

In test "file1" I see an I/O size of 3G. Does that mean that the patch that
should fix the sgl_alloc() issue is working?

Thanks,

Bart.

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Bart Van Assche
On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> And yet Laurence cannot reproduce any such lockups with your test...

Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has
already succeeded at running an unmodified version of my tests. In one of the
e-mails Laurence sent me this morning I read that he modified these scripts
to get past a kernel module unload failure that was reported while starting
these tests. So the next step is to check which changes were made to the test
scripts and also whether the test results are still valid.

> Are you absolutely certain this patch doesn't help you?
> https://patchwork.kernel.org/patch/10174037/
> 
> If it doesn't then that is actually very useful to know.

The first I tried this morning is to run the srp-test software against a merge
of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm
queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code.
Since even that was not sufficient I tried to kick the queues via debugfs (for
s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was
not sufficient to resolve the queue stall I reverted the following tree patches
that are in Jens' tree:
* "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
feedback"
* "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request"
* "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is 
busy"

Only after I had done this the srp-test software ran again without triggering
dm queue lockups. Sorry but I have not yet had the time to test patch "[RFC]
blk-mq: fixup RESTART when queue becomes idle".

> Please just focus on helping Laurence get his very capable testbed to
> reproduce this issue.  Once we can reproduce these "unkillable" "stalls"
> in-house it'll be _much_ easier to analyze and fix.

OK, I will work with Laurence on this. Maybe Laurence and I should work on this
before analyzing the lockup that was mentioned above further?

Bart.

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Laurence Oberman
On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at  4:39pm -0500,
> Bart Van Assche  wrote:
> 
> > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> > > On Thu, Jan 18 2018 at  3:58P -0500,
> > > Bart Van Assche  wrote:
> > > 
> > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > > > For Bart's test the underlying scsi-mq driver is what is
> > > > > regularly
> > > > > hitting this case in __blk_mq_try_issue_directly():
> > > > > 
> > > > > if (blk_mq_hctx_stopped(hctx) ||
> > > > > blk_queue_quiesced(q))
> > > > 
> > > > These lockups were all triggered by incorrect handling of
> > > > .queue_rq() returning BLK_STS_RESOURCE.
> > > 
> > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> > > "Incorrect" because it no longer runs
> > > blk_mq_delay_run_hw_queue()?
> > 
> > In what I wrote I was referring to both dm_mq_queue_rq() and
> > scsi_queue_rq().
> > With "incorrect" I meant that queue lockups are introduced that
> > make user
> > space processes unkillable. That's a severe bug.
> 
> And yet Laurence cannot reproduce any such lockups with your test...
> 
> Are you absolutely certain this patch doesn't help you?
> https://patchwork.kernel.org/patch/10174037/
> 
> If it doesn't then that is actually very useful to know.
> 
> > > We have time to get this right, please stop hyperventilating
> > > about
> > > "regressions".
> > 
> > Sorry Mike but that's something I consider as an unfair comment. If
> > Ming and
> > you work on patches together, it's your job to make sure that no
> > regressions
> > are introduced. Instead of blaming me because I report these
> > regressions you
> > should be grateful that I take the time and effort to report these
> > regressions
> > early. And since you are employed by a large organization that
> > sells Linux
> > support services, your employer should invest in developing test
> > cases that
> > reach a higher coverage of the dm, SCSI and block layer code. I
> > don't think
> > that it's normal that my tests discovered several issues that were
> > not
> > discovered by Red Hat's internal test suite. That's something Red
> > Hat has to
> > address.
> 
> You have no self-awareness of just how mypoic you're being about
> this.
> 
> I'm not ignoring or blaming you for your test no longer passing.  Far
> from it.  I very much want to fix this.  But I want it fixed in a way
> that doesn't paper over the real bug(s) while also introducing blind
> queue runs that compromise the blk-mq RESTART code's ability to work
> as
> intended.
> 
> I'd have thought you could appreciate this.  We need a root cause on
> this, not hand-waving justifications on why problematic delayed queue
> runs are correct.
> 
> Please just focus on helping Laurence get his very capable testbed to
> reproduce this issue.  Once we can reproduce these "unkillable"
> "stalls"
> in-house it'll be _much_ easier to analyze and fix.
> 
> Thanks,
> Mike

Hello Bart

OK, I ran 5 at once of 5 separate mount points.
I am using 4k block sizes
Its solid consistent for me. No stalls no gaps.

[global]
name=02-mq
filename=fio-output-02-mq.txt
rw=randwrite
verify=md5
;rwmixread=60
;rwmixwrite=40
bs=4K
;direct=1
;numjobs=4
;time_based=1
runtime=120

[file1]
size=3G
ioengine=libaio
iodepth=16

I watch I/O and I see it ramp up but fio still runs and it kind of
shuts down.

#!/bin/bash
for i in 1 2 3 4 5
do
cd /data-$i 
fio /root/srp-test.lobe/fio_tests/02-mq 1>/data-$i.out 2>&1 &
done


#Time cpu sys inter  ctxsw KBRead  Reads KBWrit Writes 

17:16:09   34  12 34431  10393  0  0 249872  36094 
17:16:10   31  10 13001   2679  0  0  57652   7980 
17:16:11   32  11 19473   4416  0  0 143248  17362 
17:16:12   31   9  8068   1606  0  0  20088   2026 
17:16:13   31   9  7718   1518  0  0  15908   1354 
17:16:14   36  14 41132   9710  0  0 686216  46661 
17:16:15   39  18 63622  21033  0  0  1246K  75108 
17:16:16   38  16 76114   9013  0  0  1146K  82338 
17:16:17   33  11 31941   2735  0  0 237340  25034 
17:16:18   36  14 23900   4982  0  1  1567K  43244 
17:16:19   37  15 55884   4974  0  4  1003K  67253 
17:16:20   28  12  7542   4975  0  0  1630K   3266 
17:16:218   6 14650  34721  0  0  2535K  21206 
17:16:222   2  6655  15839  0  0  2319K   9897 
17:16:23   11  11 37305 134030  0  0  1275K  87010 
17:16:24   23  22 59266 119350   6560   1640  1102K  78034 
17:16:25   21  17 65699 144462 148052  37013 159900  22120 
17:16:26   14   9 80700 164034 216588  54147  4  1 
# <--Disks--->
#Time cpu sys inter  ctxsw KBRead  Reads KBWrit Writes 
17:16:27   14   9 78699 162095 214412  53603  0  0 
17:16:28   14   9 75895 155352 204932  51233  0  0 
17:16:29   14   9 75377 161871 214136  53534  

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Mike Snitzer
On Thu, Jan 18 2018 at  4:39pm -0500,
Bart Van Assche  wrote:

> On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> > On Thu, Jan 18 2018 at  3:58P -0500,
> > Bart Van Assche  wrote:
> > 
> > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > > For Bart's test the underlying scsi-mq driver is what is regularly
> > > > hitting this case in __blk_mq_try_issue_directly():
> > > > 
> > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> > >
> > > These lockups were all triggered by incorrect handling of
> > > .queue_rq() returning BLK_STS_RESOURCE.
> > 
> > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?
> 
> In what I wrote I was referring to both dm_mq_queue_rq() and scsi_queue_rq().
> With "incorrect" I meant that queue lockups are introduced that make user
> space processes unkillable. That's a severe bug.

And yet Laurence cannot reproduce any such lockups with your test...

Are you absolutely certain this patch doesn't help you?
https://patchwork.kernel.org/patch/10174037/

If it doesn't then that is actually very useful to know.

> > We have time to get this right, please stop hyperventilating about
> > "regressions".
> 
> Sorry Mike but that's something I consider as an unfair comment. If Ming and
> you work on patches together, it's your job to make sure that no regressions
> are introduced. Instead of blaming me because I report these regressions you
> should be grateful that I take the time and effort to report these regressions
> early. And since you are employed by a large organization that sells Linux
> support services, your employer should invest in developing test cases that
> reach a higher coverage of the dm, SCSI and block layer code. I don't think
> that it's normal that my tests discovered several issues that were not
> discovered by Red Hat's internal test suite. That's something Red Hat has to
> address.

You have no self-awareness of just how mypoic you're being about this.

I'm not ignoring or blaming you for your test no longer passing.  Far
from it.  I very much want to fix this.  But I want it fixed in a way
that doesn't paper over the real bug(s) while also introducing blind
queue runs that compromise the blk-mq RESTART code's ability to work as
intended.

I'd have thought you could appreciate this.  We need a root cause on
this, not hand-waving justifications on why problematic delayed queue
runs are correct.

Please just focus on helping Laurence get his very capable testbed to
reproduce this issue.  Once we can reproduce these "unkillable" "stalls"
in-house it'll be _much_ easier to analyze and fix.

Thanks,
Mike


Re: [dm-devel] [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Bart Van Assche
On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at  3:58P -0500,
> Bart Van Assche  wrote:
> 
> > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > For Bart's test the underlying scsi-mq driver is what is regularly
> > > hitting this case in __blk_mq_try_issue_directly():
> > > 
> > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> >
> > These lockups were all triggered by incorrect handling of
> > .queue_rq() returning BLK_STS_RESOURCE.
> 
> Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?

In what I wrote I was referring to both dm_mq_queue_rq() and scsi_queue_rq().
With "incorrect" I meant that queue lockups are introduced that make user
space processes unkillable. That's a severe bug.

> Please try to do more work analyzing the test case that only you can
> easily run (due to srp_test being a PITA).

It is not correct that I'm the only one who is able to run that software.
Anyone who is willing to merge the latest SRP initiator and target driver
patches in his or her tree can run that software in
any VM. I'm working hard
on getting the patches upstream that make it possible to run the srp-test
software on a setup that is not equipped with InfiniBand hardware.

> We have time to get this right, please stop hyperventilating about
> "regressions".

Sorry Mike but that's something I consider as an unfair comment. If Ming and
you work on patches together, it's your job to make sure that no regressions
are introduced. Instead of blaming me because I report these regressions you
should be grateful that I take the time and effort to report these regressions
early. And since you are employed by a large organization that sells Linux
support services, your employer should invest in developing test cases that
reach a higher coverage of the dm, SCSI and block layer code. I don't think
that it's normal that my tests discovered several issues that were not
discovered by Red Hat's internal test suite. That's something Red Hat has to
address.

Bart.

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Laurence Oberman
On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at  3:58P -0500,
> Bart Van Assche  wrote:
> 
> > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > For Bart's test the underlying scsi-mq driver is what is
> > > regularly
> > > hitting this case in __blk_mq_try_issue_directly():
> > > 
> > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> > 
> > Hello Mike,
> > 
> > That code path is not the code path that triggered the lockups that
> > I reported
> > during the past days.
> 
> If you're hitting blk_mq_sched_insert_request() then you most
> certainly
> are hitting that code path.
> 
> If you aren't then what was your earlier email going on about?
> https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html
> 
> If you were just focusing on that as one possible reason, that isn't
> very helpful.  By this point you really should _know_ what is
> triggering
> the stall based on the code paths taken.  Please use ftrace's
> function_graph tracer if need be.
> 
> > These lockups were all triggered by incorrect handling of
> > .queue_rq() returning BLK_STS_RESOURCE.
> 
> Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?
> 
> Please try to do more work analyzing the test case that only you can
> easily run (due to srp_test being a PITA).  And less time lobbying
> for
> a change that you don't understand to _really_ be correct.
> 
> We have time to get this right, please stop hyperventilating about
> "regressions".
> 
> Thanks,
> Mike

Hello Bart
I have run a good few loops of 02-mq and its stable for me on your
tree.
I am not running the entire disconnect re-connect loops and un-mounts
etc. for good reason.
I have 35 LUNS so its very impact-full to lose them and have them come
back all the time.

Anyway
I am very happy to try reproduce this in-house so Mike and Ming can
focus on it but I need to know if all I need to do is loop over 02-mq
over and over.

Also please let me know whats debugfs and sysfs to capture and I am
happy to try help move this along.

Regards
Laurence




Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Mike Snitzer
On Thu, Jan 18 2018 at  3:58P -0500,
Bart Van Assche  wrote:

> On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > For Bart's test the underlying scsi-mq driver is what is regularly
> > hitting this case in __blk_mq_try_issue_directly():
> > 
> > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> 
> Hello Mike,
> 
> That code path is not the code path that triggered the lockups that I reported
> during the past days.

If you're hitting blk_mq_sched_insert_request() then you most certainly
are hitting that code path.

If you aren't then what was your earlier email going on about?
https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html

If you were just focusing on that as one possible reason, that isn't
very helpful.  By this point you really should _know_ what is triggering
the stall based on the code paths taken.  Please use ftrace's
function_graph tracer if need be.

> These lockups were all triggered by incorrect handling of
> .queue_rq() returning BLK_STS_RESOURCE.

Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
"Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?

Please try to do more work analyzing the test case that only you can
easily run (due to srp_test being a PITA).  And less time lobbying for
a change that you don't understand to _really_ be correct.

We have time to get this right, please stop hyperventilating about
"regressions".

Thanks,
Mike


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Bart Van Assche
On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> For Bart's test the underlying scsi-mq driver is what is regularly
> hitting this case in __blk_mq_try_issue_directly():
> 
> if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))

Hello Mike,

That code path is not the code path that triggered the lockups that I reported
during the past days. These lockups were all triggered by incorrect handling of
.queue_rq() returning BLK_STS_RESOURCE.

Bart.

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Mike Snitzer
On Thu, Jan 18 2018 at  3:11pm -0500,
Jens Axboe  wrote:

> On 1/18/18 11:47 AM, Bart Van Assche wrote:
> >> This is all very tiresome.
> > 
> > Yes, this is tiresome. It is very annoying to me that others keep
> > introducing so many regressions in such important parts of the kernel.
> > It is also annoying to me that I get blamed if I report a regression
> > instead of seeing that the regression gets fixed.
> 
> I agree, it sucks that any change there introduces the regression. I'm
> fine with doing the delay insert again until a new patch is proven to be
> better.
> 
> From the original topic of this email, we have conditions that can cause
> the driver to not be able to submit an IO. A set of those conditions can
> only happen if IO is in flight, and those cases we have covered just
> fine. Another set can potentially trigger without IO being in flight.
> These are cases where a non-device resource is unavailable at the time
> of submission. This might be iommu running out of space, for instance,
> or it might be a memory allocation of some sort. For these cases, we
> don't get any notification when the shortage clears. All we can do is
> ensure that we restart operations at some point in the future. We're SOL
> at that point, but we have to ensure that we make forward progress.
> 
> That last set of conditions better not be a a common occurence, since
> performance is down the toilet at that point. I don't want to introduce
> hot path code to rectify it. Have the driver return if that happens in a
> way that is DIFFERENT from needing a normal restart. The driver knows if
> this is a resource that will become available when IO completes on this
> device or not. If we get that return, we have a generic run-again delay.
> 
> This basically becomes the same as doing the delay queue thing from DM,
> but just in a generic fashion.

This is a bit confusing for me (as I see it we have 2 blk-mq drivers
trying to collaborate, so your refering to "driver" lacks precision; but
I could just be missing something)...

For Bart's test the underlying scsi-mq driver is what is regularly
hitting this case in __blk_mq_try_issue_directly():

if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))

It certainly better not be the norm (Bart's test hammering on this aside).

For starters, it'd be very useful to know if Bart is hitting the
blk_mq_hctx_stopped() or blk_queue_quiesced() for this case that is
triggering the use of blk_mq_sched_insert_request() -- I'd wager it is
due to blk_queue_quiesced() but Bart _please_ try to figure it out.

Anyway, in response to this case Bart would like the upper layer dm-mq
driver to blk_mq_delay_run_hw_queue().  Certainly is quite the hammer.

But that hammer aside, in general for this case, I'm concerned about: is
it really correct to allow an already stopped/quiesced underlying queue
to retain responsibility for processing the request?  Or would the
upper-layer dm-mq benefit from being able to retry the request on its
terms (via a "DIFFERENT" return from blk-mq core)?

Like this?  The (ab)use of BLK_STS_DM_REQUEUE certainly seems fitting in
this case but...

(Bart please note that this patch applies on linux-dm.git's 'for-next';
which is just a merge of Jens' 4.16 tree and dm-4.16)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 74a4f237ba91..371a1b97bf56 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1781,16 +1781,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
struct request_queue *q = rq->q;
bool run_queue = true;
 
-   /*
-* RCU or SRCU read lock is needed before checking quiesced flag.
-*
-* When queue is stopped or quiesced, ignore 'bypass_insert' from
-* blk_mq_request_direct_issue(), and return BLK_STS_OK to caller,
-* and avoid driver to try to dispatch again.
-*/
+   /* RCU or SRCU read lock is needed before checking quiesced flag */
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
run_queue = false;
-   bypass_insert = false;
+   if (bypass_insert)
+   return BLK_STS_DM_REQUEUE;
goto insert;
}
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index d8519ddd7e1a..2f554ea485c3 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -408,7 +408,7 @@ static blk_status_t dm_dispatch_clone_request(struct 
request *clone, struct requ
 
clone->start_time = jiffies;
r = blk_insert_cloned_request(clone->q, clone);
-   if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
+   if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DM_REQUEUE)
/* must complete clone in terms of original request */
dm_complete_request(rq, r);
return r;
@@ -472,6 +472,7 @@ static void init_tio(struct dm_rq_target_io *tio, struct 
request *rq,
  * Returns:
  * DM_MAPIO_*   : the request has 

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Jens Axboe
On 1/18/18 11:47 AM, Bart Van Assche wrote:
>> This is all very tiresome.
> 
> Yes, this is tiresome. It is very annoying to me that others keep
> introducing so many regressions in such important parts of the kernel.
> It is also annoying to me that I get blamed if I report a regression
> instead of seeing that the regression gets fixed.

I agree, it sucks that any change there introduces the regression. I'm
fine with doing the delay insert again until a new patch is proven to be
better.

>From the original topic of this email, we have conditions that can cause
the driver to not be able to submit an IO. A set of those conditions can
only happen if IO is in flight, and those cases we have covered just
fine. Another set can potentially trigger without IO being in flight.
These are cases where a non-device resource is unavailable at the time
of submission. This might be iommu running out of space, for instance,
or it might be a memory allocation of some sort. For these cases, we
don't get any notification when the shortage clears. All we can do is
ensure that we restart operations at some point in the future. We're SOL
at that point, but we have to ensure that we make forward progress.

That last set of conditions better not be a a common occurence, since
performance is down the toilet at that point. I don't want to introduce
hot path code to rectify it. Have the driver return if that happens in a
way that is DIFFERENT from needing a normal restart. The driver knows if
this is a resource that will become available when IO completes on this
device or not. If we get that return, we have a generic run-again delay.

This basically becomes the same as doing the delay queue thing from DM,
but just in a generic fashion.

-- 
Jens Axboe



Re: [PATCH v3 0/2] block: enforce ioctl(BLKROSET) and set_disk_ro()

2018-01-18 Thread Jens Axboe
On 1/11/18 6:09 AM, Ilya Dryomov wrote:
> Hello,
> 
> I was doing some cleanup work on rbd BLKROSET handler and discovered
> that we ignore partition rw/ro setting (hd_struct->policy) for pretty
> much everything but straight writes.
> 
> David (CCed) has blktests patches standing by.
> 
> (Another aspect of this is that we don't enforce open(2) mode.  Tejun
> took a stab at this a few years ago, but his patch had to be reverted:
> 
>   75f1dc0d076d ("block: check bdev_read_only() from blkdev_get()")
>   e51900f7d38c ("block: revert block_dev read-only check")
> 
> It is a separate issue and refusing writes to read-only devices is
> obviously more important, but perhaps it's time to revisit that as
> well?)

Applied for 4.16, thanks.

-- 
Jens Axboe



Re: [RFC] blk-throttle: export io_serviced_recursive, io_service_bytes_recursive

2018-01-18 Thread Jens Axboe
On 12/11/17 7:56 AM, weiping zhang wrote:
> export these two interface for cgroup-v1.

Applied, thanks.

-- 
Jens Axboe



Re: [PATCH v2 0/3] Avoid that blk_{un,}register_queue() trigger lock inversion

2018-01-18 Thread Jens Axboe
On 1/17/18 12:48 PM, Bart Van Assche wrote:
> Hello Jens,
> 
> The three patches in this series are what I came up with after having
> analyzed a lockdep complaint triggered by blk_unregister_queue(). Please
> consider these patches for kernel v4.16.

Thanks Bart, applied.

-- 
Jens Axboe



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Bart Van Assche
On Thu, 2018-01-18 at 13:30 -0500, Mike Snitzer wrote:
> 1%!?  Where are you getting that number?  Ming has detailed more
> significant performance gains than 1%.. and not just on lpfc (though you
> keep seizing on lpfc because of the low queue_depth of 3).

That's what I derived from the numbers you posted for null_blk. If Ming has
posted performance results for other drivers than lpfc, please let me know
where I can find these. I have not yet seen these numbers.

> This is all very tiresome.

Yes, this is tiresome. It is very annoying to me that others keep introducing
so many regressions in such important parts of the kernel. It is also annoying
to me that I get blamed if I report a regression instead of seeing that the
regression gets fixed.

Bart.

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Mike Snitzer
On Thu, Jan 18 2018 at 12:20pm -0500,
Bart Van Assche  wrote:

> On Thu, 2018-01-18 at 12:03 -0500, Mike Snitzer wrote:
> > On Thu, Jan 18 2018 at 11:50am -0500,
> > Bart Van Assche  wrote:
> > > My comments about the above are as follows:
> > > - It can take up to q->rq_timeout jiffies after a .queue_rq()
> > >   implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work()
> > >   gets called. However, it can happen that only a few milliseconds after
> > >   .queue_rq() returned BLK_STS_RESOURCE that the condition that caused
> > >   it to return BLK_STS_RESOURCE gets cleared. So the above approach can
> > >   result in long delays during which it will seem like the queue got
> > >   stuck. Additionally, I think that the block driver should decide how
> > >   long it takes before a queue is rerun and not the block layer core.
> > 
> > So configure q->rq_timeout to be shorter?  Which is configurable though
> > blk_mq_tag_set's 'timeout' member.  It apparently defaults to 30 * HZ.
> > 
> > That is the problem with timeouts, there is generally no one size fits
> > all.
> 
> Sorry but I think that would be wrong. The delay after which a queue is rerun
> should not be coupled to the request timeout. These two should be independent.

That's fair.  Not saying I think that is a fix anyway.

> > > - The lockup that I reported only occurs with the dm driver but not any
> > >   other block driver. So why to modify the block layer core since this
> > >   can be fixed by modifying the dm driver?
> > 
> > Hard to know it is only DM's blk-mq that is impacted.  That is the only
> > blk-mq driver that you're testing like this (that is also able to handle
> > faults, etc).
> 
> That's not correct. I'm also testing the SCSI core, which is one of the most
> complicated block drivers.

OK, but SCSI mq is part of the problem here.  It is a snowflake that
has more exotic reasons for returning BLK_STS_RESOURCE.

> > > - A much simpler fix and a fix that is known to work exists, namely
> > >   inserting a blk_mq_delay_run_hw_queue() call in the dm driver.
> > 
> > Because your "much simpler" fix actively hurts performance, as is
> > detailed in this header:
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=ec3eaf9a673106f66606896aed6ddd20180b02ec
> 
> We are close to the start of the merge window so I think it's better to fall
> back to an old approach that is known to work than to keep a new approach
> that is known not to work. Additionally, the performance issue you referred
> to only affects IOPS and bandwidth more than 1% with the lpfc driver and that
> is because the queue depth it supports is much lower than for other SCSI HBAs,
> namely 3 instead of 64.

1%!?  Where are you getting that number?  Ming has detailed more
significant performance gains than 1%.. and not just on lpfc (though you
keep seizing on lpfc because of the low queue_depth of 3).

This is all very tiresome.  I'm _really_ not interested in this debate
any more.  The specific case that causes the stall need to be identified
and a real fix needs to be developed.  Ming is doing a lot of that hard
work.  Please contribute or at least stop pleading for your hack to be
reintroduced.

If at the end of the 4.16 release we still don't have a handle on the
stall you're seeing I'll revisit this and likely revert to blindly
kicking the queue after an arbitrary delay.  But I'm willing to let this
issue get more time without papering over it.

Mike


[GIT PULL] Two NVMe fixes for 4.15 final

2018-01-18 Thread Jens Axboe
Hi Linus,

Two important fixes for the sgl support for nvme that is new in this
series. Please pull.


  git://git.kernel.dk/linux-block.git for-linus



Christoph Hellwig (1):
  nvme-pci: take sglist coalescing in dma_map_sg into account

Keith Busch (1):
  nvme-pci: check segement valid for SGL use

 drivers/nvme/host/pci.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

-- 
Jens Axboe



Re: [PATCH v5] Return bytes transferred for partial direct I/O

2018-01-18 Thread Darrick J. Wong
On Fri, Jan 05, 2018 at 06:15:55AM -0600, Goldwyn Rodrigues wrote:
> 
> 
> On 01/04/2018 08:10 PM, Darrick J. Wong wrote:
> > On Wed, Nov 22, 2017 at 06:29:01AM -0600, Goldwyn Rodrigues wrote:
> >> From: Goldwyn Rodrigues 
> >>
> >> In case direct I/O encounters an error midway, it returns the error.
> >> Instead it should be returning the number of bytes transferred so far.
> >>
> >> Test case for filesystems (with ENOSPC):
> >> 1. Create an almost full filesystem
> >> 2. Create a file, say /mnt/lastfile, until the filesystem is full.
> >> 3. Direct write() with count > sizeof /mnt/lastfile.
> >>
> >> Result: write() returns -ENOSPC. However, file content has data written
> >> in step 3.
> >>
> >> Changes since v1:
> >>  - incorporated iomap and block devices
> >>
> >> Changes since v2:
> >>  - realized that file size was not increasing when performing a (partial)
> >>direct I/O because end_io function was receiving the error instead of
> >>size. Fixed.
> >>
> >> Changes since v3:
> >>  - [hch] initialize transferred with dio->size and use transferred instead
> >>of dio->size.
> >>
> >> Changes since v4:
> >>  - Refreshed to v4.14
> >>
> >> Signed-off-by: Goldwyn Rodrigues 
> >> Reviewed-by: Christoph Hellwig 
> > 
> > Hmmm, I shoved this into 4.15-rc6 and saw this horrible splat running
> > generic/472 (that is the test that goes with this patch, right?) on XFS:
> 
> Yes, I will add it to the patch header.
> 
> > 
> > [ 2545.534331] XFS: Assertion failed: ret < 0 || ret == count, file: 
> > /storage/home/djwong/cdev/work/linux-xfs/fs/xfs/xfs_file.c, line: 598
> > [ 2545.538177] WARNING: CPU: 2 PID: 16340 at 
> > /storage/home/djwong/cdev/work/linux-xfs/fs/xfs/xfs_message.c:116 
> > assfail+0x58/0x90 [xfs]
> > [ 2545.539748] Modules linked in: xfs libcrc32c dax_pmem device_dax nd_pmem 
> > sch_fq_codel af_packet [last unloaded: xfs]
> > [ 2545.541191] CPU: 2 PID: 16340 Comm: xfs_io Not tainted 4.15.0-rc6-xfsx #5
> > [ 2545.542119] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > 1.10.2-1ubuntu1djwong0 04/01/2014
> > [ 2545.543414] RIP: 0010:assfail+0x58/0x90 [xfs]
> > [ 2545.544032] RSP: 0018:88002b56f990 EFLAGS: 00010246
> > [ 2545.544760] RAX:  RBX:  RCX: 
> > 
> > [ 2545.545721] RDX: 0004 RSI: 000a RDI: 
> > a0a1eac4
> > [ 2545.546681] RBP: 88002b56fdd0 R08: 88002b56f6e0 R09: 
> > 
> > [ 2545.547646] R10: 11000fe14f6e R11:  R12: 
> > 1100056adf3a
> > [ 2545.548613] R13: 0040 R14: 88002b56fc38 R15: 
> > 
> > [ 2545.549579] FS:  7fc1f601e700() GS:88007f00() 
> > knlGS:
> > [ 2545.550666] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 2545.551460] CR2: 7fc1f4bac000 CR3: 75165001 CR4: 
> > 001606e0
> > [ 2545.552423] Call Trace:
> > [ 2545.552850]  xfs_file_dio_aio_write+0x6e3/0xe40 [xfs]
> > [ 2545.553581]  ? kvm_clock_read+0x1f/0x30
> > [ 2545.554191]  ? xfs_file_dax_write+0x6a0/0x6a0 [xfs]
> > [ 2545.554885]  ? kvm_clock_read+0x1f/0x30
> > [ 2545.555439]  ? kvm_sched_clock_read+0x5/0x10
> > [ 2545.556046]  ? sched_clock+0x5/0x10
> > [ 2545.556553]  ? sched_clock_cpu+0x18/0x1e0
> > [ 2545.557136]  ? __lock_acquire+0xbbf/0x40f0
> > [ 2545.557719]  ? kvm_clock_read+0x1f/0x30
> > [ 2545.558272]  ? sched_clock+0x5/0x10
> > [ 2545.558848]  xfs_file_write_iter+0x34a/0xb50 [xfs]
> > [ 2545.559544]  do_iter_readv_writev+0x44c/0x700
> > [ 2545.560170]  ? _copy_from_user+0x96/0xd0
> > [ 2545.560729]  ? vfs_dedupe_file_range+0x820/0x820
> > [ 2545.561398]  do_iter_write+0x12c/0x550
> > [ 2545.561939]  ? rcu_lockdep_current_cpu_online+0xd7/0x120
> > [ 2545.562682]  ? rcu_read_lock_sched_held+0xa3/0x120
> > [ 2545.563366]  vfs_writev+0x175/0x2d0
> > [ 2545.563874]  ? vfs_iter_write+0xc0/0xc0
> > [ 2545.564434]  ? get_lock_stats+0x16/0x90
> > [ 2545.564992]  ? lock_downgrade+0x580/0x580
> > [ 2545.565583]  ? __fget+0x1e7/0x350
> > [ 2545.566077]  ? entry_SYSCALL_64_fastpath+0x5/0x96
> > [ 2545.566744]  ? do_pwritev+0x125/0x160
> > [ 2545.567277]  do_pwritev+0x125/0x160
> > [ 2545.567787]  entry_SYSCALL_64_fastpath+0x1f/0x96
> > [ 2545.568437] RIP: 0033:0x7fc1f56d6110
> > [ 2545.568948] RSP: 002b:7ffc07f11340 EFLAGS: 0293 ORIG_RAX: 
> > 0128
> > [ 2545.569971] RAX: ffda RBX: 01ef9170 RCX: 
> > 7fc1f56d6110
> > [ 2545.570932] RDX: 0001 RSI: 01ef9170 RDI: 
> > 0003
> > [ 2545.571897] RBP:  R08:  R09: 
> > 0040
> > [ 2545.572857] R10:  R11: 0293 R12: 
> > 
> > [ 2545.573818] R13:  R14: 0040 R15: 
> > 01ef9170
> > [ 2545.574802] Code: df 48 89 fa 48 c1 ea 03 0f b6 04 02 48 89 fa 83 e2 07 
> > 38 d0 7f 04 84 c0 75 15 0f b6 1d 

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Bart Van Assche
On Thu, 2018-01-18 at 12:03 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at 11:50am -0500,
> Bart Van Assche  wrote:
> > My comments about the above are as follows:
> > - It can take up to q->rq_timeout jiffies after a .queue_rq()
> >   implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work()
> >   gets called. However, it can happen that only a few milliseconds after
> >   .queue_rq() returned BLK_STS_RESOURCE that the condition that caused
> >   it to return BLK_STS_RESOURCE gets cleared. So the above approach can
> >   result in long delays during which it will seem like the queue got
> >   stuck. Additionally, I think that the block driver should decide how
> >   long it takes before a queue is rerun and not the block layer core.
> 
> So configure q->rq_timeout to be shorter?  Which is configurable though
> blk_mq_tag_set's 'timeout' member.  It apparently defaults to 30 * HZ.
> 
> That is the problem with timeouts, there is generally no one size fits
> all.

Sorry but I think that would be wrong. The delay after which a queue is rerun
should not be coupled to the request timeout. These two should be independent.

> > - The lockup that I reported only occurs with the dm driver but not any
> >   other block driver. So why to modify the block layer core since this
> >   can be fixed by modifying the dm driver?
> 
> Hard to know it is only DM's blk-mq that is impacted.  That is the only
> blk-mq driver that you're testing like this (that is also able to handle
> faults, etc).

That's not correct. I'm also testing the SCSI core, which is one of the most
complicated block drivers.

> > - A much simpler fix and a fix that is known to work exists, namely
> >   inserting a blk_mq_delay_run_hw_queue() call in the dm driver.
> 
> Because your "much simpler" fix actively hurts performance, as is
> detailed in this header:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=ec3eaf9a673106f66606896aed6ddd20180b02ec

We are close to the start of the merge window so I think it's better to fall
back to an old approach that is known to work than to keep a new approach
that is known not to work. Additionally, the performance issue you referred
to only affects IOPS and bandwidth more than 1% with the lpfc driver and that
is because the queue depth it supports is much lower than for other SCSI HBAs,
namely 3 instead of 64.

Thanks,

Bart.

Re: dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Bart Van Assche
On Thu, 2018-01-18 at 11:50 -0500, Mike Snitzer wrote:
> The issue you say it was originally intended to fix _should_ be
> addressed with this change:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=4dd6edd23e7ea971efddc303f9e67eb79e95808e

Hello Mike,

Sorry but I'm not convinced that that patch is sufficient. That patch helps
if .end_io() is called with status BLK_STS_RESOURCE and also if
blk_insert_cloned_request() returns the .queue_rq() return value. It does not
help if .queue_rq() returns BLK_STS_RESOURCE and that return value gets
ignored. I think that can happen as follows:
- Request cloning in multipath_clone_and_map() succeeds and that function
  returns DM_MAPIO_REMAPPED.
- dm_dispatch_clone_request() calls blk_insert_cloned_request().
- blk_insert_cloned_request() calls blk_mq_request_direct_issue(), which
  results in a call of __blk_mq_try_issue_directly().
- __blk_mq_try_issue_directly() calls blk_mq_sched_insert_request(). In this
  case the BLK_STS_RESOURCE returned by the .queue_rq() implementation of the
  underlying path will be ignored.

Please note that I have not tried to find all paths that ignore the
.queue_rq() return value.

Thanks,

Bart.

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Mike Snitzer
On Thu, Jan 18 2018 at 11:50am -0500,
Bart Van Assche  wrote:

> On 01/17/18 18:41, Ming Lei wrote:
> >BLK_STS_RESOURCE can be returned from driver when any resource
> >is running out of. And the resource may not be related with tags,
> >such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of
> >BLK_STS_RESOURCE, restart can't work any more, then IO hang may
> >be caused.
> >
> >Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost
> >all returns BLK_STS_RESOURCE under this situation. But for dm-mpath,
> >it may be triggered a bit easier since the request pool of underlying
> >queue may be consumed up much easier. But in reality, it is still not
> >easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug
> >with all kinds of scsi_debug parameters, can't trigger this issue
> >at all. But finally it is triggered in Bart's SRP test, which seems
> >made by genius, :-)
> >
> >[ ... ]
> >
> >  static void blk_mq_timeout_work(struct work_struct *work)
> >  {
> > struct request_queue *q =
> >@@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct 
> >*work)
> >  */
> > queue_for_each_hw_ctx(q, hctx, i) {
> > /* the hctx may be unmapped, so check it here */
> >-if (blk_mq_hw_queue_mapped(hctx))
> >+if (blk_mq_hw_queue_mapped(hctx)) {
> > blk_mq_tag_idle(hctx);
> >+blk_mq_fixup_restart(hctx);
> >+}
> > }
> > }
> > blk_queue_exit(q);
> 
> Hello Ming,
> 
> My comments about the above are as follows:
> - It can take up to q->rq_timeout jiffies after a .queue_rq()
>   implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work()
>   gets called. However, it can happen that only a few milliseconds after
>   .queue_rq() returned BLK_STS_RESOURCE that the condition that caused
>   it to return BLK_STS_RESOURCE gets cleared. So the above approach can
>   result in long delays during which it will seem like the queue got
>   stuck. Additionally, I think that the block driver should decide how
>   long it takes before a queue is rerun and not the block layer core.

So configure q->rq_timeout to be shorter?  Which is configurable though
blk_mq_tag_set's 'timeout' member.  It apparently defaults to 30 * HZ.

That is the problem with timeouts, there is generally no one size fits
all.

> - The lockup that I reported only occurs with the dm driver but not any
>   other block driver. So why to modify the block layer core since this
>   can be fixed by modifying the dm driver?

Hard to know it is only DM's blk-mq that is impacted.  That is the only
blk-mq driver that you're testing like this (that is also able to handle
faults, etc).

> - A much simpler fix and a fix that is known to work exists, namely
>   inserting a blk_mq_delay_run_hw_queue() call in the dm driver.

Because your "much simpler" fix actively hurts performance, as is
detailed in this header:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=ec3eaf9a673106f66606896aed6ddd20180b02ec

I'm not going to take your bandaid fix given it very much seems to be
papering over a real blk-mq issue.

Mike


Re: dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Mike Snitzer
On Thu, Jan 18 2018 at 11:37am -0500,
Bart Van Assche  wrote:

> If the .queue_rq() implementation of a block driver returns
> BLK_STS_RESOURCE then that block driver is responsible for
> rerunning the queue once the condition that caused it to return
> BLK_STS_RESOURCE has been cleared. The dm-mpath driver tells the
> dm core to requeue a request if e.g. not enough memory is
> available for cloning a request or if the underlying path is
> busy. Since the dm-mpath driver does not receive any kind of
> notification if the condition that caused it to return "requeue"
> is cleared, the only solution to avoid that dm-mpath request
> processing stalls is to call blk_mq_delay_run_hw_queue(). Hence
> this patch.
> 
> Fixes: ec3eaf9a6731 ("dm mpath: don't call blk_mq_delay_run_hw_queue() in 
> case of BLK_STS_RESOURCE")
> Signed-off-by: Bart Van Assche 
> Cc: Ming Lei 
> ---
>  drivers/md/dm-rq.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index f16096af879a..c59c59cfd2a5 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx 
> *hctx,
>   /* Undo dm_start_request() before requeuing */
>   rq_end_stats(md, rq);
>   rq_completed(md, rq_data_dir(rq), false);
> + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
>   return BLK_STS_RESOURCE;
>   }
>  
> -- 
> 2.15.1
> 

Sorry but we need to understand why you still need this.

The issue you say it was originally intended to fix _should_ be
addressed with this change:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=4dd6edd23e7ea971efddc303f9e67eb79e95808e

So if you still feel you need to blindly kick the queue then it is very
likely a bug in blk-mq (either its RESTART hueristics or whatever
internal implementation is lacking)

Did you try Ming's RFC patch to "fixup RESTART" before resorting to the
above again?, see: https://patchwork.kernel.org/patch/10172315/

Thanks,
Mike


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Bart Van Assche

On 01/17/18 18:41, Ming Lei wrote:

BLK_STS_RESOURCE can be returned from driver when any resource
is running out of. And the resource may not be related with tags,
such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of
BLK_STS_RESOURCE, restart can't work any more, then IO hang may
be caused.

Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost
all returns BLK_STS_RESOURCE under this situation. But for dm-mpath,
it may be triggered a bit easier since the request pool of underlying
queue may be consumed up much easier. But in reality, it is still not
easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug
with all kinds of scsi_debug parameters, can't trigger this issue
at all. But finally it is triggered in Bart's SRP test, which seems
made by genius, :-)

[ ... ]

>

  static void blk_mq_timeout_work(struct work_struct *work)
  {
struct request_queue *q =
@@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
 */
queue_for_each_hw_ctx(q, hctx, i) {
/* the hctx may be unmapped, so check it here */
-   if (blk_mq_hw_queue_mapped(hctx))
+   if (blk_mq_hw_queue_mapped(hctx)) {
blk_mq_tag_idle(hctx);
+   blk_mq_fixup_restart(hctx);
+   }
}
}
blk_queue_exit(q);


Hello Ming,

My comments about the above are as follows:
- It can take up to q->rq_timeout jiffies after a .queue_rq()
  implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work()
  gets called. However, it can happen that only a few milliseconds after
  .queue_rq() returned BLK_STS_RESOURCE that the condition that caused
  it to return BLK_STS_RESOURCE gets cleared. So the above approach can
  result in long delays during which it will seem like the queue got
  stuck. Additionally, I think that the block driver should decide how
  long it takes before a queue is rerun and not the block layer core.
- The lockup that I reported only occurs with the dm driver but not any
  other block driver. So why to modify the block layer core since this
  can be fixed by modifying the dm driver?
- A much simpler fix and a fix that is known to work exists, namely
  inserting a blk_mq_delay_run_hw_queue() call in the dm driver.

Bart.


[PATCH] dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Bart Van Assche
If the .queue_rq() implementation of a block driver returns
BLK_STS_RESOURCE then that block driver is responsible for
rerunning the queue once the condition that caused it to return
BLK_STS_RESOURCE has been cleared. The dm-mpath driver tells the
dm core to requeue a request if e.g. not enough memory is
available for cloning a request or if the underlying path is
busy. Since the dm-mpath driver does not receive any kind of
notification if the condition that caused it to return "requeue"
is cleared, the only solution to avoid that dm-mpath request
processing stalls is to call blk_mq_delay_run_hw_queue(). Hence
this patch.

Fixes: ec3eaf9a6731 ("dm mpath: don't call blk_mq_delay_run_hw_queue() in case 
of BLK_STS_RESOURCE")
Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
---
 drivers/md/dm-rq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index f16096af879a..c59c59cfd2a5 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx 
*hctx,
/* Undo dm_start_request() before requeuing */
rq_end_stats(md, rq);
rq_completed(md, rq_data_dir(rq), false);
+   blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
return BLK_STS_RESOURCE;
}
 
-- 
2.15.1



Re: [PATCH v2 3/3] block: Protect less code with sysfs_lock in blk_{un,}register_queue()

2018-01-18 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH v2 2/3] block: Document scheduler modification locking requirements

2018-01-18 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH v2 1/3] block: Unexport elv_register_queue() and elv_unregister_queue()

2018-01-18 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [RFC] blk-throttle: export io_serviced_recursive, io_service_bytes_recursive

2018-01-18 Thread weiping zhang
2017-12-11 23:17 GMT+08:00 Tejun Heo :
> On Mon, Dec 11, 2017 at 10:56:25PM +0800, weiping zhang wrote:
>> export these two interface for cgroup-v1.
>>
>> Signed-off-by: weiping zhang 
>
> Acked-by: Tejun Heo 
>

Hi Jens,

Have you got time to check this patch ?

Thanks

> Thanks.
>
> --
> tejun


Re: [PATCH BUGFIX/IMPROVEMENT 0/2] block, bfq: two pending patches

2018-01-18 Thread Jens Axboe
On 1/13/18 4:05 AM, Paolo Valente wrote:
> Hi Jens,
> here are again the two pending patches you asked me to resend [1]. One
> of them, fixing read-starvation problems, was accompanied by a cover
> letter. I'm pasting the content of that cover letter below.
> 
> The patch addresses (serious) starvation problems caused by
> request-tag exhaustion, as explained in more detail in the commit
> message. I started from the solution in the function
> kyber_limit_depth, but then I had to define more articulate limits, to
> counter starvation also in cases not covered in kyber_limit_depth.
> If this solution proves to be effective, I'm willing to port it
> somehow to the other schedulers.

It's something we've been doing in the old request layer for tagging for
a long time (more than a decade), so a generic (and fast) solution that
covers all cases for blk-mq-tag would indeed be great.

For now, I have applied these for 4.16, thanks.

-- 
Jens Axboe



Re: [PATCH v3 0/2] block: enforce ioctl(BLKROSET) and set_disk_ro()

2018-01-18 Thread Ilya Dryomov
On Thu, Jan 11, 2018 at 2:09 PM, Ilya Dryomov  wrote:
> Hello,
>
> I was doing some cleanup work on rbd BLKROSET handler and discovered
> that we ignore partition rw/ro setting (hd_struct->policy) for pretty
> much everything but straight writes.
>
> David (CCed) has blktests patches standing by.
>
> (Another aspect of this is that we don't enforce open(2) mode.  Tejun
> took a stab at this a few years ago, but his patch had to be reverted:
>
>   75f1dc0d076d ("block: check bdev_read_only() from blkdev_get()")
>   e51900f7d38c ("block: revert block_dev read-only check")
>
> It is a separate issue and refusing writes to read-only devices is
> obviously more important, but perhaps it's time to revisit that as
> well?)
>
> v2 -> v3:
> - lookup part only once; combine read-only check with existing
>   should_fail_request check
>
> v1 -> v2:
> - added unlikely() per Sagi's suggestion
>
> Thanks,
>
> Ilya
>
>
> Ilya Dryomov (2):
>   block: fail op_is_write() requests to read-only partitions
>   block: add bdev_read_only() checks to common helpers
>
>  block/blk-core.c | 56 
> ++--
>  block/blk-lib.c  | 12 
>  2 files changed, 50 insertions(+), 18 deletions(-)

Jens, could you please pick these up for 4.16?

Thanks,

Ilya


Re: [PATCH BUGFIX/IMPROVEMENT 0/2] block, bfq: two pending patches

2018-01-18 Thread Paolo Valente


> Il giorno 13 gen 2018, alle ore 12:05, Paolo Valente 
>  ha scritto:
> 
> Hi Jens,
> here are again the two pending patches you asked me to resend [1]. One
> of them, fixing read-starvation problems, was accompanied by a cover
> letter. I'm pasting the content of that cover letter below.
> 
> The patch addresses (serious) starvation problems caused by
> request-tag exhaustion, as explained in more detail in the commit
> message. I started from the solution in the function
> kyber_limit_depth, but then I had to define more articulate limits, to
> counter starvation also in cases not covered in kyber_limit_depth.
> If this solution proves to be effective, I'm willing to port it
> somehow to the other schedulers.
> 

Hi Jens,
have had to time to check these patches?  Sorry for pushing, but I
guess 4.16 is getting closer, and these patches are performance
critical; especially the first, which solves a starvation problem.

Thanks,
Paolo

> Thanks,
> Paolo
> 
> [1] https://www.spinics.net/lists/linux-block/msg21586.html
> 
> Paolo Valente (2):
>  block, bfq: limit tags for writes and async I/O
>  block, bfq: limit sectors served with interactive weight raising
> 
> block/bfq-iosched.c | 158 +---
> block/bfq-iosched.h |  17 ++
> block/bfq-wf2q.c|   3 +
> 3 files changed, 169 insertions(+), 9 deletions(-)
> 
> --
> 2.15.1