Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19, 2018 at 05:09:46AM +, Bart Van Assche wrote: > On Fri, 2018-01-19 at 10:32 +0800, Ming Lei wrote: > > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > > it should be DM-only which returns STS_RESOURCE so often. > > That's wrong at least for SCSI. See also > https://marc.info/?l=linux-block=151578329417076. > > For other scenario's, e.g. if a SCSI initiator submits a > SCSI request over a fabric and the SCSI target replies with "BUSY" then the Could you explain a bit when SCSI target replies with BUSY very often? Inside initiator, we have limited the max per-LUN requests and per-host requests already before calling .queue_rq(). > SCSI core will end the I/O request with status BLK_STS_RESOURCE after the > maximum number of retries has been reached (see also scsi_io_completion()). > In that last case, if a SCSI target sends a "BUSY" reply over the wire back > to the initiator, there is no other approach for the SCSI initiator to > figure out whether it can queue another request than to resubmit the > request. The worst possible strategy is to resubmit a request immediately > because that will cause a significant fraction of the fabric bandwidth to > be used just for replying "BUSY" to requests that can't be processed > immediately. -- Ming
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote: > On 1/18/18 7:32 PM, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: > >> On 1/18/18 11:47 AM, Bart Van Assche wrote: > This is all very tiresome. > >>> > >>> Yes, this is tiresome. It is very annoying to me that others keep > >>> introducing so many regressions in such important parts of the kernel. > >>> It is also annoying to me that I get blamed if I report a regression > >>> instead of seeing that the regression gets fixed. > >> > >> I agree, it sucks that any change there introduces the regression. I'm > >> fine with doing the delay insert again until a new patch is proven to be > >> better. > > > > That way is still buggy as I explained, since rerun queue before adding > > request to hctx->dispatch_list isn't correct. Who can make sure the request > > is visible when __blk_mq_run_hw_queue() is called? > > That race basically doesn't exist for a 10ms gap. > > > Not mention this way will cause performance regression again. > > How so? It's _exactly_ the same as what you are proposing, except mine > will potentially run the queue when it need not do so. But given that > these are random 10ms queue kicks because we are screwed, it should not > matter. The key point is that it only should be if we have NO better > options. If it's a frequently occurring event that we have to return > BLK_STS_RESOURCE, then we need to get a way to register an event for > when that condition clears. That event will then kick the necessary > queue(s). Please see queue_delayed_work_on(), hctx->run_work is shared by all scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new scheduling can make progress during the 100ms. > > >> From the original topic of this email, we have conditions that can cause > >> the driver to not be able to submit an IO. A set of those conditions can > >> only happen if IO is in flight, and those cases we have covered just > >> fine. Another set can potentially trigger without IO being in flight. > >> These are cases where a non-device resource is unavailable at the time > >> of submission. This might be iommu running out of space, for instance, > >> or it might be a memory allocation of some sort. For these cases, we > >> don't get any notification when the shortage clears. All we can do is > >> ensure that we restart operations at some point in the future. We're SOL > >> at that point, but we have to ensure that we make forward progress. > > > > Right, it is a generic issue, not DM-specific one, almost all drivers > > call kmalloc(GFP_ATOMIC) in IO path. > > GFP_ATOMIC basically never fails, unless we are out of memory. The I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be possible, and it is mentioned[1] there is such code in mm allocation path, also OOM can happen too. if (some randomly generated condition) && (request is atomic) return NULL; [1] https://lwn.net/Articles/276731/ > exception is higher order allocations. If a driver has a higher order > atomic allocation in its IO path, the device driver writer needs to be > taken out behind the barn and shot. Simple as that. It will NEVER work > well in a production environment. Witness the disaster that so many NIC > driver writers have learned. > > This is NOT the case we care about here. It's resources that are more > readily depleted because other devices are using them. If it's a high > frequency or generally occurring event, then we simply must have a > callback to restart the queue from that. The condition then becomes > identical to device private starvation, the only difference being from > where we restart the queue. > > > IMO, there is enough time for figuring out a generic solution before > > 4.16 release. > > I would hope so, but the proposed solutions have not filled me with > a lot of confidence in the end result so far. > > >> That last set of conditions better not be a a common occurence, since > >> performance is down the toilet at that point. I don't want to introduce > >> hot path code to rectify it. Have the driver return if that happens in a > >> way that is DIFFERENT from needing a normal restart. The driver knows if > >> this is a resource that will become available when IO completes on this > >> device or not. If we get that return, we have a generic run-again delay. > > > > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > > it should be DM-only which returns STS_RESOURCE so often. > > Where does the dm STS_RESOURCE error usually come from - what's exact > resource are we running out of? It is from blk_get_request(underlying queue), see multipath_clone_and_map(). Thanks, Ming
Re: [PATCH] dm rq: Avoid that request processing stalls sporadically
On Thu, Jan 18, 2018 at 05:49:10PM -0700, Jens Axboe wrote: > On 1/18/18 5:35 PM, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 05:24:29PM -0700, Jens Axboe wrote: > >> On 1/18/18 5:18 PM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 12:14:24AM +, Bart Van Assche wrote: > On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: > >> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > >> index f16096af879a..c59c59cfd2a5 100644 > >> --- a/drivers/md/dm-rq.c > >> +++ b/drivers/md/dm-rq.c > >> @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct > >> blk_mq_hw_ctx *hctx, > >>/* Undo dm_start_request() before requeuing */ > >>rq_end_stats(md, rq); > >>rq_completed(md, rq_data_dir(rq), false); > >> + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > >>return BLK_STS_RESOURCE; > >>} > >> > > > > Nak. > > This patch fixes a regression that was introduced by you. You should know > that regressions are not acceptable. If you don't agree with this patch, > please fix the root cause. > >>> > >>> Yesterday I sent a patch, did you test that? > >> > >> That patch isn't going to be of much help. It might prevent you from > >> completely stalling, but it might take you tens of seconds to get there. > > > > Yeah, that is true, and why I marked it as RFC, but the case is so rare to > > happen. > > You don't know that. If the resource is very limited, or someone else > is gobbling up all of it, then it may trigger quite often. Granted, > in that case, you need some way of signaling the freeing of those > resources to make it efficient. > > >> On top of that, it's a rolling timer that gets added when IO is queued, > >> and re-added if IO is pending when it fires. If the latter case is not > >> true, then it won't get armed again. So if IO fails to queue without > >> any other IO pending, you're pretty much in the same situation as if > > > > No IO pending detection may take a bit time, we can do it after > > BLK_STS_RESOURCE is returned and SCHED_RESTART is set. I have done > > this way in the following patch, what do you think of it? > > > > https://github.com/ming1/linux/commit/f866ce0b97b0ae22d033881da7eb07706fd458b4 > > I think it's overly complicated. As I wrote in a lengthier email earlier > today, just ensure that we have a unique return code from the driver for > when it aborts queuing an IO due to a resource shortage that isn't tied > to it's own consumption of it. When we get that return, do a delayed > queue run after X amount of time to ensure we always try again. If you > want to get fancy (later on), you could track the frequency of such > returns and complain if it's too high. Suppose the new introduced return code is BLK_STS_NO_DEV_RESOURCE. This way may degrade performance, for example, DM-MPATH returns BLK_STS_NO_DEV_RESOURCE when blk_get_request() returns NULL, blk-mq handles it by blk_mq_delay_run_hw_queue(10ms). Then blk_mq_sched_restart() is just triggered when one DM-MPATH request is completed, that means one request of underlying queue is completed too, but blk_mq_sched_restart() still can't make progress during the 10ms. That means we should only run blk_mq_delay_run_hw_queue(10ms/100ms) when the queue is idle. I will post out the patch in github for discussion. Thanks, Ming
Re: [PATCH v5 1/2] Return bytes transferred for partial direct I/O
On Thu, Jan 18, 2018 at 10:31:18PM -0800, Darrick J. Wong wrote: > On Fri, Jan 19, 2018 at 02:59:51PM +1100, Dave Chinner wrote: > > On Fri, Jan 19, 2018 at 02:13:53AM +, Al Viro wrote: > > > On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote: > > > > From: Goldwyn Rodrigues> > > > > > > > In case direct I/O encounters an error midway, it returns the error. > > > > Instead it should be returning the number of bytes transferred so far. > > > > > > > > Test case for filesystems (with ENOSPC): > > > > 1. Create an almost full filesystem > > > > 2. Create a file, say /mnt/lastfile, until the filesystem is full. > > > > 3. Direct write() with count > sizeof /mnt/lastfile. > > > > > > > > Result: write() returns -ENOSPC. However, file content has data written > > > > in step 3. > > > > > > > > This fixes fstest generic/472. > > > > > > OK... I can live with that. What about the XFS side? It should be > > > a prereq, to avoid bisection hazard; I can throw both into vfs.git, > > > if XFS folks are OK with that. Objections? > > > > Going through the VFS tree seesm the best approach to me - it's a > > trivial change. I'm sure Darrick will shout if it's going to be a > > problem, though. > > vfs.git is fine, though the second patch to remove the xfs assert should > go first, as Al points out. > > For both patches, > Reviewed-by: Darrick J. Wong Applied; will be in -next tomorrow morning after the tree I'm putting together gets through local beating.
Re: [PATCH v5 1/2] Return bytes transferred for partial direct I/O
On Fri, Jan 19, 2018 at 02:59:51PM +1100, Dave Chinner wrote: > On Fri, Jan 19, 2018 at 02:13:53AM +, Al Viro wrote: > > On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote: > > > From: Goldwyn Rodrigues> > > > > > In case direct I/O encounters an error midway, it returns the error. > > > Instead it should be returning the number of bytes transferred so far. > > > > > > Test case for filesystems (with ENOSPC): > > > 1. Create an almost full filesystem > > > 2. Create a file, say /mnt/lastfile, until the filesystem is full. > > > 3. Direct write() with count > sizeof /mnt/lastfile. > > > > > > Result: write() returns -ENOSPC. However, file content has data written > > > in step 3. > > > > > > This fixes fstest generic/472. > > > > OK... I can live with that. What about the XFS side? It should be > > a prereq, to avoid bisection hazard; I can throw both into vfs.git, > > if XFS folks are OK with that. Objections? > > Going through the VFS tree seesm the best approach to me - it's a > trivial change. I'm sure Darrick will shout if it's going to be a > problem, though. vfs.git is fine, though the second patch to remove the xfs assert should go first, as Al points out. For both patches, Reviewed-by: Darrick J. Wong --D > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, 2018-01-19 at 10:32 +0800, Ming Lei wrote: > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > it should be DM-only which returns STS_RESOURCE so often. That's wrong at least for SCSI. See also https://marc.info/?l=linux-block=151578329417076. Bart.
Re: [PATCH 2/2] xfs: remove assert to check bytes returned
On Fri, Jan 19, 2018 at 02:23:16AM -0200, Raphael Carvalho wrote: > On Fri, Jan 19, 2018 at 1:57 AM, Dave Chinnerwrote: > > > > On Thu, Jan 18, 2018 at 06:57:41PM -0600, Goldwyn Rodrigues wrote: > > > From: Goldwyn Rodrigues > > > > > > Since we can return less than count in case of partial direct > > > writes, remove the ASSERT. > > > > > > Signed-off-by: Goldwyn Rodrigues > > > --- > > > fs/xfs/xfs_file.c | 6 -- > > > 1 file changed, 6 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > > index 8601275cc5e6..8fc4dbf66910 100644 > > > --- a/fs/xfs/xfs_file.c > > > +++ b/fs/xfs/xfs_file.c > > > @@ -590,12 +590,6 @@ xfs_file_dio_aio_write( > > > ret = iomap_dio_rw(iocb, from, _iomap_ops, > > > xfs_dio_write_end_io); > > > out: > > > xfs_iunlock(ip, iolock); > > > - > > > - /* > > > - * No fallback to buffered IO on errors for XFS, direct IO will > > > either > > > - * complete fully or fail. > > > - */ > > > - ASSERT(ret < 0 || ret == count); > > > return ret; > > > } > > > > Acked-by: Dave Chinner > > > Is this really correct? Yes. > Isn't this check with regards to DIO > submission? Yes, if there is an error during submission. But it also checked synchronous IO completion (i.e. error or bytes written), because iomap_dio_rw() waits for non-AIO DIO to complete and returns the IO completion status in that case. > The bytes written is returned in a different asynchronous > path due to AIO support, no?! That is correct. For AIO we'll get -EIOCBQUEUED here on successful submission. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 2/2] xfs: remove assert to check bytes returned
On Fri, Jan 19, 2018 at 1:57 AM, Dave Chinnerwrote: > > On Thu, Jan 18, 2018 at 06:57:41PM -0600, Goldwyn Rodrigues wrote: > > From: Goldwyn Rodrigues > > > > Since we can return less than count in case of partial direct > > writes, remove the ASSERT. > > > > Signed-off-by: Goldwyn Rodrigues > > --- > > fs/xfs/xfs_file.c | 6 -- > > 1 file changed, 6 deletions(-) > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > index 8601275cc5e6..8fc4dbf66910 100644 > > --- a/fs/xfs/xfs_file.c > > +++ b/fs/xfs/xfs_file.c > > @@ -590,12 +590,6 @@ xfs_file_dio_aio_write( > > ret = iomap_dio_rw(iocb, from, _iomap_ops, xfs_dio_write_end_io); > > out: > > xfs_iunlock(ip, iolock); > > - > > - /* > > - * No fallback to buffered IO on errors for XFS, direct IO will either > > - * complete fully or fail. > > - */ > > - ASSERT(ret < 0 || ret == count); > > return ret; > > } > > Acked-by: Dave Chinner Is this really correct? Isn't this check with regards to DIO submission? The bytes written is returned in a different asynchronous path due to AIO support, no?! > > > -- > Dave Chinner > da...@fromorbit.com Regards, Raphael S. Carvalho
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/18/18 7:32 PM, Ming Lei wrote: > On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: >> On 1/18/18 11:47 AM, Bart Van Assche wrote: This is all very tiresome. >>> >>> Yes, this is tiresome. It is very annoying to me that others keep >>> introducing so many regressions in such important parts of the kernel. >>> It is also annoying to me that I get blamed if I report a regression >>> instead of seeing that the regression gets fixed. >> >> I agree, it sucks that any change there introduces the regression. I'm >> fine with doing the delay insert again until a new patch is proven to be >> better. > > That way is still buggy as I explained, since rerun queue before adding > request to hctx->dispatch_list isn't correct. Who can make sure the request > is visible when __blk_mq_run_hw_queue() is called? That race basically doesn't exist for a 10ms gap. > Not mention this way will cause performance regression again. How so? It's _exactly_ the same as what you are proposing, except mine will potentially run the queue when it need not do so. But given that these are random 10ms queue kicks because we are screwed, it should not matter. The key point is that it only should be if we have NO better options. If it's a frequently occurring event that we have to return BLK_STS_RESOURCE, then we need to get a way to register an event for when that condition clears. That event will then kick the necessary queue(s). >> From the original topic of this email, we have conditions that can cause >> the driver to not be able to submit an IO. A set of those conditions can >> only happen if IO is in flight, and those cases we have covered just >> fine. Another set can potentially trigger without IO being in flight. >> These are cases where a non-device resource is unavailable at the time >> of submission. This might be iommu running out of space, for instance, >> or it might be a memory allocation of some sort. For these cases, we >> don't get any notification when the shortage clears. All we can do is >> ensure that we restart operations at some point in the future. We're SOL >> at that point, but we have to ensure that we make forward progress. > > Right, it is a generic issue, not DM-specific one, almost all drivers > call kmalloc(GFP_ATOMIC) in IO path. GFP_ATOMIC basically never fails, unless we are out of memory. The exception is higher order allocations. If a driver has a higher order atomic allocation in its IO path, the device driver writer needs to be taken out behind the barn and shot. Simple as that. It will NEVER work well in a production environment. Witness the disaster that so many NIC driver writers have learned. This is NOT the case we care about here. It's resources that are more readily depleted because other devices are using them. If it's a high frequency or generally occurring event, then we simply must have a callback to restart the queue from that. The condition then becomes identical to device private starvation, the only difference being from where we restart the queue. > IMO, there is enough time for figuring out a generic solution before > 4.16 release. I would hope so, but the proposed solutions have not filled me with a lot of confidence in the end result so far. >> That last set of conditions better not be a a common occurence, since >> performance is down the toilet at that point. I don't want to introduce >> hot path code to rectify it. Have the driver return if that happens in a >> way that is DIFFERENT from needing a normal restart. The driver knows if >> this is a resource that will become available when IO completes on this >> device or not. If we get that return, we have a generic run-again delay. > > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > it should be DM-only which returns STS_RESOURCE so often. Where does the dm STS_RESOURCE error usually come from - what's exact resource are we running out of? -- Jens Axboe
Re: [PATCH v3 RESEND 2/2] blk-throttle: fix wrong initialization in case of dm device
Hi Mike, On 18/1/19 11:29, Mike Snitzer wrote: > On Thu, Jan 18 2018 at 10:09pm -0500, > Joseph Qiwrote: > >> From: Joseph Qi >> >> DM device sets QUEUE_FLAG_NONROT after the queue is registered. That is >> to mean, the previous initialization in blk_throtl_register_queue is >> wrong in this case. >> Fix it by checking and then updating the info during root tg >> initialization as we don't have a better choice. >> >> Signed-off-by: Joseph Qi >> Reviewed-by: Shaohua Li >> --- >> block/blk-throttle.c | 19 +++ >> 1 file changed, 19 insertions(+) >> >> diff --git a/block/blk-throttle.c b/block/blk-throttle.c >> index bf52035..7150f14 100644 >> --- a/block/blk-throttle.c >> +++ b/block/blk-throttle.c >> @@ -541,6 +541,25 @@ static void throtl_pd_init(struct blkg_policy_data *pd) >> if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent) >> sq->parent_sq = _to_tg(blkg->parent)->service_queue; >> tg->td = td; >> + >> +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW >> +/* >> + * DM device sets QUEUE_FLAG_NONROT after the queue is registered, >> + * so the previous initialization is wrong in this case. Check and >> + * update it here. >> + */ >> +if (blk_queue_nonrot(blkg->q) && >> +td->filtered_latency != LATENCY_FILTERED_SSD) { >> +int i; >> + >> +td->throtl_slice = DFL_THROTL_SLICE_SSD; >> +td->filtered_latency = LATENCY_FILTERED_SSD; >> +for (i = 0; i < LATENCY_BUCKET_SIZE; i++) { >> +td->avg_buckets[READ][i].latency = 0; >> +td->avg_buckets[WRITE][i].latency = 0; >> +} >> +} >> +#endif >> } >> >> /* >> -- >> 1.9.4 > > This should be fixed for 4.16, please see these block tree commits: > http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=fa70d2e2c4a0a54ced98260c6a176cc94c876d27 > http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=c100ec49fdd836ff8a17c7bfcc7611d2ee2b > > The last commit's patch header even references the previous submission > you had for this patch with: > > "These changes also stave off the need to introduce new DM-specific > workarounds in block core, e.g. this proposal: > https://patchwork.kernel.org/patch/10067961/; > Yes, if we call dm_table_set_restrictions before blk_register_queue now, we can make sure the initialization is correct by checking whether flag QUEUE_FLAG_NONROT is set or not. So my patch is no longer needed. Jens, please ignore this patch, thanks. Thanks, Joseph
Re: [PATCH 2/2] xfs: remove assert to check bytes returned
On Thu, Jan 18, 2018 at 06:57:41PM -0600, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues> > Since we can return less than count in case of partial direct > writes, remove the ASSERT. > > Signed-off-by: Goldwyn Rodrigues > --- > fs/xfs/xfs_file.c | 6 -- > 1 file changed, 6 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 8601275cc5e6..8fc4dbf66910 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -590,12 +590,6 @@ xfs_file_dio_aio_write( > ret = iomap_dio_rw(iocb, from, _iomap_ops, xfs_dio_write_end_io); > out: > xfs_iunlock(ip, iolock); > - > - /* > - * No fallback to buffered IO on errors for XFS, direct IO will either > - * complete fully or fail. > - */ > - ASSERT(ret < 0 || ret == count); > return ret; > } Acked-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [PATCH v3 RESEND 2/2] blk-throttle: fix wrong initialization in case of dm device
On 1/18/18 8:29 PM, Mike Snitzer wrote: > On Thu, Jan 18 2018 at 10:09pm -0500, > Joseph Qiwrote: > >> From: Joseph Qi >> >> DM device sets QUEUE_FLAG_NONROT after the queue is registered. That is >> to mean, the previous initialization in blk_throtl_register_queue is >> wrong in this case. >> Fix it by checking and then updating the info during root tg >> initialization as we don't have a better choice. >> >> Signed-off-by: Joseph Qi >> Reviewed-by: Shaohua Li >> --- >> block/blk-throttle.c | 19 +++ >> 1 file changed, 19 insertions(+) >> >> diff --git a/block/blk-throttle.c b/block/blk-throttle.c >> index bf52035..7150f14 100644 >> --- a/block/blk-throttle.c >> +++ b/block/blk-throttle.c >> @@ -541,6 +541,25 @@ static void throtl_pd_init(struct blkg_policy_data *pd) >> if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent) >> sq->parent_sq = _to_tg(blkg->parent)->service_queue; >> tg->td = td; >> + >> +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW >> +/* >> + * DM device sets QUEUE_FLAG_NONROT after the queue is registered, >> + * so the previous initialization is wrong in this case. Check and >> + * update it here. >> + */ >> +if (blk_queue_nonrot(blkg->q) && >> +td->filtered_latency != LATENCY_FILTERED_SSD) { >> +int i; >> + >> +td->throtl_slice = DFL_THROTL_SLICE_SSD; >> +td->filtered_latency = LATENCY_FILTERED_SSD; >> +for (i = 0; i < LATENCY_BUCKET_SIZE; i++) { >> +td->avg_buckets[READ][i].latency = 0; >> +td->avg_buckets[WRITE][i].latency = 0; >> +} >> +} >> +#endif >> } >> >> /* >> -- >> 1.9.4 > > This should be fixed for 4.16, please see these block tree commits: > http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=fa70d2e2c4a0a54ced98260c6a176cc94c876d27 > http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=c100ec49fdd836ff8a17c7bfcc7611d2ee2b > > The last commit's patch header even references the previous submission > you had for this patch with: > > "These changes also stave off the need to introduce new DM-specific > workarounds in block core, e.g. this proposal: > https://patchwork.kernel.org/patch/10067961/; > Joseph, please confirm if it works like it should with the dm tree. The point of Mike's work was to avoid special casing stuff like this, so hopefully it took care of yours too. -- Jens Axboe
Re: [PATCH v3 RESEND 2/2] blk-throttle: fix wrong initialization in case of dm device
On Thu, Jan 18 2018 at 10:09pm -0500, Joseph Qiwrote: > From: Joseph Qi > > DM device sets QUEUE_FLAG_NONROT after the queue is registered. That is > to mean, the previous initialization in blk_throtl_register_queue is > wrong in this case. > Fix it by checking and then updating the info during root tg > initialization as we don't have a better choice. > > Signed-off-by: Joseph Qi > Reviewed-by: Shaohua Li > --- > block/blk-throttle.c | 19 +++ > 1 file changed, 19 insertions(+) > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c > index bf52035..7150f14 100644 > --- a/block/blk-throttle.c > +++ b/block/blk-throttle.c > @@ -541,6 +541,25 @@ static void throtl_pd_init(struct blkg_policy_data *pd) > if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent) > sq->parent_sq = _to_tg(blkg->parent)->service_queue; > tg->td = td; > + > +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW > + /* > + * DM device sets QUEUE_FLAG_NONROT after the queue is registered, > + * so the previous initialization is wrong in this case. Check and > + * update it here. > + */ > + if (blk_queue_nonrot(blkg->q) && > + td->filtered_latency != LATENCY_FILTERED_SSD) { > + int i; > + > + td->throtl_slice = DFL_THROTL_SLICE_SSD; > + td->filtered_latency = LATENCY_FILTERED_SSD; > + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) { > + td->avg_buckets[READ][i].latency = 0; > + td->avg_buckets[WRITE][i].latency = 0; > + } > + } > +#endif > } > > /* > -- > 1.9.4 This should be fixed for 4.16, please see these block tree commits: http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=fa70d2e2c4a0a54ced98260c6a176cc94c876d27 http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=c100ec49fdd836ff8a17c7bfcc7611d2ee2b The last commit's patch header even references the previous submission you had for this patch with: "These changes also stave off the need to introduce new DM-specific workarounds in block core, e.g. this proposal: https://patchwork.kernel.org/patch/10067961/;
[PATCH v3 RESEND 2/2] blk-throttle: fix wrong initialization in case of dm device
From: Joseph QiDM device sets QUEUE_FLAG_NONROT after the queue is registered. That is to mean, the previous initialization in blk_throtl_register_queue is wrong in this case. Fix it by checking and then updating the info during root tg initialization as we don't have a better choice. Signed-off-by: Joseph Qi Reviewed-by: Shaohua Li --- block/blk-throttle.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index bf52035..7150f14 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -541,6 +541,25 @@ static void throtl_pd_init(struct blkg_policy_data *pd) if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent) sq->parent_sq = _to_tg(blkg->parent)->service_queue; tg->td = td; + +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW + /* +* DM device sets QUEUE_FLAG_NONROT after the queue is registered, +* so the previous initialization is wrong in this case. Check and +* update it here. +*/ + if (blk_queue_nonrot(blkg->q) && + td->filtered_latency != LATENCY_FILTERED_SSD) { + int i; + + td->throtl_slice = DFL_THROTL_SLICE_SSD; + td->filtered_latency = LATENCY_FILTERED_SSD; + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) { + td->avg_buckets[READ][i].latency = 0; + td->avg_buckets[WRITE][i].latency = 0; + } + } +#endif } /* -- 1.9.4
Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
Hi ming Sorry for delayed report this. On 01/17/2018 05:57 PM, Ming Lei wrote: > 2) hctx->next_cpu can become offline from online before __blk_mq_run_hw_queue > is run, there isn't warning, but once the IO is submitted to hardware, > after it is completed, how does the HBA/hw queue notify CPU since CPUs > assigned to this hw queue(irq vector) are offline? blk-mq's timeout > handler may cover that, but looks too tricky. In theory, the irq affinity will be migrated to other cpu. This is done by fixup_irqs() in the context of stop_machine. However, in my test, I found this log: [ 267.161043] do_IRQ: 7.33 No irq handler for vector The 33 is the vector used by nvme cq. The irq seems to be missed and sometimes IO hang occurred. It is not every time, I think maybe due to nvme_process_cq in nvme_queue_rq. I add dump stack behind the error log and get following: [ 267.161043] do_IRQ: 7.33 No irq handler for vector migration/7 [ 267.161045] CPU: 7 PID: 52 Comm: migration/7 Not tainted 4.15.0-rc7+ #27 [ 267.161045] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 [ 267.161046] Call Trace: [ 267.161047] [ 267.161052] dump_stack+0x7c/0xb5 [ 267.161054] do_IRQ+0xb9/0xf0 [ 267.161056] common_interrupt+0xa2/0xa2 [ 267.161057] [ 267.161059] RIP: 0010:multi_cpu_stop+0xb0/0x120 [ 267.161060] RSP: 0018:bb6c81af7e70 EFLAGS: 0202 ORIG_RAX: ffde [ 267.161061] RAX: 0001 RBX: 0004 RCX: [ 267.161062] RDX: 0006 RSI: 898c4591 RDI: 0202 [ 267.161063] RBP: bb6c826e7c88 R08: 991abc1256bc R09: 0005 [ 267.161063] R10: bb6c81af7db8 R11: 89c91d20 R12: 0001 [ 267.161064] R13: bb6c826e7cac R14: 0003 R15: [ 267.161067] ? cpu_stop_queue_work+0x90/0x90 [ 267.161068] cpu_stopper_thread+0x83/0x100 [ 267.161070] smpboot_thread_fn+0x161/0x220 [ 267.161072] kthread+0xf5/0x130 [ 267.161073] ? sort_range+0x20/0x20 [ 267.161074] ? kthread_associate_blkcg+0xe0/0xe0 [ 267.161076] ret_from_fork+0x24/0x30 The irq just occurred after the irq is enabled in multi_cpu_stop. 0x8112d655 is in multi_cpu_stop (/home/will/u04/source_code/linux-block/kernel/stop_machine.c:223). 218 */ 219 touch_nmi_watchdog(); 220 } 221 } while (curstate != MULTI_STOP_EXIT); 222 223 local_irq_restore(flags); 224 return err; 225 } Thanks Jianchao
Re: [PATCH v3 RESEND 1/2] blk-throttle: track read and write request individually
Sure, I will resend it. Thanks. Joseph On 18/1/19 10:52, Jens Axboe wrote: > On 1/18/18 7:11 PM, Joseph Qi wrote: >> Hi Jens, >> Could you please pick the two pending patches for 4.16? >> They all have been reviewed by Shaohua. > > Sorry, I guess I forgot about this series. I picked up 1/2, > for some reason I don't have 2/2 and I can't find it on > linux-block either. The thread only shows the first patch. > > Can you resend 2/2? >
Re: [PATCH v3 RESEND 1/2] blk-throttle: track read and write request individually
On 1/18/18 7:11 PM, Joseph Qi wrote: > Hi Jens, > Could you please pick the two pending patches for 4.16? > They all have been reviewed by Shaohua. Sorry, I guess I forgot about this series. I picked up 1/2, for some reason I don't have 2/2 and I can't find it on linux-block either. The thread only shows the first patch. Can you resend 2/2? -- Jens Axboe
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: > On 1/18/18 11:47 AM, Bart Van Assche wrote: > >> This is all very tiresome. > > > > Yes, this is tiresome. It is very annoying to me that others keep > > introducing so many regressions in such important parts of the kernel. > > It is also annoying to me that I get blamed if I report a regression > > instead of seeing that the regression gets fixed. > > I agree, it sucks that any change there introduces the regression. I'm > fine with doing the delay insert again until a new patch is proven to be > better. That way is still buggy as I explained, since rerun queue before adding request to hctx->dispatch_list isn't correct. Who can make sure the request is visible when __blk_mq_run_hw_queue() is called? Not mention this way will cause performance regression again. > > From the original topic of this email, we have conditions that can cause > the driver to not be able to submit an IO. A set of those conditions can > only happen if IO is in flight, and those cases we have covered just > fine. Another set can potentially trigger without IO being in flight. > These are cases where a non-device resource is unavailable at the time > of submission. This might be iommu running out of space, for instance, > or it might be a memory allocation of some sort. For these cases, we > don't get any notification when the shortage clears. All we can do is > ensure that we restart operations at some point in the future. We're SOL > at that point, but we have to ensure that we make forward progress. Right, it is a generic issue, not DM-specific one, almost all drivers call kmalloc(GFP_ATOMIC) in IO path. IMO, there is enough time for figuring out a generic solution before 4.16 release. > > That last set of conditions better not be a a common occurence, since > performance is down the toilet at that point. I don't want to introduce > hot path code to rectify it. Have the driver return if that happens in a > way that is DIFFERENT from needing a normal restart. The driver knows if > this is a resource that will become available when IO completes on this > device or not. If we get that return, we have a generic run-again delay. Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and it should be DM-only which returns STS_RESOURCE so often. > > This basically becomes the same as doing the delay queue thing from DM, > but just in a generic fashion. Yeah, it is right. -- Ming
Re: [PATCH v5 1/2] Return bytes transferred for partial direct I/O
On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues> > In case direct I/O encounters an error midway, it returns the error. > Instead it should be returning the number of bytes transferred so far. > > Test case for filesystems (with ENOSPC): > 1. Create an almost full filesystem > 2. Create a file, say /mnt/lastfile, until the filesystem is full. > 3. Direct write() with count > sizeof /mnt/lastfile. > > Result: write() returns -ENOSPC. However, file content has data written > in step 3. > > This fixes fstest generic/472. OK... I can live with that. What about the XFS side? It should be a prereq, to avoid bisection hazard; I can throw both into vfs.git, if XFS folks are OK with that. Objections?
Re: [PATCH v3 RESEND 1/2] blk-throttle: track read and write request individually
Hi Jens, Could you please pick the two pending patches for 4.16? They all have been reviewed by Shaohua. Thanks, Joseph On 18/1/8 20:05, Joseph Qi wrote: > A polite ping for the two pending patches... > > Thanks, > Joseph > > On 17/11/24 13:13, Jens Axboe wrote: >> On 11/23/2017 06:31 PM, Joseph Qi wrote: >>> Hi Jens, >>> Could you please give your advice for the two patches or pick them up if >>> you think they are good? >> >> It looks OK to me, but my preference would be to push this until >> 4.16. >>
[PATCH v5 1/2] Return bytes transferred for partial direct I/O
From: Goldwyn RodriguesIn case direct I/O encounters an error midway, it returns the error. Instead it should be returning the number of bytes transferred so far. Test case for filesystems (with ENOSPC): 1. Create an almost full filesystem 2. Create a file, say /mnt/lastfile, until the filesystem is full. 3. Direct write() with count > sizeof /mnt/lastfile. Result: write() returns -ENOSPC. However, file content has data written in step 3. This fixes fstest generic/472. Reviewed-by: Christoph Hellwig Signed-off-by: Goldwyn Rodrigues --- fs/block_dev.c | 2 +- fs/direct-io.c | 4 +--- fs/iomap.c | 20 ++-- 3 files changed, 12 insertions(+), 14 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 4a181fcb5175..49d94360bb51 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -409,7 +409,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) if (!ret) ret = blk_status_to_errno(dio->bio.bi_status); - if (likely(!ret)) + if (likely(dio->size)) ret = dio->size; bio_put(>bio); diff --git a/fs/direct-io.c b/fs/direct-io.c index 3aafb3343a65..0c98b6e65d7c 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -262,8 +262,6 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags) ret = dio->page_errors; if (ret == 0) ret = dio->io_error; - if (ret == 0) - ret = transferred; if (dio->end_io) { // XXX: ki_pos?? @@ -310,7 +308,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags) } kmem_cache_free(dio_cache, dio); - return ret; + return transferred ? transferred : ret; } static void dio_aio_complete_work(struct work_struct *work) diff --git a/fs/iomap.c b/fs/iomap.c index 47d29ccffaef..cab57d85404e 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -716,23 +716,23 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio) struct kiocb *iocb = dio->iocb; struct inode *inode = file_inode(iocb->ki_filp); loff_t offset = iocb->ki_pos; - ssize_t ret; + ssize_t err; + ssize_t transferred = dio->size; if (dio->end_io) { - ret = dio->end_io(iocb, - dio->error ? dio->error : dio->size, + err = dio->end_io(iocb, + transferred ? transferred : dio->error, dio->flags); } else { - ret = dio->error; + err = dio->error; } - if (likely(!ret)) { - ret = dio->size; + if (likely(transferred)) { /* check for short read */ - if (offset + ret > dio->i_size && + if (offset + transferred > dio->i_size && !(dio->flags & IOMAP_DIO_WRITE)) - ret = dio->i_size - offset; - iocb->ki_pos += ret; + transferred = dio->i_size - offset; + iocb->ki_pos += transferred; } /* @@ -759,7 +759,7 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio) inode_dio_end(file_inode(iocb->ki_filp)); kfree(dio); - return ret; + return transferred ? transferred : err; } static void iomap_dio_complete_work(struct work_struct *work) -- 2.15.1
[PATCH 2/2] xfs: remove assert to check bytes returned
From: Goldwyn RodriguesSince we can return less than count in case of partial direct writes, remove the ASSERT. Signed-off-by: Goldwyn Rodrigues --- fs/xfs/xfs_file.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 8601275cc5e6..8fc4dbf66910 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -590,12 +590,6 @@ xfs_file_dio_aio_write( ret = iomap_dio_rw(iocb, from, _iomap_ops, xfs_dio_write_end_io); out: xfs_iunlock(ip, iolock); - - /* -* No fallback to buffered IO on errors for XFS, direct IO will either -* complete fully or fail. -*/ - ASSERT(ret < 0 || ret == count); return ret; } -- 2.15.1
Re: [PATCH] dm rq: Avoid that request processing stalls sporadically
On 1/18/18 5:35 PM, Ming Lei wrote: > On Thu, Jan 18, 2018 at 05:24:29PM -0700, Jens Axboe wrote: >> On 1/18/18 5:18 PM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 12:14:24AM +, Bart Van Assche wrote: On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: > On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: >> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c >> index f16096af879a..c59c59cfd2a5 100644 >> --- a/drivers/md/dm-rq.c >> +++ b/drivers/md/dm-rq.c >> @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct >> blk_mq_hw_ctx *hctx, >> /* Undo dm_start_request() before requeuing */ >> rq_end_stats(md, rq); >> rq_completed(md, rq_data_dir(rq), false); >> +blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); >> return BLK_STS_RESOURCE; >> } >> > > Nak. This patch fixes a regression that was introduced by you. You should know that regressions are not acceptable. If you don't agree with this patch, please fix the root cause. >>> >>> Yesterday I sent a patch, did you test that? >> >> That patch isn't going to be of much help. It might prevent you from >> completely stalling, but it might take you tens of seconds to get there. > > Yeah, that is true, and why I marked it as RFC, but the case is so rare to > happen. You don't know that. If the resource is very limited, or someone else is gobbling up all of it, then it may trigger quite often. Granted, in that case, you need some way of signaling the freeing of those resources to make it efficient. >> On top of that, it's a rolling timer that gets added when IO is queued, >> and re-added if IO is pending when it fires. If the latter case is not >> true, then it won't get armed again. So if IO fails to queue without >> any other IO pending, you're pretty much in the same situation as if > > No IO pending detection may take a bit time, we can do it after > BLK_STS_RESOURCE is returned and SCHED_RESTART is set. I have done > this way in the following patch, what do you think of it? > > https://github.com/ming1/linux/commit/f866ce0b97b0ae22d033881da7eb07706fd458b4 I think it's overly complicated. As I wrote in a lengthier email earlier today, just ensure that we have a unique return code from the driver for when it aborts queuing an IO due to a resource shortage that isn't tied to it's own consumption of it. When we get that return, do a delayed queue run after X amount of time to ensure we always try again. If you want to get fancy (later on), you could track the frequency of such returns and complain if it's too high. There's no point in adding a lot of code to check whether we need to run the queue or not. Just always do it. We won't be doing it more than 100 times per second for the worst case of the condition taking a while to clear, if we stick to the 10ms re-run time. -- Jens Axboe
Re: dm rq: Avoid that request processing stalls sporadically
On Fri, 2018-01-19 at 08:26 +0800, Ming Lei wrote: > No, this case won't return BLK_STS_RESOURCE. It's possible that my attempt to reverse engineer the latest blk-mq changes was wrong. But the queue stall is real and needs to be fixed. Bart.
Re: dm rq: Avoid that request processing stalls sporadically
On Thu, Jan 18, 2018 at 05:13:53PM +, Bart Van Assche wrote: > On Thu, 2018-01-18 at 11:50 -0500, Mike Snitzer wrote: > > The issue you say it was originally intended to fix _should_ be > > addressed with this change: > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=4dd6edd23e7ea971efddc303f9e67eb79e95808e > > Hello Mike, > > Sorry but I'm not convinced that that patch is sufficient. That patch helps > if .end_io() is called with status BLK_STS_RESOURCE and also if > blk_insert_cloned_request() returns the .queue_rq() return value. It does not > help if .queue_rq() returns BLK_STS_RESOURCE and that return value gets > ignored. The return value from .queue_rq() is handled by blk-mq, why do you think it can be ignored? Please see blk_mq_dispatch_rq_list(). > I think that can happen as follows: > - Request cloning in multipath_clone_and_map() succeeds and that function > returns DM_MAPIO_REMAPPED. > - dm_dispatch_clone_request() calls blk_insert_cloned_request(). > - blk_insert_cloned_request() calls blk_mq_request_direct_issue(), which > results in a call of __blk_mq_try_issue_directly(). > - __blk_mq_try_issue_directly() calls blk_mq_sched_insert_request(). In this This only happens iff queue is stopped or quiesced, then we return BLK_STS_OK to blk-mq via .queue_rq(), please see __blk_mq_try_issue_directly(), how does this cause IO hang? > case the BLK_STS_RESOURCE returned by the .queue_rq() implementation of the > underlying path will be ignored. No, this case won't return BLK_STS_RESOURCE. -- Ming
Re: [PATCH] dm rq: Avoid that request processing stalls sporadically
On 1/18/18 5:18 PM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 12:14:24AM +, Bart Van Assche wrote: >> On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: >>> On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index f16096af879a..c59c59cfd2a5 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, /* Undo dm_start_request() before requeuing */ rq_end_stats(md, rq); rq_completed(md, rq_data_dir(rq), false); + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); return BLK_STS_RESOURCE; } >>> >>> Nak. >> >> This patch fixes a regression that was introduced by you. You should know >> that regressions are not acceptable. If you don't agree with this patch, >> please fix the root cause. > > Yesterday I sent a patch, did you test that? That patch isn't going to be of much help. It might prevent you from completely stalling, but it might take you tens of seconds to get there. On top of that, it's a rolling timer that gets added when IO is queued, and re-added if IO is pending when it fires. If the latter case is not true, then it won't get armed again. So if IO fails to queue without any other IO pending, you're pretty much in the same situation as if you marked the queue as restart. Nobody is going to be noticing either of those conditions. -- Jens Axboe
Re: [PATCH] dm rq: Avoid that request processing stalls sporadically
On Fri, 2018-01-19 at 08:18 +0800, Ming Lei wrote: > On Fri, Jan 19, 2018 at 12:14:24AM +, Bart Van Assche wrote: > > On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: > > > On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: > > > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > > > > index f16096af879a..c59c59cfd2a5 100644 > > > > --- a/drivers/md/dm-rq.c > > > > +++ b/drivers/md/dm-rq.c > > > > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct > > > > blk_mq_hw_ctx *hctx, > > > > /* Undo dm_start_request() before requeuing */ > > > > rq_end_stats(md, rq); > > > > rq_completed(md, rq_data_dir(rq), false); > > > > + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > > > > return BLK_STS_RESOURCE; > > > > } > > > > > > > > > > Nak. > > > > This patch fixes a regression that was introduced by you. You should know > > that regressions are not acceptable. If you don't agree with this patch, > > please fix the root cause. > > Yesterday I sent a patch, did you test that? Yes, I did. It caused queue stalls that were so bad that sending "kick" to the debugfs "state" attribute was not sufficient to resolve the queue stall. Bart.
Re: [PATCH] dm rq: Avoid that request processing stalls sporadically
On Fri, Jan 19, 2018 at 12:14:24AM +, Bart Van Assche wrote: > On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: > > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > > > index f16096af879a..c59c59cfd2a5 100644 > > > --- a/drivers/md/dm-rq.c > > > +++ b/drivers/md/dm-rq.c > > > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct > > > blk_mq_hw_ctx *hctx, > > > /* Undo dm_start_request() before requeuing */ > > > rq_end_stats(md, rq); > > > rq_completed(md, rq_data_dir(rq), false); > > > + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > > > return BLK_STS_RESOURCE; > > > } > > > > > > > Nak. > > This patch fixes a regression that was introduced by you. You should know > that regressions are not acceptable. If you don't agree with this patch, > please fix the root cause. Yesterday I sent a patch, did you test that? -- Ming
Re: [PATCH] dm rq: Avoid that request processing stalls sporadically
On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: > On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > > index f16096af879a..c59c59cfd2a5 100644 > > --- a/drivers/md/dm-rq.c > > +++ b/drivers/md/dm-rq.c > > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx > > *hctx, > > /* Undo dm_start_request() before requeuing */ > > rq_end_stats(md, rq); > > rq_completed(md, rq_data_dir(rq), false); > > + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > > return BLK_STS_RESOURCE; > > } > > > > Nak. This patch fixes a regression that was introduced by you. You should know that regressions are not acceptable. If you don't agree with this patch, please fix the root cause. Bart.
Re: [PATCH] dm rq: Avoid that request processing stalls sporadically
On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: > If the .queue_rq() implementation of a block driver returns > BLK_STS_RESOURCE then that block driver is responsible for > rerunning the queue once the condition that caused it to return > BLK_STS_RESOURCE has been cleared. The dm-mpath driver tells the > dm core to requeue a request if e.g. not enough memory is > available for cloning a request or if the underlying path is > busy. Since the dm-mpath driver does not receive any kind of > notification if the condition that caused it to return "requeue" > is cleared, the only solution to avoid that dm-mpath request > processing stalls is to call blk_mq_delay_run_hw_queue(). Hence > this patch. > > Fixes: ec3eaf9a6731 ("dm mpath: don't call blk_mq_delay_run_hw_queue() in > case of BLK_STS_RESOURCE") > Signed-off-by: Bart Van Assche> Cc: Ming Lei > --- > drivers/md/dm-rq.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > index f16096af879a..c59c59cfd2a5 100644 > --- a/drivers/md/dm-rq.c > +++ b/drivers/md/dm-rq.c > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx > *hctx, > /* Undo dm_start_request() before requeuing */ > rq_end_stats(md, rq); > rq_completed(md, rq_data_dir(rq), false); > + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > return BLK_STS_RESOURCE; > } > Nak. It still takes a bit time to add this request to hctx->dispatch_list from here, so suppose the time is longer than 100ms because of interrupt , preemption or whatever, this request can't be observed in the scheduled run queue(__blk_mq_run_hw_queue). Not mention it is just a ugly workaround, which degrades performance a lot. -- Ming
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, 2018-01-18 at 15:39 -0700, Jens Axboe wrote: > When you do have a solid test case, please please submit a blktests > test case for it! This needs to be something we can regularly in > testing. Hello Jens, That sounds like a good idea to me. BTW, I think the reason why so far I can reproduce these queue stalls easier than others is because I modified the SRP initiator to make it easy to cause the .get_budget() call to succeed and the scsi_queue_rq() to return BLK_STS_BUSY. A possible next step is to apply a similar change to the scsi_debug driver. The change I made is as follows: diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 0c887ebfbc64..7f3c4a197425 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -3135,6 +3135,16 @@ static int srp_reset_host(struct scsi_cmnd *scmnd) return srp_reconnect_rport(target->rport) == 0 ? SUCCESS : FAILED; } +static int srp_target_alloc(struct scsi_target *starget) +{ + struct Scsi_Host *shost = dev_to_shost(starget->dev.parent); + struct srp_target_port *target = host_to_target(shost); + + if (target->target_can_queue) + starget->can_queue = target->target_can_queue; + return 0; +} + static int srp_slave_alloc(struct scsi_device *sdev) { struct Scsi_Host *shost = sdev->host; @@ -3348,6 +3358,7 @@ static struct scsi_host_template srp_template = { .module = THIS_MODULE, .name = "InfiniBand SRP initiator", .proc_name = DRV_NAME, + .target_alloc = srp_target_alloc, .slave_alloc= srp_slave_alloc, .slave_configure= srp_slave_configure, .info = srp_target_info, @@ -3515,6 +3526,7 @@ enum { SRP_OPT_QUEUE_SIZE = 1 << 14, SRP_OPT_IP_SRC = 1 << 15, SRP_OPT_IP_DEST = 1 << 16, + SRP_OPT_TARGET_CAN_QUEUE= 1 << 17, }; static unsigned int srp_opt_mandatory[] = { @@ -3536,6 +3548,7 @@ static const match_table_t srp_opt_tokens = { { SRP_OPT_SERVICE_ID, "service_id=%s" }, { SRP_OPT_MAX_SECT, "max_sect=%d" }, { SRP_OPT_MAX_CMD_PER_LUN, "max_cmd_per_lun=%d"}, + { SRP_OPT_TARGET_CAN_QUEUE, "target_can_queue=%d" }, { SRP_OPT_IO_CLASS, "io_class=%x" }, { SRP_OPT_INITIATOR_EXT,"initiator_ext=%s" }, { SRP_OPT_CMD_SG_ENTRIES, "cmd_sg_entries=%u" }, @@ -3724,6 +3737,15 @@ static int srp_parse_options(struct net *net, const char *buf, target->scsi_host->cmd_per_lun = token; break; + case SRP_OPT_TARGET_CAN_QUEUE: + if (match_int(args, ) || token < 1) { + pr_warn("bad max target_can_queue parameter '%s'\n", + p); + goto out; + } + target->target_can_queue = token; + break; + case SRP_OPT_IO_CLASS: if (match_hex(args, )) { pr_warn("bad IO class parameter '%s'\n", p); diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index d66c9057d5ea..70334fa3de8e 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -216,6 +216,7 @@ struct srp_target_port { chartarget_name[32]; unsigned intscsi_id; unsigned intsg_tablesize; + unsigned inttarget_can_queue; int mr_pool_size; int mr_per_cmd; int queue_size; Bart.
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/18/18 3:35 PM, Laurence Oberman wrote: > On Thu, 2018-01-18 at 22:24 +, Bart Van Assche wrote: >> On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote: >>> OK, I ran 5 at once of 5 separate mount points. >>> I am using 4k block sizes >>> Its solid consistent for me. No stalls no gaps. >> >> Hi Laurence, >> >> That's great news and thank you for having shared this information >> but I think >> it should be mentioned that you have been running my tree in which >> some recent >> block and dm patches were reverted >> (https://github.com/bvanassche/linux/tree/block-scsi-for-next) >> >>> >>> [global] >>> name=02-mq >>> filename=fio-output-02-mq.txt >>> rw=randwrite >>> verify=md5 >>> ;rwmixread=60 >>> ;rwmixwrite=40 >>> bs=4K >>> ;direct=1 >>> ;numjobs=4 >>> ;time_based=1 >>> runtime=120 >>> >>> [file1] >>> size=3G >>> ioengine=libaio >>> iodepth=16 >>> >>> I watch I/O and I see it ramp up but fio still runs and it kind of >>> shuts down. >> >> In test "file1" I see an I/O size of 3G. Does that mean that the >> patch that >> should fix the sgl_alloc() issue is working? >> >> Thanks, >> >> Bart. > > Hello Bart Thank you. > > OK, so booting into Mike tree now and then hopefully I get the lockups. > Can you give me some idea of what to look for. > I assume I/O just stops. > > I want to get this happening in-house so we have an avenue to fix this. > Following getting the stall I will attend to the slg patch test for the > SRPT side. > > Please note I am not running latest SRPT right now as you know. When you do have a solid test case, please please submit a blktests test case for it! This needs to be something we can regularly in testing. -- Jens Axboe
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, 2018-01-18 at 22:24 +, Bart Van Assche wrote: > On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote: > > OK, I ran 5 at once of 5 separate mount points. > > I am using 4k block sizes > > Its solid consistent for me. No stalls no gaps. > > Hi Laurence, > > That's great news and thank you for having shared this information > but I think > it should be mentioned that you have been running my tree in which > some recent > block and dm patches were reverted > (https://github.com/bvanassche/linux/tree/block-scsi-for-next) > > > > > [global] > > name=02-mq > > filename=fio-output-02-mq.txt > > rw=randwrite > > verify=md5 > > ;rwmixread=60 > > ;rwmixwrite=40 > > bs=4K > > ;direct=1 > > ;numjobs=4 > > ;time_based=1 > > runtime=120 > > > > [file1] > > size=3G > > ioengine=libaio > > iodepth=16 > > > > I watch I/O and I see it ramp up but fio still runs and it kind of > > shuts down. > > In test "file1" I see an I/O size of 3G. Does that mean that the > patch that > should fix the sgl_alloc() issue is working? > > Thanks, > > Bart. Hello Bart Thank you. OK, so booting into Mike tree now and then hopefully I get the lockups. Can you give me some idea of what to look for. I assume I/O just stops. I want to get this happening in-house so we have an avenue to fix this. Following getting the stall I will attend to the slg patch test for the SRPT side. Please note I am not running latest SRPT right now as you know. Regards Back later with results
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote: > OK, I ran 5 at once of 5 separate mount points. > I am using 4k block sizes > Its solid consistent for me. No stalls no gaps. Hi Laurence, That's great news and thank you for having shared this information but I think it should be mentioned that you have been running my tree in which some recent block and dm patches were reverted (https://github.com/bvanassche/linux/tree/block-scsi-for-next) > > [global] > name=02-mq > filename=fio-output-02-mq.txt > rw=randwrite > verify=md5 > ;rwmixread=60 > ;rwmixwrite=40 > bs=4K > ;direct=1 > ;numjobs=4 > ;time_based=1 > runtime=120 > > [file1] > size=3G > ioengine=libaio > iodepth=16 > > I watch I/O and I see it ramp up but fio still runs and it kind of > shuts down. In test "file1" I see an I/O size of 3G. Does that mean that the patch that should fix the sgl_alloc() issue is working? Thanks, Bart.
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote: > And yet Laurence cannot reproduce any such lockups with your test... Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has already succeeded at running an unmodified version of my tests. In one of the e-mails Laurence sent me this morning I read that he modified these scripts to get past a kernel module unload failure that was reported while starting these tests. So the next step is to check which changes were made to the test scripts and also whether the test results are still valid. > Are you absolutely certain this patch doesn't help you? > https://patchwork.kernel.org/patch/10174037/ > > If it doesn't then that is actually very useful to know. The first I tried this morning is to run the srp-test software against a merge of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code. Since even that was not sufficient I tried to kick the queues via debugfs (for s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was not sufficient to resolve the queue stall I reverted the following tree patches that are in Jens' tree: * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback" * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request" * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy" Only after I had done this the srp-test software ran again without triggering dm queue lockups. Sorry but I have not yet had the time to test patch "[RFC] blk-mq: fixup RESTART when queue becomes idle". > Please just focus on helping Laurence get his very capable testbed to > reproduce this issue. Once we can reproduce these "unkillable" "stalls" > in-house it'll be _much_ easier to analyze and fix. OK, I will work with Laurence on this. Maybe Laurence and I should work on this before analyzing the lockup that was mentioned above further? Bart.
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote: > On Thu, Jan 18 2018 at 4:39pm -0500, > Bart Van Asschewrote: > > > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > > > On Thu, Jan 18 2018 at 3:58P -0500, > > > Bart Van Assche wrote: > > > > > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > > > For Bart's test the underlying scsi-mq driver is what is > > > > > regularly > > > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > > > > > if (blk_mq_hctx_stopped(hctx) || > > > > > blk_queue_quiesced(q)) > > > > > > > > These lockups were all triggered by incorrect handling of > > > > .queue_rq() returning BLK_STS_RESOURCE. > > > > > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > > > "Incorrect" because it no longer runs > > > blk_mq_delay_run_hw_queue()? > > > > In what I wrote I was referring to both dm_mq_queue_rq() and > > scsi_queue_rq(). > > With "incorrect" I meant that queue lockups are introduced that > > make user > > space processes unkillable. That's a severe bug. > > And yet Laurence cannot reproduce any such lockups with your test... > > Are you absolutely certain this patch doesn't help you? > https://patchwork.kernel.org/patch/10174037/ > > If it doesn't then that is actually very useful to know. > > > > We have time to get this right, please stop hyperventilating > > > about > > > "regressions". > > > > Sorry Mike but that's something I consider as an unfair comment. If > > Ming and > > you work on patches together, it's your job to make sure that no > > regressions > > are introduced. Instead of blaming me because I report these > > regressions you > > should be grateful that I take the time and effort to report these > > regressions > > early. And since you are employed by a large organization that > > sells Linux > > support services, your employer should invest in developing test > > cases that > > reach a higher coverage of the dm, SCSI and block layer code. I > > don't think > > that it's normal that my tests discovered several issues that were > > not > > discovered by Red Hat's internal test suite. That's something Red > > Hat has to > > address. > > You have no self-awareness of just how mypoic you're being about > this. > > I'm not ignoring or blaming you for your test no longer passing. Far > from it. I very much want to fix this. But I want it fixed in a way > that doesn't paper over the real bug(s) while also introducing blind > queue runs that compromise the blk-mq RESTART code's ability to work > as > intended. > > I'd have thought you could appreciate this. We need a root cause on > this, not hand-waving justifications on why problematic delayed queue > runs are correct. > > Please just focus on helping Laurence get his very capable testbed to > reproduce this issue. Once we can reproduce these "unkillable" > "stalls" > in-house it'll be _much_ easier to analyze and fix. > > Thanks, > Mike Hello Bart OK, I ran 5 at once of 5 separate mount points. I am using 4k block sizes Its solid consistent for me. No stalls no gaps. [global] name=02-mq filename=fio-output-02-mq.txt rw=randwrite verify=md5 ;rwmixread=60 ;rwmixwrite=40 bs=4K ;direct=1 ;numjobs=4 ;time_based=1 runtime=120 [file1] size=3G ioengine=libaio iodepth=16 I watch I/O and I see it ramp up but fio still runs and it kind of shuts down. #!/bin/bash for i in 1 2 3 4 5 do cd /data-$i fio /root/srp-test.lobe/fio_tests/02-mq 1>/data-$i.out 2>&1 & done #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes 17:16:09 34 12 34431 10393 0 0 249872 36094 17:16:10 31 10 13001 2679 0 0 57652 7980 17:16:11 32 11 19473 4416 0 0 143248 17362 17:16:12 31 9 8068 1606 0 0 20088 2026 17:16:13 31 9 7718 1518 0 0 15908 1354 17:16:14 36 14 41132 9710 0 0 686216 46661 17:16:15 39 18 63622 21033 0 0 1246K 75108 17:16:16 38 16 76114 9013 0 0 1146K 82338 17:16:17 33 11 31941 2735 0 0 237340 25034 17:16:18 36 14 23900 4982 0 1 1567K 43244 17:16:19 37 15 55884 4974 0 4 1003K 67253 17:16:20 28 12 7542 4975 0 0 1630K 3266 17:16:218 6 14650 34721 0 0 2535K 21206 17:16:222 2 6655 15839 0 0 2319K 9897 17:16:23 11 11 37305 134030 0 0 1275K 87010 17:16:24 23 22 59266 119350 6560 1640 1102K 78034 17:16:25 21 17 65699 144462 148052 37013 159900 22120 17:16:26 14 9 80700 164034 216588 54147 4 1 # <--Disks---> #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes 17:16:27 14 9 78699 162095 214412 53603 0 0 17:16:28 14 9 75895 155352 204932 51233 0 0 17:16:29 14 9 75377 161871 214136 53534
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, Jan 18 2018 at 4:39pm -0500, Bart Van Asschewrote: > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > > On Thu, Jan 18 2018 at 3:58P -0500, > > Bart Van Assche wrote: > > > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > > For Bart's test the underlying scsi-mq driver is what is regularly > > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > > > > > These lockups were all triggered by incorrect handling of > > > .queue_rq() returning BLK_STS_RESOURCE. > > > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? > > In what I wrote I was referring to both dm_mq_queue_rq() and scsi_queue_rq(). > With "incorrect" I meant that queue lockups are introduced that make user > space processes unkillable. That's a severe bug. And yet Laurence cannot reproduce any such lockups with your test... Are you absolutely certain this patch doesn't help you? https://patchwork.kernel.org/patch/10174037/ If it doesn't then that is actually very useful to know. > > We have time to get this right, please stop hyperventilating about > > "regressions". > > Sorry Mike but that's something I consider as an unfair comment. If Ming and > you work on patches together, it's your job to make sure that no regressions > are introduced. Instead of blaming me because I report these regressions you > should be grateful that I take the time and effort to report these regressions > early. And since you are employed by a large organization that sells Linux > support services, your employer should invest in developing test cases that > reach a higher coverage of the dm, SCSI and block layer code. I don't think > that it's normal that my tests discovered several issues that were not > discovered by Red Hat's internal test suite. That's something Red Hat has to > address. You have no self-awareness of just how mypoic you're being about this. I'm not ignoring or blaming you for your test no longer passing. Far from it. I very much want to fix this. But I want it fixed in a way that doesn't paper over the real bug(s) while also introducing blind queue runs that compromise the blk-mq RESTART code's ability to work as intended. I'd have thought you could appreciate this. We need a root cause on this, not hand-waving justifications on why problematic delayed queue runs are correct. Please just focus on helping Laurence get his very capable testbed to reproduce this issue. Once we can reproduce these "unkillable" "stalls" in-house it'll be _much_ easier to analyze and fix. Thanks, Mike
Re: [dm-devel] [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > On Thu, Jan 18 2018 at 3:58P -0500, > Bart Van Asschewrote: > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > For Bart's test the underlying scsi-mq driver is what is regularly > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > > > These lockups were all triggered by incorrect handling of > > .queue_rq() returning BLK_STS_RESOURCE. > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? In what I wrote I was referring to both dm_mq_queue_rq() and scsi_queue_rq(). With "incorrect" I meant that queue lockups are introduced that make user space processes unkillable. That's a severe bug. > Please try to do more work analyzing the test case that only you can > easily run (due to srp_test being a PITA). It is not correct that I'm the only one who is able to run that software. Anyone who is willing to merge the latest SRP initiator and target driver patches in his or her tree can run that software in any VM. I'm working hard on getting the patches upstream that make it possible to run the srp-test software on a setup that is not equipped with InfiniBand hardware. > We have time to get this right, please stop hyperventilating about > "regressions". Sorry Mike but that's something I consider as an unfair comment. If Ming and you work on patches together, it's your job to make sure that no regressions are introduced. Instead of blaming me because I report these regressions you should be grateful that I take the time and effort to report these regressions early. And since you are employed by a large organization that sells Linux support services, your employer should invest in developing test cases that reach a higher coverage of the dm, SCSI and block layer code. I don't think that it's normal that my tests discovered several issues that were not discovered by Red Hat's internal test suite. That's something Red Hat has to address. Bart.
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > On Thu, Jan 18 2018 at 3:58P -0500, > Bart Van Asschewrote: > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > For Bart's test the underlying scsi-mq driver is what is > > > regularly > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > > > Hello Mike, > > > > That code path is not the code path that triggered the lockups that > > I reported > > during the past days. > > If you're hitting blk_mq_sched_insert_request() then you most > certainly > are hitting that code path. > > If you aren't then what was your earlier email going on about? > https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html > > If you were just focusing on that as one possible reason, that isn't > very helpful. By this point you really should _know_ what is > triggering > the stall based on the code paths taken. Please use ftrace's > function_graph tracer if need be. > > > These lockups were all triggered by incorrect handling of > > .queue_rq() returning BLK_STS_RESOURCE. > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? > > Please try to do more work analyzing the test case that only you can > easily run (due to srp_test being a PITA). And less time lobbying > for > a change that you don't understand to _really_ be correct. > > We have time to get this right, please stop hyperventilating about > "regressions". > > Thanks, > Mike Hello Bart I have run a good few loops of 02-mq and its stable for me on your tree. I am not running the entire disconnect re-connect loops and un-mounts etc. for good reason. I have 35 LUNS so its very impact-full to lose them and have them come back all the time. Anyway I am very happy to try reproduce this in-house so Mike and Ming can focus on it but I need to know if all I need to do is loop over 02-mq over and over. Also please let me know whats debugfs and sysfs to capture and I am happy to try help move this along. Regards Laurence
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, Jan 18 2018 at 3:58P -0500, Bart Van Asschewrote: > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > For Bart's test the underlying scsi-mq driver is what is regularly > > hitting this case in __blk_mq_try_issue_directly(): > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > Hello Mike, > > That code path is not the code path that triggered the lockups that I reported > during the past days. If you're hitting blk_mq_sched_insert_request() then you most certainly are hitting that code path. If you aren't then what was your earlier email going on about? https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html If you were just focusing on that as one possible reason, that isn't very helpful. By this point you really should _know_ what is triggering the stall based on the code paths taken. Please use ftrace's function_graph tracer if need be. > These lockups were all triggered by incorrect handling of > .queue_rq() returning BLK_STS_RESOURCE. Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? Please try to do more work analyzing the test case that only you can easily run (due to srp_test being a PITA). And less time lobbying for a change that you don't understand to _really_ be correct. We have time to get this right, please stop hyperventilating about "regressions". Thanks, Mike
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > For Bart's test the underlying scsi-mq driver is what is regularly > hitting this case in __blk_mq_try_issue_directly(): > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) Hello Mike, That code path is not the code path that triggered the lockups that I reported during the past days. These lockups were all triggered by incorrect handling of .queue_rq() returning BLK_STS_RESOURCE. Bart.
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, Jan 18 2018 at 3:11pm -0500, Jens Axboewrote: > On 1/18/18 11:47 AM, Bart Van Assche wrote: > >> This is all very tiresome. > > > > Yes, this is tiresome. It is very annoying to me that others keep > > introducing so many regressions in such important parts of the kernel. > > It is also annoying to me that I get blamed if I report a regression > > instead of seeing that the regression gets fixed. > > I agree, it sucks that any change there introduces the regression. I'm > fine with doing the delay insert again until a new patch is proven to be > better. > > From the original topic of this email, we have conditions that can cause > the driver to not be able to submit an IO. A set of those conditions can > only happen if IO is in flight, and those cases we have covered just > fine. Another set can potentially trigger without IO being in flight. > These are cases where a non-device resource is unavailable at the time > of submission. This might be iommu running out of space, for instance, > or it might be a memory allocation of some sort. For these cases, we > don't get any notification when the shortage clears. All we can do is > ensure that we restart operations at some point in the future. We're SOL > at that point, but we have to ensure that we make forward progress. > > That last set of conditions better not be a a common occurence, since > performance is down the toilet at that point. I don't want to introduce > hot path code to rectify it. Have the driver return if that happens in a > way that is DIFFERENT from needing a normal restart. The driver knows if > this is a resource that will become available when IO completes on this > device or not. If we get that return, we have a generic run-again delay. > > This basically becomes the same as doing the delay queue thing from DM, > but just in a generic fashion. This is a bit confusing for me (as I see it we have 2 blk-mq drivers trying to collaborate, so your refering to "driver" lacks precision; but I could just be missing something)... For Bart's test the underlying scsi-mq driver is what is regularly hitting this case in __blk_mq_try_issue_directly(): if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) It certainly better not be the norm (Bart's test hammering on this aside). For starters, it'd be very useful to know if Bart is hitting the blk_mq_hctx_stopped() or blk_queue_quiesced() for this case that is triggering the use of blk_mq_sched_insert_request() -- I'd wager it is due to blk_queue_quiesced() but Bart _please_ try to figure it out. Anyway, in response to this case Bart would like the upper layer dm-mq driver to blk_mq_delay_run_hw_queue(). Certainly is quite the hammer. But that hammer aside, in general for this case, I'm concerned about: is it really correct to allow an already stopped/quiesced underlying queue to retain responsibility for processing the request? Or would the upper-layer dm-mq benefit from being able to retry the request on its terms (via a "DIFFERENT" return from blk-mq core)? Like this? The (ab)use of BLK_STS_DM_REQUEUE certainly seems fitting in this case but... (Bart please note that this patch applies on linux-dm.git's 'for-next'; which is just a merge of Jens' 4.16 tree and dm-4.16) diff --git a/block/blk-mq.c b/block/blk-mq.c index 74a4f237ba91..371a1b97bf56 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1781,16 +1781,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, struct request_queue *q = rq->q; bool run_queue = true; - /* -* RCU or SRCU read lock is needed before checking quiesced flag. -* -* When queue is stopped or quiesced, ignore 'bypass_insert' from -* blk_mq_request_direct_issue(), and return BLK_STS_OK to caller, -* and avoid driver to try to dispatch again. -*/ + /* RCU or SRCU read lock is needed before checking quiesced flag */ if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) { run_queue = false; - bypass_insert = false; + if (bypass_insert) + return BLK_STS_DM_REQUEUE; goto insert; } diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index d8519ddd7e1a..2f554ea485c3 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -408,7 +408,7 @@ static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ clone->start_time = jiffies; r = blk_insert_cloned_request(clone->q, clone); - if (r != BLK_STS_OK && r != BLK_STS_RESOURCE) + if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DM_REQUEUE) /* must complete clone in terms of original request */ dm_complete_request(rq, r); return r; @@ -472,6 +472,7 @@ static void init_tio(struct dm_rq_target_io *tio, struct request *rq, * Returns: * DM_MAPIO_* : the request has
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/18/18 11:47 AM, Bart Van Assche wrote: >> This is all very tiresome. > > Yes, this is tiresome. It is very annoying to me that others keep > introducing so many regressions in such important parts of the kernel. > It is also annoying to me that I get blamed if I report a regression > instead of seeing that the regression gets fixed. I agree, it sucks that any change there introduces the regression. I'm fine with doing the delay insert again until a new patch is proven to be better. >From the original topic of this email, we have conditions that can cause the driver to not be able to submit an IO. A set of those conditions can only happen if IO is in flight, and those cases we have covered just fine. Another set can potentially trigger without IO being in flight. These are cases where a non-device resource is unavailable at the time of submission. This might be iommu running out of space, for instance, or it might be a memory allocation of some sort. For these cases, we don't get any notification when the shortage clears. All we can do is ensure that we restart operations at some point in the future. We're SOL at that point, but we have to ensure that we make forward progress. That last set of conditions better not be a a common occurence, since performance is down the toilet at that point. I don't want to introduce hot path code to rectify it. Have the driver return if that happens in a way that is DIFFERENT from needing a normal restart. The driver knows if this is a resource that will become available when IO completes on this device or not. If we get that return, we have a generic run-again delay. This basically becomes the same as doing the delay queue thing from DM, but just in a generic fashion. -- Jens Axboe
Re: [PATCH v3 0/2] block: enforce ioctl(BLKROSET) and set_disk_ro()
On 1/11/18 6:09 AM, Ilya Dryomov wrote: > Hello, > > I was doing some cleanup work on rbd BLKROSET handler and discovered > that we ignore partition rw/ro setting (hd_struct->policy) for pretty > much everything but straight writes. > > David (CCed) has blktests patches standing by. > > (Another aspect of this is that we don't enforce open(2) mode. Tejun > took a stab at this a few years ago, but his patch had to be reverted: > > 75f1dc0d076d ("block: check bdev_read_only() from blkdev_get()") > e51900f7d38c ("block: revert block_dev read-only check") > > It is a separate issue and refusing writes to read-only devices is > obviously more important, but perhaps it's time to revisit that as > well?) Applied for 4.16, thanks. -- Jens Axboe
Re: [RFC] blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
On 12/11/17 7:56 AM, weiping zhang wrote: > export these two interface for cgroup-v1. Applied, thanks. -- Jens Axboe
Re: [PATCH v2 0/3] Avoid that blk_{un,}register_queue() trigger lock inversion
On 1/17/18 12:48 PM, Bart Van Assche wrote: > Hello Jens, > > The three patches in this series are what I came up with after having > analyzed a lockdep complaint triggered by blk_unregister_queue(). Please > consider these patches for kernel v4.16. Thanks Bart, applied. -- Jens Axboe
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, 2018-01-18 at 13:30 -0500, Mike Snitzer wrote: > 1%!? Where are you getting that number? Ming has detailed more > significant performance gains than 1%.. and not just on lpfc (though you > keep seizing on lpfc because of the low queue_depth of 3). That's what I derived from the numbers you posted for null_blk. If Ming has posted performance results for other drivers than lpfc, please let me know where I can find these. I have not yet seen these numbers. > This is all very tiresome. Yes, this is tiresome. It is very annoying to me that others keep introducing so many regressions in such important parts of the kernel. It is also annoying to me that I get blamed if I report a regression instead of seeing that the regression gets fixed. Bart.
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, Jan 18 2018 at 12:20pm -0500, Bart Van Asschewrote: > On Thu, 2018-01-18 at 12:03 -0500, Mike Snitzer wrote: > > On Thu, Jan 18 2018 at 11:50am -0500, > > Bart Van Assche wrote: > > > My comments about the above are as follows: > > > - It can take up to q->rq_timeout jiffies after a .queue_rq() > > > implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work() > > > gets called. However, it can happen that only a few milliseconds after > > > .queue_rq() returned BLK_STS_RESOURCE that the condition that caused > > > it to return BLK_STS_RESOURCE gets cleared. So the above approach can > > > result in long delays during which it will seem like the queue got > > > stuck. Additionally, I think that the block driver should decide how > > > long it takes before a queue is rerun and not the block layer core. > > > > So configure q->rq_timeout to be shorter? Which is configurable though > > blk_mq_tag_set's 'timeout' member. It apparently defaults to 30 * HZ. > > > > That is the problem with timeouts, there is generally no one size fits > > all. > > Sorry but I think that would be wrong. The delay after which a queue is rerun > should not be coupled to the request timeout. These two should be independent. That's fair. Not saying I think that is a fix anyway. > > > - The lockup that I reported only occurs with the dm driver but not any > > > other block driver. So why to modify the block layer core since this > > > can be fixed by modifying the dm driver? > > > > Hard to know it is only DM's blk-mq that is impacted. That is the only > > blk-mq driver that you're testing like this (that is also able to handle > > faults, etc). > > That's not correct. I'm also testing the SCSI core, which is one of the most > complicated block drivers. OK, but SCSI mq is part of the problem here. It is a snowflake that has more exotic reasons for returning BLK_STS_RESOURCE. > > > - A much simpler fix and a fix that is known to work exists, namely > > > inserting a blk_mq_delay_run_hw_queue() call in the dm driver. > > > > Because your "much simpler" fix actively hurts performance, as is > > detailed in this header: > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=ec3eaf9a673106f66606896aed6ddd20180b02ec > > We are close to the start of the merge window so I think it's better to fall > back to an old approach that is known to work than to keep a new approach > that is known not to work. Additionally, the performance issue you referred > to only affects IOPS and bandwidth more than 1% with the lpfc driver and that > is because the queue depth it supports is much lower than for other SCSI HBAs, > namely 3 instead of 64. 1%!? Where are you getting that number? Ming has detailed more significant performance gains than 1%.. and not just on lpfc (though you keep seizing on lpfc because of the low queue_depth of 3). This is all very tiresome. I'm _really_ not interested in this debate any more. The specific case that causes the stall need to be identified and a real fix needs to be developed. Ming is doing a lot of that hard work. Please contribute or at least stop pleading for your hack to be reintroduced. If at the end of the 4.16 release we still don't have a handle on the stall you're seeing I'll revisit this and likely revert to blindly kicking the queue after an arbitrary delay. But I'm willing to let this issue get more time without papering over it. Mike
[GIT PULL] Two NVMe fixes for 4.15 final
Hi Linus, Two important fixes for the sgl support for nvme that is new in this series. Please pull. git://git.kernel.dk/linux-block.git for-linus Christoph Hellwig (1): nvme-pci: take sglist coalescing in dma_map_sg into account Keith Busch (1): nvme-pci: check segement valid for SGL use drivers/nvme/host/pci.c | 28 ++-- 1 file changed, 14 insertions(+), 14 deletions(-) -- Jens Axboe
Re: [PATCH v5] Return bytes transferred for partial direct I/O
On Fri, Jan 05, 2018 at 06:15:55AM -0600, Goldwyn Rodrigues wrote: > > > On 01/04/2018 08:10 PM, Darrick J. Wong wrote: > > On Wed, Nov 22, 2017 at 06:29:01AM -0600, Goldwyn Rodrigues wrote: > >> From: Goldwyn Rodrigues> >> > >> In case direct I/O encounters an error midway, it returns the error. > >> Instead it should be returning the number of bytes transferred so far. > >> > >> Test case for filesystems (with ENOSPC): > >> 1. Create an almost full filesystem > >> 2. Create a file, say /mnt/lastfile, until the filesystem is full. > >> 3. Direct write() with count > sizeof /mnt/lastfile. > >> > >> Result: write() returns -ENOSPC. However, file content has data written > >> in step 3. > >> > >> Changes since v1: > >> - incorporated iomap and block devices > >> > >> Changes since v2: > >> - realized that file size was not increasing when performing a (partial) > >>direct I/O because end_io function was receiving the error instead of > >>size. Fixed. > >> > >> Changes since v3: > >> - [hch] initialize transferred with dio->size and use transferred instead > >>of dio->size. > >> > >> Changes since v4: > >> - Refreshed to v4.14 > >> > >> Signed-off-by: Goldwyn Rodrigues > >> Reviewed-by: Christoph Hellwig > > > > Hmmm, I shoved this into 4.15-rc6 and saw this horrible splat running > > generic/472 (that is the test that goes with this patch, right?) on XFS: > > Yes, I will add it to the patch header. > > > > > [ 2545.534331] XFS: Assertion failed: ret < 0 || ret == count, file: > > /storage/home/djwong/cdev/work/linux-xfs/fs/xfs/xfs_file.c, line: 598 > > [ 2545.538177] WARNING: CPU: 2 PID: 16340 at > > /storage/home/djwong/cdev/work/linux-xfs/fs/xfs/xfs_message.c:116 > > assfail+0x58/0x90 [xfs] > > [ 2545.539748] Modules linked in: xfs libcrc32c dax_pmem device_dax nd_pmem > > sch_fq_codel af_packet [last unloaded: xfs] > > [ 2545.541191] CPU: 2 PID: 16340 Comm: xfs_io Not tainted 4.15.0-rc6-xfsx #5 > > [ 2545.542119] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > 1.10.2-1ubuntu1djwong0 04/01/2014 > > [ 2545.543414] RIP: 0010:assfail+0x58/0x90 [xfs] > > [ 2545.544032] RSP: 0018:88002b56f990 EFLAGS: 00010246 > > [ 2545.544760] RAX: RBX: RCX: > > > > [ 2545.545721] RDX: 0004 RSI: 000a RDI: > > a0a1eac4 > > [ 2545.546681] RBP: 88002b56fdd0 R08: 88002b56f6e0 R09: > > > > [ 2545.547646] R10: 11000fe14f6e R11: R12: > > 1100056adf3a > > [ 2545.548613] R13: 0040 R14: 88002b56fc38 R15: > > > > [ 2545.549579] FS: 7fc1f601e700() GS:88007f00() > > knlGS: > > [ 2545.550666] CS: 0010 DS: ES: CR0: 80050033 > > [ 2545.551460] CR2: 7fc1f4bac000 CR3: 75165001 CR4: > > 001606e0 > > [ 2545.552423] Call Trace: > > [ 2545.552850] xfs_file_dio_aio_write+0x6e3/0xe40 [xfs] > > [ 2545.553581] ? kvm_clock_read+0x1f/0x30 > > [ 2545.554191] ? xfs_file_dax_write+0x6a0/0x6a0 [xfs] > > [ 2545.554885] ? kvm_clock_read+0x1f/0x30 > > [ 2545.555439] ? kvm_sched_clock_read+0x5/0x10 > > [ 2545.556046] ? sched_clock+0x5/0x10 > > [ 2545.556553] ? sched_clock_cpu+0x18/0x1e0 > > [ 2545.557136] ? __lock_acquire+0xbbf/0x40f0 > > [ 2545.557719] ? kvm_clock_read+0x1f/0x30 > > [ 2545.558272] ? sched_clock+0x5/0x10 > > [ 2545.558848] xfs_file_write_iter+0x34a/0xb50 [xfs] > > [ 2545.559544] do_iter_readv_writev+0x44c/0x700 > > [ 2545.560170] ? _copy_from_user+0x96/0xd0 > > [ 2545.560729] ? vfs_dedupe_file_range+0x820/0x820 > > [ 2545.561398] do_iter_write+0x12c/0x550 > > [ 2545.561939] ? rcu_lockdep_current_cpu_online+0xd7/0x120 > > [ 2545.562682] ? rcu_read_lock_sched_held+0xa3/0x120 > > [ 2545.563366] vfs_writev+0x175/0x2d0 > > [ 2545.563874] ? vfs_iter_write+0xc0/0xc0 > > [ 2545.564434] ? get_lock_stats+0x16/0x90 > > [ 2545.564992] ? lock_downgrade+0x580/0x580 > > [ 2545.565583] ? __fget+0x1e7/0x350 > > [ 2545.566077] ? entry_SYSCALL_64_fastpath+0x5/0x96 > > [ 2545.566744] ? do_pwritev+0x125/0x160 > > [ 2545.567277] do_pwritev+0x125/0x160 > > [ 2545.567787] entry_SYSCALL_64_fastpath+0x1f/0x96 > > [ 2545.568437] RIP: 0033:0x7fc1f56d6110 > > [ 2545.568948] RSP: 002b:7ffc07f11340 EFLAGS: 0293 ORIG_RAX: > > 0128 > > [ 2545.569971] RAX: ffda RBX: 01ef9170 RCX: > > 7fc1f56d6110 > > [ 2545.570932] RDX: 0001 RSI: 01ef9170 RDI: > > 0003 > > [ 2545.571897] RBP: R08: R09: > > 0040 > > [ 2545.572857] R10: R11: 0293 R12: > > > > [ 2545.573818] R13: R14: 0040 R15: > > 01ef9170 > > [ 2545.574802] Code: df 48 89 fa 48 c1 ea 03 0f b6 04 02 48 89 fa 83 e2 07 > > 38 d0 7f 04 84 c0 75 15 0f b6 1d
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, 2018-01-18 at 12:03 -0500, Mike Snitzer wrote: > On Thu, Jan 18 2018 at 11:50am -0500, > Bart Van Asschewrote: > > My comments about the above are as follows: > > - It can take up to q->rq_timeout jiffies after a .queue_rq() > > implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work() > > gets called. However, it can happen that only a few milliseconds after > > .queue_rq() returned BLK_STS_RESOURCE that the condition that caused > > it to return BLK_STS_RESOURCE gets cleared. So the above approach can > > result in long delays during which it will seem like the queue got > > stuck. Additionally, I think that the block driver should decide how > > long it takes before a queue is rerun and not the block layer core. > > So configure q->rq_timeout to be shorter? Which is configurable though > blk_mq_tag_set's 'timeout' member. It apparently defaults to 30 * HZ. > > That is the problem with timeouts, there is generally no one size fits > all. Sorry but I think that would be wrong. The delay after which a queue is rerun should not be coupled to the request timeout. These two should be independent. > > - The lockup that I reported only occurs with the dm driver but not any > > other block driver. So why to modify the block layer core since this > > can be fixed by modifying the dm driver? > > Hard to know it is only DM's blk-mq that is impacted. That is the only > blk-mq driver that you're testing like this (that is also able to handle > faults, etc). That's not correct. I'm also testing the SCSI core, which is one of the most complicated block drivers. > > - A much simpler fix and a fix that is known to work exists, namely > > inserting a blk_mq_delay_run_hw_queue() call in the dm driver. > > Because your "much simpler" fix actively hurts performance, as is > detailed in this header: > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=ec3eaf9a673106f66606896aed6ddd20180b02ec We are close to the start of the merge window so I think it's better to fall back to an old approach that is known to work than to keep a new approach that is known not to work. Additionally, the performance issue you referred to only affects IOPS and bandwidth more than 1% with the lpfc driver and that is because the queue depth it supports is much lower than for other SCSI HBAs, namely 3 instead of 64. Thanks, Bart.
Re: dm rq: Avoid that request processing stalls sporadically
On Thu, 2018-01-18 at 11:50 -0500, Mike Snitzer wrote: > The issue you say it was originally intended to fix _should_ be > addressed with this change: > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=4dd6edd23e7ea971efddc303f9e67eb79e95808e Hello Mike, Sorry but I'm not convinced that that patch is sufficient. That patch helps if .end_io() is called with status BLK_STS_RESOURCE and also if blk_insert_cloned_request() returns the .queue_rq() return value. It does not help if .queue_rq() returns BLK_STS_RESOURCE and that return value gets ignored. I think that can happen as follows: - Request cloning in multipath_clone_and_map() succeeds and that function returns DM_MAPIO_REMAPPED. - dm_dispatch_clone_request() calls blk_insert_cloned_request(). - blk_insert_cloned_request() calls blk_mq_request_direct_issue(), which results in a call of __blk_mq_try_issue_directly(). - __blk_mq_try_issue_directly() calls blk_mq_sched_insert_request(). In this case the BLK_STS_RESOURCE returned by the .queue_rq() implementation of the underlying path will be ignored. Please note that I have not tried to find all paths that ignore the .queue_rq() return value. Thanks, Bart.
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Thu, Jan 18 2018 at 11:50am -0500, Bart Van Asschewrote: > On 01/17/18 18:41, Ming Lei wrote: > >BLK_STS_RESOURCE can be returned from driver when any resource > >is running out of. And the resource may not be related with tags, > >such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of > >BLK_STS_RESOURCE, restart can't work any more, then IO hang may > >be caused. > > > >Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost > >all returns BLK_STS_RESOURCE under this situation. But for dm-mpath, > >it may be triggered a bit easier since the request pool of underlying > >queue may be consumed up much easier. But in reality, it is still not > >easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug > >with all kinds of scsi_debug parameters, can't trigger this issue > >at all. But finally it is triggered in Bart's SRP test, which seems > >made by genius, :-) > > > >[ ... ] > > > > static void blk_mq_timeout_work(struct work_struct *work) > > { > > struct request_queue *q = > >@@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct > >*work) > > */ > > queue_for_each_hw_ctx(q, hctx, i) { > > /* the hctx may be unmapped, so check it here */ > >-if (blk_mq_hw_queue_mapped(hctx)) > >+if (blk_mq_hw_queue_mapped(hctx)) { > > blk_mq_tag_idle(hctx); > >+blk_mq_fixup_restart(hctx); > >+} > > } > > } > > blk_queue_exit(q); > > Hello Ming, > > My comments about the above are as follows: > - It can take up to q->rq_timeout jiffies after a .queue_rq() > implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work() > gets called. However, it can happen that only a few milliseconds after > .queue_rq() returned BLK_STS_RESOURCE that the condition that caused > it to return BLK_STS_RESOURCE gets cleared. So the above approach can > result in long delays during which it will seem like the queue got > stuck. Additionally, I think that the block driver should decide how > long it takes before a queue is rerun and not the block layer core. So configure q->rq_timeout to be shorter? Which is configurable though blk_mq_tag_set's 'timeout' member. It apparently defaults to 30 * HZ. That is the problem with timeouts, there is generally no one size fits all. > - The lockup that I reported only occurs with the dm driver but not any > other block driver. So why to modify the block layer core since this > can be fixed by modifying the dm driver? Hard to know it is only DM's blk-mq that is impacted. That is the only blk-mq driver that you're testing like this (that is also able to handle faults, etc). > - A much simpler fix and a fix that is known to work exists, namely > inserting a blk_mq_delay_run_hw_queue() call in the dm driver. Because your "much simpler" fix actively hurts performance, as is detailed in this header: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=ec3eaf9a673106f66606896aed6ddd20180b02ec I'm not going to take your bandaid fix given it very much seems to be papering over a real blk-mq issue. Mike
Re: dm rq: Avoid that request processing stalls sporadically
On Thu, Jan 18 2018 at 11:37am -0500, Bart Van Asschewrote: > If the .queue_rq() implementation of a block driver returns > BLK_STS_RESOURCE then that block driver is responsible for > rerunning the queue once the condition that caused it to return > BLK_STS_RESOURCE has been cleared. The dm-mpath driver tells the > dm core to requeue a request if e.g. not enough memory is > available for cloning a request or if the underlying path is > busy. Since the dm-mpath driver does not receive any kind of > notification if the condition that caused it to return "requeue" > is cleared, the only solution to avoid that dm-mpath request > processing stalls is to call blk_mq_delay_run_hw_queue(). Hence > this patch. > > Fixes: ec3eaf9a6731 ("dm mpath: don't call blk_mq_delay_run_hw_queue() in > case of BLK_STS_RESOURCE") > Signed-off-by: Bart Van Assche > Cc: Ming Lei > --- > drivers/md/dm-rq.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > index f16096af879a..c59c59cfd2a5 100644 > --- a/drivers/md/dm-rq.c > +++ b/drivers/md/dm-rq.c > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx > *hctx, > /* Undo dm_start_request() before requeuing */ > rq_end_stats(md, rq); > rq_completed(md, rq_data_dir(rq), false); > + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > return BLK_STS_RESOURCE; > } > > -- > 2.15.1 > Sorry but we need to understand why you still need this. The issue you say it was originally intended to fix _should_ be addressed with this change: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=4dd6edd23e7ea971efddc303f9e67eb79e95808e So if you still feel you need to blindly kick the queue then it is very likely a bug in blk-mq (either its RESTART hueristics or whatever internal implementation is lacking) Did you try Ming's RFC patch to "fixup RESTART" before resorting to the above again?, see: https://patchwork.kernel.org/patch/10172315/ Thanks, Mike
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 01/17/18 18:41, Ming Lei wrote: BLK_STS_RESOURCE can be returned from driver when any resource is running out of. And the resource may not be related with tags, such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of BLK_STS_RESOURCE, restart can't work any more, then IO hang may be caused. Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost all returns BLK_STS_RESOURCE under this situation. But for dm-mpath, it may be triggered a bit easier since the request pool of underlying queue may be consumed up much easier. But in reality, it is still not easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug with all kinds of scsi_debug parameters, can't trigger this issue at all. But finally it is triggered in Bart's SRP test, which seems made by genius, :-) [ ... ] > static void blk_mq_timeout_work(struct work_struct *work) { struct request_queue *q = @@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct *work) */ queue_for_each_hw_ctx(q, hctx, i) { /* the hctx may be unmapped, so check it here */ - if (blk_mq_hw_queue_mapped(hctx)) + if (blk_mq_hw_queue_mapped(hctx)) { blk_mq_tag_idle(hctx); + blk_mq_fixup_restart(hctx); + } } } blk_queue_exit(q); Hello Ming, My comments about the above are as follows: - It can take up to q->rq_timeout jiffies after a .queue_rq() implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work() gets called. However, it can happen that only a few milliseconds after .queue_rq() returned BLK_STS_RESOURCE that the condition that caused it to return BLK_STS_RESOURCE gets cleared. So the above approach can result in long delays during which it will seem like the queue got stuck. Additionally, I think that the block driver should decide how long it takes before a queue is rerun and not the block layer core. - The lockup that I reported only occurs with the dm driver but not any other block driver. So why to modify the block layer core since this can be fixed by modifying the dm driver? - A much simpler fix and a fix that is known to work exists, namely inserting a blk_mq_delay_run_hw_queue() call in the dm driver. Bart.
[PATCH] dm rq: Avoid that request processing stalls sporadically
If the .queue_rq() implementation of a block driver returns BLK_STS_RESOURCE then that block driver is responsible for rerunning the queue once the condition that caused it to return BLK_STS_RESOURCE has been cleared. The dm-mpath driver tells the dm core to requeue a request if e.g. not enough memory is available for cloning a request or if the underlying path is busy. Since the dm-mpath driver does not receive any kind of notification if the condition that caused it to return "requeue" is cleared, the only solution to avoid that dm-mpath request processing stalls is to call blk_mq_delay_run_hw_queue(). Hence this patch. Fixes: ec3eaf9a6731 ("dm mpath: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE") Signed-off-by: Bart Van AsscheCc: Ming Lei --- drivers/md/dm-rq.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index f16096af879a..c59c59cfd2a5 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, /* Undo dm_start_request() before requeuing */ rq_end_stats(md, rq); rq_completed(md, rq_data_dir(rq), false); + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); return BLK_STS_RESOURCE; } -- 2.15.1
Re: [PATCH v2 3/3] block: Protect less code with sysfs_lock in blk_{un,}register_queue()
Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCH v2 2/3] block: Document scheduler modification locking requirements
Looks fine, Reviewed-by: Christoph Hellwig
Re: [PATCH v2 1/3] block: Unexport elv_register_queue() and elv_unregister_queue()
Looks fine, Reviewed-by: Christoph Hellwig
Re: [RFC] blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
2017-12-11 23:17 GMT+08:00 Tejun Heo: > On Mon, Dec 11, 2017 at 10:56:25PM +0800, weiping zhang wrote: >> export these two interface for cgroup-v1. >> >> Signed-off-by: weiping zhang > > Acked-by: Tejun Heo > Hi Jens, Have you got time to check this patch ? Thanks > Thanks. > > -- > tejun
Re: [PATCH BUGFIX/IMPROVEMENT 0/2] block, bfq: two pending patches
On 1/13/18 4:05 AM, Paolo Valente wrote: > Hi Jens, > here are again the two pending patches you asked me to resend [1]. One > of them, fixing read-starvation problems, was accompanied by a cover > letter. I'm pasting the content of that cover letter below. > > The patch addresses (serious) starvation problems caused by > request-tag exhaustion, as explained in more detail in the commit > message. I started from the solution in the function > kyber_limit_depth, but then I had to define more articulate limits, to > counter starvation also in cases not covered in kyber_limit_depth. > If this solution proves to be effective, I'm willing to port it > somehow to the other schedulers. It's something we've been doing in the old request layer for tagging for a long time (more than a decade), so a generic (and fast) solution that covers all cases for blk-mq-tag would indeed be great. For now, I have applied these for 4.16, thanks. -- Jens Axboe
Re: [PATCH v3 0/2] block: enforce ioctl(BLKROSET) and set_disk_ro()
On Thu, Jan 11, 2018 at 2:09 PM, Ilya Dryomovwrote: > Hello, > > I was doing some cleanup work on rbd BLKROSET handler and discovered > that we ignore partition rw/ro setting (hd_struct->policy) for pretty > much everything but straight writes. > > David (CCed) has blktests patches standing by. > > (Another aspect of this is that we don't enforce open(2) mode. Tejun > took a stab at this a few years ago, but his patch had to be reverted: > > 75f1dc0d076d ("block: check bdev_read_only() from blkdev_get()") > e51900f7d38c ("block: revert block_dev read-only check") > > It is a separate issue and refusing writes to read-only devices is > obviously more important, but perhaps it's time to revisit that as > well?) > > v2 -> v3: > - lookup part only once; combine read-only check with existing > should_fail_request check > > v1 -> v2: > - added unlikely() per Sagi's suggestion > > Thanks, > > Ilya > > > Ilya Dryomov (2): > block: fail op_is_write() requests to read-only partitions > block: add bdev_read_only() checks to common helpers > > block/blk-core.c | 56 > ++-- > block/blk-lib.c | 12 > 2 files changed, 50 insertions(+), 18 deletions(-) Jens, could you please pick these up for 4.16? Thanks, Ilya
Re: [PATCH BUGFIX/IMPROVEMENT 0/2] block, bfq: two pending patches
> Il giorno 13 gen 2018, alle ore 12:05, Paolo Valente >ha scritto: > > Hi Jens, > here are again the two pending patches you asked me to resend [1]. One > of them, fixing read-starvation problems, was accompanied by a cover > letter. I'm pasting the content of that cover letter below. > > The patch addresses (serious) starvation problems caused by > request-tag exhaustion, as explained in more detail in the commit > message. I started from the solution in the function > kyber_limit_depth, but then I had to define more articulate limits, to > counter starvation also in cases not covered in kyber_limit_depth. > If this solution proves to be effective, I'm willing to port it > somehow to the other schedulers. > Hi Jens, have had to time to check these patches? Sorry for pushing, but I guess 4.16 is getting closer, and these patches are performance critical; especially the first, which solves a starvation problem. Thanks, Paolo > Thanks, > Paolo > > [1] https://www.spinics.net/lists/linux-block/msg21586.html > > Paolo Valente (2): > block, bfq: limit tags for writes and async I/O > block, bfq: limit sectors served with interactive weight raising > > block/bfq-iosched.c | 158 +--- > block/bfq-iosched.h | 17 ++ > block/bfq-wf2q.c| 3 + > 3 files changed, 169 insertions(+), 9 deletions(-) > > -- > 2.15.1