[PATCH] SCSI: run queue if SCSI device queue isn't ready and queue is idle
Before commit 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for blk-mq"), we run queue after 3ms if queue is idle and SCSI device queue isn't ready, which is done in handling BLK_STS_RESOURCE. After commit 0df21c86bdbf is introduced, queue won't be run any more under this situation. IO hang is observed when timeout happened, and this patch fixes the IO hang issue by running queue after delay in scsi_dev_queue_ready, just like non-mq. This issue can be triggered by the following script[1]. There is another issue which can be covered by running idle queue: when .get_budget() is called on request coming from hctx->dispatch_list, if one request just completes during .get_budget(), we can't depend on SCSI's restart to make progress any more. This patch fixes the race too. With this patch, we basically recover to previous behaviour(before commit 0df21c86bdbf) of handling idle queue when running out of resource. [1] script for test/verify SCSI timeout rmmod scsi_debug modprobe scsi_debug max_queue=1 DEVICE=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename` DISK_DIR=`ls -d /sys/block/$DEVICE/device/scsi_disk/*` echo "using scsi device $DEVICE" echo "-1" >/sys/bus/pseudo/drivers/scsi_debug/every_nth echo "temporary write through" >$DISK_DIR/cache_type echo "128" >/sys/bus/pseudo/drivers/scsi_debug/opts echo none > /sys/block/$DEVICE/queue/scheduler dd if=/dev/$DEVICE of=/dev/null bs=1M iflag=direct count=1 & sleep 5 echo "0" >/sys/bus/pseudo/drivers/scsi_debug/opts wait echo "SUCCESS" Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for blk-mq") Signed-off-by: Ming Lei --- drivers/scsi/scsi_lib.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index db9556662e27..1816dd8259b3 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1967,6 +1967,8 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) out_put_device: put_device(&sdev->sdev_gendev); out: + if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev)) + blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); return false; } -- 2.9.5
Re: [PATCH v2 0/4] lockdep/crossrelease: Apply crossrelease to page locks
On Tue, Dec 05, 2017 at 03:19:46PM +0900, Byungchul Park wrote: > On 12/5/2017 2:46 PM, Byungchul Park wrote: > > On 12/5/2017 2:30 PM, Matthew Wilcox wrote: > > > On Mon, Dec 04, 2017 at 02:16:19PM +0900, Byungchul Park wrote: > > > > For now, wait_for_completion() / complete() works with lockdep, add > > > > lock_page() / unlock_page() and its family to lockdep support. > > > > > > > > Changes from v1 > > > > - Move lockdep_map_cross outside of page_ext to make it flexible > > > > - Prevent allocating lockdep_map per page by default > > > > - Add a boot parameter allowing the allocation for debugging > > > > > > > > Byungchul Park (4): > > > > lockdep: Apply crossrelease to PG_locked locks > > > > lockdep: Apply lock_acquire(release) on __Set(__Clear)PageLocked > > > > lockdep: Move data of CONFIG_LOCKDEP_PAGELOCK from page to page_ext > > > > lockdep: Add a boot parameter enabling to track page locks using > > > > lockdep and disable it by default > > > > > > I don't like the way you've structured this patch series; first adding > > > the lockdep map to struct page, then moving it to page_ext. > > > > Hello, > > > > I will make them into one patch. > > I've thought it more. > > Actually I did it because I thought I'd better make it into two > patches since it makes reviewers easier to review. It doesn't matter > which one I choose, but I prefer to split it. I don't know whether it's better to make it all one patch or split it into multiple patches. But it makes no sense to introduce it in struct page, then move it to struct page_ext.
Re: [PATCH] [PATCH v2] bcache: segregate flash only volume write streams from cached devices
Tang Junhui-- Hi On 12/03/2017 10:11 PM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui > > Hello Mike & Coly > > Could you please have a reveiw for this patch? > >> From: Tang Junhui >> >> In such scenario that there are some flash only volumes >> , and some cached devices, when many tasks request these devices in >> writeback mode, the write IOs may fall to the same bucket as bellow: >> | cached data | flash data | cached data | cached data| flash data| >> then after writeback of these cached devices, the bucket would >> be like bellow bucket: >> | free | flash data | free | free | flash data | >> >> So, there are many free space in this bucket, but since data of flash >> only volumes still exists, so this bucket cannot be reclaimable, >> which would cause waste of bucket space. >> >> In this patch, we segregate flash only volume write streams from >> cached devices, so data from flash only volumes and cached devices >> can store in different buckets. >> >> Compare to v1 patch, this patch do not add a additionally open bucket >> list, and it is try best to segregate flash only volume write streams >> from cached devices, sectors of flash only volumes may still be mixed >> with dirty sectors of cached device, but the number is very small. >> >> Signed-off-by: Tang Junhui LGTM, and I have added to my staging tree, though it required cleanpatch. (executable, line ending). Thanks, Mike
Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()
On Tue, Dec 05, 2017 at 01:16:24PM +0800, Ming Lei wrote: > On Mon, Dec 04, 2017 at 11:48:07PM +, Holger Hoffstätte wrote: > > On Tue, 05 Dec 2017 06:45:08 +0800, Ming Lei wrote: > > > > > On Mon, Dec 04, 2017 at 03:09:20PM +, Bart Van Assche wrote: > > >> On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote: > > >> > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for > > >> > blk-mq") > > >> > > >> It might be safer to revert commit 0df21c86bdbf instead of trying to fix > > >> all > > >> issues introduced by that commit for kernel version v4.15 ... > > > > > > What are all issues in v4.15-rc? Up to now, it is the only issue reported, > > > and can be fixed by this simple patch, which one can be thought as cleanup > > > too. > > > > Even with this patch I've encountered at least one hang that > > seemed related. I'm using most of block/scsi-4.15 on top of 4.14 and > > the hang in question was on a rotating disk. It could be solved by > > activating > > a different scheduler on the hanging device; all hanging sync/df processes > > got > > unstuck and all was fine again, which leads me to believe that there is at > > least > > one more rare condition where delaying requests (as done in the budget > > patch) > > leads to a hang. > > > > This happened with mq-deadline which I was testing specifically to avoid > > any BFQ-related side effects. > > OK, this looks a new report. > > Without any log, we can't make any progress, and even we can't guess > what the issue is related with. > > Could you post your dmesg log(include the hang process stack trace)? And > dump the debugfs log by the following script when this hang happens? > > http://people.redhat.com/minlei/tests/tools/dump-blk-info > > BTW, you just need to pass the disk name to the script, such as: /dev/sda. Thinking of the issue further, this patch only covers case of scsi_set_blocked(), but don't consider the case in which .get_budget() is called inside blk_mq_dispatch_rq_list() for request coming from hctx->dispatch_list. If .get_budget() is called in both blk_mq_do_dispatch_sched() and blk_mq_do_dispatch_ctx(), we don't need to run queue if the queue is idle. But if it is called from blk_mq_dispatch_rq_list() for request coming from hctx->dispatch_list, we have to run queue if queue is idle, as before. So please ignore this patch, and will submit V2 for cover both cases. Thanks, Ming
Re: [PATCH v2 0/4] lockdep/crossrelease: Apply crossrelease to page locks
On 12/5/2017 2:46 PM, Byungchul Park wrote: On 12/5/2017 2:30 PM, Matthew Wilcox wrote: On Mon, Dec 04, 2017 at 02:16:19PM +0900, Byungchul Park wrote: For now, wait_for_completion() / complete() works with lockdep, add lock_page() / unlock_page() and its family to lockdep support. Changes from v1 - Move lockdep_map_cross outside of page_ext to make it flexible - Prevent allocating lockdep_map per page by default - Add a boot parameter allowing the allocation for debugging Byungchul Park (4): lockdep: Apply crossrelease to PG_locked locks lockdep: Apply lock_acquire(release) on __Set(__Clear)PageLocked lockdep: Move data of CONFIG_LOCKDEP_PAGELOCK from page to page_ext lockdep: Add a boot parameter enabling to track page locks using lockdep and disable it by default I don't like the way you've structured this patch series; first adding the lockdep map to struct page, then moving it to page_ext. Hello, I will make them into one patch. I've thought it more. Actually I did it because I thought I'd better make it into two patches since it makes reviewers easier to review. It doesn't matter which one I choose, but I prefer to split it. But, if you are strongly against it, then I will follow you. -- Thanks, Byungchul
Re: [PATCH v2 0/4] lockdep/crossrelease: Apply crossrelease to page locks
On 12/5/2017 2:30 PM, Matthew Wilcox wrote: On Mon, Dec 04, 2017 at 02:16:19PM +0900, Byungchul Park wrote: For now, wait_for_completion() / complete() works with lockdep, add lock_page() / unlock_page() and its family to lockdep support. Changes from v1 - Move lockdep_map_cross outside of page_ext to make it flexible - Prevent allocating lockdep_map per page by default - Add a boot parameter allowing the allocation for debugging Byungchul Park (4): lockdep: Apply crossrelease to PG_locked locks lockdep: Apply lock_acquire(release) on __Set(__Clear)PageLocked lockdep: Move data of CONFIG_LOCKDEP_PAGELOCK from page to page_ext lockdep: Add a boot parameter enabling to track page locks using lockdep and disable it by default I don't like the way you've structured this patch series; first adding the lockdep map to struct page, then moving it to page_ext. Hello, I will make them into one patch. I also don't like it that you've made CONFIG_LOCKDEP_PAGELOCK not individually selectable. I might well want a kernel with crosslock support, but only for completions. OK then, I will make it individually selectable. I want to know others' opinions as well. Thank you for the opinions. I will apply yours next spin. -- Thanks, Byungchul
Re: [PATCH v2 0/4] lockdep/crossrelease: Apply crossrelease to page locks
On Mon, Dec 04, 2017 at 02:16:19PM +0900, Byungchul Park wrote: > For now, wait_for_completion() / complete() works with lockdep, add > lock_page() / unlock_page() and its family to lockdep support. > > Changes from v1 > - Move lockdep_map_cross outside of page_ext to make it flexible > - Prevent allocating lockdep_map per page by default > - Add a boot parameter allowing the allocation for debugging > > Byungchul Park (4): > lockdep: Apply crossrelease to PG_locked locks > lockdep: Apply lock_acquire(release) on __Set(__Clear)PageLocked > lockdep: Move data of CONFIG_LOCKDEP_PAGELOCK from page to page_ext > lockdep: Add a boot parameter enabling to track page locks using > lockdep and disable it by default I don't like the way you've structured this patch series; first adding the lockdep map to struct page, then moving it to page_ext. I also don't like it that you've made CONFIG_LOCKDEP_PAGELOCK not individually selectable. I might well want a kernel with crosslock support, but only for completions.
Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()
On Mon, Dec 04, 2017 at 11:48:07PM +, Holger Hoffstätte wrote: > On Tue, 05 Dec 2017 06:45:08 +0800, Ming Lei wrote: > > > On Mon, Dec 04, 2017 at 03:09:20PM +, Bart Van Assche wrote: > >> On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote: > >> > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for > >> > blk-mq") > >> > >> It might be safer to revert commit 0df21c86bdbf instead of trying to fix > >> all > >> issues introduced by that commit for kernel version v4.15 ... > > > > What are all issues in v4.15-rc? Up to now, it is the only issue reported, > > and can be fixed by this simple patch, which one can be thought as cleanup > > too. > > Even with this patch I've encountered at least one hang that > seemed related. I'm using most of block/scsi-4.15 on top of 4.14 and > the hang in question was on a rotating disk. It could be solved by activating > a different scheduler on the hanging device; all hanging sync/df processes got > unstuck and all was fine again, which leads me to believe that there is at > least > one more rare condition where delaying requests (as done in the budget patch) > leads to a hang. > > This happened with mq-deadline which I was testing specifically to avoid > any BFQ-related side effects. OK, this looks a new report. Without any log, we can't make any progress, and even we can't guess what the issue is related with. Could you post your dmesg log(include the hang process stack trace)? And dump the debugfs log by the following script when this hang happens? http://people.redhat.com/minlei/tests/tools/dump-blk-info BTW, you just need to pass the disk name to the script, such as: /dev/sda. -- Ming
Re: [PATCH 1/2] scsi-mq: Only show the CDB if available
On Tue, Dec 05, 2017 at 01:59:51AM +, Bart Van Assche wrote: > On Tue, 2017-12-05 at 09:15 +0800, Ming Lei wrote: > > On Mon, Dec 04, 2017 at 04:38:08PM -0800, Bart Van Assche wrote: > > > Since the next patch will make it possible that scsi_show_rq() gets > > > called before the CDB pointer is changed into a non-NULL value, > > > only show the CDB if the CDB pointer is not NULL. Additionally, > > > show the request timeout and SCSI command flags. This patch also > > > fixes a bug that was reported by Ming Lei. See also Ming Lei, > > > scsi_debugfs: fix crash in scsi_show_rq(), linux-scsi, 7 November > > > 2017 (https://marc.info/?l=linux-block&m=151006655317188). > > > > Please cook a patch for fixing the crash issue only, since we need > > to backport the fix to stable kernel. > > The code that is touched by this patch is only used for kernel debugging. > I will do this if others agree with your opinion. No, do not mix two different things in one patch, especially the fix part need to be backported to stable. The fix part should aim at V4.15, and the other part can be a V4.16 stuff. -- Ming
Re: [PATCH 1/2] scsi-mq: Only show the CDB if available
On Mon, Dec 04, 2017 at 10:42:28PM -0500, Martin K. Petersen wrote: > > Hi Ming, > > > Please cook a patch for fixing the crash issue only, since we need > > to backport the fix to stable kernel. > > I thought you were going to submit a V5 that addressed James' concerns? > > -- > Martin K. PetersenOracle Linux Engineering Hi Martin, I replied in the following link for James's concerns: https://marc.info/?l=linux-block&m=151074751321108&w=2 The fact is that use-after-free can't avoided at all, no matter if we set the cmnd to NULL before calling free, that means we have to handle use-after-free well in scsi_show_rq(), so we don't need to touch the free code. So V4 is well enough for merge, IMO. Thanks, Ming
Re: [PATCH 1/2] scsi-mq: Only show the CDB if available
On Tue, 2017-12-05 at 09:15 +0800, Ming Lei wrote: > On Mon, Dec 04, 2017 at 04:38:08PM -0800, Bart Van Assche wrote: > > Since the next patch will make it possible that scsi_show_rq() gets > > called before the CDB pointer is changed into a non-NULL value, > > only show the CDB if the CDB pointer is not NULL. Additionally, > > show the request timeout and SCSI command flags. This patch also > > fixes a bug that was reported by Ming Lei. See also Ming Lei, > > scsi_debugfs: fix crash in scsi_show_rq(), linux-scsi, 7 November > > 2017 (https://marc.info/?l=linux-block&m=151006655317188). > > Please cook a patch for fixing the crash issue only, since we need > to backport the fix to stable kernel. The code that is touched by this patch is only used for kernel debugging. I will do this if others agree with your opinion. Bart.
Re: [PATCH] blk-mq: Fix several SCSI request queue lockups
On Tue, Dec 05, 2017 at 01:13:43AM +, Bart Van Assche wrote: > On Tue, 2017-12-05 at 09:04 +0800, Ming Lei wrote: > > Then no reason to revert commit(0df21c86bdbf scsi: implement .get_budget an > > .put_budget for blk-mq) for one issue which may never happen in reality > > since > > this reproducer need out-of-tree patch. > > Sorry but I disagree completely. You seem to overlook that there may be other > circumstances that trigger the same lockup, e.g. a SCSI queue full condition. If the scsi_dev_queue_ready() returns false, .get_budget() catches that and never add request to hctx->dispatch. And scsi_host_queue_ready() always returns true, since we respect per-host queue depth by blk_mq_get_driver_tag() before calling .queue_rq(). Or if I miss other cases, please point it out. -- Ming
Re: [PATCH] blk-mq: Fix several SCSI request queue lockups
On Tue, 2017-12-05 at 09:04 +0800, Ming Lei wrote: > Then no reason to revert commit(0df21c86bdbf scsi: implement .get_budget an > .put_budget for blk-mq) for one issue which may never happen in reality since > this reproducer need out-of-tree patch. Sorry but I disagree completely. You seem to overlook that there may be other circumstances that trigger the same lockup, e.g. a SCSI queue full condition. Bart.
Re: [PATCH 1/2] scsi-mq: Only show the CDB if available
On Mon, Dec 04, 2017 at 04:38:08PM -0800, Bart Van Assche wrote: > Since the next patch will make it possible that scsi_show_rq() gets > called before the CDB pointer is changed into a non-NULL value, > only show the CDB if the CDB pointer is not NULL. Additionally, > show the request timeout and SCSI command flags. This patch also > fixes a bug that was reported by Ming Lei. See also Ming Lei, > scsi_debugfs: fix crash in scsi_show_rq(), linux-scsi, 7 November > 2017 (https://marc.info/?l=linux-block&m=151006655317188). Please cook a patch for fixing the crash issue only, since we need to backport the fix to stable kernel. > > Signed-off-by: Bart Van Assche > Cc: James E.J. Bottomley > Cc: Martin K. Petersen > Cc: Ming Lei > Cc: Christoph Hellwig > Cc: Hannes Reinecke > Cc: Johannes Thumshirn Please Cc: > --- > drivers/scsi/scsi_debugfs.c | 47 > - > 1 file changed, 42 insertions(+), 5 deletions(-) > > diff --git a/drivers/scsi/scsi_debugfs.c b/drivers/scsi/scsi_debugfs.c > index 01f08c03f2c1..37ed6bb8e6ec 100644 > --- a/drivers/scsi/scsi_debugfs.c > +++ b/drivers/scsi/scsi_debugfs.c > @@ -4,13 +4,50 @@ > #include > #include "scsi_debugfs.h" > > +#define SCSI_CMD_FLAG_NAME(name) [ilog2(SCMD_##name)] = #name > +static const char *const scsi_cmd_flags[] = { > + SCSI_CMD_FLAG_NAME(TAGGED), > + SCSI_CMD_FLAG_NAME(UNCHECKED_ISA_DMA), > + SCSI_CMD_FLAG_NAME(ZONE_WRITE_LOCK), > + SCSI_CMD_FLAG_NAME(INITIALIZED), > +}; > +#undef SCSI_CMD_FLAG_NAME > + > +static int scsi_flags_show(struct seq_file *m, const unsigned long flags, > +const char *const *flag_name, int flag_name_count) > +{ > + bool sep = false; > + int i; > + > + for (i = 0; i < sizeof(flags) * BITS_PER_BYTE; i++) { > + if (!(flags & BIT(i))) > + continue; > + if (sep) > + seq_puts(m, "|"); > + sep = true; > + if (i < flag_name_count && flag_name[i]) > + seq_puts(m, flag_name[i]); > + else > + seq_printf(m, "%d", i); > + } > + return 0; > +} > + > void scsi_show_rq(struct seq_file *m, struct request *rq) > { > struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req); > - int msecs = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc); > - char buf[80]; > + int alloc_ms = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc); > + int timeout_ms = jiffies_to_msecs(rq->timeout); > + const u8 *const cdb = READ_ONCE(cmd->cmnd); > + char buf[80] = "(?)"; > > - __scsi_format_command(buf, sizeof(buf), cmd->cmnd, cmd->cmd_len); > - seq_printf(m, ", .cmd=%s, .retries=%d, allocated %d.%03d s ago", buf, > -cmd->retries, msecs / 1000, msecs % 1000); > + if (cdb) > + __scsi_format_command(buf, sizeof(buf), cdb, cmd->cmd_len); > + seq_printf(m, ", .cmd=%s, .retries=%d, .result = %#x, .flags=", buf, > +cmd->retries, cmd->result); > + scsi_flags_show(m, cmd->flags, scsi_cmd_flags, > + ARRAY_SIZE(scsi_cmd_flags)); > + seq_printf(m, ", .timeout=%d.%03d, allocated %d.%03d s ago", > +timeout_ms / 1000, timeout_ms % 1000, > +alloc_ms / 1000, alloc_ms % 1000); > } > -- > 2.15.0 > -- Ming
Re: [PATCH] blk-mq: Fix several SCSI request queue lockups
On Tue, Dec 05, 2017 at 12:29:59AM +, Bart Van Assche wrote: > On Tue, 2017-12-05 at 08:20 +0800, Ming Lei wrote: > > Also it is a bit odd to see request in hctx->dispatch now, and it can only > > happen now when scsi_target_queue_ready() returns false, so I guess you > > apply > > some change on target->can_queue(such as setting it as 1 in srp/ib code > > manually)? > > Yes, but that had already been mentioned. From the e-mail at the start of > this e-mail thread: "Change the SRP initiator such that SCSI target queue > depth is limited to 1." The changes I made in the SRP initiator are the same > as those described in the following message from about one month ago: > https://www.spinics.net/lists/linux-scsi/msg114720.html. OK, got it. Then no reason to revert commit(0df21c86bdbf scsi: implement .get_budget an .put_budget for blk-mq) for one issue which may never happen in reality since this reproducer need out-of-tree patch. I don't mean it isn't a issue, but I don't think it has top priority for reverting commit 0df21c86bdbf. Especially there isn't proof shown that 0df21c86bdbf causes this issue since this commit won't change run queue for requests in hctx->dispatch_list. I's like to take a look if someone'd like to cooperate, such as providing kernel log, test debug patch, and kind of things. Or when I get this hardware to reproduce. -- Ming
[PATCH 0/2] Show commands stuck in a timeout handler in debugfs
Hello Jens, While debugging an issue with the SCSI error handler I noticed that commands that got stuck in that error handler are not shown in debugfs. That is very annoying for anyone who relies on the information in debugfs for root-causing such an issue. Hence this patch series that makes sure that commands that got stuck in a block driver timeout handler are also shown in debugfs. Please consider these patches for kernel v4.16. Thanks, Bart. Bart Van Assche (2): scsi-mq: Only show the CDB if available blk-mq-debugfs: Also show requests that have not yet been started block/blk-mq-debugfs.c | 2 +- drivers/scsi/scsi_debugfs.c | 47 - 2 files changed, 43 insertions(+), 6 deletions(-) -- 2.15.0
[PATCH 1/2] scsi-mq: Only show the CDB if available
Since the next patch will make it possible that scsi_show_rq() gets called before the CDB pointer is changed into a non-NULL value, only show the CDB if the CDB pointer is not NULL. Additionally, show the request timeout and SCSI command flags. This patch also fixes a bug that was reported by Ming Lei. See also Ming Lei, scsi_debugfs: fix crash in scsi_show_rq(), linux-scsi, 7 November 2017 (https://marc.info/?l=linux-block&m=151006655317188). Signed-off-by: Bart Van Assche Cc: James E.J. Bottomley Cc: Martin K. Petersen Cc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- drivers/scsi/scsi_debugfs.c | 47 - 1 file changed, 42 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/scsi_debugfs.c b/drivers/scsi/scsi_debugfs.c index 01f08c03f2c1..37ed6bb8e6ec 100644 --- a/drivers/scsi/scsi_debugfs.c +++ b/drivers/scsi/scsi_debugfs.c @@ -4,13 +4,50 @@ #include #include "scsi_debugfs.h" +#define SCSI_CMD_FLAG_NAME(name) [ilog2(SCMD_##name)] = #name +static const char *const scsi_cmd_flags[] = { + SCSI_CMD_FLAG_NAME(TAGGED), + SCSI_CMD_FLAG_NAME(UNCHECKED_ISA_DMA), + SCSI_CMD_FLAG_NAME(ZONE_WRITE_LOCK), + SCSI_CMD_FLAG_NAME(INITIALIZED), +}; +#undef SCSI_CMD_FLAG_NAME + +static int scsi_flags_show(struct seq_file *m, const unsigned long flags, + const char *const *flag_name, int flag_name_count) +{ + bool sep = false; + int i; + + for (i = 0; i < sizeof(flags) * BITS_PER_BYTE; i++) { + if (!(flags & BIT(i))) + continue; + if (sep) + seq_puts(m, "|"); + sep = true; + if (i < flag_name_count && flag_name[i]) + seq_puts(m, flag_name[i]); + else + seq_printf(m, "%d", i); + } + return 0; +} + void scsi_show_rq(struct seq_file *m, struct request *rq) { struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req); - int msecs = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc); - char buf[80]; + int alloc_ms = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc); + int timeout_ms = jiffies_to_msecs(rq->timeout); + const u8 *const cdb = READ_ONCE(cmd->cmnd); + char buf[80] = "(?)"; - __scsi_format_command(buf, sizeof(buf), cmd->cmnd, cmd->cmd_len); - seq_printf(m, ", .cmd=%s, .retries=%d, allocated %d.%03d s ago", buf, - cmd->retries, msecs / 1000, msecs % 1000); + if (cdb) + __scsi_format_command(buf, sizeof(buf), cdb, cmd->cmd_len); + seq_printf(m, ", .cmd=%s, .retries=%d, .result = %#x, .flags=", buf, + cmd->retries, cmd->result); + scsi_flags_show(m, cmd->flags, scsi_cmd_flags, + ARRAY_SIZE(scsi_cmd_flags)); + seq_printf(m, ", .timeout=%d.%03d, allocated %d.%03d s ago", + timeout_ms / 1000, timeout_ms % 1000, + alloc_ms / 1000, alloc_ms % 1000); } -- 2.15.0
[PATCH 2/2] blk-mq-debugfs: Also show requests that have not yet been started
When debugging e.g. the SCSI timeout handler it is important that requests that have not yet been started or that already have completed are also reported through debugfs. Signed-off-by: Bart Van Assche Cc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Martin K. Petersen --- block/blk-mq-debugfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index f7db73f1698e..886b37163f17 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -409,7 +409,7 @@ static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved) const struct show_busy_params *params = data; if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx && - test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) + list_empty(&rq->queuelist)) __blk_mq_debugfs_rq_show(params->m, list_entry_rq(&rq->queuelist)); } -- 2.15.0
Re: [PATCH] blk-mq: Fix several SCSI request queue lockups
On Tue, 2017-12-05 at 08:20 +0800, Ming Lei wrote: > Also it is a bit odd to see request in hctx->dispatch now, and it can only > happen now when scsi_target_queue_ready() returns false, so I guess you apply > some change on target->can_queue(such as setting it as 1 in srp/ib code > manually)? Yes, but that had already been mentioned. From the e-mail at the start of this e-mail thread: "Change the SRP initiator such that SCSI target queue depth is limited to 1." The changes I made in the SRP initiator are the same as those described in the following message from about one month ago: https://www.spinics.net/lists/linux-scsi/msg114720.html. Bart.
Re: [PATCH] blk-mq: Fix several SCSI request queue lockups
On Mon, Dec 04, 2017 at 11:32:27PM +, Bart Van Assche wrote: > On Tue, 2017-12-05 at 07:01 +0800, Ming Lei wrote: > > On Mon, Dec 04, 2017 at 10:48:18PM +, Bart Van Assche wrote: > > > On Tue, 2017-12-05 at 06:42 +0800, Ming Lei wrote: > > > > On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote: > > > > > * A systematic lockup for SCSI queues with queue depth 1. The > > > > > following test reproduces that bug systematically: > > > > > - Change the SRP initiator such that SCSI target queue depth is > > > > > limited to 1. > > > > > - Run the following command: > > > > > srp-test/run_tests -f xfs -d -e none -r 60 -t 01 > > > > > See also "[PATCH 4/7] blk-mq: Avoid that request processing > > > > > stalls when sharing tags" > > > > > (https://marc.info/?l=linux-block&m=151208695316857). Note: > > > > > reverting commit 0df21c86bdbf also fixes a sporadic SCSI request > > > > > queue lockup while inserting a blk_mq_sched_mark_restart_hctx() > > > > > before all blk_mq_dispatch_rq_list() calls only fixes the > > > > > systematic lockup for queue depth 1. > > > > > > > > You are the only reproducer [ ... ] > > > > > > That's not correct. I'm pretty sure if you try to reproduce this that > > > you will see the same hang I ran into. Does this mean that you have not > > > yet tried to reproduce the hang I reported? > > > > Do you mean every kernel developer has to own one SRP/IB hardware? > > When I have the time I will make it possible to run this test on any system > equipped with at least one Ethernet port. But the fact that the test I > mentioned requires IB hardware should not prevent you from running this test > since you have run this test software before. > > > I don't have your hardware to reproduce that, > > That's not true. Your employer definitely owns IB hardware. E.g. the > following message shows that you have run the srp-test yourself on IB hardware > only four weeks ago: > > https://www.spinics.net/lists/linux-block/msg19511.html The hardware belongs to Laurence, at that time I can borrow from him, and now I am not sure if it is available. > > > Otherwise, there should have be such similar reports from others, not from > > only you. > > That's not correct either. How long was it ago that kernel v4.15-rc1 was > released? One week? How many SRP users do you think have tried to trigger a > queue full condition with that kernel version? OK, we can wait for further reporters to provide kernel log if you don't want. > > > More importantly I don't understand why you can't share the kernel > > log/debugfs log when IO hang happens? > > > > Without any kernel log, how can we confirm that it is a valid report? > > It's really unfortunate that you are focussing on denying that the bug I > reported exists instead of trying to fix the bugs introduced by commit If you look at bug reports in kenrel mail list, you will see most of reports includes some kind of log, that is a common practice to report issue with log attached. It can save us much time for talking in mails. > b347689ffbca. BTW, I have more than enough experience to decide myself what > a valid report is and what not. I can easily send you several MB of kernel As I mentioned, only dmesg with hang trace and debugfs log should be enough, both can't be so big, right? > logs. The reason I have not yet done this is because I'm 99.9% sure that > these won't help to root cause the reported issue. But I can tell you what That is your opinion, most of times, I can find some clue from debugfs about hang issue, then I can try to add trace just in some possible places for narrowing down the issue. > I learned from analyzing the information under /sys/kernel/debug/block: > every time a hang occurred I noticed that no requests were "busy", that two > requests occurred in rq_lists and one request occurred in a hctx dispatch Then what do other attributes show? Like queue/hctx state? The following script can get all this info easily: http://people.redhat.com/minlei/tests/tools/dump-blk-info Also it is a bit odd to see request in hctx->dispatch now, and it can only happen now when scsi_target_queue_ready() returns false, so I guess you apply some change on target->can_queue(such as setting it as 1 in srp/ib code manually)? Please reply, if yes, I will try to see if I can reproduce it with this kind of change on scsi_debug. > list. This is enough to conclude that a queue run was missing. And I think In this case, seems it isn't related with both commit b347689ff and 0df21c86bdbf, since both don't change RESTART for hctx->dispatch, and shouldn't affect run queue. > that the patch at the start of this e-mail thread not only shows that the > root cause is in the block layer but also that this bug was introduced by > commit b347689ffbca. > > > > > You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched: > > > > improve dispatching from sw queue")', but you don't mention any i
Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()
On Tue, 05 Dec 2017 06:45:08 +0800, Ming Lei wrote: > On Mon, Dec 04, 2017 at 03:09:20PM +, Bart Van Assche wrote: >> On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote: >> > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for >> > blk-mq") >> >> It might be safer to revert commit 0df21c86bdbf instead of trying to fix all >> issues introduced by that commit for kernel version v4.15 ... > > What are all issues in v4.15-rc? Up to now, it is the only issue reported, > and can be fixed by this simple patch, which one can be thought as cleanup > too. Even with this patch I've encountered at least one hang that seemed related. I'm using most of block/scsi-4.15 on top of 4.14 and the hang in question was on a rotating disk. It could be solved by activating a different scheduler on the hanging device; all hanging sync/df processes got unstuck and all was fine again, which leads me to believe that there is at least one more rare condition where delaying requests (as done in the budget patch) leads to a hang. This happened with mq-deadline which I was testing specifically to avoid any BFQ-related side effects. I didn't do anything specific to trigger the hang and have not been able to reproduce it during regular usage. -h
Re: [PATCH] blk-mq: Fix several SCSI request queue lockups
On Tue, 2017-12-05 at 07:01 +0800, Ming Lei wrote: > On Mon, Dec 04, 2017 at 10:48:18PM +, Bart Van Assche wrote: > > On Tue, 2017-12-05 at 06:42 +0800, Ming Lei wrote: > > > On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote: > > > > * A systematic lockup for SCSI queues with queue depth 1. The > > > > following test reproduces that bug systematically: > > > > - Change the SRP initiator such that SCSI target queue depth is > > > > limited to 1. > > > > - Run the following command: > > > > srp-test/run_tests -f xfs -d -e none -r 60 -t 01 > > > > See also "[PATCH 4/7] blk-mq: Avoid that request processing > > > > stalls when sharing tags" > > > > (https://marc.info/?l=linux-block&m=151208695316857). Note: > > > > reverting commit 0df21c86bdbf also fixes a sporadic SCSI request > > > > queue lockup while inserting a blk_mq_sched_mark_restart_hctx() > > > > before all blk_mq_dispatch_rq_list() calls only fixes the > > > > systematic lockup for queue depth 1. > > > > > > You are the only reproducer [ ... ] > > > > That's not correct. I'm pretty sure if you try to reproduce this that > > you will see the same hang I ran into. Does this mean that you have not > > yet tried to reproduce the hang I reported? > > Do you mean every kernel developer has to own one SRP/IB hardware? When I have the time I will make it possible to run this test on any system equipped with at least one Ethernet port. But the fact that the test I mentioned requires IB hardware should not prevent you from running this test since you have run this test software before. > I don't have your hardware to reproduce that, That's not true. Your employer definitely owns IB hardware. E.g. the following message shows that you have run the srp-test yourself on IB hardware only four weeks ago: https://www.spinics.net/lists/linux-block/msg19511.html > Otherwise, there should have be such similar reports from others, not from > only you. That's not correct either. How long was it ago that kernel v4.15-rc1 was released? One week? How many SRP users do you think have tried to trigger a queue full condition with that kernel version? > More importantly I don't understand why you can't share the kernel > log/debugfs log when IO hang happens? > > Without any kernel log, how can we confirm that it is a valid report? It's really unfortunate that you are focussing on denying that the bug I reported exists instead of trying to fix the bugs introduced by commit b347689ffbca. BTW, I have more than enough experience to decide myself what a valid report is and what not. I can easily send you several MB of kernel logs. The reason I have not yet done this is because I'm 99.9% sure that these won't help to root cause the reported issue. But I can tell you what I learned from analyzing the information under /sys/kernel/debug/block: every time a hang occurred I noticed that no requests were "busy", that two requests occurred in rq_lists and one request occurred in a hctx dispatch list. This is enough to conclude that a queue run was missing. And I think that the patch at the start of this e-mail thread not only shows that the root cause is in the block layer but also that this bug was introduced by commit b347689ffbca. > > > You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched: > > > improve dispatching from sw queue")', but you don't mention any issue > > > about that commit. > > > > That's not correct either. From the commit message "A systematic lockup > > for SCSI queues with queue depth 1." > > I mean you mentioned your patch can fix 'commit b347689ffbca > ("blk-mq-sched: improve dispatching from sw queue")', but you never > point where the commit b347689ffbca is wrong, how your patch fixes > the mistake of that commit. You should know that it is not required to perform a root cause analysis before posting a revert. Having performed a bisect is sufficient. BTW, it seems like you forgot that last Friday I explained to you that there is an obvious bug in commit b347689ffbca, namely that a blk_mq_sched_mark_restart_hctx() call is missing in blk_mq_sched_dispatch_requests() before the blk_mq_do_dispatch_ctx() call. See also https://marc.info/?l=linux-block&m=151215794224401. Bart.
Re: [PATCH] kyber: fix another domain token wait queue hang
On Mon, Dec 04, 2017 at 03:12:15PM -0800, Omar Sandoval wrote: > From: Omar Sandoval > > Commit 8cf466602028 ("kyber: fix hang on domain token wait queue") fixed > a hang caused by leaving wait entries on the domain token wait queue > after the __sbitmap_queue_get() retry succeeded, making that wait entry > a "dud" which won't in turn wake more entries up. However, we can also > get a dud entry if kyber_get_domain_token() fails once but is then > called again and succeeds. This can happen if the hardware queue is > rerun for some other reason, or, more likely, kyber_dispatch_request() > tries the same domain twice. > > The fix is to remove our entry from the wait queue whenever we > successfully get a token. The only complication is that we might be on > one of many wait queues in the struct sbitmap_queue, but that's easily > fixed by remembering which wait queue we were put on. > > While we're here, only initialize the wait queue entry once instead on s/instead on/instead of on/ > every wait, and use spin_lock_irq() instead of spin_lock_irqsave(), > since this is always called from process context with irqs enabled. > > Signed-off-by: Omar Sandoval > --- > I have another, rarer hang I'm still working out, but I can get this one > out of the way. > > block/kyber-iosched.c | 39 +-- > 1 file changed, 25 insertions(+), 14 deletions(-) > > diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c > index b4df317c2916..00cf624ce3ed 100644 > --- a/block/kyber-iosched.c > +++ b/block/kyber-iosched.c > @@ -100,9 +100,13 @@ struct kyber_hctx_data { > unsigned int cur_domain; > unsigned int batching; > wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS]; > + struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS]; > atomic_t wait_index[KYBER_NUM_DOMAINS]; > }; > > +static int kyber_domain_wake(wait_queue_entry_t *wait, unsigned mode, int > flags, > + void *key); > + > static int rq_sched_domain(const struct request *rq) > { > unsigned int op = rq->cmd_flags; > @@ -385,6 +389,9 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, > unsigned int hctx_idx) > > for (i = 0; i < KYBER_NUM_DOMAINS; i++) { > INIT_LIST_HEAD(&khd->rqs[i]); > + init_waitqueue_func_entry(&khd->domain_wait[i], > + kyber_domain_wake); > + khd->domain_wait[i].private = hctx; > INIT_LIST_HEAD(&khd->domain_wait[i].entry); > atomic_set(&khd->wait_index[i], 0); > } > @@ -524,35 +531,39 @@ static int kyber_get_domain_token(struct > kyber_queue_data *kqd, > int nr; > > nr = __sbitmap_queue_get(domain_tokens); > - if (nr >= 0) > - return nr; > > /* >* If we failed to get a domain token, make sure the hardware queue is >* run when one becomes available. Note that this is serialized on >* khd->lock, but we still need to be careful about the waker. >*/ > - if (list_empty_careful(&wait->entry)) { > - init_waitqueue_func_entry(wait, kyber_domain_wake); > - wait->private = hctx; > + if (nr < 0 && list_empty_careful(&wait->entry)) { > ws = sbq_wait_ptr(domain_tokens, > &khd->wait_index[sched_domain]); > - add_wait_queue(&ws->wait, wait); > + khd->domain_ws[sched_domain] = ws; > + add_wait_queue_exclusive(&ws->wait, wait); > > /* >* Try again in case a token was freed before we got on the wait > - * queue. The waker may have already removed the entry from the > - * wait queue, but list_del_init() is okay with that. > + * queue. >*/ > nr = __sbitmap_queue_get(domain_tokens); > - if (nr >= 0) { > - unsigned long flags; > + } > > - spin_lock_irqsave(&ws->wait.lock, flags); > - list_del_init(&wait->entry); > - spin_unlock_irqrestore(&ws->wait.lock, flags); > - } > + /* > + * If we got a token while we were on the wait queue, remove ourselves > + * from the wait queue to ensure that all wake ups make forward > + * progress. It's possible that the waker already deleted the entry > + * between the !list_empty_careful() check and us grabbing the lock, but > + * list_del_init() is okay with that. > + */ > + if (nr >= 0 && !list_empty_careful(&wait->entry)) { > + ws = khd->domain_ws[sched_domain]; > + spin_lock_irq(&ws->wait.lock); > + list_del_init(&wait->entry); > + spin_unlock_irq(&ws->wait.lock); > } > + > return nr; > } > > -- > 2.15.1 >
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
On 05.12.2017 01:59, Jeff Moyer wrote: > Kirill Tkhai writes: > >> On 05.12.2017 00:52, Tejun Heo wrote: >>> Hello, Kirill. >>> >>> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote: > Can you please explain how this is a fundamental resource which can't > be controlled otherwise? Currently, aio_nr and aio_max_nr are global. In case of containers this means that a single container may occupy all aio requests, which are available in the system, and to deprive others possibility to use aio at all. This may happen because of evil intentions of the container's user or because of the program error, when the user makes this occasionally. >>> >>> Hmm... I see. It feels really wrong to me to make this a first class >>> resource because there is a system wide limit. The only reason I can >>> think of for the system wide limit is to prevent too much kernel >>> memory consumed by creating a lot of aios but that squarely falls >>> inside cgroup memory controller protection. If there are other >>> reasons why the number of aios should be limited system-wide, please >>> bring them up. >>> >>> If the only reason is kernel memory consumption protection, the only >>> thing we need to do is making sure that memory used for aio commands >>> are accounted against cgroup kernel memory consumption and >>> relaxing/removing system wide limit. >> >> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio >> structures and pages, and all the memory will be accounted in kmem and >> limited by memcg. Looks very good. >> >> One detail about memory consumption. io_submit() calls primitives >> file_operations::write_iter and read_iter. It's not clear for me whether >> they consume the same memory as if writev() or readv() system calls >> would be used instead. writev() may delay the actual write till dirty >> pages limit will be reached, so it seems logic of the accounting should >> be the same. So aio mustn't use more not accounted system memory in file >> system internals, then simple writev(). >> >> Could you please to say if you have thoughts about this? > > I think you just need to account the completion ring. A request of struct aio_kiocb type consumes much more memory, than struct io_event does. Shouldn't we account it too? Kirill
[PATCH] kyber: fix another domain token wait queue hang
From: Omar Sandoval Commit 8cf466602028 ("kyber: fix hang on domain token wait queue") fixed a hang caused by leaving wait entries on the domain token wait queue after the __sbitmap_queue_get() retry succeeded, making that wait entry a "dud" which won't in turn wake more entries up. However, we can also get a dud entry if kyber_get_domain_token() fails once but is then called again and succeeds. This can happen if the hardware queue is rerun for some other reason, or, more likely, kyber_dispatch_request() tries the same domain twice. The fix is to remove our entry from the wait queue whenever we successfully get a token. The only complication is that we might be on one of many wait queues in the struct sbitmap_queue, but that's easily fixed by remembering which wait queue we were put on. While we're here, only initialize the wait queue entry once instead on every wait, and use spin_lock_irq() instead of spin_lock_irqsave(), since this is always called from process context with irqs enabled. Signed-off-by: Omar Sandoval --- I have another, rarer hang I'm still working out, but I can get this one out of the way. block/kyber-iosched.c | 39 +-- 1 file changed, 25 insertions(+), 14 deletions(-) diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index b4df317c2916..00cf624ce3ed 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -100,9 +100,13 @@ struct kyber_hctx_data { unsigned int cur_domain; unsigned int batching; wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS]; + struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS]; atomic_t wait_index[KYBER_NUM_DOMAINS]; }; +static int kyber_domain_wake(wait_queue_entry_t *wait, unsigned mode, int flags, +void *key); + static int rq_sched_domain(const struct request *rq) { unsigned int op = rq->cmd_flags; @@ -385,6 +389,9 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx) for (i = 0; i < KYBER_NUM_DOMAINS; i++) { INIT_LIST_HEAD(&khd->rqs[i]); + init_waitqueue_func_entry(&khd->domain_wait[i], + kyber_domain_wake); + khd->domain_wait[i].private = hctx; INIT_LIST_HEAD(&khd->domain_wait[i].entry); atomic_set(&khd->wait_index[i], 0); } @@ -524,35 +531,39 @@ static int kyber_get_domain_token(struct kyber_queue_data *kqd, int nr; nr = __sbitmap_queue_get(domain_tokens); - if (nr >= 0) - return nr; /* * If we failed to get a domain token, make sure the hardware queue is * run when one becomes available. Note that this is serialized on * khd->lock, but we still need to be careful about the waker. */ - if (list_empty_careful(&wait->entry)) { - init_waitqueue_func_entry(wait, kyber_domain_wake); - wait->private = hctx; + if (nr < 0 && list_empty_careful(&wait->entry)) { ws = sbq_wait_ptr(domain_tokens, &khd->wait_index[sched_domain]); - add_wait_queue(&ws->wait, wait); + khd->domain_ws[sched_domain] = ws; + add_wait_queue_exclusive(&ws->wait, wait); /* * Try again in case a token was freed before we got on the wait -* queue. The waker may have already removed the entry from the -* wait queue, but list_del_init() is okay with that. +* queue. */ nr = __sbitmap_queue_get(domain_tokens); - if (nr >= 0) { - unsigned long flags; + } - spin_lock_irqsave(&ws->wait.lock, flags); - list_del_init(&wait->entry); - spin_unlock_irqrestore(&ws->wait.lock, flags); - } + /* +* If we got a token while we were on the wait queue, remove ourselves +* from the wait queue to ensure that all wake ups make forward +* progress. It's possible that the waker already deleted the entry +* between the !list_empty_careful() check and us grabbing the lock, but +* list_del_init() is okay with that. +*/ + if (nr >= 0 && !list_empty_careful(&wait->entry)) { + ws = khd->domain_ws[sched_domain]; + spin_lock_irq(&ws->wait.lock); + list_del_init(&wait->entry); + spin_unlock_irq(&ws->wait.lock); } + return nr; } -- 2.15.1
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
On 05.12.2017 02:02, Tejun Heo wrote: > Hello, Kirill. > > On Tue, Dec 05, 2017 at 01:49:42AM +0300, Kirill Tkhai wrote: >>> If the only reason is kernel memory consumption protection, the only >>> thing we need to do is making sure that memory used for aio commands >>> are accounted against cgroup kernel memory consumption and >>> relaxing/removing system wide limit. >> >> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio >> structures and pages, and all the memory will be accounted in kmem and >> limited by memcg. Looks very good. > > Yeah. > >> One detail about memory consumption. io_submit() calls primitives >> file_operations::write_iter and read_iter. It's not clear for me whether >> they consume the same memory as if writev() or readv() system calls >> would be used instead. writev() may delay the actual write till dirty >> pages limit will be reached, so it seems logic of the accounting should >> be the same. So aio mustn't use more not accounted system memory in file >> system internals, then simple writev(). >> >> Could you please to say if you have thoughts about this? > > I'm not too familiar with vfs / filesystems but I don't think there's > gonna be significant unaccounted memory consumption. It shouldn't be > too difficult to find out with experiments too. > > Thanks. Thanks, Tejun! Kirill
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
Hello, Kirill. On Tue, Dec 05, 2017 at 01:49:42AM +0300, Kirill Tkhai wrote: > > If the only reason is kernel memory consumption protection, the only > > thing we need to do is making sure that memory used for aio commands > > are accounted against cgroup kernel memory consumption and > > relaxing/removing system wide limit. > > So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio > structures and pages, and all the memory will be accounted in kmem and > limited by memcg. Looks very good. Yeah. > One detail about memory consumption. io_submit() calls primitives > file_operations::write_iter and read_iter. It's not clear for me whether > they consume the same memory as if writev() or readv() system calls > would be used instead. writev() may delay the actual write till dirty > pages limit will be reached, so it seems logic of the accounting should > be the same. So aio mustn't use more not accounted system memory in file > system internals, then simple writev(). > > Could you please to say if you have thoughts about this? I'm not too familiar with vfs / filesystems but I don't think there's gonna be significant unaccounted memory consumption. It shouldn't be too difficult to find out with experiments too. Thanks. -- tejun
Re: [PATCH] blk-mq: Fix several SCSI request queue lockups
On Mon, Dec 04, 2017 at 10:48:18PM +, Bart Van Assche wrote: > On Tue, 2017-12-05 at 06:42 +0800, Ming Lei wrote: > > On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote: > > > * A systematic lockup for SCSI queues with queue depth 1. The > > > following test reproduces that bug systematically: > > > - Change the SRP initiator such that SCSI target queue depth is > > > limited to 1. > > > - Run the following command: > > > srp-test/run_tests -f xfs -d -e none -r 60 -t 01 > > > See also "[PATCH 4/7] blk-mq: Avoid that request processing > > > stalls when sharing tags" > > > (https://marc.info/?l=linux-block&m=151208695316857). Note: > > > reverting commit 0df21c86bdbf also fixes a sporadic SCSI request > > > queue lockup while inserting a blk_mq_sched_mark_restart_hctx() > > > before all blk_mq_dispatch_rq_list() calls only fixes the > > > systematic lockup for queue depth 1. > > > > You are the only reproducer [ ... ] > > That's not correct. I'm pretty sure if you try to reproduce this that > you will see the same hang I ran into. Does this mean that you have not > yet tried to reproduce the hang I reported? Do you mean every kernel developer has to own one SRP/IB hardware? I don't have your hardware to reproduce that, and I don't think most of guys have that. Otherwise, there should have be such similar reports from others, not from only you. More importantly I don't understand why you can't share the kernel log/debugfs log when IO hang happens? Without any kernel log, how can we confirm that it is a valid report? > > > You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched: > > improve dispatching from sw queue")', but you don't mention any issue > > about that commit. > > That's not correct either. From the commit message "A systematic lockup > for SCSI queues with queue depth 1." I mean you mentioned your patch can fix 'commit b347689ffbca ("blk-mq-sched: improve dispatching from sw queue")', but you never point where the commit b347689ffbca is wrong, how your patch fixes the mistake of that commit. > > > > I think the above means that it is too risky to try to fix all bugs > > > introduced by commit 0df21c86bdbf before kernel v4.15 is released. > > > Hence revert that commit. > > > > What is the risk? > > That more bugs were introduced by commit 0df21c86bdbf than the ones that > have been discovered so far. If you don't provide any log, I have to ignore your report simply. So there is only one real issue which can be addressed easily by the following patch: https://marc.info/?l=linux-scsi&m=151223234607157&w=2 -- Ming
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
Kirill Tkhai writes: > On 05.12.2017 00:52, Tejun Heo wrote: >> Hello, Kirill. >> >> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote: Can you please explain how this is a fundamental resource which can't be controlled otherwise? >>> >>> Currently, aio_nr and aio_max_nr are global. In case of containers this >>> means that a single container may occupy all aio requests, which are >>> available in the system, and to deprive others possibility to use aio >>> at all. This may happen because of evil intentions of the container's >>> user or because of the program error, when the user makes this occasionally. >> >> Hmm... I see. It feels really wrong to me to make this a first class >> resource because there is a system wide limit. The only reason I can >> think of for the system wide limit is to prevent too much kernel >> memory consumed by creating a lot of aios but that squarely falls >> inside cgroup memory controller protection. If there are other >> reasons why the number of aios should be limited system-wide, please >> bring them up. >> >> If the only reason is kernel memory consumption protection, the only >> thing we need to do is making sure that memory used for aio commands >> are accounted against cgroup kernel memory consumption and >> relaxing/removing system wide limit. > > So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio > structures and pages, and all the memory will be accounted in kmem and > limited by memcg. Looks very good. > > One detail about memory consumption. io_submit() calls primitives > file_operations::write_iter and read_iter. It's not clear for me whether > they consume the same memory as if writev() or readv() system calls > would be used instead. writev() may delay the actual write till dirty > pages limit will be reached, so it seems logic of the accounting should > be the same. So aio mustn't use more not accounted system memory in file > system internals, then simple writev(). > > Could you please to say if you have thoughts about this? I think you just need to account the completion ring. Cheers, Jeff
Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()
On Tue, 2017-12-05 at 06:45 +0800, Ming Lei wrote: > On Mon, Dec 04, 2017 at 03:09:20PM +, Bart Van Assche wrote: > > On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote: > > > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for > > > blk-mq") > > > > It might be safer to revert commit 0df21c86bdbf instead of trying to fix all > > issues introduced by that commit for kernel version v4.15 ... > > What are all issues in v4.15-rc? Up to now, it is the only issue reported, > and can be fixed by this simple patch, which one can be thought as cleanup > too. The three issues I described in the commit message of the patch that is available at: https://marc.info/?l=linux-block&m=151240866010572. Bart.
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
On 05.12.2017 00:52, Tejun Heo wrote: > Hello, Kirill. > > On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote: >>> Can you please explain how this is a fundamental resource which can't >>> be controlled otherwise? >> >> Currently, aio_nr and aio_max_nr are global. In case of containers this >> means that a single container may occupy all aio requests, which are >> available in the system, and to deprive others possibility to use aio >> at all. This may happen because of evil intentions of the container's >> user or because of the program error, when the user makes this occasionally. > > Hmm... I see. It feels really wrong to me to make this a first class > resource because there is a system wide limit. The only reason I can > think of for the system wide limit is to prevent too much kernel > memory consumed by creating a lot of aios but that squarely falls > inside cgroup memory controller protection. If there are other > reasons why the number of aios should be limited system-wide, please > bring them up. > > If the only reason is kernel memory consumption protection, the only > thing we need to do is making sure that memory used for aio commands > are accounted against cgroup kernel memory consumption and > relaxing/removing system wide limit. So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio structures and pages, and all the memory will be accounted in kmem and limited by memcg. Looks very good. One detail about memory consumption. io_submit() calls primitives file_operations::write_iter and read_iter. It's not clear for me whether they consume the same memory as if writev() or readv() system calls would be used instead. writev() may delay the actual write till dirty pages limit will be reached, so it seems logic of the accounting should be the same. So aio mustn't use more not accounted system memory in file system internals, then simple writev(). Could you please to say if you have thoughts about this? Kirill
Re: [PATCH] blk-mq: Fix several SCSI request queue lockups
On Tue, 2017-12-05 at 06:42 +0800, Ming Lei wrote: > On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote: > > * A systematic lockup for SCSI queues with queue depth 1. The > > following test reproduces that bug systematically: > > - Change the SRP initiator such that SCSI target queue depth is > > limited to 1. > > - Run the following command: > > srp-test/run_tests -f xfs -d -e none -r 60 -t 01 > > See also "[PATCH 4/7] blk-mq: Avoid that request processing > > stalls when sharing tags" > > (https://marc.info/?l=linux-block&m=151208695316857). Note: > > reverting commit 0df21c86bdbf also fixes a sporadic SCSI request > > queue lockup while inserting a blk_mq_sched_mark_restart_hctx() > > before all blk_mq_dispatch_rq_list() calls only fixes the > > systematic lockup for queue depth 1. > > You are the only reproducer [ ... ] That's not correct. I'm pretty sure if you try to reproduce this that you will see the same hang I ran into. Does this mean that you have not yet tried to reproduce the hang I reported? > You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched: > improve dispatching from sw queue")', but you don't mention any issue > about that commit. That's not correct either. From the commit message "A systematic lockup for SCSI queues with queue depth 1." > > I think the above means that it is too risky to try to fix all bugs > > introduced by commit 0df21c86bdbf before kernel v4.15 is released. > > Hence revert that commit. > > What is the risk? That more bugs were introduced by commit 0df21c86bdbf than the ones that have been discovered so far. Bart.
Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()
On Mon, Dec 04, 2017 at 03:09:20PM +, Bart Van Assche wrote: > On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote: > > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for > > blk-mq") > > It might be safer to revert commit 0df21c86bdbf instead of trying to fix all > issues introduced by that commit for kernel version v4.15 ... What are all issues in v4.15-rc? Up to now, it is the only issue reported, and can be fixed by this simple patch, which one can be thought as cleanup too. -- Ming
Re: [PATCH] blk-mq: Fix several SCSI request queue lockups
On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote: > Commit 0df21c86bdbf introduced several bugs: > * A SCSI queue stall for queue depths > 1, addressed by commit > 88022d7201e9 ("blk-mq: don't handle failure in .get_budget") This one is committed already. > * A systematic lockup for SCSI queues with queue depth 1. The > following test reproduces that bug systematically: > - Change the SRP initiator such that SCSI target queue depth is > limited to 1. > - Run the following command: > srp-test/run_tests -f xfs -d -e none -r 60 -t 01 > See also "[PATCH 4/7] blk-mq: Avoid that request processing > stalls when sharing tags" > (https://marc.info/?l=linux-block&m=151208695316857). Note: > reverting commit 0df21c86bdbf also fixes a sporadic SCSI request > queue lockup while inserting a blk_mq_sched_mark_restart_hctx() > before all blk_mq_dispatch_rq_list() calls only fixes the > systematic lockup for queue depth 1. You are the only reproducer, and you don't want to provide any kernel log about this issue, so how can we help you fix your issue? You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched: improve dispatching from sw queue")', but you don't mention any issue about that commit, and your patch is actually nothing to do with commit b347689ffbca, and seems your work style is just try and guess. Also both Jens and I have run tests on null_blk and scsi_debug by setting queue_depth as one, and we all can't see IO hang with current blk-mq. > * A scsi_debug lockup - see also "[PATCH] SCSI: delay run queue if > device is blocked in scsi_dev_queue_ready()" > (https://marc.info/?l=linux-block&m=151223233407154). This issue is clearly explained in theory, and can be reproduced/verified by scsi_debug, so why can't we apply it to fix the issue? And the fix is simply and can be thought as cleanup too, since the handling for this case becomes same with non-mq path now. > > I think the above means that it is too risky to try to fix all bugs > introduced by commit 0df21c86bdbf before kernel v4.15 is released. > Hence revert that commit. What is the risk? > > Fixes: commit 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for > blk-mq") > Signed-off-by: Bart Van Assche > Cc: Ming Lei > Cc: Christoph Hellwig > Cc: Hannes Reinecke > Cc: Johannes Thumshirn > Cc: James E.J. Bottomley > Cc: Martin K. Petersen > Cc: linux-s...@vger.kernel.org This commit fixes one important SCSI_MQ performance issue, we can't simply revert it just because of one un-confirmed report from you only(without any kernel log provided). So Nak. -- Ming
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
Hello, Kirill. On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote: > > Can you please explain how this is a fundamental resource which can't > > be controlled otherwise? > > Currently, aio_nr and aio_max_nr are global. In case of containers this > means that a single container may occupy all aio requests, which are > available in the system, and to deprive others possibility to use aio > at all. This may happen because of evil intentions of the container's > user or because of the program error, when the user makes this occasionally. Hmm... I see. It feels really wrong to me to make this a first class resource because there is a system wide limit. The only reason I can think of for the system wide limit is to prevent too much kernel memory consumed by creating a lot of aios but that squarely falls inside cgroup memory controller protection. If there are other reasons why the number of aios should be limited system-wide, please bring them up. If the only reason is kernel memory consumption protection, the only thing we need to do is making sure that memory used for aio commands are accounted against cgroup kernel memory consumption and relaxing/removing system wide limit. Thanks. -- tejun
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
On 05.12.2017 00:35, Jeff Moyer wrote: > Kirill Tkhai writes: > >> Hi, Benjamin, >> >> On 04.12.2017 19:52, Benjamin LaHaise wrote: >>> Hi Kirill, >>> >>> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote: Hi, this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup. It may be used to limit number of aio requests, which are available for a cgroup, and could be useful for containers. The accounting is hierarchical, and aio contexts, allocated in child cgroup, are accounted in parent cgroups too. Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup aio requests limit, to show current limit and number of currenly occupied requests. >>> >>> Where are your test cases to check this functionality in the libaio test >>> suite? >> >> I tried to find actual libaio test suite repository url in google, >> but there is no certain answer. Also, there is no information in kernel >> anywhere (I hope I grepped right). >> >> Could you please provide url where actual upstream libaio could be obtained? > > https://pagure.io/libaio > > Patches can be sent to this list (linux-aio). > Thank you, Jeff. Kirill Patches 1-3 are refactoring. Patch 4 is the place where the accounting actually introduced. Patch 5 adds "io.aio_nr" file. --- Kirill Tkhai (5): aio: Move aio_nr increment to separate function aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr blkcg: Charge aio requests in blkio cgroup hierarchy blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr block/blk-cgroup.c | 88 +- fs/aio.c | 151 include/linux/aio.h| 21 ++ include/linux/blk-cgroup.h |4 + 4 files changed, 247 insertions(+), 17 deletions(-) -- Signed-off-by: Kirill Tkhai >>> >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-aio' in >> the body to majord...@kvack.org. For more info on Linux AIO, >> see: http://www.kvack.org/aio/ >> Don't email: mailto:"a...@kvack.org";>a...@kvack.org
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
Hello, Tejun, On 04.12.2017 23:07, Tejun Heo wrote: > On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote: >> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup. >> It may be used to limit number of aio requests, which are available for >> a cgroup, and could be useful for containers. > > Can you please explain how this is a fundamental resource which can't > be controlled otherwise? Currently, aio_nr and aio_max_nr are global. In case of containers this means that a single container may occupy all aio requests, which are available in the system, and to deprive others possibility to use aio at all. This may happen because of evil intentions of the container's user or because of the program error, when the user makes this occasionally. My patch set allows to guarantee that every container or cgroup has its own number of allowed aios, and nobody can steal it, and therefore can't slow down another containers, and to force programs to use direct io. AIO gives certain advantages to its user, so this patchset just doesn't allow to rob the advantages without any possibility to protect against that. This could be used by LXC or for starting some critical micro-services, for example. Kirill
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
Kirill Tkhai writes: > Hi, Benjamin, > > On 04.12.2017 19:52, Benjamin LaHaise wrote: >> Hi Kirill, >> >> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote: >>> Hi, >>> >>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup. >>> It may be used to limit number of aio requests, which are available for >>> a cgroup, and could be useful for containers. >>> >>> The accounting is hierarchical, and aio contexts, allocated in child cgroup, >>> are accounted in parent cgroups too. >>> >>> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup >>> aio requests limit, to show current limit and number of currenly occupied >>> requests. >> >> Where are your test cases to check this functionality in the libaio test >> suite? > > I tried to find actual libaio test suite repository url in google, > but there is no certain answer. Also, there is no information in kernel > anywhere (I hope I grepped right). > > Could you please provide url where actual upstream libaio could be obtained? https://pagure.io/libaio Patches can be sent to this list (linux-aio). Cheers, Jeff >>> Patches 1-3 are refactoring. >>> Patch 4 is the place where the accounting actually introduced. >>> Patch 5 adds "io.aio_nr" file. >>> >>> --- >>> >>> Kirill Tkhai (5): >>> aio: Move aio_nr increment to separate function >>> aio: Export aio_nr_lock and aio_max_nr initial value to >>> include/linux/aio.h >>> blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr >>> blkcg: Charge aio requests in blkio cgroup hierarchy >>> blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr >>> >>> >>> block/blk-cgroup.c | 88 +- >>> fs/aio.c | 151 >>> >>> include/linux/aio.h| 21 ++ >>> include/linux/blk-cgroup.h |4 + >>> 4 files changed, 247 insertions(+), 17 deletions(-) >>> >>> -- >>> Signed-off-by: Kirill Tkhai >>> >> > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to majord...@kvack.org. For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"a...@kvack.org";>a...@kvack.org
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
Hi, Benjamin, On 04.12.2017 19:52, Benjamin LaHaise wrote: > Hi Kirill, > > On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote: >> Hi, >> >> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup. >> It may be used to limit number of aio requests, which are available for >> a cgroup, and could be useful for containers. >> >> The accounting is hierarchical, and aio contexts, allocated in child cgroup, >> are accounted in parent cgroups too. >> >> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup >> aio requests limit, to show current limit and number of currenly occupied >> requests. > > Where are your test cases to check this functionality in the libaio test > suite? I tried to find actual libaio test suite repository url in google, but there is no certain answer. Also, there is no information in kernel anywhere (I hope I grepped right). Could you please provide url where actual upstream libaio could be obtained? Kirill > > -ben > >> Patches 1-3 are refactoring. >> Patch 4 is the place where the accounting actually introduced. >> Patch 5 adds "io.aio_nr" file. >> >> --- >> >> Kirill Tkhai (5): >> aio: Move aio_nr increment to separate function >> aio: Export aio_nr_lock and aio_max_nr initial value to >> include/linux/aio.h >> blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr >> blkcg: Charge aio requests in blkio cgroup hierarchy >> blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr >> >> >> block/blk-cgroup.c | 88 +- >> fs/aio.c | 151 >> >> include/linux/aio.h| 21 ++ >> include/linux/blk-cgroup.h |4 + >> 4 files changed, 247 insertions(+), 17 deletions(-) >> >> -- >> Signed-off-by: Kirill Tkhai >> >
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
Hello, Kirill. On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote: > this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup. > It may be used to limit number of aio requests, which are available for > a cgroup, and could be useful for containers. Can you please explain how this is a fundamental resource which can't be controlled otherwise? Thanks. -- tejun
Re: [PATCH] generic/473: test return EBUSY from BLKRRPART for mounted whole-dev
On Mon, Dec 04, 2017 at 05:48:36PM +0800, Xiao Yang wrote: > On 2017/12/04 17:25, Eryu Guan wrote: > > On Mon, Dec 04, 2017 at 05:15:17PM +0800, Xiao Yang wrote: > > > On 2017/12/04 16:29, Eryu Guan wrote: > > > > On Wed, Nov 29, 2017 at 08:02:26PM +0800, Xiao Yang wrote: > > > > > If the entire block device is formatted with a filesystem and > > > > > mounted, running "blockdev --rereadpt" should fail and return > > > > > EBUSY instead of pass. > > > > > > > > > > Signed-off-by: Xiao Yang > > > > As we have blktests[1] now, I think this may fit in blktests better? > > > Hi Eryu, > > > > > > Do you think test cases which use scsi_debug module should be moved into > > > blktests? > > > (e.g. generic/108, generic/349, generic/350, generic/351) > > I don't think they need to be moved to blktests. Most other tests that > > take use of scsi_debug are for filesystem testing, e.g. generic/108. > > generic/349 generic/35[01] are a bit special, they were there before > > blktests was announced available, so they're in a special blockdev group > > and not in the auto group. If Omar agrees, I think they can be ported to > > blktests. > Hi Eryu, > > Thanks for your explanation, and i will try to send it to blktests. > > Thanks, > Xiao Yang I agree, the three tests Eryu mentioned and this new test would be a good fit for blktests. Let me know if you need any help porting, things are a little different from xfstests.
[PATCH] blk-mq: Fix several SCSI request queue lockups
Commit 0df21c86bdbf introduced several bugs: * A SCSI queue stall for queue depths > 1, addressed by commit 88022d7201e9 ("blk-mq: don't handle failure in .get_budget") * A systematic lockup for SCSI queues with queue depth 1. The following test reproduces that bug systematically: - Change the SRP initiator such that SCSI target queue depth is limited to 1. - Run the following command: srp-test/run_tests -f xfs -d -e none -r 60 -t 01 See also "[PATCH 4/7] blk-mq: Avoid that request processing stalls when sharing tags" (https://marc.info/?l=linux-block&m=151208695316857). Note: reverting commit 0df21c86bdbf also fixes a sporadic SCSI request queue lockup while inserting a blk_mq_sched_mark_restart_hctx() before all blk_mq_dispatch_rq_list() calls only fixes the systematic lockup for queue depth 1. * A scsi_debug lockup - see also "[PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()" (https://marc.info/?l=linux-block&m=151223233407154). I think the above means that it is too risky to try to fix all bugs introduced by commit 0df21c86bdbf before kernel v4.15 is released. Hence revert that commit. Fixes: commit 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for blk-mq") Signed-off-by: Bart Van Assche Cc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: James E.J. Bottomley Cc: Martin K. Petersen Cc: linux-s...@vger.kernel.org --- drivers/scsi/scsi_lib.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 84bd2b16d216..a7e7966f1477 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1976,9 +1976,11 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct scsi_device *sdev = q->queuedata; struct Scsi_Host *shost = sdev->host; struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req); - blk_status_t ret; + blk_status_t ret = BLK_STS_RESOURCE; int reason; + if (!scsi_mq_get_budget(hctx)) + goto out; ret = prep_to_mq(scsi_prep_state_check(sdev, req)); if (ret != BLK_STS_OK) goto out_put_budget; @@ -2022,6 +2024,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, atomic_dec(&scsi_target(sdev)->target_busy); out_put_budget: scsi_mq_put_budget(hctx); +out: switch (ret) { case BLK_STS_OK: break; @@ -2225,8 +2228,6 @@ struct request_queue *scsi_old_alloc_queue(struct scsi_device *sdev) } static const struct blk_mq_ops scsi_mq_ops = { - .get_budget = scsi_mq_get_budget, - .put_budget = scsi_mq_put_budget, .queue_rq = scsi_queue_rq, .complete = scsi_softirq_done, .timeout= scsi_timeout, -- 2.15.0
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
Hi Kirill, On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote: > Hi, > > this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup. > It may be used to limit number of aio requests, which are available for > a cgroup, and could be useful for containers. > > The accounting is hierarchical, and aio contexts, allocated in child cgroup, > are accounted in parent cgroups too. > > Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup > aio requests limit, to show current limit and number of currenly occupied > requests. Where are your test cases to check this functionality in the libaio test suite? -ben > Patches 1-3 are refactoring. > Patch 4 is the place where the accounting actually introduced. > Patch 5 adds "io.aio_nr" file. > > --- > > Kirill Tkhai (5): > aio: Move aio_nr increment to separate function > aio: Export aio_nr_lock and aio_max_nr initial value to > include/linux/aio.h > blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr > blkcg: Charge aio requests in blkio cgroup hierarchy > blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr > > > block/blk-cgroup.c | 88 +- > fs/aio.c | 151 > > include/linux/aio.h| 21 ++ > include/linux/blk-cgroup.h |4 + > 4 files changed, 247 insertions(+), 17 deletions(-) > > -- > Signed-off-by: Kirill Tkhai > -- "Thought is the essence of where you are now."
[PATCH 2/3] arm64: don't override dma_max_pfn
The generic version now takes dma_pfn_offset into account, so there is no more need for an architecture override. Signed-off-by: Christoph Hellwig --- arch/arm64/include/asm/dma-mapping.h | 9 - 1 file changed, 9 deletions(-) diff --git a/arch/arm64/include/asm/dma-mapping.h b/arch/arm64/include/asm/dma-mapping.h index 0df756b24863..eada887a93bf 100644 --- a/arch/arm64/include/asm/dma-mapping.h +++ b/arch/arm64/include/asm/dma-mapping.h @@ -76,14 +76,5 @@ static inline void dma_mark_clean(void *addr, size_t size) { } -/* Override for dma_max_pfn() */ -static inline unsigned long dma_max_pfn(struct device *dev) -{ - dma_addr_t dma_max = (dma_addr_t)*dev->dma_mask; - - return (ulong)dma_to_phys(dev, dma_max) >> PAGE_SHIFT; -} -#define dma_max_pfn(dev) dma_max_pfn(dev) - #endif /* __KERNEL__ */ #endif /* __ASM_DMA_MAPPING_H */ -- 2.14.2
[PATCH 1/3] dma-mapping: take dma_pfn_offset into account in dma_max_pfn
This makes sure the generic version can be used with architectures / devices that have a DMA offset in the direct mapping. Signed-off-by: Christoph Hellwig --- include/linux/dma-mapping.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index e8f8e8fb244d..86beb9861618 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -692,7 +692,7 @@ static inline int dma_set_seg_boundary(struct device *dev, unsigned long mask) #ifndef dma_max_pfn static inline unsigned long dma_max_pfn(struct device *dev) { - return *dev->dma_mask >> PAGE_SHIFT; + return (*dev->dma_mask >> PAGE_SHIFT) + dev->dma_pfn_offset; } #endif -- 2.14.2
[PATCH 3/3] dma-mapping: replace PCI_DMA_BUS_IS_PHYS with a flag in struct dma_map_ops
The current PCI_DMA_BUS_IS_PHYS decided if a dma implementation is bound by the dma mask in the device because it directly maps to a physical address range (modulo an offset in the device), or if it is virtualized by an iommu and can map any address (that includes virtual iommus like swiotlb). The problem with this scheme is that it is per-architecture and not per dma_ops instance, and we are growing more and more setups that have multiple different dma operations in use on a single system, for which this scheme can't provide a correct answer. Depending on the architecture that means we either get a false positive or false negative at the moment. This patch instead adds a new is_iommu flag in struct dma_map_ops that tells if a dma_map_ops instance can map any possible physical address to replace both the PCI_DMA_BUS_IS_PHYS macro and the is_phys field in struct dma_map_ops used by it on a few architectures. Note that this means that we now need a struct device parent in the Scsi_Host or netdevice. Every modern driver has these, but there might still be a few outdated legacy drivers out there, which now won't make an intelligent decision. Signed-off-by: Christoph Hellwig --- arch/alpha/include/asm/pci.h | 5 - arch/alpha/kernel/pci_iommu.c| 1 + arch/arc/include/asm/pci.h | 6 -- arch/arm/include/asm/pci.h | 7 --- arch/arm64/include/asm/pci.h | 5 - arch/arm64/mm/dma-mapping.c | 2 ++ arch/cris/include/asm/pci.h | 6 -- arch/h8300/include/asm/pci.h | 2 -- arch/hexagon/kernel/dma.c| 1 - arch/ia64/hp/common/sba_iommu.c | 4 +--- arch/ia64/include/asm/pci.h | 17 - arch/ia64/kernel/pci-swiotlb.c | 1 + arch/ia64/kernel/setup.c | 12 arch/ia64/sn/kernel/io_common.c | 5 - arch/ia64/sn/pci/pci_dma.c | 1 + arch/m68k/include/asm/pci.h | 6 -- arch/microblaze/include/asm/pci.h| 6 -- arch/mips/include/asm/pci.h | 7 --- arch/mn10300/include/asm/pci.h | 6 -- arch/parisc/include/asm/pci.h| 23 --- arch/parisc/kernel/setup.c | 5 - arch/powerpc/include/asm/pci.h | 18 -- arch/powerpc/kernel/dma-iommu.c | 1 + arch/powerpc/kernel/dma-swiotlb.c| 1 + arch/powerpc/platforms/cell/iommu.c | 1 + arch/powerpc/platforms/ps3/system-bus.c | 2 ++ arch/powerpc/platforms/pseries/ibmebus.c | 1 + arch/powerpc/platforms/pseries/vio.c | 1 + arch/riscv/include/asm/pci.h | 3 --- arch/s390/include/asm/pci.h | 2 -- arch/s390/pci/pci_dma.c | 3 +-- arch/sh/include/asm/pci.h| 6 -- arch/sh/kernel/dma-nommu.c | 1 - arch/sparc/include/asm/pci_32.h | 4 arch/sparc/include/asm/pci_64.h | 6 -- arch/sparc/kernel/iommu.c| 1 + arch/sparc/kernel/ioport.c | 1 + arch/sparc/kernel/pci_sun4v.c| 1 + arch/tile/include/asm/pci.h | 14 -- arch/tile/kernel/pci-dma.c | 2 ++ arch/x86/include/asm/pci.h | 2 -- arch/x86/kernel/amd_gart_64.c| 1 + arch/x86/kernel/pci-calgary_64.c | 1 + arch/x86/kernel/pci-nommu.c | 1 - arch/x86/kernel/pci-swiotlb.c| 1 + arch/x86/mm/mem_encrypt.c| 1 + arch/x86/pci/sta2x11-fixup.c | 1 + arch/xtensa/include/asm/pci.h| 7 --- drivers/ide/ide-lib.c| 5 ++--- drivers/ide/ide-probe.c | 2 +- drivers/iommu/amd_iommu.c| 1 + drivers/iommu/intel-iommu.c | 1 + drivers/parisc/ccio-dma.c| 3 +-- drivers/parisc/sba_iommu.c | 3 +-- drivers/pci/host/vmd.c | 1 + drivers/scsi/scsi_lib.c | 14 ++ drivers/xen/swiotlb-xen.c| 1 + include/asm-generic/pci.h| 8 include/linux/dma-mapping.h | 16 +++- lib/dma-virt.c | 1 + net/core/dev.c | 18 -- tools/virtio/linux/dma-mapping.h | 2 -- 62 files changed, 63 insertions(+), 225 deletions(-) diff --git a/arch/alpha/include/asm/pci.h b/arch/alpha/include/asm/pci.h index b9ec55351924..cf6bc1e64d66 100644 --- a/arch/alpha/include/asm/pci.h +++ b/arch/alpha/include/asm/pci.h @@ -56,11 +56,6 @@ struct pci_controller { /* IOMMU controls. */ -/* The PCI address space does not equal the physical memory address space. - The networking and block device layers use this boolean for bounce buffer - decisions. */ -#define PCI_DMA_BUS_IS_PHYS 0 - /* TODO: integrate
replace PCI_DMA_BUS_IS_PHYS with a per-instance flag
Hi all, this small series tries to get rid of the global and misnamed PCI_DMA_BUS_IS_PHYS flag, and replace it with a setting in each struct dma_map_ops instance.
Re: [PATCH] um: Convert ubd driver to blk-mq
On Sun, Dec 03, 2017 at 10:49:23PM +0100, Richard Weinberger wrote: > Convert the driver to the modern blk-mq framework. > As byproduct we get rid of our open coded restart logic and let > blk-mq handle it. > > Signed-off-by: Richard Weinberger > --- > arch/um/drivers/ubd_kern.c | 178 > +++-- > 1 file changed, 93 insertions(+), 85 deletions(-) > > diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c > index b55fe9bf5d3e..deceb8022a28 100644 > --- a/arch/um/drivers/ubd_kern.c > +++ b/arch/um/drivers/ubd_kern.c > @@ -23,6 +23,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -142,7 +143,6 @@ struct cow { > #define MAX_SG 64 > > struct ubd { > - struct list_head restart; > /* name (and fd, below) of the file opened for writing, either the >* backing or the cow file. */ > char *file; > @@ -156,9 +156,12 @@ struct ubd { > struct cow cow; > struct platform_device pdev; > struct request_queue *queue; > + struct blk_mq_tag_set tag_set; > spinlock_t lock; > +}; > + > +struct ubd_pdu { > struct scatterlist sg[MAX_SG]; > - struct request *request; > int start_sg, end_sg; > sector_t rq_pos; > }; > @@ -182,10 +185,6 @@ struct ubd { > .shared = 0, \ > .cow = DEFAULT_COW, \ > .lock = __SPIN_LOCK_UNLOCKED(ubd_devs.lock), \ > - .request = NULL, \ > - .start_sg = 0, \ > - .end_sg = 0, \ > - .rq_pos = 0, \ > } > > /* Protected by ubd_lock */ > @@ -196,6 +195,12 @@ static int fake_ide = 0; > static struct proc_dir_entry *proc_ide_root = NULL; > static struct proc_dir_entry *proc_ide = NULL; > > +static blk_status_t ubd_queue_rq(struct blk_mq_hw_ctx *hctx, > + const struct blk_mq_queue_data *bd); > +static int ubd_init_request(struct blk_mq_tag_set *set, > + struct request *req, unsigned int hctx_idx, > + unsigned int numa_node); > + > static void make_proc_ide(void) > { > proc_ide_root = proc_mkdir("ide", NULL); > @@ -448,11 +453,8 @@ __uml_help(udb_setup, > "in the boot output.\n\n" > ); > > -static void do_ubd_request(struct request_queue * q); > - > /* Only changed by ubd_init, which is an initcall. */ > static int thread_fd = -1; > -static LIST_HEAD(restart); > > /* Function to read several request pointers at a time > * handling fractional reads if (and as) needed > @@ -510,9 +512,6 @@ static int bulk_req_safe_read( > /* Called without dev->lock held, and only in interrupt context. */ > static void ubd_handler(void) > { > - struct ubd *ubd; > - struct list_head *list, *next_ele; > - unsigned long flags; > int n; > int count; > > @@ -532,23 +531,17 @@ static void ubd_handler(void) > return; > } > for (count = 0; count < n/sizeof(struct io_thread_req *); > count++) { > - blk_end_request( > - (*irq_req_buffer)[count]->req, > - BLK_STS_OK, > - (*irq_req_buffer)[count]->length > - ); > - kfree((*irq_req_buffer)[count]); > + struct io_thread_req *io_req = (*irq_req_buffer)[count]; > + int err = io_req->error ? BLK_STS_IOERR : BLK_STS_OK; > + > + if (!blk_update_request(io_req->req, err, > io_req->length)) > + __blk_mq_end_request(io_req->req, err); > + > + kfree(io_req); > } > } > - reactivate_fd(thread_fd, UBD_IRQ); > > - list_for_each_safe(list, next_ele, &restart){ > - ubd = container_of(list, struct ubd, restart); > - list_del_init(&ubd->restart); > - spin_lock_irqsave(&ubd->lock, flags); > - do_ubd_request(ubd->queue); > - spin_unlock_irqrestore(&ubd->lock, flags); > - } > + reactivate_fd(thread_fd, UBD_IRQ); > } > > static irqreturn_t ubd_intr(int irq, void *dev) > @@ -869,6 +862,7 @@ static void ubd_device_release(struct device *dev) > struct ubd *ubd_dev = dev_get_drvdata(dev); > > blk_cleanup_queue(ubd_dev->queue); > + blk_mq_free_tag_set(&ubd_dev->tag_set); > *ubd_dev = ((struct ubd) DEFAULT_UBD); > } > > @@ -911,6 +905,11 @@ static int ubd_disk_register(int major, u64 size, int > unit, > > #define ROUND_BLOCK(n) ((n + ((1 << 9) - 1)) & (-1 << 9)) > > +static const struct blk_mq_ops ubd_mq_ops = { > + .queue_rq = ubd_queue_rq, > + .init_request = ubd_init_request, > +}; Use tables to align the "=" signs? > + blk_mq_start_request(req); > + > + pdu->rq_pos = blk_rq_pos(req); > + pdu->start_sg = 0; > + pdu->end
Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)
On Wed, Nov 29, 2017 at 08:18:09PM +0100, Christian Borntraeger wrote: > Works fine under KVM with virtio-blk, but still hangs during boot in an LPAR. > FWIW, the system not only has scsi disks via fcp but also DASDs as a boot > disk. > Seems that this is the place where the system stops. (see the sysrq-t output > at the bottom). Can you check which of the patches in the tree is the culprit?
[PATCH 2/5] aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h
Next patch will use the values in more files, so let's make them visible external. Signed-off-by: Kirill Tkhai --- fs/aio.c|4 ++-- include/linux/aio.h |4 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 04209c0561b2..9dc98a29077c 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -188,10 +188,10 @@ struct aio_kiocb { struct eventfd_ctx *ki_eventfd; }; +DEFINE_SPINLOCK(aio_nr_lock); /*-- sysctl variables*/ -static DEFINE_SPINLOCK(aio_nr_lock); unsigned long aio_nr; /* current system wide number of aio requests */ -unsigned long aio_max_nr = 0x1; /* system wide maximum number of aio requests */ +unsigned long aio_max_nr = AIO_NR_DEF; /* system wide maximum number of aio requests */ /*end sysctl variables---*/ static struct kmem_cache *kiocb_cachep; diff --git a/include/linux/aio.h b/include/linux/aio.h index 9d8aabecfe2d..5dda2663802f 100644 --- a/include/linux/aio.h +++ b/include/linux/aio.h @@ -10,6 +10,10 @@ struct mm_struct; #define KIOCB_KEY 0 +#define AIO_NR_INF (-1UL) +#define AIO_NR_DEF 0x1 + +extern spinlock_t aio_nr_lock; typedef int (kiocb_cancel_fn)(struct kiocb *); /* prototypes */
[PATCH 1/5] aio: Move aio_nr increment to separate function
There is no functional changes, only a preparation for next patches. Signed-off-by: Kirill Tkhai --- fs/aio.c | 44 1 file changed, 32 insertions(+), 12 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index e6de7715228c..04209c0561b2 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -694,13 +694,39 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) } } -static void aio_nr_sub(unsigned nr) +static bool __try_to_charge_aio_nr(unsigned nr) +{ + if (aio_nr + nr > aio_max_nr || + aio_nr + nr < aio_nr) + return false; + + aio_nr += nr; + return true; +} + +static void __uncharge_aio_nr(unsigned nr) { - spin_lock(&aio_nr_lock); if (WARN_ON(aio_nr - nr > aio_nr)) aio_nr = 0; else aio_nr -= nr; +} + +static bool try_to_charge_aio_nr(unsigned nr) +{ + bool ret; + + spin_lock(&aio_nr_lock); + ret = __try_to_charge_aio_nr(nr); + spin_unlock(&aio_nr_lock); + + return ret; +} + +static void uncharge_aio_nr(unsigned nr) +{ + spin_lock(&aio_nr_lock); + __uncharge_aio_nr(nr); spin_unlock(&aio_nr_lock); } @@ -776,15 +802,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) ctx->req_batch = 1; /* limit the number of system wide aios */ - spin_lock(&aio_nr_lock); - if (aio_nr + ctx->max_reqs > aio_max_nr || - aio_nr + ctx->max_reqs < aio_nr) { - spin_unlock(&aio_nr_lock); - err = -EAGAIN; + err = -EAGAIN; + if (!try_to_charge_aio_nr(ctx->max_reqs)) goto err_ctx; - } - aio_nr += ctx->max_reqs; - spin_unlock(&aio_nr_lock); percpu_ref_get(&ctx->users);/* io_setup() will drop this ref */ percpu_ref_get(&ctx->reqs); /* free_ioctx_users() will drop this */ @@ -801,7 +821,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) return ctx; err_cleanup: - aio_nr_sub(ctx->max_reqs); + uncharge_aio_nr(ctx->max_reqs); err_ctx: atomic_set(&ctx->dead, 1); if (ctx->mmap_size) @@ -848,7 +868,7 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx, * -EAGAIN with no ioctxs actually in use (as far as userspace * could tell). */ - aio_nr_sub(ctx->max_reqs); + uncharge_aio_nr(ctx->max_reqs); if (ctx->mmap_size) vm_munmap(ctx->mmap_base, ctx->mmap_size);
[PATCH 5/5] blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
Add a file to configure per-cgroup maximum number of available aio requests. Allow write the values, which are less then currently allocated numbers of requests. Show numbers of currently allocated and maximum available requests. Signed-off-by: Kirill Tkhai --- block/blk-cgroup.c | 40 1 file changed, 40 insertions(+) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 9cc6e9574946..dc5600ef4523 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -981,12 +981,52 @@ static int blkcg_print_stat(struct seq_file *sf, void *v) return 0; } +#ifdef CONFIG_AIO +static int blkcg_write_aio_max_nr(struct cgroup_subsys_state *css, + struct cftype *cftype, u64 val) +{ + struct blkcg *blkg = css_to_blkcg(css); + int ret = 0; + + percpu_down_read(&cgroup_threadgroup_rwsem); + spin_lock(&aio_nr_lock); + if (val >= blkg->blkg_aio_nr) + blkg->blkg_aio_max_nr = val; + else + ret = -EBUSY; + spin_unlock(&aio_nr_lock); + percpu_up_read(&cgroup_threadgroup_rwsem); + return ret; +} + +static int blkcg_show_aio_nrs(struct seq_file *sf, void *v) +{ + struct blkcg *blkg = css_to_blkcg(seq_css(sf)); + unsigned long max_nr, nr; + + spin_lock(&aio_nr_lock); + max_nr = blkg->blkg_aio_max_nr; + nr = blkg->blkg_aio_nr; + spin_unlock(&aio_nr_lock); + + seq_printf(sf, "used=%lu, max=%lu\n", nr, max_nr); + return 0; +} +#endif /* CONFIG_AIO */ + static struct cftype blkcg_files[] = { { .name = "stat", .flags = CFTYPE_NOT_ON_ROOT, .seq_show = blkcg_print_stat, }, +#ifdef CONFIG_AIO + { + .name = "aio_nr", + .write_u64 = blkcg_write_aio_max_nr, + .seq_show = blkcg_show_aio_nrs, + }, +#endif { } /* terminate */ };
[PATCH 4/5] blkcg: Charge aio requests in blkio cgroup hierarchy
This patch adds accounting of number of requests of allocated aio contexts per blkio cgroup, and aggregates child cgroups requests up the hierarhy. This may be used to limit aio requests available for containers. By default, newly allocated blkcg::blkg_aio_max_nr is set to "unlimited" value (see blkcg_css_alloc() in previous patch). This guarantees that applications, which do not know about blkcg::blkg_aio_max_nr, will be able to run like they used to do before, without configuring child cgroup's blkg_aio_max_nr. For protection "task attach" vs "io_context create/destroy" read locked cgroup_threadgroup_rwsem is used. We take it via cgroup_threadgroup_change_*() interfaces, which are used around the places we charge kioctx::max_reqs and link a ctx to mm_struct::ioctx_table. Single allocation are protected aio_nr_lock like it used before. Signed-off-by: Kirill Tkhai --- block/blk-cgroup.c | 44 +- fs/aio.c| 101 +++ include/linux/aio.h | 11 ++ 3 files changed, 153 insertions(+), 3 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 774560469b01..9cc6e9574946 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1217,8 +1217,8 @@ void blkcg_exit_queue(struct request_queue *q) */ static int blkcg_can_attach(struct cgroup_taskset *tset) { - struct task_struct *task; - struct cgroup_subsys_state *dst_css; + struct cgroup_subsys_state *dst_css, *old_css; + struct task_struct *task, *p; struct io_context *ioc; int ret = 0; @@ -1230,11 +1230,46 @@ static int blkcg_can_attach(struct cgroup_taskset *tset) ret = -EINVAL; task_unlock(task); if (ret) - break; + goto err; + if (!thread_group_leader(task)) + continue; + ret = charge_task_aio_nr(task, css_to_blkcg(dst_css)); + if (ret) + goto err; + old_css = task_css(task, io_cgrp_id); + uncharge_task_aio_nr(task, css_to_blkcg(old_css)); + } +err: + if (ret) { + cgroup_taskset_for_each(p, dst_css, tset) { + if (p == task) + break; + if (!thread_group_leader(p)) + continue; + uncharge_task_aio_nr(p, css_to_blkcg(dst_css)); + old_css = task_css(p, io_cgrp_id); + WARN_ON_ONCE(charge_task_aio_nr(p, css_to_blkcg(old_css))); + } } return ret; } +#ifdef CONFIG_AIO +static void blkcg_cancel_attach(struct cgroup_taskset *tset) +{ + struct cgroup_subsys_state *dst_css, *old_css; + struct task_struct *p; + + cgroup_taskset_for_each(p, dst_css, tset) { + if (!thread_group_leader(p)) + continue; + uncharge_task_aio_nr(p, css_to_blkcg(dst_css)); + old_css = task_css(p, io_cgrp_id); + WARN_ON_ONCE(charge_task_aio_nr(p, css_to_blkcg(old_css))); + } +} +#endif + static void blkcg_bind(struct cgroup_subsys_state *root_css) { int i; @@ -1260,6 +1295,9 @@ struct cgroup_subsys io_cgrp_subsys = { .css_offline = blkcg_css_offline, .css_free = blkcg_css_free, .can_attach = blkcg_can_attach, +#ifdef CONFIG_AIO + .cancel_attach = blkcg_cancel_attach, +#endif .bind = blkcg_bind, .dfl_cftypes = blkcg_files, .legacy_cftypes = blkcg_legacy_files, diff --git a/fs/aio.c b/fs/aio.c index 755f97a42ebe..2e63f5c582c0 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -696,6 +697,97 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) } } +#ifdef CONFIG_BLK_CGROUP +static bool try_to_charge_blkcg(unsigned long nr, struct blkcg *blkg) +{ + struct blkcg *tmp = blkg; + + while (blkg) { + if (nr + blkg->blkg_aio_nr > blkg->blkg_aio_max_nr || + nr + blkg->blkg_aio_nr < nr) + goto fail; + + blkg->blkg_aio_nr += nr; + blkg = blkcg_parent(blkg); + } + + return true; +fail: + while (tmp != blkg) { + tmp->blkg_aio_nr -= nr; + tmp = blkcg_parent(tmp); + } + return false; +} + + +static void uncharge_blkcg(unsigned long nr, struct blkcg *blkg) +{ + while (blkg) { + blkg->blkg_aio_nr -= nr; + blkg = blkcg_parent(blkg); + } +} + +static bool __try_to_charge_aio_nr(unsigned nr) +{ + struct blkcg *blkg; + + percpu_rwsem_assert_held(&cgroup_threadgroup_rwsem); + blkg = container_of(task_css_check(current, io_cgrp_id, true), +
[PATCH 3/5] blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
This adds new members of struct blkcg, which will be used to account numbers of cgroup's aio requests. Also, blkcg_root is used to store sysctl variables aio_nr and aio_max_nr. Signed-off-by: Kirill Tkhai --- block/blk-cgroup.c |4 fs/aio.c |2 ++ include/linux/aio.h|6 ++ include/linux/blk-cgroup.h |4 4 files changed, 16 insertions(+) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index d3f56baee936..774560469b01 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "blk.h" #define MAX_KEY_LEN 100 @@ -1101,6 +1102,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) INIT_HLIST_HEAD(&blkcg->blkg_list); #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&blkcg->cgwb_list); +#endif +#ifdef CONFIG_AIO + blkcg->blkg_aio_max_nr = parent_css ? AIO_NR_INF : AIO_NR_DEF; #endif list_add_tail(&blkcg->all_blkcgs_node, &all_blkcgs); diff --git a/fs/aio.c b/fs/aio.c index 9dc98a29077c..755f97a42ebe 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -189,10 +189,12 @@ struct aio_kiocb { }; DEFINE_SPINLOCK(aio_nr_lock); +#ifndef CONFIG_BLK_CGROUP /*-- sysctl variables*/ unsigned long aio_nr; /* current system wide number of aio requests */ unsigned long aio_max_nr = AIO_NR_DEF; /* system wide maximum number of aio requests */ /*end sysctl variables---*/ +#endif static struct kmem_cache *kiocb_cachep; static struct kmem_cache *kioctx_cachep; diff --git a/include/linux/aio.h b/include/linux/aio.h index 5dda2663802f..de929a8c9c59 100644 --- a/include/linux/aio.h +++ b/include/linux/aio.h @@ -2,6 +2,7 @@ #ifndef __LINUX__AIO_H #define __LINUX__AIO_H +#include #include struct kioctx; @@ -26,8 +27,13 @@ static inline void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel) { } #endif /* CONFIG_AIO */ +#if !defined(CONFIG_BLK_CGROUP) || !defined(CONFIG_AIO) /* for sysctl: */ extern unsigned long aio_nr; extern unsigned long aio_max_nr; +#else +#define aio_nr blkcg_root.blkg_aio_nr +#define aio_max_nr blkcg_root.blkg_aio_max_nr +#endif /* !CONFIG_BLK_CGROUP || !CONFIG_AIO */ #endif /* __LINUX__AIO_H */ diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 8bbc3716507a..3d9c176fc173 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -55,6 +55,10 @@ struct blkcg { #ifdef CONFIG_CGROUP_WRITEBACK struct list_headcgwb_list; #endif +#ifdef CONFIG_AIO + unsigned long blkg_aio_nr; + unsigned long blkg_aio_max_nr; +#endif }; /*
[PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
Hi, this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup. It may be used to limit number of aio requests, which are available for a cgroup, and could be useful for containers. The accounting is hierarchical, and aio contexts, allocated in child cgroup, are accounted in parent cgroups too. Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup aio requests limit, to show current limit and number of currenly occupied requests. Patches 1-3 are refactoring. Patch 4 is the place where the accounting actually introduced. Patch 5 adds "io.aio_nr" file. --- Kirill Tkhai (5): aio: Move aio_nr increment to separate function aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr blkcg: Charge aio requests in blkio cgroup hierarchy blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr block/blk-cgroup.c | 88 +- fs/aio.c | 151 include/linux/aio.h| 21 ++ include/linux/blk-cgroup.h |4 + 4 files changed, 247 insertions(+), 17 deletions(-) -- Signed-off-by: Kirill Tkhai
Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()
On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote: > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for blk-mq") It might be safer to revert commit 0df21c86bdbf instead of trying to fix all issues introduced by that commit for kernel version v4.15 ... Bart.
Re: [PATCH V2 0/2] block: fix queue freeze and cleanup
On 12/01/2017 10:49 PM, Ming Lei wrote: Unfortunately v2 has been found to exhibit another problem, an Oops in the systemd shutdown path for device-mapper LVM volumes at least, when calling 'q->request_fn(q)' if 'q->request_fn' is NULL. That issue can be fixed easily in V2 and we understand DM's case clearly. Yes. It just had not been mentioned upstream yet. But the most important thing is that V2 can't fix Chenxiang's issue, but V3/V1 can. Agree. -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: WARNING in kmalloc_slab (3)
On Sun, Dec 03, 2017 at 12:16:08PM -0800, Eric Biggers wrote: > Looks like BLKTRACESETUP doesn't limit the '.buf_nr' parameter, allowing > anyone > who can open a block device to cause an extremely large kmalloc. Here's a > simplified reproducer: > There are lots of places which allow people to allocate as much as they want. With Syzcaller, you might want to just hard code a __GFP_NOWARN in to disable it. regards, dan carpenter
Re: blk-mq + bfq: udevd hang on usb2 storages
On Fri, Dec 01, 2017 at 06:04:29PM +0100, Alban Browaeys wrote: > I initially reported as https://bugzilla.kernel.org/show_bug.cgi?id=198 > 023 . > > I have now bisected this issue to commit a6a252e6491443c1c1 "blk-mq- > sched: decide how to handle flush rq via RQF_FLUSH_SEQ". > > This is with an USB stick Sandisk Cruzer (USB Version: 2.10) I > regressed with. > systemctl restart systemd-udevd restores sanity. > > PS: With an USB3 Lexar (USB Version: 3.00) I get more severe an issue > (not bisected) where I find no way out of reboot. My report to bugzilla > has logs when I was swapping between the these keys. The logs attached > there mixes what looks like two different behaviors. Hi Paolo, >From both Alban's trace and my trace, looks this issue is in BFQ, since request can't be retrieved via e->type->ops.mq.dispatch_request() in blk_mq_do_dispatch_sched() after it is inserted into BFQ's queue. https://bugzilla.kernel.org/show_bug.cgi?id=198023#c4 https://marc.info/?l=linux-block&m=151214241518562&w=2 BTW, I have tried to reproduce the issue with scsi_debug, but not succeed, and it can't be reproduced with other schedulers(mq-deadline, none) too. So could you take a look? Thanks, Ming
[PATCH V2] block, bfq: remove batches of confusing ifdefs
Commit a33801e8b473 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP") introduced two batches of confusing ifdefs: one reported in [1], plus a similar one in another function. This commit removes both batches, in the way suggested in [1]. [1] https://www.spinics.net/lists/linux-block/msg20043.html Fixes: a33801e8b473 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP") Reported-by: Linus Torvalds Tested-by: Luca Miccio Signed-off-by: Paolo Valente --- block/bfq-iosched.c | 127 +--- 1 file changed, 72 insertions(+), 55 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index bcb6d21..3feed64 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -3689,35 +3689,16 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) return rq; } -static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) -{ - struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; - struct request *rq; #if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) - struct bfq_queue *in_serv_queue, *bfqq; - bool waiting_rq, idle_timer_disabled; -#endif - - spin_lock_irq(&bfqd->lock); - -#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) - in_serv_queue = bfqd->in_service_queue; - waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue); - - rq = __bfq_dispatch_request(hctx); - - idle_timer_disabled = - waiting_rq && !bfq_bfqq_wait_request(in_serv_queue); - -#else - rq = __bfq_dispatch_request(hctx); -#endif - spin_unlock_irq(&bfqd->lock); +static void bfq_update_dispatch_stats(struct request_queue *q, + struct request *rq, + struct bfq_queue *in_serv_queue, + bool idle_timer_disabled) +{ + struct bfq_queue *bfqq = rq ? RQ_BFQQ(rq) : NULL; -#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) - bfqq = rq ? RQ_BFQQ(rq) : NULL; if (!idle_timer_disabled && !bfqq) - return rq; + return; /* * rq and bfqq are guaranteed to exist until this function @@ -3732,7 +3713,7 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) * In addition, the following queue lock guarantees that * bfqq_group(bfqq) exists as well. */ - spin_lock_irq(hctx->queue->queue_lock); + spin_lock_irq(q->queue_lock); if (idle_timer_disabled) /* * Since the idle timer has been disabled, @@ -3751,9 +3732,37 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) bfqg_stats_set_start_empty_time(bfqg); bfqg_stats_update_io_remove(bfqg, rq->cmd_flags); } - spin_unlock_irq(hctx->queue->queue_lock); + spin_unlock_irq(q->queue_lock); +} +#else +static inline void bfq_update_dispatch_stats(struct request_queue *q, +struct request *rq, +struct bfq_queue *in_serv_queue, +bool idle_timer_disabled) {} #endif +static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) +{ + struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; + struct request *rq; + struct bfq_queue *in_serv_queue; + bool waiting_rq, idle_timer_disabled; + + spin_lock_irq(&bfqd->lock); + + in_serv_queue = bfqd->in_service_queue; + waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue); + + rq = __bfq_dispatch_request(hctx); + + idle_timer_disabled = + waiting_rq && !bfq_bfqq_wait_request(in_serv_queue); + + spin_unlock_irq(&bfqd->lock); + + bfq_update_dispatch_stats(hctx->queue, rq, in_serv_queue, + idle_timer_disabled); + return rq; } @@ -4276,16 +4285,46 @@ static bool __bfq_insert_request(struct bfq_data *bfqd, struct request *rq) return idle_timer_disabled; } +#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) +static void bfq_update_insert_stats(struct request_queue *q, + struct bfq_queue *bfqq, + bool idle_timer_disabled, + unsigned int cmd_flags) +{ + if (!bfqq) + return; + + /* +* bfqq still exists, because it can disappear only after +* either it is merged with another queue, or the process it +* is associated with exits. But both actions must be taken by +* the same process currently executing this flow of +* instructions. +* +* In addition, the following queue
Re: [PATCH] generic/473: test return EBUSY from BLKRRPART for mounted whole-dev
On 2017/12/04 17:25, Eryu Guan wrote: On Mon, Dec 04, 2017 at 05:15:17PM +0800, Xiao Yang wrote: On 2017/12/04 16:29, Eryu Guan wrote: On Wed, Nov 29, 2017 at 08:02:26PM +0800, Xiao Yang wrote: If the entire block device is formatted with a filesystem and mounted, running "blockdev --rereadpt" should fail and return EBUSY instead of pass. Signed-off-by: Xiao Yang As we have blktests[1] now, I think this may fit in blktests better? Hi Eryu, Do you think test cases which use scsi_debug module should be moved into blktests? (e.g. generic/108, generic/349, generic/350, generic/351) I don't think they need to be moved to blktests. Most other tests that take use of scsi_debug are for filesystem testing, e.g. generic/108. generic/349 generic/35[01] are a bit special, they were there before blktests was announced available, so they're in a special blockdev group and not in the auto group. If Omar agrees, I think they can be ported to blktests. Hi Eryu, Thanks for your explanation, and i will try to send it to blktests. Thanks, Xiao Yang Thanks, Eryu
Re: WARNING in kmalloc_slab (3)
On Mon, Dec 04, 2017 at 09:18:05AM +0100, Dmitry Vyukov wrote: > On Mon, Dec 4, 2017 at 9:14 AM, Dan Carpenter > wrote: > > On Sun, Dec 03, 2017 at 12:16:08PM -0800, Eric Biggers wrote: > >> Looks like BLKTRACESETUP doesn't limit the '.buf_nr' parameter, allowing > >> anyone > >> who can open a block device to cause an extremely large kmalloc. Here's a > >> simplified reproducer: > >> > > > > There are lots of places which allow people to allocate as much as they > > want. With Syzcaller, you might want to just hard code a __GFP_NOWARN > > in to disable it. > > Hi, > > Hard code it where? My idea was to just make warn_alloc() a no-op. > > User-controllable allocation are supposed to use __GFP_NOWARN. No that's not right. What we don't want is unprivileged users to use all the memory and we don't want unprivileged users to spam /var/log/messages. But you have to have slightly elevated permissions to open block devices right? The warning is helpful. Admins should "don't do that" if they don't want the warning. The kernel really isn't designed to work with Oops on Warn. I try to tell people simple thinks like not printing a warning when copy_from_user() fails because I don't want /var/log/messages to get spammed. But there are lots and lots of places which generate warnings. regards, dan carpenter
Re: [PATCH] generic/473: test return EBUSY from BLKRRPART for mounted whole-dev
On Mon, Dec 04, 2017 at 05:15:17PM +0800, Xiao Yang wrote: > On 2017/12/04 16:29, Eryu Guan wrote: > > On Wed, Nov 29, 2017 at 08:02:26PM +0800, Xiao Yang wrote: > > > If the entire block device is formatted with a filesystem and > > > mounted, running "blockdev --rereadpt" should fail and return > > > EBUSY instead of pass. > > > > > > Signed-off-by: Xiao Yang > > As we have blktests[1] now, I think this may fit in blktests better? > Hi Eryu, > > Do you think test cases which use scsi_debug module should be moved into > blktests? > (e.g. generic/108, generic/349, generic/350, generic/351) I don't think they need to be moved to blktests. Most other tests that take use of scsi_debug are for filesystem testing, e.g. generic/108. generic/349 generic/35[01] are a bit special, they were there before blktests was announced available, so they're in a special blockdev group and not in the auto group. If Omar agrees, I think they can be ported to blktests. Thanks, Eryu
Re: [PATCH] generic/473: test return EBUSY from BLKRRPART for mounted whole-dev
On 2017/12/04 16:29, Eryu Guan wrote: On Wed, Nov 29, 2017 at 08:02:26PM +0800, Xiao Yang wrote: If the entire block device is formatted with a filesystem and mounted, running "blockdev --rereadpt" should fail and return EBUSY instead of pass. Signed-off-by: Xiao Yang As we have blktests[1] now, I think this may fit in blktests better? Hi Eryu, Do you think test cases which use scsi_debug module should be moved into blktests? (e.g. generic/108, generic/349, generic/350, generic/351) Thanks, Xiao Yang Thanks, Eryu [1] https://github.com/osandov/blktests --- tests/generic/473 | 83 +++ tests/generic/473.out | 2 ++ tests/generic/group | 1 + 3 files changed, 86 insertions(+) create mode 100755 tests/generic/473 create mode 100644 tests/generic/473.out diff --git a/tests/generic/473 b/tests/generic/473 new file mode 100755 index 000..d7998cd --- /dev/null +++ b/tests/generic/473 @@ -0,0 +1,83 @@ +#! /bin/bash +# FS QA Test No. 473 +# +# Regression test for commit: +# 77032ca ("Return EBUSY from BLKRRPART for mounted whole-dev fs") +# +# If the entire block device is formatted with a filesystem and +# mounted, running "blockdev --rereadpt" should fail and return +# EBUSY. On buggy kernel, it passes unexpectedly. +# +#--- +# Copyright (c) 2017 FUJITSU LIMITED. All rights reserved. +# Author: Xiao Yang +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmpdir=`mktemp -d` +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 7 15 + +_cleanup() +{ + # Umount + $UMOUNT_PROG $tmpdir>>$seqres.full 2>&1 + # Destroy device +_put_scsi_debug_dev + rm -rf $tmpdir +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/scsi_debug + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here +_supported_os Linux +_require_scsi_debug + +# Create and format +test_dev=`_get_scsi_debug_dev` +_mkfs_dev $test_dev>>$seqres.full 2>&1 + +# Mount and check mounted whole-dev +_mount $test_dev $tmpdir + +out=$(blockdev --rereadpt $test_dev 2>&1) +res=$? + +echo $out | grep -q "Unknown command"&& \ + _notrun "blockdev --rereadpt was not supported" + +[ $res -eq 0 ]&& _fail "blockdev --rereadpt passed when checking mounted whole-dev" + +echo $out | grep -q "Device or resource busy" || \ + _fail "blockdev --rereadpt returned unexpected error when checking mounted whole-dev" + +echo $out>> $seqres.full + +echo "Silence is golden" + +# success, all done +status=0 +exit diff --git a/tests/generic/473.out b/tests/generic/473.out new file mode 100644 index 000..854fbcd --- /dev/null +++ b/tests/generic/473.out @@ -0,0 +1,2 @@ +QA output created by 473 +Silence is golden diff --git a/tests/generic/group b/tests/generic/group index 6c3bb03..54b9404 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -472,3 +472,4 @@ 467 auto quick exportfs 468 shutdown auto quick metadata 469 auto quick +473 auto quick blockdev -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe fstests" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe fstests" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html .
Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()
On Mon, Dec 04, 2017 at 09:44:55AM +0100, Johannes Thumshirn wrote: > Ming Lei writes: > > > > I am happy to do that, but recently I am very busy, so it may be done > > a bit late by me. > > > > But anyone should reproduce the issue 100% with V4.15-rc kernel by just > > running the above script, not any specific hardware is required at all, > > so that means anyone can make a patch for blktest to test block/SCSI > > timeout if he/she is interested in doing that. > > OK, let me see if I can spent some time on this the next days. That is great, thanks! -- Ming
Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()
Ming Lei writes: > I am happy to do that, but recently I am very busy, so it may be done > a bit late by me. > > But anyone should reproduce the issue 100% with V4.15-rc kernel by just > running the above script, not any specific hardware is required at all, > so that means anyone can make a patch for blktest to test block/SCSI > timeout if he/she is interested in doing that. OK, let me see if I can spent some time on this the next days. Byte, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()
On Mon, Dec 04, 2017 at 09:19:33AM +0100, Johannes Thumshirn wrote: > > Hi Ming, > > Ming Lei writes: > > This issue can be triggered by the following script: > > > > #!/bin/sh > > rmmod scsi_debug > > modprobe scsi_debug max_queue=1 > > > > DEVICE=`ls -d > > /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head > > -1 | xargs basename` > > > > DISK_DIR=`ls -d /sys/block/$DEVICE/device/scsi_disk/*` > > > > echo "using scsi device $DEVICE" > > echo "-1" >/sys/bus/pseudo/drivers/scsi_debug/every_nth > > echo starting loop $i > > echo "temporary write through" >$DISK_DIR/cache_type > > echo "128" >/sys/bus/pseudo/drivers/scsi_debug/opts > > echo none > /sys/block/$DEVICE/queue/scheduler > > dd if=/dev/$DEVICE of=/dev/null bs=1M iflag=direct count=1 & > > sleep 5 > > echo "0" >/sys/bus/pseudo/drivers/scsi_debug/opts > > wait > > echo "SUCCESS" > > Can you please submit a test-case for blktest as well, given you have a > nice reproducer? Hi Johannes, I am happy to do that, but recently I am very busy, so it may be done a bit late by me. But anyone should reproduce the issue 100% with V4.15-rc kernel by just running the above script, not any specific hardware is required at all, so that means anyone can make a patch for blktest to test block/SCSI timeout if he/she is interested in doing that. Thanks, Ming
Re: [PATCH] generic/473: test return EBUSY from BLKRRPART for mounted whole-dev
On Wed, Nov 29, 2017 at 08:02:26PM +0800, Xiao Yang wrote: > If the entire block device is formatted with a filesystem and > mounted, running "blockdev --rereadpt" should fail and return > EBUSY instead of pass. > > Signed-off-by: Xiao Yang As we have blktests[1] now, I think this may fit in blktests better? Thanks, Eryu [1] https://github.com/osandov/blktests > --- > tests/generic/473 | 83 > +++ > tests/generic/473.out | 2 ++ > tests/generic/group | 1 + > 3 files changed, 86 insertions(+) > create mode 100755 tests/generic/473 > create mode 100644 tests/generic/473.out > > diff --git a/tests/generic/473 b/tests/generic/473 > new file mode 100755 > index 000..d7998cd > --- /dev/null > +++ b/tests/generic/473 > @@ -0,0 +1,83 @@ > +#! /bin/bash > +# FS QA Test No. 473 > +# > +# Regression test for commit: > +# 77032ca ("Return EBUSY from BLKRRPART for mounted whole-dev fs") > +# > +# If the entire block device is formatted with a filesystem and > +# mounted, running "blockdev --rereadpt" should fail and return > +# EBUSY. On buggy kernel, it passes unexpectedly. > +# > +#--- > +# Copyright (c) 2017 FUJITSU LIMITED. All rights reserved. > +# Author: Xiao Yang > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#--- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmpdir=`mktemp -d` > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 7 15 > + > +_cleanup() > +{ > + # Umount > + $UMOUNT_PROG $tmpdir >>$seqres.full 2>&1 > + # Destroy device > +_put_scsi_debug_dev > + rm -rf $tmpdir > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/scsi_debug > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# real QA test starts here > +_supported_os Linux > +_require_scsi_debug > + > +# Create and format > +test_dev=`_get_scsi_debug_dev` > +_mkfs_dev $test_dev >>$seqres.full 2>&1 > + > +# Mount and check mounted whole-dev > +_mount $test_dev $tmpdir > + > +out=$(blockdev --rereadpt $test_dev 2>&1) > +res=$? > + > +echo $out | grep -q "Unknown command" && \ > + _notrun "blockdev --rereadpt was not supported" > + > +[ $res -eq 0 ] && _fail "blockdev --rereadpt passed when checking mounted > whole-dev" > + > +echo $out | grep -q "Device or resource busy" || \ > + _fail "blockdev --rereadpt returned unexpected error when checking > mounted whole-dev" > + > +echo $out >> $seqres.full > + > +echo "Silence is golden" > + > +# success, all done > +status=0 > +exit > diff --git a/tests/generic/473.out b/tests/generic/473.out > new file mode 100644 > index 000..854fbcd > --- /dev/null > +++ b/tests/generic/473.out > @@ -0,0 +1,2 @@ > +QA output created by 473 > +Silence is golden > diff --git a/tests/generic/group b/tests/generic/group > index 6c3bb03..54b9404 100644 > --- a/tests/generic/group > +++ b/tests/generic/group > @@ -472,3 +472,4 @@ > 467 auto quick exportfs > 468 shutdown auto quick metadata > 469 auto quick > +473 auto quick blockdev > -- > 1.8.3.1 > > > > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()
Hi Ming, Ming Lei writes: > This issue can be triggered by the following script: > > #!/bin/sh > rmmod scsi_debug > modprobe scsi_debug max_queue=1 > > DEVICE=`ls -d > /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 > | xargs basename` > > DISK_DIR=`ls -d /sys/block/$DEVICE/device/scsi_disk/*` > > echo "using scsi device $DEVICE" > echo "-1" >/sys/bus/pseudo/drivers/scsi_debug/every_nth > echo starting loop $i > echo "temporary write through" >$DISK_DIR/cache_type > echo "128" >/sys/bus/pseudo/drivers/scsi_debug/opts > echo none > /sys/block/$DEVICE/queue/scheduler > dd if=/dev/$DEVICE of=/dev/null bs=1M iflag=direct count=1 & > sleep 5 > echo "0" >/sys/bus/pseudo/drivers/scsi_debug/opts > wait > echo "SUCCESS" Can you please submit a test-case for blktest as well, given you have a nice reproducer? Thanks, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: WARNING in kmalloc_slab (3)
On Mon, Dec 4, 2017 at 9:14 AM, Dan Carpenter wrote: > On Sun, Dec 03, 2017 at 12:16:08PM -0800, Eric Biggers wrote: >> Looks like BLKTRACESETUP doesn't limit the '.buf_nr' parameter, allowing >> anyone >> who can open a block device to cause an extremely large kmalloc. Here's a >> simplified reproducer: >> > > There are lots of places which allow people to allocate as much as they > want. With Syzcaller, you might want to just hard code a __GFP_NOWARN > in to disable it. Hi, Hard code it where? User-controllable allocation are supposed to use __GFP_NOWARN.