[PATCH] SCSI: run queue if SCSI device queue isn't ready and queue is idle

2017-12-04 Thread Ming Lei
Before commit 0df21c86bdbf ("scsi: implement .get_budget and .put_budget
for blk-mq"), we run queue after 3ms if queue is idle and SCSI device
queue isn't ready, which is done in handling BLK_STS_RESOURCE. After
commit 0df21c86bdbf is introduced, queue won't be run any more under
this situation.

IO hang is observed when timeout happened, and this patch fixes the IO
hang issue by running queue after delay in scsi_dev_queue_ready, just like
non-mq. This issue can be triggered by the following script[1].

There is another issue which can be covered by running idle queue:
when .get_budget() is called on request coming from hctx->dispatch_list,
if one request just completes during .get_budget(), we can't depend on
SCSI's restart to make progress any more. This patch fixes the race too.

With this patch, we basically recover to previous behaviour(before commit
0df21c86bdbf) of handling idle queue when running out of resource.

[1] script for test/verify SCSI timeout
rmmod scsi_debug
modprobe scsi_debug max_queue=1

DEVICE=`ls -d 
/sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | 
xargs basename`
DISK_DIR=`ls -d /sys/block/$DEVICE/device/scsi_disk/*`

echo "using scsi device $DEVICE"
echo "-1" >/sys/bus/pseudo/drivers/scsi_debug/every_nth
echo "temporary write through" >$DISK_DIR/cache_type
echo "128" >/sys/bus/pseudo/drivers/scsi_debug/opts
echo none > /sys/block/$DEVICE/queue/scheduler
dd if=/dev/$DEVICE of=/dev/null bs=1M iflag=direct count=1 &
sleep 5
echo "0" >/sys/bus/pseudo/drivers/scsi_debug/opts
wait
echo "SUCCESS"

Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for blk-mq")
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index db9556662e27..1816dd8259b3 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1967,6 +1967,8 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
 out_put_device:
put_device(&sdev->sdev_gendev);
 out:
+   if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev))
+   blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
return false;
 }
 
-- 
2.9.5



Re: [PATCH v2 0/4] lockdep/crossrelease: Apply crossrelease to page locks

2017-12-04 Thread Matthew Wilcox
On Tue, Dec 05, 2017 at 03:19:46PM +0900, Byungchul Park wrote:
> On 12/5/2017 2:46 PM, Byungchul Park wrote:
> > On 12/5/2017 2:30 PM, Matthew Wilcox wrote:
> > > On Mon, Dec 04, 2017 at 02:16:19PM +0900, Byungchul Park wrote:
> > > > For now, wait_for_completion() / complete() works with lockdep, add
> > > > lock_page() / unlock_page() and its family to lockdep support.
> > > > 
> > > > Changes from v1
> > > >   - Move lockdep_map_cross outside of page_ext to make it flexible
> > > >   - Prevent allocating lockdep_map per page by default
> > > >   - Add a boot parameter allowing the allocation for debugging
> > > > 
> > > > Byungchul Park (4):
> > > >    lockdep: Apply crossrelease to PG_locked locks
> > > >    lockdep: Apply lock_acquire(release) on __Set(__Clear)PageLocked
> > > >    lockdep: Move data of CONFIG_LOCKDEP_PAGELOCK from page to page_ext
> > > >    lockdep: Add a boot parameter enabling to track page locks using
> > > >  lockdep and disable it by default
> > > 
> > > I don't like the way you've structured this patch series; first adding
> > > the lockdep map to struct page, then moving it to page_ext.
> > 
> > Hello,
> > 
> > I will make them into one patch.
> 
> I've thought it more.
> 
> Actually I did it because I thought I'd better make it into two
> patches since it makes reviewers easier to review. It doesn't matter
> which one I choose, but I prefer to split it.

I don't know whether it's better to make it all one patch or split it
into multiple patches.  But it makes no sense to introduce it in struct
page, then move it to struct page_ext.


Re: [PATCH] [PATCH v2] bcache: segregate flash only volume write streams from cached devices

2017-12-04 Thread Michael Lyle
Tang Junhui--

Hi

On 12/03/2017 10:11 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> Hello Mike & Coly
> 
> Could you please have a reveiw for this patch?
> 
>> From: Tang Junhui 
>>
>> In such scenario that there are some flash only volumes
>> , and some cached devices, when many tasks request these devices in
>> writeback mode, the write IOs may fall to the same bucket as bellow:
>> | cached data | flash data | cached data | cached data| flash data|
>> then after writeback of these cached devices, the bucket would
>> be like bellow bucket:
>> | free | flash data | free | free | flash data |
>>
>> So, there are many free space in this bucket, but since data of flash
>> only volumes still exists, so this bucket cannot be reclaimable,
>> which would cause waste of bucket space.
>>
>> In this patch, we segregate flash only volume write streams from
>> cached devices, so data from flash only volumes and cached devices
>> can store in different buckets.
>>
>> Compare to v1 patch, this patch do not add a additionally open bucket
>> list, and it is try best to segregate flash only volume write streams
>> from cached devices, sectors of flash only volumes may still be mixed
>> with dirty sectors of cached device, but the number is very small.
>>
>> Signed-off-by: Tang Junhui 

LGTM, and I have added to my staging tree, though it required
cleanpatch. (executable, line ending).

Thanks,

Mike


Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-04 Thread Ming Lei
On Tue, Dec 05, 2017 at 01:16:24PM +0800, Ming Lei wrote:
> On Mon, Dec 04, 2017 at 11:48:07PM +, Holger Hoffstätte wrote:
> > On Tue, 05 Dec 2017 06:45:08 +0800, Ming Lei wrote:
> > 
> > > On Mon, Dec 04, 2017 at 03:09:20PM +, Bart Van Assche wrote:
> > >> On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote:
> > >> > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for 
> > >> > blk-mq")
> > >> 
> > >> It might be safer to revert commit 0df21c86bdbf instead of trying to fix 
> > >> all
> > >> issues introduced by that commit for kernel version v4.15 ...
> > > 
> > > What are all issues in v4.15-rc? Up to now, it is the only issue reported,
> > > and can be fixed by this simple patch, which one can be thought as cleanup
> > > too.
> > 
> > Even with this patch I've encountered at least one hang that
> > seemed related. I'm using most of block/scsi-4.15 on top of 4.14 and
> > the hang in question was on a rotating disk. It could be solved by 
> > activating
> > a different scheduler on the hanging device; all hanging sync/df processes 
> > got
> > unstuck and all was fine again, which leads me to believe that there is at 
> > least
> > one more rare condition where delaying requests (as done in the budget 
> > patch)
> > leads to a hang.
> > 
> > This happened with mq-deadline which I was testing specifically to avoid
> > any BFQ-related side effects.
> 
> OK, this looks a new report.
> 
> Without any log, we can't make any progress, and even we can't guess
> what the issue is related with.
> 
> Could you post your dmesg log(include the hang process stack trace)? And
> dump the debugfs log by the following script when this hang happens?
> 
>   http://people.redhat.com/minlei/tests/tools/dump-blk-info
> 
> BTW, you just need to pass the disk name to the script, such as: /dev/sda.

Thinking of the issue further, this patch only covers case of
scsi_set_blocked(), but don't consider the case in which .get_budget()
is called inside blk_mq_dispatch_rq_list() for request coming from
hctx->dispatch_list.

If .get_budget() is called in both blk_mq_do_dispatch_sched() and
blk_mq_do_dispatch_ctx(), we don't need to run queue if the queue
is idle. But if it is called from blk_mq_dispatch_rq_list() for request
coming from hctx->dispatch_list, we have to run queue if queue is
idle, as before.

So please ignore this patch, and will submit V2 for cover both cases.

Thanks,
Ming


Re: [PATCH v2 0/4] lockdep/crossrelease: Apply crossrelease to page locks

2017-12-04 Thread Byungchul Park

On 12/5/2017 2:46 PM, Byungchul Park wrote:

On 12/5/2017 2:30 PM, Matthew Wilcox wrote:

On Mon, Dec 04, 2017 at 02:16:19PM +0900, Byungchul Park wrote:

For now, wait_for_completion() / complete() works with lockdep, add
lock_page() / unlock_page() and its family to lockdep support.

Changes from v1
  - Move lockdep_map_cross outside of page_ext to make it flexible
  - Prevent allocating lockdep_map per page by default
  - Add a boot parameter allowing the allocation for debugging

Byungchul Park (4):
   lockdep: Apply crossrelease to PG_locked locks
   lockdep: Apply lock_acquire(release) on __Set(__Clear)PageLocked
   lockdep: Move data of CONFIG_LOCKDEP_PAGELOCK from page to page_ext
   lockdep: Add a boot parameter enabling to track page locks using
 lockdep and disable it by default


I don't like the way you've structured this patch series; first adding
the lockdep map to struct page, then moving it to page_ext.


Hello,

I will make them into one patch.


I've thought it more.

Actually I did it because I thought I'd better make it into two
patches since it makes reviewers easier to review. It doesn't matter
which one I choose, but I prefer to split it.

But, if you are strongly against it, then I will follow you.

--
Thanks,
Byungchul


Re: [PATCH v2 0/4] lockdep/crossrelease: Apply crossrelease to page locks

2017-12-04 Thread Byungchul Park

On 12/5/2017 2:30 PM, Matthew Wilcox wrote:

On Mon, Dec 04, 2017 at 02:16:19PM +0900, Byungchul Park wrote:

For now, wait_for_completion() / complete() works with lockdep, add
lock_page() / unlock_page() and its family to lockdep support.

Changes from v1
  - Move lockdep_map_cross outside of page_ext to make it flexible
  - Prevent allocating lockdep_map per page by default
  - Add a boot parameter allowing the allocation for debugging

Byungchul Park (4):
   lockdep: Apply crossrelease to PG_locked locks
   lockdep: Apply lock_acquire(release) on __Set(__Clear)PageLocked
   lockdep: Move data of CONFIG_LOCKDEP_PAGELOCK from page to page_ext
   lockdep: Add a boot parameter enabling to track page locks using
 lockdep and disable it by default


I don't like the way you've structured this patch series; first adding
the lockdep map to struct page, then moving it to page_ext.


Hello,

I will make them into one patch.


I also don't like it that you've made CONFIG_LOCKDEP_PAGELOCK not
individually selectable.  I might well want a kernel with crosslock
support, but only for completions.


OK then, I will make it individually selectable.

I want to know others' opinions as well.

Thank you for the opinions. I will apply yours next spin.

--
Thanks,
Byungchul


Re: [PATCH v2 0/4] lockdep/crossrelease: Apply crossrelease to page locks

2017-12-04 Thread Matthew Wilcox
On Mon, Dec 04, 2017 at 02:16:19PM +0900, Byungchul Park wrote:
> For now, wait_for_completion() / complete() works with lockdep, add
> lock_page() / unlock_page() and its family to lockdep support.
> 
> Changes from v1
>  - Move lockdep_map_cross outside of page_ext to make it flexible
>  - Prevent allocating lockdep_map per page by default
>  - Add a boot parameter allowing the allocation for debugging
> 
> Byungchul Park (4):
>   lockdep: Apply crossrelease to PG_locked locks
>   lockdep: Apply lock_acquire(release) on __Set(__Clear)PageLocked
>   lockdep: Move data of CONFIG_LOCKDEP_PAGELOCK from page to page_ext
>   lockdep: Add a boot parameter enabling to track page locks using
> lockdep and disable it by default

I don't like the way you've structured this patch series; first adding
the lockdep map to struct page, then moving it to page_ext.

I also don't like it that you've made CONFIG_LOCKDEP_PAGELOCK not
individually selectable.  I might well want a kernel with crosslock
support, but only for completions.


Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-04 Thread Ming Lei
On Mon, Dec 04, 2017 at 11:48:07PM +, Holger Hoffstätte wrote:
> On Tue, 05 Dec 2017 06:45:08 +0800, Ming Lei wrote:
> 
> > On Mon, Dec 04, 2017 at 03:09:20PM +, Bart Van Assche wrote:
> >> On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote:
> >> > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for 
> >> > blk-mq")
> >> 
> >> It might be safer to revert commit 0df21c86bdbf instead of trying to fix 
> >> all
> >> issues introduced by that commit for kernel version v4.15 ...
> > 
> > What are all issues in v4.15-rc? Up to now, it is the only issue reported,
> > and can be fixed by this simple patch, which one can be thought as cleanup
> > too.
> 
> Even with this patch I've encountered at least one hang that
> seemed related. I'm using most of block/scsi-4.15 on top of 4.14 and
> the hang in question was on a rotating disk. It could be solved by activating
> a different scheduler on the hanging device; all hanging sync/df processes got
> unstuck and all was fine again, which leads me to believe that there is at 
> least
> one more rare condition where delaying requests (as done in the budget patch)
> leads to a hang.
> 
> This happened with mq-deadline which I was testing specifically to avoid
> any BFQ-related side effects.

OK, this looks a new report.

Without any log, we can't make any progress, and even we can't guess
what the issue is related with.

Could you post your dmesg log(include the hang process stack trace)? And
dump the debugfs log by the following script when this hang happens?

http://people.redhat.com/minlei/tests/tools/dump-blk-info

BTW, you just need to pass the disk name to the script, such as: /dev/sda.

-- 
Ming


Re: [PATCH 1/2] scsi-mq: Only show the CDB if available

2017-12-04 Thread Ming Lei
On Tue, Dec 05, 2017 at 01:59:51AM +, Bart Van Assche wrote:
> On Tue, 2017-12-05 at 09:15 +0800, Ming Lei wrote:
> > On Mon, Dec 04, 2017 at 04:38:08PM -0800, Bart Van Assche wrote:
> > > Since the next patch will make it possible that scsi_show_rq() gets
> > > called before the CDB pointer is changed into a non-NULL value,
> > > only show the CDB if the CDB pointer is not NULL. Additionally,
> > > show the request timeout and SCSI command flags. This patch also
> > > fixes a bug that was reported by Ming Lei. See also Ming Lei,
> > > scsi_debugfs: fix crash in scsi_show_rq(), linux-scsi, 7 November
> > > 2017 (https://marc.info/?l=linux-block&m=151006655317188).
> > 
> > Please cook a patch for fixing the crash issue only, since we need
> > to backport the fix to stable kernel.
> 
> The code that is touched by this patch is only used for kernel debugging.
> I will do this if others agree with your opinion.

No, do not mix two different things in one patch, especially the fix part
need to be backported to stable.

The fix part should aim at V4.15, and the other part can be a V4.16
stuff.

-- 
Ming


Re: [PATCH 1/2] scsi-mq: Only show the CDB if available

2017-12-04 Thread Ming Lei
On Mon, Dec 04, 2017 at 10:42:28PM -0500, Martin K. Petersen wrote:
> 
> Hi Ming,
> 
> > Please cook a patch for fixing the crash issue only, since we need
> > to backport the fix to stable kernel.
> 
> I thought you were going to submit a V5 that addressed James' concerns?
> 
> -- 
> Martin K. PetersenOracle Linux Engineering

Hi Martin,

I replied in the following link for James's concerns:

https://marc.info/?l=linux-block&m=151074751321108&w=2

The fact is that use-after-free can't avoided at all, no matter if
we set the cmnd to NULL before calling free, that means we have to
handle use-after-free well in scsi_show_rq(), so we don't need to
touch the free code.

So V4 is well enough for merge, IMO.


Thanks,
Ming


Re: [PATCH 1/2] scsi-mq: Only show the CDB if available

2017-12-04 Thread Bart Van Assche
On Tue, 2017-12-05 at 09:15 +0800, Ming Lei wrote:
> On Mon, Dec 04, 2017 at 04:38:08PM -0800, Bart Van Assche wrote:
> > Since the next patch will make it possible that scsi_show_rq() gets
> > called before the CDB pointer is changed into a non-NULL value,
> > only show the CDB if the CDB pointer is not NULL. Additionally,
> > show the request timeout and SCSI command flags. This patch also
> > fixes a bug that was reported by Ming Lei. See also Ming Lei,
> > scsi_debugfs: fix crash in scsi_show_rq(), linux-scsi, 7 November
> > 2017 (https://marc.info/?l=linux-block&m=151006655317188).
> 
> Please cook a patch for fixing the crash issue only, since we need
> to backport the fix to stable kernel.

The code that is touched by this patch is only used for kernel debugging.
I will do this if others agree with your opinion.

Bart.

Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Ming Lei
On Tue, Dec 05, 2017 at 01:13:43AM +, Bart Van Assche wrote:
> On Tue, 2017-12-05 at 09:04 +0800, Ming Lei wrote:
> > Then no reason to revert commit(0df21c86bdbf scsi: implement .get_budget an
> > .put_budget for blk-mq) for one issue which may never happen in reality 
> > since
> > this reproducer need out-of-tree patch.
> 
> Sorry but I disagree completely. You seem to overlook that there may be other
> circumstances that trigger the same lockup, e.g. a SCSI queue full condition.

If the scsi_dev_queue_ready() returns false, .get_budget() catches that
and never add request to hctx->dispatch. And scsi_host_queue_ready()
always returns true, since we respect per-host queue depth by
blk_mq_get_driver_tag() before calling .queue_rq().

Or if I miss other cases, please point it out.

-- 
Ming


Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Bart Van Assche
On Tue, 2017-12-05 at 09:04 +0800, Ming Lei wrote:
> Then no reason to revert commit(0df21c86bdbf scsi: implement .get_budget an
> .put_budget for blk-mq) for one issue which may never happen in reality since
> this reproducer need out-of-tree patch.

Sorry but I disagree completely. You seem to overlook that there may be other
circumstances that trigger the same lockup, e.g. a SCSI queue full condition.

Bart.

Re: [PATCH 1/2] scsi-mq: Only show the CDB if available

2017-12-04 Thread Ming Lei
On Mon, Dec 04, 2017 at 04:38:08PM -0800, Bart Van Assche wrote:
> Since the next patch will make it possible that scsi_show_rq() gets
> called before the CDB pointer is changed into a non-NULL value,
> only show the CDB if the CDB pointer is not NULL. Additionally,
> show the request timeout and SCSI command flags. This patch also
> fixes a bug that was reported by Ming Lei. See also Ming Lei,
> scsi_debugfs: fix crash in scsi_show_rq(), linux-scsi, 7 November
> 2017 (https://marc.info/?l=linux-block&m=151006655317188).

Please cook a patch for fixing the crash issue only, since we need
to backport the fix to stable kernel.

> 
> Signed-off-by: Bart Van Assche 
> Cc: James E.J. Bottomley 
> Cc: Martin K. Petersen 
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 

Please Cc: 

> ---
>  drivers/scsi/scsi_debugfs.c | 47 
> -
>  1 file changed, 42 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_debugfs.c b/drivers/scsi/scsi_debugfs.c
> index 01f08c03f2c1..37ed6bb8e6ec 100644
> --- a/drivers/scsi/scsi_debugfs.c
> +++ b/drivers/scsi/scsi_debugfs.c
> @@ -4,13 +4,50 @@
>  #include 
>  #include "scsi_debugfs.h"
>  
> +#define SCSI_CMD_FLAG_NAME(name) [ilog2(SCMD_##name)] = #name
> +static const char *const scsi_cmd_flags[] = {
> + SCSI_CMD_FLAG_NAME(TAGGED),
> + SCSI_CMD_FLAG_NAME(UNCHECKED_ISA_DMA),
> + SCSI_CMD_FLAG_NAME(ZONE_WRITE_LOCK),
> + SCSI_CMD_FLAG_NAME(INITIALIZED),
> +};
> +#undef SCSI_CMD_FLAG_NAME
> +
> +static int scsi_flags_show(struct seq_file *m, const unsigned long flags,
> +const char *const *flag_name, int flag_name_count)
> +{
> + bool sep = false;
> + int i;
> +
> + for (i = 0; i < sizeof(flags) * BITS_PER_BYTE; i++) {
> + if (!(flags & BIT(i)))
> + continue;
> + if (sep)
> + seq_puts(m, "|");
> + sep = true;
> + if (i < flag_name_count && flag_name[i])
> + seq_puts(m, flag_name[i]);
> + else
> + seq_printf(m, "%d", i);
> + }
> + return 0;
> +}
> +
>  void scsi_show_rq(struct seq_file *m, struct request *rq)
>  {
>   struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req);
> - int msecs = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
> - char buf[80];
> + int alloc_ms = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
> + int timeout_ms = jiffies_to_msecs(rq->timeout);
> + const u8 *const cdb = READ_ONCE(cmd->cmnd);
> + char buf[80] = "(?)";
>  
> - __scsi_format_command(buf, sizeof(buf), cmd->cmnd, cmd->cmd_len);
> - seq_printf(m, ", .cmd=%s, .retries=%d, allocated %d.%03d s ago", buf,
> -cmd->retries, msecs / 1000, msecs % 1000);
> + if (cdb)
> + __scsi_format_command(buf, sizeof(buf), cdb, cmd->cmd_len);
> + seq_printf(m, ", .cmd=%s, .retries=%d, .result = %#x, .flags=", buf,
> +cmd->retries, cmd->result);
> + scsi_flags_show(m, cmd->flags, scsi_cmd_flags,
> + ARRAY_SIZE(scsi_cmd_flags));
> + seq_printf(m, ", .timeout=%d.%03d, allocated %d.%03d s ago",
> +timeout_ms / 1000, timeout_ms % 1000,
> +alloc_ms / 1000, alloc_ms % 1000);
>  }
> -- 
> 2.15.0
> 

-- 
Ming


Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Ming Lei
On Tue, Dec 05, 2017 at 12:29:59AM +, Bart Van Assche wrote:
> On Tue, 2017-12-05 at 08:20 +0800, Ming Lei wrote:
> > Also it is a bit odd to see request in hctx->dispatch now, and it can only
> > happen now when scsi_target_queue_ready() returns false, so I guess you 
> > apply
> > some change on target->can_queue(such as setting it as 1 in srp/ib code
> > manually)?
> 
> Yes, but that had already been mentioned. From the e-mail at the start of
> this e-mail thread: "Change the SRP initiator such that SCSI target queue
> depth is limited to 1." The changes I made in the SRP initiator are the same
> as those described in the following message from about one month ago:
> https://www.spinics.net/lists/linux-scsi/msg114720.html.

OK, got it.

Then no reason to revert commit(0df21c86bdbf scsi: implement .get_budget an
.put_budget for blk-mq) for one issue which may never happen in reality since
this reproducer need out-of-tree patch.

I don't mean it isn't a issue, but I don't think it has top priority
for reverting commit 0df21c86bdbf. Especially there isn't proof shown
that 0df21c86bdbf causes this issue since this commit won't change run
queue for requests in hctx->dispatch_list.

I's like to take a look if someone'd like to cooperate, such as
providing kernel log, test debug patch, and kind of things. Or when
I get this hardware to reproduce.

--
Ming


[PATCH 0/2] Show commands stuck in a timeout handler in debugfs

2017-12-04 Thread Bart Van Assche
Hello Jens,

While debugging an issue with the SCSI error handler I noticed that commands
that got stuck in that error handler are not shown in debugfs. That is very
annoying for anyone who relies on the information in debugfs for root-causing
such an issue. Hence this patch series that makes sure that commands that got
stuck in a block driver timeout handler are also shown in debugfs. Please
consider these patches for kernel v4.16.

Thanks,

Bart.

Bart Van Assche (2):
  scsi-mq: Only show the CDB if available
  blk-mq-debugfs: Also show requests that have not yet been started

 block/blk-mq-debugfs.c  |  2 +-
 drivers/scsi/scsi_debugfs.c | 47 -
 2 files changed, 43 insertions(+), 6 deletions(-)

-- 
2.15.0



[PATCH 1/2] scsi-mq: Only show the CDB if available

2017-12-04 Thread Bart Van Assche
Since the next patch will make it possible that scsi_show_rq() gets
called before the CDB pointer is changed into a non-NULL value,
only show the CDB if the CDB pointer is not NULL. Additionally,
show the request timeout and SCSI command flags. This patch also
fixes a bug that was reported by Ming Lei. See also Ming Lei,
scsi_debugfs: fix crash in scsi_show_rq(), linux-scsi, 7 November
2017 (https://marc.info/?l=linux-block&m=151006655317188).

Signed-off-by: Bart Van Assche 
Cc: James E.J. Bottomley 
Cc: Martin K. Petersen 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/scsi/scsi_debugfs.c | 47 -
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/scsi_debugfs.c b/drivers/scsi/scsi_debugfs.c
index 01f08c03f2c1..37ed6bb8e6ec 100644
--- a/drivers/scsi/scsi_debugfs.c
+++ b/drivers/scsi/scsi_debugfs.c
@@ -4,13 +4,50 @@
 #include 
 #include "scsi_debugfs.h"
 
+#define SCSI_CMD_FLAG_NAME(name) [ilog2(SCMD_##name)] = #name
+static const char *const scsi_cmd_flags[] = {
+   SCSI_CMD_FLAG_NAME(TAGGED),
+   SCSI_CMD_FLAG_NAME(UNCHECKED_ISA_DMA),
+   SCSI_CMD_FLAG_NAME(ZONE_WRITE_LOCK),
+   SCSI_CMD_FLAG_NAME(INITIALIZED),
+};
+#undef SCSI_CMD_FLAG_NAME
+
+static int scsi_flags_show(struct seq_file *m, const unsigned long flags,
+  const char *const *flag_name, int flag_name_count)
+{
+   bool sep = false;
+   int i;
+
+   for (i = 0; i < sizeof(flags) * BITS_PER_BYTE; i++) {
+   if (!(flags & BIT(i)))
+   continue;
+   if (sep)
+   seq_puts(m, "|");
+   sep = true;
+   if (i < flag_name_count && flag_name[i])
+   seq_puts(m, flag_name[i]);
+   else
+   seq_printf(m, "%d", i);
+   }
+   return 0;
+}
+
 void scsi_show_rq(struct seq_file *m, struct request *rq)
 {
struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req);
-   int msecs = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
-   char buf[80];
+   int alloc_ms = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
+   int timeout_ms = jiffies_to_msecs(rq->timeout);
+   const u8 *const cdb = READ_ONCE(cmd->cmnd);
+   char buf[80] = "(?)";
 
-   __scsi_format_command(buf, sizeof(buf), cmd->cmnd, cmd->cmd_len);
-   seq_printf(m, ", .cmd=%s, .retries=%d, allocated %d.%03d s ago", buf,
-  cmd->retries, msecs / 1000, msecs % 1000);
+   if (cdb)
+   __scsi_format_command(buf, sizeof(buf), cdb, cmd->cmd_len);
+   seq_printf(m, ", .cmd=%s, .retries=%d, .result = %#x, .flags=", buf,
+  cmd->retries, cmd->result);
+   scsi_flags_show(m, cmd->flags, scsi_cmd_flags,
+   ARRAY_SIZE(scsi_cmd_flags));
+   seq_printf(m, ", .timeout=%d.%03d, allocated %d.%03d s ago",
+  timeout_ms / 1000, timeout_ms % 1000,
+  alloc_ms / 1000, alloc_ms % 1000);
 }
-- 
2.15.0



[PATCH 2/2] blk-mq-debugfs: Also show requests that have not yet been started

2017-12-04 Thread Bart Van Assche
When debugging e.g. the SCSI timeout handler it is important that
requests that have not yet been started or that already have
completed are also reported through debugfs.

Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Martin K. Petersen 
---
 block/blk-mq-debugfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index f7db73f1698e..886b37163f17 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -409,7 +409,7 @@ static void hctx_show_busy_rq(struct request *rq, void 
*data, bool reserved)
const struct show_busy_params *params = data;
 
if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
-   test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
+   list_empty(&rq->queuelist))
__blk_mq_debugfs_rq_show(params->m,
 list_entry_rq(&rq->queuelist));
 }
-- 
2.15.0



Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Bart Van Assche
On Tue, 2017-12-05 at 08:20 +0800, Ming Lei wrote:
> Also it is a bit odd to see request in hctx->dispatch now, and it can only
> happen now when scsi_target_queue_ready() returns false, so I guess you apply
> some change on target->can_queue(such as setting it as 1 in srp/ib code
> manually)?

Yes, but that had already been mentioned. From the e-mail at the start of
this e-mail thread: "Change the SRP initiator such that SCSI target queue
depth is limited to 1." The changes I made in the SRP initiator are the same
as those described in the following message from about one month ago:
https://www.spinics.net/lists/linux-scsi/msg114720.html.

Bart.

Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Ming Lei
On Mon, Dec 04, 2017 at 11:32:27PM +, Bart Van Assche wrote:
> On Tue, 2017-12-05 at 07:01 +0800, Ming Lei wrote:
> > On Mon, Dec 04, 2017 at 10:48:18PM +, Bart Van Assche wrote:
> > > On Tue, 2017-12-05 at 06:42 +0800, Ming Lei wrote:
> > > > On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote:
> > > > > * A systematic lockup for SCSI queues with queue depth 1. The
> > > > >   following test reproduces that bug systematically:
> > > > >   - Change the SRP initiator such that SCSI target queue depth is
> > > > > limited to 1.
> > > > >   - Run the following command:
> > > > >   srp-test/run_tests -f xfs -d -e none -r 60 -t 01
> > > > >   See also "[PATCH 4/7] blk-mq: Avoid that request processing
> > > > >   stalls when sharing tags"
> > > > >   (https://marc.info/?l=linux-block&m=151208695316857). Note:
> > > > >   reverting commit 0df21c86bdbf also fixes a sporadic SCSI request
> > > > >   queue lockup while inserting a blk_mq_sched_mark_restart_hctx()
> > > > >   before all blk_mq_dispatch_rq_list() calls only fixes the
> > > > >   systematic lockup for queue depth 1.
> > > > 
> > > > You are the only reproducer [ ... ]
> > > 
> > > That's not correct. I'm pretty sure if you try to reproduce this that
> > > you will see the same hang I ran into. Does this mean that you have not
> > > yet tried to reproduce the hang I reported?
> > 
> > Do you mean every kernel developer has to own one SRP/IB hardware?
> 
> When I have the time I will make it possible to run this test on any system
> equipped with at least one Ethernet port. But the fact that the test I
> mentioned requires IB hardware should not prevent you from running this test
> since you have run this test software before.
> 
> > I don't have your hardware to reproduce that,
> 
> That's not true. Your employer definitely owns IB hardware. E.g. the
> following message shows that you have run the srp-test yourself on IB hardware
> only four weeks ago:
> 
> https://www.spinics.net/lists/linux-block/msg19511.html

The hardware belongs to Laurence, at that time I can borrow from him, and
now I am not sure if it is available.

> 
> > Otherwise, there should have be such similar reports from others, not from
> > only you.
> 
> That's not correct either. How long was it ago that kernel v4.15-rc1 was
> released? One week? How many SRP users do you think have tried to trigger a
> queue full condition with that kernel version?

OK, we can wait for further reporters to provide kernel log if you
don't want.

> 
> > More importantly I don't understand why you can't share the kernel
> > log/debugfs log when IO hang happens?
> > 
> > Without any kernel log, how can we confirm that it is a valid report?
> 
> It's really unfortunate that you are focussing on denying that the bug I
> reported exists instead of trying to fix the bugs introduced by commit

If you look at bug reports in kenrel mail list, you will see most of
reports includes some kind of log, that is a common practice to report
issue with log attached. It can save us much time for talking in mails.

> b347689ffbca. BTW, I have more than enough experience to decide myself what
> a valid report is and what not. I can easily send you several MB of kernel

As I mentioned, only dmesg with hang trace and debugfs log should be enough,
both can't be so big, right?

> logs. The reason I have not yet done this is because I'm 99.9% sure that
> these won't help to root cause the reported issue. But I can tell you what

That is your opinion, most of times, I can find some clue from debugfs
about hang issue, then I can try to add trace just in some possible
places for narrowing down the issue.

> I learned from analyzing the information under /sys/kernel/debug/block:
> every time a hang occurred I noticed that no requests were "busy", that two
> requests occurred in rq_lists and one request occurred in a hctx dispatch

Then what do other attributes show? Like queue/hctx state?

The following script can get all this info easily:

http://people.redhat.com/minlei/tests/tools/dump-blk-info

Also it is a bit odd to see request in hctx->dispatch now, and it can only
happen now when scsi_target_queue_ready() returns false, so I guess you apply
some change on target->can_queue(such as setting it as 1 in srp/ib code
manually)?

Please reply, if yes, I will try to see if I can reproduce it with this
kind of change on scsi_debug.

> list. This is enough to conclude that a queue run was missing. And I think

In this case, seems it isn't related with both commit b347689ff and 
0df21c86bdbf,
since both don't change RESTART for hctx->dispatch, and shouldn't affect
run queue.

> that the patch at the start of this e-mail thread not only shows that the
> root cause is in the block layer but also that this bug was introduced by
> commit b347689ffbca.
> 
> > > > You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched:
> > > > improve dispatching from sw queue")', but you don't mention any i

Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-04 Thread Holger Hoffstätte
On Tue, 05 Dec 2017 06:45:08 +0800, Ming Lei wrote:

> On Mon, Dec 04, 2017 at 03:09:20PM +, Bart Van Assche wrote:
>> On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote:
>> > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for 
>> > blk-mq")
>> 
>> It might be safer to revert commit 0df21c86bdbf instead of trying to fix all
>> issues introduced by that commit for kernel version v4.15 ...
> 
> What are all issues in v4.15-rc? Up to now, it is the only issue reported,
> and can be fixed by this simple patch, which one can be thought as cleanup
> too.

Even with this patch I've encountered at least one hang that
seemed related. I'm using most of block/scsi-4.15 on top of 4.14 and
the hang in question was on a rotating disk. It could be solved by activating
a different scheduler on the hanging device; all hanging sync/df processes got
unstuck and all was fine again, which leads me to believe that there is at least
one more rare condition where delaying requests (as done in the budget patch)
leads to a hang.

This happened with mq-deadline which I was testing specifically to avoid
any BFQ-related side effects.

I didn't do anything specific to trigger the hang and have not been able
to reproduce it during regular usage.

-h



Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Bart Van Assche
On Tue, 2017-12-05 at 07:01 +0800, Ming Lei wrote:
> On Mon, Dec 04, 2017 at 10:48:18PM +, Bart Van Assche wrote:
> > On Tue, 2017-12-05 at 06:42 +0800, Ming Lei wrote:
> > > On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote:
> > > > * A systematic lockup for SCSI queues with queue depth 1. The
> > > >   following test reproduces that bug systematically:
> > > >   - Change the SRP initiator such that SCSI target queue depth is
> > > > limited to 1.
> > > >   - Run the following command:
> > > >   srp-test/run_tests -f xfs -d -e none -r 60 -t 01
> > > >   See also "[PATCH 4/7] blk-mq: Avoid that request processing
> > > >   stalls when sharing tags"
> > > >   (https://marc.info/?l=linux-block&m=151208695316857). Note:
> > > >   reverting commit 0df21c86bdbf also fixes a sporadic SCSI request
> > > >   queue lockup while inserting a blk_mq_sched_mark_restart_hctx()
> > > >   before all blk_mq_dispatch_rq_list() calls only fixes the
> > > >   systematic lockup for queue depth 1.
> > > 
> > > You are the only reproducer [ ... ]
> > 
> > That's not correct. I'm pretty sure if you try to reproduce this that
> > you will see the same hang I ran into. Does this mean that you have not
> > yet tried to reproduce the hang I reported?
> 
> Do you mean every kernel developer has to own one SRP/IB hardware?

When I have the time I will make it possible to run this test on any system
equipped with at least one Ethernet port. But the fact that the test I
mentioned requires IB hardware should not prevent you from running this test
since you have run this test software before.

> I don't have your hardware to reproduce that,

That's not true. Your employer definitely owns IB hardware. E.g. the
following message shows that you have run the srp-test yourself on IB hardware
only four weeks ago:

https://www.spinics.net/lists/linux-block/msg19511.html

> Otherwise, there should have be such similar reports from others, not from
> only you.

That's not correct either. How long was it ago that kernel v4.15-rc1 was
released? One week? How many SRP users do you think have tried to trigger a
queue full condition with that kernel version?

> More importantly I don't understand why you can't share the kernel
> log/debugfs log when IO hang happens?
> 
> Without any kernel log, how can we confirm that it is a valid report?

It's really unfortunate that you are focussing on denying that the bug I
reported exists instead of trying to fix the bugs introduced by commit
b347689ffbca. BTW, I have more than enough experience to decide myself what
a valid report is and what not. I can easily send you several MB of kernel
logs. The reason I have not yet done this is because I'm 99.9% sure that
these won't help to root cause the reported issue. But I can tell you what
I learned from analyzing the information under /sys/kernel/debug/block:
every time a hang occurred I noticed that no requests were "busy", that two
requests occurred in rq_lists and one request occurred in a hctx dispatch
list. This is enough to conclude that a queue run was missing. And I think
that the patch at the start of this e-mail thread not only shows that the
root cause is in the block layer but also that this bug was introduced by
commit b347689ffbca.

> > > You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched:
> > > improve dispatching from sw queue")', but you don't mention any issue
> > > about that commit.
> > 
> > That's not correct either. From the commit message "A systematic lockup
> > for SCSI queues with queue depth 1."
> 
> I mean you mentioned your patch can fix 'commit b347689ffbca
> ("blk-mq-sched: improve dispatching from sw queue")', but you never
> point where the commit b347689ffbca is wrong, how your patch fixes
> the mistake of that commit.

You should know that it is not required to perform a root cause analysis
before posting a revert. Having performed a bisect is sufficient.

BTW, it seems like you forgot that last Friday I explained to you that there
is an obvious bug in commit b347689ffbca, namely that a 
blk_mq_sched_mark_restart_hctx()
call is missing in blk_mq_sched_dispatch_requests() before the 
blk_mq_do_dispatch_ctx()
call. See also https://marc.info/?l=linux-block&m=151215794224401.

Bart.

Re: [PATCH] kyber: fix another domain token wait queue hang

2017-12-04 Thread Omar Sandoval
On Mon, Dec 04, 2017 at 03:12:15PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Commit 8cf466602028 ("kyber: fix hang on domain token wait queue") fixed
> a hang caused by leaving wait entries on the domain token wait queue
> after the __sbitmap_queue_get() retry succeeded, making that wait entry
> a "dud" which won't in turn wake more entries up. However, we can also
> get a dud entry if kyber_get_domain_token() fails once but is then
> called again and succeeds. This can happen if the hardware queue is
> rerun for some other reason, or, more likely, kyber_dispatch_request()
> tries the same domain twice.
> 
> The fix is to remove our entry from the wait queue whenever we
> successfully get a token. The only complication is that we might be on
> one of many wait queues in the struct sbitmap_queue, but that's easily
> fixed by remembering which wait queue we were put on.
> 
> While we're here, only initialize the wait queue entry once instead on

s/instead on/instead of on/

> every wait, and use spin_lock_irq() instead of spin_lock_irqsave(),
> since this is always called from process context with irqs enabled.
> 
> Signed-off-by: Omar Sandoval 
> ---
> I have another, rarer hang I'm still working out, but I can get this one
> out of the way.
> 
>  block/kyber-iosched.c | 39 +--
>  1 file changed, 25 insertions(+), 14 deletions(-)
> 
> diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
> index b4df317c2916..00cf624ce3ed 100644
> --- a/block/kyber-iosched.c
> +++ b/block/kyber-iosched.c
> @@ -100,9 +100,13 @@ struct kyber_hctx_data {
>   unsigned int cur_domain;
>   unsigned int batching;
>   wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS];
> + struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS];
>   atomic_t wait_index[KYBER_NUM_DOMAINS];
>  };
>  
> +static int kyber_domain_wake(wait_queue_entry_t *wait, unsigned mode, int 
> flags,
> +  void *key);
> +
>  static int rq_sched_domain(const struct request *rq)
>  {
>   unsigned int op = rq->cmd_flags;
> @@ -385,6 +389,9 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, 
> unsigned int hctx_idx)
>  
>   for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
>   INIT_LIST_HEAD(&khd->rqs[i]);
> + init_waitqueue_func_entry(&khd->domain_wait[i],
> +   kyber_domain_wake);
> + khd->domain_wait[i].private = hctx;
>   INIT_LIST_HEAD(&khd->domain_wait[i].entry);
>   atomic_set(&khd->wait_index[i], 0);
>   }
> @@ -524,35 +531,39 @@ static int kyber_get_domain_token(struct 
> kyber_queue_data *kqd,
>   int nr;
>  
>   nr = __sbitmap_queue_get(domain_tokens);
> - if (nr >= 0)
> - return nr;
>  
>   /*
>* If we failed to get a domain token, make sure the hardware queue is
>* run when one becomes available. Note that this is serialized on
>* khd->lock, but we still need to be careful about the waker.
>*/
> - if (list_empty_careful(&wait->entry)) {
> - init_waitqueue_func_entry(wait, kyber_domain_wake);
> - wait->private = hctx;
> + if (nr < 0 && list_empty_careful(&wait->entry)) {
>   ws = sbq_wait_ptr(domain_tokens,
> &khd->wait_index[sched_domain]);
> - add_wait_queue(&ws->wait, wait);
> + khd->domain_ws[sched_domain] = ws;
> + add_wait_queue_exclusive(&ws->wait, wait);
>  
>   /*
>* Try again in case a token was freed before we got on the wait
> -  * queue. The waker may have already removed the entry from the
> -  * wait queue, but list_del_init() is okay with that.
> +  * queue.
>*/
>   nr = __sbitmap_queue_get(domain_tokens);
> - if (nr >= 0) {
> - unsigned long flags;
> + }
>  
> - spin_lock_irqsave(&ws->wait.lock, flags);
> - list_del_init(&wait->entry);
> - spin_unlock_irqrestore(&ws->wait.lock, flags);
> - }
> + /*
> +  * If we got a token while we were on the wait queue, remove ourselves
> +  * from the wait queue to ensure that all wake ups make forward
> +  * progress. It's possible that the waker already deleted the entry
> +  * between the !list_empty_careful() check and us grabbing the lock, but
> +  * list_del_init() is okay with that.
> +  */
> + if (nr >= 0 && !list_empty_careful(&wait->entry)) {
> + ws = khd->domain_ws[sched_domain];
> + spin_lock_irq(&ws->wait.lock);
> + list_del_init(&wait->entry);
> + spin_unlock_irq(&ws->wait.lock);
>   }
> +
>   return nr;
>  }
>  
> -- 
> 2.15.1
> 


Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Kirill Tkhai
On 05.12.2017 01:59, Jeff Moyer wrote:
> Kirill Tkhai  writes:
> 
>> On 05.12.2017 00:52, Tejun Heo wrote:
>>> Hello, Kirill.
>>>
>>> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote:
> Can you please explain how this is a fundamental resource which can't
> be controlled otherwise?

 Currently, aio_nr and aio_max_nr are global. In case of containers this
 means that a single container may occupy all aio requests, which are
 available in the system, and to deprive others possibility to use aio
 at all. This may happen because of evil intentions of the container's
 user or because of the program error, when the user makes this 
 occasionally.
>>>
>>> Hmm... I see.  It feels really wrong to me to make this a first class
>>> resource because there is a system wide limit.  The only reason I can
>>> think of for the system wide limit is to prevent too much kernel
>>> memory consumed by creating a lot of aios but that squarely falls
>>> inside cgroup memory controller protection.  If there are other
>>> reasons why the number of aios should be limited system-wide, please
>>> bring them up.
>>>
>>> If the only reason is kernel memory consumption protection, the only
>>> thing we need to do is making sure that memory used for aio commands
>>> are accounted against cgroup kernel memory consumption and
>>> relaxing/removing system wide limit.
>>
>> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
>> structures and pages, and all the memory will be accounted in kmem and
>> limited by memcg. Looks very good.
>>
>> One detail about memory consumption. io_submit() calls primitives
>> file_operations::write_iter and read_iter. It's not clear for me whether
>> they consume the same memory as if writev() or readv() system calls
>> would be used instead. writev() may delay the actual write till dirty
>> pages limit will be reached, so it seems logic of the accounting should
>> be the same. So aio mustn't use more not accounted system memory in file
>> system internals, then simple writev().
>>
>> Could you please to say if you have thoughts about this?
> 
> I think you just need to account the completion ring.

A request of struct aio_kiocb type consumes much more memory, than
struct io_event does. Shouldn't we account it too?

Kirill


[PATCH] kyber: fix another domain token wait queue hang

2017-12-04 Thread Omar Sandoval
From: Omar Sandoval 

Commit 8cf466602028 ("kyber: fix hang on domain token wait queue") fixed
a hang caused by leaving wait entries on the domain token wait queue
after the __sbitmap_queue_get() retry succeeded, making that wait entry
a "dud" which won't in turn wake more entries up. However, we can also
get a dud entry if kyber_get_domain_token() fails once but is then
called again and succeeds. This can happen if the hardware queue is
rerun for some other reason, or, more likely, kyber_dispatch_request()
tries the same domain twice.

The fix is to remove our entry from the wait queue whenever we
successfully get a token. The only complication is that we might be on
one of many wait queues in the struct sbitmap_queue, but that's easily
fixed by remembering which wait queue we were put on.

While we're here, only initialize the wait queue entry once instead on
every wait, and use spin_lock_irq() instead of spin_lock_irqsave(),
since this is always called from process context with irqs enabled.

Signed-off-by: Omar Sandoval 
---
I have another, rarer hang I'm still working out, but I can get this one
out of the way.

 block/kyber-iosched.c | 39 +--
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index b4df317c2916..00cf624ce3ed 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -100,9 +100,13 @@ struct kyber_hctx_data {
unsigned int cur_domain;
unsigned int batching;
wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS];
+   struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS];
atomic_t wait_index[KYBER_NUM_DOMAINS];
 };
 
+static int kyber_domain_wake(wait_queue_entry_t *wait, unsigned mode, int 
flags,
+void *key);
+
 static int rq_sched_domain(const struct request *rq)
 {
unsigned int op = rq->cmd_flags;
@@ -385,6 +389,9 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, 
unsigned int hctx_idx)
 
for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
INIT_LIST_HEAD(&khd->rqs[i]);
+   init_waitqueue_func_entry(&khd->domain_wait[i],
+ kyber_domain_wake);
+   khd->domain_wait[i].private = hctx;
INIT_LIST_HEAD(&khd->domain_wait[i].entry);
atomic_set(&khd->wait_index[i], 0);
}
@@ -524,35 +531,39 @@ static int kyber_get_domain_token(struct kyber_queue_data 
*kqd,
int nr;
 
nr = __sbitmap_queue_get(domain_tokens);
-   if (nr >= 0)
-   return nr;
 
/*
 * If we failed to get a domain token, make sure the hardware queue is
 * run when one becomes available. Note that this is serialized on
 * khd->lock, but we still need to be careful about the waker.
 */
-   if (list_empty_careful(&wait->entry)) {
-   init_waitqueue_func_entry(wait, kyber_domain_wake);
-   wait->private = hctx;
+   if (nr < 0 && list_empty_careful(&wait->entry)) {
ws = sbq_wait_ptr(domain_tokens,
  &khd->wait_index[sched_domain]);
-   add_wait_queue(&ws->wait, wait);
+   khd->domain_ws[sched_domain] = ws;
+   add_wait_queue_exclusive(&ws->wait, wait);
 
/*
 * Try again in case a token was freed before we got on the wait
-* queue. The waker may have already removed the entry from the
-* wait queue, but list_del_init() is okay with that.
+* queue.
 */
nr = __sbitmap_queue_get(domain_tokens);
-   if (nr >= 0) {
-   unsigned long flags;
+   }
 
-   spin_lock_irqsave(&ws->wait.lock, flags);
-   list_del_init(&wait->entry);
-   spin_unlock_irqrestore(&ws->wait.lock, flags);
-   }
+   /*
+* If we got a token while we were on the wait queue, remove ourselves
+* from the wait queue to ensure that all wake ups make forward
+* progress. It's possible that the waker already deleted the entry
+* between the !list_empty_careful() check and us grabbing the lock, but
+* list_del_init() is okay with that.
+*/
+   if (nr >= 0 && !list_empty_careful(&wait->entry)) {
+   ws = khd->domain_ws[sched_domain];
+   spin_lock_irq(&ws->wait.lock);
+   list_del_init(&wait->entry);
+   spin_unlock_irq(&ws->wait.lock);
}
+
return nr;
 }
 
-- 
2.15.1



Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Kirill Tkhai
On 05.12.2017 02:02, Tejun Heo wrote:
> Hello, Kirill.
> 
> On Tue, Dec 05, 2017 at 01:49:42AM +0300, Kirill Tkhai wrote:
>>> If the only reason is kernel memory consumption protection, the only
>>> thing we need to do is making sure that memory used for aio commands
>>> are accounted against cgroup kernel memory consumption and
>>> relaxing/removing system wide limit.
>>
>> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
>> structures and pages, and all the memory will be accounted in kmem and
>> limited by memcg. Looks very good.
> 
> Yeah.
> 
>> One detail about memory consumption. io_submit() calls primitives
>> file_operations::write_iter and read_iter. It's not clear for me whether
>> they consume the same memory as if writev() or readv() system calls
>> would be used instead. writev() may delay the actual write till dirty
>> pages limit will be reached, so it seems logic of the accounting should
>> be the same. So aio mustn't use more not accounted system memory in file
>> system internals, then simple writev().
>>
>> Could you please to say if you have thoughts about this?
> 
> I'm not too familiar with vfs / filesystems but I don't think there's
> gonna be significant unaccounted memory consumption.  It shouldn't be
> too difficult to find out with experiments too.
> 
> Thanks.

Thanks, Tejun!

Kirill


Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Tejun Heo
Hello, Kirill.

On Tue, Dec 05, 2017 at 01:49:42AM +0300, Kirill Tkhai wrote:
> > If the only reason is kernel memory consumption protection, the only
> > thing we need to do is making sure that memory used for aio commands
> > are accounted against cgroup kernel memory consumption and
> > relaxing/removing system wide limit.
> 
> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
> structures and pages, and all the memory will be accounted in kmem and
> limited by memcg. Looks very good.

Yeah.

> One detail about memory consumption. io_submit() calls primitives
> file_operations::write_iter and read_iter. It's not clear for me whether
> they consume the same memory as if writev() or readv() system calls
> would be used instead. writev() may delay the actual write till dirty
> pages limit will be reached, so it seems logic of the accounting should
> be the same. So aio mustn't use more not accounted system memory in file
> system internals, then simple writev().
> 
> Could you please to say if you have thoughts about this?

I'm not too familiar with vfs / filesystems but I don't think there's
gonna be significant unaccounted memory consumption.  It shouldn't be
too difficult to find out with experiments too.

Thanks.

-- 
tejun


Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Ming Lei
On Mon, Dec 04, 2017 at 10:48:18PM +, Bart Van Assche wrote:
> On Tue, 2017-12-05 at 06:42 +0800, Ming Lei wrote:
> > On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote:
> > > * A systematic lockup for SCSI queues with queue depth 1. The
> > >   following test reproduces that bug systematically:
> > >   - Change the SRP initiator such that SCSI target queue depth is
> > > limited to 1.
> > >   - Run the following command:
> > >   srp-test/run_tests -f xfs -d -e none -r 60 -t 01
> > >   See also "[PATCH 4/7] blk-mq: Avoid that request processing
> > >   stalls when sharing tags"
> > >   (https://marc.info/?l=linux-block&m=151208695316857). Note:
> > >   reverting commit 0df21c86bdbf also fixes a sporadic SCSI request
> > >   queue lockup while inserting a blk_mq_sched_mark_restart_hctx()
> > >   before all blk_mq_dispatch_rq_list() calls only fixes the
> > >   systematic lockup for queue depth 1.
> > 
> > You are the only reproducer [ ... ]
> 
> That's not correct. I'm pretty sure if you try to reproduce this that
> you will see the same hang I ran into. Does this mean that you have not
> yet tried to reproduce the hang I reported?

Do you mean every kernel developer has to own one SRP/IB hardware?
I don't have your hardware to reproduce that, and I don't think most
of guys have that. Otherwise, there should have be such similar reports
from others, not from only you.

More importantly I don't understand why you can't share the kernel
log/debugfs log when IO hang happens?

Without any kernel log, how can we confirm that it is a valid report?

> 
> > You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched:
> > improve dispatching from sw queue")', but you don't mention any issue
> > about that commit.
> 
> That's not correct either. From the commit message "A systematic lockup
> for SCSI queues with queue depth 1."

I mean you mentioned your patch can fix 'commit b347689ffbca
("blk-mq-sched: improve dispatching from sw queue")', but you never
point where the commit b347689ffbca is wrong, how your patch fixes
the mistake of that commit.

> 
> > > I think the above means that it is too risky to try to fix all bugs
> > > introduced by commit 0df21c86bdbf before kernel v4.15 is released.
> > > Hence revert that commit.
> > 
> > What is the risk?
> 
> That more bugs were introduced by commit 0df21c86bdbf than the ones that
> have been discovered so far.

If you don't provide any log, I have to ignore your report simply.
So there is only one real issue which can be addressed easily by
the following patch:

https://marc.info/?l=linux-scsi&m=151223234607157&w=2

-- 
Ming


Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Jeff Moyer
Kirill Tkhai  writes:

> On 05.12.2017 00:52, Tejun Heo wrote:
>> Hello, Kirill.
>> 
>> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote:
 Can you please explain how this is a fundamental resource which can't
 be controlled otherwise?
>>>
>>> Currently, aio_nr and aio_max_nr are global. In case of containers this
>>> means that a single container may occupy all aio requests, which are
>>> available in the system, and to deprive others possibility to use aio
>>> at all. This may happen because of evil intentions of the container's
>>> user or because of the program error, when the user makes this occasionally.
>> 
>> Hmm... I see.  It feels really wrong to me to make this a first class
>> resource because there is a system wide limit.  The only reason I can
>> think of for the system wide limit is to prevent too much kernel
>> memory consumed by creating a lot of aios but that squarely falls
>> inside cgroup memory controller protection.  If there are other
>> reasons why the number of aios should be limited system-wide, please
>> bring them up.
>>
>> If the only reason is kernel memory consumption protection, the only
>> thing we need to do is making sure that memory used for aio commands
>> are accounted against cgroup kernel memory consumption and
>> relaxing/removing system wide limit.
>
> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
> structures and pages, and all the memory will be accounted in kmem and
> limited by memcg. Looks very good.
>
> One detail about memory consumption. io_submit() calls primitives
> file_operations::write_iter and read_iter. It's not clear for me whether
> they consume the same memory as if writev() or readv() system calls
> would be used instead. writev() may delay the actual write till dirty
> pages limit will be reached, so it seems logic of the accounting should
> be the same. So aio mustn't use more not accounted system memory in file
> system internals, then simple writev().
>
> Could you please to say if you have thoughts about this?

I think you just need to account the completion ring.

Cheers,
Jeff


Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-04 Thread Bart Van Assche
On Tue, 2017-12-05 at 06:45 +0800, Ming Lei wrote:
> On Mon, Dec 04, 2017 at 03:09:20PM +, Bart Van Assche wrote:
> > On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote:
> > > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for 
> > > blk-mq")
> > 
> > It might be safer to revert commit 0df21c86bdbf instead of trying to fix all
> > issues introduced by that commit for kernel version v4.15 ...
> 
> What are all issues in v4.15-rc? Up to now, it is the only issue reported,
> and can be fixed by this simple patch, which one can be thought as cleanup
> too.

The three issues I described in the commit message of the patch that is 
available at:
https://marc.info/?l=linux-block&m=151240866010572.

Bart.

Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Kirill Tkhai
On 05.12.2017 00:52, Tejun Heo wrote:
> Hello, Kirill.
> 
> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote:
>>> Can you please explain how this is a fundamental resource which can't
>>> be controlled otherwise?
>>
>> Currently, aio_nr and aio_max_nr are global. In case of containers this
>> means that a single container may occupy all aio requests, which are
>> available in the system, and to deprive others possibility to use aio
>> at all. This may happen because of evil intentions of the container's
>> user or because of the program error, when the user makes this occasionally.
> 
> Hmm... I see.  It feels really wrong to me to make this a first class
> resource because there is a system wide limit.  The only reason I can
> think of for the system wide limit is to prevent too much kernel
> memory consumed by creating a lot of aios but that squarely falls
> inside cgroup memory controller protection.  If there are other
> reasons why the number of aios should be limited system-wide, please
> bring them up.
>
> If the only reason is kernel memory consumption protection, the only
> thing we need to do is making sure that memory used for aio commands
> are accounted against cgroup kernel memory consumption and
> relaxing/removing system wide limit.

So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
structures and pages, and all the memory will be accounted in kmem and
limited by memcg. Looks very good.

One detail about memory consumption. io_submit() calls primitives
file_operations::write_iter and read_iter. It's not clear for me whether
they consume the same memory as if writev() or readv() system calls
would be used instead. writev() may delay the actual write till dirty
pages limit will be reached, so it seems logic of the accounting should
be the same. So aio mustn't use more not accounted system memory in file
system internals, then simple writev().

Could you please to say if you have thoughts about this?

Kirill


Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Bart Van Assche
On Tue, 2017-12-05 at 06:42 +0800, Ming Lei wrote:
> On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote:
> > * A systematic lockup for SCSI queues with queue depth 1. The
> >   following test reproduces that bug systematically:
> >   - Change the SRP initiator such that SCSI target queue depth is
> > limited to 1.
> >   - Run the following command:
> >   srp-test/run_tests -f xfs -d -e none -r 60 -t 01
> >   See also "[PATCH 4/7] blk-mq: Avoid that request processing
> >   stalls when sharing tags"
> >   (https://marc.info/?l=linux-block&m=151208695316857). Note:
> >   reverting commit 0df21c86bdbf also fixes a sporadic SCSI request
> >   queue lockup while inserting a blk_mq_sched_mark_restart_hctx()
> >   before all blk_mq_dispatch_rq_list() calls only fixes the
> >   systematic lockup for queue depth 1.
> 
> You are the only reproducer [ ... ]

That's not correct. I'm pretty sure if you try to reproduce this that
you will see the same hang I ran into. Does this mean that you have not
yet tried to reproduce the hang I reported?

> You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched:
> improve dispatching from sw queue")', but you don't mention any issue
> about that commit.

That's not correct either. From the commit message "A systematic lockup
for SCSI queues with queue depth 1."

> > I think the above means that it is too risky to try to fix all bugs
> > introduced by commit 0df21c86bdbf before kernel v4.15 is released.
> > Hence revert that commit.
> 
> What is the risk?

That more bugs were introduced by commit 0df21c86bdbf than the ones that
have been discovered so far.

Bart.

Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-04 Thread Ming Lei
On Mon, Dec 04, 2017 at 03:09:20PM +, Bart Van Assche wrote:
> On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote:
> > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for 
> > blk-mq")
> 
> It might be safer to revert commit 0df21c86bdbf instead of trying to fix all
> issues introduced by that commit for kernel version v4.15 ...

What are all issues in v4.15-rc? Up to now, it is the only issue reported,
and can be fixed by this simple patch, which one can be thought as cleanup
too.

-- 
Ming


Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Ming Lei
On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote:
> Commit 0df21c86bdbf introduced several bugs:
> * A SCSI queue stall for queue depths > 1, addressed by commit
>   88022d7201e9 ("blk-mq: don't handle failure in .get_budget")

This one is committed already.

> * A systematic lockup for SCSI queues with queue depth 1. The
>   following test reproduces that bug systematically:
>   - Change the SRP initiator such that SCSI target queue depth is
> limited to 1.
>   - Run the following command:
>   srp-test/run_tests -f xfs -d -e none -r 60 -t 01
>   See also "[PATCH 4/7] blk-mq: Avoid that request processing
>   stalls when sharing tags"
>   (https://marc.info/?l=linux-block&m=151208695316857). Note:
>   reverting commit 0df21c86bdbf also fixes a sporadic SCSI request
>   queue lockup while inserting a blk_mq_sched_mark_restart_hctx()
>   before all blk_mq_dispatch_rq_list() calls only fixes the
>   systematic lockup for queue depth 1.

You are the only reproducer, and you don't want to provide any kernel
log about this issue, so how can we help you fix your issue?

You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched:
improve dispatching from sw queue")', but you don't mention any issue
about that commit, and your patch is actually nothing to do with
commit b347689ffbca, and seems your work style is just try and guess.

Also both Jens and I have run tests on null_blk and scsi_debug by setting
queue_depth as one, and we all can't see IO hang with current blk-mq.

> * A scsi_debug lockup - see also "[PATCH] SCSI: delay run queue if
>   device is blocked in scsi_dev_queue_ready()"
>   (https://marc.info/?l=linux-block&m=151223233407154).

This issue is clearly explained in theory, and can be reproduced/verified
by scsi_debug, so why can't we apply it to fix the issue? And the fix is
simply and can be thought as cleanup too, since the handling for this case
becomes same with non-mq path now.

> 
> I think the above means that it is too risky to try to fix all bugs
> introduced by commit 0df21c86bdbf before kernel v4.15 is released.
> Hence revert that commit.

What is the risk?

> 
> Fixes: commit 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for 
> blk-mq")
> Signed-off-by: Bart Van Assche 
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> Cc: James E.J. Bottomley 
> Cc: Martin K. Petersen 
> Cc: linux-s...@vger.kernel.org

This commit fixes one important SCSI_MQ performance issue, we can't
simply revert it just because of one un-confirmed report from you
only(without any kernel log provided).

So Nak.

-- 
Ming


Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Tejun Heo
Hello, Kirill.

On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote:
> > Can you please explain how this is a fundamental resource which can't
> > be controlled otherwise?
> 
> Currently, aio_nr and aio_max_nr are global. In case of containers this
> means that a single container may occupy all aio requests, which are
> available in the system, and to deprive others possibility to use aio
> at all. This may happen because of evil intentions of the container's
> user or because of the program error, when the user makes this occasionally.

Hmm... I see.  It feels really wrong to me to make this a first class
resource because there is a system wide limit.  The only reason I can
think of for the system wide limit is to prevent too much kernel
memory consumed by creating a lot of aios but that squarely falls
inside cgroup memory controller protection.  If there are other
reasons why the number of aios should be limited system-wide, please
bring them up.

If the only reason is kernel memory consumption protection, the only
thing we need to do is making sure that memory used for aio commands
are accounted against cgroup kernel memory consumption and
relaxing/removing system wide limit.

Thanks.

-- 
tejun


Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Kirill Tkhai
On 05.12.2017 00:35, Jeff Moyer wrote:
> Kirill Tkhai  writes:
> 
>> Hi, Benjamin,
>>
>> On 04.12.2017 19:52, Benjamin LaHaise wrote:
>>> Hi Kirill,
>>>
>>> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
 Hi,

 this patch set introduces accounting aio_nr and aio_max_nr per blkio 
 cgroup.
 It may be used to limit number of aio requests, which are available for
 a cgroup, and could be useful for containers.

 The accounting is hierarchical, and aio contexts, allocated in child 
 cgroup,
 are accounted in parent cgroups too.

 Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
 aio requests limit, to show current limit and number of currenly occupied
 requests.
>>>
>>> Where are your test cases to check this functionality in the libaio test
>>> suite?
>>
>> I tried to find actual libaio test suite repository url in google,
>> but there is no certain answer. Also, there is no information in kernel
>> anywhere (I hope I grepped right).
>>
>> Could you please provide url where actual upstream libaio could be obtained?
> 
> https://pagure.io/libaio
> 
> Patches can be sent to this list (linux-aio).
> 

Thank you, Jeff.

Kirill

 Patches 1-3 are refactoring.
 Patch 4 is the place where the accounting actually introduced.
 Patch 5 adds "io.aio_nr" file.

 ---

 Kirill Tkhai (5):
   aio: Move aio_nr increment to separate function
   aio: Export aio_nr_lock and aio_max_nr initial value to 
 include/linux/aio.h
   blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
   blkcg: Charge aio requests in blkio cgroup hierarchy
   blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr


  block/blk-cgroup.c |   88 +-
  fs/aio.c   |  151 
 
  include/linux/aio.h|   21 ++
  include/linux/blk-cgroup.h |4 +
  4 files changed, 247 insertions(+), 17 deletions(-)

 --
 Signed-off-by: Kirill Tkhai 

>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-aio' in
>> the body to majord...@kvack.org.  For more info on Linux AIO,
>> see: http://www.kvack.org/aio/
>> Don't email: mailto:"a...@kvack.org";>a...@kvack.org


Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Kirill Tkhai
Hello, Tejun,

On 04.12.2017 23:07, Tejun Heo wrote:
> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
>> It may be used to limit number of aio requests, which are available for
>> a cgroup, and could be useful for containers.
> 
> Can you please explain how this is a fundamental resource which can't
> be controlled otherwise?

Currently, aio_nr and aio_max_nr are global. In case of containers this
means that a single container may occupy all aio requests, which are
available in the system, and to deprive others possibility to use aio
at all. This may happen because of evil intentions of the container's
user or because of the program error, when the user makes this occasionally.

My patch set allows to guarantee that every container or cgroup has
its own number of allowed aios, and nobody can steal it, and therefore
can't slow down another containers, and to force programs to use direct io.

AIO gives certain advantages to its user, so this patchset just doesn't
allow to rob the advantages without any possibility to protect against that.

This could be used by LXC or for starting some critical micro-services,
for example. 

Kirill


Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Jeff Moyer
Kirill Tkhai  writes:

> Hi, Benjamin,
>
> On 04.12.2017 19:52, Benjamin LaHaise wrote:
>> Hi Kirill,
>> 
>> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
>>> Hi,
>>>
>>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
>>> It may be used to limit number of aio requests, which are available for
>>> a cgroup, and could be useful for containers.
>>>
>>> The accounting is hierarchical, and aio contexts, allocated in child cgroup,
>>> are accounted in parent cgroups too.
>>>
>>> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
>>> aio requests limit, to show current limit and number of currenly occupied
>>> requests.
>> 
>> Where are your test cases to check this functionality in the libaio test
>> suite?
>
> I tried to find actual libaio test suite repository url in google,
> but there is no certain answer. Also, there is no information in kernel
> anywhere (I hope I grepped right).
>
> Could you please provide url where actual upstream libaio could be obtained?

https://pagure.io/libaio

Patches can be sent to this list (linux-aio).

Cheers,
Jeff

>>> Patches 1-3 are refactoring.
>>> Patch 4 is the place where the accounting actually introduced.
>>> Patch 5 adds "io.aio_nr" file.
>>>
>>> ---
>>>
>>> Kirill Tkhai (5):
>>>   aio: Move aio_nr increment to separate function
>>>   aio: Export aio_nr_lock and aio_max_nr initial value to 
>>> include/linux/aio.h
>>>   blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
>>>   blkcg: Charge aio requests in blkio cgroup hierarchy
>>>   blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
>>>
>>>
>>>  block/blk-cgroup.c |   88 +-
>>>  fs/aio.c   |  151 
>>> 
>>>  include/linux/aio.h|   21 ++
>>>  include/linux/blk-cgroup.h |4 +
>>>  4 files changed, 247 insertions(+), 17 deletions(-)
>>>
>>> --
>>> Signed-off-by: Kirill Tkhai 
>>>
>> 
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majord...@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"a...@kvack.org";>a...@kvack.org


Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Kirill Tkhai
Hi, Benjamin,

On 04.12.2017 19:52, Benjamin LaHaise wrote:
> Hi Kirill,
> 
> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
>> Hi,
>>
>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
>> It may be used to limit number of aio requests, which are available for
>> a cgroup, and could be useful for containers.
>>
>> The accounting is hierarchical, and aio contexts, allocated in child cgroup,
>> are accounted in parent cgroups too.
>>
>> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
>> aio requests limit, to show current limit and number of currenly occupied
>> requests.
> 
> Where are your test cases to check this functionality in the libaio test
> suite?

I tried to find actual libaio test suite repository url in google,
but there is no certain answer. Also, there is no information in kernel
anywhere (I hope I grepped right).

Could you please provide url where actual upstream libaio could be obtained?

Kirill

> 
>   -ben
> 
>> Patches 1-3 are refactoring.
>> Patch 4 is the place where the accounting actually introduced.
>> Patch 5 adds "io.aio_nr" file.
>>
>> ---
>>
>> Kirill Tkhai (5):
>>   aio: Move aio_nr increment to separate function
>>   aio: Export aio_nr_lock and aio_max_nr initial value to 
>> include/linux/aio.h
>>   blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
>>   blkcg: Charge aio requests in blkio cgroup hierarchy
>>   blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
>>
>>
>>  block/blk-cgroup.c |   88 +-
>>  fs/aio.c   |  151 
>> 
>>  include/linux/aio.h|   21 ++
>>  include/linux/blk-cgroup.h |4 +
>>  4 files changed, 247 insertions(+), 17 deletions(-)
>>
>> --
>> Signed-off-by: Kirill Tkhai 
>>
> 



Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Tejun Heo
Hello, Kirill.

On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
> It may be used to limit number of aio requests, which are available for
> a cgroup, and could be useful for containers.

Can you please explain how this is a fundamental resource which can't
be controlled otherwise?

Thanks.

-- 
tejun


Re: [PATCH] generic/473: test return EBUSY from BLKRRPART for mounted whole-dev

2017-12-04 Thread Omar Sandoval
On Mon, Dec 04, 2017 at 05:48:36PM +0800, Xiao Yang wrote:
> On 2017/12/04 17:25, Eryu Guan wrote:
> > On Mon, Dec 04, 2017 at 05:15:17PM +0800, Xiao Yang wrote:
> > > On 2017/12/04 16:29, Eryu Guan wrote:
> > > > On Wed, Nov 29, 2017 at 08:02:26PM +0800, Xiao Yang wrote:
> > > > > If the entire block device is formatted with a filesystem and
> > > > > mounted, running "blockdev --rereadpt" should fail and return
> > > > > EBUSY instead of pass.
> > > > > 
> > > > > Signed-off-by: Xiao Yang
> > > > As we have blktests[1] now, I think this may fit in blktests better?
> > > Hi Eryu,
> > > 
> > > Do you think test cases which use scsi_debug module should be moved into
> > > blktests?
> > > (e.g. generic/108, generic/349, generic/350, generic/351)
> > I don't think they need to be moved to blktests. Most other tests that
> > take use of scsi_debug are for filesystem testing, e.g. generic/108.
> > generic/349 generic/35[01] are a bit special, they were there before
> > blktests was announced available, so they're in a special blockdev group
> > and not in the auto group. If Omar agrees, I think they can be ported to
> > blktests.
> Hi Eryu,
> 
> Thanks for your explanation, and i will try to send it to blktests.
> 
> Thanks,
> Xiao Yang

I agree, the three tests Eryu mentioned and this new test would be a
good fit for blktests. Let me know if you need any help porting, things
are a little different from xfstests.


[PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Bart Van Assche
Commit 0df21c86bdbf introduced several bugs:
* A SCSI queue stall for queue depths > 1, addressed by commit
  88022d7201e9 ("blk-mq: don't handle failure in .get_budget")
* A systematic lockup for SCSI queues with queue depth 1. The
  following test reproduces that bug systematically:
  - Change the SRP initiator such that SCSI target queue depth is
limited to 1.
  - Run the following command:
  srp-test/run_tests -f xfs -d -e none -r 60 -t 01
  See also "[PATCH 4/7] blk-mq: Avoid that request processing
  stalls when sharing tags"
  (https://marc.info/?l=linux-block&m=151208695316857). Note:
  reverting commit 0df21c86bdbf also fixes a sporadic SCSI request
  queue lockup while inserting a blk_mq_sched_mark_restart_hctx()
  before all blk_mq_dispatch_rq_list() calls only fixes the
  systematic lockup for queue depth 1.
* A scsi_debug lockup - see also "[PATCH] SCSI: delay run queue if
  device is blocked in scsi_dev_queue_ready()"
  (https://marc.info/?l=linux-block&m=151223233407154).

I think the above means that it is too risky to try to fix all bugs
introduced by commit 0df21c86bdbf before kernel v4.15 is released.
Hence revert that commit.

Fixes: commit 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for 
blk-mq")
Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: James E.J. Bottomley 
Cc: Martin K. Petersen 
Cc: linux-s...@vger.kernel.org
---
 drivers/scsi/scsi_lib.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 84bd2b16d216..a7e7966f1477 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1976,9 +1976,11 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
struct scsi_device *sdev = q->queuedata;
struct Scsi_Host *shost = sdev->host;
struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
-   blk_status_t ret;
+   blk_status_t ret = BLK_STS_RESOURCE;
int reason;
 
+   if (!scsi_mq_get_budget(hctx))
+   goto out;
ret = prep_to_mq(scsi_prep_state_check(sdev, req));
if (ret != BLK_STS_OK)
goto out_put_budget;
@@ -2022,6 +2024,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
atomic_dec(&scsi_target(sdev)->target_busy);
 out_put_budget:
scsi_mq_put_budget(hctx);
+out:
switch (ret) {
case BLK_STS_OK:
break;
@@ -2225,8 +2228,6 @@ struct request_queue *scsi_old_alloc_queue(struct 
scsi_device *sdev)
 }
 
 static const struct blk_mq_ops scsi_mq_ops = {
-   .get_budget = scsi_mq_get_budget,
-   .put_budget = scsi_mq_put_budget,
.queue_rq   = scsi_queue_rq,
.complete   = scsi_softirq_done,
.timeout= scsi_timeout,
-- 
2.15.0



Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Benjamin LaHaise
Hi Kirill,

On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
> Hi,
> 
> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
> It may be used to limit number of aio requests, which are available for
> a cgroup, and could be useful for containers.
> 
> The accounting is hierarchical, and aio contexts, allocated in child cgroup,
> are accounted in parent cgroups too.
> 
> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
> aio requests limit, to show current limit and number of currenly occupied
> requests.

Where are your test cases to check this functionality in the libaio test
suite?

-ben

> Patches 1-3 are refactoring.
> Patch 4 is the place where the accounting actually introduced.
> Patch 5 adds "io.aio_nr" file.
> 
> ---
> 
> Kirill Tkhai (5):
>   aio: Move aio_nr increment to separate function
>   aio: Export aio_nr_lock and aio_max_nr initial value to 
> include/linux/aio.h
>   blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
>   blkcg: Charge aio requests in blkio cgroup hierarchy
>   blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
> 
> 
>  block/blk-cgroup.c |   88 +-
>  fs/aio.c   |  151 
> 
>  include/linux/aio.h|   21 ++
>  include/linux/blk-cgroup.h |4 +
>  4 files changed, 247 insertions(+), 17 deletions(-)
> 
> --
> Signed-off-by: Kirill Tkhai 
> 

-- 
"Thought is the essence of where you are now."


[PATCH 2/3] arm64: don't override dma_max_pfn

2017-12-04 Thread Christoph Hellwig
The generic version now takes dma_pfn_offset into account, so there is no
more need for an architecture override.

Signed-off-by: Christoph Hellwig 
---
 arch/arm64/include/asm/dma-mapping.h | 9 -
 1 file changed, 9 deletions(-)

diff --git a/arch/arm64/include/asm/dma-mapping.h 
b/arch/arm64/include/asm/dma-mapping.h
index 0df756b24863..eada887a93bf 100644
--- a/arch/arm64/include/asm/dma-mapping.h
+++ b/arch/arm64/include/asm/dma-mapping.h
@@ -76,14 +76,5 @@ static inline void dma_mark_clean(void *addr, size_t size)
 {
 }
 
-/* Override for dma_max_pfn() */
-static inline unsigned long dma_max_pfn(struct device *dev)
-{
-   dma_addr_t dma_max = (dma_addr_t)*dev->dma_mask;
-
-   return (ulong)dma_to_phys(dev, dma_max) >> PAGE_SHIFT;
-}
-#define dma_max_pfn(dev) dma_max_pfn(dev)
-
 #endif /* __KERNEL__ */
 #endif /* __ASM_DMA_MAPPING_H */
-- 
2.14.2



[PATCH 1/3] dma-mapping: take dma_pfn_offset into account in dma_max_pfn

2017-12-04 Thread Christoph Hellwig
This makes sure the generic version can be used with architectures /
devices that have a DMA offset in the direct mapping.

Signed-off-by: Christoph Hellwig 
---
 include/linux/dma-mapping.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index e8f8e8fb244d..86beb9861618 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -692,7 +692,7 @@ static inline int dma_set_seg_boundary(struct device *dev, 
unsigned long mask)
 #ifndef dma_max_pfn
 static inline unsigned long dma_max_pfn(struct device *dev)
 {
-   return *dev->dma_mask >> PAGE_SHIFT;
+   return (*dev->dma_mask >> PAGE_SHIFT) + dev->dma_pfn_offset;
 }
 #endif
 
-- 
2.14.2



[PATCH 3/3] dma-mapping: replace PCI_DMA_BUS_IS_PHYS with a flag in struct dma_map_ops

2017-12-04 Thread Christoph Hellwig
The current PCI_DMA_BUS_IS_PHYS decided if a dma implementation is bound
by the dma mask in the device because it directly maps to a physical
address range (modulo an offset in the device), or if it is virtualized
by an iommu and can map any address (that includes virtual iommus like
swiotlb).  The problem with this scheme is that it is per-architecture and
not per dma_ops instance, and we are growing more and more setups that
have multiple different dma operations in use on a single system, for
which this scheme can't provide a correct answer.  Depending on the
architecture that means we either get a false positive or false negative
at the moment.

This patch instead adds a new is_iommu flag in struct dma_map_ops that
tells if a dma_map_ops instance can map any possible physical address
to replace both the PCI_DMA_BUS_IS_PHYS macro and the is_phys field
in struct dma_map_ops used by it on a few architectures.

Note that this means that we now need a struct device parent in the
Scsi_Host or netdevice.  Every modern driver has these, but there might
still be a few outdated legacy drivers out there, which now won't make
an intelligent decision.

Signed-off-by: Christoph Hellwig 
---
 arch/alpha/include/asm/pci.h |  5 -
 arch/alpha/kernel/pci_iommu.c|  1 +
 arch/arc/include/asm/pci.h   |  6 --
 arch/arm/include/asm/pci.h   |  7 ---
 arch/arm64/include/asm/pci.h |  5 -
 arch/arm64/mm/dma-mapping.c  |  2 ++
 arch/cris/include/asm/pci.h  |  6 --
 arch/h8300/include/asm/pci.h |  2 --
 arch/hexagon/kernel/dma.c|  1 -
 arch/ia64/hp/common/sba_iommu.c  |  4 +---
 arch/ia64/include/asm/pci.h  | 17 -
 arch/ia64/kernel/pci-swiotlb.c   |  1 +
 arch/ia64/kernel/setup.c | 12 
 arch/ia64/sn/kernel/io_common.c  |  5 -
 arch/ia64/sn/pci/pci_dma.c   |  1 +
 arch/m68k/include/asm/pci.h  |  6 --
 arch/microblaze/include/asm/pci.h|  6 --
 arch/mips/include/asm/pci.h  |  7 ---
 arch/mn10300/include/asm/pci.h   |  6 --
 arch/parisc/include/asm/pci.h| 23 ---
 arch/parisc/kernel/setup.c   |  5 -
 arch/powerpc/include/asm/pci.h   | 18 --
 arch/powerpc/kernel/dma-iommu.c  |  1 +
 arch/powerpc/kernel/dma-swiotlb.c|  1 +
 arch/powerpc/platforms/cell/iommu.c  |  1 +
 arch/powerpc/platforms/ps3/system-bus.c  |  2 ++
 arch/powerpc/platforms/pseries/ibmebus.c |  1 +
 arch/powerpc/platforms/pseries/vio.c |  1 +
 arch/riscv/include/asm/pci.h |  3 ---
 arch/s390/include/asm/pci.h  |  2 --
 arch/s390/pci/pci_dma.c  |  3 +--
 arch/sh/include/asm/pci.h|  6 --
 arch/sh/kernel/dma-nommu.c   |  1 -
 arch/sparc/include/asm/pci_32.h  |  4 
 arch/sparc/include/asm/pci_64.h  |  6 --
 arch/sparc/kernel/iommu.c|  1 +
 arch/sparc/kernel/ioport.c   |  1 +
 arch/sparc/kernel/pci_sun4v.c|  1 +
 arch/tile/include/asm/pci.h  | 14 --
 arch/tile/kernel/pci-dma.c   |  2 ++
 arch/x86/include/asm/pci.h   |  2 --
 arch/x86/kernel/amd_gart_64.c|  1 +
 arch/x86/kernel/pci-calgary_64.c |  1 +
 arch/x86/kernel/pci-nommu.c  |  1 -
 arch/x86/kernel/pci-swiotlb.c|  1 +
 arch/x86/mm/mem_encrypt.c|  1 +
 arch/x86/pci/sta2x11-fixup.c |  1 +
 arch/xtensa/include/asm/pci.h|  7 ---
 drivers/ide/ide-lib.c|  5 ++---
 drivers/ide/ide-probe.c  |  2 +-
 drivers/iommu/amd_iommu.c|  1 +
 drivers/iommu/intel-iommu.c  |  1 +
 drivers/parisc/ccio-dma.c|  3 +--
 drivers/parisc/sba_iommu.c   |  3 +--
 drivers/pci/host/vmd.c   |  1 +
 drivers/scsi/scsi_lib.c  | 14 ++
 drivers/xen/swiotlb-xen.c|  1 +
 include/asm-generic/pci.h|  8 
 include/linux/dma-mapping.h  | 16 +++-
 lib/dma-virt.c   |  1 +
 net/core/dev.c   | 18 --
 tools/virtio/linux/dma-mapping.h |  2 --
 62 files changed, 63 insertions(+), 225 deletions(-)

diff --git a/arch/alpha/include/asm/pci.h b/arch/alpha/include/asm/pci.h
index b9ec55351924..cf6bc1e64d66 100644
--- a/arch/alpha/include/asm/pci.h
+++ b/arch/alpha/include/asm/pci.h
@@ -56,11 +56,6 @@ struct pci_controller {
 
 /* IOMMU controls.  */
 
-/* The PCI address space does not equal the physical memory address space.
-   The networking and block device layers use this boolean for bounce buffer
-   decisions.  */
-#define PCI_DMA_BUS_IS_PHYS  0
-
 /* TODO: integrate

replace PCI_DMA_BUS_IS_PHYS with a per-instance flag

2017-12-04 Thread Christoph Hellwig
Hi all,

this small series tries to get rid of the global and misnamed
PCI_DMA_BUS_IS_PHYS flag, and replace it with a setting in each
struct dma_map_ops instance.


Re: [PATCH] um: Convert ubd driver to blk-mq

2017-12-04 Thread Christoph Hellwig
On Sun, Dec 03, 2017 at 10:49:23PM +0100, Richard Weinberger wrote:
> Convert the driver to the modern blk-mq framework.
> As byproduct we get rid of our open coded restart logic and let
> blk-mq handle it.
> 
> Signed-off-by: Richard Weinberger 
> ---
>  arch/um/drivers/ubd_kern.c | 178 
> +++--
>  1 file changed, 93 insertions(+), 85 deletions(-)
> 
> diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c
> index b55fe9bf5d3e..deceb8022a28 100644
> --- a/arch/um/drivers/ubd_kern.c
> +++ b/arch/um/drivers/ubd_kern.c
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -142,7 +143,6 @@ struct cow {
>  #define MAX_SG 64
>  
>  struct ubd {
> - struct list_head restart;
>   /* name (and fd, below) of the file opened for writing, either the
>* backing or the cow file. */
>   char *file;
> @@ -156,9 +156,12 @@ struct ubd {
>   struct cow cow;
>   struct platform_device pdev;
>   struct request_queue *queue;
> + struct blk_mq_tag_set tag_set;
>   spinlock_t lock;
> +};
> +
> +struct ubd_pdu {
>   struct scatterlist sg[MAX_SG];
> - struct request *request;
>   int start_sg, end_sg;
>   sector_t rq_pos;
>  };
> @@ -182,10 +185,6 @@ struct ubd {
>   .shared =   0, \
>   .cow =  DEFAULT_COW, \
>   .lock = __SPIN_LOCK_UNLOCKED(ubd_devs.lock), \
> - .request =  NULL, \
> - .start_sg = 0, \
> - .end_sg =   0, \
> - .rq_pos =   0, \
>  }
>  
>  /* Protected by ubd_lock */
> @@ -196,6 +195,12 @@ static int fake_ide = 0;
>  static struct proc_dir_entry *proc_ide_root = NULL;
>  static struct proc_dir_entry *proc_ide = NULL;
>  
> +static blk_status_t ubd_queue_rq(struct blk_mq_hw_ctx *hctx,
> +  const struct blk_mq_queue_data *bd);
> +static int ubd_init_request(struct blk_mq_tag_set *set,
> + struct request *req, unsigned int hctx_idx,
> + unsigned int numa_node);
> +
>  static void make_proc_ide(void)
>  {
>   proc_ide_root = proc_mkdir("ide", NULL);
> @@ -448,11 +453,8 @@ __uml_help(udb_setup,
>  "in the boot output.\n\n"
>  );
>  
> -static void do_ubd_request(struct request_queue * q);
> -
>  /* Only changed by ubd_init, which is an initcall. */
>  static int thread_fd = -1;
> -static LIST_HEAD(restart);
>  
>  /* Function to read several request pointers at a time
>  * handling fractional reads if (and as) needed
> @@ -510,9 +512,6 @@ static int bulk_req_safe_read(
>  /* Called without dev->lock held, and only in interrupt context. */
>  static void ubd_handler(void)
>  {
> - struct ubd *ubd;
> - struct list_head *list, *next_ele;
> - unsigned long flags;
>   int n;
>   int count;
>  
> @@ -532,23 +531,17 @@ static void ubd_handler(void)
>   return;
>   }
>   for (count = 0; count < n/sizeof(struct io_thread_req *); 
> count++) {
> - blk_end_request(
> - (*irq_req_buffer)[count]->req,
> - BLK_STS_OK,
> - (*irq_req_buffer)[count]->length
> - );
> - kfree((*irq_req_buffer)[count]);
> + struct io_thread_req *io_req = (*irq_req_buffer)[count];
> + int err = io_req->error ? BLK_STS_IOERR : BLK_STS_OK;
> +
> + if (!blk_update_request(io_req->req, err, 
> io_req->length))
> + __blk_mq_end_request(io_req->req, err);
> +
> + kfree(io_req);
>   }
>   }
> - reactivate_fd(thread_fd, UBD_IRQ);
>  
> - list_for_each_safe(list, next_ele, &restart){
> - ubd = container_of(list, struct ubd, restart);
> - list_del_init(&ubd->restart);
> - spin_lock_irqsave(&ubd->lock, flags);
> - do_ubd_request(ubd->queue);
> - spin_unlock_irqrestore(&ubd->lock, flags);
> - }
> + reactivate_fd(thread_fd, UBD_IRQ);
>  }
>  
>  static irqreturn_t ubd_intr(int irq, void *dev)
> @@ -869,6 +862,7 @@ static void ubd_device_release(struct device *dev)
>   struct ubd *ubd_dev = dev_get_drvdata(dev);
>  
>   blk_cleanup_queue(ubd_dev->queue);
> + blk_mq_free_tag_set(&ubd_dev->tag_set);
>   *ubd_dev = ((struct ubd) DEFAULT_UBD);
>  }
>  
> @@ -911,6 +905,11 @@ static int ubd_disk_register(int major, u64 size, int 
> unit,
>  
>  #define ROUND_BLOCK(n) ((n + ((1 << 9) - 1)) & (-1 << 9))
>  
> +static const struct blk_mq_ops ubd_mq_ops = {
> + .queue_rq = ubd_queue_rq,
> + .init_request = ubd_init_request,
> +};

Use tables to align the "=" signs?

> + blk_mq_start_request(req);
> +
> + pdu->rq_pos = blk_rq_pos(req);
> + pdu->start_sg = 0;
> + pdu->end

Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-12-04 Thread Christoph Hellwig
On Wed, Nov 29, 2017 at 08:18:09PM +0100, Christian Borntraeger wrote:
> Works fine under KVM with virtio-blk, but still hangs during boot in an LPAR.
> FWIW, the system not only has scsi disks via fcp but also DASDs as a boot 
> disk.
> Seems that this is the place where the system stops. (see the sysrq-t output
> at the bottom).

Can you check which of the patches in the tree is the culprit?


[PATCH 2/5] aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h

2017-12-04 Thread Kirill Tkhai
Next patch will use the values in more files, so let's make them visible 
external.

Signed-off-by: Kirill Tkhai 
---
 fs/aio.c|4 ++--
 include/linux/aio.h |4 
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 04209c0561b2..9dc98a29077c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -188,10 +188,10 @@ struct aio_kiocb {
struct eventfd_ctx  *ki_eventfd;
 };
 
+DEFINE_SPINLOCK(aio_nr_lock);
 /*-- sysctl variables*/
-static DEFINE_SPINLOCK(aio_nr_lock);
 unsigned long aio_nr;  /* current system wide number of aio requests */
-unsigned long aio_max_nr = 0x1; /* system wide maximum number of aio 
requests */
+unsigned long aio_max_nr = AIO_NR_DEF; /* system wide maximum number of aio 
requests */
 /*end sysctl variables---*/
 
 static struct kmem_cache   *kiocb_cachep;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 9d8aabecfe2d..5dda2663802f 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -10,6 +10,10 @@ struct mm_struct;
 
 #define KIOCB_KEY  0
 
+#define AIO_NR_INF (-1UL)
+#define AIO_NR_DEF 0x1
+
+extern spinlock_t aio_nr_lock;
 typedef int (kiocb_cancel_fn)(struct kiocb *);
 
 /* prototypes */



[PATCH 1/5] aio: Move aio_nr increment to separate function

2017-12-04 Thread Kirill Tkhai
There is no functional changes, only a preparation
for next patches.

Signed-off-by: Kirill Tkhai 
---
 fs/aio.c |   44 
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index e6de7715228c..04209c0561b2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -694,13 +694,39 @@ static int ioctx_add_table(struct kioctx *ctx, struct 
mm_struct *mm)
}
 }
 
-static void aio_nr_sub(unsigned nr)
+static bool __try_to_charge_aio_nr(unsigned nr)
+{
+   if (aio_nr + nr > aio_max_nr ||
+   aio_nr + nr < aio_nr)
+   return false;
+
+   aio_nr += nr;
+   return true;
+}
+
+static void __uncharge_aio_nr(unsigned nr)
 {
-   spin_lock(&aio_nr_lock);
if (WARN_ON(aio_nr - nr > aio_nr))
aio_nr = 0;
else
aio_nr -= nr;
+}
+
+static bool try_to_charge_aio_nr(unsigned nr)
+{
+   bool ret;
+
+   spin_lock(&aio_nr_lock);
+   ret = __try_to_charge_aio_nr(nr);
+   spin_unlock(&aio_nr_lock);
+
+   return ret;
+}
+
+static void uncharge_aio_nr(unsigned nr)
+{
+   spin_lock(&aio_nr_lock);
+   __uncharge_aio_nr(nr);
spin_unlock(&aio_nr_lock);
 }
 
@@ -776,15 +802,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
ctx->req_batch = 1;
 
/* limit the number of system wide aios */
-   spin_lock(&aio_nr_lock);
-   if (aio_nr + ctx->max_reqs > aio_max_nr ||
-   aio_nr + ctx->max_reqs < aio_nr) {
-   spin_unlock(&aio_nr_lock);
-   err = -EAGAIN;
+   err = -EAGAIN;
+   if (!try_to_charge_aio_nr(ctx->max_reqs))
goto err_ctx;
-   }
-   aio_nr += ctx->max_reqs;
-   spin_unlock(&aio_nr_lock);
 
percpu_ref_get(&ctx->users);/* io_setup() will drop this ref */
percpu_ref_get(&ctx->reqs); /* free_ioctx_users() will drop this */
@@ -801,7 +821,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
return ctx;
 
 err_cleanup:
-   aio_nr_sub(ctx->max_reqs);
+   uncharge_aio_nr(ctx->max_reqs);
 err_ctx:
atomic_set(&ctx->dead, 1);
if (ctx->mmap_size)
@@ -848,7 +868,7 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx 
*ctx,
 * -EAGAIN with no ioctxs actually in use (as far as userspace
 *  could tell).
 */
-   aio_nr_sub(ctx->max_reqs);
+   uncharge_aio_nr(ctx->max_reqs);
 
if (ctx->mmap_size)
vm_munmap(ctx->mmap_base, ctx->mmap_size);



[PATCH 5/5] blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr

2017-12-04 Thread Kirill Tkhai
Add a file to configure per-cgroup maximum number of available
aio requests.

Allow write the values, which are less then currently allocated
numbers of requests.

Show numbers of currently allocated and maximum available requests.

Signed-off-by: Kirill Tkhai 
---
 block/blk-cgroup.c |   40 
 1 file changed, 40 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9cc6e9574946..dc5600ef4523 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -981,12 +981,52 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
return 0;
 }
 
+#ifdef CONFIG_AIO
+static int blkcg_write_aio_max_nr(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 val)
+{
+   struct blkcg *blkg = css_to_blkcg(css);
+   int ret = 0;
+
+   percpu_down_read(&cgroup_threadgroup_rwsem);
+   spin_lock(&aio_nr_lock);
+   if (val >= blkg->blkg_aio_nr)
+   blkg->blkg_aio_max_nr = val;
+   else
+   ret = -EBUSY;
+   spin_unlock(&aio_nr_lock);
+   percpu_up_read(&cgroup_threadgroup_rwsem);
+   return ret;
+}
+
+static int blkcg_show_aio_nrs(struct seq_file *sf, void *v)
+{
+   struct blkcg *blkg = css_to_blkcg(seq_css(sf));
+   unsigned long max_nr, nr;
+
+   spin_lock(&aio_nr_lock);
+   max_nr = blkg->blkg_aio_max_nr;
+   nr = blkg->blkg_aio_nr;
+   spin_unlock(&aio_nr_lock);
+
+   seq_printf(sf, "used=%lu, max=%lu\n", nr, max_nr);
+   return 0;
+}
+#endif /* CONFIG_AIO */
+
 static struct cftype blkcg_files[] = {
{
.name = "stat",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = blkcg_print_stat,
},
+#ifdef CONFIG_AIO
+   {
+   .name = "aio_nr",
+   .write_u64 = blkcg_write_aio_max_nr,
+   .seq_show = blkcg_show_aio_nrs,
+   },
+#endif
{ } /* terminate */
 };
 



[PATCH 4/5] blkcg: Charge aio requests in blkio cgroup hierarchy

2017-12-04 Thread Kirill Tkhai
This patch adds accounting of number of requests of allocated aio
contexts per blkio cgroup, and aggregates child cgroups requests
up the hierarhy. This may be used to limit aio requests available
for containers.

By default, newly allocated blkcg::blkg_aio_max_nr is set
to "unlimited" value (see blkcg_css_alloc() in previous patch).
This guarantees that applications, which do not know about
blkcg::blkg_aio_max_nr, will be able to run like they used to do
before, without configuring child cgroup's blkg_aio_max_nr.

For protection "task attach" vs "io_context create/destroy"
read locked cgroup_threadgroup_rwsem is used. We take it
via cgroup_threadgroup_change_*() interfaces, which are used
around the places we charge kioctx::max_reqs and link a ctx
to mm_struct::ioctx_table.

Single allocation are protected aio_nr_lock like it used before.

Signed-off-by: Kirill Tkhai 
---
 block/blk-cgroup.c  |   44 +-
 fs/aio.c|  101 +++
 include/linux/aio.h |   11 ++
 3 files changed, 153 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 774560469b01..9cc6e9574946 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1217,8 +1217,8 @@ void blkcg_exit_queue(struct request_queue *q)
  */
 static int blkcg_can_attach(struct cgroup_taskset *tset)
 {
-   struct task_struct *task;
-   struct cgroup_subsys_state *dst_css;
+   struct cgroup_subsys_state *dst_css, *old_css;
+   struct task_struct *task, *p;
struct io_context *ioc;
int ret = 0;
 
@@ -1230,11 +1230,46 @@ static int blkcg_can_attach(struct cgroup_taskset *tset)
ret = -EINVAL;
task_unlock(task);
if (ret)
-   break;
+   goto err;
+   if (!thread_group_leader(task))
+   continue;
+   ret = charge_task_aio_nr(task, css_to_blkcg(dst_css));
+   if (ret)
+   goto err;
+   old_css = task_css(task, io_cgrp_id);
+   uncharge_task_aio_nr(task, css_to_blkcg(old_css));
+   }
+err:
+   if (ret) {
+   cgroup_taskset_for_each(p, dst_css, tset) {
+   if (p == task)
+   break;
+   if (!thread_group_leader(p))
+   continue;
+   uncharge_task_aio_nr(p, css_to_blkcg(dst_css));
+   old_css = task_css(p, io_cgrp_id);
+   WARN_ON_ONCE(charge_task_aio_nr(p, 
css_to_blkcg(old_css)));
+   }
}
return ret;
 }
 
+#ifdef CONFIG_AIO
+static void blkcg_cancel_attach(struct cgroup_taskset *tset)
+{
+   struct cgroup_subsys_state *dst_css, *old_css;
+   struct task_struct *p;
+
+   cgroup_taskset_for_each(p, dst_css, tset) {
+   if (!thread_group_leader(p))
+   continue;
+   uncharge_task_aio_nr(p, css_to_blkcg(dst_css));
+   old_css = task_css(p, io_cgrp_id);
+   WARN_ON_ONCE(charge_task_aio_nr(p, css_to_blkcg(old_css)));
+   }
+}
+#endif
+
 static void blkcg_bind(struct cgroup_subsys_state *root_css)
 {
int i;
@@ -1260,6 +1295,9 @@ struct cgroup_subsys io_cgrp_subsys = {
.css_offline = blkcg_css_offline,
.css_free = blkcg_css_free,
.can_attach = blkcg_can_attach,
+#ifdef CONFIG_AIO
+   .cancel_attach = blkcg_cancel_attach,
+#endif
.bind = blkcg_bind,
.dfl_cftypes = blkcg_files,
.legacy_cftypes = blkcg_legacy_files,
diff --git a/fs/aio.c b/fs/aio.c
index 755f97a42ebe..2e63f5c582c0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -696,6 +697,97 @@ static int ioctx_add_table(struct kioctx *ctx, struct 
mm_struct *mm)
}
 }
 
+#ifdef CONFIG_BLK_CGROUP
+static bool try_to_charge_blkcg(unsigned long nr, struct blkcg *blkg)
+{
+   struct blkcg *tmp = blkg;
+
+   while (blkg) {
+   if (nr + blkg->blkg_aio_nr > blkg->blkg_aio_max_nr ||
+   nr + blkg->blkg_aio_nr < nr)
+   goto fail;
+
+   blkg->blkg_aio_nr += nr;
+   blkg = blkcg_parent(blkg);
+   }
+
+   return true;
+fail:
+   while (tmp != blkg) {
+   tmp->blkg_aio_nr -= nr;
+   tmp = blkcg_parent(tmp);
+   }
+   return false;
+}
+
+
+static void uncharge_blkcg(unsigned long nr, struct blkcg *blkg)
+{
+   while (blkg) {
+   blkg->blkg_aio_nr -= nr;
+   blkg = blkcg_parent(blkg);
+   }
+}
+
+static bool __try_to_charge_aio_nr(unsigned nr)
+{
+   struct blkcg *blkg;
+
+   percpu_rwsem_assert_held(&cgroup_threadgroup_rwsem);
+   blkg = container_of(task_css_check(current, io_cgrp_id, true),
+

[PATCH 3/5] blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr

2017-12-04 Thread Kirill Tkhai
This adds new members of struct blkcg, which will
be used to account numbers of cgroup's aio requests.

Also, blkcg_root is used to store sysctl variables
aio_nr and aio_max_nr.

Signed-off-by: Kirill Tkhai 
---
 block/blk-cgroup.c |4 
 fs/aio.c   |2 ++
 include/linux/aio.h|6 ++
 include/linux/blk-cgroup.h |4 
 4 files changed, 16 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d3f56baee936..774560469b01 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "blk.h"
 
 #define MAX_KEY_LEN 100
@@ -1101,6 +1102,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
INIT_HLIST_HEAD(&blkcg->blkg_list);
 #ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(&blkcg->cgwb_list);
+#endif
+#ifdef CONFIG_AIO
+   blkcg->blkg_aio_max_nr = parent_css ? AIO_NR_INF : AIO_NR_DEF;
 #endif
list_add_tail(&blkcg->all_blkcgs_node, &all_blkcgs);
 
diff --git a/fs/aio.c b/fs/aio.c
index 9dc98a29077c..755f97a42ebe 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -189,10 +189,12 @@ struct aio_kiocb {
 };
 
 DEFINE_SPINLOCK(aio_nr_lock);
+#ifndef CONFIG_BLK_CGROUP
 /*-- sysctl variables*/
 unsigned long aio_nr;  /* current system wide number of aio requests */
 unsigned long aio_max_nr = AIO_NR_DEF; /* system wide maximum number of aio 
requests */
 /*end sysctl variables---*/
+#endif
 
 static struct kmem_cache   *kiocb_cachep;
 static struct kmem_cache   *kioctx_cachep;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 5dda2663802f..de929a8c9c59 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -2,6 +2,7 @@
 #ifndef __LINUX__AIO_H
 #define __LINUX__AIO_H
 
+#include 
 #include 
 
 struct kioctx;
@@ -26,8 +27,13 @@ static inline void kiocb_set_cancel_fn(struct kiocb *req,
   kiocb_cancel_fn *cancel) { }
 #endif /* CONFIG_AIO */
 
+#if !defined(CONFIG_BLK_CGROUP) || !defined(CONFIG_AIO)
 /* for sysctl: */
 extern unsigned long aio_nr;
 extern unsigned long aio_max_nr;
+#else
+#define aio_nr blkcg_root.blkg_aio_nr
+#define aio_max_nr blkcg_root.blkg_aio_max_nr
+#endif /* !CONFIG_BLK_CGROUP || !CONFIG_AIO */
 
 #endif /* __LINUX__AIO_H */
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 8bbc3716507a..3d9c176fc173 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -55,6 +55,10 @@ struct blkcg {
 #ifdef CONFIG_CGROUP_WRITEBACK
struct list_headcgwb_list;
 #endif
+#ifdef CONFIG_AIO
+   unsigned long   blkg_aio_nr;
+   unsigned long   blkg_aio_max_nr;
+#endif
 };
 
 /*



[PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Kirill Tkhai
Hi,

this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
It may be used to limit number of aio requests, which are available for
a cgroup, and could be useful for containers.

The accounting is hierarchical, and aio contexts, allocated in child cgroup,
are accounted in parent cgroups too.

Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
aio requests limit, to show current limit and number of currenly occupied
requests.

Patches 1-3 are refactoring.
Patch 4 is the place where the accounting actually introduced.
Patch 5 adds "io.aio_nr" file.

---

Kirill Tkhai (5):
  aio: Move aio_nr increment to separate function
  aio: Export aio_nr_lock and aio_max_nr initial value to 
include/linux/aio.h
  blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
  blkcg: Charge aio requests in blkio cgroup hierarchy
  blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr


 block/blk-cgroup.c |   88 +-
 fs/aio.c   |  151 
 include/linux/aio.h|   21 ++
 include/linux/blk-cgroup.h |4 +
 4 files changed, 247 insertions(+), 17 deletions(-)

--
Signed-off-by: Kirill Tkhai 


Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-04 Thread Bart Van Assche
On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote:
> Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for blk-mq")

It might be safer to revert commit 0df21c86bdbf instead of trying to fix all
issues introduced by that commit for kernel version v4.15 ...

Bart.

Re: [PATCH V2 0/2] block: fix queue freeze and cleanup

2017-12-04 Thread Mauricio Faria de Oliveira

On 12/01/2017 10:49 PM, Ming Lei wrote:

Unfortunately v2 has been found to exhibit another problem, an Oops in
the systemd shutdown path for device-mapper LVM volumes at least, when
calling 'q->request_fn(q)' if 'q->request_fn' is NULL.

That issue can be fixed easily in V2 and we understand DM's case clearly.


Yes. It just had not been mentioned upstream yet.


But the most important thing is that V2 can't fix Chenxiang's issue, but
V3/V1 can.


Agree.


--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: WARNING in kmalloc_slab (3)

2017-12-04 Thread Dan Carpenter
On Sun, Dec 03, 2017 at 12:16:08PM -0800, Eric Biggers wrote:
> Looks like BLKTRACESETUP doesn't limit the '.buf_nr' parameter, allowing 
> anyone
> who can open a block device to cause an extremely large kmalloc.  Here's a
> simplified reproducer:
> 

There are lots of places which allow people to allocate as much as they
want.  With Syzcaller, you might want to just hard code a __GFP_NOWARN
in to disable it.

regards,
dan carpenter



Re: blk-mq + bfq: udevd hang on usb2 storages

2017-12-04 Thread Ming Lei
On Fri, Dec 01, 2017 at 06:04:29PM +0100, Alban Browaeys wrote:
> I initially reported as https://bugzilla.kernel.org/show_bug.cgi?id=198
> 023 .
> 
> I have now bisected this issue to commit a6a252e6491443c1c1 "blk-mq-
> sched: decide how to handle flush rq via RQF_FLUSH_SEQ".
> 
> This is with an USB stick Sandisk Cruzer (USB Version:  2.10) I
> regressed with.
> systemctl restart systemd-udevd restores sanity.
> 
> PS: With an USB3 Lexar (USB Version:  3.00) I get more severe an issue
> (not bisected) where I find no way out of reboot. My report to bugzilla
> has logs when I was swapping between the these keys. The logs attached
> there mixes what looks like two different behaviors.

Hi Paolo,

>From both Alban's trace and my trace, looks this issue is in BFQ,
since request can't be retrieved via e->type->ops.mq.dispatch_request()
in blk_mq_do_dispatch_sched() after it is inserted into BFQ's queue.

https://bugzilla.kernel.org/show_bug.cgi?id=198023#c4
https://marc.info/?l=linux-block&m=151214241518562&w=2

BTW, I have tried to reproduce the issue with scsi_debug, but not succeed,
and it can't be reproduced with other schedulers(mq-deadline, none) too.

So could you take a look?

Thanks,
Ming


[PATCH V2] block, bfq: remove batches of confusing ifdefs

2017-12-04 Thread Paolo Valente
Commit a33801e8b473 ("block, bfq: move debug blkio stats behind
CONFIG_DEBUG_BLK_CGROUP") introduced two batches of confusing ifdefs:
one reported in [1], plus a similar one in another function. This
commit removes both batches, in the way suggested in [1].

[1] https://www.spinics.net/lists/linux-block/msg20043.html

Fixes: a33801e8b473 ("block, bfq: move debug blkio stats behind 
CONFIG_DEBUG_BLK_CGROUP")
Reported-by: Linus Torvalds 
Tested-by: Luca Miccio 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 127 +---
 1 file changed, 72 insertions(+), 55 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index bcb6d21..3feed64 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3689,35 +3689,16 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
return rq;
 }
 
-static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
-{
-   struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
-   struct request *rq;
 #if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
-   struct bfq_queue *in_serv_queue, *bfqq;
-   bool waiting_rq, idle_timer_disabled;
-#endif
-
-   spin_lock_irq(&bfqd->lock);
-
-#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
-   in_serv_queue = bfqd->in_service_queue;
-   waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue);
-
-   rq = __bfq_dispatch_request(hctx);
-
-   idle_timer_disabled =
-   waiting_rq && !bfq_bfqq_wait_request(in_serv_queue);
-
-#else
-   rq = __bfq_dispatch_request(hctx);
-#endif
-   spin_unlock_irq(&bfqd->lock);
+static void bfq_update_dispatch_stats(struct request_queue *q,
+ struct request *rq,
+ struct bfq_queue *in_serv_queue,
+ bool idle_timer_disabled)
+{
+   struct bfq_queue *bfqq = rq ? RQ_BFQQ(rq) : NULL;
 
-#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
-   bfqq = rq ? RQ_BFQQ(rq) : NULL;
if (!idle_timer_disabled && !bfqq)
-   return rq;
+   return;
 
/*
 * rq and bfqq are guaranteed to exist until this function
@@ -3732,7 +3713,7 @@ static struct request *bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
 * In addition, the following queue lock guarantees that
 * bfqq_group(bfqq) exists as well.
 */
-   spin_lock_irq(hctx->queue->queue_lock);
+   spin_lock_irq(q->queue_lock);
if (idle_timer_disabled)
/*
 * Since the idle timer has been disabled,
@@ -3751,9 +3732,37 @@ static struct request *bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
bfqg_stats_set_start_empty_time(bfqg);
bfqg_stats_update_io_remove(bfqg, rq->cmd_flags);
}
-   spin_unlock_irq(hctx->queue->queue_lock);
+   spin_unlock_irq(q->queue_lock);
+}
+#else
+static inline void bfq_update_dispatch_stats(struct request_queue *q,
+struct request *rq,
+struct bfq_queue *in_serv_queue,
+bool idle_timer_disabled) {}
 #endif
 
+static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+   struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
+   struct request *rq;
+   struct bfq_queue *in_serv_queue;
+   bool waiting_rq, idle_timer_disabled;
+
+   spin_lock_irq(&bfqd->lock);
+
+   in_serv_queue = bfqd->in_service_queue;
+   waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue);
+
+   rq = __bfq_dispatch_request(hctx);
+
+   idle_timer_disabled =
+   waiting_rq && !bfq_bfqq_wait_request(in_serv_queue);
+
+   spin_unlock_irq(&bfqd->lock);
+
+   bfq_update_dispatch_stats(hctx->queue, rq, in_serv_queue,
+ idle_timer_disabled);
+
return rq;
 }
 
@@ -4276,16 +4285,46 @@ static bool __bfq_insert_request(struct bfq_data *bfqd, 
struct request *rq)
return idle_timer_disabled;
 }
 
+#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
+static void bfq_update_insert_stats(struct request_queue *q,
+   struct bfq_queue *bfqq,
+   bool idle_timer_disabled,
+   unsigned int cmd_flags)
+{
+   if (!bfqq)
+   return;
+
+   /*
+* bfqq still exists, because it can disappear only after
+* either it is merged with another queue, or the process it
+* is associated with exits. But both actions must be taken by
+* the same process currently executing this flow of
+* instructions.
+*
+* In addition, the following queue 

Re: [PATCH] generic/473: test return EBUSY from BLKRRPART for mounted whole-dev

2017-12-04 Thread Xiao Yang

On 2017/12/04 17:25, Eryu Guan wrote:

On Mon, Dec 04, 2017 at 05:15:17PM +0800, Xiao Yang wrote:

On 2017/12/04 16:29, Eryu Guan wrote:

On Wed, Nov 29, 2017 at 08:02:26PM +0800, Xiao Yang wrote:

If the entire block device is formatted with a filesystem and
mounted, running "blockdev --rereadpt" should fail and return
EBUSY instead of pass.

Signed-off-by: Xiao Yang

As we have blktests[1] now, I think this may fit in blktests better?

Hi Eryu,

Do you think test cases which use scsi_debug module should be moved into
blktests?
(e.g. generic/108, generic/349, generic/350, generic/351)

I don't think they need to be moved to blktests. Most other tests that
take use of scsi_debug are for filesystem testing, e.g. generic/108.
generic/349 generic/35[01] are a bit special, they were there before
blktests was announced available, so they're in a special blockdev group
and not in the auto group. If Omar agrees, I think they can be ported to
blktests.

Hi Eryu,

Thanks for your explanation, and i will try to send it to blktests.

Thanks,
Xiao Yang

Thanks,
Eryu









Re: WARNING in kmalloc_slab (3)

2017-12-04 Thread Dan Carpenter
On Mon, Dec 04, 2017 at 09:18:05AM +0100, Dmitry Vyukov wrote:
> On Mon, Dec 4, 2017 at 9:14 AM, Dan Carpenter  
> wrote:
> > On Sun, Dec 03, 2017 at 12:16:08PM -0800, Eric Biggers wrote:
> >> Looks like BLKTRACESETUP doesn't limit the '.buf_nr' parameter, allowing 
> >> anyone
> >> who can open a block device to cause an extremely large kmalloc.  Here's a
> >> simplified reproducer:
> >>
> >
> > There are lots of places which allow people to allocate as much as they
> > want.  With Syzcaller, you might want to just hard code a __GFP_NOWARN
> > in to disable it.
> 
> Hi,
> 
> Hard code it where?

My idea was to just make warn_alloc() a no-op.

> 
> User-controllable allocation are supposed to use __GFP_NOWARN.

No that's not right.  What we don't want is unprivileged users to use
all the memory and we don't want unprivileged users to spam
/var/log/messages.  But you have to have slightly elevated permissions
to open block devices right?  The warning is helpful.  Admins should
"don't do that" if they don't want the warning.

The kernel really isn't designed to work with Oops on Warn.  I try to
tell people simple thinks like not printing a warning when
copy_from_user() fails because I don't want /var/log/messages to get
spammed.  But there are lots and lots of places which generate warnings.

regards,
dan carpenter


Re: [PATCH] generic/473: test return EBUSY from BLKRRPART for mounted whole-dev

2017-12-04 Thread Eryu Guan
On Mon, Dec 04, 2017 at 05:15:17PM +0800, Xiao Yang wrote:
> On 2017/12/04 16:29, Eryu Guan wrote:
> > On Wed, Nov 29, 2017 at 08:02:26PM +0800, Xiao Yang wrote:
> > > If the entire block device is formatted with a filesystem and
> > > mounted, running "blockdev --rereadpt" should fail and return
> > > EBUSY instead of pass.
> > > 
> > > Signed-off-by: Xiao Yang
> > As we have blktests[1] now, I think this may fit in blktests better?
> Hi Eryu,
> 
> Do you think test cases which use scsi_debug module should be moved into
> blktests?
> (e.g. generic/108, generic/349, generic/350, generic/351)

I don't think they need to be moved to blktests. Most other tests that
take use of scsi_debug are for filesystem testing, e.g. generic/108.
generic/349 generic/35[01] are a bit special, they were there before
blktests was announced available, so they're in a special blockdev group
and not in the auto group. If Omar agrees, I think they can be ported to
blktests.

Thanks,
Eryu


Re: [PATCH] generic/473: test return EBUSY from BLKRRPART for mounted whole-dev

2017-12-04 Thread Xiao Yang

On 2017/12/04 16:29, Eryu Guan wrote:

On Wed, Nov 29, 2017 at 08:02:26PM +0800, Xiao Yang wrote:

If the entire block device is formatted with a filesystem and
mounted, running "blockdev --rereadpt" should fail and return
EBUSY instead of pass.

Signed-off-by: Xiao Yang

As we have blktests[1] now, I think this may fit in blktests better?

Hi Eryu,

Do you think test cases which use scsi_debug module should be moved into 
blktests?

(e.g. generic/108, generic/349, generic/350, generic/351)

Thanks,
Xiao Yang

Thanks,
Eryu

[1] https://github.com/osandov/blktests


---
  tests/generic/473 | 83 +++
  tests/generic/473.out |  2 ++
  tests/generic/group   |  1 +
  3 files changed, 86 insertions(+)
  create mode 100755 tests/generic/473
  create mode 100644 tests/generic/473.out

diff --git a/tests/generic/473 b/tests/generic/473
new file mode 100755
index 000..d7998cd
--- /dev/null
+++ b/tests/generic/473
@@ -0,0 +1,83 @@
+#! /bin/bash
+# FS QA Test No. 473
+#
+# Regression test for commit:
+# 77032ca ("Return EBUSY from BLKRRPART for mounted whole-dev fs")
+#
+# If the entire block device is formatted with a filesystem and
+# mounted, running "blockdev --rereadpt" should fail and return
+# EBUSY.  On buggy kernel, it passes unexpectedly.
+#
+#---
+# Copyright (c) 2017 FUJITSU LIMITED. All rights reserved.
+# Author: Xiao Yang
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmpdir=`mktemp -d`
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 7 15
+
+_cleanup()
+{
+   # Umount
+   $UMOUNT_PROG $tmpdir>>$seqres.full 2>&1
+   # Destroy device
+_put_scsi_debug_dev
+   rm -rf $tmpdir
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/scsi_debug
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+_supported_os Linux
+_require_scsi_debug
+
+# Create and format
+test_dev=`_get_scsi_debug_dev`
+_mkfs_dev $test_dev>>$seqres.full 2>&1
+
+# Mount and check mounted whole-dev
+_mount $test_dev $tmpdir
+
+out=$(blockdev --rereadpt $test_dev 2>&1)
+res=$?
+
+echo $out | grep -q "Unknown command"&&  \
+   _notrun "blockdev --rereadpt was not supported"
+
+[ $res -eq 0 ]&&  _fail "blockdev --rereadpt passed when checking mounted 
whole-dev"
+
+echo $out | grep -q "Device or resource busy" || \
+   _fail "blockdev --rereadpt returned unexpected error when checking mounted 
whole-dev"
+
+echo $out>>  $seqres.full
+
+echo "Silence is golden"
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/473.out b/tests/generic/473.out
new file mode 100644
index 000..854fbcd
--- /dev/null
+++ b/tests/generic/473.out
@@ -0,0 +1,2 @@
+QA output created by 473
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index 6c3bb03..54b9404 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -472,3 +472,4 @@
  467 auto quick exportfs
  468 shutdown auto quick metadata
  469 auto quick
+473 auto quick blockdev
--
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe fstests" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe fstests" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


.







Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-04 Thread Ming Lei
On Mon, Dec 04, 2017 at 09:44:55AM +0100, Johannes Thumshirn wrote:
> Ming Lei  writes:
> 
> 
> > I am happy to do that, but recently I am very busy, so it may be done
> > a bit late by me.
> >
> > But anyone should reproduce the issue 100% with V4.15-rc kernel by just
> > running the above script, not any specific hardware is required at all,
> > so that means anyone can make a patch for blktest to test block/SCSI
> > timeout if he/she is interested in doing that.
> 
> OK, let me see if I can spent some time on this the next days.

That is great, thanks!

-- 
Ming


Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-04 Thread Johannes Thumshirn
Ming Lei  writes:


> I am happy to do that, but recently I am very busy, so it may be done
> a bit late by me.
>
> But anyone should reproduce the issue 100% with V4.15-rc kernel by just
> running the above script, not any specific hardware is required at all,
> so that means anyone can make a patch for blktest to test block/SCSI
> timeout if he/she is interested in doing that.

OK, let me see if I can spent some time on this the next days.

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-04 Thread Ming Lei
On Mon, Dec 04, 2017 at 09:19:33AM +0100, Johannes Thumshirn wrote:
> 
> Hi Ming,
> 
> Ming Lei  writes:
> > This issue can be triggered by the following script:
> >
> > #!/bin/sh
> > rmmod scsi_debug
> > modprobe scsi_debug max_queue=1
> >
> > DEVICE=`ls -d 
> > /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head 
> > -1 | xargs basename`
> >
> > DISK_DIR=`ls -d /sys/block/$DEVICE/device/scsi_disk/*`
> >
> > echo "using scsi device $DEVICE"
> > echo "-1" >/sys/bus/pseudo/drivers/scsi_debug/every_nth
> > echo starting loop $i
> > echo "temporary write through" >$DISK_DIR/cache_type
> > echo "128" >/sys/bus/pseudo/drivers/scsi_debug/opts
> > echo none > /sys/block/$DEVICE/queue/scheduler
> > dd if=/dev/$DEVICE of=/dev/null bs=1M iflag=direct count=1 &
> > sleep 5
> > echo "0" >/sys/bus/pseudo/drivers/scsi_debug/opts
> > wait
> > echo "SUCCESS"
> 
> Can you please submit a test-case for blktest as well, given you have a
> nice reproducer?

Hi Johannes,

I am happy to do that, but recently I am very busy, so it may be done
a bit late by me.

But anyone should reproduce the issue 100% with V4.15-rc kernel by just
running the above script, not any specific hardware is required at all,
so that means anyone can make a patch for blktest to test block/SCSI
timeout if he/she is interested in doing that.

Thanks,
Ming


Re: [PATCH] generic/473: test return EBUSY from BLKRRPART for mounted whole-dev

2017-12-04 Thread Eryu Guan
On Wed, Nov 29, 2017 at 08:02:26PM +0800, Xiao Yang wrote:
> If the entire block device is formatted with a filesystem and
> mounted, running "blockdev --rereadpt" should fail and return
> EBUSY instead of pass.
> 
> Signed-off-by: Xiao Yang 

As we have blktests[1] now, I think this may fit in blktests better?

Thanks,
Eryu

[1] https://github.com/osandov/blktests

> ---
>  tests/generic/473 | 83 
> +++
>  tests/generic/473.out |  2 ++
>  tests/generic/group   |  1 +
>  3 files changed, 86 insertions(+)
>  create mode 100755 tests/generic/473
>  create mode 100644 tests/generic/473.out
> 
> diff --git a/tests/generic/473 b/tests/generic/473
> new file mode 100755
> index 000..d7998cd
> --- /dev/null
> +++ b/tests/generic/473
> @@ -0,0 +1,83 @@
> +#! /bin/bash
> +# FS QA Test No. 473
> +#
> +# Regression test for commit:
> +# 77032ca ("Return EBUSY from BLKRRPART for mounted whole-dev fs")
> +#
> +# If the entire block device is formatted with a filesystem and
> +# mounted, running "blockdev --rereadpt" should fail and return
> +# EBUSY.  On buggy kernel, it passes unexpectedly.
> +#
> +#---
> +# Copyright (c) 2017 FUJITSU LIMITED. All rights reserved.
> +# Author: Xiao Yang 
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmpdir=`mktemp -d`
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 7 15
> +
> +_cleanup()
> +{
> + # Umount
> + $UMOUNT_PROG $tmpdir >>$seqres.full 2>&1
> + # Destroy device
> +_put_scsi_debug_dev
> + rm -rf $tmpdir
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/scsi_debug
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +_supported_os Linux
> +_require_scsi_debug
> +
> +# Create and format
> +test_dev=`_get_scsi_debug_dev`
> +_mkfs_dev $test_dev >>$seqres.full 2>&1
> +
> +# Mount and check mounted whole-dev
> +_mount $test_dev $tmpdir
> +
> +out=$(blockdev --rereadpt $test_dev 2>&1)
> +res=$?
> +
> +echo $out | grep -q "Unknown command" && \
> + _notrun "blockdev --rereadpt was not supported"
> +
> +[ $res -eq 0 ] && _fail "blockdev --rereadpt passed when checking mounted 
> whole-dev"
> +
> +echo $out | grep -q "Device or resource busy" || \
> + _fail "blockdev --rereadpt returned unexpected error when checking 
> mounted whole-dev"
> +
> +echo $out >> $seqres.full
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/473.out b/tests/generic/473.out
> new file mode 100644
> index 000..854fbcd
> --- /dev/null
> +++ b/tests/generic/473.out
> @@ -0,0 +1,2 @@
> +QA output created by 473
> +Silence is golden
> diff --git a/tests/generic/group b/tests/generic/group
> index 6c3bb03..54b9404 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -472,3 +472,4 @@
>  467 auto quick exportfs
>  468 shutdown auto quick metadata
>  469 auto quick
> +473 auto quick blockdev
> -- 
> 1.8.3.1
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-04 Thread Johannes Thumshirn

Hi Ming,

Ming Lei  writes:
> This issue can be triggered by the following script:
>
>   #!/bin/sh
>   rmmod scsi_debug
>   modprobe scsi_debug max_queue=1
>
>   DEVICE=`ls -d 
> /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 
> | xargs basename`
>
>   DISK_DIR=`ls -d /sys/block/$DEVICE/device/scsi_disk/*`
>
>   echo "using scsi device $DEVICE"
>   echo "-1" >/sys/bus/pseudo/drivers/scsi_debug/every_nth
>   echo starting loop $i
>   echo "temporary write through" >$DISK_DIR/cache_type
>   echo "128" >/sys/bus/pseudo/drivers/scsi_debug/opts
>   echo none > /sys/block/$DEVICE/queue/scheduler
>   dd if=/dev/$DEVICE of=/dev/null bs=1M iflag=direct count=1 &
>   sleep 5
>   echo "0" >/sys/bus/pseudo/drivers/scsi_debug/opts
>   wait
>   echo "SUCCESS"

Can you please submit a test-case for blktest as well, given you have a
nice reproducer?

Thanks,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: WARNING in kmalloc_slab (3)

2017-12-04 Thread Dmitry Vyukov
On Mon, Dec 4, 2017 at 9:14 AM, Dan Carpenter  wrote:
> On Sun, Dec 03, 2017 at 12:16:08PM -0800, Eric Biggers wrote:
>> Looks like BLKTRACESETUP doesn't limit the '.buf_nr' parameter, allowing 
>> anyone
>> who can open a block device to cause an extremely large kmalloc.  Here's a
>> simplified reproducer:
>>
>
> There are lots of places which allow people to allocate as much as they
> want.  With Syzcaller, you might want to just hard code a __GFP_NOWARN
> in to disable it.

Hi,

Hard code it where?

User-controllable allocation are supposed to use __GFP_NOWARN.