Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-23 Thread Paolo Valente


> Il giorno 23 feb 2018, alle ore 17:17, Ming Lei  ha 
> scritto:
> 
> Hi Paolo,
> 
> On Fri, Feb 23, 2018 at 04:41:36PM +0100, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei  ha 
>>> scritto:
>>> 
>>> Hi Paolo,
>>> 
>>> On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote:
 Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
 RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
 be re-inserted into the active I/O scheduler for that device. As a
>>> 
>>> No, this behaviour isn't related with commit a6a252e64914, and
>>> it has been there since blk_mq_requeue_request() is introduced.
>>> 
>> 
>> Hi Ming,
>> actually, we wrote the above statement after simply following the call
>> chain that led to the failure.  In this respect, the change in commit
>> a6a252e64914:
>> 
>> static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
>> +  bool has_sched,
>>   struct request *rq)
>> {
>> -   if (rq->tag == -1) {
>> +   /* dispatch flush rq directly */
>> +   if (rq->rq_flags & RQF_FLUSH_SEQ) {
>> +   spin_lock(>lock);
>> +   list_add(>queuelist, >dispatch);
>> +   spin_unlock(>lock);
>> +   return true;
>> +   }
>> +
>> +   if (has_sched) {
>>rq->rq_flags |= RQF_SORTED;
>> -   return false;
>> +   WARN_ON(rq->tag != -1);
>>}
>> 
>> -   /*
>> -* If we already have a real request tag, send directly to
>> -* the dispatch list.
>> -*/
>> -   spin_lock(>lock);
>> -   list_add(>queuelist, >dispatch);
>> -   spin_unlock(>lock);
>> -   return true;
>> +   return false;
>> }
>> 
>> makes blk_mq_sched_bypass_insert return false for all non-flush
>> requests.  From that, the anomaly described in our commit follows, for
>> bfq any stateful scheduler that waits for the completion of requests
>> that passed through it.  I'm elaborating again a little bit on this in
>> my replies to your next points below.
> 
> Before a6a252e64914, follows blk_mq_sched_bypass_insert()
> 
>   if (rq->tag == -1) {
>   rq->rq_flags |= RQF_SORTED;
>   return false;
>  }
> 
>   /*
>* If we already have a real request tag, send directly to
>* the dispatch list.
>*/
>   spin_lock(>lock);
>   list_add(>queuelist, >dispatch);
>   spin_unlock(>lock);
>   return true;
> 
> This function still returns false for all non-flush request, so nothing
> changes wrt. this kind of handling.
> 

Yep Ming.  I don't have the expertise to tell you why, but the failure
in the USB case was caused by an rq that is not flush, but for which
rq->tag != -1.  So, the previous version of blk_mq_sched_bypass_insert
returned true, and there was not failure, while after commit
a6a252e64914 the function returns true and the failure occurs if bfq
does not exploit the requeue hook.

You have actually shown it yourself, several months ago, through the
simple code instrumentation you made and used to show that bfq was
stuck.  And I guess it can still be reproduced very easily, unless
something else has changed in blk-mq.

BTW, if you can shed a light on this fact, that would be a great
occasion to learn for me.

>> 
>> I don't mean that this change is an error, it simply sends a stateful
>> scheduler in an inconsistent state, unless the scheduler properly
>> handles the requeue that precedes the re-insertion into the
>> scheduler.
>> 
>> If this clarifies the situation, but there is still some misleading
>> statement in the commit, just let me better understand, and I'll be
>> glad to rectify it, if possible somehow.
>> 
>> 
>>> And you can see blk_mq_requeue_request() is called by lots of drivers,
>>> especially it is often used in error handler, see SCSI's example.
>>> 
 consequence, I/O schedulers may get the same request inserted again,
 even several times, without a finish_request invoked on that request
 before each re-insertion.
 
>> 
>> ...
>> 
 @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = {
.ops.mq = {
.limit_depth= bfq_limit_depth,
.prepare_request= bfq_prepare_request,
 -  .finish_request = bfq_finish_request,
 +  .requeue_request= bfq_finish_requeue_request,
 +  .finish_request = bfq_finish_requeue_request,
.exit_icq   = bfq_exit_icq,
.insert_requests= bfq_insert_requests,
.dispatch_request   = bfq_dispatch_request,
>>> 
>>> This way may not be correct since blk_mq_sched_requeue_request() can be
>>> called for one request which won't enter io scheduler.
>>> 
>> 
>> Exactly, there are two 

[PATCH 1/1] extend BLKRRPART to update the readable size of media

2018-02-23 Thread Steve Kenton
The readable size of new, factory blank, optical media can change via user
space ioctl(SG_IO) commands to format overwritable media such as DVD+RW for
a UDF filesystem or write to sequential media such as DVD-R for an ISO-9660
filesystem. However there appears to be no easy way to update the size used
by the block layer from the initial value of zero. Traditionally, ejecting and
reinserting the media is the workaround to this long standing issue which 
programs such as growisofs use by automatically opening and closing the tray
at the end of a session so the newly created filesystem becomes mountable
and readable. This can be problematic when using a slot load drive.

An even more heavy handed approach is to delete and then rescan the drive
using sysfs. echo offline > state  echo 1 > delete  echo - - - > scan
without ever ejecting the media. It works but is slow and rather ugly.

Reading a factory blank DVD+RW needing to be "de-iced" gives this result
both before *AND* after formatting with $ xorriso -dev /dev/sr0 -format full

$ dd if=/dev/sr0 of=/dev/null bs=2k
dd: error reading ‘/dev/sr0’: Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.0275589 s, 0.0 kB/s

The crux of the problem being the readable size after formatting is still zero
which does not change to the correct value without ejecting and reinserting
the newly formatted media to force a call to fops->revalidate_disk().

hdi@hdi-H110N:/sys/block/sr0$ cat size
0
hdi@hdi-H110N:/sys/block/sr0$ cat size
9180416

Making the BLKRRPART ioctl call fops->revalidate_disk() => sr_revalidate_disk()
in sr.c resolves the issue for optical media and should be a benign and
potentially useful operation if ever called for non-optical media =>
sd_revalidate_disk() in sd.c for any other devices which change size.
Programs would have to "opt in" to use ioctl(BLKRRPART) on optial media.

The down side is that BLKRRPART requires CAP_SYSADMIN for userspace programs.

Signed-off-by: Steve Kenton 
Signed-off-by: Thomas Schmitt 
---
I discussed this with Thomas Schmitt the maintainer of the xorriso burner
program to define the problem and suggest a fix.

V2: Add NULL pointer check
Checkpatch wrapping and update commit message
Send to linux-block@vger.kernel.org instead of linux-fsde...@vger.kernel.org

V3: Fix stupid copy paste omission of return value and build test again
Burn another DVD-R and verify size change after blkrrpart /dev/sr0

 block/ioctl.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

 block/ioctl.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 1668506d8ed8..8486389855d5 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -163,14 +163,20 @@ int __blkdev_reread_part(struct block_device *bdev)
 {
struct gendisk *disk = bdev->bd_disk;
 
-   if (!disk_part_scan_enabled(disk) || bdev != bdev->bd_contains)
+   if (bdev != bdev->bd_contains) /* must be whole disk */
return -EINVAL;
+
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
 
lockdep_assert_held(>bd_mutex);
 
-   return rescan_partitions(disk, bdev);
+   if (disk_part_scan_enabled(disk))
+   return rescan_partitions(disk, bdev); /* update partitions */
+   else if (disk->fops->revalidate_disk) /* could be sr or sd */
+   return disk->fops->revalidate_disk(disk); /* update size */
+   else
+   return -EINVAL;
 }
 EXPORT_SYMBOL(__blkdev_reread_part);
 
-- 
2.16.1.72.g5be1f00



Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-23 Thread Joseph Qi
Hi Tejun,

On 18/2/23 22:23, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 23, 2018 at 09:56:54AM +0800, xuejiufei wrote:
>>> On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote:
 I still don't get how css_tryget can work here.

 The race happens when:
 1) writeback kworker has found the blkg with rcu;
 2) blkcg is during offlining and blkg_destroy() has already been called.
 Then, writeback kworker will take queue lock and access the blkg with
 refcount 0.
>>>
>>> Yeah, then tryget would fail and it should go through the root.
>>>
>> In this race, the refcount of blkg becomes zero and is destroyed.
>> However css may still have refcount, and css_tryget can return success
>> before other callers put the refcount.
>> So I don't get how css_tryget can fix this race? Or I wonder if we can
>> add another function blkg_tryget?
> 
> IIRC, as long as the blkcg and the device are there, the blkgs aren't
> gonna be destroyed.  So, if you have a ref to the blkcg through
> tryget, the blkg shouldn't go away.
> 

Maybe we have misunderstanding here.

In this case, blkg doesn't go away as we have rcu protect, but
blkg_destroy() can be called, in which blkg_put() will put the last
refcnt and then schedule __blkg_release_rcu().

css refcnt can't prevent blkcg css from offlining, instead it is css
online_cnt.

css_tryget() will only get a refcnt of blkcg css, but can't be
guaranteed to fail when css is confirmed to kill.

The race sequence:
writeback kworker   cgroup_rmdir
  cgroup_destroy_locked
kill_css
  css_killed_ref_fn
css_killed_work_fn
  offline_css
blkcg_css_offline
  blkcg_bio_issue_check
rcu_read_lock
blkg_lookup
  spin_trylock(q->queue_lock)
  blkg_destroy
  spin_unlock(q->queue_lock)
blk_throtl_bio
  spin_lock_irq(q->queue_lock)
  spin_unlock_irq(q->queue_lock)
rcu_read_unlock

Thanks,
Joseph


Re: [PATCH 1/2] blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch

2018-02-23 Thread Bart Van Assche
On Fri, 2018-02-23 at 23:36 +0800, Ming Lei wrote:
> __blk_mq_requeue_request() covers two cases:
> 
> - one is that the requeued request is added to hctx->dispatch, such as
> blk_mq_dispatch_rq_list()
> 
> - another case is that the request is requeued to io scheduler, such as
> blk_mq_requeue_request().
> 
> We should call io sched's .requeue_request callback only for the 2nd
> case.

I think it would have been more clear if it would have been mentioned that
blk_mq_sched_requeue_request() must only be called for requests that have
been started. Anyway, if you add the following to this patch:

Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Cc: sta...@vger.kernel.org

then you can also add:

Reviewed-by: Bart Van Assche 





Re: [PATCH V2] block: pass inclusive 'lend' parameter to truncate_inode_pages_range

2018-02-23 Thread Jens Axboe
On 2/9/18 5:46 PM, Ming Lei wrote:
> The 'lend' parameter of truncate_inode_pages_range is required to be
> inclusive, so follow the rule.
> 
> This patch fixes one memory corruption triggered by discard.

Applied, thanks.

-- 
Jens Axboe



Re: [PATCH V2] block: pass inclusive 'lend' parameter to truncate_inode_pages_range

2018-02-23 Thread Bart Van Assche
On Sat, 2018-02-10 at 08:46 +0800, Ming Lei wrote:
> The 'lend' parameter of truncate_inode_pages_range is required to be
> inclusive, so follow the rule.
> 
> This patch fixes one memory corruption triggered by discard.

Reviewed-by: Bart Van Assche 





Re: [PATCH] lightnvm: fix bad block initialization

2018-02-23 Thread Matias Bjørling

On 02/23/2018 06:40 PM, Heiner Litz wrote:

fix reading bad block device information to correctly setup the per line
blk_bitmap during lightnvm initialization

Signed-off-by: Heiner Litz 
---
  drivers/lightnvm/pblk-init.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 86a94a7faa96..e0eba150ac7e 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -461,10 +461,11 @@ static int pblk_bb_line(struct pblk *pblk, struct 
pblk_line *line,
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
int i, bb_cnt = 0;
+   int blk_per_lun = geo->nr_chks * geo->plane_mode;
  
  	for (i = 0; i < blk_per_line; i++) {

struct pblk_lun *rlun = >luns[i];
-   u8 *lun_bb_log = bb_log + i * blk_per_line;
+   u8 *lun_bb_log = bb_log + i * blk_per_lun;
  
  		if (lun_bb_log[line->id] == NVM_BLK_T_FREE)

continue;



Thanks Heiner. Applied.


Hangs in balance_dirty_pages with arm-32 LPAE + highmem

2018-02-23 Thread Laura Abbott

Hi,

The Fedora arm-32 build VMs have a somewhat long standing problem
of hanging when running mkfs.ext4 with a bunch of processes stuck
in D state. This has been seen as far back as 4.13 but is still
present on 4.14:

sysrq: SysRq : Show Blocked State   
 [255/1885]
  taskPC stack   pid father
auditd  D0   377  1 0x0020
[] (__schedule) from [] (schedule+0x98/0xbc)
[] (schedule) from [] (schedule_timeout+0x328/0x3ac)
[] (schedule_timeout) from [] 
(io_schedule_timeout+0x24/0x38)
[] (io_schedule_timeout) from [] 
(balance_dirty_pages.constprop.6+0xac8
/0xc5c)
[] (balance_dirty_pages.constprop.6) from [] 
(balance_dirty_pages_ratel
imited+0x2b8/0x43c)
[] (balance_dirty_pages_ratelimited) from [] 
(generic_perform_write+0x1
74/0x1a4)
[] (generic_perform_write) from [] 
(__generic_file_write_iter+0x16c/0x1
98)
[] (__generic_file_write_iter) from [] 
(ext4_file_write_iter+0x314/0x41
4)
[] (ext4_file_write_iter) from [] (__vfs_write+0x100/0x128)
[] (__vfs_write) from [] (vfs_write+0xc0/0x194)
[] (vfs_write) from [] (SyS_write+0x44/0x7c)
[] (SyS_write) from [] (__sys_trace_return+0x0/0x10)
rs:main Q:Reg   D0   441  1 0x
[] (__schedule) from [] (schedule+0x98/0xbc)
[] (schedule) from [] (schedule_timeout+0x328/0x3ac)
[] (schedule_timeout) from [] 
(io_schedule_timeout+0x24/0x38)
[] (io_schedule_timeout) from [] 
(balance_dirty_pages.constprop.6+0xac8
/0xc5c)
[] (balance_dirty_pages.constprop.6) from [] 
(balance_dirty_pages_ratel
imited+0x2b8/0x43c)
[] (balance_dirty_pages_ratelimited) from [] 
(generic_perform_write+0x1
74/0x1a4)
[] (generic_perform_write) from [] 
(__generic_file_write_iter+0x16c/0x1
98)
[] (__generic_file_write_iter) from [] 
(ext4_file_write_iter+0x314/0x41
4)
[] (ext4_file_write_iter) from [] (__vfs_write+0x100/0x128)
[] (__vfs_write) from [] (vfs_write+0xc0/0x194)
[] (vfs_write) from [] (SyS_write+0x44/0x7c)
[] (SyS_write) from [] (ret_fast_syscall+0x0/0x4c)
ntpdD0  1453  1 0x0001
[] (__schedule) from [] (schedule+0x98/0xbc)
[] (schedule) from [] (schedule_timeout+0x328/0x3ac)
[] (schedule_timeout) from [] 
(io_schedule_timeout+0x24/0x38)
[] (io_schedule_timeout) from [] 
(balance_dirty_pages.constprop.6+0xac8
/0xc5c)
[] (balance_dirty_pages.constprop.6) from [] 
(balance_dirty_pages_ratel
imited+0x2b8/0x43c)
[] (balance_dirty_pages_ratelimited) from [] 
(generic_perform_write+0x1
74/0x1a4)
[] (generic_perform_write) from [] 
(__generic_file_write_iter+0x16c/0x1
98)
[] (__generic_file_write_iter) from [] 
(ext4_file_write_iter+0x314/0x41
4)
[] (ext4_file_write_iter) from [] (__vfs_write+0x100/0x128)
[] (__vfs_write) from [] (vfs_write+0xc0/0x194)
[] (ext4_file_write_iter) from [] (__vfs_write+0x100/0x128) 
 [203/1885]
[] (__vfs_write) from [] (vfs_write+0xc0/0x194)
[] (vfs_write) from [] (SyS_write+0x44/0x7c)
[] (SyS_write) from [] (ret_fast_syscall+0x0/0x4c)
kojid   D0  4616  1 0x
[] (__schedule) from [] (schedule+0x98/0xbc)
[] (schedule) from [] (schedule_timeout+0x328/0x3ac)
[] (schedule_timeout) from [] 
(io_schedule_timeout+0x24/0x38)
[] (io_schedule_timeout) from [] 
(balance_dirty_pages.constprop.6+0xac8
/0xc5c)
[] (balance_dirty_pages.constprop.6) from [] 
(balance_dirty_pages_ratel
imited+0x2b8/0x43c)
[] (balance_dirty_pages_ratelimited) from [] 
(generic_perform_write+0x1
74/0x1a4)
[] (generic_perform_write) from [] 
(__generic_file_write_iter+0x16c/0x1
98)
[] (__generic_file_write_iter) from [] 
(ext4_file_write_iter+0x314/0x41
4)
[] (ext4_file_write_iter) from [] (__vfs_write+0x100/0x128)
[] (__vfs_write) from [] (vfs_write+0xc0/0x194)
[] (vfs_write) from [] (SyS_write+0x44/0x7c)
[] (SyS_write) from [] (ret_fast_syscall+0x0/0x4c)
kworker/u8:0D0 28525  2 0x
Workqueue: writeback wb_workfn (flush-7:0)
[] (__schedule) from [] (schedule+0x98/0xbc)
[] (schedule) from [] (io_schedule+0x1c/0x2c)
[] (io_schedule) from [] (wbt_wait+0x21c/0x300)
[] (wbt_wait) from [] (blk_mq_make_request+0xac/0x560)
[] (blk_mq_make_request) from [] 
(generic_make_request+0xd0/0x214)
[] (generic_make_request) from [] (submit_bio+0x114/0x16c)
[] (submit_bio) from [] (submit_bh_wbc+0x190/0x1a0)
[] (submit_bh_wbc) from [] 
(__block_write_full_page+0x2e8/0x43c)
[] (__block_write_full_page) from [] 
(block_write_full_page+0x80/0xec)
[] (block_write_full_page) from [] (__writepage+0x1c/0x4c)
[] (__writepage) from [] (write_cache_pages+0x350/0x3f0)
[] (write_cache_pages) from [] 
(generic_writepages+0x44/0x60)
[] (generic_writepages) from [] (do_writepages+0x3c/0x74)
[] (do_writepages) from [] 
(__writeback_single_inode+0xb4/0x404)
[] (__writeback_single_inode) from [] 
(writeback_sb_inodes+0x258/0x438)
[] (writeback_sb_inodes) from [] 
(__writeback_inodes_wb+0x6c/0xa8)
[] (__writeback_inodes_wb) from [] 
(wb_writeback+0x1c4/0x30c)
[] (wb_writeback) from [] (wb_workfn+0x130/0x450)
[] (wb_workfn) from [] (process_one_work+0x254/0x42c)
[] 

[PATCH 1/1] extend BLKRRPART to update the readable size of media

2018-02-23 Thread Steve Kenton
The readable size of new, factory blank, optical media can change via user
space ioctl(SG_IO) commands to format overwritable media such as DVD+RW for
a UDF filesystem or write to sequential media such as DVD-R for an ISO-9660
filesystem. However there appears to be no easy way to update the size used
by the block layer from the initial value of zero. Traditionally, ejecting and
reinserting the media is the workaround to this long standing issue which 
programs such as growisofs use by automatically opening and closing the tray
at the end of a session so the newly created filesystem becomes mountable
and readable. This can be problematic when using a slot load drive.

An even more heavy handed approach is to delete and then rescan the drive
using sysfs. echo offline > state  echo 1 > delete  echo - - - > scan
without ever ejecting the media. It works but is slow and rather ugly.

Reading a factory blank DVD+RW needing to be "de-iced" gives this result
both before *AND* after formatting with $ xorriso -dev /dev/sr0 -format full

$ dd if=/dev/sr0 of=/dev/null bs=2k
dd: error reading ‘/dev/sr0’: Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.0275589 s, 0.0 kB/s

The crux of the problem being the readable size after formatting is still zero
which does not change to the correct value without ejecting and reinserting
the newly formatted media to force a call to fops->revalidate_disk().

hdi@hdi-H110N:/sys/block/sr0$ cat size
0
hdi@hdi-H110N:/sys/block/sr0$ cat size
9180416

Making the BLKRRPART ioctl call fops->revalidate_disk() => sr_revalidate_disk()
in sr.c resolves the issue for optical media and should be a benign and
potentially useful operation if ever called for non-optical media =>
sd_revalidate_disk() in sd.c for any other devices which change size.
Programs would have to "opt in" to use ioctl(BLKRRPART) on optial media.

The down side is that BLKRRPART requires CAP_SYSADMIN for userspace programs.

Signed-off-by: Steve Kenton 
Signed-off-by: Thomas Schmitt 
---
I discussed this with Thomas Schmitt the maintainer of the xorriso burner
program to define the problem and suggest a fix.

V2: Add NULL pointer check
Checkpatch wrapping and update commit message
Send to linux-block@vger.kernel.org instead of linux-fsde...@vger.kernel.org
 block/ioctl.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 1668506d8ed8..1d48dcad668a 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -163,14 +163,18 @@ int __blkdev_reread_part(struct block_device *bdev)
 {
struct gendisk *disk = bdev->bd_disk;
 
-   if (!disk_part_scan_enabled(disk) || bdev != bdev->bd_contains)
+   if (bdev != bdev->bd_contains) /* must be whole disk */
return -EINVAL;
+
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
 
lockdep_assert_held(>bd_mutex);
 
-   return rescan_partitions(disk, bdev);
+   if (disk_part_scan_enabled(disk))
+   return rescan_partitions(disk, bdev); /* update partitions */
+   else if (disk->fops->revalidate_disk) /* could be sr or sd */
+   return disk->fops->revalidate_disk(disk); /* update size */
 }
 EXPORT_SYMBOL(__blkdev_reread_part);
 
-- 
2.16.1.72.g5be1f00



Re: v4.16-rc2: I/O hang with dm-rq + Kyber

2018-02-23 Thread Bart Van Assche
On Sat, 2018-02-24 at 00:26 +0800, Ming Lei wrote:
> The following 2 patch fixes one IO hang on kyber in my test on USB, could
> you test it and see if your case can be fixed?
> 
>   https://marc.info/?l=linux-block=151940022831994=2

These two patches are sufficient to make my test pass.

Thanks!

Bart.




Re: [PATCH 2/2] block: kyber: fix domain token leak during requeue

2018-02-23 Thread Bart Van Assche
On Fri, 2018-02-23 at 23:36 +0800, Ming Lei wrote:
> When requeuing request, the domain token should have been freed
> before re-inserting the request to io scheduler. Otherwise, the
> assigned domain token will be leaked, and IO hang can be caused.

Please consider to add a "Cc: stable" tag to this patch.

Anyway:

Reviewed-by: Bart Van Assche 




Re: v4.16-rc2: I/O hang with dm-rq + Kyber

2018-02-23 Thread Ming Lei
On Thu, Feb 22, 2018 at 09:10:23PM +, Bart Van Assche wrote:
> Hello Omar,
> 
> I/O hangs if I run the following command on top of kernel v4.16-rc2 + the
> ib_srpt patch that adds RDMA/CM support:
> 
> srp-test/run_tests -c -d -r 10 -t 02-mq -e kyber
> 
> This does not happen with the deadline scheduler nor without a scheduler.
> This test passed a few months ago. I have attached the output of the following
> command to this e-mail:
> 
> (cd /sys/kernel/debug/block && grep -r .) | grep -v /poll_stat
> 
> Can you have a look?

The following 2 patch fixes one IO hang on kyber in my test on USB, could
you test it and see if your case can be fixed?

https://marc.info/?l=linux-block=151940022831994=2

BTW, from your attached log, looks it belongs to domain token leak
issue too, so it should have been addressed by above patches.

Thanks,
Ming


Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-23 Thread Ming Lei
Hi Paolo,

On Fri, Feb 23, 2018 at 04:41:36PM +0100, Paolo Valente wrote:
> 
> 
> > Il giorno 23 feb 2018, alle ore 16:07, Ming Lei  ha 
> > scritto:
> > 
> > Hi Paolo,
> > 
> > On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote:
> >> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
> >> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
> >> be re-inserted into the active I/O scheduler for that device. As a
> > 
> > No, this behaviour isn't related with commit a6a252e64914, and
> > it has been there since blk_mq_requeue_request() is introduced.
> > 
> 
> Hi Ming,
> actually, we wrote the above statement after simply following the call
> chain that led to the failure.  In this respect, the change in commit
> a6a252e64914:
> 
>  static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
> +  bool has_sched,
>struct request *rq)
>  {
> -   if (rq->tag == -1) {
> +   /* dispatch flush rq directly */
> +   if (rq->rq_flags & RQF_FLUSH_SEQ) {
> +   spin_lock(>lock);
> +   list_add(>queuelist, >dispatch);
> +   spin_unlock(>lock);
> +   return true;
> +   }
> +
> +   if (has_sched) {
> rq->rq_flags |= RQF_SORTED;
> -   return false;
> +   WARN_ON(rq->tag != -1);
> }
>  
> -   /*
> -* If we already have a real request tag, send directly to
> -* the dispatch list.
> -*/
> -   spin_lock(>lock);
> -   list_add(>queuelist, >dispatch);
> -   spin_unlock(>lock);
> -   return true;
> +   return false;
>  }
> 
> makes blk_mq_sched_bypass_insert return false for all non-flush
> requests.  From that, the anomaly described in our commit follows, for
> bfq any stateful scheduler that waits for the completion of requests
> that passed through it.  I'm elaborating again a little bit on this in
> my replies to your next points below.

Before a6a252e64914, follows blk_mq_sched_bypass_insert()

   if (rq->tag == -1) {
   rq->rq_flags |= RQF_SORTED;
   return false;
   }

   /*
* If we already have a real request tag, send directly to
* the dispatch list.
*/
   spin_lock(>lock);
   list_add(>queuelist, >dispatch);
   spin_unlock(>lock);
   return true;

This function still returns false for all non-flush request, so nothing
changes wrt. this kind of handling.

> 
> I don't mean that this change is an error, it simply sends a stateful
> scheduler in an inconsistent state, unless the scheduler properly
> handles the requeue that precedes the re-insertion into the
> scheduler.
> 
> If this clarifies the situation, but there is still some misleading
> statement in the commit, just let me better understand, and I'll be
> glad to rectify it, if possible somehow.
> 
> 
> > And you can see blk_mq_requeue_request() is called by lots of drivers,
> > especially it is often used in error handler, see SCSI's example.
> > 
> >> consequence, I/O schedulers may get the same request inserted again,
> >> even several times, without a finish_request invoked on that request
> >> before each re-insertion.
> >> 
> 
> ...
> 
> >> @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = {
> >>.ops.mq = {
> >>.limit_depth= bfq_limit_depth,
> >>.prepare_request= bfq_prepare_request,
> >> -  .finish_request = bfq_finish_request,
> >> +  .requeue_request= bfq_finish_requeue_request,
> >> +  .finish_request = bfq_finish_requeue_request,
> >>.exit_icq   = bfq_exit_icq,
> >>.insert_requests= bfq_insert_requests,
> >>.dispatch_request   = bfq_dispatch_request,
> > 
> > This way may not be correct since blk_mq_sched_requeue_request() can be
> > called for one request which won't enter io scheduler.
> > 
> 
> Exactly, there are two cases: requeues that lead to subsequent
> re-insertions, and requeues that don't.  The function
> bfq_finish_requeue_request handles both, and both need to be handled,
> to inform bfq that it has not to wait for the completion of rq any
> longer.
> 
> One special case is when bfq_finish_requeue_request gets invoked even
> if rq has nothing to do with any scheduler.  In that case,
> bfq_finish_requeue_request exists immediately.
> 
> 
> > __blk_mq_requeue_request() is called for two cases:
> > 
> > - one is that the requeued request is added to hctx->dispatch, such
> > as blk_mq_dispatch_rq_list()
> 
> yes
> 
> > - another case is that the request is requeued to io scheduler, such as
> > blk_mq_requeue_request().
> > 
> 
> yes
> 
> > For the 1st case, blk_mq_sched_requeue_request() shouldn't be called
> > since it is nothing to do with scheduler,
> 
> No, if that 

Re: [PATCH v4 4/6] block: Add a third argument to blk_alloc_queue_node()

2018-02-23 Thread Bart Van Assche
On Fri, 2018-02-23 at 09:26 +0100, Johannes Thumshirn wrote:
> how about "block: Add 'lock' as third argument to
> blk_alloc_queue_node()"?
> 
> So one actually sees early enough what the thrird argument will be?

Hello Johannes,

If I have to repost this patch series I will make that change.

Thanks for the review.

Bart.




Re: disk-io lockup in 4.14.13 kernel

2018-02-23 Thread Bart Van Assche
On Fri, 2018-02-23 at 11:58 +0200, Jaco Kroon wrote:
> On 22/02/2018 18:46, Bart Van Assche wrote:
> > (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
> 
> I don't have a /sys/kernel/debug folder - I've enabled CONFIG_DEBUG_FS
> and BLK_DEBUG_FS, will reboot at the first opportunity.  As a general
> rule - is there additional overhead to having debugfs enabled?  Any
> other risks that I should be aware of?  In essence, are there any
> disadvantages to just enabling DEBUG_FS as a general rule?  I did note
> that a few extra DEBUG options pop up for other modules ... so my gut is
> towards leaving this disabled as a general rule and enabling when needed.

Hello Jaco,

The only disadvantages of enabling debugfs that I know of are:
- The additional memory required by debugfs (probably not that much).
- A security risk if not all users who have an account on the system are fully
  trusted. See e.g. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=681418.

Enabling debugfs doesn't cause any runtime overhead in the hot path of the
block layer if no software accesses the debugfs attributes.

Bart.

v4.16-rc2: virtio-block + ext4 lockdep splats / sleeping from invalid context

2018-02-23 Thread Mark Rutland
Hi all,

While fuzzing arm64/v4.16-rc2 with syzkaller, I simultaneously hit a
number of splats in the block layer:

* inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage in
  jbd2_trans_will_send_data_barrier

* BUG: sleeping function called from invalid context at mm/mempool.c:320

* WARNING: CPU: 0 PID: 0 at block/blk.h:297 
generic_make_request_checks+0x670/0x750

... I've included the full splats at the end of the mail.

These all happen in the context of the virtio block IRQ handler, so I
wonder if this calls something that doesn't expect to be called from IRQ
context. Is it valid to call blk_mq_complete_request() or
blk_mq_end_request() from an IRQ handler?

Syzkaller came up with a minimized reproducer, but it's a bit wacky (the
fcntl and bpf calls should have no practical effect), and I haven't
managed to come up with a C reproducer.

Any ideas?

Thanks,
Mark.


Syzkaller reproducer:
# {Threaded:true Collide:true Repeat:false Procs:1 Sandbox:setuid Fault:false 
FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true 
WaitRepeat:false Debug:false Repro:false}
mmap(&(0x7f00/0x24000)=nil, 0x24000, 0x3, 0x32, 0x, 0x0)
r0 = openat(0xff9c, &(0x7f019000-0x8)='./file0\x00', 0x42, 0x0)
fcntl$setstatus(r0, 0x4, 0x1)
ftruncate(r0, 0x400)
io_setup(0x1f, 

Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-23 Thread Paolo Valente


> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei  ha 
> scritto:
> 
> Hi Paolo,
> 
> On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote:
>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>> be re-inserted into the active I/O scheduler for that device. As a
> 
> No, this behaviour isn't related with commit a6a252e64914, and
> it has been there since blk_mq_requeue_request() is introduced.
> 

Hi Ming,
actually, we wrote the above statement after simply following the call
chain that led to the failure.  In this respect, the change in commit
a6a252e64914:

 static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
+  bool has_sched,
   struct request *rq)
 {
-   if (rq->tag == -1) {
+   /* dispatch flush rq directly */
+   if (rq->rq_flags & RQF_FLUSH_SEQ) {
+   spin_lock(>lock);
+   list_add(>queuelist, >dispatch);
+   spin_unlock(>lock);
+   return true;
+   }
+
+   if (has_sched) {
rq->rq_flags |= RQF_SORTED;
-   return false;
+   WARN_ON(rq->tag != -1);
}
 
-   /*
-* If we already have a real request tag, send directly to
-* the dispatch list.
-*/
-   spin_lock(>lock);
-   list_add(>queuelist, >dispatch);
-   spin_unlock(>lock);
-   return true;
+   return false;
 }

makes blk_mq_sched_bypass_insert return false for all non-flush
requests.  From that, the anomaly described in our commit follows, for
bfq any stateful scheduler that waits for the completion of requests
that passed through it.  I'm elaborating again a little bit on this in
my replies to your next points below.

I don't mean that this change is an error, it simply sends a stateful
scheduler in an inconsistent state, unless the scheduler properly
handles the requeue that precedes the re-insertion into the
scheduler.

If this clarifies the situation, but there is still some misleading
statement in the commit, just let me better understand, and I'll be
glad to rectify it, if possible somehow.


> And you can see blk_mq_requeue_request() is called by lots of drivers,
> especially it is often used in error handler, see SCSI's example.
> 
>> consequence, I/O schedulers may get the same request inserted again,
>> even several times, without a finish_request invoked on that request
>> before each re-insertion.
>> 

...

>> @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = {
>>  .ops.mq = {
>>  .limit_depth= bfq_limit_depth,
>>  .prepare_request= bfq_prepare_request,
>> -.finish_request = bfq_finish_request,
>> +.requeue_request= bfq_finish_requeue_request,
>> +.finish_request = bfq_finish_requeue_request,
>>  .exit_icq   = bfq_exit_icq,
>>  .insert_requests= bfq_insert_requests,
>>  .dispatch_request   = bfq_dispatch_request,
> 
> This way may not be correct since blk_mq_sched_requeue_request() can be
> called for one request which won't enter io scheduler.
> 

Exactly, there are two cases: requeues that lead to subsequent
re-insertions, and requeues that don't.  The function
bfq_finish_requeue_request handles both, and both need to be handled,
to inform bfq that it has not to wait for the completion of rq any
longer.

One special case is when bfq_finish_requeue_request gets invoked even
if rq has nothing to do with any scheduler.  In that case,
bfq_finish_requeue_request exists immediately.


> __blk_mq_requeue_request() is called for two cases:
> 
>   - one is that the requeued request is added to hctx->dispatch, such
>   as blk_mq_dispatch_rq_list()

yes

>   - another case is that the request is requeued to io scheduler, such as
>   blk_mq_requeue_request().
> 

yes

> For the 1st case, blk_mq_sched_requeue_request() shouldn't be called
> since it is nothing to do with scheduler,

No, if that rq has been inserted and then extracted from the scheduler
through a dispatch_request, then it has.  The scheduler is stateful,
and keeps state for rq, because it must do so, until a completion or a
requeue arrive.  In particular, bfq may decide that no other
bfq_queues must be served before rq has been completed, because this
is necessary to preserve its target service guarantees.  If bfq is not
informed, either through its completion or its requeue hook, then it
will wait forever.

> seems we only need to do that
> for 2nd case.
> 

> So looks we need the following patch:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 23de7fd8099a..a216f3c3c3ce 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -712,7 +714,6 @@ static void __blk_mq_requeue_request(struct request *rq)
> 

[PATCH 2/2] block: kyber: fix domain token leak during requeue

2018-02-23 Thread Ming Lei
When requeuing request, the domain token should have been freed
before re-inserting the request to io scheduler. Otherwise, the
assigned domain token will be leaked, and IO hang can be caused.

Cc: Paolo Valente 
Cc: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/kyber-iosched.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index f95c60774ce8..0d6d25e32e1f 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -833,6 +833,7 @@ static struct elevator_type kyber_sched = {
.limit_depth = kyber_limit_depth,
.prepare_request = kyber_prepare_request,
.finish_request = kyber_finish_request,
+   .requeue_request = kyber_finish_request,
.completed_request = kyber_completed_request,
.dispatch_request = kyber_dispatch_request,
.has_work = kyber_has_work,
-- 
2.9.5



[PATCH 0/2] blk-mq: kyber: fix domain token leak during requeue

2018-02-23 Thread Ming Lei
Hi,

This two patch fixes domain token leak on kyber scheduler, and actually
fixes one IO hang issue.

Thanks,

Ming Lei (2):
  blk-mq: don't call io sched's .requeue_request when requeueing rq to
->dispatch
  block: kyber: fix domain token leak during requeue

 block/blk-mq.c| 4 +++-
 block/kyber-iosched.c | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

-- 
2.9.5



[PATCH 1/2] blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch

2018-02-23 Thread Ming Lei
__blk_mq_requeue_request() covers two cases:

- one is that the requeued request is added to hctx->dispatch, such as
blk_mq_dispatch_rq_list()

- another case is that the request is requeued to io scheduler, such as
blk_mq_requeue_request().

We should call io sched's .requeue_request callback only for the 2nd
case.

Cc: Paolo Valente 
Cc: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 357492712b0e..16e83e6df404 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -712,7 +712,6 @@ static void __blk_mq_requeue_request(struct request *rq)
 
trace_block_rq_requeue(q, rq);
wbt_requeue(q->rq_wb, >issue_stat);
-   blk_mq_sched_requeue_request(rq);
 
if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
@@ -725,6 +724,9 @@ void blk_mq_requeue_request(struct request *rq, bool 
kick_requeue_list)
 {
__blk_mq_requeue_request(rq);
 
+   /* this request will be re-inserted to io scheduler queue */
+   blk_mq_sched_requeue_request(rq);
+
BUG_ON(blk_queued_rq(rq));
blk_mq_add_to_requeue_list(rq, true, kick_requeue_list);
 }
-- 
2.9.5



Re: [PATCH v2] block/loop: add documentation for sysfs interface

2018-02-23 Thread Jonathan Corbet
On Sat, 17 Feb 2018 11:43:04 +0530
Aishwarya Pant  wrote:

> Documentation has been compiled from git logs and by reading through
> code.
> 
> Signed-off-by: Aishwarya Pant 
> ---
> For drivers/block/loop.c, I don't see any maintainers or mailing lists except
> for LKML. I am guessing linux-block mailing list should be okay.
> 
> Changes in v2:
> - Add linux-block@vger.kernel.org to the recipient list.

Applied to the docs tree, thanks.

jon


Re: [RESEND PATCH v2] Documentation/ABI: clean up sysfs-class-pktcdvd

2018-02-23 Thread Jonathan Corbet
On Fri, 23 Feb 2018 18:46:32 +0530
Aishwarya Pant  wrote:

> Clean up the sysfs documentation such that it is in the same format as
> described in Documentation/ABI/README. Mainly, the patch moves the
> attribute names to the 'What:' field. This might be useful for scripting
> and tracking changes in the ABI.
> 
> Signed-off-by: Aishwarya Pant 

Applied to the docs tree, thanks.

In general, I've been setting aside the sysfs ABI doc patches to see what
the reaction is to them.  I'm trying to catch up with the ones that
haven't been merged via another tree, but it wouldn't be entirely
surprising if I miss one.  Please don't hesitate to resend anything that
doesn't appear in linux-next in the near future.

Thanks,

jon


Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-23 Thread Ming Lei
Hi Paolo,

On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote:
> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
> be re-inserted into the active I/O scheduler for that device. As a

No, this behaviour isn't related with commit a6a252e64914, and
it has been there since blk_mq_requeue_request() is introduced.

And you can see blk_mq_requeue_request() is called by lots of drivers,
especially it is often used in error handler, see SCSI's example.

> consequence, I/O schedulers may get the same request inserted again,
> even several times, without a finish_request invoked on that request
> before each re-insertion.
> 
> This fact is the cause of the failure reported in [1]. For an I/O
> scheduler, every re-insertion of the same re-prepared request is
> equivalent to the insertion of a new request. For schedulers like
> mq-deadline or kyber, this fact causes no harm. In contrast, it
> confuses a stateful scheduler like BFQ, which keeps state for an I/O
> request, until the finish_request hook is invoked on the request. In
> particular, BFQ may get stuck, waiting forever for the number of
> request dispatches, of the same request, to be balanced by an equal
> number of request completions (while there will be one completion for
> that request). In this state, BFQ may refuse to serve I/O requests
> from other bfq_queues. The hang reported in [1] then follows.
> 
> However, the above re-prepared requests undergo a requeue, thus the
> requeue_request hook of the active elevator is invoked for these
> requests, if set. This commit then addresses the above issue by
> properly implementing the hook requeue_request in BFQ.
> 
> [1] https://marc.info/?l=linux-block=15127608676
> 
> Reported-by: Ivan Kozik 
> Reported-by: Alban Browaeys 
> Tested-by: Mike Galbraith 
> Signed-off-by: Paolo Valente 
> Signed-off-by: Serena Ziviani 
> ---
> V2: contains fix to bug reported in [2]
> V3: implements the improvement suggested in [3]
> 
> [2] https://lkml.org/lkml/2018/2/5/599
> [3] https://lkml.org/lkml/2018/2/7/532
> 
>  block/bfq-iosched.c | 107 
> 
>  1 file changed, 82 insertions(+), 25 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 47e6ec7427c4..aeca22d91101 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
> blk_mq_hw_ctx *hctx)
>   }
> 
>   /*
> -  * We exploit the bfq_finish_request hook to decrement
> -  * rq_in_driver, but bfq_finish_request will not be
> -  * invoked on this request. So, to avoid unbalance,
> -  * just start this request, without incrementing
> -  * rq_in_driver. As a negative consequence,
> -  * rq_in_driver is deceptively lower than it should be
> -  * while this request is in service. This may cause
> -  * bfq_schedule_dispatch to be invoked uselessly.
> +  * We exploit the bfq_finish_requeue_request hook to
> +  * decrement rq_in_driver, but
> +  * bfq_finish_requeue_request will not be invoked on
> +  * this request. So, to avoid unbalance, just start
> +  * this request, without incrementing rq_in_driver. As
> +  * a negative consequence, rq_in_driver is deceptively
> +  * lower than it should be while this request is in
> +  * service. This may cause bfq_schedule_dispatch to be
> +  * invoked uselessly.
>*
>* As for implementing an exact solution, the
> -  * bfq_finish_request hook, if defined, is probably
> -  * invoked also on this request. So, by exploiting
> -  * this hook, we could 1) increment rq_in_driver here,
> -  * and 2) decrement it in bfq_finish_request. Such a
> -  * solution would let the value of the counter be
> -  * always accurate, but it would entail using an extra
> -  * interface function. This cost seems higher than the
> -  * benefit, being the frequency of non-elevator-private
> +  * bfq_finish_requeue_request hook, if defined, is
> +  * probably invoked also on this request. So, by
> +  * exploiting this hook, we could 1) increment
> +  * rq_in_driver here, and 2) decrement it in
> +  * bfq_finish_requeue_request. Such a solution would
> +  * let the value of the counter be always accurate,
> +  * but it would entail using an extra interface
> +  * function. This cost seems higher than the benefit,
> +  * being the 

Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-23 Thread Tejun Heo
Hello,

On Fri, Feb 23, 2018 at 09:56:54AM +0800, xuejiufei wrote:
> > On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote:
> >> I still don't get how css_tryget can work here.
> >>
> >> The race happens when:
> >> 1) writeback kworker has found the blkg with rcu;
> >> 2) blkcg is during offlining and blkg_destroy() has already been called.
> >> Then, writeback kworker will take queue lock and access the blkg with
> >> refcount 0.
> > 
> > Yeah, then tryget would fail and it should go through the root.
> > 
> In this race, the refcount of blkg becomes zero and is destroyed.
> However css may still have refcount, and css_tryget can return success
> before other callers put the refcount.
> So I don't get how css_tryget can fix this race? Or I wonder if we can
> add another function blkg_tryget?

IIRC, as long as the blkcg and the device are there, the blkgs aren't
gonna be destroyed.  So, if you have a ref to the blkcg through
tryget, the blkg shouldn't go away.

Thanks.

-- 
tejun


Re: [PATCH] blk-throttle: avoid multiple counting for same bio

2018-02-23 Thread Tejun Heo
Hello,

On Fri, Feb 23, 2018 at 12:29:25PM +0800, Chengguang Xu wrote:
> > That's true, the issue Shaohua has fixed is double charge, but double
> > stat issue still exists.
> > 
> > Jiufei has posted a fix, which has already been tested by Bo Liu:
> > [PATCH RESEND] blk-throttle: avoid double counted
> > https://www.mail-archive.com/linux-block@vger.kernel.org/msg18516.html

Shaohua, can you please take a look at the patch?

Thanks.

-- 
tejun


[RESEND PATCH v2] Documentation/ABI: clean up sysfs-class-pktcdvd

2018-02-23 Thread Aishwarya Pant
Clean up the sysfs documentation such that it is in the same format as
described in Documentation/ABI/README. Mainly, the patch moves the
attribute names to the 'What:' field. This might be useful for scripting
and tracking changes in the ABI.

Signed-off-by: Aishwarya Pant 
---
Changes in v2:
- Convert spaces to tabs

 Documentation/ABI/testing/sysfs-class-pktcdvd | 129 +++---
 1 file changed, 75 insertions(+), 54 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-pktcdvd 
b/Documentation/ABI/testing/sysfs-class-pktcdvd
index b1c3f0263359..dde4f26d0780 100644
--- a/Documentation/ABI/testing/sysfs-class-pktcdvd
+++ b/Documentation/ABI/testing/sysfs-class-pktcdvd
@@ -1,60 +1,81 @@
-What:   /sys/class/pktcdvd/
-Date:   Oct. 2006
-KernelVersion:  2.6.20
-Contact:Thomas Maier 
-Description:
-
 sysfs interface
 ---
+The pktcdvd module (packet writing driver) creates the following files in the
+sysfs: ( is in the format major:minor)
+
+What:  /sys/class/pktcdvd/add
+What:  /sys/class/pktcdvd/remove
+What:  /sys/class/pktcdvd/device_map
+Date:  Oct. 2006
+KernelVersion: 2.6.20
+Contact:   Thomas Maier 
+Description:
+
+   add:(WO) Write a block device id (major:minor) to
+   create a new pktcdvd device and map it to the
+   block device.
+
+   remove: (WO) Write the pktcdvd device id (major:minor)
+   to remove the pktcdvd device.
+
+   device_map: (RO) Shows the device mapping in format:
+   pktcdvd[0-7]  
+
+
+What:  /sys/class/pktcdvd/pktcdvd[0-7]/dev
+What:  /sys/class/pktcdvd/pktcdvd[0-7]/uevent
+Date:  Oct. 2006
+KernelVersion: 2.6.20
+Contact:   Thomas Maier 
+Description:
+   dev:(RO) Device id
+
+   uevent: (WO) To send a uevent
+
+
+What:  /sys/class/pktcdvd/pktcdvd[0-7]/stat/packets_started
+What:  /sys/class/pktcdvd/pktcdvd[0-7]/stat/packets_finished
+What:  /sys/class/pktcdvd/pktcdvd[0-7]/stat/kb_written
+What:  /sys/class/pktcdvd/pktcdvd[0-7]/stat/kb_read
+What:  /sys/class/pktcdvd/pktcdvd[0-7]/stat/kb_read_gather
+What:  /sys/class/pktcdvd/pktcdvd[0-7]/stat/reset
+Date:  Oct. 2006
+KernelVersion: 2.6.20
+Contact:   Thomas Maier 
+Description:
+   packets_started:(RO) Number of started packets.
+
+   packets_finished:   (RO) Number of finished packets.
+
+   kb_written: (RO) kBytes written.
+
+   kb_read:(RO) kBytes read.
+
+   kb_read_gather: (RO) kBytes read to fill write packets.
+
+   reset:  (WO) Write any value to it to reset
+   pktcdvd device statistic values, like
+   bytes read/written.
+
+
+What:  /sys/class/pktcdvd/pktcdvd[0-7]/write_queue/size
+What:  /sys/class/pktcdvd/pktcdvd[0-7]/write_queue/congestion_off
+What:  /sys/class/pktcdvd/pktcdvd[0-7]/write_queue/congestion_on
+Date:  Oct. 2006
+KernelVersion: 2.6.20
+Contact:   Thomas Maier 
+Description:
+   size:   (RO) Contains the size of the bio write queue.
+
+   congestion_off: (RW) If bio write queue size is below this mark,
+   accept new bio requests from the block layer.
 
-The pktcdvd module (packet writing driver) creates
-these files in the sysfs:
-( is in format  major:minor )
-
-/sys/class/pktcdvd/
-add(0200)  Write a block device id (major:minor)
-   to create a new pktcdvd device and map
-   it to the block device.
-
-remove (0200)  Write the pktcdvd device id (major:minor)
-   to it to remove the pktcdvd device.
-
-device_map (0444)  Shows the device mapping in format:
- pktcdvd[0-7]  
-
-/sys/class/pktcdvd/pktcdvd[0-7]/
-dev   (0444) Device id
-uevent(0200) To send an uevent.
-
-/sys/class/pktcdvd/pktcdvd[0-7]/stat/
-packets_started   (0444) Number of started packets.
-packets_finished  (0444) Number of finished packets.
-
-kb_written(0444) kBytes written.
-kb_read   (0444) kBytes read.
-kb_read_gather(0444) kBytes read to fill write packets.
-
-reset (0200) Write any value to it to reset
- pktcdvd device statistic values, like
- bytes read/written.
-
-/sys/class/pktcdvd/pktcdvd[0-7]/write_queue/
-

Re: [PATCH 1/1] extend BLKRRPART to update the readable size of optical media

2018-02-23 Thread Carlos Maiolino
On Fri, Feb 23, 2018 at 11:14:41AM +0100, Carlos Maiolino wrote:
> On Thu, Feb 22, 2018 at 01:00:16PM -0600, Steve Kenton wrote:
> > The readable size of new, factory blank, optical media can change via user 
> > space ioctl(SG_IO) commands to format overwritable media such as DVD+RW for 
> > a UDF filesystem or write to sequential media such as DVD-R for an ISO-9660 
> > filesystem. However there appears to be no easy way to update the size used 
> > by the block layer from the initial value of zero other than ejecting and 
> > reinserting the media. This is a long standing issue which programs such as 
> > growisofs work around by automatically opening and closing the tray at the 
> > end of a session so the newly created filesystem becomes 
> > mountable/readable. This can be problematic when using a slot load drive.
> > 

This should go to linux- block btw, not linux-fsdevel, cc'ing them

> > Making the BLKRRPART ioctl call fops->revalidate_disk() => 
> > sr_revalidate_disk() in sr.c resolves the issue for optical media and 
> > should be a rare and benign operation if ever called for non-optical media 
> > => sd_revalidate_disk() in sd.c.
> > 
> 
> Please use checkpatch.pl and break the comment description to at most
> 80chars/line
> 
> > Signed-off-by: Steve Kenton 
> > Signed-off-by: Thomas Schmitt 
> > ---
> > I discussed this with Thomas Schmitt the maintainer of the xorriso burner 
> > program to define the problem and suggest a fix.
> > 
> > Loading  a factory blank DVD+RW needing to be "de-iced" gives the results 
> > below before and after format full using xorriso but *NOT* ejecting the 
> > media
> > 
> > hdi@hdi-H110N:/sys/block/sr0$ dd if=/dev/sr0 of=/dev/null bs=2k
> > dd: error reading ‘/dev/sr0’: Input/output error
> > 0+0 records in
> > 0+0 records out
> > 0 bytes (0 B) copied, 0.0275589 s, 0.0 kB/s
> > 
> > hdi@hdi-H110N:/sys/block/sr0$ cat size
> > 4
> > 
> > hdi@hdi-H110N:/sys/block/sr0$ xorriso -dev /dev/sr0 -format full
> > 
> > hdi@hdi-H110N:/sys/block/sr0$ dd if=/dev/sr0 of=/dev/null bs=2k
> > 0+0 records in
> > 0+0 records out
> > 0 bytes (0 B) copied, 0.000280081 s, 0.0 kB/s
> > 
> > hdi@hdi-H110N:/sys/block/sr0$ cat size
> > 0
> > 
> > Here are the results after deleting and then rescanning the optial drive 
> > via sysfs
> > echo offline > state  echo 1 > delete  echo - - - > scan
> > *WITHOUT* ejecting the media
> > 
> > hdi@hdi-H110N:/sys/block/sr0/device$ dd if=/dev/sr0 of=/dev/null bs=2k 
> > count=1
> > 1+0 records in
> > 1+0 records out
> > 2048 bytes (2.0 kB) copied, 0.0624783 s, 32.8 kB/s
> > 
> > hdi@hdi-H110N:/sys/block/sr0$ cat size
> > 9180416
> > 
> > With this patch applied running a small program to perform ioctl(BLKRRPART) 
> > on /dev/sr0 produces similar results with a blank DVD-R after writing to 
> > the disc using xorriso. Again, *WITHOUT* ejecting and reinserting the 
> > media. Much nicer than the spin down and delete then scan for the device 
> > again business, I think :-)
> > 
> > hdi@hdi-H110N:/sys/block/sr0$ cat size
> > 4
> > 
> > ./blkrrpart /dev/sr0
> > 
> > hdi@hdi-H110N:/sys/block/sr0$ cat size
> > 16256
> > 
> > 
> >  block/ioctl.c | 8 ++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/ioctl.c b/block/ioctl.c
> > index 1668506d8ed8..da79e9ba44ba 100644
> > --- a/block/ioctl.c
> > +++ b/block/ioctl.c
> > @@ -163,14 +163,18 @@ int __blkdev_reread_part(struct block_device *bdev)
> >  {
> > struct gendisk *disk = bdev->bd_disk;
> >  
> > -   if (!disk_part_scan_enabled(disk) || bdev != bdev->bd_contains)
> > +   if (bdev != bdev->bd_contains) // must be whole disk
> 
>   ^^ /*  Comment */
> > return -EINVAL;
> > +
> > if (!capable(CAP_SYS_ADMIN))
> > return -EACCES;
> >  
> > lockdep_assert_held(>bd_mutex);
> >  
> > -   return rescan_partitions(disk, bdev);
> > +   if (disk_part_scan_enabled(disk))
> > +   return rescan_partitions(disk, bdev);
> > +   else // update size but not partitions, could be sr or sd
> 
>   ^^ /* Comment */
> > +   return disk->fops->revalidate_disk(disk);
> 
> and what happens if a device without ->revalidate_disk() defined calls into 
> this
> function?
> 
> >  }
> >  EXPORT_SYMBOL(__blkdev_reread_part);
> >  
> > -- 
> > 2.16.1.72.g5be1f00
> > 
> 
> -- 
> Carlos

-- 
Carlos


Re: disk-io lockup in 4.14.13 kernel

2018-02-23 Thread Jaco Kroon
Hi Bart,

Thank you for your response.

On 22/02/2018 18:46, Bart Van Assche wrote:
> On 02/22/18 02:58, Jaco Kroon wrote:
>> We've been seeing sporadic IO lockups on recent kernels.
>
> Are you using the legacy I/O stack or blk-mq? If you are not yet using
> blk-mq, can you switch to blk-mq + scsi-mq + dm-mq? If the lockup is
> reproducible with blk-mq, can you share the output of the following
> command:
crowsnest ~ # zgrep MQ /proc/config.gz

CONFIG_SCSI_MQ_DEFAULT=y
# CONFIG_DM_MQ_DEFAULT is not set

... oi, so that's a very valid question.

blk-mq is thus off by default, I've now enabled it on the "live" system
with "echo 1 > /sys/module/dm_mod/parameters/use_blk_mq".

I've also modified the kernel config to set CONFIG_DM_MQ_DEFAULT (I know
I can just set this on cmdline too).

The only immediately visible effect is that I seem to be more
consistently get >300MB/s (and more frequently >400MB) off the array in
terms of read speed, where normally I'd expect 200MB consistent with
spikes just over 300MB and very infrequently over 400MB.  This is a very
simple spot check with iotop over approximately a minute.

I am seeing I/O errors in dmesg from time to time, this to me hints that
potentially it may be something related to some error path that's
causing problems.

Just so we're clear, we're seeing this happen approximately once a
month, so if switching on dm_mod.use_blk_mq solves it then I won't
really know beyond a shadow of a doubt, and the only way of "knowing" is
if we can get to an uptime of three months or so ...

>
> (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
I don't have a /sys/kernel/debug folder - I've enabled CONFIG_DEBUG_FS
and BLK_DEBUG_FS, will reboot at the first opportunity.  As a general
rule - is there additional overhead to having debugfs enabled?  Any
other risks that I should be aware of?  In essence, are there any
disadvantages to just enabling DEBUG_FS as a general rule?  I did note
that a few extra DEBUG options pop up for other modules ... so my gut is
towards leaving this disabled as a general rule and enabling when needed.

>
> Thanks,
Thank you!

Kind Regards,
Jaco


Re: [PATCH v4 6/6] block: Fix a race between request queue removal and the block cgroup controller

2018-02-23 Thread Joseph Qi


On 18/2/23 09:08, Bart Van Assche wrote:
> Avoid that the following race can occur:
> 
> blk_cleanup_queue()   blkcg_print_blkgs()
>   spin_lock_irq(lock) (1)   spin_lock_irq(blkg->q->queue_lock) (2,5)
> q->queue_lock = >__queue_lock (3)
>   spin_unlock_irq(lock) (4)
> spin_unlock_irq(blkg->q->queue_lock) (6)
> 
> (1) take driver lock;
> (2) busy loop for driver lock;
> (3) override driver lock with internal lock;
> (4) unlock driver lock;
> (5) can take driver lock now;
> (6) but unlock internal lock.
> 
> This change is safe because only the SCSI core and the NVME core keep
> a reference on a request queue after having called blk_cleanup_queue().
> Neither driver accesses any of the removed data structures between its
> blk_cleanup_queue() and blk_put_queue() calls.
> 
> Reported-by: Joseph Qi 
> Signed-off-by: Bart Van Assche 
> Cc: Jan Kara 

Looks good.
Reviewed-by: Joseph Qi 


Re: [PATCH v4 5/6] block: Fix a race between the cgroup code and request queue initialization

2018-02-23 Thread Joseph Qi


On 18/2/23 09:08, Bart Van Assche wrote:
> Initialize the request queue lock earlier such that the following
> race can no longer occur:
> 
> blk_init_queue_node() blkcg_print_blkgs()
>   blk_alloc_queue_node (1)
> q->queue_lock = >__queue_lock (2)
> blkcg_init_queue(q) (3)
> spin_lock_irq(blkg->q->queue_lock) (4)
>   q->queue_lock = lock (5)
> spin_unlock_irq(blkg->q->queue_lock) (6)
> 
> (1) allocate an uninitialized queue;
> (2) initialize queue_lock to its default internal lock;
> (3) initialize blkcg part of request queue, which will create blkg and
> then insert it to blkg_list;
> (4) traverse blkg_list and find the created blkg, and then take its
> queue lock, here it is the default *internal lock*;
> (5) *race window*, now queue_lock is overridden with *driver specified
> lock*;
> (6) now unlock *driver specified lock*, not the locked *internal lock*,
> unlock balance breaks.
> 
> The changes in this patch are as follows:
> - Move the .queue_lock initialization from blk_init_queue_node() into
>   blk_alloc_queue_node().
> - Only override the .queue_lock pointer for legacy queues because it
>   is not useful for blk-mq queues to override this pointer.
> - For all all block drivers that initialize .queue_lock explicitly,
>   change the blk_alloc_queue() call in the driver into a
>   blk_alloc_queue_node() call and remove the explicit .queue_lock
>   initialization. Additionally, initialize the spin lock that will
>   be used as queue lock earlier if necessary.
> 
> Reported-by: Joseph Qi 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Joseph Qi 
> Cc: Philipp Reisner 
> Cc: Ulf Hansson 
> Cc: Kees Cook 

Looks good.
Reviewed-by: Joseph Qi 


Re: [PATCH v4 3/6] zram: Delete gendisk before cleaning up the request queue

2018-02-23 Thread Joseph Qi


On 18/2/23 09:08, Bart Van Assche wrote:
> Remove the disk, partition and bdi sysfs attributes before cleaning up
> the request queue associated with the disk.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Minchan Kim 
> Cc: Nitin Gupta 
> Cc: Sergey Senozhatsky 

Looks good.
Reviewed-by: Joseph Qi 


Re: [PATCH v4 2/6] md: Delete gendisk before cleaning up the request queue

2018-02-23 Thread Joseph Qi


On 18/2/23 09:08, Bart Van Assche wrote:
> Remove the disk, partition and bdi sysfs attributes before cleaning up
> the request queue associated with the disk.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Shaohua Li 

Looks good.
Reviewed-by: Joseph Qi 


Re: [PATCH v4 1/6] block/loop: Delete gendisk before cleaning up the request queue

2018-02-23 Thread Joseph Qi


On 18/2/23 09:08, Bart Van Assche wrote:
> Remove the disk, partition and bdi sysfs attributes before cleaning up
> the request queue associated with the disk.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Josef Bacik 
> Cc: Shaohua Li 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 
> Cc: Ming Lei 

Looks good.
Reviewed-by: Joseph Qi 


Re: [PATCH v4 4/6] block: Add a third argument to blk_alloc_queue_node()

2018-02-23 Thread Joseph Qi


On 18/2/23 09:08, Bart Van Assche wrote:
> This patch does not change any functionality.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Joseph Qi 
> Cc: Philipp Reisner 
> Cc: Ulf Hansson 
> Cc: Kees Cook 

Looks good.
Reviewed-by: Joseph Qi 


Re: [PATCH] lightnvm: fix memory leak in pblk_luns_init

2018-02-23 Thread Javier González
> On 23 Feb 2018, at 07.12, Matias Bjørling  wrote:
> 
> On 02/23/2018 12:03 AM, Huaicheng Li wrote:
>> Please ignore my previous email as I found the memory is free'ed at
>> pblk_init()'s error handling logic.
>> Sorry for the interruption.
>> On Thu, Feb 22, 2018 at 3:01 PM, Huaicheng Li 
>> wrote:
>>> Signed-off-by: Huaicheng Li 
>>> ---
>>> drivers/lightnvm/pblk-init.c | 1 +
>>> 1 file changed, 1 insertion(+)
>>> 
>>> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
>>> index 93d671ca518e..330665d91d8d 100644
>>> --- a/drivers/lightnvm/pblk-init.c
>>> +++ b/drivers/lightnvm/pblk-init.c
>>> @@ -510,6 +510,7 @@ static int pblk_luns_init(struct pblk *pblk, struct
>>> ppa_addr *luns)
>>> if (ret) {
>>> while (--i >= 0)
>>> kfree(pblk->luns[i].bb_list);
>>> +   kfree(pblk->luns);
>>> return ret;
>>> }
>>> }
>>> --
>>> 2.13.6
> 
> No worries. The initialization part is pretty complex due to all the
> data structures being set up. I would love to have a clean
> initialization function, which cleans up after it self. But being able
> to share the "fail" bringup and tear-down code is very helpful and
> makes other parts easier.

As part of the 2.0 patches, I took a first try at refactoring the
init/exit sequences to maintain the same fail paths but minimizing
dependencies across structures. One of the big parts have been the bad
block initialization, which now is much cleaner, but still needs more
work to be as I would like it to. Patches are very welcome :)

Javier


signature.asc
Description: Message signed with OpenPGP


Re: [PATCH v4 4/6] block: Add a third argument to blk_alloc_queue_node()

2018-02-23 Thread Johannes Thumshirn
Hi Bart,

how about "block: Add 'lock' as third argument to
blk_alloc_queue_node()"?

So one actually sees early enough what the thrird argument will be?

Thanks,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH v4 3/6] zram: Delete gendisk before cleaning up the request queue

2018-02-23 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH v4 2/6] md: Delete gendisk before cleaning up the request queue

2018-02-23 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH v4 1/6] block/loop: Delete gendisk before cleaning up the request queue

2018-02-23 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850