Re: [PATCH 0/15 v2] loop: Fix oops and possible deadlocks

2018-10-10 Thread Tetsuo Handa
On 2018/10/10 19:04, Jan Kara wrote:
> Hi,
> 
> this patch series fixes oops and possible deadlocks as reported by syzbot [1]
> [2]. The second patch in the series (from Tetsuo) fixes the oops, the 
> remaining
> patches are cleaning up the locking in the loop driver so that we can in the
> end reasonably easily switch to rereading partitions without holding mutex
> protecting the loop device.
> 
> I have lightly tested the patches by creating, deleting, and modifying loop
> devices but if there's some more comprehensive loopback device testsuite, I
> can try running it. Review is welcome!

Testing on linux-next by syzbot will be the most comprehensive. ;-)

> 
> Changes since v1:
> * Added patch moving fput() calls in loop_change_fd() from under 
> loop_ctl_mutex
> * Fixed bug in loop_control_ioctl() where it failed to return error properly
> 
>   Honza
> 
> [1] 
> https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
> [2] 
> https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889
> 


Re: [PATCH 4/4] block/loop: Fix circular locking dependency at blkdev_reread_part().

2018-09-27 Thread Tetsuo Handa
On 2018/09/27 20:27, Jan Kara wrote:
> Hi,
> 
> On Wed 26-09-18 00:26:49, Tetsuo Handa wrote:
>> syzbot is reporting circular locking dependency between bdev->bd_mutex
>> and lo->lo_ctl_mutex [1] which is caused by calling blkdev_reread_part()
>> with lock held. We need to drop lo->lo_ctl_mutex in order to fix it.
>>
>> This patch fixes it by combining loop_index_mutex and loop_ctl_mutex into
>> loop_mutex, and releasing loop_mutex before calling blkdev_reread_part()
>> or fput() or path_put() or leaving ioctl().
>>
>> The rule is that current thread calls lock_loop() before accessing
>> "struct loop_device", and current thread no longer accesses "struct
>> loop_device" after unlock_loop() is called.
>>
>> Since syzbot is reporting various bugs [2] where a race in the loop module
>> is suspected, let's check whether this patch affects these bugs too.
>>
>> [1] 
>> https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889
>> [2] 
>> https://syzkaller.appspot.com/bug?id=b3c7e1440aa8ece16bf557dbac427fdff1dad9d6
>>
>> Signed-off-by: Tetsuo Handa 
>> Reported-by: syzbot 
>> 
>> ---
>>  drivers/block/loop.c | 187 
>> ---
>>  1 file changed, 101 insertions(+), 86 deletions(-)
> 
> I still don't like this patch. I'll post a patch series showing what I have
> in mind. Admittedly, it's a bit tedious but the locking is much saner
> afterwards...

Please be sure to Cc: me. I'm not subscribed to linux-block ML.

But if we have to release lo_ctl_mutex before calling blkdev_reread_part(),
what is nice with re-acquiring lo_ctl_mutex after blkdev_reread_part() ?



[PATCH 4/4] block/loop: Fix circular locking dependency at blkdev_reread_part().

2018-09-25 Thread Tetsuo Handa
syzbot is reporting circular locking dependency between bdev->bd_mutex
and lo->lo_ctl_mutex [1] which is caused by calling blkdev_reread_part()
with lock held. We need to drop lo->lo_ctl_mutex in order to fix it.

This patch fixes it by combining loop_index_mutex and loop_ctl_mutex into
loop_mutex, and releasing loop_mutex before calling blkdev_reread_part()
or fput() or path_put() or leaving ioctl().

The rule is that current thread calls lock_loop() before accessing
"struct loop_device", and current thread no longer accesses "struct
loop_device" after unlock_loop() is called.

Since syzbot is reporting various bugs [2] where a race in the loop module
is suspected, let's check whether this patch affects these bugs too.

[1] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889
[2] 
https://syzkaller.appspot.com/bug?id=b3c7e1440aa8ece16bf557dbac427fdff1dad9d6

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 

---
 drivers/block/loop.c | 187 ---
 1 file changed, 101 insertions(+), 86 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4b05a27..04389bb 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -84,12 +84,50 @@
 #include 
 
 static DEFINE_IDR(loop_index_idr);
-static DEFINE_MUTEX(loop_index_mutex);
-static DEFINE_MUTEX(loop_ctl_mutex);
+static DEFINE_MUTEX(loop_mutex);
+static void *loop_mutex_owner; /* == __mutex_owner(_mutex) */
 
 static int max_part;
 static int part_shift;
 
+/*
+ * lock_loop - Lock loop_mutex.
+ */
+static void lock_loop(void)
+{
+   mutex_lock(_mutex);
+   loop_mutex_owner = current;
+}
+
+/*
+ * lock_loop_killable - Lock loop_mutex unless killed.
+ */
+static int lock_loop_killable(void)
+{
+   int err = mutex_lock_killable(_mutex);
+
+   if (err)
+   return err;
+   loop_mutex_owner = current;
+   return 0;
+}
+
+/*
+ * unlock_loop - Unlock loop_mutex as needed.
+ *
+ * Explicitly call this function before calling fput() or blkdev_reread_part()
+ * in order to avoid circular lock dependency. After this function is called,
+ * current thread is no longer allowed to access "struct loop_device" memory,
+ * for another thread would access that memory as soon as loop_mutex is held.
+ */
+static void unlock_loop(void)
+{
+   if (loop_mutex_owner == current) {
+   loop_mutex_owner = NULL;
+   mutex_unlock(_mutex);
+   }
+}
+
 static int transfer_xor(struct loop_device *lo, int cmd,
struct page *raw_page, unsigned raw_off,
struct page *loop_page, unsigned loop_off,
@@ -637,6 +675,7 @@ static void loop_reread_partitions(struct loop_device *lo,
const int count = atomic_read(>lo_refcnt);
 
memcpy(filename, lo->lo_file_name, sizeof(filename));
+   unlock_loop();
 
/*
 * bd_mutex has been held already in release path, so don't
@@ -699,6 +738,7 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
struct file *file, *old_file;
int error;
 
+   lockdep_assert_held(_mutex);
error = -ENXIO;
if (lo->lo_state != Lo_bound)
goto out;
@@ -737,10 +777,12 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
 
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
loop_reread_partitions(lo, bdev);
+   unlock_loop();
fput(old_file);
return 0;
 
  out_putf:
+   unlock_loop();
fput(file);
  out:
return error;
@@ -918,6 +960,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
int error;
loff_t  size;
 
+   lockdep_assert_held(_mutex);
/* This is safe, since we have a reference from open(). */
__module_get(THIS_MODULE);
 
@@ -991,6 +1034,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
return 0;
 
  out_putf:
+   unlock_loop();
fput(file);
  out:
/* This is safe: open() is still holding a reference. */
@@ -1042,6 +1086,7 @@ static int loop_clr_fd(struct loop_device *lo)
struct block_device *bdev = lo->lo_device;
bool reread;
 
+   lockdep_assert_held(_mutex);
if (lo->lo_state != Lo_bound)
return -ENXIO;
 
@@ -1057,7 +1102,6 @@ static int loop_clr_fd(struct loop_device *lo)
 */
if (atomic_read(>lo_refcnt) > 1) {
lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-   mutex_unlock(_ctl_mutex);
return 0;
}
 
@@ -,13 +1155,7 @@ static int loop_clr_fd(struct loop_device *lo)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
if (reread)
loop_reread_partitions(lo, bdev);
-   mutex_unlock(_ctl_mutex);
-   /*
-* Need not hold loop_ctl_mutex t

[PATCH 3/4] block/loop: Reorganize loop_reread_partitions() callers.

2018-09-25 Thread Tetsuo Handa
We will drop loop_ctl_mutex before calling blkdev_reread_part() in
order to fix circular locking dependency between bdev->bd_mutex and
loop_ctl_mutex. To do that we need to make sure that we won't touch
"struct loop_device" after releasing loop_ctl_mutex.

As a preparation step, this patch reorganizes loop_reread_partitions()
callers. According to Ming Lei, calling loop_unprepare_queue() before
loop_reread_partitions() (like we did until 3.19) is fine. Therefore,
this patch will not cause user visible changes.

Signed-off-by: Tetsuo Handa 
Cc: Ming Lei 
---
 drivers/block/loop.c | 29 +++--
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 920cbb1..4b05a27 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -632,6 +632,11 @@ static void loop_reread_partitions(struct loop_device *lo,
   struct block_device *bdev)
 {
int rc;
+   char filename[LO_NAME_SIZE];
+   const int num = lo->lo_number;
+   const int count = atomic_read(>lo_refcnt);
+
+   memcpy(filename, lo->lo_file_name, sizeof(filename));
 
/*
 * bd_mutex has been held already in release path, so don't
@@ -641,13 +646,13 @@ static void loop_reread_partitions(struct loop_device *lo,
 * must be at least one and it can only become zero when the
 * current holder is released.
 */
-   if (!atomic_read(>lo_refcnt))
+   if (!count)
rc = __blkdev_reread_part(bdev);
else
rc = blkdev_reread_part(bdev);
if (rc)
pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
-   __func__, lo->lo_number, lo->lo_file_name, rc);
+   __func__, num, filename, rc);
 }
 
 static inline int is_loop_device(struct file *file)
@@ -730,9 +735,9 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
loop_update_dio(lo);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-   fput(old_file);
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
loop_reread_partitions(lo, bdev);
+   fput(old_file);
return 0;
 
  out_putf:
@@ -971,16 +976,18 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
set_blocksize(bdev, S_ISBLK(inode->i_mode) ?
  block_size(inode->i_bdev) : PAGE_SIZE);
 
+   /*
+* Grab the block_device to prevent its destruction after we
+* put /dev/loopXX inode. Later in loop_clr_fd() we bdput(bdev).
+*/
+   bdgrab(bdev);
+
lo->lo_state = Lo_bound;
if (part_shift)
lo->lo_flags |= LO_FLAGS_PARTSCAN;
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
loop_reread_partitions(lo, bdev);
 
-   /* Grab the block_device to prevent its destruction after we
-* put /dev/loopXX inode. Later in loop_clr_fd() we bdput(bdev).
-*/
-   bdgrab(bdev);
return 0;
 
  out_putf:
@@ -1033,6 +1040,7 @@ static int loop_clr_fd(struct loop_device *lo)
struct file *filp = lo->lo_backing_file;
gfp_t gfp = lo->old_gfp_mask;
struct block_device *bdev = lo->lo_device;
+   bool reread;
 
if (lo->lo_state != Lo_bound)
return -ENXIO;
@@ -1096,12 +1104,13 @@ static int loop_clr_fd(struct loop_device *lo)
module_put(THIS_MODULE);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-   if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
-   loop_reread_partitions(lo, bdev);
+   reread = (lo->lo_flags & LO_FLAGS_PARTSCAN) && bdev;
+   loop_unprepare_queue(lo);
lo->lo_flags = 0;
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
-   loop_unprepare_queue(lo);
+   if (reread)
+   loop_reread_partitions(lo, bdev);
mutex_unlock(_ctl_mutex);
/*
 * Need not hold loop_ctl_mutex to fput backing file.
-- 
1.8.3.1



[PATCH 1/4] block/loop: Don't grab "struct file" for vfs_getattr() operation.

2018-09-25 Thread Tetsuo Handa
vfs_getattr() needs "struct path" rather than "struct file".
Let's use path_get()/path_put() rather than get_file()/fput().

Signed-off-by: Tetsuo Handa 
Reviewed-by: Jan Kara 
---
 drivers/block/loop.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index abad6d1..c2745e6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1206,7 +1206,7 @@ static int loop_clr_fd(struct loop_device *lo)
 static int
 loop_get_status(struct loop_device *lo, struct loop_info64 *info)
 {
-   struct file *file;
+   struct path path;
struct kstat stat;
int ret;
 
@@ -1231,16 +1231,16 @@ static int loop_clr_fd(struct loop_device *lo)
}
 
/* Drop lo_ctl_mutex while we call into the filesystem. */
-   file = get_file(lo->lo_backing_file);
+   path = lo->lo_backing_file->f_path;
+   path_get();
mutex_unlock(>lo_ctl_mutex);
-   ret = vfs_getattr(>f_path, , STATX_INO,
- AT_STATX_SYNC_AS_STAT);
+   ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
info->lo_inode = stat.ino;
info->lo_rdevice = huge_encode_dev(stat.rdev);
}
-   fput(file);
+   path_put();
return ret;
 }
 
-- 
1.8.3.1



[PATCH 2/4] block/loop: Use global lock for ioctl() operation.

2018-09-25 Thread Tetsuo Handa
syzbot is reporting NULL pointer dereference [1] which is caused by
race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
loop devices at loop_validate_file() without holding corresponding
lo->lo_ctl_mutex locks.

Since ioctl() request on loop devices is not frequent operation, we don't
need fine grained locking. Let's use global lock in order to allow safe
traversal at loop_validate_file().

Note that syzbot is also reporting circular locking dependency between
bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling
blkdev_reread_part() with lock held. This patch does not address it.

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Reviewed-by: Jan Kara 
---
 drivers/block/loop.c | 58 ++--
 drivers/block/loop.h |  1 -
 2 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c2745e6..920cbb1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -85,6 +85,7 @@
 
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_index_mutex);
+static DEFINE_MUTEX(loop_ctl_mutex);
 
 static int max_part;
 static int part_shift;
@@ -1048,7 +1049,7 @@ static int loop_clr_fd(struct loop_device *lo)
 */
if (atomic_read(>lo_refcnt) > 1) {
lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return 0;
}
 
@@ -1101,12 +1102,12 @@ static int loop_clr_fd(struct loop_device *lo)
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
loop_unprepare_queue(lo);
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
/*
-* Need not hold lo_ctl_mutex to fput backing file.
-* Calling fput holding lo_ctl_mutex triggers a circular
+* Need not hold loop_ctl_mutex to fput backing file.
+* Calling fput holding loop_ctl_mutex triggers a circular
 * lock dependency possibility warning as fput can take
-* bd_mutex which is usually taken before lo_ctl_mutex.
+* bd_mutex which is usually taken before loop_ctl_mutex.
 */
fput(filp);
return 0;
@@ -1211,7 +1212,7 @@ static int loop_clr_fd(struct loop_device *lo)
int ret;
 
if (lo->lo_state != Lo_bound) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -ENXIO;
}
 
@@ -1230,10 +1231,10 @@ static int loop_clr_fd(struct loop_device *lo)
   lo->lo_encrypt_key_size);
}
 
-   /* Drop lo_ctl_mutex while we call into the filesystem. */
+   /* Drop loop_ctl_mutex while we call into the filesystem. */
path = lo->lo_backing_file->f_path;
path_get();
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
@@ -1325,7 +1326,7 @@ static int loop_clr_fd(struct loop_device *lo)
int err;
 
if (!arg) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, );
@@ -1343,7 +1344,7 @@ static int loop_clr_fd(struct loop_device *lo)
int err;
 
if (!arg) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, );
@@ -1401,7 +1402,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
struct loop_device *lo = bdev->bd_disk->private_data;
int err;
 
-   err = mutex_lock_killable_nested(>lo_ctl_mutex, 1);
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
goto out_unlocked;
 
@@ -1413,7 +1414,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
err = loop_change_fd(lo, bdev, arg);
break;
case LOOP_CLR_FD:
-   /* loop_clr_fd would have unlocked lo_ctl_mutex on success */
+   /* loop_clr_fd would have unlocked loop_ctl_mutex on success */
err = loop_clr_fd(lo);
if (!err)
goto out_unlocked;
@@ -1426,7 +1427,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
break;
case LOOP_GET_STATUS:
err = loop_get_status_old(lo, (struct loop_info __user *) arg);
-   /* loop_get_status() unlocks lo_ctl_mutex */
+   /* loop_get_st

[PATCH] block/loop: Don't hold lock while rereading partition.

2018-09-24 Thread Tetsuo Handa
syzbot is reporting circular locking dependency between bdev->bd_mutex and
lo->lo_ctl_mutex [1] which is caused by calling blkdev_reread_part() with
lock held. Don't hold loop_ctl_mutex while calling blkdev_reread_part().
Also, bring bdgrab() at loop_set_fd() to before loop_reread_partitions()
in case loop_clr_fd() is called while blkdev_reread_part() from
loop_set_fd() is in progress.

[1] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 

---
 drivers/block/loop.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 920cbb1..877cca8 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -632,7 +632,12 @@ static void loop_reread_partitions(struct loop_device *lo,
   struct block_device *bdev)
 {
int rc;
+   char filename[LO_NAME_SIZE];
+   const int num = lo->lo_number;
+   const int count = atomic_read(>lo_refcnt);
 
+   memcpy(filename, lo->lo_file_name, sizeof(filename));
+   mutex_unlock(_ctl_mutex);
/*
 * bd_mutex has been held already in release path, so don't
 * acquire it if this function is called in such case.
@@ -641,13 +646,14 @@ static void loop_reread_partitions(struct loop_device *lo,
 * must be at least one and it can only become zero when the
 * current holder is released.
 */
-   if (!atomic_read(>lo_refcnt))
+   if (!count)
rc = __blkdev_reread_part(bdev);
else
rc = blkdev_reread_part(bdev);
+   mutex_lock(_ctl_mutex);
if (rc)
pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
-   __func__, lo->lo_number, lo->lo_file_name, rc);
+   __func__, num, filename, rc);
 }
 
 static inline int is_loop_device(struct file *file)
@@ -971,16 +977,18 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
set_blocksize(bdev, S_ISBLK(inode->i_mode) ?
  block_size(inode->i_bdev) : PAGE_SIZE);
 
+   /*
+* Grab the block_device to prevent its destruction after we
+* put /dev/loopXX inode. Later in loop_clr_fd() we bdput(bdev).
+*/
+   bdgrab(bdev);
+
lo->lo_state = Lo_bound;
if (part_shift)
lo->lo_flags |= LO_FLAGS_PARTSCAN;
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
loop_reread_partitions(lo, bdev);
 
-   /* Grab the block_device to prevent its destruction after we
-* put /dev/loopXX inode. Later in loop_clr_fd() we bdput(bdev).
-*/
-   bdgrab(bdev);
return 0;
 
  out_putf:
-- 
1.8.3.1



[PATCH v2] block/loop: Use global lock for ioctl() operation.

2018-09-24 Thread Tetsuo Handa
syzbot is reporting NULL pointer dereference [1] which is caused by
race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
loop devices at loop_validate_file() without holding corresponding
lo->lo_ctl_mutex locks.

Since ioctl() request on loop devices is not frequent operation, we don't
need fine grained locking. Let's use global lock in order to allow safe
traversal at loop_validate_file().

Note that syzbot is also reporting circular locking dependency between
bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling
blkdev_reread_part() with lock held. This patch does not address it.

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889

v2: Don't call mutex_init() upon loop_add() request.

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
---
 drivers/block/loop.c | 58 ++--
 drivers/block/loop.h |  1 -
 2 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c2745e6..920cbb1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -85,6 +85,7 @@
 
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_index_mutex);
+static DEFINE_MUTEX(loop_ctl_mutex);
 
 static int max_part;
 static int part_shift;
@@ -1048,7 +1049,7 @@ static int loop_clr_fd(struct loop_device *lo)
 */
if (atomic_read(>lo_refcnt) > 1) {
lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return 0;
}
 
@@ -1101,12 +1102,12 @@ static int loop_clr_fd(struct loop_device *lo)
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
loop_unprepare_queue(lo);
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
/*
-* Need not hold lo_ctl_mutex to fput backing file.
-* Calling fput holding lo_ctl_mutex triggers a circular
+* Need not hold loop_ctl_mutex to fput backing file.
+* Calling fput holding loop_ctl_mutex triggers a circular
 * lock dependency possibility warning as fput can take
-* bd_mutex which is usually taken before lo_ctl_mutex.
+* bd_mutex which is usually taken before loop_ctl_mutex.
 */
fput(filp);
return 0;
@@ -1211,7 +1212,7 @@ static int loop_clr_fd(struct loop_device *lo)
int ret;
 
if (lo->lo_state != Lo_bound) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -ENXIO;
}
 
@@ -1230,10 +1231,10 @@ static int loop_clr_fd(struct loop_device *lo)
   lo->lo_encrypt_key_size);
}
 
-   /* Drop lo_ctl_mutex while we call into the filesystem. */
+   /* Drop loop_ctl_mutex while we call into the filesystem. */
path = lo->lo_backing_file->f_path;
path_get();
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
@@ -1325,7 +1326,7 @@ static int loop_clr_fd(struct loop_device *lo)
int err;
 
if (!arg) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, );
@@ -1343,7 +1344,7 @@ static int loop_clr_fd(struct loop_device *lo)
int err;
 
if (!arg) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, );
@@ -1401,7 +1402,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
struct loop_device *lo = bdev->bd_disk->private_data;
int err;
 
-   err = mutex_lock_killable_nested(>lo_ctl_mutex, 1);
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
goto out_unlocked;
 
@@ -1413,7 +1414,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
err = loop_change_fd(lo, bdev, arg);
break;
case LOOP_CLR_FD:
-   /* loop_clr_fd would have unlocked lo_ctl_mutex on success */
+   /* loop_clr_fd would have unlocked loop_ctl_mutex on success */
err = loop_clr_fd(lo);
if (!err)
goto out_unlocked;
@@ -1426,7 +1427,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
break;
case LOOP_GET_STATUS:
err = loop_get_status_old(lo, (struct loop_info __user *) arg);
-   /* loop_get_status() unlocks lo_ctl_mu

[PATCH] block/loop: Use global lock for ioctl() operation.

2018-09-24 Thread Tetsuo Handa
syzbot is reporting NULL pointer dereference [1] which is caused by
race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
loop devices at loop_validate_file() without holding corresponding
lo->lo_ctl_mutex locks.

Since ioctl() request on loop devices is not frequent operation, we don't
need fine grained locking. Let's use global lock in order to allow safe
traversal at loop_validate_file().

Note that syzbot is also reporting circular locking dependency between
bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling
blkdev_reread_part() with lock held. This patch does not address it.

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
---
 drivers/block/loop.c | 59 ++--
 drivers/block/loop.h |  1 -
 2 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c2745e6..1477f52e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -85,6 +85,7 @@
 
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_index_mutex);
+static DEFINE_MUTEX(loop_ctl_mutex);
 
 static int max_part;
 static int part_shift;
@@ -1048,7 +1049,7 @@ static int loop_clr_fd(struct loop_device *lo)
 */
if (atomic_read(>lo_refcnt) > 1) {
lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return 0;
}
 
@@ -1101,12 +1102,12 @@ static int loop_clr_fd(struct loop_device *lo)
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
loop_unprepare_queue(lo);
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
/*
-* Need not hold lo_ctl_mutex to fput backing file.
-* Calling fput holding lo_ctl_mutex triggers a circular
+* Need not hold loop_ctl_mutex to fput backing file.
+* Calling fput holding loop_ctl_mutex triggers a circular
 * lock dependency possibility warning as fput can take
-* bd_mutex which is usually taken before lo_ctl_mutex.
+* bd_mutex which is usually taken before loop_ctl_mutex.
 */
fput(filp);
return 0;
@@ -1211,7 +1212,7 @@ static int loop_clr_fd(struct loop_device *lo)
int ret;
 
if (lo->lo_state != Lo_bound) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -ENXIO;
}
 
@@ -1230,10 +1231,10 @@ static int loop_clr_fd(struct loop_device *lo)
   lo->lo_encrypt_key_size);
}
 
-   /* Drop lo_ctl_mutex while we call into the filesystem. */
+   /* Drop loop_ctl_mutex while we call into the filesystem. */
path = lo->lo_backing_file->f_path;
path_get();
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
@@ -1325,7 +1326,7 @@ static int loop_clr_fd(struct loop_device *lo)
int err;
 
if (!arg) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, );
@@ -1343,7 +1344,7 @@ static int loop_clr_fd(struct loop_device *lo)
int err;
 
if (!arg) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, );
@@ -1401,7 +1402,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
struct loop_device *lo = bdev->bd_disk->private_data;
int err;
 
-   err = mutex_lock_killable_nested(>lo_ctl_mutex, 1);
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
goto out_unlocked;
 
@@ -1413,7 +1414,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
err = loop_change_fd(lo, bdev, arg);
break;
case LOOP_CLR_FD:
-   /* loop_clr_fd would have unlocked lo_ctl_mutex on success */
+   /* loop_clr_fd would have unlocked loop_ctl_mutex on success */
err = loop_clr_fd(lo);
if (!err)
goto out_unlocked;
@@ -1426,7 +1427,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
break;
case LOOP_GET_STATUS:
err = loop_get_status_old(lo, (struct loop_info __user *) arg);
-   /* loop_get_status() unlocks lo_ctl_mutex */
+   /* loop_get_status() unlocks

[PATCH] block/loop: Don't grab "struct file" for vfs_getattr() operation.

2018-09-24 Thread Tetsuo Handa
vfs_getattr() needs "struct path" rather than "struct file".
Let's use path_get()/path_put() rather than get_file()/fput().

Signed-off-by: Tetsuo Handa 
---
 drivers/block/loop.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index abad6d1..c2745e6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1206,7 +1206,7 @@ static int loop_clr_fd(struct loop_device *lo)
 static int
 loop_get_status(struct loop_device *lo, struct loop_info64 *info)
 {
-   struct file *file;
+   struct path path;
struct kstat stat;
int ret;
 
@@ -1231,16 +1231,16 @@ static int loop_clr_fd(struct loop_device *lo)
}
 
/* Drop lo_ctl_mutex while we call into the filesystem. */
-   file = get_file(lo->lo_backing_file);
+   path = lo->lo_backing_file->f_path;
+   path_get();
mutex_unlock(>lo_ctl_mutex);
-   ret = vfs_getattr(>f_path, , STATX_INO,
- AT_STATX_SYNC_AS_STAT);
+   ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
info->lo_inode = stat.ino;
info->lo_rdevice = huge_encode_dev(stat.rdev);
}
-   fput(file);
+   path_put();
return ret;
 }
 
-- 
1.8.3.1



Re: [PATCH v4] block/loop: Serialize ioctl operations.

2018-09-24 Thread Tetsuo Handa
On 2018/09/25 3:47, Jan Kara wrote:
>> +/*
>> + * unlock_loop - Unlock loop_mutex as needed.
>> + *
>> + * Explicitly call this function before calling fput() or 
>> blkdev_reread_part()
>> + * in order to avoid circular lock dependency. After this function is 
>> called,
>> + * current thread is no longer allowed to access "struct loop_device" 
>> memory,
>> + * for another thread would access that memory as soon as loop_mutex is 
>> held.
>> + */
>> +static void unlock_loop(void)
>> +{
>> +if (loop_mutex_owner == current) {
> 
> Urgh, why this check? Conditional locking / unlocking is evil so it has to
> have *very* good reasons and there is not any explanation here. So far I
> don't see a reason why this is needed at all.

Yeah, this is why Jens hates this patch. But any alternative?



>> @@ -630,7 +669,12 @@ static void loop_reread_partitions(struct loop_device 
>> *lo,
>> +unlock_loop();
> 
> Unlocking in loop_reread_partitions() makes the locking rules ugly. And
> unnecessarily AFAICT. Can't we just use lo_refcnt to protect us against
> loop_clr_fd() and freeing of 'lo' structure itself?

Really? I think that just elevating lo->lo_refcnt will cause another lockdep
warning because __blkdev_reread_part() requires bdev->bd_mutex being held.
Don't we need to drop the lock in order to solve original lockdep warning at 
[2] ?

int __blkdev_reread_part(struct block_device *bdev)
{
struct gendisk *disk = bdev->bd_disk;

if (!disk_part_scan_enabled(disk) || bdev != bdev->bd_contains)
return -EINVAL;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;

lockdep_assert_held(>bd_mutex);

return rescan_partitions(disk, bdev);
}

int blkdev_reread_part(struct block_device *bdev)
{
int res;

mutex_lock(>bd_mutex);
res = __blkdev_reread_part(bdev);
mutex_unlock(>bd_mutex);

return res;
}



Re: [PATCH v3] block/loop: Serialize ioctl operations.

2018-07-11 Thread Tetsuo Handa
Since syzbot restarted testing linux-next.git , it is a good chance
to test this patch for unexpected regressions (until you come up with
a good alternative).

On 2018/06/26 23:34, Tetsuo Handa wrote:
> Did you get any idea?
> 
> On 2018/06/05 3:13, Jens Axboe wrote:
>> On 6/4/18 5:19 AM, Tetsuo Handa wrote:
>>> This problem was already ignored for 8 months. Unless we boost priority,
>>> this problem will be ignored for years. Jens, can we test this patch?
>>
>> Sorry, it's just that I _really_ hate this patch. We're making up
>> a weird locking primitive that tracks the process, with a weird
>> unlock helper that only unlocks if it's the process that is
>> holding the mutex.
>>
>> I'll try and think about this a bit, it would be nice if we had
>> a better alternative than the above.
>>


Re: [PATCH] bdi: Fix another oops in wb_workfn()

2018-06-18 Thread Tetsuo Handa
On 2018/06/18 22:46, Jan Kara wrote:
> syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to

[1] 
https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

line is missing.

> wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
> WB_shutting_down after wb->bdi->dev became NULL. This indicates that
> unregister_bdi() failed to call wb_shutdown() on one of wb objects.
> 
> The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
> drops bdi's reference to wb structures before going through the list of
> wbs again and calling wb_shutdown() on each of them. This way the loop
> iterating through all wbs can easily miss a wb if that wb has already
> passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
> from cgwb_release_workfn() and as a result fully shutdown bdi although
> wb_workfn() for this wb structure is still running. In fact there are
> also other ways cgwb_bdi_unregister() can race with
> cgwb_release_workfn() leading e.g. to use-after-free issues:
> 
> CPU1CPU2
> cgwb_bdi_unregister()
>   cgwb_kill(*slot);
> 
> cgwb_release()
>   queue_work(cgwb_release_wq, >release_work);
> cgwb_release_workfn()
>   wb = list_first_entry(>wb_list, ...)
>   spin_unlock_irq(_lock);
>   wb_shutdown(wb);
>   ...
>   kfree_rcu(wb, rcu);
>   wb_shutdown(wb); -> oops use-after-free
> 
> We solve these issues by synchronizing writeback structure shutdown from
> cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
> way we also no longer need synchronization using WB_shutting_down as the
> mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
> CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
> bdi_unregister().

Wow, this patch removes WB_shutting_down.

A bit of worry for me is how long will this mutex_lock() sleep, for
if there are a lot of wb objects to shutdown, sequentially doing
wb_shutdown() might block someone's mutex_lock() for longer than
khungtaskd's timeout period (typically 120 seconds) ?

> 
> Reported-by: syzbot 
> Signed-off-by: Jan Kara 
> ---
>  include/linux/backing-dev-defs.h |  2 +-
>  mm/backing-dev.c | 20 +++-
>  2 files changed, 8 insertions(+), 14 deletions(-)


Re: [PATCH -v2] loop: add recursion validation to LOOP_CHANGE_FD

2018-06-05 Thread Tetsuo Handa
I noticed that this patch is forgotten at ext4.git#loop-fix and therefore
is not available at linux-next.git . Please be sure to include for 4.18 .

On 2018/05/08 0:37, Theodore Ts'o wrote:
> Refactor the validation code used in LOOP_SET_FD so it is also used in
> LOOP_CHANGE_FD.  Otherwise it is possible to construct a set of loop
> devices that all refer to each other.  This can lead to a infinite
> loop in starting with "while (is_loop_device(f)) .." in loop_set_fd().
> 
> Fix this by refactoring out the validation code and using it for
> LOOP_CHANGE_FD as well as LOOP_SET_FD.
> 
> Reported-by: 
> syzbot+4349872271ece473a7c91190b68b4bac7c5db...@syzkaller.appspotmail.com
> Reported-by: syzbot+40bd32c4d9a3cc12a...@syzkaller.appspotmail.com
> Reported-by: syzbot+769c54e66f994b041...@syzkaller.appspotmail.com
> Reported-by: syzbot+0a89a9ce473936c57...@syzkaller.appspotmail.com
> Signed-off-by: Theodore Ts'o 
> ---
>  drivers/block/loop.c | 68 
> +---
>  1 file changed, 38 insertions(+), 30 deletions(-)
> 


Re: [PATCH] bdi: Fix another oops in wb_workfn()

2018-05-26 Thread Tetsuo Handa
Tejun Heo wrote:
> On Sun, May 27, 2018 at 11:21:25AM +0900, Tetsuo Handa wrote:
> > syzbot is still hitting NULL pointer dereference at wb_workfn() [1].
> > This might be because we overlooked that delayed_work_timer_fn() does not
> > check WB_registered before calling __queue_work() while mod_delayed_work()
> > does not wait for already started delayed_work_timer_fn() because it uses
> > del_timer() rather than del_timer_sync().
> 
> It shouldn't be that as dwork timer is an irq safe timer.  Even if
> that's the case, the right thing to do would be fixing workqueue
> rather than reaching into workqueue internals from backing-dev code.
> 

Do you think that there is possibility that __queue_work() is almost 
concurrently
executed from two CPUs, one from mod_delayed_work(bdi_wq, >dwork, 0) from
wb_shutdown() path (which is called without spin_lock_bh(>work_lock)) and
the other from delayed_work_timer_fn() path (which is called without checking
WB_registered bit under spin_lock_bh(>work_lock)) ?


[PATCH] bdi: Fix another oops in wb_workfn()

2018-05-26 Thread Tetsuo Handa
>From 8a8222698163d1fe180258566e9a3ff43f54fcd9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Date: Sun, 27 May 2018 11:08:20 +0900
Subject: [PATCH] bdi: Fix another oops in wb_workfn()

syzbot is still hitting NULL pointer dereference at wb_workfn() [1].
This might be because we overlooked that delayed_work_timer_fn() does not
check WB_registered before calling __queue_work() while mod_delayed_work()
does not wait for already started delayed_work_timer_fn() because it uses
del_timer() rather than del_timer_sync().

Make wb_shutdown() be careful about wb_wakeup_delayed() path.

[1] 
https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Reported-by: syzbot <syzbot+4a7438e774b21ddd8...@syzkaller.appspotmail.com>
Cc: Tejun Heo <t...@kernel.org>
Cc: Dave Chinner <dchin...@redhat.com>
Cc: Jan Kara <j...@suse.cz>
Cc: Jens Axboe <ax...@kernel.dk>
---
 mm/backing-dev.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7441bd9..31e1d7e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -372,11 +372,24 @@ static void wb_shutdown(struct bdi_writeback *wb)
 
cgwb_remove_from_bdi_list(wb);
/*
+* mod_delayed_work() is not appropriate here, for
+* delayed_work_timer_fn() from wb_wakeup_delayed() does not check
+* WB_registered before calling __queue_work().
+*/
+   del_timer_sync(>dwork.timer);
+   /*
+* Clear WORK_STRUCT_PENDING_BIT in order to make sure that next
+* queue_delayed_work() actually enqueues this work to the tail, for
+* wb_wakeup_delayed() already set WORK_STRUCT_PENDING_BIT before
+* scheduling delayed_work_timer_fn().
+*/
+   cancel_delayed_work_sync(>dwork);
+   /*
 * Drain work list and shutdown the delayed_work.  !WB_registered
 * tells wb_workfn() that @wb is dying and its work_list needs to
 * be drained no matter what.
 */
-   mod_delayed_work(bdi_wq, >dwork, 0);
+   queue_delayed_work(bdi_wq, >dwork, 0);
flush_delayed_work(>dwork);
WARN_ON(!list_empty(>work_list));
/*
-- 
1.8.3.1




Re: general protection fault in wb_workfn (2)

2018-05-26 Thread Tetsuo Handa
Forwarding 
http://lkml.kernel.org/r/201805251915.fgh64517.hvfjoolffmq...@i-love.sakura.ne.jp
 .

Jan Kara wrote:
> > void delayed_work_timer_fn(struct timer_list *t)
> > {
> > struct delayed_work *dwork = from_timer(dwork, t, timer);
> > 
> > /* should have been called from irqsafe timer with irq already off */
> > __queue_work(dwork->cpu, dwork->wq, >work);
> > }
> > 
> > Then, wb_workfn() is after all scheduled even if we check for
> > WB_registered bit, isn't it?
> 
> It can be queued after WB_registered bit is cleared but it cannot be queued
> after mod_delayed_work(bdi_wq, >dwork, 0) has finished. That function
> deletes the pending timer (the timer cannot be armed again because
> WB_registered is cleared) and queues what should be the last round of
> wb_workfn().

mod_delayed_work() deletes the pending timer but does not wait for already
invoked timer handler to complete because it is using del_timer() rather than
del_timer_sync(). Then, what happens if __queue_work() is almost concurrently
executed from two CPUs, one from mod_delayed_work(bdi_wq, >dwork, 0) from
wb_shutdown() path (which is called without spin_lock_bh(>work_lock)) and
the other from delayed_work_timer_fn() path (which is called without checking
WB_registered bit under spin_lock_bh(>work_lock)) ?

wb_wakeup_delayed() {
  spin_lock_bh(>work_lock);
  if (test_bit(WB_registered, >state)) // succeeds
queue_delayed_work(bdi_wq, >d_work, timeout) {
  queue_delayed_work_on(WORK_CPU_UNBOUND, bdi_wq, >d_work, timeout) {
 if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, 
work_data_bits(>d_work.work))) { // succeeds
   __queue_delayed_work(WORK_CPU_UNBOUND, bdi_wq, >d_work, timeout) 
{
 add_timer(timer); // schedules for delayed_work_timer_fn()
   }
 }
  }
}
  spin_unlock_bh(>work_lock);
}

delayed_work_timer_fn() {
  // del_timer() already returns false at this point because this timer
  // is already inside handler. But something took long here enough to
  // wait for __queue_work() from wb_shutdown() path to finish?
  __queue_work(WORK_CPU_UNBOUND, bdi_wq, >d_work.work) {
insert_work(pwq, work, worklist, work_flags);
  }
}

wb_shutdown() {
  mod_delayed_work(bdi_wq, >dwork, 0) {
mod_delayed_work_on(WORK_CPU_UNBOUND, bdi_wq, >dwork, 0) {
  ret = try_to_grab_pending(>dwork.work, true, ) {
if (likely(del_timer(>dwork.timer))) // fails because already in 
delayed_work_timer_fn()
  return 1;
if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, 
work_data_bits(>dwork.work))) // fails because already set by 
queue_delayed_work()
  return 0;
// Returns 1 or -ENOENT after doing something?
  }
  if (ret >= 0)
__queue_delayed_work(WORK_CPU_UNBOUND, bdi_wq, >dwork, 0) {
  __queue_work(WORK_CPU_UNBOUND, bdi_wq, >dwork.work) {
insert_work(pwq, work, worklist, work_flags);
  }
}
}
  }
}



[PATCH v3] block/loop: Serialize ioctl operations.

2018-05-25 Thread Tetsuo Handa
syzbot is reporting NULL pointer dereference [1] which is caused by
race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
loop devices without holding corresponding locks.

syzbot is also reporting circular locking dependency between bdev->bd_mutex
and lo->lo_ctl_mutex [2] which is caused by calling blkdev_reread_part()
with lock held.

Since ioctl() request on loop devices is not frequent operation, we don't
need fine grained locking. Let's use global lock and simplify the locking
order.

Strategy is that the global lock is held upon entry of ioctl() request,
and release it before either starting operations which might deadlock or
leaving ioctl() request. After the global lock is released, current thread
no longer uses "struct loop_device" memory because it might be modified
by next ioctl() request which was waiting for current ioctl() request.

In order to enforce this strategy, this patch inversed
loop_reread_partitions() and loop_unprepare_queue() in loop_clr_fd().
I don't know whether it breaks something, but I don't have testcases.

Since this patch serializes using global lock, race bugs should no longer
exist. Thus, it will be easy to test whether this patch broke something.

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Reported-by: syzbot <syzbot+bf89c128e05dd6c62...@syzkaller.appspotmail.com>
Reported-by: syzbot 
<syzbot+4684a000d5abdade83fac55b1e7d1f935ef19...@syzkaller.appspotmail.com>
Cc: Jens Axboe <ax...@kernel.dk>
---
 drivers/block/loop.c | 231 ---
 drivers/block/loop.h |   1 -
 2 files changed, 128 insertions(+), 104 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 55cf554..feb9fa7 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -81,11 +81,50 @@
 #include 
 
 static DEFINE_IDR(loop_index_idr);
-static DEFINE_MUTEX(loop_index_mutex);
+static DEFINE_MUTEX(loop_mutex);
+static void *loop_mutex_owner; /* == __mutex_owner(_mutex) */
 
 static int max_part;
 static int part_shift;
 
+/*
+ * lock_loop - Lock loop_mutex.
+ */
+static void lock_loop(void)
+{
+   mutex_lock(_mutex);
+   loop_mutex_owner = current;
+}
+
+/*
+ * lock_loop_killable - Lock loop_mutex unless killed.
+ */
+static int lock_loop_killable(void)
+{
+   int err = mutex_lock_killable(_mutex);
+
+   if (err)
+   return err;
+   loop_mutex_owner = current;
+   return 0;
+}
+
+/*
+ * unlock_loop - Unlock loop_mutex as needed.
+ *
+ * Explicitly call this function before calling fput() or blkdev_reread_part()
+ * in order to avoid circular lock dependency. After this function is called,
+ * current thread is no longer allowed to access "struct loop_device" memory,
+ * for another thread would access that memory as soon as loop_mutex is held.
+ */
+static void unlock_loop(void)
+{
+   if (loop_mutex_owner == current) {
+   loop_mutex_owner = NULL;
+   mutex_unlock(_mutex);
+   }
+}
+
 static int transfer_xor(struct loop_device *lo, int cmd,
struct page *raw_page, unsigned raw_off,
struct page *loop_page, unsigned loop_off,
@@ -626,7 +665,12 @@ static void loop_reread_partitions(struct loop_device *lo,
   struct block_device *bdev)
 {
int rc;
+   /* Save variables which might change after unlock_loop() is called. */
+   char filename[sizeof(lo->lo_file_name)];
+   const int num = lo->lo_number;
 
+   memmove(filename, lo->lo_file_name, sizeof(filename));
+   unlock_loop();
/*
 * bd_mutex has been held already in release path, so don't
 * acquire it if this function is called in such case.
@@ -641,7 +685,7 @@ static void loop_reread_partitions(struct loop_device *lo,
rc = blkdev_reread_part(bdev);
if (rc)
pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
-   __func__, lo->lo_number, lo->lo_file_name, rc);
+   __func__, num, filename, rc);
 }
 
 /*
@@ -659,6 +703,7 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
struct inode*inode;
int error;
 
+   lockdep_assert_held(_mutex);
error = -ENXIO;
if (lo->lo_state != Lo_bound)
goto out;
@@ -695,12 +740,14 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
loop_update_dio(lo);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-   fput(old_file);
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
-   loop_reread_part

Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-25 Thread Tetsuo Handa
Jan Kara wrote:
> > void delayed_work_timer_fn(struct timer_list *t)
> > {
> > struct delayed_work *dwork = from_timer(dwork, t, timer);
> > 
> > /* should have been called from irqsafe timer with irq already off */
> > __queue_work(dwork->cpu, dwork->wq, >work);
> > }
> > 
> > Then, wb_workfn() is after all scheduled even if we check for
> > WB_registered bit, isn't it?
> 
> It can be queued after WB_registered bit is cleared but it cannot be queued
> after mod_delayed_work(bdi_wq, >dwork, 0) has finished. That function
> deletes the pending timer (the timer cannot be armed again because
> WB_registered is cleared) and queues what should be the last round of
> wb_workfn().

mod_delayed_work() deletes the pending timer but does not wait for already
invoked timer handler to complete because it is using del_timer() rather than
del_timer_sync(). Then, what happens if __queue_work() is almost concurrently
executed from two CPUs, one from mod_delayed_work(bdi_wq, >dwork, 0) from
wb_shutdown() path (which is called without spin_lock_bh(>work_lock)) and
the other from delayed_work_timer_fn() path (which is called without checking
WB_registered bit under spin_lock_bh(>work_lock)) ?

wb_wakeup_delayed() {
  spin_lock_bh(>work_lock);
  if (test_bit(WB_registered, >state)) // succeeds
queue_delayed_work(bdi_wq, >d_work, timeout) {
  queue_delayed_work_on(WORK_CPU_UNBOUND, bdi_wq, >d_work, timeout) {
 if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, 
work_data_bits(>d_work.work))) { // succeeds
   __queue_delayed_work(WORK_CPU_UNBOUND, bdi_wq, >d_work, timeout) 
{
 add_timer(timer); // schedules for delayed_work_timer_fn()
   }
 }
  }
}
  spin_unlock_bh(>work_lock);
}

delayed_work_timer_fn() {
  // del_timer() already returns false at this point because this timer
  // is already inside handler. But something took long here enough to
  // wait for __queue_work() from wb_shutdown() path to finish?
  __queue_work(WORK_CPU_UNBOUND, bdi_wq, >d_work.work) {
insert_work(pwq, work, worklist, work_flags);
  }
}

wb_shutdown() {
  mod_delayed_work(bdi_wq, >dwork, 0) {
mod_delayed_work_on(WORK_CPU_UNBOUND, bdi_wq, >dwork, 0) {
  ret = try_to_grab_pending(>dwork.work, true, ) {
if (likely(del_timer(>dwork.timer))) // fails because already in 
delayed_work_timer_fn()
  return 1;
if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, 
work_data_bits(>dwork.work))) // fails because already set by 
queue_delayed_work()
  return 0;
// Returns 1 or -ENOENT after doing something?
  }
  if (ret >= 0)
__queue_delayed_work(WORK_CPU_UNBOUND, bdi_wq, >dwork, 0) {
  __queue_work(WORK_CPU_UNBOUND, bdi_wq, >dwork.work) {
insert_work(pwq, work, worklist, work_flags);
  }
}
}
  }
}


Re: INFO: task hung in blk_queue_enter

2018-05-22 Thread Tetsuo Handa
I checked counter values using debug printk() patch shown below, and
found that q->q_usage_counter.count == 1 when this deadlock occurs.

Since sum of percpu_count did not change after percpu_ref_kill(), this is
not a race condition while folding percpu counter values into atomic counter
value. That is, for some reason, someone who is responsible for calling
percpu_ref_put(>q_usage_counter) (presumably via blk_queue_exit()) is
unable to call percpu_ref_put().

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b4..6933020 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -908,6 +908,12 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
+static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
+{
+   return (unsigned long __percpu *)
+   (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
+}
+
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
@@ -950,10 +956,22 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 */
smp_rmb();
 
-   wait_event(q->mq_freeze_wq,
-  (atomic_read(>mq_freeze_depth) == 0 &&
-   (preempt || !blk_queue_preempt_only(q))) ||
-  blk_queue_dying(q));
+   while (wait_event_timeout(q->mq_freeze_wq,
+ (atomic_read(>mq_freeze_depth) == 
0 &&
+  (preempt || 
!blk_queue_preempt_only(q))) ||
+ blk_queue_dying(q), 3 * HZ) == 0) {
+   struct percpu_ref *ref = >q_usage_counter;
+   unsigned long __percpu *percpu_count = 
percpu_count_ptr(ref);
+   unsigned long count = 0;
+   int cpu;
+
+   for_each_possible_cpu(cpu)
+   count += *per_cpu_ptr(percpu_count, cpu);
+
+   printk("%s(%d): %px %ld %ld\n", current->comm, 
current->pid,
+  ref, atomic_long_read(>count), count);
+   }
+
if (blk_queue_dying(q))
return -ENODEV;
}
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9f96fa7..72773ce 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -133,8 +133,8 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head 
*rcu)
for_each_possible_cpu(cpu)
count += *per_cpu_ptr(percpu_count, cpu);
 
-   pr_debug("global %ld percpu %ld",
-atomic_long_read(>count), (long)count);
+   printk("%px global %ld percpu %ld\n", ref,
+  atomic_long_read(>count), (long)count);
 
/*
 * It's crucial that we sum the percpu counters _before_ adding the sum
@@ -150,6 +150,8 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head 
*rcu)
 */
atomic_long_add((long)count - PERCPU_COUNT_BIAS, >count);
 
+   printk("%px global %ld\n", ref, atomic_long_read(>count));
+
WARN_ONCE(atomic_long_read(>count) <= 0,
  "percpu ref (%pf) <= 0 (%ld) after switching to atomic",
  ref->release, atomic_long_read(>count));

If I change blk_queue_enter() not to wait at wait_event() if
q->mq_freeze_depth != 0, this deadlock problem does not occur.

Also, I found that if blk_freeze_queue_start() tries to wait for
counters to become 1 before calling percpu_ref_kill() (like shown
below), this deadlock problem does not occur.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9ce9cac..4bff534 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -134,12 +134,36 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct 
hd_struct *part,
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_rw, );
 }
 
+#define PERCPU_COUNT_BIAS   (1LU << (BITS_PER_LONG - 1))
+
+static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
+{
+   return (unsigned long __percpu *)
+   (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
+}
+
 void blk_freeze_queue_start(struct request_queue *q)
 {
int freeze_depth;
 
freeze_depth = atomic_inc_return(>mq_freeze_depth);
if (freeze_depth == 1) {
+   int i;
+   for (i = 0; i < 10; i++) {
+   struct percpu_ref *ref = >q_usage_counter;
+   unsigned long __percpu *percpu_count = 
percpu_count_ptr(ref);
+   unsigned long count = 0;
+   int cpu;
+
+   for_each_possible_cpu(cpu)
+   count += *per_cpu_ptr(percpu_count, cpu);
+
+   if (atomic_long_read(>count) + count - 
PERCPU_COUNT_BIAS == 1)
+   break;
+   printk("%s(%d):! %px %ld %ld\n", current->comm, 

Re: INFO: task hung in blk_queue_enter

2018-05-21 Thread Tetsuo Handa
Bart Van Assche wrote:
> On Wed, 2018-05-16 at 17:16 +0200, Dmitry Vyukov wrote:
> > On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche <bart.vanass...@wdc.com> 
> > wrote:
> > > On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
> > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > index 85909b4..59e2496 100644
> > > > --- a/block/blk-core.c
> > > > +++ b/block/blk-core.c
> > > > @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, 
> > > > blk_mq_req_flags_t flags)
> > > >   smp_rmb();
> > > > 
> > > >   wait_event(q->mq_freeze_wq,
> > > > -(atomic_read(>mq_freeze_depth) == 0 &&
> > > > - (preempt || !blk_queue_preempt_only(q))) ||
> > > > +atomic_read(>mq_freeze_depth) ||
> > > > +(preempt || !blk_queue_preempt_only(q)) ||
> > > >  blk_queue_dying(q));
> > > > - if (blk_queue_dying(q))
> > > > + if (atomic_read(>mq_freeze_depth) || 
> > > > blk_queue_dying(q))
> > > >   return -ENODEV;
> > > >   }
> > > >  }
> > > 
> > > That change looks wrong to me.
> > 
> > Hi Bart,
> > 
> > Why does it look wrong to you?
> 
> Because that change conflicts with the purpose of queue freezing and also 
> because
> that change would inject I/O errors in code paths that shouldn't inject I/O 
> errors.

But waiting there until atomic_read(>mq_freeze_depth) becomes 0 is causing 
deadlock.
wait_event() never returns is a bug. I think we should not wait for 
q->mq_freeze_depth.


Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-19 Thread Tetsuo Handa
Tetsuo Handa wrote:
> Jan Kara wrote:
> > Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
> > the necessary precautions against racing with bdi unregistration.
> 
> Yes, this patch will solve NULL pointer dereference bug. But is it OK to leave
> list_empty(>work_list) == false situation? Who takes over the role of 
> making
> list_empty(>work_list) == true?

syzbot is again reporting the same NULL pointer dereference.

  general protection fault in wb_workfn (2)
  https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Didn't we overlook something obvious in commit b8b784958eccbf8f ("bdi: Fix oops 
in wb_workfn()") ?

At first, I thought that that commit will solve NULL pointer dereference bug.
But what does

if (!list_empty(>work_list))
-   mod_delayed_work(bdi_wq, >dwork, 0);
+   wb_wakeup(wb);
else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
wb_wakeup_delayed(wb);

mean?

static void wb_wakeup(struct bdi_writeback *wb)
{
spin_lock_bh(>work_lock);
if (test_bit(WB_registered, >state))
mod_delayed_work(bdi_wq, >dwork, 0);
spin_unlock_bh(>work_lock);
}

It means nothing but "we don't call mod_delayed_work() if WB_registered bit was
already cleared".

But if WB_registered bit is not yet cleared when we hit wb_wakeup_delayed() 
path?

void wb_wakeup_delayed(struct bdi_writeback *wb)
{
unsigned long timeout;

timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
spin_lock_bh(>work_lock);
if (test_bit(WB_registered, >state))
queue_delayed_work(bdi_wq, >dwork, timeout);
spin_unlock_bh(>work_lock);
}

add_timer() is called because (presumably) timeout > 0. And after that timeout
expires, __queue_work() is called even if WB_registered bit is already cleared
before that timeout expires, isn't it?

void delayed_work_timer_fn(struct timer_list *t)
{
struct delayed_work *dwork = from_timer(dwork, t, timer);

/* should have been called from irqsafe timer with irq already off */
__queue_work(dwork->cpu, dwork->wq, >work);
}

Then, wb_workfn() is after all scheduled even if we check for WB_registered bit,
isn't it?

Then, don't we need to check that

mod_delayed_work(bdi_wq, >dwork, 0);
flush_delayed_work(>dwork);

is really waiting for completion? At least, shouldn't we try below debug output
(not only for debugging this report but also generally desirable)?

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7441bd9..ccec8cd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -376,8 +376,10 @@ static void wb_shutdown(struct bdi_writeback *wb)
 * tells wb_workfn() that @wb is dying and its work_list needs to
 * be drained no matter what.
 */
-   mod_delayed_work(bdi_wq, >dwork, 0);
-   flush_delayed_work(>dwork);
+   if (!mod_delayed_work(bdi_wq, >dwork, 0))
+   printk(KERN_WARNING "wb_shutdown: mod_delayed_work() failed\n");
+   if (!flush_delayed_work(>dwork))
+   printk(KERN_WARNING "wb_shutdown: flush_delayed_work() 
failed\n");
WARN_ON(!list_empty(>work_list));
/*
 * Make sure bit gets cleared after shutdown is finished. Matches with


Re: INFO: task hung in blk_queue_enter

2018-05-16 Thread Tetsuo Handa
Tetsuo Handa wrote:
> I couldn't check whether freeze_depth in blk_freeze_queue_start() was 1,
> but presumably q->mq_freeze_depth > 0 because syz-executor7(PID=5010) is
> stuck at wait_event() in blk_queue_enter().
> 
> Since flags == 0, preempt == false. Since stuck at wait_event(), success == 
> false.
> Thus, atomic_read(>mq_freeze_depth) > 0 if blk_queue_dying(q) == false. 
> And I
> guess blk_queue_dying(q) == false because we are just trying to 
> freeze/unfreeze.
> 

I was able to reproduce the hung up using modified reproducer, and got values
using below debug printk() patch.

  --- a/block/blk-core.c
  +++ b/block/blk-core.c
  @@ -950,10 +950,12 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 */
smp_rmb();
   
  - wait_event(q->mq_freeze_wq,
  -(atomic_read(>mq_freeze_depth) == 0 &&
  - (preempt || !blk_queue_preempt_only(q))) ||
  -blk_queue_dying(q));
  + while (wait_event_timeout(q->mq_freeze_wq,
  +   (atomic_read(>mq_freeze_depth) == 
0 &&
  +(preempt || 
!blk_queue_preempt_only(q))) ||
  +   blk_queue_dying(q), 10 * HZ) == 0)
  + printk("%s(%u): q->mq_freeze_depth=%d preempt=%d 
blk_queue_preempt_only(q)=%d blk_queue_dying(q)=%d\n",
  +current->comm, current->pid, 
atomic_read(>mq_freeze_depth), preempt, blk_queue_preempt_only(q), 
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
}

[   75.869126] print_req_error: I/O error, dev loop0, sector 0
[   85.983146] a.out(8838): q->mq_freeze_depth=1 preempt=0 
blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
[   96.222884] a.out(8838): q->mq_freeze_depth=1 preempt=0 
blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
[  106.463322] a.out(8838): q->mq_freeze_depth=1 preempt=0 
blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
[  116.702912] a.out(8838): q->mq_freeze_depth=1 preempt=0 
blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0

One ore more threads are waiting for q->mq_freeze_depth to become 0. But the
thread who incremented q->mq_freeze_depth at blk_freeze_queue_start(q) from
blk_freeze_queue() is waiting at blk_mq_freeze_queue_wait(). Therefore,
atomic_read(>mq_freeze_depth) == 0 condition for wait_event() in
blk_queue_enter() will never be satisfied. But what does that wait_event()
want to do? Isn't "start freezing" a sort of blk_queue_dying(q) == true?
Since percpu_ref_tryget_live(>q_usage_counter) failed and the queue is
about to be frozen, shouldn't we treat atomic_read(>mq_freeze_depth) != 0
as if blk_queue_dying(q) == true? That is, something like below:

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b4..59e2496 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
smp_rmb();
 
wait_event(q->mq_freeze_wq,
-  (atomic_read(>mq_freeze_depth) == 0 &&
-   (preempt || !blk_queue_preempt_only(q))) ||
+  atomic_read(>mq_freeze_depth) ||
+  (preempt || !blk_queue_preempt_only(q)) ||
   blk_queue_dying(q));
-   if (blk_queue_dying(q))
+   if (atomic_read(>mq_freeze_depth) || blk_queue_dying(q))
return -ENODEV;
}
 }



Re: INFO: task hung in blk_queue_enter

2018-05-15 Thread Tetsuo Handa
I managed to obtain SysRq-t when khungtaskd fires using debug printk()
( https://groups.google.com/forum/#!topic/syzkaller-bugs/OTuOsVebAiE ).
Only 4 threads shown below seems to be relevant to this problem.

[  246.929688]   taskPC stack   pid father
[  249.888937] jbd2/sda1-8 D17736  2271  2 0x8000
[  249.894586] Call Trace:
[  249.897193]  __schedule+0x801/0x1e30
[  249.900954]  schedule+0xef/0x430
[  249.904356]  io_schedule+0x1c/0x70
[  249.907935]  bit_wait_io+0x18/0x90
[  249.911492]  __wait_on_bit+0xb3/0x130
[  249.915313]  out_of_line_wait_on_bit+0x204/0x3a0
[  249.920105]  __wait_on_buffer+0x76/0x90
[  249.924102]  jbd2_journal_commit_transaction+0x655b/0x8c18
[  249.929893]  kjournald2+0x26c/0xb30
[  249.933579]  kthread+0x345/0x410
[  249.936966]  ret_from_fork+0x3a/0x50
[  254.408972] syz-executor7   D18976  5010   4828 0x0004
[  254.414639] Call Trace:
[  254.417263]  __schedule+0x801/0x1e30
[  254.421070]  schedule+0xef/0x430
[  254.424467]  blk_queue_enter+0x8da/0xdf0
[  254.428584]  generic_make_request+0x144/0x1510
[  254.433217]  blk_queue_split+0x374/0x2090
[  254.437425]  blk_mq_make_request+0x2d0/0x25c0
[  254.442004]  generic_make_request+0x795/0x1510
[  254.446663]  submit_bio+0xba/0x460
[  254.451104]  mpage_readpages+0x6d7/0x970
[  254.455224]  blkdev_readpages+0x2c/0x40
[  254.459220]  __do_page_cache_readahead+0x79a/0xdc0
[  254.464205]  ondemand_readahead+0x550/0xc40
[  254.468559]  page_cache_sync_readahead+0xd1/0x110
[  254.473430]  generic_file_read_iter+0x1a74/0x2f00
[  254.478423]  blkdev_read_iter+0x120/0x190
[  254.482594]  generic_file_splice_read+0x552/0x910
[  254.487484]  do_splice_to+0x12e/0x190
[  254.491311]  splice_direct_to_actor+0x268/0x8d0
[  254.496017]  do_splice_direct+0x2cc/0x400
[  254.500224]  do_sendfile+0x60f/0xe00
[  254.503971]  __x64_sys_sendfile64+0x155/0x240
[  254.508507]  do_syscall_64+0x1b1/0x800
[  254.512431]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  254.568779] syz-executor7   D25408  5021   4828 0x0004
[  254.574429] Call Trace:
[  254.577125]  __schedule+0x801/0x1e30
[  254.580882]  schedule+0xef/0x430
[  254.584297]  blk_mq_freeze_queue_wait+0x1ce/0x460
[  254.589192]  blk_freeze_queue+0x4a/0x80
[  254.593209]  blk_mq_freeze_queue+0x15/0x20
[  254.597464]  loop_clr_fd+0x226/0xb80
[  254.601208]  lo_ioctl+0x642/0x2130
[  254.604774]  blkdev_ioctl+0x9b6/0x2020
[  254.608715]  block_ioctl+0xee/0x130
[  254.612362]  do_vfs_ioctl+0x1cf/0x16a0
[  254.616294]  ksys_ioctl+0xa9/0xd0
[  254.619770]  __x64_sys_ioctl+0x73/0xb0
[  254.623674]  do_syscall_64+0x1b1/0x800
[  254.627596]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  255.022839] 1 lock held by syz-executor7/5021:
[  255.028209]  #0: (ptrval) (>lo_ctl_mutex/1){+.+.}, at: 
lo_ioctl+0x8d/0x2130
[  254.719133] blkid   D23880  5071   2713 0x
[  254.724791] Call Trace:
[  254.727402]  __schedule+0x801/0x1e30
[  254.731159]  schedule+0xef/0x430
[  254.734562]  schedule_preempt_disabled+0x10/0x20
[  254.739333]  __mutex_lock+0xe38/0x17f0
[  254.743280]  mutex_lock_killable_nested+0x16/0x20
[  254.748135]  lo_ioctl+0x8d/0x2130
[  254.751611]  blkdev_ioctl+0x9b6/0x2020
[  254.70]  block_ioctl+0xee/0x130
[  254.759200]  do_vfs_ioctl+0x1cf/0x16a0
[  254.763123]  ksys_ioctl+0xa9/0xd0
[  254.766601]  __x64_sys_ioctl+0x73/0xb0
[  254.770506]  do_syscall_64+0x1b1/0x800
[  254.774425]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  255.036399] 1 lock held by blkid/5071:
[  255.041075]  #0: (ptrval) (>lo_ctl_mutex/1){+.+.}, at: 
lo_ioctl+0x8d/0x2130



syz-executor7(PID=5021) is doing ioctl(LOOP_CLR_FD).

blkid(PID=5071) is doing ioctl() but is blocked on lo_ctl_mutex which is
held by syz-executor7(PID=5021). Therefore, I assume that blkid(PID=5071)
is irrelevant for this hung up problem.



syz-executor7(PID=5021) is stuck at blk_mq_freeze_queue_wait() from
blk_freeze_queue() from blk_mq_freeze_queue() from loop_clr_fd().

/*
 * Guarantee no request is in use, so we can change any data structure of
 * the queue afterward.
 */
void blk_freeze_queue(struct request_queue *q)
{
/*
 * In the !blk_mq case we are only calling this to kill the
 * q_usage_counter, otherwise this increases the freeze depth
 * and waits for it to return to zero.  For this reason there is
 * no blk_unfreeze_queue(), and blk_freeze_queue() is not
 * exported to drivers as the only user for unfreeze is blk_mq.
 */
blk_freeze_queue_start(q);
if (!q->mq_ops)
blk_drain_queue(q);
blk_mq_freeze_queue_wait(q);
}

q->mq_freeze_depth is incremented at blk_freeze_queue_start() and
the caller waits until q->q_usage_counter becomes 0.

void blk_freeze_queue_start(struct request_queue *q)
{
int freeze_depth;

freeze_depth = atomic_inc_return(>mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(>q_usage_counter);

[PATCH v2] block/loop: Serialize ioctl operations.

2018-05-09 Thread Tetsuo Handa
>From 86acfa7288c5e6ddab26f430e2bd2f42ad1407f0 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Date: Wed, 9 May 2018 13:01:31 +0900
Subject: [PATCH v2] block/loop: Serialize ioctl operations.

syzbot is reporting NULL pointer dereference [1] which is caused by
race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
loop devices without holding corresponding locks.

syzbot is also reporting circular locking dependency between bdev->bd_mutex
and lo->lo_ctl_mutex [2] which is caused by calling blkdev_reread_part()
with lock held.

Since ioctl() request on loop devices is not frequent operation, we don't
need fine grained locking. Let's use global lock and simplify the locking
order.

Strategy is that the global lock is held upon entry of ioctl() request,
and release it before either starting operations which might deadlock or
leaving ioctl() request. After the global lock is released, current thread
no longer uses "struct loop_device" memory because it might be modified
by next ioctl() request which was waiting for current ioctl() request.

In order to enforce this strategy, this patch inversed
loop_reread_partitions() and loop_unprepare_queue() in loop_clr_fd().
I don't know whether it breaks something, but I don't have testcases.

Since this patch serializes using global lock, race bugs should no longer
exist. Thus, it will be easy to test whether this patch broke something.

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Reported-by: syzbot <syzbot+bf89c128e05dd6c62...@syzkaller.appspotmail.com>
Reported-by: syzbot 
<syzbot+4684a000d5abdade83fac55b1e7d1f935ef19...@syzkaller.appspotmail.com>
Cc: Jens Axboe <ax...@kernel.dk>
---
 drivers/block/loop.c | 231 ---
 drivers/block/loop.h |   1 -
 2 files changed, 128 insertions(+), 104 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 24f6682..b17fd05 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -81,11 +81,50 @@
 #include 
 
 static DEFINE_IDR(loop_index_idr);
-static DEFINE_MUTEX(loop_index_mutex);
+static DEFINE_MUTEX(loop_mutex);
+static void *loop_mutex_owner; /* == __mutex_owner(_mutex) */
 
 static int max_part;
 static int part_shift;
 
+/*
+ * lock_loop - Lock loop_mutex.
+ */
+static void lock_loop(void)
+{
+   mutex_lock(_mutex);
+   loop_mutex_owner = current;
+}
+
+/*
+ * lock_loop_killable - Lock loop_mutex unless killed.
+ */
+static int lock_loop_killable(void)
+{
+   int err = mutex_lock_killable(_mutex);
+
+   if (err)
+   return err;
+   loop_mutex_owner = current;
+   return 0;
+}
+
+/*
+ * unlock_loop - Unlock loop_mutex as needed.
+ *
+ * Explicitly call this function before calling fput() or blkdev_reread_part()
+ * in order to avoid circular lock dependency. After this function is called,
+ * current thread is no longer allowed to access "struct loop_device" memory,
+ * for another thread would access that memory as soon as loop_mutex is held.
+ */
+static void unlock_loop(void)
+{
+   if (loop_mutex_owner == current) {
+   loop_mutex_owner = NULL;
+   mutex_unlock(_mutex);
+   }
+}
+
 static int transfer_xor(struct loop_device *lo, int cmd,
struct page *raw_page, unsigned raw_off,
struct page *loop_page, unsigned loop_off,
@@ -626,7 +665,12 @@ static void loop_reread_partitions(struct loop_device *lo,
   struct block_device *bdev)
 {
int rc;
+   /* Save variables which might change after unlock_loop() is called. */
+   char filename[sizeof(lo->lo_file_name)];
+   const int num = lo->lo_number;
 
+   memmove(filename, lo->lo_file_name, sizeof(filename));
+   unlock_loop();
/*
 * bd_mutex has been held already in release path, so don't
 * acquire it if this function is called in such case.
@@ -641,7 +685,7 @@ static void loop_reread_partitions(struct loop_device *lo,
rc = blkdev_reread_part(bdev);
if (rc)
pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
-   __func__, lo->lo_number, lo->lo_file_name, rc);
+   __func__, num, filename, rc);
 }
 
 static inline int is_loop_device(struct file *file)
@@ -689,6 +733,7 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
struct inode*inode;
int error;
 
+   lockdep_assert_held(_mutex);
error = -ENXIO;
if (lo->lo_state != Lo_bound)
goto out;
@@ -726,12 

Re: general protection fault in lo_ioctl (2)

2018-05-08 Thread Tetsuo Handa
Theodore Y. Ts'o wrote:
> On Tue, May 08, 2018 at 08:05:12PM +0900, Tetsuo Handa wrote:
> > 
> > So, it is time to think how to solve this race condition, as well as how to 
> > solve
> > lockdep's deadlock warning (and I guess that syzbot is actually hitting 
> > deadlocks).
> > An approach which serializes loop operations using global lock was proposed 
> > at
> > https://groups.google.com/d/msg/syzkaller-bugs/2Rw8-OM6IbM/PzdobV8kAgAJ .
> > Please respond...
> 
> I'm looking at your patch which you proposed on this, and the locking
> architecture still looks way too complex.  Things like
> loop_mutex_owner, and all of the infrastructure around
> lo->ioctl_in_progress should be removed, if at all possible.

The patch in the above link no longer uses "lo->ioctl_in_progress".
You looked at previous version rather than current version.

> 
> I believe it should be possible to do things with a single global
> mutex, some code refactoring, and some unlocked versions of some of
> the functions.

The patch in the above link uses single global mutex "loop_mutex".

> 
> Again, this requires root, and it requires someone deliberately trying
> to induce a race.  So "it's time" is not necessarily the priority I
> would set for this item.  But if we are going to fix it, let's fix it
> right, and not make the code more complex and less maintainable, all
> in the name of trying to make a rare, not-likely-to-happen-in-real-life
> syzbot reported problem to go away.

While NULL pointer dereference would be rare, deadlocks might not be rare
enough to postpone the patch. Deadlocks can cause pile of false-positive
hung task reports and can prevent syzbot from finding other bugs. That's
why I say "it is time to think".


Re: general protection fault in lo_ioctl (2)

2018-05-08 Thread Tetsuo Handa
On 2018/05/08 5:56, Tetsuo Handa wrote:
> On 2018/05/02 20:23, Dmitry Vyukov wrote:
>> #syz dup: INFO: rcu detected stall in blkdev_ioctl
> 
> The cause of stall turned out to be ioctl(loop_fd, LOOP_CHANGE_FD, loop_fd).
> 
> But we haven't explained the cause of NULL pointer dereference which can
> occur when raced with ioctl(LOOP_CLR_FD). Therefore,
> 
> #syz undup
> 

Using sleep injection patch and reproducer shown below, I can reproduce
the crashes. It is a race between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
ioctl(other_loop_fd, LOOP_SET_FD, loop_fd).

Unless we hold corresponding lo->lo_ctl_mutex (or keep corresponding
lo->lo_refcnt elevated) when traversing other loop devices,
"/* Avoid recursion */" loop from loop_set_fd()/loop_change_fd() will
suffer from races by loop_clr_fd().

So, it is time to think how to solve this race condition, as well as how to 
solve
lockdep's deadlock warning (and I guess that syzbot is actually hitting 
deadlocks).
An approach which serializes loop operations using global lock was proposed at
https://groups.google.com/d/msg/syzkaller-bugs/2Rw8-OM6IbM/PzdobV8kAgAJ .
Please respond...


--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -909,6 +909,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
error = -EINVAL;
goto out_putf;
}
+   pr_err("Start sleeping\n");
+   schedule_timeout_killable(3 * HZ);
+   pr_err("End sleeping\n");
f = l->lo_backing_file;
}
 



#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
int fd0 = open("/dev/loop0", O_RDONLY);
int fd1 = open("/dev/loop1", O_RDONLY);
int fd2 = open("/tmp/file", O_RDWR | O_CREAT | O_TRUNC, 0600);
ioctl(fd1, LOOP_SET_FD, fd2);
if (fork() == 0) {
sleep(1);
ioctl(fd1, LOOP_CLR_FD, 0);
_exit(0);
}
ioctl(fd0, LOOP_SET_FD, fd1);
return 0;
}



[   14.119073] loop: module loaded
[   17.363610] Start sleeping
[   20.383442] End sleeping
[   20.386511] BUG: unable to handle kernel NULL pointer dereference at 
0008
[   20.394779] PGD 13377d067 P4D 13377d067 PUD 131509067 PMD 0 
[   20.400847] Oops:  [#1] SMP
[   20.403875] Modules linked in: loop
[   20.406188] CPU: 6 PID: 6470 Comm: a.out Tainted: GT 
4.17.0-rc4+ #540
[   20.411266] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[   20.418169] RIP: 0010:lo_ioctl+0x7ef/0x840 [loop]
[   20.421272] RSP: 0018:c9bbbd88 EFLAGS: 00010282
[   20.424661] RAX:  RBX:  RCX: 83679478
[   20.429271] RDX: 8801332e9c00 RSI: 0086 RDI: 0286
[   20.434517] RBP: c9bbbdd8 R08: 0638 R09: 
[   20.436879] R10: 0190 R11: 0720072007200720 R12: 8801314ab118
[   20.439076] R13: 880138deae40 R14: 8801311f7780 R15: 8801314ab000
[   20.441144] FS:  7f0b57743740() GS:88013a78() 
knlGS:
[   20.443588] CS:  0010 DS:  ES:  CR0: 80050033
[   20.445284] CR2: 0008 CR3: 000138efb002 CR4: 000606e0
[   20.447381] Call Trace:
[   20.448149]  blkdev_ioctl+0x88d/0x950
[   20.449237]  block_ioctl+0x38/0x40
[   20.450269]  do_vfs_ioctl+0xaa/0x650
[   20.451479]  ? handle_mm_fault+0x108/0x250
[   20.452704]  ksys_ioctl+0x70/0x80
[   20.453737]  __x64_sys_ioctl+0x15/0x20
[   20.454887]  do_syscall_64+0x5d/0x100
[   20.456014]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   20.457519] RIP: 0033:0x7f0b57267107
[   20.458644] RSP: 002b:7fff8a0fd698 EFLAGS: 0246 ORIG_RAX: 
0010
[   20.460853] RAX: ffda RBX: 0004 RCX: 7f0b57267107
[   20.462952] RDX: 0004 RSI: 4c00 RDI: 0003
[   20.465023] RBP: 0003 R08: 7f0b57743740 R09: 
[   20.467091] R10: 7f0b57743a10 R11: 0246 R12: 004005ef
[   20.469361] R13: 7fff8a0fd790 R14:  R15: 
[   20.471657] Code: a0 48 89 55 d0 e8 e0 5f 1d e1 bf b8 0b 00 00 e8 78 9e 7c 
e2 48 c7 c7 a9 40 00 a0 e8 ca 5f 1d e1 48 8b 55 d0 48 8b 82 f0 00 00 00 <48> 8b 
40 08 48 8b 40 68 48 85 c0 0f 84 15 fd ff ff 0f b7 90 b8 
[   20.477207] RIP: lo_ioctl+0x7ef/0x840 [loop] RSP: c9bbbd88
[   20.479027] CR2: 0008
[ 

Re: general protection fault in lo_ioctl (2)

2018-05-07 Thread Tetsuo Handa
On 2018/05/02 20:23, Dmitry Vyukov wrote:
> #syz dup: INFO: rcu detected stall in blkdev_ioctl

The cause of stall turned out to be ioctl(loop_fd, LOOP_CHANGE_FD, loop_fd).

But we haven't explained the cause of NULL pointer dereference which can
occur when raced with ioctl(LOOP_CLR_FD). Therefore,

#syz undup


Re: [PATCH] loop: add recursion validation to LOOP_CHANGE_FD

2018-05-07 Thread Tetsuo Handa
Theodore Y. Ts'o wrote:
> On Mon, May 07, 2018 at 10:21:04PM +0900, Tetsuo Handa wrote:
> > > I don't understand your concern; where are we going to out_putf when
> > > error == 0?
> 
> Ah, now I see it, thanks.  I'll send a revised patch.
> 
> > By the way, are you aware that current "/* Avoid recursion */" loop is not 
> > thread safe?
> 
> Actually, it is safe.  While the child loop device has an open file on
> the parent, lo_refcnt is elevated, which prevents loop_clr_fd from
> actually set setting lo_state to Lo_rundown and clearing
> lo_backing_file

If you think it is safe, please explain that the crash referenced in a patch
at https://groups.google.com/d/msg/syzkaller-bugs/2Rw8-OM6IbM/PzdobV8kAgAJ is
no longer possible. syzbot is hitting crashes there.


Re: [PATCH] loop: add recursion validation to LOOP_CHANGE_FD

2018-05-07 Thread Tetsuo Handa
Theodore Y. Ts'o wrote:
> On Mon, May 07, 2018 at 08:16:58PM +0900, Tetsuo Handa wrote:
> > Oh, your message did not arrive at news.gmane.org and I didn't notice that 
> > you
> > already wrote this patch. But please update yours or review mine shown 
> > below.
> > 
> > > @@ -673,14 +703,13 @@ static int loop_change_fd(struct loop_device *lo, 
> > > struct block_device *bdev,
> > >   if (!file)
> > >   goto out;
> > >  
> > > + error = loop_validate_file(file, bdev);
> > > + if (error)
> > > + goto out_putf;
> > > +
> > >   inode = file->f_mapping->host;
> > >   old_file = lo->lo_backing_file;
> > >  
> > > - error = -EINVAL;
> > > -
> > > - if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
> > > - goto out_putf;
> > > -
> > >   /* size of the new backing store needs to be the same */
> > >   if (get_loop_size(lo, file) != get_loop_size(lo, old_file))
> > >   goto out_putf;
> > 
> > error == 0 upon "goto out_putf" is wrong.
> 
> I don't understand your concern; where are we going to out_putf when
> error == 0?

-   error = -EINVAL;  /* <= You are trying to remove this assignment. */
-
-   if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
-   goto out_putf;
/* size of the new backing store needs to be the same */
if (get_loop_size(lo, file) != get_loop_size(lo, old_file))
goto out_putf; /* <= Hence error == 0 at this point. */

By the way, are you aware that current "/* Avoid recursion */" loop is not 
thread safe?


Re: [PATCH] loop: add recursion validation to LOOP_CHANGE_FD

2018-05-07 Thread Tetsuo Handa
Oh, your message did not arrive at news.gmane.org and I didn't notice that you
already wrote this patch. But please update yours or review mine shown below.

> @@ -673,14 +703,13 @@ static int loop_change_fd(struct loop_device *lo, 
> struct block_device *bdev,
>   if (!file)
>   goto out;
>  
> + error = loop_validate_file(file, bdev);
> + if (error)
> + goto out_putf;
> +
>   inode = file->f_mapping->host;
>   old_file = lo->lo_backing_file;
>  
> - error = -EINVAL;
> -
> - if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
> - goto out_putf;
> -
>   /* size of the new backing store needs to be the same */
>   if (get_loop_size(lo, file) != get_loop_size(lo, old_file))
>   goto out_putf;

error == 0 upon "goto out_putf" is wrong.



>From eed54c6ae475860a9c63b37b58f34735e792eef7 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Date: Sat, 5 May 2018 12:59:12 +0900
Subject: [PATCH] block/loop: Add recursion check for LOOP_CHANGE_FD request.

syzbot is reporting hung tasks at blk_freeze_queue() [1]. This is
due to ioctl(loop_fd, LOOP_CHANGE_FD, loop_fd) request which should be
rejected. Fix this by adding same recursion check which is used by
LOOP_SET_FD request.

[1] 
https://syzkaller.appspot.com/bug?id=078805a692853552e08154b1bcd2bc2c729eda88

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Reported-by: syzbot <syzbot+2ab52b8d94df5e2ea...@syzkaller.appspotmail.com>
Cc: Jens Axboe <ax...@kernel.dk>
---
 drivers/block/loop.c | 59 
 1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5d4e316..cee3c01 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -644,6 +644,34 @@ static void loop_reread_partitions(struct loop_device *lo,
__func__, lo->lo_number, lo->lo_file_name, rc);
 }
 
+static inline int is_loop_device(struct file *file)
+{
+   struct inode *i = file->f_mapping->host;
+
+   return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == LOOP_MAJOR;
+}
+
+static int check_loop_recursion(struct file *f, struct block_device *bdev)
+{
+   /*
+* FIXME: Traversing on other loop devices without corresponding
+* lo_ctl_mutex is not safe. l->lo_state can become Lo_rundown and
+* l->lo_backing_file can become NULL when raced with LOOP_CLR_FD.
+*/
+   while (is_loop_device(f)) {
+   struct loop_device *l;
+
+   if (f->f_mapping->host->i_bdev == bdev)
+   return -EBUSY;
+
+   l = f->f_mapping->host->i_bdev->bd_disk->private_data;
+   if (l->lo_state == Lo_unbound)
+   return -EINVAL;
+   f = l->lo_backing_file;
+   }
+   return 0;
+}
+
 /*
  * loop_change_fd switched the backing store of a loopback device to
  * a new file. This is useful for operating system installers to free up
@@ -673,6 +701,11 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
if (!file)
goto out;
 
+   /* Avoid recursion */
+   error = check_loop_recursion(file, bdev);
+   if (error)
+   goto out_putf;
+
inode = file->f_mapping->host;
old_file = lo->lo_backing_file;
 
@@ -706,13 +739,6 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
return error;
 }
 
-static inline int is_loop_device(struct file *file)
-{
-   struct inode *i = file->f_mapping->host;
-
-   return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == LOOP_MAJOR;
-}
-
 /* loop sysfs attributes */
 
 static ssize_t loop_attr_show(struct device *dev, char *page,
@@ -877,7 +903,7 @@ static int loop_prepare_queue(struct loop_device *lo)
 static int loop_set_fd(struct loop_device *lo, fmode_t mode,
   struct block_device *bdev, unsigned int arg)
 {
-   struct file *file, *f;
+   struct file *file;
struct inode*inode;
struct address_space *mapping;
int lo_flags = 0;
@@ -897,20 +923,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
goto out_putf;
 
/* Avoid recursion */
-   f = file;
-   while (is_loop_device(f)) {
-   struct loop_device *l;
-
-   if (f->f_mapping->host->i_bdev == bdev)
-   goto out_putf;
-
-   l = f->f_mapping->host->i_bdev->bd_disk->private_data;
-   if (l->lo_state == Lo_unbound) {
-   error = -EINVAL;
-   goto out_putf;
-   }
-   f = l->lo_backing_file;
-   }
+   error = check_loop_recursion(file, bdev);
+   if (error)
+   goto out_putf;
 
mapping = file->f_mapping;
inode = mapping->host;
-- 
1.8.3.1


Re: INFO: rcu detected stall in blkdev_ioctl

2018-05-05 Thread Tetsuo Handa
If various stall reports regarding loop_set_fd() are hitting below sequence, a 
patch
was proposed at 
https://groups.google.com/d/msg/syzkaller-bugs/5pzXJ8yQFR0/vWeRytaQBAAJ .

--
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
const int fd = open("/tmp/file", O_RDWR | O_CREAT | O_TRUNC, 0600);
const int fd0 = open("/dev/loop0", O_RDONLY);
const int fd1 = open("/dev/loop1", O_RDONLY);
ioctl(fd0, LOOP_SET_FD, fd);
ioctl(fd0, LOOP_CHANGE_FD, fd0);
ioctl(fd1, LOOP_SET_FD, fd0);
return 0;
}
--




Re: [PATCH] loop: remember whether sysfs_create_group() succeeded

2018-05-05 Thread Tetsuo Handa
Milan Broz wrote:
> On 05/04/2018 04:40 PM, Tetsuo Handa wrote:
> > The loop module ignores sysfs_create_group() failure and pretends that
> > LOOP_SET_FD request succeeded. I guess that the author of commit
> > ee86273062cbb310 ("loop: add some basic read-only sysfs attributes")
> > assumed that it is not a fatal error enough to abort LOOP_SET_FD request.
> 
> IIRC we added sysfs attributes to easily access loop info for a regular user
> in lsblk command (and perhaps even in udev rules).
> 
> The ioctl interface was still controlling the loop device, so the failure
> here was not meant to be fatal. TBH I think was not a great idea.

Thanks for reply.

>  
> > Do we want to abort LOOP_SET_FD request if sysfs_create_group() failed?
> 
> I would prefer failure - there are several utilities that expects attributes 
> in
> sysfs to be valid (for example I print info from here in cryptsetup status
> if the backing image is an image), so ignoring failure put the system
> in inconsistent state.

I see. But can we for now send v1 patch for 4.17 release (and postpone making
LOOP_SET_FD request fail if sysfs_create_group() failed)? This bug has so far
crashed syzbot tests for 6432 times in 190 days.

We have a lot of bugs regarding loop module which prevent syzbot from
finding other bugs. I want to immediately squash bugs in block/loop so that
we can reduce false-positive hung task reports.

> 
> Thanks for fixing this,
> Milan
> 


Re: INFO: task hung in blk_freeze_queue

2018-05-04 Thread Tetsuo Handa
A patch for this specific report is ready. I don't know whether other
"dup" reports will be solved by this patch. Thus, I "undup" this report.

#syz undup

>From eed54c6ae475860a9c63b37b58f34735e792eef7 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Date: Sat, 5 May 2018 12:59:12 +0900
Subject: [PATCH] block/loop: Add recursion check for LOOP_CHANGE_FD request.

syzbot is reporting hung tasks at blk_freeze_queue() [1]. This is
due to ioctl(loop_fd, LOOP_CHANGE_FD, loop_fd) request which should be
rejected. Fix this by adding same recursion check which is used by
LOOP_SET_FD request.

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Reported-by: syzbot <syzbot+2ab52b8d94df5e2ea...@syzkaller.appspotmail.com>
Cc: Jens Axboe <ax...@kernel.dk>
---
 drivers/block/loop.c | 59 
 1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5d4e316..cee3c01 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -644,6 +644,34 @@ static void loop_reread_partitions(struct loop_device *lo,
__func__, lo->lo_number, lo->lo_file_name, rc);
 }
 
+static inline int is_loop_device(struct file *file)
+{
+   struct inode *i = file->f_mapping->host;
+
+   return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == LOOP_MAJOR;
+}
+
+static int check_loop_recursion(struct file *f, struct block_device *bdev)
+{
+   /*
+* FIXME: Traversing on other loop devices without corresponding
+* lo_ctl_mutex is not safe. l->lo_state can become Lo_rundown and
+* l->lo_backing_file can become NULL when raced with LOOP_CLR_FD.
+*/
+   while (is_loop_device(f)) {
+   struct loop_device *l;
+
+   if (f->f_mapping->host->i_bdev == bdev)
+   return -EBUSY;
+
+   l = f->f_mapping->host->i_bdev->bd_disk->private_data;
+   if (l->lo_state == Lo_unbound)
+   return -EINVAL;
+   f = l->lo_backing_file;
+   }
+   return 0;
+}
+
 /*
  * loop_change_fd switched the backing store of a loopback device to
  * a new file. This is useful for operating system installers to free up
@@ -673,6 +701,11 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
if (!file)
goto out;
 
+   /* Avoid recursion */
+   error = check_loop_recursion(file, bdev);
+   if (error)
+   goto out_putf;
+
inode = file->f_mapping->host;
old_file = lo->lo_backing_file;
 
@@ -706,13 +739,6 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
return error;
 }
 
-static inline int is_loop_device(struct file *file)
-{
-   struct inode *i = file->f_mapping->host;
-
-   return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == LOOP_MAJOR;
-}
-
 /* loop sysfs attributes */
 
 static ssize_t loop_attr_show(struct device *dev, char *page,
@@ -877,7 +903,7 @@ static int loop_prepare_queue(struct loop_device *lo)
 static int loop_set_fd(struct loop_device *lo, fmode_t mode,
   struct block_device *bdev, unsigned int arg)
 {
-   struct file *file, *f;
+   struct file *file;
struct inode*inode;
struct address_space *mapping;
int lo_flags = 0;
@@ -897,20 +923,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
goto out_putf;
 
/* Avoid recursion */
-   f = file;
-   while (is_loop_device(f)) {
-   struct loop_device *l;
-
-   if (f->f_mapping->host->i_bdev == bdev)
-   goto out_putf;
-
-   l = f->f_mapping->host->i_bdev->bd_disk->private_data;
-   if (l->lo_state == Lo_unbound) {
-   error = -EINVAL;
-   goto out_putf;
-   }
-   f = l->lo_backing_file;
-   }
+   error = check_loop_recursion(file, bdev);
+   if (error)
+   goto out_putf;
 
mapping = file->f_mapping;
inode = mapping->host;
-- 
1.8.3.1




Re: [PATCH v2] loop: remember whether sysfs_create_group() succeeded

2018-05-04 Thread Tetsuo Handa
Jens Axboe wrote:
> On 5/4/18 10:14 AM, Tetsuo Handa wrote:
> >>>> If that's not easily done, then my next suggestion would be to
> >>>> use a loop flag for it, LO_FLAGS_SYSFS_SETUP or something like that.
> >>>
> >>> Yes, that would be possible.
> >>
> >> Let's make that change.
> > 
> > Since LO_FLAGS_* are defined in include/uapi/linux/loop.h as userspace 
> > visible
> > flags, I feel that using "struct loop_device"->lo_flags for recording 
> > whether
> > sysfs entry exists might be strange... Anyway, updated patch is shown below.
> 
> Hmm yes, I forgot about that, I guess that makes the flags approach
> pretty much useless. Let's just go with your v1 in that case.
> 

OK. You can s/sysfs_ready/sysfs_init_done/ before you apply to your tree.


[PATCH v2] loop: remember whether sysfs_create_group() succeeded

2018-05-04 Thread Tetsuo Handa
Jens Axboe wrote:
> > The loop module ignores sysfs_create_group() failure and pretends that
> > LOOP_SET_FD request succeeded. I guess that the author of commit
> > ee86273062cbb310 ("loop: add some basic read-only sysfs attributes")
> > assumed that it is not a fatal error enough to abort LOOP_SET_FD request.
> > 
> > Do we want to abort LOOP_SET_FD request if sysfs_create_group() failed?
> 
> Probably safer to retain that behavior.

OK.

> 
> >> If that's not easily done, then my next suggestion would be to
> >> use a loop flag for it, LO_FLAGS_SYSFS_SETUP or something like that.
> > 
> > Yes, that would be possible.
> 
> Let's make that change.

Since LO_FLAGS_* are defined in include/uapi/linux/loop.h as userspace visible
flags, I feel that using "struct loop_device"->lo_flags for recording whether
sysfs entry exists might be strange... Anyway, updated patch is shown below.



>From c9897b6387b13b533c32dcc624e12a93f23224d0 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Date: Sat, 5 May 2018 00:52:59 +0900
Subject: [PATCH v2] loop: remember whether sysfs_create_group() succeeded

syzbot is hitting WARN() triggered by memory allocation fault
injection [1] because loop module is calling sysfs_remove_group()
when sysfs_create_group() failed.

Fix this by remembering whether sysfs_create_group() succeeded.
To remember it, userspace visible API flag LO_FLAGS_SYSFS_ENTRY is
introduced. This flag indicates that sysfs entries are available, and
should be always set if CONFIG_SYSFS=y because sysfs entries will be
created unless fault injection prevents it.

[1] 
https://syzkaller.appspot.com/bug?id=3f86c0edf75c86d2633aeb9dd69eccc70bc7e90b

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Reported-by: syzbot 
<syzbot+9f03168400f56df89dbc6f1751f4458fe739f...@syzkaller.appspotmail.com>
Cc: Jens Axboe <ax...@kernel.dk>
---
 drivers/block/loop.c  | 6 --
 include/uapi/linux/loop.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5d4e316..499eafd 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -951,7 +951,8 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
loop_update_dio(lo);
set_capacity(lo->lo_disk, size);
bd_set_size(bdev, size << 9);
-   loop_sysfs_init(lo);
+   if (IS_ENABLED(CONFIG_SYSFS) && loop_sysfs_init(lo) == 0)
+   lo->lo_flags |= LO_FLAGS_SYSFS_ENTRY;
/* let user-space know about the new size */
kobject_uevent(_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
 
@@ -1070,7 +1071,8 @@ static int loop_clr_fd(struct loop_device *lo)
invalidate_bdev(bdev);
}
set_capacity(lo->lo_disk, 0);
-   loop_sysfs_exit(lo);
+   if (lo->lo_flags & LO_FLAGS_SYSFS_ENTRY)
+   loop_sysfs_exit(lo);
if (bdev) {
bd_set_size(bdev, 0);
/* let user-space know about this change */
diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h
index 080a8df..5de1eaa6 100644
--- a/include/uapi/linux/loop.h
+++ b/include/uapi/linux/loop.h
@@ -23,6 +23,7 @@ enum {
LO_FLAGS_AUTOCLEAR  = 4,
LO_FLAGS_PARTSCAN   = 8,
LO_FLAGS_DIRECT_IO  = 16,
+   LO_FLAGS_SYSFS_ENTRY= 32,
 };
 
 #include/* for __kernel_old_dev_t */
-- 
1.8.3.1


Re: [PATCH] loop: remember whether sysfs_create_group() succeeded

2018-05-04 Thread Tetsuo Handa
Jens Axboe wrote:
> On 5/4/18 8:27 AM, Tetsuo Handa wrote:
> > Jens Axboe wrote:
> >> On 5/4/18 5:47 AM, Tetsuo Handa wrote:
> >>> >From 626d33de1b70b11ecaf95a9f83f7644998e54cbb Mon Sep 17 00:00:00 2001
> >>> From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> >>> Date: Wed, 2 May 2018 23:03:48 +0900
> >>> Subject: [PATCH] loop: remember whether sysfs_create_group() succeeded
> >>>
> >>> syzbot is hitting WARN() triggered by memory allocation fault
> >>> injection [1] because loop module is calling sysfs_remove_group()
> >>> when sysfs_create_group() failed.
> >>> Fix this by remembering whether sysfs_create_group() succeeded.
> >>
> >> Can we store this locally instead of in the loop_device? Also,
> >> naming wise, something like sysfs_init_done would be more readily
> >> understandable.
> > 
> > Whether sysfs entry for this loop device exists is per "struct loop_device"
> > flag, isn't it? What does "locally" mean?
> 
> I'm assuming this is calling remove in an error path when alloc fails.
> So it should be possible to know locally whether this was done or not,
> before calling the teardown. Storing this is in the loop_device seems
> like a bit of a hack.

The loop module ignores sysfs_create_group() failure and pretends that
LOOP_SET_FD request succeeded. I guess that the author of commit
ee86273062cbb310 ("loop: add some basic read-only sysfs attributes")
assumed that it is not a fatal error enough to abort LOOP_SET_FD request.

Do we want to abort LOOP_SET_FD request if sysfs_create_group() failed?

> 
> If that's not easily done, then my next suggestion would be to
> use a loop flag for it, LO_FLAGS_SYSFS_SETUP or something like that.

Yes, that would be possible.


Re: [PATCH] loop: remember whether sysfs_create_group() succeeded

2018-05-04 Thread Tetsuo Handa
Jens Axboe wrote:
> On 5/4/18 5:47 AM, Tetsuo Handa wrote:
> >>From 626d33de1b70b11ecaf95a9f83f7644998e54cbb Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> > Date: Wed, 2 May 2018 23:03:48 +0900
> > Subject: [PATCH] loop: remember whether sysfs_create_group() succeeded
> > 
> > syzbot is hitting WARN() triggered by memory allocation fault
> > injection [1] because loop module is calling sysfs_remove_group()
> > when sysfs_create_group() failed.
> > Fix this by remembering whether sysfs_create_group() succeeded.
> 
> Can we store this locally instead of in the loop_device? Also,
> naming wise, something like sysfs_init_done would be more readily
> understandable.

Whether sysfs entry for this loop device exists is per "struct loop_device"
flag, isn't it? What does "locally" mean?


[PATCH] loop: remember whether sysfs_create_group() succeeded

2018-05-04 Thread Tetsuo Handa
>From 626d33de1b70b11ecaf95a9f83f7644998e54cbb Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Date: Wed, 2 May 2018 23:03:48 +0900
Subject: [PATCH] loop: remember whether sysfs_create_group() succeeded

syzbot is hitting WARN() triggered by memory allocation fault
injection [1] because loop module is calling sysfs_remove_group()
when sysfs_create_group() failed.
Fix this by remembering whether sysfs_create_group() succeeded.

[1] 
https://syzkaller.appspot.com/bug?id=3f86c0edf75c86d2633aeb9dd69eccc70bc7e90b

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Reported-by: syzbot 
<syzbot+9f03168400f56df89dbc6f1751f4458fe739f...@syzkaller.appspotmail.com>
Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Cc: Jens Axboe <ax...@kernel.dk>
---
 drivers/block/loop.c | 11 ++-
 drivers/block/loop.h |  1 +
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5d4e316..1d758d8 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -809,16 +809,17 @@ static ssize_t loop_attr_dio_show(struct loop_device *lo, 
char *buf)
.attrs= loop_attrs,
 };
 
-static int loop_sysfs_init(struct loop_device *lo)
+static void loop_sysfs_init(struct loop_device *lo)
 {
-   return sysfs_create_group(_to_dev(lo->lo_disk)->kobj,
- _attribute_group);
+   lo->sysfs_ready = !sysfs_create_group(_to_dev(lo->lo_disk)->kobj,
+ _attribute_group);
 }
 
 static void loop_sysfs_exit(struct loop_device *lo)
 {
-   sysfs_remove_group(_to_dev(lo->lo_disk)->kobj,
-  _attribute_group);
+   if (lo->sysfs_ready)
+   sysfs_remove_group(_to_dev(lo->lo_disk)->kobj,
+  _attribute_group);
 }
 
 static void loop_config_discard(struct loop_device *lo)
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index b78de98..73c801f 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -58,6 +58,7 @@ struct loop_device {
struct kthread_worker   worker;
struct task_struct  *worker_task;
booluse_dio;
+   boolsysfs_ready;
 
struct request_queue*lo_queue;
struct blk_mq_tag_set   tag_set;
-- 
1.8.3.1



Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-03 Thread Tetsuo Handa
Jan Kara wrote:
> Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
> the necessary precautions against racing with bdi unregistration.

Yes, this patch will solve NULL pointer dereference bug. But is it OK to leave
list_empty(>work_list) == false situation? Who takes over the role of making
list_empty(>work_list) == true?

Just a confirmation, for Fabiano Rosas is facing a problem that "write call
hangs in kernel space after virtio hot-remove" and is thinking that we might
need to go the opposite direction
( http://lkml.kernel.org/r/f0787b79-1e50-5f55-a400-44f715451...@linux.ibm.com ).


Re: INFO: task hung in wb_shutdown (2)

2018-05-01 Thread Tetsuo Handa
>From 1b90d7f71d60e743c69cdff3ba41edd1f9f86f93 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Date: Wed, 2 May 2018 07:07:55 +0900
Subject: [PATCH v2] bdi: wake up concurrent wb_shutdown() callers.

syzbot is reporting hung tasks at wait_on_bit(WB_shutting_down) in
wb_shutdown() [1]. This seems to be because commit 5318ce7d46866e1d ("bdi:
Shutdown writeback on all cgwbs in cgwb_bdi_destroy()") forgot to call
wake_up_bit(WB_shutting_down) after clear_bit(WB_shutting_down).

Introduce a helper function clear_and_wake_up_bit() and use it, in order
to avoid similar errors in future.

[1] 
https://syzkaller.appspot.com/bug?id=b297474817af98d5796bc544e1bb806fc3da0e5e

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Reported-by: syzbot <syzbot+c0cf869505e03bdf1...@syzkaller.appspotmail.com>
Fixes: 5318ce7d46866e1d ("bdi: Shutdown writeback on all cgwbs in 
cgwb_bdi_destroy()")
Cc: Tejun Heo <t...@kernel.org>
Cc: Jan Kara <j...@suse.cz>
Cc: Jens Axboe <ax...@fb.com>
Suggested-by: Linus Torvalds <torva...@linux-foundation.org>
---
 include/linux/wait_bit.h | 17 +
 mm/backing-dev.c |  2 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 9318b21..2b0072f 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -305,4 +305,21 @@ struct wait_bit_queue_entry {
__ret;  \
 })
 
+/**
+ * clear_and_wake_up_bit - clear a bit and wake up anyone waiting on that bit
+ *
+ * @bit: the bit of the word being waited on
+ * @word: the word being waited on, a kernel virtual address
+ *
+ * You can use this helper if bitflags are manipulated atomically rather than
+ * non-atomically under a lock.
+ */
+static inline void clear_and_wake_up_bit(int bit, void *word)
+{
+   clear_bit_unlock(bit, word);
+   /* See wake_up_bit() for which memory barrier you need to use. */
+   smp_mb__after_atomic();
+   wake_up_bit(word, bit);
+}
+
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 023190c..fa5e6d7 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -383,7 +383,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 * the barrier provided by test_and_clear_bit() above.
 */
smp_wmb();
-   clear_bit(WB_shutting_down, >state);
+   clear_and_wake_up_bit(WB_shutting_down, >state);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
-- 
1.8.3.1


Re: INFO: task hung in wb_shutdown (2)

2018-05-01 Thread Tetsuo Handa
Tejun, Jan, Jens,

Can you review this patch? syzbot has hit this bug for nearly 4000 times but
is still unable to find a reproducer. Therefore, the only way to test would be
to apply this patch upstream and test whether the problem is solved.

On 2018/04/24 21:19, Tetsuo Handa wrote:
>>From 39ed6be8a2c12dfe54feaa5abbc2ec46103022bf Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Date: Tue, 24 Apr 2018 11:59:08 +0900
> Subject: [PATCH] bdi: wake up concurrent wb_shutdown() callers.
> 
> syzbot is reporting hung tasks at wait_on_bit(WB_shutting_down) in
> wb_shutdown() [1]. This might be because commit 5318ce7d46866e1d ("bdi:
> Shutdown writeback on all cgwbs in cgwb_bdi_destroy()") forgot to call
> wake_up_bit(WB_shutting_down) after clear_bit(WB_shutting_down).
> 
> [1] 
> https://syzkaller.appspot.com/bug?id=b297474817af98d5796bc544e1bb806fc3da0e5e
> 
> Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Reported-by: syzbot <syzbot+c0cf869505e03bdf1...@syzkaller.appspotmail.com>
> Fixes: 5318ce7d46866e1d ("bdi: Shutdown writeback on all cgwbs in 
> cgwb_bdi_destroy()")
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Jan Kara <j...@suse.cz>
> Cc: Jens Axboe <ax...@fb.com>
> ---
>  mm/backing-dev.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 023190c..dadac99 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -384,6 +384,8 @@ static void wb_shutdown(struct bdi_writeback *wb)
>*/
>   smp_wmb();
>   clear_bit(WB_shutting_down, >state);
> + smp_mb(); /* advised by wake_up_bit() */
> + wake_up_bit(>state, WB_shutting_down);
>  }
>  
>  static void wb_exit(struct bdi_writeback *wb)
> 



Re: INFO: rcu detected stall in blkdev_ioctl

2018-04-28 Thread Tetsuo Handa
Like I noted in a patch at

https://groups.google.com/d/msg/syzkaller-bugs/2Rw8-OM6IbM/PzdobV8kAgAJ

loop module is not thread safe. Can we use more global lock?



Re: write call hangs in kernel space after virtio hot-remove

2018-04-26 Thread Tetsuo Handa
Tejun Heo wrote:
> (cc'ing Tetuso and quoting whole message)
> 
> Tetuso, could this be the same problem as the hang in wb_shutdown()
> syszbot reported?

Excuse me, but I'm too unfamiliar to judge it. ;-)

Anyway, since Fabiano has a reproducer, I appreciate trying a patch at
https://groups.google.com/forum/#!msg/syzkaller-bugs/qeR_1VM0UvU/1zcLE0Y4BAAJ .

Also, Fabiano's attempt might be somewhat related to a NULL pointer race 
condition
at 
https://groups.google.com/forum/#!msg/syzkaller-bugs/bVx4WExoTDw/G68X4kPcAQAJ .


Re: general protection fault in wb_workfn

2018-04-23 Thread Tetsuo Handa
On 2018/04/23 19:09, Tetsuo Handa wrote:
> By the way, I got a newbie question regarding commit 5318ce7d46866e1d ("bdi:
> Shutdown writeback on all cgwbs in cgwb_bdi_destroy()"). It uses clear_bit()
> to clear WB_shutting_down bit so that threads waiting at wait_on_bit() will
> wake up. But clear_bit() itself does not wake up threads, does it? Who wakes
> them up (e.g. by calling wake_up_bit()) after clear_bit() was called?
> 

Below report might be waiting for wake_up_bit() ?

  INFO: task hung in wb_shutdown (2)
  https://syzkaller.appspot.com/bug?id=b297474817af98d5796bc544e1bb806fc3da0e5e


Re: general protection fault in wb_workfn

2018-04-23 Thread Tetsuo Handa
On 2018/04/20 1:05, syzbot wrote:
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] SMP KASAN
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 28 Comm: kworker/u4:2 Not tainted 4.16.0-rc7+ #368
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Workqueue: writeback wb_workfn
> RIP: 0010:dev_name include/linux/device.h:981 [inline]
> RIP: 0010:wb_workfn+0x1a2/0x16b0 fs/fs-writeback.c:1936
> RSP: 0018:8801d951f038 EFLAGS: 00010206
> RAX: dc00 RBX:  RCX: 81bf6ea5
> RDX: 000a RSI: 87b44840 RDI: 0050
> RBP: 8801d951f558 R08: 11003b2a3def R09: 0004
> R10: 8801d951f438 R11: 0004 R12: 0100
> R13: 8801baee0dc0 R14: 8801d951f530 R15: 8801baee10d8
> FS:  () GS:8801db20() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0047ff80 CR3: 07a22006 CR4: 001626f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  process_one_work+0xc47/0x1bb0 kernel/workqueue.c:2113
>  process_scheduled_works kernel/workqueue.c:2173 [inline]
>  worker_thread+0xa4b/0x1990 kernel/workqueue.c:2252
>  kthread+0x33c/0x400 kernel/kthread.c:238
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:406

This report says that wb->bdi->dev == NULL

  static inline const char *dev_name(const struct device *dev)
  {
/* Use the init name until the kobject becomes available */
if (dev->init_name)
  return dev->init_name;
  
return kobject_name(>kobj);
  }

  void wb_workfn(struct work_struct *work)
  {
  (...snipped...)
 set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
  (...snipped...)
  }

immediately after ioctl(LOOP_CTL_REMOVE) was requested. It is plausible
because ioctl(LOOP_CTL_REMOVE) sets bdi->dev to NULL after returning from
wb_shutdown().

loop_control_ioctl(LOOP_CTL_REMOVE) {
  loop_remove(lo) {
del_gendisk(lo->lo_disk) {
  bdi_unregister(disk->queue->backing_dev_info) {
bdi_remove_from_list(bdi);
wb_shutdown(>wb);
cgwb_bdi_unregister(bdi);
if (bdi->dev) {
  bdi_debug_unregister(bdi);
  device_unregister(bdi->dev);
  bdi->dev = NULL;
}
  }
}
  }
}

For some reason wb_shutdown() is not waiting for wb_workfn() to complete
( or something queues again after WB_registered bit was cleared ) ?

Anyway, I think that this is block layer problem rather than fs layer problem.



By the way, I got a newbie question regarding commit 5318ce7d46866e1d ("bdi:
Shutdown writeback on all cgwbs in cgwb_bdi_destroy()"). It uses clear_bit()
to clear WB_shutting_down bit so that threads waiting at wait_on_bit() will
wake up. But clear_bit() itself does not wake up threads, does it? Who wakes
them up (e.g. by calling wake_up_bit()) after clear_bit() was called?


Re: WARNING: lock held when returning to user space!

2018-04-14 Thread Tetsuo Handa
The patch was sent to linux.git as commit bdac616db9bbadb9.

#syz fix: loop: fix LOOP_GET_STATUS lock imbalance



Re: [PATCH] loop: fix LOOP_GET_STATUS lock imbalance

2018-04-07 Thread Tetsuo Handa
Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Commit 2d1d4c1e591f made loop_get_status() drop lo_ctx_mutex before
> returning, but the loop_get_status_old(), loop_get_status64(), and
> loop_get_status_compat() wrappers don't call loop_get_status() if the
> passed argument is NULL. The callers expect that the lock is dropped, so
> make sure we drop it in that case, too.
> 
> Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com

Well, it is me who reported this bug before syzbot reports it. ;-)

> Fixes: 2d1d4c1e591f ("loop: don't call into filesystem while holding 
> lo_ctl_mutex")

But I feel we should revert 2d1d4c1e591f rather than applying this patch.

> Signed-off-by: Omar Sandoval 

If the reason of dropping the lock is to avoid deadlock caused by recursing
into the same lock from vfs_getattr(), it is correct thing to drop the lock.

But when the reason is that vfs_getattr() cannot return when NFS server is
dead, there is no point with dropping the lock. Anybody who tries to call
loop_get_status() will get stuck. It is commit 3148ffbdb9162baa ("loop:
use killable lock in ioctls") which actually helps minimizing number of
stuck processes when NFS server is dead if we didn't drop the lock.
If we drop the lock before calling vfs_getattr(), all threads who called
loop_get_status() will reach vfs_getattr() and get stuck, won't it?


Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem

2018-03-06 Thread Tetsuo Handa
Laura Abbott wrote:
> On 02/26/2018 06:28 AM, Michal Hocko wrote:
> > On Fri 23-02-18 11:51:41, Laura Abbott wrote:
> >> Hi,
> >>
> >> The Fedora arm-32 build VMs have a somewhat long standing problem
> >> of hanging when running mkfs.ext4 with a bunch of processes stuck
> >> in D state. This has been seen as far back as 4.13 but is still
> >> present on 4.14:
> >>
> > [...]
> >> This looks like everything is blocked on the writeback completing but
> >> the writeback has been throttled. According to the infra team, this problem
> >> is _not_ seen without LPAE (i.e. only 4G of RAM). I did see
> >> https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to
> >> quite match since this seems to be completely stuck. Any suggestions to
> >> narrow the problem down?
> > 
> > How much dirtyable memory does the system have? We do allow only lowmem
> > to be dirtyable by default on 32b highmem systems. Maybe you have the
> > lowmem mostly consumed by the kernel memory. Have you tried to enable
> > highmem_is_dirtyable?
> > 
> 
> Setting highmem_is_dirtyable did fix the problem. The infrastructure
> people seemed satisfied enough with this (and are happy to have the
> machines back).

That's good.

> I'll see if they are willing to run a few more tests
> to get some more state information.

Well, I'm far from understanding what is happening in your case, but I'm
interested in other threads which were trying to allocate memory. Therefore,
I appreciate if they can take SysRq-m + SysRq-t than SysRq-w (as described
at http://akari.osdn.jp/capturing-kernel-messages.html ).

Code which assumes that kswapd can make progress can get stuck when kswapd
is blocked somewhere. And wbt_wait() seems to change behavior based on
current_is_kswapd(). If everyone is waiting for kswapd but kswapd cannot
make progress, I worry that it leads to hangups like your case.



Below is a totally different case which I got today, but an example of
whether SysRq-m + SysRq-t can give us some clues.

Running below program on CPU 0 (using "taskset -c 0") on 4.16-rc4 against XFS
can trigger OOM lockups (hangup without being able to invoke the OOM killer).

--
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
static char buffer[4096] = { };
char *buf = NULL;
unsigned long size;
unsigned long i;
for (i = 0; i < 1024; i++) {
if (fork() == 0) {
int fd;
snprintf(buffer, sizeof(buffer), "/tmp/file.%u", 
getpid());
fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
memset(buffer, 0, sizeof(buffer));
sleep(1);
while (write(fd, buffer, sizeof(buffer)) == 
sizeof(buffer));
_exit(0);
}
}
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
sleep(2);
/* Will cause OOM due to overcommit */
for (i = 0; i < size; i += 4096)
buf[i] = 0;
return 0;
}
--

MM people love to ignore such kind of problem with "It is a DoS attack", but
only one CPU out of 8 CPUs is occupied by this program, which means that other
threads (including kernel threads doing memory reclaim activities) are free to
use idle CPUs 1-7 as they need. Also, while CPU 0 was really busy processing
hundreds of threads doing direct reclaim, idle CPUs 1-7 should be able to invoke
the OOM killer shortly because there should be already little to reclaim. Also,
writepending: did not decrease (and no disk I/O was observed) during the OOM
lockup. Thus, I don't know whether this is just an overloaded.

[  660.035957] Node 0 Normal free:17056kB min:17320kB low:21648kB high:25976kB 
active_anon:570132kB inactive_anon:13452kB active_file:15136kB 
inactive_file:13296kB unevictable:0kB writepending:42320kB present:1048576kB 
managed:951188kB mlocked:0kB kernel_stack:22448kB pagetables:37304kB bounce:0kB 
free_pcp:0kB local_pcp:0kB free_cma:0kB
[  709.498421] Node 0 Normal free:16920kB min:17320kB low:21648kB high:25976kB 
active_anon:570132kB inactive_anon:13452kB active_file:19180kB 
inactive_file:17640kB unevictable:0kB writepending:42740kB present:1048576kB 
managed:951188kB mlocked:0kB kernel_stack:22400kB pagetables:37304kB bounce:0kB 
free_pcp:248kB local_pcp:0kB free_cma:0kB
[  751.290146] Node 0 Normal free:16920kB min:17320kB low:21648kB high:25976kB 
active_anon:570132kB inactive_anon:13452kB active_file:14556kB 
inactive_file:14452kB unevictable:0kB writepending:42740kB present:1048576kB 
managed:951188kB mlocked:0kB kernel_stack:22400kB pagetables:37304kB bounce:0kB 
free_pcp:248kB local_pcp:0kB 

block: mempool allocation hangs under OOM. (Re: A pitfall of mempool?)

2017-04-27 Thread Tetsuo Handa
Hello.

I noticed a hang up where kswapd was unable to get memory for bio from mempool
at bio_alloc_bioset(GFP_NOFS) request at
http://lkml.kernel.org/r/201704252022.dfb26076.fmoqvfotjos...@i-love.sakura.ne.jp
 .

Since there is no mean to check whether kswapd was making progress, I tried
below patch on top of allocation watchdog patch at
http://lkml.kernel.org/r/1489578541-81526-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
 .

--
 include/linux/gfp.h | 8 
 kernel/hung_task.c  | 5 -
 mm/mempool.c| 9 -
 mm/page_alloc.c | 6 ++
 4 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 2b1a44f5..cc16050 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -469,6 +469,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t 
gfp_mask,
return __alloc_pages_node(nid, gfp_mask, order);
 }
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+extern void start_memalloc_timer(const gfp_t gfp_mask, const int order);
+extern void stop_memalloc_timer(const gfp_t gfp_mask);
+#else
+#define start_memalloc_timer(gfp_mask, order) do { } while (0)
+#define stop_memalloc_timer(gfp_mask) do { } while (0)
+#endif
+
 #ifdef CONFIG_NUMA
 extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 8f237c0..7d11e8e 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -149,7 +150,9 @@ static bool rcu_lock_break(struct task_struct *g, struct 
task_struct *t)
get_task_struct(g);
get_task_struct(t);
rcu_read_unlock();
-   cond_resched();
+   if (console_trylock())
+   console_unlock();
+   //cond_resched();
rcu_read_lock();
can_cont = pid_alive(g) && pid_alive(t);
put_task_struct(t);
diff --git a/mm/mempool.c b/mm/mempool.c
index 47a659d..8b449af 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -324,11 +324,14 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
+   start_memalloc_timer(gfp_temp, -1);
 repeat_alloc:
 
element = pool->alloc(gfp_temp, pool->pool_data);
-   if (likely(element != NULL))
+   if (likely(element != NULL)) {
+   stop_memalloc_timer(gfp_temp);
return element;
+   }
 
spin_lock_irqsave(>lock, flags);
if (likely(pool->curr_nr)) {
@@ -341,6 +344,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 * for debugging.
 */
kmemleak_update_trace(element);
+   stop_memalloc_timer(gfp_temp);
return element;
}
 
@@ -350,13 +354,16 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 */
if (gfp_temp != gfp_mask) {
spin_unlock_irqrestore(>lock, flags);
+   stop_memalloc_timer(gfp_temp);
gfp_temp = gfp_mask;
+   start_memalloc_timer(gfp_temp, -1);
goto repeat_alloc;
}
 
/* We must not sleep if !__GFP_DIRECT_RECLAIM */
if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
spin_unlock_irqrestore(>lock, flags);
+   stop_memalloc_timer(gfp_temp);
return NULL;
}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f539752..652ba4f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4008,7 +4008,7 @@ bool memalloc_maybe_stalling(void)
return false;
 }
 
-static void start_memalloc_timer(const gfp_t gfp_mask, const int order)
+void start_memalloc_timer(const gfp_t gfp_mask, const int order)
 {
struct memalloc_info *m = >memalloc;
 
@@ -4032,7 +4032,7 @@ static void start_memalloc_timer(const gfp_t gfp_mask, 
const int order)
m->in_flight++;
 }
 
-static void stop_memalloc_timer(const gfp_t gfp_mask)
+void stop_memalloc_timer(const gfp_t gfp_mask)
 {
struct memalloc_info *m = >memalloc;
 
@@ -4055,8 +4055,6 @@ static void memalloc_counter_fold(int cpu)
 }
 
 #else
-#define start_memalloc_timer(gfp_mask, order) do { } while (0)
-#define stop_memalloc_timer(gfp_mask) do { } while (0)
 #define memalloc_counter_fold(cpu) do { } while (0)
 #endif
 
--

These patches reported that 
mempool_alloc(GFP_NOFS|__GFP_NOWARN|__GFP_NORETRY|__GFP_NOMEMALLOC)
was not able to get memory for more than 15 minutes (and thus unable to submit 
block I/O
request for reclaiming memory via FS writeback, and thus unable to unblock 
__GFP_FS
allocation requests waiting for FS writeback, and thus unable to invoke the OOM 
killer for
reclaiming memory for more than 12 minutes, and thus unable to unblock 
mempool_alloc(),
and thus silent hang up).

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170427.txt.xz .
--
[0.00] Linux version 4.11.0-rc8-next-20170426+ 

[4.11-rc1 block]: Boot failure due to memory corruption.

2017-03-08 Thread Tetsuo Handa
Hello.

I noticed that 4.11-rc1 crashes upon boot.


[5.358848] Fusion MPT base driver 3.04.20
[5.360468] Copyright (c) 1999-2008 LSI Corporation
[5.377993] e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI
[5.380697] e1000: Copyright (c) 1999-2006 Intel Corporation.
[5.383907] Fusion MPT SPI Host driver 3.04.20
[5.384772] scsi host0: ata_piix
[5.389732] scsi host1: ata_piix
[5.391435] ata1: PATA max UDMA/33 cmd 0x1f0 ctl 0x3f6 bmdma 0x1060 irq 14
[5.400844] ata2: PATA max UDMA/33 cmd 0x170 ctl 0x376 bmdma 0x1068 irq 15
[5.408349] mptbase: ioc0: Initiating bringup
[5.446375] ioc0: LSI53C1030 B0: Capabilities={Initiator}
[5.535645] scsi host2: ioc0: LSI53C1030 B0, FwRev=01032920h, Ports=1, 
MaxQ=128, IRQ=17
[5.575721] ata2.00: ATAPI: VMware Virtual IDE CDROM Drive, 0001, max 
UDMA/33
[5.583595] ata2.00: configured for UDMA/33
[5.592101] scsi 2:0:0:0: Direct-Access VMware,  VMware Virtual S 1.0  
PQ: 0 ANSI: 2
[5.595206] scsi target2:0:0: Beginning Domain Validation
[5.598405] scsi 1:0:0:0: CD-ROMNECVMWar VMware IDE CDR10 1.00 
PQ: 0 ANSI: 5
[5.605254] [drm] DMA map mode: Using physical TTM page addresses.
[5.605571] scsi target2:0:0: Domain Validation skipping write tests
[5.605573] scsi target2:0:0: Ending Domain Validation
[5.605696] scsi target2:0:0: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 
127)
[5.625342] BUG: unable to handle kernel paging request at 88006b443e58
[5.625349] IP: rb_erase+0x301/0x350
[5.625350] PGD 31ae067 
[5.625350] PUD 31b1067 
[5.625351] PMD 7fda6067 
[5.625351] PTE 80006b443060
[5.625352] 
[5.625353] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[5.625354] Modules linked in: serio_raw vmwgfx(+) drm_kms_helper 
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mptspi drm ata_piix 
scsi_transport_spi mptscsih e1000(+) i2c_core mptbase libata
[5.625366] CPU: 3 PID: 9 Comm: rcuos/0 Not tainted 4.11.0-rc1 #112
[5.625367] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/31/2013
[5.625368] task: 880074970240 task.stack: c936c000
[5.625369] RIP: 0010:rb_erase+0x301/0x350
[5.625370] RSP: 0018:c936fe18 EFLAGS: 00010046
[5.625371] RAX:  RBX: 88006ae742c0 RCX: 
[5.625372] RDX: 0001 RSI: 88006b443e58 RDI: 88006ae742e0
[5.625372] RBP: c936fe18 R08:  R09: 0001
[5.625373] R10:  R11:  R12: 0246
[5.625373] R13: 88006b440800 R14: 81359c30 R15: 88006b7c77a0
[5.625374] FS:  () GS:88007580() 
knlGS:
[5.625375] CS:  0010 DS:  ES:  CR0: 80050033
[5.625376] CR2: 88006b443e58 CR3: 6a6c7000 CR4: 001406e0
[5.625408] Call Trace:
[5.625411]  wb_congested_put+0x6a/0xb0
[5.625414]  __blkg_release_rcu+0xe3/0x1c0
[5.625418]  rcu_nocb_kthread+0x1c8/0x570
[5.625419]  ? rcu_nocb_kthread+0xf8/0x570
[5.625424]  kthread+0x10a/0x140
[5.625425]  ? rcu_all_qs+0x90/0x90
[5.625427]  ? kthread_create_on_node+0x60/0x60
[5.625429]  ret_from_fork+0x31/0x40
[5.625432] Code: 01 4c 89 07 49 89 c8 eb 9a 48 89 16 48 89 fa e9 e7 fe ff 
ff 48 89 51 10 48 89 fa e9 db fe ff ff 4c 89 06 5d c3 4c 89 42 10 5d c3 <4c> 89 
06 e9 b1 fd ff ff 83 e2 01 0f 85 e1 fd ff ff 5d c3 4c 89 
[5.625453] RIP: rb_erase+0x301/0x350 RSP: c936fe18
[5.625453] CR2: 88006b443e58
[5.625502] ---[ end trace 0415a22853a9a611 ]---
[5.625503] Kernel panic - not syncing: Fatal exception in interrupt
[6.708465] Shutting down cpus with NMI
[6.708610] Kernel Offset: disabled
[6.789338] ---[ end Kernel panic - not syncing: Fatal exception in interrupt


Judging from the timing, I suspect that this problem is caused by memory 
corruption
caused by commit 165a5e22fafb127 ("block: Move bdi_unregister() to 
del_gendisk()").


# bad: [a3b4924b027f9a4b95ce89a914c1e0459e76f18a] Merge tag 'scsi-misc' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
# good: [15dd03811d99dcf828f4eeb2c2b6a02ddc1201c7] scsi: megaraid_sas: NVME 
Interface detection and prop settings
# good: [857de6e00778738dc3d61f75acbac35bdc48e533] scsi: use 
'scsi_device_from_queue()' for scsi_dh
# good: [825c6abbc316f496cd2b66e1fa72892cf4b49a9f] scsi: lpfc: use proper 
format string for dma_addr_t
# good: [54d7989f476ca57fc3c5cc71524c480ccb74c481] Merge tag 'for_linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
# good: [41dc529a4602ac737020f423f84686a81de38e6d] qla2xxx: Improve RSCN 
handling in driver
# good: [821fd6f6cb6500cd04a6c7e8f701f9b311a5c2b3] Merge branch 'for-next' of