Re: [PATCH 8/8] aio: support for IO polling

2018-11-22 Thread Jan Kara


On Tue 20-11-18 10:19:53, Jens Axboe wrote:
> +/*
> + * We can't just wait for polled events to come to us, we have to actively
> + * find and complete them.
> + */
> +static void aio_iopoll_reap_events(struct kioctx *ctx)
> +{
> + if (!(ctx->flags & IOCTX_FLAG_IOPOLL))
> + return;
> +
> + while (!list_empty_careful(>poll_submitted) ||
> +!list_empty(>poll_completing)) {
> + unsigned int nr_events = 0;
> +
> + __aio_iopoll_check(ctx, NULL, _events, 1, UINT_MAX);
> + }
> +}
> +
> +static int aio_iopoll_check(struct kioctx *ctx, long min_nr, long nr,
> + struct io_event __user *event)
> +{
> + unsigned int nr_events = 0;
> + int ret = 0;
> +
> + /* * Only allow one thread polling at a time */
> + if (test_and_set_bit(0, >getevents_busy))
> + return -EBUSY;
> +
> + while (!nr_events || !need_resched()) {
> + int tmin = 0;
> +
> + if (nr_events < min_nr)
> + tmin = min_nr - nr_events;
> +
> + ret = __aio_iopoll_check(ctx, event, _events, tmin, nr);
> + if (ret <= 0)
> + break;
> + ret = 0;
> + }
> +
> + clear_bit(0, >getevents_busy);
> + return nr_events ? nr_events : ret;
> +}

Hum, what if userspace calls io_destroy() while another process is polling
for events on the same kioctx? It seems we'd be reaping events from two
processes in parallel in that case which will result in various
"interesting" effects like ctx->poll_completing list corruption...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 0/16 v3] loop: Fix oops and possible deadlocks

2018-11-12 Thread Jan Kara
On Thu 08-11-18 16:28:11, Theodore Y. Ts'o wrote:
> On Thu, Nov 08, 2018 at 02:01:00PM +0100, Jan Kara wrote:
> > Hi,
> > 
> > this patch series fixes oops and possible deadlocks as reported by syzbot 
> > [1]
> > [2]. The second patch in the series (from Tetsuo) fixes the oops, the 
> > remaining
> > patches are cleaning up the locking in the loop driver so that we can in the
> > end reasonably easily switch to rereading partitions without holding mutex
> > protecting the loop device.
> > 
> > I have tested the patches by creating, deleting, modifying loop devices, 
> > and by
> > running loop blktests (as well as creating new ones with the load syzkaller 
> > has
> > used to detect the problem). Review is welcome but I think the patches are 
> > fine
> > to go as far as I'm concerned! Jens, can you please pick them up?
> > 
> > Changes since v1:
> > * Added patch moving fput() calls in loop_change_fd() from under 
> > loop_ctl_mutex
> > * Fixed bug in loop_control_ioctl() where it failed to return error properly
> > 
> > Changes since v2:
> > * Rebase on top of 4.20-rc1
> > * Add patch to stop fooling lockdep about loop_ctl_mutex
> > 
> > Honza
> 
> Thanks for working on fixing up the Loop driver to fix these races!
> 
> Is it worth adding some Cc: sta...@kernel.org lines?  Figuring out
> which Fixes they should apply to might be tricky, and from my
> experience because of some of the recent loop work, backporting to
> older stable kernels is not necessarily going to be trivial.  But
> since Dmitry also runs Syzkaller on stable kernels, it'd be great if
> we could get them backported without relying on Sasha's AUTOSTABLE.

That's a fair request but generally I've found this too intrusive for
stable.  The patch 2/16 should be relatively easy to backport and closes
the possible use-after-free which is the nasties of the problems (but also
so rare that I was never able to hit it in my testing and syzbot hit it
only couple of times todate). So there CC to stable might make sense.  The
rest fixes possible deadlocks and they are possible to trigger only by root
bashing reconfiguration of loop devices - IMO not worth the hassle for
stable.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 0/16 v3] loop: Fix oops and possible deadlocks

2018-11-08 Thread Jan Kara
On Thu 08-11-18 06:21:21, Jens Axboe wrote:
> On 11/8/18 6:01 AM, Jan Kara wrote:
> > Hi,
> > 
> > this patch series fixes oops and possible deadlocks as reported by syzbot 
> > [1]
> > [2]. The second patch in the series (from Tetsuo) fixes the oops, the 
> > remaining
> > patches are cleaning up the locking in the loop driver so that we can in the
> > end reasonably easily switch to rereading partitions without holding mutex
> > protecting the loop device.
> > 
> > I have tested the patches by creating, deleting, modifying loop devices, 
> > and by
> > running loop blktests (as well as creating new ones with the load syzkaller 
> > has
> > used to detect the problem). Review is welcome but I think the patches are 
> > fine
> > to go as far as I'm concerned! Jens, can you please pick them up?
> 
> I've been waiting for this series - it's a big series for 4.20, though... It
> would suck having to defer this to 4.21 since it's a long standing issue, but
> the risk of a new regression is present as well.
> 
> Are you fine with this going in for 4.21?

Yeah, I'm fine with the delay so that it has time to soak in linux-next.

Honza
-- 
Jan Kara 
SUSE Labs, CR


[PATCH 06/16] loop: Split setting of lo_state from loop_clr_fd

2018-11-08 Thread Jan Kara
Move setting of lo_state to Lo_rundown out into the callers. That will
allow us to unlock loop_ctl_mutex while the loop device is protected
from other changes by its special state.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 52 +++-
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4c37578989c4..eb01a685da4e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -975,7 +975,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
loop_reread_partitions(lo, bdev);
 
/* Grab the block_device to prevent its destruction after we
-* put /dev/loopXX inode. Later in loop_clr_fd() we bdput(bdev).
+* put /dev/loopXX inode. Later in __loop_clr_fd() we bdput(bdev).
 */
bdgrab(bdev);
return 0;
@@ -1025,31 +1025,15 @@ loop_init_xfer(struct loop_device *lo, struct 
loop_func_table *xfer,
return err;
 }
 
-static int loop_clr_fd(struct loop_device *lo)
+static int __loop_clr_fd(struct loop_device *lo)
 {
struct file *filp = lo->lo_backing_file;
gfp_t gfp = lo->old_gfp_mask;
struct block_device *bdev = lo->lo_device;
 
-   if (lo->lo_state != Lo_bound)
+   if (WARN_ON_ONCE(lo->lo_state != Lo_rundown))
return -ENXIO;
 
-   /*
-* If we've explicitly asked to tear down the loop device,
-* and it has an elevated reference count, set it for auto-teardown when
-* the last reference goes away. This stops $!~#$@ udev from
-* preventing teardown because it decided that it needs to run blkid on
-* the loopback device whenever they appear. xfstests is notorious for
-* failing tests because blkid via udev races with a losetup
-* /do something like mkfs/losetup -d  causing the losetup -d
-* command to fail with EBUSY.
-*/
-   if (atomic_read(>lo_refcnt) > 1) {
-   lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-   mutex_unlock(_ctl_mutex);
-   return 0;
-   }
-
if (filp == NULL)
return -EINVAL;
 
@@ -1057,7 +1041,6 @@ static int loop_clr_fd(struct loop_device *lo)
blk_mq_freeze_queue(lo->lo_queue);
 
spin_lock_irq(>lo_lock);
-   lo->lo_state = Lo_rundown;
lo->lo_backing_file = NULL;
spin_unlock_irq(>lo_lock);
 
@@ -1110,6 +1093,30 @@ static int loop_clr_fd(struct loop_device *lo)
return 0;
 }
 
+static int loop_clr_fd(struct loop_device *lo)
+{
+   if (lo->lo_state != Lo_bound)
+   return -ENXIO;
+   /*
+* If we've explicitly asked to tear down the loop device,
+* and it has an elevated reference count, set it for auto-teardown when
+* the last reference goes away. This stops $!~#$@ udev from
+* preventing teardown because it decided that it needs to run blkid on
+* the loopback device whenever they appear. xfstests is notorious for
+* failing tests because blkid via udev races with a losetup
+* /do something like mkfs/losetup -d  causing the losetup -d
+* command to fail with EBUSY.
+*/
+   if (atomic_read(>lo_refcnt) > 1) {
+   lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
+   mutex_unlock(_ctl_mutex);
+   return 0;
+   }
+   lo->lo_state = Lo_rundown;
+
+   return __loop_clr_fd(lo);
+}
+
 static int
 loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 {
@@ -1691,11 +1698,14 @@ static void lo_release(struct gendisk *disk, fmode_t 
mode)
goto out_unlock;
 
if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
+   if (lo->lo_state != Lo_bound)
+   goto out_unlock;
+   lo->lo_state = Lo_rundown;
/*
 * In autoclear mode, stop the loop thread
 * and remove configuration after last close.
 */
-   err = loop_clr_fd(lo);
+   err = __loop_clr_fd(lo);
if (!err)
return;
} else if (lo->lo_state == Lo_bound) {
-- 
2.16.4



[PATCH 16/16] loop: Get rid of 'nested' acquisition of loop_ctl_mutex

2018-11-08 Thread Jan Kara
The nested acquisition of loop_ctl_mutex (->lo_ctl_mutex back then) has
been introduced by commit f028f3b2f987e "loop: fix circular locking in
loop_clr_fd()" to fix lockdep complains about bd_mutex being acquired
after lo_ctl_mutex during partition rereading. Now that these are
properly fixed, let's stop fooling lockdep.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 112afc9bc604..bf6bc35aaf88 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -681,7 +681,7 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
int error;
boolpartscan;
 
-   error = mutex_lock_killable_nested(_ctl_mutex, 1);
+   error = mutex_lock_killable(_ctl_mutex);
if (error)
return error;
error = -ENXIO;
@@ -919,7 +919,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
if (!file)
goto out;
 
-   error = mutex_lock_killable_nested(_ctl_mutex, 1);
+   error = mutex_lock_killable(_ctl_mutex);
if (error)
goto out_putf;
 
@@ -1135,7 +1135,7 @@ static int loop_clr_fd(struct loop_device *lo)
 {
int err;
 
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   err = mutex_lock_killable(_ctl_mutex);
if (err)
return err;
if (lo->lo_state != Lo_bound) {
@@ -1172,7 +1172,7 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
struct block_device *bdev;
bool partscan = false;
 
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   err = mutex_lock_killable(_ctl_mutex);
if (err)
return err;
if (lo->lo_encrypt_key_size &&
@@ -1277,7 +1277,7 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
struct kstat stat;
int ret;
 
-   ret = mutex_lock_killable_nested(_ctl_mutex, 1);
+   ret = mutex_lock_killable(_ctl_mutex);
if (ret)
return ret;
if (lo->lo_state != Lo_bound) {
@@ -1466,7 +1466,7 @@ static int lo_simple_ioctl(struct loop_device *lo, 
unsigned int cmd,
 {
int err;
 
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   err = mutex_lock_killable(_ctl_mutex);
if (err)
return err;
switch (cmd) {
-- 
2.16.4



[PATCH 11/16] loop: Push loop_ctl_mutex down to loop_change_fd()

2018-11-08 Thread Jan Kara
Push loop_ctl_mutex down to loop_change_fd(). We will need this to be
able to call loop_reread_partitions() without loop_ctl_mutex.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 161e2a08f2e8..ea5e313908b1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -691,19 +691,22 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
struct file *file, *old_file;
int error;
 
+   error = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (error)
+   return error;
error = -ENXIO;
if (lo->lo_state != Lo_bound)
-   goto out;
+   goto out_unlock;
 
/* the loop device has to be read-only */
error = -EINVAL;
if (!(lo->lo_flags & LO_FLAGS_READ_ONLY))
-   goto out;
+   goto out_unlock;
 
error = -EBADF;
file = fget(arg);
if (!file)
-   goto out;
+   goto out_unlock;
 
error = loop_validate_file(file, bdev);
if (error)
@@ -730,11 +733,13 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
fput(old_file);
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
loop_reread_partitions(lo, bdev);
+   mutex_unlock(_ctl_mutex);
return 0;
 
- out_putf:
+out_putf:
fput(file);
- out:
+out_unlock:
+   mutex_unlock(_ctl_mutex);
return error;
 }
 
@@ -1469,12 +1474,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
case LOOP_SET_FD:
return loop_set_fd(lo, mode, bdev, arg);
case LOOP_CHANGE_FD:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   err = loop_change_fd(lo, bdev, arg);
-   mutex_unlock(_ctl_mutex);
-   break;
+   return loop_change_fd(lo, bdev, arg);
case LOOP_CLR_FD:
return loop_clr_fd(lo);
case LOOP_SET_STATUS:
-- 
2.16.4



[PATCH 09/16] loop: Push loop_ctl_mutex down to loop_set_status()

2018-11-08 Thread Jan Kara
Push loop_ctl_mutex down to loop_set_status(). We will need this to be
able to call loop_reread_partitions() without loop_ctl_mutex.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 51 +--
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 2e814f8af4df..af79a59732b7 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1141,46 +1141,55 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
struct loop_func_table *xfer;
kuid_t uid = current_uid();
 
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
if (lo->lo_encrypt_key_size &&
!uid_eq(lo->lo_key_owner, uid) &&
-   !capable(CAP_SYS_ADMIN))
-   return -EPERM;
-   if (lo->lo_state != Lo_bound)
-   return -ENXIO;
-   if ((unsigned int) info->lo_encrypt_key_size > LO_KEY_SIZE)
-   return -EINVAL;
+   !capable(CAP_SYS_ADMIN)) {
+   err = -EPERM;
+   goto out_unlock;
+   }
+   if (lo->lo_state != Lo_bound) {
+   err = -ENXIO;
+   goto out_unlock;
+   }
+   if ((unsigned int) info->lo_encrypt_key_size > LO_KEY_SIZE) {
+   err = -EINVAL;
+   goto out_unlock;
+   }
 
/* I/O need to be drained during transfer transition */
blk_mq_freeze_queue(lo->lo_queue);
 
err = loop_release_xfer(lo);
if (err)
-   goto exit;
+   goto out_unfreeze;
 
if (info->lo_encrypt_type) {
unsigned int type = info->lo_encrypt_type;
 
if (type >= MAX_LO_CRYPT) {
err = -EINVAL;
-   goto exit;
+   goto out_unfreeze;
}
xfer = xfer_funcs[type];
if (xfer == NULL) {
err = -EINVAL;
-   goto exit;
+   goto out_unfreeze;
}
} else
xfer = NULL;
 
err = loop_init_xfer(lo, xfer, info);
if (err)
-   goto exit;
+   goto out_unfreeze;
 
if (lo->lo_offset != info->lo_offset ||
lo->lo_sizelimit != info->lo_sizelimit) {
if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
err = -EFBIG;
-   goto exit;
+   goto out_unfreeze;
}
}
 
@@ -1212,7 +1221,7 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
/* update dio if lo_offset or transfer is changed */
__loop_update_dio(lo, lo->use_dio);
 
- exit:
+out_unfreeze:
blk_mq_unfreeze_queue(lo->lo_queue);
 
if (!err && (info->lo_flags & LO_FLAGS_PARTSCAN) &&
@@ -1221,6 +1230,8 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN;
loop_reread_partitions(lo, lo->lo_device);
}
+out_unlock:
+   mutex_unlock(_ctl_mutex);
 
return err;
 }
@@ -1467,12 +1478,8 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
case LOOP_SET_STATUS:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
err = loop_set_status_old(lo,
(struct loop_info __user *)arg);
-   mutex_unlock(_ctl_mutex);
}
break;
case LOOP_GET_STATUS:
@@ -1480,12 +1487,8 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
case LOOP_SET_STATUS64:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
err = loop_set_status64(lo,
(struct loop_info64 __user *) arg);
-   mutex_unlock(_ctl_mutex);
}
break;
case LOOP_GET_STATUS64:
@@ -1630,12 +1633,8 @@ static int lo_compat_ioctl(struct block_device *bdev, 
fmode_t mode,
 
switch(cmd) {
case LOOP_SET_STATUS:
-   err = mutex_lock_killable(_ctl_mutex);
-   if (!err) {
-   err = loop_set_status_compat(lo,
-(const struct 
compat_loop_info __user *)a

[PATCH 13/16] loop: Move loop_reread_partitions() out of loop_ctl_mutex

2018-11-08 Thread Jan Kara
Calling loop_reread_partitions() under loop_ctl_mutex causes lockdep to
complain about circular lock dependency between bdev->bd_mutex and
lo->lo_ctl_mutex. The problem is that on loop device open or close
lo_open() and lo_release() get called with bdev->bd_mutex held and they
need to acquire loop_ctl_mutex. OTOH when loop_reread_partitions() is
called with loop_ctl_mutex held, it will call blkdev_reread_part() which
acquires bdev->bd_mutex. See syzbot report for details [1].

Move all calls of loop_rescan_partitions() out of loop_ctl_mutex to
avoid lockdep warning and fix deadlock possibility.

[1] https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d1588

Reported-by: syzbot 

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f1d7a4fe30fc..cce5d4e8e863 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -679,6 +679,7 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
 {
struct file *file, *old_file;
int error;
+   boolpartscan;
 
error = mutex_lock_killable_nested(_ctl_mutex, 1);
if (error)
@@ -720,9 +721,10 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
blk_mq_unfreeze_queue(lo->lo_queue);
 
fput(old_file);
-   if (lo->lo_flags & LO_FLAGS_PARTSCAN)
-   loop_reread_partitions(lo, bdev);
+   partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;
mutex_unlock(_ctl_mutex);
+   if (partscan)
+   loop_reread_partitions(lo, bdev);
return 0;
 
 out_putf:
@@ -903,6 +905,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
int lo_flags = 0;
int error;
loff_t  size;
+   boolpartscan;
 
/* This is safe, since we have a reference from open(). */
__module_get(THIS_MODULE);
@@ -969,14 +972,15 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
lo->lo_state = Lo_bound;
if (part_shift)
lo->lo_flags |= LO_FLAGS_PARTSCAN;
-   if (lo->lo_flags & LO_FLAGS_PARTSCAN)
-   loop_reread_partitions(lo, bdev);
+   partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;
 
/* Grab the block_device to prevent its destruction after we
 * put /dev/loopXX inode. Later in __loop_clr_fd() we bdput(bdev).
 */
bdgrab(bdev);
mutex_unlock(_ctl_mutex);
+   if (partscan)
+   loop_reread_partitions(lo, bdev);
return 0;
 
 out_unlock:
@@ -1157,6 +1161,8 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
int err;
struct loop_func_table *xfer;
kuid_t uid = current_uid();
+   struct block_device *bdev;
+   bool partscan = false;
 
err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
@@ -1245,10 +1251,13 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
 !(lo->lo_flags & LO_FLAGS_PARTSCAN)) {
lo->lo_flags |= LO_FLAGS_PARTSCAN;
lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN;
-   loop_reread_partitions(lo, lo->lo_device);
+   bdev = lo->lo_device;
+   partscan = true;
}
 out_unlock:
mutex_unlock(_ctl_mutex);
+   if (partscan)
+   loop_reread_partitions(lo, bdev);
 
return err;
 }
-- 
2.16.4



[PATCH 14/16] loop: Fix deadlock when calling blkdev_reread_part()

2018-11-08 Thread Jan Kara
Calling blkdev_reread_part() under loop_ctl_mutex causes lockdep to
complain about circular lock dependency between bdev->bd_mutex and
lo->lo_ctl_mutex. The problem is that on loop device open or close
lo_open() and lo_release() get called with bdev->bd_mutex held and they
need to acquire loop_ctl_mutex. OTOH when loop_reread_partitions() is
called with loop_ctl_mutex held, it will call blkdev_reread_part() which
acquires bdev->bd_mutex. See syzbot report for details [1].

Move call to blkdev_reread_part() in __loop_clr_fd() from under
loop_ctl_mutex to finish fixing of the lockdep warning and the possible
deadlock.

[1] https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d1588

Reported-by: syzbot 

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cce5d4e8e863..b3f981ac8ef1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1030,12 +1030,14 @@ loop_init_xfer(struct loop_device *lo, struct 
loop_func_table *xfer,
return err;
 }
 
-static int __loop_clr_fd(struct loop_device *lo)
+static int __loop_clr_fd(struct loop_device *lo, bool release)
 {
struct file *filp = NULL;
gfp_t gfp = lo->old_gfp_mask;
struct block_device *bdev = lo->lo_device;
int err = 0;
+   bool partscan = false;
+   int lo_number;
 
mutex_lock(_ctl_mutex);
if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
@@ -1088,7 +1090,15 @@ static int __loop_clr_fd(struct loop_device *lo)
module_put(THIS_MODULE);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-   if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev) {
+   partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
+   lo_number = lo->lo_number;
+   lo->lo_flags = 0;
+   if (!part_shift)
+   lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
+   loop_unprepare_queue(lo);
+out_unlock:
+   mutex_unlock(_ctl_mutex);
+   if (partscan) {
/*
 * bd_mutex has been held already in release path, so don't
 * acquire it if this function is called in such case.
@@ -1097,21 +1107,15 @@ static int __loop_clr_fd(struct loop_device *lo)
 * must be at least one and it can only become zero when the
 * current holder is released.
 */
-   if (!atomic_read(>lo_refcnt))
+   if (release)
err = __blkdev_reread_part(bdev);
else
err = blkdev_reread_part(bdev);
pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
-   __func__, lo->lo_number, err);
+   __func__, lo_number, err);
/* Device is gone, no point in returning error */
err = 0;
}
-   lo->lo_flags = 0;
-   if (!part_shift)
-   lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
-   loop_unprepare_queue(lo);
-out_unlock:
-   mutex_unlock(_ctl_mutex);
/*
 * Need not hold loop_ctl_mutex to fput backing file.
 * Calling fput holding loop_ctl_mutex triggers a circular
@@ -1152,7 +1156,7 @@ static int loop_clr_fd(struct loop_device *lo)
lo->lo_state = Lo_rundown;
mutex_unlock(_ctl_mutex);
 
-   return __loop_clr_fd(lo);
+   return __loop_clr_fd(lo, false);
 }
 
 static int
@@ -1713,7 +1717,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 * In autoclear mode, stop the loop thread
 * and remove configuration after last close.
 */
-   __loop_clr_fd(lo);
+   __loop_clr_fd(lo, true);
return;
} else if (lo->lo_state == Lo_bound) {
/*
-- 
2.16.4



[PATCH 08/16] loop: Push loop_ctl_mutex down to loop_get_status()

2018-11-08 Thread Jan Kara
Push loop_ctl_mutex down to loop_get_status() to avoid the unusual
convention that the function gets called with loop_ctl_mutex held and
releases it.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 37 ++---
 1 file changed, 10 insertions(+), 27 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d8a7b5da881b..2e814f8af4df 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1232,6 +1232,9 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
struct kstat stat;
int ret;
 
+   ret = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (ret)
+   return ret;
if (lo->lo_state != Lo_bound) {
mutex_unlock(_ctl_mutex);
return -ENXIO;
@@ -1346,10 +1349,8 @@ loop_get_status_old(struct loop_device *lo, struct 
loop_info __user *arg) {
struct loop_info64 info64;
int err;
 
-   if (!arg) {
-   mutex_unlock(_ctl_mutex);
+   if (!arg)
return -EINVAL;
-   }
err = loop_get_status(lo, );
if (!err)
err = loop_info64_to_old(, );
@@ -1364,10 +1365,8 @@ loop_get_status64(struct loop_device *lo, struct 
loop_info64 __user *arg) {
struct loop_info64 info64;
int err;
 
-   if (!arg) {
-   mutex_unlock(_ctl_mutex);
+   if (!arg)
return -EINVAL;
-   }
err = loop_get_status(lo, );
if (!err && copy_to_user(arg, , sizeof(info64)))
err = -EFAULT;
@@ -1477,12 +1476,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
}
break;
case LOOP_GET_STATUS:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   err = loop_get_status_old(lo, (struct loop_info __user *) arg);
-   /* loop_get_status() unlocks loop_ctl_mutex */
-   break;
+   return loop_get_status_old(lo, (struct loop_info __user *) arg);
case LOOP_SET_STATUS64:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
@@ -1495,12 +1489,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
}
break;
case LOOP_GET_STATUS64:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   err = loop_get_status64(lo, (struct loop_info64 __user *) arg);
-   /* loop_get_status() unlocks loop_ctl_mutex */
-   break;
+   return loop_get_status64(lo, (struct loop_info64 __user *) arg);
case LOOP_SET_CAPACITY:
case LOOP_SET_DIRECT_IO:
case LOOP_SET_BLOCK_SIZE:
@@ -1625,10 +1614,8 @@ loop_get_status_compat(struct loop_device *lo,
struct loop_info64 info64;
int err;
 
-   if (!arg) {
-   mutex_unlock(_ctl_mutex);
+   if (!arg)
return -EINVAL;
-   }
err = loop_get_status(lo, );
if (!err)
err = loop_info64_to_compat(, arg);
@@ -1651,12 +1638,8 @@ static int lo_compat_ioctl(struct block_device *bdev, 
fmode_t mode,
}
break;
case LOOP_GET_STATUS:
-   err = mutex_lock_killable(_ctl_mutex);
-   if (!err) {
-   err = loop_get_status_compat(lo,
-(struct compat_loop_info 
__user *)arg);
-   /* loop_get_status() unlocks loop_ctl_mutex */
-   }
+   err = loop_get_status_compat(lo,
+(struct compat_loop_info __user *)arg);
break;
case LOOP_SET_CAPACITY:
case LOOP_CLR_FD:
-- 
2.16.4



[PATCH 07/16] loop: Push loop_ctl_mutex down into loop_clr_fd()

2018-11-08 Thread Jan Kara
loop_clr_fd() has a weird locking convention that is expects
loop_ctl_mutex held, releases it on success and keeps it on failure.
Untangle the mess by moving locking of loop_ctl_mutex into
loop_clr_fd().

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 49 +
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index eb01a685da4e..d8a7b5da881b 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1027,15 +1027,22 @@ loop_init_xfer(struct loop_device *lo, struct 
loop_func_table *xfer,
 
 static int __loop_clr_fd(struct loop_device *lo)
 {
-   struct file *filp = lo->lo_backing_file;
+   struct file *filp = NULL;
gfp_t gfp = lo->old_gfp_mask;
struct block_device *bdev = lo->lo_device;
+   int err = 0;
 
-   if (WARN_ON_ONCE(lo->lo_state != Lo_rundown))
-   return -ENXIO;
+   mutex_lock(_ctl_mutex);
+   if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
+   err = -ENXIO;
+   goto out_unlock;
+   }
 
-   if (filp == NULL)
-   return -EINVAL;
+   filp = lo->lo_backing_file;
+   if (filp == NULL) {
+   err = -EINVAL;
+   goto out_unlock;
+   }
 
/* freeze request queue during the transition */
blk_mq_freeze_queue(lo->lo_queue);
@@ -1082,6 +1089,7 @@ static int __loop_clr_fd(struct loop_device *lo)
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
loop_unprepare_queue(lo);
+out_unlock:
mutex_unlock(_ctl_mutex);
/*
 * Need not hold loop_ctl_mutex to fput backing file.
@@ -1089,14 +1097,22 @@ static int __loop_clr_fd(struct loop_device *lo)
 * lock dependency possibility warning as fput can take
 * bd_mutex which is usually taken before loop_ctl_mutex.
 */
-   fput(filp);
-   return 0;
+   if (filp)
+   fput(filp);
+   return err;
 }
 
 static int loop_clr_fd(struct loop_device *lo)
 {
-   if (lo->lo_state != Lo_bound)
+   int err;
+
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
+   if (lo->lo_state != Lo_bound) {
+   mutex_unlock(_ctl_mutex);
return -ENXIO;
+   }
/*
 * If we've explicitly asked to tear down the loop device,
 * and it has an elevated reference count, set it for auto-teardown when
@@ -1113,6 +1129,7 @@ static int loop_clr_fd(struct loop_device *lo)
return 0;
}
lo->lo_state = Lo_rundown;
+   mutex_unlock(_ctl_mutex);
 
return __loop_clr_fd(lo);
 }
@@ -1447,14 +1464,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
mutex_unlock(_ctl_mutex);
break;
case LOOP_CLR_FD:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   /* loop_clr_fd would have unlocked loop_ctl_mutex on success */
-   err = loop_clr_fd(lo);
-   if (err)
-   mutex_unlock(_ctl_mutex);
-   break;
+   return loop_clr_fd(lo);
case LOOP_SET_STATUS:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
@@ -1690,7 +1700,6 @@ static int lo_open(struct block_device *bdev, fmode_t 
mode)
 static void lo_release(struct gendisk *disk, fmode_t mode)
 {
struct loop_device *lo;
-   int err;
 
mutex_lock(_ctl_mutex);
lo = disk->private_data;
@@ -1701,13 +1710,13 @@ static void lo_release(struct gendisk *disk, fmode_t 
mode)
if (lo->lo_state != Lo_bound)
goto out_unlock;
lo->lo_state = Lo_rundown;
+   mutex_unlock(_ctl_mutex);
/*
 * In autoclear mode, stop the loop thread
 * and remove configuration after last close.
 */
-   err = __loop_clr_fd(lo);
-   if (!err)
-   return;
+   __loop_clr_fd(lo);
+   return;
} else if (lo->lo_state == Lo_bound) {
/*
 * Otherwise keep thread (if running) and config,
-- 
2.16.4



[PATCH 01/16] block/loop: Don't grab "struct file" for vfs_getattr() operation.

2018-11-08 Thread Jan Kara
From: Tetsuo Handa 

vfs_getattr() needs "struct path" rather than "struct file".
Let's use path_get()/path_put() rather than get_file()/fput().

Signed-off-by: Tetsuo Handa 
Reviewed-by: Jan Kara 
Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cb0cc8685076..a29ef169f360 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1204,7 +1204,7 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
 static int
 loop_get_status(struct loop_device *lo, struct loop_info64 *info)
 {
-   struct file *file;
+   struct path path;
struct kstat stat;
int ret;
 
@@ -1229,16 +1229,16 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
}
 
/* Drop lo_ctl_mutex while we call into the filesystem. */
-   file = get_file(lo->lo_backing_file);
+   path = lo->lo_backing_file->f_path;
+   path_get();
mutex_unlock(>lo_ctl_mutex);
-   ret = vfs_getattr(>f_path, , STATX_INO,
- AT_STATX_SYNC_AS_STAT);
+   ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
info->lo_inode = stat.ino;
info->lo_rdevice = huge_encode_dev(stat.rdev);
}
-   fput(file);
+   path_put();
return ret;
 }
 
-- 
2.16.4



[PATCH 15/16] loop: Avoid circular locking dependency between loop_ctl_mutex and bd_mutex

2018-11-08 Thread Jan Kara
Code in loop_change_fd() drops reference to the old file (and also the
new file in a failure case) under loop_ctl_mutex. Similarly to a
situation in loop_set_fd() this can create a circular locking dependency
if this was the last reference holding the file open. Delay dropping of
the file reference until we have released loop_ctl_mutex.

Reported-by: Tetsuo Handa 
Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index b3f981ac8ef1..112afc9bc604 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -677,7 +677,7 @@ static int loop_validate_file(struct file *file, struct 
block_device *bdev)
 static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
  unsigned int arg)
 {
-   struct file *file, *old_file;
+   struct file *file = NULL, *old_file;
int error;
boolpartscan;
 
@@ -686,21 +686,21 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
return error;
error = -ENXIO;
if (lo->lo_state != Lo_bound)
-   goto out_unlock;
+   goto out_err;
 
/* the loop device has to be read-only */
error = -EINVAL;
if (!(lo->lo_flags & LO_FLAGS_READ_ONLY))
-   goto out_unlock;
+   goto out_err;
 
error = -EBADF;
file = fget(arg);
if (!file)
-   goto out_unlock;
+   goto out_err;
 
error = loop_validate_file(file, bdev);
if (error)
-   goto out_putf;
+   goto out_err;
 
old_file = lo->lo_backing_file;
 
@@ -708,7 +708,7 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
 
/* size of the new backing store needs to be the same */
if (get_loop_size(lo, file) != get_loop_size(lo, old_file))
-   goto out_putf;
+   goto out_err;
 
/* and ... switch */
blk_mq_freeze_queue(lo->lo_queue);
@@ -719,18 +719,22 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
 lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
loop_update_dio(lo);
blk_mq_unfreeze_queue(lo->lo_queue);
-
-   fput(old_file);
partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;
mutex_unlock(_ctl_mutex);
+   /*
+* We must drop file reference outside of loop_ctl_mutex as dropping
+* the file ref can take bd_mutex which creates circular locking
+* dependency.
+*/
+   fput(old_file);
if (partscan)
loop_reread_partitions(lo, bdev);
return 0;
 
-out_putf:
-   fput(file);
-out_unlock:
+out_err:
mutex_unlock(_ctl_mutex);
+   if (file)
+   fput(file);
return error;
 }
 
-- 
2.16.4



[PATCH 05/16] loop: Push lo_ctl_mutex down into individual ioctls

2018-11-08 Thread Jan Kara
Push acquisition of lo_ctl_mutex down into individual ioctl handling
branches. This is a preparatory step for pushing the lock down into
individual ioctl handling functions so that they can release the lock as
they need it. We also factor out some simple ioctl handlers that will
not need any special handling to reduce unnecessary code duplication.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 88 +---
 1 file changed, 63 insertions(+), 25 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index dcdc96f4d2d4..4c37578989c4 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1393,70 +1393,108 @@ static int loop_set_block_size(struct loop_device *lo, 
unsigned long arg)
return 0;
 }
 
-static int lo_ioctl(struct block_device *bdev, fmode_t mode,
-   unsigned int cmd, unsigned long arg)
+static int lo_simple_ioctl(struct loop_device *lo, unsigned int cmd,
+  unsigned long arg)
 {
-   struct loop_device *lo = bdev->bd_disk->private_data;
int err;
 
err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
-   goto out_unlocked;
+   return err;
+   switch (cmd) {
+   case LOOP_SET_CAPACITY:
+   err = loop_set_capacity(lo);
+   break;
+   case LOOP_SET_DIRECT_IO:
+   err = loop_set_dio(lo, arg);
+   break;
+   case LOOP_SET_BLOCK_SIZE:
+   err = loop_set_block_size(lo, arg);
+   break;
+   default:
+   err = lo->ioctl ? lo->ioctl(lo, cmd, arg) : -EINVAL;
+   }
+   mutex_unlock(_ctl_mutex);
+   return err;
+}
+
+static int lo_ioctl(struct block_device *bdev, fmode_t mode,
+   unsigned int cmd, unsigned long arg)
+{
+   struct loop_device *lo = bdev->bd_disk->private_data;
+   int err;
 
switch (cmd) {
case LOOP_SET_FD:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_set_fd(lo, mode, bdev, arg);
+   mutex_unlock(_ctl_mutex);
break;
case LOOP_CHANGE_FD:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_change_fd(lo, bdev, arg);
+   mutex_unlock(_ctl_mutex);
break;
case LOOP_CLR_FD:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
/* loop_clr_fd would have unlocked loop_ctl_mutex on success */
err = loop_clr_fd(lo);
-   if (!err)
-   goto out_unlocked;
+   if (err)
+   mutex_unlock(_ctl_mutex);
break;
case LOOP_SET_STATUS:
err = -EPERM;
-   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
+   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_set_status_old(lo,
(struct loop_info __user *)arg);
+   mutex_unlock(_ctl_mutex);
+   }
break;
case LOOP_GET_STATUS:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_get_status_old(lo, (struct loop_info __user *) arg);
/* loop_get_status() unlocks loop_ctl_mutex */
-   goto out_unlocked;
+   break;
case LOOP_SET_STATUS64:
err = -EPERM;
-   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
+   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_set_status64(lo,
(struct loop_info64 __user *) arg);
+   mutex_unlock(_ctl_mutex);
+   }
break;
case LOOP_GET_STATUS64:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_get_status64(lo, (struct loop_info64 __user *) arg);
/* loop_get_status() unlocks loop_ctl_mutex */
-   goto out_unlocked;
-   case LOOP_SET_CAPACITY:
-   err = -EPERM;
-   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
-   err = loop_set_capacity(lo);
break;
+ 

[PATCH 04/16] loop: Get rid of loop_index_mutex

2018-11-08 Thread Jan Kara
Now that loop_ctl_mutex is global, just get rid of loop_index_mutex as
there is no good reason to keep these two separate and it just
complicates the locking.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 41 -
 1 file changed, 20 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3de2cd94225a..dcdc96f4d2d4 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -83,7 +83,6 @@
 #include 
 
 static DEFINE_IDR(loop_index_idr);
-static DEFINE_MUTEX(loop_index_mutex);
 static DEFINE_MUTEX(loop_ctl_mutex);
 
 static int max_part;
@@ -1626,9 +1625,11 @@ static int lo_compat_ioctl(struct block_device *bdev, 
fmode_t mode,
 static int lo_open(struct block_device *bdev, fmode_t mode)
 {
struct loop_device *lo;
-   int err = 0;
+   int err;
 
-   mutex_lock(_index_mutex);
+   err = mutex_lock_killable(_ctl_mutex);
+   if (err)
+   return err;
lo = bdev->bd_disk->private_data;
if (!lo) {
err = -ENXIO;
@@ -1637,7 +1638,7 @@ static int lo_open(struct block_device *bdev, fmode_t 
mode)
 
atomic_inc(>lo_refcnt);
 out:
-   mutex_unlock(_index_mutex);
+   mutex_unlock(_ctl_mutex);
return err;
 }
 
@@ -1646,12 +1647,11 @@ static void lo_release(struct gendisk *disk, fmode_t 
mode)
struct loop_device *lo;
int err;
 
-   mutex_lock(_index_mutex);
+   mutex_lock(_ctl_mutex);
lo = disk->private_data;
if (atomic_dec_return(>lo_refcnt))
-   goto unlock_index;
+   goto out_unlock;
 
-   mutex_lock(_ctl_mutex);
if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
/*
 * In autoclear mode, stop the loop thread
@@ -1659,7 +1659,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 */
err = loop_clr_fd(lo);
if (!err)
-   goto unlock_index;
+   return;
} else if (lo->lo_state == Lo_bound) {
/*
 * Otherwise keep thread (if running) and config,
@@ -1669,9 +1669,8 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
blk_mq_unfreeze_queue(lo->lo_queue);
}
 
+out_unlock:
mutex_unlock(_ctl_mutex);
-unlock_index:
-   mutex_unlock(_index_mutex);
 }
 
 static const struct block_device_operations lo_fops = {
@@ -1972,7 +1971,7 @@ static struct kobject *loop_probe(dev_t dev, int *part, 
void *data)
struct kobject *kobj;
int err;
 
-   mutex_lock(_index_mutex);
+   mutex_lock(_ctl_mutex);
err = loop_lookup(, MINOR(dev) >> part_shift);
if (err < 0)
err = loop_add(, MINOR(dev) >> part_shift);
@@ -1980,7 +1979,7 @@ static struct kobject *loop_probe(dev_t dev, int *part, 
void *data)
kobj = NULL;
else
kobj = get_disk_and_module(lo->lo_disk);
-   mutex_unlock(_index_mutex);
+   mutex_unlock(_ctl_mutex);
 
*part = 0;
return kobj;
@@ -1990,9 +1989,13 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
   unsigned long parm)
 {
struct loop_device *lo;
-   int ret = -ENOSYS;
+   int ret;
 
-   mutex_lock(_index_mutex);
+   ret = mutex_lock_killable(_ctl_mutex);
+   if (ret)
+   return ret;
+
+   ret = -ENOSYS;
switch (cmd) {
case LOOP_CTL_ADD:
ret = loop_lookup(, parm);
@@ -2006,9 +2009,6 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
ret = loop_lookup(, parm);
if (ret < 0)
break;
-   ret = mutex_lock_killable(_ctl_mutex);
-   if (ret)
-   break;
if (lo->lo_state != Lo_unbound) {
ret = -EBUSY;
mutex_unlock(_ctl_mutex);
@@ -2020,7 +2020,6 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
break;
}
lo->lo_disk->private_data = NULL;
-   mutex_unlock(_ctl_mutex);
idr_remove(_index_idr, lo->lo_number);
loop_remove(lo);
break;
@@ -2030,7 +2029,7 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
break;
ret = loop_add(, -1);
}
-   mutex_unlock(_index_mutex);
+   mutex_unlock(_ctl_mutex);
 
return ret;
 }
@@ -2114,10 +2113,10 @@ static int __init loop_init(void)
  THIS_MODULE, loop_probe, NULL, NULL);
 
/* pre-create number of devices given by config or max_loop */
-   mutex_lock(_index_mutex);
+   mutex_lock(_ctl_mutex);
for (i = 0; i &

[PATCH 0/16 v3] loop: Fix oops and possible deadlocks

2018-11-08 Thread Jan Kara
Hi,

this patch series fixes oops and possible deadlocks as reported by syzbot [1]
[2]. The second patch in the series (from Tetsuo) fixes the oops, the remaining
patches are cleaning up the locking in the loop driver so that we can in the
end reasonably easily switch to rereading partitions without holding mutex
protecting the loop device.

I have tested the patches by creating, deleting, modifying loop devices, and by
running loop blktests (as well as creating new ones with the load syzkaller has
used to detect the problem). Review is welcome but I think the patches are fine
to go as far as I'm concerned! Jens, can you please pick them up?

Changes since v1:
* Added patch moving fput() calls in loop_change_fd() from under loop_ctl_mutex
* Fixed bug in loop_control_ioctl() where it failed to return error properly

Changes since v2:
* Rebase on top of 4.20-rc1
* Add patch to stop fooling lockdep about loop_ctl_mutex

Honza

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889



[PATCH 02/16] block/loop: Use global lock for ioctl() operation.

2018-11-08 Thread Jan Kara
From: Tetsuo Handa 

syzbot is reporting NULL pointer dereference [1] which is caused by
race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
loop devices at loop_validate_file() without holding corresponding
lo->lo_ctl_mutex locks.

Since ioctl() request on loop devices is not frequent operation, we don't
need fine grained locking. Let's use global lock in order to allow safe
traversal at loop_validate_file().

Note that syzbot is also reporting circular locking dependency between
bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling
blkdev_reread_part() with lock held. This patch does not address it.

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Reviewed-by: Jan Kara 
Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 58 ++--
 drivers/block/loop.h |  1 -
 2 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a29ef169f360..63008e879771 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -84,6 +84,7 @@
 
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_index_mutex);
+static DEFINE_MUTEX(loop_ctl_mutex);
 
 static int max_part;
 static int part_shift;
@@ -1046,7 +1047,7 @@ static int loop_clr_fd(struct loop_device *lo)
 */
if (atomic_read(>lo_refcnt) > 1) {
lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return 0;
}
 
@@ -1099,12 +1100,12 @@ static int loop_clr_fd(struct loop_device *lo)
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
loop_unprepare_queue(lo);
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
/*
-* Need not hold lo_ctl_mutex to fput backing file.
-* Calling fput holding lo_ctl_mutex triggers a circular
+* Need not hold loop_ctl_mutex to fput backing file.
+* Calling fput holding loop_ctl_mutex triggers a circular
 * lock dependency possibility warning as fput can take
-* bd_mutex which is usually taken before lo_ctl_mutex.
+* bd_mutex which is usually taken before loop_ctl_mutex.
 */
fput(filp);
return 0;
@@ -1209,7 +1210,7 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
int ret;
 
if (lo->lo_state != Lo_bound) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -ENXIO;
}
 
@@ -1228,10 +1229,10 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
   lo->lo_encrypt_key_size);
}
 
-   /* Drop lo_ctl_mutex while we call into the filesystem. */
+   /* Drop loop_ctl_mutex while we call into the filesystem. */
path = lo->lo_backing_file->f_path;
path_get();
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
@@ -1323,7 +1324,7 @@ loop_get_status_old(struct loop_device *lo, struct 
loop_info __user *arg) {
int err;
 
if (!arg) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, );
@@ -1341,7 +1342,7 @@ loop_get_status64(struct loop_device *lo, struct 
loop_info64 __user *arg) {
int err;
 
if (!arg) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, );
@@ -1399,7 +1400,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
struct loop_device *lo = bdev->bd_disk->private_data;
int err;
 
-   err = mutex_lock_killable_nested(>lo_ctl_mutex, 1);
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
goto out_unlocked;
 
@@ -1411,7 +1412,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
err = loop_change_fd(lo, bdev, arg);
break;
case LOOP_CLR_FD:
-   /* loop_clr_fd would have unlocked lo_ctl_mutex on success */
+   /* loop_clr_fd would have unlocked loop_ctl_mutex on success */
err = loop_clr_fd(lo);
if (!err)
goto out_unlocked;
@@ -1424,7 +1425,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
break;
case LOOP_GET_STATUS:
   

[PATCH 12/16] loop: Move special partition reread handling in loop_clr_fd()

2018-11-08 Thread Jan Kara
The call of __blkdev_reread_part() from loop_reread_partition() happens
only when we need to invalidate partitions from loop_release(). Thus
move a detection for this into loop_clr_fd() and simplify
loop_reread_partition().

This makes loop_reread_partition() safe to use without loop_ctl_mutex
because we use only lo->lo_number and lo->lo_file_name in case of error
for reporting purposes (thus possibly reporting outdate information is
not a big deal) and we are safe from 'lo' going away under us by
elevated lo->lo_refcnt.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 33 +++--
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ea5e313908b1..f1d7a4fe30fc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -630,18 +630,7 @@ static void loop_reread_partitions(struct loop_device *lo,
 {
int rc;
 
-   /*
-* bd_mutex has been held already in release path, so don't
-* acquire it if this function is called in such case.
-*
-* If the reread partition isn't from release path, lo_refcnt
-* must be at least one and it can only become zero when the
-* current holder is released.
-*/
-   if (!atomic_read(>lo_refcnt))
-   rc = __blkdev_reread_part(bdev);
-   else
-   rc = blkdev_reread_part(bdev);
+   rc = blkdev_reread_part(bdev);
if (rc)
pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
__func__, lo->lo_number, lo->lo_file_name, rc);
@@ -1095,8 +1084,24 @@ static int __loop_clr_fd(struct loop_device *lo)
module_put(THIS_MODULE);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-   if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
-   loop_reread_partitions(lo, bdev);
+   if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev) {
+   /*
+* bd_mutex has been held already in release path, so don't
+* acquire it if this function is called in such case.
+*
+* If the reread partition isn't from release path, lo_refcnt
+* must be at least one and it can only become zero when the
+* current holder is released.
+*/
+   if (!atomic_read(>lo_refcnt))
+   err = __blkdev_reread_part(bdev);
+   else
+   err = blkdev_reread_part(bdev);
+   pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
+   __func__, lo->lo_number, err);
+   /* Device is gone, no point in returning error */
+   err = 0;
+   }
lo->lo_flags = 0;
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
-- 
2.16.4



[PATCH 10/16] loop: Push loop_ctl_mutex down to loop_set_fd()

2018-11-08 Thread Jan Kara
Push lo_ctl_mutex down to loop_set_fd(). We will need this to be able to
call loop_reread_partitions() without lo_ctl_mutex.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index af79a59732b7..161e2a08f2e8 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -918,13 +918,17 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
if (!file)
goto out;
 
+   error = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (error)
+   goto out_putf;
+
error = -EBUSY;
if (lo->lo_state != Lo_unbound)
-   goto out_putf;
+   goto out_unlock;
 
error = loop_validate_file(file, bdev);
if (error)
-   goto out_putf;
+   goto out_unlock;
 
mapping = file->f_mapping;
inode = mapping->host;
@@ -936,10 +940,10 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
error = -EFBIG;
size = get_loop_size(lo, file);
if ((loff_t)(sector_t)size != size)
-   goto out_putf;
+   goto out_unlock;
error = loop_prepare_queue(lo);
if (error)
-   goto out_putf;
+   goto out_unlock;
 
error = 0;
 
@@ -978,11 +982,14 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
 * put /dev/loopXX inode. Later in __loop_clr_fd() we bdput(bdev).
 */
bdgrab(bdev);
+   mutex_unlock(_ctl_mutex);
return 0;
 
- out_putf:
+out_unlock:
+   mutex_unlock(_ctl_mutex);
+out_putf:
fput(file);
- out:
+out:
/* This is safe: open() is still holding a reference. */
module_put(THIS_MODULE);
return error;
@@ -1460,12 +1467,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
 
switch (cmd) {
case LOOP_SET_FD:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   err = loop_set_fd(lo, mode, bdev, arg);
-   mutex_unlock(_ctl_mutex);
-   break;
+   return loop_set_fd(lo, mode, bdev, arg);
case LOOP_CHANGE_FD:
err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
-- 
2.16.4



[PATCH 03/16] loop: Fold __loop_release into loop_release

2018-11-08 Thread Jan Kara
__loop_release() has a single call site. Fold it there. This is
currently not a huge win but it will make following replacement of
loop_index_mutex more obvious.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 63008e879771..3de2cd94225a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1641,12 +1641,15 @@ static int lo_open(struct block_device *bdev, fmode_t 
mode)
return err;
 }
 
-static void __lo_release(struct loop_device *lo)
+static void lo_release(struct gendisk *disk, fmode_t mode)
 {
+   struct loop_device *lo;
int err;
 
+   mutex_lock(_index_mutex);
+   lo = disk->private_data;
if (atomic_dec_return(>lo_refcnt))
-   return;
+   goto unlock_index;
 
mutex_lock(_ctl_mutex);
if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
@@ -1656,7 +1659,7 @@ static void __lo_release(struct loop_device *lo)
 */
err = loop_clr_fd(lo);
if (!err)
-   return;
+   goto unlock_index;
} else if (lo->lo_state == Lo_bound) {
/*
 * Otherwise keep thread (if running) and config,
@@ -1667,12 +1670,7 @@ static void __lo_release(struct loop_device *lo)
}
 
mutex_unlock(_ctl_mutex);
-}
-
-static void lo_release(struct gendisk *disk, fmode_t mode)
-{
-   mutex_lock(_index_mutex);
-   __lo_release(disk->private_data);
+unlock_index:
mutex_unlock(_index_mutex);
 }
 
-- 
2.16.4



Re: [PATCH 1/2] loop/006: Add test for setting partscan flag

2018-10-24 Thread Jan Kara
On Tue 23-10-18 11:57:50, Omar Sandoval wrote:
> On Tue, Oct 23, 2018 at 12:05:12PM +0200, Jan Kara wrote:
> > On Mon 22-10-18 15:52:55, Omar Sandoval wrote:
> > > On Thu, Oct 18, 2018 at 12:31:46PM +0200, Jan Kara wrote:
> > > > Add test for setting partscan flag.
> > > > 
> > > > Signed-off-by: Jan Kara 
> > > 
> > > Sorry I didn't notice this earlier, but loop/001 already does a
> > > partition rescan (via losetup -P). Does that cover this test case?
> > 
> > Yes I know. But the partition rescanning on device creation has been
> > handled properly while partition rescanning as a result of LOOP_SET_STATUS
> > was buggy. That's why I've added this test.
> 
> At least here, losetup -P does a LOOP_SET_STATUS:
> 
> $ sudo strace -e ioctl losetup -f --show -P test.img
> ioctl(3, LOOP_CTL_GET_FREE) = 0
> ioctl(4, LOOP_SET_FD, 3)= 0
> ioctl(4, LOOP_SET_STATUS64, {lo_offset=0, lo_number=0, 
> lo_flags=LO_FLAGS_PARTSCAN, lo_file_name="/home/osandov/test.img", ...}) = 0
> /dev/loop0
> +++ exited with 0 +++

Right, my bad. Just discard this test then. Thanks for noticing this!

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] loop/006: Add test for setting partscan flag

2018-10-23 Thread Jan Kara
On Mon 22-10-18 15:52:55, Omar Sandoval wrote:
> On Thu, Oct 18, 2018 at 12:31:46PM +0200, Jan Kara wrote:
> > Add test for setting partscan flag.
> > 
> > Signed-off-by: Jan Kara 
> 
> Sorry I didn't notice this earlier, but loop/001 already does a
> partition rescan (via losetup -P). Does that cover this test case?

Yes I know. But the partition rescanning on device creation has been
handled properly while partition rescanning as a result of LOOP_SET_STATUS
was buggy. That's why I've added this test.

> > +int main(int argc, char **argv)
> > +{
> > +   int ret;
> > +   int fd;
> > +   struct loop_info64 info;
> > +
> > +   if (argc != 2)
> > +   usage(argv[0]);
> > +
> > +   fd = open(argv[1], O_RDONLY);
> > +   if (fd == -1) {
> > +   perror("open");
> > +   return EXIT_FAILURE;
> > +   }
> > +
> > +   memset(, 0, sizeof(info));
> > +   info.lo_flags = LO_FLAGS_PARTSCAN;
> > +   memcpy(info.lo_file_name, "part", 5);
> 
> What's the significance of this file name?

Probably none, I guess I can just delete it. I think I've just copy-pasted
it from some other test excercising LOOP_SET_STATUS...

Honza
-- 
Jan Kara 
SUSE Labs, CR


[PATCH 2/2] loop/007: Add test for oops during backing file verification

2018-10-18 Thread Jan Kara
Add regression test for patch "block/loop: Use global lock for ioctl()
operation." where we can oops while traversing list of loop devices
backing newly created device.

Signed-off-by: Jan Kara 
---
 src/Makefile |  3 ++-
 src/loop_change_fd.c | 48 +
 tests/loop/007   | 75 
 tests/loop/007.out   |  2 ++
 4 files changed, 127 insertions(+), 1 deletion(-)
 create mode 100644 src/loop_change_fd.c
 create mode 100755 tests/loop/007
 create mode 100644 tests/loop/007.out

diff --git a/src/Makefile b/src/Makefile
index 6dadcbec8beb..acd169718b7d 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -5,7 +5,8 @@ C_TARGETS := \
sg/dxfer-from-dev \
sg/syzkaller1 \
nbdsetsize \
-   loop_set_status_partscan
+   loop_set_status_partscan \
+   loop_change_fd
 
 CXX_TARGETS := \
discontiguous-io
diff --git a/src/loop_change_fd.c b/src/loop_change_fd.c
new file mode 100644
index ..b124d829f380
--- /dev/null
+++ b/src/loop_change_fd.c
@@ -0,0 +1,48 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+void usage(const char *progname)
+{
+   fprintf(stderr, "usage: %s LOOPDEV PATH\n", progname);
+   exit(EXIT_FAILURE);
+}
+
+int main(int argc, char **argv)
+{
+   int ret;
+   int fd, filefd;
+
+   if (argc != 3)
+   usage(argv[0]);
+
+   fd = open(argv[1], O_RDWR);
+   if (fd == -1) {
+   perror("open");
+   return EXIT_FAILURE;
+   }
+
+   filefd = open(argv[2], O_RDWR);
+   if (filefd == -1) {
+   perror("open");
+   return EXIT_FAILURE;
+   }
+
+   ret = ioctl(fd, LOOP_CHANGE_FD, filefd);
+   if (ret == -1) {
+   perror("ioctl");
+   close(fd);
+   close(filefd);
+   return EXIT_FAILURE;
+   }
+   close(fd);
+   close(filefd);
+   return EXIT_SUCCESS;
+}
diff --git a/tests/loop/007 b/tests/loop/007
new file mode 100755
index ..98e4edbe7afa
--- /dev/null
+++ b/tests/loop/007
@@ -0,0 +1,75 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+# Copyright (C) 2018 Jan Kara
+#
+# Regression test for patch "block/loop: Use global lock for ioctl() 
operation."
+# We swap file (through LOOP_CHANGE_FD) under one loopback device while
+# creating and deleting another loopback device pointing to the first one.
+# This checks for races in validation of backing fd.
+
+. tests/loop/rc
+
+DESCRIPTION="check for crash when validating new loop file"
+
+# Try for some time to trigger the race
+TIMED=1
+
+requires() {
+   _have_src_program loop_change_fd
+}
+
+# Setup and tear down loop device pointing to loop_dev
+run_setter() {
+   loop_dev="$1"
+
+   while top_dev=$(losetup -f --show "$loop_dev"); do
+   losetup -d "$top_dev"
+   done
+}
+
+# Switch backing file of loop_dev continuously
+run_switcher() {
+   loop_dev="$1"
+
+   i=1
+   while src/loop_change_fd "$loop_dev" "$TMPDIR/file$i"; do
+   i=$(((i+1)%2))
+   done
+}
+
+test() {
+   echo "Running ${TEST_NAME}"
+
+   timeout=${TIMEOUT:-30}
+
+   truncate -s 1M "$TMPDIR/file0"
+   truncate -s 1M "$TMPDIR/file1"
+
+   if ! loop_dev="$(losetup -r -f --show "$TMPDIR/file0")"; then
+   return 1
+   fi
+
+   run_switcher "$loop_dev" &
+   switch_pid=$!
+   run_setter "$loop_dev" &
+   set_pid=$!
+
+   sleep $timeout
+
+   # Discard KILLED messages from bash...
+   {
+   kill -9 $switch_pid
+   kill -9 $set_pid
+   wait
+   sleep 1
+   } 2>/dev/null
+
+   # Clean up devices
+   top_dev=$(losetup -j "$loop_dev" | sed -e 's/\([^:]*\): .*/\1/')
+   if [[ -n "$top_dev" ]]; then
+   losetup -d "$top_dev"
+   fi
+   losetup -d "$loop_dev"
+
+   echo "Test complete"
+}
diff --git a/tests/loop/007.out b/tests/loop/007.out
new file mode 100644
index ..32752934d48a
--- /dev/null
+++ b/tests/loop/007.out
@@ -0,0 +1,2 @@
+Running loop/007
+Test complete
-- 
2.16.4



[PATCH 0/2] blktests: New loop tests

2018-10-18 Thread Jan Kara


Hello,

these two patches create two new tests for blktests as regression tests
for my recently posted loopback device fixes. More details in individual
patches.

Honza


[PATCH 1/2] loop/006: Add test for setting partscan flag

2018-10-18 Thread Jan Kara
Add test for setting partscan flag.

Signed-off-by: Jan Kara 
---
 src/Makefile   |  3 ++-
 src/loop_set_status_partscan.c | 45 ++
 tests/loop/006 | 33 +++
 tests/loop/006.out |  2 ++
 4 files changed, 82 insertions(+), 1 deletion(-)
 create mode 100644 src/loop_set_status_partscan.c
 create mode 100755 tests/loop/006
 create mode 100644 tests/loop/006.out

diff --git a/src/Makefile b/src/Makefile
index f89f61701179..6dadcbec8beb 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -4,7 +4,8 @@ C_TARGETS := \
openclose \
sg/dxfer-from-dev \
sg/syzkaller1 \
-   nbdsetsize
+   nbdsetsize \
+   loop_set_status_partscan
 
 CXX_TARGETS := \
discontiguous-io
diff --git a/src/loop_set_status_partscan.c b/src/loop_set_status_partscan.c
new file mode 100644
index ..8873a12e4334
--- /dev/null
+++ b/src/loop_set_status_partscan.c
@@ -0,0 +1,45 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+void usage(const char *progname)
+{
+   fprintf(stderr, "usage: %s PATH\n", progname);
+   exit(EXIT_FAILURE);
+}
+
+int main(int argc, char **argv)
+{
+   int ret;
+   int fd;
+   struct loop_info64 info;
+
+   if (argc != 2)
+   usage(argv[0]);
+
+   fd = open(argv[1], O_RDONLY);
+   if (fd == -1) {
+   perror("open");
+   return EXIT_FAILURE;
+   }
+
+   memset(, 0, sizeof(info));
+   info.lo_flags = LO_FLAGS_PARTSCAN;
+   memcpy(info.lo_file_name, "part", 5);
+
+   ret = ioctl(fd, LOOP_SET_STATUS64, );
+   if (ret == -1) {
+   perror("ioctl");
+   close(fd);
+   return EXIT_FAILURE;
+   }
+   close(fd);
+   return EXIT_SUCCESS;
+}
diff --git a/tests/loop/006 b/tests/loop/006
new file mode 100755
index ..e468d3a20210
--- /dev/null
+++ b/tests/loop/006
@@ -0,0 +1,33 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+# Copyright (C) 2018 Jan Kara
+#
+# Regression test for patch "loop: Fix deadlock when calling
+# blkdev_reread_part()". We check whether setting partscan flag
+# and thus forcing partition reread won't trigger lockdep warning.
+#
+
+. tests/loop/rc
+
+DESCRIPTION="check for partition rereading deadlock"
+
+QUICK=1
+
+requires() {
+   _have_src_program loop_set_status_partscan
+}
+
+test() {
+   local loop_dev
+   echo "Running ${TEST_NAME}"
+
+   truncate -s 1M "$TMPDIR/file"
+   if ! loop_dev="$(losetup -f --show "$TMPDIR/file")"; then
+   return 1
+   fi
+
+   src/loop_set_status_partscan "$loop_dev"
+   losetup -d "$loop_dev"
+
+   echo "Test complete"
+}
diff --git a/tests/loop/006.out b/tests/loop/006.out
new file mode 100644
index ..50bf833f77b0
--- /dev/null
+++ b/tests/loop/006.out
@@ -0,0 +1,2 @@
+Running loop/006
+Test complete
-- 
2.16.4



Re: [PATCH v2] block: BFQ default for single queue devices

2018-10-18 Thread Jan Kara
On Wed 17-10-18 10:29:22, Jens Axboe wrote:
> On 10/17/18 4:05 AM, Jan Kara wrote:
> > On Tue 16-10-18 11:35:59, Jens Axboe wrote:
> >> On 10/15/18 1:44 PM, Paolo Valente wrote:
> >>> Here are some old results with a very simple configuration:
> >>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
> >>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
> >>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
> >>>
> >>> Then I stopped repeating tests that always yielded the same good results.
> >>>
> >>> As for more professional systems, a well-known company doing
> >>> real-time packet-traffic dumping asked me to modify bfq so as to
> >>> guarantee lossless data writing also during queries.  The involved box
> >>> had a RAID reaching a few Gbps, and everything worked well.
> >>>
> >>> Anyway, if you have specific issues in mind, I can check more deeply.
> >>
> >> Do you have anything more recent? All of these predate the current
> >> code (by a lot), and isn't even mq. I'm mostly just interested in
> >> plain fast NVMe device, and a big box hardware raid setup with
> >> a ton of drives.
> >>
> >> I do still think that this should be going through the distros, they
> >> need to be the ones driving this, as they will ultimately be the
> >> ones getting customer reports on regressions. The qual/test cycle
> >> they do is useful for this. In mainline, if we make a change like
> >> this, we'll figure out if it worked many releases down the line.
> > 
> > Well, the problem with this is that big distro people really don't care
> > much because they already use udev for tuning the IO scheduler. So whatever
> > defaults the kernel is going to pick likely won't be seen by distro
> > customers. Embedded people seem to be driving this effort because they
> > either don't run udev or they feel not all their teams building new
> > products have enough expertise to come up with a proper set of rules...
> 
> Which is also the approach that I've been advocating for here, instead
> of a kernel patch...

I know you've been advocating the use of udev for IO scheduler selection.
But do you want to force everybody to use udev? And for people who build
their own (usually small) systems, do you want to force them to think about
IO scheduler selection and writing appropriate rules? These are the
problems people were mentioning and I'm not sure what is your opinion on
this.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] block: BFQ default for single queue devices

2018-10-17 Thread Jan Kara
On Tue 16-10-18 11:35:59, Jens Axboe wrote:
> On 10/15/18 1:44 PM, Paolo Valente wrote:
> > Here are some old results with a very simple configuration:
> > http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
> > http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
> > http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
> > 
> > Then I stopped repeating tests that always yielded the same good results.
> > 
> > As for more professional systems, a well-known company doing
> > real-time packet-traffic dumping asked me to modify bfq so as to
> > guarantee lossless data writing also during queries.  The involved box
> > had a RAID reaching a few Gbps, and everything worked well.
> > 
> > Anyway, if you have specific issues in mind, I can check more deeply.
> 
> Do you have anything more recent? All of these predate the current
> code (by a lot), and isn't even mq. I'm mostly just interested in
> plain fast NVMe device, and a big box hardware raid setup with
> a ton of drives.
> 
> I do still think that this should be going through the distros, they
> need to be the ones driving this, as they will ultimately be the
> ones getting customer reports on regressions. The qual/test cycle
> they do is useful for this. In mainline, if we make a change like
> this, we'll figure out if it worked many releases down the line.

Well, the problem with this is that big distro people really don't care
much because they already use udev for tuning the IO scheduler. So whatever
defaults the kernel is going to pick likely won't be seen by distro
customers. Embedded people seem to be driving this effort because they
either don't run udev or they feel not all their teams building new
products have enough expertise to come up with a proper set of rules...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 0/15 v2] loop: Fix oops and possible deadlocks

2018-10-17 Thread Jan Kara
On Tue 16-10-18 11:16:22, Omar Sandoval wrote:
> On Tue, Oct 16, 2018 at 01:36:54PM +0200, Jan Kara wrote:
> > On Wed 10-10-18 14:28:09, Jan Kara wrote:
> > > On Wed 10-10-18 13:42:27, Johannes Thumshirn wrote:
> > > > On Wed, Oct 10, 2018 at 07:19:00PM +0900, Tetsuo Handa wrote:
> > > > > On 2018/10/10 19:04, Jan Kara wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > this patch series fixes oops and possible deadlocks as reported by 
> > > > > > syzbot [1]
> > > > > > [2]. The second patch in the series (from Tetsuo) fixes the oops, 
> > > > > > the remaining
> > > > > > patches are cleaning up the locking in the loop driver so that we 
> > > > > > can in the
> > > > > > end reasonably easily switch to rereading partitions without 
> > > > > > holding mutex
> > > > > > protecting the loop device.
> > > > > > 
> > > > > > I have lightly tested the patches by creating, deleting, and 
> > > > > > modifying loop
> > > > > > devices but if there's some more comprehensive loopback device 
> > > > > > testsuite, I
> > > > > > can try running it. Review is welcome!
> > > > > 
> > > > > Testing on linux-next by syzbot will be the most comprehensive. ;-)
> > > > 
> > > > Apart from that blktests has a loop category and I think it could also 
> > > > be
> > > > worthwhile to add the C reproducer from syzkaller to blktests.
> > > 
> > > Yeah, I did run loop tests now and they ran fine. I can try converting the
> > > syzbot reproducers into something legible but it will take a while.
> > 
> > So I took a stab at this. But I hit two issues:
> > 
> > 1) For the reproducer triggering the lockdep warning, you need a 32-bit
> > binary (so that it uses compat_ioctl). I don't think we want to introduce
> > 32-bit devel environment dependency to blktests. With 64-bits, the problem
> > is also there but someone noticed and silenced lockdep (with a reason that
> > I consider is incorrect)... I think the test is still worth it though as
> > I'll remove the lockdep-fooling code in my patches and thus new breakage
> > will be noticed.
> 
> Agreed, even if it doesn't trigger lockdep now, it's a good regression
> test.
> 
> > 2) For the oops (use-after-free) issue I was not able to reproduce that in
> > my test KVM in couple hours. The race window is rather narrow and syzbot
> > with KASAN and everything hit it only 11 times. So I'm not sure how useful
> > that test is. Any opinions?
> 
> I'd say we should add it anyways. If anything, it's a smoke test for
> changing fds on a loop device. You could add a note that the race it's
> testing for is very narrow.

OK, I'll post the patches later today.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 0/15 v2] loop: Fix oops and possible deadlocks

2018-10-16 Thread Jan Kara
On Wed 10-10-18 14:28:09, Jan Kara wrote:
> On Wed 10-10-18 13:42:27, Johannes Thumshirn wrote:
> > On Wed, Oct 10, 2018 at 07:19:00PM +0900, Tetsuo Handa wrote:
> > > On 2018/10/10 19:04, Jan Kara wrote:
> > > > Hi,
> > > > 
> > > > this patch series fixes oops and possible deadlocks as reported by 
> > > > syzbot [1]
> > > > [2]. The second patch in the series (from Tetsuo) fixes the oops, the 
> > > > remaining
> > > > patches are cleaning up the locking in the loop driver so that we can 
> > > > in the
> > > > end reasonably easily switch to rereading partitions without holding 
> > > > mutex
> > > > protecting the loop device.
> > > > 
> > > > I have lightly tested the patches by creating, deleting, and modifying 
> > > > loop
> > > > devices but if there's some more comprehensive loopback device 
> > > > testsuite, I
> > > > can try running it. Review is welcome!
> > > 
> > > Testing on linux-next by syzbot will be the most comprehensive. ;-)
> > 
> > Apart from that blktests has a loop category and I think it could also be
> > worthwhile to add the C reproducer from syzkaller to blktests.
> 
> Yeah, I did run loop tests now and they ran fine. I can try converting the
> syzbot reproducers into something legible but it will take a while.

So I took a stab at this. But I hit two issues:

1) For the reproducer triggering the lockdep warning, you need a 32-bit
binary (so that it uses compat_ioctl). I don't think we want to introduce
32-bit devel environment dependency to blktests. With 64-bits, the problem
is also there but someone noticed and silenced lockdep (with a reason that
I consider is incorrect)... I think the test is still worth it though as
I'll remove the lockdep-fooling code in my patches and thus new breakage
will be noticed.

2) For the oops (use-after-free) issue I was not able to reproduce that in
my test KVM in couple hours. The race window is rather narrow and syzbot
with KASAN and everything hit it only 11 times. So I'm not sure how useful
that test is. Any opinions?

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 0/15 v2] loop: Fix oops and possible deadlocks

2018-10-10 Thread Jan Kara
On Wed 10-10-18 13:42:27, Johannes Thumshirn wrote:
> On Wed, Oct 10, 2018 at 07:19:00PM +0900, Tetsuo Handa wrote:
> > On 2018/10/10 19:04, Jan Kara wrote:
> > > Hi,
> > > 
> > > this patch series fixes oops and possible deadlocks as reported by syzbot 
> > > [1]
> > > [2]. The second patch in the series (from Tetsuo) fixes the oops, the 
> > > remaining
> > > patches are cleaning up the locking in the loop driver so that we can in 
> > > the
> > > end reasonably easily switch to rereading partitions without holding mutex
> > > protecting the loop device.
> > > 
> > > I have lightly tested the patches by creating, deleting, and modifying 
> > > loop
> > > devices but if there's some more comprehensive loopback device testsuite, 
> > > I
> > > can try running it. Review is welcome!
> > 
> > Testing on linux-next by syzbot will be the most comprehensive. ;-)
> 
> Apart from that blktests has a loop category and I think it could also be
> worthwhile to add the C reproducer from syzkaller to blktests.

Yeah, I did run loop tests now and they ran fine. I can try converting the
syzbot reproducers into something legible but it will take a while.

Honza
-- 
Jan Kara 
SUSE Labs, CR


[PATCH 06/15] loop: Split setting of lo_state from loop_clr_fd

2018-10-10 Thread Jan Kara
Move setting of lo_state to Lo_rundown out into the callers. That will
allow us to unlock loop_ctl_mutex while the loop device is protected
from other changes by its special state.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 52 +++-
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index b7b3b4ae0896..2410921bd9ff 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -976,7 +976,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
loop_reread_partitions(lo, bdev);
 
/* Grab the block_device to prevent its destruction after we
-* put /dev/loopXX inode. Later in loop_clr_fd() we bdput(bdev).
+* put /dev/loopXX inode. Later in __loop_clr_fd() we bdput(bdev).
 */
bdgrab(bdev);
return 0;
@@ -1026,31 +1026,15 @@ loop_init_xfer(struct loop_device *lo, struct 
loop_func_table *xfer,
return err;
 }
 
-static int loop_clr_fd(struct loop_device *lo)
+static int __loop_clr_fd(struct loop_device *lo)
 {
struct file *filp = lo->lo_backing_file;
gfp_t gfp = lo->old_gfp_mask;
struct block_device *bdev = lo->lo_device;
 
-   if (lo->lo_state != Lo_bound)
+   if (WARN_ON_ONCE(lo->lo_state != Lo_rundown))
return -ENXIO;
 
-   /*
-* If we've explicitly asked to tear down the loop device,
-* and it has an elevated reference count, set it for auto-teardown when
-* the last reference goes away. This stops $!~#$@ udev from
-* preventing teardown because it decided that it needs to run blkid on
-* the loopback device whenever they appear. xfstests is notorious for
-* failing tests because blkid via udev races with a losetup
-* /do something like mkfs/losetup -d  causing the losetup -d
-* command to fail with EBUSY.
-*/
-   if (atomic_read(>lo_refcnt) > 1) {
-   lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-   mutex_unlock(_ctl_mutex);
-   return 0;
-   }
-
if (filp == NULL)
return -EINVAL;
 
@@ -1058,7 +1042,6 @@ static int loop_clr_fd(struct loop_device *lo)
blk_mq_freeze_queue(lo->lo_queue);
 
spin_lock_irq(>lo_lock);
-   lo->lo_state = Lo_rundown;
lo->lo_backing_file = NULL;
spin_unlock_irq(>lo_lock);
 
@@ -,6 +1094,30 @@ static int loop_clr_fd(struct loop_device *lo)
return 0;
 }
 
+static int loop_clr_fd(struct loop_device *lo)
+{
+   if (lo->lo_state != Lo_bound)
+   return -ENXIO;
+   /*
+* If we've explicitly asked to tear down the loop device,
+* and it has an elevated reference count, set it for auto-teardown when
+* the last reference goes away. This stops $!~#$@ udev from
+* preventing teardown because it decided that it needs to run blkid on
+* the loopback device whenever they appear. xfstests is notorious for
+* failing tests because blkid via udev races with a losetup
+* /do something like mkfs/losetup -d  causing the losetup -d
+* command to fail with EBUSY.
+*/
+   if (atomic_read(>lo_refcnt) > 1) {
+   lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
+   mutex_unlock(_ctl_mutex);
+   return 0;
+   }
+   lo->lo_state = Lo_rundown;
+
+   return __loop_clr_fd(lo);
+}
+
 static int
 loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 {
@@ -1692,11 +1699,14 @@ static void lo_release(struct gendisk *disk, fmode_t 
mode)
goto out_unlock;
 
if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
+   if (lo->lo_state != Lo_bound)
+   goto out_unlock;
+   lo->lo_state = Lo_rundown;
/*
 * In autoclear mode, stop the loop thread
 * and remove configuration after last close.
 */
-   err = loop_clr_fd(lo);
+   err = __loop_clr_fd(lo);
if (!err)
return;
} else if (lo->lo_state == Lo_bound) {
-- 
2.16.4



[PATCH 15/15] loop: Avoid circular locking dependency between loop_ctl_mutex and bd_mutex

2018-10-10 Thread Jan Kara
Code in loop_change_fd() drops reference to the old file (and also the
new file in a failure case) under loop_ctl_mutex. Similarly to a
situation in loop_set_fd() this can create a circular locking dependency
if this was the last reference holding the file open. Delay dropping of
the file reference until we have released loop_ctl_mutex.

Reported-by: Tetsuo Handa 
Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d6b3fac25040..a517247a32fa 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -678,7 +678,7 @@ static int loop_validate_file(struct file *file, struct 
block_device *bdev)
 static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
  unsigned int arg)
 {
-   struct file *file, *old_file;
+   struct file *file = NULL, *old_file;
int error;
boolpartscan;
 
@@ -687,21 +687,21 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
return error;
error = -ENXIO;
if (lo->lo_state != Lo_bound)
-   goto out_unlock;
+   goto out_err;
 
/* the loop device has to be read-only */
error = -EINVAL;
if (!(lo->lo_flags & LO_FLAGS_READ_ONLY))
-   goto out_unlock;
+   goto out_err;
 
error = -EBADF;
file = fget(arg);
if (!file)
-   goto out_unlock;
+   goto out_err;
 
error = loop_validate_file(file, bdev);
if (error)
-   goto out_putf;
+   goto out_err;
 
old_file = lo->lo_backing_file;
 
@@ -709,7 +709,7 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
 
/* size of the new backing store needs to be the same */
if (get_loop_size(lo, file) != get_loop_size(lo, old_file))
-   goto out_putf;
+   goto out_err;
 
/* and ... switch */
blk_mq_freeze_queue(lo->lo_queue);
@@ -720,18 +720,22 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
 lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
loop_update_dio(lo);
blk_mq_unfreeze_queue(lo->lo_queue);
-
-   fput(old_file);
partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;
mutex_unlock(_ctl_mutex);
+   /*
+* We must drop file reference outside of loop_ctl_mutex as dropping
+* the file ref can take bd_mutex which creates circular locking
+* dependency.
+*/
+   fput(old_file);
if (partscan)
loop_reread_partitions(lo, bdev);
return 0;
 
-out_putf:
-   fput(file);
-out_unlock:
+out_err:
mutex_unlock(_ctl_mutex);
+   if (file)
+   fput(file);
return error;
 }
 
-- 
2.16.4



[PATCH 07/15] loop: Push loop_ctl_mutex down into loop_clr_fd()

2018-10-10 Thread Jan Kara
loop_clr_fd() has a weird locking convention that is expects
loop_ctl_mutex held, releases it on success and keeps it on failure.
Untangle the mess by moving locking of loop_ctl_mutex into
loop_clr_fd().

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 49 +
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 2410921bd9ff..be41d5dcecd2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1028,15 +1028,22 @@ loop_init_xfer(struct loop_device *lo, struct 
loop_func_table *xfer,
 
 static int __loop_clr_fd(struct loop_device *lo)
 {
-   struct file *filp = lo->lo_backing_file;
+   struct file *filp = NULL;
gfp_t gfp = lo->old_gfp_mask;
struct block_device *bdev = lo->lo_device;
+   int err = 0;
 
-   if (WARN_ON_ONCE(lo->lo_state != Lo_rundown))
-   return -ENXIO;
+   mutex_lock(_ctl_mutex);
+   if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
+   err = -ENXIO;
+   goto out_unlock;
+   }
 
-   if (filp == NULL)
-   return -EINVAL;
+   filp = lo->lo_backing_file;
+   if (filp == NULL) {
+   err = -EINVAL;
+   goto out_unlock;
+   }
 
/* freeze request queue during the transition */
blk_mq_freeze_queue(lo->lo_queue);
@@ -1083,6 +1090,7 @@ static int __loop_clr_fd(struct loop_device *lo)
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
loop_unprepare_queue(lo);
+out_unlock:
mutex_unlock(_ctl_mutex);
/*
 * Need not hold loop_ctl_mutex to fput backing file.
@@ -1090,14 +1098,22 @@ static int __loop_clr_fd(struct loop_device *lo)
 * lock dependency possibility warning as fput can take
 * bd_mutex which is usually taken before loop_ctl_mutex.
 */
-   fput(filp);
-   return 0;
+   if (filp)
+   fput(filp);
+   return err;
 }
 
 static int loop_clr_fd(struct loop_device *lo)
 {
-   if (lo->lo_state != Lo_bound)
+   int err;
+
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
+   if (lo->lo_state != Lo_bound) {
+   mutex_unlock(_ctl_mutex);
return -ENXIO;
+   }
/*
 * If we've explicitly asked to tear down the loop device,
 * and it has an elevated reference count, set it for auto-teardown when
@@ -1114,6 +1130,7 @@ static int loop_clr_fd(struct loop_device *lo)
return 0;
}
lo->lo_state = Lo_rundown;
+   mutex_unlock(_ctl_mutex);
 
return __loop_clr_fd(lo);
 }
@@ -1448,14 +1465,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
mutex_unlock(_ctl_mutex);
break;
case LOOP_CLR_FD:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   /* loop_clr_fd would have unlocked loop_ctl_mutex on success */
-   err = loop_clr_fd(lo);
-   if (err)
-   mutex_unlock(_ctl_mutex);
-   break;
+   return loop_clr_fd(lo);
case LOOP_SET_STATUS:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
@@ -1691,7 +1701,6 @@ static int lo_open(struct block_device *bdev, fmode_t 
mode)
 static void lo_release(struct gendisk *disk, fmode_t mode)
 {
struct loop_device *lo;
-   int err;
 
mutex_lock(_ctl_mutex);
lo = disk->private_data;
@@ -1702,13 +1711,13 @@ static void lo_release(struct gendisk *disk, fmode_t 
mode)
if (lo->lo_state != Lo_bound)
goto out_unlock;
lo->lo_state = Lo_rundown;
+   mutex_unlock(_ctl_mutex);
/*
 * In autoclear mode, stop the loop thread
 * and remove configuration after last close.
 */
-   err = __loop_clr_fd(lo);
-   if (!err)
-   return;
+   __loop_clr_fd(lo);
+   return;
} else if (lo->lo_state == Lo_bound) {
/*
 * Otherwise keep thread (if running) and config,
-- 
2.16.4



[PATCH 01/15] block/loop: Don't grab "struct file" for vfs_getattr() operation.

2018-10-10 Thread Jan Kara
From: Tetsuo Handa 

vfs_getattr() needs "struct path" rather than "struct file".
Let's use path_get()/path_put() rather than get_file()/fput().

Signed-off-by: Tetsuo Handa 
Reviewed-by: Jan Kara 
Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ea9debf59b22..50c81ff44ae2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1205,7 +1205,7 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
 static int
 loop_get_status(struct loop_device *lo, struct loop_info64 *info)
 {
-   struct file *file;
+   struct path path;
struct kstat stat;
int ret;
 
@@ -1230,16 +1230,16 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
}
 
/* Drop lo_ctl_mutex while we call into the filesystem. */
-   file = get_file(lo->lo_backing_file);
+   path = lo->lo_backing_file->f_path;
+   path_get();
mutex_unlock(>lo_ctl_mutex);
-   ret = vfs_getattr(>f_path, , STATX_INO,
- AT_STATX_SYNC_AS_STAT);
+   ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
info->lo_inode = stat.ino;
info->lo_rdevice = huge_encode_dev(stat.rdev);
}
-   fput(file);
+   path_put();
return ret;
 }
 
-- 
2.16.4



[PATCH 11/15] loop: Push loop_ctl_mutex down to loop_change_fd()

2018-10-10 Thread Jan Kara
Push loop_ctl_mutex down to loop_change_fd(). We will need this to be
able to call loop_reread_partitions() without loop_ctl_mutex.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index b4500d82238d..c53ad5e88a7d 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -692,19 +692,22 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
struct file *file, *old_file;
int error;
 
+   error = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (error)
+   return error;
error = -ENXIO;
if (lo->lo_state != Lo_bound)
-   goto out;
+   goto out_unlock;
 
/* the loop device has to be read-only */
error = -EINVAL;
if (!(lo->lo_flags & LO_FLAGS_READ_ONLY))
-   goto out;
+   goto out_unlock;
 
error = -EBADF;
file = fget(arg);
if (!file)
-   goto out;
+   goto out_unlock;
 
error = loop_validate_file(file, bdev);
if (error)
@@ -731,11 +734,13 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
fput(old_file);
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
loop_reread_partitions(lo, bdev);
+   mutex_unlock(_ctl_mutex);
return 0;
 
- out_putf:
+out_putf:
fput(file);
- out:
+out_unlock:
+   mutex_unlock(_ctl_mutex);
return error;
 }
 
@@ -1470,12 +1475,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
case LOOP_SET_FD:
return loop_set_fd(lo, mode, bdev, arg);
case LOOP_CHANGE_FD:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   err = loop_change_fd(lo, bdev, arg);
-   mutex_unlock(_ctl_mutex);
-   break;
+   return loop_change_fd(lo, bdev, arg);
case LOOP_CLR_FD:
return loop_clr_fd(lo);
case LOOP_SET_STATUS:
-- 
2.16.4



[PATCH 03/15] loop: Fold __loop_release into loop_release

2018-10-10 Thread Jan Kara
__loop_release() has a single call site. Fold it there. This is
currently not a huge win but it will make following replacement of
loop_index_mutex more obvious.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d0f1b7106572..cc43d835fe6f 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1642,12 +1642,15 @@ static int lo_open(struct block_device *bdev, fmode_t 
mode)
return err;
 }
 
-static void __lo_release(struct loop_device *lo)
+static void lo_release(struct gendisk *disk, fmode_t mode)
 {
+   struct loop_device *lo;
int err;
 
+   mutex_lock(_index_mutex);
+   lo = disk->private_data;
if (atomic_dec_return(>lo_refcnt))
-   return;
+   goto unlock_index;
 
mutex_lock(_ctl_mutex);
if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
@@ -1657,7 +1660,7 @@ static void __lo_release(struct loop_device *lo)
 */
err = loop_clr_fd(lo);
if (!err)
-   return;
+   goto unlock_index;
} else if (lo->lo_state == Lo_bound) {
/*
 * Otherwise keep thread (if running) and config,
@@ -1668,12 +1671,7 @@ static void __lo_release(struct loop_device *lo)
}
 
mutex_unlock(_ctl_mutex);
-}
-
-static void lo_release(struct gendisk *disk, fmode_t mode)
-{
-   mutex_lock(_index_mutex);
-   __lo_release(disk->private_data);
+unlock_index:
mutex_unlock(_index_mutex);
 }
 
-- 
2.16.4



[PATCH 05/15] loop: Push lo_ctl_mutex down into individual ioctls

2018-10-10 Thread Jan Kara
Push acquisition of lo_ctl_mutex down into individual ioctl handling
branches. This is a preparatory step for pushing the lock down into
individual ioctl handling functions so that they can release the lock as
they need it. We also factor out some simple ioctl handlers that will
not need any special handling to reduce unnecessary code duplication.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 88 +---
 1 file changed, 63 insertions(+), 25 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3fa5e63944a4..b7b3b4ae0896 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1394,70 +1394,108 @@ static int loop_set_block_size(struct loop_device *lo, 
unsigned long arg)
return 0;
 }
 
-static int lo_ioctl(struct block_device *bdev, fmode_t mode,
-   unsigned int cmd, unsigned long arg)
+static int lo_simple_ioctl(struct loop_device *lo, unsigned int cmd,
+  unsigned long arg)
 {
-   struct loop_device *lo = bdev->bd_disk->private_data;
int err;
 
err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
-   goto out_unlocked;
+   return err;
+   switch (cmd) {
+   case LOOP_SET_CAPACITY:
+   err = loop_set_capacity(lo);
+   break;
+   case LOOP_SET_DIRECT_IO:
+   err = loop_set_dio(lo, arg);
+   break;
+   case LOOP_SET_BLOCK_SIZE:
+   err = loop_set_block_size(lo, arg);
+   break;
+   default:
+   err = lo->ioctl ? lo->ioctl(lo, cmd, arg) : -EINVAL;
+   }
+   mutex_unlock(_ctl_mutex);
+   return err;
+}
+
+static int lo_ioctl(struct block_device *bdev, fmode_t mode,
+   unsigned int cmd, unsigned long arg)
+{
+   struct loop_device *lo = bdev->bd_disk->private_data;
+   int err;
 
switch (cmd) {
case LOOP_SET_FD:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_set_fd(lo, mode, bdev, arg);
+   mutex_unlock(_ctl_mutex);
break;
case LOOP_CHANGE_FD:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_change_fd(lo, bdev, arg);
+   mutex_unlock(_ctl_mutex);
break;
case LOOP_CLR_FD:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
/* loop_clr_fd would have unlocked loop_ctl_mutex on success */
err = loop_clr_fd(lo);
-   if (!err)
-   goto out_unlocked;
+   if (err)
+   mutex_unlock(_ctl_mutex);
break;
case LOOP_SET_STATUS:
err = -EPERM;
-   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
+   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_set_status_old(lo,
(struct loop_info __user *)arg);
+   mutex_unlock(_ctl_mutex);
+   }
break;
case LOOP_GET_STATUS:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_get_status_old(lo, (struct loop_info __user *) arg);
/* loop_get_status() unlocks loop_ctl_mutex */
-   goto out_unlocked;
+   break;
case LOOP_SET_STATUS64:
err = -EPERM;
-   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
+   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_set_status64(lo,
(struct loop_info64 __user *) arg);
+   mutex_unlock(_ctl_mutex);
+   }
break;
case LOOP_GET_STATUS64:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_get_status64(lo, (struct loop_info64 __user *) arg);
/* loop_get_status() unlocks loop_ctl_mutex */
-   goto out_unlocked;
-   case LOOP_SET_CAPACITY:
-   err = -EPERM;
-   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
-   err = loop_set_capacity(lo);
break;
+ 

[PATCH 09/15] loop: Push loop_ctl_mutex down to loop_set_status()

2018-10-10 Thread Jan Kara
Push loop_ctl_mutex down to loop_set_status(). We will need this to be
able to call loop_reread_partitions() without loop_ctl_mutex.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 51 +--
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cb4eeff91238..5661489d11a7 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1142,46 +1142,55 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
struct loop_func_table *xfer;
kuid_t uid = current_uid();
 
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
if (lo->lo_encrypt_key_size &&
!uid_eq(lo->lo_key_owner, uid) &&
-   !capable(CAP_SYS_ADMIN))
-   return -EPERM;
-   if (lo->lo_state != Lo_bound)
-   return -ENXIO;
-   if ((unsigned int) info->lo_encrypt_key_size > LO_KEY_SIZE)
-   return -EINVAL;
+   !capable(CAP_SYS_ADMIN)) {
+   err = -EPERM;
+   goto out_unlock;
+   }
+   if (lo->lo_state != Lo_bound) {
+   err = -ENXIO;
+   goto out_unlock;
+   }
+   if ((unsigned int) info->lo_encrypt_key_size > LO_KEY_SIZE) {
+   err = -EINVAL;
+   goto out_unlock;
+   }
 
/* I/O need to be drained during transfer transition */
blk_mq_freeze_queue(lo->lo_queue);
 
err = loop_release_xfer(lo);
if (err)
-   goto exit;
+   goto out_unfreeze;
 
if (info->lo_encrypt_type) {
unsigned int type = info->lo_encrypt_type;
 
if (type >= MAX_LO_CRYPT) {
err = -EINVAL;
-   goto exit;
+   goto out_unfreeze;
}
xfer = xfer_funcs[type];
if (xfer == NULL) {
err = -EINVAL;
-   goto exit;
+   goto out_unfreeze;
}
} else
xfer = NULL;
 
err = loop_init_xfer(lo, xfer, info);
if (err)
-   goto exit;
+   goto out_unfreeze;
 
if (lo->lo_offset != info->lo_offset ||
lo->lo_sizelimit != info->lo_sizelimit) {
if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
err = -EFBIG;
-   goto exit;
+   goto out_unfreeze;
}
}
 
@@ -1213,7 +1222,7 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
/* update dio if lo_offset or transfer is changed */
__loop_update_dio(lo, lo->use_dio);
 
- exit:
+out_unfreeze:
blk_mq_unfreeze_queue(lo->lo_queue);
 
if (!err && (info->lo_flags & LO_FLAGS_PARTSCAN) &&
@@ -1222,6 +1231,8 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN;
loop_reread_partitions(lo, lo->lo_device);
}
+out_unlock:
+   mutex_unlock(_ctl_mutex);
 
return err;
 }
@@ -1468,12 +1479,8 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
case LOOP_SET_STATUS:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
err = loop_set_status_old(lo,
(struct loop_info __user *)arg);
-   mutex_unlock(_ctl_mutex);
}
break;
case LOOP_GET_STATUS:
@@ -1481,12 +1488,8 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
case LOOP_SET_STATUS64:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
err = loop_set_status64(lo,
(struct loop_info64 __user *) arg);
-   mutex_unlock(_ctl_mutex);
}
break;
case LOOP_GET_STATUS64:
@@ -1631,12 +1634,8 @@ static int lo_compat_ioctl(struct block_device *bdev, 
fmode_t mode,
 
switch(cmd) {
case LOOP_SET_STATUS:
-   err = mutex_lock_killable(_ctl_mutex);
-   if (!err) {
-   err = loop_set_status_compat(lo,
-(const struct 
compat_loop_info __user *)a

[PATCH 14/15] loop: Fix deadlock when calling blkdev_reread_part()

2018-10-10 Thread Jan Kara
Calling blkdev_reread_part() under loop_ctl_mutex causes lockdep to
complain about circular lock dependency between bdev->bd_mutex and
lo->lo_ctl_mutex. The problem is that on loop device open or close
lo_open() and lo_release() get called with bdev->bd_mutex held and they
need to acquire loop_ctl_mutex. OTOH when loop_reread_partitions() is
called with loop_ctl_mutex held, it will call blkdev_reread_part() which
acquires bdev->bd_mutex. See syzbot report for details [1].

Move call to blkdev_reread_part() in __loop_clr_fd() from under
loop_ctl_mutex to finish fixing of the lockdep warning and the possible
deadlock.

[1] https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d1588

Reported-by: syzbot 

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0d54c3ee3a96..d6b3fac25040 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1031,12 +1031,14 @@ loop_init_xfer(struct loop_device *lo, struct 
loop_func_table *xfer,
return err;
 }
 
-static int __loop_clr_fd(struct loop_device *lo)
+static int __loop_clr_fd(struct loop_device *lo, bool release)
 {
struct file *filp = NULL;
gfp_t gfp = lo->old_gfp_mask;
struct block_device *bdev = lo->lo_device;
int err = 0;
+   bool partscan = false;
+   int lo_number;
 
mutex_lock(_ctl_mutex);
if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
@@ -1089,7 +1091,15 @@ static int __loop_clr_fd(struct loop_device *lo)
module_put(THIS_MODULE);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-   if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev) {
+   partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
+   lo_number = lo->lo_number;
+   lo->lo_flags = 0;
+   if (!part_shift)
+   lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
+   loop_unprepare_queue(lo);
+out_unlock:
+   mutex_unlock(_ctl_mutex);
+   if (partscan) {
/*
 * bd_mutex has been held already in release path, so don't
 * acquire it if this function is called in such case.
@@ -1098,21 +1108,15 @@ static int __loop_clr_fd(struct loop_device *lo)
 * must be at least one and it can only become zero when the
 * current holder is released.
 */
-   if (!atomic_read(>lo_refcnt))
+   if (release)
err = __blkdev_reread_part(bdev);
else
err = blkdev_reread_part(bdev);
pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
-   __func__, lo->lo_number, err);
+   __func__, lo_number, err);
/* Device is gone, no point in returning error */
err = 0;
}
-   lo->lo_flags = 0;
-   if (!part_shift)
-   lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
-   loop_unprepare_queue(lo);
-out_unlock:
-   mutex_unlock(_ctl_mutex);
/*
 * Need not hold loop_ctl_mutex to fput backing file.
 * Calling fput holding loop_ctl_mutex triggers a circular
@@ -1153,7 +1157,7 @@ static int loop_clr_fd(struct loop_device *lo)
lo->lo_state = Lo_rundown;
mutex_unlock(_ctl_mutex);
 
-   return __loop_clr_fd(lo);
+   return __loop_clr_fd(lo, false);
 }
 
 static int
@@ -1714,7 +1718,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 * In autoclear mode, stop the loop thread
 * and remove configuration after last close.
 */
-   __loop_clr_fd(lo);
+   __loop_clr_fd(lo, true);
return;
} else if (lo->lo_state == Lo_bound) {
/*
-- 
2.16.4



[PATCH 08/15] loop: Push loop_ctl_mutex down to loop_get_status()

2018-10-10 Thread Jan Kara
Push loop_ctl_mutex down to loop_get_status() to avoid the unusual
convention that the function gets called with loop_ctl_mutex held and
releases it.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 37 ++---
 1 file changed, 10 insertions(+), 27 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index be41d5dcecd2..cb4eeff91238 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1233,6 +1233,9 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
struct kstat stat;
int ret;
 
+   ret = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (ret)
+   return ret;
if (lo->lo_state != Lo_bound) {
mutex_unlock(_ctl_mutex);
return -ENXIO;
@@ -1347,10 +1350,8 @@ loop_get_status_old(struct loop_device *lo, struct 
loop_info __user *arg) {
struct loop_info64 info64;
int err;
 
-   if (!arg) {
-   mutex_unlock(_ctl_mutex);
+   if (!arg)
return -EINVAL;
-   }
err = loop_get_status(lo, );
if (!err)
err = loop_info64_to_old(, );
@@ -1365,10 +1366,8 @@ loop_get_status64(struct loop_device *lo, struct 
loop_info64 __user *arg) {
struct loop_info64 info64;
int err;
 
-   if (!arg) {
-   mutex_unlock(_ctl_mutex);
+   if (!arg)
return -EINVAL;
-   }
err = loop_get_status(lo, );
if (!err && copy_to_user(arg, , sizeof(info64)))
err = -EFAULT;
@@ -1478,12 +1477,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
}
break;
case LOOP_GET_STATUS:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   err = loop_get_status_old(lo, (struct loop_info __user *) arg);
-   /* loop_get_status() unlocks loop_ctl_mutex */
-   break;
+   return loop_get_status_old(lo, (struct loop_info __user *) arg);
case LOOP_SET_STATUS64:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
@@ -1496,12 +1490,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
}
break;
case LOOP_GET_STATUS64:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   err = loop_get_status64(lo, (struct loop_info64 __user *) arg);
-   /* loop_get_status() unlocks loop_ctl_mutex */
-   break;
+   return loop_get_status64(lo, (struct loop_info64 __user *) arg);
case LOOP_SET_CAPACITY:
case LOOP_SET_DIRECT_IO:
case LOOP_SET_BLOCK_SIZE:
@@ -1626,10 +1615,8 @@ loop_get_status_compat(struct loop_device *lo,
struct loop_info64 info64;
int err;
 
-   if (!arg) {
-   mutex_unlock(_ctl_mutex);
+   if (!arg)
return -EINVAL;
-   }
err = loop_get_status(lo, );
if (!err)
err = loop_info64_to_compat(, arg);
@@ -1652,12 +1639,8 @@ static int lo_compat_ioctl(struct block_device *bdev, 
fmode_t mode,
}
break;
case LOOP_GET_STATUS:
-   err = mutex_lock_killable(_ctl_mutex);
-   if (!err) {
-   err = loop_get_status_compat(lo,
-(struct compat_loop_info 
__user *)arg);
-   /* loop_get_status() unlocks loop_ctl_mutex */
-   }
+   err = loop_get_status_compat(lo,
+(struct compat_loop_info __user *)arg);
break;
case LOOP_SET_CAPACITY:
case LOOP_CLR_FD:
-- 
2.16.4



[PATCH 12/15] loop: Move special partition reread handling in loop_clr_fd()

2018-10-10 Thread Jan Kara
The call of __blkdev_reread_part() from loop_reread_partition() happens
only when we need to invalidate partitions from loop_release(). Thus
move a detection for this into loop_clr_fd() and simplify
loop_reread_partition().

This makes loop_reread_partition() safe to use without loop_ctl_mutex
because we use only lo->lo_number and lo->lo_file_name in case of error
for reporting purposes (thus possibly reporting outdate information is
not a big deal) and we are safe from 'lo' going away under us by
elevated lo->lo_refcnt.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 33 +++--
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c53ad5e88a7d..db73fb5f16c7 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -631,18 +631,7 @@ static void loop_reread_partitions(struct loop_device *lo,
 {
int rc;
 
-   /*
-* bd_mutex has been held already in release path, so don't
-* acquire it if this function is called in such case.
-*
-* If the reread partition isn't from release path, lo_refcnt
-* must be at least one and it can only become zero when the
-* current holder is released.
-*/
-   if (!atomic_read(>lo_refcnt))
-   rc = __blkdev_reread_part(bdev);
-   else
-   rc = blkdev_reread_part(bdev);
+   rc = blkdev_reread_part(bdev);
if (rc)
pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
__func__, lo->lo_number, lo->lo_file_name, rc);
@@ -1096,8 +1085,24 @@ static int __loop_clr_fd(struct loop_device *lo)
module_put(THIS_MODULE);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-   if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
-   loop_reread_partitions(lo, bdev);
+   if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev) {
+   /*
+* bd_mutex has been held already in release path, so don't
+* acquire it if this function is called in such case.
+*
+* If the reread partition isn't from release path, lo_refcnt
+* must be at least one and it can only become zero when the
+* current holder is released.
+*/
+   if (!atomic_read(>lo_refcnt))
+   err = __blkdev_reread_part(bdev);
+   else
+   err = blkdev_reread_part(bdev);
+   pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
+   __func__, lo->lo_number, err);
+   /* Device is gone, no point in returning error */
+   err = 0;
+   }
lo->lo_flags = 0;
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
-- 
2.16.4



[PATCH 04/15] loop: Get rid of loop_index_mutex

2018-10-10 Thread Jan Kara
Now that loop_ctl_mutex is global, just get rid of loop_index_mutex as
there is no good reason to keep these two separate and it just
complicates the locking.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 41 -
 1 file changed, 20 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cc43d835fe6f..3fa5e63944a4 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -83,7 +83,6 @@
 #include 
 
 static DEFINE_IDR(loop_index_idr);
-static DEFINE_MUTEX(loop_index_mutex);
 static DEFINE_MUTEX(loop_ctl_mutex);
 
 static int max_part;
@@ -1627,9 +1626,11 @@ static int lo_compat_ioctl(struct block_device *bdev, 
fmode_t mode,
 static int lo_open(struct block_device *bdev, fmode_t mode)
 {
struct loop_device *lo;
-   int err = 0;
+   int err;
 
-   mutex_lock(_index_mutex);
+   err = mutex_lock_killable(_ctl_mutex);
+   if (err)
+   return err;
lo = bdev->bd_disk->private_data;
if (!lo) {
err = -ENXIO;
@@ -1638,7 +1639,7 @@ static int lo_open(struct block_device *bdev, fmode_t 
mode)
 
atomic_inc(>lo_refcnt);
 out:
-   mutex_unlock(_index_mutex);
+   mutex_unlock(_ctl_mutex);
return err;
 }
 
@@ -1647,12 +1648,11 @@ static void lo_release(struct gendisk *disk, fmode_t 
mode)
struct loop_device *lo;
int err;
 
-   mutex_lock(_index_mutex);
+   mutex_lock(_ctl_mutex);
lo = disk->private_data;
if (atomic_dec_return(>lo_refcnt))
-   goto unlock_index;
+   goto out_unlock;
 
-   mutex_lock(_ctl_mutex);
if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
/*
 * In autoclear mode, stop the loop thread
@@ -1660,7 +1660,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 */
err = loop_clr_fd(lo);
if (!err)
-   goto unlock_index;
+   return;
} else if (lo->lo_state == Lo_bound) {
/*
 * Otherwise keep thread (if running) and config,
@@ -1670,9 +1670,8 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
blk_mq_unfreeze_queue(lo->lo_queue);
}
 
+out_unlock:
mutex_unlock(_ctl_mutex);
-unlock_index:
-   mutex_unlock(_index_mutex);
 }
 
 static const struct block_device_operations lo_fops = {
@@ -1973,7 +1972,7 @@ static struct kobject *loop_probe(dev_t dev, int *part, 
void *data)
struct kobject *kobj;
int err;
 
-   mutex_lock(_index_mutex);
+   mutex_lock(_ctl_mutex);
err = loop_lookup(, MINOR(dev) >> part_shift);
if (err < 0)
err = loop_add(, MINOR(dev) >> part_shift);
@@ -1981,7 +1980,7 @@ static struct kobject *loop_probe(dev_t dev, int *part, 
void *data)
kobj = NULL;
else
kobj = get_disk_and_module(lo->lo_disk);
-   mutex_unlock(_index_mutex);
+   mutex_unlock(_ctl_mutex);
 
*part = 0;
return kobj;
@@ -1991,9 +1990,13 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
   unsigned long parm)
 {
struct loop_device *lo;
-   int ret = -ENOSYS;
+   int ret;
 
-   mutex_lock(_index_mutex);
+   ret = mutex_lock_killable(_ctl_mutex);
+   if (ret)
+   return ret;
+
+   ret = -ENOSYS;
switch (cmd) {
case LOOP_CTL_ADD:
ret = loop_lookup(, parm);
@@ -2007,9 +2010,6 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
ret = loop_lookup(, parm);
if (ret < 0)
break;
-   ret = mutex_lock_killable(_ctl_mutex);
-   if (ret)
-   break;
if (lo->lo_state != Lo_unbound) {
ret = -EBUSY;
mutex_unlock(_ctl_mutex);
@@ -2021,7 +2021,6 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
break;
}
lo->lo_disk->private_data = NULL;
-   mutex_unlock(_ctl_mutex);
idr_remove(_index_idr, lo->lo_number);
loop_remove(lo);
break;
@@ -2031,7 +2030,7 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
break;
ret = loop_add(, -1);
}
-   mutex_unlock(_index_mutex);
+   mutex_unlock(_ctl_mutex);
 
return ret;
 }
@@ -2115,10 +2114,10 @@ static int __init loop_init(void)
  THIS_MODULE, loop_probe, NULL, NULL);
 
/* pre-create number of devices given by config or max_loop */
-   mutex_lock(_index_mutex);
+   mutex_lock(_ctl_mutex);
for (i = 0; i &

[PATCH 02/15] block/loop: Use global lock for ioctl() operation.

2018-10-10 Thread Jan Kara
From: Tetsuo Handa 

syzbot is reporting NULL pointer dereference [1] which is caused by
race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
loop devices at loop_validate_file() without holding corresponding
lo->lo_ctl_mutex locks.

Since ioctl() request on loop devices is not frequent operation, we don't
need fine grained locking. Let's use global lock in order to allow safe
traversal at loop_validate_file().

Note that syzbot is also reporting circular locking dependency between
bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling
blkdev_reread_part() with lock held. This patch does not address it.

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Reviewed-by: Jan Kara 
Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 58 ++--
 drivers/block/loop.h |  1 -
 2 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 50c81ff44ae2..d0f1b7106572 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -84,6 +84,7 @@
 
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_index_mutex);
+static DEFINE_MUTEX(loop_ctl_mutex);
 
 static int max_part;
 static int part_shift;
@@ -1047,7 +1048,7 @@ static int loop_clr_fd(struct loop_device *lo)
 */
if (atomic_read(>lo_refcnt) > 1) {
lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return 0;
}
 
@@ -1100,12 +1101,12 @@ static int loop_clr_fd(struct loop_device *lo)
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
loop_unprepare_queue(lo);
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
/*
-* Need not hold lo_ctl_mutex to fput backing file.
-* Calling fput holding lo_ctl_mutex triggers a circular
+* Need not hold loop_ctl_mutex to fput backing file.
+* Calling fput holding loop_ctl_mutex triggers a circular
 * lock dependency possibility warning as fput can take
-* bd_mutex which is usually taken before lo_ctl_mutex.
+* bd_mutex which is usually taken before loop_ctl_mutex.
 */
fput(filp);
return 0;
@@ -1210,7 +1211,7 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
int ret;
 
if (lo->lo_state != Lo_bound) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -ENXIO;
}
 
@@ -1229,10 +1230,10 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
   lo->lo_encrypt_key_size);
}
 
-   /* Drop lo_ctl_mutex while we call into the filesystem. */
+   /* Drop loop_ctl_mutex while we call into the filesystem. */
path = lo->lo_backing_file->f_path;
path_get();
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
@@ -1324,7 +1325,7 @@ loop_get_status_old(struct loop_device *lo, struct 
loop_info __user *arg) {
int err;
 
if (!arg) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, );
@@ -1342,7 +1343,7 @@ loop_get_status64(struct loop_device *lo, struct 
loop_info64 __user *arg) {
int err;
 
if (!arg) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, );
@@ -1400,7 +1401,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
struct loop_device *lo = bdev->bd_disk->private_data;
int err;
 
-   err = mutex_lock_killable_nested(>lo_ctl_mutex, 1);
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
goto out_unlocked;
 
@@ -1412,7 +1413,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
err = loop_change_fd(lo, bdev, arg);
break;
case LOOP_CLR_FD:
-   /* loop_clr_fd would have unlocked lo_ctl_mutex on success */
+   /* loop_clr_fd would have unlocked loop_ctl_mutex on success */
err = loop_clr_fd(lo);
if (!err)
goto out_unlocked;
@@ -1425,7 +1426,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
break;
case LOOP_GET_STATUS:
   

[PATCH 10/15] loop: Push loop_ctl_mutex down to loop_set_fd()

2018-10-10 Thread Jan Kara
Push lo_ctl_mutex down to loop_set_fd(). We will need this to be able to
call loop_reread_partitions() without lo_ctl_mutex.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5661489d11a7..b4500d82238d 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -919,13 +919,17 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
if (!file)
goto out;
 
+   error = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (error)
+   goto out_putf;
+
error = -EBUSY;
if (lo->lo_state != Lo_unbound)
-   goto out_putf;
+   goto out_unlock;
 
error = loop_validate_file(file, bdev);
if (error)
-   goto out_putf;
+   goto out_unlock;
 
mapping = file->f_mapping;
inode = mapping->host;
@@ -937,10 +941,10 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
error = -EFBIG;
size = get_loop_size(lo, file);
if ((loff_t)(sector_t)size != size)
-   goto out_putf;
+   goto out_unlock;
error = loop_prepare_queue(lo);
if (error)
-   goto out_putf;
+   goto out_unlock;
 
error = 0;
 
@@ -979,11 +983,14 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
 * put /dev/loopXX inode. Later in __loop_clr_fd() we bdput(bdev).
 */
bdgrab(bdev);
+   mutex_unlock(_ctl_mutex);
return 0;
 
- out_putf:
+out_unlock:
+   mutex_unlock(_ctl_mutex);
+out_putf:
fput(file);
- out:
+out:
/* This is safe: open() is still holding a reference. */
module_put(THIS_MODULE);
return error;
@@ -1461,12 +1468,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
 
switch (cmd) {
case LOOP_SET_FD:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   err = loop_set_fd(lo, mode, bdev, arg);
-   mutex_unlock(_ctl_mutex);
-   break;
+   return loop_set_fd(lo, mode, bdev, arg);
case LOOP_CHANGE_FD:
err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
-- 
2.16.4



[PATCH 0/15 v2] loop: Fix oops and possible deadlocks

2018-10-10 Thread Jan Kara
Hi,

this patch series fixes oops and possible deadlocks as reported by syzbot [1]
[2]. The second patch in the series (from Tetsuo) fixes the oops, the remaining
patches are cleaning up the locking in the loop driver so that we can in the
end reasonably easily switch to rereading partitions without holding mutex
protecting the loop device.

I have lightly tested the patches by creating, deleting, and modifying loop
devices but if there's some more comprehensive loopback device testsuite, I
can try running it. Review is welcome!

Changes since v1:
* Added patch moving fput() calls in loop_change_fd() from under loop_ctl_mutex
* Fixed bug in loop_control_ioctl() where it failed to return error properly

Honza

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889



Re: [PATCH 0/14] loop: Fix oops and possible deadlocks

2018-10-04 Thread Jan Kara
On Thu 27-09-18 23:47:01, Tetsuo Handa wrote:
> Possible changes folded into this series.

Thanks for having a look. But please comment on individual patches at
appropriate places instead of sending this patch where everything is just
mixed together. It is much easier to find out what we are talking about
that way.

>   (1) (I guess) no need to use _nested version.

I just preserved the current status as I didn't want to dig into that hole.
Even if you're right, that would be a separate change. Not something these
patches should deal with.

>   (2) Use mutex_lock_killable() where possible.

Where exactly? I've only noticed you've changed loop_probe() where I think
the change is just bogus. That gets called on device insertion and you have
nowhere to deliver the signal in that case.

>   (3) Move fput() to after mutex_unlock().

Again, independent change. I've just preserved the current situation. But
probably worth including in this series as a separate patch. Care to send a
follow up patch with proper changelog etc.?

>   (4) Don't return 0 upon invalid loop_control_ioctl().

Good catch, I'll fix that up.

>   (5) No need to mutex_lock()/mutex_unlock() on each loop device at
>   unregister_transfer_cb() callback.

Another independent optimization. Will you send a follow up patch? I can
write the patch (and the one above) but I don't want to steal the credit
from you...

>   (6) No need to mutex_unlock()+mutex_lock() when calling __loop_clr_fd().

This is deliberate so that we get rid of the weird "__loop_clr_fd()
releases mutex it did not acquire". This is not performance critical path
by any means so better keep the locking simple.

    Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 4/4] block/loop: Fix circular locking dependency at blkdev_reread_part().

2018-09-27 Thread Jan Kara
On Thu 27-09-18 20:35:27, Tetsuo Handa wrote:
> On 2018/09/27 20:27, Jan Kara wrote:
> > Hi,
> > 
> > On Wed 26-09-18 00:26:49, Tetsuo Handa wrote:
> >> syzbot is reporting circular locking dependency between bdev->bd_mutex
> >> and lo->lo_ctl_mutex [1] which is caused by calling blkdev_reread_part()
> >> with lock held. We need to drop lo->lo_ctl_mutex in order to fix it.
> >>
> >> This patch fixes it by combining loop_index_mutex and loop_ctl_mutex into
> >> loop_mutex, and releasing loop_mutex before calling blkdev_reread_part()
> >> or fput() or path_put() or leaving ioctl().
> >>
> >> The rule is that current thread calls lock_loop() before accessing
> >> "struct loop_device", and current thread no longer accesses "struct
> >> loop_device" after unlock_loop() is called.
> >>
> >> Since syzbot is reporting various bugs [2] where a race in the loop module
> >> is suspected, let's check whether this patch affects these bugs too.
> >>
> >> [1] 
> >> https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889
> >> [2] 
> >> https://syzkaller.appspot.com/bug?id=b3c7e1440aa8ece16bf557dbac427fdff1dad9d6
> >>
> >> Signed-off-by: Tetsuo Handa 
> >> Reported-by: syzbot 
> >> 
> >> ---
> >>  drivers/block/loop.c | 187 
> >> ---
> >>  1 file changed, 101 insertions(+), 86 deletions(-)
> > 
> > I still don't like this patch. I'll post a patch series showing what I have
> > in mind. Admittedly, it's a bit tedious but the locking is much saner
> > afterwards...
> 
> Please be sure to Cc: me. I'm not subscribed to linux-block ML.

Yes, I've CCed you.

> But if we have to release lo_ctl_mutex before calling blkdev_reread_part(),
> what is nice with re-acquiring lo_ctl_mutex after blkdev_reread_part() ?

We don't reacquire the mutex after blkdev_reread_part(). Just the code
needed to be cleaned up so that loop_reread_part() does not need
lo_ctl_mutex for anything.

Honza

-- 
Jan Kara 
SUSE Labs, CR


[PATCH 07/14] loop: Push loop_ctl_mutex down into loop_clr_fd()

2018-09-27 Thread Jan Kara
loop_clr_fd() has a weird locking convention that is expects
loop_ctl_mutex held, releases it on success and keeps it on failure.
Untangle the mess by moving locking of loop_ctl_mutex into
loop_clr_fd().

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 49 +
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 51d11898e170..e4b82ca49286 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1028,15 +1028,22 @@ loop_init_xfer(struct loop_device *lo, struct 
loop_func_table *xfer,
 
 static int __loop_clr_fd(struct loop_device *lo)
 {
-   struct file *filp = lo->lo_backing_file;
+   struct file *filp = NULL;
gfp_t gfp = lo->old_gfp_mask;
struct block_device *bdev = lo->lo_device;
+   int err = 0;
 
-   if (WARN_ON_ONCE(lo->lo_state != Lo_rundown))
-   return -ENXIO;
+   mutex_lock(_ctl_mutex);
+   if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
+   err = -ENXIO;
+   goto out_unlock;
+   }
 
-   if (filp == NULL)
-   return -EINVAL;
+   filp = lo->lo_backing_file;
+   if (filp == NULL) {
+   err = -EINVAL;
+   goto out_unlock;
+   }
 
/* freeze request queue during the transition */
blk_mq_freeze_queue(lo->lo_queue);
@@ -1083,6 +1090,7 @@ static int __loop_clr_fd(struct loop_device *lo)
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
loop_unprepare_queue(lo);
+out_unlock:
mutex_unlock(_ctl_mutex);
/*
 * Need not hold loop_ctl_mutex to fput backing file.
@@ -1090,14 +1098,22 @@ static int __loop_clr_fd(struct loop_device *lo)
 * lock dependency possibility warning as fput can take
 * bd_mutex which is usually taken before loop_ctl_mutex.
 */
-   fput(filp);
-   return 0;
+   if (filp)
+   fput(filp);
+   return err;
 }
 
 static int loop_clr_fd(struct loop_device *lo)
 {
-   if (lo->lo_state != Lo_bound)
+   int err;
+
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
+   if (lo->lo_state != Lo_bound) {
+   mutex_unlock(_ctl_mutex);
return -ENXIO;
+   }
/*
 * If we've explicitly asked to tear down the loop device,
 * and it has an elevated reference count, set it for auto-teardown when
@@ -1114,6 +1130,7 @@ static int loop_clr_fd(struct loop_device *lo)
return 0;
}
lo->lo_state = Lo_rundown;
+   mutex_unlock(_ctl_mutex);
 
return __loop_clr_fd(lo);
 }
@@ -1448,14 +1465,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
mutex_unlock(_ctl_mutex);
break;
case LOOP_CLR_FD:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   /* loop_clr_fd would have unlocked loop_ctl_mutex on success */
-   err = loop_clr_fd(lo);
-   if (err)
-   mutex_unlock(_ctl_mutex);
-   break;
+   return loop_clr_fd(lo);
case LOOP_SET_STATUS:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
@@ -1691,7 +1701,6 @@ static int lo_open(struct block_device *bdev, fmode_t 
mode)
 static void lo_release(struct gendisk *disk, fmode_t mode)
 {
struct loop_device *lo;
-   int err;
 
mutex_lock(_ctl_mutex);
lo = disk->private_data;
@@ -1702,13 +1711,13 @@ static void lo_release(struct gendisk *disk, fmode_t 
mode)
if (lo->lo_state != Lo_bound)
goto out_unlock;
lo->lo_state = Lo_rundown;
+   mutex_unlock(_ctl_mutex);
/*
 * In autoclear mode, stop the loop thread
 * and remove configuration after last close.
 */
-   err = __loop_clr_fd(lo);
-   if (!err)
-   return;
+   __loop_clr_fd(lo);
+   return;
} else if (lo->lo_state == Lo_bound) {
/*
 * Otherwise keep thread (if running) and config,
-- 
2.16.4



[PATCH 04/14] loop: Get rid of loop_index_mutex

2018-09-27 Thread Jan Kara
Now that loop_ctl_mutex is global, just get rid of loop_index_mutex as
there is no good reason to keep these two separate and it just
complicates the locking.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 38 ++
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cc43d835fe6f..e35707fb8318 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -83,7 +83,6 @@
 #include 
 
 static DEFINE_IDR(loop_index_idr);
-static DEFINE_MUTEX(loop_index_mutex);
 static DEFINE_MUTEX(loop_ctl_mutex);
 
 static int max_part;
@@ -1627,9 +1626,11 @@ static int lo_compat_ioctl(struct block_device *bdev, 
fmode_t mode,
 static int lo_open(struct block_device *bdev, fmode_t mode)
 {
struct loop_device *lo;
-   int err = 0;
+   int err;
 
-   mutex_lock(_index_mutex);
+   err = mutex_lock_killable(_ctl_mutex);
+   if (err)
+   return err;
lo = bdev->bd_disk->private_data;
if (!lo) {
err = -ENXIO;
@@ -1638,7 +1639,7 @@ static int lo_open(struct block_device *bdev, fmode_t 
mode)
 
atomic_inc(>lo_refcnt);
 out:
-   mutex_unlock(_index_mutex);
+   mutex_unlock(_ctl_mutex);
return err;
 }
 
@@ -1647,12 +1648,11 @@ static void lo_release(struct gendisk *disk, fmode_t 
mode)
struct loop_device *lo;
int err;
 
-   mutex_lock(_index_mutex);
+   mutex_lock(_ctl_mutex);
lo = disk->private_data;
if (atomic_dec_return(>lo_refcnt))
-   goto unlock_index;
+   goto out_unlock;
 
-   mutex_lock(_ctl_mutex);
if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
/*
 * In autoclear mode, stop the loop thread
@@ -1660,7 +1660,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 */
err = loop_clr_fd(lo);
if (!err)
-   goto unlock_index;
+   return;
} else if (lo->lo_state == Lo_bound) {
/*
 * Otherwise keep thread (if running) and config,
@@ -1670,9 +1670,8 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
blk_mq_unfreeze_queue(lo->lo_queue);
}
 
+out_unlock:
mutex_unlock(_ctl_mutex);
-unlock_index:
-   mutex_unlock(_index_mutex);
 }
 
 static const struct block_device_operations lo_fops = {
@@ -1973,7 +1972,7 @@ static struct kobject *loop_probe(dev_t dev, int *part, 
void *data)
struct kobject *kobj;
int err;
 
-   mutex_lock(_index_mutex);
+   mutex_lock(_ctl_mutex);
err = loop_lookup(, MINOR(dev) >> part_shift);
if (err < 0)
err = loop_add(, MINOR(dev) >> part_shift);
@@ -1981,7 +1980,7 @@ static struct kobject *loop_probe(dev_t dev, int *part, 
void *data)
kobj = NULL;
else
kobj = get_disk_and_module(lo->lo_disk);
-   mutex_unlock(_index_mutex);
+   mutex_unlock(_ctl_mutex);
 
*part = 0;
return kobj;
@@ -1993,7 +1992,10 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
struct loop_device *lo;
int ret = -ENOSYS;
 
-   mutex_lock(_index_mutex);
+   ret = mutex_lock_killable(_ctl_mutex);
+   if (ret)
+   return ret;
+
switch (cmd) {
case LOOP_CTL_ADD:
ret = loop_lookup(, parm);
@@ -2007,9 +2009,6 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
ret = loop_lookup(, parm);
if (ret < 0)
break;
-   ret = mutex_lock_killable(_ctl_mutex);
-   if (ret)
-   break;
if (lo->lo_state != Lo_unbound) {
ret = -EBUSY;
mutex_unlock(_ctl_mutex);
@@ -2021,7 +2020,6 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
break;
}
lo->lo_disk->private_data = NULL;
-   mutex_unlock(_ctl_mutex);
idr_remove(_index_idr, lo->lo_number);
loop_remove(lo);
break;
@@ -2031,7 +2029,7 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
break;
ret = loop_add(, -1);
}
-   mutex_unlock(_index_mutex);
+   mutex_unlock(_ctl_mutex);
 
return ret;
 }
@@ -2115,10 +2113,10 @@ static int __init loop_init(void)
  THIS_MODULE, loop_probe, NULL, NULL);
 
/* pre-create number of devices given by config or max_loop */
-   mutex_lock(_index_mutex);
+   mutex_lock(_ctl_mutex);
for (i = 0; i < nr; i++)
loop_add(, i);
-   mutex_unlock(_index_mutex);
+   mutex_unloc

[PATCH 13/14] loop: Move loop_reread_partitions() out of loop_ctl_mutex

2018-09-27 Thread Jan Kara
Calling loop_reread_partitions() under loop_ctl_mutex causes lockdep to
complain about circular lock dependency between bdev->bd_mutex and
lo->lo_ctl_mutex. The problem is that on loop device open or close
lo_open() and lo_release() get called with bdev->bd_mutex held and they
need to acquire loop_ctl_mutex. OTOH when loop_reread_partitions() is
called with loop_ctl_mutex held, it will call blkdev_reread_part() which
acquires bdev->bd_mutex. See syzbot report for details [1].

Move all calls of loop_rescan_partitions() out of loop_ctl_mutex to
avoid lockdep warning and fix deadlock possibility.

[1] https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d1588

Reported-by: syzbot 

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cce1a32ae6d0..d2be85c48f03 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -680,6 +680,7 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
 {
struct file *file, *old_file;
int error;
+   boolpartscan;
 
error = mutex_lock_killable_nested(_ctl_mutex, 1);
if (error)
@@ -721,9 +722,10 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
blk_mq_unfreeze_queue(lo->lo_queue);
 
fput(old_file);
-   if (lo->lo_flags & LO_FLAGS_PARTSCAN)
-   loop_reread_partitions(lo, bdev);
+   partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;
mutex_unlock(_ctl_mutex);
+   if (partscan)
+   loop_reread_partitions(lo, bdev);
return 0;
 
 out_putf:
@@ -904,6 +906,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
int lo_flags = 0;
int error;
loff_t  size;
+   boolpartscan;
 
/* This is safe, since we have a reference from open(). */
__module_get(THIS_MODULE);
@@ -970,14 +973,15 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
lo->lo_state = Lo_bound;
if (part_shift)
lo->lo_flags |= LO_FLAGS_PARTSCAN;
-   if (lo->lo_flags & LO_FLAGS_PARTSCAN)
-   loop_reread_partitions(lo, bdev);
+   partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;
 
/* Grab the block_device to prevent its destruction after we
 * put /dev/loopXX inode. Later in __loop_clr_fd() we bdput(bdev).
 */
bdgrab(bdev);
mutex_unlock(_ctl_mutex);
+   if (partscan)
+   loop_reread_partitions(lo, bdev);
return 0;
 
 out_unlock:
@@ -1158,6 +1162,8 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
int err;
struct loop_func_table *xfer;
kuid_t uid = current_uid();
+   struct block_device *bdev;
+   bool partscan = false;
 
err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
@@ -1246,10 +1252,13 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
 !(lo->lo_flags & LO_FLAGS_PARTSCAN)) {
lo->lo_flags |= LO_FLAGS_PARTSCAN;
lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN;
-   loop_reread_partitions(lo, lo->lo_device);
+   bdev = lo->lo_device;
+   partscan = true;
}
 out_unlock:
mutex_unlock(_ctl_mutex);
+   if (partscan)
+   loop_reread_partitions(lo, bdev);
 
return err;
 }
-- 
2.16.4



[PATCH 03/14] loop: Fold __loop_release into loop_release

2018-09-27 Thread Jan Kara
__loop_release() has a single call site. Fold it there. This is
currently not a huge win but it will make following replacement of
loop_index_mutex more obvious.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d0f1b7106572..cc43d835fe6f 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1642,12 +1642,15 @@ static int lo_open(struct block_device *bdev, fmode_t 
mode)
return err;
 }
 
-static void __lo_release(struct loop_device *lo)
+static void lo_release(struct gendisk *disk, fmode_t mode)
 {
+   struct loop_device *lo;
int err;
 
+   mutex_lock(_index_mutex);
+   lo = disk->private_data;
if (atomic_dec_return(>lo_refcnt))
-   return;
+   goto unlock_index;
 
mutex_lock(_ctl_mutex);
if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
@@ -1657,7 +1660,7 @@ static void __lo_release(struct loop_device *lo)
 */
err = loop_clr_fd(lo);
if (!err)
-   return;
+   goto unlock_index;
} else if (lo->lo_state == Lo_bound) {
/*
 * Otherwise keep thread (if running) and config,
@@ -1668,12 +1671,7 @@ static void __lo_release(struct loop_device *lo)
}
 
mutex_unlock(_ctl_mutex);
-}
-
-static void lo_release(struct gendisk *disk, fmode_t mode)
-{
-   mutex_lock(_index_mutex);
-   __lo_release(disk->private_data);
+unlock_index:
mutex_unlock(_index_mutex);
 }
 
-- 
2.16.4



[PATCH 14/14] loop: Fix deadlock when calling blkdev_reread_part()

2018-09-27 Thread Jan Kara
Calling blkdev_reread_part() under loop_ctl_mutex causes lockdep to
complain about circular lock dependency between bdev->bd_mutex and
lo->lo_ctl_mutex. The problem is that on loop device open or close
lo_open() and lo_release() get called with bdev->bd_mutex held and they
need to acquire loop_ctl_mutex. OTOH when loop_reread_partitions() is
called with loop_ctl_mutex held, it will call blkdev_reread_part() which
acquires bdev->bd_mutex. See syzbot report for details [1].

Move call to blkdev_reread_part() in __loop_clr_fd() from under
loop_ctl_mutex to finish fixing of the lockdep warning and the possible
deadlock.

[1] https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d1588

Reported-by: syzbot 

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d2be85c48f03..a0fb7bf62b29 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1031,12 +1031,14 @@ loop_init_xfer(struct loop_device *lo, struct 
loop_func_table *xfer,
return err;
 }
 
-static int __loop_clr_fd(struct loop_device *lo)
+static int __loop_clr_fd(struct loop_device *lo, bool release)
 {
struct file *filp = NULL;
gfp_t gfp = lo->old_gfp_mask;
struct block_device *bdev = lo->lo_device;
int err = 0;
+   bool partscan = false;
+   int lo_number;
 
mutex_lock(_ctl_mutex);
if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
@@ -1089,7 +1091,15 @@ static int __loop_clr_fd(struct loop_device *lo)
module_put(THIS_MODULE);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-   if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev) {
+   partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
+   lo_number = lo->lo_number;
+   lo->lo_flags = 0;
+   if (!part_shift)
+   lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
+   loop_unprepare_queue(lo);
+out_unlock:
+   mutex_unlock(_ctl_mutex);
+   if (partscan) {
/*
 * bd_mutex has been held already in release path, so don't
 * acquire it if this function is called in such case.
@@ -1098,21 +1108,15 @@ static int __loop_clr_fd(struct loop_device *lo)
 * must be at least one and it can only become zero when the
 * current holder is released.
 */
-   if (!atomic_read(>lo_refcnt))
+   if (release)
err = __blkdev_reread_part(bdev);
else
err = blkdev_reread_part(bdev);
pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
-   __func__, lo->lo_number, err);
+   __func__, lo_number, err);
/* Device is gone, no point in returning error */
err = 0;
}
-   lo->lo_flags = 0;
-   if (!part_shift)
-   lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
-   loop_unprepare_queue(lo);
-out_unlock:
-   mutex_unlock(_ctl_mutex);
/*
 * Need not hold loop_ctl_mutex to fput backing file.
 * Calling fput holding loop_ctl_mutex triggers a circular
@@ -1153,7 +1157,7 @@ static int loop_clr_fd(struct loop_device *lo)
lo->lo_state = Lo_rundown;
mutex_unlock(_ctl_mutex);
 
-   return __loop_clr_fd(lo);
+   return __loop_clr_fd(lo, false);
 }
 
 static int
@@ -1714,7 +1718,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 * In autoclear mode, stop the loop thread
 * and remove configuration after last close.
 */
-   __loop_clr_fd(lo);
+   __loop_clr_fd(lo, true);
return;
} else if (lo->lo_state == Lo_bound) {
/*
-- 
2.16.4



[PATCH 09/14] loop: Push loop_ctl_mutex down to loop_set_status()

2018-09-27 Thread Jan Kara
Push loop_ctl_mutex down to loop_set_status(). We will need this to be
able to call loop_reread_partitions() without loop_ctl_mutex.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 51 +--
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 577d5e5a9312..1cc29bd77d67 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1142,46 +1142,55 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
struct loop_func_table *xfer;
kuid_t uid = current_uid();
 
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
if (lo->lo_encrypt_key_size &&
!uid_eq(lo->lo_key_owner, uid) &&
-   !capable(CAP_SYS_ADMIN))
-   return -EPERM;
-   if (lo->lo_state != Lo_bound)
-   return -ENXIO;
-   if ((unsigned int) info->lo_encrypt_key_size > LO_KEY_SIZE)
-   return -EINVAL;
+   !capable(CAP_SYS_ADMIN)) {
+   err = -EPERM;
+   goto out_unlock;
+   }
+   if (lo->lo_state != Lo_bound) {
+   err = -ENXIO;
+   goto out_unlock;
+   }
+   if ((unsigned int) info->lo_encrypt_key_size > LO_KEY_SIZE) {
+   err = -EINVAL;
+   goto out_unlock;
+   }
 
/* I/O need to be drained during transfer transition */
blk_mq_freeze_queue(lo->lo_queue);
 
err = loop_release_xfer(lo);
if (err)
-   goto exit;
+   goto out_unfreeze;
 
if (info->lo_encrypt_type) {
unsigned int type = info->lo_encrypt_type;
 
if (type >= MAX_LO_CRYPT) {
err = -EINVAL;
-   goto exit;
+   goto out_unfreeze;
}
xfer = xfer_funcs[type];
if (xfer == NULL) {
err = -EINVAL;
-   goto exit;
+   goto out_unfreeze;
}
} else
xfer = NULL;
 
err = loop_init_xfer(lo, xfer, info);
if (err)
-   goto exit;
+   goto out_unfreeze;
 
if (lo->lo_offset != info->lo_offset ||
lo->lo_sizelimit != info->lo_sizelimit) {
if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
err = -EFBIG;
-   goto exit;
+   goto out_unfreeze;
}
}
 
@@ -1213,7 +1222,7 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
/* update dio if lo_offset or transfer is changed */
__loop_update_dio(lo, lo->use_dio);
 
- exit:
+out_unfreeze:
blk_mq_unfreeze_queue(lo->lo_queue);
 
if (!err && (info->lo_flags & LO_FLAGS_PARTSCAN) &&
@@ -1222,6 +1231,8 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN;
loop_reread_partitions(lo, lo->lo_device);
}
+out_unlock:
+   mutex_unlock(_ctl_mutex);
 
return err;
 }
@@ -1468,12 +1479,8 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
case LOOP_SET_STATUS:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
err = loop_set_status_old(lo,
(struct loop_info __user *)arg);
-   mutex_unlock(_ctl_mutex);
}
break;
case LOOP_GET_STATUS:
@@ -1481,12 +1488,8 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
case LOOP_SET_STATUS64:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
err = loop_set_status64(lo,
(struct loop_info64 __user *) arg);
-   mutex_unlock(_ctl_mutex);
}
break;
case LOOP_GET_STATUS64:
@@ -1631,12 +1634,8 @@ static int lo_compat_ioctl(struct block_device *bdev, 
fmode_t mode,
 
switch(cmd) {
case LOOP_SET_STATUS:
-   err = mutex_lock_killable(_ctl_mutex);
-   if (!err) {
-   err = loop_set_status_compat(lo,
-(const struct 
compat_loop_info __user *)a

[PATCH 10/14] loop: Push loop_ctl_mutex down to loop_set_fd()

2018-09-27 Thread Jan Kara
Push lo_ctl_mutex down to loop_set_fd(). We will need this to be able to
call loop_reread_partitions() without lo_ctl_mutex.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 1cc29bd77d67..504e5ef07509 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -919,13 +919,17 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
if (!file)
goto out;
 
+   error = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (error)
+   goto out_putf;
+
error = -EBUSY;
if (lo->lo_state != Lo_unbound)
-   goto out_putf;
+   goto out_unlock;
 
error = loop_validate_file(file, bdev);
if (error)
-   goto out_putf;
+   goto out_unlock;
 
mapping = file->f_mapping;
inode = mapping->host;
@@ -937,10 +941,10 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
error = -EFBIG;
size = get_loop_size(lo, file);
if ((loff_t)(sector_t)size != size)
-   goto out_putf;
+   goto out_unlock;
error = loop_prepare_queue(lo);
if (error)
-   goto out_putf;
+   goto out_unlock;
 
error = 0;
 
@@ -979,11 +983,14 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
 * put /dev/loopXX inode. Later in __loop_clr_fd() we bdput(bdev).
 */
bdgrab(bdev);
+   mutex_unlock(_ctl_mutex);
return 0;
 
- out_putf:
+out_unlock:
+   mutex_unlock(_ctl_mutex);
+out_putf:
fput(file);
- out:
+out:
/* This is safe: open() is still holding a reference. */
module_put(THIS_MODULE);
return error;
@@ -1461,12 +1468,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
 
switch (cmd) {
case LOOP_SET_FD:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   err = loop_set_fd(lo, mode, bdev, arg);
-   mutex_unlock(_ctl_mutex);
-   break;
+   return loop_set_fd(lo, mode, bdev, arg);
case LOOP_CHANGE_FD:
err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
-- 
2.16.4



[PATCH 12/14] loop: Move special partition reread handling in loop_clr_fd()

2018-09-27 Thread Jan Kara
The call of __blkdev_reread_part() from loop_reread_partition() happens
only when we need to invalidate partitions from loop_release(). Thus
move a detection for this into loop_clr_fd() and simplify
loop_reread_partition().

This makes loop_reread_partition() safe to use without loop_ctl_mutex
because we use only lo->lo_number and lo->lo_file_name in case of error
for reporting purposes (thus possibly reporting outdate information is
not a big deal) and we are safe from 'lo' going away under us by
elevated lo->lo_refcnt.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 33 +++--
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index b4ea862f14fd..cce1a32ae6d0 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -631,18 +631,7 @@ static void loop_reread_partitions(struct loop_device *lo,
 {
int rc;
 
-   /*
-* bd_mutex has been held already in release path, so don't
-* acquire it if this function is called in such case.
-*
-* If the reread partition isn't from release path, lo_refcnt
-* must be at least one and it can only become zero when the
-* current holder is released.
-*/
-   if (!atomic_read(>lo_refcnt))
-   rc = __blkdev_reread_part(bdev);
-   else
-   rc = blkdev_reread_part(bdev);
+   rc = blkdev_reread_part(bdev);
if (rc)
pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
__func__, lo->lo_number, lo->lo_file_name, rc);
@@ -1096,8 +1085,24 @@ static int __loop_clr_fd(struct loop_device *lo)
module_put(THIS_MODULE);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-   if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
-   loop_reread_partitions(lo, bdev);
+   if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev) {
+   /*
+* bd_mutex has been held already in release path, so don't
+* acquire it if this function is called in such case.
+*
+* If the reread partition isn't from release path, lo_refcnt
+* must be at least one and it can only become zero when the
+* current holder is released.
+*/
+   if (!atomic_read(>lo_refcnt))
+   err = __blkdev_reread_part(bdev);
+   else
+   err = blkdev_reread_part(bdev);
+   pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
+   __func__, lo->lo_number, err);
+   /* Device is gone, no point in returning error */
+   err = 0;
+   }
lo->lo_flags = 0;
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
-- 
2.16.4



[PATCH 11/14] loop: Push loop_ctl_mutex down to loop_change_fd()

2018-09-27 Thread Jan Kara
Push loop_ctl_mutex down to loop_change_fd(). We will need this to be
able to call loop_reread_partitions() without loop_ctl_mutex.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 504e5ef07509..b4ea862f14fd 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -692,19 +692,22 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
struct file *file, *old_file;
int error;
 
+   error = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (error)
+   return error;
error = -ENXIO;
if (lo->lo_state != Lo_bound)
-   goto out;
+   goto out_unlock;
 
/* the loop device has to be read-only */
error = -EINVAL;
if (!(lo->lo_flags & LO_FLAGS_READ_ONLY))
-   goto out;
+   goto out_unlock;
 
error = -EBADF;
file = fget(arg);
if (!file)
-   goto out;
+   goto out_unlock;
 
error = loop_validate_file(file, bdev);
if (error)
@@ -731,11 +734,13 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
fput(old_file);
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
loop_reread_partitions(lo, bdev);
+   mutex_unlock(_ctl_mutex);
return 0;
 
- out_putf:
+out_putf:
fput(file);
- out:
+out_unlock:
+   mutex_unlock(_ctl_mutex);
return error;
 }
 
@@ -1470,12 +1475,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
case LOOP_SET_FD:
return loop_set_fd(lo, mode, bdev, arg);
case LOOP_CHANGE_FD:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   err = loop_change_fd(lo, bdev, arg);
-   mutex_unlock(_ctl_mutex);
-   break;
+   return loop_change_fd(lo, bdev, arg);
case LOOP_CLR_FD:
return loop_clr_fd(lo);
case LOOP_SET_STATUS:
-- 
2.16.4



[PATCH 05/14] loop: Push lo_ctl_mutex down into individual ioctls

2018-09-27 Thread Jan Kara
Push acquisition of lo_ctl_mutex down into individual ioctl handling
branches. This is a preparatory step for pushing the lock down into
individual ioctl handling functions so that they can release the lock as
they need it. We also factor out some simple ioctl handlers that will
not need any special handling to reduce unnecessary code duplication.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 88 +---
 1 file changed, 63 insertions(+), 25 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index e35707fb8318..a86ef20c15e2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1394,70 +1394,108 @@ static int loop_set_block_size(struct loop_device *lo, 
unsigned long arg)
return 0;
 }
 
-static int lo_ioctl(struct block_device *bdev, fmode_t mode,
-   unsigned int cmd, unsigned long arg)
+static int lo_simple_ioctl(struct loop_device *lo, unsigned int cmd,
+  unsigned long arg)
 {
-   struct loop_device *lo = bdev->bd_disk->private_data;
int err;
 
err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
-   goto out_unlocked;
+   return err;
+   switch (cmd) {
+   case LOOP_SET_CAPACITY:
+   err = loop_set_capacity(lo);
+   break;
+   case LOOP_SET_DIRECT_IO:
+   err = loop_set_dio(lo, arg);
+   break;
+   case LOOP_SET_BLOCK_SIZE:
+   err = loop_set_block_size(lo, arg);
+   break;
+   default:
+   err = lo->ioctl ? lo->ioctl(lo, cmd, arg) : -EINVAL;
+   }
+   mutex_unlock(_ctl_mutex);
+   return err;
+}
+
+static int lo_ioctl(struct block_device *bdev, fmode_t mode,
+   unsigned int cmd, unsigned long arg)
+{
+   struct loop_device *lo = bdev->bd_disk->private_data;
+   int err;
 
switch (cmd) {
case LOOP_SET_FD:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_set_fd(lo, mode, bdev, arg);
+   mutex_unlock(_ctl_mutex);
break;
case LOOP_CHANGE_FD:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_change_fd(lo, bdev, arg);
+   mutex_unlock(_ctl_mutex);
break;
case LOOP_CLR_FD:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
/* loop_clr_fd would have unlocked loop_ctl_mutex on success */
err = loop_clr_fd(lo);
-   if (!err)
-   goto out_unlocked;
+   if (err)
+   mutex_unlock(_ctl_mutex);
break;
case LOOP_SET_STATUS:
err = -EPERM;
-   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
+   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_set_status_old(lo,
(struct loop_info __user *)arg);
+   mutex_unlock(_ctl_mutex);
+   }
break;
case LOOP_GET_STATUS:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_get_status_old(lo, (struct loop_info __user *) arg);
/* loop_get_status() unlocks loop_ctl_mutex */
-   goto out_unlocked;
+   break;
case LOOP_SET_STATUS64:
err = -EPERM;
-   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
+   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_set_status64(lo,
(struct loop_info64 __user *) arg);
+   mutex_unlock(_ctl_mutex);
+   }
break;
case LOOP_GET_STATUS64:
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (err)
+   return err;
err = loop_get_status64(lo, (struct loop_info64 __user *) arg);
/* loop_get_status() unlocks loop_ctl_mutex */
-   goto out_unlocked;
-   case LOOP_SET_CAPACITY:
-   err = -EPERM;
-   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
-   err = loop_set_capacity(lo);
break;
+ 

[PATCH 0/14] loop: Fix oops and possible deadlocks

2018-09-27 Thread Jan Kara
Hi,

this patch series fixes oops and possible deadlocks as reported by syzbot [1]
[2]. The second patch in the series (from Tetsuo) fixes the oops, the remaining
patches are cleaning up the locking in the loop driver so that we can in the
end reasonably easily switch to rereading partitions without holding mutex
protecting the loop device.

I have lightly tested the patches by creating, deleting, and modifying loop
devices but if there's some more comprehensive loopback device testsuite, I
can try running it. Review is welcome!

Honza

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889



[PATCH 08/14] loop: Push loop_ctl_mutex down to loop_get_status()

2018-09-27 Thread Jan Kara
Push loop_ctl_mutex down to loop_get_status() to avoid the unusual
convention that the function gets called with loop_ctl_mutex held and
releases it.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 37 ++---
 1 file changed, 10 insertions(+), 27 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index e4b82ca49286..577d5e5a9312 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1233,6 +1233,9 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
struct kstat stat;
int ret;
 
+   ret = mutex_lock_killable_nested(_ctl_mutex, 1);
+   if (ret)
+   return ret;
if (lo->lo_state != Lo_bound) {
mutex_unlock(_ctl_mutex);
return -ENXIO;
@@ -1347,10 +1350,8 @@ loop_get_status_old(struct loop_device *lo, struct 
loop_info __user *arg) {
struct loop_info64 info64;
int err;
 
-   if (!arg) {
-   mutex_unlock(_ctl_mutex);
+   if (!arg)
return -EINVAL;
-   }
err = loop_get_status(lo, );
if (!err)
err = loop_info64_to_old(, );
@@ -1365,10 +1366,8 @@ loop_get_status64(struct loop_device *lo, struct 
loop_info64 __user *arg) {
struct loop_info64 info64;
int err;
 
-   if (!arg) {
-   mutex_unlock(_ctl_mutex);
+   if (!arg)
return -EINVAL;
-   }
err = loop_get_status(lo, );
if (!err && copy_to_user(arg, , sizeof(info64)))
err = -EFAULT;
@@ -1478,12 +1477,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
}
break;
case LOOP_GET_STATUS:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   err = loop_get_status_old(lo, (struct loop_info __user *) arg);
-   /* loop_get_status() unlocks loop_ctl_mutex */
-   break;
+   return loop_get_status_old(lo, (struct loop_info __user *) arg);
case LOOP_SET_STATUS64:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
@@ -1496,12 +1490,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
}
break;
case LOOP_GET_STATUS64:
-   err = mutex_lock_killable_nested(_ctl_mutex, 1);
-   if (err)
-   return err;
-   err = loop_get_status64(lo, (struct loop_info64 __user *) arg);
-   /* loop_get_status() unlocks loop_ctl_mutex */
-   break;
+   return loop_get_status64(lo, (struct loop_info64 __user *) arg);
case LOOP_SET_CAPACITY:
case LOOP_SET_DIRECT_IO:
case LOOP_SET_BLOCK_SIZE:
@@ -1626,10 +1615,8 @@ loop_get_status_compat(struct loop_device *lo,
struct loop_info64 info64;
int err;
 
-   if (!arg) {
-   mutex_unlock(_ctl_mutex);
+   if (!arg)
return -EINVAL;
-   }
err = loop_get_status(lo, );
if (!err)
err = loop_info64_to_compat(, arg);
@@ -1652,12 +1639,8 @@ static int lo_compat_ioctl(struct block_device *bdev, 
fmode_t mode,
}
break;
case LOOP_GET_STATUS:
-   err = mutex_lock_killable(_ctl_mutex);
-   if (!err) {
-   err = loop_get_status_compat(lo,
-(struct compat_loop_info 
__user *)arg);
-   /* loop_get_status() unlocks loop_ctl_mutex */
-   }
+   err = loop_get_status_compat(lo,
+(struct compat_loop_info __user *)arg);
break;
case LOOP_SET_CAPACITY:
case LOOP_CLR_FD:
-- 
2.16.4



[PATCH 06/14] loop: Split setting of lo_state from loop_clr_fd

2018-09-27 Thread Jan Kara
Move setting of lo_state to Lo_rundown out into the callers. That will
allow us to unlock loop_ctl_mutex while the loop device is protected
from other changes by its special state.

Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 52 +++-
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a86ef20c15e2..51d11898e170 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -976,7 +976,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
loop_reread_partitions(lo, bdev);
 
/* Grab the block_device to prevent its destruction after we
-* put /dev/loopXX inode. Later in loop_clr_fd() we bdput(bdev).
+* put /dev/loopXX inode. Later in __loop_clr_fd() we bdput(bdev).
 */
bdgrab(bdev);
return 0;
@@ -1026,31 +1026,15 @@ loop_init_xfer(struct loop_device *lo, struct 
loop_func_table *xfer,
return err;
 }
 
-static int loop_clr_fd(struct loop_device *lo)
+static int __loop_clr_fd(struct loop_device *lo)
 {
struct file *filp = lo->lo_backing_file;
gfp_t gfp = lo->old_gfp_mask;
struct block_device *bdev = lo->lo_device;
 
-   if (lo->lo_state != Lo_bound)
+   if (WARN_ON_ONCE(lo->lo_state != Lo_rundown))
return -ENXIO;
 
-   /*
-* If we've explicitly asked to tear down the loop device,
-* and it has an elevated reference count, set it for auto-teardown when
-* the last reference goes away. This stops $!~#$@ udev from
-* preventing teardown because it decided that it needs to run blkid on
-* the loopback device whenever they appear. xfstests is notorious for
-* failing tests because blkid via udev races with a losetup
-* /do something like mkfs/losetup -d  causing the losetup -d
-* command to fail with EBUSY.
-*/
-   if (atomic_read(>lo_refcnt) > 1) {
-   lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-   mutex_unlock(_ctl_mutex);
-   return 0;
-   }
-
if (filp == NULL)
return -EINVAL;
 
@@ -1058,7 +1042,6 @@ static int loop_clr_fd(struct loop_device *lo)
blk_mq_freeze_queue(lo->lo_queue);
 
spin_lock_irq(>lo_lock);
-   lo->lo_state = Lo_rundown;
lo->lo_backing_file = NULL;
spin_unlock_irq(>lo_lock);
 
@@ -,6 +1094,30 @@ static int loop_clr_fd(struct loop_device *lo)
return 0;
 }
 
+static int loop_clr_fd(struct loop_device *lo)
+{
+   if (lo->lo_state != Lo_bound)
+   return -ENXIO;
+   /*
+* If we've explicitly asked to tear down the loop device,
+* and it has an elevated reference count, set it for auto-teardown when
+* the last reference goes away. This stops $!~#$@ udev from
+* preventing teardown because it decided that it needs to run blkid on
+* the loopback device whenever they appear. xfstests is notorious for
+* failing tests because blkid via udev races with a losetup
+* /do something like mkfs/losetup -d  causing the losetup -d
+* command to fail with EBUSY.
+*/
+   if (atomic_read(>lo_refcnt) > 1) {
+   lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
+   mutex_unlock(_ctl_mutex);
+   return 0;
+   }
+   lo->lo_state = Lo_rundown;
+
+   return __loop_clr_fd(lo);
+}
+
 static int
 loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 {
@@ -1692,11 +1699,14 @@ static void lo_release(struct gendisk *disk, fmode_t 
mode)
goto out_unlock;
 
if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
+   if (lo->lo_state != Lo_bound)
+   goto out_unlock;
+   lo->lo_state = Lo_rundown;
/*
 * In autoclear mode, stop the loop thread
 * and remove configuration after last close.
 */
-   err = loop_clr_fd(lo);
+   err = __loop_clr_fd(lo);
if (!err)
return;
} else if (lo->lo_state == Lo_bound) {
-- 
2.16.4



[PATCH 02/14] block/loop: Use global lock for ioctl() operation.

2018-09-27 Thread Jan Kara
From: Tetsuo Handa 

syzbot is reporting NULL pointer dereference [1] which is caused by
race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
loop devices at loop_validate_file() without holding corresponding
lo->lo_ctl_mutex locks.

Since ioctl() request on loop devices is not frequent operation, we don't
need fine grained locking. Let's use global lock in order to allow safe
traversal at loop_validate_file().

Note that syzbot is also reporting circular locking dependency between
bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling
blkdev_reread_part() with lock held. This patch does not address it.

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Reviewed-by: Jan Kara 
Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 58 ++--
 drivers/block/loop.h |  1 -
 2 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 50c81ff44ae2..d0f1b7106572 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -84,6 +84,7 @@
 
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_index_mutex);
+static DEFINE_MUTEX(loop_ctl_mutex);
 
 static int max_part;
 static int part_shift;
@@ -1047,7 +1048,7 @@ static int loop_clr_fd(struct loop_device *lo)
 */
if (atomic_read(>lo_refcnt) > 1) {
lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return 0;
}
 
@@ -1100,12 +1101,12 @@ static int loop_clr_fd(struct loop_device *lo)
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
loop_unprepare_queue(lo);
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
/*
-* Need not hold lo_ctl_mutex to fput backing file.
-* Calling fput holding lo_ctl_mutex triggers a circular
+* Need not hold loop_ctl_mutex to fput backing file.
+* Calling fput holding loop_ctl_mutex triggers a circular
 * lock dependency possibility warning as fput can take
-* bd_mutex which is usually taken before lo_ctl_mutex.
+* bd_mutex which is usually taken before loop_ctl_mutex.
 */
fput(filp);
return 0;
@@ -1210,7 +1211,7 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
int ret;
 
if (lo->lo_state != Lo_bound) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -ENXIO;
}
 
@@ -1229,10 +1230,10 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
   lo->lo_encrypt_key_size);
}
 
-   /* Drop lo_ctl_mutex while we call into the filesystem. */
+   /* Drop loop_ctl_mutex while we call into the filesystem. */
path = lo->lo_backing_file->f_path;
path_get();
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
@@ -1324,7 +1325,7 @@ loop_get_status_old(struct loop_device *lo, struct 
loop_info __user *arg) {
int err;
 
if (!arg) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, );
@@ -1342,7 +1343,7 @@ loop_get_status64(struct loop_device *lo, struct 
loop_info64 __user *arg) {
int err;
 
if (!arg) {
-   mutex_unlock(>lo_ctl_mutex);
+   mutex_unlock(_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, );
@@ -1400,7 +1401,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
struct loop_device *lo = bdev->bd_disk->private_data;
int err;
 
-   err = mutex_lock_killable_nested(>lo_ctl_mutex, 1);
+   err = mutex_lock_killable_nested(_ctl_mutex, 1);
if (err)
goto out_unlocked;
 
@@ -1412,7 +1413,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
err = loop_change_fd(lo, bdev, arg);
break;
case LOOP_CLR_FD:
-   /* loop_clr_fd would have unlocked lo_ctl_mutex on success */
+   /* loop_clr_fd would have unlocked loop_ctl_mutex on success */
err = loop_clr_fd(lo);
if (!err)
goto out_unlocked;
@@ -1425,7 +1426,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
break;
case LOOP_GET_STATUS:
   

[PATCH 01/14] block/loop: Don't grab "struct file" for vfs_getattr() operation.

2018-09-27 Thread Jan Kara
From: Tetsuo Handa 

vfs_getattr() needs "struct path" rather than "struct file".
Let's use path_get()/path_put() rather than get_file()/fput().

Signed-off-by: Tetsuo Handa 
Reviewed-by: Jan Kara 
Signed-off-by: Jan Kara 
---
 drivers/block/loop.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ea9debf59b22..50c81ff44ae2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1205,7 +1205,7 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
 static int
 loop_get_status(struct loop_device *lo, struct loop_info64 *info)
 {
-   struct file *file;
+   struct path path;
struct kstat stat;
int ret;
 
@@ -1230,16 +1230,16 @@ loop_get_status(struct loop_device *lo, struct 
loop_info64 *info)
}
 
/* Drop lo_ctl_mutex while we call into the filesystem. */
-   file = get_file(lo->lo_backing_file);
+   path = lo->lo_backing_file->f_path;
+   path_get();
mutex_unlock(>lo_ctl_mutex);
-   ret = vfs_getattr(>f_path, , STATX_INO,
- AT_STATX_SYNC_AS_STAT);
+   ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
info->lo_inode = stat.ino;
info->lo_rdevice = huge_encode_dev(stat.rdev);
}
-   fput(file);
+   path_put();
return ret;
 }
 
-- 
2.16.4



Re: [PATCH 4/4] block/loop: Fix circular locking dependency at blkdev_reread_part().

2018-09-27 Thread Jan Kara
 * Otherwise keep thread (if running) and config,
> @@ -1676,15 +1695,13 @@ static void __lo_release(struct loop_device *lo)
>   blk_mq_freeze_queue(lo->lo_queue);
>   blk_mq_unfreeze_queue(lo->lo_queue);
>   }
> -
> - mutex_unlock(_ctl_mutex);
>  }
>  
>  static void lo_release(struct gendisk *disk, fmode_t mode)
>  {
> - mutex_lock(_index_mutex);
> + lock_loop();
>   __lo_release(disk->private_data);
> - mutex_unlock(_index_mutex);
> + unlock_loop();
>  }
>  
>  static const struct block_device_operations lo_fops = {
> @@ -1723,10 +1740,8 @@ static int unregister_transfer_cb(int id, void *ptr, 
> void *data)
>   struct loop_device *lo = ptr;
>   struct loop_func_table *xfer = data;
>  
> - mutex_lock(_ctl_mutex);
>   if (lo->lo_encryption == xfer)
>   loop_release_xfer(lo);
> - mutex_unlock(_ctl_mutex);
>   return 0;
>  }
>  
> @@ -1738,8 +1753,14 @@ int loop_unregister_transfer(int number)
>   if (n == 0 || n >= MAX_LO_CRYPT || (xfer = xfer_funcs[n]) == NULL)
>   return -EINVAL;
>  
> + /*
> +  * cleanup_cryptoloop() cannot handle errors because it is called
> +  * from module_exit(). Thus, don't give up upon SIGKILL here.
> +  */
> + lock_loop();
>   xfer_funcs[n] = NULL;
>   idr_for_each(_index_idr, _transfer_cb, xfer);
> + unlock_loop();
>   return 0;
>  }
>  
> @@ -1982,20 +2003,18 @@ static int loop_lookup(struct loop_device **l, int i)
>  static struct kobject *loop_probe(dev_t dev, int *part, void *data)
>  {
>   struct loop_device *lo;
> - struct kobject *kobj;
> - int err;
> + struct kobject *kobj = NULL;
> + int err = lock_loop_killable();
>  
> - mutex_lock(_index_mutex);
> + *part = 0;
> + if (err)
> + return NULL;
>   err = loop_lookup(, MINOR(dev) >> part_shift);
>   if (err < 0)
>   err = loop_add(, MINOR(dev) >> part_shift);
> - if (err < 0)
> - kobj = NULL;
> - else
> + if (err >= 0)
>   kobj = get_disk_and_module(lo->lo_disk);
> - mutex_unlock(_index_mutex);
> -
> - *part = 0;
> + unlock_loop();
>   return kobj;
>  }
>  
> @@ -2003,9 +2022,11 @@ static long loop_control_ioctl(struct file *file, 
> unsigned int cmd,
>  unsigned long parm)
>  {
>   struct loop_device *lo;
> - int ret = -ENOSYS;
> + int ret = lock_loop_killable();
>  
> - mutex_lock(_index_mutex);
> + if (ret)
> + return ret;
> + ret = -ENOSYS;
>   switch (cmd) {
>   case LOOP_CTL_ADD:
>   ret = loop_lookup(, parm);
> @@ -2019,21 +2040,15 @@ static long loop_control_ioctl(struct file *file, 
> unsigned int cmd,
>   ret = loop_lookup(, parm);
>   if (ret < 0)
>   break;
> - ret = mutex_lock_killable(_ctl_mutex);
> - if (ret)
> - break;
>   if (lo->lo_state != Lo_unbound) {
>   ret = -EBUSY;
> - mutex_unlock(_ctl_mutex);
>   break;
>   }
>   if (atomic_read(>lo_refcnt) > 0) {
>   ret = -EBUSY;
> - mutex_unlock(_ctl_mutex);
>   break;
>   }
>   lo->lo_disk->private_data = NULL;
> - mutex_unlock(_ctl_mutex);
>   idr_remove(_index_idr, lo->lo_number);
>   loop_remove(lo);
>   break;
> @@ -2043,7 +2058,7 @@ static long loop_control_ioctl(struct file *file, 
> unsigned int cmd,
>   break;
>   ret = loop_add(, -1);
>   }
> - mutex_unlock(_index_mutex);
> + unlock_loop();
>  
>   return ret;
>  }
> @@ -2127,10 +2142,10 @@ static int __init loop_init(void)
> THIS_MODULE, loop_probe, NULL, NULL);
>  
>   /* pre-create number of devices given by config or max_loop */
> - mutex_lock(_index_mutex);
> + lock_loop();
>   for (i = 0; i < nr; i++)
>   loop_add(, i);
> - mutex_unlock(_index_mutex);
> + unlock_loop();
>  
>   printk(KERN_INFO "loop: module loaded\n");
>   return 0;
> -- 
> 1.8.3.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] block/loop: Don't hold lock while rereading partition.

2018-09-25 Thread Jan Kara
On Tue 25-09-18 14:10:03, Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency between bdev->bd_mutex and
> lo->lo_ctl_mutex [1] which is caused by calling blkdev_reread_part() with
> lock held. Don't hold loop_ctl_mutex while calling blkdev_reread_part().
> Also, bring bdgrab() at loop_set_fd() to before loop_reread_partitions()
> in case loop_clr_fd() is called while blkdev_reread_part() from
> loop_set_fd() is in progress.
> 
> [1] 
> https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889
> 
> Signed-off-by: Tetsuo Handa 
> Reported-by: syzbot 
> 

Thank you for splitting out this patch. Some comments below.

> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 920cbb1..877cca8 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -632,7 +632,12 @@ static void loop_reread_partitions(struct loop_device 
> *lo,
>  struct block_device *bdev)
>  {
>   int rc;
> + char filename[LO_NAME_SIZE];
> + const int num = lo->lo_number;
> + const int count = atomic_read(>lo_refcnt);
>  
> + memcpy(filename, lo->lo_file_name, sizeof(filename));
> + mutex_unlock(_ctl_mutex);
>   /*
>* bd_mutex has been held already in release path, so don't
>* acquire it if this function is called in such case.
> @@ -641,13 +646,14 @@ static void loop_reread_partitions(struct loop_device 
> *lo,
>* must be at least one and it can only become zero when the
>* current holder is released.
>*/
> - if (!atomic_read(>lo_refcnt))
> + if (!count)
>   rc = __blkdev_reread_part(bdev);
>   else
>   rc = blkdev_reread_part(bdev);
> + mutex_lock(_ctl_mutex);
>   if (rc)
>   pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
> - __func__, lo->lo_number, lo->lo_file_name, rc);
> + __func__, num, filename, rc);
>  }

I still don't quite like this. It is non-trivial to argue that the
temporary dropping of loop_ctl_mutex in loop_reread_partitions() is OK for
all it's callers. I'm really strongly in favor of unlocking the mutex in
the callers of loop_reread_partitions() and reorganizing the code there so
that loop_reread_partitions() is called as late as possible so that it is
clear that dropping the mutex is fine (and usually we don't even have to
reacquire it). Plus your patch does not seem to take care of the possible
races of loop_clr_fd() with LOOP_CTL_REMOVE? See my other mail for
details...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] block/loop: Don't grab "struct file" for vfs_getattr() operation.

2018-09-25 Thread Jan Kara
On Tue 25-09-18 12:51:01, Tetsuo Handa wrote:
> vfs_getattr() needs "struct path" rather than "struct file".
> Let's use path_get()/path_put() rather than get_file()/fput().
> 
> Signed-off-by: Tetsuo Handa 

Thanks for splitting off the cleanup. The patch looks good to me. Feel free
to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  drivers/block/loop.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index abad6d1..c2745e6 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1206,7 +1206,7 @@ static int loop_clr_fd(struct loop_device *lo)
>  static int
>  loop_get_status(struct loop_device *lo, struct loop_info64 *info)
>  {
> - struct file *file;
> + struct path path;
>   struct kstat stat;
>   int ret;
>  
> @@ -1231,16 +1231,16 @@ static int loop_clr_fd(struct loop_device *lo)
>   }
>  
>   /* Drop lo_ctl_mutex while we call into the filesystem. */
> - file = get_file(lo->lo_backing_file);
> + path = lo->lo_backing_file->f_path;
> + path_get();
>   mutex_unlock(>lo_ctl_mutex);
> - ret = vfs_getattr(>f_path, , STATX_INO,
> -   AT_STATX_SYNC_AS_STAT);
> + ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
>   if (!ret) {
>   info->lo_device = huge_encode_dev(stat.dev);
>   info->lo_inode = stat.ino;
>   info->lo_rdevice = huge_encode_dev(stat.rdev);
>   }
> - fput(file);
> + path_put();
>   return ret;
>  }
>  
> -- 
> 1.8.3.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs

2018-08-22 Thread Jan Kara
On Wed 22-08-18 18:50:53, Ming Lei wrote:
> On Wed, Aug 22, 2018 at 12:33:05PM +0200, Jan Kara wrote:
> > On Wed 22-08-18 10:02:49, Martin Wilck wrote:
> > > On Mon, 2018-07-30 at 20:37 +0800, Ming Lei wrote:
> > > > On Wed, Jul 25, 2018 at 11:15:09PM +0200, Martin Wilck wrote:
> > > > > 
> > > > > +/**
> > > > > + * bio_iov_iter_get_pages - pin user or kernel pages and add them
> > > > > to a bio
> > > > > + * @bio: bio to add pages to
> > > > > + * @iter: iov iterator describing the region to be mapped
> > > > > + *
> > > > > + * Pins pages from *iter and appends them to @bio's bvec array.
> > > > > The
> > > > > + * pages will have to be released using put_page() when done.
> > > > > + * The function tries, but does not guarantee, to pin as many
> > > > > pages as
> > > > > + * fit into the bio, or are requested in *iter, whatever is
> > > > > smaller.
> > > > > + * If MM encounters an error pinning the requested pages, it
> > > > > stops.
> > > > > + * Error is returned only if 0 pages could be pinned.
> > > > > + */
> > > > > +int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> > > > > +{
> > > > > + unsigned short orig_vcnt = bio->bi_vcnt;
> > > > > +
> > > > > + do {
> > > > > + int ret = __bio_iov_iter_get_pages(bio, iter);
> > > > > +
> > > > > + if (unlikely(ret))
> > > > > + return bio->bi_vcnt > orig_vcnt ? 0 : ret;
> > > > > +
> > > > > + } while (iov_iter_count(iter) && !bio_full(bio));
> > > > 
> > > > When 'ret' isn't zero, and some partial progress has been made, seems
> > > > less pages
> > > > might be obtained than requested too. Is that something we need to
> > > > worry about?
> > > 
> > > This would be the case when VM isn't willing or able to fulfill the
> > > page-pinning request. Previously, we came to the conclusion that VM has
> > > the right to do so. This is the reason why callers have to check the
> > > number of pages allocated, and either loop over
> > > bio_iov_iter_get_pages(), or fall back to buffered I/O, until all pages
> > > have been obtained. All callers except the blockdev fast path do the
> > > former. 
> > > 
> > > We could add looping in __blkdev_direct_IO_simple() on top of the
> > > current patch set, to avoid fallback to buffered IO in this corner
> > > case. Should we? If yes, only for WRITEs, or for READs as well?
> > > 
> > > I haven't encountered this situation in my tests, and I'm unsure how to
> > > provoke it - run a direct IO test under high memory pressure?
> > 
> > Currently, iov_iter_get_pages() is always guaranteed to get at least one
> > page as that is current guarantee of get_user_pages() (unless we hit
> > EFAULT obviously). So bio_iov_iter_get_pages() as is now is guaranteed to
> 
> Is it possible for this EFAULT to happen on the user-space VM?

Certainly if the user passes bogus address...

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs

2018-08-22 Thread Jan Kara
On Wed 22-08-18 10:02:49, Martin Wilck wrote:
> On Mon, 2018-07-30 at 20:37 +0800, Ming Lei wrote:
> > On Wed, Jul 25, 2018 at 11:15:09PM +0200, Martin Wilck wrote:
> > > 
> > > +/**
> > > + * bio_iov_iter_get_pages - pin user or kernel pages and add them
> > > to a bio
> > > + * @bio: bio to add pages to
> > > + * @iter: iov iterator describing the region to be mapped
> > > + *
> > > + * Pins pages from *iter and appends them to @bio's bvec array.
> > > The
> > > + * pages will have to be released using put_page() when done.
> > > + * The function tries, but does not guarantee, to pin as many
> > > pages as
> > > + * fit into the bio, or are requested in *iter, whatever is
> > > smaller.
> > > + * If MM encounters an error pinning the requested pages, it
> > > stops.
> > > + * Error is returned only if 0 pages could be pinned.
> > > + */
> > > +int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> > > +{
> > > + unsigned short orig_vcnt = bio->bi_vcnt;
> > > +
> > > + do {
> > > + int ret = __bio_iov_iter_get_pages(bio, iter);
> > > +
> > > + if (unlikely(ret))
> > > + return bio->bi_vcnt > orig_vcnt ? 0 : ret;
> > > +
> > > + } while (iov_iter_count(iter) && !bio_full(bio));
> > 
> > When 'ret' isn't zero, and some partial progress has been made, seems
> > less pages
> > might be obtained than requested too. Is that something we need to
> > worry about?
> 
> This would be the case when VM isn't willing or able to fulfill the
> page-pinning request. Previously, we came to the conclusion that VM has
> the right to do so. This is the reason why callers have to check the
> number of pages allocated, and either loop over
> bio_iov_iter_get_pages(), or fall back to buffered I/O, until all pages
> have been obtained. All callers except the blockdev fast path do the
> former. 
> 
> We could add looping in __blkdev_direct_IO_simple() on top of the
> current patch set, to avoid fallback to buffered IO in this corner
> case. Should we? If yes, only for WRITEs, or for READs as well?
> 
> I haven't encountered this situation in my tests, and I'm unsure how to
> provoke it - run a direct IO test under high memory pressure?

Currently, iov_iter_get_pages() is always guaranteed to get at least one
page as that is current guarantee of get_user_pages() (unless we hit
EFAULT obviously). So bio_iov_iter_get_pages() as is now is guaranteed to
exhaust 'iter' or fill 'bio'. But in the future, the guarantee that
get_user_pages() will always pin at least one page may go away. But we'd
have to audit all users at that time anyway.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Jan Kara
On Thu 19-07-18 20:20:51, Ming Lei wrote:
> On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote:
> > On Thu 19-07-18 19:04:46, Ming Lei wrote:
> > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > > > segment of the input iov_iter's iovec. This may be much less than the 
> > > > number
> > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > > > bio_iov_iter_get_pages() repeatedly until either the requested number
> > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> > > > short writes or reads may occur for direct synchronous IOs with multiple
> > > > iovec slots (such as generated by writev()). In that case,
> > > > __generic_file_write_iter() falls back to buffered writes, which
> > > > has been observed to cause data corruption in certain workloads.
> > > > 
> > > > Note: if segments aren't page-aligned in the input iovec, this patch may
> > > > result in multiple adjacent slots of the bi_io_vec array to reference 
> > > > the same
> > > > page (the byte ranges are guaranteed to be disjunct if the preceding 
> > > > patch is
> > > > applied). We haven't seen problems with that in our and the customer's
> > > > tests. It'd be possible to detect this situation and merge bi_io_vec 
> > > > slots
> > > > that refer to the same page, but I prefer to keep it simple for now.
> > > > 
> > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for 
> > > > simplified bdev direct-io")
> > > > Signed-off-by: Martin Wilck 
> > > > ---
> > > >  fs/block_dev.c | 8 +++-
> > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > index 0dd87aa..41643c4 100644
> > > > --- a/fs/block_dev.c
> > > > +++ b/fs/block_dev.c
> > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, 
> > > > struct iov_iter *iter,
> > > >  
> > > > ret = bio_iov_iter_get_pages(, iter);
> > > > if (unlikely(ret))
> > > > -   return ret;
> > > > +   goto out;
> > > > +
> > > > +   while (ret == 0 &&
> > > > +  bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 
> > > > 0)
> > > > +   ret = bio_iov_iter_get_pages(, iter);
> > > > +
> > > > ret = bio.bi_iter.bi_size;
> > > >  
> > > > if (iov_iter_rw(iter) == READ) {
> > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, 
> > > > struct iov_iter *iter,
> > > > put_page(bvec->bv_page);
> > > > }
> > > >  
> > > > +out:
> > > > if (vecs != inline_vecs)
> > > > kfree(vecs);
> > > >
> > > 
> > > You might put the 'vecs' leak fix into another patch, and resue the
> > > current code block for that.
> > > 
> > > Looks all users of bio_iov_iter_get_pages() need this kind of fix, so
> > > what do you think about the following way?
> > 
> > No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly
> > fine with it returning less pages and they loop appropriately.
> 
> OK, but this way still may make one bio to hold more data, especially
> the comment of bio_iov_iter_get_pages() says that 'Pins as many pages from
> *iter', so looks it is the correct way to do.

Well, but as Al pointed out MM may decide that user has already pinned too
many pages and refuse to pin more. So pinning full iter worth of pages may
just be impossible. Currently, there are no checks like this in MM but
eventually, I'd like to account pinned pages in mlock ulimit (or a limit of
similar kind) and then what Al speaks about would become very real. So I'd
prefer to not develop new locations that must be able to pin arbitrary
amount of pages.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Jan Kara
On Thu 19-07-18 14:23:53, Martin Wilck wrote:
> On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote:
> > Secondly, I don't think it is good to discard error from
> > bio_iov_iter_get_pages() here and just submit partial IO. It will
> > again
> > lead to part of IO being done as direct and part attempted to be done
> > as
> > buffered. Also the "slow" direct IO path in __blkdev_direct_IO()
> > behaves
> > differently - it aborts and returns error if bio_iov_iter_get_pages()
> > ever
> > returned error. IMO we should do the same here.
> 
> Well, it aborts the loop, but then (in the sync case) it still waits
> for the already submitted IOs to finish. Here, too, I'd find it more
> logical to return the number of successfully transmitted bytes rather
> than an error code. In the async case, the submitted bios are left in
> place, and will probably sooner or later finish, changing iocb->ki_pos.

Well, both these behaviors make sense, just traditionally (defined by our
implementation) DIO returns error even if part of IO has actually been
successfully submitted. Making a userspace visible change like you suggest
thus has to be very carefully analyzed and frankly I don't think it's worth
the bother.

> I'm actually not quite certain if that's correct. In the sync case, it
> causes the already-performed IO to be done again, buffered. In the
> async case, it it may even cause two IOs for the same range to be in
> flight at the same time ... ?

It doesn't cause IO to be done again. Look at __generic_file_write_iter().
If generic_file_direct_write() returned error, we immediately return error
as well without retrying buffered IO. We only retry buffered IO for partial
(or 0) return value.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Jan Kara
On Thu 19-07-18 16:53:51, Christoph Hellwig wrote:
> On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote:
> > > Well, there has never been a promise that it will grab *all* pages in the
> > > iter AFAIK. Practically, I think that it was just too hairy to implement 
> > > in
> > > the macro magic that iter processing is... Al might know more (added to
> > > CC).
> > 
> > Not really - it's more that VM has every right to refuse letting you pin
> > an arbitrary amount of pages anyway.
> 
> In which case the code after this patch isn't going to help either, because
> it still tries to pin it all, just in multiple calls to get_user_pages().

Yeah. Actually previous version of the fix (not posted publicly) submitted
partial bio and then reused the bio to submit more. This is also the way
__blkdev_direct_IO operates. Martin optimized this to fill the bio
completely (as we know we have enough bvecs) before submitting which has
chances to perform better. I'm fine with either approach, just we have to
decide which way to go.

    Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Jan Kara
On Thu 19-07-18 19:04:46, Ming Lei wrote:
> On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > segment of the input iov_iter's iovec. This may be much less than the number
> > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > bio_iov_iter_get_pages() repeatedly until either the requested number
> > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> > short writes or reads may occur for direct synchronous IOs with multiple
> > iovec slots (such as generated by writev()). In that case,
> > __generic_file_write_iter() falls back to buffered writes, which
> > has been observed to cause data corruption in certain workloads.
> > 
> > Note: if segments aren't page-aligned in the input iovec, this patch may
> > result in multiple adjacent slots of the bi_io_vec array to reference the 
> > same
> > page (the byte ranges are guaranteed to be disjunct if the preceding patch 
> > is
> > applied). We haven't seen problems with that in our and the customer's
> > tests. It'd be possible to detect this situation and merge bi_io_vec slots
> > that refer to the same page, but I prefer to keep it simple for now.
> > 
> > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified 
> > bdev direct-io")
> > Signed-off-by: Martin Wilck 
> > ---
> >  fs/block_dev.c | 8 +++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 0dd87aa..41643c4 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> > iov_iter *iter,
> >  
> > ret = bio_iov_iter_get_pages(, iter);
> > if (unlikely(ret))
> > -   return ret;
> > +   goto out;
> > +
> > +   while (ret == 0 &&
> > +  bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> > +   ret = bio_iov_iter_get_pages(, iter);
> > +
> > ret = bio.bi_iter.bi_size;
> >  
> > if (iov_iter_rw(iter) == READ) {
> > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> > iov_iter *iter,
> > put_page(bvec->bv_page);
> > }
> >  
> > +out:
> > if (vecs != inline_vecs)
> > kfree(vecs);
> >
> 
> You might put the 'vecs' leak fix into another patch, and resue the
> current code block for that.
> 
> Looks all users of bio_iov_iter_get_pages() need this kind of fix, so
> what do you think about the following way?

No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly
fine with it returning less pages and they loop appropriately.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Jan Kara
On Thu 19-07-18 11:39:18, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call
> bio_iov_iter_get_pages() repeatedly until either the requested number
> of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> short writes or reads may occur for direct synchronous IOs with multiple
> iovec slots (such as generated by writev()). In that case,
> __generic_file_write_iter() falls back to buffered writes, which
> has been observed to cause data corruption in certain workloads.
> 
> Note: if segments aren't page-aligned in the input iovec, this patch may
> result in multiple adjacent slots of the bi_io_vec array to reference the same
> page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> applied). We haven't seen problems with that in our and the customer's
> tests. It'd be possible to detect this situation and merge bi_io_vec slots
> that refer to the same page, but I prefer to keep it simple for now.
> 
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified 
> bdev direct-io")
> Signed-off-by: Martin Wilck 
> ---
>  fs/block_dev.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aa..41643c4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> iov_iter *iter,
>  
>   ret = bio_iov_iter_get_pages(, iter);
>   if (unlikely(ret))
> - return ret;
> + goto out;
> +
> + while (ret == 0 &&
> +bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> + ret = bio_iov_iter_get_pages(, iter);
> +

I have two suggestions here (posting them now in public):

Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we made
sure we have enough vecs for pages in iter. So I'd WARN if this isn't true.

Secondly, I don't think it is good to discard error from
bio_iov_iter_get_pages() here and just submit partial IO. It will again
lead to part of IO being done as direct and part attempted to be done as
buffered. Also the "slow" direct IO path in __blkdev_direct_IO() behaves
differently - it aborts and returns error if bio_iov_iter_get_pages() ever
returned error. IMO we should do the same here.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Jan Kara
On Thu 19-07-18 18:21:23, Ming Lei wrote:
> On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > segment of the input iov_iter's iovec. This may be much less than the number
> > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> 
> In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
> as many as possible pages since both 'maxsize' and 'maxpages' are provided
> to cover all.
> 
> So the question is why iov_iter_get_pages() doesn't work as expected?

Well, there has never been a promise that it will grab *all* pages in the
iter AFAIK. Practically, I think that it was just too hairy to implement in
the macro magic that iter processing is... Al might know more (added to
CC).

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec

2018-07-19 Thread Jan Kara
On Thu 19-07-18 11:39:17, Martin Wilck wrote:
> If the last page of the bio is not "full", the length of the last
> vector slot needs to be corrected. This slot has the index
> (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
> array, which is shifted by the value of bio->bi_vcnt at function
> invocation, the correct index is (nr_pages - 1).
> 
> V2: improved readability following suggestions from Ming Lei.
> 
> Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
> Signed-off-by: Martin Wilck 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

BTW, explicit CC: sta...@vger.kernel.org would be good. But Jens can add it
I guess.

Honza

> ---
>  block/bio.c | 18 --
>  1 file changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 67eff5e..0964328 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -912,16 +912,16 @@ EXPORT_SYMBOL(bio_add_page);
>   */
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
> - unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> + unsigned short idx, nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
>   struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
>   struct page **pages = (struct page **)bv;
> - size_t offset, diff;
> + size_t offset;
>   ssize_t size;
>  
>   size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, );
>   if (unlikely(size <= 0))
>   return size ? size : -EFAULT;
> - nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
> + idx = nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
>  
>   /*
>* Deep magic below:  We need to walk the pinned pages backwards
> @@ -934,17 +934,15 @@ int bio_iov_iter_get_pages(struct bio *bio, struct 
> iov_iter *iter)
>   bio->bi_iter.bi_size += size;
>   bio->bi_vcnt += nr_pages;
>  
> - diff = (nr_pages * PAGE_SIZE - offset) - size;
> - while (nr_pages--) {
> - bv[nr_pages].bv_page = pages[nr_pages];
> - bv[nr_pages].bv_len = PAGE_SIZE;
> - bv[nr_pages].bv_offset = 0;
> + while (idx--) {
> + bv[idx].bv_page = pages[idx];
> + bv[idx].bv_len = PAGE_SIZE;
> + bv[idx].bv_offset = 0;
>   }
>  
>   bv[0].bv_offset += offset;
>   bv[0].bv_len -= offset;
> - if (diff)
> - bv[bio->bi_vcnt - 1].bv_len -= diff;
> + bv[nr_pages - 1].bv_len -= nr_pages * PAGE_SIZE - offset - size;
>  
>   iov_iter_advance(iter, size);
>   return 0;
> -- 
> 2.17.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: Silent data corruption in blkdev_direct_IO()

2018-07-18 Thread Jan Kara
On Wed 18-07-18 13:40:07, Jan Kara wrote:
> On Wed 18-07-18 11:20:15, Johannes Thumshirn wrote:
> > On Wed, Jul 18, 2018 at 03:54:46PM +0800, Ming Lei wrote:
> > > Please go ahead and take care of it since you have the test cases.
> > 
> > Speaking of which, do we already know how it is triggered and can we
> > cook up a blktests testcase for it? This would be more than helpful
> > for all parties.
> 
> Using multiple iovecs with writev / readv trivially triggers the case of IO
> that is done partly as direct and partly as buffered. Neither me nor Martin
> were able to trigger the data corruption the customer is seeing with KVM
> though (since the generic code tries to maintain data integrity even if the
> IO is mixed). It should be possible to trigger the corruption by having two
> processes doing write to the same PAGE_SIZE region of a block device, just at
> different offsets. And if the first process happens to use direct IO while
> the second ends up doing read-modify-write cycle through page cache, the
> first write could end up being lost. I'll try whether something like this
> is able to see the corruption...

OK, when I run attached test program like:

blkdev-dio-test /dev/loop0 0 &
blkdev-dio-test /dev/loop0 2048 &

One of them reports lost write almost immediately. On kernel with my fix
the test program runs for quite a while without problems.

Honza
-- 
Jan Kara 
SUSE Labs, CR
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 

#define PAGE_SIZE 4096
#define SECT_SIZE 512
#define BUF_OFF (2*SECT_SIZE)

int main(int argc, char **argv)
{
	int fd = open(argv[1], O_RDWR | O_DIRECT);
	int ret;
	char *buf;
	loff_t off;
	struct iovec iov[2];
	unsigned int seq;

	if (fd < 0) {
		perror("open");
		return 1;
	}

	off = strtol(argv[2], NULL, 10);

	buf = aligned_alloc(PAGE_SIZE, PAGE_SIZE);

	iov[0].iov_base = buf;
	iov[0].iov_len = SECT_SIZE;
	iov[1].iov_base = buf + BUF_OFF;
	iov[1].iov_len = SECT_SIZE;

	seq = 0;
	memset(buf, 0, PAGE_SIZE);
	while (1) {
		*(unsigned int *)buf = seq;
		*(unsigned int *)(buf + BUF_OFF) = seq;
		ret = pwritev(fd, iov, 2, off);
		if (ret < 0) {
			perror("pwritev");
			return 1;
		}
		if (ret != 2*SECT_SIZE) {
			fprintf(stderr, "Short pwritev: %d\n", ret);
			return 1;
		}
		ret = pread(fd, buf, PAGE_SIZE, off);
		if (ret < 0) {
			perror("pread");
			return 1;
		}
		if (ret != PAGE_SIZE) {
			fprintf(stderr, "Short read: %d\n", ret);
			return 1;
		}
		if (*(unsigned int *)buf != seq ||
		*(unsigned int *)(buf + SECT_SIZE) != seq) {
			printf("Lost write %u: %u %u\n", seq, *(unsigned int *)buf, *(unsigned int *)(buf + SECT_SIZE));
			return 1;
		}
		seq++;
	}

	return 0;
}


Re: Silent data corruption in blkdev_direct_IO()

2018-07-18 Thread Jan Kara
On Wed 18-07-18 11:20:15, Johannes Thumshirn wrote:
> On Wed, Jul 18, 2018 at 03:54:46PM +0800, Ming Lei wrote:
> > Please go ahead and take care of it since you have the test cases.
> 
> Speaking of which, do we already know how it is triggered and can we
> cook up a blktests testcase for it? This would be more than helpful
> for all parties.

Using multiple iovecs with writev / readv trivially triggers the case of IO
that is done partly as direct and partly as buffered. Neither me nor Martin
were able to trigger the data corruption the customer is seeing with KVM
though (since the generic code tries to maintain data integrity even if the
IO is mixed). It should be possible to trigger the corruption by having two
processes doing write to the same PAGE_SIZE region of a block device, just at
different offsets. And if the first process happens to use direct IO while
the second ends up doing read-modify-write cycle through page cache, the
first write could end up being lost. I'll try whether something like this
is able to see the corruption...

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] bdi: Fix another oops in wb_workfn()

2018-06-22 Thread Jan Kara
On Mon 18-06-18 10:40:14, Tejun Heo wrote:
> On Mon, Jun 18, 2018 at 03:46:58PM +0200, Jan Kara wrote:
> > syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
> > wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
> > WB_shutting_down after wb->bdi->dev became NULL. This indicates that
> > unregister_bdi() failed to call wb_shutdown() on one of wb objects.
> > 
> > The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
> > drops bdi's reference to wb structures before going through the list of
> > wbs again and calling wb_shutdown() on each of them. This way the loop
> > iterating through all wbs can easily miss a wb if that wb has already
> > passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
> > from cgwb_release_workfn() and as a result fully shutdown bdi although
> > wb_workfn() for this wb structure is still running. In fact there are
> > also other ways cgwb_bdi_unregister() can race with
> > cgwb_release_workfn() leading e.g. to use-after-free issues:
> > 
> > CPU1CPU2
> > cgwb_bdi_unregister()
> >   cgwb_kill(*slot);
> > 
> > cgwb_release()
> >   queue_work(cgwb_release_wq, >release_work);
> > cgwb_release_workfn()
> >   wb = list_first_entry(>wb_list, ...)
> >   spin_unlock_irq(_lock);
> >   wb_shutdown(wb);
> >   ...
> >   kfree_rcu(wb, rcu);
> >   wb_shutdown(wb); -> oops use-after-free
> > 
> > We solve these issues by synchronizing writeback structure shutdown from
> > cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
> > way we also no longer need synchronization using WB_shutting_down as the
> > mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
> > CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
> > bdi_unregister().
> > 
> > Reported-by: syzbot 
> > Signed-off-by: Jan Kara 
> 
> Acked-by: Tejun Heo 

OK, Jens, can you please pick up the fix? Thanks!

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] bdi: Fix another oops in wb_workfn()

2018-06-19 Thread Jan Kara
On Mon 18-06-18 23:38:12, Tetsuo Handa wrote:
> On 2018/06/18 22:46, Jan Kara wrote:
> > syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
> 
> [1] 
> https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
> 
> line is missing.
> 
> > wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
> > WB_shutting_down after wb->bdi->dev became NULL. This indicates that
> > unregister_bdi() failed to call wb_shutdown() on one of wb objects.
> > 
> > The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
> > drops bdi's reference to wb structures before going through the list of
> > wbs again and calling wb_shutdown() on each of them. This way the loop
> > iterating through all wbs can easily miss a wb if that wb has already
> > passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
> > from cgwb_release_workfn() and as a result fully shutdown bdi although
> > wb_workfn() for this wb structure is still running. In fact there are
> > also other ways cgwb_bdi_unregister() can race with
> > cgwb_release_workfn() leading e.g. to use-after-free issues:
> > 
> > CPU1CPU2
> > cgwb_bdi_unregister()
> >   cgwb_kill(*slot);
> > 
> > cgwb_release()
> >   queue_work(cgwb_release_wq, >release_work);
> > cgwb_release_workfn()
> >   wb = list_first_entry(>wb_list, ...)
> >   spin_unlock_irq(_lock);
> >   wb_shutdown(wb);
> >   ...
> >   kfree_rcu(wb, rcu);
> >   wb_shutdown(wb); -> oops use-after-free
> > 
> > We solve these issues by synchronizing writeback structure shutdown from
> > cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
> > way we also no longer need synchronization using WB_shutting_down as the
> > mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
> > CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
> > bdi_unregister().
> 
> Wow, this patch removes WB_shutting_down.

Yes.

> A bit of worry for me is how long will this mutex_lock() sleep, for
> if there are a lot of wb objects to shutdown, sequentially doing
> wb_shutdown() might block someone's mutex_lock() for longer than
> khungtaskd's timeout period (typically 120 seconds) ?

That's a good question but since the bdi is going away in this case I
don't think the flusher work should take long to complete - the device is
removed from the system at this point so it won't do any IO.

Honza
-- 
Jan Kara 
SUSE Labs, CR


[PATCH] bdi: Fix another oops in wb_workfn()

2018-06-18 Thread Jan Kara
syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1CPU2
cgwb_bdi_unregister()
  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, >release_work);
cgwb_release_workfn()
  wb = list_first_entry(>wb_list, ...)
  spin_unlock_irq(_lock);
  wb_shutdown(wb);
  ...
  kfree_rcu(wb, rcu);
  wb_shutdown(wb); -> oops use-after-free

We solve these issues by synchronizing writeback structure shutdown from
cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
way we also no longer need synchronization using WB_shutting_down as the
mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
bdi_unregister().

Reported-by: syzbot 
Signed-off-by: Jan Kara 
---
 include/linux/backing-dev-defs.h |  2 +-
 mm/backing-dev.c | 20 +++-
 2 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 0bd432a4d7bd..24251762c20c 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -22,7 +22,6 @@ struct dentry;
  */
 enum wb_state {
WB_registered,  /* bdi_register() was done */
-   WB_shutting_down,   /* wb_shutdown() in progress */
WB_writeback_running,   /* Writeback is in progress */
WB_has_dirty_io,/* Dirty inodes on ->b_{dirty|io|more_io} */
WB_start_all,   /* nr_pages == 0 (all) work pending */
@@ -189,6 +188,7 @@ struct backing_dev_info {
 #ifdef CONFIG_CGROUP_WRITEBACK
struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
struct rb_root cgwb_congested_tree; /* their congested states */
+   struct mutex cgwb_release_mutex;  /* protect shutdown of wb structs */
 #else
struct bdi_writeback_congested *wb_congested;
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc834c04a..2e5d3df0853d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -359,15 +359,8 @@ static void wb_shutdown(struct bdi_writeback *wb)
spin_lock_bh(>work_lock);
if (!test_and_clear_bit(WB_registered, >state)) {
spin_unlock_bh(>work_lock);
-   /*
-* Wait for wb shutdown to finish if someone else is just
-* running wb_shutdown(). Otherwise we could proceed to wb /
-* bdi destruction before wb_shutdown() is finished.
-*/
-   wait_on_bit(>state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
return;
}
-   set_bit(WB_shutting_down, >state);
spin_unlock_bh(>work_lock);
 
cgwb_remove_from_bdi_list(wb);
@@ -379,12 +372,6 @@ static void wb_shutdown(struct bdi_writeback *wb)
mod_delayed_work(bdi_wq, >dwork, 0);
flush_delayed_work(>dwork);
WARN_ON(!list_empty(>work_list));
-   /*
-* Make sure bit gets cleared after shutdown is finished. Matches with
-* the barrier provided by test_and_clear_bit() above.
-*/
-   smp_wmb();
-   clear_and_wake_up_bit(WB_shutting_down, >state);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
@@ -508,10 +495,12 @@ static void cgwb_release_workfn(struct work_struct *work)
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
release_work);
 
+   mutex_lock(>bdi->cgwb_release_mutex);
wb_shutdown(wb);
 
css_put(wb->memcg_css);
css_put(wb->blkcg_css);
+   mutex_unlock(>bdi->cgwb_release_mutex);
 
fprop_local_destroy_percpu(>memcg_completions);
percpu_ref_exit(>refcnt);
@@ -697,6 +686,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
INIT_RADIX_TREE(>cgwb_tree, GFP_ATOMIC);
bdi->cgwb_congested_tree = RB_ROOT;
+   mutex_init(&

Re: [PATCH] bdi: Fix another oops in wb_workfn()

2018-06-11 Thread Jan Kara
On Mon 11-06-18 09:01:31, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jun 11, 2018 at 11:12:48AM +0200, Jan Kara wrote:
> > However this is wrong and so is the patch. The problem is in
> > cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's
> > reference to wb structures before going through the list of wbs again and
> > calling wb_shutdown() on each of them. The writeback structures we are
> > accessing at this point can be already freed in principle like:
> > 
> > CPU1CPU2
> > cgwb_bdi_unregister()
> >   cgwb_kill(*slot);
> > 
> > cgwb_release()
> >   queue_work(cgwb_release_wq, >release_work);
> > cgwb_release_workfn()
> >   wb = list_first_entry(>wb_list, ...)
> >   spin_unlock_irq(_lock);
> >   wb_shutdown(wb);
> >   ...   
> >   kfree_rcu(wb, rcu);
> >   wb_shutdown(wb); -> oops use-after-free
> > 
> > I'm not 100% sure how to fix this. wb structures can be at various phases of
> > shutdown (or there may be other external references still existing) when we
> > enter cgwb_bdi_unregister() so I think adding a way for 
> > cgwb_bdi_unregister()
> > to wait for standard wb shutdown path to finish is the most robust way.
> > What do you think about attached patch Tejun? So far only compile tested...
> > 
> > Possible problem with it is that now cgwb_bdi_unregister() will wait for
> > all wb references to be dropped so it adds some implicit dependencies to
> > bdi shutdown path. 
> 
> Would something like the following work or am I missing the point
> entirely?

I was pondering the same solution for a while but I think it won't work.
The problem is that e.g. wb_memcg_offline() could have already removed
wb from the radix tree but it is still pending in bdi->wb_list
(wb_shutdown() has not run yet) and so we'd drop reference we didn't get.

Honza
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 347cc83..359cacd 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -715,14 +715,19 @@ static void cgwb_bdi_unregister(struct backing_dev_info 
> *bdi)
>   WARN_ON(test_bit(WB_registered, >wb.state));
>  
>   spin_lock_irq(_lock);
> - radix_tree_for_each_slot(slot, >cgwb_tree, , 0)
> - cgwb_kill(*slot);
> + radix_tree_for_each_slot(slot, >cgwb_tree, , 0) {
> + struct bdi_writeback *wb = *slot;
> +
> + wb_get(wb);
> + cgwb_kill(wb);
> + }
>  
>   while (!list_empty(>wb_list)) {
>   wb = list_first_entry(>wb_list, struct bdi_writeback,
> bdi_node);
>   spin_unlock_irq(_lock);
>   wb_shutdown(wb);
> + wb_put(wb);
>   spin_lock_irq(_lock);
>   }
>   spin_unlock_irq(_lock);
> 
> 
> -- 
> tejun
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] bdi: Fix another oops in wb_workfn()

2018-06-11 Thread Jan Kara
On Sat 09-06-18 23:00:05, Tetsuo Handa wrote:
> From 014c4149f2e24cd26b278b32d5dfda056eecf093 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa 
> Date: Sat, 9 Jun 2018 22:47:52 +0900
> Subject: [PATCH] bdi: Fix another oops in wb_workfn()
> 
> syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
> wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
> WB_shutting_down after wb->bdi->dev became NULL. This indicates that
> unregister_bdi() failed to call wb_shutdown() on one of wb objects.
> 
> Since cgwb_bdi_unregister() from bdi_unregister() cannot call wb_shutdown()
> on wb objects which have already passed list_del_rcu() in wb_shutdown(),
> cgwb_bdi_unregister() from bdi_unregister() can return and set wb->bdi->dev
> to NULL before such wb objects enter final round of wb_workfn() via
> mod_delayed_work()/flush_delayed_work().

Thanks a lot for debugging the issue and also thanks a lot to Dmitry for
taking time to reproduce the race by hand with the debug patch! I really
appreciate it!

> Since WB_registered is already cleared by wb_shutdown(), only wb_shutdown()
> can schedule for final round of wb_workfn(). Since concurrent calls to
> wb_shutdown() on the same wb object is safe because of WB_shutting_down
> state, I think that wb_shutdown() can safely keep a wb object in the
> bdi->wb_list until that wb object leaves final round of wb_workfn().
> Thus, make wb_shutdown() call list_del_rcu() after flush_delayed_work().

However this is wrong and so is the patch. The problem is in
cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's
reference to wb structures before going through the list of wbs again and
calling wb_shutdown() on each of them. The writeback structures we are
accessing at this point can be already freed in principle like:

CPU1CPU2
cgwb_bdi_unregister()
  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, >release_work);
cgwb_release_workfn()
  wb = list_first_entry(>wb_list, ...)
  spin_unlock_irq(_lock);
  wb_shutdown(wb);
  ...   
  kfree_rcu(wb, rcu);
  wb_shutdown(wb); -> oops use-after-free

I'm not 100% sure how to fix this. wb structures can be at various phases of
shutdown (or there may be other external references still existing) when we
enter cgwb_bdi_unregister() so I think adding a way for cgwb_bdi_unregister()
to wait for standard wb shutdown path to finish is the most robust way.
What do you think about attached patch Tejun? So far only compile tested...

Possible problem with it is that now cgwb_bdi_unregister() will wait for
all wb references to be dropped so it adds some implicit dependencies to
bdi shutdown path. 

Honza
-- 
Jan Kara 
SUSE Labs, CR
>From f5038c6e7a3d1a4a91879187b92ede8c868988ac Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Mon, 11 Jun 2018 10:56:04 +0200
Subject: [PATCH] bdi: Fix another oops in wb_workfn()

syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1CPU2
cgwb_bdi_unregister()
  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, >release_work);
cgwb_release_workfn()
  wb = list_first_entry(>wb_list, ...)
  spin_unlock_irq(_lock);
  wb_shutdown(wb);
  ...
  kfree_rcu(wb, rcu);
  wb_shutdown(wb); -> oops use-after-free

We solve all these issues by making cgwb_bdi_unregister() wait for
shutdown of all wb structures instead of going through them and trying
to actively shut them down.

[1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Cc: Dmitry Vyukov 
Cc: Tejun Heo 
Reported-and-analyzed-by: Tetsuo Handa 
Reported-by: syzbot 
Signed-off-by: Jan Kara 
--

Re: general protection fault in wb_workfn (2)

2018-05-28 Thread Jan Kara
On Sun 27-05-18 09:47:54, Tetsuo Handa wrote:
> Forwarding 
> http://lkml.kernel.org/r/201805251915.fgh64517.hvfjoolffmq...@i-love.sakura.ne.jp
>  .
> 
> Jan Kara wrote:
> > > void delayed_work_timer_fn(struct timer_list *t)
> > > {
> > >   struct delayed_work *dwork = from_timer(dwork, t, timer);
> > > 
> > >   /* should have been called from irqsafe timer with irq already off */
> > >   __queue_work(dwork->cpu, dwork->wq, >work);
> > > }
> > > 
> > > Then, wb_workfn() is after all scheduled even if we check for
> > > WB_registered bit, isn't it?
> > 
> > It can be queued after WB_registered bit is cleared but it cannot be queued
> > after mod_delayed_work(bdi_wq, >dwork, 0) has finished. That function
> > deletes the pending timer (the timer cannot be armed again because
> > WB_registered is cleared) and queues what should be the last round of
> > wb_workfn().
> 
> mod_delayed_work() deletes the pending timer but does not wait for already
> invoked timer handler to complete because it is using del_timer() rather than
> del_timer_sync(). Then, what happens if __queue_work() is almost concurrently
> executed from two CPUs, one from mod_delayed_work(bdi_wq, >dwork, 0) from
> wb_shutdown() path (which is called without spin_lock_bh(>work_lock)) and
> the other from delayed_work_timer_fn() path (which is called without checking
> WB_registered bit under spin_lock_bh(>work_lock)) ?

In this case, work should still be queued only once. The synchronization in
this case should be provided by the WORK_STRUCT_PENDING_BIT. When a delayed
work is queued by mod_delayed_work(), this bit is set, and gets cleared
only once the work is started on some CPU. But admittedly this code is
rather convoluted so I may be missing something.

Also you should note that flush_delayed_work() which follows
mod_delayed_work() in wb_shutdown() does del_timer_sync() so I don't see
how anything could get past that. In fact mod_delayed_work() is in
wb_shutdown() path to make sure wb_workfn() gets executed at least once
before the bdi_writeback structure gets cleaned up so that all queued items
are finished. We do not rely on it to remove pending timers or queued
wb_workfn() executions.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-21 Thread Jan Kara
On Sat 19-05-18 23:27:09, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Jan Kara wrote:
> > > Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
> > > the necessary precautions against racing with bdi unregistration.
> > 
> > Yes, this patch will solve NULL pointer dereference bug. But is it OK to 
> > leave
> > list_empty(>work_list) == false situation? Who takes over the role of 
> > making
> > list_empty(>work_list) == true?
> 
> syzbot is again reporting the same NULL pointer dereference.
> 
>   general protection fault in wb_workfn (2)
>   
> https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Gaah... So we are still missing something.

> Didn't we overlook something obvious in commit b8b784958eccbf8f ("bdi:
> Fix oops in wb_workfn()") ?
> 
> At first, I thought that that commit will solve NULL pointer dereference bug.
> But what does
> 
>   if (!list_empty(>work_list))
> - mod_delayed_work(bdi_wq, >dwork, 0);
> + wb_wakeup(wb);
>   else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
>   wb_wakeup_delayed(wb);
> 
> mean?
> 
> static void wb_wakeup(struct bdi_writeback *wb)
> {
>   spin_lock_bh(>work_lock);
>   if (test_bit(WB_registered, >state))
>   mod_delayed_work(bdi_wq, >dwork, 0);
>   spin_unlock_bh(>work_lock);
> }
> 
> It means nothing but "we don't call mod_delayed_work() if WB_registered
> bit was already cleared".

Exactly.

> But if WB_registered bit is not yet cleared when we hit
> wb_wakeup_delayed() path?
> 
> void wb_wakeup_delayed(struct bdi_writeback *wb)
> {
>   unsigned long timeout;
> 
>   timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
>   spin_lock_bh(>work_lock);
>   if (test_bit(WB_registered, >state))
>   queue_delayed_work(bdi_wq, >dwork, timeout);
>   spin_unlock_bh(>work_lock);
> }
> 
> add_timer() is called because (presumably) timeout > 0. And after that
> timeout expires, __queue_work() is called even if WB_registered bit is
> already cleared before that timeout expires, isn't it?

Yes.

> void delayed_work_timer_fn(struct timer_list *t)
> {
>   struct delayed_work *dwork = from_timer(dwork, t, timer);
> 
>   /* should have been called from irqsafe timer with irq already off */
>   __queue_work(dwork->cpu, dwork->wq, >work);
> }
> 
> Then, wb_workfn() is after all scheduled even if we check for
> WB_registered bit, isn't it?

It can be queued after WB_registered bit is cleared but it cannot be queued
after mod_delayed_work(bdi_wq, >dwork, 0) has finished. That function
deletes the pending timer (the timer cannot be armed again because
WB_registered is cleared) and queues what should be the last round of
wb_workfn().

> Then, don't we need to check that
> 
>   mod_delayed_work(bdi_wq, >dwork, 0);
>   flush_delayed_work(>dwork);
> 
> is really waiting for completion? At least, shouldn't we try below debug
> output (not only for debugging this report but also generally desirable)?
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 7441bd9..ccec8cd 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -376,8 +376,10 @@ static void wb_shutdown(struct bdi_writeback *wb)
>* tells wb_workfn() that @wb is dying and its work_list needs to
>* be drained no matter what.
>*/
> - mod_delayed_work(bdi_wq, >dwork, 0);
> - flush_delayed_work(>dwork);
> + if (!mod_delayed_work(bdi_wq, >dwork, 0))
> + printk(KERN_WARNING "wb_shutdown: mod_delayed_work() failed\n");

false return from mod_delayed_work() just means that there was no timer
armed. That is a valid situation if there are no dirty data.

> + if (!flush_delayed_work(>dwork))
> + printk(KERN_WARNING "wb_shutdown: flush_delayed_work() 
> failed\n");

And this is valid as well (although unlikely) if the work managed to
complete on another CPU before flush_delayed_work() was called.

So I don't think your warnings will help us much. But yes, we need to debug
this somehow. For now I have no idea what could be still going wrong.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: write call hangs in kernel space after virtio hot-remove

2018-05-09 Thread Jan Kara
On Thu 03-05-18 10:48:20, Matthew Wilcox wrote:
> On Thu, May 03, 2018 at 12:05:14PM -0400, Jeff Layton wrote:
> > On Thu, 2018-05-03 at 16:42 +0200, Jan Kara wrote:
> > > On Wed 25-04-18 17:07:48, Fabiano Rosas wrote:
> > > > I'm looking into an issue where removing a virtio disk via sysfs while 
> > > > another
> > > > process is issuing write() calls results in the writing task going into 
> > > > a
> > > > livelock:
> >
> > > Thanks for the debugging of the problem. I agree with your analysis 
> > > however
> > > I don't like your fix. The issue is that when bdi is unregistered we don't
> > > really expect any writeback to happen after that moment. This is what
> > > prevents various use-after-free issues and I'd like that to stay the way 
> > > it
> > > is.
> > > 
> > > What I think we should do is that we'll prevent dirtying of new pages when
> > > we know the underlying device is gone. Because that will fix your problem
> > > and also make sure user sees the IO errors directly instead of just in the
> > > kernel log. The question is how to make this happen in the least painful
> > > way. I think we could intercept writes in grab_cache_page_write_begin()
> > > (which however requires that function to return a proper error code and 
> > > not
> > > just NULL / non-NULL). And we should also intercept write faults to not
> > > allow page dirtying via mmap - probably somewhere in do_shared_fault() and
> > > do_wp_page(). I've added Jeff to CC since he's dealing with IO error
> > > handling a lot these days. Jeff, what do you think?
> > 
> > (cc'ing Willy too since he's given this more thought than me)
> > 
> > For the record, I've mostly been looking at error _reporting_. Handling
> > errors at this level is not something I've really considered in great
> > detail as of yet.
> > 
> > Still, I think the basic idea sounds reasonable. Not allowing pages to
> > be dirtied when we can't clean them seems like a reasonable thing to
> > do.
> > 
> > The big question is how we'll report this to userland:
> > 
> > Would your approach have it return an error on write() and such? What
> > sort of error if so? ENODEV? Would we have to SIGBUS when someone tries
> > to dirty the page through mmap?
> 
> I have been having some thoughts in this direction.  They are perhaps
> a little more long-term than this particular bug, so they may not be
> relevant to the immediate fix.
> 
> I want to separate removing hardware from tearing down the block device
> that represents something on those hardware devices.  That allows for
> better handling of intermittent transport failures, or accidental device
> removal, followed by a speedy insert.
> 
> In that happy future, the bug described here wouldn't be getting an
> -EIO, it'd be getting an -ENODEV because we've decided this device is
> permanently gone.  I think that should indeed be a SIGBUS on mapped
> writes.
> 
> Looking at the current code, filemap_fault() will return VM_FAULT_SIGBUS
> already if page_cache_read() returns any error other than -ENOMEM, so
> that's fine.  We probably want some code (and this is where I reach the
> edge of my knowledge about the current page cache) to rip all the pages
> out of the page cache for every file on an -ENODEV filesystem.

Note that that already (partially) happens through:

del_gendisk() -> invalidate_partition() -> __invalidate_device() ->
  invalidate_inodes()
  invalidate_bdev()

The slight trouble is this currently won't do anything for open files (as
such inodes will have elevated refcount) so that needs fixing. And then
we need to prevent new dirty pages from being created...

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH] loop: remember whether sysfs_create_group() succeeded

2018-05-09 Thread Jan Kara
On Fri 04-05-18 20:47:29, Tetsuo Handa wrote:
> >From 626d33de1b70b11ecaf95a9f83f7644998e54cbb Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Date: Wed, 2 May 2018 23:03:48 +0900
> Subject: [PATCH] loop: remember whether sysfs_create_group() succeeded
> 
> syzbot is hitting WARN() triggered by memory allocation fault
> injection [1] because loop module is calling sysfs_remove_group()
> when sysfs_create_group() failed.
> Fix this by remembering whether sysfs_create_group() succeeded.
> 
> [1] 
> https://syzkaller.appspot.com/bug?id=3f86c0edf75c86d2633aeb9dd69eccc70bc7e90b
> 
> Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Reported-by: syzbot 
> <syzbot+9f03168400f56df89dbc6f1751f4458fe739f...@syzkaller.appspotmail.com>
> Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
> Cc: Jens Axboe <ax...@kernel.dk>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  drivers/block/loop.c | 11 ++-
>  drivers/block/loop.h |  1 +
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 5d4e316..1d758d8 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -809,16 +809,17 @@ static ssize_t loop_attr_dio_show(struct loop_device 
> *lo, char *buf)
>   .attrs= loop_attrs,
>  };
>  
> -static int loop_sysfs_init(struct loop_device *lo)
> +static void loop_sysfs_init(struct loop_device *lo)
>  {
> - return sysfs_create_group(_to_dev(lo->lo_disk)->kobj,
> -   _attribute_group);
> + lo->sysfs_ready = !sysfs_create_group(_to_dev(lo->lo_disk)->kobj,
> +   _attribute_group);
>  }
>  
>  static void loop_sysfs_exit(struct loop_device *lo)
>  {
> - sysfs_remove_group(_to_dev(lo->lo_disk)->kobj,
> -_attribute_group);
> + if (lo->sysfs_ready)
> + sysfs_remove_group(_to_dev(lo->lo_disk)->kobj,
> +_attribute_group);
>  }
>  
>  static void loop_config_discard(struct loop_device *lo)
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index b78de98..73c801f 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -58,6 +58,7 @@ struct loop_device {
>   struct kthread_worker   worker;
>   struct task_struct  *worker_task;
>   booluse_dio;
> + boolsysfs_ready;
>  
>   struct request_queue*lo_queue;
>   struct blk_mq_tag_set   tag_set;
> -- 
> 1.8.3.1
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-09 Thread Jan Kara
On Thu 03-05-18 18:26:26, Jan Kara wrote:
> Syzbot has reported that it can hit a NULL pointer dereference in
> wb_workfn() due to wb->bdi->dev being NULL. This indicates that
> wb_workfn() was called for an already unregistered bdi which should not
> happen as wb_shutdown() called from bdi_unregister() should make sure
> all pending writeback works are completed before bdi is unregistered.
> Except that wb_workfn() itself can requeue the work with:
> 
>   mod_delayed_work(bdi_wq, >dwork, 0);
> 
> and if this happens while wb_shutdown() is waiting in:
> 
>   flush_delayed_work(>dwork);
> 
> the dwork can get executed after wb_shutdown() has finished and
> bdi_unregister() has cleared wb->bdi->dev.
> 
> Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
> the necessary precautions against racing with bdi unregistration.
> 
> CC: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> CC: Tejun Heo <t...@kernel.org>
> Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
> Reported-by: syzbot <syzbot+9873874c735f2892e...@syzkaller.appspotmail.com>
> Signed-off-by: Jan Kara <j...@suse.cz>
> ---
>  fs/fs-writeback.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Jens, can you please pick up this patch? Probably for the next merge window
(I don't see a reason to rush this at this point in release cycle). Thanks!

Honza

> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 47d7c151fcba..471d863958bc 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1961,7 +1961,7 @@ void wb_workfn(struct work_struct *work)
>   }
>  
>   if (!list_empty(>work_list))
> - mod_delayed_work(bdi_wq, >dwork, 0);
> + wb_wakeup(wb);
>   else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
>   wb_wakeup_delayed(wb);
>  
> -- 
> 2.13.6
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-09 Thread Jan Kara
On Fri 04-05-18 07:55:58, Dave Chinner wrote:
> On Thu, May 03, 2018 at 06:26:26PM +0200, Jan Kara wrote:
> > Syzbot has reported that it can hit a NULL pointer dereference in
> > wb_workfn() due to wb->bdi->dev being NULL. This indicates that
> > wb_workfn() was called for an already unregistered bdi which should not
> > happen as wb_shutdown() called from bdi_unregister() should make sure
> > all pending writeback works are completed before bdi is unregistered.
> > Except that wb_workfn() itself can requeue the work with:
> > 
> > mod_delayed_work(bdi_wq, >dwork, 0);
> > 
> > and if this happens while wb_shutdown() is waiting in:
> > 
> > flush_delayed_work(>dwork);
> > 
> > the dwork can get executed after wb_shutdown() has finished and
> > bdi_unregister() has cleared wb->bdi->dev.
> > 
> > Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
> > the necessary precautions against racing with bdi unregistration.
> > 
> > CC: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> > CC: Tejun Heo <t...@kernel.org>
> > Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
> > Reported-by: syzbot <syzbot+9873874c735f2892e...@syzkaller.appspotmail.com>
> > Signed-off-by: Jan Kara <j...@suse.cz>
> > ---
> >  fs/fs-writeback.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 47d7c151fcba..471d863958bc 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1961,7 +1961,7 @@ void wb_workfn(struct work_struct *work)
> > }
> >  
> > if (!list_empty(>work_list))
> > -   mod_delayed_work(bdi_wq, >dwork, 0);
> > +   wb_wakeup(wb);
> > else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
> > wb_wakeup_delayed(wb);
> 
> Yup, looks fine - I can't see any more of these open coded wakeup,
> either, so we should be good here.
> 
> Reviewed-by: Dave Chinner <dchin...@redhat.com>

Thanks!

> As an aside, why is half the wb infrastructure in fs/fs-writeback.c
> and the other half in mm/backing-dev.c? it seems pretty random as to
> what is where e.g. wb_wakeup() and wb_wakeup_delayed() are almost
> identical, but are in completely different files...

Yeah, it deserves a cleanup.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-09 Thread Jan Kara
On Fri 04-05-18 07:35:34, Tetsuo Handa wrote:
> Jan Kara wrote:
> > Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
> > the necessary precautions against racing with bdi unregistration.
> 
> Yes, this patch will solve NULL pointer dereference bug. But is it OK to
> leave list_empty(>work_list) == false situation? Who takes over the
> role of making list_empty(>work_list) == true?

That's a good question. The reason is the last running instance of
wb_workfn() cannot leave with the work_list non-empty. Once WB_registered
is cleared we cannot add new entries to work_list. Then we'll queue and
flush last wb_workfn() to clean up the list. The problem with NULL ptr
deref has been triggered not by this last running wb_workfn() but by one
running independently in parallel to wb_shutdown(). So something like:

CPU0CPU1CPU2
wb_workfn()
  do {
...
  } while (!list_empty(>work_list));
wb_queue_work()
  if (test_bit(WB_registered, >state)) {
list_add_tail(>list, >work_list);
mod_delayed_work(bdi_wq, >dwork, 0);
  }
wb_shutdown()
  if 
(!test_and_clear_bit(WB_registered, >state)) {
  ...
  mod_delayed_work(bdi_wq, 
>dwork, 0);
  
flush_delayed_work(>dwork);
  if (!list_empty(>work_list))
mod_delayed_work(bdi_wq, >dwork, 0); -> queues buggy work

> Just a confirmation, for Fabiano Rosas is facing a problem that "write call
> hangs in kernel space after virtio hot-remove" and is thinking that we might
> need to go the opposite direction
> ( http://lkml.kernel.org/r/f0787b79-1e50-5f55-a400-44f715451...@linux.ibm.com 
> ).

Yes, I'm aware of that report and I think it should be solved
differently than what Fabiano suggests.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


[PATCH] bdi: Fix oops in wb_workfn()

2018-05-03 Thread Jan Kara
Syzbot has reported that it can hit a NULL pointer dereference in
wb_workfn() due to wb->bdi->dev being NULL. This indicates that
wb_workfn() was called for an already unregistered bdi which should not
happen as wb_shutdown() called from bdi_unregister() should make sure
all pending writeback works are completed before bdi is unregistered.
Except that wb_workfn() itself can requeue the work with:

mod_delayed_work(bdi_wq, >dwork, 0);

and if this happens while wb_shutdown() is waiting in:

flush_delayed_work(>dwork);

the dwork can get executed after wb_shutdown() has finished and
bdi_unregister() has cleared wb->bdi->dev.

Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
the necessary precautions against racing with bdi unregistration.

CC: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
CC: Tejun Heo <t...@kernel.org>
Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
Reported-by: syzbot <syzbot+9873874c735f2892e...@syzkaller.appspotmail.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/fs-writeback.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 47d7c151fcba..471d863958bc 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1961,7 +1961,7 @@ void wb_workfn(struct work_struct *work)
}
 
if (!list_empty(>work_list))
-   mod_delayed_work(bdi_wq, >dwork, 0);
+   wb_wakeup(wb);
else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
wb_wakeup_delayed(wb);
 
-- 
2.13.6



Re: general protection fault in wb_workfn

2018-05-03 Thread Jan Kara
On Mon 23-04-18 19:09:51, Tetsuo Handa wrote:
> On 2018/04/20 1:05, syzbot wrote:
> > kasan: CONFIG_KASAN_INLINE enabled
> > kasan: GPF could be caused by NULL-ptr deref or user memory access
> > general protection fault:  [#1] SMP KASAN
> > Dumping ftrace buffer:
> >    (ftrace buffer empty)
> > Modules linked in:
> > CPU: 0 PID: 28 Comm: kworker/u4:2 Not tainted 4.16.0-rc7+ #368
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/01/2011
> > Workqueue: writeback wb_workfn
> > RIP: 0010:dev_name include/linux/device.h:981 [inline]
> > RIP: 0010:wb_workfn+0x1a2/0x16b0 fs/fs-writeback.c:1936
> > RSP: 0018:8801d951f038 EFLAGS: 00010206
> > RAX: dc00 RBX:  RCX: 81bf6ea5
> > RDX: 000a RSI: 87b44840 RDI: 0050
> > RBP: 8801d951f558 R08: 11003b2a3def R09: 0004
> > R10: 8801d951f438 R11: 0004 R12: 0100
> > R13: 8801baee0dc0 R14: 8801d951f530 R15: 8801baee10d8
> > FS:  () GS:8801db20() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 0047ff80 CR3: 07a22006 CR4: 001626f0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  process_one_work+0xc47/0x1bb0 kernel/workqueue.c:2113
> >  process_scheduled_works kernel/workqueue.c:2173 [inline]
> >  worker_thread+0xa4b/0x1990 kernel/workqueue.c:2252
> >  kthread+0x33c/0x400 kernel/kthread.c:238
> >  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:406
> 
> This report says that wb->bdi->dev == NULL
> 
>   static inline const char *dev_name(const struct device *dev)
>   {
> /* Use the init name until the kobject becomes available */
> if (dev->init_name)
>   return dev->init_name;
>   
> return kobject_name(>kobj);
>   }
> 
>   void wb_workfn(struct work_struct *work)
>   {
>   (...snipped...)
>  set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
>   (...snipped...)
>   }
> 
> immediately after ioctl(LOOP_CTL_REMOVE) was requested. It is plausible
> because ioctl(LOOP_CTL_REMOVE) sets bdi->dev to NULL after returning from
> wb_shutdown().
> 
> loop_control_ioctl(LOOP_CTL_REMOVE) {
>   loop_remove(lo) {
> del_gendisk(lo->lo_disk) {
>   bdi_unregister(disk->queue->backing_dev_info) {
> bdi_remove_from_list(bdi);
> wb_shutdown(>wb);
> cgwb_bdi_unregister(bdi);
> if (bdi->dev) {
>   bdi_debug_unregister(bdi);
>   device_unregister(bdi->dev);
>   bdi->dev = NULL;
> }
>   }
> }
>   }
> }
> 
> For some reason wb_shutdown() is not waiting for wb_workfn() to complete
> ( or something queues again after WB_registered bit was cleared ) ?
> 
> Anyway, I think that this is block layer problem rather than fs layer
> problem.

Thanks for the analysis. I think I can see where is the problem -
wb_workfn() can requeue the work while wb_shutdown() is running I'll send a
patch shortly.

> By the way, I got a newbie question regarding commit 5318ce7d46866e1d ("bdi:
> Shutdown writeback on all cgwbs in cgwb_bdi_destroy()"). It uses clear_bit()
> to clear WB_shutting_down bit so that threads waiting at wait_on_bit() will
> wake up. But clear_bit() itself does not wake up threads, does it? Who wakes
> them up (e.g. by calling wake_up_bit()) after clear_bit() was called?

Yeah, that's a bug. Thanks for fixing it.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: INFO: task hung in wb_shutdown (2)

2018-05-03 Thread Jan Kara
On Wed 02-05-18 07:14:51, Tetsuo Handa wrote:
> >From 1b90d7f71d60e743c69cdff3ba41edd1f9f86f93 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Date: Wed, 2 May 2018 07:07:55 +0900
> Subject: [PATCH v2] bdi: wake up concurrent wb_shutdown() callers.
> 
> syzbot is reporting hung tasks at wait_on_bit(WB_shutting_down) in
> wb_shutdown() [1]. This seems to be because commit 5318ce7d46866e1d ("bdi:
> Shutdown writeback on all cgwbs in cgwb_bdi_destroy()") forgot to call
> wake_up_bit(WB_shutting_down) after clear_bit(WB_shutting_down).
> 
> Introduce a helper function clear_and_wake_up_bit() and use it, in order
> to avoid similar errors in future.
> 
> [1] 
> https://syzkaller.appspot.com/bug?id=b297474817af98d5796bc544e1bb806fc3da0e5e
> 
> Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Reported-by: syzbot <syzbot+c0cf869505e03bdf1...@syzkaller.appspotmail.com>
> Fixes: 5318ce7d46866e1d ("bdi: Shutdown writeback on all cgwbs in 
> cgwb_bdi_destroy()")
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Jan Kara <j...@suse.cz>
> Cc: Jens Axboe <ax...@fb.com>
> Suggested-by: Linus Torvalds <torva...@linux-foundation.org>

Thanks for debugging this and for the fix Tetsuo! The patch looks good to
me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  include/linux/wait_bit.h | 17 +
>  mm/backing-dev.c |  2 +-
>  2 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
> index 9318b21..2b0072f 100644
> --- a/include/linux/wait_bit.h
> +++ b/include/linux/wait_bit.h
> @@ -305,4 +305,21 @@ struct wait_bit_queue_entry {
>   __ret;  \
>  })
>  
> +/**
> + * clear_and_wake_up_bit - clear a bit and wake up anyone waiting on that bit
> + *
> + * @bit: the bit of the word being waited on
> + * @word: the word being waited on, a kernel virtual address
> + *
> + * You can use this helper if bitflags are manipulated atomically rather than
> + * non-atomically under a lock.
> + */
> +static inline void clear_and_wake_up_bit(int bit, void *word)
> +{
> + clear_bit_unlock(bit, word);
> + /* See wake_up_bit() for which memory barrier you need to use. */
> + smp_mb__after_atomic();
> + wake_up_bit(word, bit);
> +}
> +
>  #endif /* _LINUX_WAIT_BIT_H */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 023190c..fa5e6d7 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -383,7 +383,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
>* the barrier provided by test_and_clear_bit() above.
>*/
>   smp_wmb();
> - clear_bit(WB_shutting_down, >state);
> + clear_and_wake_up_bit(WB_shutting_down, >state);
>  }
>  
>  static void wb_exit(struct bdi_writeback *wb)
> -- 
> 1.8.3.1
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 2/3] fs: make thaw_super_locked() really just a helper

2018-05-03 Thread Jan Kara
On Fri 20-04-18 16:59:03, Luis R. Rodriguez wrote:
> thaw_super_locked() was added via commit 08fdc8a0138a ("buffer.c: call
> thaw_super during emergency thaw") merged on v4.17 to help with the
> ability so that the caller can take charge of handling the s_umount lock,
> however, it has left all* of the failure handling including unlocking
> lock of s_umount inside thaw_super_locked().
> 
> This does not make thaw_super_locked() flexible. For instance we may
> later want to use it with the abilty to handle unfolding of the locks
> ourselves.
> 
> Change thaw_super_locked() to really just be a helper, and let the
> callers deal with all the error handling.

And do you have use for the new thaw_super_locked()? Because the new
semantics with having to deal with deactivate_locked_super() does not seem
ideal either so as a standalone cleanup patch it does not look too
useful.
 
> This commit introeuces no functional changes.
^^^ introduces
 
Honza

> ---
>  fs/super.c | 27 ---
>  1 file changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 9d0eb5e20a1f..82bc74a16f06 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -937,10 +937,15 @@ void emergency_remount(void)
>  
>  static void do_thaw_all_callback(struct super_block *sb)
>  {
> + int error;
> +
>   down_write(>s_umount);
>   if (sb->s_root && sb->s_flags & MS_BORN) {
>   emergency_thaw_bdev(sb);
> - thaw_super_locked(sb);
> + error = thaw_super_locked(sb);
> + if (error)
> + up_write(>s_umount);
> + deactivate_locked_super(sb);
>   } else {
>   up_write(>s_umount);
>   }
> @@ -1532,14 +1537,13 @@ int freeze_super(struct super_block *sb)
>  }
>  EXPORT_SYMBOL(freeze_super);
>  
> +/* Caller takes the sb->s_umount rw_semaphore lock and handles active count 
> */
>  static int thaw_super_locked(struct super_block *sb)
>  {
>   int error;
>  
> - if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
> - up_write(>s_umount);
> + if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
>   return -EINVAL;
> - }
>  
>   if (sb_rdonly(sb)) {
>   sb->s_writers.frozen = SB_UNFROZEN;
> @@ -1554,7 +1558,6 @@ static int thaw_super_locked(struct super_block *sb)
>   printk(KERN_ERR
>   "VFS:Filesystem thaw failed\n");
>   lockdep_sb_freeze_release(sb);
> - up_write(>s_umount);
>   return error;
>   }
>   }
> @@ -1563,7 +1566,6 @@ static int thaw_super_locked(struct super_block *sb)
>   sb_freeze_unlock(sb);
>  out:
>   wake_up(>s_writers.wait_unfrozen);
> - deactivate_locked_super(sb);
>   return 0;
>  }
>  
> @@ -1575,7 +1577,18 @@ static int thaw_super_locked(struct super_block *sb)
>   */
>  int thaw_super(struct super_block *sb)
>  {
> + int error;
> +
>   down_write(>s_umount);
> - return thaw_super_locked(sb);
> + error = thaw_super_locked(sb);
> + if (error) {
> + up_write(>s_umount);
> + goto out;
> + }
> +
> + deactivate_locked_super(sb);
> +
> +out:
> + return error;
>  }
>  EXPORT_SYMBOL(thaw_super);
> -- 
> 2.16.3
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 3/3] fs: fix corner case race on freeze_bdev() when sb disappears

2018-05-03 Thread Jan Kara
On Fri 20-04-18 16:59:04, Luis R. Rodriguez wrote:
> freeze_bdev() will bail but leave the bd_fsfreeze_count incremented
> if the get_active_super() does not find the superblock on our
> super_blocks list to match.
> 
> This issue has been present since v2.6.29 during the introduction of the
> ioctl_fsfreeze() and ioctl_fsthaw() via commit fcccf502540e3 ("filesystem
> freeze: implement generic freeze feature").
> 
> I am not aware of any existing races which have triggered this
> situation, however, if it does trigger it could mean leaving a
> superblock with bd_fsfreeze_count always positive.
> 
> Fixes: fcccf502540e3 ("filesystem freeze: implement generic freeze feature")
> Signed-off-by: Luis R. Rodriguez <mcg...@kernel.org>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/block_dev.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index b54966679833..7a532aa58c07 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -507,8 +507,10 @@ struct super_block *freeze_bdev(struct block_device 
> *bdev)
>   }
>  
>   sb = get_active_super(bdev);
> - if (!sb)
> + if (!sb) {
> + bdev->bd_fsfreeze_count--;
>   goto out;
> +     }
>   if (sb->s_op->freeze_super)
>   error = sb->s_op->freeze_super(sb);
>   else
> -- 
> 2.16.3
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 1/3] fs: move documentation for thaw_super() where appropriate

2018-05-03 Thread Jan Kara
On Fri 20-04-18 16:59:02, Luis R. Rodriguez wrote:
> On commit 08fdc8a0138a ("buffer.c: call thaw_super during emergency thaw")
> Mateusz added thaw_super_locked() and made thaw_super() use it, but
> forgot to move the documentation.
> 
> Signed-off-by: Luis R. Rodriguez <mcg...@kernel.org>

Looks good (modulo the -- which is probably worth fixing when touching the
comment anyway). You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/super.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 5fa9a8d8d865..9d0eb5e20a1f 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1532,12 +1532,6 @@ int freeze_super(struct super_block *sb)
>  }
>  EXPORT_SYMBOL(freeze_super);
>  
> -/**
> - * thaw_super -- unlock filesystem
> - * @sb: the super to thaw
> - *
> - * Unlocks the filesystem and marks it writeable again after freeze_super().
> - */
>  static int thaw_super_locked(struct super_block *sb)
>  {
>   int error;
> @@ -1573,6 +1567,12 @@ static int thaw_super_locked(struct super_block *sb)
>   return 0;
>  }
>  
> +/**
> + * thaw_super -- unlock filesystem
> + * @sb: the super to thaw
> + *
> + * Unlocks the filesystem and marks it writeable again after freeze_super().
> + */
>  int thaw_super(struct super_block *sb)
>  {
>   down_write(>s_umount);
> -- 
> 2.16.3
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: write call hangs in kernel space after virtio hot-remove

2018-05-03 Thread Jan Kara
em
and also make sure user sees the IO errors directly instead of just in the
kernel log. The question is how to make this happen in the least painful
way. I think we could intercept writes in grab_cache_page_write_begin()
(which however requires that function to return a proper error code and not
just NULL / non-NULL). And we should also intercept write faults to not
allow page dirtying via mmap - probably somewhere in do_shared_fault() and
do_wp_page(). I've added Jeff to CC since he's dealing with IO error
handling a lot these days. Jeff, what do you think?

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes

2018-05-03 Thread Jan Kara
On Wed 02-05-18 14:27:37, Robert Dorr wrote:
> In the current implementation the first write to the location updates the
> metadata and must issue the flush.   In Windows SQL Server can avoid this
> behavior.   SQL Server can issue DeviceIoControl with SET_FILE_VALID_DATA
> and then SetEndOfFile.  The SetEndOfFile acquires space and saves
> metadata without requiring the actual write.   This allows us to quickly
> create a large file and the writes do not need the added flush.
> 
> Is this something that fallocate could accommodate to avoid having to
> write once (triggers flush for metadata) and then secondary writes can
> use FUA and avoid the flush?

Well, the question then is what do you see in the file if you read those
blocks before writing them or if the system crashes before writing the
data. As you describe the feature, you'd see there just the old block
contents which is a security issue (you could see for example old
/etc/passwd contents there). That's why we've refused such feature in the
past [1].

Honza

[1] https://www.spinics.net/lists/linux-ext4/msg31637.html

> -Original Message-
> From: Dave Chinner <da...@fromorbit.com> 
> Sent: Tuesday, May 1, 2018 9:46 PM
> To: Jan Kara <j...@suse.cz>
> Cc: linux-...@vger.kernel.org; linux-fsde...@vger.kernel.org; 
> linux-block@vger.kernel.org; h...@lst.de; Robert Dorr <rd...@microsoft.com>
> Subject: Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes
> 
> On Sat, Apr 21, 2018 at 03:03:09PM +0200, Jan Kara wrote:
> > On Wed 18-04-18 14:08:26, Dave Chinner wrote:
> > > From: Dave Chinner <dchin...@redhat.com>
> > > 
> > > Currently iomap_dio_rw() only handles (data)sync write completions 
> > > for AIO. This means we can't optimised non-AIO IO to minimise device 
> > > flushes as we can't tell the caller whether a flush is required or 
> > > not.
> > > 
> > > To solve this problem and enable further optimisations, make 
> > > iomap_dio_rw responsible for data sync behaviour for all IO, not 
> > > just AIO.
> > > 
> > > In doing so, the sync operation is now accounted as part of the DIO 
> > > IO by inode_dio_end(), hence post-IO data stability updates will no 
> > > long race against operations that serialise via inode_dio_wait() 
> > > such as truncate or hole punch.
> > > 
> > > Signed-Off-By: Dave Chinner <dchin...@redhat.com>
> > > Reviewed-by: Christoph Hellwig <h...@lst.de>
> > 
> > Looks good to me. You can add:
> > 
> > Reviewed-by: Jan Kara <j...@suse.cz>
> 
> It looks good, but it's broken in a subtle, nasty way. :/
> 
> > > @@ -768,14 +776,8 @@ static ssize_t iomap_dio_complete(struct 
> > > iomap_dio *dio)  static void iomap_dio_complete_work(struct 
> > > work_struct *work)  {
> > >   struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
> > > - struct kiocb *iocb = dio->iocb;
> > > - bool is_write = (dio->flags & IOMAP_DIO_WRITE);
> > > - ssize_t ret;
> > >  
> > > - ret = iomap_dio_complete(dio);
> > > - if (is_write && ret > 0)
> > > - ret = generic_write_sync(iocb, ret);
> > > - iocb->ki_complete(iocb, ret, 0);
> > > + dio->iocb->ki_complete(dio->iocb, iomap_dio_complete(dio), 0);
> 
> This generates a use after free from KASAN from generic/016. it appears the 
> compiler orders the code so that it dereferences
> dio->iocb after iomap_dio_complete() has freed the dio structure
> (yay!).
> 
> I'll post a new version of the patchset now that I've got changes to
> 2 of the 3 remaining patches in it
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes

2018-05-03 Thread Jan Kara
On Wed 02-05-18 12:45:40, Dave Chinner wrote:
> On Sat, Apr 21, 2018 at 03:03:09PM +0200, Jan Kara wrote:
> > On Wed 18-04-18 14:08:26, Dave Chinner wrote:
> > > From: Dave Chinner <dchin...@redhat.com>
> > > 
> > > Currently iomap_dio_rw() only handles (data)sync write completions
> > > for AIO. This means we can't optimised non-AIO IO to minimise device
> > > flushes as we can't tell the caller whether a flush is required or
> > > not.
> > > 
> > > To solve this problem and enable further optimisations, make
> > > iomap_dio_rw responsible for data sync behaviour for all IO, not
> > > just AIO.
> > > 
> > > In doing so, the sync operation is now accounted as part of the DIO
> > > IO by inode_dio_end(), hence post-IO data stability updates will no
> > > long race against operations that serialise via inode_dio_wait()
> > > such as truncate or hole punch.
> > > 
> > > Signed-Off-By: Dave Chinner <dchin...@redhat.com>
> > > Reviewed-by: Christoph Hellwig <h...@lst.de>
> > 
> > Looks good to me. You can add:
> > 
> > Reviewed-by: Jan Kara <j...@suse.cz>
> 
> It looks good, but it's broken in a subtle, nasty way. :/
> 
> > > @@ -768,14 +776,8 @@ static ssize_t iomap_dio_complete(struct iomap_dio 
> > > *dio)
> > >  static void iomap_dio_complete_work(struct work_struct *work)
> > >  {
> > >   struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
> > > - struct kiocb *iocb = dio->iocb;
> > > - bool is_write = (dio->flags & IOMAP_DIO_WRITE);
> > > - ssize_t ret;
> > >  
> > > - ret = iomap_dio_complete(dio);
> > > - if (is_write && ret > 0)
> > > - ret = generic_write_sync(iocb, ret);
> > > - iocb->ki_complete(iocb, ret, 0);
> > > + dio->iocb->ki_complete(dio->iocb, iomap_dio_complete(dio), 0);
> 
> This generates a use after free from KASAN from generic/016. it
> appears the compiler orders the code so that it dereferences
> dio->iocb after iomap_dio_complete() has freed the dio structure
> (yay!).

Yeah, very subtle but the compiler is indeed free to do this (in C the
sequence point is only the function call but the order of evaluation of
function arguments is unspecified). Thanks for catching this.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes

2018-04-25 Thread Jan Kara
On Wed 25-04-18 00:07:07, Holger Hoffstätte wrote:
> On 04/24/18 19:34, Christoph Hellwig wrote:
> > On Sat, Apr 21, 2018 at 02:54:05PM +0200, Jan Kara wrote:
> > > > -   if (iocb->ki_flags & IOCB_DSYNC)
> > > > +   if (iocb->ki_flags & IOCB_DSYNC) {
> > > > dio->flags |= IOMAP_DIO_NEED_SYNC;
> > > > +   /*
> > > > +* We optimistically try using FUA for this IO. 
> > > >  Any
> > > > +* non-FUA write that occurs will clear this 
> > > > flag, hence
> > > > +* we know before completion whether a cache 
> > > > flush is
> > > > +* necessary.
> > > > +*/
> > > > +   dio->flags |= IOMAP_DIO_WRITE_FUA;
> > > > +   }
> > > 
> > > So I don't think this is quite correct. IOCB_DSYNC gets set also for 
> > > O_SYNC
> > > writes (in that case we also set IOCB_SYNC). And for those we cannot use
> > > the FUA optimization AFAICT (definitely IOMAP_F_DIRTY isn't a safe
> > > indicator of a need of full fsync for O_SYNC). Other than that the patch
> > > looks good to me.
> > 
> > Oops, good catch. I think the above if should just be
> > 
> > if (iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC) == IOCB_DSYNC)) {
> > 
> > and we are fine.
> 
> The above line just gives parenthesis salad errors, so why not compromise
> on:
> 
>   if ((iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC)) == IOCB_DSYNC) {
> 
> Unless my bit twiddling has completely left me I think this is what was
> intended, and it actually compiles too.

Yup, I agree this is what needs to happen.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 03/11] fs: add frozen sb state helpers

2018-04-21 Thread Jan Kara
On Fri 20-04-18 11:49:32, Luis R. Rodriguez wrote:
> On Tue, Apr 17, 2018 at 05:59:36PM -0700, Luis R. Rodriguez wrote:
> > On Thu, Dec 21, 2017 at 12:03:29PM +0100, Jan Kara wrote:
> > > Hello,
> > > 
> > > I think I owe you a reply here... Sorry that it took so long.
> > 
> > Took me just as long :)
> > 
> > > On Fri 01-12-17 22:13:27, Luis R. Rodriguez wrote:
> > > > 
> > > > I'll note that its still not perfectly clear if really the semantics 
> > > > behind
> > > > freeze_bdev() match what I described above fully. That still needs to be
> > > > vetted for. For instance, does thaw_bdev() keep a superblock frozen if 
> > > > we
> > > > an ioctl initiated freeze had occurred before? If so then great. 
> > > > Otherwise
> > > > I think we'll need to distinguish the ioctl interface. Worst possible 
> > > > case
> > > > is that bdev semantics and in-kernel semantics differ somehow, then that
> > > > will really create a holy fucking mess.
> > > 
> > > I believe nobody really thought about mixing those two interfaces to fs
> > > freezing and so the behavior is basically defined by the implementation.
> > > That is:
> > > 
> > > freeze_bdev() on sb frozen by ioctl_fsfreeze() -> EBUSY
> > > freeze_bdev() on sb frozen by freeze_bdev() -> success
> > > ioctl_fsfreeze() on sb frozen by freeze_bdev() -> EBUSY
> > > ioctl_fsfreeze() on sb frozen by ioctl_fsfreeze() -> EBUSY
> > > 
> > > thaw_bdev() on sb frozen by ioctl_fsfreeze() -> EINVAL
> > 
> > Phew, so this is what we want for the in-kernel freezing so we're good
> > and *can* combine these then.
> 
> I double checked, and I don't see where you get EINVAL for this case.
> We *do* keep the sb frozen though, which is good, and the worst fear
> I had was that we did not. However we return 0 if there was already
> a prior freeze_bdev() or ioctl_fsfreeze() other than the context that
> started the prior freeze (--bdev->bd_fsfreeze_count > 0).
> 
> The -EINVAL is only returned currently if there were no freezers.
> 
> int thaw_bdev(struct block_device *bdev, struct super_block *sb)
> {
>   int error = -EINVAL;
> 
>   mutex_lock(>bd_fsfreeze_mutex);
>   if (!bdev->bd_fsfreeze_count)
>   goto out;

But this is precisely where we'd bail if we freeze sb by ioctl_fsfreeze()
but try to thaw by thaw_bdev(). ioctl_fsfreeze() does not touch
bd_fsfreeze_count...

Honza

-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes

2018-04-21 Thread Jan Kara
On Wed 18-04-18 14:08:26, Dave Chinner wrote:
> From: Dave Chinner <dchin...@redhat.com>
> 
> Currently iomap_dio_rw() only handles (data)sync write completions
> for AIO. This means we can't optimised non-AIO IO to minimise device
> flushes as we can't tell the caller whether a flush is required or
> not.
> 
> To solve this problem and enable further optimisations, make
> iomap_dio_rw responsible for data sync behaviour for all IO, not
> just AIO.
> 
> In doing so, the sync operation is now accounted as part of the DIO
> IO by inode_dio_end(), hence post-IO data stability updates will no
> long race against operations that serialise via inode_dio_wait()
> such as truncate or hole punch.
> 
> Signed-Off-By: Dave Chinner <dchin...@redhat.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/iomap.c| 22 +++---
>  fs/xfs/xfs_file.c |  5 -
>  2 files changed, 15 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index afd163586aa0..1f59c2d9ade6 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -685,6 +685,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_data);
>   * Private flags for iomap_dio, must not overlap with the public ones in
>   * iomap.h:
>   */
> +#define IOMAP_DIO_NEED_SYNC  (1 << 29)
>  #define IOMAP_DIO_WRITE  (1 << 30)
>  #define IOMAP_DIO_DIRTY  (1 << 31)
>  
> @@ -759,6 +760,13 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>   dio_warn_stale_pagecache(iocb->ki_filp);
>   }
>  
> + /*
> +  * If this is a DSYNC write, make sure we push it to stable storage now
> +  * that we've written data.
> +  */
> + if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> + ret = generic_write_sync(iocb, ret);
> +
>   inode_dio_end(file_inode(iocb->ki_filp));
>   kfree(dio);
>  
> @@ -768,14 +776,8 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  static void iomap_dio_complete_work(struct work_struct *work)
>  {
>   struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
> - struct kiocb *iocb = dio->iocb;
> - bool is_write = (dio->flags & IOMAP_DIO_WRITE);
> - ssize_t ret;
>  
> - ret = iomap_dio_complete(dio);
> - if (is_write && ret > 0)
> - ret = generic_write_sync(iocb, ret);
> - iocb->ki_complete(iocb, ret, 0);
> + dio->iocb->ki_complete(dio->iocb, iomap_dio_complete(dio), 0);
>  }
>  
>  /*
> @@ -961,6 +963,10 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
> length,
>   return copied;
>  }
>  
> +/*
> + * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether 
> the IO
> + * is being issued as AIO or not.
> + */
>  ssize_t
>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops, iomap_dio_end_io_t end_io)
> @@ -1006,6 +1012,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   dio->flags |= IOMAP_DIO_DIRTY;
>   } else {
>   dio->flags |= IOMAP_DIO_WRITE;
> + if (iocb->ki_flags & IOCB_DSYNC)
> + dio->flags |= IOMAP_DIO_NEED_SYNC;
>   flags |= IOMAP_WRITE;
>   }
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 6f15027661b6..0c4b8313d544 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -570,11 +570,6 @@ xfs_file_dio_aio_write(
>* complete fully or fail.
>*/
>   ASSERT(ret < 0 || ret == count);
> -
> - if (ret > 0) {
> - /* Handle various SYNC-type writes */
> - ret = generic_write_sync(iocb, ret);
> - }
>   return ret;
>  }
>  
> -- 
> 2.16.1
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


  1   2   3   4   5   >