[PATCH 11/11] block: Fix oops scsi_disk_get()

2017-03-13 Thread Jan Kara
When device open races with device shutdown, we can get the following
oops in scsi_disk_get():

[11863.044351] general protection fault:  [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs 
raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W  
4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: 88007f438200 task.stack: c9fd
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:c9fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: 88007f56d000 RCX: 
[11863.048030] RDX: 0001 RSI: 0004 RDI: 81a8d880
[11863.048030] RBP: c9fd3a18 R08:  R09: 0001
[11863.059217] R10:  R11:  R12: fffa
[11863.059217] R13: 880078872800 R14: 880070915540 R15: 001d
[11863.059217] FS:  7f2611f71800() GS:88007f0c() 
knlGS:
[11863.059217] CS:  0010 DS:  ES:  CR0: 80050033
[11863.059217] CR2: 0060e048 CR3: 778d4000 CR4: 06e0
[11863.059217] Call Trace:
[11863.059217]  ? disk_get_part+0x22/0x1f0
[11863.059217]  sd_open+0x39/0x130
[11863.059217]  __blkdev_get+0x69/0x430
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? bd_acquire+0x96/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_get+0x126/0x350
[11863.059217]  ? _raw_spin_unlock+0x2b/0x40
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_open+0x65/0x80
...

As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.

We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.

Tested-by: Lekshmi Pillai <lekshmicpil...@in.ibm.com>
Reviewed-by: Bart Van Assche <bart.vanass...@sandisk.com>
Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index a9c516a8b37d..510aac1486cb 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1352,7 +1352,7 @@ struct kobject *get_disk(struct gendisk *disk)
owner = disk->fops->owner;
if (owner && !try_module_get(owner))
return NULL;
-   kobj = kobject_get(_to_dev(disk)->kobj);
+   kobj = kobject_get_unless_zero(_to_dev(disk)->kobj);
if (kobj == NULL) {
module_put(owner);
return NULL;
-- 
2.10.2



[PATCH 02/11] block: Fix race of bdev open with gendisk shutdown

2017-03-13 Thread Jan Kara
blkdev_open() may race with gendisk shutdown in two different ways.
Either del_gendisk() has already unhashed block device inode (and thus
bd_acquire() will end up creating new block device inode) however
gen_gendisk() will still return the gendisk that is being destroyed.
Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
before we get to get_gendisk() and get_gendisk() will return new gendisk
that got allocated for device that reused the device number.

In both cases this will result in possible inconsistencies between
bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
destroyed and device number reused, in the second case immediately).

Fix the problem by checking whether the gendisk is still alive and inode
hashed when associating bdev inode with it and its bdi. That way we are
sure that we will not associate bdev inode with disk that got past
blk_unregister_region() in del_gendisk() (and thus device number can get
reused). Similarly, we will not associate bdev that was once associated
with gendisk that is going away (and thus the corresponding bdev inode
will get unhashed in del_gendisk()) with a new gendisk that is just
reusing the device number.

Also add a warning that will tell us about unexpected inconsistencies
between bdi associated with the bdev inode and bdi associated with the
disk.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/block_dev.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 53e2389ae4d4..5ec8750f5332 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
if (!partno) {
ret = -ENXIO;
bdev->bd_part = disk_get_part(disk, partno);
-   if (!bdev->bd_part)
+   if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
+   inode_unhashed(bdev->bd_inode))
goto out_clear;
 
ret = 0;
@@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_contains = whole;
bdev->bd_part = disk_get_part(disk, partno);
if (!(disk->flags & GENHD_FL_UP) ||
-   !bdev->bd_part || !bdev->bd_part->nr_sects) {
+   !bdev->bd_part || !bdev->bd_part->nr_sects ||
+   inode_unhashed(bdev->bd_inode)) {
ret = -ENXIO;
goto out_clear;
}
@@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
 
if (bdev->bd_bdi == _backing_dev_info)
bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
+   else
+   WARN_ON_ONCE(bdev->bd_bdi !=
+disk->queue->backing_dev_info);
} else {
if (bdev->bd_contains == bdev) {
ret = 0;
-- 
2.10.2



[PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()

2017-03-13 Thread Jan Kara
Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.

Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 +
 mm/backing-dev.c | 22 ++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb3dcdebc80..8af720f22a2d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -21,6 +21,7 @@ struct dentry;
  */
 enum wb_state {
WB_registered,  /* bdi_register() was done */
+   WB_shutting_down,   /* wb_shutdown() in progress */
WB_writeback_running,   /* Writeback is in progress */
WB_has_dirty_io,/* Dirty inodes on ->b_{dirty|io|more_io} */
 };
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e3d56dba4da8..b67be4fc12c4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -356,8 +356,15 @@ static void wb_shutdown(struct bdi_writeback *wb)
spin_lock_bh(>work_lock);
if (!test_and_clear_bit(WB_registered, >state)) {
spin_unlock_bh(>work_lock);
+   /*
+* Wait for wb shutdown to finish if someone else is just
+* running wb_shutdown(). Otherwise we could proceed to wb /
+* bdi destruction before wb_shutdown() is finished.
+*/
+   wait_on_bit(>state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
return;
}
+   set_bit(WB_shutting_down, >state);
spin_unlock_bh(>work_lock);
 
cgwb_remove_from_bdi_list(wb);
@@ -369,6 +376,12 @@ static void wb_shutdown(struct bdi_writeback *wb)
mod_delayed_work(bdi_wq, >dwork, 0);
flush_delayed_work(>dwork);
WARN_ON(!list_empty(>work_list));
+   /*
+* Make sure bit gets cleared after shutdown is finished. Matches with
+* the barrier provided by test_and_clear_bit() above.
+*/
+   smp_wmb();
+   clear_bit(WB_shutting_down, >state);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
@@ -699,12 +712,21 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 {
struct radix_tree_iter iter;
void **slot;
+   struct bdi_writeback *wb;
 
WARN_ON(test_bit(WB_registered, >wb.state));
 
spin_lock_irq(_lock);
radix_tree_for_each_slot(slot, >cgwb_tree, , 0)
cgwb_kill(*slot);
+
+   while (!list_empty(>wb_list)) {
+   wb = list_first_entry(>wb_list, struct bdi_writeback,
+ bdi_node);
+   spin_unlock_irq(_lock);
+   wb_shutdown(wb);
+   spin_lock_irq(_lock);
+   }
spin_unlock_irq(_lock);
 
/*
-- 
2.10.2



[PATCH 0/11 v4] block: Fix block device shutdown related races

2017-03-13 Thread Jan Kara
Hello,

this is a series with the remaining patches (on top of 4.11-rc2) to fix several
different races and issues I've found when testing device shutdown and reuse.
The first two patches fix possible (theoretical) problems when opening of a
block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
code where get_gendisk() can return already freed gendisk structure (again
triggered by Omar's stress test).

People, please have a look at patches. They are mostly simple however the
interactions are rather complex so I may have missed something. Also I'm
happy for any additional testing these patches can get - I've stressed them
with Omar's script, tested memcg writeback, tested static (not udev managed)
device inodes.

Changes since v3:
* Rebased on top of 4.11-rc2
* Reworked patch 2 (block: Fix race of bdev open with gendisk shutdown) based
  on Tejun's feedback
* Significantly updated patch 5 (and dropped previous Tejun's ack) to
  accommodate for fixes to SCSI re-registration of BDI that went to 4.11-rc2

Changes since v2:
* Added Tejun's acks
* Rebased on top of 4.11-rc1
* Fixed two possible races between blkdev_open() and del_gendisk()
* Fixed possible race between concurrent shutdown of cgwb spotted by Tejun

Changes since v1:
* Added Acks and Tested-by tags for patches in areas that did not change
* Reworked inode_detach_wb() related fixes based on Tejun's feedback

Honza


[PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete

2017-03-13 Thread Jan Kara
When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
restart the process of opening the block device. However we forget to
switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
inode will be pointing to a stale bdi. Fix the problem by setting
bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.

Acked-by: Tejun Heo <t...@kernel.org>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/block_dev.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..53e2389ae4d4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1556,8 +1556,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
-   if (bdev->bd_bdi == _backing_dev_info)
-   bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
 
if (!partno) {
ret = -ENXIO;
@@ -1622,6 +1620,9 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
}
bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
}
+
+   if (bdev->bd_bdi == _backing_dev_info)
+   bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
} else {
if (bdev->bd_contains == bdev) {
ret = 0;
@@ -1653,8 +1654,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_disk = NULL;
bdev->bd_part = NULL;
bdev->bd_queue = NULL;
-   bdi_put(bdev->bd_bdi);
-   bdev->bd_bdi = _backing_dev_info;
if (bdev != bdev->bd_contains)
__blkdev_put(bdev->bd_contains, mode, 1);
bdev->bd_contains = NULL;
-- 
2.10.2



[PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister()

2017-03-13 Thread Jan Kara
Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
from bdi_unregister() which is not necessarily called from bdi_destroy()
and thus the name is somewhat misleading.

Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/backing-dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8c30b1a7aae5..3ea3bbd921d6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -700,7 +700,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
return ret;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 {
struct radix_tree_iter iter;
void **slot;
@@ -801,7 +801,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
return 0;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { }
 
 static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 {
@@ -925,7 +925,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
/* make sure nobody finds us on the bdi_list anymore */
bdi_remove_from_list(bdi);
wb_shutdown(>wb);
-   cgwb_bdi_destroy(bdi);
+   cgwb_bdi_unregister(bdi);
 
if (bdi->dev) {
bdi_debug_unregister(bdi);
-- 
2.10.2



[PATCH 04/11] bdi: Make wb->bdi a proper reference

2017-03-13 Thread Jan Kara
Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.

Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/backing-dev.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 12408f86783c..03d4ba27c133 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -294,6 +294,8 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
 
memset(wb, 0, sizeof(*wb));
 
+   if (wb != >wb)
+   bdi_get(bdi);
wb->bdi = bdi;
wb->last_old_flush = jiffies;
INIT_LIST_HEAD(>b_dirty);
@@ -314,8 +316,10 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
wb->dirty_sleep = jiffies;
 
wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
-   if (!wb->congested)
-   return -ENOMEM;
+   if (!wb->congested) {
+   err = -ENOMEM;
+   goto out_put_bdi;
+   }
 
err = fprop_local_init_percpu(>completions, gfp);
if (err)
@@ -335,6 +339,9 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
fprop_local_destroy_percpu(>completions);
 out_put_cong:
wb_congested_put(wb->congested);
+out_put_bdi:
+   if (wb != >wb)
+   bdi_put(bdi);
return err;
 }
 
@@ -372,6 +379,8 @@ static void wb_exit(struct bdi_writeback *wb)
 
fprop_local_destroy_percpu(>completions);
wb_congested_put(wb->congested);
+   if (wb != >bdi->wb)
+   bdi_put(wb->bdi);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-- 
2.10.2



[PATCH 10/11] kobject: Export kobject_get_unless_zero()

2017-03-13 Thread Jan Kara
Make the function available for outside use and fortify it against NULL
kobject.

CC: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bart.vanass...@sandisk.com>
Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/kobject.h | 2 ++
 lib/kobject.c   | 5 -
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index e6284591599e..ca85cb80e99a 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ extern int __must_check kobject_rename(struct kobject *, 
const char *new_name);
 extern int __must_check kobject_move(struct kobject *, struct kobject *);
 
 extern struct kobject *kobject_get(struct kobject *kobj);
+extern struct kobject * __must_check kobject_get_unless_zero(
+   struct kobject *kobj);
 extern void kobject_put(struct kobject *kobj);
 
 extern const void *kobject_namespace(struct kobject *kobj);
diff --git a/lib/kobject.c b/lib/kobject.c
index 445dcaeb0f56..763d70a18941 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -601,12 +601,15 @@ struct kobject *kobject_get(struct kobject *kobj)
 }
 EXPORT_SYMBOL(kobject_get);
 
-static struct kobject * __must_check kobject_get_unless_zero(struct kobject 
*kobj)
+struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
 {
+   if (!kobj)
+   return NULL;
if (!kref_get_unless_zero(>kref))
kobj = NULL;
return kobj;
 }
+EXPORT_SYMBOL(kobject_get_unless_zero);
 
 /*
  * kobject_cleanup - free kobject resources.
-- 
2.10.2



[PATCH 03/11] bdi: Mark congested->bdi as internal

2017-03-13 Thread Jan Kara
congested->bdi pointer is used only to be able to remove congested
structure from bdi->cgwb_congested_tree on structure release. Moreover
the pointer can become NULL when we unregister the bdi. Rename the field
to __bdi and add a comment to make it more explicit this is internal
stuff of memcg writeback code and people should not use the field as
such use will be likely race prone.

We do not bother with converting congested->bdi to a proper refcounted
reference. It will be slightly ugly to special-case bdi->wb.congested to
avoid effectively a cyclic reference of bdi to itself and the reference
gets cleared from bdi_unregister() making it impossible to reference
a freed bdi.

Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/backing-dev-defs.h |  4 +++-
 mm/backing-dev.c | 10 +-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index ad955817916d..8fb3dcdebc80 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -54,7 +54,9 @@ struct bdi_writeback_congested {
atomic_t refcnt;/* nr of attached wb's and blkg */
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-   struct backing_dev_info *bdi;   /* the associated bdi */
+   struct backing_dev_info *__bdi; /* the associated bdi, set to NULL
+* on bdi unregistration. For memcg-wb
+* internal use only! */
int blkcg_id;   /* ID of the associated blkcg */
struct rb_node rb_node; /* on bdi->cgwb_congestion_tree */
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c6f2a37028c2..12408f86783c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -438,7 +438,7 @@ wb_congested_get_create(struct backing_dev_info *bdi, int 
blkcg_id, gfp_t gfp)
return NULL;
 
atomic_set(_congested->refcnt, 0);
-   new_congested->bdi = bdi;
+   new_congested->__bdi = bdi;
new_congested->blkcg_id = blkcg_id;
goto retry;
 
@@ -466,10 +466,10 @@ void wb_congested_put(struct bdi_writeback_congested 
*congested)
}
 
/* bdi might already have been destroyed leaving @congested unlinked */
-   if (congested->bdi) {
+   if (congested->__bdi) {
rb_erase(>rb_node,
->bdi->cgwb_congested_tree);
-   congested->bdi = NULL;
+>__bdi->cgwb_congested_tree);
+   congested->__bdi = NULL;
}
 
spin_unlock_irqrestore(_lock, flags);
@@ -752,7 +752,7 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
rb_entry(rbn, struct bdi_writeback_congested, rb_node);
 
rb_erase(rbn, >cgwb_congested_tree);
-   congested->bdi = NULL;  /* mark @congested unlinked */
+   congested->__bdi = NULL;/* mark @congested unlinked */
}
spin_unlock_irq(_lock);
 }
-- 
2.10.2



[PATCH] axonram: Fix gendisk handling

2017-03-08 Thread Jan Kara
It is invalid to call del_gendisk() when disk->queue is NULL. Fix error
handling in axon_ram_probe() to avoid doing that.

Also del_gendisk() does not drop a reference to gendisk allocated by
alloc_disk(). That has to be done by put_disk(). Add that call where
needed.

Reported-by: Dan Carpenter <dan.carpen...@oracle.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 arch/powerpc/sysdev/axonram.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

Warning: The patch is not tested in any way. I just based the fix on Smatch
warning and how things should be...

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ada29eaed6e2..f523ac883150 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -274,7 +274,9 @@ static int axon_ram_probe(struct platform_device *device)
if (bank->disk->major > 0)
unregister_blkdev(bank->disk->major,
bank->disk->disk_name);
-   del_gendisk(bank->disk);
+   if (bank->disk->flags & GENHD_FL_UP)
+   del_gendisk(bank->disk);
+   put_disk(bank->disk);
}
device->dev.platform_data = NULL;
if (bank->io_addr != 0)
@@ -299,6 +301,7 @@ axon_ram_remove(struct platform_device *device)
device_remove_file(>dev, _attr_ecc);
free_irq(bank->irq_id, device);
del_gendisk(bank->disk);
+   put_disk(bank->disk);
iounmap((void __iomem *) bank->io_addr);
kfree(bank);
 
-- 
2.10.2



Re: [PATCH 0/13 v2] block: Fix block device shutdown related races

2017-03-08 Thread Jan Kara
On Tue 07-03-17 08:28:23, Omar Sandoval wrote:
> On Tue, Mar 07, 2017 at 02:57:30PM +0100, Jan Kara wrote:
> > On Mon 06-03-17 12:38:18, Omar Sandoval wrote:
> > > Unfortunately, this patch actually seems to have introduced a
> > > regression. Reverting the patch fixes it.
> > > 
> > > I'm running a QEMU kvm vm with a virtio-scsi device and I get oopses
> > > like this:
> > 
> > What workload do you run? Or is it enough to just boot the kvm with
> > virtio-scsi enabled? Can you send me the kvm setup so that I can reproduce
> > this?
> 
> This is just while booting. In fact, you don't even need a virtio-scsi
> disk attached, it's enough to have the virtio-scsi-pci controller. This
> is my exact command line:
> 
> qemu-system-x86_64 -nodefaults -nographic -serial mon:stdio -cpu kvm64 \
>   -enable-kvm -smp 1 -m 2G -watchdog i6300esb \
>   -netdev user,id=vlan0,hostfwd=tcp:127.0.0.1:-:22 \
>   -device virtio-net,netdev=vlan0 \
>   -drive file=Silver/Silver.qcow2,index=0,media=disk,if=virtio,cache=none 
> \
>   -device virtio-scsi-pci \
>   -kernel 
> /home/osandov/linux/builds/virtio-scsi-crash/arch/x86/boot/bzImage \
>   -virtfs 
> local,path=/home/osandov/linux/builds/virtio-scsi-crash,security_model=none,readonly,mount_tag=modules
>  \
>   -append 'root=/dev/vda1 console=ttyS0,115200 scsi_mod.use_blk_mq=y'
> 
> My kernel config is here:
> https://github.com/osandov/osandov-linux/blob/master/configs/qemu.config

Thanks. I was able to reproduce. Do attached two patches help?

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
>From 60623ea9de6309f7717fc3536e3594b079735af0 Mon Sep 17 00:00:00 2001
From: Jan Kara <j...@suse.cz>
Date: Tue, 7 Mar 2017 18:01:51 +0100
Subject: [PATCH 1/2] block: Allow bdi re-registration

SCSI can call device_add_disk() several times for one request queue when
a device in unbound and bound, creating new gendisk each time. This will
lead to bdi being repeatedly registered and unregistered. This was not a
big problem until commit 165a5e22fafb "block: Move bdi_unregister() to
del_gendisk()" since bdi was only registered repeatedly (bdi_register()
handles repeated calls fine, only we ended up leaking reference to
gendisk due to overwriting bdi->owner) but unregistered only in
blk_cleanup_queue() which didn't get called repeatedly. After
165a5e22fafb we were doing correct bdi_register() - bdi_unregister()
cycles however bdi_unregister() is not prepared for it. So make sure
bdi_unregister() cleans up bdi in such a way that it is prepared for
a possible following bdi_register() call.

An easy way to provoke this behavior is to enable
CONFIG_DEBUG_TEST_DRIVER_REMOVE and use scsi_debug driver to create a
scsi disk which immediately hangs without this fix.

Fixes: 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/backing-dev.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6d861d090e9f..6ac932210f56 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -710,6 +710,11 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 	 */
 	atomic_dec(>usage_cnt);
 	wait_event(cgwb_release_wait, !atomic_read(>usage_cnt));
+	/*
+	 * Grab back our reference so that we hold it when @bdi gets
+	 * re-registered.
+	 */
+	atomic_inc(>usage_cnt);
 }
 
 /**
@@ -857,6 +862,8 @@ int bdi_register_owner(struct backing_dev_info *bdi, struct device *owner)
 			MINOR(owner->devt));
 	if (rc)
 		return rc;
+	/* Leaking owner reference... */
+	WARN_ON(bdi->owner);
 	bdi->owner = owner;
 	get_device(owner);
 	return 0;
-- 
2.10.2

>From b9a6424cabd0eb4a663a1faeb11afef268faebc6 Mon Sep 17 00:00:00 2001
From: Jan Kara <j...@suse.cz>
Date: Wed, 8 Mar 2017 13:24:47 +0100
Subject: [PATCH 2/2] bdi: Fix use-after-free in wb_congested_put()

bdi_writeback_congested structures get created for each blkcg and bdi
regardless whether bdi is registered or not. When they are created in
unregistered bdi and the request queue (and thus bdi) is then destroyed
while blkg still holds reference to bdi_writeback_congested structure,
this structure will be referencing freed bdi and last wb_congested_put()
will try to remove the structure from already freed bdi.

With commit 165a5e22fafb "block: Move bdi_unregister() to
del_gendisk()", SCSI started to destroy bdis without calling
bdi_unregister() first (previously it was calling bdi_unregister() even
for unregistered bdis) and thus the code detaching
bdi_writeback_congested in cgwb_bdi_destroy() was not triggered and we
started hitting this use-after-free bug. It is enough to boot a KVM
instance with virtio-scsi device to trigger this behavior.

Fix the p

Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.

2017-03-06 Thread Jan Kara
On Mon 06-03-17 06:35:21, James Bottomley wrote:
> On Mon, 2017-03-06 at 13:01 +0100, Jan Kara wrote:
> > On Mon 06-03-17 11:27:33, Jan Kara wrote:
> > > Hi,
> > > 
> > > On Sun 05-03-17 10:21:11, Wu Fengguang wrote:
> > > > FYI next-20170303 is good while mainline is bad with this error.
> > > > The attached reproduce-* may help reproduce the issue.
> > > 
> > > Thanks for report! So from the stacktrace we are in the path 
> > > testing removal of a device immediately after it has been probed 
> > > and for some reason bdi_unregister() hangs - likely waiting for 
> > > cgroup-writeback references to drop. Given how early this happens 
> > > my guess is we fail to initialize something but for now I don't see 
> > > how my patch could make a difference. I'm trying to reproduce this 
> > > to be able to debug more...
> > 
> > OK, so after some debugging I think this is yet another problem in 
> > SCSI initialization / destruction code which my patch only makes 
> > visible (added relevant maintainers).
> > 
> > I can reproduce the problem reliably with enabling:
> > 
> > CONFIG_DEBUG_TEST_DRIVER_REMOVE=y
> > CONFIG_SCSI_DEBUG=m
> > CONFIG_BLK_CGROUP=y
> > CONFIG_MEMCG=y
> > (and thus CONFIG_CGROUP_WRITEBACK=y)
> > 
> > then 'modprobe scsi_debug' is all it takes to reproduce hang. 
> > Relevant kernel messages with some of my debugging added (attached is 
> > a patch that adds those debug messages):
> 
> This looks to be precisely the same problem Dan Williams was debugging
> for us.
> 
> > [   58.721765] scsi host0: scsi_debug: version 1.86 [20160430]
> > [   58.721765]   dev_size_mb=8, opts=0x0, submit_queues=1,
> > statistics=0
> > [   58.728946] CGWB init 88007fbb2000
> > [   58.730095] Created sdev 880078e1a000
> > [   58.731611] scsi 0:0:0:0: Direct-Access Linuxscsi_debug
> > 0186 PQ : 0 ANSI: 7
> > [   58.782246] sd 0:0:0:0: [sda] 16384 512-byte logical blocks: (8.39
> > MB/8.00 MiB)
> > [   58.789687] sd 0:0:0:0: [sda] Write Protect is off
> > [   58.791140] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
> > [   58.800879] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
> > enabled, supports DPO and FUA
> > [   58.893738] sd 0:0:0:0: [sda] Attached SCSI disk
> > [   58.896808] Unreg1
> > [   58.897960] Unreg2
> > [   58.898637] Unreg3
> > [   58.899100] CGWB 88007fbb2000 usage_cnt: 0
> > [   58.94] Unreg4
> > [   58.904976] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> 
> OK, can you put a WARN_ON trace in sd_shutdown and tell us where this
> is coming from.  For the device to be reused after this we have to be
> calling sd_shutdown() without going into SDEV_DEL.

Sure. The call trace is:

[   41.919244] [ cut here ]
[   41.919263] WARNING: CPU: 4 PID: 2335 at drivers/scsi/sd.c:3332 
sd_shutdown+0x2f/0x100
[   41.919268] Modules linked in: scsi_debug(+) netconsole loop btrfs raid6_pq 
zlib_deflate lzo_compress xor
[   41.919319] CPU: 4 PID: 2335 Comm: modprobe Not tainted 4.11.0-rc1-xen+ #49
[   41.919325] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   41.919331] Call Trace:
[   41.919343]  dump_stack+0x8e/0xf0
[   41.919354]  __warn+0x116/0x120
[   41.919361]  warn_slowpath_null+0x1d/0x20
[   41.919368]  sd_shutdown+0x2f/0x100
[   41.919374]  sd_remove+0x70/0xd0

*** Here is the unexpected step I guess...

[   41.919383]  driver_probe_device+0xe0/0x4c0
[   41.919395]  ? kobject_uevent_env+0x1bc/0x660
[   41.919403]  __device_attach_driver+0xfa/0x120
[   41.919410]  ? __driver_attach+0x110/0x110
[   41.919417]  bus_for_each_drv+0x68/0x90
[   41.919424]  __device_attach+0xb2/0x110
[   41.919432]  device_initial_probe+0x13/0x20
[   41.919439]  bus_probe_device+0xa8/0xc0
[   41.919446]  device_add+0x297/0x600
[   41.919453]  ? call_rcu_bh+0x20/0x20
[   41.919463]  ? mutex_unlock+0x12/0x20
[   41.919473]  scsi_sysfs_add_sdev+0x69/0x210
[   41.919480]  scsi_probe_and_add_lun+0xcfd/0xd10
[   41.919489]  ? find_next_zero_bit+0x10/0x20
[   41.919496]  __scsi_scan_target+0xd2/0x4c0
[   41.919506]  ? proc_register+0x41/0x130
[   41.919512]  ? proc_register+0x11c/0x130
[   41.919520]  ? trace_hardirqs_on+0xd/0x10
[   41.919527]  ? _raw_spin_unlock_irq+0x30/0x50
[   41.919534]  scsi_scan_channel+0x5f/0xa0
[   41.919541]  scsi_scan_host_selected+0xb9/0x120
[   41.919549]  do_scsi_scan_host+0x81/0x90
[   41.919555]  scsi_scan_host+0x17a/0x1b0
[   41.919568]  ? scsi_add_host_with_dma+0x302/0x310
[   41.919593]  sdebug_driver_probe+0x204/0x320 [scsi_debug]
[   41.919601]  ? driver_sysfs_add+0x90/0xb0
[   41.919608]  driver_probe_device+0xb3/0x4c0
[   41.919615]  ? kobject_uevent_env+0x1bc

Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.

2017-03-06 Thread Jan Kara
On Mon 06-03-17 07:44:55, James Bottomley wrote:
> On Mon, 2017-03-06 at 16:14 +0100, Jan Kara wrote:
> > On Mon 06-03-17 06:35:21, James Bottomley wrote:
> > > On Mon, 2017-03-06 at 13:01 +0100, Jan Kara wrote:
> > > > On Mon 06-03-17 11:27:33, Jan Kara wrote:
> > > > > Hi,
> > > > > 
> > > > > On Sun 05-03-17 10:21:11, Wu Fengguang wrote:
> > > > > > FYI next-20170303 is good while mainline is bad with this
> > > > > > error.
> > > > > > The attached reproduce-* may help reproduce the issue.
> > > > > 
> > > > > Thanks for report! So from the stacktrace we are in the path 
> > > > > testing removal of a device immediately after it has been
> > > > > probed 
> > > > > and for some reason bdi_unregister() hangs - likely waiting for
> > > > > cgroup-writeback references to drop. Given how early this
> > > > > happens 
> > > > > my guess is we fail to initialize something but for now I don't
> > > > > see 
> > > > > how my patch could make a difference. I'm trying to reproduce
> > > > > this 
> > > > > to be able to debug more...
> > > > 
> > > > OK, so after some debugging I think this is yet another problem
> > > > in 
> > > > SCSI initialization / destruction code which my patch only makes 
> > > > visible (added relevant maintainers).
> > > > 
> > > > I can reproduce the problem reliably with enabling:
> > > > 
> > > > CONFIG_DEBUG_TEST_DRIVER_REMOVE=y
> > > > CONFIG_SCSI_DEBUG=m
> > > > CONFIG_BLK_CGROUP=y
> > > > CONFIG_MEMCG=y
> > > > (and thus CONFIG_CGROUP_WRITEBACK=y)
> > > > 
> > > > then 'modprobe scsi_debug' is all it takes to reproduce hang. 
> > > > Relevant kernel messages with some of my debugging added
> > > > (attached is 
> > > > a patch that adds those debug messages):
> > > 
> > > This looks to be precisely the same problem Dan Williams was
> > > debugging
> > > for us.
> > > 
> > > > [   58.721765] scsi host0: scsi_debug: version 1.86 [20160430]
> > > > [   58.721765]   dev_size_mb=8, opts=0x0, submit_queues=1,
> > > > statistics=0
> > > > [   58.728946] CGWB init 88007fbb2000
> > > > [   58.730095] Created sdev 880078e1a000
> > > > [   58.731611] scsi 0:0:0:0: Direct-Access Linux   
> > > >  scsi_debug
> > > > 0186 PQ : 0 ANSI: 7
> > > > [   58.782246] sd 0:0:0:0: [sda] 16384 512-byte logical blocks:
> > > > (8.39
> > > > MB/8.00 MiB)
> > > > [   58.789687] sd 0:0:0:0: [sda] Write Protect is off
> > > > [   58.791140] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
> > > > [   58.800879] sd 0:0:0:0: [sda] Write cache: enabled, read
> > > > cache:
> > > > enabled, supports DPO and FUA
> > > > [   58.893738] sd 0:0:0:0: [sda] Attached SCSI disk
> > > > [   58.896808] Unreg1
> > > > [   58.897960] Unreg2
> > > > [   58.898637] Unreg3
> > > > [   58.899100] CGWB 88007fbb2000 usage_cnt: 0
> > > > [   58.94] Unreg4
> > > > [   58.904976] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > > 
> > > OK, can you put a WARN_ON trace in sd_shutdown and tell us where
> > > this
> > > is coming from.  For the device to be reused after this we have to
> > > be
> > > calling sd_shutdown() without going into SDEV_DEL.
> > 
> > Sure. The call trace is:
> > 
> > [   41.919244] [ cut here ]
> > [   41.919263] WARNING: CPU: 4 PID: 2335 at drivers/scsi/sd.c:3332
> > sd_shutdown+0x2f/0x100
> > [   41.919268] Modules linked in: scsi_debug(+) netconsole loop btrfs
> > raid6_pq zlib_deflate lzo_compress xor
> > [   41.919319] CPU: 4 PID: 2335 Comm: modprobe Not tainted 4.11.0-rc1
> > -xen+ #49
> > [   41.919325] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [   41.919331] Call Trace:
> > [   41.919343]  dump_stack+0x8e/0xf0
> > [   41.919354]  __warn+0x116/0x120
> > [   41.919361]  warn_slowpath_null+0x1d/0x20
> > [   41.919368]  sd_shutdown+0x2f/0x100
> > [   41.919374]  sd_remove+0x70/0xd0
> > 
> > *** Here is the unexpected step I guess...
> > 
> > [   41.919383]  driver_probe_device+0xe0/0x4c0
> 
> Exactly.  It's this, I think
> 
>   bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE) &&
>  !drv->suppress_bind_attrs;
> 
> You have that config option set.

Yes - or better said 0-day testing has it set. Maybe that is not a sane
default for 0-day tests? The option is explicitely marked as "unstable"...
Fengguang?

> So the drivers base layer is calling ->remove after probe and
> triggering the destruction of the queue.
> 
> What to do about this (apart from nuke such a stupid option) is
> somewhat more problematic.

I guess this is between you and Greg :).

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


[PATCH 05/11] bdi: Move removal from bdi->wb_list into wb_shutdown()

2017-03-06 Thread Jan Kara
Currently the removal from bdi->wb_list happens directly in
cgwb_release_workfn(). Move it to wb_shutdown() which is functionally
equivalent and more logical (the list gets only used for distributing
writeback works among bdi_writeback structures). It will also allow us
to simplify writeback shutdown in cgwb_bdi_destroy().

Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/backing-dev.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 37cd1adae979..4dffae6e2d7b 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -345,6 +345,8 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
return err;
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
+
 /*
  * Remove bdi from the global list and shutdown any threads we have running
  */
@@ -358,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
}
spin_unlock_bh(>work_lock);
 
+   cgwb_remove_from_bdi_list(wb);
/*
 * Drain work list and shutdown the delayed_work.  !WB_registered
 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -491,10 +494,6 @@ static void cgwb_release_workfn(struct work_struct *work)
release_work);
struct backing_dev_info *bdi = wb->bdi;
 
-   spin_lock_irq(_lock);
-   list_del_rcu(>bdi_node);
-   spin_unlock_irq(_lock);
-
wb_shutdown(wb);
 
css_put(wb->memcg_css);
@@ -526,6 +525,13 @@ static void cgwb_kill(struct bdi_writeback *wb)
percpu_ref_kill(>refcnt);
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+   spin_lock_irq(_lock);
+   list_del_rcu(>bdi_node);
+   spin_unlock_irq(_lock);
+}
+
 static int cgwb_create(struct backing_dev_info *bdi,
   struct cgroup_subsys_state *memcg_css, gfp_t gfp)
 {
@@ -783,6 +789,8 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
wb_congested_put(bdi->wb_congested);
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb) { }
+
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
 int bdi_init(struct backing_dev_info *bdi)
-- 
2.10.2



[PATCH 02/11] block: Fix race of bdev open with gendisk shutdown

2017-03-06 Thread Jan Kara
blkdev_open() may race with gendisk shutdown in such a way that
del_gendisk() has already unhashed block device inode (and thus
bd_acquire() will end up creating new block device inode) however
gen_gendisk() will still return the gendisk that is being destroyed.
This will result in the new bdev inode being associated with bdi of the
request queue that is going away and when the device number gets
eventually reused, the block device will still be pointing to the stale
bdi.

Fix the problem by checking whether the gendisk is still alive when
associating bdev inode with it and its bdi. That way we are sure that
once we are unhashing block device inodes in del_gendisk(), newly
created bdev inodes cannot be associated with bdi of the deleted gendisk
anymore. We actually already do this check when opening a partition so
we need to add it only for the case when the whole device is opened.

Also add a warning that will tell us about unexpected inconsistencies
between bdi associated with the bdev inode and bdi associated with the
disk.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/genhd.c  | 7 ++-
 fs/block_dev.c | 5 -
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index b26a5ea115d0..e8df37de03af 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -662,6 +662,12 @@ void del_gendisk(struct gendisk *disk)
struct disk_part_iter piter;
struct hd_struct *part;
 
+   disk->flags &= ~GENHD_FL_UP;
+   /*
+* Make sure __blkdev_open() sees the disk is going away before
+* starting to unhash bdev inodes.
+*/
+   smp_wmb();
blk_integrity_del(disk);
disk_del_events(disk);
 
@@ -678,7 +684,6 @@ void del_gendisk(struct gendisk *disk)
invalidate_partition(disk, 0);
bdev_unhash_inode(disk_devt(disk));
set_capacity(disk, 0);
-   disk->flags &= ~GENHD_FL_UP;
 
sysfs_remove_link(_to_dev(disk)->kobj, "bdi");
/*
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 53e2389ae4d4..9e1993a2827f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1560,7 +1560,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
if (!partno) {
ret = -ENXIO;
bdev->bd_part = disk_get_part(disk, partno);
-   if (!bdev->bd_part)
+   if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part)
goto out_clear;
 
ret = 0;
@@ -1623,6 +1623,9 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
 
if (bdev->bd_bdi == _backing_dev_info)
bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
+   else
+   WARN_ON_ONCE(bdev->bd_bdi !=
+disk->queue->backing_dev_info);
} else {
if (bdev->bd_contains == bdev) {
ret = 0;
-- 
2.10.2



[PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister()

2017-03-06 Thread Jan Kara
Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
(called from bdi_unregister()). That is however unnecessary now when
cgwb->bdi is a proper refcounted reference (thus bdi cannot get
released before all cgwbs are released) and when cgwb_bdi_destroy()
shuts down writeback directly.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 -
 mm/backing-dev.c | 18 +-
 2 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8af720f22a2d..e66d4722db8e 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -164,7 +164,6 @@ struct backing_dev_info {
 #ifdef CONFIG_CGROUP_WRITEBACK
struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
struct rb_root cgwb_congested_tree; /* their congested states */
-   atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
 #else
struct bdi_writeback_congested *wb_congested;
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d084c89922f3..7592f4a36776 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -406,11 +406,9 @@ static void wb_exit(struct bdi_writeback *wb)
 /*
  * cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
  * blkcg->cgwb_list, and memcg->cgwb_list.  bdi->cgwb_tree is also RCU
- * protected.  cgwb_release_wait is used to wait for the completion of cgwb
- * releases from bdi destruction path.
+ * protected.
  */
 static DEFINE_SPINLOCK(cgwb_lock);
-static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait);
 
 /**
  * wb_congested_get_create - get or create a wb_congested
@@ -505,7 +503,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 {
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
release_work);
-   struct backing_dev_info *bdi = wb->bdi;
 
wb_shutdown(wb);
 
@@ -516,9 +513,6 @@ static void cgwb_release_workfn(struct work_struct *work)
percpu_ref_exit(>refcnt);
wb_exit(wb);
kfree_rcu(wb, rcu);
-
-   if (atomic_dec_and_test(>usage_cnt))
-   wake_up_all(_release_wait);
 }
 
 static void cgwb_release(struct percpu_ref *refcnt)
@@ -608,7 +602,6 @@ static int cgwb_create(struct backing_dev_info *bdi,
/* we might have raced another instance of this function */
ret = radix_tree_insert(>cgwb_tree, memcg_css->id, wb);
if (!ret) {
-   atomic_inc(>usage_cnt);
list_add_tail_rcu(>bdi_node, >wb_list);
list_add(>memcg_node, memcg_cgwb_list);
list_add(>blkcg_node, blkcg_cgwb_list);
@@ -698,7 +691,6 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
INIT_RADIX_TREE(>cgwb_tree, GFP_ATOMIC);
bdi->cgwb_congested_tree = RB_ROOT;
-   atomic_set(>usage_cnt, 1);
 
ret = wb_init(>wb, bdi, 1, GFP_KERNEL);
if (!ret) {
@@ -739,14 +731,6 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
}
 
spin_unlock_irq(_lock);
-
-   /*
-* All cgwb's and their congested states must be shutdown and
-* released before returning.  Drain the usage counter to wait for
-* all cgwb's and cgwb_congested's ever created on @bdi.
-*/
-   atomic_dec(>usage_cnt);
-   wait_event(cgwb_release_wait, !atomic_read(>usage_cnt));
 }
 
 /**
-- 
2.10.2



[PATCH 10/11] kobject: Export kobject_get_unless_zero()

2017-03-06 Thread Jan Kara
Make the function available for outside use and fortify it against NULL
kobject.

CC: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bart.vanass...@sandisk.com>
Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/kobject.h | 2 ++
 lib/kobject.c   | 5 -
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index e6284591599e..ca85cb80e99a 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ extern int __must_check kobject_rename(struct kobject *, 
const char *new_name);
 extern int __must_check kobject_move(struct kobject *, struct kobject *);
 
 extern struct kobject *kobject_get(struct kobject *kobj);
+extern struct kobject * __must_check kobject_get_unless_zero(
+   struct kobject *kobj);
 extern void kobject_put(struct kobject *kobj);
 
 extern const void *kobject_namespace(struct kobject *kobj);
diff --git a/lib/kobject.c b/lib/kobject.c
index 445dcaeb0f56..763d70a18941 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -601,12 +601,15 @@ struct kobject *kobject_get(struct kobject *kobj)
 }
 EXPORT_SYMBOL(kobject_get);
 
-static struct kobject * __must_check kobject_get_unless_zero(struct kobject 
*kobj)
+struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
 {
+   if (!kobj)
+   return NULL;
if (!kref_get_unless_zero(>kref))
kobj = NULL;
return kobj;
 }
+EXPORT_SYMBOL(kobject_get_unless_zero);
 
 /*
  * kobject_cleanup - free kobject resources.
-- 
2.10.2



[PATCH 11/11] block: Fix oops scsi_disk_get()

2017-03-06 Thread Jan Kara
When device open races with device shutdown, we can get the following
oops in scsi_disk_get():

[11863.044351] general protection fault:  [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs 
raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W  
4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: 88007f438200 task.stack: c9fd
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:c9fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: 88007f56d000 RCX: 
[11863.048030] RDX: 0001 RSI: 0004 RDI: 81a8d880
[11863.048030] RBP: c9fd3a18 R08:  R09: 0001
[11863.059217] R10:  R11:  R12: fffa
[11863.059217] R13: 880078872800 R14: 880070915540 R15: 001d
[11863.059217] FS:  7f2611f71800() GS:88007f0c() 
knlGS:
[11863.059217] CS:  0010 DS:  ES:  CR0: 80050033
[11863.059217] CR2: 0060e048 CR3: 778d4000 CR4: 06e0
[11863.059217] Call Trace:
[11863.059217]  ? disk_get_part+0x22/0x1f0
[11863.059217]  sd_open+0x39/0x130
[11863.059217]  __blkdev_get+0x69/0x430
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? bd_acquire+0x96/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_get+0x126/0x350
[11863.059217]  ? _raw_spin_unlock+0x2b/0x40
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_open+0x65/0x80
...

As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.

We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.

Tested-by: Lekshmi Pillai <lekshmicpil...@in.ibm.com>
Reviewed-by: Bart Van Assche <bart.vanass...@sandisk.com>
Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index e8df37de03af..7fa59bc231dd 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1374,7 +1374,7 @@ struct kobject *get_disk(struct gendisk *disk)
owner = disk->fops->owner;
if (owner && !try_module_get(owner))
return NULL;
-   kobj = kobject_get(_to_dev(disk)->kobj);
+   kobj = kobject_get_unless_zero(_to_dev(disk)->kobj);
if (kobj == NULL) {
module_put(owner);
return NULL;
-- 
2.10.2



[PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()

2017-03-06 Thread Jan Kara
Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 +
 mm/backing-dev.c | 22 ++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb3dcdebc80..8af720f22a2d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -21,6 +21,7 @@ struct dentry;
  */
 enum wb_state {
WB_registered,  /* bdi_register() was done */
+   WB_shutting_down,   /* wb_shutdown() in progress */
WB_writeback_running,   /* Writeback is in progress */
WB_has_dirty_io,/* Dirty inodes on ->b_{dirty|io|more_io} */
 };
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4dffae6e2d7b..d084c89922f3 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -356,8 +356,15 @@ static void wb_shutdown(struct bdi_writeback *wb)
spin_lock_bh(>work_lock);
if (!test_and_clear_bit(WB_registered, >state)) {
spin_unlock_bh(>work_lock);
+   /*
+* Wait for wb shutdown to finish if someone else is just
+* running wb_shutdown(). Otherwise we could proceed to wb /
+* bdi destruction before wb_shutdown() is finished.
+*/
+   wait_on_bit(>state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
return;
}
+   set_bit(WB_shutting_down, >state);
spin_unlock_bh(>work_lock);
 
cgwb_remove_from_bdi_list(wb);
@@ -369,6 +376,12 @@ static void wb_shutdown(struct bdi_writeback *wb)
mod_delayed_work(bdi_wq, >dwork, 0);
flush_delayed_work(>dwork);
WARN_ON(!list_empty(>work_list));
+   /*
+* Make sure bit gets cleared after shutdown is finished. Matches with
+* the barrier provided by test_and_clear_bit() above.
+*/
+   smp_wmb();
+   clear_bit(WB_shutting_down, >state);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
@@ -700,6 +713,7 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
struct radix_tree_iter iter;
struct rb_node *rbn;
void **slot;
+   struct bdi_writeback *wb;
 
WARN_ON(test_bit(WB_registered, >wb.state));
 
@@ -716,6 +730,14 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
congested->__bdi = NULL;/* mark @congested unlinked */
}
 
+   while (!list_empty(>wb_list)) {
+   wb = list_first_entry(>wb_list, struct bdi_writeback,
+ bdi_node);
+   spin_unlock_irq(_lock);
+   wb_shutdown(wb);
+   spin_lock_irq(_lock);
+   }
+
spin_unlock_irq(_lock);
 
/*
-- 
2.10.2



[PATCH 04/11] bdi: Make wb->bdi a proper reference

2017-03-06 Thread Jan Kara
Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.

Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/backing-dev.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index fec27da14aea..37cd1adae979 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -294,6 +294,8 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
 
memset(wb, 0, sizeof(*wb));
 
+   if (wb != >wb)
+   bdi_get(bdi);
wb->bdi = bdi;
wb->last_old_flush = jiffies;
INIT_LIST_HEAD(>b_dirty);
@@ -314,8 +316,10 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
wb->dirty_sleep = jiffies;
 
wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
-   if (!wb->congested)
-   return -ENOMEM;
+   if (!wb->congested) {
+   err = -ENOMEM;
+   goto out_put_bdi;
+   }
 
err = fprop_local_init_percpu(>completions, gfp);
if (err)
@@ -335,6 +339,9 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
fprop_local_destroy_percpu(>completions);
 out_put_cong:
wb_congested_put(wb->congested);
+out_put_bdi:
+   if (wb != >wb)
+   bdi_put(bdi);
return err;
 }
 
@@ -372,6 +379,8 @@ static void wb_exit(struct bdi_writeback *wb)
 
fprop_local_destroy_percpu(>completions);
wb_congested_put(wb->congested);
+   if (wb != >bdi->wb)
+   bdi_put(wb->bdi);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-- 
2.10.2



[PATCH 0/11 v3] block: Fix block device shutdown related races

2017-03-06 Thread Jan Kara
Hello,

this is a series with the remaining patches (on top of 3.11-rc1) to fix several
different races and issues I've found when testing device shutdown and reuse.
The first two patches fix possible (theoretical) problems when opening of a
block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
code where get_gendisk() can return already freed gendisk structure (again
triggered by Omar's stress test).

People, please have a look at patches. They are mostly simple however the
interactions are rather complex so I may have missed something. Also I'm
happy for any additional testing these patches can get - I've stressed them
with Omar's script, tested memcg writeback, tested static (not udev managed)
device inodes.

Changes since v2:
* Added Tejun's acks
* Rebased on top of 4.11-rc1
* Fixed two possible races between blkdev_open() and del_gendisk()
* Fixed possible race between concurrent shutdown of cgwb spotted by Tejun

Changes since v1:
* Added Acks and Tested-by tags for patches in areas that did not change
* Reworked inode_detach_wb() related fixes based on Tejun's feedback

Honza


[PATCH 03/11] bdi: Mark congested->bdi as internal

2017-03-06 Thread Jan Kara
congested->bdi pointer is used only to be able to remove congested
structure from bdi->cgwb_congested_tree on structure release. Moreover
the pointer can become NULL when we unregister the bdi. Rename the field
to __bdi and add a comment to make it more explicit this is internal
stuff of memcg writeback code and people should not use the field as
such use will be likely race prone.

We do not bother with converting congested->bdi to a proper refcounted
reference. It will be slightly ugly to special-case bdi->wb.congested to
avoid effectively a cyclic reference of bdi to itself and the reference
gets cleared from bdi_unregister() making it impossible to reference
a freed bdi.

Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/backing-dev-defs.h |  4 +++-
 mm/backing-dev.c | 10 +-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index ad955817916d..8fb3dcdebc80 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -54,7 +54,9 @@ struct bdi_writeback_congested {
atomic_t refcnt;/* nr of attached wb's and blkg */
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-   struct backing_dev_info *bdi;   /* the associated bdi */
+   struct backing_dev_info *__bdi; /* the associated bdi, set to NULL
+* on bdi unregistration. For memcg-wb
+* internal use only! */
int blkcg_id;   /* ID of the associated blkcg */
struct rb_node rb_node; /* on bdi->cgwb_congestion_tree */
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6d861d090e9f..fec27da14aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -438,7 +438,7 @@ wb_congested_get_create(struct backing_dev_info *bdi, int 
blkcg_id, gfp_t gfp)
return NULL;
 
atomic_set(_congested->refcnt, 0);
-   new_congested->bdi = bdi;
+   new_congested->__bdi = bdi;
new_congested->blkcg_id = blkcg_id;
goto retry;
 
@@ -466,10 +466,10 @@ void wb_congested_put(struct bdi_writeback_congested 
*congested)
}
 
/* bdi might already have been destroyed leaving @congested unlinked */
-   if (congested->bdi) {
+   if (congested->__bdi) {
rb_erase(>rb_node,
->bdi->cgwb_congested_tree);
-   congested->bdi = NULL;
+>__bdi->cgwb_congested_tree);
+   congested->__bdi = NULL;
}
 
spin_unlock_irqrestore(_lock, flags);
@@ -698,7 +698,7 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
rb_entry(rbn, struct bdi_writeback_congested, rb_node);
 
rb_erase(rbn, >cgwb_congested_tree);
-   congested->bdi = NULL;  /* mark @congested unlinked */
+   congested->__bdi = NULL;/* mark @congested unlinked */
}
 
spin_unlock_irq(_lock);
-- 
2.10.2



[PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete

2017-03-06 Thread Jan Kara
When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
restart the process of opening the block device. However we forget to
switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
inode will be pointing to a stale bdi. Fix the problem by setting
bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/block_dev.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..53e2389ae4d4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1556,8 +1556,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
-   if (bdev->bd_bdi == _backing_dev_info)
-   bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
 
if (!partno) {
ret = -ENXIO;
@@ -1622,6 +1620,9 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
}
bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
}
+
+   if (bdev->bd_bdi == _backing_dev_info)
+   bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
} else {
if (bdev->bd_contains == bdev) {
ret = 0;
@@ -1653,8 +1654,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_disk = NULL;
bdev->bd_part = NULL;
bdev->bd_queue = NULL;
-   bdi_put(bdev->bd_bdi);
-   bdev->bd_bdi = _backing_dev_info;
if (bdev != bdev->bd_contains)
__blkdev_put(bdev->bd_contains, mode, 1);
bdev->bd_contains = NULL;
-- 
2.10.2



Re: [PATCH 0/13 v2] block: Fix block device shutdown related races

2017-03-07 Thread Jan Kara
On Mon 06-03-17 12:38:18, Omar Sandoval wrote:
> On Wed, Mar 01, 2017 at 04:11:09PM +0100, Jan Kara wrote:
> > On Tue 28-02-17 23:26:53, Omar Sandoval wrote:
> > > On Tue, Feb 28, 2017 at 11:25:28PM -0800, Omar Sandoval wrote:
> > > > On Wed, Feb 22, 2017 at 11:24:25AM +0100, Jan Kara wrote:
> > > > > On Tue 21-02-17 10:19:28, Jens Axboe wrote:
> > > > > > On 02/21/2017 10:09 AM, Jan Kara wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > this is a second revision of the patch set to fix several 
> > > > > > > different races and
> > > > > > > issues I've found when testing device shutdown and reuse. The 
> > > > > > > first three
> > > > > > > patches are fixes to problems in my previous series fixing BDI 
> > > > > > > lifetime issues.
> > > > > > > Patch 4 fixes issues with reuse of BDI name with scsi devices. 
> > > > > > > With it I cannot
> > > > > > > reproduce the BDI name reuse issues using Omar's stress test 
> > > > > > > using scsi_debug
> > > > > > > so it can be used as a replacement of Dan's patches. Patches 5-11 
> > > > > > > fix oops that
> > > > > > > is triggered by __blkdev_put() calling inode_detach_wb() too 
> > > > > > > early (the problem
> > > > > > > reported by Thiago). Patches 12 and 13 fix oops due to a bug in 
> > > > > > > gendisk code
> > > > > > > where get_gendisk() can return already freed gendisk structure 
> > > > > > > (again triggered
> > > > > > > by Omar's stress test).
> > > > > > > 
> > > > > > > People, please have a look at patches. They are mostly simple 
> > > > > > > however the
> > > > > > > interactions are rather complex so I may have missed something. 
> > > > > > > Also I'm
> > > > > > > happy for any additional testing these patches can get - I've 
> > > > > > > stressed them
> > > > > > > with Omar's script, tested memcg writeback, tested static (not 
> > > > > > > udev managed)
> > > > > > > device inodes.
> > > > > > > 
> > > > > > > Jens, I think at least patches 1-3 should go in together with my 
> > > > > > > fixes you
> > > > > > > already have in your tree (or shortly after them). It is up to 
> > > > > > > you whether
> > > > > > > you decide to delay my first fixes or pick these up quickly. 
> > > > > > > Patch 4 is
> > > > > > > (IMHO a cleaner) replacement of Dan's patches so consider whether 
> > > > > > > you want
> > > > > > > to use it instead of those patches.
> > > > > > 
> > > > > > I have applied 1-3 to my for-linus branch, which will go in after
> > > > > > the initial pull request has been pulled by Linus. Consider fixing 
> > > > > > up
> > > > > > #4 so it applies, I like it.
> > > > > 
> > > > > OK, attached is patch 4 rebased on top of Linus' tree from today which
> > > > > already has linux-block changes pulled in. I've left put_disk_devt() 
> > > > > in
> > > > > blk_cleanup_queue() to maintain the logic in the original patch (now 
> > > > > commit
> > > > > 0dba1314d4f8) that request_queue and gendisk each hold one devt 
> > > > > reference.
> > > > > The bdi_unregister() call that is moved to del_gendisk() by this 
> > > > > patch is
> > > > > now protected by the gendisk reference instead of the request_queue 
> > > > > one
> > > > > so it still maintains the property that devt reference protects bdi
> > > > > registration-unregistration lifetime (as much as that is not needed 
> > > > > anymore
> > > > > after this patch).
> > > > > 
> > > > > I have also updated the comment in the code and the changelog - they 
> > > > > were
> > > > > somewhat stale after changes to the whole series Tejun suggested.
> > > > > 
> > > > >   Honza
&g

Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.

2017-03-07 Thread Jan Kara
On Mon 06-03-17 09:25:42, James Bottomley wrote:
> On Mon, 2017-03-06 at 17:13 +0100, Jan Kara wrote:
> > On Mon 06-03-17 07:44:55, James Bottomley wrote:
...
> > > > Sure. The call trace is:
> > > > 
> > > > [   41.919244] [ cut here ]
> > > > [   41.919263] WARNING: CPU: 4 PID: 2335 at
> > > > drivers/scsi/sd.c:3332
> > > > sd_shutdown+0x2f/0x100
> > > > [   41.919268] Modules linked in: scsi_debug(+) netconsole loop
> > > > btrfs
> > > > raid6_pq zlib_deflate lzo_compress xor
> > > > [   41.919319] CPU: 4 PID: 2335 Comm: modprobe Not tainted 4.11.0
> > > > -rc1
> > > > -xen+ #49
> > > > [   41.919325] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > > [   41.919331] Call Trace:
> > > > [   41.919343]  dump_stack+0x8e/0xf0
> > > > [   41.919354]  __warn+0x116/0x120
> > > > [   41.919361]  warn_slowpath_null+0x1d/0x20
> > > > [   41.919368]  sd_shutdown+0x2f/0x100
> > > > [   41.919374]  sd_remove+0x70/0xd0
> > > > 
> > > > *** Here is the unexpected step I guess...
> > > > 
> > > > [   41.919383]  driver_probe_device+0xe0/0x4c0
> > > 
> > > Exactly.  It's this, I think
> > > 
> > >   bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE)
> > > &&
> > >  !drv->suppress_bind_attrs;
> > > 
> > > You have that config option set.
> > 
> > Yes - or better said 0-day testing has it set. Maybe that is not a 
> > sane default for 0-day tests? The option is explicitely marked as
> > "unstable"... Fengguang?
> > 
> > > So the drivers base layer is calling ->remove after probe and
> > > triggering the destruction of the queue.
> > > 
> > > What to do about this (apart from nuke such a stupid option) is
> > > somewhat more problematic.
> > 
> > I guess this is between you and Greg :).
> 
> Nice try, sport ... you qualify just behind Dan in the "not my problem"
> olympic hurdles event.  I'm annoyed because there's no indication in
> the log that the driver add behaviour is radically altered, so I've
> been staring at the wrong code for weeks.  However, the unbind/rebind
> should work, so the real issue is that your bdi changes (or perhaps
> something else in block) have induced a regression in the unbinding of
> upper layer drivers.  If you're going to release the bdi in
> del_gendisk, you have to have some mechanism for re-acquiring it on re
> -probe (likely with the same name) otherwise rebind is broken for every
> block driver.

So my patch does not release bdi in del_gendisk(). Bdi has two
initialization / destruction phases (similarly to request queue). You
allocate and initialize bdi through bdi_init(), then you call
bdi_register() to register it (which happens in device_add_disk()). On
shutdown you have to first call bdi_unregister() (used to be called from
blk_cleanup_queue(), my patch moved it to del_gendisk()). After that the
last reference to bdi may be dropped which does final bdi destruction.

So do I understand correctly that SCSI may call device_add_disk(),
del_gendisk() repeatedly for the same request queue? If yes, then indeed I
have a bug to fix... But gendisk seems to get allocated from scratch on
each probe so we don't call device_add_disk(), del_gendisk() more times on
the same disk, right?

> The fact that the second rebind tried with a different name indicates
> that sd_devt_release wasn't called, so some vestige of the devt remains
> on the subsequent rebind.

Yep, I guess that's caused by Dan's patch (commit 0dba1314d4f8 now) which
calls put_disk_devt() only in blk_cleanup_queue() which, if I understood
you correctly, does not get called during unbind-bind cycle? In fact Dan's
patch would end up leaking devt's because of repeated device_add_disk()
calls for the same request queue...

> Here's the problem: the queue belongs to SCSI (the lower layer), so it's
> not going to change because it doesn't see the release.  The gendisk and
> its allied stuff belongs to sd so it gets freed and re-created for the
> same queue.  Something in block is very confused when this happens.

Yep, I think the binding of request queue to different gendisks is
something I or Dan did not expect.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


[PATCH] block: Initialize bd_bdi on inode initialization

2017-03-02 Thread Jan Kara
So far we initialized bd_bdi only in bdget(). That is fine for normal
bdev inodes however for the special case of the root inode of
blockdev_superblock that function is never called and thus bd_bdi is
left uninitialized. As a result bdev_evict_inode() may oops doing
bdi_put(root->bd_bdi) on that inode as can be seen when doing:

mount -t bdev none /mnt

Fix the problem by initializing bd_bdi when first allocating the inode
and then reinitializing bd_bdi in bdev_evict_inode().

Thanks to syzkaller team for finding the problem.

Reported-by: Dmitry Vyukov <dvyu...@google.com>
Fixes: b1d2dc5659b41741f5a29b2ade76ffb4e5bb13d8
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/block_dev.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 77c30f15a02c..2eca00ec4370 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -870,6 +870,7 @@ static void init_once(void *foo)
 #ifdef CONFIG_SYSFS
INIT_LIST_HEAD(>bd_holder_disks);
 #endif
+   bdev->bd_bdi = _backing_dev_info;
inode_init_once(>vfs_inode);
/* Initialize mutex for freeze. */
mutex_init(>bd_fsfreeze_mutex);
@@ -884,8 +885,10 @@ static void bdev_evict_inode(struct inode *inode)
spin_lock(_lock);
list_del_init(>bd_list);
spin_unlock(_lock);
-   if (bdev->bd_bdi != _backing_dev_info)
+   if (bdev->bd_bdi != _backing_dev_info) {
bdi_put(bdev->bd_bdi);
+   bdev->bd_bdi = _backing_dev_info;
+   }
 }
 
 static const struct super_operations bdev_sops = {
@@ -988,7 +991,6 @@ struct block_device *bdget(dev_t dev)
bdev->bd_contains = NULL;
bdev->bd_super = NULL;
bdev->bd_inode = inode;
-   bdev->bd_bdi = _backing_dev_info;
bdev->bd_block_size = i_blocksize(inode);
bdev->bd_part_count = 0;
bdev->bd_invalidated = 0;
-- 
2.10.2



Re: [PATCH] block: Make writeback throttling defaults consistent for SQ devices

2017-04-19 Thread Jan Kara
On Wed 19-04-17 11:33:27, Jan Kara wrote:
> When CFQ is used as an elevator, it disables writeback throttling
> because they don't play well together. Later when a different elevator
> is chosen for the device, writeback throttling doesn't get enabled
> again as it should. Make sure CFQ enables writeback throttling (if it
> should be enabled by default) when we switch from it to another IO
> scheduler.
> 
> Signed-off-by: Jan Kara <j...@suse.cz>

Forgot to change header to v2 so I'm rather writing it here so that it's
clear.

Honza

> ---
>  block/blk-sysfs.c | 19 +--
>  block/blk-wbt.c   | 19 +++
>  block/blk-wbt.h   |  4 
>  block/elevator.c  |  3 +++
>  4 files changed, 27 insertions(+), 18 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index fc20489f0d2b..f85723332288 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -844,23 +844,6 @@ struct kobj_type blk_queue_ktype = {
>   .release= blk_release_queue,
>  };
>  
> -static void blk_wb_init(struct request_queue *q)
> -{
> -#ifndef CONFIG_BLK_WBT_MQ
> - if (q->mq_ops)
> - return;
> -#endif
> -#ifndef CONFIG_BLK_WBT_SQ
> - if (q->request_fn)
> - return;
> -#endif
> -
> - /*
> -  * If this fails, we don't get throttling
> -  */
> - wbt_init(q);
> -}
> -
>  int blk_register_queue(struct gendisk *disk)
>  {
>   int ret;
> @@ -908,7 +891,7 @@ int blk_register_queue(struct gendisk *disk)
>  
>   kobject_uevent(>kobj, KOBJ_ADD);
>  
> - blk_wb_init(q);
> + wbt_enable_default(q);
>  
>   blk_throtl_register_queue(q);
>  
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index b3b79149d3a0..26e1bb617877 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -665,6 +665,25 @@ void wbt_disable_default(struct request_queue *q)
>  }
>  EXPORT_SYMBOL_GPL(wbt_disable_default);
>  
> +/*
> + * Enable wbt if defaults are configured that way
> + */
> +void wbt_enable_default(struct request_queue *q)
> +{
> + /* Throttling already enabled? */
> + if (q->rq_wb)
> + return;
> +
> + /* Queue not registered? Maybe shutting down... */
> + if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags))
> + return;
> +
> + if ((q->mq_ops && IS_ENABLED(CONFIG_BLK_WBT_MQ)) ||
> + (q->request_fn && IS_ENABLED(CONFIG_BLK_WBT_SQ)))
> + wbt_init(q);
> +}
> +EXPORT_SYMBOL_GPL(wbt_enable_default);
> +
>  u64 wbt_default_latency_nsec(struct request_queue *q)
>  {
>   /*
> diff --git a/block/blk-wbt.h b/block/blk-wbt.h
> index ad6c78507c3a..df6de50c5d59 100644
> --- a/block/blk-wbt.h
> +++ b/block/blk-wbt.h
> @@ -117,6 +117,7 @@ void wbt_update_limits(struct rq_wb *);
>  void wbt_requeue(struct rq_wb *, struct blk_issue_stat *);
>  void wbt_issue(struct rq_wb *, struct blk_issue_stat *);
>  void wbt_disable_default(struct request_queue *);
> +void wbt_enable_default(struct request_queue *);
>  
>  void wbt_set_queue_depth(struct rq_wb *, unsigned int);
>  void wbt_set_write_cache(struct rq_wb *, bool);
> @@ -155,6 +156,9 @@ static inline void wbt_issue(struct rq_wb *rwb, struct 
> blk_issue_stat *stat)
>  static inline void wbt_disable_default(struct request_queue *q)
>  {
>  }
> +static inline void wbt_enable_default(struct request_queue *q)
> +{
> +}
>  static inline void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
>  {
>  }
> diff --git a/block/elevator.c b/block/elevator.c
> index dbeecf7be719..fb50416b5aae 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -41,6 +41,7 @@
>  
>  #include "blk.h"
>  #include "blk-mq-sched.h"
> +#include "blk-wbt.h"
>  
>  static DEFINE_SPINLOCK(elv_list_lock);
>  static LIST_HEAD(elv_list);
> @@ -877,6 +878,8 @@ void elv_unregister_queue(struct request_queue *q)
>   kobject_uevent(>kobj, KOBJ_REMOVE);
>   kobject_del(>kobj);
>   e->registered = 0;
> + /* Re-enable throttling in case elevator disabled it */
> + wbt_enable_default(q);
>   }
>  }
>  EXPORT_SYMBOL(elv_unregister_queue);
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


[PATCH] block: Make writeback throttling defaults consistent for SQ devices

2017-04-19 Thread Jan Kara
When CFQ is used as an elevator, it disables writeback throttling
because they don't play well together. Later when a different elevator
is chosen for the device, writeback throttling doesn't get enabled
again as it should. Make sure CFQ enables writeback throttling (if it
should be enabled by default) when we switch from it to another IO
scheduler.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/blk-sysfs.c | 19 +--
 block/blk-wbt.c   | 19 +++
 block/blk-wbt.h   |  4 
 block/elevator.c  |  3 +++
 4 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fc20489f0d2b..f85723332288 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -844,23 +844,6 @@ struct kobj_type blk_queue_ktype = {
.release= blk_release_queue,
 };
 
-static void blk_wb_init(struct request_queue *q)
-{
-#ifndef CONFIG_BLK_WBT_MQ
-   if (q->mq_ops)
-   return;
-#endif
-#ifndef CONFIG_BLK_WBT_SQ
-   if (q->request_fn)
-   return;
-#endif
-
-   /*
-* If this fails, we don't get throttling
-*/
-   wbt_init(q);
-}
-
 int blk_register_queue(struct gendisk *disk)
 {
int ret;
@@ -908,7 +891,7 @@ int blk_register_queue(struct gendisk *disk)
 
kobject_uevent(>kobj, KOBJ_ADD);
 
-   blk_wb_init(q);
+   wbt_enable_default(q);
 
blk_throtl_register_queue(q);
 
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index b3b79149d3a0..26e1bb617877 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -665,6 +665,25 @@ void wbt_disable_default(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(wbt_disable_default);
 
+/*
+ * Enable wbt if defaults are configured that way
+ */
+void wbt_enable_default(struct request_queue *q)
+{
+   /* Throttling already enabled? */
+   if (q->rq_wb)
+   return;
+
+   /* Queue not registered? Maybe shutting down... */
+   if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags))
+   return;
+
+   if ((q->mq_ops && IS_ENABLED(CONFIG_BLK_WBT_MQ)) ||
+   (q->request_fn && IS_ENABLED(CONFIG_BLK_WBT_SQ)))
+   wbt_init(q);
+}
+EXPORT_SYMBOL_GPL(wbt_enable_default);
+
 u64 wbt_default_latency_nsec(struct request_queue *q)
 {
/*
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index ad6c78507c3a..df6de50c5d59 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -117,6 +117,7 @@ void wbt_update_limits(struct rq_wb *);
 void wbt_requeue(struct rq_wb *, struct blk_issue_stat *);
 void wbt_issue(struct rq_wb *, struct blk_issue_stat *);
 void wbt_disable_default(struct request_queue *);
+void wbt_enable_default(struct request_queue *);
 
 void wbt_set_queue_depth(struct rq_wb *, unsigned int);
 void wbt_set_write_cache(struct rq_wb *, bool);
@@ -155,6 +156,9 @@ static inline void wbt_issue(struct rq_wb *rwb, struct 
blk_issue_stat *stat)
 static inline void wbt_disable_default(struct request_queue *q)
 {
 }
+static inline void wbt_enable_default(struct request_queue *q)
+{
+}
 static inline void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
 {
 }
diff --git a/block/elevator.c b/block/elevator.c
index dbeecf7be719..fb50416b5aae 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -41,6 +41,7 @@
 
 #include "blk.h"
 #include "blk-mq-sched.h"
+#include "blk-wbt.h"
 
 static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
@@ -877,6 +878,8 @@ void elv_unregister_queue(struct request_queue *q)
kobject_uevent(>kobj, KOBJ_REMOVE);
kobject_del(>kobj);
e->registered = 0;
+   /* Re-enable throttling in case elevator disabled it */
+   wbt_enable_default(q);
}
 }
 EXPORT_SYMBOL(elv_unregister_queue);
-- 
2.12.0



Re: [PATCH v3 04/20] fs: check for writeback errors after syncing out buffers in generic_file_fsync

2017-04-24 Thread Jan Kara
On Mon 24-04-17 09:22:43, Jeff Layton wrote:
> ext2 currently does a test+clear of the AS_EIO flag, which is
> is problematic for some coming changes.
> 
> What we really need to do instead is call filemap_check_errors
> in __generic_file_fsync after syncing out the buffers. That
> will be sufficient for this case, and help other callers detect
> these errors properly as well.
> 
> With that, we don't need to twiddle it in ext2.
> 
> Suggested-by: Jan Kara <j...@suse.cz>
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza
> ---
>  fs/ext2/file.c | 2 +-
>  fs/libfs.c | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index b21891a6bfca..ed00e7ae0ef3 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -177,7 +177,7 @@ int ext2_fsync(struct file *file, loff_t start, loff_t 
> end, int datasync)
>   struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
>  
>   ret = generic_file_fsync(file, start, end, datasync);
> - if (ret == -EIO || test_and_clear_bit(AS_EIO, >flags)) {
> + if (ret == -EIO) {
>   /* We don't really know where the IO error happened... */
>   ext2_error(sb, __func__,
>  "detected IO error when writing metadata buffers");
> diff --git a/fs/libfs.c b/fs/libfs.c
> index a8b62e5d43a9..12a48ee442d3 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -991,7 +991,8 @@ int __generic_file_fsync(struct file *file, loff_t start, 
> loff_t end,
>  
>  out:
>   inode_unlock(inode);
> - return ret;
> + err = filemap_check_errors(inode->i_mapping);
> + return ret ? : err;
>  }
>  EXPORT_SYMBOL(__generic_file_fsync);
>  
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v3 08/20] mm: ensure that we set mapping error if writeout() fails

2017-04-24 Thread Jan Kara
On Mon 24-04-17 09:22:47, Jeff Layton wrote:
> If writepage fails during a page migration, then we need to ensure that
> fsync will see it by flagging the mapping.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza
> ---
>  mm/migrate.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 738f1d5f8350..3a59830bdae2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -792,7 +792,11 @@ static int writeout(struct address_space *mapping, 
> struct page *page)
>   /* unlocked. Relock */
>   lock_page(page);
>  
> - return (rc < 0) ? -EIO : -EAGAIN;
> + if (rc < 0) {
> + mapping_set_error(mapping, rc);
> + return -EIO;
> + }
> + return -EAGAIN;
>  }
>  
>  /*
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v3 09/20] 9p: set mapping error when writeback fails in launder_page

2017-04-24 Thread Jan Kara
On Mon 24-04-17 09:22:48, Jeff Layton wrote:
> launder_page is just writeback under the page lock. We still need to
> mark the mapping for errors there when they occur.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/9p/vfs_addr.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
> index adaf6f6dd858..7af6e6501698 100644
> --- a/fs/9p/vfs_addr.c
> +++ b/fs/9p/vfs_addr.c
> @@ -223,8 +223,11 @@ static int v9fs_launder_page(struct page *page)
>   v9fs_fscache_wait_on_page_write(inode, page);
>   if (clear_page_dirty_for_io(page)) {
>   retval = v9fs_vfs_writepage_locked(page);
> - if (retval)
> + if (retval) {
> + if (retval != -EAGAIN)
> + mapping_set_error(page->mapping, retval);
>   return retval;
> + }
>   }
>   return 0;
>  }
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v3 06/20] dax: set errors in mapping when writeback fails

2017-04-24 Thread Jan Kara
On Mon 24-04-17 09:22:45, Jeff Layton wrote:
> In order to get proper error codes from fsync, we must set an error in
> the mapping range when writeback fails.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

So I'm fine with the change but please expand the changelog to something
like:

DAX currently doesn't set errors in the mapping when cache flushing fails
in dax_writeback_mapping_range(). Since this function can get called only
from fsync(2) or sync(2), this is actually as good as it can currently get
since we correctly propagate the error up from dax_writeback_mapping_range()
to filemap_fdatawrite(). However in the future better writeback error
handling will enable us to properly report these errors on fsync(2) even if
there are multiple file descriptors open against the file or if sync(2)
gets called before fsync(2). So convert DAX to using standard error
reporting through the mapping.

After improving the changelog you can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/dax.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 85abd741253d..9b6b04030c3f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -901,8 +901,10 @@ int dax_writeback_mapping_range(struct address_space 
> *mapping,
>  
>   ret = dax_writeback_one(bdev, mapping, indices[i],
>   pvec.pages[i]);
> - if (ret < 0)
> + if (ret < 0) {
> + mapping_set_error(mapping, ret);
>   return ret;
> +     }
>   }
>   }
>   return 0;
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails

2017-04-24 Thread Jan Kara
On Mon 24-04-17 09:22:49, Jeff Layton wrote:
> This ensures that we see errors on fsync when writeback fails.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Hum, but do we really want to clobber mapping errors with temporary stuff
like ENOMEM? Or do you want to handle that in mapping_set_error?

Honza

> ---
>  fs/fuse/file.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index ec238fb5a584..07d0efcb050c 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page)
>  err_free:
>   fuse_request_free(req);
>  err:
> + mapping_set_error(page->mapping, error);
>   end_page_writeback(page);
>   return error;
>  }
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


[PATCH] block: Add comment to submit_bio_wait()

2017-08-02 Thread Jan Kara
submit_bio_wait() does not consume bio reference. Add comment about
that.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/bio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 9a63597aaacc..e241bbc49f14 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -936,6 +936,10 @@ static void submit_bio_wait_endio(struct bio *bio)
  *
  * Simple wrapper around submit_bio(). Returns 0 on success, or the error from
  * bio_endio() on failure.
+ *
+ * WARNING: Unlike to how submit_bio() is usually used, this function does not
+ * result in bio reference to be consumed. The caller must drop the reference
+ * on his own.
  */
 int submit_bio_wait(struct bio *bio)
 {
-- 
2.12.3



Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait

2017-08-10 Thread Jan Kara
On Thu 10-08-17 06:49:53, Goldwyn Rodrigues wrote:
> On 08/09/2017 09:17 PM, Jens Axboe wrote:
> > On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote:
> >>>>>>>> No, from a multi-device point of view, this is inconsistent. I
> >>>>>>>> have tried the request bio returns -EAGAIN before the split, but
> >>>>>>>> I shall check again. Where do you see this happening?
> >>>>>>>
> >>>>>>> No, this isn't multi-device specific, any driver can do it.
> >>>>>>> Please see blk_queue_split.
> >>>>>>>
> >>>>>>
> >>>>>> In that case, the bio end_io function is chained and the bio of
> >>>>>> the split will replicate the error to the parent (if not already
> >>>>>> set).
> >>>>>
> >>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part
> >>>>> of the bio probably already dispatched to disk (if the bio is
> >>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't
> >>>>> block and dispatch to disk), what will application be going to do?
> >>>>> I think this is different to other IO errors. FOr other IO errors,
> >>>>> application will handle the error, while we ask app to retry the
> >>>>> whole bio here and app doesn't know part of bio is already written
> >>>>> to disk.
> >>>>
> >>>> It is the same as for other I/O errors as well, such as EIO. You do
> >>>> not know which bio of all submitted bio's returned the error EIO.
> >>>> The application would and should consider the whole I/O as failed.
> >>>>
> >>>> The user application does not know of bios, or how it is going to be
> >>>> split in the underlying layers. It knows at the system call level.
> >>>> In this case, the EAGAIN will be returned to the user for the whole
> >>>> I/O not as a part of the I/O. It is up to application to try the I/O
> >>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled
> >>>> out using dio->io_error. You can read about it at the patch header
> >>>> for the initial patchset at [1].
> >>>>
> >>>> Use case: It is for applications having two threads, a compute
> >>>> thread and an I/O thread. It would try to push AIO as much as
> >>>> possible in the compute thread using RWF_NOWAIT, and if it fails,
> >>>> would pass it on to I/O thread which would perform without
> >>>> RWF_NOWAIT. End result if done right is you save on context switches
> >>>> and all the synchronization/messaging machinery to perform I/O.
> >>>>
> >>>> [1] http://marc.info/?l=linux-block=149789003305876=2
> >>>
> >>> Yes, I knew the concept, but I didn't see previous patches mentioned
> >>> the -EAGAIN actually should be taken as a real IO error. This means a
> >>> lot to applications and make the API hard to use. I'm wondering if we
> >>> should disable bio split for NOWAIT bio, which will make the -EAGAIN
> >>> only mean 'try again'.
> >>
> >> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say
> >> the API is hard to use? Do you have a case to back it up?
> > 
> > Because it is hard to use, and potentially suboptimal. Let's say you're
> > doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a
> > short write, or do we return EWOULDBLOCK? If the latter, then that
> > really sucks from an API point of view.
> > 
> >> No, not splitting the bio does not make sense here. I do not see any
> >> advantage in it, unless you can present a case otherwise.
> > 
> > It ties back into the "hard to use" that I do agree with IFF we don't
> > return the short write. It's hard for an application to use that
> > efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the
> > full 1MB from a different context.
> > 
> 
> It returns the error code only and not short reads/writes. But isn't
> that true for all system calls in case of error?
> 
> For aio, there are two result fields in io_event out of which one could
> be used for error while the other be used for amount of writes/reads
> performed. However, only one is used. This will not work with
> pread()/pwrite() calls though because of the limitation of return values.
> 
> Finally, what if the EWOULDBLOCK is returned for an earlier bio (say
> offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are
> successful. What short return value should the system call return?

This is indeed tricky. If an application submits 1MB write, I don't think
we can afford to just write arbitrary subset of it. That just IMHO too much
violates how writes traditionally behaved. Even short writes trigger bugs
in various applications but I'm willing to require that applications using
NOWAIT IO can handle these. However writing arbitrary subset looks like a
nasty catch. IMHO we should not submit further bios until we are sure
current one does not return EWOULDBLOCK when splitting a larger one...

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 17/17] block/cfq: cache rightmost rb_node

2017-07-19 Thread Jan Kara
On Tue 18-07-17 18:46:03, Davidlohr Bueso wrote:
> For the same reasons we already cache the leftmost pointer,
> apply the same optimization for rb_last() calls. Users must
> explicitly do this as rb_root_cached only deals with the
> smallest node.
> 
> Cc: ax...@fb.com
> Cc: linux-block@vger.kernel.org
> Signed-off-by: Davidlohr Bueso <dbu...@suse.de>

Hum, as I'm reading the code, here we have about 1:1 ratio of cached
lookups and tree insert / delete where we have to maintain the cached
value. Is the optimization worth it in such case?

Honza

> ---
> This is part of the rbtree internal caching series:
> https://lwn.net/Articles/726809/
> 
>  block/cfq-iosched.c | 19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 92c31683a2bb..57ec45fd4590 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -94,11 +94,13 @@ struct cfq_ttime {
>   */
>  struct cfq_rb_root {
>   struct rb_root_cached rb;
> + struct rb_node *rb_rightmost;
>   unsigned count;
>   u64 min_vdisktime;
>   struct cfq_ttime ttime;
>  };
>  #define CFQ_RB_ROOT  (struct cfq_rb_root) { .rb = RB_ROOT_CACHED, \
> + .rb_rightmost = NULL,\
>   .ttime = {.last_end_request = ktime_get_ns(),},}
>  
>  /*
> @@ -1183,6 +1185,9 @@ static struct cfq_group *cfq_rb_first_group(struct 
> cfq_rb_root *root)
>  
>  static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
>  {
> + if (root->rb_rightmost == n)
> + root->rb_rightmost = rb_next(n);
> +
>   rb_erase_cached(n, >rb);
>   RB_CLEAR_NODE(n);
>  
> @@ -1239,20 +1244,24 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, 
> struct cfq_group *cfqg)
>   struct rb_node *parent = NULL;
>   struct cfq_group *__cfqg;
>   s64 key = cfqg_key(st, cfqg);
> - bool leftmost = true;
> + bool leftmost = true, rightmost = true;
>  
>   while (*node != NULL) {
>   parent = *node;
>   __cfqg = rb_entry_cfqg(parent);
>  
> - if (key < cfqg_key(st, __cfqg))
> + if (key < cfqg_key(st, __cfqg)) {
>   node = >rb_left;
> - else {
> + rightmost = false;
> + } else {
>   node = >rb_right;
>   leftmost = false;
>   }
>   }
>  
> + if (rightmost)
> + st->rb_rightmost = >rb_node;
> +
>   rb_link_node(>rb_node, parent, node);
>   rb_insert_color_cached(>rb_node, >rb, leftmost);
>  }
> @@ -1355,7 +1364,7 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, 
> struct cfq_group *cfqg)
>* so that groups get lesser vtime based on their weights, so that
>* if group does not loose all if it was not continuously backlogged.
>*/
> - n = rb_last(>rb.rb_root);
> + n = st->rb_rightmost;
>   if (n) {
>   __cfqg = rb_entry_cfqg(n);
>   cfqg->vdisktime = __cfqg->vdisktime +
> @@ -2204,7 +2213,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, 
> struct cfq_queue *cfqq,
>   st = st_for(cfqq->cfqg, cfqq_class(cfqq), cfqq_type(cfqq));
>   if (cfq_class_idle(cfqq)) {
>       rb_key = CFQ_IDLE_DELAY;
> - parent = rb_last(>rb.rb_root);
> + parent = st->rb_rightmost;
>   if (parent && parent != >rb_node) {
>   __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
>   rb_key += __cfqq->rb_key;
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 10/17] block/cfq: replace cfq_rb_root leftmost caching

2017-07-19 Thread Jan Kara
On Tue 18-07-17 18:45:56, Davidlohr Bueso wrote:
> ... with the generic rbtree flavor instead. No changes
> in semantics whatsoever.
> 
> Cc: ax...@fb.com
> Cc: linux-block@vger.kernel.org
> Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org>
> Signed-off-by: Davidlohr Bueso <dbu...@suse.de>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  block/cfq-iosched.c | 70 
> +++--
>  1 file changed, 20 insertions(+), 50 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 3d5c28945719..92c31683a2bb 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -93,13 +93,12 @@ struct cfq_ttime {
>   * move this into the elevator for the rq sorting as well.
>   */
>  struct cfq_rb_root {
> - struct rb_root rb;
> - struct rb_node *left;
> + struct rb_root_cached rb;
>   unsigned count;
>   u64 min_vdisktime;
>   struct cfq_ttime ttime;
>  };
> -#define CFQ_RB_ROOT  (struct cfq_rb_root) { .rb = RB_ROOT, \
> +#define CFQ_RB_ROOT  (struct cfq_rb_root) { .rb = RB_ROOT_CACHED, \
>   .ttime = {.last_end_request = ktime_get_ns(),},}
>  
>  /*
> @@ -984,10 +983,9 @@ static inline u64 max_vdisktime(u64 min_vdisktime, u64 
> vdisktime)
>  
>  static void update_min_vdisktime(struct cfq_rb_root *st)
>  {
> - struct cfq_group *cfqg;
> + if (!RB_EMPTY_ROOT(>rb.rb_root)) {
> + struct cfq_group *cfqg = rb_entry_cfqg(st->rb.rb_leftmost);
>  
> - if (st->left) {
> - cfqg = rb_entry_cfqg(st->left);
>   st->min_vdisktime = max_vdisktime(st->min_vdisktime,
> cfqg->vdisktime);
>   }
> @@ -1169,46 +1167,25 @@ cfq_choose_req(struct cfq_data *cfqd, struct request 
> *rq1, struct request *rq2,
>   }
>  }
>  
> -/*
> - * The below is leftmost cache rbtree addon
> - */
>  static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
>  {
>   /* Service tree is empty */
>   if (!root->count)
>   return NULL;
>  
> - if (!root->left)
> - root->left = rb_first(>rb);
> -
> - if (root->left)
> - return rb_entry(root->left, struct cfq_queue, rb_node);
> -
> - return NULL;
> + return rb_entry(rb_first_cached(>rb), struct cfq_queue, rb_node);
>  }
>  
>  static struct cfq_group *cfq_rb_first_group(struct cfq_rb_root *root)
>  {
> - if (!root->left)
> - root->left = rb_first(>rb);
> -
> - if (root->left)
> - return rb_entry_cfqg(root->left);
> -
> - return NULL;
> + return rb_entry_cfqg(rb_first_cached(>rb));
>  }
>  
> -static void rb_erase_init(struct rb_node *n, struct rb_root *root)
> +static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
>  {
> - rb_erase(n, root);
> + rb_erase_cached(n, >rb);
>   RB_CLEAR_NODE(n);
> -}
>  
> -static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
> -{
> - if (root->left == n)
> - root->left = NULL;
> - rb_erase_init(n, >rb);
>   --root->count;
>  }
>  
> @@ -1258,11 +1235,11 @@ cfqg_key(struct cfq_rb_root *st, struct cfq_group 
> *cfqg)
>  static void
>  __cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
>  {
> - struct rb_node **node = >rb.rb_node;
> + struct rb_node **node = >rb.rb_root.rb_node;
>   struct rb_node *parent = NULL;
>   struct cfq_group *__cfqg;
>   s64 key = cfqg_key(st, cfqg);
> - int left = 1;
> + bool leftmost = true;
>  
>   while (*node != NULL) {
>   parent = *node;
> @@ -1272,15 +1249,12 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, 
> struct cfq_group *cfqg)
>   node = >rb_left;
>   else {
>   node = >rb_right;
> - left = 0;
> + leftmost = false;
>   }
>   }
>  
> - if (left)
> - st->left = >rb_node;
> -
>   rb_link_node(>rb_node, parent, node);
> - rb_insert_color(>rb_node, >rb);
> + rb_insert_color_cached(>rb_node, >rb, leftmost);
>  }
>  
>  /*
> @@ -1381,7 +1355,7 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, 
> struct cfq_group *cfqg)
>* so that groups get lesser vtime based on their weights, so that
>* if group does not 

[PATCH 01/19] bcache: Fix leak of bdev reference

2017-06-30 Thread Jan Kara
If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
block device using lookup_bdev() to detect which situation we are in to
properly report error. However we never drop the reference returned to
us from lookup_bdev(). Fix that.

Signed-off-by: Jan Kara <j...@suse.cz>
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/super.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 8352fad..9a2c190 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1964,6 +1964,8 @@ static ssize_t register_bcache(struct kobject *k, struct 
kobj_attribute *attr,
else
err = "device busy";
mutex_unlock(_register_lock);
+   if (!IS_ERR(bdev))
+   bdput(bdev);
if (attr == _register_quiet)
goto out;
}
-- 
1.8.3.1



[PATCH 01/19] bcache: Fix leak of bdev reference

2017-06-30 Thread Jan Kara
If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
block device using lookup_bdev() to detect which situation we are in to
properly report error. However we never drop the reference returned to
us from lookup_bdev(). Fix that.

Signed-off-by: Jan Kara <j...@suse.cz>
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/super.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 8352fad..9a2c190 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1964,6 +1964,8 @@ static ssize_t register_bcache(struct kobject *k, struct 
kobj_attribute *attr,
else
err = "device busy";
mutex_unlock(_register_lock);
+   if (!IS_ERR(bdev))
+   bdput(bdev);
if (attr == _register_quiet)
goto out;
}
-- 
1.8.3.1



Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails

2017-04-25 Thread Jan Kara
On Mon 24-04-17 13:14:36, Jeff Layton wrote:
> On Mon, 2017-04-24 at 18:04 +0200, Jan Kara wrote:
> > On Mon 24-04-17 09:22:49, Jeff Layton wrote:
> > > This ensures that we see errors on fsync when writeback fails.
> > > 
> > > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> > 
> > Hum, but do we really want to clobber mapping errors with temporary stuff
> > like ENOMEM? Or do you want to handle that in mapping_set_error?
> > 
> 
> Right now we don't really have such a thing as temporary errors in the
> writeback codepath. If you return an error here, the data doesn't stay
> dirty or anything, and I think we want to ensure that that gets reported
> via fsync.
> 
> I'd like to see us add better handling for retryable errors for stuff
> like ENOMEM or EAGAIN. I think this is the first step toward that
> though. Once we have more consistent handling of writeback errors in
> general, then we can start doing more interesting things with retryable
> errors.
> 
> So yeah, I this is the right thing to do for now.

OK, fair enough. And question number 2):

Who is actually responsible for setting the error in the mapping when error
happens inside ->writepage()? Is it the ->writepage() callback or the
caller of ->writepage()? Or something else? Currently it seems to be a
strange mix (e.g. mm/page-writeback.c: __writepage() calls
mapping_set_error() when ->writepage() returns error) so I'd like to
understand what's the plan and have that recorded in the changelogs.

Honza

> 
> > 
> > > ---
> > >  fs/fuse/file.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > index ec238fb5a584..07d0efcb050c 100644
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page)
> > >  err_free:
> > >   fuse_request_free(req);
> > >  err:
> > > + mapping_set_error(page->mapping, error);
> > >   end_page_writeback(page);
> > >   return error;
> > >  }
> > > -- 
> > > 2.9.3
> > > 
> > > 
> 
> -- 
> Jeff Layton <jlay...@redhat.com>
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails

2017-04-25 Thread Jan Kara
On Tue 25-04-17 06:35:13, Jeff Layton wrote:
> On Tue, 2017-04-25 at 10:17 +0200, Jan Kara wrote:
> > On Mon 24-04-17 13:14:36, Jeff Layton wrote:
> > > On Mon, 2017-04-24 at 18:04 +0200, Jan Kara wrote:
> > > > On Mon 24-04-17 09:22:49, Jeff Layton wrote:
> > > > > This ensures that we see errors on fsync when writeback fails.
> > > > > 
> > > > > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> > > > 
> > > > Hum, but do we really want to clobber mapping errors with temporary 
> > > > stuff
> > > > like ENOMEM? Or do you want to handle that in mapping_set_error?
> > > > 
> > > 
> > > Right now we don't really have such a thing as temporary errors in the
> > > writeback codepath. If you return an error here, the data doesn't stay
> > > dirty or anything, and I think we want to ensure that that gets reported
> > > via fsync.
> > > 
> > > I'd like to see us add better handling for retryable errors for stuff
> > > like ENOMEM or EAGAIN. I think this is the first step toward that
> > > though. Once we have more consistent handling of writeback errors in
> > > general, then we can start doing more interesting things with retryable
> > > errors.
> > > 
> > > So yeah, I this is the right thing to do for now.
> > 
> > OK, fair enough. And question number 2):
> > 
> > Who is actually responsible for setting the error in the mapping when error
> > happens inside ->writepage()? Is it the ->writepage() callback or the
> > caller of ->writepage()? Or something else? Currently it seems to be a
> > strange mix (e.g. mm/page-writeback.c: __writepage() calls
> > mapping_set_error() when ->writepage() returns error) so I'd like to
> > understand what's the plan and have that recorded in the changelogs.
> > 
> 
> That's an excellent question.
> 
> I think we probably want the writepage/launder_page operations to call
> mapping_set_error. That makes it possible for filesystems (e.g. NFS) to
> handle their own error tracking and reporting without using the new
> infrastructure. If they never call mapping_set_error then we'll always
> just return whatever their ->fsync operation returns on an fsync.

OK, makes sense. It is also in line with what you did for DAX, 9p, or here
for FUSE. So feel free to add:

Reviewed-by: Jan Kara <j...@suse.cz>

for this patch but please also add a sentense that ->writepage() is
responsible for calling mapping_set_error() if it fails and page is not
redirtied to the changelogs of patches changing writepage handlers.

> I'll make another pass through the tree and see whether we have some
> mapping_set_error calls that should be removed, and will flesh out
> vfs.txt to state this. Maybe that file needs a whole section on
> writeback error reporting? Hmmm...

I think it would be nice to have all the logic described in one place. So
+1 from me.

> That probably also means that I should drop patch 8 from this series
> (mm: ensure that we set mapping error if writeout fails), since that
> should be happening in writepage already.

Yes.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v4 15/27] fs: retrofit old error reporting API onto new infrastructure

2017-05-15 Thread Jan Kara
On Tue 09-05-17 11:49:18, Jeff Layton wrote:
> Now that we have a better way to store and report errors that occur
> during writeback, we need to convert the existing codebase to use it. We
> could just adapt all of the filesystem code and related infrastructure
> to the new API, but that's a lot of churn.
> 
> When it comes to setting errors in the mapping, filemap_set_wb_error is
> a drop-in replacement for mapping_set_error. Turn that function into a
> simple wrapper around the new one.
> 
> Because we want to ensure that writeback errors are always reported at
> fsync time, inject filemap_report_wb_error calls much closer to the
> syscall boundary, in call_fsync.
> 
> For fsync calls (and things like the nfsd equivalent), we either return
> the error that the fsync operation returns, or the one returned by
> filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
> the latest value. This allows us to provide new fsync semantics that
> will return errors that may have occurred previously and been viewed
> via other file descriptors.
> 
> The final piece of the puzzle is what to do about filemap_check_errors
> calls that are being called directly or via filemap_* functions. Here,
> we must take a little "creative license".
> 
> Since we now handle advancing the file->f_wb_err value at the generic
> filesystem layer, we no longer need those callers to clear errors out
> of the mapping or advance an errseq_t.
> 
> A lot of the existing codebase relies on being getting an error back
> from those functions when there is a writeback problem, so we do still
> want to have them report writeback errors somehow.
> 
> When reporting writeback errors, we will always report errors that have
> occurred since a particular point in time. With the old writeback error
> reporting, the time we used was "since it was last tested/cleared" which
> is entirely arbitrary and potentially racy. Now, we can at least report
> the latest error that has occurred since an arbitrary point in time
> (represented as a sampled errseq_t value).
> 
> In the case where we don't have a struct file to work with, this patch
> just has the wrappers sample the current mapping->wb_err value, and use
> that as an arbitrary point from which to check for errors.

I think this is really dangerous and we shouldn't do this. You are quite
likely to lose IO errors in such calls because you will ignore all errors
that happened during previous background writeback or even for IO that
managed to complete before we called filemap_fdatawait(). Maybe we need to
keep the original set-clear-bit IO error reporting for these cases, until
we can convert them to fdatawait_range_since()?

> That's probably not "correct" in all cases, particularly in the case of
> something like filemap_fdatawait, but I'm not sure it's any worse than
> what we already have, and this gives us a basis from which to work.
> 
> A lot of those callers will likely want to change to a model where they
> sample the errseq_t much earlier (perhaps when starting a transaction),
> store it in an appropriate place and then use that value later when
> checking to see if an error occurred.
> 
> That will almost certainly take some involvement from other subsystem
> maintainers. I'm quite open to adding new API functions to help enable
> this if that would be helpful, but I don't really want to do that until
> I better understand what's needed.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

...

> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 5f7317875a67..7ce13281925f 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t 
> start, loff_t end,
>   .nr_to_write = LONG_MAX,
>   .for_reclaim = 0,
>   };
> + errseq_t since = READ_ONCE(file->f_wb_err);
>  
>   if (unlikely(f2fs_readonly(inode->i_sb)))
>   return 0;
> @@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t 
> start, loff_t end,
>   }
>  
>   ret = wait_on_node_pages_writeback(sbi, ino);
> + if (ret == 0)
> + ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
>   if (ret)
>   goto out;

So this conversion looks wrong and actually points to a larger issue with
the scheme. The problem is there are two mappings that come into play here
- file_inode(file)->i_mapping which is the data mapping and
NODE_MAPPING(sbi) which is the metadata mapping (and this is not a problem
specific to f2fs. For example ext2 also uses this scheme where block
devices' mapping is the metadata mapping). And we need to merge error
information from these two mapping

Re: [PATCH v4 21/27] mm: clean up error handling in write_one_page

2017-05-15 Thread Jan Kara
On Tue 09-05-17 11:49:24, Jeff Layton wrote:
> Don't try to check PageError since that's potentially racy and not
> necessarily going to be set after writepage errors out.
> 
> Instead, sample the mapping error early on, and use that value to tell
> us whether we got a writeback error since then.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  mm/page-writeback.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index de0dbf12e2c1..1643456881b4 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2373,11 +2373,12 @@ int do_writepages(struct address_space *mapping, 
> struct writeback_control *wbc)
>  int write_one_page(struct page *page)
>  {
>   struct address_space *mapping = page->mapping;
> - int ret = 0;
> + int ret = 0, ret2;
>   struct writeback_control wbc = {
>   .sync_mode = WB_SYNC_ALL,
>   .nr_to_write = 1,
>   };
> + errseq_t since = filemap_sample_wb_error(mapping);
>  
>   BUG_ON(!PageLocked(page));
>  
> @@ -2386,16 +2387,14 @@ int write_one_page(struct page *page)
>   if (clear_page_dirty_for_io(page)) {
>   get_page(page);
>   ret = mapping->a_ops->writepage(page, );
> - if (ret == 0) {
> + if (ret == 0)
>   wait_on_page_writeback(page);
> - if (PageError(page))
> - ret = -EIO;
> - }
>   put_page(page);
>   } else {
>   unlock_page(page);
>   }
> - return ret;
> + ret2 = filemap_check_wb_error(mapping, since);
> + return ret ? : ret2;
>  }
>  EXPORT_SYMBOL(write_one_page);
>  
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v4 19/27] buffer: set errors in mapping at the time that the error occurs

2017-05-15 Thread Jan Kara
On Tue 09-05-17 11:49:22, Jeff Layton wrote:
> I noticed on xfs that I could still sometimes get back an error on fsync
> on a fd that was opened after the error condition had been cleared.
> 
> The problem is that the buffer code sets the write_io_error flag and
> then later checks that flag to set the error in the mapping. That flag
> perisists for quite a while however. If the file is later opened with
> O_TRUNC, the buffers will then be invalidated and the mapping's error
> set such that a subsequent fsync will return error. I think this is
> incorrect, as there was no writeback between the open and fsync.
> 
> Add a new mark_buffer_write_io_error operation that sets the flag and
> the error in the mapping at the same time. Replace all calls to
> set_buffer_write_io_error with mark_buffer_write_io_error, and remove
> the places that check this flag in order to set the error in the
> mapping.
> 
> This sets the error in the mapping earlier, at the time that it's first
> detected.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Small nits below.

> @@ -354,7 +354,7 @@ void end_buffer_async_write(struct buffer_head *bh, int 
> uptodate)
>   } else {
>   buffer_io_error(bh, ", lost async page write");
>   mapping_set_error(page->mapping, -EIO);
> - set_buffer_write_io_error(bh);
> + mark_buffer_write_io_error(bh);

No need to call mapping_set_error() here when it gets called in
mark_buffer_write_io_error() again?

> @@ -1182,6 +1180,17 @@ void mark_buffer_dirty(struct buffer_head *bh)
>  }
>  EXPORT_SYMBOL(mark_buffer_dirty);
>  
> +void mark_buffer_write_io_error(struct buffer_head *bh)
> +{
> + set_buffer_write_io_error(bh);
> + /* FIXME: do we need to set this in both places? */
> + if (bh->b_page && bh->b_page->mapping)
> + mapping_set_error(bh->b_page->mapping, -EIO);
> + if (bh->b_assoc_map)
> + mapping_set_error(bh->b_assoc_map, -EIO);
> +}
> +EXPORT_SYMBOL(mark_buffer_write_io_error);

So buffers that are shared by several inodes cannot have bh->b_assoc_map
set. So for filesystems that have metadata like this setting in
bh->b_assoc_map doesn't really help and they have to check blockdevice's
mapping anyway. OTOH if filesystem doesn't have such type of metadata
relevant for fsync, this could help it. So maybe it is worth it.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 04/10] fs: Introduce RWF_NOWAIT

2017-05-15 Thread Jan Kara
On Thu 11-05-17 14:17:04, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> RWF_NOWAIT informs kernel to bail out if an AIO request will block
> for reasons such as file allocations, or a writeback triggered,
> or would block while allocating requests while performing
> direct I/O.
> 
> RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.
> 
> The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This
> is called by most filesystems, either through fsops.write_iter() or through
> the function defined by write_iter(). If not, we perform the check defined
> by .write_iter() which is called for direct IO specifically.
> 
> Filesystems xfs, btrfs and ext4 would be supported in the following patches.
...
> diff --git a/fs/aio.c b/fs/aio.c
> index 020fa0045e3c..34027b67e2f4 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1592,6 +1592,12 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   goto out_put_req;
>   }
>  
> + if ((req->common.ki_flags & IOCB_NOWAIT) &&
> + !(req->common.ki_flags & IOCB_DIRECT)) {
> + ret = -EOPNOTSUPP;
> + goto out_put_req;
> + }
> +
>   ret = put_user(KIOCB_KEY, _iocb->aio_key);
>   if (unlikely(ret)) {
>   pr_debug("EFAULT: aio_key\n");

I think you need to also check here that the IO is write. So that NOWAIT
reads don't silently pass.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v4 11/27] fuse: set mapping error in writepage_locked when it fails

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:14, Jeff Layton wrote:
> This ensures that we see errors on fsync when writeback fails.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/fuse/file.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index ec238fb5a584..07d0efcb050c 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page)
>  err_free:
>   fuse_request_free(req);
>  err:
> + mapping_set_error(page->mapping, error);
>   end_page_writeback(page);
>   return error;
>  }
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:16, Jeff Layton wrote:
> An errseq_t is a way of recording errors in one place, and allowing any
> number of "subscribers" to tell whether an error has been set again
> since a previous time.
> 
> It's implemented as an unsigned 32-bit value that is managed with atomic
> operations. The low order bits are designated to hold an error code
> (max size of MAX_ERRNO). The upper bits are used as a counter.
> 
> The API works with consumers sampling an errseq_t value at a particular
> point in time. Later, that value can be used to tell whether new errors
> have been set since that time.
> 
> Note that there is a 1 in 512k risk of collisions here if new errors
> are being recorded frequently, since we have so few bits to use as a
> counter. To mitigate this, one bit is used as a flag to tell whether the
> value has been sampled since a new value was recorded. That allows
> us to avoid bumping the counter if no one has sampled it since it
> was last bumped.
> 
> Later patches will build on this infrastructure to change how writeback
> errors are tracked in the kernel.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

The patch looks good to me. Feel free to add:

Reviewed-by: Jan Kara <j...@suse.cz>

Just two nits below:
...
> +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> +{
> + int err = 0;
> + errseq_t old, new;
> +
> + /*
> +  * Most callers will want to use the inline wrapper to check this,
> +  * so that the common case of no error is handled without needing
> +  * to lock.
> +  */

I'm not sure which locking you are speaking about here. Is the comment
stale?

> + old = READ_ONCE(*eseq);
> + if (old != *since) {
> + /*
> +  * Set the flag and try to swap it into place if it has
> +  * changed.
> +  *
> +  * We don't care about the outcome of the swap here. If the
> +  * swap doesn't occur, then it has either been updated by a
> +  * writer who is bumping the seq count anyway, or another

"bumping the seq count anyway" part is not quite true. Writer may see
ERRSEQ_SEEN not set and so just update the error code and leave seq count
as is. But since you compare full errseq_t for equality, this works out as
designed...

> +  * reader who is just setting the "seen" flag. Either outcome
> +  * is OK, and we can advance "since" and return an error based
> +  * on what we have.
> +  */
> + new = old | ERRSEQ_SEEN;
> + if (new != old)
> + cmpxchg(eseq, old, new);
> + *since = new;
> + err = -(new & MAX_ERRNO);
> + }
> + return err;
> +}
> +EXPORT_SYMBOL(errseq_check_and_advance);

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v4 14/27] fs: new infrastructure for writeback error handling and reporting

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:17, Jeff Layton wrote:
> Most filesystems currently use mapping_set_error and
> filemap_check_errors for setting and reporting/clearing writeback errors
> at the mapping level. filemap_check_errors is indirectly called from
> most of the filemap_fdatawait_* functions and from
> filemap_write_and_wait*. These functions are called from all sorts of
> contexts to wait on writeback to finish -- e.g. mostly in fsync, but
> also in truncate calls, getattr, etc.
> 
> The non-fsync callers are problematic. We should be reporting writeback
> errors during fsync, but many places spread over the tree clear out
> errors before they can be properly reported, or report errors at
> nonsensical times.
> 
> If I get -EIO on a stat() call, there is no reason for me to assume that
> it is because some previous writeback failed. The fact that it also
> clears out the error such that a subsequent fsync returns 0 is a bug,
> and a nasty one since that's potentially silent data corruption.
> 
> This patch adds a small bit of new infrastructure for setting and
> reporting errors during address_space writeback. While the above was my
> original impetus for adding this, I think it's also the case that
> current fsync semantics are just problematic for userland. Most
> applications that call fsync do so to ensure that the data they wrote
> has hit the backing store.
> 
> In the case where there are multiple writers to the file at the same
> time, this is really hard to determine. The first one to call fsync will
> see any stored error, and the rest get back 0. The processes with open
> fds may not be associated with one another in any way. They could even
> be in different containers, so ensuring coordination between all fsync
> callers is not really an option.
> 
> One way to remedy this would be to track what file descriptor was used
> to dirty the file, but that's rather cumbersome and would likely be
> slow. However, there is a simpler way to improve the semantics here
> without incurring too much overhead.
> 
> This set adds an errseq_t to struct address_space, and a corresponding
> one is added to struct file. Writeback errors are recorded in the
> mapping's errseq_t, and the one in struct file is used as the "since"
> value.
> 
> This changes the semantics of the Linux fsync implementation such that
> applications can now use it to determine whether there were any
> writeback errors since fsync(fd) was last called (or since the file was
> opened in the case of fsync having never been called).
> 
> Note that those writeback errors may have occurred when writing data
> that was dirtied via an entirely different fd, but that's the case now
> with the current mapping_set_error/filemap_check_error infrastructure.
> This will at least prevent you from getting a false report of success.
> 
> The new behavior is still consistent with the POSIX spec, and is more
> reliable for application developers. This patch just adds some basic
> infrastructure for doing this. Later patches will change the existing
> code to use this new infrastructure.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Just one nit below. Otherwise the patch looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

> diff --git a/fs/file_table.c b/fs/file_table.c
> index 954d510b765a..d6138b6411ff 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, fmode_t 
> mode,
>   file->f_path = *path;
>   file->f_inode = path->dentry->d_inode;
>   file->f_mapping = path->dentry->d_inode->i_mapping;
> + file->f_wb_err = filemap_sample_wb_error(file->f_mapping);

Why do you sample here when you also sample in do_dentry_open()? I didn't
find any alloc_file() callers that would possibly care about writeback
errors... 

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v4 01/27] fs: remove unneeded forward definition of mm_struct from fs.h

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:04, Jeff Layton wrote:
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  include/linux/fs.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7251f7bb45e8..38adefd8e2a0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1252,8 +1252,6 @@ extern void f_delown(struct file *filp);
>  extern pid_t f_getown(struct file *filp);
>  extern int send_sigurg(struct fown_struct *fown);
>  
> -struct mm_struct;
> -
>  /*
>   *   Umount options
>   */
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v4 12/27] cifs: set mapping error when page writeback fails in writepage or launder_pages

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:15, Jeff Layton wrote:
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/cifs/file.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 21d404535739..0bee7f8d91ad 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2234,14 +2234,16 @@ cifs_writepage_locked(struct page *page, struct 
> writeback_control *wbc)
>   set_page_writeback(page);
>  retry_write:
>   rc = cifs_partialpagewrite(page, 0, PAGE_SIZE);
> - if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL)
> - goto retry_write;
> - else if (rc == -EAGAIN)
> + if (rc == -EAGAIN) {
> + if (wbc->sync_mode == WB_SYNC_ALL)
> + goto retry_write;
>   redirty_page_for_writepage(wbc, page);
> - else if (rc != 0)
> + } else if (rc != 0) {
>   SetPageError(page);
> - else
> + mapping_set_error(page->mapping, rc);
> + } else {
>   SetPageUptodate(page);
> + }
>   end_page_writeback(page);
>   put_page(page);
>   free_xid(xid);
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 16/25] fuse: Convert to separately allocated bdi

2017-05-16 Thread Jan Kara
On Mon 15-05-17 23:34:00, Rakesh Pandit wrote:
> Hi Jan, Miklos,
> 
> On Wed, Apr 12, 2017 at 12:24:40PM +0200, Jan Kara wrote:
> > Allocate struct backing_dev_info separately instead of embedding it
> > inside the superblock. This unifies handling of bdi among users.
> >
> > CC: Miklos Szeredi <mik...@szeredi.hu>
> > CC: linux-fsde...@vger.kernel.org
> > Acked-by: Miklos Szeredi <mszer...@redhat.com>
> > Reviewed-by: Christoph Hellwig <h...@lst.de>
> > Signed-off-by: Jan Kara <j...@suse.cz>
> 
> ...
> 
> >  static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
> >  {
> > int err;
> > +   char *suffix = "";
> >  
> > -   fc->bdi.name = "fuse";
> > -   fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_SIZE;
> > -   /* fuse does it's own writeback accounting */
> > -   fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;
> > -
> > -   err = bdi_init(>bdi);
> > +   if (sb->s_bdev)
> > +   suffix = "-fuseblk";
> > +   err = super_setup_bdi_name(sb, "%u:%u%s", MAJOR(fc->dev),
> > +  MINOR(fc->dev), suffix);
> > if (err)
> > return err;
> >
> 
> This call to super_setup_bdi_name would only work with "fuse" but not
> with "fuseblk" as mounting a block device in userspace triggers
> mount_bdev call which results in set_bdev_super taking a reference
> from block device's BDI.  But super_setup_bdi_name allocates a new bdi
> and ignores the already existing reference which triggers:
> 
> WARN_ON(sb->s_bdi != _backing_dev_info);
> 
> as sb->s_bdi already has a reference from set_bdev_super.  This works
> for "fuse" (without a blocking device) for obvious reasons.  I can
> reproduce this on -rc1 and also found a report on lkml:
> https://lkml.org/lkml/2017/5/2/445
> 
> Only sane solution seems to be maintaining a private bdi instace just
> for fuseblk and let fuse use the common new infrastructure.

Thanks for analysis! Does the attached patch fix the warning for you?

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
>From 5b0cfc37b45670a35228c96cbaee2b99cd3d447c Mon Sep 17 00:00:00 2001
From: Jan Kara <j...@suse.cz>
Date: Tue, 16 May 2017 12:22:22 +0200
Subject: [PATCH] fuseblk: Fix warning in super_setup_bdi_name()

Commit 5f7f7543f52e "fuse: Convert to separately allocated bdi" didn't
properly handle fuseblk filesystem. When fuse_bdi_init() is called for
that filesystem type, sb->s_bdi is already initialized (by
set_bdev_super()) to point to block device's bdi and consequently
super_setup_bdi_name() complains about this fact when reseting bdi to
the private one.

Fix the problem by properly dropping bdi reference in fuse_bdi_init()
before creating a private bdi in super_setup_bdi_name().

Fixes: 5f7f7543f52eee03ed35c9d671fbb1cdbd4bc9b5
Reported-by: Rakesh Pandit <rak...@tuxera.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/fuse/inode.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 5a1b58f8fef4..65c88379a3a1 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -975,8 +975,15 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
 	int err;
 	char *suffix = "";
 
-	if (sb->s_bdev)
+	if (sb->s_bdev) {
 		suffix = "-fuseblk";
+		/*
+		 * sb->s_bdi points to blkdev's bdi however we want to redirect
+		 * it to our private bdi...
+		 */
+		bdi_put(sb->s_bdi);
+		sb->s_bdi = _backing_dev_info;
+	}
 	err = super_setup_bdi_name(sb, "%u:%u%s", MAJOR(fc->dev),
    MINOR(fc->dev), suffix);
 	if (err)
-- 
2.12.0



Re: [PATCH 6/8] nowait aio: ext4

2017-05-09 Thread Jan Kara
On Tue 09-05-17 07:22:17, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> Return EAGAIN if any of the following checks fail for direct I/O:
>   + i_rwsem is lockable
>   + Writing beyond end of file (will trigger allocation)
>   + Blocks are not allocated at the write location
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>

The patch looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/ext4/file.c | 20 
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index cefa9835f275..2efdc6d4d3e8 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -216,7 +216,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
> *from)
>   return ext4_dax_write_iter(iocb, from);
>  #endif
>  
> - inode_lock(inode);
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + if (!inode_trylock(inode))
> + return -EAGAIN;
> + } else {
> + inode_lock(inode);
> + }
> +
>   ret = ext4_write_checks(iocb, from);
>   if (ret <= 0)
>   goto out;
> @@ -235,9 +241,15 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
> *from)
>  
>   iocb->private = 
>   /* Check whether we do a DIO overwrite or not */
> - if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
> - ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
> - overwrite = 1;
> + if (o_direct && !unaligned_aio) {
> + if (ext4_overwrite_io(inode, iocb->ki_pos, 
> iov_iter_count(from))) {
> + if (ext4_should_dioread_nolock(inode))
> + overwrite = 1;
> + } else if (iocb->ki_flags & IOCB_NOWAIT) {
> + ret = -EAGAIN;
> + goto out;
> + }
> + }
>  
>   ret = __generic_file_write_iter(iocb, from);
>   inode_unlock(inode);
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v7 05/22] jbd2: don't clear and reset errors after waiting on writeback

2017-06-20 Thread Jan Kara
On Fri 16-06-17 15:34:10, Jeff Layton wrote:
> Resetting this flag is almost certainly racy, and will be problematic
> with some coming changes.
> 
> Make filemap_fdatawait_keep_errors return int, but not clear the flag(s).
> Have jbd2 call it instead of filemap_fdatawait and don't attempt to
> re-set the error flag if it fails.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/jbd2/commit.c   | 15 +++
>  include/linux/fs.h |  2 +-
>  mm/filemap.c   | 16 ++--
>  3 files changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b6b194ec1b4f..502110540598 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -263,18 +263,9 @@ static int journal_finish_inode_data_buffers(journal_t 
> *journal,
>   continue;
>   jinode->i_flags |= JI_COMMIT_RUNNING;
>   spin_unlock(>j_list_lock);
> - err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
> - if (err) {
> - /*
> -  * Because AS_EIO is cleared by
> -  * filemap_fdatawait_range(), set it again so
> -  * that user process can get -EIO from fsync().
> -  */
> - mapping_set_error(jinode->i_vfs_inode->i_mapping, -EIO);
> -
> - if (!ret)
> - ret = err;
> - }
> + err = 
> filemap_fdatawait_keep_errors(jinode->i_vfs_inode->i_mapping);
> + if (!ret)
> + ret = err;
>   spin_lock(>j_list_lock);
>   jinode->i_flags &= ~JI_COMMIT_RUNNING;
>   smp_mb();
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1a135274b4f8..1b1233a1db1e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2509,7 +2509,7 @@ extern int write_inode_now(struct inode *, int);
>  extern int filemap_fdatawrite(struct address_space *);
>  extern int filemap_flush(struct address_space *);
>  extern int filemap_fdatawait(struct address_space *);
> -extern void filemap_fdatawait_keep_errors(struct address_space *);
> +extern int filemap_fdatawait_keep_errors(struct address_space *);
>  extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
>  loff_t lend);
>  extern int filemap_write_and_wait(struct address_space *mapping);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b9e870600572..37f286df7c95 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -311,6 +311,16 @@ int filemap_check_errors(struct address_space *mapping)
>  }
>  EXPORT_SYMBOL(filemap_check_errors);
>  
> +static int filemap_check_and_keep_errors(struct address_space *mapping)
> +{
> + /* Check for outstanding write errors */
> + if (test_bit(AS_EIO, >flags))
> + return -EIO;
> + if (test_bit(AS_ENOSPC, >flags))
> + return -ENOSPC;
> + return 0;
> +}
> +
>  /**
>   * __filemap_fdatawrite_range - start writeback on mapping dirty pages in 
> range
>   * @mapping: address space structure to write
> @@ -455,15 +465,17 @@ EXPORT_SYMBOL(filemap_fdatawait_range);
>   * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
>   * fsfreeze(8)
>   */
> -void filemap_fdatawait_keep_errors(struct address_space *mapping)
> +int filemap_fdatawait_keep_errors(struct address_space *mapping)
>  {
>   loff_t i_size = i_size_read(mapping->host);
>  
>   if (i_size == 0)
> -         return;
> + return 0;
>  
>   __filemap_fdatawait_range(mapping, 0, i_size - 1);
> + return filemap_check_and_keep_errors(mapping);
>  }
> +EXPORT_SYMBOL(filemap_fdatawait_keep_errors);
>  
>  /**
>   * filemap_fdatawait - wait for all under-writeback pages to complete
> -- 
> 2.13.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v7 01/22] fs: remove call_fsync helper function

2017-06-20 Thread Jan Kara
On Fri 16-06-17 15:34:06, Jeff Layton wrote:
> Requested-by: Christoph Hellwig <h...@infradead.org>
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza
> ---
>  fs/sync.c  | 2 +-
>  include/linux/fs.h | 6 --
>  ipc/shm.c  | 2 +-
>  3 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/sync.c b/fs/sync.c
> index 11ba023434b1..2a54c1f22035 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -192,7 +192,7 @@ int vfs_fsync_range(struct file *file, loff_t start, 
> loff_t end, int datasync)
>   spin_unlock(>i_lock);
>   mark_inode_dirty_sync(inode);
>   }
> - return call_fsync(file, start, end, datasync);
> + return file->f_op->fsync(file, start, end, datasync);
>  }
>  EXPORT_SYMBOL(vfs_fsync_range);
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4929a8f28cc3..1a135274b4f8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1740,12 +1740,6 @@ static inline int call_mmap(struct file *file, struct 
> vm_area_struct *vma)
>   return file->f_op->mmap(file, vma);
>  }
>  
> -static inline int call_fsync(struct file *file, loff_t start, loff_t end,
> -  int datasync)
> -{
> - return file->f_op->fsync(file, start, end, datasync);
> -}
> -
>  ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
> unsigned long nr_segs, unsigned long fast_segs,
> struct iovec *fast_pointer,
> diff --git a/ipc/shm.c b/ipc/shm.c
> index ec5688e98f25..28a444861a8f 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -453,7 +453,7 @@ static int shm_fsync(struct file *file, loff_t start, 
> loff_t end, int datasync)
>  
>   if (!sfd->file->f_op->fsync)
>   return -EINVAL;
> - return call_fsync(sfd->file, start, end, datasync);
> + return sfd->file->f_op->fsync(sfd->file, start, end, datasync);
>  }
>  
>  static long shm_fallocate(struct file *file, int mode, loff_t offset,
> -- 
> 2.13.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 02/10] fs: Introduce filemap_range_has_page()

2017-05-25 Thread Jan Kara
On Wed 24-05-17 11:41:42, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> filemap_range_has_page() return true if the file's mapping has
> a page within the range mentioned. This function will be used
> to check if a write() call will cause a writeback of previous
> writes.
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  include/linux/fs.h |  2 ++
>  mm/filemap.c   | 33 +
>  2 files changed, 35 insertions(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f53867140f43..dc0ab585cd56 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2517,6 +2517,8 @@ extern int filemap_fdatawait(struct address_space *);
>  extern void filemap_fdatawait_keep_errors(struct address_space *);
>  extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
>  loff_t lend);
> +extern int filemap_range_has_page(struct address_space *, loff_t lstart,
> +   loff_t lend);
>  extern int filemap_write_and_wait(struct address_space *mapping);
>  extern int filemap_write_and_wait_range(struct address_space *mapping,
>   loff_t lstart, loff_t lend);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6f1be573a5e6..87aba7698584 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -376,6 +376,39 @@ int filemap_flush(struct address_space *mapping)
>  }
>  EXPORT_SYMBOL(filemap_flush);
>  
> +/**
> + * filemap_range_has_page - check if a page exists in range.
> + * @mapping:   address space structure to wait for
> + * @start_byte:offset in bytes where the range starts
> + * @end_byte:  offset in bytes where the range ends (inclusive)
> + *
> + * Find at least one page in the range supplied, usually used to check if
> + * direct writing in this range will trigger a writeback.
> + */
> +int filemap_range_has_page(struct address_space *mapping,
> +loff_t start_byte, loff_t end_byte)
> +{
> + pgoff_t index = start_byte >> PAGE_SHIFT;
> + pgoff_t end = end_byte >> PAGE_SHIFT;
> + struct pagevec pvec;
> + int ret;
> +
> + if (end_byte < start_byte)
> + return 0;
> +
> + if (mapping->nrpages == 0)
> + return 0;
> +
> + pagevec_init(, 0);
> + ret = pagevec_lookup(, mapping, index, 1);
> + if (!ret)
> + return 0;
> + ret = (pvec.pages[0]->index <= end);
> + pagevec_release();
> + return ret;
> +}
> +EXPORT_SYMBOL(filemap_range_has_page);
> +
>  static int __filemap_fdatawait_range(struct address_space *mapping,
>loff_t start_byte, loff_t end_byte)
>  {
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 04/10] fs: Introduce RWF_NOWAIT

2017-05-25 Thread Jan Kara
On Wed 24-05-17 11:41:44, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> RWF_NOWAIT informs kernel to bail out if an AIO request will block
> for reasons such as file allocations, or a writeback triggered,
> or would block while allocating requests while performing
> direct I/O.
> 
> RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.
> 
> The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This
> is called by most filesystems, either through fsops.write_iter() or through
> the function defined by write_iter(). If not, we perform the check defined
> by .write_iter() which is called for direct IO specifically.
> 
> Filesystems xfs, btrfs and ext4 would be supported in the following patches.
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good now. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza


> ---
>  fs/9p/vfs_file.c|  3 +++
>  fs/aio.c| 13 +
>  fs/ceph/file.c  |  3 +++
>  fs/cifs/file.c  |  3 +++
>  fs/fuse/file.c  |  3 +++
>  fs/nfs/direct.c |  3 +++
>  fs/ocfs2/file.c |  3 +++
>  include/linux/fs.h  |  5 -
>  include/uapi/linux/fs.h |  1 +
>  mm/filemap.c|  3 +++
>  10 files changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index 3de3b4a89d89..403681db7723 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -411,6 +411,9 @@ v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter 
> *from)
>   loff_t origin;
>   int err = 0;
>  
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
>   retval = generic_write_checks(iocb, from);
>   if (retval <= 0)
>   return retval;
> diff --git a/fs/aio.c b/fs/aio.c
> index 020fa0045e3c..9616dc733103 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1592,6 +1592,19 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   goto out_put_req;
>   }
>  
> + if (req->common.ki_flags & IOCB_NOWAIT) {
> + if (!(req->common.ki_flags & IOCB_DIRECT)) {
> + ret = -EOPNOTSUPP;
> + goto out_put_req;
> + }
> +
> + if ((iocb->aio_lio_opcode != IOCB_CMD_PWRITE) &&
> + (iocb->aio_lio_opcode != IOCB_CMD_PWRITEV)) {
> + ret = -EINVAL;
> + goto out_put_req;
> + }
> + }
> +
>   ret = put_user(KIOCB_KEY, _iocb->aio_key);
>   if (unlikely(ret)) {
>   pr_debug("EFAULT: aio_key\n");
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 3fdde0b283c9..a53fd2675b1b 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1300,6 +1300,9 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>   int err, want, got;
>   loff_t pos;
>  
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
>   if (ceph_snap(inode) != CEPH_NOSNAP)
>   return -EROFS;
>  
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 0fd081bd2a2f..ff84fa9ddb6c 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2725,6 +2725,9 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct 
> iov_iter *from)
>* write request.
>*/
>  
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
>   rc = generic_write_checks(iocb, from);
>   if (rc <= 0)
>   return rc;
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 3ee4fdc3da9e..812c7bd0c290 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1425,6 +1425,9 @@ static ssize_t fuse_direct_write_iter(struct kiocb 
> *iocb, struct iov_iter *from)
>   struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(file);
>   ssize_t res;
>  
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
>   if (is_bad_inode(inode))
>   return -EIO;
>  
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 6fb9fad2d1e6..c8e7dd76126c 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -979,6 +979,9 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct 
> iov_iter *iter)
>   dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
>   file, iov_iter_count(iter), (long long) iocb->ki_pos);
>  
> + if

Re: [PATCH 03/10] fs: Use RWF_* flags for AIO operations

2017-05-25 Thread Jan Kara
On Wed 24-05-17 11:41:43, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will
> carry the RWF_* flags. We cannot use aio_flags because they are not
> checked for validity which may break existing applications.
> 
> Note, the only place RWF_HIPRI comes in effect is dio_await_one().
> All the rest of the locations, aio code return -EIOCBQUEUED before the
> checks for RWF_HIPRI.
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/aio.c | 8 +++-
>  include/uapi/linux/aio_abi.h | 2 +-
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index f52d925ee259..020fa0045e3c 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1541,7 +1541,7 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   ssize_t ret;
>  
>   /* enforce forwards compatibility on users */
> - if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
> + if (unlikely(iocb->aio_reserved2)) {
>   pr_debug("EINVAL: reserve field set\n");
>   return -EINVAL;
>   }
> @@ -1586,6 +1586,12 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   req->common.ki_flags |= IOCB_EVENTFD;
>   }
>  
> + ret = kiocb_set_rw_flags(>common, iocb->aio_rw_flags);
> + if (unlikely(ret)) {
> + pr_debug("EINVAL: aio_rw_flags\n");
> + goto out_put_req;
> + }
> +
>   ret = put_user(KIOCB_KEY, _iocb->aio_key);
>   if (unlikely(ret)) {
>   pr_debug("EFAULT: aio_key\n");
> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> index bb2554f7fbd1..a2d4a8ac94ca 100644
> --- a/include/uapi/linux/aio_abi.h
> +++ b/include/uapi/linux/aio_abi.h
> @@ -79,7 +79,7 @@ struct io_event {
>  struct iocb {
>   /* these are internal to the kernel/libc. */
>   __u64   aio_data;   /* data to be returned in event's data */
> - __u32   PADDED(aio_key, aio_reserved1);
> + __u32   PADDED(aio_key, aio_rw_flags);
>   /* the kernel sets aio_key to the req # */
>  
>   /* common fields */
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 06/10] fs: Introduce IOMAP_NOWAIT

2017-05-25 Thread Jan Kara
On Wed 24-05-17 11:41:46, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
> This is used by XFS in the XFS patch.
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/iomap.c| 2 ++
>  include/linux/iomap.h | 1 +
>  2 files changed, 3 insertions(+)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 4b10892967a5..5d85ec6e7b20 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -879,6 +879,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   } else {
>   dio->flags |= IOMAP_DIO_WRITE;
>   flags |= IOMAP_WRITE;
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + flags |= IOMAP_NOWAIT;
>   }
>  
>   ret = filemap_write_and_wait_range(mapping, start, end);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index f753e788da31..69f4e9470084 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -52,6 +52,7 @@ struct iomap {
>  #define IOMAP_REPORT (1 << 2) /* report extent status, e.g. FIEMAP */
>  #define IOMAP_FAULT  (1 << 3) /* mapping for page fault */
>  #define IOMAP_DIRECT (1 << 4) /* direct I/O */
> +#define IOMAP_NOWAIT (1 << 5) /* Don't wait for writeback */
>  
>  struct iomap_ops {
>   /*
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 01/10] fs: Separate out kiocb flags setup based on RWF_* flags

2017-05-25 Thread Jan Kara
On Wed 24-05-17 11:41:41, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/read_write.c| 12 +++-
>  include/linux/fs.h | 14 ++
>  2 files changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 47c1d4484df9..53c816c61122 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -678,16 +678,10 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
> struct iov_iter *iter,
>   struct kiocb kiocb;
>   ssize_t ret;
>  
> - if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
> - return -EOPNOTSUPP;
> -
>   init_sync_kiocb(, filp);
> - if (flags & RWF_HIPRI)
> - kiocb.ki_flags |= IOCB_HIPRI;
> - if (flags & RWF_DSYNC)
> - kiocb.ki_flags |= IOCB_DSYNC;
> - if (flags & RWF_SYNC)
> - kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> + ret = kiocb_set_rw_flags(, flags);
> + if (ret)
> + return ret;
>   kiocb.ki_pos = *ppos;
>  
>   if (type == READ)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 803e5a9b2654..f53867140f43 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3056,6 +3056,20 @@ static inline int iocb_flags(struct file *file)
>   return res;
>  }
>  
> +static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags)
> +{
> + if (unlikely(flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC)))
> + return -EOPNOTSUPP;
> +
> + if (flags & RWF_HIPRI)
> + ki->ki_flags |= IOCB_HIPRI;
> + if (flags & RWF_DSYNC)
> + ki->ki_flags |= IOCB_DSYNC;
> +     if (flags & RWF_SYNC)
> + ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> + return 0;
> +}
> +
>  static inline ino_t parent_ino(struct dentry *dentry)
>  {
>   ino_t res;
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 05/10] fs: return if direct write will trigger writeback

2017-05-25 Thread Jan Kara
On Wed 24-05-17 11:41:45, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> Find out if the write will trigger a wait due to writeback. If yes,
> return -EAGAIN.
> 
> Return -EINVAL for buffered AIO: there are multiple causes of
> delay such as page locks, dirty throttling logic, page loading
> from disk etc. which cannot be taken care of.
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  mm/filemap.c | 17 ++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 097213275461..bc146efa6815 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2675,6 +2675,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
> struct iov_iter *from)
>  
>   pos = iocb->ki_pos;
>  
> + if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> + return -EINVAL;
> +
>   if (limit != RLIM_INFINITY) {
>   if (iocb->ki_pos >= limit) {
>   send_sig(SIGXFSZ, current, 0);
> @@ -2743,9 +2746,17 @@ generic_file_direct_write(struct kiocb *iocb, struct 
> iov_iter *from)
>   write_len = iov_iter_count(from);
>   end = (pos + write_len - 1) >> PAGE_SHIFT;
>  
> - written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 
> 1);
> - if (written)
> - goto out;
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + /* If there are pages to writeback, return */
> + if (filemap_range_has_page(inode->i_mapping, pos,
> +pos + iov_iter_count(from)))
> + return -EAGAIN;
> + } else {
> + written = filemap_write_and_wait_range(mapping, pos,
> + pos + write_len - 1);
> + if (written)
> + goto out;
> + }
>  
>   /*
>* After a write we want buffered reads to be sure to go to disk to get
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 09/10] xfs: nowait aio support

2017-05-29 Thread Jan Kara
On Sun 28-05-17 21:38:26, Goldwyn Rodrigues wrote:
> On 05/28/2017 04:31 AM, Christoph Hellwig wrote:
> > Despite my previous reviewed-by tag this will need another fix:
> > 
> > xfs_file_iomap_begin needs to return EAGAIN if we don't have the extent
> > list in memoery already.  E.g. something like this:
> > 
> > if ((flags & IOMAP_NOWAIT) && !(ip->i_d.if_flags & XFS_IFEXTENTS)) {
> > error = -EAGAIN;
> > goto out_unlock;
> > }
> > 
> > right after locking the ilock.
> 
> I am not sure if it is right to penalize the application to write to
> file which has been freshly opened (and is the first one to open). It
> basically means extent maps needs to be read from disk. Do you see a
> reason it would have a non-deterministic wait if it is the only user? I
> understand the block layer can block if it has too many requests though.

Well, submitting such write will have to wait for read of metadata from
disk. That is certainly considered blocking so Christoph is right that we
must return EAGAIN in such case.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 09/10] xfs: nowait aio support

2017-05-31 Thread Jan Kara
On Tue 30-05-17 11:13:29, Goldwyn Rodrigues wrote:
> > Btw, can you write a small blurb up for the man page to document these
> > Ñ•emantics in man-page like language?
> > 
> 
> Yes, but which man page would it belong to?
> Should it be a subsection of errors in io_getevents/io_submit. We don't
> want to add ERRORS to io_getevents() because it would be the return
> value of the io_getevents call, and not the ones in the iocb structure.
> Should it be a new man page, say for iocb(7/8)?

I think you should extend the manpage for io_submit(8). There you can add
definition of struct iocb in 'DESCRIPTION' section explaining at least the
most common fields. You can also explain there which flags can be passed
and what are they intended to do.

You can also expand EAGAIN error description to specifically mention that
in case of NOWAIT aio EAGAIN can be returned if io submission would block.

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 0/10 v12] No wait AIO

2017-06-16 Thread Jan Kara
On Thu 15-06-17 11:25:28, Andrew Morton wrote:
> On Thu, 15 Jun 2017 10:59:52 -0500 Goldwyn Rodrigues <rgold...@suse.de> wrote:
> 
> > This series adds nonblocking feature to asynchronous I/O writes.
> > io_submit() can be delayed because of a number of reason:
> >  - Block allocation for files
> >  - Data writebacks for direct I/O
> >  - Sleeping because of waiting to acquire i_rwsem
> >  - Congested block device
> > 
> > The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
> > any of these conditions are met. This way userspace can push most
> > of the write()s to the kernel to the best of its ability to complete
> > and if it returns -EAGAIN, can defer it to another thread.
> > 
> > In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in
> > uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to
> > IOCB_NOWAIT for struct iocb, REQ_NOWAIT for bio.bi_opf and IOMAP_NOWAIT for
> > iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could
> > not use aio_flags because it is not currently checked for invalidity
> > in the kernel.
> > 
> > This feature is provided for direct I/O of asynchronous I/O only. I have
> > tested it against xfs, ext4, and btrfs while I intend to add more 
> > filesystems.
> > The nowait feature is for request based devices. In the future, I intend to
> > add support to stacked devices such as md.
> > 
> > Applications will have to check supportability by sending a async direct 
> > write
> > and any other error besides -EAGAIN would mean it is not supported.
> > 
> 
> How accurate it this?  For example, the changes to
> generic_file_direct_write() appear to greatly reduce the chances of
> blocking but there are surely race opportunities which will still
> result in userspace unexpectedly experiencing blocking in a succeednig
> write() call?

Yes, so you are right that there are still possibilities for blocking -
e.g. we could get blocked in reclaim when allocating memory somewhere. Now
we hope what Goldwyn did will be enough for practical purposes as in the
end this is an API to improve performance and so in the worst case app
won't get the performance it expects (this just has to be rare enough that
it all pays off in the end). Also if we spot some place that ends up to
cause blocking in practice, we'll work on improving that...
 
> If correct then I think there should be some discussion and perhaps
> testing results in the changelog.

Probably we could add a note to the first paragraph of the changelog of
patch 4/10 like: Note that we can still block (put the process submitting
IO to sleep) in some rare cases like when there is not enough free memory
or when acquiring some fs-internal sleeping locks.

WRT test results, Goldwyn has some functional tests (for xfstests). We also
have a customer that is working on testing the series with their workload
however that will take some time given it requires updating their software
stack. If you are looking for some synthetic benchmark results, I suppose
we can put something together however it's going to be just a synthetic
benchmark and as such the relevance is limited.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


[PATCH] nfs: Fix bdi handling for cloned superblocks

2017-05-04 Thread Jan Kara
In commit 0d3b12584972 "nfs: Convert to separately allocated bdi" I have
wrongly cloned bdi reference in nfs_clone_super(). Further inspection
has shown that originally the code was actually allocating a new bdi (in
->clone_server callback) which was later registered in
nfs_fs_mount_common() and used for sb->s_bdi in nfs_initialise_sb().
This could later result in bdi for the original superblock not getting
unregistered when that superblock got shutdown (as the cloned sb still
held bdi reference) and later when a new superblock was created under
the same anonymous device number, a clash in sysfs has happened on bdi
registration:

[ cut here ]
WARNING: CPU: 1 PID: 10284 at /linux-next/fs/sysfs/dir.c:31 
sysfs_warn_dup+0x64/0x74
sysfs: cannot create duplicate filename '/devices/virtual/bdi/0:32'
Modules linked in: axp20x_usb_power gpio_axp209 nvmem_sunxi_sid sun4i_dma 
sun4i_ss virt_dma
CPU: 1 PID: 10284 Comm: mount.nfs Not tainted 4.11.0-rc4+ #14
Hardware name: Allwinner sun7i (A20) Family
[] (unwind_backtrace) from [] (show_stack+0x10/0x14)
[] (show_stack) from [] (dump_stack+0x78/0x8c)
[] (dump_stack) from [] (__warn+0xe8/0x100)
[] (__warn) from [] (warn_slowpath_fmt+0x38/0x48)
[] (warn_slowpath_fmt) from [] (sysfs_warn_dup+0x64/0x74)
[] (sysfs_warn_dup) from [] (sysfs_create_dir_ns+0x84/0x94)
[] (sysfs_create_dir_ns) from [] 
(kobject_add_internal+0x9c/0x2ec)
[] (kobject_add_internal) from [] (kobject_add+0x48/0x98)
[] (kobject_add) from [] (device_add+0xe4/0x5a0)
[] (device_add) from [] 
(device_create_groups_vargs+0xac/0xbc)
[] (device_create_groups_vargs) from [] 
(device_create_vargs+0x20/0x28)
[] (device_create_vargs) from [] (bdi_register_va+0x44/0xfc)
[] (bdi_register_va) from [] 
(super_setup_bdi_name+0x48/0xa4)
[] (super_setup_bdi_name) from [] 
(nfs_fill_super+0x1a4/0x204)
[] (nfs_fill_super) from [] 
(nfs_fs_mount_common+0x140/0x1e8)
[] (nfs_fs_mount_common) from [] 
(nfs4_remote_mount+0x50/0x58)
[] (nfs4_remote_mount) from [] (mount_fs+0x14/0xa4)
[] (mount_fs) from [] (vfs_kern_mount+0x54/0x128)
[] (vfs_kern_mount) from [] (nfs_do_root_mount+0x80/0xa0)
[] (nfs_do_root_mount) from [] (nfs4_try_mount+0x28/0x3c)
[] (nfs4_try_mount) from [] (nfs_fs_mount+0x2cc/0x8c4)
[] (nfs_fs_mount) from [] (mount_fs+0x14/0xa4)
[] (mount_fs) from [] (vfs_kern_mount+0x54/0x128)
[] (vfs_kern_mount) from [] (do_mount+0x158/0xc7c)
[] (do_mount) from [] (SyS_mount+0x8c/0xb4)
[] (SyS_mount) from [] (ret_fast_syscall+0x0/0x3c)

Fix the problem by always creating new bdi for a superblock as we used
to do.

Reported-and-tested-by: Corentin Labbe <clabbe.montj...@gmail.com>
Fixes: 0d3b12584972ce5781179ad3f15cca3cdb5cae05
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/nfs/internal.h |  6 +++---
 fs/nfs/super.c| 28 ++--
 2 files changed, 13 insertions(+), 21 deletions(-)

Hi Jens,

can you please merge this fix to my series converting BDI handling in
filesystems? Thanks!

Honza

diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 9dc65d7ae754..7b38fedb7e03 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -139,7 +139,7 @@ struct nfs_mount_request {
 };
 
 struct nfs_mount_info {
-   int (*fill_super)(struct super_block *, struct nfs_mount_info *);
+   void (*fill_super)(struct super_block *, struct nfs_mount_info *);
int (*set_security)(struct super_block *, struct dentry *, struct 
nfs_mount_info *);
struct nfs_parsed_mount_data *parsed;
struct nfs_clone_mount *cloned;
@@ -407,7 +407,7 @@ struct dentry *nfs_fs_mount(struct file_system_type *, int, 
const char *, void *
 struct dentry * nfs_xdev_mount_common(struct file_system_type *, int,
const char *, struct nfs_mount_info *);
 void nfs_kill_super(struct super_block *);
-int nfs_fill_super(struct super_block *, struct nfs_mount_info *);
+void nfs_fill_super(struct super_block *, struct nfs_mount_info *);
 
 extern struct rpc_stat nfs_rpcstat;
 
@@ -458,7 +458,7 @@ extern void nfs_read_prepare(struct rpc_task *task, void 
*calldata);
 extern void nfs_pageio_reset_read_mds(struct nfs_pageio_descriptor *pgio);
 
 /* super.c */
-int nfs_clone_super(struct super_block *, struct nfs_mount_info *);
+void nfs_clone_super(struct super_block *, struct nfs_mount_info *);
 void nfs_umount_begin(struct super_block *);
 int  nfs_statfs(struct dentry *, struct kstatfs *);
 int  nfs_show_options(struct seq_file *, struct dentry *);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index dc69314d455e..2f3822a4a7d5 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2321,11 +2321,10 @@ inline void nfs_initialise_sb(struct super_block *sb)
 /*
  * Finish setting up an NFS2/3 superblock
  */
-int nfs_fill_super(struct super_block *sb, struct nfs_mount_info *mount_info)
+void nfs_fill_super(struct super_block *sb, struct nfs_mount_info *mount_info)
 {
struct nfs_parsed_mount_dat

Treat REQ_FUA and REQ_PREFLUSH as synchronous

2017-05-02 Thread Jan Kara
Hello,

Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous"
made requests with REQ_FUA and REQ_PREFLUSH to be treated as synchronous
and dropped REQ_SYNC from definitions of WRITE_FUA and friends. This
however introduced a bunch of bugs to filesystems (I know about ext4,
btrfs, f2fs, gfs2, reiserfs regressing because of this) because they
implicitely expected REQ_FUA or REQ_PREFLUSH implies a synchronous request.
At the first sight they do however generic_make_request_checks() will strip
REQ_FUA and REQ_PREFLUSH flags from a bio if the underlying storage does
not have volatile write cache and so writes suddenly become async. I will
go and fix filesystems to explicitely set REQ_SYNC if they want sync
behavior as that is a good cleanup anyway but I wanted to question whether
it makes sense to treat REQ_FUA and REQ_PREFLUSH ops as synchronous in
op_is_sync() since callers cannot rely on this anyway... Thoughts?

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: BUG: KASAN: use-after-free in scsi_exit_rq

2017-05-02 Thread Jan Kara
On Fri 28-04-17 17:46:47, Tejun Heo wrote:
> On Fri, Apr 21, 2017 at 09:49:17PM +, Bart Van Assche wrote:
> > On Thu, 2017-04-20 at 15:18 -0600, Scott Bauer wrote:
> > > [  642.638860] BUG: KASAN: use-after-free in scsi_exit_rq+0xf3/0x120 at 
> > > addr 8802b7fedf00
> > > [  642.639362] Read of size 1 by task rcuos/5/53
> > > [  642.639713] CPU: 7 PID: 53 Comm: rcuos/6 Not tainted 4.11.0-rc5+ #13
> > > [  642.640170] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 
> > > 04/01/2014
> > > [  642.640923] Call Trace:
> > > [  642.641080]  dump_stack+0x63/0x8f
> > > [  642.641289]  kasan_object_err+0x21/0x70
> > > [  642.641531]  kasan_report.part.1+0x231/0x500
> > > [  642.641823]  ? scsi_exit_rq+0xf3/0x120
> > > [  642.642054]  ? _raw_spin_unlock_irqrestore+0xe/0x10
> > > [  642.642353]  ? free_percpu+0x1b7/0x340
> > > [  642.642586]  ? put_task_stack+0x117/0x2b0
> > > [  642.642837]  __asan_report_load1_noabort+0x2e/0x30
> > > [  642.643138]  scsi_exit_rq+0xf3/0x120
> > > [  642.643366]  free_request_size+0x44/0x60
> > > [  642.643614]  mempool_destroy.part.6+0x9b/0x150
> > > [  642.643899]  ? kasan_slab_free+0x87/0xb0
> > > [  642.644152]  mempool_destroy+0x13/0x20
> > > [  642.644394]  blk_exit_rl+0x36/0x40
> > > [  642.644614]  blkg_free+0x146/0x200
> > > [  642.644836]  __blkg_release_rcu+0x121/0x220
> > > [  642.645112]  rcu_nocb_kthread+0x61f/0xca0
> > > [  642.645376]  ? get_state_synchronize_rcu+0x20/0x20
> > > [  642.645690]  ? pci_mmcfg_check_reserved+0x110/0x110
> > > [  642.646011]  kthread+0x298/0x390
> > > [  642.646224]  ? get_state_synchronize_rcu+0x20/0x20
> > > [  642.646535]  ? kthread_park+0x160/0x160
> > > [  642.646787]  ret_from_fork+0x2c/0x40
> > 
> > I'm not familiar with cgroups but seeing this makes me wonder whether it 
> > would
> > be possible to move the blk_exit_rl() calls from blk_release_queue() into
> > blk_cleanup_queue()? The SCSI core frees a SCSI host after 
> > blk_cleanup_queue()
> > has finished for all associated SCSI devices. This is why I think that 
> > calling
> > blk_exit_rl() earlier would be sufficient to avoid that scsi_exit_rq()
> > dereferences a SCSI host pointer after it has been freed.
> 
> Hmm... I see.  Didn't know request put path involved derefing those
> structs.  The blk_exit_rl() call just replaced open coded
> mempool_destroy call, so the destruction order was always like this.
> We shouldn't have any request in flight by cleanup, so moving it there
> should be fine.

So I'm also not aware of any particular breakage this would cause. However
logically the freeing of request mempools really belongs to
blk_release_queue() so it seems a bit dumb to move blk_exit_rl() just
because SCSI stores the fact from which slab cache it has allocated the
sense buffer in a structure (shost) that it frees under its hands by the
time blk_release_queue() is called. :-|

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: Treat REQ_FUA and REQ_PREFLUSH as synchronous

2017-05-02 Thread Jan Kara
On Tue 02-05-17 08:18:15, Christoph Hellwig wrote:
> On Tue, May 02, 2017 at 05:15:58PM +0200, Jan Kara wrote:
> > in generic_make_request_checks()? Or just set it unconditionally in that
> > function if we see REQ_FUA | REQ_PREFLUSH set?
> 
> Just add REQ_FUA and REQ_PREFLUSH to the test in op_is_sync.

It is there. The problem happens when REQ_FUA and REQ_PREFLUSH get stripped
from bio because the underlying storage doesn't have volatile cache, the
bio suddently becomes treated as async which regresses performance.

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: Treat REQ_FUA and REQ_PREFLUSH as synchronous

2017-05-02 Thread Jan Kara
On Tue 02-05-17 08:49:14, Jens Axboe wrote:
> On 05/02/2017 08:45 AM, Christoph Hellwig wrote:
> > On Tue, May 02, 2017 at 12:21:23PM +0200, Jan Kara wrote:
> >> it makes sense to treat REQ_FUA and REQ_PREFLUSH ops as synchronous in
> >> op_is_sync() since callers cannot rely on this anyway... Thoughts?
> > 
> > I'm fine with treating them as sync.
> 
> Yes me too. It makes sense to do so implicitly, and it avoids having
> to go and add REQ_SYNC in a bunch of places. That will also be more
> error proof in the future.
> 
> Jan, will you send a patch for that?

Hum, so I've just sent out fixes to individual filesystems to set REQ_SYNC
explicitely :-|. So would you prefer to do something like:

if (op_is_flush(bio->bi_opf) &&
!test_bit(QUEUE_FLAG_WC, >queue_flags)) {
bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
+   bio->bi_opf |= REQ_SYNC;
if (!nr_sectors) {
err = 0;
goto end_io;
}
}

in generic_make_request_checks()? Or just set it unconditionally in that
function if we see REQ_FUA | REQ_PREFLUSH set? Because we cannot just add
REQ_SYNC into REQ_FUA and REQ_PREFLUSH definitions as it used to be with
WRITE_FUA & co. definitions...

In the end I think that modifying filesystems (and drivers/md) to set REQ_SYNC
was the cleanest way to go as much as it was annoying...

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 3/4] fs: support RWF_NOWAIT for buffered reads

2017-08-24 Thread Jan Kara
On Tue 22-08-17 18:17:11, Christoph Hellwig wrote:
> This is based on the old idea and code from Milosz Tanski.  With the aio
> nowait code it becomes mostly trivial now.  Buffered writes continue to
> return -EOPNOTSUPP if RWF_NOWAIT is passed.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/aio.c   |  6 --
>  fs/btrfs/file.c|  6 +-
>  fs/ext4/file.c |  6 +++---
>  fs/xfs/xfs_file.c  | 11 +--
>  include/linux/fs.h |  6 +++---
>  5 files changed, 20 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index dcad3a66748c..d93daa076726 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1593,12 +1593,6 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   goto out_put_req;
>   }
>  
> - if ((req->common.ki_flags & IOCB_NOWAIT) &&
> - !(req->common.ki_flags & IOCB_DIRECT)) {
> - ret = -EOPNOTSUPP;
> - goto out_put_req;
> - }
> -
>   ret = put_user(KIOCB_KEY, _iocb->aio_key);
>   if (unlikely(ret)) {
>   pr_debug("EFAULT: aio_key\n");
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 9e75d8a39aac..e62dd55b4079 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1886,6 +1886,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb 
> *iocb,
>   loff_t oldsize;
>   int clean_page = 0;
>  
> + if (!(iocb->ki_flags & IOCB_DIRECT) &&
> + (iocb->ki_flags & IOCB_NOWAIT))
> + return -EOPNOTSUPP;
> +
>   if (!inode_trylock(inode)) {
>   if (iocb->ki_flags & IOCB_NOWAIT)
>   return -EAGAIN;
> @@ -3105,7 +3109,7 @@ static loff_t btrfs_file_llseek(struct file *file, 
> loff_t offset, int whence)
>  
>  static int btrfs_file_open(struct inode *inode, struct file *filp)
>  {
> - filp->f_mode |= FMODE_AIO_NOWAIT;
> + filp->f_mode |= FMODE_NOWAIT;
>   return generic_file_open(inode, filp);
>  }
>  
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 0d7cf0cc9b87..f83521337b8f 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -223,6 +223,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
> *from)
>   if (IS_DAX(inode))
>   return ext4_dax_write_iter(iocb, from);
>  #endif
> + if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
> + return -EOPNOTSUPP;
>  
>   if (!inode_trylock(inode)) {
>   if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -448,9 +450,7 @@ static int ext4_file_open(struct inode * inode, struct 
> file * filp)
>   return ret;
>   }
>  
> - /* Set the flags to support nowait AIO */
> - filp->f_mode |= FMODE_AIO_NOWAIT;
> -
> + filp->f_mode |= FMODE_NOWAIT;
>   return dquot_file_open(inode, filp);
>  }
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c4893e226fd8..1a09104b3eb0 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -259,7 +259,11 @@ xfs_file_buffered_aio_read(
>  
>   trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos);
>  
> - xfs_ilock(ip, XFS_IOLOCK_SHARED);
> + if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EAGAIN;
> + xfs_ilock(ip, XFS_IOLOCK_SHARED);
> + }
>   ret = generic_file_read_iter(iocb, to);
>   xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>  
> @@ -636,6 +640,9 @@ xfs_file_buffered_aio_write(
>   int enospc = 0;
>   int iolock;
>  
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
>  write_retry:
>   iolock = XFS_IOLOCK_EXCL;
>   xfs_ilock(ip, iolock);
> @@ -912,7 +919,7 @@ xfs_file_open(
>   return -EFBIG;
>   if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb)))
>   return -EIO;
> - file->f_mode |= FMODE_AIO_NOWAIT;
> + file->f_mode |= FMODE_NOWAIT;
>   return 0;
>  }
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6e1fd5d21248..ded258edc425 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -146,8 +146,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
> offset,
>  /* File was opened by fanotify and shouldn't generate fanotify events */
>  #define FMODE_NONOTIFY   

Re: [PATCH 4/4] block_dev: support RFW_NOWAIT on block device nodes

2017-08-24 Thread Jan Kara
On Tue 22-08-17 18:17:12, Christoph Hellwig wrote:
> All support is already there in the generic code, we just need to wire
> it up.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/block_dev.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 9941dc8342df..ea21d18d8e79 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1739,6 +1739,8 @@ static int blkdev_open(struct inode * inode, struct 
> file * filp)
>*/
>   filp->f_flags |= O_LARGEFILE;
>  
> + filp->f_mode |= FMODE_NOWAIT;
> +
>   if (filp->f_flags & O_NDELAY)
>   filp->f_mode |= FMODE_NDELAY;
>   if (filp->f_flags & O_EXCL)
> @@ -1891,6 +1893,9 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct 
> iov_iter *from)
>   if (iocb->ki_pos >= size)
>   return -ENOSPC;
>  
> + if ((iocb->ki_flags & (IOCB_NOWAIT | IOCB_DIRECT)) == IOCB_NOWAIT)
> +     return -EOPNOTSUPP;
> +
>   iov_iter_truncate(from, size - iocb->ki_pos);
>  
>   blk_start_plug();
> -- 
> 2.11.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 1/4] fs: pass iocb to do_generic_file_read

2017-08-24 Thread Jan Kara
On Tue 22-08-17 18:17:09, Christoph Hellwig wrote:
> And rename it to the more descriptive generic_file_buffered_read while
> at it.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  mm/filemap.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a49702445ce0..4bcfa74ad802 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1886,9 +1886,8 @@ static void shrink_readahead_size_eio(struct file *filp,
>  }
>  
>  /**
> - * do_generic_file_read - generic file read routine
> - * @filp:the file to read
> - * @ppos:current file position
> + * generic_file_buffered_read - generic file read routine
> + * @iocb:the iocb to read
>   * @iter:data destination
>   * @written: already copied
>   *
> @@ -1898,12 +1897,14 @@ static void shrink_readahead_size_eio(struct file 
> *filp,
>   * This is really ugly. But the goto's actually try to clarify some
>   * of the logic when it comes to error handling etc.
>   */
> -static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
> +static ssize_t generic_file_buffered_read(struct kiocb *iocb,
>   struct iov_iter *iter, ssize_t written)
>  {
> + struct file *filp = iocb->ki_filp;
>   struct address_space *mapping = filp->f_mapping;
>   struct inode *inode = mapping->host;
>   struct file_ra_state *ra = >f_ra;
> + loff_t *ppos = >ki_pos;
>   pgoff_t index;
>   pgoff_t last_index;
>   pgoff_t prev_index;
> @@ -2151,14 +2152,14 @@ static ssize_t do_generic_file_read(struct file 
> *filp, loff_t *ppos,
>  ssize_t
>  generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  {
> - struct file *file = iocb->ki_filp;
> - ssize_t retval = 0;
>   size_t count = iov_iter_count(iter);
> + ssize_t retval = 0;
>  
>   if (!count)
>   goto out; /* skip atime */
>  
>   if (iocb->ki_flags & IOCB_DIRECT) {
> + struct file *file = iocb->ki_filp;
>   struct address_space *mapping = file->f_mapping;
>   struct inode *inode = mapping->host;
>   loff_t size;
> @@ -2199,7 +2200,7 @@ generic_file_read_iter(struct kiocb *iocb, struct 
> iov_iter *iter)
>       goto out;
>   }
>  
> - retval = do_generic_file_read(file, >ki_pos, iter, retval);
> + retval = generic_file_buffered_read(iocb, iter, retval);
>  out:
>   return retval;
>  }
> -- 
> 2.11.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 2/4] fs: support IOCB_NOWAIT in generic_file_buffered_read

2017-08-24 Thread Jan Kara
On Tue 22-08-17 18:17:10, Christoph Hellwig wrote:
> From: Milosz Tanski <mil...@adfin.com>
> 
> Allow generic_file_buffered_read to bail out early instead of waiting for
> the page lock or reading a page if IOCB_NOWAIT is specified.
> 
> Signed-off-by: Milosz Tanski <mil...@adfin.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>
> Reviewed-by: Jeff Moyer <jmo...@redhat.com>
> Acked-by: Sage Weil <s...@redhat.com>
> ---
>  mm/filemap.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4bcfa74ad802..9f60255fb7bb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1937,6 +1937,8 @@ static ssize_t generic_file_buffered_read(struct kiocb 
> *iocb,
>  
>   page = find_get_page(mapping, index);
>   if (!page) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + goto would_block;
>   page_cache_sync_readahead(mapping,
>   ra, filp,
>   index, last_index - index);

Hum, we have:

if (!PageUptodate(page)) {
/*
 * See comment in do_read_cache_page on why
 * wait_on_page_locked is used to avoid
 * unnecessarily
 * serialisations and why it's safe.
 */
error = wait_on_page_locked_killable(page);
if (unlikely(error))
goto readpage_error;
if (PageUptodate(page))
goto page_ok;

And wait_on_page_locked_killable() above does not seem to be handled in
your patch. I would just check IOCB_NOWAIT in !PageUptodate(page) branch
above and bail - which also removes the need for the two checks below...

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH v2] block/laptop_mode: Convert timers to use timer_setup()

2017-10-09 Thread Jan Kara
t bdi_init(struct backing_dev_info *bdi)
>   INIT_LIST_HEAD(>bdi_list);
>   INIT_LIST_HEAD(>wb_list);
>   init_waitqueue_head(>wb_waitq);
> + setup_timer(>laptop_mode_wb_timer,
> + laptop_mode_timer_fn, (unsigned long)bdi);
>  
>   ret = cgwb_bdi_init(bdi);
>  
> @@ -916,6 +950,8 @@ EXPORT_SYMBOL(bdi_register_owner);
>   */
>  static void bdi_remove_from_list(struct backing_dev_info *bdi)
>  {
> + del_timer_sync(>laptop_mode_wb_timer);
> +
>   spin_lock_bh(_lock);
>   list_del_rcu(>bdi_list);
>   spin_unlock_bh(_lock);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 8d1fc593bce8..f8fe90dc529d 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1976,42 +1976,6 @@ int dirty_writeback_centisecs_handler(struct ctl_table 
> *table, int write,
>   return 0;
>  }
>  
> -#ifdef CONFIG_BLOCK
> -void laptop_mode_timer_fn(unsigned long data)
> -{
> - struct request_queue *q = (struct request_queue *)data;
> -
> - wakeup_flusher_threads_bdi(q->backing_dev_info, WB_REASON_LAPTOP_TIMER);
> -}
> -
> -/*
> - * We've spun up the disk and we're in laptop mode: schedule writeback
> - * of all dirty data a few seconds from now.  If the flush is already 
> scheduled
> - * then push it back - the user is still using the disk.
> - */
> -void laptop_io_completion(struct backing_dev_info *info)
> -{
> - mod_timer(>laptop_mode_wb_timer, jiffies + laptop_mode);
> -}
> -
> -/*
> - * We're in laptop mode and we've just synced. The sync's writes will have
> - * caused another writeback to be scheduled by laptop_io_completion.
> - * Nothing needs to be written back anymore, so we unschedule the writeback.
> - */
> -void laptop_sync_completion(void)
> -{
> - struct backing_dev_info *bdi;
> -
> - rcu_read_lock();
> -
> - list_for_each_entry_rcu(bdi, _list, bdi_list)
> - del_timer(>laptop_mode_wb_timer);
> -
> - rcu_read_unlock();
> -}
> -#endif
> -
>  /*
>   * If ratelimit_pages is too high then we can get into dirty-data overload
>   * if a large number of processes all perform writes at the same time.
> -- 
> 2.14.1
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 1/3] lockdep: Apply crossrelease to PG_locked locks

2017-11-24 Thread Jan Kara
On Fri 24-11-17 09:11:49, Michal Hocko wrote:
> On Fri 24-11-17 12:02:36, Byungchul Park wrote:
> > On Thu, Nov 16, 2017 at 02:07:46PM +0100, Michal Hocko wrote:
> > > On Thu 16-11-17 21:48:05, Byungchul Park wrote:
> > > > On 11/16/2017 9:02 PM, Michal Hocko wrote:
> > > > > for each struct page. So you are doubling the size. Who is going to
> > > > > enable this config option? You are moving this to page_ext in a later
> > > > > patch which is a good step but it doesn't go far enough because this
> > > > > still consumes those resources. Is there any problem to make this
> > > > > kernel command line controllable? Something we do for page_owner for
> > > > > example?
> > > > 
> > > > Sure. I will add it.
> > > > 
> > > > > Also it would be really great if you could give us some measures about
> > > > > the runtime overhead. I do not expect it to be very large but this is
> > > > 
> > > > The major overhead would come from the amount of additional memory
> > > > consumption for 'lockdep_map's.
> > > 
> > > yes
> > > 
> > > > Do you want me to measure the overhead by the additional memory
> > > > consumption?
> > > > 
> > > > Or do you expect another overhead?
> > > 
> > > I would be also interested how much impact this has on performance. I do
> > > not expect it would be too large but having some numbers for cache cold
> > > parallel kbuild or other heavy page lock workloads.
> > 
> > Hello Michal,
> > 
> > I measured 'cache cold parallel kbuild' on my qemu machine. The result
> > varies much so I cannot confirm, but I think there's no meaningful
> > difference between before and after applying crossrelease to page locks.
> > 
> > Actually, I expect little overhead in lock_page() and unlock_page() even
> > after applying crossreleas to page locks, but only expect a bit overhead
> > by additional memory consumption for 'lockdep_map's per page.
> > 
> > I run the following instructions within "QEMU x86_64 4GB memory 4 cpus":
> > 
> >make clean
> >echo 3 > drop_caches
> >time make -j4
> 
> Maybe FS people will help you find a more representative workload. E.g.
> linear cache cold file read should be good as well. Maybe there are some
> tests in fstests (or how they call xfstests these days).

So a relatively good test of page handling costs is to mmap cache hot file
and measure time to fault in all the pages in the mapping. That way IO and
filesystem stays out of the way and you measure only page table lookup,
page handling (taking page ref and locking the page), and instantiation of
the new PTE. Out of this page handling is actually the significant part.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 03/11] fs: add frozen sb state helpers

2017-11-30 Thread Jan Kara
On Wed 29-11-17 15:23:48, Luis R. Rodriguez wrote:
> The question of whether or not a superblock is frozen needs to be
> augmented in the future to account for differences between a user
> initiated freeze and a kernel initiated freeze done automatically
> on behalf of the kernel.
> 
> Provide helpers so that these can be used instead so that we don't
> have to expand checks later in these same call sites as we expand
> the definition of a frozen superblock.
> 
> Signed-off-by: Luis R. Rodriguez <mcg...@kernel.org>

So helpers are fine but...

> +/**
> + * sb_is_frozen_by_user - is superblock frozen by a user call
> + * @sb: the super to check
> + *
> + * Returns true if the super freeze was initiated by userspace, for instance,
> + * an ioctl call.
> + */
> +static inline bool sb_is_frozen_by_user(struct super_block *sb)
> +{
> + return sb->s_writers.frozen == SB_FREEZE_COMPLETE;
> +}

... I dislike the _by_user() suffix as there may be different places that
call freeze_super() (e.g. device mapper does this during some operations).
Clearly we need to distinguish "by system suspend" and "the other" cases.
So please make this clear in the naming.

In fact, what might be a cleaner solution is to introduce a 'freeze_count'
for superblock freezing (we already do have this for block devices). Then
you don't need to differentiate these two cases - but you'd still need to
properly handle cleanup if freezing of all superblocks fails in the middle.
So I'm not 100% this works out nicely in the end. But it's certainly worth
a consideration.

Honza

-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 07/11] xfs: remove not needed freezing calls

2017-11-30 Thread Jan Kara
On Wed 29-11-17 15:23:52, Luis R. Rodriguez wrote:
> This removes superflous freezer calls as they are no longer needed
> as the VFS now performs filesystem freezing/thaw if the filesystem has
> support for it.
> 
> The following Coccinelle rule was used as follows:
> 
> spatch --sp-file fs-freeze-cleanup.cocci --in-place fs/$FS/

I think your rule misses WQ_FREEZABLE flag for workqueues? That would be
also good to get rid of...

Honza

> 
> @ has_freeze_fs @
> identifier super_ops;
> expression freeze_op;
> @@
> 
> struct super_operations super_ops = {
>   .freeze_fs = freeze_op,
> };
> 
> @ remove_set_freezable depends on has_freeze_fs @
> expression time;
> statement S, S2, S3;
> expression task;
> @@
> 
> (
> - set_freezable();
> |
> - if (try_to_freeze())
> - continue;
> |
> - try_to_freeze();
> |
> - freezable_schedule();
> + schedule();
> |
> - freezable_schedule_timeout(time);
> + schedule_timeout(time);
> |
> - if (freezing(task)) { S }
> |
> - if (freezing(task)) { S }
> - else
>   { S2 }
> |
> - freezing(current)
> )
> 
> Generated-by: Coccinelle SmPL
> Signed-off-by: Luis R. Rodriguez <mcg...@kernel.org>
> ---
>  fs/xfs/xfs_trans_ail.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index cef89f7127d3..1f3dd10a9d00 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -513,7 +513,6 @@ xfsaild(
>   longtout = 0;   /* milliseconds */
>  
>   current->flags |= PF_MEMALLOC;
> - set_freezable();
>  
>   while (1) {
>   if (tout && tout <= 20)
> @@ -551,19 +550,17 @@ xfsaild(
>   if (!xfs_ail_min(ailp) &&
>   ailp->xa_target == ailp->xa_target_prev) {
>   spin_unlock(>xa_lock);
> - freezable_schedule();
> + schedule();
>   tout = 0;
>   continue;
>   }
>   spin_unlock(>xa_lock);
>  
>   if (tout)
> - freezable_schedule_timeout(msecs_to_jiffies(tout));
> + schedule_timeout(msecs_to_jiffies(tout));
>  
>   __set_current_state(TASK_RUNNING);
>  
> - try_to_freeze();
> -
>   tout = xfsaild_push(ailp);
>   }
>  
> -- 
> 2.15.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 05/11] fs: add iterate_supers_excl() and iterate_supers_reverse_excl()

2017-11-30 Thread Jan Kara
On Wed 29-11-17 15:23:50, Luis R. Rodriguez wrote:
> There are use cases where we wish to traverse the superblock list
> but also capture errors, and in which case we want to avoid having
> our callers issue a lock themselves since we can do the locking for
> the callers. Provide a iterate_supers_excl() which calls a function
> with the write lock held. If an error occurs we capture it and
> propagate it.
> 
> Likewise there are use cases where we wish to traverse the superblock
> list but in reverse order. The new iterate_supers_reverse_excl() helpers
> does this but also also captures any errors encountered.
> 
> Signed-off-by: Luis R. Rodriguez <mcg...@kernel.org>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza


> ---
>  fs/super.c | 91 
> ++
>  include/linux/fs.h |  2 ++
>  2 files changed, 93 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index a63513d187e8..885711c1d35b 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -605,6 +605,97 @@ void iterate_supers(void (*f)(struct super_block *, void 
> *), void *arg)
>   spin_unlock(_lock);
>  }
>  
> +/**
> + *   iterate_supers_excl - exclusively call func for all active superblocks
> + *   @f: function to call
> + *   @arg: argument to pass to it
> + *
> + *   Scans the superblock list and calls given function, passing it
> + *   locked superblock and given argument. Returns 0 unless an error
> + *   occurred on calling the function on any superblock.
> + */
> +int iterate_supers_excl(int (*f)(struct super_block *, void *), void *arg)
> +{
> + struct super_block *sb, *p = NULL;
> + int error = 0;
> +
> + spin_lock(_lock);
> + list_for_each_entry(sb, _blocks, s_list) {
> + if (hlist_unhashed(>s_instances))
> + continue;
> + sb->s_count++;
> + spin_unlock(_lock);
> +
> + down_write(>s_umount);
> + if (sb->s_root && (sb->s_flags & SB_BORN)) {
> + error = f(sb, arg);
> + if (error) {
> + up_write(>s_umount);
> + spin_lock(_lock);
> + __put_super(sb);
> + break;
> + }
> + }
> + up_write(>s_umount);
> +
> + spin_lock(_lock);
> + if (p)
> + __put_super(p);
> + p = sb;
> + }
> + if (p)
> + __put_super(p);
> + spin_unlock(_lock);
> +
> + return error;
> +}
> +
> +/**
> + *   iterate_supers_reverse_excl - exclusively calls func in reverse order
> + *   @f: function to call
> + *   @arg: argument to pass to it
> + *
> + *   Scans the superblock list and calls given function, passing it
> + *   locked superblock and given argument, in reverse order, and holding
> + *   the s_umount write lock. Returns if an error occurred.
> + */
> +int iterate_supers_reverse_excl(int (*f)(struct super_block *, void *),
> +  void *arg)
> +{
> + struct super_block *sb, *p = NULL;
> + int error = 0;
> +
> + spin_lock(_lock);
> + list_for_each_entry_reverse(sb, _blocks, s_list) {
> + if (hlist_unhashed(>s_instances))
> + continue;
> + sb->s_count++;
> + spin_unlock(_lock);
> +
> + down_write(>s_umount);
> + if (sb->s_root && (sb->s_flags & SB_BORN)) {
> + error = f(sb, arg);
> + if (error) {
> + up_write(>s_umount);
> + spin_lock(_lock);
> + __put_super(sb);
> + break;
> + }
> + }
> + up_write(>s_umount);
> +
> + spin_lock(_lock);
> + if (p)
> + __put_super(p);
> + p = sb;
> + }
> + if (p)
> + __put_super(p);
> + spin_unlock(_lock);
> +
> + return error;
> +}
> +
>  /**
>   *   iterate_supers_type - call function for superblocks of given type
>   *   @type: fs type
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 107725b20fad..fe90b6542697 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3164,6 +3164,8 @@ extern struct super_block *get_active_super(struct

Re: [PATCH 02/11] fs: provide unlocked helper thaw_super()

2017-11-30 Thread Jan Kara
On Wed 29-11-17 15:23:47, Luis R. Rodriguez wrote:
> thaw_super() hold a write lock, however we wish to also enable
> callers which already hold the write lock. To do this provide a helper
> and make thaw_super() use it. This way, all that thaw_super() does
> now is lock handling and active count management.
> 
> This change has no functional changes.
> 
> Suggested-by: Dave Chinner <da...@fromorbit.com>
> Signed-off-by: Luis R. Rodriguez <mcg...@kernel.org>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/super.c | 39 ++-
>  1 file changed, 26 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index a7650ff22f0e..cecc279beecd 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1493,21 +1493,13 @@ int freeze_super(struct super_block *sb)
>  }
>  EXPORT_SYMBOL(freeze_super);
>  
> -/**
> - * thaw_super -- unlock filesystem
> - * @sb: the super to thaw
> - *
> - * Unlocks the filesystem and marks it writeable again after freeze_super().
> - */
> -int thaw_super(struct super_block *sb)
> +/* Caller takes lock and handles active count */
> +static int thaw_locked_super(struct super_block *sb)
>  {
>   int error;
>  
> - down_write(>s_umount);
> - if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
> - up_write(>s_umount);
> + if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
>   return -EINVAL;
> - }
>  
>   if (sb_rdonly(sb)) {
>   sb->s_writers.frozen = SB_UNFROZEN;
> @@ -1522,7 +1514,6 @@ int thaw_super(struct super_block *sb)
>   printk(KERN_ERR
>   "VFS:Filesystem thaw failed\n");
>   lockdep_sb_freeze_release(sb);
> - up_write(>s_umount);
>   return error;
>   }
>   }
> @@ -1531,7 +1522,29 @@ int thaw_super(struct super_block *sb)
>   sb_freeze_unlock(sb);
>  out:
>   wake_up(>s_writers.wait_unfrozen);
> - deactivate_locked_super(sb);
>   return 0;
>  }
> +
> +/**
> + * thaw_super -- unlock filesystem
> + * @sb: the super to thaw
> + *
> + * Unlocks the filesystem and marks it writeable again after freeze_super().
> + */
> +int thaw_super(struct super_block *sb)
> +{
> + int error;
> +
> + down_write(>s_umount);
> +     error = thaw_locked_super(sb);
> + if (error) {
> + up_write(>s_umount);
> + goto out;
> + }
> +
> + deactivate_locked_super(sb);
> +
> +out:
> + return error;
> +}
>  EXPORT_SYMBOL(thaw_super);
> -- 
> 2.15.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 01/11] fs: provide unlocked helper for freeze_super()

2017-11-30 Thread Jan Kara
On Wed 29-11-17 15:23:46, Luis R. Rodriguez wrote:
> freeze_super() holds a write lock, however we wish to also enable
> callers which already hold the write lock. To do this provide a helper
> and make freeze_super() use it. This way, all that freeze_super() does
> now is lock handling and active count management.
> 
> This change has no functional changes.
> 
> Suggested-by: Dave Chinner <da...@fromorbit.com>
> Signed-off-by: Luis R. Rodriguez <mcg...@kernel.org>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/super.c | 100 
> +
>  1 file changed, 55 insertions(+), 45 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index d4e33e8f1e6f..a7650ff22f0e 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1387,59 +1387,20 @@ static void sb_freeze_unlock(struct super_block *sb)
>   percpu_up_write(sb->s_writers.rw_sem + level);
>  }
>  
> -/**
> - * freeze_super - lock the filesystem and force it into a consistent state
> - * @sb: the super to lock
> - *
> - * Syncs the super to make sure the filesystem is consistent and calls the 
> fs's
> - * freeze_fs.  Subsequent calls to this without first thawing the fs will 
> return
> - * -EBUSY.
> - *
> - * During this function, sb->s_writers.frozen goes through these values:
> - *
> - * SB_UNFROZEN: File system is normal, all writes progress as usual.
> - *
> - * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
> - * writes should be blocked, though page faults are still allowed. We wait 
> for
> - * all writes to complete and then proceed to the next stage.
> - *
> - * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
> - * but internal fs threads can still modify the filesystem (although they
> - * should not dirty new pages or inodes), writeback can run etc. After 
> waiting
> - * for all running page faults we sync the filesystem which will clean all
> - * dirty pages and inodes (no new dirty pages or inodes can be created when
> - * sync is running).
> - *
> - * SB_FREEZE_FS: The file system is frozen. Now all internal sources of fs
> - * modification are blocked (e.g. XFS preallocation truncation on inode
> - * reclaim). This is usually implemented by blocking new transactions for
> - * filesystems that have them and need this additional guard. After all
> - * internal writers are finished we call ->freeze_fs() to finish filesystem
> - * freezing. Then we transition to SB_FREEZE_COMPLETE state. This state is
> - * mostly auxiliary for filesystems to verify they do not modify frozen fs.
> - *
> - * sb->s_writers.frozen is protected by sb->s_umount.
> - */
> -int freeze_super(struct super_block *sb)
> +/* Caller takes lock and handles active count */
> +static int freeze_locked_super(struct super_block *sb)
>  {
>   int ret;
>  
> - atomic_inc(>s_active);
> - down_write(>s_umount);
> - if (sb->s_writers.frozen != SB_UNFROZEN) {
> - deactivate_locked_super(sb);
> + if (sb->s_writers.frozen != SB_UNFROZEN)
>   return -EBUSY;
> - }
>  
> - if (!(sb->s_flags & SB_BORN)) {
> - up_write(>s_umount);
> + if (!(sb->s_flags & SB_BORN))
>   return 0;   /* sic - it's "nothing to do" */
> - }
>  
>   if (sb_rdonly(sb)) {
>   /* Nothing to do really... */
>   sb->s_writers.frozen = SB_FREEZE_COMPLETE;
> - up_write(>s_umount);
>   return 0;
>   }
>  
> @@ -1468,7 +1429,6 @@ int freeze_super(struct super_block *sb)
>   sb->s_writers.frozen = SB_UNFROZEN;
>   sb_freeze_unlock(sb);
>   wake_up(>s_writers.wait_unfrozen);
> - deactivate_locked_super(sb);
>   return ret;
>   }
>   }
> @@ -1478,9 +1438,59 @@ int freeze_super(struct super_block *sb)
>*/
>   sb->s_writers.frozen = SB_FREEZE_COMPLETE;
>   lockdep_sb_freeze_release(sb);
> - up_write(>s_umount);
>   return 0;
>  }
> +
> +/**
> + * freeze_super - lock the filesystem and force it into a consistent state
> + * @sb: the super to lock
> + *
> + * Syncs the super to make sure the filesystem is consistent and calls the 
> fs's
> + * freeze_fs.  Subsequent calls to this without first thawing the fs will 
> return
> + * -EBUSY.
> + *
> + * During this function, sb->s_writers.frozen goes through th

Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown

2017-11-20 Thread Jan Kara
Hi Tao!

On Fri 17-11-17 14:51:18, Hou Tao wrote:
> On 2017/3/13 23:14, Jan Kara wrote:
> > blkdev_open() may race with gendisk shutdown in two different ways.
> > Either del_gendisk() has already unhashed block device inode (and thus
> > bd_acquire() will end up creating new block device inode) however
> > gen_gendisk() will still return the gendisk that is being destroyed.
> > Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> > before we get to get_gendisk() and get_gendisk() will return new gendisk
> > that got allocated for device that reused the device number.
> > 
> > In both cases this will result in possible inconsistencies between
> > bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> > destroyed and device number reused, in the second case immediately).
> 
> > Fix the problem by checking whether the gendisk is still alive and inode
> > hashed when associating bdev inode with it and its bdi. That way we are
> > sure that we will not associate bdev inode with disk that got past
> > blk_unregister_region() in del_gendisk() (and thus device number can get
> > reused). Similarly, we will not associate bdev that was once associated
> > with gendisk that is going away (and thus the corresponding bdev inode
> > will get unhashed in del_gendisk()) with a new gendisk that is just
> > reusing the device number.
> 
> I have run some extended tests based on Omar Sandoval's script [1] these days,
> and found two new problems related with the race between bdev open and gendisk
> shutdown.
> 
> The first problem lies in __blkdev_get(), and will cause use-after-free, or
> worse oops. It happens because there may be two instances of gendisk related
> with one bdev. When the process which owns the newer gendisk completes the
> invocation of __blkdev_get() firstly, the other process which owns the older
> gendisk will put the last reference of the older gendisk and cause 
> use-after-free.
> 
> The following call sequences illustrate the problem:
> 
> unhash the bdev_inode of /dev/sda
> 
> process A (blkdev_open()):
>   bd_acquire() returns new bdev_inode of /dev/sda
>   get_gendisk() returns gendisk (v1 gendisk)
> 
> remove gendisk from bdev_map
> device number is reused

^^ this is through blk_unregister_region() in del_gendisk(), isn't it?

> process B (blkdev_open()):
>   bd_acquire() returns new bdev_inode of /dev/sda
>   get_gendisk() returns a new gendisk (v2 gendisk)
>   increase bdev->bd_openers

So this needs to be a bit more complex - bd_acquire() must happen before
bdev_unhash_inode() in del_gendisk() above happened. And get_gendisk() must
happen after blk_unregister_region() from del_gendisk() happened. But yes,
this seems possible.

> process A (blkdev_open()):
>   find bdev->bd_openers != 0
>   put_disk(disk) // this is the last reference to v1 gendisk
>   disk_unblock_events(disk) // use-after-free occurs
> 
> The problem can be fixed by an extra check showed in the following snippet:
> 
> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> {
>   if (!bdev->bd_openers) {
>   } else {
>   if (bdev->bd_disk != disk) {
>   ret = -ENXIO;
>   goto out_unlock_bdev;
>   }
>   }
> }
>
> And we also can simplify the check in your patch by moving
> blk_unregister_region() before bdev_unhash_inode(), so whenever we get a
> new bdev_inode, the gendisk returned by get_gendisk() will either be NULL
> or a new gendisk.

I don't think we can do that. If we call blk_unregister_region() before
bdev_unhash_inode(), the device number can get reused while bdev inode is
still visible. So you can get that bdev inode v1 associated with gendisk
v2. And open following unhashing of bdev inode v1 will get bdev inode v2
associated with gendisk v2. So you have two different bdev inodes
associated with the same gendisk and that will cause data integrity issues.

But what you suggest above is probably a reasonable fix. Will you submit a
proper patch please?

> And the only check we need to do in __blkdev_get() is
> checking the hash status of the bdev_inode after we get a gendisk from
> get_gendisk(). The check can be done like the following snippet:
> 
> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> {
>   /* .. */
>   ret = -ENXIO;
>   disk = get_gendisk(bdev->bd_dev, );
>   if (!disk)
>   goto out;
>   owner = disk->fops->owner;
> 
>   if (inode_unhashed(bdev->bd_inode))
>   goto out_unmatched;
> }
> 
> The second problem 

Re: [PATCH v2 2/3] bdi: add error handle for bdi_debug_register

2017-11-01 Thread Jan Kara
On Tue 31-10-17 18:38:24, weiping zhang wrote:
> In order to make error handle more cleaner we call bdi_debug_register
> before set state to WB_registered, that we can avoid call bdi_unregister
> in release_bdi().
> 
> Signed-off-by: weiping zhang <zhangweip...@didichuxing.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  mm/backing-dev.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index b5f940ce0143..84b2dc76f140 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -882,10 +882,13 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
> char *fmt, va_list args)
>   if (IS_ERR(dev))
>   return PTR_ERR(dev);
>  
> + if (bdi_debug_register(bdi, dev_name(dev))) {
> + device_destroy(bdi_class, dev->devt);
> + return -ENOMEM;
> + }
>   cgwb_bdi_register(bdi);
>   bdi->dev = dev;
>  
> - bdi_debug_register(bdi, dev_name(dev));
>   set_bit(WB_registered, >wb.state);
>  
>   spin_lock_bh(_lock);
> -- 
> 2.14.2
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 03/11] fs: add frozen sb state helpers

2017-12-01 Thread Jan Kara
On Thu 30-11-17 20:05:48, Luis R. Rodriguez wrote:
> On Thu, Nov 30, 2017 at 06:13:10PM +0100, Jan Kara wrote:
> > ... I dislike the _by_user() suffix as there may be different places that
> > call freeze_super() (e.g. device mapper does this during some operations).
> > Clearly we need to distinguish "by system suspend" and "the other" cases.
> > So please make this clear in the naming.
> 
> Ah. How about sb_frozen_by_cb() ?

And what does 'cb' stand for? :)

> > In fact, what might be a cleaner solution is to introduce a 'freeze_count'
> > for superblock freezing (we already do have this for block devices). Then
> > you don't need to differentiate these two cases - but you'd still need to
> > properly handle cleanup if freezing of all superblocks fails in the middle.
> > So I'm not 100% this works out nicely in the end. But it's certainly worth
> > a consideration.
> 
> Ah, there are three important reasons for doing it the way I did it which are
> easy to miss, unless you read the commit log message very carefully.
> 
> 0) The ioctl interface causes a failure to be sent back to userspace if
> you issue two consecutive freezes, or two thaws. Ie, once a filesystem is
> frozen, a secondary call will result in an error. Likewise for thaw.

Yep. But also note that there's *another* interface to filesystem freezing
which behaves differently - freeze_bdev() (used internally by dm). That
interface uses the counter and freezing of already frozen device succeeds.
IOW it is a mess. We cannot change the behavior of the ioctl but we could
still provide an in-kernel interface to freeze_super() with the same
semantics as freeze_bdev() which might be easier to use by suspend - maybe
we could call it 'exclusive' (for the current freeze_super() semantics) and
'non-exclusive' (for the freeze_bdev() semantics) since this is very much
like O_EXCL open of block devices...

> 1) The new iterate supers stuff I added bail on the first error and return 
> that
> error. If we kept the ioctl() interface error scheme we'd be erroring out
> if on suspend if userspace had already frozen a filesystem. Clearly that'd
> be silly so we need to distinguish between the automatic kernel freezing
> and the old userspace ioctl initiated interface, so that we can keep the
> old behaviour but allow in-kernel auto freeze on suspend to work properly.

This would work fine with the non-exclusive semantics I believe.

> 2) If we fail to suspend we need to then thaw up all filesystems. The order
> in which we try to freeze is in reverse order on the super_block list. If we
> fail though we iterate in proper order on the super_block list and thaw. If
> you had two filesystems this means that if a failure happened on freezing
> the first filesystem, we'd first thaw the other filesystem -- and because of
> 0) if we don't distinguish between the ioctl interface or auto freezing, we'd
> also fail on thaw'ing given the other superblock wouldn't have been frozen.
> 
> So we need to keep two separate approaches. The count stuff would not suffice
> to distinguish origin of source for freeze call.
> 
> Come to think of it, I think I forgot to avoid thaw if the freeze was ioctl
> initiated..
> 
> thaw_unlocked(bool cb_call)
> {
>   if (sb_frozen_by_cb(sb) && !cb_call)
> return 0; /* skip as the user wanted to keep this fs frozen */
>   ...
> }
> 
> Even though the kernel auto call is new I think we need to keep ioctl 
> initiated
> frozen filesystems frozen to not break old userspace assumptions.
> 
> So, keeping all this in mind, does a count method still suffice?

The count method would need a different error recovery method - i.e. if you
fail freezing filesystems somewhere in the middle of iteration through
superblock list, you'd need to iterate from that point on to the superblock
where you've started. This is somewhat more complicated than your approach
but it seems cleaner to me:

1) Function freezing all superblocks either succeeds and all superblocks
are frozen or fails and no superblocks are (additionally) frozen.

2) It is not that normal users + one special user (who owns the "flag" in
the superblock in form of a special freeze state) setup. We'd simply have
exclusive and non-exclusive users of superblock freezing and there can be
arbitrary numbers of them.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 1/4] bdi: add check for bdi_debug_root

2017-10-30 Thread Jan Kara
On Fri 27-10-17 01:35:36, weiping zhang wrote:
> this patch add a check for bdi_debug_root and do error handle for it.
> we should make sure it was created success, otherwise when add new
> block device's bdi folder(eg, 8:0) will be create a debugfs root directory.
> 
> Signed-off-by: weiping zhang <zhangweip...@didichuxing.com>
> ---
>  mm/backing-dev.c | 17 ++---
>  1 file changed, 14 insertions(+), 3 deletions(-)

These functions get called only on system boot - ENOMEM in those cases is
generally considered fatal and oopsing is acceptable result. So I don't
think this patch is needed.

Honza

> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 74b52dfd5852..5072be19d9b2 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -36,9 +36,12 @@ struct workqueue_struct *bdi_wq;
>  
>  static struct dentry *bdi_debug_root;
>  
> -static void bdi_debug_init(void)
> +static int bdi_debug_init(void)
>  {
>   bdi_debug_root = debugfs_create_dir("bdi", NULL);
> + if (!bdi_debug_root)
> + return -ENOMEM;
> + return 0;
>  }
>  
>  static int bdi_debug_stats_show(struct seq_file *m, void *v)
> @@ -126,8 +129,9 @@ static void bdi_debug_unregister(struct backing_dev_info 
> *bdi)
>   debugfs_remove(bdi->debug_dir);
>  }
>  #else
> -static inline void bdi_debug_init(void)
> +static inline int bdi_debug_init(void)
>  {
> + return 0;
>  }
>  static inline void bdi_debug_register(struct backing_dev_info *bdi,
> const char *name)
> @@ -229,12 +233,19 @@ ATTRIBUTE_GROUPS(bdi_dev);
>  
>  static __init int bdi_class_init(void)
>  {
> + int ret;
> +
>   bdi_class = class_create(THIS_MODULE, "bdi");
>   if (IS_ERR(bdi_class))
>   return PTR_ERR(bdi_class);
>  
>   bdi_class->dev_groups = bdi_dev_groups;
> - bdi_debug_init();
> + ret = bdi_debug_init();
> + if (ret) {
> + class_destroy(bdi_class);
> + bdi_class = NULL;
> + return ret;
> + }
>  
>   return 0;
>  }
> -- 
> 2.14.2
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 2/4] bdi: convert bdi_debug_register to int

2017-10-30 Thread Jan Kara
On Fri 27-10-17 01:35:57, weiping zhang wrote:
> Convert bdi_debug_register to int and then do error handle for it.
> 
> Signed-off-by: weiping zhang <zhangweip...@didichuxing.com>

This patch looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  mm/backing-dev.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 5072be19d9b2..e9d6a1ede12b 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -116,11 +116,23 @@ static const struct file_operations 
> bdi_debug_stats_fops = {
>   .release= single_release,
>  };
>  
> -static void bdi_debug_register(struct backing_dev_info *bdi, const char 
> *name)
> +static int bdi_debug_register(struct backing_dev_info *bdi, const char *name)
>  {
> + if (!bdi_debug_root)
> + return -ENOMEM;
> +
>   bdi->debug_dir = debugfs_create_dir(name, bdi_debug_root);
> + if (!bdi->debug_dir)
> + return -ENOMEM;
> +
>   bdi->debug_stats = debugfs_create_file("stats", 0444, bdi->debug_dir,
>  bdi, _debug_stats_fops);
> + if (!bdi->debug_stats) {
> + debugfs_remove(bdi->debug_dir);
> + return -ENOMEM;
> + }
> +
> + return 0;
>  }
>  
>  static void bdi_debug_unregister(struct backing_dev_info *bdi)
> @@ -133,9 +145,10 @@ static inline int bdi_debug_init(void)
>  {
>   return 0;
>  }
> -static inline void bdi_debug_register(struct backing_dev_info *bdi,
> +static inline int bdi_debug_register(struct backing_dev_info *bdi,
>     const char *name)
>  {
> + return 0;
>  }
>  static inline void bdi_debug_unregister(struct backing_dev_info *bdi)
>  {
> -- 
> 2.14.2
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 3/4] bdi: add error handle for bdi_debug_register

2017-10-30 Thread Jan Kara
On Fri 27-10-17 01:36:14, weiping zhang wrote:
> In order to make error handle more cleaner we call bdi_debug_register
> before set state to WB_registered, that we can avoid call bdi_unregister
> in release_bdi().
> 
> Signed-off-by: weiping zhang <zhangweip...@didichuxing.com>
> ---
>  mm/backing-dev.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index e9d6a1ede12b..54396d53f471 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -893,10 +893,13 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
> char *fmt, va_list args)
>   if (IS_ERR(dev))
>   return PTR_ERR(dev);
>  
> + if (bdi_debug_register(bdi, dev_name(dev))) {
> + device_destroy(bdi_class, dev->devt);
> + return -ENOMEM;
> + }
>   cgwb_bdi_register(bdi);
>   bdi->dev = dev;
>  
> - bdi_debug_register(bdi, dev_name(dev));
>   set_bit(WB_registered, >wb.state);
>  
>   spin_lock_bh(_lock);
> @@ -916,6 +919,8 @@ int bdi_register(struct backing_dev_info *bdi, const char 
> *fmt, ...)
>   va_start(args, fmt);
>   ret = bdi_register_va(bdi, fmt, args);
>   va_end(args);
> + if (ret)
> + bdi_put(bdi);

Why do you drop bdi reference here in case of error? We didn't do it
previously if bdi_register_va() failed for other reasons...

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 4/4] block: add WARN_ON if bdi register fail

2017-10-30 Thread Jan Kara
On Fri 27-10-17 01:36:42, weiping zhang wrote:
> device_add_disk need do more safety error handle, so this patch just
> add WARN_ON.
> 
> Signed-off-by: weiping zhang <zhangweip...@didichuxing.com>
> ---
>  block/genhd.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index dd305c65ffb0..cb55eea821eb 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -660,7 +660,9 @@ void device_add_disk(struct device *parent, struct 
> gendisk *disk)
>  
>   /* Register BDI before referencing it from bdev */
>   bdi = disk->queue->backing_dev_info;
> - bdi_register_owner(bdi, disk_to_dev(disk));
> + retval = bdi_register_owner(bdi, disk_to_dev(disk));
> + if (retval)
> + WARN_ON(1);

Just a nit: You can do

WARN_ON(retval);

Otherwise you can add:

Reviewed-by: Jan Kara <j...@suse.cz>

    Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 03/11] fs: add frozen sb state helpers

2017-12-21 Thread Jan Kara
Hello,

I think I owe you a reply here... Sorry that it took so long.

On Fri 01-12-17 22:13:27, Luis R. Rodriguez wrote:
> On Fri, Dec 01, 2017 at 12:47:24PM +0100, Jan Kara wrote:
> > On Thu 30-11-17 20:05:48, Luis R. Rodriguez wrote:
> > > > In fact, what might be a cleaner solution is to introduce a 
> > > > 'freeze_count'
> > > > for superblock freezing (we already do have this for block devices). 
> > > > Then
> > > > you don't need to differentiate these two cases - but you'd still need 
> > > > to
> > > > properly handle cleanup if freezing of all superblocks fails in the 
> > > > middle.
> > > > So I'm not 100% this works out nicely in the end. But it's certainly 
> > > > worth
> > > > a consideration.
> > > 
> > > Ah, there are three important reasons for doing it the way I did it which 
> > > are
> > > easy to miss, unless you read the commit log message very carefully.
> > > 
> > > 0) The ioctl interface causes a failure to be sent back to userspace if
> > > you issue two consecutive freezes, or two thaws. Ie, once a filesystem is
> > > frozen, a secondary call will result in an error. Likewise for thaw.
> > 
> > Yep. But also note that there's *another* interface to filesystem freezing
> > which behaves differently - freeze_bdev() (used internally by dm). That
> > interface uses the counter and freezing of already frozen device succeeds.
> 
> Ah... so freeze_bdev() semantics matches the same semantics I was looking
> for.
> 
> > IOW it is a mess.
> 
> To say the least.
> 
> > We cannot change the behavior of the ioctl but we could
> > still provide an in-kernel interface to freeze_super() with the same
> > semantics as freeze_bdev() which might be easier to use by suspend - maybe
> > we could call it 'exclusive' (for the current freeze_super() semantics) and
> > 'non-exclusive' (for the freeze_bdev() semantics) since this is very much
> > like O_EXCL open of block devices...
> 
> Sure, now typically I see we make exclusive calls with the postfix _excl() so
> I take it you'd be OK in renaming freeze_super() freeze_super_excl() 
> eventually
> then?

In principle yes but let's leave the naming disputes to a later time when
it is clear what API do we actually want to provide.

> I totally missed freeze_bdev() otherwise I think I would have picked up on the
> shared semantics stuff and I would have just made a helper out of what
> freeze_bdev() uses, and then have both in-kernel and freeze_bdev() use it.
> 
> I'll note that its still not perfectly clear if really the semantics behind
> freeze_bdev() match what I described above fully. That still needs to be
> vetted for. For instance, does thaw_bdev() keep a superblock frozen if we
> an ioctl initiated freeze had occurred before? If so then great. Otherwise
> I think we'll need to distinguish the ioctl interface. Worst possible case
> is that bdev semantics and in-kernel semantics differ somehow, then that
> will really create a holy fucking mess.

I believe nobody really thought about mixing those two interfaces to fs
freezing and so the behavior is basically defined by the implementation.
That is:

freeze_bdev() on sb frozen by ioctl_fsfreeze() -> EBUSY
freeze_bdev() on sb frozen by freeze_bdev() -> success
ioctl_fsfreeze() on sb frozen by freeze_bdev() -> EBUSY
ioctl_fsfreeze() on sb frozen by ioctl_fsfreeze() -> EBUSY

thaw_bdev() on sb frozen by ioctl_fsfreeze() -> EINVAL
ioctl_fsthaw() on sb frozen by freeze_bdev() -> success

What I propose is the following API:

freeze_super_excl()
  - freezes superblock, returns EBUSY if the superblock is already frozen
(either by another freeze_super_excl() or by freeze_super())
freeze_super()
  - this function will make sure superblock is frozen when the function
returns with success. It can be nested with other freeze_super() or
freeze_super_excl() calls (this second part is different from how
freeze_bdev() behaves currently but AFAICT this behavior is actually
what all current users of freeze_bdev() really want - just make sure
fs cannot be written to)
thaw_super()
  - counterpart to freeze_super(), would fail with EINVAL if we were to
drop the last "freeze reference" but sb was actually frozen with
freeze_super_excl()
thaw_super_excl()
  - counterpart to freeze_super_excl(). Fails with EINVAL if sb was not
frozen with freeze_super_excl() (this is different to current behavior
but I don't believe anyone relies on this and doing otherwise is asking
for data corruption).

I'd implement it by a freeze counter in the superblock (similar to what we
currently have in bdev) where every call to

Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-09 Thread Jan Kara
On Fri 04-05-18 07:55:58, Dave Chinner wrote:
> On Thu, May 03, 2018 at 06:26:26PM +0200, Jan Kara wrote:
> > Syzbot has reported that it can hit a NULL pointer dereference in
> > wb_workfn() due to wb->bdi->dev being NULL. This indicates that
> > wb_workfn() was called for an already unregistered bdi which should not
> > happen as wb_shutdown() called from bdi_unregister() should make sure
> > all pending writeback works are completed before bdi is unregistered.
> > Except that wb_workfn() itself can requeue the work with:
> > 
> > mod_delayed_work(bdi_wq, >dwork, 0);
> > 
> > and if this happens while wb_shutdown() is waiting in:
> > 
> > flush_delayed_work(>dwork);
> > 
> > the dwork can get executed after wb_shutdown() has finished and
> > bdi_unregister() has cleared wb->bdi->dev.
> > 
> > Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
> > the necessary precautions against racing with bdi unregistration.
> > 
> > CC: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> > CC: Tejun Heo <t...@kernel.org>
> > Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
> > Reported-by: syzbot <syzbot+9873874c735f2892e...@syzkaller.appspotmail.com>
> > Signed-off-by: Jan Kara <j...@suse.cz>
> > ---
> >  fs/fs-writeback.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 47d7c151fcba..471d863958bc 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1961,7 +1961,7 @@ void wb_workfn(struct work_struct *work)
> > }
> >  
> > if (!list_empty(>work_list))
> > -   mod_delayed_work(bdi_wq, >dwork, 0);
> > +   wb_wakeup(wb);
> > else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
> > wb_wakeup_delayed(wb);
> 
> Yup, looks fine - I can't see any more of these open coded wakeup,
> either, so we should be good here.
> 
> Reviewed-by: Dave Chinner <dchin...@redhat.com>

Thanks!

> As an aside, why is half the wb infrastructure in fs/fs-writeback.c
> and the other half in mm/backing-dev.c? it seems pretty random as to
> what is where e.g. wb_wakeup() and wb_wakeup_delayed() are almost
> identical, but are in completely different files...

Yeah, it deserves a cleanup.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-09 Thread Jan Kara
On Thu 03-05-18 18:26:26, Jan Kara wrote:
> Syzbot has reported that it can hit a NULL pointer dereference in
> wb_workfn() due to wb->bdi->dev being NULL. This indicates that
> wb_workfn() was called for an already unregistered bdi which should not
> happen as wb_shutdown() called from bdi_unregister() should make sure
> all pending writeback works are completed before bdi is unregistered.
> Except that wb_workfn() itself can requeue the work with:
> 
>   mod_delayed_work(bdi_wq, >dwork, 0);
> 
> and if this happens while wb_shutdown() is waiting in:
> 
>   flush_delayed_work(>dwork);
> 
> the dwork can get executed after wb_shutdown() has finished and
> bdi_unregister() has cleared wb->bdi->dev.
> 
> Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
> the necessary precautions against racing with bdi unregistration.
> 
> CC: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> CC: Tejun Heo <t...@kernel.org>
> Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
> Reported-by: syzbot <syzbot+9873874c735f2892e...@syzkaller.appspotmail.com>
> Signed-off-by: Jan Kara <j...@suse.cz>
> ---
>  fs/fs-writeback.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Jens, can you please pick up this patch? Probably for the next merge window
(I don't see a reason to rush this at this point in release cycle). Thanks!

Honza

> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 47d7c151fcba..471d863958bc 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1961,7 +1961,7 @@ void wb_workfn(struct work_struct *work)
>   }
>  
>   if (!list_empty(>work_list))
> - mod_delayed_work(bdi_wq, >dwork, 0);
> + wb_wakeup(wb);
>   else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
>   wb_wakeup_delayed(wb);
>  
> -- 
> 2.13.6
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-09 Thread Jan Kara
On Fri 04-05-18 07:35:34, Tetsuo Handa wrote:
> Jan Kara wrote:
> > Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
> > the necessary precautions against racing with bdi unregistration.
> 
> Yes, this patch will solve NULL pointer dereference bug. But is it OK to
> leave list_empty(>work_list) == false situation? Who takes over the
> role of making list_empty(>work_list) == true?

That's a good question. The reason is the last running instance of
wb_workfn() cannot leave with the work_list non-empty. Once WB_registered
is cleared we cannot add new entries to work_list. Then we'll queue and
flush last wb_workfn() to clean up the list. The problem with NULL ptr
deref has been triggered not by this last running wb_workfn() but by one
running independently in parallel to wb_shutdown(). So something like:

CPU0CPU1CPU2
wb_workfn()
  do {
...
  } while (!list_empty(>work_list));
wb_queue_work()
  if (test_bit(WB_registered, >state)) {
list_add_tail(>list, >work_list);
mod_delayed_work(bdi_wq, >dwork, 0);
  }
wb_shutdown()
  if 
(!test_and_clear_bit(WB_registered, >state)) {
  ...
  mod_delayed_work(bdi_wq, 
>dwork, 0);
  
flush_delayed_work(>dwork);
  if (!list_empty(>work_list))
mod_delayed_work(bdi_wq, >dwork, 0); -> queues buggy work

> Just a confirmation, for Fabiano Rosas is facing a problem that "write call
> hangs in kernel space after virtio hot-remove" and is thinking that we might
> need to go the opposite direction
> ( http://lkml.kernel.org/r/f0787b79-1e50-5f55-a400-44f715451...@linux.ibm.com 
> ).

Yes, I'm aware of that report and I think it should be solved
differently than what Fabiano suggests.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: write call hangs in kernel space after virtio hot-remove

2018-05-09 Thread Jan Kara
On Thu 03-05-18 10:48:20, Matthew Wilcox wrote:
> On Thu, May 03, 2018 at 12:05:14PM -0400, Jeff Layton wrote:
> > On Thu, 2018-05-03 at 16:42 +0200, Jan Kara wrote:
> > > On Wed 25-04-18 17:07:48, Fabiano Rosas wrote:
> > > > I'm looking into an issue where removing a virtio disk via sysfs while 
> > > > another
> > > > process is issuing write() calls results in the writing task going into 
> > > > a
> > > > livelock:
> >
> > > Thanks for the debugging of the problem. I agree with your analysis 
> > > however
> > > I don't like your fix. The issue is that when bdi is unregistered we don't
> > > really expect any writeback to happen after that moment. This is what
> > > prevents various use-after-free issues and I'd like that to stay the way 
> > > it
> > > is.
> > > 
> > > What I think we should do is that we'll prevent dirtying of new pages when
> > > we know the underlying device is gone. Because that will fix your problem
> > > and also make sure user sees the IO errors directly instead of just in the
> > > kernel log. The question is how to make this happen in the least painful
> > > way. I think we could intercept writes in grab_cache_page_write_begin()
> > > (which however requires that function to return a proper error code and 
> > > not
> > > just NULL / non-NULL). And we should also intercept write faults to not
> > > allow page dirtying via mmap - probably somewhere in do_shared_fault() and
> > > do_wp_page(). I've added Jeff to CC since he's dealing with IO error
> > > handling a lot these days. Jeff, what do you think?
> > 
> > (cc'ing Willy too since he's given this more thought than me)
> > 
> > For the record, I've mostly been looking at error _reporting_. Handling
> > errors at this level is not something I've really considered in great
> > detail as of yet.
> > 
> > Still, I think the basic idea sounds reasonable. Not allowing pages to
> > be dirtied when we can't clean them seems like a reasonable thing to
> > do.
> > 
> > The big question is how we'll report this to userland:
> > 
> > Would your approach have it return an error on write() and such? What
> > sort of error if so? ENODEV? Would we have to SIGBUS when someone tries
> > to dirty the page through mmap?
> 
> I have been having some thoughts in this direction.  They are perhaps
> a little more long-term than this particular bug, so they may not be
> relevant to the immediate fix.
> 
> I want to separate removing hardware from tearing down the block device
> that represents something on those hardware devices.  That allows for
> better handling of intermittent transport failures, or accidental device
> removal, followed by a speedy insert.
> 
> In that happy future, the bug described here wouldn't be getting an
> -EIO, it'd be getting an -ENODEV because we've decided this device is
> permanently gone.  I think that should indeed be a SIGBUS on mapped
> writes.
> 
> Looking at the current code, filemap_fault() will return VM_FAULT_SIGBUS
> already if page_cache_read() returns any error other than -ENOMEM, so
> that's fine.  We probably want some code (and this is where I reach the
> edge of my knowledge about the current page cache) to rip all the pages
> out of the page cache for every file on an -ENODEV filesystem.

Note that that already (partially) happens through:

del_gendisk() -> invalidate_partition() -> __invalidate_device() ->
  invalidate_inodes()
  invalidate_bdev()

The slight trouble is this currently won't do anything for open files (as
such inodes will have elevated refcount) so that needs fixing. And then
we need to prevent new dirty pages from being created...

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH] loop: remember whether sysfs_create_group() succeeded

2018-05-09 Thread Jan Kara
On Fri 04-05-18 20:47:29, Tetsuo Handa wrote:
> >From 626d33de1b70b11ecaf95a9f83f7644998e54cbb Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Date: Wed, 2 May 2018 23:03:48 +0900
> Subject: [PATCH] loop: remember whether sysfs_create_group() succeeded
> 
> syzbot is hitting WARN() triggered by memory allocation fault
> injection [1] because loop module is calling sysfs_remove_group()
> when sysfs_create_group() failed.
> Fix this by remembering whether sysfs_create_group() succeeded.
> 
> [1] 
> https://syzkaller.appspot.com/bug?id=3f86c0edf75c86d2633aeb9dd69eccc70bc7e90b
> 
> Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Reported-by: syzbot 
> <syzbot+9f03168400f56df89dbc6f1751f4458fe739f...@syzkaller.appspotmail.com>
> Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
> Cc: Jens Axboe <ax...@kernel.dk>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  drivers/block/loop.c | 11 ++-
>  drivers/block/loop.h |  1 +
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 5d4e316..1d758d8 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -809,16 +809,17 @@ static ssize_t loop_attr_dio_show(struct loop_device 
> *lo, char *buf)
>   .attrs= loop_attrs,
>  };
>  
> -static int loop_sysfs_init(struct loop_device *lo)
> +static void loop_sysfs_init(struct loop_device *lo)
>  {
> - return sysfs_create_group(_to_dev(lo->lo_disk)->kobj,
> -   _attribute_group);
> + lo->sysfs_ready = !sysfs_create_group(_to_dev(lo->lo_disk)->kobj,
> +   _attribute_group);
>  }
>  
>  static void loop_sysfs_exit(struct loop_device *lo)
>  {
> - sysfs_remove_group(_to_dev(lo->lo_disk)->kobj,
> -_attribute_group);
> + if (lo->sysfs_ready)
> + sysfs_remove_group(_to_dev(lo->lo_disk)->kobj,
> +_attribute_group);
>  }
>  
>  static void loop_config_discard(struct loop_device *lo)
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index b78de98..73c801f 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -58,6 +58,7 @@ struct loop_device {
>   struct kthread_worker   worker;
>   struct task_struct  *worker_task;
>   booluse_dio;
> + boolsysfs_ready;
>  
>   struct request_queue*lo_queue;
>   struct blk_mq_tag_set   tag_set;
> -- 
> 1.8.3.1
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-21 Thread Jan Kara
On Sat 19-05-18 23:27:09, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Jan Kara wrote:
> > > Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
> > > the necessary precautions against racing with bdi unregistration.
> > 
> > Yes, this patch will solve NULL pointer dereference bug. But is it OK to 
> > leave
> > list_empty(>work_list) == false situation? Who takes over the role of 
> > making
> > list_empty(>work_list) == true?
> 
> syzbot is again reporting the same NULL pointer dereference.
> 
>   general protection fault in wb_workfn (2)
>   
> https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Gaah... So we are still missing something.

> Didn't we overlook something obvious in commit b8b784958eccbf8f ("bdi:
> Fix oops in wb_workfn()") ?
> 
> At first, I thought that that commit will solve NULL pointer dereference bug.
> But what does
> 
>   if (!list_empty(>work_list))
> - mod_delayed_work(bdi_wq, >dwork, 0);
> + wb_wakeup(wb);
>   else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
>   wb_wakeup_delayed(wb);
> 
> mean?
> 
> static void wb_wakeup(struct bdi_writeback *wb)
> {
>   spin_lock_bh(>work_lock);
>   if (test_bit(WB_registered, >state))
>   mod_delayed_work(bdi_wq, >dwork, 0);
>   spin_unlock_bh(>work_lock);
> }
> 
> It means nothing but "we don't call mod_delayed_work() if WB_registered
> bit was already cleared".

Exactly.

> But if WB_registered bit is not yet cleared when we hit
> wb_wakeup_delayed() path?
> 
> void wb_wakeup_delayed(struct bdi_writeback *wb)
> {
>   unsigned long timeout;
> 
>   timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
>   spin_lock_bh(>work_lock);
>   if (test_bit(WB_registered, >state))
>   queue_delayed_work(bdi_wq, >dwork, timeout);
>   spin_unlock_bh(>work_lock);
> }
> 
> add_timer() is called because (presumably) timeout > 0. And after that
> timeout expires, __queue_work() is called even if WB_registered bit is
> already cleared before that timeout expires, isn't it?

Yes.

> void delayed_work_timer_fn(struct timer_list *t)
> {
>   struct delayed_work *dwork = from_timer(dwork, t, timer);
> 
>   /* should have been called from irqsafe timer with irq already off */
>   __queue_work(dwork->cpu, dwork->wq, >work);
> }
> 
> Then, wb_workfn() is after all scheduled even if we check for
> WB_registered bit, isn't it?

It can be queued after WB_registered bit is cleared but it cannot be queued
after mod_delayed_work(bdi_wq, >dwork, 0) has finished. That function
deletes the pending timer (the timer cannot be armed again because
WB_registered is cleared) and queues what should be the last round of
wb_workfn().

> Then, don't we need to check that
> 
>   mod_delayed_work(bdi_wq, >dwork, 0);
>   flush_delayed_work(>dwork);
> 
> is really waiting for completion? At least, shouldn't we try below debug
> output (not only for debugging this report but also generally desirable)?
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 7441bd9..ccec8cd 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -376,8 +376,10 @@ static void wb_shutdown(struct bdi_writeback *wb)
>* tells wb_workfn() that @wb is dying and its work_list needs to
>* be drained no matter what.
>*/
> - mod_delayed_work(bdi_wq, >dwork, 0);
> - flush_delayed_work(>dwork);
> + if (!mod_delayed_work(bdi_wq, >dwork, 0))
> + printk(KERN_WARNING "wb_shutdown: mod_delayed_work() failed\n");

false return from mod_delayed_work() just means that there was no timer
armed. That is a valid situation if there are no dirty data.

> + if (!flush_delayed_work(>dwork))
> + printk(KERN_WARNING "wb_shutdown: flush_delayed_work() 
> failed\n");

And this is valid as well (although unlikely) if the work managed to
complete on another CPU before flush_delayed_work() was called.

So I don't think your warnings will help us much. But yes, we need to debug
this somehow. For now I have no idea what could be still going wrong.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


[PATCH] bdi: Fix another oops in wb_workfn()

2018-06-18 Thread Jan Kara
syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1CPU2
cgwb_bdi_unregister()
  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, >release_work);
cgwb_release_workfn()
  wb = list_first_entry(>wb_list, ...)
  spin_unlock_irq(_lock);
  wb_shutdown(wb);
  ...
  kfree_rcu(wb, rcu);
  wb_shutdown(wb); -> oops use-after-free

We solve these issues by synchronizing writeback structure shutdown from
cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
way we also no longer need synchronization using WB_shutting_down as the
mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
bdi_unregister().

Reported-by: syzbot 
Signed-off-by: Jan Kara 
---
 include/linux/backing-dev-defs.h |  2 +-
 mm/backing-dev.c | 20 +++-
 2 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 0bd432a4d7bd..24251762c20c 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -22,7 +22,6 @@ struct dentry;
  */
 enum wb_state {
WB_registered,  /* bdi_register() was done */
-   WB_shutting_down,   /* wb_shutdown() in progress */
WB_writeback_running,   /* Writeback is in progress */
WB_has_dirty_io,/* Dirty inodes on ->b_{dirty|io|more_io} */
WB_start_all,   /* nr_pages == 0 (all) work pending */
@@ -189,6 +188,7 @@ struct backing_dev_info {
 #ifdef CONFIG_CGROUP_WRITEBACK
struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
struct rb_root cgwb_congested_tree; /* their congested states */
+   struct mutex cgwb_release_mutex;  /* protect shutdown of wb structs */
 #else
struct bdi_writeback_congested *wb_congested;
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc834c04a..2e5d3df0853d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -359,15 +359,8 @@ static void wb_shutdown(struct bdi_writeback *wb)
spin_lock_bh(>work_lock);
if (!test_and_clear_bit(WB_registered, >state)) {
spin_unlock_bh(>work_lock);
-   /*
-* Wait for wb shutdown to finish if someone else is just
-* running wb_shutdown(). Otherwise we could proceed to wb /
-* bdi destruction before wb_shutdown() is finished.
-*/
-   wait_on_bit(>state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
return;
}
-   set_bit(WB_shutting_down, >state);
spin_unlock_bh(>work_lock);
 
cgwb_remove_from_bdi_list(wb);
@@ -379,12 +372,6 @@ static void wb_shutdown(struct bdi_writeback *wb)
mod_delayed_work(bdi_wq, >dwork, 0);
flush_delayed_work(>dwork);
WARN_ON(!list_empty(>work_list));
-   /*
-* Make sure bit gets cleared after shutdown is finished. Matches with
-* the barrier provided by test_and_clear_bit() above.
-*/
-   smp_wmb();
-   clear_and_wake_up_bit(WB_shutting_down, >state);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
@@ -508,10 +495,12 @@ static void cgwb_release_workfn(struct work_struct *work)
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
release_work);
 
+   mutex_lock(>bdi->cgwb_release_mutex);
wb_shutdown(wb);
 
css_put(wb->memcg_css);
css_put(wb->blkcg_css);
+   mutex_unlock(>bdi->cgwb_release_mutex);
 
fprop_local_destroy_percpu(>memcg_completions);
percpu_ref_exit(>refcnt);
@@ -697,6 +686,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
INIT_RADIX_TREE(>cgwb_tree, GFP_ATOMIC);
bdi->cgwb_congested_tree = RB_ROOT;
+   mutex_init(&

Re: [PATCH] bdi: Fix another oops in wb_workfn()

2018-06-11 Thread Jan Kara
On Sat 09-06-18 23:00:05, Tetsuo Handa wrote:
> From 014c4149f2e24cd26b278b32d5dfda056eecf093 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa 
> Date: Sat, 9 Jun 2018 22:47:52 +0900
> Subject: [PATCH] bdi: Fix another oops in wb_workfn()
> 
> syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
> wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
> WB_shutting_down after wb->bdi->dev became NULL. This indicates that
> unregister_bdi() failed to call wb_shutdown() on one of wb objects.
> 
> Since cgwb_bdi_unregister() from bdi_unregister() cannot call wb_shutdown()
> on wb objects which have already passed list_del_rcu() in wb_shutdown(),
> cgwb_bdi_unregister() from bdi_unregister() can return and set wb->bdi->dev
> to NULL before such wb objects enter final round of wb_workfn() via
> mod_delayed_work()/flush_delayed_work().

Thanks a lot for debugging the issue and also thanks a lot to Dmitry for
taking time to reproduce the race by hand with the debug patch! I really
appreciate it!

> Since WB_registered is already cleared by wb_shutdown(), only wb_shutdown()
> can schedule for final round of wb_workfn(). Since concurrent calls to
> wb_shutdown() on the same wb object is safe because of WB_shutting_down
> state, I think that wb_shutdown() can safely keep a wb object in the
> bdi->wb_list until that wb object leaves final round of wb_workfn().
> Thus, make wb_shutdown() call list_del_rcu() after flush_delayed_work().

However this is wrong and so is the patch. The problem is in
cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's
reference to wb structures before going through the list of wbs again and
calling wb_shutdown() on each of them. The writeback structures we are
accessing at this point can be already freed in principle like:

CPU1CPU2
cgwb_bdi_unregister()
  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, >release_work);
cgwb_release_workfn()
  wb = list_first_entry(>wb_list, ...)
  spin_unlock_irq(_lock);
  wb_shutdown(wb);
  ...   
  kfree_rcu(wb, rcu);
  wb_shutdown(wb); -> oops use-after-free

I'm not 100% sure how to fix this. wb structures can be at various phases of
shutdown (or there may be other external references still existing) when we
enter cgwb_bdi_unregister() so I think adding a way for cgwb_bdi_unregister()
to wait for standard wb shutdown path to finish is the most robust way.
What do you think about attached patch Tejun? So far only compile tested...

Possible problem with it is that now cgwb_bdi_unregister() will wait for
all wb references to be dropped so it adds some implicit dependencies to
bdi shutdown path. 

Honza
-- 
Jan Kara 
SUSE Labs, CR
>From f5038c6e7a3d1a4a91879187b92ede8c868988ac Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Mon, 11 Jun 2018 10:56:04 +0200
Subject: [PATCH] bdi: Fix another oops in wb_workfn()

syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1CPU2
cgwb_bdi_unregister()
  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, >release_work);
cgwb_release_workfn()
  wb = list_first_entry(>wb_list, ...)
  spin_unlock_irq(_lock);
  wb_shutdown(wb);
  ...
  kfree_rcu(wb, rcu);
  wb_shutdown(wb); -> oops use-after-free

We solve all these issues by making cgwb_bdi_unregister() wait for
shutdown of all wb structures instead of going through them and trying
to actively shut them down.

[1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Cc: Dmitry Vyukov 
Cc: Tejun Heo 
Reported-and-analyzed-by: Tetsuo Handa 
Reported-by: syzbot 
Signed-off-by: Jan Kara 
--

Re: [PATCH] bdi: Fix another oops in wb_workfn()

2018-06-11 Thread Jan Kara
On Mon 11-06-18 09:01:31, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jun 11, 2018 at 11:12:48AM +0200, Jan Kara wrote:
> > However this is wrong and so is the patch. The problem is in
> > cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's
> > reference to wb structures before going through the list of wbs again and
> > calling wb_shutdown() on each of them. The writeback structures we are
> > accessing at this point can be already freed in principle like:
> > 
> > CPU1CPU2
> > cgwb_bdi_unregister()
> >   cgwb_kill(*slot);
> > 
> > cgwb_release()
> >   queue_work(cgwb_release_wq, >release_work);
> > cgwb_release_workfn()
> >   wb = list_first_entry(>wb_list, ...)
> >   spin_unlock_irq(_lock);
> >   wb_shutdown(wb);
> >   ...   
> >   kfree_rcu(wb, rcu);
> >   wb_shutdown(wb); -> oops use-after-free
> > 
> > I'm not 100% sure how to fix this. wb structures can be at various phases of
> > shutdown (or there may be other external references still existing) when we
> > enter cgwb_bdi_unregister() so I think adding a way for 
> > cgwb_bdi_unregister()
> > to wait for standard wb shutdown path to finish is the most robust way.
> > What do you think about attached patch Tejun? So far only compile tested...
> > 
> > Possible problem with it is that now cgwb_bdi_unregister() will wait for
> > all wb references to be dropped so it adds some implicit dependencies to
> > bdi shutdown path. 
> 
> Would something like the following work or am I missing the point
> entirely?

I was pondering the same solution for a while but I think it won't work.
The problem is that e.g. wb_memcg_offline() could have already removed
wb from the radix tree but it is still pending in bdi->wb_list
(wb_shutdown() has not run yet) and so we'd drop reference we didn't get.

Honza
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 347cc83..359cacd 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -715,14 +715,19 @@ static void cgwb_bdi_unregister(struct backing_dev_info 
> *bdi)
>   WARN_ON(test_bit(WB_registered, >wb.state));
>  
>   spin_lock_irq(_lock);
> - radix_tree_for_each_slot(slot, >cgwb_tree, , 0)
> - cgwb_kill(*slot);
> + radix_tree_for_each_slot(slot, >cgwb_tree, , 0) {
> + struct bdi_writeback *wb = *slot;
> +
> + wb_get(wb);
> + cgwb_kill(wb);
> + }
>  
>   while (!list_empty(>wb_list)) {
>   wb = list_first_entry(>wb_list, struct bdi_writeback,
> bdi_node);
>   spin_unlock_irq(_lock);
>   wb_shutdown(wb);
> + wb_put(wb);
>   spin_lock_irq(_lock);
>   }
>   spin_unlock_irq(_lock);
> 
> 
> -- 
> tejun
-- 
Jan Kara 
SUSE Labs, CR


Re: general protection fault in wb_workfn (2)

2018-05-28 Thread Jan Kara
On Sun 27-05-18 09:47:54, Tetsuo Handa wrote:
> Forwarding 
> http://lkml.kernel.org/r/201805251915.fgh64517.hvfjoolffmq...@i-love.sakura.ne.jp
>  .
> 
> Jan Kara wrote:
> > > void delayed_work_timer_fn(struct timer_list *t)
> > > {
> > >   struct delayed_work *dwork = from_timer(dwork, t, timer);
> > > 
> > >   /* should have been called from irqsafe timer with irq already off */
> > >   __queue_work(dwork->cpu, dwork->wq, >work);
> > > }
> > > 
> > > Then, wb_workfn() is after all scheduled even if we check for
> > > WB_registered bit, isn't it?
> > 
> > It can be queued after WB_registered bit is cleared but it cannot be queued
> > after mod_delayed_work(bdi_wq, >dwork, 0) has finished. That function
> > deletes the pending timer (the timer cannot be armed again because
> > WB_registered is cleared) and queues what should be the last round of
> > wb_workfn().
> 
> mod_delayed_work() deletes the pending timer but does not wait for already
> invoked timer handler to complete because it is using del_timer() rather than
> del_timer_sync(). Then, what happens if __queue_work() is almost concurrently
> executed from two CPUs, one from mod_delayed_work(bdi_wq, >dwork, 0) from
> wb_shutdown() path (which is called without spin_lock_bh(>work_lock)) and
> the other from delayed_work_timer_fn() path (which is called without checking
> WB_registered bit under spin_lock_bh(>work_lock)) ?

In this case, work should still be queued only once. The synchronization in
this case should be provided by the WORK_STRUCT_PENDING_BIT. When a delayed
work is queued by mod_delayed_work(), this bit is set, and gets cleared
only once the work is started on some CPU. But admittedly this code is
rather convoluted so I may be missing something.

Also you should note that flush_delayed_work() which follows
mod_delayed_work() in wb_shutdown() does del_timer_sync() so I don't see
how anything could get past that. In fact mod_delayed_work() is in
wb_shutdown() path to make sure wb_workfn() gets executed at least once
before the bdi_writeback structure gets cleaned up so that all queued items
are finished. We do not rely on it to remove pending timers or queued
wb_workfn() executions.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 1/3] fs: move documentation for thaw_super() where appropriate

2018-05-03 Thread Jan Kara
On Fri 20-04-18 16:59:02, Luis R. Rodriguez wrote:
> On commit 08fdc8a0138a ("buffer.c: call thaw_super during emergency thaw")
> Mateusz added thaw_super_locked() and made thaw_super() use it, but
> forgot to move the documentation.
> 
> Signed-off-by: Luis R. Rodriguez <mcg...@kernel.org>

Looks good (modulo the -- which is probably worth fixing when touching the
comment anyway). You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/super.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 5fa9a8d8d865..9d0eb5e20a1f 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1532,12 +1532,6 @@ int freeze_super(struct super_block *sb)
>  }
>  EXPORT_SYMBOL(freeze_super);
>  
> -/**
> - * thaw_super -- unlock filesystem
> - * @sb: the super to thaw
> - *
> - * Unlocks the filesystem and marks it writeable again after freeze_super().
> - */
>  static int thaw_super_locked(struct super_block *sb)
>  {
>   int error;
> @@ -1573,6 +1567,12 @@ static int thaw_super_locked(struct super_block *sb)
>   return 0;
>  }
>  
> +/**
> + * thaw_super -- unlock filesystem
> + * @sb: the super to thaw
> + *
> + * Unlocks the filesystem and marks it writeable again after freeze_super().
> + */
>  int thaw_super(struct super_block *sb)
>  {
>   down_write(>s_umount);
> -- 
> 2.16.3
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes

2018-05-03 Thread Jan Kara
On Wed 02-05-18 14:27:37, Robert Dorr wrote:
> In the current implementation the first write to the location updates the
> metadata and must issue the flush.   In Windows SQL Server can avoid this
> behavior.   SQL Server can issue DeviceIoControl with SET_FILE_VALID_DATA
> and then SetEndOfFile.  The SetEndOfFile acquires space and saves
> metadata without requiring the actual write.   This allows us to quickly
> create a large file and the writes do not need the added flush.
> 
> Is this something that fallocate could accommodate to avoid having to
> write once (triggers flush for metadata) and then secondary writes can
> use FUA and avoid the flush?

Well, the question then is what do you see in the file if you read those
blocks before writing them or if the system crashes before writing the
data. As you describe the feature, you'd see there just the old block
contents which is a security issue (you could see for example old
/etc/passwd contents there). That's why we've refused such feature in the
past [1].

Honza

[1] https://www.spinics.net/lists/linux-ext4/msg31637.html

> -Original Message-
> From: Dave Chinner <da...@fromorbit.com> 
> Sent: Tuesday, May 1, 2018 9:46 PM
> To: Jan Kara <j...@suse.cz>
> Cc: linux-...@vger.kernel.org; linux-fsde...@vger.kernel.org; 
> linux-block@vger.kernel.org; h...@lst.de; Robert Dorr <rd...@microsoft.com>
> Subject: Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes
> 
> On Sat, Apr 21, 2018 at 03:03:09PM +0200, Jan Kara wrote:
> > On Wed 18-04-18 14:08:26, Dave Chinner wrote:
> > > From: Dave Chinner <dchin...@redhat.com>
> > > 
> > > Currently iomap_dio_rw() only handles (data)sync write completions 
> > > for AIO. This means we can't optimised non-AIO IO to minimise device 
> > > flushes as we can't tell the caller whether a flush is required or 
> > > not.
> > > 
> > > To solve this problem and enable further optimisations, make 
> > > iomap_dio_rw responsible for data sync behaviour for all IO, not 
> > > just AIO.
> > > 
> > > In doing so, the sync operation is now accounted as part of the DIO 
> > > IO by inode_dio_end(), hence post-IO data stability updates will no 
> > > long race against operations that serialise via inode_dio_wait() 
> > > such as truncate or hole punch.
> > > 
> > > Signed-Off-By: Dave Chinner <dchin...@redhat.com>
> > > Reviewed-by: Christoph Hellwig <h...@lst.de>
> > 
> > Looks good to me. You can add:
> > 
> > Reviewed-by: Jan Kara <j...@suse.cz>
> 
> It looks good, but it's broken in a subtle, nasty way. :/
> 
> > > @@ -768,14 +776,8 @@ static ssize_t iomap_dio_complete(struct 
> > > iomap_dio *dio)  static void iomap_dio_complete_work(struct 
> > > work_struct *work)  {
> > >   struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
> > > - struct kiocb *iocb = dio->iocb;
> > > - bool is_write = (dio->flags & IOMAP_DIO_WRITE);
> > > - ssize_t ret;
> > >  
> > > - ret = iomap_dio_complete(dio);
> > > - if (is_write && ret > 0)
> > > - ret = generic_write_sync(iocb, ret);
> > > - iocb->ki_complete(iocb, ret, 0);
> > > + dio->iocb->ki_complete(dio->iocb, iomap_dio_complete(dio), 0);
> 
> This generates a use after free from KASAN from generic/016. it appears the 
> compiler orders the code so that it dereferences
> dio->iocb after iomap_dio_complete() has freed the dio structure
> (yay!).
> 
> I'll post a new version of the patchset now that I've got changes to
> 2 of the 3 remaining patches in it
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: write call hangs in kernel space after virtio hot-remove

2018-05-03 Thread Jan Kara
em
and also make sure user sees the IO errors directly instead of just in the
kernel log. The question is how to make this happen in the least painful
way. I think we could intercept writes in grab_cache_page_write_begin()
(which however requires that function to return a proper error code and not
just NULL / non-NULL). And we should also intercept write faults to not
allow page dirtying via mmap - probably somewhere in do_shared_fault() and
do_wp_page(). I've added Jeff to CC since he's dealing with IO error
handling a lot these days. Jeff, what do you think?

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


<    1   2   3   4   5   >