from:"Jan Kara"

Re: [PATCH 0/4 RFC] BDI lifetime fix

2017-01-31 Thread Jan Kara

On Thu 26-01-17 22:15:06, Dan Williams wrote:
> On Thu, Jan 26, 2017 at 9:45 AM, Jan Kara <j...@suse.cz> wrote:
> > Hello,
> >
> > this patch series attempts to solve the problems with the life time of a
> > backing_dev_info structure. Currently it lives inside request_queue 
> > structure
> > and thus it gets destroyed as soon as request queue goes away. However
> > the block device inode still stays around and thus inode_to_bdi() call on
> > that inode (e.g. from flusher worker) may happen after request queue has 
> > been
> > destroyed resulting in oops.
> >
> > This patch set tries to solve these problems by making backing_dev_info
> > independent structure referenced from block device inode. That makes sure
> > inode_to_bdi() cannot ever oops. The patches are lightly tested for now
> > (they boot, basic tests with adding & removing loop devices seem to do what
> > I'd expect them to do ;). If someone is able to reproduce crashes on bdi
> > when device goes away, please test these patches.
> 
> This survives a several runs of the libnvdimm unit tests which stress
> del_gendisk() and blk_cleanup_queue(). I'll keep testing since the
> failure was intermittent, but this is looking good.

Can I add your Tested-by tag?

Honza

-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] block: Unhash block device inodes on gendisk destruction

2017-01-31 Thread Jan Kara

Currently, block device inodes stay around after corresponding gendisk
hash died until memory reclaim finds them and frees them. Since we will
make block device inode pin the bdi, we want to free the block device
inode as soon as the device goes away so that bdi does not stay around
unnecessarily. Furthermore we need to avoid issues when new device with
the same major,minor pair gets created since reusing the bdi structure
would be rather difficult in this case.

Unhashing block device inode on gendisk destruction nicely deals with
these problems. Once last block device inode reference is dropped (which
may be directly in del_gendisk()), the inode gets evicted. Furthermore if
the major,minor pair gets reallocated, we are guaranteed to get new
block device inode even if old block device inode is not yet evicted and
thus we avoid issues with possible reuse of bdi.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/genhd.c  |  2 ++
 fs/block_dev.c | 15 +++
 include/linux/fs.h |  1 +
 3 files changed, 18 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index fcd6d4fae657..f2f22d0e8e14 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -648,6 +648,8 @@ void del_gendisk(struct gendisk *disk)
disk_part_iter_init(, disk,
 DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
while ((part = disk_part_iter_next())) {
+   bdev_unhash_inode(MKDEV(disk->major,
+   disk->first_minor + part->partno));
invalidate_partition(disk, part->partno);
delete_partition(disk, part->partno);
}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5db5d1340d69..ed6a34be7a1e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -954,6 +954,21 @@ static int bdev_set(struct inode *inode, void *data)
 
 static LIST_HEAD(all_bdevs);
 
+/*
+ * If there is a bdev inode for this device, unhash it so that it gets evicted
+ * as soon as last inode reference is dropped.
+ */
+void bdev_unhash_inode(dev_t dev)
+{
+   struct inode *inode;
+
+   inode = ilookup5(blockdev_superblock, hash(dev), bdev_test, );
+   if (inode) {
+   remove_inode_hash(inode);
+   iput(inode);
+   }
+}
+
 struct block_device *bdget(dev_t dev)
 {
struct block_device *bdev;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2ba074328894..702cb6c50194 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2342,6 +2342,7 @@ extern struct kmem_cache *names_cachep;
 #ifdef CONFIG_BLOCK
 extern int register_blkdev(unsigned int, const char *);
 extern void unregister_blkdev(unsigned int, const char *);
+extern void bdev_unhash_inode(dev_t dev);
 extern struct block_device *bdget(dev_t);
 extern struct block_device *bdgrab(struct block_device *bdev);
 extern void bd_set_size(struct block_device *, loff_t size);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] block: Dynamically allocate and refcount backing_dev_info

2017-02-01 Thread Jan Kara

On Wed 01-02-17 01:50:07, Christoph Hellwig wrote:
> On Tue, Jan 31, 2017 at 01:54:28PM +0100, Jan Kara wrote:
> > Instead of storing backing_dev_info inside struct request_queue,
> > allocate it dynamically, reference count it, and free it when the last
> > reference is dropped. Currently only request_queue holds the reference
> > but in the following patch we add other users referencing
> > backing_dev_info.
> 
> Do we really need the separate slab cache?  Otherwise this looks
> fine to me.

Yeah, probably it is not worth it. I'll remove it.

    Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] block: Make blk_get_backing_dev_info() safe without open bdev

2017-02-01 Thread Jan Kara

On Wed 01-02-17 01:53:20, Christoph Hellwig wrote:
> Looks fine:
> 
> Reviewed-by: Christoph Hellwig <h...@lst.de>
> 
> But can you also add another patch to kill off blk_get_backing_dev_info?
> The direct dereference is short and cleaner.  Additionally the bt_bdi
> field in XFS could go away, too.

OK, I'll do that. Another cleanup I was considering is to remove all other
embedded occurences of backing_dev_info and make the structure only
dynamically allocated. It would unify the handling of backing_dev_info and
allow us to make bdi_init(), bdi_destroy(), etc. static inside
mm/backing_dev.c. What do you think?

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/24] fuse: Convert to separately allocated bdi

2017-02-07 Thread Jan Kara

On Tue 07-02-17 10:16:58, Miklos Szeredi wrote:
> On Thu, Feb 2, 2017 at 6:34 PM, Jan Kara <j...@suse.cz> wrote:
> > Allocate struct backing_dev_info separately instead of embedding it
> > inside the superblock. This unifies handling of bdi among users.
> 
> Acked-by: Miklos Szeredi <mszer...@redhat.com>
> 
> A follow on patch could get rid of fc->bdi_initialized too (replace
> remaining uses with fc->sb).

Yeah, I was looking at that but was not 100% sure about it from a quick
look. I'll do that.

Honza
> >
> > CC: Miklos Szeredi <mik...@szeredi.hu>
> > CC: linux-fsde...@vger.kernel.org
> > Signed-off-by: Jan Kara <j...@suse.cz>
> > ---
> >  fs/fuse/dev.c|  8 
> >  fs/fuse/fuse_i.h |  3 ---
> >  fs/fuse/inode.c  | 42 +-
> >  3 files changed, 17 insertions(+), 36 deletions(-)
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 70ea57c7b6bb..1912164d57e9 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -382,8 +382,8 @@ static void request_end(struct fuse_conn *fc, struct 
> > fuse_req *req)
> >
> > if (fc->num_background == fc->congestion_threshold &&
> > fc->connected && fc->bdi_initialized) {
> > -   clear_bdi_congested(>bdi, BLK_RW_SYNC);
> > -   clear_bdi_congested(>bdi, BLK_RW_ASYNC);
> > +   clear_bdi_congested(fc->sb->s_bdi, BLK_RW_SYNC);
> > +   clear_bdi_congested(fc->sb->s_bdi, BLK_RW_ASYNC);
> > }
> > fc->num_background--;
> > fc->active_background--;
> > @@ -570,8 +570,8 @@ void fuse_request_send_background_locked(struct 
> > fuse_conn *fc,
> > fc->blocked = 1;
> > if (fc->num_background == fc->congestion_threshold &&
> > fc->bdi_initialized) {
> > -   set_bdi_congested(>bdi, BLK_RW_SYNC);
> > -   set_bdi_congested(>bdi, BLK_RW_ASYNC);
> > +   set_bdi_congested(fc->sb->s_bdi, BLK_RW_SYNC);
> > +   set_bdi_congested(fc->sb->s_bdi, BLK_RW_ASYNC);
> > }
> > list_add_tail(>list, >bg_queue);
> > flush_bg_queue(fc);
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 91307940c8ac..effab9e9607f 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -631,9 +631,6 @@ struct fuse_conn {
> > /** Negotiated minor version */
> > unsigned minor;
> >
> > -   /** Backing dev info */
> > -   struct backing_dev_info bdi;
> > -
> > /** Entry on the fuse_conn_list */
> > struct list_head entry;
> >
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 6fe6a88ecb4a..90bacbc87fb3 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -386,12 +386,6 @@ static void fuse_send_destroy(struct fuse_conn *fc)
> > }
> >  }
> >
> > -static void fuse_bdi_destroy(struct fuse_conn *fc)
> > -{
> > -   if (fc->bdi_initialized)
> > -   bdi_destroy(>bdi);
> > -}
> > -
> >  static void fuse_put_super(struct super_block *sb)
> >  {
> > struct fuse_conn *fc = get_fuse_conn_super(sb);
> > @@ -403,7 +397,6 @@ static void fuse_put_super(struct super_block *sb)
> > list_del(>entry);
> > fuse_ctl_remove_conn(fc);
> > mutex_unlock(_mutex);
> > -   fuse_bdi_destroy(fc);
> >
> > fuse_conn_put(fc);
> >  }
> > @@ -928,7 +921,8 @@ static void process_init_reply(struct fuse_conn *fc, 
> > struct fuse_req *req)
> > fc->no_flock = 1;
> > }
> >
> > -   fc->bdi.ra_pages = min(fc->bdi.ra_pages, ra_pages);
> > +   fc->sb->s_bdi->ra_pages =
> > +   min(fc->sb->s_bdi->ra_pages, ra_pages);
> > fc->minor = arg->minor;
> > fc->max_write = arg->minor < 5 ? 4096 : arg->max_write;
> > fc->max_write = max_t(unsigned, 4096, fc->max_write);
> > @@ -944,7 +938,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct 
> > fuse_req *req)
> >
> > arg->major = FUSE_KERNEL_VERSION;
> > arg->minor

Re: [PATCH 0/4 v2] BDI lifetime fix

2017-02-07 Thread Jan Kara

On Mon 06-02-17 13:26:53, Thiago Jung Bauermann wrote:
> Am Montag, 6. Februar 2017, 12:48:42 BRST schrieb Thiago Jung Bauermann:
> > 216 static inline void wb_get(struct bdi_writeback *wb)
> > 217 {
> > 218 if (wb != >bdi->wb)
> > 219 percpu_ref_get(>refcnt);
> > 220 }
> > 
> > So it looks like wb->bdi is NULL.
> 
> Sorry, looking a little deeper, it's actually wb which is NULL:
> 
> ./include/linux/backing-dev.h:
> 371 return inode->i_wb;
>0xc037999c <+76>:ld  r31,256(r29)
> 
> ./include/linux/backing-dev-defs.h:
> 218 if (wb != >bdi->wb)
>0xc03799a0 <+80>:ld  r9,0(r31)
>0xc03799a4 <+84>:addir9,r9,88
>0xc03799a8 <+88>:cmpld   cr7,r31,r9
>0xc03799ac <+92>:beq cr7,0xc03799e0 
> <locked_inode_to_wb_and_lock_list+144>
> 
> We can see above that inode->i_wb is in r31, and the machine crashed at 
> 0xc03799a0 so it was trying to dereference wb and crashed.
> r31 is NULL in the crash information.

Thanks for report and the analysis. After some looking into the code I see
where the problem is. Writeback code assumes inode->i_wb can never become
invalid once it is set however we still call inode_detach_wb() from
__blkdev_put(). So in a way this is a different problem but closely
related.

It seems to me that instead of calling inode_detach_wb() in __blkdev_put()
we may just switch blkdev inode to bdi->wb (it is now guaranteed to stay
around). That way bdi_unregister() can complete (destroying all writeback
structures except for bdi->wb) while block device inode can still live with
a valid i_wb structure.

CCed Tejun who is more familar with this code to verify my thoughts.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

Re: [PATCH 3/4] block: Dynamically allocate and refcount backing_dev_info

2017-02-02 Thread Jan Kara

On Wed 01-02-17 14:45:20, Jens Axboe wrote:
> On 02/01/2017 04:22 AM, Jan Kara wrote:
> > On Wed 01-02-17 01:50:07, Christoph Hellwig wrote:
> >> On Tue, Jan 31, 2017 at 01:54:28PM +0100, Jan Kara wrote:
> >>> Instead of storing backing_dev_info inside struct request_queue,
> >>> allocate it dynamically, reference count it, and free it when the last
> >>> reference is dropped. Currently only request_queue holds the reference
> >>> but in the following patch we add other users referencing
> >>> backing_dev_info.
> >>
> >> Do we really need the separate slab cache?  Otherwise this looks
> >> fine to me.
> > 
> > Yeah, probably it is not worth it. I'll remove it.
> 
> I agree on that, it should not be a hot path. Will you respin the series
> after making this change? Would be great to get this queued up.

Yes, will send it later today. I was just waiting whether someone else does
not have more comments to the series.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/5] block: Unhash block device inodes on gendisk destruction

2017-02-02 Thread Jan Kara

Currently, block device inodes stay around after corresponding gendisk
hash died until memory reclaim finds them and frees them. Since we will
make block device inode pin the bdi, we want to free the block device
inode as soon as the device goes away so that bdi does not stay around
unnecessarily. Furthermore we need to avoid issues when new device with
the same major,minor pair gets created since reusing the bdi structure
would be rather difficult in this case.

Unhashing block device inode on gendisk destruction nicely deals with
these problems. Once last block device inode reference is dropped (which
may be directly in del_gendisk()), the inode gets evicted. Furthermore if
the major,minor pair gets reallocated, we are guaranteed to get new
block device inode even if old block device inode is not yet evicted and
thus we avoid issues with possible reuse of bdi.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/genhd.c  |  2 ++
 fs/block_dev.c | 15 +++
 include/linux/fs.h |  1 +
 3 files changed, 18 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index fcd6d4fae657..f2f22d0e8e14 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -648,6 +648,8 @@ void del_gendisk(struct gendisk *disk)
disk_part_iter_init(, disk,
 DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
while ((part = disk_part_iter_next())) {
+   bdev_unhash_inode(MKDEV(disk->major,
+   disk->first_minor + part->partno));
invalidate_partition(disk, part->partno);
delete_partition(disk, part->partno);
}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5db5d1340d69..ed6a34be7a1e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -954,6 +954,21 @@ static int bdev_set(struct inode *inode, void *data)
 
 static LIST_HEAD(all_bdevs);
 
+/*
+ * If there is a bdev inode for this device, unhash it so that it gets evicted
+ * as soon as last inode reference is dropped.
+ */
+void bdev_unhash_inode(dev_t dev)
+{
+   struct inode *inode;
+
+   inode = ilookup5(blockdev_superblock, hash(dev), bdev_test, );
+   if (inode) {
+   remove_inode_hash(inode);
+   iput(inode);
+   }
+}
+
 struct block_device *bdget(dev_t dev)
 {
struct block_device *bdev;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2ba074328894..702cb6c50194 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2342,6 +2342,7 @@ extern struct kmem_cache *names_cachep;
 #ifdef CONFIG_BLOCK
 extern int register_blkdev(unsigned int, const char *);
 extern void unregister_blkdev(unsigned int, const char *);
+extern void bdev_unhash_inode(dev_t dev);
 extern struct block_device *bdget(dev_t);
 extern struct block_device *bdgrab(struct block_device *bdev);
 extern void bd_set_size(struct block_device *, loff_t size);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/5 v3] BDI lifetime fix

2017-02-02 Thread Jan Kara

Hello,

this is the third version of the patch series that attempts to solve the
problems with the life time of a backing_dev_info structure. Currently it lives
inside request_queue structure and thus it gets destroyed as soon as request
queue goes away. However the block device inode still stays around and thus
inode_to_bdi() call on that inode (e.g. from flusher worker) may happen after
request queue has been destroyed resulting in oops.

This patch set tries to solve these problems by making backing_dev_info
independent structure referenced from block device inode. That makes sure
inode_to_bdi() cannot ever oops. I gave some basic testing to the patches in
KVM and on a real machine, Dan was running them with libnvdimm test suite which
was previously triggering the oops and things look good. So patches should be
reasonably healthy.

Changes since v2:
* Added Reviewed-by tags
* Removed slab cache for backing_dev_info
* Added patch to remove blkdev_get_backing_dev_info()

Changes since v1:
* Use kref instead of atomic_t for refcount
* Get rid of free_on_put flag

Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/5] block: Make blk_get_backing_dev_info() safe without open bdev

2017-02-02 Thread Jan Kara

Currenly blk_get_backing_dev_info() is not safe to be called when the
block device is not open as bdev->bd_disk is NULL in that case. However
inode_to_bdi() uses this function and may be call called from flusher
worker or other writeback related functions without bdev being open
which leads to crashes such as:

[113031.075540] Unable to handle kernel paging request for data at address 
0x
[113031.075614] Faulting instruction address: 0xc03692e0
0:mon> t
[c000fb65f900] c036cb6c writeback_sb_inodes+0x30c/0x590
[c000fb65fa10] c036ced4 __writeback_inodes_wb+0xe4/0x150
[c000fb65fa70] c036d33c wb_writeback+0x30c/0x450
[c000fb65fb40] c036e198 wb_workfn+0x268/0x580
[c000fb65fc50] c00f3470 process_one_work+0x1e0/0x590
[c000fb65fce0] c00f38c8 worker_thread+0xa8/0x660
[c000fb65fd80] c00fc4b0 kthread+0x110/0x130
[c000fb65fe30] c00098f0 ret_from_kernel_thread+0x5c/0x6c
--- Exception: 0  at 
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c000fb65f620]
pc: c03692e0: locked_inode_to_wb_and_lock_list+0x50/0x290
lr: c036cb6c: writeback_sb_inodes+0x30c/0x590
sp: c000fb65f8a0
   msr: 80010280b033
   dar: 0
 dsisr: 4000
  current = 0xc001d69be400
  paca= 0xc348   softe: 0irq_happened: 0x01
pid   = 18689, comm = kworker/u16:10

Fix the problem by grabbing reference to bdi on first open of the block
device and drop the reference only once the inode is evicted from
memory. This pins struct backing_dev_info in memory and thus fixes the
crashes.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reported-and-tested-by: Dan Williams <dan.j.willi...@intel.com>
Reported-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/blk-core.c   | 8 +++-
 fs/block_dev.c | 7 +++
 include/linux/fs.h | 1 +
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 545ccb4b96f3..84fabb51714a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -109,14 +109,12 @@ void blk_queue_congestion_threshold(struct request_queue 
*q)
  * @bdev:  device
  *
  * Locates the passed device's request queue and returns the address of its
- * backing_dev_info.  This function can only be called if @bdev is opened
- * and the return value is never NULL.
+ * backing_dev_info. The return value is never NULL however we may return
+ * _backing_dev_info if the bdev is not currently open.
  */
 struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev)
 {
-   struct request_queue *q = bdev_get_queue(bdev);
-
-   return q->backing_dev_info;
+   return bdev->bd_bdi;
 }
 EXPORT_SYMBOL(blk_get_backing_dev_info);
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index ed6a34be7a1e..601b71b76d7f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -884,6 +884,8 @@ static void bdev_evict_inode(struct inode *inode)
spin_lock(_lock);
list_del_init(>bd_list);
spin_unlock(_lock);
+   if (bdev->bd_bdi != _backing_dev_info)
+   bdi_put(bdev->bd_bdi);
 }
 
 static const struct super_operations bdev_sops = {
@@ -986,6 +988,7 @@ struct block_device *bdget(dev_t dev)
bdev->bd_contains = NULL;
bdev->bd_super = NULL;
bdev->bd_inode = inode;
+   bdev->bd_bdi = _backing_dev_info;
bdev->bd_block_size = (1 << inode->i_blkbits);
bdev->bd_part_count = 0;
bdev->bd_invalidated = 0;
@@ -1542,6 +1545,8 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
+   if (bdev->bd_bdi == _backing_dev_info)
+   bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
 
if (!partno) {
ret = -ENXIO;
@@ -1637,6 +1642,8 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_disk = NULL;
bdev->bd_part = NULL;
bdev->bd_queue = NULL;
+   bdi_put(bdev->bd_bdi);
+   bdev->bd_bdi = _backing_dev_info;
if (bdev != bdev->bd_contains)
__blkdev_put(bdev->bd_contains, mode, 1);
bdev->bd_contains = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 702cb6c50194..c930cbc19342 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -423,6 +423,7 @@ struct block_device {
int bd_invalidated;
struct gendisk *bd_disk;
struct request_queue *  bd_queue;
+   struct backing_dev_info *bd_bdi;
struct list_headbd_list;
/*

[PATCH 5/5] block: Get rid of blk_get_backing_dev_info()

2017-02-02 Thread Jan Kara

blk_get_backing_dev_info() is now a simple dereference. Remove that
function and simplify some code around that.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/blk-core.c| 14 --
 block/compat_ioctl.c|  7 ++-
 block/ioctl.c   |  7 ++-
 fs/btrfs/disk-io.c  |  2 +-
 fs/btrfs/volumes.c  |  2 +-
 fs/xfs/xfs_buf.c|  3 +--
 fs/xfs/xfs_buf.h|  1 -
 include/linux/backing-dev.h |  2 +-
 include/linux/blkdev.h  |  1 -
 9 files changed, 8 insertions(+), 31 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 84fabb51714a..47104f6a398b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -104,20 +104,6 @@ void blk_queue_congestion_threshold(struct request_queue 
*q)
q->nr_congestion_off = nr;
 }
 
-/**
- * blk_get_backing_dev_info - get the address of a queue's backing_dev_info
- * @bdev:  device
- *
- * Locates the passed device's request queue and returns the address of its
- * backing_dev_info. The return value is never NULL however we may return
- * _backing_dev_info if the bdev is not currently open.
- */
-struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev)
-{
-   return bdev->bd_bdi;
-}
-EXPORT_SYMBOL(blk_get_backing_dev_info);
-
 void blk_rq_init(struct request_queue *q, struct request *rq)
 {
memset(rq, 0, sizeof(*rq));
diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c
index 556826ac7cb4..570021a0dc1c 100644
--- a/block/compat_ioctl.c
+++ b/block/compat_ioctl.c
@@ -661,7 +661,6 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, 
unsigned long arg)
struct block_device *bdev = inode->i_bdev;
struct gendisk *disk = bdev->bd_disk;
fmode_t mode = file->f_mode;
-   struct backing_dev_info *bdi;
loff_t size;
unsigned int max_sectors;
 
@@ -708,9 +707,8 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, 
unsigned long arg)
case BLKFRAGET:
if (!arg)
return -EINVAL;
-   bdi = blk_get_backing_dev_info(bdev);
return compat_put_long(arg,
-  (bdi->ra_pages * PAGE_SIZE) / 512);
+  (bdev->bd_bdi->ra_pages * PAGE_SIZE) / 512);
case BLKROGET: /* compatible */
return compat_put_int(arg, bdev_read_only(bdev) != 0);
case BLKBSZGET_32: /* get the logical block size (cf. BLKSSZGET) */
@@ -728,8 +726,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, 
unsigned long arg)
case BLKFRASET:
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
-   bdi = blk_get_backing_dev_info(bdev);
-   bdi->ra_pages = (arg * 512) / PAGE_SIZE;
+   bdev->bd_bdi->ra_pages = (arg * 512) / PAGE_SIZE;
return 0;
case BLKGETSIZE:
size = i_size_read(bdev->bd_inode);
diff --git a/block/ioctl.c b/block/ioctl.c
index be7f4de3eb3c..7b88820b93d9 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -505,7 +505,6 @@ static int blkdev_bszset(struct block_device *bdev, fmode_t 
mode,
 int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
unsigned long arg)
 {
-   struct backing_dev_info *bdi;
void __user *argp = (void __user *)arg;
loff_t size;
unsigned int max_sectors;
@@ -532,8 +531,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, 
unsigned cmd,
case BLKFRAGET:
if (!arg)
return -EINVAL;
-   bdi = blk_get_backing_dev_info(bdev);
-   return put_long(arg, (bdi->ra_pages * PAGE_SIZE) / 512);
+   return put_long(arg, (bdev->bd_bdi->ra_pages*PAGE_SIZE) / 512);
case BLKROGET:
return put_int(arg, bdev_read_only(bdev) != 0);
case BLKBSZGET: /* get block device soft block size (cf. BLKSSZGET) */
@@ -560,8 +558,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, 
unsigned cmd,
case BLKFRASET:
if(!capable(CAP_SYS_ADMIN))
return -EACCES;
-   bdi = blk_get_backing_dev_info(bdev);
-   bdi->ra_pages = (arg * 512) / PAGE_SIZE;
+   bdev->bd_bdi->ra_pages = (arg * 512) / PAGE_SIZE;
return 0;
case BLKBSZSET:
return blkdev_bszset(bdev, mode, argp);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 18004169552c..37a31b12bb0c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1800,7 +1800,7 @@ static int btrfs_congested_fn(void *congested_data, int 
bdi_bits)
list_for_each_entry_rcu(device, >fs_devices->devices, dev_list) {
if (!device->bdev)
continue;
-   bdi = blk_get_backi

[PATCH 2/5] block: Use pointer to backing_dev_info from request_queue

2017-02-02 Thread Jan Kara

We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/blk-cgroup.c |  6 +++---
 block/blk-core.c   | 27 ++-
 block/blk-integrity.c  |  4 ++--
 block/blk-settings.c   |  2 +-
 block/blk-sysfs.c  |  8 
 block/blk-wbt.c|  8 
 block/genhd.c  |  2 +-
 drivers/block/aoe/aoeblk.c |  4 ++--
 drivers/block/drbd/drbd_main.c |  6 +++---
 drivers/block/drbd/drbd_nl.c   | 12 +++-
 drivers/block/drbd/drbd_proc.c |  2 +-
 drivers/block/drbd/drbd_req.c  |  2 +-
 drivers/block/pktcdvd.c|  4 ++--
 drivers/block/rbd.c|  2 +-
 drivers/md/bcache/request.c| 10 +-
 drivers/md/bcache/super.c  |  8 
 drivers/md/dm-cache-target.c   |  2 +-
 drivers/md/dm-era-target.c |  2 +-
 drivers/md/dm-table.c  |  2 +-
 drivers/md/dm-thin.c   |  2 +-
 drivers/md/dm.c|  6 +++---
 drivers/md/linear.c|  2 +-
 drivers/md/md.c|  6 +++---
 drivers/md/multipath.c |  2 +-
 drivers/md/raid0.c |  6 +++---
 drivers/md/raid1.c |  4 ++--
 drivers/md/raid10.c| 10 +-
 drivers/md/raid5.c | 12 ++--
 fs/gfs2/ops_fstype.c   |  2 +-
 fs/nilfs2/super.c  |  2 +-
 fs/super.c |  2 +-
 include/linux/blkdev.h |  3 ++-
 mm/page-writeback.c|  4 ++--
 33 files changed, 90 insertions(+), 86 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 8ba0af780e88..d673a69b61b4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -184,7 +184,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
goto err_free_blkg;
}
 
-   wb_congested = wb_congested_get_create(>backing_dev_info,
+   wb_congested = wb_congested_get_create(q->backing_dev_info,
   blkcg->css.id,
   GFP_NOWAIT | __GFP_NOWARN);
if (!wb_congested) {
@@ -469,8 +469,8 @@ static int blkcg_reset_stats(struct cgroup_subsys_state 
*css,
 const char *blkg_dev_name(struct blkcg_gq *blkg)
 {
/* some drivers (floppy) instantiate a queue w/o disk registered */
-   if (blkg->q->backing_dev_info.dev)
-   return dev_name(blkg->q->backing_dev_info.dev);
+   if (blkg->q->backing_dev_info->dev)
+   return dev_name(blkg->q->backing_dev_info->dev);
return NULL;
 }
 EXPORT_SYMBOL_GPL(blkg_dev_name);
diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c58b64..a9ff1b919ae7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -74,7 +74,7 @@ static void blk_clear_congested(struct request_list *rl, int 
sync)
 * flip its congestion state for events on other blkcgs.
 */
if (rl == >q->root_rl)
-   clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
+   clear_wb_congested(rl->q->backing_dev_info->wb.congested, sync);
 #endif
 }
 
@@ -85,7 +85,7 @@ static void blk_set_congested(struct request_list *rl, int 
sync)
 #else
/* see blk_clear_congested() */
if (rl == >q->root_rl)
-   set_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
+   set_wb_congested(rl->q->backing_dev_info->wb.congested, sync);
 #endif
 }
 
@@ -116,7 +116,7 @@ struct backing_dev_info *blk_get_backing_dev_info(struct 
block_device *bdev)
 {
struct request_queue *q = bdev_get_queue(bdev);
 
-   return >backing_dev_info;
+   return q->backing_dev_info;
 }
 EXPORT_SYMBOL(blk_get_backing_dev_info);
 
@@ -584,7 +584,7 @@ void blk_cleanup_queue(struct request_queue *q)
blk_flush_integrity();
 
/* @q won't process any more request, flush async actions */
-   del_timer_sync(>backing_dev_info.laptop_mode_wb_timer);
+   del_timer_sync(>backing_dev_info->laptop_mode_wb_timer);
blk_sync_queue(q);
 
if (q->mq_ops)
@@ -596,7 +596,7 @@ void blk_cleanup_queue(struct request_queue *q)
q->queue_lock = >__queue_lock;
spin_unlock_irq(lock);
 
-   bdi_unregister(>backing_dev_info);
+   bdi_unregister(q->backing_dev_info);
 
/* @q is and will stay empty, shutdown and put */
blk_put_queue(q);
@@ -708,17 +708,18 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
gfp_mask, int node_id)
if (!q->bio_split)
goto fail_id;
 
-   q->backing_dev_info.ra_pages =
+   q->backing_

[PATCH 3/5] block: Dynamically allocate and refcount backing_dev_info

2017-02-02 Thread Jan Kara

Instead of storing backing_dev_info inside struct request_queue,
allocate it dynamically, reference count it, and free it when the last
reference is dropped. Currently only request_queue holds the reference
but in the following patch we add other users referencing
backing_dev_info.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/blk-core.c | 12 +---
 block/blk-sysfs.c|  2 +-
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/backing-dev.h  | 10 +-
 include/linux/blkdev.h   |  1 -
 mm/backing-dev.c | 34 +-
 6 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a9ff1b919ae7..545ccb4b96f3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -693,7 +693,6 @@ static void blk_rq_timed_out_timer(unsigned long data)
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 {
struct request_queue *q;
-   int err;
 
q = kmem_cache_alloc_node(blk_requestq_cachep,
gfp_mask | __GFP_ZERO, node_id);
@@ -708,17 +707,16 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
gfp_mask, int node_id)
if (!q->bio_split)
goto fail_id;
 
-   q->backing_dev_info = >_backing_dev_info;
+   q->backing_dev_info = bdi_alloc_node(gfp_mask, node_id);
+   if (!q->backing_dev_info)
+   goto fail_split;
+
q->backing_dev_info->ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_SIZE;
q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
q->backing_dev_info->name = "block";
q->node = node_id;
 
-   err = bdi_init(q->backing_dev_info);
-   if (err)
-   goto fail_split;
-
setup_timer(>backing_dev_info->laptop_mode_wb_timer,
laptop_mode_timer_fn, (unsigned long) q);
setup_timer(>timeout, blk_rq_timed_out_timer, (unsigned long) q);
@@ -769,7 +767,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
 fail_ref:
percpu_ref_exit(>q_usage_counter);
 fail_bdi:
-   bdi_destroy(q->backing_dev_info);
+   bdi_put(q->backing_dev_info);
 fail_split:
bioset_free(q->bio_split);
 fail_id:
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 64fb54c6b41c..4cbaa519ec2d 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -799,7 +799,7 @@ static void blk_release_queue(struct kobject *kobj)
container_of(kobj, struct request_queue, kobj);
 
wbt_exit(q);
-   bdi_exit(q->backing_dev_info);
+   bdi_put(q->backing_dev_info);
blkcg_exit_queue(q);
 
if (q->elevator) {
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index e850e76acaaf..ad955817916d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct page;
 struct device;
@@ -144,6 +145,7 @@ struct backing_dev_info {
 
char *name;
 
+   struct kref refcnt; /* Reference counter for the structure */
unsigned int capabilities; /* Device capabilities */
unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 43b93a947e61..efb6ca992d05 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -18,7 +18,14 @@
 #include 
 
 int __must_check bdi_init(struct backing_dev_info *bdi);
-void bdi_exit(struct backing_dev_info *bdi);
+
+static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi)
+{
+   kref_get(>refcnt);
+   return bdi;
+}
+
+void bdi_put(struct backing_dev_info *bdi);
 
 __printf(3, 4)
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
@@ -29,6 +36,7 @@ void bdi_unregister(struct backing_dev_info *bdi);
 
 int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
 void bdi_destroy(struct backing_dev_info *bdi);
+struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id);
 
 void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
bool range_cyclic, enum wb_reason reason);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 069e4a102a73..de85701cc699 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -433,7 +433,6 @@ struct request_queue {
struct delayed_work delay_work;
 
struct backing_dev_info *backing_dev_info;
-   struct backing_dev_info _backing_dev_info;
 
/*
 * The queue owner gets to use this for whatever they like.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 3bfed5ab2475..28ce6cf7b2ff 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -237,6 +237,7 @@

Re: [PATCH 0/4 v2] BDI lifetime fix

2017-02-08 Thread Jan Kara

On Tue 07-02-17 12:21:01, Tejun Heo wrote:
> Hello,
> 
> On Tue, Feb 07, 2017 at 01:33:31PM +0100, Jan Kara wrote:
> > > We can see above that inode->i_wb is in r31, and the machine crashed at 
> > > 0xc03799a0 so it was trying to dereference wb and crashed.
> > > r31 is NULL in the crash information.
> > 
> > Thanks for report and the analysis. After some looking into the code I see
> > where the problem is. Writeback code assumes inode->i_wb can never become
> > invalid once it is set however we still call inode_detach_wb() from
> > __blkdev_put(). So in a way this is a different problem but closely
> > related.
> 
> Heh, it feels like we're chasing our own tails.

Pretty much. I went through the history of bdi registration and
unregistration to understand various constraints and various different
reasons keep pushing that around and always something breaks...

> > It seems to me that instead of calling inode_detach_wb() in __blkdev_put()
> > we may just switch blkdev inode to bdi->wb (it is now guaranteed to stay
> > around). That way bdi_unregister() can complete (destroying all writeback
> > structures except for bdi->wb) while block device inode can still live with
> > a valid i_wb structure.
> 
> So, the problem there would be synchronizing get_wb against the
> transition.  We can do that and inode_switch_wbs_work_fn() already
> does it but it is a bit nasty.

Yeah, I have prototyped that and it is relatively simple although we have
to use synchronize_rcu() to be sure unlocked users of i_wb are done before
switching and that is somewhat ugly. So I'm looking for ways to avoid the
switching as well. Especially since from high-level POV the switching
should not be necessary. Everything is going away and there is no real work
to be done. Just we may be unlucky enough that e.g. flusher is looking
whether there's some work to do and we remove stuff under its hands. So
switching seems like a bit of an overkill.

> I'm getting a bit confused here, so the reason we added
> inode_detach_wb() in __blkdev_put() was because the root wb might go
> away because it's embedded in the bdi which is embedded in the
> request_queue which is put and may be released by put_disk().
> 
> Are you saying that we changed the behavior so that bdi->wb stays
> around?  If so, we can just remove the inode_detach_wb() call, no?

Yes, my patches (currently in linux-block) make bdi->wb stay around as long
as the block device inode. However things are complicated by the fact that
these days bdev_inode->i_wb may be pointing even to non-root wb_writeback
structure. If that happens and we don't call inode_detach_wb(),
bdi_unregister() will block waiting for reference count on wb_writeback to
drop to zero which happens only once bdev inode is evicted from inode cache
which may be far far in the future.

Now I think we can move bdi_unregister() into del_gendisk() (where it IMHO
belongs anyway as a counterpart to device_add_disk() in which we call
bdi_register()) and shutdown the block device inode there before calling
bdi_unregister(). But I'm still figuring out whether it will not break
something else because the code has lots of interactions...

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

Re: [PATCH 0/4 v2] BDI lifetime fix

2017-02-08 Thread Jan Kara

On Wed 08-02-17 08:51:42, Jan Kara wrote:
> On Tue 07-02-17 12:21:01, Tejun Heo wrote:
> > Hello,
> > 
> > On Tue, Feb 07, 2017 at 01:33:31PM +0100, Jan Kara wrote:
> > > > We can see above that inode->i_wb is in r31, and the machine crashed at 
> > > > 0xc03799a0 so it was trying to dereference wb and crashed.
> > > > r31 is NULL in the crash information.
> > > 
> > > Thanks for report and the analysis. After some looking into the code I see
> > > where the problem is. Writeback code assumes inode->i_wb can never become
> > > invalid once it is set however we still call inode_detach_wb() from
> > > __blkdev_put(). So in a way this is a different problem but closely
> > > related.
> > 
> > Heh, it feels like we're chasing our own tails.
> 
> Pretty much. I went through the history of bdi registration and
> unregistration to understand various constraints and various different
> reasons keep pushing that around and always something breaks...
> 
> > > It seems to me that instead of calling inode_detach_wb() in __blkdev_put()
> > > we may just switch blkdev inode to bdi->wb (it is now guaranteed to stay
> > > around). That way bdi_unregister() can complete (destroying all writeback
> > > structures except for bdi->wb) while block device inode can still live 
> > > with
> > > a valid i_wb structure.
> > 
> > So, the problem there would be synchronizing get_wb against the
> > transition.  We can do that and inode_switch_wbs_work_fn() already
> > does it but it is a bit nasty.
> 
> Yeah, I have prototyped that and it is relatively simple although we have
> to use synchronize_rcu() to be sure unlocked users of i_wb are done before
> switching and that is somewhat ugly. So I'm looking for ways to avoid the
> switching as well. Especially since from high-level POV the switching
> should not be necessary. Everything is going away and there is no real work
> to be done. Just we may be unlucky enough that e.g. flusher is looking
> whether there's some work to do and we remove stuff under its hands. So
> switching seems like a bit of an overkill.
> 
> > I'm getting a bit confused here, so the reason we added
> > inode_detach_wb() in __blkdev_put() was because the root wb might go
> > away because it's embedded in the bdi which is embedded in the
> > request_queue which is put and may be released by put_disk().
> > 
> > Are you saying that we changed the behavior so that bdi->wb stays
> > around?  If so, we can just remove the inode_detach_wb() call, no?
> 
> Yes, my patches (currently in linux-block) make bdi->wb stay around as long
> as the block device inode. However things are complicated by the fact that
> these days bdev_inode->i_wb may be pointing even to non-root wb_writeback
> structure. If that happens and we don't call inode_detach_wb(),
> bdi_unregister() will block waiting for reference count on wb_writeback to
> drop to zero which happens only once bdev inode is evicted from inode cache
> which may be far far in the future.
> 
> Now I think we can move bdi_unregister() into del_gendisk() (where it IMHO
> belongs anyway as a counterpart to device_add_disk() in which we call
> bdi_register()) and shutdown the block device inode there before calling
> bdi_unregister(). But I'm still figuring out whether it will not break
> something else because the code has lots of interactions...

More news from device shutdown world ;): I was looking more into how device
shutdown works. I was wondering what happens when device gets hot-removed
and how do we shutdown stuff. If I tracked the callback maze correctly, when
we remove scsi disk, we do __scsi_remove_device() -> device_del() ->
bus_remove_device() -> device_release_driver() -> sd_remove() ->
del_gendisk(). We also have __scsi_remove_device() -> blk_cleanup_queue()
-> bdi_unregister() shortly after the previous happening

This ordering seems to be the real culprit of the bdi name reuse problems
Omar has reported? Same as described in commit 6cd18e711dd8 for MD BTW and
Dan's patch could be IMHO replaced by a move of bdi_unregister() from
blk_cleanup_queue() to del_gendisk() where it logically belongs as a
counterpart of device_add_disk(). I'll test that.


__scsi_remove_device() is also called when the device was hot-removed. At
that point the bdev can still be open and in active use and its i_wb can
point to some non-root wb_writeback struct. Thus bdi_unregister() will
block waiting for that wb_writeback to get released and thus SCSI device
removal will block basically intefinitely (at least until fs on top of bdev
gets unmounted). I believe this is a bug and __scsi_remove_device() is
expected to finish regardless of upper layers still using the bdev. So to
fix this I don't think we can really avoid the switching of bdev inode from
non-root wb_writeback structure to bdi->wb.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

Re: [PATCH 22/24] ubifs: Convert to separately allocated bdi

2017-02-03 Thread Jan Kara

On Thu 02-02-17 21:34:32, Richard Weinberger wrote:
> Jan,
> 
> Am 02.02.2017 um 18:34 schrieb Jan Kara:
> > Allocate struct backing_dev_info separately instead of embedding it
> > inside the superblock. This unifies handling of bdi among users.
> > 
> > CC: Richard Weinberger <rich...@nod.at>
> > CC: Artem Bityutskiy <dedeki...@gmail.com>
> > CC: Adrian Hunter <adrian.hun...@intel.com>
> > CC: linux-...@lists.infradead.org
> > Signed-off-by: Jan Kara <j...@suse.cz>
> 
> Is this series available at some git tree, please?

I've pushed it out to:

git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git bdi

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/24] fs: Provide infrastructure for dynamic BDIs in filesystems

2017-02-03 Thread Jan Kara

On Thu 02-02-17 11:28:27, Liu Bo wrote:
> Hi,
> 
> On Thu, Feb 02, 2017 at 06:34:02PM +0100, Jan Kara wrote:
> > Provide helper functions for setting up dynamically allocated
> > backing_dev_info structures for filesystems and cleaning them up on
> > superblock destruction.
> 
> Just one concern, will this cause problems for multiple superblock cases
> like nfs with nosharecache?

Can you ellaborate a bit? I've looked for a while what nfs with
nosharecache does but I didn't see how it would influence anything with
bdis...

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/24] cifs: Convert to separately allocated bdi

2017-02-02 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Steve French <sfre...@samba.org>
CC: linux-c...@vger.kernel.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/cifs/cifs_fs_sb.h |  1 -
 fs/cifs/cifsfs.c |  7 ++-
 fs/cifs/connect.c| 10 --
 3 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/cifs/cifs_fs_sb.h b/fs/cifs/cifs_fs_sb.h
index 07ed81cf1552..cbd216b57239 100644
--- a/fs/cifs/cifs_fs_sb.h
+++ b/fs/cifs/cifs_fs_sb.h
@@ -68,7 +68,6 @@ struct cifs_sb_info {
umode_t mnt_dir_mode;
unsigned int mnt_cifs_flags;
char   *mountdata; /* options received at mount time or via DFS refs */
-   struct backing_dev_info bdi;
struct delayed_work prune_tlinks;
struct rcu_head rcu;
char *prepath;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 70f4e65fced2..8dcf1c2555bf 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -138,7 +138,12 @@ cifs_read_super(struct super_block *sb)
sb->s_magic = CIFS_MAGIC_NUMBER;
sb->s_op = _super_ops;
sb->s_xattr = cifs_xattr_handlers;
-   sb->s_bdi = _sb->bdi;
+   rc = super_setup_bdi(sb);
+   if (rc)
+   goto out_no_root;
+   /* tune readahead according to rsize */
+   sb->s_bdi->ra_pages = cifs_sb->rsize / PAGE_SIZE;
+
sb->s_blocksize = CIFS_MAX_MSGSIZE;
sb->s_blocksize_bits = 14;  /* default 2**14 = CIFS_MAX_MSGSIZE */
inode = cifs_root_iget(sb);
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 35ae49ed1f76..6c1dfe56589d 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -3650,10 +3650,6 @@ cifs_mount(struct cifs_sb_info *cifs_sb, struct smb_vol 
*volume_info)
int referral_walks_count = 0;
 #endif
 
-   rc = bdi_setup_and_register(_sb->bdi, "cifs");
-   if (rc)
-   return rc;
-
 #ifdef CONFIG_CIFS_DFS_UPCALL
 try_mount_again:
/* cleanup activities if we're chasing a referral */
@@ -3681,7 +3677,6 @@ cifs_mount(struct cifs_sb_info *cifs_sb, struct smb_vol 
*volume_info)
server = cifs_get_tcp_session(volume_info);
if (IS_ERR(server)) {
rc = PTR_ERR(server);
-   bdi_destroy(_sb->bdi);
goto out;
}
if ((volume_info->max_credits < 20) ||
@@ -3735,9 +3730,6 @@ cifs_mount(struct cifs_sb_info *cifs_sb, struct smb_vol 
*volume_info)
cifs_sb->wsize = server->ops->negotiate_wsize(tcon, volume_info);
cifs_sb->rsize = server->ops->negotiate_rsize(tcon, volume_info);
 
-   /* tune readahead according to rsize */
-   cifs_sb->bdi.ra_pages = cifs_sb->rsize / PAGE_SIZE;
-
 remote_path_check:
 #ifdef CONFIG_CIFS_DFS_UPCALL
/*
@@ -3854,7 +3846,6 @@ cifs_mount(struct cifs_sb_info *cifs_sb, struct smb_vol 
*volume_info)
cifs_put_smb_ses(ses);
else
cifs_put_tcp_session(server, 0);
-   bdi_destroy(_sb->bdi);
}
 
 out:
@@ -4057,7 +4048,6 @@ cifs_umount(struct cifs_sb_info *cifs_sb)
}
spin_unlock(_sb->tlink_tree_lock);
 
-   bdi_destroy(_sb->bdi);
kfree(cifs_sb->mountdata);
kfree(cifs_sb->prepath);
call_rcu(_sb->rcu, delayed_free);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 24/24] block: Remove unused functions

2017-02-02 Thread Jan Kara

Now that all backing_dev_info structure are allocated separately, we can
drop some unused functions.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/backing-dev.h |  5 -
 mm/backing-dev.c| 54 +
 2 files changed, 5 insertions(+), 54 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 6865b1c8b122..f39822a06305 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -17,8 +17,6 @@
 #include 
 #include 
 
-int __must_check bdi_init(struct backing_dev_info *bdi);
-
 static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi)
 {
kref_get(>refcnt);
@@ -32,12 +30,9 @@ int bdi_register(struct backing_dev_info *bdi, struct device 
*parent,
const char *fmt, ...);
 int bdi_register_va(struct backing_dev_info *bdi, struct device *parent,
const char *fmt, va_list args);
-int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 int bdi_register_owner(struct backing_dev_info *bdi, struct device *owner);
 void bdi_unregister(struct backing_dev_info *bdi);
 
-int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
-void bdi_destroy(struct backing_dev_info *bdi);
 struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id);
 struct backing_dev_info *bdi_alloc(gfp_t gfp_mask);
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 82fee0f52d06..38b1197f7479 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -12,8 +12,6 @@
 #include 
 #include 
 
-static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
-
 struct backing_dev_info noop_backing_dev_info = {
.name   = "noop",
.capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK,
@@ -242,6 +240,8 @@ static __init int bdi_class_init(void)
 }
 postcore_initcall(bdi_class_init);
 
+static int bdi_init(struct backing_dev_info *bdi);
+
 static int __init default_bdi_init(void)
 {
int err;
@@ -771,7 +771,7 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi) 
{ }
 
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
-int bdi_init(struct backing_dev_info *bdi)
+static int bdi_init(struct backing_dev_info *bdi)
 {
int ret;
 
@@ -791,7 +791,6 @@ int bdi_init(struct backing_dev_info *bdi)
 
return ret;
 }
-EXPORT_SYMBOL(bdi_init);
 
 struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id)
 {
@@ -864,12 +863,6 @@ int bdi_register(struct backing_dev_info *bdi, struct 
device *parent,
 }
 EXPORT_SYMBOL(bdi_register);
 
-int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
-{
-   return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev));
-}
-EXPORT_SYMBOL(bdi_register_dev);
-
 int bdi_register_owner(struct backing_dev_info *bdi, struct device *owner)
 {
int rc;
@@ -923,19 +916,14 @@ void bdi_unregister(struct backing_dev_info *bdi)
}
 }
 
-static void bdi_exit(struct backing_dev_info *bdi)
-{
-   WARN_ON_ONCE(bdi->dev);
-   wb_exit(>wb);
-}
-
 static void release_bdi(struct kref *ref)
 {
struct backing_dev_info *bdi =
container_of(ref, struct backing_dev_info, refcnt);
 
bdi_unregister(bdi);
-   bdi_exit(bdi);
+   WARN_ON_ONCE(bdi->dev);
+   wb_exit(>wb);
kfree(bdi);
 }
 
@@ -944,38 +932,6 @@ void bdi_put(struct backing_dev_info *bdi)
kref_put(>refcnt, release_bdi);
 }
 
-void bdi_destroy(struct backing_dev_info *bdi)
-{
-   bdi_unregister(bdi);
-   bdi_exit(bdi);
-}
-EXPORT_SYMBOL(bdi_destroy);
-
-/*
- * For use from filesystems to quickly init and register a bdi associated
- * with dirty writeback
- */
-int bdi_setup_and_register(struct backing_dev_info *bdi, char *name)
-{
-   int err;
-
-   bdi->name = name;
-   bdi->capabilities = 0;
-   err = bdi_init(bdi);
-   if (err)
-   return err;
-
-   err = bdi_register(bdi, NULL, "%.28s-%ld", name,
-  atomic_long_inc_return(_seq));
-   if (err) {
-   bdi_destroy(bdi);
-   return err;
-   }
-
-   return 0;
-}
-EXPORT_SYMBOL(bdi_setup_and_register);
-
 static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 17/24] fuse: Convert to separately allocated bdi

2017-02-02 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Miklos Szeredi <mik...@szeredi.hu>
CC: linux-fsde...@vger.kernel.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/fuse/dev.c|  8 
 fs/fuse/fuse_i.h |  3 ---
 fs/fuse/inode.c  | 42 +-
 3 files changed, 17 insertions(+), 36 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 70ea57c7b6bb..1912164d57e9 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -382,8 +382,8 @@ static void request_end(struct fuse_conn *fc, struct 
fuse_req *req)
 
if (fc->num_background == fc->congestion_threshold &&
fc->connected && fc->bdi_initialized) {
-   clear_bdi_congested(>bdi, BLK_RW_SYNC);
-   clear_bdi_congested(>bdi, BLK_RW_ASYNC);
+   clear_bdi_congested(fc->sb->s_bdi, BLK_RW_SYNC);
+   clear_bdi_congested(fc->sb->s_bdi, BLK_RW_ASYNC);
}
fc->num_background--;
fc->active_background--;
@@ -570,8 +570,8 @@ void fuse_request_send_background_locked(struct fuse_conn 
*fc,
fc->blocked = 1;
if (fc->num_background == fc->congestion_threshold &&
fc->bdi_initialized) {
-   set_bdi_congested(>bdi, BLK_RW_SYNC);
-   set_bdi_congested(>bdi, BLK_RW_ASYNC);
+   set_bdi_congested(fc->sb->s_bdi, BLK_RW_SYNC);
+   set_bdi_congested(fc->sb->s_bdi, BLK_RW_ASYNC);
}
list_add_tail(>list, >bg_queue);
flush_bg_queue(fc);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 91307940c8ac..effab9e9607f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -631,9 +631,6 @@ struct fuse_conn {
/** Negotiated minor version */
unsigned minor;
 
-   /** Backing dev info */
-   struct backing_dev_info bdi;
-
/** Entry on the fuse_conn_list */
struct list_head entry;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 6fe6a88ecb4a..90bacbc87fb3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -386,12 +386,6 @@ static void fuse_send_destroy(struct fuse_conn *fc)
}
 }
 
-static void fuse_bdi_destroy(struct fuse_conn *fc)
-{
-   if (fc->bdi_initialized)
-   bdi_destroy(>bdi);
-}
-
 static void fuse_put_super(struct super_block *sb)
 {
struct fuse_conn *fc = get_fuse_conn_super(sb);
@@ -403,7 +397,6 @@ static void fuse_put_super(struct super_block *sb)
list_del(>entry);
fuse_ctl_remove_conn(fc);
mutex_unlock(_mutex);
-   fuse_bdi_destroy(fc);
 
fuse_conn_put(fc);
 }
@@ -928,7 +921,8 @@ static void process_init_reply(struct fuse_conn *fc, struct 
fuse_req *req)
fc->no_flock = 1;
}
 
-   fc->bdi.ra_pages = min(fc->bdi.ra_pages, ra_pages);
+   fc->sb->s_bdi->ra_pages =
+   min(fc->sb->s_bdi->ra_pages, ra_pages);
fc->minor = arg->minor;
fc->max_write = arg->minor < 5 ? 4096 : arg->max_write;
fc->max_write = max_t(unsigned, 4096, fc->max_write);
@@ -944,7 +938,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct 
fuse_req *req)
 
arg->major = FUSE_KERNEL_VERSION;
arg->minor = FUSE_KERNEL_MINOR_VERSION;
-   arg->max_readahead = fc->bdi.ra_pages * PAGE_SIZE;
+   arg->max_readahead = fc->sb->s_bdi->ra_pages * PAGE_SIZE;
arg->flags |= FUSE_ASYNC_READ | FUSE_POSIX_LOCKS | FUSE_ATOMIC_O_TRUNC |
FUSE_EXPORT_SUPPORT | FUSE_BIG_WRITES | FUSE_DONT_MASK |
FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ |
@@ -976,27 +970,20 @@ static void fuse_free_conn(struct fuse_conn *fc)
 static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
 {
int err;
+   char *suffix = "";
 
-   fc->bdi.name = "fuse";
-   fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_SIZE;
-   /* fuse does it's own writeback accounting */
-   fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;
-
-   err = bdi_init(>bdi);
+   if (sb->s_bdev)
+   suffix = "-fuseblk";
+   err = super_setup_bdi_name(sb, "%u:%u%s", MAJOR(fc->dev),
+  MINOR(fc->dev), suffix);
if (err)
return err;
 
-   fc->bdi_initialized = 1;
-
-   if (sb->s_bdev) {
-   err =  bdi_register(>bdi, NULL, "%u:%u-fuseblk",
-   MAJOR(fc->dev), MINOR(fc->dev));
-   } else {

[PATCH 21/24] nfs: Convert to separately allocated bdi

2017-02-02 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Trond Myklebust <trond.mykleb...@primarydata.com>
CC: Anna Schumaker <anna.schuma...@netapp.com>
CC: linux-...@vger.kernel.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/nfs/client.c   | 10 --
 fs/nfs/internal.h |  6 +++---
 fs/nfs/super.c| 34 +++---
 fs/nfs/write.c| 13 ++---
 include/linux/nfs_fs_sb.h |  1 -
 5 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 91a8d610ba0f..479afae529d8 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -738,9 +738,6 @@ static void nfs_server_set_fsinfo(struct nfs_server *server,
server->rsize = NFS_MAX_FILE_IO_SIZE;
server->rpages = (server->rsize + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
-   server->backing_dev_info.name = "nfs";
-   server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
-
if (server->wsize > max_rpc_payload)
server->wsize = max_rpc_payload;
if (server->wsize > NFS_MAX_FILE_IO_SIZE)
@@ -894,12 +891,6 @@ struct nfs_server *nfs_alloc_server(void)
return NULL;
}
 
-   if (bdi_init(>backing_dev_info)) {
-   nfs_free_iostats(server->io_stats);
-   kfree(server);
-   return NULL;
-   }
-
ida_init(>openowner_id);
ida_init(>lockowner_id);
pnfs_init_server(server);
@@ -930,7 +921,6 @@ void nfs_free_server(struct nfs_server *server)
ida_destroy(>lockowner_id);
ida_destroy(>openowner_id);
nfs_free_iostats(server->io_stats);
-   bdi_destroy(>backing_dev_info);
kfree(server);
nfs_release_automount_timer();
dprintk("<-- nfs_free_server()\n");
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 09ca5095c04e..55591c06b5d0 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -139,7 +139,7 @@ struct nfs_mount_request {
 };
 
 struct nfs_mount_info {
-   void (*fill_super)(struct super_block *, struct nfs_mount_info *);
+   int (*fill_super)(struct super_block *, struct nfs_mount_info *);
int (*set_security)(struct super_block *, struct dentry *, struct 
nfs_mount_info *);
struct nfs_parsed_mount_data *parsed;
struct nfs_clone_mount *cloned;
@@ -405,7 +405,7 @@ struct dentry *nfs_fs_mount(struct file_system_type *, int, 
const char *, void *
 struct dentry * nfs_xdev_mount_common(struct file_system_type *, int,
const char *, struct nfs_mount_info *);
 void nfs_kill_super(struct super_block *);
-void nfs_fill_super(struct super_block *, struct nfs_mount_info *);
+int nfs_fill_super(struct super_block *, struct nfs_mount_info *);
 
 extern struct rpc_stat nfs_rpcstat;
 
@@ -456,7 +456,7 @@ extern void nfs_read_prepare(struct rpc_task *task, void 
*calldata);
 extern void nfs_pageio_reset_read_mds(struct nfs_pageio_descriptor *pgio);
 
 /* super.c */
-void nfs_clone_super(struct super_block *, struct nfs_mount_info *);
+int nfs_clone_super(struct super_block *, struct nfs_mount_info *);
 void nfs_umount_begin(struct super_block *);
 int  nfs_statfs(struct dentry *, struct kstatfs *);
 int  nfs_show_options(struct seq_file *, struct dentry *);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 6bca17883b93..16f4d92a96ec 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2322,18 +2322,17 @@ inline void nfs_initialise_sb(struct super_block *sb)
sb->s_blocksize = nfs_block_bits(server->wsize,
 >s_blocksize_bits);
 
-   sb->s_bdi = >backing_dev_info;
-
nfs_super_set_maxbytes(sb, server->maxfilesize);
 }
 
 /*
  * Finish setting up an NFS2/3 superblock
  */
-void nfs_fill_super(struct super_block *sb, struct nfs_mount_info *mount_info)
+int nfs_fill_super(struct super_block *sb, struct nfs_mount_info *mount_info)
 {
struct nfs_parsed_mount_data *data = mount_info->parsed;
struct nfs_server *server = NFS_SB(sb);
+   int ret;
 
sb->s_blocksize_bits = 0;
sb->s_blocksize = 0;
@@ -2351,13 +2350,21 @@ void nfs_fill_super(struct super_block *sb, struct 
nfs_mount_info *mount_info)
}
 
nfs_initialise_sb(sb);
+
+   ret = super_setup_bdi_name(sb, "%u:%u", MAJOR(server->s_dev),
+  MINOR(server->s_dev));
+   if (ret)
+   return ret;
+   sb->s_bdi->ra_pages = server->rpages * NFS_MAX_READAHEAD;
+   return 0;
+
 }
 EXPORT_SYMBOL_GPL(nfs_fill_super);
 
 /*
  * Finish setting up a cloned NFS2/3/4 superblock
  */
-void nfs_clone_super(struct super_block *sb, struct nfs_mount_info *mount_info)
+int nfs

[PATCH 22/24] ubifs: Convert to separately allocated bdi

2017-02-02 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Richard Weinberger <rich...@nod.at>
CC: Artem Bityutskiy <dedeki...@gmail.com>
CC: Adrian Hunter <adrian.hun...@intel.com>
CC: linux-...@lists.infradead.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/ubifs/super.c | 23 +++
 fs/ubifs/ubifs.h |  3 ---
 2 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index e08aa04fc835..34810eb52b22 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1827,7 +1827,6 @@ static void ubifs_put_super(struct super_block *sb)
}
 
ubifs_umount(c);
-   bdi_destroy(>bdi);
ubi_close_volume(c->ubi);
mutex_unlock(>umount_mutex);
 }
@@ -2019,29 +2018,23 @@ static int ubifs_fill_super(struct super_block *sb, 
void *data, int silent)
goto out;
}
 
+   err = ubifs_parse_options(c, data, 0);
+   if (err)
+   goto out_close;
+
/*
 * UBIFS provides 'backing_dev_info' in order to disable read-ahead. For
 * UBIFS, I/O is not deferred, it is done immediately in readpage,
 * which means the user would have to wait not just for their own I/O
 * but the read-ahead I/O as well i.e. completely pointless.
 *
-* Read-ahead will be disabled because @c->bdi.ra_pages is 0.
+* Read-ahead will be disabled because @sb->s_bdi->ra_pages is 0.
 */
-   c->bdi.name = "ubifs",
-   c->bdi.capabilities = 0;
-   err  = bdi_init(>bdi);
+   err = super_setup_bdi_name(sb, "ubifs_%d_%d", c->vi.ubi_num,
+  c->vi.vol_id);
if (err)
goto out_close;
-   err = bdi_register(>bdi, NULL, "ubifs_%d_%d",
-  c->vi.ubi_num, c->vi.vol_id);
-   if (err)
-   goto out_bdi;
-
-   err = ubifs_parse_options(c, data, 0);
-   if (err)
-   goto out_bdi;
 
-   sb->s_bdi = >bdi;
sb->s_fs_info = c;
sb->s_magic = UBIFS_SUPER_MAGIC;
sb->s_blocksize = UBIFS_BLOCK_SIZE;
@@ -2080,8 +2073,6 @@ static int ubifs_fill_super(struct super_block *sb, void 
*data, int silent)
ubifs_umount(c);
 out_unlock:
mutex_unlock(>umount_mutex);
-out_bdi:
-   bdi_destroy(>bdi);
 out_close:
ubi_close_volume(c->ubi);
 out:
diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h
index ca72382ce6cc..41b42a425b42 100644
--- a/fs/ubifs/ubifs.h
+++ b/fs/ubifs/ubifs.h
@@ -968,7 +968,6 @@ struct ubifs_debug_info;
  * struct ubifs_info - UBIFS file-system description data structure
  * (per-superblock).
  * @vfs_sb: VFS @struct super_block object
- * @bdi: backing device info object to make VFS happy and disable read-ahead
  *
  * @highest_inum: highest used inode number
  * @max_sqnum: current global sequence number
@@ -1216,7 +1215,6 @@ struct ubifs_debug_info;
  */
 struct ubifs_info {
struct super_block *vfs_sb;
-   struct backing_dev_info bdi;
 
ino_t highest_inum;
unsigned long long max_sqnum;
@@ -1457,7 +1455,6 @@ extern const struct inode_operations 
ubifs_file_inode_operations;
 extern const struct file_operations ubifs_dir_operations;
 extern const struct inode_operations ubifs_dir_inode_operations;
 extern const struct inode_operations ubifs_symlink_inode_operations;
-extern struct backing_dev_info ubifs_backing_dev_info;
 extern struct ubifs_compressor *ubifs_compressors[UBIFS_COMPR_TYPES_CNT];
 
 /* io.c */
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/24] block: Unregister bdi on last reference drop

2017-02-02 Thread Jan Kara

Most users will want to unregister bdi when dropping last reference to a
bdi. Only a few users (like block devices) want to play more complex
tricks with bdi registration and unregistration. So unregister bdi when
the last reference to bdi is dropped and just make sure we don't
unregister the bdi the second time if it is already unregistered.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/backing-dev-defs.h |  3 ++-
 mm/backing-dev.c | 10 ++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index ad955817916d..2ecafc8a2d06 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -146,7 +146,8 @@ struct backing_dev_info {
char *name;
 
struct kref refcnt; /* Reference counter for the structure */
-   unsigned int capabilities; /* Device capabilities */
+   unsigned int registered:1;  /* Is bdi registered? */
+   unsigned int capabilities:31;   /* Device capabilities */
unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d59571023df7..82fee0f52d06 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -843,6 +843,7 @@ int bdi_register_va(struct backing_dev_info *bdi, struct 
device *parent,
 
spin_lock_bh(_lock);
list_add_tail_rcu(>bdi_list, _list);
+   bdi->registered = 1;
spin_unlock_bh(_lock);
 
trace_writeback_bdi_register(bdi);
@@ -897,6 +898,14 @@ static void bdi_remove_from_list(struct backing_dev_info 
*bdi)
 
 void bdi_unregister(struct backing_dev_info *bdi)
 {
+   spin_lock_bh(_lock);
+   if (!bdi->registered) {
+   spin_unlock_bh(_lock);
+   return;
+   }
+   bdi->registered = 0;
+   spin_unlock_bh(_lock);
+
/* make sure nobody finds us on the bdi_list anymore */
bdi_remove_from_list(bdi);
wb_shutdown(>wb);
@@ -925,6 +934,7 @@ static void release_bdi(struct kref *ref)
struct backing_dev_info *bdi =
container_of(ref, struct backing_dev_info, refcnt);
 
+   bdi_unregister(bdi);
bdi_exit(bdi);
kfree(bdi);
 }
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/24] bdi: Provide bdi_register_va()

2017-02-02 Thread Jan Kara

Add function that registers bdi and takes va_list instead of variable
number of arguments.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/backing-dev.h |  2 ++
 mm/backing-dev.c| 20 +++-
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 81c07ade4305..6865b1c8b122 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -30,6 +30,8 @@ void bdi_put(struct backing_dev_info *bdi);
 __printf(3, 4)
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
const char *fmt, ...);
+int bdi_register_va(struct backing_dev_info *bdi, struct device *parent,
+   const char *fmt, va_list args);
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 int bdi_register_owner(struct backing_dev_info *bdi, struct device *owner);
 void bdi_unregister(struct backing_dev_info *bdi);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7a5ba4163656..d59571023df7 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -824,18 +824,15 @@ struct backing_dev_info *bdi_alloc(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(bdi_alloc);
 
-int bdi_register(struct backing_dev_info *bdi, struct device *parent,
-   const char *fmt, ...)
+int bdi_register_va(struct backing_dev_info *bdi, struct device *parent,
+   const char *fmt, va_list args)
 {
-   va_list args;
struct device *dev;
 
if (bdi->dev)   /* The driver needs to use separate queues per device */
return 0;
 
-   va_start(args, fmt);
dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, 
args);
-   va_end(args);
if (IS_ERR(dev))
return PTR_ERR(dev);
 
@@ -851,6 +848,19 @@ int bdi_register(struct backing_dev_info *bdi, struct 
device *parent,
trace_writeback_bdi_register(bdi);
return 0;
 }
+EXPORT_SYMBOL(bdi_register_va);
+
+int bdi_register(struct backing_dev_info *bdi, struct device *parent,
+   const char *fmt, ...)
+{
+   va_list args;
+   int ret;
+
+   va_start(args, fmt);
+   ret = bdi_register_va(bdi, parent, fmt, args);
+   va_end(args);
+   return ret;
+}
 EXPORT_SYMBOL(bdi_register);
 
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/24] fs: Provide infrastructure for dynamic BDIs in filesystems

2017-02-02 Thread Jan Kara

Provide helper functions for setting up dynamically allocated
backing_dev_info structures for filesystems and cleaning them up on
superblock destruction.

CC: linux-...@lists.infradead.org
CC: linux-...@vger.kernel.org
CC: Petr Vandrovec <p...@vandrovec.name>
CC: linux-ni...@vger.kernel.org
CC: cluster-de...@redhat.com
CC: osd-...@open-osd.org
CC: codal...@coda.cs.cmu.edu
CC: linux-...@lists.infradead.org
CC: ecryp...@vger.kernel.org
CC: linux-c...@vger.kernel.org
CC: ceph-de...@vger.kernel.org
CC: linux-bt...@vger.kernel.org
CC: v9fs-develo...@lists.sourceforge.net
CC: lustre-de...@lists.lustre.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/super.c   | 49 
 include/linux/backing-dev-defs.h |  2 +-
 include/linux/fs.h   |  6 +
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index ea662b0e5e78..31dc4c6450ef 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -446,6 +446,11 @@ void generic_shutdown_super(struct super_block *sb)
hlist_del_init(>s_instances);
spin_unlock(_lock);
up_write(>s_umount);
+   if (sb->s_iflags & SB_I_DYNBDI) {
+   bdi_put(sb->s_bdi);
+   sb->s_bdi = _backing_dev_info;
+   sb->s_iflags &= ~SB_I_DYNBDI;
+   }
 }
 
 EXPORT_SYMBOL(generic_shutdown_super);
@@ -1249,6 +1254,50 @@ mount_fs(struct file_system_type *type, int flags, const 
char *name, void *data)
 }
 
 /*
+ * Setup private BDI for given superblock. I gets automatically cleaned up
+ * in generic_shutdown_super().
+ */
+int super_setup_bdi_name(struct super_block *sb, char *fmt, ...)
+{
+   struct backing_dev_info *bdi;
+   int err;
+   va_list args;
+
+   bdi = bdi_alloc(GFP_KERNEL);
+   if (!bdi)
+   return -ENOMEM;
+
+   bdi->name = sb->s_type->name;
+
+   va_start(args, fmt);
+   err = bdi_register_va(bdi, NULL, fmt, args);
+   va_end(args);
+   if (err) {
+   bdi_put(bdi);
+   return err;
+   }
+   WARN_ON(sb->s_bdi != _backing_dev_info);
+   sb->s_bdi = bdi;
+   sb->s_iflags |= SB_I_DYNBDI;
+
+   return 0;
+}
+EXPORT_SYMBOL(super_setup_bdi_name);
+
+/*
+ * Setup private BDI for given superblock. I gets automatically cleaned up
+ * in generic_shutdown_super().
+ */
+int super_setup_bdi(struct super_block *sb)
+{
+   static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
+
+   return super_setup_bdi_name(sb, "%.28s-%ld", sb->s_type->name,
+   atomic_long_inc_return(_seq));
+}
+EXPORT_SYMBOL(super_setup_bdi);
+
+/*
  * This is an internal function, please use sb_end_{write,pagefault,intwrite}
  * instead.
  */
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 2ecafc8a2d06..70080b4217f4 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -143,7 +143,7 @@ struct backing_dev_info {
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data;   /* Pointer to aux data for congested func */
 
-   char *name;
+   const char *name;
 
struct kref refcnt; /* Reference counter for the structure */
unsigned int registered:1;  /* Is bdi registered? */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c930cbc19342..8ed8b6d1bc54 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1267,6 +1267,9 @@ struct mm_struct;
 /* sb->s_iflags to limit user namespace mounts */
 #define SB_I_USERNS_VISIBLE0x0010 /* fstype already mounted */
 
+/* Temporary flag until all filesystems are converted to dynamic bdis */
+#define SB_I_DYNBDI0x0100
+
 /* Possible states of 'frozen' field */
 enum {
SB_UNFROZEN = 0,/* FS is unfrozen */
@@ -2103,6 +2106,9 @@ extern int vfs_ustat(dev_t, struct kstatfs *);
 extern int freeze_super(struct super_block *super);
 extern int thaw_super(struct super_block *super);
 extern bool our_mnt(struct vfsmount *mnt);
+extern __printf(2, 3)
+int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
+extern int super_setup_bdi(struct super_block *sb);
 
 extern int current_umask(void);
 
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 16/24] exofs: Convert to separately allocated bdi

2017-02-02 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Boaz Harrosh <o...@electrozaur.com>
CC: Benny Halevy <bhal...@primarydata.com>
CC: osd-...@open-osd.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/exofs/exofs.h |  1 -
 fs/exofs/super.c | 17 ++---
 2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 2e86086bc940..5dc392404559 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -64,7 +64,6 @@ struct exofs_dev {
  * our extension to the in-memory superblock
  */
 struct exofs_sb_info {
-   struct backing_dev_info bdi;/* register our bdi with VFS  */
struct exofs_sb_stats s_ess;/* Written often, pre-allocate*/
int s_timeout;  /* timeout for OSD operations */
uint64_ts_nextid;   /* highest object ID used */
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 1076a4233b39..819624cfc8da 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -464,7 +464,6 @@ static void exofs_put_super(struct super_block *sb)
sbi->one_comp.obj.partition);
 
exofs_sysfs_sb_del(sbi);
-   bdi_destroy(>bdi);
exofs_free_sbi(sbi);
sb->s_fs_info = NULL;
 }
@@ -809,8 +808,12 @@ static int exofs_fill_super(struct super_block *sb, void 
*data, int silent)
__sbi_read_stats(sbi);
 
/* set up operation vectors */
-   sbi->bdi.ra_pages = __ra_pages(>layout);
-   sb->s_bdi = >bdi;
+   ret = super_setup_bdi(sb);
+   if (ret) {
+   EXOFS_DBGMSG("Failed to super_setup_bdi\n");
+   goto free_sbi;
+   }
+   sb->s_bdi->ra_pages = __ra_pages(>layout);
sb->s_fs_info = sbi;
sb->s_op = _sops;
sb->s_export_op = _export_ops;
@@ -836,14 +839,6 @@ static int exofs_fill_super(struct super_block *sb, void 
*data, int silent)
goto free_sbi;
}
 
-   ret = bdi_setup_and_register(>bdi, "exofs");
-   if (ret) {
-   EXOFS_DBGMSG("Failed to bdi_setup_and_register\n");
-   dput(sb->s_root);
-   sb->s_root = NULL;
-   goto free_sbi;
-   }
-
exofs_sysfs_dbg_print();
_exofs_print_device("Mounting", opts->dev_name,
ore_comp_dev(>oc, 0),
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/24] btrfs: Convert to separately allocated bdi

2017-02-02 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Chris Mason <c...@fb.com>
CC: Josef Bacik <jba...@fb.com>
CC: David Sterba <dste...@suse.com>
CC: linux-bt...@vger.kernel.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/btrfs/ctree.h   |  1 -
 fs/btrfs/disk-io.c | 36 +++-
 fs/btrfs/super.c   |  7 +++
 3 files changed, 14 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6a823719b6c5..1dc06f66dfcf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -801,7 +801,6 @@ struct btrfs_fs_info {
struct btrfs_super_block *super_for_commit;
struct super_block *sb;
struct inode *btree_inode;
-   struct backing_dev_info bdi;
struct mutex tree_log_mutex;
struct mutex transaction_kthread_mutex;
struct mutex cleaner_mutex;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 37a31b12bb0c..b25723e729c0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1810,21 +1810,6 @@ static int btrfs_congested_fn(void *congested_data, int 
bdi_bits)
return ret;
 }
 
-static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
-{
-   int err;
-
-   err = bdi_setup_and_register(bdi, "btrfs");
-   if (err)
-   return err;
-
-   bdi->ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_SIZE;
-   bdi->congested_fn   = btrfs_congested_fn;
-   bdi->congested_data = info;
-   bdi->capabilities |= BDI_CAP_CGROUP_WRITEBACK;
-   return 0;
-}
-
 /*
  * called by the kthread helper functions to finally call the bio end_io
  * functions.  This is where read checksum verification actually happens
@@ -2598,16 +2583,10 @@ int open_ctree(struct super_block *sb,
goto fail;
}
 
-   ret = setup_bdi(fs_info, _info->bdi);
-   if (ret) {
-   err = ret;
-   goto fail_srcu;
-   }
-
ret = percpu_counter_init(_info->dirty_metadata_bytes, 0, 
GFP_KERNEL);
if (ret) {
err = ret;
-   goto fail_bdi;
+   goto fail_srcu;
}
fs_info->dirty_metadata_batch = PAGE_SIZE *
(1 + ilog2(nr_cpu_ids));
@@ -2715,7 +2694,6 @@ int open_ctree(struct super_block *sb,
 
sb->s_blocksize = 4096;
sb->s_blocksize_bits = blksize_bits(4096);
-   sb->s_bdi = _info->bdi;
 
btrfs_init_btree_inode(fs_info);
 
@@ -2912,9 +2890,12 @@ int open_ctree(struct super_block *sb,
goto fail_sb_buffer;
}
 
-   fs_info->bdi.ra_pages *= btrfs_super_num_devices(disk_super);
-   fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages,
-   SZ_4M / PAGE_SIZE);
+   sb->s_bdi->congested_fn = btrfs_congested_fn;
+   sb->s_bdi->congested_data = fs_info;
+   sb->s_bdi->capabilities |= BDI_CAP_CGROUP_WRITEBACK;
+   sb->s_bdi->ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_SIZE;
+   sb->s_bdi->ra_pages *= btrfs_super_num_devices(disk_super);
+   sb->s_bdi->ra_pages = max(sb->s_bdi->ra_pages, SZ_4M / PAGE_SIZE);
 
sb->s_blocksize = sectorsize;
sb->s_blocksize_bits = blksize_bits(sectorsize);
@@ -3282,8 +3263,6 @@ int open_ctree(struct super_block *sb,
percpu_counter_destroy(_info->delalloc_bytes);
 fail_dirty_metadata_bytes:
percpu_counter_destroy(_info->dirty_metadata_bytes);
-fail_bdi:
-   bdi_destroy(_info->bdi);
 fail_srcu:
cleanup_srcu_struct(_info->subvol_srcu);
 fail:
@@ -4010,7 +3989,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
percpu_counter_destroy(_info->dirty_metadata_bytes);
percpu_counter_destroy(_info->delalloc_bytes);
percpu_counter_destroy(_info->bio_counter);
-   bdi_destroy(_info->bdi);
cleanup_srcu_struct(_info->subvol_srcu);
 
btrfs_free_stripe_hash_table(fs_info);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b5ae7d3d1896..08ef08b63132 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1133,6 +1133,13 @@ static int btrfs_fill_super(struct super_block *sb,
 #endif
sb->s_flags |= MS_I_VERSION;
sb->s_iflags |= SB_I_CGROUPWB;
+
+   err = super_setup_bdi(sb);
+   if (err) {
+   btrfs_err(fs_info, "super_setup_bdi failed");
+   return err;
+   }
+
err = open_ctree(sb, fs_devices, (char *)data);
if (err) {
btrfs_err(fs_info, "open_ctree failed");
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/24] 9p: Convert to separately allocated bdi

2017-02-02 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside session. This unifies handling of bdi among users.

CC: Eric Van Hensbergen <eri...@gmail.com>
CC: Ron Minnich <rminn...@sandia.gov>
CC: Latchesar Ionkov <lu...@ionkov.net>
CC: v9fs-develo...@lists.sourceforge.net
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/9p/v9fs.c  | 10 +-
 fs/9p/v9fs.h  |  1 -
 fs/9p/vfs_super.c | 15 ---
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/9p/v9fs.c b/fs/9p/v9fs.c
index 072e7599583a..0898a1a774fa 100644
--- a/fs/9p/v9fs.c
+++ b/fs/9p/v9fs.c
@@ -332,10 +332,6 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info 
*v9ses,
goto err_names;
init_rwsem(>rename_sem);
 
-   rc = bdi_setup_and_register(>bdi, "9p");
-   if (rc)
-   goto err_names;
-
v9ses->uid = INVALID_UID;
v9ses->dfltuid = V9FS_DEFUID;
v9ses->dfltgid = V9FS_DEFGID;
@@ -344,7 +340,7 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info 
*v9ses,
if (IS_ERR(v9ses->clnt)) {
rc = PTR_ERR(v9ses->clnt);
p9_debug(P9_DEBUG_ERROR, "problem initializing 9p client\n");
-   goto err_bdi;
+   goto err_names;
}
 
v9ses->flags = V9FS_ACCESS_USER;
@@ -414,8 +410,6 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info 
*v9ses,
 
 err_clnt:
p9_client_destroy(v9ses->clnt);
-err_bdi:
-   bdi_destroy(>bdi);
 err_names:
kfree(v9ses->uname);
kfree(v9ses->aname);
@@ -444,8 +438,6 @@ void v9fs_session_close(struct v9fs_session_info *v9ses)
kfree(v9ses->uname);
kfree(v9ses->aname);
 
-   bdi_destroy(>bdi);
-
spin_lock(_sessionlist_lock);
list_del(>slist);
spin_unlock(_sessionlist_lock);
diff --git a/fs/9p/v9fs.h b/fs/9p/v9fs.h
index 443d12e02043..76eaf49abd3a 100644
--- a/fs/9p/v9fs.h
+++ b/fs/9p/v9fs.h
@@ -114,7 +114,6 @@ struct v9fs_session_info {
kuid_t uid; /* if ACCESS_SINGLE, the uid that has access */
struct p9_client *clnt; /* 9p client */
struct list_head slist; /* list of sessions registered with v9fs */
-   struct backing_dev_info bdi;
struct rw_semaphore rename_sem;
 };
 
diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
index de3ed8629196..a0965fb587a5 100644
--- a/fs/9p/vfs_super.c
+++ b/fs/9p/vfs_super.c
@@ -72,10 +72,12 @@ static int v9fs_set_super(struct super_block *s, void *data)
  *
  */
 
-static void
+static int
 v9fs_fill_super(struct super_block *sb, struct v9fs_session_info *v9ses,
int flags, void *data)
 {
+   int ret;
+
sb->s_maxbytes = MAX_LFS_FILESIZE;
sb->s_blocksize_bits = fls(v9ses->maxdata - 1);
sb->s_blocksize = 1 << sb->s_blocksize_bits;
@@ -85,7 +87,11 @@ v9fs_fill_super(struct super_block *sb, struct 
v9fs_session_info *v9ses,
sb->s_xattr = v9fs_xattr_handlers;
} else
sb->s_op = _super_ops;
-   sb->s_bdi = >bdi;
+
+   ret = super_setup_bdi(sb);
+   if (ret)
+   return ret;
+
if (v9ses->cache)
sb->s_bdi->ra_pages = (VM_MAX_READAHEAD * 1024)/PAGE_SIZE;
 
@@ -99,6 +105,7 @@ v9fs_fill_super(struct super_block *sb, struct 
v9fs_session_info *v9ses,
 #endif
 
save_mount_options(sb, data);
+   return 0;
 }
 
 /**
@@ -138,7 +145,9 @@ static struct dentry *v9fs_mount(struct file_system_type 
*fs_type, int flags,
retval = PTR_ERR(sb);
goto clunk_fid;
}
-   v9fs_fill_super(sb, v9ses, flags, data);
+   retval = v9fs_fill_super(sb, v9ses, flags, data);
+   if (retval)
+   goto release_sb;
 
if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE)
sb->s_d_op = _cached_dentry_operations;
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/24] fs: Get proper reference for s_bdi

2017-02-02 Thread Jan Kara

So far we just relied on block device to hold a bdi reference for us
while the filesystem is mounted. While that works perfectly fine, it is
a bit awkward that we have a pointer to a refcounted structure in the
superblock without proper reference. So make s_bdi hold a proper
reference to block device's BDI. No filesystem using mount_bdev()
actually changes s_bdi so this is safe and will make bdev filesystems
work the same way as filesystems needing to set up their private bdi.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/super.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 31dc4c6450ef..dfb95ccd4351 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1047,12 +1047,9 @@ static int set_bdev_super(struct super_block *s, void 
*data)
 {
s->s_bdev = data;
s->s_dev = s->s_bdev->bd_dev;
+   s->s_bdi = bdi_get(s->s_bdev->bd_bdi);
+   s->s_iflags |= SB_I_DYNBDI;
 
-   /*
-* We set the bdi here to the queue backing, file systems can
-* overwrite this in ->fill_super()
-*/
-   s->s_bdi = bdev_get_queue(s->s_bdev)->backing_dev_info;
return 0;
 }
 
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/24] ceph: Convert to separately allocated bdi

2017-02-02 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside client structure. This unifies handling of bdi among users.

CC: Ilya Dryomov <idryo...@gmail.com>
CC: "Yan, Zheng" <z...@redhat.com>
CC: Sage Weil <s...@redhat.com>
CC: ceph-de...@vger.kernel.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/ceph/addr.c|  6 +++---
 fs/ceph/debugfs.c |  2 +-
 fs/ceph/super.c   | 32 +++-
 fs/ceph/super.h   |  2 --
 4 files changed, 15 insertions(+), 27 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 9cd0c0ea7cdb..f83d00cf3e66 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -576,7 +576,7 @@ static int writepage_nounlock(struct page *page, struct 
writeback_control *wbc)
writeback_stat = atomic_long_inc_return(>writeback_count);
if (writeback_stat >
CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb))
-   set_bdi_congested(>backing_dev_info, BLK_RW_ASYNC);
+   set_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
 
set_page_writeback(page);
err = ceph_osdc_writepages(osdc, ceph_vino(inode),
@@ -698,7 +698,7 @@ static void writepages_finish(struct ceph_osd_request *req)
if (atomic_long_dec_return(>writeback_count) <
 CONGESTION_OFF_THRESH(
fsc->mount_options->congestion_kb))
-   clear_bdi_congested(>backing_dev_info,
+   clear_bdi_congested(inode_to_bdi(inode),
BLK_RW_ASYNC);
 
if (rc < 0)
@@ -977,7 +977,7 @@ static int ceph_writepages_start(struct address_space 
*mapping,
if (atomic_long_inc_return(>writeback_count) >
CONGESTION_ON_THRESH(
fsc->mount_options->congestion_kb)) {
-   set_bdi_congested(>backing_dev_info,
+   set_bdi_congested(inode_to_bdi(inode),
  BLK_RW_ASYNC);
}
 
diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index 39ff678e567f..5da595c0edf1 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -251,7 +251,7 @@ int ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
goto out;
 
snprintf(name, sizeof(name), "../../bdi/%s",
-dev_name(fsc->backing_dev_info.dev));
+dev_name(fsc->sb->s_bdi->dev));
fsc->debugfs_bdi =
debugfs_create_symlink("bdi",
   fsc->client->debugfs_dir,
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 6bd20d707bfd..ecc411fa7c06 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -579,10 +579,6 @@ static struct ceph_fs_client *create_fs_client(struct 
ceph_mount_options *fsopt,
 
atomic_long_set(>writeback_count, 0);
 
-   err = bdi_init(>backing_dev_info);
-   if (err < 0)
-   goto fail_client;
-
err = -ENOMEM;
/*
 * The number of concurrent works can be high but they don't need
@@ -590,7 +586,7 @@ static struct ceph_fs_client *create_fs_client(struct 
ceph_mount_options *fsopt,
 */
fsc->wb_wq = alloc_workqueue("ceph-writeback", 0, 1);
if (fsc->wb_wq == NULL)
-   goto fail_bdi;
+   goto fail_client;
fsc->pg_inv_wq = alloc_workqueue("ceph-pg-invalid", 0, 1);
if (fsc->pg_inv_wq == NULL)
goto fail_wb_wq;
@@ -624,8 +620,6 @@ static struct ceph_fs_client *create_fs_client(struct 
ceph_mount_options *fsopt,
destroy_workqueue(fsc->pg_inv_wq);
 fail_wb_wq:
destroy_workqueue(fsc->wb_wq);
-fail_bdi:
-   bdi_destroy(>backing_dev_info);
 fail_client:
ceph_destroy_client(fsc->client);
 fail:
@@ -643,8 +637,6 @@ static void destroy_fs_client(struct ceph_fs_client *fsc)
destroy_workqueue(fsc->pg_inv_wq);
destroy_workqueue(fsc->trunc_wq);
 
-   bdi_destroy(>backing_dev_info);
-
mempool_destroy(fsc->wb_pagevec_pool);
 
destroy_mount_options(fsc->mount_options);
@@ -938,25 +930,23 @@ static int ceph_compare_super(struct super_block *sb, 
void *data)
  */
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
-static int ceph_register_bdi(struct super_block *sb,
-struct ceph_fs_client *fsc)
+static int ceph_setup_bdi(struct super_block *sb, struct ceph_fs_client *fsc)
 {
int err;
 
+   err = super_setup_bdi_name(sb, "ceph-%ld",
+  atomic_long_inc_return(_seq));
+   if (err)
+   return err;
+
/* set ra_pages based on rasize

[PATCH 11/24] ecryptfs: Convert to separately allocated bdi

2017-02-02 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Tyler Hicks <tyhi...@canonical.com>
CC: ecryp...@vger.kernel.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/ecryptfs/ecryptfs_kernel.h | 1 -
 fs/ecryptfs/main.c| 4 +---
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/ecryptfs/ecryptfs_kernel.h b/fs/ecryptfs/ecryptfs_kernel.h
index 599a29237cfe..e93444a4c4b1 100644
--- a/fs/ecryptfs/ecryptfs_kernel.h
+++ b/fs/ecryptfs/ecryptfs_kernel.h
@@ -349,7 +349,6 @@ struct ecryptfs_mount_crypt_stat {
 struct ecryptfs_sb_info {
struct super_block *wsi_sb;
struct ecryptfs_mount_crypt_stat mount_crypt_stat;
-   struct backing_dev_info bdi;
 };
 
 /* file private data. */
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index 151872dcc1f4..9014479d0160 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -519,12 +519,11 @@ static struct dentry *ecryptfs_mount(struct 
file_system_type *fs_type, int flags
goto out;
}
 
-   rc = bdi_setup_and_register(>bdi, "ecryptfs");
+   rc = super_setup_bdi(s);
if (rc)
goto out1;
 
ecryptfs_set_superblock_private(s, sbi);
-   s->s_bdi = >bdi;
 
/* ->kill_sb() will take care of sbi after that point */
sbi = NULL;
@@ -633,7 +632,6 @@ static void ecryptfs_kill_block_super(struct super_block 
*sb)
if (!sb_info)
return;
ecryptfs_destroy_mount_crypt_stat(_info->mount_crypt_stat);
-   bdi_destroy(_info->bdi);
kmem_cache_free(ecryptfs_sb_info_cache, sb_info);
 }
 
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/24] afs: Convert to separately allocated bdi

2017-02-02 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: David Howells <dhowe...@redhat.com>
CC: linux-...@lists.infradead.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/afs/internal.h | 1 -
 fs/afs/super.c| 4 +++-
 fs/afs/volume.c   | 7 ---
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 535a38d2c1d0..f8d52c36e6ec 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -310,7 +310,6 @@ struct afs_volume {
unsigned short  rjservers;  /* number of servers discarded 
due to -ENOMEDIUM */
struct afs_server   *servers[8];/* servers on which volume 
resides (ordered) */
struct rw_semaphore server_sem; /* lock for accessing current 
server */
-   struct backing_dev_info bdi;
 };
 
 /*
diff --git a/fs/afs/super.c b/fs/afs/super.c
index fbdb022b75a2..3bae29cd277f 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -319,7 +319,9 @@ static int afs_fill_super(struct super_block *sb,
sb->s_blocksize_bits= PAGE_SHIFT;
sb->s_magic = AFS_FS_MAGIC;
sb->s_op= _super_ops;
-   sb->s_bdi   = >volume->bdi;
+   ret = super_setup_bdi(sb);
+   if (ret)
+   return ret;
strlcpy(sb->s_id, as->volume->vlocation->vldb.name, sizeof(sb->s_id));
 
/* allocate the root inode and dentry */
diff --git a/fs/afs/volume.c b/fs/afs/volume.c
index d142a2449e65..db73d6dad02b 100644
--- a/fs/afs/volume.c
+++ b/fs/afs/volume.c
@@ -106,10 +106,6 @@ struct afs_volume *afs_volume_lookup(struct 
afs_mount_params *params)
volume->cell= params->cell;
volume->vid = vlocation->vldb.vid[params->type];
 
-   ret = bdi_setup_and_register(>bdi, "afs");
-   if (ret)
-   goto error_bdi;
-
init_rwsem(>server_sem);
 
/* look up all the applicable server records */
@@ -155,8 +151,6 @@ struct afs_volume *afs_volume_lookup(struct 
afs_mount_params *params)
return ERR_PTR(ret);
 
 error_discard:
-   bdi_destroy(>bdi);
-error_bdi:
up_write(>cell->vl_sem);
 
for (loop = volume->nservers - 1; loop >= 0; loop--)
@@ -206,7 +200,6 @@ void afs_put_volume(struct afs_volume *volume)
for (loop = volume->nservers - 1; loop >= 0; loop--)
afs_put_server(volume->servers[loop]);
 
-   bdi_destroy(>bdi);
kfree(volume);
 
_leave(" [destroyed]");
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/24] orangefs: Remove orangefs_backing_dev_info

2017-02-02 Thread Jan Kara

It is not used anywhere.

CC: Mike Marshall <hub...@omnibond.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/orangefs/inode.c   |  6 --
 fs/orangefs/orangefs-kernel.h |  1 -
 fs/orangefs/orangefs-mod.c| 12 +---
 3 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 551bc74ed2b8..5cd617980fbf 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -136,12 +136,6 @@ static ssize_t orangefs_direct_IO(struct kiocb *iocb,
return -EINVAL;
 }
 
-struct backing_dev_info orangefs_backing_dev_info = {
-   .name = "orangefs",
-   .ra_pages = 0,
-   .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
-};
-
 /** ORANGEFS2 implementation of address space operations */
 const struct address_space_operations orangefs_address_operations = {
.readpage = orangefs_readpage,
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index 3bf803d732c5..70355a9a2596 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -529,7 +529,6 @@ extern spinlock_t orangefs_htable_ops_in_progress_lock;
 extern int hash_table_size;
 
 extern const struct address_space_operations orangefs_address_operations;
-extern struct backing_dev_info orangefs_backing_dev_info;
 extern const struct inode_operations orangefs_file_inode_operations;
 extern const struct file_operations orangefs_file_operations;
 extern const struct inode_operations orangefs_symlink_inode_operations;
diff --git a/fs/orangefs/orangefs-mod.c b/fs/orangefs/orangefs-mod.c
index 4113eb0495bf..c1b5174cb5a9 100644
--- a/fs/orangefs/orangefs-mod.c
+++ b/fs/orangefs/orangefs-mod.c
@@ -80,11 +80,6 @@ static int __init orangefs_init(void)
int ret = -1;
__u32 i = 0;
 
-   ret = bdi_init(_backing_dev_info);
-
-   if (ret)
-   return ret;
-
if (op_timeout_secs < 0)
op_timeout_secs = 0;
 
@@ -94,7 +89,7 @@ static int __init orangefs_init(void)
/* initialize global book keeping data structures */
ret = op_cache_initialize();
if (ret < 0)
-   goto err;
+   goto out;
 
ret = orangefs_inode_cache_initialize();
if (ret < 0)
@@ -181,9 +176,6 @@ static int __init orangefs_init(void)
 cleanup_op:
op_cache_finalize();
 
-err:
-   bdi_destroy(_backing_dev_info);
-
 out:
return ret;
 }
@@ -207,8 +199,6 @@ static void __exit orangefs_exit(void)
 
kfree(orangefs_htable_ops_in_progress);
 
-   bdi_destroy(_backing_dev_info);
-
pr_info("orangefs: module version %s unloaded\n", ORANGEFS_VERSION);
 }
 
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/24] mtd: Convert to dynamically allocated bdi infrastructure

2017-02-02 Thread Jan Kara

MTD already allocates backing_dev_info dynamically. Convert it to use
generic infrastructure for this including proper refcounting. We drop
mtd->backing_dev_info as its only use was to pass mtd_bdi pointer from
one file into another and if we wanted to keep that in a clean way, we'd
have to make mtd hold and drop bdi reference as needed which seems
pointless for passing one global pointer...

CC: David Woodhouse <dw...@infradead.org>
CC: Brian Norris <computersforpe...@gmail.com>
CC: linux-...@lists.infradead.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 drivers/mtd/mtdcore.c   | 23 ---
 drivers/mtd/mtdsuper.c  |  7 ++-
 include/linux/mtd/mtd.h |  5 -
 3 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 052772f7caef..a05063de362f 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -46,7 +46,7 @@
 
 #include "mtdcore.h"
 
-static struct backing_dev_info *mtd_bdi;
+struct backing_dev_info *mtd_bdi;
 
 #ifdef CONFIG_PM_SLEEP
 
@@ -496,11 +496,9 @@ int add_mtd_device(struct mtd_info *mtd)
 * mtd_device_parse_register() multiple times on the same master MTD,
 * especially with CONFIG_MTD_PARTITIONED_MASTER=y.
 */
-   if (WARN_ONCE(mtd->backing_dev_info, "MTD already registered\n"))
+   if (WARN_ONCE(mtd->dev.type, "MTD already registered\n"))
return -EEXIST;
 
-   mtd->backing_dev_info = mtd_bdi;
-
BUG_ON(mtd->writesize == 0);
mutex_lock(_table_mutex);
 
@@ -1775,13 +1773,18 @@ static struct backing_dev_info * __init 
mtd_bdi_init(char *name)
struct backing_dev_info *bdi;
int ret;
 
-   bdi = kzalloc(sizeof(*bdi), GFP_KERNEL);
+   bdi = bdi_alloc(GFP_KERNEL);
if (!bdi)
return ERR_PTR(-ENOMEM);
 
-   ret = bdi_setup_and_register(bdi, name);
+   bdi->name = name;
+   /*
+* We put '-0' suffix to the name to get the same name format as we
+* used to get. Since this is called only once, we get a unique name. 
+*/
+   ret = bdi_register(bdi, NULL, "%.28s-0", name);
if (ret)
-   kfree(bdi);
+   bdi_put(bdi);
 
return ret ? ERR_PTR(ret) : bdi;
 }
@@ -1813,8 +1816,7 @@ static int __init init_mtd(void)
 out_procfs:
if (proc_mtd)
remove_proc_entry("mtd", NULL);
-   bdi_destroy(mtd_bdi);
-   kfree(mtd_bdi);
+   bdi_put(mtd_bdi);
 err_bdi:
class_unregister(_class);
 err_reg:
@@ -1828,8 +1830,7 @@ static void __exit cleanup_mtd(void)
if (proc_mtd)
remove_proc_entry("mtd", NULL);
class_unregister(_class);
-   bdi_destroy(mtd_bdi);
-   kfree(mtd_bdi);
+   bdi_put(mtd_bdi);
idr_destroy(_idr);
 }
 
diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index 20c02a3b7417..e69e7855e31f 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * compare superblocks to see if they're equivalent
@@ -38,6 +39,8 @@ static int get_sb_mtd_compare(struct super_block *sb, void 
*_mtd)
return 0;
 }
 
+extern struct backing_dev_info *mtd_bdi;
+
 /*
  * mark the superblock by the MTD device it is using
  * - set the device number to be the correct MTD block device for pesuperstence
@@ -49,7 +52,9 @@ static int get_sb_mtd_set(struct super_block *sb, void *_mtd)
 
sb->s_mtd = mtd;
sb->s_dev = MKDEV(MTD_BLOCK_MAJOR, mtd->index);
-   sb->s_bdi = mtd->backing_dev_info;
+   sb->s_bdi = bdi_get(mtd_bdi);
+   sb->s_iflags |= SB_I_DYNBDI;
+
return 0;
 }
 
diff --git a/include/linux/mtd/mtd.h b/include/linux/mtd/mtd.h
index 13f8052b9ff9..de64f87abbe0 100644
--- a/include/linux/mtd/mtd.h
+++ b/include/linux/mtd/mtd.h
@@ -332,11 +332,6 @@ struct mtd_info {
int (*_get_device) (struct mtd_info *mtd);
void (*_put_device) (struct mtd_info *mtd);
 
-   /* Backing device capabilities for this device
-* - provides mmap capabilities
-*/
-   struct backing_dev_info *backing_dev_info;
-
struct notifier_block reboot_notifier;  /* default mode before reboot */
 
/* ECC status information */
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/24] lustre: Convert to separately allocated bdi

2017-02-02 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Oleg Drokin <oleg.dro...@intel.com>
CC: Andreas Dilger <andreas.dil...@intel.com>
CC: James Simmons <jsimm...@infradead.org>
CC: lustre-de...@lists.lustre.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 .../staging/lustre/lustre/include/lustre_disk.h|  4 
 drivers/staging/lustre/lustre/llite/llite_lib.c| 24 +++---
 2 files changed, 3 insertions(+), 25 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre_disk.h 
b/drivers/staging/lustre/lustre/include/lustre_disk.h
index 8886458748c1..a676bccabd43 100644
--- a/drivers/staging/lustre/lustre/include/lustre_disk.h
+++ b/drivers/staging/lustre/lustre/include/lustre_disk.h
@@ -133,13 +133,9 @@ struct lustre_sb_info {
struct obd_export*lsi_osd_exp;
char  lsi_osd_type[16];
char  lsi_fstype[16];
-   struct backing_dev_info   lsi_bdi; /* each client mountpoint needs
-   * own backing_dev_info
-   */
 };
 
 #define LSI_UMOUNT_FAILOVER  0x0020
-#define LSI_BDI_INITIALIZED  0x0040
 
 #define s2lsi(sb)  ((struct lustre_sb_info *)((sb)->s_fs_info))
 #define s2lsi_nocast(sb) ((sb)->s_fs_info)
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c 
b/drivers/staging/lustre/lustre/llite/llite_lib.c
index 25f5aed97f63..4f07d2e60d40 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -861,15 +861,6 @@ void ll_lli_init(struct ll_inode_info *lli)
mutex_init(>lli_layout_mutex);
 }
 
-static inline int ll_bdi_register(struct backing_dev_info *bdi)
-{
-   static atomic_t ll_bdi_num = ATOMIC_INIT(0);
-
-   bdi->name = "lustre";
-   return bdi_register(bdi, NULL, "lustre-%d",
-   atomic_inc_return(_bdi_num));
-}
-
 int ll_fill_super(struct super_block *sb, struct vfsmount *mnt)
 {
struct lustre_profile *lprof = NULL;
@@ -879,6 +870,7 @@ int ll_fill_super(struct super_block *sb, struct vfsmount 
*mnt)
char  *profilenm = get_profile_name(sb);
struct config_llog_instance *cfg;
interr;
+   static atomic_t ll_bdi_num = ATOMIC_INIT(0);
 
CDEBUG(D_VFSTRACE, "VFS Op: sb %p\n", sb);
 
@@ -901,16 +893,11 @@ int ll_fill_super(struct super_block *sb, struct vfsmount 
*mnt)
if (err)
goto out_free;
 
-   err = bdi_init(>lsi_bdi);
-   if (err)
-   goto out_free;
-   lsi->lsi_flags |= LSI_BDI_INITIALIZED;
-   lsi->lsi_bdi.capabilities = 0;
-   err = ll_bdi_register(>lsi_bdi);
+   err = super_setup_bdi_name(sb, "lustre-%d",
+  atomic_inc_return(_bdi_num));
if (err)
goto out_free;
 
-   sb->s_bdi = >lsi_bdi;
/* kernel >= 2.6.38 store dentry operations in sb->s_d_op. */
sb->s_d_op = _d_ops;
 
@@ -1031,11 +1018,6 @@ void ll_put_super(struct super_block *sb)
if (profilenm)
class_del_profile(profilenm);
 
-   if (lsi->lsi_flags & LSI_BDI_INITIALIZED) {
-   bdi_destroy(>lsi_bdi);
-   lsi->lsi_flags &= ~LSI_BDI_INITIALIZED;
-   }
-
ll_free_sbi(sb);
lsi->lsi_llsbi = NULL;
 
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 15/24] coda: Convert to separately allocated bdi

2017-02-02 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Jan Harkes <jahar...@cs.cmu.edu>
CC: c...@cs.cmu.edu
CC: codal...@coda.cs.cmu.edu
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/coda/inode.c| 11 ---
 include/linux/coda_psdev.h |  1 -
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index 71dbe7e287ce..b72eb96f3f96 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -183,10 +183,6 @@ static int coda_fill_super(struct super_block *sb, void 
*data, int silent)
goto unlock_out;
}
 
-   error = bdi_setup_and_register(>bdi, "coda");
-   if (error)
-   goto unlock_out;
-
vc->vc_sb = sb;
mutex_unlock(>vc_mutex);
 
@@ -197,7 +193,10 @@ static int coda_fill_super(struct super_block *sb, void 
*data, int silent)
sb->s_magic = CODA_SUPER_MAGIC;
sb->s_op = _super_operations;
sb->s_d_op = _dentry_operations;
-   sb->s_bdi = >bdi;
+
+   error = super_setup_bdi(sb);
+   if (error)
+   goto error;
 
/* get root fid from Venus: this needs the root inode */
error = venus_rootfid(sb, );
@@ -228,7 +227,6 @@ static int coda_fill_super(struct super_block *sb, void 
*data, int silent)
 
 error:
mutex_lock(>vc_mutex);
-   bdi_destroy(>bdi);
vc->vc_sb = NULL;
sb->s_fs_info = NULL;
 unlock_out:
@@ -240,7 +238,6 @@ static void coda_put_super(struct super_block *sb)
 {
struct venus_comm *vcp = coda_vcp(sb);
mutex_lock(>vc_mutex);
-   bdi_destroy(>bdi);
vcp->vc_sb = NULL;
sb->s_fs_info = NULL;
mutex_unlock(>vc_mutex);
diff --git a/include/linux/coda_psdev.h b/include/linux/coda_psdev.h
index 5b8721efa948..31e4e1f1547c 100644
--- a/include/linux/coda_psdev.h
+++ b/include/linux/coda_psdev.h
@@ -15,7 +15,6 @@ struct venus_comm {
struct list_headvc_processing;
int vc_inuse;
struct super_block *vc_sb;
-   struct backing_dev_info bdi;
struct mutexvc_mutex;
 };
 
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/24 RFC] fs: Convert all embedded bdis into separate ones

2017-02-02 Thread Jan Kara

Hello,

this patch series converts all embedded occurences of struct backing_dev_info
to use standalone dynamically allocated structures. This makes bdi handling
unified across all bdi users and generally removes some boilerplate code from
filesystems setting up their own bdi. It also allows us to remove some code
from generic bdi implementation.

The patches were only compile-tested for most filesystems (I've tested
mounting only for NFS & btrfs) so fs maintainers please have a look whether
the changes look sound to you.

This series is based on top of bdi fixes that were merged into linux-block
git tree.

Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/24] block: Provide bdi_alloc()

2017-02-02 Thread Jan Kara

Provide bdi_alloc() forsimple allocation of a BDI that can be used by
filesystems that don't need anything fancy. We use this function when
converting filesystems from embedded struct backing_dev_info into a
dynamically allocated one.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/backing-dev.h |  1 +
 mm/backing-dev.c| 15 +++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c52a48cb9a66..81c07ade4305 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -37,6 +37,7 @@ void bdi_unregister(struct backing_dev_info *bdi);
 int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
 void bdi_destroy(struct backing_dev_info *bdi);
 struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id);
+struct backing_dev_info *bdi_alloc(gfp_t gfp_mask);
 
 void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
bool range_cyclic, enum wb_reason reason);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 28ce6cf7b2ff..7a5ba4163656 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -809,6 +809,21 @@ struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, 
int node_id)
return bdi;
 }
 
+struct backing_dev_info *bdi_alloc(gfp_t gfp_mask)
+{
+   struct backing_dev_info *bdi;
+
+   bdi = kmalloc(sizeof(struct backing_dev_info), gfp_mask | __GFP_ZERO);
+   if (!bdi)
+   return NULL;
+   if (bdi_init(bdi)) {
+   kfree(bdi);
+   return NULL;
+   }
+   return bdi;
+}
+EXPORT_SYMBOL(bdi_alloc);
+
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
const char *fmt, ...)
 {
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-23 Thread Jan Kara

On Fri 20-01-17 07:42:09, Dan Williams wrote:
> On Fri, Jan 20, 2017 at 1:47 AM, Jan Kara <j...@suse.cz> wrote:
> > On Thu 19-01-17 14:17:19, Vishal Verma wrote:
> >> On 01/18, Jan Kara wrote:
> >> > On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> >> > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you 
> >> > are
> >> > mostly interested in. We could possibly do something more efficient than
> >> > what NVDIMM driver does however the complexity would be relatively high 
> >> > and
> >> > frankly I'm far from convinced this is really worth it. If there are so
> >> > many badblocks this would matter, the HW has IMHO bigger problems than
> >> > performance.
> >>
> >> Correct, and Dave was of the opinion that once at least XFS has reverse
> >> mapping support (which it does now), adding badblocks information to
> >> that should not be a hard lift, and should be a better solution. I
> >> suppose should try to benchmark how much of a penalty the current badblock
> >> checking in the NVVDIMM driver imposes. The penalty is not because there
> >> may be a large number of badblocks, but just due to the fact that we
> >> have to do this check for every IO, in fact, every 'bvec' in a bio.
> >
> > Well, letting filesystem know is certainly good from error reporting quality
> > POV. I guess I'll leave it upto XFS guys to tell whether they can be more
> > efficient in checking whether current IO overlaps with any of given bad
> > blocks.
> >
> >> > Now my question: Why do we bother with badblocks at all? In cases 1) and 
> >> > 2)
> >> > if the platform can recover from MCE, we can just always access 
> >> > persistent
> >> > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> >> > seems to already happen so we just need to make sure all places handle
> >> > returned errors properly (e.g. fs/dax.c does not seem to) and we are 
> >> > done.
> >> > No need for bad blocks list at all, no slow down unless we hit a bad cell
> >> > and in that case who cares about performance when the data is gone...
> >>
> >> Even when we have MCE recovery, we cannot do away with the badblocks
> >> list:
> >> 1. My understanding is that the hardware's ability to do MCE recovery is
> >> limited/best-effort, and is not guaranteed. There can be circumstances
> >> that cause a "Processor Context Corrupt" state, which is unrecoverable.
> >
> > Well, then they have to work on improving the hardware. Because having HW
> > that just sometimes gets stuck instead of reporting bad storage is simply
> > not acceptable. And no matter how hard you try you cannot avoid MCEs from
> > OS when accessing persistent memory so OS just has no way to avoid that
> > risk.
> >
> >> 2. We still need to maintain a badblocks list so that we know what
> >> blocks need to be cleared (via the ACPI method) on writes.
> >
> > Well, why cannot we just do the write, see whether we got CMCI and if yes,
> > clear the error via the ACPI method?
> 
> I would need to check if you get the address reported in the CMCI, but
> it would only fire if the write triggered a read-modify-write cycle. I
> suspect most copies to pmem, through something like
> arch_memcpy_to_pmem(), are not triggering any reads. It also triggers
> asynchronously, so what data do you write after clearing the error?
> There may have been more writes while the CMCI was being delivered.

OK, I see. And if we just write new data but don't clear error on write
through the ACPI method, will we still get MCE on following read of that
data? But regardless whether we get MCE or not, I suppose that the memory
location will be still marked as bad in some ACPI table, won't it?

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] block level event logging for storage media management

2017-01-25 Thread Jan Kara

On Tue 24-01-17 15:18:57, Oleg Drokin wrote:
> 
> On Jan 23, 2017, at 2:27 AM, Dan Williams wrote:
> 
> > [ adding Oleg ]
> > 
> > On Sun, Jan 22, 2017 at 10:00 PM, Song Liu <songliubrav...@fb.com> wrote:
> >> Hi Dan,
> >> 
> >> I think the the block level event log is more like log only system. When 
> >> en event
> >> happens,  it is not necessary to take immediate action. (I guess this is 
> >> different
> >> to bad block list?).
> >> 
> >> I would hope the event log to track more information. Some of these 
> >> individual
> >> event may not be very interesting, for example, soft error or latency 
> >> outliers.
> >> However, when we gather event log for a fleet of devices, these "soft 
> >> event"
> >> may become valuable for health monitoring.
> > 
> > I'd be interested in this. It sounds like you're trying to fill a gap
> > between tracing and console log messages which I believe others have
> > encountered as well.
> 
> We have a somewhat similar problem problem in Lustre and I guess it's not
> just Lustre.  Currently there are all sorts of conditional debug code all
> over the place that goes to the console and when you enable it for
> anything verbose, you quickly overflow your dmesg buffer no matter the
> size, that might be mostly ok for local "block level" stuff, but once you
> become distributed, it start to be a mess and once you get to be super
> large it worsens even more since you need to somehow coordinate data from
> multiple nodes, ensure all of it is not lost and still you don't end up
> using a lot of it since only a few nodes end up being useful.  (I don't
> know how NFS people manage to debug complicated issues using just this,
> could not be super easy).
> 
> Having some sort of a buffer of a (potentially very) large size that
> could be storing the data until it's needed, or eagerly polled by some
> daemon for storage (helpful when you expect a lot of data that definitely
> won't fit in RAM).
> 
> Tracepoints have the buffer and the daemon, but creating new messages is
> very cumbersome, so converting every debug message into one does not look
> very feasible.  Also it's convenient to have "event masks" one want
> logged that I don't think you could do with tracepoints.

So creating trace points IMO isn't that cumbersome. I agree that converting
hundreds or thousands debug printks into tracepoints is a pain in the
ass but still it is doable. WRT filtering, you can enable each tracepoint
individually. Granted that is not exactly the 'event mask' feature you ask
about but that can be easily scripted in userspace if you give some
structure to tracepoint names. Finally tracepoints provide a fine grained
control you never get with printk - e.g. you can make a tracepoint trigger
only if specific inode is involved with trace filters which greatly reduces
the amount of output.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/4 RFC] BDI lifetime fix

2017-01-26 Thread Jan Kara

Hello,

this patch series attempts to solve the problems with the life time of a
backing_dev_info structure. Currently it lives inside request_queue structure
and thus it gets destroyed as soon as request queue goes away. However
the block device inode still stays around and thus inode_to_bdi() call on
that inode (e.g. from flusher worker) may happen after request queue has been
destroyed resulting in oops.

This patch set tries to solve these problems by making backing_dev_info
independent structure referenced from block device inode. That makes sure
inode_to_bdi() cannot ever oops. The patches are lightly tested for now
(they boot, basic tests with adding & removing loop devices seem to do what
I'd expect them to do ;). If someone is able to reproduce crashes on bdi
when device goes away, please test these patches.

I'd also appreciate if people had a look whether the approach I took looks
sensible.

Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4] block: Make blk_get_backing_dev_info() safe without open bdev

2017-01-26 Thread Jan Kara

Currenly blk_get_backing_dev_info() is not safe to be called when the
block device is not open as bdev->bd_disk is NULL in that case. However
inode_to_bdi() uses this function and may be call called from flusher
worker or other writeback related functions without bdev being open
which leads to crashes such as:

[113031.075540] Unable to handle kernel paging request for data at address 
0x
[113031.075614] Faulting instruction address: 0xc03692e0
0:mon> t
[c000fb65f900] c036cb6c writeback_sb_inodes+0x30c/0x590
[c000fb65fa10] c036ced4 __writeback_inodes_wb+0xe4/0x150
[c000fb65fa70] c036d33c wb_writeback+0x30c/0x450
[c000fb65fb40] c036e198 wb_workfn+0x268/0x580
[c000fb65fc50] c00f3470 process_one_work+0x1e0/0x590
[c000fb65fce0] c00f38c8 worker_thread+0xa8/0x660
[c000fb65fd80] c00fc4b0 kthread+0x110/0x130
[c000fb65fe30] c00098f0 ret_from_kernel_thread+0x5c/0x6c
--- Exception: 0  at 
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c000fb65f620]
pc: c03692e0: locked_inode_to_wb_and_lock_list+0x50/0x290
lr: c036cb6c: writeback_sb_inodes+0x30c/0x590
sp: c000fb65f8a0
   msr: 80010280b033
   dar: 0
 dsisr: 4000
  current = 0xc001d69be400
  paca= 0xc348   softe: 0irq_happened: 0x01
pid   = 18689, comm = kworker/u16:10

Fix the problem by grabbing reference to bdi on first open of the block
device and drop the reference only once the inode is evicted from
memory. This pins struct backing_dev_info in memory and thus fixes the
crashes.

Reported-by: Dan Williams <dan.j.willi...@intel.com>
Reported-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/blk-core.c   | 8 +++-
 fs/block_dev.c | 7 +++
 include/linux/fs.h | 1 +
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 5613d3e0821e..34056a37361c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -109,14 +109,12 @@ void blk_queue_congestion_threshold(struct request_queue 
*q)
  * @bdev:  device
  *
  * Locates the passed device's request queue and returns the address of its
- * backing_dev_info.  This function can only be called if @bdev is opened
- * and the return value is never NULL.
+ * backing_dev_info. The return value is never NULL however we may return
+ * _backing_dev_info if the bdev is not currently open.
  */
 struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev)
 {
-   struct request_queue *q = bdev_get_queue(bdev);
-
-   return q->backing_dev_info;
+   return bdev->bd_bdi;
 }
 EXPORT_SYMBOL(blk_get_backing_dev_info);
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index ed6a34be7a1e..601b71b76d7f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -884,6 +884,8 @@ static void bdev_evict_inode(struct inode *inode)
spin_lock(_lock);
list_del_init(>bd_list);
spin_unlock(_lock);
+   if (bdev->bd_bdi != _backing_dev_info)
+   bdi_put(bdev->bd_bdi);
 }
 
 static const struct super_operations bdev_sops = {
@@ -986,6 +988,7 @@ struct block_device *bdget(dev_t dev)
bdev->bd_contains = NULL;
bdev->bd_super = NULL;
bdev->bd_inode = inode;
+   bdev->bd_bdi = _backing_dev_info;
bdev->bd_block_size = (1 << inode->i_blkbits);
bdev->bd_part_count = 0;
bdev->bd_invalidated = 0;
@@ -1542,6 +1545,8 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
+   if (bdev->bd_bdi == _backing_dev_info)
+   bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
 
if (!partno) {
ret = -ENXIO;
@@ -1637,6 +1642,8 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_disk = NULL;
bdev->bd_part = NULL;
bdev->bd_queue = NULL;
+   bdi_put(bdev->bd_bdi);
+   bdev->bd_bdi = _backing_dev_info;
if (bdev != bdev->bd_contains)
__blkdev_put(bdev->bd_contains, mode, 1);
bdev->bd_contains = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 702cb6c50194..c930cbc19342 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -423,6 +423,7 @@ struct block_device {
int bd_invalidated;
struct gendisk *bd_disk;
struct request_queue *  bd_queue;
+   struct backing_dev_info *bd_bdi;
struct list_headbd_list;
/*
 * Private data.  You must have bd_claim'ed the blo

[PATCH 1/4] block: Unhash block device inodes on gendisk destruction

2017-01-26 Thread Jan Kara

Currently, block device inodes stay around after corresponding gendisk
hash died until memory reclaim finds them and frees them. Since we will
make block device inode pin the bdi, we want to free the block device
inode as soon as the device goes away so that bdi does not stay around
unnecessarily. Furthermore we need to avoid issues when new device with
the same major,minor pair gets created since reusing the bdi structure
would be rather difficult in this case.

Unhashing block device inode on gendisk destruction nicely deals with
these problems. Once last block device inode reference is dropped (which
may be directly in del_gendisk()), the inode gets evicted. Furthermore if
the major,minor pair gets reallocated, we are guaranteed to get new
block device inode even if old block device inode is not yet evicted and
thus we avoid issues with possible reuse of bdi.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/genhd.c  |  2 ++
 fs/block_dev.c | 15 +++
 include/linux/fs.h |  1 +
 3 files changed, 18 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index fcd6d4fae657..f2f22d0e8e14 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -648,6 +648,8 @@ void del_gendisk(struct gendisk *disk)
disk_part_iter_init(, disk,
 DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
while ((part = disk_part_iter_next())) {
+   bdev_unhash_inode(MKDEV(disk->major,
+   disk->first_minor + part->partno));
invalidate_partition(disk, part->partno);
delete_partition(disk, part->partno);
}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5db5d1340d69..ed6a34be7a1e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -954,6 +954,21 @@ static int bdev_set(struct inode *inode, void *data)
 
 static LIST_HEAD(all_bdevs);
 
+/*
+ * If there is a bdev inode for this device, unhash it so that it gets evicted
+ * as soon as last inode reference is dropped.
+ */
+void bdev_unhash_inode(dev_t dev)
+{
+   struct inode *inode;
+
+   inode = ilookup5(blockdev_superblock, hash(dev), bdev_test, );
+   if (inode) {
+   remove_inode_hash(inode);
+   iput(inode);
+   }
+}
+
 struct block_device *bdget(dev_t dev)
 {
struct block_device *bdev;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2ba074328894..702cb6c50194 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2342,6 +2342,7 @@ extern struct kmem_cache *names_cachep;
 #ifdef CONFIG_BLOCK
 extern int register_blkdev(unsigned int, const char *);
 extern void unregister_blkdev(unsigned int, const char *);
+extern void bdev_unhash_inode(dev_t dev);
 extern struct block_device *bdget(dev_t);
 extern struct block_device *bdgrab(struct block_device *bdev);
 extern void bd_set_size(struct block_device *, loff_t size);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] block: Dynamically allocate and refcount backing_dev_info

2017-01-26 Thread Jan Kara

Instead of storing backing_dev_info inside struct request_queue,
allocate it dynamically, reference count it, and free it when the last
reference is dropped. Currently only request_queue holds the reference
but in the following patch we add other users referencing
backing_dev_info.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/blk-core.c | 10 --
 block/blk-sysfs.c|  2 +-
 include/linux/backing-dev-defs.h |  4 +++-
 include/linux/backing-dev.h  | 10 +-
 include/linux/blkdev.h   |  1 -
 mm/backing-dev.c | 35 +++
 6 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a9ff1b919ae7..5613d3e0821e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -693,7 +693,6 @@ static void blk_rq_timed_out_timer(unsigned long data)
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 {
struct request_queue *q;
-   int err;
 
q = kmem_cache_alloc_node(blk_requestq_cachep,
gfp_mask | __GFP_ZERO, node_id);
@@ -708,17 +707,16 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
gfp_mask, int node_id)
if (!q->bio_split)
goto fail_id;
 
-   q->backing_dev_info = >_backing_dev_info;
+   q->backing_dev_info = bdi_alloc_node(gfp_mask, node_id);
+   if (!q->backing_dev_info)
+   goto fail_split;
+
q->backing_dev_info->ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_SIZE;
q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
q->backing_dev_info->name = "block";
q->node = node_id;
 
-   err = bdi_init(q->backing_dev_info);
-   if (err)
-   goto fail_split;
-
setup_timer(>backing_dev_info->laptop_mode_wb_timer,
laptop_mode_timer_fn, (unsigned long) q);
setup_timer(>timeout, blk_rq_timed_out_timer, (unsigned long) q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 64fb54c6b41c..4cbaa519ec2d 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -799,7 +799,7 @@ static void blk_release_queue(struct kobject *kobj)
container_of(kobj, struct request_queue, kobj);
 
wbt_exit(q);
-   bdi_exit(q->backing_dev_info);
+   bdi_put(q->backing_dev_info);
blkcg_exit_queue(q);
 
if (q->elevator) {
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index e850e76acaaf..4282f21b1611 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -144,7 +144,9 @@ struct backing_dev_info {
 
char *name;
 
-   unsigned int capabilities; /* Device capabilities */
+   atomic_t refcnt;/* Reference counter for the structure */
+   unsigned int capabilities:31;   /* Device capabilities */
+   unsigned int free_on_put:1; /* Structure will be freed on last 
bdi_put() */
unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 43b93a947e61..c1601c23391a 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -18,7 +18,14 @@
 #include 
 
 int __must_check bdi_init(struct backing_dev_info *bdi);
-void bdi_exit(struct backing_dev_info *bdi);
+
+static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi)
+{
+   atomic_inc(>refcnt);
+   return bdi;
+}
+
+void bdi_put(struct backing_dev_info *bdi);
 
 __printf(3, 4)
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
@@ -29,6 +36,7 @@ void bdi_unregister(struct backing_dev_info *bdi);
 
 int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
 void bdi_destroy(struct backing_dev_info *bdi);
+struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id);
 
 void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
bool range_cyclic, enum wb_reason reason);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 069e4a102a73..de85701cc699 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -433,7 +433,6 @@ struct request_queue {
struct delayed_work delay_work;
 
struct backing_dev_info *backing_dev_info;
-   struct backing_dev_info _backing_dev_info;
 
/*
 * The queue owner gets to use this for whatever they like.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 3bfed5ab2475..14196e43b9ca 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -20,6 +20,8 @@ struct backing_dev_info noop_backing_dev_info = {
 };
 EXPORT_SYMBOL_GPL(noop_backing_dev_info);
 
+static struct kmem_cache *bdi_cachep;
+
 static struct class *bdi_class;
 
 /*
@@ -237,6 +239,9 @@ static __init int bdi_class_init(vo

Re: [PATCH 0/4 RFC] BDI lifetime fix

2017-01-30 Thread Jan Kara

On Thu 26-01-17 22:15:06, Dan Williams wrote:
> On Thu, Jan 26, 2017 at 9:45 AM, Jan Kara <j...@suse.cz> wrote:
> > Hello,
> >
> > this patch series attempts to solve the problems with the life time of a
> > backing_dev_info structure. Currently it lives inside request_queue 
> > structure
> > and thus it gets destroyed as soon as request queue goes away. However
> > the block device inode still stays around and thus inode_to_bdi() call on
> > that inode (e.g. from flusher worker) may happen after request queue has 
> > been
> > destroyed resulting in oops.
> >
> > This patch set tries to solve these problems by making backing_dev_info
> > independent structure referenced from block device inode. That makes sure
> > inode_to_bdi() cannot ever oops. The patches are lightly tested for now
> > (they boot, basic tests with adding & removing loop devices seem to do what
> > I'd expect them to do ;). If someone is able to reproduce crashes on bdi
> > when device goes away, please test these patches.
> 
> This survives a several runs of the libnvdimm unit tests which stress
> del_gendisk() and blk_cleanup_queue(). I'll keep testing since the
> failure was intermittent, but this is looking good.

Thanks for testing!

> > I'd also appreciate if people had a look whether the approach I took looks
> > sensible.
> 
> Looks sensible, just the kref comment.
> 
> I also don't see a need to try to tag on the bdi device name reuse
> into this series. I'm wondering if we can handle that separately with
> device_rename(bdi->dev, ...) when we know scsi is done with the old
> bdi but it has not finished being deleted

Do you mean I should not speak about it in the changelog? The problems I
have are not as much with reusing device *name* here (and resulting sysfs
conflicts) but rather a major:minor number pair which results in reusing
block device inode and we are not prepared for that since the bdi
associated with that inode may be already unregistered and reusing it would
be difficult.

Honza

-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v2 0/2] block: fix backing_dev_info lifetime

2017-01-26 Thread Jan Kara

On Wed 25-01-17 13:43:58, Dan Williams wrote:
> On Mon, Jan 23, 2017 at 1:17 PM, Thiago Jung Bauermann
> <bauer...@linux.vnet.ibm.com> wrote:
> > Hello Dan,
> >
> > Am Freitag, 6. Januar 2017, 17:02:51 BRST schrieb Dan Williams:
> >> v1 of these changes [1] was a one line change to bdev_get_queue() to
> >> prevent a shutdown crash when del_gendisk() races the final
> >> __blkdev_put().
> >>
> >> While it is known at del_gendisk() time that the queue is still alive,
> >> Jan Kara points to other paths [2] that are racing __blkdev_put() where
> >> the assumption that ->bd_queue, or inode->i_wb is valid does not hold.
> >>
> >> Fix that broken assumption, make it the case that if you have a live
> >> block_device, or block_device-inode that the corresponding queue and
> >> inode-write-back data is still valid.
> >>
> >> These changes survive a run of the libnvdimm unit test suite which puts
> >> some stress on the block_device shutdown path.
> >
> > I realize that the kernel test robot found problems with this series, but 
> > FWIW
> > it fixes the bug mentioned in [2].
> >
> 
> Thanks for the test result. I might take a look at cleaning up the
> test robot reports and resubmitting this approach unless Jan beats me
> to the punch with his backing_devi_info lifetime change patches.

Yeah, so my patches (and I suspect your as well), have a problem when the
backing_device_info stays around because blkdev inode still exists, device
gets removed (e.g. USB disk gets unplugged) but blkdev inode still stays
around (there doesn't appear to be anything that would be forcing blkdev
inode out of cache on device removal and there cannot be because different
processes may hold inode reference) and then some other device gets plugged
in and reuses the same MAJOR:MINOR combination. Things get awkward there, I
think we need to unhash blkdev inode on device removal but so far I didn't
make this work...

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/13] block: Move bdi_unregister() to del_gendisk()

2017-02-22 Thread Jan Kara

On Tue 21-02-17 19:53:29, Bart Van Assche wrote:
> On 02/21/2017 09:55 AM, Jan Kara wrote:
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 47104f6a398b..9a901dcfdd5c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -580,8 +580,6 @@ void blk_cleanup_queue(struct request_queue *q)
> > q->queue_lock = >__queue_lock;
> > spin_unlock_irq(lock);
> >  
> > -   bdi_unregister(q->backing_dev_info);
> > -
> > /* @q is and will stay empty, shutdown and put */
> > blk_put_queue(q);
> >  }
> > diff --git a/block/genhd.c b/block/genhd.c
> > index f6c4d4400759..68c613edb93a 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -660,6 +660,13 @@ void del_gendisk(struct gendisk *disk)
> > disk->flags &= ~GENHD_FL_UP;
> >  
> > sysfs_remove_link(_to_dev(disk)->kobj, "bdi");
> > +   /*
> > +* Unregister bdi before releasing device numbers (as they can get
> > +* reused and we'd get clashes in sysfs) but after bdev inodes are
> > +* unhashed and thus will be soon destroyed as bdev inode's reference
> > +* to wb_writeback can block bdi_unregister().
> > +*/
> > +   bdi_unregister(disk->queue->backing_dev_info);
> > blk_unregister_queue(disk);
> > blk_unregister_region(disk_devt(disk), disk->minors);
> 
> This change looks suspicious to me. There are drivers that create a
> block layer queue but neither call device_add_disk() nor del_gendisk(),
> e.g. drivers/scsi/st.c. Although bdi_init() will be called for the
> queues created by these drivers, this patch will cause the
> bdi_unregister() call to be skipped for these drivers.

Well, the thing is that bdi_unregister() is the counterpart to
bdi_register(). Unless you call bdi_register(), which happens only in
device_add_disk() (and some filesystems which create their private bdis),
there's no point in calling bdi_unregister(). Counterpart to bdi_init() is
bdi_exit() and that gets called always once bdi reference count drops to 0.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

Re: [PATCH 0/13 v2] block: Fix block device shutdown related races

2017-02-22 Thread Jan Kara

On Tue 21-02-17 10:19:28, Jens Axboe wrote:
> On 02/21/2017 10:09 AM, Jan Kara wrote:
> > Hello,
> > 
> > this is a second revision of the patch set to fix several different races 
> > and
> > issues I've found when testing device shutdown and reuse. The first three
> > patches are fixes to problems in my previous series fixing BDI lifetime 
> > issues.
> > Patch 4 fixes issues with reuse of BDI name with scsi devices. With it I 
> > cannot
> > reproduce the BDI name reuse issues using Omar's stress test using 
> > scsi_debug
> > so it can be used as a replacement of Dan's patches. Patches 5-11 fix oops 
> > that
> > is triggered by __blkdev_put() calling inode_detach_wb() too early (the 
> > problem
> > reported by Thiago). Patches 12 and 13 fix oops due to a bug in gendisk code
> > where get_gendisk() can return already freed gendisk structure (again 
> > triggered
> > by Omar's stress test).
> > 
> > People, please have a look at patches. They are mostly simple however the
> > interactions are rather complex so I may have missed something. Also I'm
> > happy for any additional testing these patches can get - I've stressed them
> > with Omar's script, tested memcg writeback, tested static (not udev managed)
> > device inodes.
> > 
> > Jens, I think at least patches 1-3 should go in together with my fixes you
> > already have in your tree (or shortly after them). It is up to you whether
> > you decide to delay my first fixes or pick these up quickly. Patch 4 is
> > (IMHO a cleaner) replacement of Dan's patches so consider whether you want
> > to use it instead of those patches.
> 
> I have applied 1-3 to my for-linus branch, which will go in after
> the initial pull request has been pulled by Linus. Consider fixing up
> #4 so it applies, I like it.

OK, attached is patch 4 rebased on top of Linus' tree from today which
already has linux-block changes pulled in. I've left put_disk_devt() in
blk_cleanup_queue() to maintain the logic in the original patch (now commit
0dba1314d4f8) that request_queue and gendisk each hold one devt reference.
The bdi_unregister() call that is moved to del_gendisk() by this patch is
now protected by the gendisk reference instead of the request_queue one
so it still maintains the property that devt reference protects bdi
registration-unregistration lifetime (as much as that is not needed anymore
after this patch).

I have also updated the comment in the code and the changelog - they were
somewhat stale after changes to the whole series Tejun suggested.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
>From 9abe9565c83af6b653b159a7bf5b895aff638c65 Mon Sep 17 00:00:00 2001
From: Jan Kara <j...@suse.cz>
Date: Wed, 8 Feb 2017 08:05:56 +0100
Subject: [PATCH] block: Move bdi_unregister() to del_gendisk()

Commit 6cd18e711dd8 "block: destroy bdi before blockdev is
unregistered." moved bdi unregistration (at that time through
bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because
it needs to happen before blk_unregister_region() call in del_gendisk()
for MD. SCSI though will free up the device number from sd_remove()
called through a maze of callbacks from device_del() in
__scsi_remove_device() before blk_cleanup_queue() and thus similar races
as described in 6cd18e711dd8 can happen for SCSI as well as reported by
Omar [1].

Moving bdi_unregister() to del_gendisk() works for MD and fixes the
problem for SCSI since del_gendisk() gets called from sd_remove() before
freeing the device number.

This also makes device_add_disk() (calling bdi_register_owner()) more
symmetric with del_gendisk().

[1] http://marc.info/?l=linux-block=148554717109098=2

Tested-by: Lekshmi Pillai <lekshmicpil...@in.ibm.com>
Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/blk-core.c | 1 -
 block/genhd.c| 5 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b9e857f4afe8..1086dac8724c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -578,7 +578,6 @@ void blk_cleanup_queue(struct request_queue *q)
 		q->queue_lock = >__queue_lock;
 	spin_unlock_irq(lock);

-	bdi_unregister(q->backing_dev_info);
 	put_disk_devt(q->disk_devt);

 	/* @q is and will stay empty, shutdown and put */
diff --git a/block/genhd.c b/block/genhd.c
index 2f444b87a5f2..b26a5ea115d0 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -681,6 +681,11 @@ void del_gendisk(struct gendisk *disk)
 	disk->flags &= ~GENHD_FL_UP;

 	sysfs_remove_link(_to_dev(disk)->kobj, "bdi");
+	/*
+	 * Unregister bdi before releasing device numbers (as they can get
+	 * reused and we'd get clashes in sysfs).
+	 */
+	bdi_unregister(disk->queue->backing_dev_info);
 	blk_unregister_queue(disk);
 	blk_unregister_region(disk_devt(disk), disk->minors);

-- 
2.10.2

Re: [PATCH 08/10] block: Fix oops in locked_inode_to_wb_and_lock_list()

2017-02-20 Thread Jan Kara

On Sun 12-02-17 13:40:27, Tejun Heo wrote:
> Hello, Jan.
> 
> On Thu, Feb 09, 2017 at 01:44:31PM +0100, Jan Kara wrote:
> > When block device is closed, we call inode_detach_wb() in __blkdev_put()
> > which sets inode->i_wb to NULL. That is contrary to expectations that
> > inode->i_wb stays valid once set during the whole inode's lifetime and
> > leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
> > inode_to_wb() returned NULL.
> > 
> > The reason why we called inode_detach_wb() is not valid anymore though.
> > BDI is guaranteed to stay along until we call bdi_put() from
> > bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
> > moment. A complication is that i_wb can point to non-root wb_writeback
> > structure and in that case we do need to clean it up as bdi_unregister()
> > blocks waiting for all non-root wb_writeback references to get dropped.
> > Thus this i_wb reference could block device removal e.g. from
> > __scsi_remove_device() (which indirectly ends up calling
> > bdi_unregister()). We cannot rely on block device inode to go away soon
> > (and thus i_wb reference to get dropped) as the device may got
> > hot-removed e.g. under a mounted filesystem. We deal with these issues
> > by switching block device inode from non-root wb_writeback structure to
> > bdi->wb when needed.  Since this is rather expensive (requires
> > synchronize_rcu()) we do the switching only in del_gendisk() when we
> > know the device is going away.
> 
> So, the only reason cgwb_bdi_destroy() is synchronous is because bdi
> destruction was synchronous.  Now that bdi is properly reference
> counted and can be decoupled from gendisk / q destruction, I can't
> think of a reason to keep cgwb destruction synchronous.  Switching
> wb's on destruction is kinda clumsy and it almost always hurts to
> expose synchronize_rcu() in userland visible paths.
> 
> Wouldn't something like the following work?
> 
> * Remove bdi->usage_cnt and the synchronous waiting in
>   cgwb_bdi_destroy().
> 
> * Instead, make cgwb's hold bdi->refcnt and put it from
>   cgwb_release_workfn().
> 
> Then, we don't have to switch during shutdown and can just let things
> drain.

At first sight this looks workable and would mean less special code so I
like it. I'll experiment with it and see how it works out.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

Re: [PATCH 01/10] block: Move bdev_unhash_inode() after invalidate_partition()

2017-02-20 Thread Jan Kara

On Sun 12-02-17 12:58:33, Tejun Heo wrote:
> On Thu, Feb 09, 2017 at 01:44:24PM +0100, Jan Kara wrote:
> > Move bdev_unhash_inode() after invalidate_partition() as
> > invalidate_partition() looks up bdev and will unnecessarily recreate it
> > if bdev_unhash_inode() destroyed it. Also use part_devt() when calling
> > bdev_unhash_inode() instead of manually creating the device number.
> > 
> > Signed-off-by: Jan Kara <j...@suse.cz>
> 
> Acked-by: Tejun Heo <t...@kernel.org>
> 
> > @@ -648,9 +648,8 @@ void del_gendisk(struct gendisk *disk)
> > disk_part_iter_init(, disk,
> >  DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
> > while ((part = disk_part_iter_next())) {
> > -   bdev_unhash_inode(MKDEV(disk->major,
> > -   disk->first_minor + part->partno));
> > invalidate_partition(disk, part->partno);
> > +   bdev_unhash_inode(part_devt(part));
> > delete_partition(disk, part->partno);
> 
> So, before this patch, invalidate_partition() would have operated on a
> newly created inode and thus wouldn't have actually invalidated the
> existing bdev / mapping, right?

Yes, that is another effect. I'll add a note about this to the changelog.
Thanks for noting this.

Honza

-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

[PATCH 02/13] block: Unhash also block device inode for the whole device

2017-02-21 Thread Jan Kara

Iteration over partitions in del_gendisk() omits part0. Add
bdev_unhash_inode() call for the whole device. Otherwise if the device
number gets reused, bdev inode will be still associated with the old
(stale) bdi.

Tested-by: Lekshmi Pillai <lekshmicpil...@in.ibm.com>
Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/genhd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/genhd.c b/block/genhd.c
index 6cb9f3a34a92..f6c4d4400759 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -655,6 +655,7 @@ void del_gendisk(struct gendisk *disk)
disk_part_iter_exit();
 
invalidate_partition(disk, 0);
+   bdev_unhash_inode(disk_devt(disk));
set_capacity(disk, 0);
disk->flags &= ~GENHD_FL_UP;
 
-- 
2.10.2

[PATCH 11/13] block: Fix oops in locked_inode_to_wb_and_lock_list()

2017-02-21 Thread Jan Kara

When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.

The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.

Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.

Reported-by: Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/block_dev.c| 8 ++--
 include/linux/writeback.h | 1 +
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 68e855fdce58..a0a8a000bdde 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -884,6 +884,8 @@ static void bdev_evict_inode(struct inode *inode)
spin_lock(_lock);
list_del_init(>bd_list);
spin_unlock(_lock);
+   /* Detach inode from wb early as bdi_put() may free bdi->wb */
+   inode_detach_wb(inode);
if (bdev->bd_bdi != _backing_dev_info)
bdi_put(bdev->bd_bdi);
 }
@@ -1874,12 +1876,6 @@ static void __blkdev_put(struct block_device *bdev, 
fmode_t mode, int for_part)
kill_bdev(bdev);
 
bdev_write_inode(bdev);
-   /*
-* Detaching bdev inode from its wb in __destroy_inode()
-* is too late: the queue which embeds its bdi (along with
-* root wb) can be gone as soon as we put_disk() below.
-*/
-   inode_detach_wb(bdev->bd_inode);
}
if (bdev->bd_contains == bdev) {
if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5527d910ba3d..f1af3f67ce5a 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -237,6 +237,7 @@ static inline void inode_attach_wb(struct inode *inode, 
struct page *page)
 static inline void inode_detach_wb(struct inode *inode)
 {
if (inode->i_wb) {
+   WARN_ON_ONCE(!(inode->i_state & I_CLEAR));
wb_put(inode->i_wb);
inode->i_wb = NULL;
}
-- 
2.10.2

[PATCH 06/13] bdi: Make wb->bdi a proper reference

2017-02-21 Thread Jan Kara

Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/backing-dev.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c324eae17f0d..d7aaf2517c30 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -294,6 +294,8 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
 
memset(wb, 0, sizeof(*wb));
 
+   if (wb != >wb)
+   bdi_get(bdi);
wb->bdi = bdi;
wb->last_old_flush = jiffies;
INIT_LIST_HEAD(>b_dirty);
@@ -314,8 +316,10 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
wb->dirty_sleep = jiffies;
 
wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
-   if (!wb->congested)
-   return -ENOMEM;
+   if (!wb->congested) {
+   err = -ENOMEM;
+   goto out_put_bdi;
+   }
 
err = fprop_local_init_percpu(>completions, gfp);
if (err)
@@ -335,6 +339,9 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
fprop_local_destroy_percpu(>completions);
 out_put_cong:
wb_congested_put(wb->congested);
+out_put_bdi:
+   if (wb != >wb)
+   bdi_put(bdi);
return err;
 }
 
@@ -372,6 +379,8 @@ static void wb_exit(struct bdi_writeback *wb)
 
fprop_local_destroy_percpu(>completions);
wb_congested_put(wb->congested);
+   if (wb != >bdi->wb)
+   bdi_put(wb->bdi);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-- 
2.10.2

[PATCH 07/13] bdi: Move removal from bdi->wb_list into wb_shutdown()

2017-02-21 Thread Jan Kara

Currently the removal from bdi->wb_list happens directly in
cgwb_release_workfn(). Move it to wb_shutdown() which is functionally
equivalent and more logical (the list gets only used for distributing
writeback works among bdi_writeback structures). It will also allow us
to simplify writeback shutdown in cgwb_bdi_destroy().

Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/backing-dev.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d7aaf2517c30..54b9e934eef4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -345,6 +345,8 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
return err;
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
+
 /*
  * Remove bdi from the global list and shutdown any threads we have running
  */
@@ -358,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
}
spin_unlock_bh(>work_lock);
 
+   cgwb_remove_from_bdi_list(wb);
/*
 * Drain work list and shutdown the delayed_work.  !WB_registered
 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -491,10 +494,6 @@ static void cgwb_release_workfn(struct work_struct *work)
release_work);
struct backing_dev_info *bdi = wb->bdi;
 
-   spin_lock_irq(_lock);
-   list_del_rcu(>bdi_node);
-   spin_unlock_irq(_lock);
-
wb_shutdown(wb);
 
css_put(wb->memcg_css);
@@ -526,6 +525,13 @@ static void cgwb_kill(struct bdi_writeback *wb)
percpu_ref_kill(>refcnt);
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+   spin_lock_irq(_lock);
+   list_del_rcu(>bdi_node);
+   spin_unlock_irq(_lock);
+}
+
 static int cgwb_create(struct backing_dev_info *bdi,
   struct cgroup_subsys_state *memcg_css, gfp_t gfp)
 {
@@ -776,6 +782,8 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
return 0;
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb) { }
+
 static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
 
 #endif /* CONFIG_CGROUP_WRITEBACK */
-- 
2.10.2

[PATCH 08/13] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()

2017-02-21 Thread Jan Kara

Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/backing-dev.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 54b9e934eef4..c9623b410170 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -700,6 +700,7 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
struct radix_tree_iter iter;
struct rb_node *rbn;
void **slot;
+   struct wb_writeback *wb;
 
WARN_ON(test_bit(WB_registered, >wb.state));
 
@@ -716,6 +717,14 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
congested->__bdi = NULL;/* mark @congested unlinked */
}
 
+   while (!list_empty(>wb_list)) {
+   wb = list_first_entry(>wb_list, struct bdi_writeback,
+ bdi_node);
+   spin_unlock_irq(_lock);
+   wb_shutdown(wb);
+   spin_lock_irq(_lock);
+   }
+
spin_unlock_irq(_lock);
 
/*
-- 
2.10.2

[PATCH 12/13] kobject: Export kobject_get_unless_zero()

2017-02-21 Thread Jan Kara

Make the function available for outside use and fortify it against NULL
kobject.

CC: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/kobject.h | 2 ++
 lib/kobject.c   | 5 -
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index e6284591599e..ca85cb80e99a 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ extern int __must_check kobject_rename(struct kobject *, 
const char *new_name);
 extern int __must_check kobject_move(struct kobject *, struct kobject *);
 
 extern struct kobject *kobject_get(struct kobject *kobj);
+extern struct kobject * __must_check kobject_get_unless_zero(
+   struct kobject *kobj);
 extern void kobject_put(struct kobject *kobj);
 
 extern const void *kobject_namespace(struct kobject *kobj);
diff --git a/lib/kobject.c b/lib/kobject.c
index 445dcaeb0f56..763d70a18941 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -601,12 +601,15 @@ struct kobject *kobject_get(struct kobject *kobj)
 }
 EXPORT_SYMBOL(kobject_get);
 
-static struct kobject * __must_check kobject_get_unless_zero(struct kobject 
*kobj)
+struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
 {
+   if (!kobj)
+   return NULL;
if (!kref_get_unless_zero(>kref))
kobj = NULL;
return kobj;
 }
+EXPORT_SYMBOL(kobject_get_unless_zero);
 
 /*
  * kobject_cleanup - free kobject resources.
-- 
2.10.2

[PATCH 01/13] block: Move bdev_unhash_inode() after invalidate_partition()

2017-02-21 Thread Jan Kara

Move bdev_unhash_inode() after invalidate_partition() as
invalidate_partition() looks up bdev and it cannot find the right bdev
inode after bdev_unhash_inode() is called. Thus invalidate_partition()
would not invalidate page cache of the previously used bdev. Also use
part_devt() when calling bdev_unhash_inode() instead of manually
creating the device number.

Tested-by: Lekshmi Pillai <lekshmicpil...@in.ibm.com>
Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/genhd.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index d9ccd42f3675..6cb9f3a34a92 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -648,9 +648,8 @@ void del_gendisk(struct gendisk *disk)
disk_part_iter_init(, disk,
 DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
while ((part = disk_part_iter_next())) {
-   bdev_unhash_inode(MKDEV(disk->major,
-   disk->first_minor + part->partno));
invalidate_partition(disk, part->partno);
+   bdev_unhash_inode(part_devt(part));
delete_partition(disk, part->partno);
}
disk_part_iter_exit();
-- 
2.10.2

[PATCH 10/13] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister()

2017-02-21 Thread Jan Kara

Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
from bdi_unregister() which is not necessarily called from bdi_destroy()
and thus the name is somewhat misleading.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/backing-dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 31cdee91e826..b1b50cce3f36 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -687,7 +687,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
return ret;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 {
struct radix_tree_iter iter;
struct rb_node *rbn;
@@ -777,7 +777,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
 static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb) { }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { }
 
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
@@ -885,7 +885,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
/* make sure nobody finds us on the bdi_list anymore */
bdi_remove_from_list(bdi);
wb_shutdown(>wb);
-   cgwb_bdi_destroy(bdi);
+   cgwb_bdi_unregister(bdi);
 
if (bdi->dev) {
bdi_debug_unregister(bdi);
-- 
2.10.2

[PATCH 0/13 v2] block: Fix block device shutdown related races

2017-02-21 Thread Jan Kara

Hello,

this is a second revision of the patch set to fix several different races and
issues I've found when testing device shutdown and reuse. The first three
patches are fixes to problems in my previous series fixing BDI lifetime issues.
Patch 4 fixes issues with reuse of BDI name with scsi devices. With it I cannot
reproduce the BDI name reuse issues using Omar's stress test using scsi_debug
so it can be used as a replacement of Dan's patches. Patches 5-11 fix oops that
is triggered by __blkdev_put() calling inode_detach_wb() too early (the problem
reported by Thiago). Patches 12 and 13 fix oops due to a bug in gendisk code
where get_gendisk() can return already freed gendisk structure (again triggered
by Omar's stress test).

People, please have a look at patches. They are mostly simple however the
interactions are rather complex so I may have missed something. Also I'm
happy for any additional testing these patches can get - I've stressed them
with Omar's script, tested memcg writeback, tested static (not udev managed)
device inodes.

Jens, I think at least patches 1-3 should go in together with my fixes you
already have in your tree (or shortly after them). It is up to you whether
you decide to delay my first fixes or pick these up quickly. Patch 4 is
(IMHO a cleaner) replacement of Dan's patches so consider whether you want
to use it instead of those patches.

Changes since v1:
* Added Acks and Tested-by tags for patches in areas that did not change
* Reworked inode_detach_wb() related fixes based on Tejun's feedback

Honza

[PATCH 13/13] block: Fix oops scsi_disk_get()

2017-02-21 Thread Jan Kara

When device open races with device shutdown, we can get the following
oops in scsi_disk_get():

[11863.044351] general protection fault:  [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs 
raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W  
4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: 88007f438200 task.stack: c9fd
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:c9fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: 88007f56d000 RCX: 
[11863.048030] RDX: 0001 RSI: 0004 RDI: 81a8d880
[11863.048030] RBP: c9fd3a18 R08:  R09: 0001
[11863.059217] R10:  R11:  R12: fffa
[11863.059217] R13: 880078872800 R14: 880070915540 R15: 001d
[11863.059217] FS:  7f2611f71800() GS:88007f0c() 
knlGS:
[11863.059217] CS:  0010 DS:  ES:  CR0: 80050033
[11863.059217] CR2: 0060e048 CR3: 778d4000 CR4: 06e0
[11863.059217] Call Trace:
[11863.059217]  ? disk_get_part+0x22/0x1f0
[11863.059217]  sd_open+0x39/0x130
[11863.059217]  __blkdev_get+0x69/0x430
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? bd_acquire+0x96/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_get+0x126/0x350
[11863.059217]  ? _raw_spin_unlock+0x2b/0x40
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_open+0x65/0x80
...

As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.

We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.

Tested-by: Lekshmi Pillai <lekshmicpil...@in.ibm.com>
Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index 68c613edb93a..2baacfea7b5e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1350,7 +1350,7 @@ struct kobject *get_disk(struct gendisk *disk)
owner = disk->fops->owner;
if (owner && !try_module_get(owner))
return NULL;
-   kobj = kobject_get(_to_dev(disk)->kobj);
+   kobj = kobject_get_unless_zero(_to_dev(disk)->kobj);
if (kobj == NULL) {
module_put(owner);
return NULL;
-- 
2.10.2

[PATCH 09/13] bdi: Do not wait for cgwbs release in bdi_unregister()

2017-02-21 Thread Jan Kara

Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
(called from bdi_unregister()). That is however unnecessary now when
cgwb->bdi is a proper refcounted reference (thus bdi cannot get
released before all cgwbs are released) and when cgwb_bdi_destroy()
shuts down writeback directly.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 -
 mm/backing-dev.c | 18 +-
 2 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb3dcdebc80..7bd5ba9890b0 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -163,7 +163,6 @@ struct backing_dev_info {
 #ifdef CONFIG_CGROUP_WRITEBACK
struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
struct rb_root cgwb_congested_tree; /* their congested states */
-   atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
 #else
struct bdi_writeback_congested *wb_congested;
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c9623b410170..31cdee91e826 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -393,11 +393,9 @@ static void wb_exit(struct bdi_writeback *wb)
 /*
  * cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
  * blkcg->cgwb_list, and memcg->cgwb_list.  bdi->cgwb_tree is also RCU
- * protected.  cgwb_release_wait is used to wait for the completion of cgwb
- * releases from bdi destruction path.
+ * protected.
  */
 static DEFINE_SPINLOCK(cgwb_lock);
-static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait);
 
 /**
  * wb_congested_get_create - get or create a wb_congested
@@ -492,7 +490,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 {
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
release_work);
-   struct backing_dev_info *bdi = wb->bdi;
 
wb_shutdown(wb);
 
@@ -503,9 +500,6 @@ static void cgwb_release_workfn(struct work_struct *work)
percpu_ref_exit(>refcnt);
wb_exit(wb);
kfree_rcu(wb, rcu);
-
-   if (atomic_dec_and_test(>usage_cnt))
-   wake_up_all(_release_wait);
 }
 
 static void cgwb_release(struct percpu_ref *refcnt)
@@ -595,7 +589,6 @@ static int cgwb_create(struct backing_dev_info *bdi,
/* we might have raced another instance of this function */
ret = radix_tree_insert(>cgwb_tree, memcg_css->id, wb);
if (!ret) {
-   atomic_inc(>usage_cnt);
list_add_tail_rcu(>bdi_node, >wb_list);
list_add(>memcg_node, memcg_cgwb_list);
list_add(>blkcg_node, blkcg_cgwb_list);
@@ -685,7 +678,6 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
INIT_RADIX_TREE(>cgwb_tree, GFP_ATOMIC);
bdi->cgwb_congested_tree = RB_ROOT;
-   atomic_set(>usage_cnt, 1);
 
ret = wb_init(>wb, bdi, 1, GFP_KERNEL);
if (!ret) {
@@ -726,14 +718,6 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
}
 
spin_unlock_irq(_lock);
-
-   /*
-* All cgwb's and their congested states must be shutdown and
-* released before returning.  Drain the usage counter to wait for
-* all cgwb's and cgwb_congested's ever created on @bdi.
-*/
-   atomic_dec(>usage_cnt);
-   wait_event(cgwb_release_wait, !atomic_read(>usage_cnt));
 }
 
 /**
-- 
2.10.2

[PATCH 03/13] block: Revalidate i_bdev reference in bd_aquire()

2017-02-21 Thread Jan Kara

When a device gets removed, block device inode unhashed so that it is not
used anymore (bdget() will not find it anymore). Later when a new device
gets created with the same device number, we create new block device
inode. However there may be file system device inodes whose i_bdev still
points to the original block device inode and thus we get two active
block device inodes for the same device. They will share the same
gendisk so the only visible differences will be that page caches will
not be coherent and BDIs will be different (the old block device inode
still points to unregistered BDI).

Fix the problem by checking in bd_acquire() whether i_bdev still points
to active block device inode and re-lookup the block device if not. That
way any open of a block device happening after the old device has been
removed will get correct block device inode.

Tested-by: Lekshmi Pillai <lekshmicpil...@in.ibm.com>
Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/block_dev.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 601b71b76d7f..68e855fdce58 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1043,13 +1043,22 @@ static struct block_device *bd_acquire(struct inode 
*inode)
 
spin_lock(_lock);
bdev = inode->i_bdev;
-   if (bdev) {
+   if (bdev && !inode_unhashed(bdev->bd_inode)) {
bdgrab(bdev);
spin_unlock(_lock);
return bdev;
}
spin_unlock(_lock);
 
+   /*
+* i_bdev references block device inode that was already shut down
+* (corresponding device got removed).  Remove the reference and look
+* up block device inode again just in case new device got
+* reestablished under the same device number.
+*/
+   if (bdev)
+   bd_forget(inode);
+
bdev = bdget(inode->i_rdev);
if (bdev) {
spin_lock(_lock);
-- 
2.10.2

[PATCH 09/10] kobject: Export kobject_get_unless_zero()

2017-02-09 Thread Jan Kara

Make the function available for outside use and fortify it against NULL
kobject.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/kobject.h | 2 ++
 lib/kobject.c   | 5 -
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index e6284591599e..ca85cb80e99a 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ extern int __must_check kobject_rename(struct kobject *, 
const char *new_name);
 extern int __must_check kobject_move(struct kobject *, struct kobject *);
 
 extern struct kobject *kobject_get(struct kobject *kobj);
+extern struct kobject * __must_check kobject_get_unless_zero(
+   struct kobject *kobj);
 extern void kobject_put(struct kobject *kobj);
 
 extern const void *kobject_namespace(struct kobject *kobj);
diff --git a/lib/kobject.c b/lib/kobject.c
index 445dcaeb0f56..763d70a18941 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -601,12 +601,15 @@ struct kobject *kobject_get(struct kobject *kobj)
 }
 EXPORT_SYMBOL(kobject_get);
 
-static struct kobject * __must_check kobject_get_unless_zero(struct kobject 
*kobj)
+struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
 {
+   if (!kobj)
+   return NULL;
if (!kref_get_unless_zero(>kref))
kobj = NULL;
return kobj;
 }
+EXPORT_SYMBOL(kobject_get_unless_zero);
 
 /*
  * kobject_cleanup - free kobject resources.
-- 
2.10.2

[PATCH 10/10] block: Fix oops scsi_disk_get()

2017-02-09 Thread Jan Kara

When device open races with device shutdown, we can get the following
oops in scsi_disk_get():

[11863.044351] general protection fault:  [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs 
raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W  
4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: 88007f438200 task.stack: c9fd
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:c9fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: 88007f56d000 RCX: 
[11863.048030] RDX: 0001 RSI: 0004 RDI: 81a8d880
[11863.048030] RBP: c9fd3a18 R08:  R09: 0001
[11863.059217] R10:  R11:  R12: fffa
[11863.059217] R13: 880078872800 R14: 880070915540 R15: 001d
[11863.059217] FS:  7f2611f71800() GS:88007f0c() 
knlGS:
[11863.059217] CS:  0010 DS:  ES:  CR0: 80050033
[11863.059217] CR2: 0060e048 CR3: 778d4000 CR4: 06e0
[11863.059217] Call Trace:
[11863.059217]  ? disk_get_part+0x22/0x1f0
[11863.059217]  sd_open+0x39/0x130
[11863.059217]  __blkdev_get+0x69/0x430
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? bd_acquire+0x96/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_get+0x126/0x350
[11863.059217]  ? _raw_spin_unlock+0x2b/0x40
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_open+0x65/0x80
...

As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.

We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index 721921a140cc..e3dbc311c323 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1350,7 +1350,7 @@ struct kobject *get_disk(struct gendisk *disk)
owner = disk->fops->owner;
if (owner && !try_module_get(owner))
return NULL;
-   kobj = kobject_get(_to_dev(disk)->kobj);
+   kobj = kobject_get_unless_zero(_to_dev(disk)->kobj);
if (kobj == NULL) {
module_put(owner);
return NULL;
-- 
2.10.2

[PATCH 06/10] writeback: Move __inode_wait_for_state_bit

2017-02-09 Thread Jan Kara

Move it up in fs/fs-writeback.c so that we don't have to use forward
declarations. No code change.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/fs-writeback.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c9770de11650..23dc97cf2a50 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -99,6 +99,21 @@ static inline struct inode *wb_inode(struct list_head *head)
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage);
 
+/*
+ * Wait for bit in inode->i_state to clear. Called with i_lock held.
+ * Caller must make sure inode cannot go away when we drop i_lock.
+ */
+static void __inode_wait_for_state_bit(struct inode *inode, int bit_nr)
+   __releases(inode->i_lock)
+   __acquires(inode->i_lock)
+{
+   while (inode->i_state & (1 << bit_nr)) {
+   spin_unlock(>i_lock);
+   wait_on_bit(>i_state, bit_nr, TASK_UNINTERRUPTIBLE);
+   spin_lock(>i_lock);
+   }
+}
+
 static bool wb_io_lists_populated(struct bdi_writeback *wb)
 {
if (wb_has_dirty_io(wb)) {
@@ -1170,21 +1185,6 @@ static int write_inode(struct inode *inode, struct 
writeback_control *wbc)
 }
 
 /*
- * Wait for bit in inode->i_state to clear. Called with i_lock held.
- * Caller must make sure inode cannot go away when we drop i_lock.
- */
-static void __inode_wait_for_state_bit(struct inode *inode, int bit_nr)
-   __releases(inode->i_lock)
-   __acquires(inode->i_lock)
-{
-   while (inode->i_state & (1 << bit_nr)) {
-   spin_unlock(>i_lock);
-   wait_on_bit(>i_state, bit_nr, TASK_UNINTERRUPTIBLE);
-   spin_lock(>i_lock);
-   }
-}
-
-/*
  * Wait for writeback on an inode to complete. Caller must have inode pinned.
  */
 void inode_wait_for_writeback(struct inode *inode)
-- 
2.10.2

[PATCH 08/10] block: Fix oops in locked_inode_to_wb_and_lock_list()

2017-02-09 Thread Jan Kara

When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.

The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment. A complication is that i_wb can point to non-root wb_writeback
structure and in that case we do need to clean it up as bdi_unregister()
blocks waiting for all non-root wb_writeback references to get dropped.
Thus this i_wb reference could block device removal e.g. from
__scsi_remove_device() (which indirectly ends up calling
bdi_unregister()). We cannot rely on block device inode to go away soon
(and thus i_wb reference to get dropped) as the device may got
hot-removed e.g. under a mounted filesystem. We deal with these issues
by switching block device inode from non-root wb_writeback structure to
bdi->wb when needed.  Since this is rather expensive (requires
synchronize_rcu()) we do the switching only in del_gendisk() when we
know the device is going away.

Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.

Reported-by: Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/genhd.c |  4 ++--
 fs/block_dev.c| 11 ---
 include/linux/fs.h|  2 +-
 include/linux/writeback.h |  1 +
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 68c613edb93a..721921a140cc 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -649,13 +649,13 @@ void del_gendisk(struct gendisk *disk)
 DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
while ((part = disk_part_iter_next())) {
invalidate_partition(disk, part->partno);
-   bdev_unhash_inode(part_devt(part));
+   bdev_cleanup_inode(part_devt(part));
delete_partition(disk, part->partno);
}
disk_part_iter_exit();
 
invalidate_partition(disk, 0);
-   bdev_unhash_inode(disk_devt(disk));
+   bdev_cleanup_inode(disk_devt(disk));
set_capacity(disk, 0);
disk->flags &= ~GENHD_FL_UP;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 360439373a66..65ac3a60ac8e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -884,6 +884,8 @@ static void bdev_evict_inode(struct inode *inode)
spin_lock(_lock);
list_del_init(>bd_list);
spin_unlock(_lock);
+   /* Detach inode from wb early as bdi_put() may free bdi->wb */
+   inode_detach_wb(inode);
if (bdev->bd_bdi != _backing_dev_info)
bdi_put(bdev->bd_bdi);
 }
@@ -960,13 +962,14 @@ static LIST_HEAD(all_bdevs);
  * If there is a bdev inode for this device, unhash it so that it gets evicted
  * as soon as last inode reference is dropped.
  */
-void bdev_unhash_inode(dev_t dev)
+void bdev_cleanup_inode(dev_t dev)
 {
struct inode *inode;
 
inode = ilookup5(blockdev_superblock, hash(dev), bdev_test, );
if (inode) {
remove_inode_hash(inode);
+   inode_switch_to_default_wb_sync(inode);
iput(inode);
}
 }
@@ -1874,12 +1877,6 @@ static void __blkdev_put(struct block_device *bdev, 
fmode_t mode, int for_part)
kill_bdev(bdev);
 
bdev_write_inode(bdev);
-   /*
-* Detaching bdev inode from its wb in __destroy_inode()
-* is too late: the queue which embeds its bdi (along with
-* root wb) can be gone as soon as we put_disk() below.
-*/
-   inode_detach_wb(bdev->bd_inode);
}
if (bdev->bd_contains == bdev) {
if (disk->fops->release)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 319fb76f9081..f8c86b9c31d5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2344,7 +2344,7 @@ extern struct kmem_cache *names_cachep;
 #ifdef CONFIG_BLOCK
 extern int register_blkdev(unsigned int, const char *);
 extern void unregister_blkdev(unsigned int, const char *);
-extern void bdev_unhash_inode(dev_t dev);
+extern void bdev_cleanup_inode(dev_t dev);
 extern struct block_device *bdget(dev_t);
 extern struct block_device *bdgrab(struct block_device *bdev);
 extern void bd_set_size(struct block_device *, loff_t size);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0d3ba83a0f7f..6d27b78c9a79 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -237,6 +237,7 @@ static inline void inode_attach_wb(struct inode *inode, 
struct page *page)
 static inline void inode

[PATCH 07/10] writeback: Implement reliable switching to default writeback structure

2017-02-09 Thread Jan Kara

Currently switching of inode between different writeback structures is
asynchronous and not guaranteed to succeed. Add a variant of switching
that is synchronous and reliable so that it can reliably move inode to
the default writeback structure (bdi->wb) when writeback on bdi is going
to be shutdown.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/fs-writeback.c | 60 ---
 include/linux/fs.h|  3 ++-
 include/linux/writeback.h |  6 +
 3 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 23dc97cf2a50..52992a1036b1 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -332,14 +332,11 @@ struct inode_switch_wbs_context {
struct work_struct  work;
 };
 
-static void inode_switch_wbs_work_fn(struct work_struct *work)
+static void do_inode_switch_wbs(struct inode *inode,
+   struct bdi_writeback *new_wb)
 {
-   struct inode_switch_wbs_context *isw =
-   container_of(work, struct inode_switch_wbs_context, work);
-   struct inode *inode = isw->inode;
struct address_space *mapping = inode->i_mapping;
struct bdi_writeback *old_wb = inode->i_wb;
-   struct bdi_writeback *new_wb = isw->new_wb;
struct radix_tree_iter iter;
bool switched = false;
void **slot;
@@ -436,15 +433,29 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
spin_unlock(_wb->list_lock);
spin_unlock(_wb->list_lock);
 
+   /*
+* Make sure waitqueue_active() check in wake_up_bit() cannot happen
+* before I_WB_SWITCH is cleared. Pairs with the barrier in
+* set_task_state() after wait_on_bit() added waiter to the wait queue.
+*/
+   smp_mb();
+   wake_up_bit(>i_state, __I_WB_SWITCH);
+
if (switched) {
wb_wakeup(new_wb);
wb_put(old_wb);
}
-   wb_put(new_wb);
+}
 
-   iput(inode);
-   kfree(isw);
+static void inode_switch_wbs_work_fn(struct work_struct *work)
+{
+   struct inode_switch_wbs_context *isw =
+   container_of(work, struct inode_switch_wbs_context, work);
 
+   do_inode_switch_wbs(isw->inode, isw->new_wb);
+   wb_put(isw->new_wb);
+   iput(isw->inode);
+   kfree(isw);
atomic_dec(_nr_in_flight);
 }
 
@@ -521,6 +532,39 @@ static void inode_switch_wbs(struct inode *inode, int 
new_wb_id)
 }
 
 /**
+ * inode_switch_to_default_wb_sync - change the wb association of an inode to
+ * the default writeback structure synchronously
+ * @inode: target inode
+ *
+ * Switch @inode's wb association to the default writeback structure (bdi->wb).
+ * Unlike inode_switch_wbs() the switching is performed synchronously and we
+ * guarantee the inode is switched to the default writeback structure when this
+ * function returns. Nothing prevents from someone else switching inode to
+ * another writeback structure just when we are done though. Preventing that is
+ * upto the caller if needed.
+ */
+void inode_switch_to_default_wb_sync(struct inode *inode)
+{
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+   /* while holding I_WB_SWITCH, no one else can update the association */
+   spin_lock(>i_lock);
+   if (WARN_ON_ONCE(inode->i_state & I_FREEING) ||
+   !inode_to_wb_is_valid(inode) || inode_to_wb(inode) == >wb) {
+   spin_unlock(>i_lock);
+   return;
+   }
+   __inode_wait_for_state_bit(inode, __I_WB_SWITCH);
+   inode->i_state |= I_WB_SWITCH;
+   spin_unlock(>i_lock);
+
+   /* Make I_WB_SWITCH setting visible to unlocked users of i_wb */
+   synchronize_rcu();
+
+   do_inode_switch_wbs(inode, >wb);
+}
+
+/**
  * wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it
  * @wbc: writeback_control of interest
  * @inode: target inode
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c930cbc19342..319fb76f9081 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1929,7 +1929,8 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
 #define I_DIRTY_TIME   (1 << 11)
 #define __I_DIRTY_TIME_EXPIRED 12
 #define I_DIRTY_TIME_EXPIRED   (1 << __I_DIRTY_TIME_EXPIRED)
-#define I_WB_SWITCH(1 << 13)
+#define __I_WB_SWITCH  13
+#define I_WB_SWITCH(1 << __I_WB_SWITCH)
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 #define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5527d910ba3d..0d3ba83a0f7f 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -280,6 +280,8 @@ static inline void wbc_init_bio(struct writeback_control 
*wbc, struct bio *bio)
bio_associate_blkcg(bio, wbc->wb->blkcg_css);

Re: [PATCH 03/10] block: Revalidate i_bdev reference in bd_aquire()

2017-02-09 Thread Jan Kara

On Thu 09-02-17 13:44:26, Jan Kara wrote:
> When a device gets removed, block device inode unhashed so that it is not
> used anymore (bdget() will not find it anymore). Later when a new device
> gets created with the same device number, we create new block device
> inode. However there may be file system device inodes whose i_bdev still
> points to the original block device inode and thus we get two active
> block device inodes for the same device. They will share the same
> gendisk so the only visible differences will be that page caches will
> not be coherent and BDIs will be different (the old block device inode
> still points to unregistered BDI).
> 
> Fix the problem by checking in bd_acquire() whether i_bdev still points
> to active block device inode and re-lookup the block device if not. That
> way any open of a block device happening after the old device has been
> removed will get correct block device inode.

Thiago spotted a stupid bug in this patch (calling bd_forget() on bdev
instead of inode). Fixed version is attached.

    Honza
> 
> Signed-off-by: Jan Kara <j...@suse.cz>
> ---
>  fs/block_dev.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 601b71b76d7f..360439373a66 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1043,13 +1043,22 @@ static struct block_device *bd_acquire(struct inode 
> *inode)
>  
>   spin_lock(_lock);
>   bdev = inode->i_bdev;
> - if (bdev) {
> + if (bdev && !inode_unhashed(bdev->bd_inode)) {
>   bdgrab(bdev);
>   spin_unlock(_lock);
>   return bdev;
>   }
>   spin_unlock(_lock);
>  
> + /*
> +  * i_bdev references block device inode that was already shut down
> +  * (corresponding device got removed).  Remove the reference and look
> +  * up block device inode again just in case new device got
> +  * reestablished under the same device number.
> +  */
> + if (bdev)
> + bd_forget(bdev);
> +
>   bdev = bdget(inode->i_rdev);
>   if (bdev) {
>       spin_lock(_lock);
> -- 
> 2.10.2
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
>From aaf612333753b948a96aebe4a2f8066ed45ef164 Mon Sep 17 00:00:00 2001
From: Jan Kara <j...@suse.cz>
Date: Thu, 9 Feb 2017 12:16:30 +0100
Subject: [PATCH 03/10] block: Revalidate i_bdev reference in bd_aquire()

When a device gets removed, block device inode unhashed so that it is not
used anymore (bdget() will not find it anymore). Later when a new device
gets created with the same device number, we create new block device
inode. However there may be file system device inodes whose i_bdev still
points to the original block device inode and thus we get two active
block device inodes for the same device. They will share the same
gendisk so the only visible differences will be that page caches will
not be coherent and BDIs will be different (the old block device inode
still points to unregistered BDI).

Fix the problem by checking in bd_acquire() whether i_bdev still points
to active block device inode and re-lookup the block device if not. That
way any open of a block device happening after the old device has been
removed will get correct block device inode.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/block_dev.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 601b71b76d7f..68e855fdce58 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1043,13 +1043,22 @@ static struct block_device *bd_acquire(struct inode *inode)
 
 	spin_lock(_lock);
 	bdev = inode->i_bdev;
-	if (bdev) {
+	if (bdev && !inode_unhashed(bdev->bd_inode)) {
 		bdgrab(bdev);
 		spin_unlock(_lock);
 		return bdev;
 	}
 	spin_unlock(_lock);
 
+	/*
+	 * i_bdev references block device inode that was already shut down
+	 * (corresponding device got removed).  Remove the reference and look
+	 * up block device inode again just in case new device got
+	 * reestablished under the same device number.
+	 */
+	if (bdev)
+		bd_forget(inode);
+
 	bdev = bdget(inode->i_rdev);
 	if (bdev) {
 		spin_lock(_lock);
-- 
2.10.2

Re: [PATCH 05/24] fs: Get proper reference for s_bdi

2017-02-09 Thread Jan Kara

On Thu 09-02-17 16:36:13, Boaz Harrosh wrote:
> On 02/02/2017 07:34 PM, Jan Kara wrote:
> > So far we just relied on block device to hold a bdi reference for us
> > while the filesystem is mounted. While that works perfectly fine, it is
> > a bit awkward that we have a pointer to a refcounted structure in the
> > superblock without proper reference. So make s_bdi hold a proper
> > reference to block device's BDI. No filesystem using mount_bdev()
> > actually changes s_bdi so this is safe and will make bdev filesystems
> > work the same way as filesystems needing to set up their private bdi.
> > 
> > Signed-off-by: Jan Kara <j...@suse.cz>
> > ---
> >  fs/super.c | 7 ++-
> >  1 file changed, 2 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/super.c b/fs/super.c
> > index 31dc4c6450ef..dfb95ccd4351 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -1047,12 +1047,9 @@ static int set_bdev_super(struct super_block *s, 
> > void *data)
> >  {
> > s->s_bdev = data;
> > s->s_dev = s->s_bdev->bd_dev;
> > +   s->s_bdi = bdi_get(s->s_bdev->bd_bdi);
> > +   s->s_iflags |= SB_I_DYNBDI;
> >  
> > -   /*
> > -* We set the bdi here to the queue backing, file systems can
> > -* overwrite this in ->fill_super()
> > -*/
> 
> Question: So I have an FS that uses mount_bdev but than goes and
> overrides sb->s_bdev in ->fill_super() anyway. This is because of two
> reasons. One because I have many more devices. (like btrfs I'm
> moulti-dev) but I like to use mount_bdev because of the somewhat delicate
> handling of automatic bind-mounts.
> 
> For me it is a bigger hack to get the ref-counting and bind-mounts
> locking correctly then to bdi_put and say the new super_setup_bdi(sb) in
> fill_super. Would you expect problems?

No, that should work just fine.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

Re: [PATCH 0/10] block: Fix block device shutdown related races

2017-02-09 Thread Jan Kara

On Thu 09-02-17 12:52:47, Thiago Jung Bauermann wrote:
> Hello Jan,
> 
> Am Donnerstag, 9. Februar 2017, 13:44:23 BRST schrieb Jan Kara:
> > People, please have a look at patches. The are mostly simple however the
> > interactions are rather complex so I may have missed something. Also I'm
> > happy for any additional testing these patches can get - I've stressed them
> > with Omar's script, tested memcg writeback, tested static (not udev managed)
> > device inodes.
> 
> Thank you for these fixes. I will have them tested and report back how it 
> goes.
> 
> Can you tell which branch I should apply them on? I tried a number of 
> branches 
> in linux-block (and applied the bdi lifetime v3 patches if the branch didn't 
> already had them) but this series either didn't apply or the build failed 
> with:
> 
> /home/bauermann/trabalho/src/linux-2.6.git/fs/block_dev.c: In function 
> ‘bd_acquire’:
> /home/bauermann/trabalho/src/linux-2.6.git/fs/block_dev.c:1063:13: error: 
> passing argument 1 of ‘bd_forget’ from incompatible pointer type [-
> Werror=incompatible-pointer-types]
>bd_forget(bdev);
>  ^
> In file included from /home/bauermann/trabalho/src/linux-2.6.git/include/
> linux/device_cgroup.h:1:0,
>  from /home/bauermann/trabalho/src/linux-2.6.git/fs/
> block_dev.c:14:
> /home/bauermann/trabalho/src/linux-2.6.git/include/linux/fs.h:2351:13: note: 
> expected ‘struct inode *’ but argument is of type ‘struct block_device *’
>  extern void bd_forget(struct inode *inode);
>  ^
> cc1: some warnings being treated as errors

Indeed, I'm wondering how this could pass one of the tests I did... Hum.
Anyway thanks for spotting this and attached is a fixed up version of the
patch 3.

I've pushed out a branch with all BDI patches I have accumulated to

git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git bdi

It includes filesystem-bdi cleanup patches as well on top of these fixes.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
>From aaf612333753b948a96aebe4a2f8066ed45ef164 Mon Sep 17 00:00:00 2001
From: Jan Kara <j...@suse.cz>
Date: Thu, 9 Feb 2017 12:16:30 +0100
Subject: [PATCH 03/10] block: Revalidate i_bdev reference in bd_aquire()

When a device gets removed, block device inode unhashed so that it is not
used anymore (bdget() will not find it anymore). Later when a new device
gets created with the same device number, we create new block device
inode. However there may be file system device inodes whose i_bdev still
points to the original block device inode and thus we get two active
block device inodes for the same device. They will share the same
gendisk so the only visible differences will be that page caches will
not be coherent and BDIs will be different (the old block device inode
still points to unregistered BDI).

Fix the problem by checking in bd_acquire() whether i_bdev still points
to active block device inode and re-lookup the block device if not. That
way any open of a block device happening after the old device has been
removed will get correct block device inode.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/block_dev.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 601b71b76d7f..68e855fdce58 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1043,13 +1043,22 @@ static struct block_device *bd_acquire(struct inode *inode)
 
 	spin_lock(_lock);
 	bdev = inode->i_bdev;
-	if (bdev) {
+	if (bdev && !inode_unhashed(bdev->bd_inode)) {
 		bdgrab(bdev);
 		spin_unlock(_lock);
 		return bdev;
 	}
 	spin_unlock(_lock);
 
+	/*
+	 * i_bdev references block device inode that was already shut down
+	 * (corresponding device got removed).  Remove the reference and look
+	 * up block device inode again just in case new device got
+	 * reestablished under the same device number.
+	 */
+	if (bdev)
+		bd_forget(inode);
+
 	bdev = bdget(inode->i_rdev);
 	if (bdev) {
 		spin_lock(_lock);
-- 
2.10.2

[PATCH 01/10] block: Move bdev_unhash_inode() after invalidate_partition()

2017-02-09 Thread Jan Kara

Move bdev_unhash_inode() after invalidate_partition() as
invalidate_partition() looks up bdev and will unnecessarily recreate it
if bdev_unhash_inode() destroyed it. Also use part_devt() when calling
bdev_unhash_inode() instead of manually creating the device number.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 block/genhd.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index d9ccd42f3675..6cb9f3a34a92 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -648,9 +648,8 @@ void del_gendisk(struct gendisk *disk)
disk_part_iter_init(, disk,
 DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
while ((part = disk_part_iter_next())) {
-   bdev_unhash_inode(MKDEV(disk->major,
-   disk->first_minor + part->partno));
invalidate_partition(disk, part->partno);
+   bdev_unhash_inode(part_devt(part));
delete_partition(disk, part->partno);
}
disk_part_iter_exit();
-- 
2.10.2

[PATCH 05/10] writeback: Generalize and standardize I_SYNC waiting function

2017-02-09 Thread Jan Kara

__inode_wait_for_writeback() waits for I_SYNC on inode to get cleared.
There's nothing specific regarting I_SYNC for that function. Generalize
it so that we can use it also for I_WB_SWITCH bit. Also the function
uses __wait_on_bit() unnecessarily. Switch it to wait_on_bit() to remove
some code.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/fs-writeback.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef600591d96f..c9770de11650 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1170,21 +1170,16 @@ static int write_inode(struct inode *inode, struct 
writeback_control *wbc)
 }
 
 /*
- * Wait for writeback on an inode to complete. Called with i_lock held.
+ * Wait for bit in inode->i_state to clear. Called with i_lock held.
  * Caller must make sure inode cannot go away when we drop i_lock.
  */
-static void __inode_wait_for_writeback(struct inode *inode)
+static void __inode_wait_for_state_bit(struct inode *inode, int bit_nr)
__releases(inode->i_lock)
__acquires(inode->i_lock)
 {
-   DEFINE_WAIT_BIT(wq, >i_state, __I_SYNC);
-   wait_queue_head_t *wqh;
-
-   wqh = bit_waitqueue(>i_state, __I_SYNC);
-   while (inode->i_state & I_SYNC) {
+   while (inode->i_state & (1 << bit_nr)) {
spin_unlock(>i_lock);
-   __wait_on_bit(wqh, , bit_wait,
- TASK_UNINTERRUPTIBLE);
+   wait_on_bit(>i_state, bit_nr, TASK_UNINTERRUPTIBLE);
spin_lock(>i_lock);
}
 }
@@ -1195,7 +1190,7 @@ static void __inode_wait_for_writeback(struct inode 
*inode)
 void inode_wait_for_writeback(struct inode *inode)
 {
spin_lock(>i_lock);
-   __inode_wait_for_writeback(inode);
+   __inode_wait_for_state_bit(inode, __I_SYNC);
spin_unlock(>i_lock);
 }
 
@@ -1397,7 +1392,7 @@ static int writeback_single_inode(struct inode *inode,
 * inode reference or inode has I_WILL_FREE set, it cannot go
 * away under us.
 */
-   __inode_wait_for_writeback(inode);
+   __inode_wait_for_state_bit(inode, __I_SYNC);
}
WARN_ON(inode->i_state & I_SYNC);
/*
-- 
2.10.2

Re: [PATCH 22/24] ubifs: Convert to separately allocated bdi

2017-02-09 Thread Jan Kara

On Wed 08-02-17 12:24:00, Richard Weinberger wrote:
> Am 02.02.2017 um 18:34 schrieb Jan Kara:
> > Allocate struct backing_dev_info separately instead of embedding it
> > inside the superblock. This unifies handling of bdi among users.
> > 
> > CC: Richard Weinberger <rich...@nod.at>
> > CC: Artem Bityutskiy <dedeki...@gmail.com>
> > CC: Adrian Hunter <adrian.hun...@intel.com>
> > CC: linux-...@lists.infradead.org
> > Signed-off-by: Jan Kara <j...@suse.cz>
> > ---
> >  fs/ubifs/super.c | 23 +++
> >  fs/ubifs/ubifs.h |  3 ---
> >  2 files changed, 7 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> > index e08aa04fc835..34810eb52b22 100644
> > --- a/fs/ubifs/super.c
> > +++ b/fs/ubifs/super.c
> > @@ -1827,7 +1827,6 @@ static void ubifs_put_super(struct super_block *sb)
> > }
> >  
> > ubifs_umount(c);
> > -   bdi_destroy(>bdi);
> > ubi_close_volume(c->ubi);
> > mutex_unlock(>umount_mutex);
> >  }
> > @@ -2019,29 +2018,23 @@ static int ubifs_fill_super(struct super_block *sb, 
> > void *data, int silent)
> > goto out;
> > }
> >  
> > +   err = ubifs_parse_options(c, data, 0);
> > +   if (err)
> > +   goto out_close;
> > +
> > /*
> >  * UBIFS provides 'backing_dev_info' in order to disable read-ahead. For
> >  * UBIFS, I/O is not deferred, it is done immediately in readpage,
> >  * which means the user would have to wait not just for their own I/O
> >  * but the read-ahead I/O as well i.e. completely pointless.
> >  *
> > -* Read-ahead will be disabled because @c->bdi.ra_pages is 0.
> > +* Read-ahead will be disabled because @sb->s_bdi->ra_pages is 0.
> >  */
> > -   c->bdi.name = "ubifs",
> > -   c->bdi.capabilities = 0;
> 
> So ->capabilities is now zero by default since you use __GFP_ZERO in
> bdi_alloc().
> At least for UBIFS I'll add a comment on this, otherwise it is not so
> clear that UBIFS wants a BDI with no capabilities and how it achieves that.

OK, I've modified the comment to:

 * Read-ahead will be disabled because @sb->s_bdi->ra_pages is 0. Also
 * @sb->s_bdi->capabilities are initialized to 0 so there won't be any
 * writeback happening.
 */


> Acked-by: Richard Weinberger <rich...@nod.at>

Thanks.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

Re: [lustre-devel] [PATCH 04/24] fs: Provide infrastructure for dynamic BDIs in filesystems

2017-02-09 Thread Jan Kara

> > @@ -1249,6 +1254,50 @@ mount_fs(struct file_system_type *type, int flags, 
> > const char *name, void *data)
> > }
> > 
> > /*
> > + * Setup private BDI for given superblock. I gets automatically cleaned up
> 
> (typo) s/I/It/
> 
> Looks fine otherwise.

Thanks, fixed.

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

Re: [PATCH 07/10] writeback: Implement reliable switching to default writeback structure

2017-02-10 Thread Jan Kara

On Fri 10-02-17 13:19:44, NeilBrown wrote:
> On Thu, Feb 09 2017, Jan Kara wrote:
> 
> > Currently switching of inode between different writeback structures is
> > asynchronous and not guaranteed to succeed. Add a variant of switching
> > that is synchronous and reliable so that it can reliably move inode to
> > the default writeback structure (bdi->wb) when writeback on bdi is going
> > to be shutdown.
> >
> > Signed-off-by: Jan Kara <j...@suse.cz>
> > ---
> >  fs/fs-writeback.c | 60 
> > ---
> >  include/linux/fs.h|  3 ++-
> >  include/linux/writeback.h |  6 +
> >  3 files changed, 60 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 23dc97cf2a50..52992a1036b1 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -332,14 +332,11 @@ struct inode_switch_wbs_context {
> > struct work_struct  work;
> >  };
> >  
> > -static void inode_switch_wbs_work_fn(struct work_struct *work)
> > +static void do_inode_switch_wbs(struct inode *inode,
> > +   struct bdi_writeback *new_wb)
> >  {
> > -   struct inode_switch_wbs_context *isw =
> > -   container_of(work, struct inode_switch_wbs_context, work);
> > -   struct inode *inode = isw->inode;
> > struct address_space *mapping = inode->i_mapping;
> > struct bdi_writeback *old_wb = inode->i_wb;
> > -   struct bdi_writeback *new_wb = isw->new_wb;
> > struct radix_tree_iter iter;
> > bool switched = false;
> > void **slot;
> > @@ -436,15 +433,29 @@ static void inode_switch_wbs_work_fn(struct 
> > work_struct *work)
> > spin_unlock(_wb->list_lock);
> > spin_unlock(_wb->list_lock);
> >  
> > +   /*
> > +* Make sure waitqueue_active() check in wake_up_bit() cannot happen
> > +* before I_WB_SWITCH is cleared. Pairs with the barrier in
> > +* set_task_state() after wait_on_bit() added waiter to the wait queue.
> 
> I think you mean "set_current_state()" ??

Yes, I'll fix that.

> It's rather a trap for the unwary, this need for a smp_mb().
> Greping for wake_up_bit(), I find quite a few places with barriers -
> sometimes clear_bit_unlock() or spin_unlock() - but
> 
> fs/block_dev.c- whole->bd_claiming = NULL;
> fs/block_dev.c: wake_up_bit(>bd_claiming, 0);
> 
> fs/cifs/connect.c-  clear_bit(TCON_LINK_PENDING, >tl_flags);
> fs/cifs/connect.c:  wake_up_bit(>tl_flags, TCON_LINK_PENDING);
> 
> fs/cifs/misc.c- clear_bit(CIFS_INODE_PENDING_WRITERS, 
> >flags);
> fs/cifs/misc.c: wake_up_bit(>flags, 
> CIFS_INODE_PENDING_WRITERS);
> 
> (several more in cifs)
> 
> net/sunrpc/xprt.c-  clear_bit(XPRT_CLOSE_WAIT, >state);
> net/sunrpc/xprt.c-  xprt->ops->close(xprt);
> net/sunrpc/xprt.c-  xprt_release_write(xprt, NULL);
> net/sunrpc/xprt.c:  wake_up_bit(>state, XPRT_LOCKED);
> (there might be a barrier in ->close or xprt_release_write() I guess)
> 
> security/keys/gc.c- clear_bit(KEY_GC_REAPING_KEYTYPE, 
> _gc_flags);
> security/keys/gc.c: wake_up_bit(_gc_flags, 
> KEY_GC_REAPING_KEYTYPE);

Yup, the above look like bugs.

> I wonder if there is a good way to make this less error-prone.
> I would suggest that wake_up_bit() should always have a barrier, and
> __wake_up_bit() is needed to avoid it, but there is already a
> __wake_up_bit() with a slightly different interface.

Yeah, it is error-prone as all waitqueue_active() optimizations...
 
> In this case, you have a spin_unlock() just before the wake_up_bit().
> It is my understand that it would provide enough of a barrier (all
> writes before are globally visible after), so do you really need
> the barrier here?

I believe I do. spin_unlock() is a semi-permeable barrier - i.e., any read
or write from "outside" can be moved inside. So CPU is free to prefetch
values for waitqueue active checks before the spinlock is unlocked or even
before clearing I_WB_SWITCH bit.

> > +*/
> > +   smp_mb();
> > +   wake_up_bit(>i_state, __I_WB_SWITCH);
> > +
> > if (switched) {
> > wb_wakeup(new_wb);
> > wb_put(old_wb);
> > }
> > -   wb_put(new_wb);
> > +}
> >  
> > -   iput(inode);
> > -   kfree(isw);
> > +static void inode_switch_wbs_work_fn(struct work_struct *work)
> > +{
> > +   struct inode_switch_wbs_context *isw =
> > +   container_of(work, struct

Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Jan Kara

On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> Your note on the online repair does raise another tangentially related
> topic. Currently, if there are badblocks, writes via the bio submission
> path will clear the error (if the hardware is able to remap the bad
> locations). However, if the filesystem is mounted eith DAX, even
> non-mmap operations - read() and write() will go through the dax paths
> (dax_do_io()). We haven't found a good/agreeable way to perform
> error-clearing in this case. So currently, if a dax mounted filesystem
> has badblocks, the only way to clear those badblocks is to mount it
> without DAX, and overwrite/zero the bad locations. This is a pretty
> terrible user experience, and I'm hoping this can be solved in a better
> way.

Please remind me, what is the problem with DAX code doing necessary work to
clear the error when it gets EIO from memcpy on write?

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Jan Kara

On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> I do mean that in the filesystem, for every IO, the badblocks will be
> checked. Currently, the pmem driver does this, and the hope is that the
> filesystem can do a better job at it. The driver unconditionally checks
> every IO for badblocks on the whole device. Depending on how the
> badblocks are represented in the filesystem, we might be able to quickly
> tell if a file/range has existing badblocks, and error out the IO
> accordingly.
> 
> At mount the the fs would read the existing badblocks on the block
> device, and build its own representation of them. Then during normal
> use, if the underlying badblocks change, the fs would get a notification
> that would allow it to also update its own representation.

So I believe we have to distinguish three cases so that we are on the same
page.

1) PMEM is exposed only via a block interface for legacy filesystems to
use. Here, all the bad blocks handling IMO must happen in NVDIMM driver.
Looking from outside, the IO either returns with EIO or succeeds. As a
result you cannot ever ger rid of bad blocks handling in the NVDIMM driver.

2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
mostly interested in. We could possibly do something more efficient than
what NVDIMM driver does however the complexity would be relatively high and
frankly I'm far from convinced this is really worth it. If there are so
many badblocks this would matter, the HW has IMHO bigger problems than
performance.

3) PMEM filesystem - there things are even more difficult as was already
noted elsewhere in the thread. But for now I'd like to leave those aside
not to complicate things too much.

Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
if the platform can recover from MCE, we can just always access persistent
memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
seems to already happen so we just need to make sure all places handle
returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
No need for bad blocks list at all, no slow down unless we hit a bad cell
and in that case who cares about performance when the data is gone...

For platforms that cannot recover from MCE - just buy better hardware ;).
Seriously, I have doubts people can seriously use a machine that will
unavoidably randomly reboot (as there is always a risk you hit error that
has not been uncovered by background scrub). But maybe for big cloud providers
the cost savings may offset for the inconvenience, I don't know. But still
for that case a bad blocks handling in NVDIMM code like we do now looks
good enough?

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM ATTEND] Un-addressable device memory and block/fs implications

2017-01-18 Thread Jan Kara

On Fri 16-12-16 08:44:11, Aneesh Kumar K.V wrote:
> Jerome Glisse <jgli...@redhat.com> writes:
> 
> > I would like to discuss un-addressable device memory in the context of
> > filesystem and block device. Specificaly how to handle write-back, read,
> > ... when a filesystem page is migrated to device memory that CPU can not
> > access.
> >
> > I intend to post a patchset leveraging the same idea as the existing
> > block bounce helper (block/bounce.c) to handle this. I believe this is
> > worth discussing during summit see how people feels about such plan and
> > if they have better ideas.
> >
> >
> > I also like to join discussions on:
> >   - Peer-to-Peer DMAs between PCIe devices
> >   - CDM coherent device memory
> >   - PMEM
> >   - overall mm discussions
> 
> I would like to attend this discussion. I can talk about coherent device
> memory and how having HMM handle that will make it easy to have one
> interface for device driver. For Coherent device case we definitely need
> page cache migration support.

Aneesh, did you intend this as your request to attend? You posted it as a
reply to another email so it is not really clear. Note that each attend
request should be a separate email so that it does not get lost...

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] Future direction of DAX

2017-01-17 Thread Jan Kara

On Fri 13-01-17 17:20:08, Ross Zwisler wrote:
> - The DAX fsync/msync model was built for platforms that need to flush dirty
>   processor cache lines in order to make data durable on NVDIMMs.  There exist
>   platforms, however, that are set up so that the processor caches are
>   effectively part of the ADR safe zone.  This means that dirty data can be
>   assumed to be durable even in the processor cache, obviating the need to
>   manually flush the cache during fsync/msync.  These platforms still need to
>   call fsync/msync to ensure that filesystem metadata updates are properly
>   written to media.  Our first idea on how to properly support these platforms
>   would be for DAX to be made aware that in some cases doesn't need to keep
>   metadata about dirty cache lines.  A similar issue exists for volatile uses
>   of DAX such as with BRD or with PMEM and the memmap command line parameter,
>   and we'd like a solution that covers them all.

Well, we still need the radix tree entries for locking. And you still need
to keep track of which file offsets are writeably mapped (which we
currently implicitely keep via dirty radix tree entries) so that you can
writeprotect them if needed (during filesystem freezing, for reflink, ...).
So I think what is going to gain the most by far is simply to avoid doing
the writeback at all in such situations.

> - If I recall correctly, at one point Dave Chinner suggested that we change
>   DAX so that I/O would use cached stores instead of the non-temporal stores
>   that it currently uses.  We would then track pages that were written to by
>   DAX in the radix tree so that they would be flushed later during
>   fsync/msync.  Does this sound like a win?  Also, assuming that we can find a
>   solution for platforms where the processor cache is part of the ADR safe
>   zone (above topic) this would be a clear improvement, moving us from using
>   non-temporal stores to faster cached stores with no downside.

I guess this needs measurements. But it is worth a try.

> - Jan suggested [2] that we could use the radix tree as a cache to service DAX
>   faults without needing to call into the filesystem.  Are there any issues
>   with this approach, and should we move forward with it as an optimization?

Yup, I'm still for it.

> - Whenever you mount a filesystem with DAX, it spits out a message that says
>   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What criteria
>   needs to be met for DAX to no longer be considered experimental?

So from my POV I'd be OK with removing the warning but still the code is
new so there are clearly bugs lurking ;).

> - When we msync() a huge page, if the range is less than the entire huge page,
>   should we flush the entire huge page and mark it clean in the radix tree, or
>   should we only flush the requested range and leave the radix tree entry
>   dirty?

If you do partial msync(), then you have the problem that msync(0, x),
msync(x, EOF) will not yield a clean file which may surprise somebody. So
I'm slightly skeptical.
 
> - Should we enable 1 GiB huge pages in filesystem DAX?  Does anyone have any
>   specific customer requests for this or performance data suggesting it would
>   be a win?  If so, what work needs to be done to get 1 GiB sized and aligned
>   filesystem block allocations, to get the required enabling in the MM layer,
>   etc?

I'm not convinced it is worth it now. Maybe later...

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-20 Thread Jan Kara

On Thu 19-01-17 11:03:12, Dan Williams wrote:
> On Thu, Jan 19, 2017 at 10:59 AM, Vishal Verma <vishal.l.ve...@intel.com> 
> wrote:
> > On 01/19, Jan Kara wrote:
> >> On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
> >> > On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> >> > > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> >> > > <darrick.w...@oracle.com> wrote:
> >> > > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> >> > > > > Jan Kara <j...@suse.cz> writes:
> >> > > > >
> >> > > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> >> > > > > > > Your note on the online repair does raise another tangentially
> >> > > > > > > related
> >> > > > > > > topic. Currently, if there are badblocks, writes via the bio
> >> > > > > > > submission
> >> > > > > > > path will clear the error (if the hardware is able to remap
> >> > > > > > > the bad
> >> > > > > > > locations). However, if the filesystem is mounted eith DAX,
> >> > > > > > > even
> >> > > > > > > non-mmap operations - read() and write() will go through the
> >> > > > > > > dax paths
> >> > > > > > > (dax_do_io()). We haven't found a good/agreeable way to
> >> > > > > > > perform
> >> > > > > > > error-clearing in this case. So currently, if a dax mounted
> >> > > > > > > filesystem
> >> > > > > > > has badblocks, the only way to clear those badblocks is to
> >> > > > > > > mount it
> >> > > > > > > without DAX, and overwrite/zero the bad locations. This is a
> >> > > > > > > pretty
> >> > > > > > > terrible user experience, and I'm hoping this can be solved in
> >> > > > > > > a better
> >> > > > > > > way.
> >> > > > > >
> >> > > > > > Please remind me, what is the problem with DAX code doing
> >> > > > > > necessary work to
> >> > > > > > clear the error when it gets EIO from memcpy on write?
> >> > > > >
> >> > > > > You won't get an MCE for a store;  only loads generate them.
> >> > > > >
> >> > > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> >> > > > > -o dax?
> >> > > >
> >> > > > Not necessarily; XFS usually implements this by punching out the
> >> > > > range
> >> > > > and then reallocating it as unwritten blocks.
> >> > > >
> >> > >
> >> > > That does clear the error because the unwritten blocks are zeroed and
> >> > > errors cleared when they become allocated again.
> >> >
> >> > Yes, the problem was that writes won't clear errors. zeroing through
> >> > either hole-punch, truncate, unlinking the file should all work
> >> > (assuming the hole-punch or truncate ranges wholly contain the
> >> > 'badblock' sector).
> >>
> >> Let me repeat my question: You have mentioned that if we do IO through DAX,
> >> writes won't clear errors and we should fall back to normal block path to
> >> do write to clear the error. What does prevent us from directly clearing
> >> the error from DAX path?
> >>
> > With DAX, all IO goes through DAX paths. There are two cases:
> > 1. mmap and loads/stores: Obviously there is no kernel intervention
> > here, and no badblocks handling is possible.
> > 2. read() or write() IO: In the absence of dax, this would go through
> > the bio submission path, through the pmem driver, and that would handle
> > error clearing. With DAX, this goes through dax_iomap_actor, which also
> > doesn't go through the pmem driver (it does a dax mapping, followed by
> > essentially memcpy), and hence cannot handle badblocks.
> 
> Hmm, that may no longer be true after my changes to push dax flushing
> to the driver. I.e. we could have a copy_from_iter() implementation
> that attempts to clear errors... I'll get that series out and we can
> discuss there.

Yeah, that was precisely my point - doing copy_from_iter() that clears
errors should be possible...

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-20 Thread Jan Kara

On Thu 19-01-17 14:17:19, Vishal Verma wrote:
> On 01/18, Jan Kara wrote:
> > On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
> > mostly interested in. We could possibly do something more efficient than
> > what NVDIMM driver does however the complexity would be relatively high and
> > frankly I'm far from convinced this is really worth it. If there are so
> > many badblocks this would matter, the HW has IMHO bigger problems than
> > performance.
> 
> Correct, and Dave was of the opinion that once at least XFS has reverse
> mapping support (which it does now), adding badblocks information to
> that should not be a hard lift, and should be a better solution. I
> suppose should try to benchmark how much of a penalty the current badblock
> checking in the NVVDIMM driver imposes. The penalty is not because there
> may be a large number of badblocks, but just due to the fact that we
> have to do this check for every IO, in fact, every 'bvec' in a bio.

Well, letting filesystem know is certainly good from error reporting quality
POV. I guess I'll leave it upto XFS guys to tell whether they can be more
efficient in checking whether current IO overlaps with any of given bad
blocks.
 
> > Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
> > if the platform can recover from MCE, we can just always access persistent
> > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> > seems to already happen so we just need to make sure all places handle
> > returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
> > No need for bad blocks list at all, no slow down unless we hit a bad cell
> > and in that case who cares about performance when the data is gone...
> 
> Even when we have MCE recovery, we cannot do away with the badblocks
> list:
> 1. My understanding is that the hardware's ability to do MCE recovery is
> limited/best-effort, and is not guaranteed. There can be circumstances
> that cause a "Processor Context Corrupt" state, which is unrecoverable.

Well, then they have to work on improving the hardware. Because having HW
that just sometimes gets stuck instead of reporting bad storage is simply
not acceptable. And no matter how hard you try you cannot avoid MCEs from
OS when accessing persistent memory so OS just has no way to avoid that
risk.

> 2. We still need to maintain a badblocks list so that we know what
> blocks need to be cleared (via the ACPI method) on writes.

Well, why cannot we just do the write, see whether we got CMCI and if yes,
clear the error via the ACPI method?

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/13 v2] block: Fix block device shutdown related races

2017-03-01 Thread Jan Kara

Hello,

On Tue 28-02-17 11:54:41, Tejun Heo wrote:
> It generally looks good to me.

Thanks for review!

> The only worry I have is around wb_shutdown() synchronization and if that
> is actually an issue it shouldn't be too difficult to fix.

Yeah, I'll have a look at that.

> The other thing which came to mind is that the congested->__bdi sever
> semantics.  IIRC, that one was also to support the "bdi must go away now"
> behavior.  As bdi is refcnted now, I think we can probably just let cong
> hold onto the bdi rather than try to sever the ref there.

So currently I get away with __bdi not being a proper refcounted reference.
If we were to remove the clearing of __bdi, we'd have to make it into
refcounted reference which is sligthly ugly as we need to special-case
embedded bdi_writeback_congested structures. Maybe it will be a worthwhile
cleanup but for now I left it alone...

    Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

Re: [PATCH 0/13 v2] block: Fix block device shutdown related races

2017-03-01 Thread Jan Kara

On Tue 28-02-17 23:26:53, Omar Sandoval wrote:
> On Tue, Feb 28, 2017 at 11:25:28PM -0800, Omar Sandoval wrote:
> > On Wed, Feb 22, 2017 at 11:24:25AM +0100, Jan Kara wrote:
> > > On Tue 21-02-17 10:19:28, Jens Axboe wrote:
> > > > On 02/21/2017 10:09 AM, Jan Kara wrote:
> > > > > Hello,
> > > > > 
> > > > > this is a second revision of the patch set to fix several different 
> > > > > races and
> > > > > issues I've found when testing device shutdown and reuse. The first 
> > > > > three
> > > > > patches are fixes to problems in my previous series fixing BDI 
> > > > > lifetime issues.
> > > > > Patch 4 fixes issues with reuse of BDI name with scsi devices. With 
> > > > > it I cannot
> > > > > reproduce the BDI name reuse issues using Omar's stress test using 
> > > > > scsi_debug
> > > > > so it can be used as a replacement of Dan's patches. Patches 5-11 fix 
> > > > > oops that
> > > > > is triggered by __blkdev_put() calling inode_detach_wb() too early 
> > > > > (the problem
> > > > > reported by Thiago). Patches 12 and 13 fix oops due to a bug in 
> > > > > gendisk code
> > > > > where get_gendisk() can return already freed gendisk structure (again 
> > > > > triggered
> > > > > by Omar's stress test).
> > > > > 
> > > > > People, please have a look at patches. They are mostly simple however 
> > > > > the
> > > > > interactions are rather complex so I may have missed something. Also 
> > > > > I'm
> > > > > happy for any additional testing these patches can get - I've 
> > > > > stressed them
> > > > > with Omar's script, tested memcg writeback, tested static (not udev 
> > > > > managed)
> > > > > device inodes.
> > > > > 
> > > > > Jens, I think at least patches 1-3 should go in together with my 
> > > > > fixes you
> > > > > already have in your tree (or shortly after them). It is up to you 
> > > > > whether
> > > > > you decide to delay my first fixes or pick these up quickly. Patch 4 
> > > > > is
> > > > > (IMHO a cleaner) replacement of Dan's patches so consider whether you 
> > > > > want
> > > > > to use it instead of those patches.
> > > > 
> > > > I have applied 1-3 to my for-linus branch, which will go in after
> > > > the initial pull request has been pulled by Linus. Consider fixing up
> > > > #4 so it applies, I like it.
> > > 
> > > OK, attached is patch 4 rebased on top of Linus' tree from today which
> > > already has linux-block changes pulled in. I've left put_disk_devt() in
> > > blk_cleanup_queue() to maintain the logic in the original patch (now 
> > > commit
> > > 0dba1314d4f8) that request_queue and gendisk each hold one devt reference.
> > > The bdi_unregister() call that is moved to del_gendisk() by this patch is
> > > now protected by the gendisk reference instead of the request_queue one
> > > so it still maintains the property that devt reference protects bdi
> > > registration-unregistration lifetime (as much as that is not needed 
> > > anymore
> > > after this patch).
> > > 
> > > I have also updated the comment in the code and the changelog - they were
> > > somewhat stale after changes to the whole series Tejun suggested.
> > > 
> > >   Honza
> > 
> > Hey, Jan, I just tested this out when I was seeing similar crashes with
> > sr instead of sd, and this fixed it.
> > 
> > Tested-by: Omar Sandoval <osan...@fb.com>
> 
> Just realized it wasn't clear, I'm talking about patch 4 specifically.

Thanks for confirmation!

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

Re: [PATCH] cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode

2017-03-02 Thread Jan Kara

On Wed 01-03-17 10:07:44, Hou Tao wrote:
> When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
> as the delay of cfq_group's vdisktime if there have been other cfq_groups
> already.
> 
> When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
> from jiffies to nanoseconds") could result in a large iops delay and
> lead to an abnormal io schedule delay for the added cfq_group. To fix
> it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
> when iops mode is enabled.
> 
> Cc: <sta...@vger.kernel.org> # 4.8+
> Signed-off-by: Hou Tao <hout...@huawei.com>

OK, I agree my commit broke the logic in this case. Thanks for the fix.
Please add also tag:

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4

I somewhat disagree with the fix though. See below:

> +static inline u64 cfq_get_cfqg_vdisktime_delay(struct cfq_data *cfqd)
> +{
> + if (!iops_mode(cfqd))
> + return CFQ_IDLE_DELAY;
> + else
> + return nsecs_to_jiffies64(CFQ_IDLE_DELAY);
> +}
> +

So using nsecs_to_jiffies64(CFQ_IDLE_DELAY) when in iops mode just does not
make any sense. AFAIU the code in cfq_group_notify_queue_add() we just want
to add the cfqg as the last one in the tree. So returning 1 from
cfq_get_cfqg_vdisktime_delay() in iops mode should be fine as well.

Frankly, vdisktime is in fixed-point precision shifted by
CFQ_SERVICE_SHIFT so using CFQ_IDLE_DELAY does not make much sense in any
case and just adding 1 to maximum vdisktime should be fine in all the
cases. But that would require more testing whether I did not miss anything
subtle.

Honza

>  static void
>  cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
>  {
> @@ -1380,7 +1388,8 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, 
> struct cfq_group *cfqg)
>   n = rb_last(>rb);
>   if (n) {
>   __cfqg = rb_entry_cfqg(n);
> - cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
> + cfqg->vdisktime = __cfqg->vdisktime +
> + cfq_get_cfqg_vdisktime_delay(cfqd);
>       } else
>   cfqg->vdisktime = st->min_vdisktime;
>   cfq_group_service_tree_add(st, cfqg);
> -- 
> 2.5.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

Re: [PATCH 7/8] blk-wbt: add general throttling mechanism

2016-11-08 Thread Jan Kara

On Tue 01-11-16 15:08:50, Jens Axboe wrote:
> We can hook this up to the block layer, to help throttle buffered
> writes.
> 
> wbt registers a few trace points that can be used to track what is
> happening in the system:
> 
> wbt_lat: 259:0: latency 2446318
> wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
>wmean=518866, wmin=15522, wmax=5330353, wsamples=57
> wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, 
> max=32
> 
> This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
> dumps the current read/write stats for that window, and wbt_step shows a
> step down event where we now scale back writes. Each trace includes the
> device, 259:0 in this case.

Just one serious question and one nit below:

> +void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
> +{
> + struct rq_wait *rqw;
> + int inflight, limit;
> +
> + if (!(wb_acct & WBT_TRACKED))
> + return;
> +
> + rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
> + inflight = atomic_dec_return(>inflight);
> +
> + /*
> +  * wbt got disabled with IO in flight. Wake up any potential
> +  * waiters, we don't have to do more than that.
> +  */
> + if (unlikely(!rwb_enabled(rwb))) {
> + rwb_wake_all(rwb);
> + return;
> + }
> +
> + /*
> +  * If the device does write back caching, drop further down
> +  * before we wake people up.
> +  */
> + if (rwb->wc && !wb_recent_wait(rwb))
> + limit = 0;
> + else
> + limit = rwb->wb_normal;

So for devices with write cache, you will completely drain the device
before waking anybody waiting to issue new requests. Isn't it too strict?
In particular may_queue() will allow new writers to issue new writes once
we drop below the limit so it can happen that some processes will be
effectively starved waiting in may_queue?

> +static void wb_timer_fn(unsigned long data)
> +{
> + struct rq_wb *rwb = (struct rq_wb *) data;
> + unsigned int inflight = wbt_inflight(rwb);
> + int status;
> +
> + status = latency_exceeded(rwb);
> +
> + trace_wbt_timer(rwb->bdi, status, rwb->scale_step, inflight);
> +
> + /*
> +  * If we exceeded the latency target, step down. If we did not,
> +  * step one level up. If we don't know enough to say either exceeded
> +  * or ok, then don't do anything.
> +  */
> + switch (status) {
> + case LAT_EXCEEDED:
> + scale_down(rwb, true);
> + break;
> + case LAT_OK:
> + scale_up(rwb);
> + break;
> + case LAT_UNKNOWN_WRITES:
> + scale_up(rwb);
> + break;
> + case LAT_UNKNOWN:
> + if (++rwb->unknown_cnt < RWB_UNKNOWN_BUMP)
> + break;
> + /*
> +  * We get here for two reasons:
> +  *
> +  * 1) We previously scaled reduced depth, and we currently
> +  *don't have a valid read/write sample. For that case,
> +  *slowly return to center state (step == 0).
> +  * 2) We started a the center step, but don't have a valid
> +  *read/write sample, but we do have writes going on.
> +  *Allow step to go negative, to increase write perf.
> +  */

I think part 2) of the comment now belongs to LAT_UNKNOWN_WRITES label.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 8/8] block: hook up writeback throttling

2016-11-08 Thread Jan Kara

On Tue 01-11-16 15:08:51, Jens Axboe wrote:
> Enable throttling of buffered writeback to make it a lot
> more smooth, and has way less impact on other system activity.
> Background writeback should be, by definition, background
> activity. The fact that we flush huge bundles of it at the time
> means that it potentially has heavy impacts on foreground workloads,
> which isn't ideal. We can't easily limit the sizes of writes that
> we do, since that would impact file system layout in the presence
> of delayed allocation. So just throttle back buffered writeback,
> unless someone is waiting for it.
> 
> The algorithm for when to throttle takes its inspiration in the
> CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
> the minimum latencies of requests over a window of time. In that
> window of time, if the minimum latency of any request exceeds a
> given target, then a scale count is incremented and the queue depth
> is shrunk. The next monitoring window is shrunk accordingly. Unlike
> CoDel, if we hit a window that exhibits good behavior, then we
> simply increment the scale count and re-calculate the limits for that
> scale value. This prevents us from oscillating between a
> close-to-ideal value and max all the time, instead remaining in the
> windows where we get good behavior.
> 
> Unlike CoDel, blk-wb allows the scale count to to negative. This
> happens if we primarily have writes going on. Unlike positive
> scale counts, this doesn't change the size of the monitoring window.
> When the heavy writers finish, blk-bw quickly snaps back to it's
> stable state of a zero scale count.
> 
> The patch registers two sysfs entries. The first one, 'wb_window_usec',
> defines the window of monitoring. The second one, 'wb_lat_usec',
> sets the latency target for the window. It defaults to 2 msec for
> non-rotational storage, and 75 msec for rotational storage. Setting
> this value to '0' disables blk-wb. Generally, a user would not have
> to touch these settings.
> 
> We don't enable WBT on devices that are managed with CFQ, and have
> a non-root block cgroup attached. If we have a proportional share setup
> on this particular disk, then the wbt throttling will interfere with
> that. We don't have a strong need for wbt for that case, since we will
> rely on CFQ doing that for us.

Just one nit: Don't you miss wbt_exit() call for legacy block layer? I
don't see where that happens.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/8] block: add scalable completion tracking of requests

2016-11-08 Thread Jan Kara

;  static struct queue_sysfs_entry queue_requests_entry = {
>   .attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
>   .show = queue_requests_show,
> @@ -553,6 +573,11 @@ static struct queue_sysfs_entry queue_dax_entry = {
>   .show = queue_dax_show,
>  };
>  
> +static struct queue_sysfs_entry queue_stats_entry = {
> + .attr = {.name = "stats", .mode = S_IRUGO },
> + .show = queue_stats_show,
> +};
> +
>  static struct attribute *default_attrs[] = {
>   _requests_entry.attr,
>   _ra_entry.attr,
> @@ -582,6 +607,7 @@ static struct attribute *default_attrs[] = {
>   _poll_entry.attr,
>   _wc_entry.attr,
>   _dax_entry.attr,
> + _stats_entry.attr,
>   NULL,
>  };
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 562ac46cb790..4d0044d09984 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -250,4 +250,20 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t 
> cookie)
>   return cookie & ((1u << BLK_QC_T_SHIFT) - 1);
>  }
>  
> +struct blk_issue_stat {
> + u64 time;
> +};
> +
> +#define BLK_RQ_STAT_BATCH64
> +
> +struct blk_rq_stat {
> + s64 mean;
> + u64 min;
> + u64 max;
> + s32 nr_samples;
> + s32 nr_batch;
> + u64 batch;
> + s64 time;
> +};
> +
>  #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 0c677fb35ce4..6bd5eb56894e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -197,6 +197,7 @@ struct request {
>   struct gendisk *rq_disk;
>   struct hd_struct *part;
>   unsigned long start_time;
> + struct blk_issue_stat issue_stat;
>  #ifdef CONFIG_BLK_CGROUP
>   struct request_list *rl;/* rl this rq is alloced from */
>   unsigned long long start_time_ns;
> @@ -492,6 +493,9 @@ struct request_queue {
>  
>   unsigned intnr_sorted;
>   unsigned intin_flight[2];
> +
> + struct blk_rq_stat  rq_stats[2];
> +
>   /*
>* Number of active block driver functions for which blk_drain_queue()
>* must wait. Must be incremented around functions that unlock the
> -- 
> 2.7.4
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/8] block: add scalable completion tracking of requests

2016-11-10 Thread Jan Kara

On Wed 09-11-16 12:52:25, Jens Axboe wrote:
> On 11/09/2016 09:09 AM, Jens Axboe wrote:
> >On 11/09/2016 02:01 AM, Jan Kara wrote:
> >>On Tue 08-11-16 08:25:52, Jens Axboe wrote:
> >>>On 11/08/2016 06:30 AM, Jan Kara wrote:
> >>>>On Tue 01-11-16 15:08:49, Jens Axboe wrote:
> >>>>>For legacy block, we simply track them in the request queue. For
> >>>>>blk-mq, we track them on a per-sw queue basis, which we can then
> >>>>>sum up through the hardware queues and finally to a per device
> >>>>>state.
> >>>>>
> >>>>>The stats are tracked in, roughly, 0.1s interval windows.
> >>>>>
> >>>>>Add sysfs files to display the stats.
> >>>>>
> >>>>>Signed-off-by: Jens Axboe <ax...@fb.com>
> >>>>
> >>>>This patch looks mostly good to me but I have one concern: You track
> >>>>statistics in a fixed 134ms window, stats get cleared at the
> >>>>beginning of
> >>>>each window. Now this can interact with the writeback window and
> >>>>latency
> >>>>settings which are dynamic and settable from userspace - so if the
> >>>>writeback code observation window gets set larger than the stats
> >>>>window,
> >>>>things become strange since you'll likely miss quite some observations
> >>>>about read latencies. So I think you need to make sure stats window is
> >>>>always larger than writeback window. Or actually, why do you have
> >>>>something
> >>>>like stats window and don't leave clearing of statistics completely
> >>>>to the
> >>>>writeback tracking code?
> >>>
> >>>That's a good point, and there actually used to be a comment to that
> >>>effect in the code. I think the best solution here would be to make the
> >>>stats code mask available somewhere, and allow a consumer of the stats
> >>>to request a larger window.
> >>>
> >>>Similarly, we could make the stat window be driven by the consumer, as
> >>>you suggest.
> >>>
> >>>Currently there are two pending submissions that depend on the stats
> >>>code. One is this writeback series, and the other one is the hybrid
> >>>polling code. The latter does not really care about the window size as
> >>>such, since it has no monitoring window of its own, and it wants the
> >>>auto-clearing as well.
> >>>
> >>>I don't mind working on additions for this, but I'd prefer if we could
> >>>layer them on top of the existing series instead of respinning it.
> >>>There's considerable test time on the existing patchset. Would that work
> >>>for you? Especially collapsing the stats and wbt windows would require
> >>>some re-architecting.
> >>
> >>OK, that works for me. Actually, when thinking about this, I have one
> >>more
> >>suggestion: Do we really want to expose the wbt window as a sysfs
> >>tunable?
> >>I guess it is good for initial experiments but longer term having the wbt
> >>window length be a function of target read latency might be better.
> >>Generally you want the window length to be considerably larger than the
> >>target latency but OTOH not too large so that the algorithm can react
> >>reasonably quickly so that suggests it could really be autotuned (and we
> >>scale the window anyway to adapt it to current situation).
> >
> >That's not a bad idea, I have thought about that as well before. We
> >don't need the window tunable, and you are right, it can be a function
> >of the desired latency.
> >
> >I'll hardwire the 100msec latency window for now and get rid of the
> >exposed tunable. It's harder to remove sysfs files once they have made
> >it into the kernel...
> 
> Killed the sysfs variable, so for now it'll be a 100msec window by
> default.

OK, I guess good enough to get this merged.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/8] block: add code to track actual device queue depth

2016-11-05 Thread Jan Kara

On Tue 01-11-16 15:08:48, Jens Axboe wrote:
> For blk-mq, ->nr_requests does track queue depth, at least at init
> time. But for the older queue paths, it's simply a soft setting.
> On top of that, it's generally larger than the hardware setting
> on purpose, to allow backup of requests for merging.
> 
> Fill a hole in struct request with a 'queue_depth' member, that
> drivers can call to more closely inform the block layer of the
> real queue depth.
> 
> Signed-off-by: Jens Axboe <ax...@fb.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza
> ---
>  block/blk-settings.c   | 12 
>  drivers/scsi/scsi.c|  3 +++
>  include/linux/blkdev.h | 11 +++
>  3 files changed, 26 insertions(+)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 55369a65dea2..9cf053759363 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -837,6 +837,18 @@ void blk_queue_flush_queueable(struct request_queue *q, 
> bool queueable)
>  EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
>  
>  /**
> + * blk_set_queue_depth - tell the block layer about the device queue depth
> + * @q:   the request queue for the device
> + * @depth:   queue depth
> + *
> + */
> +void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
> +{
> + q->queue_depth = depth;
> +}
> +EXPORT_SYMBOL(blk_set_queue_depth);
> +
> +/**
>   * blk_queue_write_cache - configure queue's write cache
>   * @q:   the request queue for the device
>   * @wc:  write back cache on or off
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index 1deb6adc411f..75455d4dab68 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int 
> depth)
>   wmb();
>   }
>  
> + if (sdev->request_queue)
> + blk_set_queue_depth(sdev->request_queue, depth);
> +
>   return sdev->queue_depth;
>  }
>  EXPORT_SYMBOL(scsi_change_queue_depth);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8396da2bb698..0c677fb35ce4 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -405,6 +405,8 @@ struct request_queue {
>   struct blk_mq_ctx __percpu  *queue_ctx;
>   unsigned intnr_queues;
>  
> + unsigned intqueue_depth;
> +
>   /* hw dispatch queues */
>   struct blk_mq_hw_ctx**queue_hw_ctx;
>   unsigned intnr_hw_queues;
> @@ -777,6 +779,14 @@ static inline bool blk_write_same_mergeable(struct bio 
> *a, struct bio *b)
>   return false;
>  }
>  
> +static inline unsigned int blk_queue_depth(struct request_queue *q)
> +{
> + if (q->queue_depth)
> + return q->queue_depth;
> +
> + return q->nr_requests;
> +}
> +
>  /*
>   * q->prep_rq_fn return values
>   */
> @@ -1093,6 +1103,7 @@ extern void blk_limits_io_min(struct queue_limits 
> *limits, unsigned int min);
>  extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
>  extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
>  extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
> +extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
>  extern void blk_set_default_limits(struct queue_limits *lim);
>  extern void blk_set_stacking_limits(struct queue_limits *lim);
>  extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> -- 
> 2.7.4
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/8] writeback: mark background writeback as such

2016-11-05 Thread Jan Kara

On Tue 01-11-16 15:08:46, Jens Axboe wrote:
> If we're doing background type writes, then use the appropriate
> background write flags for that.
> 
> Signed-off-by: Jens Axboe <ax...@fb.com>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  include/linux/writeback.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 50c96ee8108f..c78f9f0920b5 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -107,6 +107,8 @@ static inline int wbc_to_write_flags(struct 
> writeback_control *wbc)
>  {
>   if (wbc->sync_mode == WB_SYNC_ALL)
>   return REQ_SYNC;
> + else if (wbc->for_kupdate || wbc->for_background)
> +         return REQ_BACKGROUND;
>  
>   return 0;
>  }
> -- 
> 2.7.4
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/8] block: add WRITE_BACKGROUND

2016-11-05 Thread Jan Kara

On Tue 01-11-16 15:08:44, Jens Axboe wrote:
> This adds a new request flag, REQ_BACKGROUND, that callers can use to
> tell the block layer that this is background (non-urgent) IO.
> 
> Signed-off-by: Jens Axboe <ax...@fb.com>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  include/linux/blk_types.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index bb921028e7c5..562ac46cb790 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -177,6 +177,7 @@ enum req_flag_bits {
>   __REQ_FUA,  /* forced unit access */
>   __REQ_PREFLUSH, /* request for cache flush */
>   __REQ_RAHEAD,   /* read ahead, can fail anytime */
> + __REQ_BACKGROUND,   /* background IO */
>   __REQ_NR_BITS,  /* stops here */
>  };
>  
> @@ -192,6 +193,7 @@ enum req_flag_bits {
>  #define REQ_FUA  (1ULL << __REQ_FUA)
>  #define REQ_PREFLUSH (1ULL << __REQ_PREFLUSH)
>  #define REQ_RAHEAD   (1ULL << __REQ_RAHEAD)
> +#define REQ_BACKGROUND   (1ULL << __REQ_BACKGROUND)
>  
>  #define REQ_FAILFAST_MASK \
>   (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
> -- 
> 2.7.4
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] brd: Make rd_size argument static

2016-10-25 Thread Jan Kara

rd_size does not appear to be used outside of brd. Make it static.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 drivers/block/brd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 0c76d4016eeb..8b22e4a04918 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -443,7 +443,7 @@ static int rd_nr = CONFIG_BLK_DEV_RAM_COUNT;
 module_param(rd_nr, int, S_IRUGO);
 MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices");
 
-int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
+static int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
 module_param(rd_size, int, S_IRUGO);
 MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
 
-- 
2.6.6

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] blk-wbt: remove stat ops

2016-11-14 Thread Jan Kara

On Fri 11-11-16 08:21:57, Jens Axboe wrote:
> Again a leftover from when the throttling code was generic. Now that we
> just have the block user, get rid of the stat ops and indirections.
> 
> Signed-off-by: Jens Axboe <ax...@fb.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  block/blk-sysfs.c | 23 +--
>  block/blk-wbt.c   | 15 +--
>  block/blk-wbt.h   | 13 ++---
>  3 files changed, 8 insertions(+), 43 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 9262d2d60a09..415e764807d0 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -770,27 +770,6 @@ struct kobj_type blk_queue_ktype = {
>   .release= blk_release_queue,
>  };
>  
> -static void blk_wb_stat_get(void *data, struct blk_rq_stat *stat)
> -{
> - blk_queue_stat_get(data, stat);
> -}
> -
> -static void blk_wb_stat_clear(void *data)
> -{
> - blk_stat_clear(data);
> -}
> -
> -static bool blk_wb_stat_is_current(struct blk_rq_stat *stat)
> -{
> - return blk_stat_is_current(stat);
> -}
> -
> -static struct wb_stat_ops wb_stat_ops = {
> - .get= blk_wb_stat_get,
> - .is_current = blk_wb_stat_is_current,
> - .clear  = blk_wb_stat_clear,
> -};
> -
>  static void blk_wb_init(struct request_queue *q)
>  {
>  #ifndef CONFIG_BLK_WBT_MQ
> @@ -805,7 +784,7 @@ static void blk_wb_init(struct request_queue *q)
>   /*
>* If this fails, we don't get throttling
>*/
> - wbt_init(q, _stat_ops);
> + wbt_init(q);
>  }
>  
>  int blk_register_queue(struct gendisk *disk)
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index 4ab9cebc8003..f6ec7e587fa6 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -308,7 +308,7 @@ static int __latency_exceeded(struct rq_wb *rwb, struct 
> blk_rq_stat *stat)
>* waited or still has writes in flights, consider us doing
>* just writes as well.
>*/
> - if ((stat[1].nr_samples && rwb->stat_ops->is_current(stat)) ||
> + if ((stat[1].nr_samples && blk_stat_is_current(stat)) ||
>   wb_recent_wait(rwb) || wbt_inflight(rwb))
>   return LAT_UNKNOWN_WRITES;
>   return LAT_UNKNOWN;
> @@ -333,7 +333,7 @@ static int latency_exceeded(struct rq_wb *rwb)
>  {
>   struct blk_rq_stat stat[2];
>  
> - rwb->stat_ops->get(rwb->ops_data, stat);
> + blk_queue_stat_get(rwb->queue, stat);
>   return __latency_exceeded(rwb, stat);
>  }
>  
> @@ -355,7 +355,7 @@ static void scale_up(struct rq_wb *rwb)
>  
>   rwb->scale_step--;
>   rwb->unknown_cnt = 0;
> - rwb->stat_ops->clear(rwb->ops_data);
> + blk_stat_clear(rwb->queue);
>  
>   rwb->scaled_max = calc_wb_limits(rwb);
>  
> @@ -385,7 +385,7 @@ static void scale_down(struct rq_wb *rwb, bool 
> hard_throttle)
>  
>   rwb->scaled_max = false;
>   rwb->unknown_cnt = 0;
> - rwb->stat_ops->clear(rwb->ops_data);
> + blk_stat_clear(rwb->queue);
>   calc_wb_limits(rwb);
>   rwb_trace_step(rwb, "step down");
>  }
> @@ -675,7 +675,7 @@ void wbt_disable(struct rq_wb *rwb)
>  }
>  EXPORT_SYMBOL_GPL(wbt_disable);
>  
> -int wbt_init(struct request_queue *q, struct wb_stat_ops *ops)
> +int wbt_init(struct request_queue *q)
>  {
>   struct rq_wb *rwb;
>   int i;
> @@ -688,9 +688,6 @@ int wbt_init(struct request_queue *q, struct wb_stat_ops 
> *ops)
>   BUILD_BUG_ON(RWB_WINDOW_NSEC > BLK_STAT_NSEC);
>   BUILD_BUG_ON(WBT_NR_BITS > BLK_STAT_RES_BITS);
>  
> - if (!ops->get || !ops->is_current || !ops->clear)
> - return -EINVAL;
> -
>   rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
>   if (!rwb)
>   return -ENOMEM;
> @@ -706,8 +703,6 @@ int wbt_init(struct request_queue *q, struct wb_stat_ops 
> *ops)
>   rwb->last_comp = rwb->last_issue = jiffies;
>   rwb->queue = q;
>   rwb->win_nsec = RWB_WINDOW_NSEC;
> - rwb->stat_ops = ops;
> - rwb->ops_data = q;
>   wbt_update_limits(rwb);
>  
>   /*
> diff --git a/block/blk-wbt.h b/block/blk-wbt.h
> index 09c61a3f8295..44dc2173dc1f 100644
> --- a/block/blk-wbt.h
> +++ b/block/blk-wbt.h
> @@ -46,12 +46,6 @@ static inline bool wbt_is_read(struct blk_issue_stat *stat)
>   return (stat->time >> BLK_STAT_SHIFT) & WBT_READ;
>  }
>

Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-11-01 Thread Jan Kara

On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> [ My mail system got broken and original reply didn't get to through. Resent. 
> ]

OK, this answers some of my questions from previous email so disregard that
one.

> On Thu, Oct 13, 2016 at 11:33:13AM +0200, Jan Kara wrote:
> > On Thu 15-09-16 14:54:57, Kirill A. Shutemov wrote:
> > > Most of work happans on head page. Only when we need to do copy data to
> > > userspace we find relevant subpage.
> > > 
> > > We are still limited by PAGE_SIZE per iteration. Lifting this limitation
> > > would require some more work.
> >
> > Hum, I'm kind of lost.
> 
> The limitation here comes from how copy_page_to_iter() and
> copy_page_from_iter() work wrt. highmem: it can only handle one small
> page a time.
> 
> On write side, we also have problem with assuming small page: write length
> and offset within page calculated before we know if small or huge page is
> allocated. It's not easy to fix. Looks like it would require change in
> ->write_begin() interface to accept len > PAGE_SIZE.
>
> > Can you point me to some design document / email that would explain some
> > high level ideas how are huge pages in page cache supposed to work?
> 
> I'll elaborate more in cover letter to next revision.
> 
> > When are we supposed to operate on the head page and when on subpage?
> 
> It's case-by-case. See above explanation why we're limited to PAGE_SIZE
> here.
> 
> > What is protected by the page lock of the head page?
> 
> Whole huge page. As with anon pages.
> 
> > Do page locks of subpages play any role?
> 
> lock_page() on any subpage would lock whole huge page.
> 
> > If understand right, e.g.  pagecache_get_page() will return subpages but
> > is it generally safe to operate on subpages individually or do we have
> > to be aware that they are part of a huge page?
> 
> I tried to make it as transparent as possible: page flag operations will
> be redirected to head page, if necessary. Things like page_mapping() and
> page_to_pgoff() know about huge pages.
> 
> Direct access to struct page fields must be avoided for tail pages as most
> of them doesn't have meaning you would expect for small pages.

OK, good to know.

> > If I understand the motivation right, it is mostly about being able to mmap
> > PMD-sized chunks to userspace. So my naive idea would be that we could just
> > implement it by allocating PMD sized chunks of pages when adding pages to
> > page cache, we don't even have to read them all unless we come from PMD
> > fault path.
> 
> Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> per-hugepage, one common list of buffer heads...
> 
> PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> it otherwise doesn't make sense) and handling it differently for file-THP
> is nightmare from maintenance POV.

But the complexity of two different page sizes for page cache and *each*
filesystem that wants to support it does not make the maintenance easy
either. So I'm not convinced that using the same rules for anon-THP and
file-THP is a clear win. And if we have these two options neither of which
has negligible maintenance cost, I'd also like to see more justification
for why it is a good idea to have file-THP for normal filesystems. Do you
have any performance numbers that show it is a win under some realistic
workload?

I'd also note that having PMD-sized pages has some obvious disadvantages as
well:

1) I'm not sure buffer head handling code will quite scale to 512 or even
2048 buffer_heads on a linked list referenced from a page. It may work but
I suspect the performance will suck. 

2) PMD-sized pages result in increased space & memory usage.

3) In ext4 we have to estimate how much metadata we may need to modify when
allocating blocks underlying a page in the worst case (you don't seem to
update this estimate in your patch set). With 2048 blocks underlying a page,
each possibly in a different block group, it is a lot of metadata forcing
us to reserve a large transaction (not sure if you'll be able to even
reserve such large transaction with the default journal size), which again
makes things slower.

4) As you have noted some places like write_begin() still depend on 4k
pages which creates a strange mix of places that use subpages and that use
head pages.

All this would be a non-issue (well, except 2 I guess) if we just didn't
expose filesystems to the fact that something like file-THP exists.

> > Reclaim may need to be aware not to split pages unnecessarily
> > but that's about it. So I'd like to understand what's wrong with this
> > naive idea and why do filesystems need to be aware that someone wants to
> > map in PMD sized chun

Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-28 Thread Jan Kara

On Thu 27-10-16 10:26:18, Jens Axboe wrote:
> On 10/27/2016 03:26 AM, Jan Kara wrote:
> >On Wed 26-10-16 10:12:38, Jens Axboe wrote:
> >>On 10/26/2016 10:04 AM, Paolo Valente wrote:
> >>>
> >>>>Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <ax...@kernel.dk> ha 
> >>>>scritto:
> >>>>
> >>>>On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
> >>>>>On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
> >>>>>>The question to ask first is whether to actually have pluggable
> >>>>>>schedulers on blk-mq at all, or just have one that is meant to
> >>>>>>do the right thing in every case (and possibly can be bypassed
> >>>>>>completely).
> >>>>>
> >>>>>That would be my preference.  Have a BFQ-variant for blk-mq as an
> >>>>>option (default to off unless opted in by the driver or user), and
> >>>>>not other scheduler for blk-mq.  Don't bother with bfq for non
> >>>>>blk-mq.  It's not like there is any advantage in the legacy-request
> >>>>>device even for slow devices, except for the option of having I/O
> >>>>>scheduling.
> >>>>
> >>>>It's the only right way forward. blk-mq might not offer any substantial
> >>>>advantages to rotating storage, but with scheduling, it won't offer a
> >>>>downside either. And it'll take us towards the real goal, which is to
> >>>>have just one IO path.
> >>>
> >>>ok
> >>>
> >>>>Adding a new scheduler for the legacy IO path
> >>>>makes no sense.
> >>>
> >>>I would fully agree if effective and stable I/O scheduling would be
> >>>available in blk-mq in one or two months.  But I guess that it will
> >>>take at least one year optimistically, given the current status of the
> >>>needed infrastructure, and given the great difficulties of doing
> >>>effective scheduling at the high parallelism and extreme target speeds
> >>>of blk-mq.  Of course, this holds true unless little clever scheduling
> >>>is performed.
> >>>
> >>>So, what's the point in forcing a lot of users wait another year or
> >>>more, for a solution that has yet to be even defined, while they could
> >>>enjoy a much better system, and then switch an even better system when
> >>>scheduling is ready in blk-mq too?
> >>
> >>That same argument could have been made 2 years ago. Saying no to a new
> >>scheduler for the legacy framework goes back roughly that long. We could
> >>have had BFQ for mq NOW, if we didn't keep coming back to this very
> >>point.
> >>
> >>I'm hesistant to add a new scheduler because it's very easy to add, very
> >>difficult to get rid of. If we do add BFQ as a legacy scheduler now,
> >>it'll take us years and years to get rid of it again. We should be
> >>moving towards LESS moving parts in the legacy path, not more.
> >>
> >>We can keep having this discussion every few years, but I think we'd
> >>both prefer to make some actual progress here. It's perfectly fine to
> >>add an interface for a single queue interface for an IO scheduler for
> >>blk-mq, since we don't care too much about scalability there. And that
> >>won't take years, that should be a few weeks. Retrofitting BFQ on top of
> >>that should not be hard either. That can co-exist with a real multiqueue
> >>scheduler as well, something that's geared towards some fairness for
> >>faster devices.
> >
> >OK, so some solution like having a variant of blk_sq_make_request() that
> >will consume requests, do IO scheduling decisions on them, and feed them
> >into the HW queue is it sees fit would be acceptable? That will provide the
> >IO scheduler a global view that it needs for complex scheduling decisions
> >so it should indeed be relatively easy to port BFQ to work like that.
> 
> I'd probably start off Omar's base [1] that switches the software queues
> to store bios instead of requests, since that lifts the of the 1:1
> mapping between what we can queue up and what we can dispatch. Without
> that, the IO scheduler won't have too much to work with. And with that
> in place, it'll be a "bio in, request out" type of setup, which is
> similar to what we have in the legacy path.
>
> I'd keep the software queues, but as a starting point, mandate 1
> h

Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-27 Thread Jan Kara

On Wed 26-10-16 10:12:38, Jens Axboe wrote:
> On 10/26/2016 10:04 AM, Paolo Valente wrote:
> >
> >>Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <ax...@kernel.dk> ha 
> >>scritto:
> >>
> >>On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
> >>>On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
> >>>>The question to ask first is whether to actually have pluggable
> >>>>schedulers on blk-mq at all, or just have one that is meant to
> >>>>do the right thing in every case (and possibly can be bypassed
> >>>>completely).
> >>>
> >>>That would be my preference.  Have a BFQ-variant for blk-mq as an
> >>>option (default to off unless opted in by the driver or user), and
> >>>not other scheduler for blk-mq.  Don't bother with bfq for non
> >>>blk-mq.  It's not like there is any advantage in the legacy-request
> >>>device even for slow devices, except for the option of having I/O
> >>>scheduling.
> >>
> >>It's the only right way forward. blk-mq might not offer any substantial
> >>advantages to rotating storage, but with scheduling, it won't offer a
> >>downside either. And it'll take us towards the real goal, which is to
> >>have just one IO path.
> >
> >ok
> >
> >>Adding a new scheduler for the legacy IO path
> >>makes no sense.
> >
> >I would fully agree if effective and stable I/O scheduling would be
> >available in blk-mq in one or two months.  But I guess that it will
> >take at least one year optimistically, given the current status of the
> >needed infrastructure, and given the great difficulties of doing
> >effective scheduling at the high parallelism and extreme target speeds
> >of blk-mq.  Of course, this holds true unless little clever scheduling
> >is performed.
> >
> >So, what's the point in forcing a lot of users wait another year or
> >more, for a solution that has yet to be even defined, while they could
> >enjoy a much better system, and then switch an even better system when
> >scheduling is ready in blk-mq too?
> 
> That same argument could have been made 2 years ago. Saying no to a new
> scheduler for the legacy framework goes back roughly that long. We could
> have had BFQ for mq NOW, if we didn't keep coming back to this very
> point.
> 
> I'm hesistant to add a new scheduler because it's very easy to add, very
> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
> it'll take us years and years to get rid of it again. We should be
> moving towards LESS moving parts in the legacy path, not more.
> 
> We can keep having this discussion every few years, but I think we'd
> both prefer to make some actual progress here. It's perfectly fine to
> add an interface for a single queue interface for an IO scheduler for
> blk-mq, since we don't care too much about scalability there. And that
> won't take years, that should be a few weeks. Retrofitting BFQ on top of
> that should not be hard either. That can co-exist with a real multiqueue
> scheduler as well, something that's geared towards some fairness for
> faster devices.

OK, so some solution like having a variant of blk_sq_make_request() that
will consume requests, do IO scheduling decisions on them, and feed them
into the HW queue is it sees fit would be acceptable? That will provide the
IO scheduler a global view that it needs for complex scheduling decisions
so it should indeed be relatively easy to port BFQ to work like that.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 17/41] filemap: handle huge pages in filemap_fdatawait_range()

2016-10-31 Thread Jan Kara

On Mon 24-10-16 14:36:25, Kirill A. Shutemov wrote:
> On Thu, Oct 13, 2016 at 03:18:02PM +0200, Jan Kara wrote:
> > On Thu 13-10-16 15:08:44, Kirill A. Shutemov wrote:
> > > On Thu, Oct 13, 2016 at 11:44:41AM +0200, Jan Kara wrote:
> > > > On Thu 15-09-16 14:54:59, Kirill A. Shutemov wrote:
> > > > > We writeback whole huge page a time.
> > > > 
> > > > This is one of the things I don't understand. Firstly I didn't see where
> > > > changes of writeback like this would happen (maybe they come later).
> > > > Secondly I'm not sure why e.g. writeback should behave atomically wrt 
> > > > huge
> > > > pages. Is this because radix-tree multiorder entry tracks dirtiness for 
> > > > us
> > > > at that granularity?
> > > 
> > > We track dirty/writeback on per-compound pages: meaning we have one
> > > dirty/writeback flag for whole compound page, not on every individual
> > > 4k subpage. The same story for radix-tree tags.
> > > 
> > > > BTW, can you also explain why do we need multiorder entries? What do
> > > > they solve for us?
> > > 
> > > It helps us having coherent view on tags in radix-tree: no matter which
> > > index we refer from the range huge page covers we will get the same
> > > answer on which tags set.
> > 
> > OK, understand that. But why do we need a coherent view? For which purposes
> > exactly do we care that it is not just a bunch of 4k pages that happen to
> > be physically contiguous and thus can be mapped in one PMD?
> 
> My understanding is that things like PageDirty() should be handled on the
> same granularity as PAGECACHE_TAG_DIRTY, otherwise things can go horribly
> wrong...

Yeah, I agree with that. My question was rather aiming in the direction:
Why don't we keep PageDirty and PAGECACHE_TAG_DIRTY on a page granularity?
Why do we push all this to happen only in the head page?

In your coverletter of the latest version (BTW thanks for expanding
explanations there) you write:
  - head page (the first subpage) on LRU represents whole huge page;
  - head page's flags represent state of whole huge page (with few
exceptions);
  - mm can't migrate subpages of the compound page individually;

So the fact that flags of a head page represent flags of each individual
page is the decision that I'm questioning, at least for PageDirty and
PageWriteback flags. I'm asking because frankly, I don't like the series
much. IMHO too many places need to know about huge pages and things will
get broken frequently. And from filesystem POV I don't really see why a
filesystem should care about huge pages *at all*. Sure functions allocating
pages into page cache need to care, sure functions mapping pages into page
tables need to care. But nobody else should need to be aware we are playing
some huge page games... At least that is my idea how things ought to work
;)

Your solution seems to go more towards the direction where we have two
different sizes of pages in the system and everyone has to cope with it.
But I'd also note that you go only half way there - e.g. page lookup
functions still work with subpages, some places still use PAGE_SIZE &
page->index, ... - so the result is a strange mix.

So what are the reasons for having pages forming a huge page bound so
tightly?

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-11-03 Thread Jan Kara

On Wed 02-11-16 07:36:12, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> > I'd also note that having PMD-sized pages has some obvious disadvantages as
> > well:
> > 
> > 1) I'm not sure buffer head handling code will quite scale to 512 or even
> > 2048 buffer_heads on a linked list referenced from a page. It may work but
> > I suspect the performance will suck. 
> 
> buffer_head handling always sucks.  For the iomap based bufferd write
> path I plan to support a buffer_head-less mode for the block size ==
> PAGE_SIZE case in 4.11 latest, but if I get enough other things of my
> plate in time even for 4.10.  I think that's the right way to go for
> THP, especially if we require the fs to allocate the whole huge page
> as a single extent, similar to the DAX PMD mapping case.

Yeah, if we require whole THP to be backed by a single extent, things get
simpler. But still there's the issue that ext4 cannot easily use iomap code
for buffered writes because of the data exposure issue we already talked
about - well, ext4 could actually work (it supports unwritten extents) but
old compatibility modes won't work and I'd strongly prefer not to have two
independent write paths in ext4... But I'll put more thought into this, I
have some idea how we could hack around the problem even for on-disk formats
that don't support unwritten extents. The trick we could use is that we'd
just mark the range of file as unwritten in memory in extent cache we have,
that should protect us against exposing uninitialized pages in racing
faults.

> > 2) PMD-sized pages result in increased space & memory usage.
> 
> How so?

Well, memory usage is clear I guess - if the files are smaller than THP
size, or if you don't use all the 4k pages that are forming THP you are
wasting memory. Sure it can be somewhat controlled by the heuristics
deciding when to use THP in pagecache and when to fall back to 4k pages.

Regarding space usage - it is mostly the case for sparse mmaped IO where
you always have to allocate (and write out) all the blocks underlying a THP
that gets written to, even though you may only need 4K from that area...

> > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > allocating blocks underlying a page in the worst case (you don't seem to
> > update this estimate in your patch set). With 2048 blocks underlying a page,
> > each possibly in a different block group, it is a lot of metadata forcing
> > us to reserve a large transaction (not sure if you'll be able to even
> > reserve such large transaction with the default journal size), which again
> > makes things slower.
> 
> As said above I think we should only use huge page mappings if there is
> a single underlying extent, same as in DAX to keep the complexity down.
> 
> > 4) As you have noted some places like write_begin() still depend on 4k
> > pages which creates a strange mix of places that use subpages and that use
> > head pages.
> 
> Just use the iomap bufferd I/O code and all these issues will go away.

Yep, the above two things would make things somewhat less ugly I agree.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-11-03 Thread Jan Kara

On Wed 02-11-16 11:32:04, Kirill A. Shutemov wrote:
> On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> > On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> > > > If I understand the motivation right, it is mostly about being able to 
> > > > mmap
> > > > PMD-sized chunks to userspace. So my naive idea would be that we could 
> > > > just
> > > > implement it by allocating PMD sized chunks of pages when adding pages 
> > > > to
> > > > page cache, we don't even have to read them all unless we come from PMD
> > > > fault path.
> > > 
> > > Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> > > per-hugepage, one common list of buffer heads...
> > > 
> > > PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> > > it otherwise doesn't make sense) and handling it differently for file-THP
> > > is nightmare from maintenance POV.
> > 
> > But the complexity of two different page sizes for page cache and *each*
> > filesystem that wants to support it does not make the maintenance easy
> > either.
> 
> I think with time we can make small pages just a subcase of huge pages.
> And some generalization can be made once more than one filesystem with
> backing storage will adopt huge pages.

My objection is that IMHO currently the code is too ugly to go in. Too many
places need to know about THP and I'm not even sure you have patched all
the places or whether some corner cases remained unfixed and how should I
find that out.

> > So I'm not convinced that using the same rules for anon-THP and
> > file-THP is a clear win.
> 
> We already have file-THP with the same rules: tmpfs. Backing storage is
> what changes the picture.

Right, the ugliness comes from access to backing storage having to deal
with huge pages.

> > I'd also note that having PMD-sized pages has some obvious disadvantages as
> > well:
> >
> > 1) I'm not sure buffer head handling code will quite scale to 512 or even
> > 2048 buffer_heads on a linked list referenced from a page. It may work but
> > I suspect the performance will suck.
> 
> Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
> why syscall-based IO sucks. We spend a lot of time looking for desired
> block.
> 
> We need to switch to some other data structure for storing buffer_heads.
> Is there a reason why we have list there in first place?
> Why not just array?
> 
> I will look into it, but this sounds like a separate infrastructure change
> project.

As Christoph said iomap code should help you with that and make things
simpler. If things go as we imagine, we should be able to pretty much avoid
buffer heads. But it will take some time to get there.

> > 2) PMD-sized pages result in increased space & memory usage.
> 
> Space? Do you mean disk space? Not really: we still don't write beyond
> i_size or into holes.
> 
> Behaviour wrt to holes may change with mmap()-IO as we have less
> granularity, but the same can be seen just between different
> architectures: 4k vs. 64k base page size.

Yes, I meant different granularity of mmap based IO. And I agree it isn't a
new problem but the scale of the problem is much larger with 2MB pages than
with say 64K pages. And actually the overhead of higher IO granularity of
64K pages has been one of the reasons we have switched SLES PPC kernels
from 64K pages to 4K pages (we've got complaints from customers). 

> > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > allocating blocks underlying a page in the worst case (you don't seem to
> > update this estimate in your patch set). With 2048 blocks underlying a page,
> > each possibly in a different block group, it is a lot of metadata forcing
> > us to reserve a large transaction (not sure if you'll be able to even
> > reserve such large transaction with the default journal size), which again
> > makes things slower.
> 
> I didn't saw this on profiles. And xfstests looks fine. I probably need to
> run them with 1k blocks once again.

You wouldn't see this in profiles - it is a correctness thing. And it won't
be triggered unless the file is heavily fragmented which likely does not
happen with any test in xfstests. If it happens you'll notice though - the
filesystem will just report error and shut itself down.

> The numbers below generated with fio. The working set is relatively small,
> so it fits into page cache and writing set doesn't hit dirty_ratio.
> 
> I think the mmap performance should be enough to justify initial inclusion
> of an experimental feature: it useful for workloads that targets mmap()-IO.
>

Re: [PATCHv3 14/41] filemap: allocate huge page in page_cache_read(), if allowed

2016-10-11 Thread Jan Kara

On Thu 15-09-16 14:54:56, Kirill A. Shutemov wrote:
> This patch adds basic functionality to put huge page into page cache.
> 
> At the moment we only put huge pages into radix-tree if the range covered
> by the huge page is empty.
> 
> We ignore shadow entires for now, just remove them from the tree before
> inserting huge page.
> 
> Later we can add logic to accumulate information from shadow entires to
> return to caller (average eviction time?).
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> ---
>  include/linux/fs.h  |   5 ++
>  include/linux/pagemap.h |  21 ++-
>  mm/filemap.c| 148 
> +++-
>  3 files changed, 157 insertions(+), 17 deletions(-)
> 
...
> @@ -663,16 +663,55 @@ static int __add_to_page_cache_locked(struct page *page,
>   page->index = offset;
>  
>   spin_lock_irq(>tree_lock);
> - error = page_cache_tree_insert(mapping, page, shadowp);
> + if (PageTransHuge(page)) {
> + struct radix_tree_iter iter;
> + void **slot;
> + void *p;
> +
> + error = 0;
> +
> + /* Wipe shadow entires */
> + radix_tree_for_each_slot(slot, >page_tree, , 
> offset) {
> + if (iter.index >= offset + HPAGE_PMD_NR)
> + break;
> +
> + p = radix_tree_deref_slot_protected(slot,
> + >tree_lock);
> + if (!p)
> + continue;
> +
> + if (!radix_tree_exception(p)) {
> + error = -EEXIST;
> + break;
> + }
> +
> + mapping->nrexceptional--;
> + rcu_assign_pointer(*slot, NULL);

I think you also need something like workingset_node_shadows_dec(node)
here. It would be even better if you used something like
clear_exceptional_entry() to have the logic in one place (you obviously
need to factor out only part of clear_exceptional_entry() first).

> + }
> +
> + if (!error)
> + error = __radix_tree_insert(>page_tree, offset,
> + compound_order(page), page);
> +
> + if (!error) {
> + count_vm_event(THP_FILE_ALLOC);
> + mapping->nrpages += HPAGE_PMD_NR;
> + *shadowp = NULL;
> + __inc_node_page_state(page, NR_FILE_THPS);
> + }
> + } else {
> + error = page_cache_tree_insert(mapping, page, shadowp);
> + }

And I'd prefer to have this logic moved to page_cache_tree_insert() because
logically it IMHO belongs there - it is a simply another case of handling
of radix tree used for page cache.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 >

1 - 100 of 417 matches

Mail list logo