date:20170412

Re: [PATCH V3 00/16] Introduce the BFQ I/O scheduler

2017-04-12 Thread Paolo Valente

> Il giorno 11 apr 2017, alle ore 20:31, Bart Van Assche 
>  ha scritto:
> 
> On Tue, 2017-04-11 at 19:37 +0200, Paolo Valente wrote:
>> Just pushed:
>> https://github.com/Algodev-github/bfq-mq/tree/add-bfq-mq-logical
> 
> Thanks!
> 
> But are you aware that the code on that branch doesn't build?
> 
> $ make all
> [ ... ]
> ERROR: "bfq_mark_bfqq_busy" [block/bfq-wf2q.ko] undefined!
> ERROR: "bfqg_stats_update_dequeue" [block/bfq-wf2q.ko] undefined!
> [ ... ]
> 
> $ PAGER= git grep bfq_mark_bfqq_busy
> block/bfq-wf2q.c:   bfq_mark_bfqq_busy(bfqq);
> 

That's exactly the complain of the kbuild test robot.  As I wrote,
build completes with no problem in my test system (Ubuntu 16.04, gcc
5.4.0), even with the exact offending tree and .config that the robot
reports.

I didn't understand what is going on.  In your case, as well as for
the test robot, the compilation of the file block/bfq-wf2q.c as a
module component fails, because that file does not contain the
definition of the reported functions.  But that definition is
(uniquely) in the file block/bfq-iosched.c, which is to be compiled
with the former file, according to the following rule in
block/Makefile:
obj-$(CONFIG_IOSCHED_BFQ)   += bfq-iosched.o bfq-wf2q.o bfq-cgroup.o

I have tried all combinations of configurations for bfq (builti-in or
module, with or without cgrousp support), always successfully.  If it
makes any sense to share this information, these are the exact
commands I used to test al combinations (in addition to make full
builds in some cases, and try make all as in your case):

make O=builddir M=block

and

make O=builddir M=block modules

Where is my mistake?

Thanks,
Paolo

> Bart.

Re: [PATCH 04/25] fs: Provide infrastructure for dynamic BDIs in filesystems

2017-04-12 Thread Christoph Hellwig

> + if (sb->s_iflags & SB_I_DYNBDI) {
> + bdi_put(sb->s_bdi);
> + sb->s_bdi = _backing_dev_info;

At some point I'd really like to get rid of noop_backing_dev_info and
have a NULL here..

Otherwise this looks fine..

Reviewed-by: Christoph Hellwig

Re: [PATCH 05/25] fs: Get proper reference for s_bdi

2017-04-12 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 06/25] lustre: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH v5] lightnvn: pblk

2017-04-12 Thread Javier González

> On 12 Apr 2017, at 00.23, Bart Van Assche  wrote:
> 
> On Wed, 2017-04-12 at 00:13 +0200, Javier González wrote:
>> please point out to any other tools/concerns you may have.
> 
> Hello Javier,
> 
> Do you already have an account at https://scan.coverity.com/? Any Linux
> kernel developer can get an account for free. A full Coverity scan of
> Linus' tree is available at https://scan.coverity.com/projects/linux.

Hi Bart,

No I did not. Thanks for the invite. I just created an account now;
waiting for approval.

Javier


signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 03/25] bdi: Export bdi_alloc_node() and bdi_put()

2017-04-12 Thread Christoph Hellwig

On Wed, Mar 29, 2017 at 12:56:01PM +0200, Jan Kara wrote:
> MTD will want to call bdi_alloc_node() and bdi_put() directly. Export
> these functions.
> 
> Signed-off-by: Jan Kara 

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 18/25] gfs2: Convert to properly refcounting bdi

2017-04-12 Thread Christoph Hellwig

On Wed, Mar 29, 2017 at 12:56:16PM +0200, Jan Kara wrote:
> Similarly to set_bdev_super() GFS2 just used block device reference to
> bdi. Convert it to properly getting bdi reference. The reference will
> get automatically dropped on superblock destruction.

Hmm, why iisn't gfs2 simply using the generic mount_bdev code?

Otherwise looks fine:

Reviewed-by: Christoph Hellwig

Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma

2017-04-12 Thread Christoph Hellwig

On Mon, Apr 10, 2017 at 01:05:50PM -0500, Steve Wise wrote:
> I'll test cxgb4 if you convert it. :)

That will take a lot of work.  The problem with cxgb4 is that it
allocatesd all the interrupts at device enable time, but then only
allocates them to ULDs when they attached, while this scheme assumes
as way to map out queues / vectors at initialization time.

Re: [PATCH 11/25] ecryptfs: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 12/25] afs: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 16/25] fuse: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH 25/25] bdi: Drop 'parent' argument from bdi_register[_va]()

2017-04-12 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH 22/25] ubifs: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 21/25] nfs: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

>  /*
>   * Finish setting up an NFS2/3 superblock
>   */

I was just looking why you didn't update the v4 variant, but it seems
like the comment above is simply incorrect..

Thus the patch looks fine:

Reviewed-by: Christoph Hellwig

Re: [PATCH 23/25] fs: Remove SB_I_DYNBDI flag

2017-04-12 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH 24/25] block: Remove unused functions

2017-04-12 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH] remove the mg_disk driver

2017-04-12 Thread Hannes Reinecke

On 04/12/2017 07:58 AM, Christoph Hellwig wrote:
> Any comments?  Getting rid of this driver which was never wired up
> at all would help with some of the pending block work..
> 
> On Thu, Apr 06, 2017 at 01:28:46PM +0200, Christoph Hellwig wrote:
>> This drivers was added in 2008, but as far as a I can tell we never had a
>> single platform that actually registered resources for the platform driver.
>>
>> It's also been unmaintained for a long time and apparently has a ATA mode
>> that can be driven using the IDE/libata subsystem.
>>
>> Signed-off-by: Christoph Hellwig 
>> ---
>>  Documentation/blockdev/mflash.txt |   84 ---
>>  drivers/block/Kconfig |   17 -
>>  drivers/block/Makefile|1 -
>>  drivers/block/mg_disk.c   | 1110 
>> -
>>  include/linux/mg_disk.h   |   45 --
>>  5 files changed, 1257 deletions(-)
>>  delete mode 100644 Documentation/blockdev/mflash.txt
>>  delete mode 100644 drivers/block/mg_disk.c
>>  delete mode 100644 include/linux/mg_disk.h
>>
Go.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 13/25] mtd: Convert to dynamically allocated bdi infrastructure

2017-04-12 Thread Christoph Hellwig

> + sb->s_bdi = bdi_get(mtd_bdi);
> + sb->s_iflags |= SB_I_DYNBDI;

FTI, while I think this is a faithful conversion of the existing code,
the single bdi for all MTD devices looks rather bogus to me..

Otherwise this looks good:

Reviewed-by: Christoph Hellwig

Re: [PATCH 14/25] coda: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH 15/25] exofs: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH 19/25] nilfs2: Convert to properly refcounting bdi

2017-04-12 Thread Christoph Hellwig

On Wed, Mar 29, 2017 at 12:56:17PM +0200, Jan Kara wrote:
> Similarly to set_bdev_super() NILFS2 just used block device reference to
> bdi. Convert it to properly getting bdi reference. The reference will
> get automatically dropped on superblock destruction.

I really wish we could get rid of this open coding in block based
file systems..

Otherwise looks fine:

Reviewed-by: Christoph Hellwig

Re: [PATCH 20/25] ncpfs: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH 1/9] Use RWF_* flags for AIO operations

2017-04-12 Thread Christoph Hellwig

>  
> + if (unlikely(iocb->aio_rw_flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))) 
> {
> + pr_debug("EINVAL: aio_rw_flags set with incompatible flags\n");
> + return -EINVAL;
> + }

> + if (iocb->aio_rw_flags & RWF_HIPRI)
> + req->common.ki_flags |= IOCB_HIPRI;
> + if (iocb->aio_rw_flags & RWF_DSYNC)
> + req->common.ki_flags |= IOCB_DSYNC;
> + if (iocb->aio_rw_flags & RWF_SYNC)
> + req->common.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);

Pleae introduce a common helper to share this code between the
synchronous and the aio path

Re: [PATCH 13/25] mtd: Convert to dynamically allocated bdi infrastructure

2017-04-12 Thread Jan Kara

On Wed 12-04-17 01:14:06, Christoph Hellwig wrote:
> > +   sb->s_bdi = bdi_get(mtd_bdi);
> > +   sb->s_iflags |= SB_I_DYNBDI;
> 
> FTI, while I think this is a faithful conversion of the existing code,
> the single bdi for all MTD devices looks rather bogus to me..

Yeah, I don't understand why they don't allocate backing_dev_info per
superblock as other places do but I didn't dare to change that since there
are user visible consequences of that. I suspect they don't really want
anything from the bdi and allocate it only so that other parts of the
kernel are happy.

> Otherwise this looks good:
> 
> Reviewed-by: Christoph Hellwig 

Thanks.

Honza
-- 
Jan Kara 
SUSE Labs, CR

[PATCH 18/25] gfs2: Convert to properly refcounting bdi

2017-04-12 Thread Jan Kara

Similarly to set_bdev_super() GFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.

CC: Steven Whitehouse 
CC: Bob Peterson 
CC: cluster-de...@redhat.com
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/gfs2/ops_fstype.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index b108e7ba81af..e6b6f97d0fc1 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "gfs2.h"
 #include "incore.h"
@@ -1222,12 +1223,8 @@ static int set_gfs2_super(struct super_block *s, void 
*data)
 {
s->s_bdev = data;
s->s_dev = s->s_bdev->bd_dev;
-
-   /*
-* We set the bdi here to the queue backing, file systems can
-* overwrite this in ->fill_super()
-*/
-   s->s_bdi = bdev_get_queue(s->s_bdev)->backing_dev_info;
+   s->s_bdi = bdi_get(s->s_bdev->bd_bdi);
+   s->s_iflags |= SB_I_DYNBDI;
return 0;
 }
 
-- 
2.12.0

[PATCH 17/25] fuse: Get rid of bdi_initialized

2017-04-12 Thread Jan Kara

It is not needed anymore since bdi is initialized whenever superblock
exists.

CC: Miklos Szeredi 
CC: linux-fsde...@vger.kernel.org
Suggested-by: Miklos Szeredi 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/fuse/dev.c| 5 ++---
 fs/fuse/fuse_i.h | 3 ---
 fs/fuse/inode.c  | 2 --
 3 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 78887f68ee6a..c2d7f3a92679 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -382,7 +382,7 @@ static void request_end(struct fuse_conn *fc, struct 
fuse_req *req)
wake_up(>blocked_waitq);
 
if (fc->num_background == fc->congestion_threshold &&
-   fc->connected && fc->bdi_initialized) {
+   fc->connected && fc->sb) {
clear_bdi_congested(fc->sb->s_bdi, BLK_RW_SYNC);
clear_bdi_congested(fc->sb->s_bdi, BLK_RW_ASYNC);
}
@@ -573,8 +573,7 @@ void fuse_request_send_background_locked(struct fuse_conn 
*fc,
fc->num_background++;
if (fc->num_background == fc->max_background)
fc->blocked = 1;
-   if (fc->num_background == fc->congestion_threshold &&
-   fc->bdi_initialized) {
+   if (fc->num_background == fc->congestion_threshold && fc->sb) {
set_bdi_congested(fc->sb->s_bdi, BLK_RW_SYNC);
set_bdi_congested(fc->sb->s_bdi, BLK_RW_ASYNC);
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 0e7c79a390e0..f33341d9501a 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -527,9 +527,6 @@ struct fuse_conn {
/** Filesystem supports NFS exporting.  Only set in INIT */
unsigned export_support:1;
 
-   /** Set if bdi is valid */
-   unsigned bdi_initialized:1;
-
/** write-back cache policy (default is write-through) */
unsigned writeback_cache:1;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 90bacbc87fb3..73cf05135252 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -983,8 +983,6 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct 
super_block *sb)
/* fuse does it's own writeback accounting */
sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;
 
-   fc->bdi_initialized = 1;
-
/*
 * For a single fuse filesystem use max 1% of dirty +
 * writeback threshold.
-- 
2.12.0

[PATCH 11/25] ecryptfs: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Tyler Hicks 
CC: ecryp...@vger.kernel.org
Acked-by: Tyler Hicks 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/ecryptfs/ecryptfs_kernel.h | 1 -
 fs/ecryptfs/main.c| 4 +---
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/ecryptfs/ecryptfs_kernel.h b/fs/ecryptfs/ecryptfs_kernel.h
index 95c1c8d34539..9c351bf757b2 100644
--- a/fs/ecryptfs/ecryptfs_kernel.h
+++ b/fs/ecryptfs/ecryptfs_kernel.h
@@ -349,7 +349,6 @@ struct ecryptfs_mount_crypt_stat {
 struct ecryptfs_sb_info {
struct super_block *wsi_sb;
struct ecryptfs_mount_crypt_stat mount_crypt_stat;
-   struct backing_dev_info bdi;
 };
 
 /* file private data. */
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index 151872dcc1f4..9014479d0160 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -519,12 +519,11 @@ static struct dentry *ecryptfs_mount(struct 
file_system_type *fs_type, int flags
goto out;
}
 
-   rc = bdi_setup_and_register(>bdi, "ecryptfs");
+   rc = super_setup_bdi(s);
if (rc)
goto out1;
 
ecryptfs_set_superblock_private(s, sbi);
-   s->s_bdi = >bdi;
 
/* ->kill_sb() will take care of sbi after that point */
sbi = NULL;
@@ -633,7 +632,6 @@ static void ecryptfs_kill_block_super(struct super_block 
*sb)
if (!sb_info)
return;
ecryptfs_destroy_mount_crypt_stat(_info->mount_crypt_stat);
-   bdi_destroy(_info->bdi);
kmem_cache_free(ecryptfs_sb_info_cache, sb_info);
 }
 
-- 
2.12.0

[PATCH 15/25] exofs: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Boaz Harrosh 
CC: Benny Halevy 
Acked-by: Boaz Harrosh 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/exofs/exofs.h |  1 -
 fs/exofs/super.c | 17 ++---
 2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 2e86086bc940..5dc392404559 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -64,7 +64,6 @@ struct exofs_dev {
  * our extension to the in-memory superblock
  */
 struct exofs_sb_info {
-   struct backing_dev_info bdi;/* register our bdi with VFS  */
struct exofs_sb_stats s_ess;/* Written often, pre-allocate*/
int s_timeout;  /* timeout for OSD operations */
uint64_ts_nextid;   /* highest object ID used */
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 1076a4233b39..819624cfc8da 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -464,7 +464,6 @@ static void exofs_put_super(struct super_block *sb)
sbi->one_comp.obj.partition);
 
exofs_sysfs_sb_del(sbi);
-   bdi_destroy(>bdi);
exofs_free_sbi(sbi);
sb->s_fs_info = NULL;
 }
@@ -809,8 +808,12 @@ static int exofs_fill_super(struct super_block *sb, void 
*data, int silent)
__sbi_read_stats(sbi);
 
/* set up operation vectors */
-   sbi->bdi.ra_pages = __ra_pages(>layout);
-   sb->s_bdi = >bdi;
+   ret = super_setup_bdi(sb);
+   if (ret) {
+   EXOFS_DBGMSG("Failed to super_setup_bdi\n");
+   goto free_sbi;
+   }
+   sb->s_bdi->ra_pages = __ra_pages(>layout);
sb->s_fs_info = sbi;
sb->s_op = _sops;
sb->s_export_op = _export_ops;
@@ -836,14 +839,6 @@ static int exofs_fill_super(struct super_block *sb, void 
*data, int silent)
goto free_sbi;
}
 
-   ret = bdi_setup_and_register(>bdi, "exofs");
-   if (ret) {
-   EXOFS_DBGMSG("Failed to bdi_setup_and_register\n");
-   dput(sb->s_root);
-   sb->s_root = NULL;
-   goto free_sbi;
-   }
-
exofs_sysfs_dbg_print();
_exofs_print_device("Mounting", opts->dev_name,
ore_comp_dev(>oc, 0),
-- 
2.12.0

[PATCH 14/25] coda: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Jan Harkes 
CC: c...@cs.cmu.edu
CC: codal...@coda.cs.cmu.edu
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/coda/inode.c| 11 ---
 include/linux/coda_psdev.h |  1 -
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index 2dea594da199..6058df380cc0 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -183,10 +183,6 @@ static int coda_fill_super(struct super_block *sb, void 
*data, int silent)
goto unlock_out;
}
 
-   error = bdi_setup_and_register(>bdi, "coda");
-   if (error)
-   goto unlock_out;
-
vc->vc_sb = sb;
mutex_unlock(>vc_mutex);
 
@@ -197,7 +193,10 @@ static int coda_fill_super(struct super_block *sb, void 
*data, int silent)
sb->s_magic = CODA_SUPER_MAGIC;
sb->s_op = _super_operations;
sb->s_d_op = _dentry_operations;
-   sb->s_bdi = >bdi;
+
+   error = super_setup_bdi(sb);
+   if (error)
+   goto error;
 
/* get root fid from Venus: this needs the root inode */
error = venus_rootfid(sb, );
@@ -228,7 +227,6 @@ static int coda_fill_super(struct super_block *sb, void 
*data, int silent)
 
 error:
mutex_lock(>vc_mutex);
-   bdi_destroy(>bdi);
vc->vc_sb = NULL;
sb->s_fs_info = NULL;
 unlock_out:
@@ -240,7 +238,6 @@ static void coda_put_super(struct super_block *sb)
 {
struct venus_comm *vcp = coda_vcp(sb);
mutex_lock(>vc_mutex);
-   bdi_destroy(>bdi);
vcp->vc_sb = NULL;
sb->s_fs_info = NULL;
mutex_unlock(>vc_mutex);
diff --git a/include/linux/coda_psdev.h b/include/linux/coda_psdev.h
index 5b8721efa948..31e4e1f1547c 100644
--- a/include/linux/coda_psdev.h
+++ b/include/linux/coda_psdev.h
@@ -15,7 +15,6 @@ struct venus_comm {
struct list_headvc_processing;
int vc_inuse;
struct super_block *vc_sb;
-   struct backing_dev_info bdi;
struct mutexvc_mutex;
 };
 
-- 
2.12.0

[PATCH 07/25] 9p: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside session. This unifies handling of bdi among users.

CC: Eric Van Hensbergen 
CC: Ron Minnich 
CC: Latchesar Ionkov 
CC: v9fs-develo...@lists.sourceforge.net
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/9p/v9fs.c  | 10 +-
 fs/9p/v9fs.h  |  1 -
 fs/9p/vfs_super.c | 15 ---
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/9p/v9fs.c b/fs/9p/v9fs.c
index a89f3cfe3c7d..c202930086ed 100644
--- a/fs/9p/v9fs.c
+++ b/fs/9p/v9fs.c
@@ -333,10 +333,6 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info 
*v9ses,
goto err_names;
init_rwsem(>rename_sem);
 
-   rc = bdi_setup_and_register(>bdi, "9p");
-   if (rc)
-   goto err_names;
-
v9ses->uid = INVALID_UID;
v9ses->dfltuid = V9FS_DEFUID;
v9ses->dfltgid = V9FS_DEFGID;
@@ -345,7 +341,7 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info 
*v9ses,
if (IS_ERR(v9ses->clnt)) {
rc = PTR_ERR(v9ses->clnt);
p9_debug(P9_DEBUG_ERROR, "problem initializing 9p client\n");
-   goto err_bdi;
+   goto err_names;
}
 
v9ses->flags = V9FS_ACCESS_USER;
@@ -415,8 +411,6 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info 
*v9ses,
 
 err_clnt:
p9_client_destroy(v9ses->clnt);
-err_bdi:
-   bdi_destroy(>bdi);
 err_names:
kfree(v9ses->uname);
kfree(v9ses->aname);
@@ -445,8 +439,6 @@ void v9fs_session_close(struct v9fs_session_info *v9ses)
kfree(v9ses->uname);
kfree(v9ses->aname);
 
-   bdi_destroy(>bdi);
-
spin_lock(_sessionlist_lock);
list_del(>slist);
spin_unlock(_sessionlist_lock);
diff --git a/fs/9p/v9fs.h b/fs/9p/v9fs.h
index 443d12e02043..76eaf49abd3a 100644
--- a/fs/9p/v9fs.h
+++ b/fs/9p/v9fs.h
@@ -114,7 +114,6 @@ struct v9fs_session_info {
kuid_t uid; /* if ACCESS_SINGLE, the uid that has access */
struct p9_client *clnt; /* 9p client */
struct list_head slist; /* list of sessions registered with v9fs */
-   struct backing_dev_info bdi;
struct rw_semaphore rename_sem;
 };
 
diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
index de3ed8629196..a0965fb587a5 100644
--- a/fs/9p/vfs_super.c
+++ b/fs/9p/vfs_super.c
@@ -72,10 +72,12 @@ static int v9fs_set_super(struct super_block *s, void *data)
  *
  */
 
-static void
+static int
 v9fs_fill_super(struct super_block *sb, struct v9fs_session_info *v9ses,
int flags, void *data)
 {
+   int ret;
+
sb->s_maxbytes = MAX_LFS_FILESIZE;
sb->s_blocksize_bits = fls(v9ses->maxdata - 1);
sb->s_blocksize = 1 << sb->s_blocksize_bits;
@@ -85,7 +87,11 @@ v9fs_fill_super(struct super_block *sb, struct 
v9fs_session_info *v9ses,
sb->s_xattr = v9fs_xattr_handlers;
} else
sb->s_op = _super_ops;
-   sb->s_bdi = >bdi;
+
+   ret = super_setup_bdi(sb);
+   if (ret)
+   return ret;
+
if (v9ses->cache)
sb->s_bdi->ra_pages = (VM_MAX_READAHEAD * 1024)/PAGE_SIZE;
 
@@ -99,6 +105,7 @@ v9fs_fill_super(struct super_block *sb, struct 
v9fs_session_info *v9ses,
 #endif
 
save_mount_options(sb, data);
+   return 0;
 }
 
 /**
@@ -138,7 +145,9 @@ static struct dentry *v9fs_mount(struct file_system_type 
*fs_type, int flags,
retval = PTR_ERR(sb);
goto clunk_fid;
}
-   v9fs_fill_super(sb, v9ses, flags, data);
+   retval = v9fs_fill_super(sb, v9ses, flags, data);
+   if (retval)
+   goto release_sb;
 
if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE)
sb->s_d_op = _cached_dentry_operations;
-- 
2.12.0

[PATCH 20/25] ncpfs: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Petr Vandrovec 
Acked-by: Petr Vandrovec 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/ncpfs/inode.c | 8 ++--
 fs/ncpfs/ncp_fs_sb.h | 1 -
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/ncpfs/inode.c b/fs/ncpfs/inode.c
index d5606099712a..6d0f14c86099 100644
--- a/fs/ncpfs/inode.c
+++ b/fs/ncpfs/inode.c
@@ -554,12 +554,11 @@ static int ncp_fill_super(struct super_block *sb, void 
*raw_data, int silent)
sb->s_magic = NCP_SUPER_MAGIC;
sb->s_op = _sops;
sb->s_d_op = _dentry_operations;
-   sb->s_bdi = >bdi;
 
server = NCP_SBP(sb);
memset(server, 0, sizeof(*server));
 
-   error = bdi_setup_and_register(>bdi, "ncpfs");
+   error = super_setup_bdi(sb);
if (error)
goto out_fput;
 
@@ -568,7 +567,7 @@ static int ncp_fill_super(struct super_block *sb, void 
*raw_data, int silent)
if (data.info_fd != -1) {
struct socket *info_sock = sockfd_lookup(data.info_fd, );
if (!info_sock)
-   goto out_bdi;
+   goto out_fput;
server->info_sock = info_sock;
error = -EBADFD;
if (info_sock->type != SOCK_STREAM)
@@ -746,8 +745,6 @@ static int ncp_fill_super(struct super_block *sb, void 
*raw_data, int silent)
 out_fput2:
if (server->info_sock)
sockfd_put(server->info_sock);
-out_bdi:
-   bdi_destroy(>bdi);
 out_fput:
sockfd_put(sock);
 out:
@@ -788,7 +785,6 @@ static void ncp_put_super(struct super_block *sb)
kill_pid(server->m.wdog_pid, SIGTERM, 1);
put_pid(server->m.wdog_pid);
 
-   bdi_destroy(>bdi);
kfree(server->priv.data);
kfree(server->auth.object_name);
vfree(server->rxbuf);
diff --git a/fs/ncpfs/ncp_fs_sb.h b/fs/ncpfs/ncp_fs_sb.h
index 55e26fd80886..366fd63cc506 100644
--- a/fs/ncpfs/ncp_fs_sb.h
+++ b/fs/ncpfs/ncp_fs_sb.h
@@ -143,7 +143,6 @@ struct ncp_server {
size_t len;
__u8 data[128];
} unexpected_packet;
-   struct backing_dev_info bdi;
 };
 
 extern void ncp_tcp_rcv_proc(struct work_struct *work);
-- 
2.12.0

[PATCH 25/25] bdi: Drop 'parent' argument from bdi_register[_va]()

2017-04-12 Thread Jan Kara

Drop 'parent' argument of bdi_register() and bdi_register_va().  It is
always NULL.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 drivers/mtd/mtdcore.c   |  2 +-
 fs/super.c  |  2 +-
 include/linux/backing-dev.h |  9 -
 mm/backing-dev.c| 13 +
 4 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 23e2e56ca54e..1517da3ddd7d 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -1782,7 +1782,7 @@ static struct backing_dev_info * __init mtd_bdi_init(char 
*name)
 * We put '-0' suffix to the name to get the same name format as we
 * used to get. Since this is called only once, we get a unique name. 
 */
-   ret = bdi_register(bdi, NULL, "%.28s-0", name);
+   ret = bdi_register(bdi, "%.28s-0", name);
if (ret)
bdi_put(bdi);
 
diff --git a/fs/super.c b/fs/super.c
index 8444d26926ef..adb0c0de428c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1272,7 +1272,7 @@ int super_setup_bdi_name(struct super_block *sb, char 
*fmt, ...)
bdi->name = sb->s_type->name;
 
va_start(args, fmt);
-   err = bdi_register_va(bdi, NULL, fmt, args);
+   err = bdi_register_va(bdi, fmt, args);
va_end(args);
if (err) {
bdi_put(bdi);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index aaeb2ec5d33c..557d84063934 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -25,11 +25,10 @@ static inline struct backing_dev_info *bdi_get(struct 
backing_dev_info *bdi)
 
 void bdi_put(struct backing_dev_info *bdi);
 
-__printf(3, 4)
-int bdi_register(struct backing_dev_info *bdi, struct device *parent,
-   const char *fmt, ...);
-int bdi_register_va(struct backing_dev_info *bdi, struct device *parent,
-   const char *fmt, va_list args);
+__printf(2, 3)
+int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...);
+int bdi_register_va(struct backing_dev_info *bdi, const char *fmt,
+   va_list args);
 int bdi_register_owner(struct backing_dev_info *bdi, struct device *owner);
 void bdi_unregister(struct backing_dev_info *bdi);
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4dcd56947f2a..f028a9a472fd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -856,15 +856,14 @@ struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, 
int node_id)
 }
 EXPORT_SYMBOL(bdi_alloc_node);
 
-int bdi_register_va(struct backing_dev_info *bdi, struct device *parent,
-   const char *fmt, va_list args)
+int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list 
args)
 {
struct device *dev;
 
if (bdi->dev)   /* The driver needs to use separate queues per device */
return 0;
 
-   dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, 
args);
+   dev = device_create_vargs(bdi_class, NULL, MKDEV(0, 0), bdi, fmt, args);
if (IS_ERR(dev))
return PTR_ERR(dev);
 
@@ -883,14 +882,13 @@ int bdi_register_va(struct backing_dev_info *bdi, struct 
device *parent,
 }
 EXPORT_SYMBOL(bdi_register_va);
 
-int bdi_register(struct backing_dev_info *bdi, struct device *parent,
-   const char *fmt, ...)
+int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...)
 {
va_list args;
int ret;
 
va_start(args, fmt);
-   ret = bdi_register_va(bdi, parent, fmt, args);
+   ret = bdi_register_va(bdi, fmt, args);
va_end(args);
return ret;
 }
@@ -900,8 +898,7 @@ int bdi_register_owner(struct backing_dev_info *bdi, struct 
device *owner)
 {
int rc;
 
-   rc = bdi_register(bdi, NULL, "%u:%u", MAJOR(owner->devt),
-   MINOR(owner->devt));
+   rc = bdi_register(bdi, "%u:%u", MAJOR(owner->devt), MINOR(owner->devt));
if (rc)
return rc;
/* Leaking owner reference... */
-- 
2.12.0

[PATCH 06/25] lustre: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Oleg Drokin 
CC: Andreas Dilger 
CC: James Simmons 
CC: lustre-de...@lists.lustre.org
Reviewed-by: Andreas Dilger 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 .../staging/lustre/lustre/include/lustre_disk.h|  4 
 drivers/staging/lustre/lustre/llite/llite_lib.c| 24 +++---
 2 files changed, 3 insertions(+), 25 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre_disk.h 
b/drivers/staging/lustre/lustre/include/lustre_disk.h
index 8886458748c1..a676bccabd43 100644
--- a/drivers/staging/lustre/lustre/include/lustre_disk.h
+++ b/drivers/staging/lustre/lustre/include/lustre_disk.h
@@ -133,13 +133,9 @@ struct lustre_sb_info {
struct obd_export*lsi_osd_exp;
char  lsi_osd_type[16];
char  lsi_fstype[16];
-   struct backing_dev_info   lsi_bdi; /* each client mountpoint needs
-   * own backing_dev_info
-   */
 };
 
 #define LSI_UMOUNT_FAILOVER  0x0020
-#define LSI_BDI_INITIALIZED  0x0040
 
 #define s2lsi(sb)  ((struct lustre_sb_info *)((sb)->s_fs_info))
 #define s2lsi_nocast(sb) ((sb)->s_fs_info)
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c 
b/drivers/staging/lustre/lustre/llite/llite_lib.c
index b229cbc7bb33..d483c44aafe5 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -863,15 +863,6 @@ void ll_lli_init(struct ll_inode_info *lli)
mutex_init(>lli_layout_mutex);
 }
 
-static inline int ll_bdi_register(struct backing_dev_info *bdi)
-{
-   static atomic_t ll_bdi_num = ATOMIC_INIT(0);
-
-   bdi->name = "lustre";
-   return bdi_register(bdi, NULL, "lustre-%d",
-   atomic_inc_return(_bdi_num));
-}
-
 int ll_fill_super(struct super_block *sb, struct vfsmount *mnt)
 {
struct lustre_profile *lprof = NULL;
@@ -881,6 +872,7 @@ int ll_fill_super(struct super_block *sb, struct vfsmount 
*mnt)
char  *profilenm = get_profile_name(sb);
struct config_llog_instance *cfg;
interr;
+   static atomic_t ll_bdi_num = ATOMIC_INIT(0);
 
CDEBUG(D_VFSTRACE, "VFS Op: sb %p\n", sb);
 
@@ -903,16 +895,11 @@ int ll_fill_super(struct super_block *sb, struct vfsmount 
*mnt)
if (err)
goto out_free;
 
-   err = bdi_init(>lsi_bdi);
-   if (err)
-   goto out_free;
-   lsi->lsi_flags |= LSI_BDI_INITIALIZED;
-   lsi->lsi_bdi.capabilities = 0;
-   err = ll_bdi_register(>lsi_bdi);
+   err = super_setup_bdi_name(sb, "lustre-%d",
+  atomic_inc_return(_bdi_num));
if (err)
goto out_free;
 
-   sb->s_bdi = >lsi_bdi;
/* kernel >= 2.6.38 store dentry operations in sb->s_d_op. */
sb->s_d_op = _d_ops;
 
@@ -1033,11 +1020,6 @@ void ll_put_super(struct super_block *sb)
if (profilenm)
class_del_profile(profilenm);
 
-   if (lsi->lsi_flags & LSI_BDI_INITIALIZED) {
-   bdi_destroy(>lsi_bdi);
-   lsi->lsi_flags &= ~LSI_BDI_INITIALIZED;
-   }
-
ll_free_sbi(sb);
lsi->lsi_llsbi = NULL;
 
-- 
2.12.0

[PATCH 24/25] block: Remove unused functions

2017-04-12 Thread Jan Kara

Now that all backing_dev_info structure are allocated separately, we can
drop some unused functions.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 include/linux/backing-dev.h |  5 
 mm/backing-dev.c| 56 +
 2 files changed, 6 insertions(+), 55 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 47a98e6e2a65..aaeb2ec5d33c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -17,8 +17,6 @@
 #include 
 #include 
 
-int __must_check bdi_init(struct backing_dev_info *bdi);
-
 static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi)
 {
kref_get(>refcnt);
@@ -32,12 +30,9 @@ int bdi_register(struct backing_dev_info *bdi, struct device 
*parent,
const char *fmt, ...);
 int bdi_register_va(struct backing_dev_info *bdi, struct device *parent,
const char *fmt, va_list args);
-int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 int bdi_register_owner(struct backing_dev_info *bdi, struct device *owner);
 void bdi_unregister(struct backing_dev_info *bdi);
 
-int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
-void bdi_destroy(struct backing_dev_info *bdi);
 struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id);
 static inline struct backing_dev_info *bdi_alloc(gfp_t gfp_mask)
 {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 3dd175986390..4dcd56947f2a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -12,8 +12,6 @@
 #include 
 #include 
 
-static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
-
 struct backing_dev_info noop_backing_dev_info = {
.name   = "noop",
.capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK,
@@ -242,6 +240,8 @@ static __init int bdi_class_init(void)
 }
 postcore_initcall(bdi_class_init);
 
+static int bdi_init(struct backing_dev_info *bdi);
+
 static int __init default_bdi_init(void)
 {
int err;
@@ -820,7 +820,7 @@ static void cgwb_remove_from_bdi_list(struct bdi_writeback 
*wb)
 
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
-int bdi_init(struct backing_dev_info *bdi)
+static int bdi_init(struct backing_dev_info *bdi)
 {
int ret;
 
@@ -838,7 +838,6 @@ int bdi_init(struct backing_dev_info *bdi)
 
return ret;
 }
-EXPORT_SYMBOL(bdi_init);
 
 struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id)
 {
@@ -897,12 +896,6 @@ int bdi_register(struct backing_dev_info *bdi, struct 
device *parent,
 }
 EXPORT_SYMBOL(bdi_register);
 
-int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
-{
-   return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev));
-}
-EXPORT_SYMBOL(bdi_register_dev);
-
 int bdi_register_owner(struct backing_dev_info *bdi, struct device *owner)
 {
int rc;
@@ -950,13 +943,6 @@ void bdi_unregister(struct backing_dev_info *bdi)
}
 }
 
-static void bdi_exit(struct backing_dev_info *bdi)
-{
-   WARN_ON_ONCE(bdi->dev);
-   wb_exit(>wb);
-   cgwb_bdi_exit(bdi);
-}
-
 static void release_bdi(struct kref *ref)
 {
struct backing_dev_info *bdi =
@@ -964,7 +950,9 @@ static void release_bdi(struct kref *ref)
 
if (test_bit(WB_registered, >wb.state))
bdi_unregister(bdi);
-   bdi_exit(bdi);
+   WARN_ON_ONCE(bdi->dev);
+   wb_exit(>wb);
+   cgwb_bdi_exit(bdi);
kfree(bdi);
 }
 
@@ -974,38 +962,6 @@ void bdi_put(struct backing_dev_info *bdi)
 }
 EXPORT_SYMBOL(bdi_put);
 
-void bdi_destroy(struct backing_dev_info *bdi)
-{
-   bdi_unregister(bdi);
-   bdi_exit(bdi);
-}
-EXPORT_SYMBOL(bdi_destroy);
-
-/*
- * For use from filesystems to quickly init and register a bdi associated
- * with dirty writeback
- */
-int bdi_setup_and_register(struct backing_dev_info *bdi, char *name)
-{
-   int err;
-
-   bdi->name = name;
-   bdi->capabilities = 0;
-   err = bdi_init(bdi);
-   if (err)
-   return err;
-
-   err = bdi_register(bdi, NULL, "%.28s-%ld", name,
-  atomic_long_inc_return(_seq));
-   if (err) {
-   bdi_destroy(bdi);
-   return err;
-   }
-
-   return 0;
-}
-EXPORT_SYMBOL(bdi_setup_and_register);
-
 static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
-- 
2.12.0

[PATCH 16/25] fuse: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Miklos Szeredi 
CC: linux-fsde...@vger.kernel.org
Acked-by: Miklos Szeredi 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/fuse/dev.c|  8 
 fs/fuse/fuse_i.h |  3 ---
 fs/fuse/inode.c  | 42 +-
 3 files changed, 17 insertions(+), 36 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index b681b43c766e..78887f68ee6a 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -383,8 +383,8 @@ static void request_end(struct fuse_conn *fc, struct 
fuse_req *req)
 
if (fc->num_background == fc->congestion_threshold &&
fc->connected && fc->bdi_initialized) {
-   clear_bdi_congested(>bdi, BLK_RW_SYNC);
-   clear_bdi_congested(>bdi, BLK_RW_ASYNC);
+   clear_bdi_congested(fc->sb->s_bdi, BLK_RW_SYNC);
+   clear_bdi_congested(fc->sb->s_bdi, BLK_RW_ASYNC);
}
fc->num_background--;
fc->active_background--;
@@ -575,8 +575,8 @@ void fuse_request_send_background_locked(struct fuse_conn 
*fc,
fc->blocked = 1;
if (fc->num_background == fc->congestion_threshold &&
fc->bdi_initialized) {
-   set_bdi_congested(>bdi, BLK_RW_SYNC);
-   set_bdi_congested(>bdi, BLK_RW_ASYNC);
+   set_bdi_congested(fc->sb->s_bdi, BLK_RW_SYNC);
+   set_bdi_congested(fc->sb->s_bdi, BLK_RW_ASYNC);
}
list_add_tail(>list, >bg_queue);
flush_bg_queue(fc);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 32ac2c9b09c0..0e7c79a390e0 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -631,9 +631,6 @@ struct fuse_conn {
/** Negotiated minor version */
unsigned minor;
 
-   /** Backing dev info */
-   struct backing_dev_info bdi;
-
/** Entry on the fuse_conn_list */
struct list_head entry;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 6fe6a88ecb4a..90bacbc87fb3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -386,12 +386,6 @@ static void fuse_send_destroy(struct fuse_conn *fc)
}
 }
 
-static void fuse_bdi_destroy(struct fuse_conn *fc)
-{
-   if (fc->bdi_initialized)
-   bdi_destroy(>bdi);
-}
-
 static void fuse_put_super(struct super_block *sb)
 {
struct fuse_conn *fc = get_fuse_conn_super(sb);
@@ -403,7 +397,6 @@ static void fuse_put_super(struct super_block *sb)
list_del(>entry);
fuse_ctl_remove_conn(fc);
mutex_unlock(_mutex);
-   fuse_bdi_destroy(fc);
 
fuse_conn_put(fc);
 }
@@ -928,7 +921,8 @@ static void process_init_reply(struct fuse_conn *fc, struct 
fuse_req *req)
fc->no_flock = 1;
}
 
-   fc->bdi.ra_pages = min(fc->bdi.ra_pages, ra_pages);
+   fc->sb->s_bdi->ra_pages =
+   min(fc->sb->s_bdi->ra_pages, ra_pages);
fc->minor = arg->minor;
fc->max_write = arg->minor < 5 ? 4096 : arg->max_write;
fc->max_write = max_t(unsigned, 4096, fc->max_write);
@@ -944,7 +938,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct 
fuse_req *req)
 
arg->major = FUSE_KERNEL_VERSION;
arg->minor = FUSE_KERNEL_MINOR_VERSION;
-   arg->max_readahead = fc->bdi.ra_pages * PAGE_SIZE;
+   arg->max_readahead = fc->sb->s_bdi->ra_pages * PAGE_SIZE;
arg->flags |= FUSE_ASYNC_READ | FUSE_POSIX_LOCKS | FUSE_ATOMIC_O_TRUNC |
FUSE_EXPORT_SUPPORT | FUSE_BIG_WRITES | FUSE_DONT_MASK |
FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ |
@@ -976,27 +970,20 @@ static void fuse_free_conn(struct fuse_conn *fc)
 static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
 {
int err;
+   char *suffix = "";
 
-   fc->bdi.name = "fuse";
-   fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_SIZE;
-   /* fuse does it's own writeback accounting */
-   fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;
-
-   err = bdi_init(>bdi);
+   if (sb->s_bdev)
+   suffix = "-fuseblk";
+   err = super_setup_bdi_name(sb, "%u:%u%s", MAJOR(fc->dev),
+  MINOR(fc->dev), suffix);
if (err)
return err;
 
-   fc->bdi_initialized = 1;
-
-   if (sb->s_bdev) {
-   err =  bdi_register(>bdi, NULL, "%u:%u-fuseblk",
-   MAJOR(fc->dev), MINOR(fc->dev));
-   } else {
-   err = bdi_register_dev(>bdi, fc->dev);
-   }
+   sb->s_bdi->ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_SIZE;
+   /* fuse does it's own writeback accounting */
+

[PATCH 23/25] fs: Remove SB_I_DYNBDI flag

2017-04-12 Thread Jan Kara

Now that all bdi structures filesystems use are properly refcounted, we
can remove the SB_I_DYNBDI flag.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 drivers/mtd/mtdsuper.c | 1 -
 fs/gfs2/ops_fstype.c   | 1 -
 fs/nfs/super.c | 1 -
 fs/nilfs2/super.c  | 1 -
 fs/super.c | 5 +
 include/linux/fs.h | 3 ---
 6 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index e69e7855e31f..e43fea896d1e 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -53,7 +53,6 @@ static int get_sb_mtd_set(struct super_block *sb, void *_mtd)
sb->s_mtd = mtd;
sb->s_dev = MKDEV(MTD_BLOCK_MAJOR, mtd->index);
sb->s_bdi = bdi_get(mtd_bdi);
-   sb->s_iflags |= SB_I_DYNBDI;
 
return 0;
 }
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index e6b6f97d0fc1..ed67548b286c 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1224,7 +1224,6 @@ static int set_gfs2_super(struct super_block *s, void 
*data)
s->s_bdev = data;
s->s_dev = s->s_bdev->bd_dev;
s->s_bdi = bdi_get(s->s_bdev->bd_bdi);
-   s->s_iflags |= SB_I_DYNBDI;
return 0;
 }
 
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 8d97aa70407e..dc69314d455e 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2379,7 +2379,6 @@ int nfs_clone_super(struct super_block *sb, struct 
nfs_mount_info *mount_info)
nfs_initialise_sb(sb);
 
sb->s_bdi = bdi_get(old_sb->s_bdi);
-   sb->s_iflags |= SB_I_DYNBDI;
 
return 0;
 }
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index feb796a38b8d..926682981d61 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -1069,7 +1069,6 @@ nilfs_fill_super(struct super_block *sb, void *data, int 
silent)
sb->s_max_links = NILFS_LINK_MAX;
 
sb->s_bdi = bdi_get(sb->s_bdev->bd_bdi);
-   sb->s_iflags |= SB_I_DYNBDI;
 
err = load_nilfs(nilfs, sb);
if (err)
diff --git a/fs/super.c b/fs/super.c
index e267d3a00144..8444d26926ef 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -446,10 +446,9 @@ void generic_shutdown_super(struct super_block *sb)
hlist_del_init(>s_instances);
spin_unlock(_lock);
up_write(>s_umount);
-   if (sb->s_iflags & SB_I_DYNBDI) {
+   if (sb->s_bdi != _backing_dev_info) {
bdi_put(sb->s_bdi);
sb->s_bdi = _backing_dev_info;
-   sb->s_iflags &= ~SB_I_DYNBDI;
}
 }
 
@@ -1055,7 +1054,6 @@ static int set_bdev_super(struct super_block *s, void 
*data)
s->s_bdev = data;
s->s_dev = s->s_bdev->bd_dev;
s->s_bdi = bdi_get(s->s_bdev->bd_bdi);
-   s->s_iflags |= SB_I_DYNBDI;
 
return 0;
 }
@@ -1282,7 +1280,6 @@ int super_setup_bdi_name(struct super_block *sb, char 
*fmt, ...)
}
WARN_ON(sb->s_bdi != _backing_dev_info);
sb->s_bdi = bdi;
-   sb->s_iflags |= SB_I_DYNBDI;
 
return 0;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98cf14ea78c0..30e5c14bd743 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1272,9 +1272,6 @@ struct mm_struct;
 /* sb->s_iflags to limit user namespace mounts */
 #define SB_I_USERNS_VISIBLE0x0010 /* fstype already mounted */
 
-/* Temporary flag until all filesystems are converted to dynamic bdis */
-#define SB_I_DYNBDI0x0100
-
 /* Possible states of 'frozen' field */
 enum {
SB_UNFROZEN = 0,/* FS is unfrozen */
-- 
2.12.0

[PATCH 13/25] mtd: Convert to dynamically allocated bdi infrastructure

2017-04-12 Thread Jan Kara

MTD already allocates backing_dev_info dynamically. Convert it to use
generic infrastructure for this including proper refcounting. We drop
mtd->backing_dev_info as its only use was to pass mtd_bdi pointer from
one file into another and if we wanted to keep that in a clean way, we'd
have to make mtd hold and drop bdi reference as needed which seems
pointless for passing one global pointer...

CC: David Woodhouse 
CC: Brian Norris 
CC: linux-...@lists.infradead.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 drivers/mtd/mtdcore.c   | 23 ---
 drivers/mtd/mtdsuper.c  |  7 ++-
 include/linux/mtd/mtd.h |  5 -
 3 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 66a9dedd1062..23e2e56ca54e 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -46,7 +46,7 @@
 
 #include "mtdcore.h"
 
-static struct backing_dev_info *mtd_bdi;
+struct backing_dev_info *mtd_bdi;
 
 #ifdef CONFIG_PM_SLEEP
 
@@ -496,11 +496,9 @@ int add_mtd_device(struct mtd_info *mtd)
 * mtd_device_parse_register() multiple times on the same master MTD,
 * especially with CONFIG_MTD_PARTITIONED_MASTER=y.
 */
-   if (WARN_ONCE(mtd->backing_dev_info, "MTD already registered\n"))
+   if (WARN_ONCE(mtd->dev.type, "MTD already registered\n"))
return -EEXIST;
 
-   mtd->backing_dev_info = mtd_bdi;
-
BUG_ON(mtd->writesize == 0);
mutex_lock(_table_mutex);
 
@@ -1775,13 +1773,18 @@ static struct backing_dev_info * __init 
mtd_bdi_init(char *name)
struct backing_dev_info *bdi;
int ret;
 
-   bdi = kzalloc(sizeof(*bdi), GFP_KERNEL);
+   bdi = bdi_alloc(GFP_KERNEL);
if (!bdi)
return ERR_PTR(-ENOMEM);
 
-   ret = bdi_setup_and_register(bdi, name);
+   bdi->name = name;
+   /*
+* We put '-0' suffix to the name to get the same name format as we
+* used to get. Since this is called only once, we get a unique name. 
+*/
+   ret = bdi_register(bdi, NULL, "%.28s-0", name);
if (ret)
-   kfree(bdi);
+   bdi_put(bdi);
 
return ret ? ERR_PTR(ret) : bdi;
 }
@@ -1813,8 +1816,7 @@ static int __init init_mtd(void)
 out_procfs:
if (proc_mtd)
remove_proc_entry("mtd", NULL);
-   bdi_destroy(mtd_bdi);
-   kfree(mtd_bdi);
+   bdi_put(mtd_bdi);
 err_bdi:
class_unregister(_class);
 err_reg:
@@ -1828,8 +1830,7 @@ static void __exit cleanup_mtd(void)
if (proc_mtd)
remove_proc_entry("mtd", NULL);
class_unregister(_class);
-   bdi_destroy(mtd_bdi);
-   kfree(mtd_bdi);
+   bdi_put(mtd_bdi);
idr_destroy(_idr);
 }
 
diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index 20c02a3b7417..e69e7855e31f 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * compare superblocks to see if they're equivalent
@@ -38,6 +39,8 @@ static int get_sb_mtd_compare(struct super_block *sb, void 
*_mtd)
return 0;
 }
 
+extern struct backing_dev_info *mtd_bdi;
+
 /*
  * mark the superblock by the MTD device it is using
  * - set the device number to be the correct MTD block device for pesuperstence
@@ -49,7 +52,9 @@ static int get_sb_mtd_set(struct super_block *sb, void *_mtd)
 
sb->s_mtd = mtd;
sb->s_dev = MKDEV(MTD_BLOCK_MAJOR, mtd->index);
-   sb->s_bdi = mtd->backing_dev_info;
+   sb->s_bdi = bdi_get(mtd_bdi);
+   sb->s_iflags |= SB_I_DYNBDI;
+
return 0;
 }
 
diff --git a/include/linux/mtd/mtd.h b/include/linux/mtd/mtd.h
index eebdc63cf6af..79b176eca04a 100644
--- a/include/linux/mtd/mtd.h
+++ b/include/linux/mtd/mtd.h
@@ -334,11 +334,6 @@ struct mtd_info {
int (*_get_device) (struct mtd_info *mtd);
void (*_put_device) (struct mtd_info *mtd);
 
-   /* Backing device capabilities for this device
-* - provides mmap capabilities
-*/
-   struct backing_dev_info *backing_dev_info;
-
struct notifier_block reboot_notifier;  /* default mode before reboot */
 
/* ECC status information */
-- 
2.12.0

clean up a few end_request variants

2017-04-12 Thread Christoph Hellwig

Just some trivial cleanup patches that I'd like to offload from
a big series in the pipe..

[PATCH 2/3] block: remove blk_end_request_cur

2017-04-12 Thread Christoph Hellwig

This function is not used anywhere in the kernel.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c   | 18 --
 include/linux/blkdev.h |  1 -
 2 files changed, 19 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8aee417c1e4f..a01af9ca0455 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2838,24 +2838,6 @@ void blk_end_request_all(struct request *rq, int error)
 EXPORT_SYMBOL(blk_end_request_all);
 
 /**
- * blk_end_request_cur - Helper function to finish the current request chunk.
- * @rq: the request to finish the current chunk for
- * @error: %0 for success, < %0 for error
- *
- * Description:
- * Complete the current consecutively mapped chunk from @rq.
- *
- * Return:
- * %false - we are done with this request
- * %true  - still buffers pending for this request
- */
-bool blk_end_request_cur(struct request *rq, int error)
-{
-   return blk_end_request(rq, error, blk_rq_cur_bytes(rq));
-}
-EXPORT_SYMBOL(blk_end_request_cur);
-
-/**
  * __blk_end_request - Helper function for drivers to complete the request.
  * @rq:   the request being processed
  * @error:%0 for success, < %0 for error
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3b6201b1cd0e..f770a17d2fbf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1127,7 +1127,6 @@ extern void blk_finish_request(struct request *rq, int 
error);
 extern bool blk_end_request(struct request *rq, int error,
unsigned int nr_bytes);
 extern void blk_end_request_all(struct request *rq, int error);
-extern bool blk_end_request_cur(struct request *rq, int error);
 extern bool __blk_end_request(struct request *rq, int error,
  unsigned int nr_bytes);
 extern void __blk_end_request_all(struct request *rq, int error);
-- 
2.11.0

[PATCH 3/3] block: make __blk_end_bidi_request private

2017-04-12 Thread Christoph Hellwig

blk_insert_flush should be using __blk_end_request to start with.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c  | 2 +-
 block/blk-flush.c | 2 +-
 block/blk.h   | 2 --
 3 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a01af9ca0455..518f5d189436 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2785,7 +2785,7 @@ static bool blk_end_bidi_request(struct request *rq, int 
error,
  * %false - we are done with this request
  * %true  - still buffers pending for this request
  **/
-bool __blk_end_bidi_request(struct request *rq, int error,
+static bool __blk_end_bidi_request(struct request *rq, int error,
   unsigned int nr_bytes, unsigned int 
bidi_bytes)
 {
if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes))
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 4e951d3bf548..c4e0880b54bb 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -447,7 +447,7 @@ void blk_insert_flush(struct request *rq)
if (q->mq_ops)
blk_mq_end_request(rq, 0);
else
-   __blk_end_bidi_request(rq, 0, 0, 0);
+   __blk_end_request(rq, 0, 0);
return;
}
 
diff --git a/block/blk.h b/block/blk.h
index 07d375183f31..35b3041eec1a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -67,8 +67,6 @@ void blk_queue_bypass_start(struct request_queue *q);
 void blk_queue_bypass_end(struct request_queue *q);
 void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
-bool __blk_end_bidi_request(struct request *rq, int error,
-   unsigned int nr_bytes, unsigned int bidi_bytes);
 void blk_freeze_queue(struct request_queue *q);
 
 static inline void blk_queue_enter_live(struct request_queue *q)
-- 
2.11.0

[PATCH 1/3] block: remove blk_end_request_err and __blk_end_request_err

2017-04-12 Thread Christoph Hellwig

Both functions are entirely unused.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c   | 39 ---
 include/linux/blkdev.h |  2 --
 2 files changed, 41 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4f95bdaa1e16..8aee417c1e4f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2856,25 +2856,6 @@ bool blk_end_request_cur(struct request *rq, int error)
 EXPORT_SYMBOL(blk_end_request_cur);
 
 /**
- * blk_end_request_err - Finish a request till the next failure boundary.
- * @rq: the request to finish till the next failure boundary for
- * @error: must be negative errno
- *
- * Description:
- * Complete @rq till the next failure boundary.
- *
- * Return:
- * %false - we are done with this request
- * %true  - still buffers pending for this request
- */
-bool blk_end_request_err(struct request *rq, int error)
-{
-   WARN_ON(error >= 0);
-   return blk_end_request(rq, error, blk_rq_err_bytes(rq));
-}
-EXPORT_SYMBOL_GPL(blk_end_request_err);
-
-/**
  * __blk_end_request - Helper function for drivers to complete the request.
  * @rq:   the request being processed
  * @error:%0 for success, < %0 for error
@@ -2933,26 +2914,6 @@ bool __blk_end_request_cur(struct request *rq, int error)
 }
 EXPORT_SYMBOL(__blk_end_request_cur);
 
-/**
- * __blk_end_request_err - Finish a request till the next failure boundary.
- * @rq: the request to finish till the next failure boundary for
- * @error: must be negative errno
- *
- * Description:
- * Complete @rq till the next failure boundary.  Must be called
- * with queue lock held.
- *
- * Return:
- * %false - we are done with this request
- * %true  - still buffers pending for this request
- */
-bool __blk_end_request_err(struct request *rq, int error)
-{
-   WARN_ON(error >= 0);
-   return __blk_end_request(rq, error, blk_rq_err_bytes(rq));
-}
-EXPORT_SYMBOL_GPL(__blk_end_request_err);
-
 void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 struct bio *bio)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f912ddc39020..3b6201b1cd0e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1128,12 +1128,10 @@ extern bool blk_end_request(struct request *rq, int 
error,
unsigned int nr_bytes);
 extern void blk_end_request_all(struct request *rq, int error);
 extern bool blk_end_request_cur(struct request *rq, int error);
-extern bool blk_end_request_err(struct request *rq, int error);
 extern bool __blk_end_request(struct request *rq, int error,
  unsigned int nr_bytes);
 extern void __blk_end_request_all(struct request *rq, int error);
 extern bool __blk_end_request_cur(struct request *rq, int error);
-extern bool __blk_end_request_err(struct request *rq, int error);
 
 extern void blk_complete_request(struct request *);
 extern void __blk_complete_request(struct request *);
-- 
2.11.0

Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically

2017-04-12 Thread Benjamin Block

On Fri, Apr 07, 2017 at 11:16:48AM -0700, Bart Van Assche wrote:
> Hello Jens,
> 
> The six patches in this patch series fix the queue lockup I reported
> recently on the linux-block mailing list. Please consider these patches
> for inclusion in the upstream kernel.
> 

Hey Bart,

just out of curiosity. Is this maybe related to similar stuff happening
when CPUs are hot plugged - at least in that the stack gets stuck? Like
in this thread here:
https://www.mail-archive.com/linux-block@vger.kernel.org/msg06057.html

Would be interesting, because we recently saw similar stuff happening.

Beste Grüße / Best regards,
  - Benjamin Block
-- 
Linux on z Systems Development / IBM Systems & Technology Group
  IBM Deutschland Research & Development GmbH 
Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294

[PATCH 7/8] block: remove bio_no_advance_iter

2017-04-12 Thread Christoph Hellwig

Now that we don't have to support the odd Write Same special case
we can simply increment the iter if the bio has data, else just
manipulate bi_size directly.

Signed-off-by: Christoph Hellwig 
---
 include/linux/bio.h | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 96a20afb8575..7a24a1a24967 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -83,13 +83,6 @@ static inline bool bio_has_data(struct bio *bio)
return false;
 }
 
-static inline bool bio_no_advance_iter(struct bio *bio)
-{
-   return bio_op(bio) == REQ_OP_DISCARD ||
-  bio_op(bio) == REQ_OP_SECURE_ERASE ||
-  bio_op(bio) == REQ_OP_WRITE_ZEROES;
-}
-
 static inline bool bio_mergeable(struct bio *bio)
 {
if (bio->bi_opf & REQ_NOMERGE_FLAGS)
@@ -165,10 +158,10 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 {
iter->bi_sector += bytes >> 9;
 
-   if (bio_no_advance_iter(bio))
-   iter->bi_size -= bytes;
-   else
+   if (bio_has_data(bio))
bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   else
+   iter->bi_size -= bytes;
 }
 
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
-- 
2.11.0

[PATCH 2/8] target/iblock: convert WRITE_SAME to blkdev_issue_zeroout

2017-04-12 Thread Christoph Hellwig

From: Nicholas Bellinger 

The people who are actively using iblock_execute_write_same_direct() are
doing so in the context of ESX VAAI BlockZero, together with
EXTENDED_COPY and COMPARE_AND_WRITE primitives.

In practice though I've not seen any users of IBLOCK WRITE_SAME for
anything other than VAAI BlockZero, so just using blkdev_issue_zeroout()
when available, and falling back to iblock_execute_write_same() if the
WRITE_SAME buffer contains anything other than zeros should be OK.

Signed-off-by: Christoph Hellwig 
---
 drivers/target/target_core_iblock.c | 44 +++--
 1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/drivers/target/target_core_iblock.c 
b/drivers/target/target_core_iblock.c
index d316ed537d59..5bfde20481d7 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -86,6 +86,7 @@ static int iblock_configure_device(struct se_device *dev)
struct block_device *bd = NULL;
struct blk_integrity *bi;
fmode_t mode;
+   unsigned int max_write_zeroes_sectors;
int ret = -ENOMEM;
 
if (!(ib_dev->ibd_flags & IBDF_HAS_UDEV_PATH)) {
@@ -129,7 +130,11 @@ static int iblock_configure_device(struct se_device *dev)
 * Enable write same emulation for IBLOCK and use 0x as
 * the smaller WRITE_SAME(10) only has a two-byte block count.
 */
-   dev->dev_attrib.max_write_same_len = 0x;
+   max_write_zeroes_sectors = bdev_write_zeroes_sectors(bd);
+   if (max_write_zeroes_sectors)
+   dev->dev_attrib.max_write_same_len = max_write_zeroes_sectors;
+   else
+   dev->dev_attrib.max_write_same_len = 0x;
 
if (blk_queue_nonrot(q))
dev->dev_attrib.is_nonrot = 1;
@@ -415,28 +420,31 @@ iblock_execute_unmap(struct se_cmd *cmd, sector_t lba, 
sector_t nolb)
 }
 
 static sense_reason_t
-iblock_execute_write_same_direct(struct block_device *bdev, struct se_cmd *cmd)
+iblock_execute_zero_out(struct block_device *bdev, struct se_cmd *cmd)
 {
struct se_device *dev = cmd->se_dev;
struct scatterlist *sg = >t_data_sg[0];
-   struct page *page = NULL;
-   int ret;
+   unsigned char *buf, zero = 0x00, *p = 
+   int rc, ret;
 
-   if (sg->offset) {
-   page = alloc_page(GFP_KERNEL);
-   if (!page)
-   return TCM_OUT_OF_RESOURCES;
-   sg_copy_to_buffer(sg, cmd->t_data_nents, page_address(page),
- dev->dev_attrib.block_size);
-   }
+   buf = kmap(sg_page(sg)) + sg->offset;
+   if (!buf)
+   return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
+   /*
+* Fall back to block_execute_write_same() slow-path if
+* incoming WRITE_SAME payload does not contain zeros.
+*/
+   rc = memcmp(buf, p, cmd->data_length);
+   kunmap(sg_page(sg));
+
+   if (rc)
+   return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 
-   ret = blkdev_issue_write_same(bdev,
+   ret = blkdev_issue_zeroout(bdev,
target_to_linux_sector(dev, cmd->t_task_lba),
target_to_linux_sector(dev,
sbc_get_write_same_sectors(cmd)),
-   GFP_KERNEL, page ? page : sg_page(sg));
-   if (page)
-   __free_page(page);
+   GFP_KERNEL, false);
if (ret)
return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 
@@ -472,8 +480,10 @@ iblock_execute_write_same(struct se_cmd *cmd)
return TCM_INVALID_CDB_FIELD;
}
 
-   if (bdev_write_same(bdev))
-   return iblock_execute_write_same_direct(bdev, cmd);
+   if (bdev_write_zeroes_sectors(bdev)) {
+   if (!iblock_execute_zero_out(bdev, cmd))
+   return 0;
+   }
 
ibr = kzalloc(sizeof(struct iblock_req), GFP_KERNEL);
if (!ibr)
-- 
2.11.0

[PATCH 3/8] sd: remove write same support

2017-04-12 Thread Christoph Hellwig

There are no more end-users of REQ_OP_WRITE_SAME left, so we can start
deleting it.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 70 ---
 drivers/scsi/sd_zbc.c |  1 -
 2 files changed, 71 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8cf34a8e3eea..a905802e927e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -878,77 +878,10 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
sdkp->zeroing_mode = SD_ZERO_WRITE;
 
 out:
-   blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
-(logical_block_size >> 9));
blk_queue_max_write_zeroes_sectors(q, sdkp->max_ws_blocks *
 (logical_block_size >> 9));
 }
 
-/**
- * sd_setup_write_same_cmnd - write the same data to multiple blocks
- * @cmd: command to prepare
- *
- * Will issue either WRITE SAME(10) or WRITE SAME(16) depending on
- * preference indicated by target device.
- **/
-static int sd_setup_write_same_cmnd(struct scsi_cmnd *cmd)
-{
-   struct request *rq = cmd->request;
-   struct scsi_device *sdp = cmd->device;
-   struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
-   struct bio *bio = rq->bio;
-   sector_t sector = blk_rq_pos(rq);
-   unsigned int nr_sectors = blk_rq_sectors(rq);
-   unsigned int nr_bytes = blk_rq_bytes(rq);
-   int ret;
-
-   if (sdkp->device->no_write_same)
-   return BLKPREP_INVALID;
-
-   BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size);
-
-   if (sd_is_zoned(sdkp)) {
-   ret = sd_zbc_setup_write_cmnd(cmd);
-   if (ret != BLKPREP_OK)
-   return ret;
-   }
-
-   sector >>= ilog2(sdp->sector_size) - 9;
-   nr_sectors >>= ilog2(sdp->sector_size) - 9;
-
-   rq->timeout = SD_WRITE_SAME_TIMEOUT;
-
-   if (sdkp->ws16 || sector > 0x || nr_sectors > 0x) {
-   cmd->cmd_len = 16;
-   cmd->cmnd[0] = WRITE_SAME_16;
-   put_unaligned_be64(sector, >cmnd[2]);
-   put_unaligned_be32(nr_sectors, >cmnd[10]);
-   } else {
-   cmd->cmd_len = 10;
-   cmd->cmnd[0] = WRITE_SAME;
-   put_unaligned_be32(sector, >cmnd[2]);
-   put_unaligned_be16(nr_sectors, >cmnd[7]);
-   }
-
-   cmd->transfersize = sdp->sector_size;
-   cmd->allowed = SD_MAX_RETRIES;
-
-   /*
-* For WRITE SAME the data transferred via the DATA OUT buffer is
-* different from the amount of data actually written to the target.
-*
-* We set up __data_len to the amount of data transferred via the
-* DATA OUT buffer so that blk_rq_map_sg sets up the proper S/G list
-* to transfer a single sector of data first, but then reset it to
-* the amount of data to be written right after so that the I/O path
-* knows how much to actually write.
-*/
-   rq->__data_len = sdp->sector_size;
-   ret = scsi_init_io(cmd);
-   rq->__data_len = nr_bytes;
-   return ret;
-}
-
 static int sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
 {
struct request *rq = cmd->request;
@@ -1232,8 +1165,6 @@ static int sd_init_command(struct scsi_cmnd *cmd)
}
case REQ_OP_WRITE_ZEROES:
return sd_setup_write_zeroes_cmnd(cmd);
-   case REQ_OP_WRITE_SAME:
-   return sd_setup_write_same_cmnd(cmd);
case REQ_OP_FLUSH:
return sd_setup_flush_cmnd(cmd);
case REQ_OP_READ:
@@ -1872,7 +1803,6 @@ static int sd_done(struct scsi_cmnd *SCpnt)
switch (req_op(req)) {
case REQ_OP_DISCARD:
case REQ_OP_WRITE_ZEROES:
-   case REQ_OP_WRITE_SAME:
case REQ_OP_ZONE_RESET:
if (!result) {
good_bytes = blk_rq_bytes(req);
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 1994f7799fce..8af6c9cd30ca 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -330,7 +330,6 @@ void sd_zbc_complete(struct scsi_cmnd *cmd,
switch (req_op(rq)) {
case REQ_OP_WRITE:
case REQ_OP_WRITE_ZEROES:
-   case REQ_OP_WRITE_SAME:
case REQ_OP_ZONE_RESET:
 
/* Unlock the zone */
-- 
2.11.0

[PATCH 5/8] dm: remove write same support

2017-04-12 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-core.h  |  1 -
 drivers/md/dm-io.c| 21 +
 drivers/md/dm-linear.c|  1 -
 drivers/md/dm-mpath.c |  1 -
 drivers/md/dm-rq.c|  3 ---
 drivers/md/dm-stripe.c|  4 +---
 drivers/md/dm-table.c | 29 -
 drivers/md/dm.c   | 23 ---
 include/linux/device-mapper.h |  6 --
 9 files changed, 2 insertions(+), 87 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index fea5bd52ada8..d661801d72e7 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -131,7 +131,6 @@ struct mapped_device {
 void dm_init_md_queue(struct mapped_device *md);
 void dm_init_normal_md_queue(struct mapped_device *md);
 int md_in_flight(struct mapped_device *md);
-void disable_write_same(struct mapped_device *md);
 void disable_write_zeroes(struct mapped_device *md);
 
 static inline struct completion *dm_get_completion_from_kobject(struct kobject 
*kobj)
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 3702e502466d..105e68dabd3e 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -303,7 +303,6 @@ static void do_region(int op, int op_flags, unsigned region,
unsigned num_bvecs;
sector_t remaining = where->count;
struct request_queue *q = bdev_get_queue(where->bdev);
-   unsigned short logical_block_size = queue_logical_block_size(q);
sector_t num_sectors;
unsigned int uninitialized_var(special_cmd_max_sectors);
 
@@ -314,10 +313,7 @@ static void do_region(int op, int op_flags, unsigned 
region,
special_cmd_max_sectors = q->limits.max_discard_sectors;
else if (op == REQ_OP_WRITE_ZEROES)
special_cmd_max_sectors = q->limits.max_write_zeroes_sectors;
-   else if (op == REQ_OP_WRITE_SAME)
-   special_cmd_max_sectors = q->limits.max_write_same_sectors;
-   if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES ||
-op == REQ_OP_WRITE_SAME)  &&
+   if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES) &&
special_cmd_max_sectors == 0) {
dec_count(io, region, -EOPNOTSUPP);
return;
@@ -336,9 +332,6 @@ static void do_region(int op, int op_flags, unsigned region,
case REQ_OP_WRITE_ZEROES:
num_bvecs = 0;
break;
-   case REQ_OP_WRITE_SAME:
-   num_bvecs = 1;
-   break;
default:
num_bvecs = min_t(int, BIO_MAX_PAGES,
  dm_sector_div_up(remaining, 
(PAGE_SIZE >> SECTOR_SHIFT)));
@@ -355,18 +348,6 @@ static void do_region(int op, int op_flags, unsigned 
region,
num_sectors = min_t(sector_t, special_cmd_max_sectors, 
remaining);
bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
remaining -= num_sectors;
-   } else if (op == REQ_OP_WRITE_SAME) {
-   /*
-* WRITE SAME only uses a single page.
-*/
-   dp->get_page(dp, , , );
-   bio_add_page(bio, page, logical_block_size, offset);
-   num_sectors = min_t(sector_t, special_cmd_max_sectors, 
remaining);
-   bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
-
-   offset = 0;
-   remaining -= num_sectors;
-   dp->next_page(dp);
} else while (remaining) {
/*
 * Try and add as many pages as possible.
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index e17fd44ceef5..f928f7e9ee4a 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -58,7 +58,6 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
 
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
-   ti->num_write_same_bios = 1;
ti->num_write_zeroes_bios = 1;
ti->private = lc;
return 0;
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index ab55955ed704..ece53947b99d 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1102,7 +1102,6 @@ static int multipath_ctr(struct dm_target *ti, unsigned 
argc, char **argv)
 
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
-   ti->num_write_same_bios = 1;
ti->num_write_zeroes_bios = 1;
if (m->queue_mode == DM_TYPE_BIO_BASED)
ti->per_io_data_size = multipath_per_bio_data_size();
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index e60f1b6845be..6f8dc99685f2 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -299,9 +299,6 @@ static void dm_done(struct request *clone, int error, bool 
mapped)

[PATCH 1/8] drbd: drop REQ_OP_WRITE_SAME support

2017-04-12 Thread Christoph Hellwig

Linux only used it for zeroing, for which we have better methods now.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/drbd/drbd_main.c | 28 ++
 drivers/block/drbd/drbd_nl.c   | 60 --
 drivers/block/drbd/drbd_receiver.c | 38 +++-
 drivers/block/drbd/drbd_req.c  |  1 -
 drivers/block/drbd/drbd_worker.c   |  4 ---
 5 files changed, 7 insertions(+), 124 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 84455c365f57..183468e0b959 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -931,7 +931,7 @@ void assign_p_sizes_qlim(struct drbd_device *device, struct 
p_sizes *p, struct r
p->qlim->io_min = cpu_to_be32(queue_io_min(q));
p->qlim->io_opt = cpu_to_be32(queue_io_opt(q));
p->qlim->discard_enabled = blk_queue_discard(q);
-   p->qlim->write_same_capable = 
!!q->limits.max_write_same_sectors;
+   p->qlim->write_same_capable = 0;
} else {
q = device->rq_queue;
p->qlim->physical_block_size = 
cpu_to_be32(queue_physical_block_size(q));
@@ -1610,9 +1610,6 @@ static int _drbd_send_bio(struct drbd_peer_device 
*peer_device, struct bio *bio)
 ? 0 : MSG_MORE);
if (err)
return err;
-   /* REQ_OP_WRITE_SAME has only one segment */
-   if (bio_op(bio) == REQ_OP_WRITE_SAME)
-   break;
}
return 0;
 }
@@ -1631,9 +1628,6 @@ static int _drbd_send_zc_bio(struct drbd_peer_device 
*peer_device, struct bio *b
  bio_iter_last(bvec, iter) ? 0 : MSG_MORE);
if (err)
return err;
-   /* REQ_OP_WRITE_SAME has only one segment */
-   if (bio_op(bio) == REQ_OP_WRITE_SAME)
-   break;
}
return 0;
 }
@@ -1665,7 +1659,6 @@ static u32 bio_flags_to_wire(struct drbd_connection 
*connection,
return  (bio->bi_opf & REQ_SYNC ? DP_RW_SYNC : 0) |
(bio->bi_opf & REQ_FUA ? DP_FUA : 0) |
(bio->bi_opf & REQ_PREFLUSH ? DP_FLUSH : 0) |
-   (bio_op(bio) == REQ_OP_WRITE_SAME ? DP_WSAME : 0) |
(bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0) |
(bio_op(bio) == REQ_OP_WRITE_ZEROES ? DP_DISCARD : 0);
else
@@ -1680,7 +1673,6 @@ int drbd_send_dblock(struct drbd_peer_device 
*peer_device, struct drbd_request *
struct drbd_device *device = peer_device->device;
struct drbd_socket *sock;
struct p_data *p;
-   struct p_wsame *wsame = NULL;
void *digest_out;
unsigned int dp_flags = 0;
int digest_size;
@@ -1717,27 +1709,13 @@ int drbd_send_dblock(struct drbd_peer_device 
*peer_device, struct drbd_request *
err = __send_command(peer_device->connection, device->vnr, 
sock, P_TRIM, sizeof(*t), NULL, 0);
goto out;
}
-   if (dp_flags & DP_WSAME) {
-   /* this will only work if DRBD_FF_WSAME is set AND the
-* handshake agreed that all nodes and backend devices are
-* WRITE_SAME capable and agree on logical_block_size */
-   wsame = (struct p_wsame*)p;
-   digest_out = wsame + 1;
-   wsame->size = cpu_to_be32(req->i.size);
-   } else
-   digest_out = p + 1;
+   digest_out = p + 1;
 
/* our digest is still only over the payload.
 * TRIM does not carry any payload. */
if (digest_size)
drbd_csum_bio(peer_device->connection->integrity_tfm, 
req->master_bio, digest_out);
-   if (wsame) {
-   err =
-   __send_command(peer_device->connection, device->vnr, sock, 
P_WSAME,
-  sizeof(*wsame) + digest_size, NULL,
-  bio_iovec(req->master_bio).bv_len);
-   } else
-   err =
+   err =
__send_command(peer_device->connection, device->vnr, sock, 
P_DATA,
   sizeof(*p) + digest_size, NULL, req->i.size);
if (!err) {
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 02255a0d68b9..53aeed040eb4 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1234,65 +1234,6 @@ static void fixup_discard_if_not_supported(struct 
request_queue *q)
}
 }
 
-static void decide_on_write_same_support(struct drbd_device *device,
-   struct request_queue *q,
-   struct request_queue *b, struct o_qlim *o)
-{
-   struct drbd_peer_device *peer_device = first_peer_device(device);
-   struct drbd_connection

[PATCH 4/8] md: drop WRITE_SAME support

2017-04-12 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 drivers/md/linear.c| 1 -
 drivers/md/md.h| 7 ---
 drivers/md/multipath.c | 1 -
 drivers/md/raid0.c | 2 --
 drivers/md/raid1.c | 4 +---
 drivers/md/raid10.c| 1 -
 drivers/md/raid5.c | 1 -
 7 files changed, 1 insertion(+), 16 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 377a8a3672e3..da363f5d54b0 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -292,7 +292,6 @@ static void linear_make_request(struct mddev *mddev, struct 
bio *bio)

trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
  split, 
disk_devt(mddev->gendisk),
  bio_sector);
-   mddev_check_writesame(mddev, split);
mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1e76d64ce180..d82b11b5ae5a 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -703,13 +703,6 @@ static inline void mddev_clear_unsupported_flags(struct 
mddev *mddev,
mddev->flags &= ~unsupported_flags;
 }
 
-static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
-{
-   if (bio_op(bio) == REQ_OP_WRITE_SAME &&
-   !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
-   mddev->queue->limits.max_write_same_sectors = 0;
-}
-
 static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio 
*bio)
 {
if (bio_op(bio) == REQ_OP_WRITE_ZEROES &&
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index e95d521d93e9..68d67a404aab 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -138,7 +138,6 @@ static void multipath_make_request(struct mddev *mddev, 
struct bio * bio)
mp_bh->bio.bi_opf |= REQ_FAILFAST_TRANSPORT;
mp_bh->bio.bi_end_io = multipath_end_request;
mp_bh->bio.bi_private = mp_bh;
-   mddev_check_writesame(mddev, _bh->bio);
mddev_check_write_zeroes(mddev, _bh->bio);
generic_make_request(_bh->bio);
return;
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index ce7a6a56cf73..c094749c11e5 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -382,7 +382,6 @@ static int raid0_run(struct mddev *mddev)
bool discard_supported = false;
 
blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
-   blk_queue_max_write_same_sectors(mddev->queue, 
mddev->chunk_sectors);
blk_queue_max_write_zeroes_sectors(mddev->queue, 
mddev->chunk_sectors);
blk_queue_max_discard_sectors(mddev->queue, 
mddev->chunk_sectors);
 
@@ -504,7 +503,6 @@ static void raid0_make_request(struct mddev *mddev, struct 
bio *bio)

trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
  split, 
disk_devt(mddev->gendisk),
  bio_sector);
-   mddev_check_writesame(mddev, split);
mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index b59cc100320a..ac9ef686e625 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3177,10 +3177,8 @@ static int raid1_run(struct mddev *mddev)
if (IS_ERR(conf))
return PTR_ERR(conf);
 
-   if (mddev->queue) {
-   blk_queue_max_write_same_sectors(mddev->queue, 0);
+   if (mddev->queue)
blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
-   }
 
rdev_for_each(rdev, mddev) {
if (!mddev->gendisk)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 28ec3a93acee..79988908f862 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3748,7 +3748,6 @@ static int raid10_run(struct mddev *mddev)
if (mddev->queue) {
blk_queue_max_discard_sectors(mddev->queue,
  mddev->chunk_sectors);
-   blk_queue_max_write_same_sectors(mddev->queue, 0);
blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
blk_queue_io_min(mddev->queue, chunk_size);
if (conf->geo.raid_disks % conf->geo.near_copies)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2efdb0d67460..04fd6a946825 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7262,7 +7262,6 @@ static int raid5_run(struct mddev *mddev)
blk_queue_max_discard_sectors(mddev->queue,
  0xfffe * STRIPE_SECTORS);
 
-   blk_queue_max_write_same_sectors(mddev->queue, 0);

[PATCH 6/8] block: remove REQ_OP_WRITE_SAME support

2017-04-12 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 block/bio.c |  3 --
 block/blk-core.c| 11 +-
 block/blk-lib.c | 90 -
 block/blk-merge.c   | 32 
 block/blk-settings.c| 16 
 block/blk-sysfs.c   | 12 --
 include/linux/bio.h |  3 --
 include/linux/blk_types.h   |  4 +-
 include/linux/blkdev.h  | 26 -
 include/trace/events/f2fs.h |  1 -
 kernel/trace/blktrace.c |  1 -
 11 files changed, 2 insertions(+), 197 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f4d207180266..b310e7ef3fbf 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -684,9 +684,6 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, 
gfp_t gfp_mask,
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
break;
-   case REQ_OP_WRITE_SAME:
-   bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
-   break;
default:
__bio_for_each_segment(bv, bio_src, iter, iter_src)
bio->bi_io_vec[bio->bi_vcnt++] = bv;
diff --git a/block/blk-core.c b/block/blk-core.c
index 8654aa0cef6d..92336bc8495c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1929,10 +1929,6 @@ generic_make_request_checks(struct bio *bio)
if (!blk_queue_secure_erase(q))
goto not_supported;
break;
-   case REQ_OP_WRITE_SAME:
-   if (!bdev_write_same(bio->bi_bdev))
-   goto not_supported;
-   break;
case REQ_OP_ZONE_REPORT:
case REQ_OP_ZONE_RESET:
if (!bdev_is_zoned(bio->bi_bdev))
@@ -2100,12 +2096,7 @@ blk_qc_t submit_bio(struct bio *bio)
 * go through the normal accounting stuff before submission.
 */
if (bio_has_data(bio)) {
-   unsigned int count;
-
-   if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
-   count = bdev_logical_block_size(bio->bi_bdev) >> 9;
-   else
-   count = bio_sectors(bio);
+   unsigned int count = bio_sectors(bio);
 
if (op_is_write(bio_op(bio))) {
count_vm_events(PGPGOUT, count);
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e8caecd71688..57c99b9b3b78 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -131,96 +131,6 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
-/**
- * __blkdev_issue_write_same - generate number of bios with same page
- * @bdev:  target blockdev
- * @sector:start sector
- * @nr_sects:  number of sectors to write
- * @gfp_mask:  memory allocation flags (for bio_alloc)
- * @page:  page containing data to write
- * @biop:  pointer to anchor bio
- *
- * Description:
- *  Generate and issue number of bios(REQ_OP_WRITE_SAME) with same page.
- */
-static int __blkdev_issue_write_same(struct block_device *bdev, sector_t 
sector,
-   sector_t nr_sects, gfp_t gfp_mask, struct page *page,
-   struct bio **biop)
-{
-   struct request_queue *q = bdev_get_queue(bdev);
-   unsigned int max_write_same_sectors;
-   struct bio *bio = *biop;
-   sector_t bs_mask;
-
-   if (!q)
-   return -ENXIO;
-
-   bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
-   if ((sector | nr_sects) & bs_mask)
-   return -EINVAL;
-
-   if (!bdev_write_same(bdev))
-   return -EOPNOTSUPP;
-
-   /* Ensure that max_write_same_sectors doesn't overflow bi_size */
-   max_write_same_sectors = UINT_MAX >> 9;
-
-   while (nr_sects) {
-   bio = next_bio(bio, 1, gfp_mask);
-   bio->bi_iter.bi_sector = sector;
-   bio->bi_bdev = bdev;
-   bio->bi_vcnt = 1;
-   bio->bi_io_vec->bv_page = page;
-   bio->bi_io_vec->bv_offset = 0;
-   bio->bi_io_vec->bv_len = bdev_logical_block_size(bdev);
-   bio_set_op_attrs(bio, REQ_OP_WRITE_SAME, 0);
-
-   if (nr_sects > max_write_same_sectors) {
-   bio->bi_iter.bi_size = max_write_same_sectors << 9;
-   nr_sects -= max_write_same_sectors;
-   sector += max_write_same_sectors;
-   } else {
-   bio->bi_iter.bi_size = nr_sects << 9;
-   nr_sects = 0;
-   }
-   cond_resched();
-   }
-
-   *biop = bio;
-   return 0;
-}
-
-/**
- * blkdev_issue_write_same - queue a write same operation
- * @bdev:  target blockdev
- * @sector:start sector
- * @nr_sects:  number of sectors to write
- * @gfp_mask:  memory allocation flags (for bio_alloc)
- * @page:  page containing data
- *
- * Description:
- *Issue a write same request for the sectors

Re: [PATCH 18/25] gfs2: Convert to properly refcounting bdi

2017-04-12 Thread Steven Whitehouse


Hi,


On 12/04/17 09:16, Christoph Hellwig wrote:

On Wed, Mar 29, 2017 at 12:56:16PM +0200, Jan Kara wrote:

Similarly to set_bdev_super() GFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.

Hmm, why iisn't gfs2 simply using the generic mount_bdev code?

Otherwise looks fine:

Reviewed-by: Christoph Hellwig 


It is more or less. However we landed up copying it because we needed a 
slight modification in order to cope with the metafs mounts. There may 
be scope to factor out the common parts I guess. We cannot select the 
root dentry until after we've parsed the mount command line, so it is 
really just the last part of the function that is different,


Steve.

[PATCH 8/8] block: use bio_has_data to check if a bio has bvecs

2017-04-12 Thread Christoph Hellwig

Now that Write Same is gone and discard bios never have a payload we
can simply use bio_has_data as an indicator that the bio has bvecs
that need to be handled.

Signed-off-by: Christoph Hellwig 
---
 block/bio.c |  8 +---
 block/blk-merge.c   |  9 +
 include/linux/bio.h | 21 +
 3 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b310e7ef3fbf..1c9f04c30ba9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -679,15 +679,9 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, 
gfp_t gfp_mask,
bio->bi_iter.bi_sector  = bio_src->bi_iter.bi_sector;
bio->bi_iter.bi_size= bio_src->bi_iter.bi_size;
 
-   switch (bio_op(bio)) {
-   case REQ_OP_DISCARD:
-   case REQ_OP_SECURE_ERASE:
-   case REQ_OP_WRITE_ZEROES:
-   break;
-   default:
+   if (bio_has_data(bio)) {
__bio_for_each_segment(bv, bio_src, iter, iter_src)
bio->bi_io_vec[bio->bi_vcnt++] = bv;
-   break;
}
 
if (bio_integrity(bio_src)) {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index d6c86bfc5722..549d060097f1 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -232,16 +232,9 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
struct bio *fbio, *bbio;
struct bvec_iter iter;
 
-   if (!bio)
+   if (!bio || !bio_has_data(bio))
return 0;
 
-   switch (bio_op(bio)) {
-   case REQ_OP_DISCARD:
-   case REQ_OP_SECURE_ERASE:
-   case REQ_OP_WRITE_ZEROES:
-   return 0;
-   }
-
fbio = bio;
cluster = blk_queue_cluster(q);
seg_size = 0;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7a24a1a24967..86bf531f97aa 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -178,26 +178,15 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 static inline unsigned __bio_segments(struct bio *bio, struct bvec_iter *bvec)
 {
unsigned segs = 0;
-   struct bio_vec bv;
-   struct bvec_iter iter;
 
-   /*
-* We special case discard/write same/write zeroes, because they
-* interpret bi_size differently:
-*/
+   if (bio_has_data(bio)) {
+   struct bio_vec bv;
+   struct bvec_iter iter;
 
-   switch (bio_op(bio)) {
-   case REQ_OP_DISCARD:
-   case REQ_OP_SECURE_ERASE:
-   case REQ_OP_WRITE_ZEROES:
-   return 0;
-   default:
-   break;
+   __bio_for_each_segment(bv, bio, iter, *bvec)
+   segs++;
}
 
-   __bio_for_each_segment(bv, bio, iter, *bvec)
-   segs++;
-
return segs;
 }
 
-- 
2.11.0

remove REQ_OP_WRITE_SAME

2017-04-12 Thread Christoph Hellwig

Now that we are using REQ_OP_WRITE_ZEROES for all zeroing needs in the
kernel there is very little use left for REQ_OP_WRITE_SAME.  We only
have two callers left, and both just export optional protocol features
to remote systems: DRBD and the target code.

For the target code the only real use case was zeroing offload, which
is kept with this series, and for DRBD I suspect the same based on the
usage.

git://git.infradead.org/users/hch/block.git delete-write-same

Gitweb:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/delete-write-same

Changes from RFC:
 - add zeroing offload for the SCSI target.

Re: [PATCH 21/25] nfs: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

On Wed 12-04-17 01:20:34, Christoph Hellwig wrote:
> >  /*
> >   * Finish setting up an NFS2/3 superblock
> >   */
> 
> I was just looking why you didn't update the v4 variant, but it seems
> like the comment above is simply incorrect..

Yes, it's used for NFS4 as well AFAICT.

> Thus the patch looks fine:
> 
> Reviewed-by: Christoph Hellwig 

Thanks.

Honza
-- 
Jan Kara 
SUSE Labs, CR

[PATCH 02/25] block: Unregister bdi on last reference drop

2017-04-12 Thread Jan Kara

Most users will want to unregister bdi when dropping last reference to a
bdi. Only a few users (like block devices) want to play more complex
tricks with bdi registration and unregistration. So unregister bdi when
the last reference to bdi is dropped and just make sure we don't
unregister the bdi the second time if it is already unregistered.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 mm/backing-dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e5e0972bdd6f..164ccc93690f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -961,6 +961,8 @@ static void release_bdi(struct kref *ref)
struct backing_dev_info *bdi =
container_of(ref, struct backing_dev_info, refcnt);
 
+   if (test_bit(WB_registered, >wb.state))
+   bdi_unregister(bdi);
bdi_exit(bdi);
kfree(bdi);
 }
-- 
2.12.0

Re: [PATCH 10/25] loop: zero-fill bio on the submitting cpu

2017-04-12 Thread Ming Lei

On Thu, Apr 06, 2017 at 05:39:29PM +0200, Christoph Hellwig wrote:
> In thruth I've just audited which blk-mq drivers don't currently have a
> complete callback, but I think this change is at least borderline useful.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/loop.c | 30 ++
>  drivers/block/loop.h |  1 +
>  2 files changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index cc981f34e017..6924ec611a49 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -445,32 +445,27 @@ static int lo_req_flush(struct loop_device *lo, struct 
> request *rq)
>   return ret;
>  }
>  
> -static inline void handle_partial_read(struct loop_cmd *cmd, long bytes)
> +static void lo_complete_rq(struct request *rq)
>  {
> - if (bytes < 0 || op_is_write(req_op(cmd->rq)))
> - return;
> + struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
>  
> - if (unlikely(bytes < blk_rq_bytes(cmd->rq))) {
> + if (unlikely(req_op(cmd->rq) == REQ_OP_READ && cmd->use_aio &&
> +  cmd->ret >= 0 && cmd->ret < blk_rq_bytes(cmd->rq))) {
>   struct bio *bio = cmd->rq->bio;
>  
> - bio_advance(bio, bytes);
> + bio_advance(bio, cmd->ret);
>   zero_fill_bio(bio);
>   }
> +
> + blk_mq_end_request(rq, cmd->ret < 0 ? -EIO : 0);
>  }
>  
>  static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
>  {
>   struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
> - struct request *rq = cmd->rq;
> -
> - handle_partial_read(cmd, ret);
>  
> - if (ret > 0)
> - ret = 0;
> - else if (ret < 0)
> - ret = -EIO;
> -
> - blk_mq_complete_request(rq, ret);
> + cmd->ret = ret;
> + blk_mq_complete_request(cmd->rq, 0);
>  }
>  
>  static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
> @@ -1686,8 +1681,10 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
>   ret = do_req_filebacked(lo, cmd->rq);
>   failed:
>   /* complete non-aio request */
> - if (!cmd->use_aio || ret)
> - blk_mq_complete_request(cmd->rq, ret ? -EIO : 0);
> + if (!cmd->use_aio || ret) {
> + cmd->ret = ret ? -EIO : 0;
> + blk_mq_complete_request(cmd->rq, 0);
> + }
>  }
>  
>  static void loop_queue_work(struct kthread_work *work)
> @@ -1713,6 +1710,7 @@ static int loop_init_request(void *data, struct request 
> *rq,
>  static const struct blk_mq_ops loop_mq_ops = {
>   .queue_rq   = loop_queue_rq,
>   .init_request   = loop_init_request,
> + .complete   = lo_complete_rq,
>  };
>  
>  static int loop_add(struct loop_device **l, int i)
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index fb2237c73e61..fecd3f97ef8c 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -70,6 +70,7 @@ struct loop_cmd {
>   struct request *rq;
>   struct list_head list;
>   bool use_aio;   /* use AIO interface to handle I/O */
> + long ret;
>   struct kiocb iocb;
>  };

Reviewed-by: Ming Lei 

Thanks,
Ming

[PATCH 03/25] bdi: Export bdi_alloc_node() and bdi_put()

2017-04-12 Thread Jan Kara

MTD will want to call bdi_alloc_node() and bdi_put() directly. Export
these functions.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 mm/backing-dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 164ccc93690f..3dd175986390 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -855,6 +855,7 @@ struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int 
node_id)
}
return bdi;
 }
+EXPORT_SYMBOL(bdi_alloc_node);
 
 int bdi_register_va(struct backing_dev_info *bdi, struct device *parent,
const char *fmt, va_list args)
@@ -971,6 +972,7 @@ void bdi_put(struct backing_dev_info *bdi)
 {
kref_put(>refcnt, release_bdi);
 }
+EXPORT_SYMBOL(bdi_put);
 
 void bdi_destroy(struct backing_dev_info *bdi)
 {
-- 
2.12.0

Re: [kbuild-all] [PATCH V2 16/16] block, bfq: split bfq-iosched.c into multiple source files

2017-04-12 Thread Ye Xiaolong

On 04/11, Paolo Valente wrote:
>
>> Il giorno 02 apr 2017, alle ore 12:02, kbuild test robot  ha 
>> scritto:
>> 
>> Hi Paolo,
>> 
>> [auto build test ERROR on block/for-next]
>> [also build test ERROR on v4.11-rc4 next-20170331]
>> [if your patch is applied to the wrong git tree, please drop us a note to 
>> help improve the system]
>> 
>
>Hi,
>this seems to be a false positive.  Build is correct with the tested
>tree and the .config.
>

Hmm, this error is reproducible in 0day side, and you patches were applied on
top of 803e16d "Merge branch 'for-4.12/block' into for-next", is it the same as
yours?

Thanks,
Xiaolong

>Thanks,
>Paolo
>
>> url:
>> https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-introduce-the-BFQ-v0-I-O-scheduler-as-an-extra-scheduler/20170402-100622
>> base:   
>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
>> for-next
>> config: i386-allmodconfig (attached as .config)
>> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
>> reproduce:
>># save the attached .config to linux build tree
>>make ARCH=i386 
>> 
>> All errors (new ones prefixed by >>):
>> 
 ERROR: "bfq_mark_bfqq_busy" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfqg_stats_update_dequeue" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfq_clear_bfqq_busy" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfq_clear_bfqq_non_blocking_wait_rq" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfq_bfqq_non_blocking_wait_rq" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfq_clear_bfqq_wait_request" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfq_timeout" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfqg_stats_set_start_empty_time" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfq_weights_tree_add" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfq_put_queue" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfq_bfqq_sync" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfqg_to_blkg" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfqq_group" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfq_weights_tree_remove" [block/bfq-wf2q.ko] undefined!
 ERROR: "bfq_bic_update_cgroup" [block/bfq-iosched.ko] undefined!
 ERROR: "bfqg_stats_set_start_idle_time" [block/bfq-iosched.ko] undefined!
 ERROR: "bfqg_stats_update_completion" [block/bfq-iosched.ko] undefined!
 ERROR: "bfq_bfqq_move" [block/bfq-iosched.ko] undefined!
 ERROR: "bfqg_put" [block/bfq-iosched.ko] undefined!
 ERROR: "next_queue_may_preempt" [block/bfq-iosched.ko] undefined!
>> 
>> ---
>> 0-DAY kernel test infrastructureOpen Source Technology Center
>> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
>> <.config.gz>
>
>___
>kbuild-all mailing list
>kbuild-...@lists.01.org
>https://lists.01.org/mailman/listinfo/kbuild-all

Re: [PATCH 5/9] nowait aio: return on congested block device

2017-04-12 Thread Christoph Hellwig

As mentioned last time around, this should be a REQ_NOWAIT flag so
that it can be easily passed down to the request layer.

> +static inline void bio_wouldblock_error(struct bio *bio)
> +{
> + bio->bi_error = -EAGAIN;
> + bio_endio(bio);
> +}

Please skip this helper..

> +#define QUEUE_FLAG_NOWAIT  28/* queue supports BIO_NOWAIT */

Please make the flag name a little more descriptive, this sounds like
it will never wait.

Re: [PATCH 9/9] nowait aio: Return -EOPNOTSUPP if filesystem does not support

2017-04-12 Thread Christoph Hellwig

This should go into the patch that introduces IOCB_NOWAIT.

Re: kill req->errors

2017-04-12 Thread Christoph Hellwig

On Thu, Apr 06, 2017 at 04:00:24PM -0400, Konrad Rzeszutek Wilk wrote:
> You wouldn't have a git tree to easily test it? Thanks.

Did you manage to give it a spin now that I pointed you to the git
tree?

Re: kill req->errors

2017-04-12 Thread Christoph Hellwig

Any more comments on these patches?  I'd like to make some progress
on this work.

[PATCH 04/25] fs: Provide infrastructure for dynamic BDIs in filesystems

2017-04-12 Thread Jan Kara

Provide helper functions for setting up dynamically allocated
backing_dev_info structures for filesystems and cleaning them up on
superblock destruction.

CC: linux-...@lists.infradead.org
CC: linux-...@vger.kernel.org
CC: Petr Vandrovec 
CC: linux-ni...@vger.kernel.org
CC: cluster-de...@redhat.com
CC: osd-...@open-osd.org
CC: codal...@coda.cs.cmu.edu
CC: linux-...@lists.infradead.org
CC: ecryp...@vger.kernel.org
CC: linux-c...@vger.kernel.org
CC: ceph-de...@vger.kernel.org
CC: linux-bt...@vger.kernel.org
CC: v9fs-develo...@lists.sourceforge.net
CC: lustre-de...@lists.lustre.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/super.c   | 49 
 include/linux/backing-dev-defs.h |  2 +-
 include/linux/fs.h   |  6 +
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index b8b6a086c03b..0f51a437c269 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -446,6 +446,11 @@ void generic_shutdown_super(struct super_block *sb)
hlist_del_init(>s_instances);
spin_unlock(_lock);
up_write(>s_umount);
+   if (sb->s_iflags & SB_I_DYNBDI) {
+   bdi_put(sb->s_bdi);
+   sb->s_bdi = _backing_dev_info;
+   sb->s_iflags &= ~SB_I_DYNBDI;
+   }
 }
 
 EXPORT_SYMBOL(generic_shutdown_super);
@@ -1256,6 +1261,50 @@ mount_fs(struct file_system_type *type, int flags, const 
char *name, void *data)
 }
 
 /*
+ * Setup private BDI for given superblock. It gets automatically cleaned up
+ * in generic_shutdown_super().
+ */
+int super_setup_bdi_name(struct super_block *sb, char *fmt, ...)
+{
+   struct backing_dev_info *bdi;
+   int err;
+   va_list args;
+
+   bdi = bdi_alloc(GFP_KERNEL);
+   if (!bdi)
+   return -ENOMEM;
+
+   bdi->name = sb->s_type->name;
+
+   va_start(args, fmt);
+   err = bdi_register_va(bdi, NULL, fmt, args);
+   va_end(args);
+   if (err) {
+   bdi_put(bdi);
+   return err;
+   }
+   WARN_ON(sb->s_bdi != _backing_dev_info);
+   sb->s_bdi = bdi;
+   sb->s_iflags |= SB_I_DYNBDI;
+
+   return 0;
+}
+EXPORT_SYMBOL(super_setup_bdi_name);
+
+/*
+ * Setup private BDI for given superblock. I gets automatically cleaned up
+ * in generic_shutdown_super().
+ */
+int super_setup_bdi(struct super_block *sb)
+{
+   static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
+
+   return super_setup_bdi_name(sb, "%.28s-%ld", sb->s_type->name,
+   atomic_long_inc_return(_seq));
+}
+EXPORT_SYMBOL(super_setup_bdi);
+
+/*
  * This is an internal function, please use sb_end_{write,pagefault,intwrite}
  * instead.
  */
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index e66d4722db8e..866c433e7d32 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -146,7 +146,7 @@ struct backing_dev_info {
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data;   /* Pointer to aux data for congested func */
 
-   char *name;
+   const char *name;
 
struct kref refcnt; /* Reference counter for the structure */
unsigned int capabilities; /* Device capabilities */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7bb45e8..98cf14ea78c0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1272,6 +1272,9 @@ struct mm_struct;
 /* sb->s_iflags to limit user namespace mounts */
 #define SB_I_USERNS_VISIBLE0x0010 /* fstype already mounted */
 
+/* Temporary flag until all filesystems are converted to dynamic bdis */
+#define SB_I_DYNBDI0x0100
+
 /* Possible states of 'frozen' field */
 enum {
SB_UNFROZEN = 0,/* FS is unfrozen */
@@ -2121,6 +2124,9 @@ extern int vfs_ustat(dev_t, struct kstatfs *);
 extern int freeze_super(struct super_block *super);
 extern int thaw_super(struct super_block *super);
 extern bool our_mnt(struct vfsmount *mnt);
+extern __printf(2, 3)
+int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
+extern int super_setup_bdi(struct super_block *sb);
 
 extern int current_umask(void);
 
-- 
2.12.0

[PATCH 09/25] ceph: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside client structure. This unifies handling of bdi among users.

CC: Ilya Dryomov 
CC: "Yan, Zheng" 
CC: Sage Weil 
CC: ceph-de...@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/ceph/addr.c|  6 +++---
 fs/ceph/debugfs.c |  2 +-
 fs/ceph/super.c   | 35 +--
 fs/ceph/super.h   |  2 --
 4 files changed, 17 insertions(+), 28 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 1a3e1b40799a..9ecb2fd348cb 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -578,7 +578,7 @@ static int writepage_nounlock(struct page *page, struct 
writeback_control *wbc)
writeback_stat = atomic_long_inc_return(>writeback_count);
if (writeback_stat >
CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb))
-   set_bdi_congested(>backing_dev_info, BLK_RW_ASYNC);
+   set_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
 
set_page_writeback(page);
err = ceph_osdc_writepages(osdc, ceph_vino(inode),
@@ -700,7 +700,7 @@ static void writepages_finish(struct ceph_osd_request *req)
if (atomic_long_dec_return(>writeback_count) <
 CONGESTION_OFF_THRESH(
fsc->mount_options->congestion_kb))
-   clear_bdi_congested(>backing_dev_info,
+   clear_bdi_congested(inode_to_bdi(inode),
BLK_RW_ASYNC);
 
if (rc < 0)
@@ -979,7 +979,7 @@ static int ceph_writepages_start(struct address_space 
*mapping,
if (atomic_long_inc_return(>writeback_count) >
CONGESTION_ON_THRESH(
fsc->mount_options->congestion_kb)) {
-   set_bdi_congested(>backing_dev_info,
+   set_bdi_congested(inode_to_bdi(inode),
  BLK_RW_ASYNC);
}
 
diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index f2ae393e2c31..3ef11bc8d728 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -251,7 +251,7 @@ int ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
goto out;
 
snprintf(name, sizeof(name), "../../bdi/%s",
-dev_name(fsc->backing_dev_info.dev));
+dev_name(fsc->sb->s_bdi->dev));
fsc->debugfs_bdi =
debugfs_create_symlink("bdi",
   fsc->client->debugfs_dir,
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 0ec8d0114e57..a8c81b2052ca 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -579,10 +579,6 @@ static struct ceph_fs_client *create_fs_client(struct 
ceph_mount_options *fsopt,
 
atomic_long_set(>writeback_count, 0);
 
-   err = bdi_init(>backing_dev_info);
-   if (err < 0)
-   goto fail_client;
-
err = -ENOMEM;
/*
 * The number of concurrent works can be high but they don't need
@@ -590,7 +586,7 @@ static struct ceph_fs_client *create_fs_client(struct 
ceph_mount_options *fsopt,
 */
fsc->wb_wq = alloc_workqueue("ceph-writeback", 0, 1);
if (fsc->wb_wq == NULL)
-   goto fail_bdi;
+   goto fail_client;
fsc->pg_inv_wq = alloc_workqueue("ceph-pg-invalid", 0, 1);
if (fsc->pg_inv_wq == NULL)
goto fail_wb_wq;
@@ -624,8 +620,6 @@ static struct ceph_fs_client *create_fs_client(struct 
ceph_mount_options *fsopt,
destroy_workqueue(fsc->pg_inv_wq);
 fail_wb_wq:
destroy_workqueue(fsc->wb_wq);
-fail_bdi:
-   bdi_destroy(>backing_dev_info);
 fail_client:
ceph_destroy_client(fsc->client);
 fail:
@@ -643,8 +637,6 @@ static void destroy_fs_client(struct ceph_fs_client *fsc)
destroy_workqueue(fsc->pg_inv_wq);
destroy_workqueue(fsc->trunc_wq);
 
-   bdi_destroy(>backing_dev_info);
-
mempool_destroy(fsc->wb_pagevec_pool);
 
destroy_mount_options(fsc->mount_options);
@@ -937,33 +929,32 @@ static int ceph_compare_super(struct super_block *sb, 
void *data)
  */
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
-static int ceph_register_bdi(struct super_block *sb,
-struct ceph_fs_client *fsc)
+static int ceph_setup_bdi(struct super_block *sb, struct ceph_fs_client *fsc)
 {
int err;
 
+   err = super_setup_bdi_name(sb, "ceph-%ld",
+  atomic_long_inc_return(_seq));
+   if (err)
+   return err;
+
/* set ra_pages based on rasize mount option? */
if (fsc->mount_options->rasize >= PAGE_SIZE)
-   fsc->backing_dev_info.ra_pages =
+   sb->s_bdi->ra_pages

[PATCH 12/25] afs: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: David Howells 
CC: linux-...@lists.infradead.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/afs/internal.h | 1 -
 fs/afs/super.c| 5 -
 fs/afs/volume.c   | 8 
 3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index a6901360fb81..393672997cc2 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -318,7 +318,6 @@ struct afs_volume {
unsigned short  rjservers;  /* number of servers discarded 
due to -ENOMEDIUM */
struct afs_server   *servers[8];/* servers on which volume 
resides (ordered) */
struct rw_semaphore server_sem; /* lock for accessing current 
server */
-   struct backing_dev_info bdi;
 };
 
 /*
diff --git a/fs/afs/super.c b/fs/afs/super.c
index fbdb022b75a2..c79633e5cfd8 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -319,7 +319,10 @@ static int afs_fill_super(struct super_block *sb,
sb->s_blocksize_bits= PAGE_SHIFT;
sb->s_magic = AFS_FS_MAGIC;
sb->s_op= _super_ops;
-   sb->s_bdi   = >volume->bdi;
+   ret = super_setup_bdi(sb);
+   if (ret)
+   return ret;
+   sb->s_bdi->ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_SIZE;
strlcpy(sb->s_id, as->volume->vlocation->vldb.name, sizeof(sb->s_id));
 
/* allocate the root inode and dentry */
diff --git a/fs/afs/volume.c b/fs/afs/volume.c
index 546f9d01710b..db73d6dad02b 100644
--- a/fs/afs/volume.c
+++ b/fs/afs/volume.c
@@ -106,11 +106,6 @@ struct afs_volume *afs_volume_lookup(struct 
afs_mount_params *params)
volume->cell= params->cell;
volume->vid = vlocation->vldb.vid[params->type];
 
-   volume->bdi.ra_pages= VM_MAX_READAHEAD*1024/PAGE_SIZE; 
-   ret = bdi_setup_and_register(>bdi, "afs");
-   if (ret)
-   goto error_bdi;
-
init_rwsem(>server_sem);
 
/* look up all the applicable server records */
@@ -156,8 +151,6 @@ struct afs_volume *afs_volume_lookup(struct 
afs_mount_params *params)
return ERR_PTR(ret);
 
 error_discard:
-   bdi_destroy(>bdi);
-error_bdi:
up_write(>cell->vl_sem);
 
for (loop = volume->nservers - 1; loop >= 0; loop--)
@@ -207,7 +200,6 @@ void afs_put_volume(struct afs_volume *volume)
for (loop = volume->nservers - 1; loop >= 0; loop--)
afs_put_server(volume->servers[loop]);
 
-   bdi_destroy(>bdi);
kfree(volume);
 
_leave(" [destroyed]");
-- 
2.12.0

[PATCH 0/25 v3] fs: Convert all embedded bdis into separate ones

2017-04-12 Thread Jan Kara

Hello,

this is the third revision of the patch series which converts all embedded
occurences of struct backing_dev_info to use standalone dynamically allocated
structures. This makes bdi handling unified across all bdi users and generally
removes some boilerplate code from filesystems setting up their own bdi. It
also allows us to remove some code from generic bdi implementation.

The patches were only compile-tested for most filesystems (I've tested
mounting only for NFS & btrfs) so fs maintainers please have a look whether
the changes look sound to you.

This series is based on top of bdi fixes that were merged into linux-block
git tree into for-next branch. I have pushed out the result as a branch to

git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git bdi

Since all patches got reviewed by Christoph, can you please pick them up Jens?
Thanks!

Changes since v2:
* Added Reviewed-by tags from Christoph

Changes since v1:
* Added some acks
* Added further FUSE cleanup patch
* Added removal of unused argument to bdi_register()
* Fixed up some compilation failures spotted by 0-day testing

Honza

[PATCH 21/25] nfs: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Trond Myklebust 
CC: Anna Schumaker 
CC: linux-...@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/nfs/client.c   | 10 --
 fs/nfs/internal.h |  6 +++---
 fs/nfs/super.c| 34 +++---
 fs/nfs/write.c| 13 ++---
 include/linux/nfs_fs_sb.h |  1 -
 5 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 390ada8741bc..04d15a0045e3 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -761,9 +761,6 @@ static void nfs_server_set_fsinfo(struct nfs_server *server,
server->rsize = NFS_MAX_FILE_IO_SIZE;
server->rpages = (server->rsize + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
-   server->backing_dev_info.name = "nfs";
-   server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
-
if (server->wsize > max_rpc_payload)
server->wsize = max_rpc_payload;
if (server->wsize > NFS_MAX_FILE_IO_SIZE)
@@ -917,12 +914,6 @@ struct nfs_server *nfs_alloc_server(void)
return NULL;
}
 
-   if (bdi_init(>backing_dev_info)) {
-   nfs_free_iostats(server->io_stats);
-   kfree(server);
-   return NULL;
-   }
-
ida_init(>openowner_id);
ida_init(>lockowner_id);
pnfs_init_server(server);
@@ -953,7 +944,6 @@ void nfs_free_server(struct nfs_server *server)
ida_destroy(>lockowner_id);
ida_destroy(>openowner_id);
nfs_free_iostats(server->io_stats);
-   bdi_destroy(>backing_dev_info);
kfree(server);
nfs_release_automount_timer();
dprintk("<-- nfs_free_server()\n");
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 7b38fedb7e03..9dc65d7ae754 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -139,7 +139,7 @@ struct nfs_mount_request {
 };
 
 struct nfs_mount_info {
-   void (*fill_super)(struct super_block *, struct nfs_mount_info *);
+   int (*fill_super)(struct super_block *, struct nfs_mount_info *);
int (*set_security)(struct super_block *, struct dentry *, struct 
nfs_mount_info *);
struct nfs_parsed_mount_data *parsed;
struct nfs_clone_mount *cloned;
@@ -407,7 +407,7 @@ struct dentry *nfs_fs_mount(struct file_system_type *, int, 
const char *, void *
 struct dentry * nfs_xdev_mount_common(struct file_system_type *, int,
const char *, struct nfs_mount_info *);
 void nfs_kill_super(struct super_block *);
-void nfs_fill_super(struct super_block *, struct nfs_mount_info *);
+int nfs_fill_super(struct super_block *, struct nfs_mount_info *);
 
 extern struct rpc_stat nfs_rpcstat;
 
@@ -458,7 +458,7 @@ extern void nfs_read_prepare(struct rpc_task *task, void 
*calldata);
 extern void nfs_pageio_reset_read_mds(struct nfs_pageio_descriptor *pgio);
 
 /* super.c */
-void nfs_clone_super(struct super_block *, struct nfs_mount_info *);
+int nfs_clone_super(struct super_block *, struct nfs_mount_info *);
 void nfs_umount_begin(struct super_block *);
 int  nfs_statfs(struct dentry *, struct kstatfs *);
 int  nfs_show_options(struct seq_file *, struct dentry *);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 54e0f9f2dd94..8d97aa70407e 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2315,18 +2315,17 @@ inline void nfs_initialise_sb(struct super_block *sb)
sb->s_blocksize = nfs_block_bits(server->wsize,
 >s_blocksize_bits);
 
-   sb->s_bdi = >backing_dev_info;
-
nfs_super_set_maxbytes(sb, server->maxfilesize);
 }
 
 /*
  * Finish setting up an NFS2/3 superblock
  */
-void nfs_fill_super(struct super_block *sb, struct nfs_mount_info *mount_info)
+int nfs_fill_super(struct super_block *sb, struct nfs_mount_info *mount_info)
 {
struct nfs_parsed_mount_data *data = mount_info->parsed;
struct nfs_server *server = NFS_SB(sb);
+   int ret;
 
sb->s_blocksize_bits = 0;
sb->s_blocksize = 0;
@@ -2344,13 +2343,21 @@ void nfs_fill_super(struct super_block *sb, struct 
nfs_mount_info *mount_info)
}
 
nfs_initialise_sb(sb);
+
+   ret = super_setup_bdi_name(sb, "%u:%u", MAJOR(server->s_dev),
+  MINOR(server->s_dev));
+   if (ret)
+   return ret;
+   sb->s_bdi->ra_pages = server->rpages * NFS_MAX_READAHEAD;
+   return 0;
+
 }
 EXPORT_SYMBOL_GPL(nfs_fill_super);
 
 /*
  * Finish setting up a cloned NFS2/3/4 superblock
  */
-void nfs_clone_super(struct super_block *sb, struct nfs_mount_info *mount_info)
+int nfs_clone_super(struct super_block *sb, struct nfs_mount_info *mount_info)
 {
const struct super_block *old_sb =

[PATCH 19/25] nilfs2: Convert to properly refcounting bdi

2017-04-12 Thread Jan Kara

Similarly to set_bdev_super() NILFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.

CC: Ryusuke Konishi 
CC: linux-ni...@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/nilfs2/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index e1872f36147f..feb796a38b8d 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -1068,7 +1068,8 @@ nilfs_fill_super(struct super_block *sb, void *data, int 
silent)
sb->s_time_gran = 1;
sb->s_max_links = NILFS_LINK_MAX;
 
-   sb->s_bdi = bdev_get_queue(sb->s_bdev)->backing_dev_info;
+   sb->s_bdi = bdi_get(sb->s_bdev->bd_bdi);
+   sb->s_iflags |= SB_I_DYNBDI;
 
err = load_nilfs(nilfs, sb);
if (err)
-- 
2.12.0

[PATCH 01/25] bdi: Provide bdi_register_va() and bdi_alloc()

2017-04-12 Thread Jan Kara

Add function that registers bdi and takes va_list instead of variable
number of arguments.

Add bdi_alloc() as simple wrapper for NUMA-unaware users allocating BDI.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 include/linux/backing-dev.h |  6 ++
 mm/backing-dev.c| 20 +++-
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c52a48cb9a66..47a98e6e2a65 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -30,6 +30,8 @@ void bdi_put(struct backing_dev_info *bdi);
 __printf(3, 4)
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
const char *fmt, ...);
+int bdi_register_va(struct backing_dev_info *bdi, struct device *parent,
+   const char *fmt, va_list args);
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 int bdi_register_owner(struct backing_dev_info *bdi, struct device *owner);
 void bdi_unregister(struct backing_dev_info *bdi);
@@ -37,6 +39,10 @@ void bdi_unregister(struct backing_dev_info *bdi);
 int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
 void bdi_destroy(struct backing_dev_info *bdi);
 struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id);
+static inline struct backing_dev_info *bdi_alloc(gfp_t gfp_mask)
+{
+   return bdi_alloc_node(gfp_mask, NUMA_NO_NODE);
+}
 
 void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
bool range_cyclic, enum wb_reason reason);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 3ea3bbd921d6..e5e0972bdd6f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -856,18 +856,15 @@ struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, 
int node_id)
return bdi;
 }
 
-int bdi_register(struct backing_dev_info *bdi, struct device *parent,
-   const char *fmt, ...)
+int bdi_register_va(struct backing_dev_info *bdi, struct device *parent,
+   const char *fmt, va_list args)
 {
-   va_list args;
struct device *dev;
 
if (bdi->dev)   /* The driver needs to use separate queues per device */
return 0;
 
-   va_start(args, fmt);
dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, 
args);
-   va_end(args);
if (IS_ERR(dev))
return PTR_ERR(dev);
 
@@ -884,6 +881,19 @@ int bdi_register(struct backing_dev_info *bdi, struct 
device *parent,
trace_writeback_bdi_register(bdi);
return 0;
 }
+EXPORT_SYMBOL(bdi_register_va);
+
+int bdi_register(struct backing_dev_info *bdi, struct device *parent,
+   const char *fmt, ...)
+{
+   va_list args;
+   int ret;
+
+   va_start(args, fmt);
+   ret = bdi_register_va(bdi, parent, fmt, args);
+   va_end(args);
+   return ret;
+}
 EXPORT_SYMBOL(bdi_register);
 
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
-- 
2.12.0

[PATCH 22/25] ubifs: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Richard Weinberger 
CC: Artem Bityutskiy 
CC: Adrian Hunter 
CC: linux-...@lists.infradead.org
Acked-by: Richard Weinberger 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/ubifs/super.c | 25 +
 fs/ubifs/ubifs.h |  3 ---
 2 files changed, 9 insertions(+), 19 deletions(-)

diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index b73811bd7676..cf4cc99b75b5 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1827,7 +1827,6 @@ static void ubifs_put_super(struct super_block *sb)
}
 
ubifs_umount(c);
-   bdi_destroy(>bdi);
ubi_close_volume(c->ubi);
mutex_unlock(>umount_mutex);
 }
@@ -2019,29 +2018,25 @@ static int ubifs_fill_super(struct super_block *sb, 
void *data, int silent)
goto out;
}
 
+   err = ubifs_parse_options(c, data, 0);
+   if (err)
+   goto out_close;
+
/*
 * UBIFS provides 'backing_dev_info' in order to disable read-ahead. For
 * UBIFS, I/O is not deferred, it is done immediately in readpage,
 * which means the user would have to wait not just for their own I/O
 * but the read-ahead I/O as well i.e. completely pointless.
 *
-* Read-ahead will be disabled because @c->bdi.ra_pages is 0.
+* Read-ahead will be disabled because @sb->s_bdi->ra_pages is 0. Also
+* @sb->s_bdi->capabilities are initialized to 0 so there won't be any
+* writeback happening.
 */
-   c->bdi.name = "ubifs",
-   c->bdi.capabilities = 0;
-   err  = bdi_init(>bdi);
+   err = super_setup_bdi_name(sb, "ubifs_%d_%d", c->vi.ubi_num,
+  c->vi.vol_id);
if (err)
goto out_close;
-   err = bdi_register(>bdi, NULL, "ubifs_%d_%d",
-  c->vi.ubi_num, c->vi.vol_id);
-   if (err)
-   goto out_bdi;
-
-   err = ubifs_parse_options(c, data, 0);
-   if (err)
-   goto out_bdi;
 
-   sb->s_bdi = >bdi;
sb->s_fs_info = c;
sb->s_magic = UBIFS_SUPER_MAGIC;
sb->s_blocksize = UBIFS_BLOCK_SIZE;
@@ -2080,8 +2075,6 @@ static int ubifs_fill_super(struct super_block *sb, void 
*data, int silent)
ubifs_umount(c);
 out_unlock:
mutex_unlock(>umount_mutex);
-out_bdi:
-   bdi_destroy(>bdi);
 out_close:
ubi_close_volume(c->ubi);
 out:
diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h
index 4d57e488038e..4da10a6d702a 100644
--- a/fs/ubifs/ubifs.h
+++ b/fs/ubifs/ubifs.h
@@ -972,7 +972,6 @@ struct ubifs_debug_info;
  * struct ubifs_info - UBIFS file-system description data structure
  * (per-superblock).
  * @vfs_sb: VFS @struct super_block object
- * @bdi: backing device info object to make VFS happy and disable read-ahead
  *
  * @highest_inum: highest used inode number
  * @max_sqnum: current global sequence number
@@ -1220,7 +1219,6 @@ struct ubifs_debug_info;
  */
 struct ubifs_info {
struct super_block *vfs_sb;
-   struct backing_dev_info bdi;
 
ino_t highest_inum;
unsigned long long max_sqnum;
@@ -1461,7 +1459,6 @@ extern const struct inode_operations 
ubifs_file_inode_operations;
 extern const struct file_operations ubifs_dir_operations;
 extern const struct inode_operations ubifs_dir_inode_operations;
 extern const struct inode_operations ubifs_symlink_inode_operations;
-extern struct backing_dev_info ubifs_backing_dev_info;
 extern struct ubifs_compressor *ubifs_compressors[UBIFS_COMPR_TYPES_CNT];
 
 /* io.c */
-- 
2.12.0

[PATCH 10/25] cifs: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara

Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Steve French 
CC: linux-c...@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/cifs/cifs_fs_sb.h |  1 -
 fs/cifs/cifsfs.c |  7 ++-
 fs/cifs/connect.c| 10 --
 3 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/cifs/cifs_fs_sb.h b/fs/cifs/cifs_fs_sb.h
index 07ed81cf1552..cbd216b57239 100644
--- a/fs/cifs/cifs_fs_sb.h
+++ b/fs/cifs/cifs_fs_sb.h
@@ -68,7 +68,6 @@ struct cifs_sb_info {
umode_t mnt_dir_mode;
unsigned int mnt_cifs_flags;
char   *mountdata; /* options received at mount time or via DFS refs */
-   struct backing_dev_info bdi;
struct delayed_work prune_tlinks;
struct rcu_head rcu;
char *prepath;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 15e1db8738ae..502eab6bdbc4 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -138,7 +138,12 @@ cifs_read_super(struct super_block *sb)
sb->s_magic = CIFS_MAGIC_NUMBER;
sb->s_op = _super_ops;
sb->s_xattr = cifs_xattr_handlers;
-   sb->s_bdi = _sb->bdi;
+   rc = super_setup_bdi(sb);
+   if (rc)
+   goto out_no_root;
+   /* tune readahead according to rsize */
+   sb->s_bdi->ra_pages = cifs_sb->rsize / PAGE_SIZE;
+
sb->s_blocksize = CIFS_MAX_MSGSIZE;
sb->s_blocksize_bits = 14;  /* default 2**14 = CIFS_MAX_MSGSIZE */
inode = cifs_root_iget(sb);
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 9ae695ae3ed7..7f50c8949401 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -3683,10 +3683,6 @@ cifs_mount(struct cifs_sb_info *cifs_sb, struct smb_vol 
*volume_info)
int referral_walks_count = 0;
 #endif
 
-   rc = bdi_setup_and_register(_sb->bdi, "cifs");
-   if (rc)
-   return rc;
-
 #ifdef CONFIG_CIFS_DFS_UPCALL
 try_mount_again:
/* cleanup activities if we're chasing a referral */
@@ -3714,7 +3710,6 @@ cifs_mount(struct cifs_sb_info *cifs_sb, struct smb_vol 
*volume_info)
server = cifs_get_tcp_session(volume_info);
if (IS_ERR(server)) {
rc = PTR_ERR(server);
-   bdi_destroy(_sb->bdi);
goto out;
}
if ((volume_info->max_credits < 20) ||
@@ -3768,9 +3763,6 @@ cifs_mount(struct cifs_sb_info *cifs_sb, struct smb_vol 
*volume_info)
cifs_sb->wsize = server->ops->negotiate_wsize(tcon, volume_info);
cifs_sb->rsize = server->ops->negotiate_rsize(tcon, volume_info);
 
-   /* tune readahead according to rsize */
-   cifs_sb->bdi.ra_pages = cifs_sb->rsize / PAGE_SIZE;
-
 remote_path_check:
 #ifdef CONFIG_CIFS_DFS_UPCALL
/*
@@ -3887,7 +3879,6 @@ cifs_mount(struct cifs_sb_info *cifs_sb, struct smb_vol 
*volume_info)
cifs_put_smb_ses(ses);
else
cifs_put_tcp_session(server, 0);
-   bdi_destroy(_sb->bdi);
}
 
 out:
@@ -4090,7 +4081,6 @@ cifs_umount(struct cifs_sb_info *cifs_sb)
}
spin_unlock(_sb->tlink_tree_lock);
 
-   bdi_destroy(_sb->bdi);
kfree(cifs_sb->mountdata);
kfree(cifs_sb->prepath);
call_rcu(_sb->rcu, delayed_free);
-- 
2.12.0

[PATCH 05/25] fs: Get proper reference for s_bdi

2017-04-12 Thread Jan Kara

So far we just relied on block device to hold a bdi reference for us
while the filesystem is mounted. While that works perfectly fine, it is
a bit awkward that we have a pointer to a refcounted structure in the
superblock without proper reference. So make s_bdi hold a proper
reference to block device's BDI. No filesystem using mount_bdev()
actually changes s_bdi so this is safe and will make bdev filesystems
work the same way as filesystems needing to set up their private bdi.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/super.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 0f51a437c269..e267d3a00144 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1054,12 +1054,9 @@ static int set_bdev_super(struct super_block *s, void 
*data)
 {
s->s_bdev = data;
s->s_dev = s->s_bdev->bd_dev;
+   s->s_bdi = bdi_get(s->s_bdev->bd_bdi);
+   s->s_iflags |= SB_I_DYNBDI;
 
-   /*
-* We set the bdi here to the queue backing, file systems can
-* overwrite this in ->fill_super()
-*/
-   s->s_bdi = bdev_get_queue(s->s_bdev)->backing_dev_info;
return 0;
 }
 
-- 
2.12.0

Re: [PATCH 19/25] nilfs2: Convert to properly refcounting bdi

2017-04-12 Thread Ryusuke Konishi


On 2017/04/12 19:24, Jan Kara wrote:

Similarly to set_bdev_super() NILFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.

CC: Ryusuke Konishi 
CC: linux-ni...@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 


Looks fine, thanks.

Acked-by: Ryusuke Konishi 



---
 fs/nilfs2/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index e1872f36147f..feb796a38b8d 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -1068,7 +1068,8 @@ nilfs_fill_super(struct super_block *sb, void *data, int 
silent)
sb->s_time_gran = 1;
sb->s_max_links = NILFS_LINK_MAX;

-   sb->s_bdi = bdev_get_queue(sb->s_bdev)->backing_dev_info;
+   sb->s_bdi = bdi_get(sb->s_bdev->bd_bdi);
+   sb->s_iflags |= SB_I_DYNBDI;

err = load_nilfs(nilfs, sb);
if (err)

Re: [PATCH 08/25] btrfs: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 10/25] cifs: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 09/25] ceph: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 07/25] 9p: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

[PATCH v6] lightnvm: pblk

2017-04-12 Thread Javier González

Hi Matias,

A last spin to fix a regression that I introduced yesterday on v4. This
should be the one.

Thanks,
Javier

Changes since v5:
* Fix regression on the erase scheduler introduced on v4

Changes since v4:
* Rebase on top of Matias' for-4.12/core
* Fix type implicit conversions reported by sparse (reported by Bart Van
  Assche)
* Make error and debug statistics long atomic variables.

Changes since v3:
* Apply Bart's feedback [1]
* Implement dynamic L2P optimizations for > 32-bit physical media
  geometry (from Matias Bjørling)
* Fix memory leak on GC (Reported by Simon A. F. Lund)
* 8064 is a perfectly round number of lines :)

[1] https://lkml.org/lkml/2017/4/8/172

Changes since v2:
* Rebase on top of Matias' for-4.12/core
* Implement L2P scan recovery to recover L2P table in case of power
  failure.
* Re-design disk format to be more flexible in future versions (from
  Matias Bjørling)
* Implement per-instance uuid to allow correct recovery without
  forcing line erases (from Matias Bjørling)
* Re-design GC threading to have several GC readers and a single
  writer that places data on the write buffer. This allows to maximize
  the GC write buffer budget without having unnecessary GC writers
  competing for the write buffer lock.
* Simplify sysfs interface.
* Refactoring and several code improvements (together with Matias
  Bjørling)

Changes since v1:
* Rebase on top of Matias' for-4.12/core
* Move from per-LUN block allocation to a line model. This means that a
  whole lines across all LUNs is allocated at a time. Data is still
  stripped in a round-robin fashion at a page granurality.
* Implement new disk format scheme, where metadata is stored per line
  instead of per LUN. This allows for space optimizations.
* Improvements on GC workqueue management and victim selection.
* Implement sysfs interface to query pblk's operation and statistics.
* Implement a user - GC I/O rate-limiter
* Various bug fixes

Javier González (1):
  lightnvm: physical block device (pblk) target

 Documentation/lightnvm/pblk.txt  |   21 +
 drivers/lightnvm/Kconfig |9 +
 drivers/lightnvm/Makefile|5 +
 drivers/lightnvm/pblk-cache.c|  114 +++
 drivers/lightnvm/pblk-core.c | 1655 ++
 drivers/lightnvm/pblk-gc.c   |  555 +
 drivers/lightnvm/pblk-init.c |  949 ++
 drivers/lightnvm/pblk-map.c  |  136 
 drivers/lightnvm/pblk-rb.c   |  852 
 drivers/lightnvm/pblk-read.c |  529 
 drivers/lightnvm/pblk-recovery.c |  998 +++
 drivers/lightnvm/pblk-rl.c   |  182 +
 drivers/lightnvm/pblk-sysfs.c|  507 
 drivers/lightnvm/pblk-write.c|  411 ++
 drivers/lightnvm/pblk.h  | 1121 ++
 15 files changed, 8044 insertions(+)
 create mode 100644 Documentation/lightnvm/pblk.txt
 create mode 100644 drivers/lightnvm/pblk-cache.c
 create mode 100644 drivers/lightnvm/pblk-core.c
 create mode 100644 drivers/lightnvm/pblk-gc.c
 create mode 100644 drivers/lightnvm/pblk-init.c
 create mode 100644 drivers/lightnvm/pblk-map.c
 create mode 100644 drivers/lightnvm/pblk-rb.c
 create mode 100644 drivers/lightnvm/pblk-read.c
 create mode 100644 drivers/lightnvm/pblk-recovery.c
 create mode 100644 drivers/lightnvm/pblk-rl.c
 create mode 100644 drivers/lightnvm/pblk-sysfs.c
 create mode 100644 drivers/lightnvm/pblk-write.c
 create mode 100644 drivers/lightnvm/pblk.h

-- 
2.7.4

RFC: drop the T10 OSD code and its users

2017-04-12 Thread Christoph Hellwig

The only real user of the T10 OSD protocol, the pNFS object layout
driver never went to the point of having shipping products, and the
other two users (osdblk and exofs) were simple example of it's usage.

The code has been mostly unmaintained for years and is getting in the
way of block / SCSI changes, so I think it's finally time to drop it.

These patches are against Jens' block for-next tree as that already
has various modifications of the SCSI code.

[PATCH V4 07/16] block, bfq: reduce I/O latency for soft real-time applications

2017-04-12 Thread Paolo Valente

To guarantee a low latency also to the I/O requests issued by soft
real-time applications, this patch introduces a further heuristic,
which weight-raises (in the sense explained in the previous patch)
also the queues associated to applications deemed as soft real-time.

To be deemed as soft real-time, an application must meet two
requirements.  First, the application must not require an average
bandwidth higher than the approximate bandwidth required to playback
or record a compressed high-definition video. Second, the request
pattern of the application must be isochronous, i.e., after issuing a
request or a batch of requests, the application must stop issuing new
requests until all its pending requests have been completed. After
that, the application may issue a new batch, and so on.

As for the second requirement, it is critical to require also that,
after all the pending requests of the application have been completed,
an adequate minimum amount of time elapses before the application
starts issuing new requests. This prevents also greedy (i.e.,
I/O-bound) applications from being incorrectly deemed, occasionally,
as soft real-time. In fact, if *any amount of time* is fine, then even
a greedy application may, paradoxically, meet both the above
requirements, if: (1) the application performs random I/O and/or the
device is slow, and (2) the CPU load is high. The reason is the
following.  First, if condition (1) is true, then, during the service
of the application, the throughput may be low enough to let the
application meet the bandwidth requirement.  Second, if condition (2)
is true as well, then the application may occasionally behave in an
apparently isochronous way, because it may simply stop issuing
requests while the CPUs are busy serving other processes.

To address this issue, the heuristic leverages the simple fact that
greedy applications issue *all* their requests as quickly as they can,
whereas soft real-time applications spend some time processing data
after each batch of requests is completed. In particular, the
heuristic works as follows. First, according to the above isochrony
requirement, the heuristic checks whether an application may be soft
real-time, thereby giving to the application the opportunity to be
deemed as such, only when both the following two conditions happen to
hold: 1) the queue associated with the application has expired and is
empty, 2) there is no outstanding request of the application.

Suppose that both conditions hold at time, say, t_c and that the
application issues its next request at time, say, t_i. At time t_c the
heuristic computes the next time instant, called soft_rt_next_start in
the code, such that, only if t_i >= soft_rt_next_start, then both the
next conditions will hold when the application issues its next
request: 1) the application will meet the above bandwidth requirement,
2) a given minimum time interval, say Delta, will have elapsed from
time t_c (so as to filter out greedy application).

The current value of Delta is a little bit higher than the value that
we have found, experimentally, to be adequate on a real,
general-purpose machine. In particular we had to increase Delta to
make the filter quite precise also in slower, embedded systems, and in
KVM/QEMU virtual machines (details in the comments on the code).

If the application actually issues its next request after time
soft_rt_next_start, then its associated queue will be weight-raised
for a relatively short time interval. If, during this time interval,
the application proves again to meet the bandwidth and isochrony
requirements, then the end of the weight-raising period for the queue
is moved forward, and so on. Note that an application whose associated
queue never happens to be empty when it expires will never have the
opportunity to be deemed as soft real-time.

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 342 +---
 1 file changed, 323 insertions(+), 19 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 1a32c83..7f94ad3 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -119,6 +119,13 @@
 #define BFQ_DEFAULT_GRP_IOPRIO 0
 #define BFQ_DEFAULT_GRP_CLASS  IOPRIO_CLASS_BE
 
+/*
+ * Soft real-time applications are extremely more latency sensitive
+ * than interactive ones. Over-raise the weight of the former to
+ * privilege them against the latter.
+ */
+#define BFQ_SOFTRT_WEIGHT_FACTOR   100
+
 struct bfq_entity;
 
 /**
@@ -343,6 +350,14 @@ struct bfq_queue {
/* current maximum weight-raising time for this queue */
unsigned long wr_cur_max_time;
/*
+* Minimum time instant such that, only if a new request is
+* enqueued after this time instant in an idle @bfq_queue with
+* no outstanding requests, then the task associated with the
+* queue it is deemed as

[PATCH V4 14/16] block, bfq: handle bursts of queue activations

2017-04-12 Thread Paolo Valente

From: Arianna Avanzini 

Many popular I/O-intensive services or applications spawn or
reactivate many parallel threads/processes during short time
intervals. Examples are systemd during boot or git grep.  These
services or applications benefit mostly from a high throughput: the
quicker the I/O generated by their processes is cumulatively served,
the sooner the target job of these services or applications gets
completed. As a consequence, it is almost always counterproductive to
weight-raise any of the queues associated to the processes of these
services or applications: in most cases it would just lower the
throughput, mainly because weight-raising also implies device idling.

To address this issue, an I/O scheduler needs, first, to detect which
queues are associated with these services or applications. In this
respect, we have that, from the I/O-scheduler standpoint, these
services or applications cause bursts of activations, i.e.,
activations of different queues occurring shortly after each
other. However, a shorter burst of activations may be caused also by
the start of an application that does not consist in a lot of parallel
I/O-bound threads (see the comments on the function bfq_handle_burst
for details).

In view of these facts, this commit introduces:
1) an heuristic to detect (only) bursts of queue activations caused by
   services or applications consisting in many parallel I/O-bound
   threads;
2) the prevention of device idling and weight-raising for the queues
   belonging to these bursts.

Signed-off-by: Arianna Avanzini 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 404 ++--
 1 file changed, 389 insertions(+), 15 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 549f030..b7e3c86 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -360,6 +360,10 @@ struct bfq_queue {
 
/* bit vector: a 1 for each seeky requests in history */
u32 seek_history;
+
+   /* node for the device's burst list */
+   struct hlist_node burst_list_node;
+
/* position of the last request enqueued */
sector_t last_request_pos;
 
@@ -443,6 +447,17 @@ struct bfq_io_cq {
bool saved_IO_bound;
 
/*
+* Same purpose as the previous fields for the value of the
+* field keeping the queue's belonging to a large burst
+*/
+   bool saved_in_large_burst;
+   /*
+* True if the queue belonged to a burst list before its merge
+* with another cooperating queue.
+*/
+   bool was_in_burst_list;
+
+   /*
 * Similar to previous fields: save wr information.
 */
unsigned long saved_wr_coeff;
@@ -609,6 +624,36 @@ struct bfq_data {
 */
bool strict_guarantees;
 
+   /*
+* Last time at which a queue entered the current burst of
+* queues being activated shortly after each other; for more
+* details about this and the following parameters related to
+* a burst of activations, see the comments on the function
+* bfq_handle_burst.
+*/
+   unsigned long last_ins_in_burst;
+   /*
+* Reference time interval used to decide whether a queue has
+* been activated shortly after @last_ins_in_burst.
+*/
+   unsigned long bfq_burst_interval;
+   /* number of queues in the current burst of queue activations */
+   int burst_size;
+
+   /* common parent entity for the queues in the burst */
+   struct bfq_entity *burst_parent_entity;
+   /* Maximum burst size above which the current queue-activation
+* burst is deemed as 'large'.
+*/
+   unsigned long bfq_large_burst_thresh;
+   /* true if a large queue-activation burst is in progress */
+   bool large_burst;
+   /*
+* Head of the burst list (as for the above fields, more
+* details in the comments on the function bfq_handle_burst).
+*/
+   struct hlist_head burst_list;
+
/* if set to true, low-latency heuristics are enabled */
bool low_latency;
/*
@@ -671,7 +716,8 @@ struct bfq_data {
 };
 
 enum bfqq_state_flags {
-   BFQQF_busy = 0, /* has requests or is in service */
+   BFQQF_just_created = 0, /* queue just allocated */
+   BFQQF_busy, /* has requests or is in service */
BFQQF_wait_request, /* waiting for a request */
BFQQF_non_blocking_wait_rq, /*
 * waiting for a request
@@ -685,6 +731,10 @@ enum bfqq_state_flags {
 * having consumed at most 2/10 of
 * its budget
 */
+   BFQQF_in_large_burst,   /*
+* bfqq activated in a large burst,
+* see

[PATCH V4 11/16] block, bfq: reduce idling only in symmetric scenarios

2017-04-12 Thread Paolo Valente

From: Arianna Avanzini 

A seeky queue (i..e, a queue containing random requests) is assigned a
very small device-idling slice, for throughput issues. Unfortunately,
given the process associated with a seeky queue, this behavior causes
the following problem: if the process, say P, performs sync I/O and
has a higher weight than some other processes doing I/O and associated
with non-seeky queues, then BFQ may fail to guarantee to P its
reserved share of the throughput. The reason is that idling is key
for providing service guarantees to processes doing sync I/O [1].

This commit addresses this issue by allowing the device-idling slice
to be reduced for a seeky queue only if the scenario happens to be
symmetric, i.e., if all the queues are to receive the same share of
the throughput.

[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
Scheduler", Proceedings of the First Workshop on Mobile System
Technologies (MST-2015), May 2015.
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf

Signed-off-by: Arianna Avanzini 
Signed-off-by: Riccardo Pizzetti 
Signed-off-by: Samuele Zecchini 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 287 ++--
 1 file changed, 280 insertions(+), 7 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 6e7388a..b97801f 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -183,6 +183,20 @@ struct bfq_sched_data {
 };
 
 /**
+ * struct bfq_weight_counter - counter of the number of all active entities
+ * with a given weight.
+ */
+struct bfq_weight_counter {
+   unsigned int weight; /* weight of the entities this counter refers to */
+   unsigned int num_active; /* nr of active entities with this weight */
+   /*
+* Weights tree member (see bfq_data's @queue_weights_tree and
+* @group_weights_tree)
+*/
+   struct rb_node weights_node;
+};
+
+/**
  * struct bfq_entity - schedulable entity.
  *
  * A bfq_entity is used to represent either a bfq_queue (leaf node in the
@@ -212,6 +226,8 @@ struct bfq_sched_data {
 struct bfq_entity {
/* service_tree member */
struct rb_node rb_node;
+   /* pointer to the weight counter associated with this entity */
+   struct bfq_weight_counter *weight_counter;
 
/*
 * Flag, true if the entity is on a tree (either the active or
@@ -456,6 +472,25 @@ struct bfq_data {
struct bfq_group *root_group;
 
/*
+* rbtree of weight counters of @bfq_queues, sorted by
+* weight. Used to keep track of whether all @bfq_queues have
+* the same weight. The tree contains one counter for each
+* distinct weight associated to some active and not
+* weight-raised @bfq_queue (see the comments to the functions
+* bfq_weights_tree_[add|remove] for further details).
+*/
+   struct rb_root queue_weights_tree;
+   /*
+* rbtree of non-queue @bfq_entity weight counters, sorted by
+* weight. Used to keep track of whether all @bfq_groups have
+* the same weight. The tree contains one counter for each
+* distinct weight associated to some active @bfq_group (see
+* the comments to the functions bfq_weights_tree_[add|remove]
+* for further details).
+*/
+   struct rb_root group_weights_tree;
+
+   /*
 * Number of bfq_queues containing requests (including the
 * queue in service, even if it is idling).
 */
@@ -791,6 +826,11 @@ struct bfq_group_data {
  * to avoid too many special cases during group creation/
  * migration.
  * @stats: stats for this bfqg.
+ * @active_entities: number of active entities belonging to the group;
+ *   unused for the root group. Used to know whether there
+ *   are groups with more than one active @bfq_entity
+ *   (see the comments to the function
+ *   bfq_bfqq_may_idle()).
  * @rq_pos_tree: rbtree sorted by next_request position, used when
  *   determining if two or more queues have interleaving
  *   requests (see bfq_find_close_cooperator()).
@@ -818,6 +858,8 @@ struct bfq_group {
 
struct bfq_entity *my_entity;
 
+   int active_entities;
+
struct rb_root rq_pos_tree;
 
struct bfqg_stats stats;
@@ -1254,12 +1296,27 @@ static bool bfq_update_parent_budget(struct bfq_entity 
*next_in_service)
  * a candidate for next service (i.e, a candidate entity to serve
  * after the in-service entity is expired). The function then returns
  * true.
+ *
+ * In contrast, the entity could stil be a candidate for next service
+ * if it is not a queue, and has more than one child. In fact, even if
+ * one

[PATCH V4 15/16] block, bfq: remove all get and put of I/O contexts

2017-04-12 Thread Paolo Valente

When a bfq queue is set in service and when it is merged, a reference
to the I/O context associated with the queue is taken. This reference
is then released when the queue is deselected from service or
split. More precisely, the release of the reference is postponed to
when the scheduler lock is released, to avoid nesting between the
scheduler and the I/O-context lock. In fact, such nesting would lead
to deadlocks, because of other code paths that take the same locks in
the opposite order. This postponing of I/O-context releases does
complicate code.

This commit addresses these issue by modifying involved operations in
such a way to not need to get the above I/O-context references any
more. Then it also removes any get and release of these references.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 143 +---
 1 file changed, 23 insertions(+), 120 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b7e3c86..30bb8f9 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -538,8 +538,6 @@ struct bfq_data {
 
/* bfq_queue in service */
struct bfq_queue *in_service_queue;
-   /* bfq_io_cq (bic) associated with the @in_service_queue */
-   struct bfq_io_cq *in_service_bic;
 
/* on-disk position of the last served request */
sector_t last_position;
@@ -704,15 +702,6 @@ struct bfq_data {
struct bfq_io_cq *bio_bic;
/* bfqq associated with the task issuing current bio for merging */
struct bfq_queue *bio_bfqq;
-
-   /*
-* io context to put right after bfqd->lock is released. This
-* filed is used to perform put_io_context, when needed, to
-* after the scheduler lock has been released, and thus
-* prevent an ioc->lock from being possibly taken while the
-* scheduler lock is being held.
-*/
-   struct io_context *ioc_to_put;
 };
 
 enum bfqq_state_flags {
@@ -1148,34 +1137,6 @@ static void bfq_schedule_dispatch(struct bfq_data *bfqd)
}
 }
 
-/*
- * Next two functions release bfqd->lock and put the io context
- * pointed by bfqd->ioc_to_put. This delayed put is used to not risk
- * to take an ioc->lock while the scheduler lock is being held.
- */
-static void bfq_unlock_put_ioc(struct bfq_data *bfqd)
-{
-   struct io_context *ioc_to_put = bfqd->ioc_to_put;
-
-   bfqd->ioc_to_put = NULL;
-   spin_unlock_irq(>lock);
-
-   if (ioc_to_put)
-   put_io_context(ioc_to_put);
-}
-
-static void bfq_unlock_put_ioc_restore(struct bfq_data *bfqd,
-  unsigned long flags)
-{
-   struct io_context *ioc_to_put = bfqd->ioc_to_put;
-
-   bfqd->ioc_to_put = NULL;
-   spin_unlock_irqrestore(>lock, flags);
-
-   if (ioc_to_put)
-   put_io_context(ioc_to_put);
-}
-
 /**
  * bfq_gt - compare two timestamps.
  * @a: first ts.
@@ -2684,18 +2645,6 @@ static void __bfq_bfqd_reset_in_service(struct bfq_data 
*bfqd)
struct bfq_entity *in_serv_entity = _serv_bfqq->entity;
struct bfq_entity *entity = in_serv_entity;
 
-   if (bfqd->in_service_bic) {
-   /*
-* Schedule the release of a reference to
-* bfqd->in_service_bic->icq.ioc to right after the
-* scheduler lock is released. This ioc is not
-* released immediately, to not risk to possibly take
-* an ioc->lock while holding the scheduler lock.
-*/
-   bfqd->ioc_to_put = bfqd->in_service_bic->icq.ioc;
-   bfqd->in_service_bic = NULL;
-   }
-
bfq_clear_bfqq_wait_request(in_serv_bfqq);
hrtimer_try_to_cancel(>idle_slice_timer);
bfqd->in_service_queue = NULL;
@@ -3495,7 +3444,7 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
__bfq_deactivate_entity(entity, false);
bfq_put_async_queues(bfqd, bfqg);
 
-   bfq_unlock_put_ioc_restore(bfqd, flags);
+   spin_unlock_irqrestore(>lock, flags);
/*
 * @blkg is going offline and will be ignored by
 * blkg_[rw]stat_recursive_sum().  Transfer stats to the parent so
@@ -5472,20 +5421,18 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct 
bfq_queue *new_bfqq)
 * first time that the requests of some process are redirected to
 * it.
 *
-* We redirect bfqq to new_bfqq and not the opposite, because we
-* are in the context of the process owning bfqq, hence we have
-* the io_cq of this process. So we can immediately configure this
-* io_cq to redirect the requests of the process to new_bfqq.
+* We redirect bfqq to new_bfqq and not the opposite, because
+* we are in the context of the process owning bfqq, thus we
+* have the io_cq of this process. So we can immediately
+* configure this io_cq to redirect the requests of the
+

Re: [PATCH V3 00/16] Introduce the BFQ I/O scheduler

2017-04-12 Thread Bart Van Assche

On Wed, 2017-04-12 at 08:01 +0200, Paolo Valente wrote:
> Where is my mistake?

I think in the Makefile. How about the patch below? Please note that I'm no
Kbuild expert.

diff --git a/block/Makefile b/block/Makefile
index 546066ee7fa6..b3711af6b637 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -20,7 +20,8 @@ obj-$(CONFIG_IOSCHED_NOOP)    += noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)  += cfq-iosched.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)  += mq-deadline.o
-obj-$(CONFIG_IOSCHED_BFQ)  += bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
+bfq-y  := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
+obj-$(CONFIG_IOSCHED_BFQ)  += bfq.o
   obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)   += cmdline-parser.o

[PATCH 3/4] nfs: remove the objlayout driver

2017-04-12 Thread Christoph Hellwig

The objlayout code has been in the tree, but it's been unmaintained and
no server product for it actually ever shipped.

Signed-off-by: Christoph Hellwig 
---
 Documentation/admin-guide/kernel-parameters.txt |   6 -
 Documentation/filesystems/nfs/pnfs.txt  |  37 --
 fs/nfs/Kconfig  |   5 -
 fs/nfs/Makefile |   1 -
 fs/nfs/objlayout/Kbuild |   5 -
 fs/nfs/objlayout/objio_osd.c| 675 --
 fs/nfs/objlayout/objlayout.c| 706 
 fs/nfs/objlayout/objlayout.h| 183 --
 fs/nfs/objlayout/pnfs_osd_xdr_cli.c | 415 --
 9 files changed, 2033 deletions(-)
 delete mode 100644 fs/nfs/objlayout/Kbuild
 delete mode 100644 fs/nfs/objlayout/objio_osd.c
 delete mode 100644 fs/nfs/objlayout/objlayout.c
 delete mode 100644 fs/nfs/objlayout/objlayout.h
 delete mode 100644 fs/nfs/objlayout/pnfs_osd_xdr_cli.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index facc20a3f962..17156d66b124 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2419,12 +2419,6 @@
and gids from such clients.  This is intended to ease
migration from NFSv2/v3.
 
-   objlayoutdriver.osd_login_prog=
-   [NFS] [OBJLAYOUT] sets the pathname to the program which
-   is used to automatically discover and login into new
-   osd-targets. Please see:
-   Documentation/filesystems/pnfs.txt for more explanations
-
nmi_debug=  [KNL,AVR32,SH] Specify one or more actions to take
when a NMI is triggered.
Format: [state][,regs][,debounce][,die]
diff --git a/Documentation/filesystems/nfs/pnfs.txt 
b/Documentation/filesystems/nfs/pnfs.txt
index 8de578a98222..80dc0bdc302a 100644
--- a/Documentation/filesystems/nfs/pnfs.txt
+++ b/Documentation/filesystems/nfs/pnfs.txt
@@ -64,46 +64,9 @@ table which are called by the nfs-client pnfs-core to 
implement the
 different layout types.
 
 Files-layout-driver code is in: fs/nfs/filelayout/.. directory
-Objects-layout-driver code is in: fs/nfs/objlayout/.. directory
 Blocks-layout-driver code is in: fs/nfs/blocklayout/.. directory
 Flexfiles-layout-driver code is in: fs/nfs/flexfilelayout/.. directory
 
-objects-layout setup
-
-
-As part of the full STD implementation the objlayoutdriver.ko needs, at times,
-to automatically login to yet undiscovered iscsi/osd devices. For this the
-driver makes up-calles to a user-mode script called *osd_login*
-
-The path_name of the script to use is by default:
-   /sbin/osd_login.
-This name can be overridden by the Kernel module parameter:
-   objlayoutdriver.osd_login_prog
-
-If Kernel does not find the osd_login_prog path it will zero it out
-and will not attempt farther logins. An admin can then write new value
-to the objlayoutdriver.osd_login_prog Kernel parameter to re-enable it.
-
-The /sbin/osd_login is part of the nfs-utils package, and should usually
-be installed on distributions that support this Kernel version.
-
-The API to the login script is as follows:
-   Usage: $0 -u  -o  -s 
-   Options:
-   -u  target uri e.g. iscsi://:
-   (always exists)
-   (More protocols can be defined in the future.
-The client does not interpret this string it is
-passed unchanged as received from the Server)
-   -o  osdname of the requested target OSD
-   (Might be empty)
-   (A string which denotes the OSD name, there is a
-limit of 64 chars on this string)
-   -s  systemid of the requested target OSD
-   (Might be empty)
-   (This string, if not empty is always an hex
-representation of the 20 bytes osd_system_id)
-
 blocks-layout setup
 ---
 
diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index f31fd0dd92c6..69d02cf8cf37 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -123,11 +123,6 @@ config PNFS_BLOCK
depends on NFS_V4_1 && BLK_DEV_DM
default NFS_V4
 
-config PNFS_OBJLAYOUT
-   tristate
-   depends on NFS_V4_1 && SCSI_OSD_ULD
-   default NFS_V4
-
 config PNFS_FLEXFILE_LAYOUT
tristate
depends on NFS_V4_1 && NFS_V3
diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 6abdda209642..98f4e5728a67 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -31,6 +31,5 @@ nfsv4-$(CONFIG_NFS_V4_1)

[PATCH V4 05/16] block, bfq: add more fairness with writes and slow processes

2017-04-12 Thread Paolo Valente

This patch deals with two sources of unfairness, which can also cause
high latencies and throughput loss. The first source is related to
write requests. Write requests tend to starve read requests, basically
because, on one side, writes are slower than reads, whereas, on the
other side, storage devices confuse schedulers by deceptively
signaling the completion of write requests immediately after receiving
them. This patch addresses this issue by just throttling writes. In
particular, after a write request is dispatched for a queue, the
budget of the queue is decremented by the number of sectors to write,
multiplied by an (over)charge coefficient. The value of the
coefficient is the result of our tuning with different devices.

The second source of unfairness has to do with slowness detection:
when the in-service queue is expired, BFQ also controls whether the
queue has been "too slow", i.e., has consumed its last-assigned budget
at such a low rate that it would have been impossible to consume all
of this budget within the maximum time slice T_max (Subsec. 3.5 in
[1]). In this case, the queue is always (over)charged the whole
budget, to reduce its utilization of the device. Both this overcharge
and the slowness-detection criterion may cause unfairness.

First, always charging a full budget to a slow queue is too coarse. It
is much more accurate, and this patch lets BFQ do so, to charge an
amount of service 'equivalent' to the amount of time during which the
queue has been in service. As explained in more detail in the comments
on the code, this enables BFQ to provide time fairness among slow
queues.

Secondly, because of ZBR, a queue may be deemed as slow when its
associated process is performing I/O on the slowest zones of a
disk. However, unless the process is truly too slow, not reducing the
disk utilization of the queue is more profitable in terms of disk
throughput than the opposite. A similar problem is caused by logical
block mapping on non-rotational devices. For this reason, this patch
lets a queue be charged time, and not budget, only if the queue has
consumed less than 2/3 of its assigned budget. As an additional,
important benefit, this tolerance allows BFQ to preserve enough
elasticity to still perform bandwidth, and not time, distribution with
little unlucky or quasi-sequential processes.

Finally, for the same reasons as above, this patch makes slowness
detection itself much less harsh: a queue is deemed slow only if it
has consumed its budget at less than half of the peak rate.

[1] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 120 +---
 1 file changed, 85 insertions(+), 35 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 61d880b..dce273b 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -753,6 +753,13 @@ static const int bfq_stats_min_budgets = 194;
 /* Default maximum budget values, in sectors and number of requests. */
 static const int bfq_default_max_budget = 16 * 1024;
 
+/*
+ * Async to sync throughput distribution is controlled as follows:
+ * when an async request is served, the entity is charged the number
+ * of sectors of the request, multiplied by the factor below
+ */
+static const int bfq_async_charge_factor = 10;
+
 /* Default timeout values, in jiffies, approximating CFQ defaults. */
 static const int bfq_timeout = HZ / 8;
 
@@ -1571,22 +1578,52 @@ static void bfq_bfqq_served(struct bfq_queue *bfqq, int 
served)
 }
 
 /**
- * bfq_bfqq_charge_full_budget - set the service to the entity budget.
+ * bfq_bfqq_charge_time - charge an amount of service equivalent to the length
+ *   of the time interval during which bfqq has been in
+ *   service.
+ * @bfqd: the device
  * @bfqq: the queue that needs a service update.
+ * @time_ms: the amount of time during which the queue has received service
  *
- * When it's not possible to be fair in the service domain, because
- * a queue is not consuming its budget fast enough (the meaning of
- * fast depends on the timeout parameter), we charge it a full
- * budget.  In this way we should obtain a sort of time-domain
- * fairness among all the seeky/slow queues.
+ * If a queue does not consume its budget fast enough, then providing
+ * the queue with service fairness may impair throughput, more or less
+ * severely. For this reason, queues that consume their budget slowly
+ * are provided with time fairness instead of service fairness. This
+

[PATCH V4 04/16] block, bfq: modify the peak-rate estimator

2017-04-12 Thread Paolo Valente

Unless the maximum budget B_max that BFQ can assign to a queue is set
explicitly by the user, BFQ automatically updates B_max. In
particular, BFQ dynamically sets B_max to the number of sectors that
can be read, at the current estimated peak rate, during the maximum
time, T_max, allowed before a budget timeout occurs. In formulas, if
we denote as R_est the estimated peak rate, then B_max = T_max ∗
R_est. Hence, the higher R_est is with respect to the actual device
peak rate, the higher the probability that processes incur budget
timeouts unjustly is. Besides, a too high value of B_max unnecessarily
increases the deviation from an ideal, smooth service.

Unfortunately, it is not trivial to estimate the peak rate correctly:
because of the presence of sw and hw queues between the scheduler and
the device components that finally serve I/O requests, it is hard to
say exactly when a given dispatched request is served inside the
device, and for how long. As a consequence, it is hard to know
precisely at what rate a given set of requests is actually served by
the device.

On the opposite end, the dispatch time of any request is trivially
available, and, from this piece of information, the "dispatch rate"
of requests can be immediately computed. So, the idea in the next
function is to use what is known, namely request dispatch times
(plus, when useful, request completion times), to estimate what is
unknown, namely in-device request service rate.

The main issue is that, because of the above facts, the rate at
which a certain set of requests is dispatched over a certain time
interval can vary greatly with respect to the rate at which the
same requests are then served. But, since the size of any
intermediate queue is limited, and the service scheme is lossless
(no request is silently dropped), the following obvious convergence
property holds: the number of requests dispatched MUST become
closer and closer to the number of requests completed as the
observation interval grows. This is the key property used in
this new version of the peak-rate estimator.

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 497 +++-
 1 file changed, 372 insertions(+), 125 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 1edac72..61d880b 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -407,19 +407,37 @@ struct bfq_data {
/* on-disk position of the last served request */
sector_t last_position;
 
+   /* time of last request completion (ns) */
+   u64 last_completion;
+
+   /* time of first rq dispatch in current observation interval (ns) */
+   u64 first_dispatch;
+   /* time of last rq dispatch in current observation interval (ns) */
+   u64 last_dispatch;
+
/* beginning of the last budget */
ktime_t last_budget_start;
/* beginning of the last idle slice */
ktime_t last_idling_start;
-   /* number of samples used to calculate @peak_rate */
+
+   /* number of samples in current observation interval */
int peak_rate_samples;
+   /* num of samples of seq dispatches in current observation interval */
+   u32 sequential_samples;
+   /* total num of sectors transferred in current observation interval */
+   u64 tot_sectors_dispatched;
+   /* max rq size seen during current observation interval (sectors) */
+   u32 last_rq_max_size;
+   /* time elapsed from first dispatch in current observ. interval (us) */
+   u64 delta_from_first;
/*
-* Peak read/write rate, observed during the service of a
-* budget [BFQ_RATE_SHIFT * sectors/usec]. The value is
-* left-shifted by BFQ_RATE_SHIFT to increase precision in
+* Current estimate of the device peak rate, measured in
+* [BFQ_RATE_SHIFT * sectors/usec]. The left-shift by
+* BFQ_RATE_SHIFT is performed to increase precision in
 * fixed-point calculations.
 */
-   u64 peak_rate;
+   u32 peak_rate;
+
/* maximum budget allotted to a bfq_queue before rescheduling */
int bfq_max_budget;
 
@@ -740,7 +758,7 @@ static const int bfq_timeout = HZ / 8;
 
 static struct kmem_cache *bfq_pool;
 
-/* Below this threshold (in ms), we consider thinktime immediate. */
+/* Below this threshold (in ns), we consider thinktime immediate. */
 #define BFQ_MIN_TT (2 * NSEC_PER_MSEC)
 
 /* hw_tag detection: parallel requests threshold and min samples needed. */
@@ -752,8 +770,12 @@ static struct kmem_cache *bfq_pool;
 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 32/8)
 
-/* Min samples used for peak rate estimation (for autotuning). */
-#define BFQ_PEAK_RATE_SAMPLES  32
+/* Min number of samples required to perform peak-rate update */
+#define

[PATCH V4 03/16] block, bfq: improve throughput boosting

2017-04-12 Thread Paolo Valente

The feedback-loop algorithm used by BFQ to compute queue (process)
budgets is basically a set of three update rules, one for each of the
main reasons why a queue may be expired. If many processes suddenly
switch from sporadic I/O to greedy and sequential I/O, then these
rules are quite slow to assign large budgets to these processes, and
hence to achieve a high throughput. On the opposite side, BFQ assigns
the maximum possible budget B_max to a just-created queue. This allows
a high throughput to be achieved immediately if the associated process
is I/O-bound and performs sequential I/O from the beginning. But it
also increases the worst-case latency experienced by the first
requests issued by the process, because the larger the budget of a
queue waiting for service is, the later the queue will be served by
B-WF2Q+ (Subsec 3.3 in [1]). This is detrimental for an interactive or
soft real-time application.

To tackle these throughput and latency problems, on one hand this
patch changes the initial budget value to B_max/2. On the other hand,
it re-tunes the three rules, adopting a more aggressive,
multiplicative increase/linear decrease scheme. This scheme trades
latency for throughput more than before, and tends to assign large
budgets quickly to processes that are or become I/O-bound. For two of
the expiration reasons, the new version of the rules also contains
some more little improvements, briefly described below.

*No more backlog.* In this case, the budget was larger than the number
of sectors actually read/written by the process before it stopped
doing I/O. Hence, to reduce latency for the possible future I/O
requests of the process, the old rule simply set the next budget to
the number of sectors actually consumed by the process. However, if
there are still outstanding requests, then the process may have not
yet issued its next request just because it is still waiting for the
completion of some of the still outstanding ones. If this sub-case
holds true, then the new rule, instead of decreasing the budget,
doubles it, proactively, in the hope that: 1) a larger budget will fit
the actual needs of the process, and 2) the process is sequential and
hence a higher throughput will be achieved by serving the process
longer after granting it access to the device.

*Budget timeout*. The original rule set the new budget to the maximum
value B_max, to maximize throughput and let all processes experiencing
budget timeouts receive the same share of the device time. In our
experiments we verified that this sudden jump to B_max did not provide
sensible benefits; rather it increased the latency of processes
performing sporadic and short I/O. The new rule only doubles the
budget.

[1] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 87 +
 1 file changed, 41 insertions(+), 46 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index af1740a..1edac72 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -752,9 +752,6 @@ static struct kmem_cache *bfq_pool;
 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 32/8)
 
-/* Budget feedback step. */
-#define BFQ_BUDGET_STEP 128
-
 /* Min samples used for peak rate estimation (for autotuning). */
 #define BFQ_PEAK_RATE_SAMPLES  32
 
@@ -4074,40 +4071,6 @@ static struct bfq_queue *bfq_set_in_service_queue(struct 
bfq_data *bfqd)
return bfqq;
 }
 
-/*
- * bfq_default_budget - return the default budget for @bfqq on @bfqd.
- * @bfqd: the device descriptor.
- * @bfqq: the queue to consider.
- *
- * We use 3/4 of the @bfqd maximum budget as the default value
- * for the max_budget field of the queues.  This lets the feedback
- * mechanism to start from some middle ground, then the behavior
- * of the process will drive the heuristics towards high values, if
- * it behaves as a greedy sequential reader, or towards small values
- * if it shows a more intermittent behavior.
- */
-static unsigned long bfq_default_budget(struct bfq_data *bfqd,
-   struct bfq_queue *bfqq)
-{
-   unsigned long budget;
-
-   /*
-* When we need an estimate of the peak rate we need to avoid
-* to give budgets that are too short due to previous
-* measurements.  So, in the first 10 assignments use a
-* ``safe'' budget value. For such first assignment the value
-* of bfqd->budgets_assigned happens to be

[PATCH for-4.9 1/5] blk-mq: Avoid memory reclaim when remapping queues

2017-04-12 Thread Sumit Semwal

From: Gabriel Krisman Bertazi 

[ Upstream commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 ]

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c00f0160aaf0] [c00f0160ab50] 0xc00f0160ab50 (unreliable)
[c00f0160acc0] [c0016624] __switch_to+0x2e4/0x430
[c00f0160ad20] [c0b1a880] __schedule+0x310/0x9b0
[c00f0160ae00] [c0b1af68] schedule+0x48/0xc0
[c00f0160ae30] [c0b1b4b0] schedule_preempt_disabled+0x20/0x30
[c00f0160ae50] [c0b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c00f0160aed0] [c0b1d678] mutex_lock+0x78/0xa0
[c00f0160af00] [d00019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c00f0160b0b0] [d00019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c00f0160b0f0] [d000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c00f0160b120] [c03172c8] super_cache_scan+0x1f8/0x210
[c00f0160b190] [c026301c] shrink_slab.part.13+0x21c/0x4c0
[c00f0160b2d0] [c0268088] shrink_zone+0x2d8/0x3c0
[c00f0160b380] [c026834c] do_try_to_free_pages+0x1dc/0x520
[c00f0160b450] [c026876c] try_to_free_pages+0xdc/0x250
[c00f0160b4e0] [c0251978] __alloc_pages_nodemask+0x868/0x10d0
[c00f0160b6f0] [c0567030] blk_mq_init_rq_map+0x160/0x380
[c00f0160b7a0] [c056758c] blk_mq_map_swqueue+0x33c/0x360
[c00f0160b820] [c0567904] blk_mq_queue_reinit+0x64/0xb0
[c00f0160b850] [c056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c00f0160b8a0] [c00f5d38] notifier_call_chain+0x98/0x100
[c00f0160b8f0] [c00c5fb0] __cpu_notify+0x70/0xe0
[c00f0160b930] [c00c63c4] notify_prepare+0x44/0xb0
[c00f0160b9b0] [c00c52f4] cpuhp_invoke_callback+0x84/0x250
[c00f0160ba10] [c00c570c] cpuhp_up_callbacks+0x5c/0x120
[c00f0160ba60] [c00c7cb8] _cpu_up+0xf8/0x1d0
[c00f0160bac0] [c00c7eb0] do_cpu_up+0x120/0x150
[c00f0160bb40] [c06fe024] cpu_subsys_online+0x64/0xe0
[c00f0160bb90] [c06f5124] device_online+0xb4/0x120
[c00f0160bbd0] [c06f5244] online_store+0xb4/0xc0
[c00f0160bc20] [c06f0a68] dev_attr_store+0x68/0xa0
[c00f0160bc60] [c03ccc30] sysfs_kf_write+0x80/0xb0
[c00f0160bca0] [c03cbabc] kernfs_fop_write+0x17c/0x250
[c00f0160bcf0] [c030fe6c] __vfs_write+0x6c/0x1e0
[c00f0160bd90] [c0311490] vfs_write+0xd0/0x270
[c00f0160bde0] [c03131fc] SyS_write+0x6c/0x110
[c00f0160be30] [c0009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi 
Cc: Brian King 
Cc: Douglas Miller 
Cc: linux-block@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Signed-off-by: Jens Axboe 
Signed-off-by: Sumit Semwal 
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ee54ad0..7b597ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1474,7 +1474,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
INIT_LIST_HEAD(>page_list);
 
tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
-GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 set->numa_node);
if (!tags->rqs) {
blk_mq_free_tags(tags);
@@ -1500,7 +1500,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 
do {
page = alloc_pages_node(set->numa_node,
-   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
+   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
this_order);
if (page)
break;
@@ -1521,7 +1521,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 * Allow kmemleak to scan these pages as they contain pointers

[v2 PATCH for-4.4 15/16] blk-mq: Avoid memory reclaim when remapping queues

2017-04-12 Thread Sumit Semwal

From: Gabriel Krisman Bertazi 

[ Upstream commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 ]

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c00f0160aaf0] [c00f0160ab50] 0xc00f0160ab50 (unreliable)
[c00f0160acc0] [c0016624] __switch_to+0x2e4/0x430
[c00f0160ad20] [c0b1a880] __schedule+0x310/0x9b0
[c00f0160ae00] [c0b1af68] schedule+0x48/0xc0
[c00f0160ae30] [c0b1b4b0] schedule_preempt_disabled+0x20/0x30
[c00f0160ae50] [c0b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c00f0160aed0] [c0b1d678] mutex_lock+0x78/0xa0
[c00f0160af00] [d00019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c00f0160b0b0] [d00019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c00f0160b0f0] [d000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c00f0160b120] [c03172c8] super_cache_scan+0x1f8/0x210
[c00f0160b190] [c026301c] shrink_slab.part.13+0x21c/0x4c0
[c00f0160b2d0] [c0268088] shrink_zone+0x2d8/0x3c0
[c00f0160b380] [c026834c] do_try_to_free_pages+0x1dc/0x520
[c00f0160b450] [c026876c] try_to_free_pages+0xdc/0x250
[c00f0160b4e0] [c0251978] __alloc_pages_nodemask+0x868/0x10d0
[c00f0160b6f0] [c0567030] blk_mq_init_rq_map+0x160/0x380
[c00f0160b7a0] [c056758c] blk_mq_map_swqueue+0x33c/0x360
[c00f0160b820] [c0567904] blk_mq_queue_reinit+0x64/0xb0
[c00f0160b850] [c056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c00f0160b8a0] [c00f5d38] notifier_call_chain+0x98/0x100
[c00f0160b8f0] [c00c5fb0] __cpu_notify+0x70/0xe0
[c00f0160b930] [c00c63c4] notify_prepare+0x44/0xb0
[c00f0160b9b0] [c00c52f4] cpuhp_invoke_callback+0x84/0x250
[c00f0160ba10] [c00c570c] cpuhp_up_callbacks+0x5c/0x120
[c00f0160ba60] [c00c7cb8] _cpu_up+0xf8/0x1d0
[c00f0160bac0] [c00c7eb0] do_cpu_up+0x120/0x150
[c00f0160bb40] [c06fe024] cpu_subsys_online+0x64/0xe0
[c00f0160bb90] [c06f5124] device_online+0xb4/0x120
[c00f0160bbd0] [c06f5244] online_store+0xb4/0xc0
[c00f0160bc20] [c06f0a68] dev_attr_store+0x68/0xa0
[c00f0160bc60] [c03ccc30] sysfs_kf_write+0x80/0xb0
[c00f0160bca0] [c03cbabc] kernfs_fop_write+0x17c/0x250
[c00f0160bcf0] [c030fe6c] __vfs_write+0x6c/0x1e0
[c00f0160bd90] [c0311490] vfs_write+0xd0/0x270
[c00f0160bde0] [c03131fc] SyS_write+0x6c/0x110
[c00f0160be30] [c0009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi 
Cc: Brian King 
Cc: Douglas Miller 
Cc: linux-block@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Signed-off-by: Jens Axboe 
Signed-off-by: Sumit Semwal 
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d8d63c3..0d1af3e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1470,7 +1470,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
INIT_LIST_HEAD(>page_list);
 
tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
-GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 set->numa_node);
if (!tags->rqs) {
blk_mq_free_tags(tags);
@@ -1496,7 +1496,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 
do {
page = alloc_pages_node(set->numa_node,
-   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
+   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
this_order);
if (page)
break;
@@ -1517,7 +1517,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 * Allow kmemleak to scan these pages as they contain pointers

Re: [PATCH 5/9] nowait aio: return on congested block device

2017-04-12 Thread Goldwyn Rodrigues



On 04/12/2017 03:36 AM, Christoph Hellwig wrote:
> As mentioned last time around, this should be a REQ_NOWAIT flag so
> that it can be easily passed dow? n to the request layer.
> 
>> +static inline void bio_wouldblock_error(struct bio *bio)
>> +{
>> +bio->bi_error = -EAGAIN;
>> +bio_endio(bio);
>> +}
> 
> Please skip this helper..

Why? It is being called three times?
I am incorporating all the rest of the comments, besides this one. Thanks.

> 
>> +#define QUEUE_FLAG_NOWAIT  28   /* queue supports BIO_NOWAIT */
> 
> Please make the flag name a little more descriptive, this sounds like
> it will never wait.
> 

-- 
Goldwyn

Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically

2017-04-12 Thread Bart Van Assche

On Wed, 2017-04-12 at 12:55 +0200, Benjamin Block wrote:
> On Fri, Apr 07, 2017 at 11:16:48AM -0700, Bart Van Assche wrote:
> > The six patches in this patch series fix the queue lockup I reported
> > recently on the linux-block mailing list. Please consider these patches
> > for inclusion in the upstream kernel.
> 
> just out of curiosity. Is this maybe related to similar stuff happening
> when CPUs are hot plugged - at least in that the stack gets stuck? Like
> in this thread here:
> https://www.mail-archive.com/linux-block@vger.kernel.org/msg06057.html
> 
> Would be interesting, because we recently saw similar stuff happening.

Hello Benjamin,

My proposal is to repeat that test with Jens' for-next branch. If the issue
still occurs with that tree then please check the contents of
/sys/kernel/debug/block/*/mq/*/{dispatch,*/rq_list}. That will allow to
determine whether or not any block layer requests are still pending. If
running the command below resolves the deadlock then it means that a
trigger to run a block layer queue is still missing somewhere:

for a in /sys/kernel/debug/block/*/mq/state; do echo run >$a; done

See also git://git.kernel.dk/linux-block.git.

Bart.

Re: [PATCH] blk-mq: Fix blk_execute_rq_nowait() handling of dying queues

2017-04-12 Thread Bart Van Assche

On Wed, 2017-04-12 at 13:01 +0800, Ming Lei wrote:
> On Wed, Apr 12, 2017 at 7:58 AM, Bart Van Assche
>  wrote:
> > 
> > diff --git a/block/blk-exec.c b/block/blk-exec.c
> > index 8cd0e9bc8dc8..f7d9bed2cb15 100644
> > --- a/block/blk-exec.c
> > +++ b/block/blk-exec.c
> > @@ -57,10 +57,13 @@ void blk_execute_rq_nowait(struct request_queue *q, 
> > struct gendisk *bd_disk,
> > rq->end_io = done;
> > 
> > /*
> > -* don't check dying flag for MQ because the request won't
> > -* be reused after dying flag is set
> > +* The blk_freeze_queue() call in blk_set_queue_dying() and the
> > +* test of the "dying" flag in blk_queue_enter() guarantee that
> > +* blk_execute_rq_nowait() won't be called anymore after the "dying"
> > +* flag has been set.
> 
> That never be guaranteed, see the following case:
> 
> 1) blk_get_request() is called just before queue is set as dying in another 
> path
> 
> 2) the request is allocated successfully and passed to
> blk_execute_rq_nowait() even
> though queue has been set as dying

Hello Ming,

Shouldn't the blk-mq code guarantee that blk_execute_rq_nowait() won't be
called anymore after the "dying" flag has been set? I think changing the
blk_freeze_queue_start() call into blk_freeze_queue() in blk_set_queue_dying()
is sufficient to achieve this.

Note: after I had posted this patch I have been able to reproduce the issue
described in the patch description. Although I still think we need the patch
at the start of this e-mail thread, it doesn't fix the issue I described.

Bart.

Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically

2017-04-12 Thread Bart Van Assche

On Wed, 2017-04-12 at 11:42 +0800, Ming Lei wrote:
> On Tue, Apr 11, 2017 at 06:18:36PM +, Bart Van Assche wrote:
> > On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> > > Rather than working so hard to use DM code against me, your argument
> > > should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> > > established pattern"
> > > 
> > > I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> > > only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> > > tree-wide.
> > > 
> > > Could be there are some others, but hardly a well-established pattern.
> > 
> > Hello Mike,
> > 
> > Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
> > .queue_rq() implementation stop the request queue (blk_mq_stop_hw_queue())
> > before returning "busy" and restart the queue after the busy condition has
> > been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk and
> > xen-blkfront. However, this approach is not appropriate for the dm-mq core
> > nor for the scsi core since both drivers already use the "stopped" state for
> > another purpose than tracking whether or not a hardware queue is busy. Hence
> > the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these 
> > last
> > two drivers to rerun a hardware queue after the busy state has been cleared.
> 
> But looks this patch just reruns the hw queue after 100ms, which isn't
> that after the busy state has been cleared, right?

Hello Ming,

That patch can be considered as a first step that can be refined further, namely
by modifying the dm-rq code further such that dm-rq queues are only rerun after
the busy condition has been cleared. The patch at the start of this thread is
easier to review and easier to test than any patch that would only rerun dm-rq
queues after the busy condition has been cleared.

> Actually if BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(), blk-mq
> will buffer this request into hctx->dispatch and run the hw queue again,
> so looks blk_mq_delay_run_hw_queue() in this situation shouldn't have been
> needed at my 1st impression.

If the blk-mq core would always rerun a hardware queue if a block driver
returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
to be busy with polling a hardware queue until the "busy" condition has been
cleared. One can see easily that that's not what the blk-mq core does. From
blk_mq_sched_dispatch_requests():

if (!list_empty(_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
did_work = blk_mq_dispatch_rq_list(q, _list);
}

>From the end of blk_mq_dispatch_rq_list():

if (!list_empty(list)) {
[ ... ]
if (!blk_mq_sched_needs_restart(hctx) &&
!test_bit(BLK_MQ_S_TAG_WAITING, >state))
blk_mq_run_hw_queue(hctx, true);
}

In other words, the BLK_MQ_S_SCHED_RESTART flag is set before the dispatch list
is examined and only if that flag gets cleared while blk_mq_dispatch_rq_list()
is in progress by a concurrent blk_mq_sched_restart_hctx() call then the
dispatch list will be rerun after a block driver returned BLK_MQ_RQ_QUEUE_BUSY.

Bart.

Re: [PATCH V3 01/16] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler

2017-04-12 Thread kbuild test robot

Hi Paolo,

[auto build test ERROR on block/for-next]
[also build test ERROR on v4.11-rc6 next-20170412]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Paolo-Valente/Introduce-the-BFQ-I-O-scheduler/20170412-021320
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
for-next
config: blackfin-allyesconfig (attached as .config)
compiler: bfin-uclinux-gcc (GCC) 6.2.0
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=blackfin 

Note: the 
linux-review/Paolo-Valente/Introduce-the-BFQ-I-O-scheduler/20170412-021320 HEAD 
36eb6533f8b6705991185201f75e98880cd223f7 builds fine.
  It only hurts bisectibility.

All error/warnings (new ones prefixed by >>):

   block/bfq-iosched.c: In function 'bfq_update_peak_rate':
>> block/bfq-iosched.c:2674:6: error: 'delta_usecs' undeclared (first use in 
>> this function)
 if (delta_usecs < 1000) {
 ^~~
   block/bfq-iosched.c:2674:6: note: each undeclared identifier is reported 
only once for each function it appears in
>> block/bfq-iosched.c:2739:22: error: invalid storage class for function 
>> 'bfq_smallest_from_now'
static unsigned long bfq_smallest_from_now(void)
 ^
>> block/bfq-iosched.c:2739:1: warning: ISO C90 forbids mixed declarations and 
>> code [-Wdeclaration-after-statement]
static unsigned long bfq_smallest_from_now(void)
^~
>> block/bfq-iosched.c:2774:13: error: invalid storage class for function 
>> 'bfq_bfqq_expire'
static void bfq_bfqq_expire(struct bfq_data *bfqd,
^~~
>> block/bfq-iosched.c:2823:13: error: invalid storage class for function 
>> 'bfq_bfqq_budget_timeout'
static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
^~~
>> block/bfq-iosched.c:2839:13: error: invalid storage class for function 
>> 'bfq_may_expire_for_budg_timeout'
static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
^~~
>> block/bfq-iosched.c:2858:13: error: invalid storage class for function 
>> 'bfq_bfqq_may_idle'
static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
^
>> block/bfq-iosched.c:2901:13: error: invalid storage class for function 
>> 'bfq_bfqq_must_idle'
static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
^~
>> block/bfq-iosched.c:2913:26: error: invalid storage class for function 
>> 'bfq_select_queue'
static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
 ^~~~
>> block/bfq-iosched.c:3012:24: error: invalid storage class for function 
>> 'bfq_dispatch_rq_from_bfqq'
static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd,
   ^
>> block/bfq-iosched.c:3044:13: error: invalid storage class for function 
>> 'bfq_has_work'
static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
^~~~
>> block/bfq-iosched.c:3056:24: error: invalid storage class for function 
>> '__bfq_dispatch_request'
static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
   ^~
>> block/bfq-iosched.c:3141:24: error: invalid storage class for function 
>> 'bfq_dispatch_request'
static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
   ^~~~
>> block/bfq-iosched.c:3160:13: error: invalid storage class for function 
>> 'bfq_put_queue'
static void bfq_put_queue(struct bfq_queue *bfqq)
^
>> block/bfq-iosched.c:3173:13: error: invalid storage class for function 
>> 'bfq_exit_bfqq'
static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
^
>> block/bfq-iosched.c:3185:13: error: invalid storage class for function 
>> 'bfq_exit_icq_bfqq'
static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync)
^
>> block/bfq-iosched.c:3203:13: error: invalid storage class for function 
>> 'bfq_exit_icq'
static void bfq_exit_icq(struct io_cq *icq)
^~~~
>> block/bfq-iosched.c:3216:1: error: invalid storage class for function 
>> 'bfq_set_next_ioprio_data'
bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
^~~~
>> block/bfq-iosched.c:3262:13

Re: [PATCH V3 01/16] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler

2017-04-12 Thread kbuild test robot

Hi Paolo,

[auto build test ERROR on block/for-next]
[also build test ERROR on v4.11-rc6 next-20170412]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Paolo-Valente/Introduce-the-BFQ-I-O-scheduler/20170412-021320
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
for-next
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=sh 

Note: the 
linux-review/Paolo-Valente/Introduce-the-BFQ-I-O-scheduler/20170412-021320 HEAD 
36eb6533f8b6705991185201f75e98880cd223f7 builds fine.
  It only hurts bisectibility.

All error/warnings (new ones prefixed by >>):

 ^~~~
   block/bfq-iosched.c:4095:40: error: initializer element is not constant
 __ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store)
   ^
   include/linux/sysfs.h:104:11: note: in definition of macro '__ATTR'
 .store = _store,  \
  ^~
   block/bfq-iosched.c:4105:2: note: in expansion of macro 'BFQ_ATTR'
 BFQ_ATTR(timeout_sync),
 ^~~~
   block/bfq-iosched.c:4095:40: note: (near initialization for 
'bfq_attrs[7].store')
 __ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store)
   ^
   include/linux/sysfs.h:104:11: note: in definition of macro '__ATTR'
 .store = _store,  \
  ^~
   block/bfq-iosched.c:4105:2: note: in expansion of macro 'BFQ_ATTR'
 BFQ_ATTR(timeout_sync),
 ^~~~
   block/bfq-iosched.c:4095:21: error: initializer element is not constant
 __ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store)
^
   include/linux/sysfs.h:103:10: note: in definition of macro '__ATTR'
 .show = _show,  \
 ^
   block/bfq-iosched.c:4106:2: note: in expansion of macro 'BFQ_ATTR'
 BFQ_ATTR(strict_guarantees),
 ^~~~
   block/bfq-iosched.c:4095:21: note: (near initialization for 
'bfq_attrs[8].show')
 __ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store)
^
   include/linux/sysfs.h:103:10: note: in definition of macro '__ATTR'
 .show = _show,  \
 ^
   block/bfq-iosched.c:4106:2: note: in expansion of macro 'BFQ_ATTR'
 BFQ_ATTR(strict_guarantees),
 ^~~~
   block/bfq-iosched.c:4095:40: error: initializer element is not constant
 __ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store)
   ^
   include/linux/sysfs.h:104:11: note: in definition of macro '__ATTR'
 .store = _store,  \
  ^~
   block/bfq-iosched.c:4106:2: note: in expansion of macro 'BFQ_ATTR'
 BFQ_ATTR(strict_guarantees),
 ^~~~
   block/bfq-iosched.c:4095:40: note: (near initialization for 
'bfq_attrs[8].store')
 __ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store)
   ^
   include/linux/sysfs.h:104:11: note: in definition of macro '__ATTR'
 .store = _store,  \
  ^~
   block/bfq-iosched.c:4106:2: note: in expansion of macro 'BFQ_ATTR'
 BFQ_ATTR(strict_guarantees),
 ^~~~
   block/bfq-iosched.c:4112:19: error: initializer element is not constant
  .get_rq_priv  = bfq_get_rq_private,
  ^~
   block/bfq-iosched.c:4112:19: note: (near initialization for 
'iosched_bfq_mq.ops.mq.get_rq_priv')
   block/bfq-iosched.c:4113:19: error: initializer element is not constant
  .put_rq_priv  = bfq_put_rq_private,
  ^~
   block/bfq-iosched.c:4113:19: note: (near initialization for 
'iosched_bfq_mq.ops.mq.put_rq_priv')
   block/bfq-iosched.c:4114:16: error: initializer element is not constant
  .exit_icq  = bfq_exit_icq,
   ^~~~
   block/bfq-iosched.c:4114:16: note: (near initialization for 
'iosched_bfq_mq.ops.mq.exit_icq')
   block/bfq-iosched.c:4115:22: error: initializer element is not constant
  .insert_requests = bfq_insert_requests,
 ^~~
   block/bfq-iosched.c:4115:22: note: (near initialization for 
'iosched_bfq_mq.ops.mq.insert_requests')
   block/bfq-iosched.c:4116:23: error: initializer element is not constant
  .dispatch_request = bfq_dispatch_request,
  ^~~~
   block/bfq-iosched.c:4116:23: note: (near initialization for 
'iosched_bfq_mq.ops.mq.dispatch_request')
   block/bfq-iosched.c:4124:16: error: initializer element is not constant
  .has_work  = bfq_has_work,
   ^~~~
   block/bfq-iosched.c:4124:16: note:

Re: [PATCH v6] lightnvm: physical block device (pblk) target

2017-04-12 Thread Matias Bjørling


On 04/12/2017 03:19 PM, Javier González wrote:

This patch introduces pblk, a host-side translation layer for
Open-Channel SSDs to expose them like block devices. The translation
layer allows data placement decisions, and I/O scheduling to be
managed by the host, enabling users to optimize the SSD for their
specific workloads.

An open-channel SSD has a set of LUNs (parallel units) and a
collection of blocks. Each block can be read in any order, but
writes must be sequential. Writes may also fail, and if a block
requires it, must also be reset before new writes can be
applied.

To manage the constraints, pblk maintains a logical to
physical address (L2P) table,  write cache, garbage
collection logic, recovery scheme, and logic to rate-limit
user I/Os versus garbage collection I/Os.

The L2P table is fully-associative and manages sectors at a
4KB granularity. Pblk stores the L2P table in two places, in
the out-of-band area of the media and on the last page of a
line. In the cause of a power failure, pblk will perform a
scan to recover the L2P table.

The user data is organized into lines. A line is data
striped across blocks and LUNs. The lines enable the host to
reduce the amount of metadata to maintain besides the user
data and makes it easier to implement RAID or erasure coding
in the future.

pblk implements multi-tenant support and can be instantiated
multiple times on the same drive. Each instance owns a
portion of the SSD - both regarding I/O bandwidth and
capacity - providing I/O isolation for each case.

Finally, pblk also exposes a sysfs interface that allows
user-space to peek into the internals of pblk. The interface
is available at /dev/block/*/pblk/ where * is the block
device name exposed.

This work also contains contributions from:
  Matias Bjørling 
  Simon A. F. Lund 
  Young Tack Jin 
  Huaicheng Li 

Signed-off-by: Javier González 
---


Thanks Javier. Applied to 4.12, and replaced the v5 version.

Re: [PATCH V3 02/16] block, bfq: add full hierarchical scheduling and cgroups support

2017-04-12 Thread kbuild test robot

Hi Arianna,

[auto build test ERROR on block/for-next]
[also build test ERROR on v4.11-rc6 next-20170412]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Paolo-Valente/Introduce-the-BFQ-I-O-scheduler/20170412-021320
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
for-next
config: m32r-allyesconfig (attached as .config)
compiler: m32r-linux-gcc (GCC) 6.2.0
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=m32r 

Note: the 
linux-review/Paolo-Valente/Introduce-the-BFQ-I-O-scheduler/20170412-021320 HEAD 
36eb6533f8b6705991185201f75e98880cd223f7 builds fine.
  It only hurts bisectibility.

All error/warnings (new ones prefixed by >>):

^~~
   block/bfq-iosched.c:4559:13: error: invalid storage class for function 
'bfq_bfqq_may_idle'
static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
^
   block/bfq-iosched.c:4602:13: error: invalid storage class for function 
'bfq_bfqq_must_idle'
static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
^~
   block/bfq-iosched.c:4614:26: error: invalid storage class for function 
'bfq_select_queue'
static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
 ^~~~
   block/bfq-iosched.c:4714:24: error: invalid storage class for function 
'bfq_dispatch_rq_from_bfqq'
static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd,
   ^
   block/bfq-iosched.c:4746:13: error: invalid storage class for function 
'bfq_has_work'
static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
^~~~
   block/bfq-iosched.c:4758:24: error: invalid storage class for function 
'__bfq_dispatch_request'
static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
   ^~
   block/bfq-iosched.c:4843:24: error: invalid storage class for function 
'bfq_dispatch_request'
static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
   ^~~~
   block/bfq-iosched.c:4862:13: error: invalid storage class for function 
'bfq_put_queue'
static void bfq_put_queue(struct bfq_queue *bfqq)
^
   block/bfq-iosched.c:4884:13: error: invalid storage class for function 
'bfq_exit_bfqq'
static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
^
   block/bfq-iosched.c:4896:13: error: invalid storage class for function 
'bfq_exit_icq_bfqq'
static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync)
^
   block/bfq-iosched.c:4914:13: error: invalid storage class for function 
'bfq_exit_icq'
static void bfq_exit_icq(struct io_cq *icq)
^~~~
   block/bfq-iosched.c:4927:1: error: invalid storage class for function 
'bfq_set_next_ioprio_data'
bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
^~~~
   block/bfq-iosched.c:4973:13: error: invalid storage class for function 
'bfq_check_ioprio_change'
static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio)
^~~
   block/bfq-iosched.c:5001:13: error: invalid storage class for function 
'bfq_init_bfqq'
static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
^
   block/bfq-iosched.c:5036:27: error: invalid storage class for function 
'bfq_async_queue_prio'
static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
  ^~~~
   block/bfq-iosched.c:5055:26: error: invalid storage class for function 
'bfq_get_queue'
static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 ^
   block/bfq-iosched.c:5120:13: error: invalid storage class for function 
'bfq_update_io_thinktime'
static void bfq_update_io_thinktime(struct bfq_data *bfqd,
^~~
   block/bfq-iosched.c:5135:1: error: invalid storage class for function 
'bfq_update_io_seektime'
bfq_update_io_seektime(struct bfq_data *bfqd, struct bfq_queue *bfqq,
^~
   block/bfq-iosched.c:5157:13: error: invalid storage class for function 
'bfq_update_idle_window'
static void bfq_update_idle_window(struct bfq_data *bfqd,
^~
   block/bfq-iosched.c:5192:13: error: invalid storage class for function 
'bfq_rq_enqueued'
static void bfq_rq_enqueued(struct bfq_data

Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically

2017-04-12 Thread Ming Lei

On Wed, Apr 12, 2017 at 06:38:07PM +, Bart Van Assche wrote:
> On Wed, 2017-04-12 at 11:42 +0800, Ming Lei wrote:
> > On Tue, Apr 11, 2017 at 06:18:36PM +, Bart Van Assche wrote:
> > > On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> > > > Rather than working so hard to use DM code against me, your argument
> > > > should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> > > > established pattern"
> > > > 
> > > > I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> > > > only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> > > > tree-wide.
> > > > 
> > > > Could be there are some others, but hardly a well-established pattern.
> > > 
> > > Hello Mike,
> > > 
> > > Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
> > > .queue_rq() implementation stop the request queue (blk_mq_stop_hw_queue())
> > > before returning "busy" and restart the queue after the busy condition has
> > > been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk 
> > > and
> > > xen-blkfront. However, this approach is not appropriate for the dm-mq core
> > > nor for the scsi core since both drivers already use the "stopped" state 
> > > for
> > > another purpose than tracking whether or not a hardware queue is busy. 
> > > Hence
> > > the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these 
> > > last
> > > two drivers to rerun a hardware queue after the busy state has been 
> > > cleared.
> > 
> > But looks this patch just reruns the hw queue after 100ms, which isn't
> > that after the busy state has been cleared, right?
> 
> Hello Ming,
> 
> That patch can be considered as a first step that can be refined further, 
> namely
> by modifying the dm-rq code further such that dm-rq queues are only rerun 
> after
> the busy condition has been cleared. The patch at the start of this thread is
> easier to review and easier to test than any patch that would only rerun dm-rq
> queues after the busy condition has been cleared.

OK, got it, it should have been better to add comments about this change
since reruning the queue after 100ms is actually a workaround, instead
of final solution.

> 
> > Actually if BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(), blk-mq
> > will buffer this request into hctx->dispatch and run the hw queue again,
> > so looks blk_mq_delay_run_hw_queue() in this situation shouldn't have been
> > needed at my 1st impression.
> 
> If the blk-mq core would always rerun a hardware queue if a block driver
> returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core

It won't casue 100% CPU utilization since we restart queue in completion
path and at that time at least one tag is available, then progress can be
made.

> to be busy with polling a hardware queue until the "busy" condition has been
> cleared. One can see easily that that's not what the blk-mq core does. From
> blk_mq_sched_dispatch_requests():
> 
>   if (!list_empty(_list)) {
>   blk_mq_sched_mark_restart_hctx(hctx);
>   did_work = blk_mq_dispatch_rq_list(q, _list);
>   }
> 
> From the end of blk_mq_dispatch_rq_list():
> 
>   if (!list_empty(list)) {
>   [ ... ]
>   if (!blk_mq_sched_needs_restart(hctx) &&
>   !test_bit(BLK_MQ_S_TAG_WAITING, >state))
>   blk_mq_run_hw_queue(hctx, true);
>   }

That is exactly what I meant, blk-mq already provides this mechanism
to rerun the queue automatically in case of BLK_MQ_RQ_QUEUE_BUSY. If the
mechanism doesn't work well, we need to fix that, then why bother
drivers to workaround it?

> 
> In other words, the BLK_MQ_S_SCHED_RESTART flag is set before the dispatch 
> list
> is examined and only if that flag gets cleared while blk_mq_dispatch_rq_list()
> is in progress by a concurrent blk_mq_sched_restart_hctx() call then the
> dispatch list will be rerun after a block driver returned 
> BLK_MQ_RQ_QUEUE_BUSY.

Yes, the queue is rerun either in completion path when
BLK_MQ_S_SCHED_RESTART is set, or just .queue_rq() returning _BUSY
and the flag is cleared at the same time from completion path.

So in theroy we can make sure the queue will be run again if _BUSY
happened, then what is the root cause why we have to add
blk_mq_delay_run_hw_queue(hctx, 100) in dm's .queue_rq()?

Thanks,
Ming

Re: [PATCH] blk-mq: Fix blk_execute_rq_nowait() handling of dying queues

2017-04-12 Thread Ming Lei

On Thu, Apr 13, 2017 at 2:24 AM, Bart Van Assche
 wrote:
> On Wed, 2017-04-12 at 13:01 +0800, Ming Lei wrote:
>> On Wed, Apr 12, 2017 at 7:58 AM, Bart Van Assche
>>  wrote:
>> >
>> > diff --git a/block/blk-exec.c b/block/blk-exec.c
>> > index 8cd0e9bc8dc8..f7d9bed2cb15 100644
>> > --- a/block/blk-exec.c
>> > +++ b/block/blk-exec.c
>> > @@ -57,10 +57,13 @@ void blk_execute_rq_nowait(struct request_queue *q, 
>> > struct gendisk *bd_disk,
>> > rq->end_io = done;
>> >
>> > /*
>> > -* don't check dying flag for MQ because the request won't
>> > -* be reused after dying flag is set
>> > +* The blk_freeze_queue() call in blk_set_queue_dying() and the
>> > +* test of the "dying" flag in blk_queue_enter() guarantee that
>> > +* blk_execute_rq_nowait() won't be called anymore after the 
>> > "dying"
>> > +* flag has been set.
>>
>> That never be guaranteed, see the following case:
>>
>> 1) blk_get_request() is called just before queue is set as dying in another 
>> path
>>
>> 2) the request is allocated successfully and passed to
>> blk_execute_rq_nowait() even
>> though queue has been set as dying
>
> Hello Ming,
>
> Shouldn't the blk-mq code guarantee that blk_execute_rq_nowait() won't be
> called anymore after the "dying" flag has been set? I think changing the
> blk_freeze_queue_start() call into blk_freeze_queue() in blk_set_queue_dying()
> is sufficient to achieve this.

I have explained that this change isn't enough.

>
> Note: after I had posted this patch I have been able to reproduce the issue
> described in the patch description. Although I still think we need the patch
> at the start of this e-mail thread, it doesn't fix the issue I described.

Since it fixes nothing, I don't suggest to do that.


Thanks,
Ming Lei

99 matches

Mail list logo