date:20170407

Re: [RESEND PATCH] block: sed-opal: Tone down all the pr_* to debugs

2017-04-07 Thread Jens Axboe

On 04/07/2017 01:58 PM, Scott Bauer wrote:
> Lets not flood the kernel log with messages unless
> the user requests so.

Thanks Scott, applied for 4.12.

-- 
Jens Axboe

[RESEND PATCH] block: sed-opal: Tone down all the pr_* to debugs

2017-04-07 Thread Scott Bauer

Lets not flood the kernel log with messages unless
the user requests so.

Signed-off-by: Scott Bauer 
---
 block/sed-opal.c | 153 +++
 1 file changed, 74 insertions(+), 79 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 6736c78..9b30ae5 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -275,8 +275,8 @@ static bool check_tper(const void *data)
u8 flags = tper->supported_features;
 
if (!(flags & TPER_SYNC_SUPPORTED)) {
-   pr_err("TPer sync not supported. flags = %d\n",
-  tper->supported_features);
+   pr_debug("TPer sync not supported. flags = %d\n",
+tper->supported_features);
return false;
}
 
@@ -289,7 +289,7 @@ static bool check_sum(const void *data)
u32 nlo = be32_to_cpu(sum->num_locking_objects);
 
if (nlo == 0) {
-   pr_err("Need at least one locking object.\n");
+   pr_debug("Need at least one locking object.\n");
return false;
}
 
@@ -385,9 +385,9 @@ static int next(struct opal_dev *dev)
 
error = step->fn(dev, step->data);
if (error) {
-   pr_err("Error on step function: %d with error %d: %s\n",
-  state, error,
-  opal_error_to_human(error));
+   pr_debug("Error on step function: %d with error %d: 
%s\n",
+state, error,
+opal_error_to_human(error));
 
/* For each OPAL command we do a discovery0 then we
 * start some sort of session.
@@ -419,8 +419,8 @@ static int opal_discovery0_end(struct opal_dev *dev)
print_buffer(dev->resp, hlen);
 
if (hlen > IO_BUFFER_LENGTH - sizeof(*hdr)) {
-   pr_warn("Discovery length overflows buffer (%zu+%u)/%u\n",
-   sizeof(*hdr), hlen, IO_BUFFER_LENGTH);
+   pr_debug("Discovery length overflows buffer (%zu+%u)/%u\n",
+sizeof(*hdr), hlen, IO_BUFFER_LENGTH);
return -EFAULT;
}
 
@@ -503,7 +503,7 @@ static void add_token_u8(int *err, struct opal_dev *cmd, u8 
tok)
if (*err)
return;
if (cmd->pos >= IO_BUFFER_LENGTH - 1) {
-   pr_err("Error adding u8: end of buffer.\n");
+   pr_debug("Error adding u8: end of buffer.\n");
*err = -ERANGE;
return;
}
@@ -553,7 +553,7 @@ static void add_token_u64(int *err, struct opal_dev *cmd, 
u64 number)
len = DIV_ROUND_UP(msb, 4);
 
if (cmd->pos >= IO_BUFFER_LENGTH - len - 1) {
-   pr_err("Error adding u64: end of buffer.\n");
+   pr_debug("Error adding u64: end of buffer.\n");
*err = -ERANGE;
return;
}
@@ -579,7 +579,7 @@ static void add_token_bytestring(int *err, struct opal_dev 
*cmd,
}
 
if (len >= IO_BUFFER_LENGTH - cmd->pos - header_len) {
-   pr_err("Error adding bytestring: end of buffer.\n");
+   pr_debug("Error adding bytestring: end of buffer.\n");
*err = -ERANGE;
return;
}
@@ -597,7 +597,7 @@ static void add_token_bytestring(int *err, struct opal_dev 
*cmd,
 static int build_locking_range(u8 *buffer, size_t length, u8 lr)
 {
if (length > OPAL_UID_LENGTH) {
-   pr_err("Can't build locking range. Length OOB\n");
+   pr_debug("Can't build locking range. Length OOB\n");
return -ERANGE;
}
 
@@ -614,7 +614,7 @@ static int build_locking_range(u8 *buffer, size_t length, 
u8 lr)
 static int build_locking_user(u8 *buffer, size_t length, u8 lr)
 {
if (length > OPAL_UID_LENGTH) {
-   pr_err("Can't build locking range user, Length OOB\n");
+   pr_debug("Can't build locking range user, Length OOB\n");
return -ERANGE;
}
 
@@ -648,7 +648,7 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 
tsn)
add_token_u8(, cmd, OPAL_ENDLIST);
 
if (err) {
-   pr_err("Error finalizing command.\n");
+   pr_debug("Error finalizing command.\n");
return -EFAULT;
}
 
@@ -660,7 +660,7 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 
tsn)
hdr->subpkt.length = cpu_to_be32(cmd->pos - sizeof(*hdr));
while (cmd->pos % 4) {
if (cmd->pos >= IO_BUFFER_LENGTH) {
-   pr_err("Error: Buffer overrun\n");
+   pr_debug("Error: Buffer overrun\n");
return -ERANGE;
}
cmd->cmd[cmd->pos++] = 0;
@@ -679,14 +679,14 @@ static const struct opal_resp_tok *response_get_token(
const struct

Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically

2017-04-07 Thread Jens Axboe

On 04/07/2017 12:39 PM, Bart Van Assche wrote:
> On Fri, 2017-04-07 at 11:33 -0700, Bart Van Assche wrote:
>> On Fri, 2017-04-07 at 12:23 -0600, Jens Axboe wrote:
>>> On 04/07/2017 12:16 PM, Bart Van Assche wrote:
 Hello Jens,

 The six patches in this patch series fix the queue lockup I reported
 recently on the linux-block mailing list. Please consider these patches
 for inclusion in the upstream kernel.
>>>
>>> Some of this we need in 4.11, but not all of it. I can't be applying patches
>>> that "improve scalability" at this point.
>>>
>>> 4-6 looks like what we want for 4.11, I'll see if those apply directly. Then
>>> we can put 1-3 on top in 4.12, with the others pulled in first.
>>
>> Hello Jens,
>>
>> Please note that patch 2/6 is a bug fix. The current implementation of
>> blk_mq_sched_restart_queues() only considers hardware queues associated with
>> the same request queue as the hardware queue that has been passed as an
>> argument. If a tag set is shared across request queues - as is the case for
>> SCSI - then all request queues that share a tag set with the hctx argument
>> must be considered.
> 
> (replying to my own e-mail)
> 
> Hello Jens,
> 
> If you want I can split that patch into two patches - one that runs all 
> hardware
> queues with which the tag set is shared and one that switches from rerunning
> all hardware queues to one hardware queue.

I already put it in, but this is getting very awkward. We're at -rc5 time, 
patches
going into mainline should be TINY. And now I'm sitting on this, that I have to
justify:

 15 files changed, 281 insertions(+), 164 deletions(-)

and where one of the patches reads like it's a performance improvement, when
in reality it's fixing a hang. So yes, the patch should have been split in
two, and the series should have been ordered so that the first patches could
go into 4.11, and the rest on top of that in 4.12. Did we really need a
patch clarifying comments in that series? Probably not.

-- 
Jens Axboe

[PATCH v3] lightnvm: pblk

2017-04-07 Thread Javier González

This patch introduces pblk, a new target for LightNVM implementing a
full host-based FTL. Details on the commit message.

Changes since v2:
* Rebase on top of Matias' for-4.12/core
* Implement L2P scan recovery to recover L2P table in case of power
  failure.
* Re-design disk format to be more flexible in future versions (from
  Matias Bjørling)
* Implement per-instance uuid to allow correct recovery without
  forcing line erases (from Matias Bjørling)
* Re-design GC threading to have several GC readers and a single
  writer that places data on the write buffer. This allows to maximize
  the GC write buffer budget without having unnecessary GC writers
  competing for the write buffer lock.
* Simplify sysfs interface.
* Refactoring and several code improvements (together with Matias
  Bjørling)

Changes since v1:
* Rebase on top of Matias' for-4.12/core
* Move from per-LUN block allocation to a line model. This means that a
  whole lines across all LUNs is allocated at a time. Data is still
  stripped in a round-robin fashion at a page granurality.
* Implement new disk format scheme, where metadata is stored per line
  instead of per LUN. This allows for space optimizations.
* Improvements on GC workqueue management and victim selection.
* Implement sysfs interface to query pblk's operation and statistics.
* Implement a user - GC I/O rate-limiter
* Various bug fixes

Javier González (1):
  lightnvm: physical block device (pblk) target

 Documentation/lightnvm/pblk.txt  |   21 +
 drivers/lightnvm/Kconfig |   19 +
 drivers/lightnvm/Makefile|5 +
 drivers/lightnvm/pblk-cache.c|  112 +++
 drivers/lightnvm/pblk-core.c | 1641 ++
 drivers/lightnvm/pblk-gc.c   |  542 +
 drivers/lightnvm/pblk-init.c |  942 ++
 drivers/lightnvm/pblk-map.c  |  135 
 drivers/lightnvm/pblk-rb.c   |  859 
 drivers/lightnvm/pblk-read.c |  513 
 drivers/lightnvm/pblk-recovery.c | 1007 +++
 drivers/lightnvm/pblk-rl.c   |  182 +
 drivers/lightnvm/pblk-sysfs.c|  500 
 drivers/lightnvm/pblk-write.c|  408 ++
 drivers/lightnvm/pblk.h  | 1127 ++
 include/linux/lightnvm.h |   57 +-
 pblk-sysfs.c |  581 ++
 17 files changed, 8638 insertions(+), 13 deletions(-)
 create mode 100644 Documentation/lightnvm/pblk.txt
 create mode 100644 drivers/lightnvm/pblk-cache.c
 create mode 100644 drivers/lightnvm/pblk-core.c
 create mode 100644 drivers/lightnvm/pblk-gc.c
 create mode 100644 drivers/lightnvm/pblk-init.c
 create mode 100644 drivers/lightnvm/pblk-map.c
 create mode 100644 drivers/lightnvm/pblk-rb.c
 create mode 100644 drivers/lightnvm/pblk-read.c
 create mode 100644 drivers/lightnvm/pblk-recovery.c
 create mode 100644 drivers/lightnvm/pblk-rl.c
 create mode 100644 drivers/lightnvm/pblk-sysfs.c
 create mode 100644 drivers/lightnvm/pblk-write.c
 create mode 100644 drivers/lightnvm/pblk.h
 create mode 100644 pblk-sysfs.c

-- 
2.7.4

Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically

2017-04-07 Thread Bart Van Assche

On Fri, 2017-04-07 at 11:33 -0700, Bart Van Assche wrote:
> On Fri, 2017-04-07 at 12:23 -0600, Jens Axboe wrote:
> > On 04/07/2017 12:16 PM, Bart Van Assche wrote:
> > > Hello Jens,
> > > 
> > > The six patches in this patch series fix the queue lockup I reported
> > > recently on the linux-block mailing list. Please consider these patches
> > > for inclusion in the upstream kernel.
> > 
> > Some of this we need in 4.11, but not all of it. I can't be applying patches
> > that "improve scalability" at this point.
> > 
> > 4-6 looks like what we want for 4.11, I'll see if those apply directly. Then
> > we can put 1-3 on top in 4.12, with the others pulled in first.
> 
> Hello Jens,
> 
> Please note that patch 2/6 is a bug fix. The current implementation of
> blk_mq_sched_restart_queues() only considers hardware queues associated with
> the same request queue as the hardware queue that has been passed as an
> argument. If a tag set is shared across request queues - as is the case for
> SCSI - then all request queues that share a tag set with the hctx argument
> must be considered.

(replying to my own e-mail)

Hello Jens,

If you want I can split that patch into two patches - one that runs all hardware
queues with which the tag set is shared and one that switches from rerunning
all hardware queues to one hardware queue.

Bart.

Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically

2017-04-07 Thread Bart Van Assche

On Fri, 2017-04-07 at 12:23 -0600, Jens Axboe wrote:
> On 04/07/2017 12:16 PM, Bart Van Assche wrote:
> > Hello Jens,
> > 
> > The six patches in this patch series fix the queue lockup I reported
> > recently on the linux-block mailing list. Please consider these patches
> > for inclusion in the upstream kernel.
> 
> Some of this we need in 4.11, but not all of it. I can't be applying patches
> that "improve scalability" at this point.
> 
> 4-6 looks like what we want for 4.11, I'll see if those apply directly. Then
> we can put 1-3 on top in 4.12, with the others pulled in first.

Hello Jens,

Please note that patch 2/6 is a bug fix. The current implementation of
blk_mq_sched_restart_queues() only considers hardware queues associated with
the same request queue as the hardware queue that has been passed as an
argument. If a tag set is shared across request queues - as is the case for
SCSI - then all request queues that share a tag set with the hctx argument
must be considered.

Bart.

[PATCH 4/4] lightnvm: allow to init targets on factory mode

2017-04-07 Thread Javier González

Target initialization has two responsibilities: creating the target
partition and instantiating the target. This patch enables to create a
factory partition (e.g., do not trigger recovery on the given target).
This is useful for target development and for being able to restore the
device state at any moment in time without requiring a full-device
erase.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c   | 14 +++---
 drivers/lightnvm/rrpc.c   |  3 ++-
 include/linux/lightnvm.h  |  3 ++-
 include/uapi/linux/lightnvm.h |  4 
 4 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 28e69a7..e4530c2 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -279,7 +279,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
tdisk->fops = _fops;
tdisk->queue = tqueue;
 
-   targetdata = tt->init(tgt_dev, tdisk);
+   targetdata = tt->init(tgt_dev, tdisk, create->flags);
if (IS_ERR(targetdata))
goto err_init;
 
@@ -1243,8 +1243,16 @@ static long nvm_ioctl_dev_create(struct file *file, void 
__user *arg)
create.tgtname[DISK_NAME_LEN - 1] = '\0';
 
if (create.flags != 0) {
-   pr_err("nvm: no flags supported\n");
-   return -EINVAL;
+   __u32 flags = create.flags;
+
+   /* Check for valid flags */
+   if (flags & NVM_TARGET_FACTORY)
+   flags &= ~NVM_TARGET_FACTORY;
+
+   if (flags) {
+   pr_err("nvm: flag not supported\n");
+   return -EINVAL;
+   }
}
 
return __nvm_configure_create();
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index 4e4c299..2c04ff3 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -1515,7 +1515,8 @@ static int rrpc_luns_configure(struct rrpc *rrpc)
 
 static struct nvm_tgt_type tt_rrpc;
 
-static void *rrpc_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk)
+static void *rrpc_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk,
+  int flags)
 {
struct request_queue *bqueue = dev->q;
struct request_queue *tqueue = tdisk->queue;
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index bebea80..e88f7ef 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -436,7 +436,8 @@ static inline int ppa_cmp_blk(struct ppa_addr ppa1, struct 
ppa_addr ppa2)
 
 typedef blk_qc_t (nvm_tgt_make_rq_fn)(struct request_queue *, struct bio *);
 typedef sector_t (nvm_tgt_capacity_fn)(void *);
-typedef void *(nvm_tgt_init_fn)(struct nvm_tgt_dev *, struct gendisk *);
+typedef void *(nvm_tgt_init_fn)(struct nvm_tgt_dev *, struct gendisk *,
+   int flags);
 typedef void (nvm_tgt_exit_fn)(void *);
 typedef int (nvm_tgt_sysfs_init_fn)(struct gendisk *);
 typedef void (nvm_tgt_sysfs_exit_fn)(struct gendisk *);
diff --git a/include/uapi/linux/lightnvm.h b/include/uapi/linux/lightnvm.h
index fd19f36..c8aec4b 100644
--- a/include/uapi/linux/lightnvm.h
+++ b/include/uapi/linux/lightnvm.h
@@ -85,6 +85,10 @@ struct nvm_ioctl_create_conf {
};
 };
 
+enum {
+   NVM_TARGET_FACTORY = 1 << 0,/* Init target in factory mode */
+};
+
 struct nvm_ioctl_create {
char dev[DISK_NAME_LEN];/* open-channel SSD device */
char tgttype[NVM_TTYPE_NAME_MAX];   /* target type name */
-- 
2.7.4

[PATCH 2/4] lightnvm: fix cleanup order of disk on init error

2017-04-07 Thread Javier González

Reorder disk allocation such that the disk structure can be put
safely.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index da4c082..28e69a7 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -263,15 +263,15 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
goto err_t;
}
 
+   tdisk = alloc_disk(0);
+   if (!tdisk)
+   goto err_dev;
+
tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node);
if (!tqueue)
-   goto err_dev;
+   goto err_disk;
blk_queue_make_request(tqueue, tt->make_rq);
 
-   tdisk = alloc_disk(0);
-   if (!tdisk)
-   goto err_queue;
-
sprintf(tdisk->disk_name, "%s", create->tgtname);
tdisk->flags = GENHD_FL_EXT_DEVT;
tdisk->major = 0;
@@ -307,9 +307,9 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
if (tt->exit)
tt->exit(targetdata);
 err_init:
-   put_disk(tdisk);
-err_queue:
blk_cleanup_queue(tqueue);
+err_disk:
+   put_disk(tdisk);
 err_dev:
nvm_remove_tgt_dev(tgt_dev, 0);
 err_t:
-- 
2.7.4

[PATCH 3/4] lightnvm: bad type conversion for nvme control bits

2017-04-07 Thread Javier González

The NVMe I/O command control bits are 16 bytes, but is interpreted as
32 bytes in the lightnvm user I/O data path.

Signed-off-by: Javier González 
---
 drivers/nvme/host/lightnvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 2c8f933..83e7ea2 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -754,7 +754,7 @@ static int nvme_nvm_user_vcmd(struct nvme_ns *ns, int admin,
c.common.cdw2[1] = cpu_to_le32(vcmd.cdw3);
/* cdw11-12 */
c.ph_rw.length = cpu_to_le16(vcmd.nppas);
-   c.ph_rw.control  = cpu_to_le32(vcmd.control);
+   c.ph_rw.control  = cpu_to_le16(vcmd.control);
c.common.cdw10[3] = cpu_to_le32(vcmd.cdw13);
c.common.cdw10[4] = cpu_to_le32(vcmd.cdw14);
c.common.cdw10[5] = cpu_to_le32(vcmd.cdw15);
-- 
2.7.4

[PATCH 1/4] lightnvm: double-clear of dev->lun_map on target init error

2017-04-07 Thread Javier González

The dev->lun_map bits are cleared twice if an target init error occurs.
First in the target clean routine, and then next in the nvm_tgt_create
error function. Make sure that it is only cleared once by extending
nvm_remove_tgt_devi() with a clear bit, such that clearing of bits can
ignored when cleaning up a successful initialized target.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 2122922..da4c082 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -89,7 +89,7 @@ static void nvm_release_luns_err(struct nvm_dev *dev, int 
lun_begin,
WARN_ON(!test_and_clear_bit(i, dev->lun_map));
 }
 
-static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev)
+static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, int clear)
 {
struct nvm_dev *dev = tgt_dev->parent;
struct nvm_dev_map *dev_map = tgt_dev->map;
@@ -100,11 +100,13 @@ static void nvm_remove_tgt_dev(struct nvm_tgt_dev 
*tgt_dev)
int *lun_offs = ch_map->lun_offs;
int ch = i + ch_map->ch_off;
 
-   for (j = 0; j < ch_map->nr_luns; j++) {
-   int lun = j + lun_offs[j];
-   int lunid = (ch * dev->geo.luns_per_chnl) + lun;
+   if (clear) {
+   for (j = 0; j < ch_map->nr_luns; j++) {
+   int lun = j + lun_offs[j];
+   int lunid = (ch * dev->geo.luns_per_chnl) + lun;
 
-   WARN_ON(!test_and_clear_bit(lunid, dev->lun_map));
+   WARN_ON(!test_and_clear_bit(lunid, 
dev->lun_map));
+   }
}
 
kfree(ch_map->lun_offs);
@@ -309,7 +311,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
 err_queue:
blk_cleanup_queue(tqueue);
 err_dev:
-   nvm_remove_tgt_dev(tgt_dev);
+   nvm_remove_tgt_dev(tgt_dev, 0);
 err_t:
kfree(t);
 err_reserve:
@@ -332,7 +334,7 @@ static void __nvm_remove_target(struct nvm_target *t)
if (tt->exit)
tt->exit(tdisk->private_data);
 
-   nvm_remove_tgt_dev(t->dev);
+   nvm_remove_tgt_dev(t->dev, 1);
put_disk(tdisk);
 
list_del(>list);
-- 
2.7.4

Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically

2017-04-07 Thread Jens Axboe

On 04/07/2017 12:16 PM, Bart Van Assche wrote:
> Hello Jens,
> 
> The six patches in this patch series fix the queue lockup I reported
> recently on the linux-block mailing list. Please consider these patches
> for inclusion in the upstream kernel.

Some of this we need in 4.11, but not all of it. I can't be applying patches
that "improve scalability" at this point.

4-6 looks like what we want for 4.11, I'll see if those apply directly. Then
we can put 1-3 on top in 4.12, with the others pulled in first.

-- 
Jens Axboe

[PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically

2017-04-07 Thread Bart Van Assche

Hello Jens,

The six patches in this patch series fix the queue lockup I reported
recently on the linux-block mailing list. Please consider these patches
for inclusion in the upstream kernel.

Thanks,

Bart.

Changes between v3 and v4:
- Addressed the review comments on version three of this series about the
  patch that makes it safe to use RCU to iterate over .tag_list and also
  about the runtime performance and use of short variable names in patch 2/5.
- Clarified the description of the patch that fixes the scsi-mq stall.
- Added a patch to fix a dm-mq queue stall.
  
Changes between v2 and v3:
- Removed the blk_mq_ops.restart_hctx function pointer again.
- Modified blk_mq_sched_restart_queues() such that only a single hardware
  queue is restarted instead of multiple if hardware queues are shared.
- Introduced a new function in the block layer, namely
  blk_mq_delay_run_hw_queue().  

Changes between v1 and v2:
- Reworked scsi_restart_queues() such that it no longer takes the SCSI
  host lock.
- Added two patches - one for exporting blk_mq_sched_restart_hctx() and
  another one to make iterating with RCU over blk_mq_tag_set.tag_list safe.

Bart Van Assche (6):
  blk-mq: Make it safe to use RCU to iterate over
blk_mq_tag_set.tag_list
  blk-mq: Restart a single queue if tag sets are shared
  blk-mq: Clarify comments in blk_mq_dispatch_rq_list()
  blk-mq: Introduce blk_mq_delay_run_hw_queue()
  scsi: Avoid that SCSI queues get stuck
  dm rq: Avoid that request processing stalls sporadically

 block/blk-mq-sched.c| 63 +++---
 block/blk-mq-sched.h| 16 +--
 block/blk-mq.c  | 73 +++--
 drivers/md/dm-rq.c  |  1 +
 drivers/scsi/scsi_lib.c |  6 ++--
 include/linux/blk-mq.h  |  2 ++
 include/linux/blkdev.h  |  1 -
 7 files changed, 118 insertions(+), 44 deletions(-)

-- 
2.12.0

[PATCH v4 3/6] blk-mq: Clarify comments in blk_mq_dispatch_rq_list()

2017-04-07 Thread Bart Van Assche

The blk_mq_dispatch_rq_list() implementation got modified several
times but the comments in that function were not updated every
time. Since it is nontrivial what is going on, update the comments
in blk_mq_dispatch_rq_list().

Signed-off-by: Bart Van Assche 
Cc: Omar Sandoval 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
---
 block/blk-mq.c | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index dba34eb79a08..aff85d41cea3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1063,8 +1063,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, 
struct list_head *list)
 */
if (!list_empty(list)) {
/*
-* If we got a driver tag for the next request already,
-* free it again.
+* If an I/O scheduler has been configured and we got a driver
+* tag for the next request already, free it again.
 */
rq = list_first_entry(list, struct request, queuelist);
blk_mq_put_driver_tag(rq);
@@ -1074,16 +1074,24 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx 
*hctx, struct list_head *list)
spin_unlock(>lock);
 
/*
-* the queue is expected stopped with BLK_MQ_RQ_QUEUE_BUSY, but
-* it's possible the queue is stopped and restarted again
-* before this. Queue restart will dispatch requests. And since
-* requests in rq_list aren't added into hctx->dispatch yet,
-* the requests in rq_list might get lost.
+* If SCHED_RESTART was set by the caller of this function and
+* it is no longer set that means that it was cleared by another
+* thread and hence that a queue rerun is needed.
 *
-* blk_mq_run_hw_queue() already checks the STOPPED bit
+* If TAG_WAITING is set that means that an I/O scheduler has
+* been configured and another thread is waiting for a driver
+* tag. To guarantee fairness, do not rerun this hardware queue
+* but let the other thread grab the driver tag.
 *
-* If RESTART or TAG_WAITING is set, then let completion restart
-* the queue instead of potentially looping here.
+* If no I/O scheduler has been configured it is possible that
+* the hardware queue got stopped and restarted before requests
+* were pushed back onto the dispatch list. Rerun the queue to
+* avoid starvation. Notes:
+* - blk_mq_run_hw_queue() checks whether or not a queue has
+*   been stopped before rerunning a queue.
+* - Some but not all block drivers stop a queue before
+*   returning BLK_MQ_RQ_QUEUE_BUSY. Two exceptions are scsi-mq
+*   and dm-rq.
 */
if (!blk_mq_sched_needs_restart(hctx) &&
!test_bit(BLK_MQ_S_TAG_WAITING, >state))
-- 
2.12.0

[PATCH v4 4/6] blk-mq: Introduce blk_mq_delay_run_hw_queue()

2017-04-07 Thread Bart Van Assche

Introduce a function that runs a hardware queue unconditionally
after a delay. Note: there is already a function that stops and
restarts a hardware queue after a delay, namely blk_mq_delay_queue().

This function will be used in the next patch in this series.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Long Li 
Cc: K. Y. Srinivasan 
---
 block/blk-mq.c | 32 ++--
 include/linux/blk-mq.h |  2 ++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index aff85d41cea3..836e3a17da54 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1146,7 +1146,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
*hctx)
return hctx->next_cpu;
 }
 
-void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
+static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
+   unsigned long msecs)
 {
if (unlikely(blk_mq_hctx_stopped(hctx) ||
 !blk_mq_hw_queue_mapped(hctx)))
@@ -1163,7 +1164,24 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, 
bool async)
put_cpu();
}
 
-   kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx), >run_work);
+   if (msecs == 0)
+   kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx),
+>run_work);
+   else
+   kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+>delayed_run_work,
+msecs_to_jiffies(msecs));
+}
+
+void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
+{
+   __blk_mq_delay_run_hw_queue(hctx, true, msecs);
+}
+EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
+
+void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
+{
+   __blk_mq_delay_run_hw_queue(hctx, async, 0);
 }
 
 void blk_mq_run_hw_queues(struct request_queue *q, bool async)
@@ -1266,6 +1284,15 @@ static void blk_mq_run_work_fn(struct work_struct *work)
__blk_mq_run_hw_queue(hctx);
 }
 
+static void blk_mq_delayed_run_work_fn(struct work_struct *work)
+{
+   struct blk_mq_hw_ctx *hctx;
+
+   hctx = container_of(work, struct blk_mq_hw_ctx, delayed_run_work.work);
+
+   __blk_mq_run_hw_queue(hctx);
+}
+
 static void blk_mq_delay_work_fn(struct work_struct *work)
 {
struct blk_mq_hw_ctx *hctx;
@@ -1866,6 +1893,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
node = hctx->numa_node = set->numa_node;
 
INIT_WORK(>run_work, blk_mq_run_work_fn);
+   INIT_DELAYED_WORK(>delayed_run_work, blk_mq_delayed_run_work_fn);
INIT_DELAYED_WORK(>delay_work, blk_mq_delay_work_fn);
spin_lock_init(>lock);
INIT_LIST_HEAD(>dispatch);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index bdea90d75274..b90c3d5766cd 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -51,6 +51,7 @@ struct blk_mq_hw_ctx {
 
atomic_tnr_active;
 
+   struct delayed_work delayed_run_work;
struct delayed_work delay_work;
 
struct hlist_node   cpuhp_dead;
@@ -236,6 +237,7 @@ void blk_mq_stop_hw_queues(struct request_queue *q);
 void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
+void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long 
msecs);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
-- 
2.12.0

[PATCH v4 2/6] blk-mq: Restart a single queue if tag sets are shared

2017-04-07 Thread Bart Van Assche

To improve scalability, if hardware queues are shared, restart
a single hardware queue in round-robin fashion. Rename
blk_mq_sched_restart_queues() to reflect the new semantics.
Remove blk_mq_sched_mark_restart_queue() because this function
has no callers. Remove flag QUEUE_FLAG_RESTART because this
patch removes the code that uses this flag.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
---
 block/blk-mq-sched.c   | 63 ++
 block/blk-mq-sched.h   | 16 +
 block/blk-mq.c |  2 +-
 include/linux/blkdev.h |  1 -
 4 files changed, 55 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..a5c683a6429c 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -317,25 +317,68 @@ static bool blk_mq_sched_bypass_insert(struct 
blk_mq_hw_ctx *hctx,
return true;
 }
 
-static void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 {
if (test_bit(BLK_MQ_S_SCHED_RESTART, >state)) {
clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
-   if (blk_mq_hctx_has_pending(hctx))
+   if (blk_mq_hctx_has_pending(hctx)) {
blk_mq_run_hw_queue(hctx, true);
+   return true;
+   }
}
+   return false;
 }
 
-void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
-{
-   struct request_queue *q = hctx->queue;
-   unsigned int i;
+/**
+ * list_for_each_entry_rcu_rr - iterate in a round-robin fashion over rcu list
+ * @pos:loop cursor.
+ * @skip:   the list element that will not be examined. Iteration starts at
+ *  @skip->next.
+ * @head:   head of the list to examine. This list must have at least one
+ *  element, namely @skip.
+ * @member: name of the list_head structure within typeof(*pos).
+ */
+#define list_for_each_entry_rcu_rr(pos, skip, head, member)\
+   for ((pos) = (skip);\
+(pos = (pos)->member.next != (head) ? list_entry_rcu(  \
+   (pos)->member.next, typeof(*pos), member) : \
+ list_entry_rcu((pos)->member.next->next, typeof(*pos), member)), \
+(pos) != (skip); )
 
-   if (test_bit(QUEUE_FLAG_RESTART, >queue_flags)) {
-   if (test_and_clear_bit(QUEUE_FLAG_RESTART, >queue_flags)) {
-   queue_for_each_hw_ctx(q, hctx, i)
-   blk_mq_sched_restart_hctx(hctx);
+/*
+ * Called after a driver tag has been freed to check whether a hctx needs to
+ * be restarted. Restarts @hctx if its tag set is not shared. Restarts hardware
+ * queues in a round-robin fashion if the tag set of @hctx is shared with other
+ * hardware queues.
+ */
+void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
+{
+   struct blk_mq_tags *const tags = hctx->tags;
+   struct blk_mq_tag_set *const set = hctx->queue->tag_set;
+   struct request_queue *const queue = hctx->queue, *q;
+   struct blk_mq_hw_ctx *hctx2;
+   unsigned int i, j;
+
+   if (set->flags & BLK_MQ_F_TAG_SHARED) {
+   rcu_read_lock();
+   list_for_each_entry_rcu_rr(q, queue, >tag_list,
+  tag_set_list) {
+   queue_for_each_hw_ctx(q, hctx2, i)
+   if (hctx2->tags == tags &&
+   blk_mq_sched_restart_hctx(hctx2))
+   goto done;
+   }
+   j = hctx->queue_num + 1;
+   for (i = 0; i < queue->nr_hw_queues; i++, j++) {
+   if (j == queue->nr_hw_queues)
+   j = 0;
+   hctx2 = queue->queue_hw_ctx[j];
+   if (hctx2->tags == tags &&
+   blk_mq_sched_restart_hctx(hctx2))
+   break;
}
+done:
+   rcu_read_unlock();
} else {
blk_mq_sched_restart_hctx(hctx);
}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index a75b16b123f7..4e3fc2a40207 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -19,7 +19,7 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct 
bio *bio,
struct request **merged_request);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request 
*rq);
-void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx);
+void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 bool run_queue, bool async, bool

[PATCH v4 1/6] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list

2017-04-07 Thread Bart Van Assche

Since the next patch in this series will use RCU to iterate over
tag_list, make this safe. Add lockdep_assert_held() statements
in functions that iterate over tag_list to make clear that using
list_for_each_entry() instead of list_for_each_entry_rcu() is
fine in these functions.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
---
 block/blk-mq.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f7cd3208bcdf..c26464f9649a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2076,6 +2076,8 @@ static void blk_mq_update_tag_set_depth(struct 
blk_mq_tag_set *set, bool shared)
 {
struct request_queue *q;
 
+   lockdep_assert_held(>tag_list_lock);
+
list_for_each_entry(q, >tag_list, tag_set_list) {
blk_mq_freeze_queue(q);
queue_set_hctx_shared(q, shared);
@@ -2088,7 +2090,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue 
*q)
struct blk_mq_tag_set *set = q->tag_set;
 
mutex_lock(>tag_list_lock);
-   list_del_init(>tag_set_list);
+   list_del_rcu(>tag_set_list);
+   INIT_LIST_HEAD(>tag_set_list);
if (list_is_singular(>tag_list)) {
/* just transitioned to unshared */
set->flags &= ~BLK_MQ_F_TAG_SHARED;
@@ -2096,6 +2099,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue 
*q)
blk_mq_update_tag_set_depth(set, false);
}
mutex_unlock(>tag_list_lock);
+
+   synchronize_rcu();
 }
 
 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
@@ -2113,7 +2118,7 @@ static void blk_mq_add_queue_tag_set(struct 
blk_mq_tag_set *set,
}
if (set->flags & BLK_MQ_F_TAG_SHARED)
queue_set_hctx_shared(q, true);
-   list_add_tail(>tag_set_list, >tag_list);
+   list_add_tail_rcu(>tag_set_list, >tag_list);
 
mutex_unlock(>tag_list_lock);
 }
@@ -2601,6 +2606,8 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set 
*set, int nr_hw_queues)
 {
struct request_queue *q;
 
+   lockdep_assert_held(>tag_list_lock);
+
if (nr_hw_queues > nr_cpu_ids)
nr_hw_queues = nr_cpu_ids;
if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
-- 
2.12.0

[PATCH v4 5/6] scsi: Avoid that SCSI queues get stuck

2017-04-07 Thread Bart Van Assche

If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
driver that implements that function is responsible for rerunning the
hardware queue once requests can be queued again successfully.

commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq()
for the BLK_MQ_RQ_QUEUE_BUSY case. Hence change all calls to functions
that are intended to rerun a busy queue such that these examine all
hardware queues instead of only stopped queues.

Since no other functions than scsi_internal_device_block() and
scsi_internal_device_unblock() should ever stop or restart a SCSI
queue, change the blk_mq_delay_queue() call into a
blk_mq_delay_run_hw_queue() call.

Fixes: commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped 
queues")
Fixes: commit 7e79dadce222 ("blk-mq: stop hardware queue in 
blk_mq_delay_queue()")
Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: James Bottomley 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Sagi Grimberg 
Cc: Long Li 
Cc: K. Y. Srinivasan 
---
 drivers/scsi/scsi_lib.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 11972d1075f1..7bc4513bf4e4 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -496,7 +496,7 @@ static void scsi_run_queue(struct request_queue *q)
scsi_starved_list_run(sdev->host);
 
if (q->mq_ops)
-   blk_mq_start_stopped_hw_queues(q, false);
+   blk_mq_run_hw_queues(q, false);
else
blk_run_queue(q);
 }
@@ -667,7 +667,7 @@ static bool scsi_end_request(struct request *req, int error,
!list_empty(>host->starved_list))
kblockd_schedule_work(>requeue_work);
else
-   blk_mq_start_stopped_hw_queues(q, true);
+   blk_mq_run_hw_queues(q, true);
} else {
unsigned long flags;
 
@@ -1974,7 +1974,7 @@ static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
case BLK_MQ_RQ_QUEUE_BUSY:
if (atomic_read(>device_busy) == 0 &&
!scsi_device_blocked(sdev))
-   blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
+   blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
break;
case BLK_MQ_RQ_QUEUE_ERROR:
/*
-- 
2.12.0

[PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically

2017-04-07 Thread Bart Van Assche

While running the srp-test software I noticed that request
processing stalls sporadically at the beginning of a test, namely
when mkfs is run against a dm-mpath device. Every time when that
happened the following command was sufficient to resume request
processing:

echo run >/sys/kernel/debug/block/dm-0/state

This patch avoids that such request processing stalls occur. The
test I ran is as follows:

while srp-test/run_tests -d -r 30 -t 02-mq; do :; done

Signed-off-by: Bart Van Assche 
Cc: Mike Snitzer 
Cc: dm-de...@redhat.com
---
 drivers/md/dm-rq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6886bf160fb2..d19af1d21f4c 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -755,6 +755,7 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
/* Undo dm_start_request() before requeuing */
rq_end_stats(md, rq);
rq_completed(md, rq_data_dir(rq), false);
+   blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
return BLK_MQ_RQ_QUEUE_BUSY;
}
 
-- 
2.12.0

Re: [PATCH] block: sed-opal: Tone down all the pr_* to debugs

2017-04-07 Thread Jens Axboe

On 04/07/2017 10:06 AM, Scott Bauer wrote:
> Lets not flood the kernel log with messages unless
> the user requests so.

Can you send this against my for-4.12/block of for-next branch? One
hunk is rejected right now, looks like this is against master.

-- 
Jens Axboe

[PATCH] block: sed-opal: Tone down all the pr_* to debugs

2017-04-07 Thread Scott Bauer

Lets not flood the kernel log with messages unless
the user requests so.

Signed-off-by: Scott Bauer 
---
 block/sed-opal.c | 153 +++
 1 file changed, 74 insertions(+), 79 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 14035f8..f10227e 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -275,8 +275,8 @@ static bool check_tper(const void *data)
u8 flags = tper->supported_features;
 
if (!(flags & TPER_SYNC_SUPPORTED)) {
-   pr_err("TPer sync not supported. flags = %d\n",
-  tper->supported_features);
+   pr_debug("TPer sync not supported. flags = %d\n",
+tper->supported_features);
return false;
}
 
@@ -289,7 +289,7 @@ static bool check_sum(const void *data)
u32 nlo = be32_to_cpu(sum->num_locking_objects);
 
if (nlo == 0) {
-   pr_err("Need at least one locking object.\n");
+   pr_debug("Need at least one locking object.\n");
return false;
}
 
@@ -385,9 +385,9 @@ static int next(struct opal_dev *dev)
 
error = step->fn(dev, step->data);
if (error) {
-   pr_err("Error on step function: %d with error %d: %s\n",
-  state, error,
-  opal_error_to_human(error));
+   pr_debug("Error on step function: %d with error %d: 
%s\n",
+state, error,
+opal_error_to_human(error));
 
/* For each OPAL command we do a discovery0 then we
 * start some sort of session.
@@ -419,8 +419,8 @@ static int opal_discovery0_end(struct opal_dev *dev)
print_buffer(dev->resp, hlen);
 
if (hlen > IO_BUFFER_LENGTH - sizeof(*hdr)) {
-   pr_warn("Discovery length overflows buffer (%zu+%u)/%u\n",
-   sizeof(*hdr), hlen, IO_BUFFER_LENGTH);
+   pr_debug("Discovery length overflows buffer (%zu+%u)/%u\n",
+sizeof(*hdr), hlen, IO_BUFFER_LENGTH);
return -EFAULT;
}
 
@@ -503,7 +503,7 @@ static void add_token_u8(int *err, struct opal_dev *cmd, u8 
tok)
if (*err)
return;
if (cmd->pos >= IO_BUFFER_LENGTH - 1) {
-   pr_err("Error adding u8: end of buffer.\n");
+   pr_debug("Error adding u8: end of buffer.\n");
*err = -ERANGE;
return;
}
@@ -553,7 +553,7 @@ static void add_token_u64(int *err, struct opal_dev *cmd, 
u64 number)
len = DIV_ROUND_UP(msb, 4);
 
if (cmd->pos >= IO_BUFFER_LENGTH - len - 1) {
-   pr_err("Error adding u64: end of buffer.\n");
+   pr_debug("Error adding u64: end of buffer.\n");
*err = -ERANGE;
return;
}
@@ -579,7 +579,7 @@ static void add_token_bytestring(int *err, struct opal_dev 
*cmd,
}
 
if (len >= IO_BUFFER_LENGTH - cmd->pos - header_len) {
-   pr_err("Error adding bytestring: end of buffer.\n");
+   pr_debug("Error adding bytestring: end of buffer.\n");
*err = -ERANGE;
return;
}
@@ -597,7 +597,7 @@ static void add_token_bytestring(int *err, struct opal_dev 
*cmd,
 static int build_locking_range(u8 *buffer, size_t length, u8 lr)
 {
if (length > OPAL_UID_LENGTH) {
-   pr_err("Can't build locking range. Length OOB\n");
+   pr_debug("Can't build locking range. Length OOB\n");
return -ERANGE;
}
 
@@ -614,7 +614,7 @@ static int build_locking_range(u8 *buffer, size_t length, 
u8 lr)
 static int build_locking_user(u8 *buffer, size_t length, u8 lr)
 {
if (length > OPAL_UID_LENGTH) {
-   pr_err("Can't build locking range user, Length OOB\n");
+   pr_debug("Can't build locking range user, Length OOB\n");
return -ERANGE;
}
 
@@ -648,7 +648,7 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 
tsn)
add_token_u8(, cmd, OPAL_ENDLIST);
 
if (err) {
-   pr_err("Error finalizing command.\n");
+   pr_debug("Error finalizing command.\n");
return -EFAULT;
}
 
@@ -660,7 +660,7 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 
tsn)
hdr->subpkt.length = cpu_to_be32(cmd->pos - sizeof(*hdr));
while (cmd->pos % 4) {
if (cmd->pos >= IO_BUFFER_LENGTH) {
-   pr_err("Error: Buffer overrun\n");
+   pr_debug("Error: Buffer overrun\n");
return -ERANGE;
}
cmd->cmd[cmd->pos++] = 0;
@@ -679,14 +679,14 @@ static const struct opal_resp_tok *response_get_token(
const struct

Re: [PATCH v3 0/5] Avoid that scsi-mq queue processing stalls

2017-04-07 Thread Bart Van Assche

On Fri, 2017-04-07 at 23:34 +0800, Ming Lei wrote:
> On Fri, Apr 07, 2017 at 03:18:19PM +, Bart Van Assche wrote:
> > On Fri, 2017-04-07 at 17:41 +0800, Ming Lei wrote:
> > > On Thu, Apr 06, 2017 at 11:10:45AM -0700, Bart Van Assche wrote:
> > > > Hello Jens,
> > > > 
> > > > The five patches in this patch series fix the queue lockup I reported
> > > > recently on the linux-block mailing list. Please consider these patches
> > > > for inclusion in the upstream kernel.
> > > 
> > > I read the commit log of the 5 patches, looks not found descriptions
> > > about root cause of the queue lockup, so could you explain a bit about
> > > the reason behind?
> > 
> > Hello Ming,
> > 
> > If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
> > driver that implements that function is responsible for rerunning the
> > hardware queue once requests can be queued successfully again. That is
> > not the case today for the SCSI core. Patch 5/5 ensures that hardware
> 
> The current .queue_rq() will call blk_mq_delay_queue() if QUEUE_BUSY is
> returned, and once request is completed, the queue will be restarted
> by blk_mq_start_stopped_hw_queues() in scsi_end_request(). This way
> sounds OK in theory. And I just try to understand the specific reason
> which causes the lockup, but still not get it.

Hello Ming,

blk_mq_delay_queue() stops and restarts a hardware queue after a delay has
expired. If the SCSI core calls blk_mq_start_stopped_hw_queues() after that
delay has expired no queues will be restarted. This is why patch 5/5 changes
two blk_mq_start_stopped_hw_queues() calls into two blk_mq_run_hw_queues()
calls.

Bart.

Re: [PATCH 0/2] Trace completion of all bios

2017-04-07 Thread Jens Axboe

On 04/06/2017 07:10 PM, NeilBrown wrote:
> Hi Jens,
>  I think all objections to this patch have been answered so I'm
>  resending.
>  I've added a small cleanup patch first which makes some small
>  enhancements to the documentation and #defines for the bio->flags
>  field.

Added for 4.12. I hand applied 2/2, since it did not apply directly to
the 4.12 branch.

-- 
Jens Axboe

Re: [PATCH v3 0/5] Avoid that scsi-mq queue processing stalls

2017-04-07 Thread Ming Lei

On Fri, Apr 07, 2017 at 03:18:19PM +, Bart Van Assche wrote:
> On Fri, 2017-04-07 at 17:41 +0800, Ming Lei wrote:
> > On Thu, Apr 06, 2017 at 11:10:45AM -0700, Bart Van Assche wrote:
> > > Hello Jens,
> > > 
> > > The five patches in this patch series fix the queue lockup I reported
> > > recently on the linux-block mailing list. Please consider these patches
> > > for inclusion in the upstream kernel.
> > 
> > I read the commit log of the 5 patches, looks not found descriptions
> > about root cause of the queue lockup, so could you explain a bit about
> > the reason behind?
> 
> Hello Ming,
> 
> If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
> driver that implements that function is responsible for rerunning the
> hardware queue once requests can be queued successfully again. That is
> not the case today for the SCSI core. Patch 5/5 ensures that hardware

The current .queue_rq() will call blk_mq_delay_queue() if QUEUE_BUSY is
returned, and once request is completed, the queue will be restarted
by blk_mq_start_stopped_hw_queues() in scsi_end_request(). This way
sounds OK in theory. And I just try to understand the specific reason
which causes the lockup, but still not get it.

Thanks,
Ming

Re: [PATCH v3 0/5] Avoid that scsi-mq queue processing stalls

2017-04-07 Thread Bart Van Assche

On Fri, 2017-04-07 at 17:41 +0800, Ming Lei wrote:
> On Thu, Apr 06, 2017 at 11:10:45AM -0700, Bart Van Assche wrote:
> > Hello Jens,
> > 
> > The five patches in this patch series fix the queue lockup I reported
> > recently on the linux-block mailing list. Please consider these patches
> > for inclusion in the upstream kernel.
> 
> I read the commit log of the 5 patches, looks not found descriptions
> about root cause of the queue lockup, so could you explain a bit about
> the reason behind?

Hello Ming,

If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
driver that implements that function is responsible for rerunning the
hardware queue once requests can be queued successfully again. That is
not the case today for the SCSI core. Patch 5/5 ensures that hardware
queues for which scsi_queue_rq() has returned "busy" are rerun after
the "busy" condition has been cleared. This is why patch 5/5 fixes a
queue lockup.

Bart.

Re: [PATCH v3 1/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list

2017-04-07 Thread Bart Van Assche

On Fri, 2017-04-07 at 17:46 +0800, Ming Lei wrote:
> On Thu, Apr 06, 2017 at 11:10:46AM -0700, Bart Van Assche wrote:
> > Since the next patch in this series will use RCU to iterate over
> > tag_list, make this safe. Add lockdep_assert_held() statements
> > in functions that iterate over tag_list to make clear that using
> > list_for_each_entry() instead of list_for_each_entry_rcu() is
> > fine in these functions.
> > 
> > Signed-off-by: Bart Van Assche 
> > Cc: Christoph Hellwig 
> > Cc: Hannes Reinecke 
> > ---
> >  block/blk-mq.c | 6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index f7cd3208bcdf..b5580b09b4a5 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2076,6 +2076,8 @@ static void blk_mq_update_tag_set_depth(struct 
> > blk_mq_tag_set *set, bool shared)
> >  {
> > struct request_queue *q;
> >  
> > +   lockdep_assert_held(>tag_list_lock);
> > +
> > list_for_each_entry(q, >tag_list, tag_set_list) {
> > blk_mq_freeze_queue(q);
> > queue_set_hctx_shared(q, shared);
> > @@ -2096,6 +2098,8 @@ static void blk_mq_del_queue_tag_set(struct 
> > request_queue *q)
> > blk_mq_update_tag_set_depth(set, false);
> > }
> > mutex_unlock(>tag_list_lock);
> > +
> > +   synchronize_rcu();
> 
> Looks synchronize_rcu() is only needed in deletion path, so it can
> be moved to blk_mq_del_queue_tag_set().
> 
> Also list_del_init/list_add_tail() need to be replaced with RCU
> safe version in functions operating >tag_list.

Hello Ming,

I will replace list_del_init() / list_add_tail() by their RCU equivalents.

Regarding synchronize_rcu(): have you noticed that that call has been added to
blk_mq_del_queue_tag_set(), the function you requested to move that call to?

Bart.

Re: [PATCH v3 4/5] blk-mq: Introduce blk_mq_delay_run_hw_queue()

2017-04-07 Thread Bart Van Assche

On Fri, 2017-04-07 at 18:23 +0800, Ming Lei wrote:
> On Thu, Apr 06, 2017 at 11:10:49AM -0700, Bart Van Assche wrote:
> > Introduce a function that runs a hardware queue unconditionally
> 
> I appreciate if some background can be provided for this function,
> like fixing some issues?

Hello Ming,

Have you noticed that patch 5/5 in this series uses this function?

Bart.

Re: [PATCH v3 2/5] blk-mq: Restart a single queue if tag sets are shared

2017-04-07 Thread Bart Van Assche

On Thu, 2017-04-06 at 13:21 -0600, Jens Axboe wrote:
> On 04/06/2017 01:12 PM, Jens Axboe wrote:
> > On 04/06/2017 12:10 PM, Bart Van Assche wrote:
> > > + for (i = 0; i < queue->nr_hw_queues; i++) {
> > > + j = (i + hctx->queue_num + 1) % queue->nr_hw_queues;
> > > + h = queue->queue_hw_ctx[j];
> > > + if (h->tags == tags && blk_mq_sched_restart_hctx(h))
> > > + break;
> > 
> > I'm pretty sure that doing:
> > 
> > j = i + hctx->queue_num + 1;;
> 
> And 'i' too many there of course:
> 
>   j = hctx->queue_num + 1;;

Hello Jens,

Thanks for the feedback. I will implement this change and retest this patch
series.

Bart.

Re: [PATCH 7/8] nowait aio: xfs

2017-04-07 Thread Darrick J. Wong

On Fri, Apr 07, 2017 at 06:34:28AM -0500, Goldwyn Rodrigues wrote:
> 
> 
> On 04/06/2017 05:54 PM, Darrick J. Wong wrote:
> > On Mon, Apr 03, 2017 at 11:52:11PM -0700, Christoph Hellwig wrote:
> >>> + if (unaligned_io) {
> >>> + /* If we are going to wait for other DIO to finish, bail */
> >>> + if ((iocb->ki_flags & IOCB_NOWAIT) &&
> >>> +  atomic_read(>i_dio_count))
> >>> + return -EAGAIN;
> >>>   inode_dio_wait(inode);
> >>
> >> This checks i_dio_count twice in the nowait case, I think it should be:
> >>
> >>if (iocb->ki_flags & IOCB_NOWAIT) {
> >>if (atomic_read(>i_dio_count))
> >>return -EAGAIN;
> >>} else {
> >>inode_dio_wait(inode);
> >>}
> >>
> >>>   if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
> >>>   if (flags & IOMAP_DIRECT) {
> >>> + /* A reflinked inode will result in CoW alloc */
> >>> + if (flags & IOMAP_NOWAIT) {
> >>> + error = -EAGAIN;
> >>> + goto out_unlock;
> >>> + }
> >>
> >> This is a bit pessimistic - just because the inode has any shared
> >> extents we could still write into unshared ones.  For now I think this
> >> pessimistic check is fine, but the comment should be corrected.
> > 
> > Consider what happens in both _reflink_{allocate,reserve}_cow.  If there
> > is already an existing reservation in the CoW fork then we'll have to
> > CoW and therefore can't satisfy the NOWAIT flag.  If there isn't already
> > anything in the CoW fork, then we have to see if there are shared blocks
> > by calling _reflink_trim_around_shared.  That performs a refcountbt
> > lookup, which involves locking the AGF, so we also can't satisfy NOWAIT.
> > 
> > IOWs, I think this hunk has to move outside the IOMAP_DIRECT check to
> > cover both write-to-reflinked-file cases.
> > 
> 
> IOMAP_NOWAIT is set only with IOMAP_DIRECT since the nowait feature is
> for direct-IO only. This is checked early on, when we are checking for

Ah, ok.  Disregard what I said about moving it then.

--D

> user-passed flags, and if not, -EINVAL is returned.
> 
> 
> -- 
> Goldwyn

Re: [PATCHv7 2/2] loop: support 4k physical blocksize

2017-04-07 Thread Ming Lei

On Fri, Apr 07, 2017 at 12:50:57PM +0200, Hannes Reinecke wrote:
> When generating bootable VM images certain systems (most notably
> s390x) require devices with 4k blocksize. This patch implements
> a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
> blocksize to that of the underlying device, and allow to change
> the logical blocksize for up to the physical blocksize.
> 
> Signed-off-by: Hannes Reinecke 
> ---
>  drivers/block/loop.c  | 40 +++-
>  drivers/block/loop.h  |  1 +
>  include/uapi/linux/loop.h |  3 +++
>  3 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 81b3900..0909e06 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, 
> bool dio)
>  }
>  
>  static int
> -figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
> +figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
> +  loff_t logical_blocksize)
>  {
>   loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
>   sector_t x = (sector_t)size;
> @@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, 
> bool dio)
>   lo->lo_offset = offset;
>   if (lo->lo_sizelimit != sizelimit)
>   lo->lo_sizelimit = sizelimit;
> + if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
> + lo->lo_logical_blocksize = logical_blocksize;
> + blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
> + blk_queue_logical_block_size(lo->lo_queue,
> +  lo->lo_logical_blocksize);
> + }
>   set_capacity(lo->lo_disk, x);
>   bd_set_size(bdev, (loff_t)get_capacity(bdev->bd_disk) << 9);
>   /* let user-space know about the new size */
> @@ -814,6 +821,7 @@ static void loop_config_discard(struct loop_device *lo)
>   struct file *file = lo->lo_backing_file;
>   struct inode *inode = file->f_mapping->host;
>   struct request_queue *q = lo->lo_queue;
> + int lo_bits = 9;
>  
>   /*
>* We use punch hole to reclaim the free space used by the
> @@ -833,7 +841,10 @@ static void loop_config_discard(struct loop_device *lo)
>  
>   q->limits.discard_granularity = inode->i_sb->s_blocksize;
>   q->limits.discard_alignment = 0;
> - blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> + if (lo->lo_flags & LO_FLAGS_BLOCKSIZE)
> + lo_bits = blksize_bits(lo->lo_logical_blocksize);
> +
> + blk_queue_max_discard_sectors(q, UINT_MAX >> lo_bits);
>   q->limits.discard_zeroes_data = 1;
>   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
>  }
> @@ -1087,6 +1098,7 @@ static int loop_clr_fd(struct loop_device *lo)
>   int err;
>   struct loop_func_table *xfer;
>   kuid_t uid = current_uid();
> + int lo_flags = lo->lo_flags;
>  
>   if (lo->lo_encrypt_key_size &&
>   !uid_eq(lo->lo_key_owner, uid) &&
> @@ -1119,9 +1131,26 @@ static int loop_clr_fd(struct loop_device *lo)
>   if (err)
>   goto exit;
>  
> + if (info->lo_flags & LO_FLAGS_BLOCKSIZE) {
> + if (!(lo->lo_flags & LO_FLAGS_BLOCKSIZE))
> + lo->lo_logical_blocksize = 512;
> + lo->lo_flags |= LO_FLAGS_BLOCKSIZE;
> + if (LO_INFO_BLOCKSIZE(info) != 512 &&
> + LO_INFO_BLOCKSIZE(info) != 1024 &&
> + LO_INFO_BLOCKSIZE(info) != 2048 &&
> + LO_INFO_BLOCKSIZE(info) != 4096)
> + return -EINVAL;
> + if (LO_INFO_BLOCKSIZE(info) > lo->lo_blocksize)
> + return -EINVAL;
> + }
> +
>   if (lo->lo_offset != info->lo_offset ||
> - lo->lo_sizelimit != info->lo_sizelimit)
> - if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
> + lo->lo_sizelimit != info->lo_sizelimit ||
> + lo->lo_flags != lo_flags ||
> + ((lo->lo_flags & LO_FLAGS_BLOCKSIZE) &&
> +  lo->lo_logical_blocksize != LO_INFO_BLOCKSIZE(info))) {
> + if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit,
> +  LO_INFO_BLOCKSIZE(info)))
>   err = -EFBIG;
>   goto exit;
>   }
> @@ -1312,7 +1341,8 @@ static int loop_set_capacity(struct loop_device *lo)
>   if (unlikely(lo->lo_state != Lo_bound))
>   return -ENXIO;
>  
> - return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit);
> + return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit,
> + lo->lo_logical_blocksize);
>  }
>  
>  static int loop_set_dio(struct loop_device *lo, unsigned long arg)
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index fb2237c..579f2f7 100644
> --- a/drivers/block/loop.h
> +++

Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails

2017-04-07 Thread Jens Axboe

On 04/06/2017 09:23 PM, Ming Lei wrote:
> Hi Jens,
> 
> Thanks for your comment!
> 
> On Thu, Apr 06, 2017 at 01:29:26PM -0600, Jens Axboe wrote:
>> On 04/06/2017 02:23 AM, Ming Lei wrote:
>>> On Thu, Apr 06, 2017 at 12:57:51AM -0700, Omar Sandoval wrote:
 On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
> On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
>> From: Omar Sandoval 
>>
>> While dispatching requests, if we fail to get a driver tag, we mark the
>> hardware queue as waiting for a tag and put the requests on a
>> hctx->dispatch list to be run later when a driver tag is freed. However,
>> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
>> queues if using a single-queue scheduler with a multiqueue device. If
>
> It can't perform well by using a SQ scheduler on a MQ device, so just be
> curious why someone wants to do that in this way,:-)

 I don't know why anyone would want to, but it has to work :) The only
 reason we noticed this is because when the NBD device is created, it
 only has a single queue, so we automatically assign mq-deadline to it.
 Later, we update the number of queues, but it's still using mq-deadline.

> I guess you mean that ops.mq.dispatch_request() may dispatch requests
> from other hardware queues in blk_mq_sched_dispatch_requests() instead
> of current hctx.

 Yup, that's right. It's weird, and I talked to Jens about just forcing
 the MQ device into an SQ mode when using an SQ scheduler, but this way
 works fine more or less.
>>>
>>> Or just switch the elevator to the MQ default one when the device becomes
>>> MQ? Or let mq-deadline's .dispatch_request() just return reqs in current
>>> hctx?
>>
>> No, that would be a really bad idea imho. First of all, I don't want
>> kernel driven scheduler changes. Secondly, the framework should work
>> with a non-direct mapping between hardware dispatch queues and
>> scheduling queues.
>>
>> While we could force a single queue usage to make that a 1:1 mapping
>> always, that loses big benefits on eg nbd, which uses multiple hardware
>> queues to up the bandwidth. Similarly on nvme, for example, we still
>> scale better with N submission queues and 1 scheduling queue compared to
>> having just 1 submission queue.
> 
> Looks that isn't what I meant. And my 2nd point is to make
> mq-deadline's dispatch_request(hctx) just returns requests mapped to
> the hw queue of 'hctx', then we can avoid to mess up blk-mq.c and
> blk-mq-sched.c.

That would indeed be better. But let's assume that we have 4 hardware
queues, we're asked to run queue X. But the FIFO dictates that the first
request that should run is on queue Y, since it's expired. What do we
do? Then we'd have to abort dispatching on queue X, and run queue Y
appropriately.

This shuffle can happen all the time. I think it'd be worthwhile to work
towards the goal of improving mq-deadline to deal with this, and
subsequently cleaning up the interface. It would be a cleaner
implementation, though efficiency might suffer further.

> If that is true, it looks like a issue in usage of I/O scheduler, since
> the mq-deadline scheduler just queues requests in one per-request_queue
> linked list, for MQ device, the scheduler queue should have been per-hctx.

 That's an option, but that's a different scheduling policy. Again, I
 agree that it's strange, but it's reasonable behavior.
>>>
>>> IMO, the current mq-deadline isn't good/ready for MQ device, and it
>>> doesn't make sense to use it for MQ.
>>
>> I don't think that's true at all. I do agree that it's somewhat quirky
>> since it does introduce scheduling dependencies between the hardware
>> queues, and we have to work at making that well understood and explicit,
>> as not to introduce bugs due to that. But in reality, all multiqueue
>> hardware we are deadling with are mapped to a single resource. As such,
>> it makes a lot of sense to schedule it as such. Hence I don't think that
>> a single queue deadline approach is necessarily a bad idea for even fast
>> storage.
> 
> When we map all mq into one single deadline queue, it can't scale well, for
> example, I just run a simple write test(libaio, dio, bs:4k, 4jobs) over one
> commodity NVMe in a dual-socket(four cores) system:
>   
>   IO scheduler|   CPU utilization
>   ---
>   none|   60%
>   ---
>   mq-deadline |   100%
>   ---
> 
> And IO throughput is basically same in two cases.
> 
That's a given, for low blocksize and high iops, we'll be hitting the
deadline lock pretty hard. We dispatch single requests at the time, to
optimize for rotational storage, since it improves merging a lot. If we
modified e->type->ops.mq.dispatch_request() to take a "dump this many
requests" parameter and ramped

Re: [PATCH v2 2/2] blk-mq: Add a polling specific stats function

2017-04-07 Thread Jens Axboe

On 04/07/2017 06:11 AM, Stephen  Bates wrote:
> On 2017-04-05, 7:14 PM, "Jens Axboe"  wrote:
> 
>> Why not just have 8 buckets, and make it:
>>
>>  bucket = ddir + ilog2(bytes) - 9;
>>
>> and cap it at MAX_BUCKET (8) and put all those above into the top
>> bucket.
> 
> Thanks. However, that equation does not differentiate between
> direction and size. Instead we can use
> 
> bucket = ddir + 2*(ilog2(bytes) - 9);

It would be cleaner to just embed the fact that we have 2 sets of
identical buckets, and return 

bucket = ilog2(bytes) - 9;

and have poll_stat be indexed by:

->poll_stat[ddir][bucket];

instead.

-- 
Jens Axboe

Re: [Nbd] [PATCH 00/12] nbd: Netlink interface and path failure enhancements

2017-04-07 Thread Wouter Verhelst

Hi Josef,

On Thu, Apr 06, 2017 at 05:05:25PM -0400, Josef Bacik wrote:
> 
> > On Apr 6, 2017, at 5:01 PM, Josef Bacik  wrote:
> > 
> > This patchset adds a new netlink configuration interface to NBD as well as a
> > bunch of enhancments around path failures.  The patches provide the 
> > following
> > enhancemnts to NBD
> > 
> > - Netlink configuration interface that doesn't leave a userspace application
> >   waiting in kernel space for the device to disconnect.
> > - Netlink reconfigure interface for adding re-connected sockets to replace 
> > dead
> >   sockets.
> > - A flag to destroy the NBD device on disconnect, much like how mount -o 
> > loop
> >   works.
> > - A status interface that currently will only report whether a device is
> >   connected or not, but can be extended to include whatever in the future.
> > - A netlink multicast notification scheme to notify user space when there 
> > are
> >   connection issues to allow for seamless reconnects.
> > - Dead link handling.  You can specify a dead link timeout and the NBD 
> > device
> >   will pause IO for that timeout waiting to see if the connection can be
> >   re-established.  This is helpful to allow for things like nbd server 
> > upgrades
> >   where the whole server disappears for a short period of time.
> > 
> > These patches have been thorougly and continuously tested for about a month.
> > I've been finding bugs in various places, but this batch has been solid for 
> > the
> > last few days of testing, which include a constant disconnect/reconnect 
> > torture
> > test.  Thanks,
> 
> Oops forgot to mention that I have patches for the nbd user space side
> that utilizes all of these new interfaces, you can find the tree here
> 
> https://github.com/josefbacik/nbd

Yeah, found that already :-)

> The user space patches are a bit rough, but utilize everything so good
> for testing. They will be cleaned up and submitted once the kernel
> part is upstream. Thanks,

No worries, I'll probably clean it up myself. TBH, cleaning up
nbd-client so it becomes somewhat less of a beast has been on my TODO
list for a while, and this will only add to it... maybe that'll finally
convince me that it's time ;-)

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

[PATCH v3 2/2] blk-mq: Add a polling specific stats function

2017-04-07 Thread sbates

From: Stephen Bates 

Rather than bucketing IO statisics based on direction only we also
bucket based on the IO size. This leads to improved polling
performance. Update the bucket callback function and use it in the
polling latency estimation.

Signed-off-by: Stephen Bates 
---
 block/blk-mq.c | 45 +++--
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 061fc2c..5fd376b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -42,6 +42,25 @@ static LIST_HEAD(all_q_list);
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
+/* Must be consisitent with function below */
+#define BLK_MQ_POLL_STATS_BKTS 16
+static int blk_mq_poll_stats_bkt(const struct request *rq)
+{
+   int ddir, bytes, bucket;
+
+   ddir = blk_stat_rq_ddir(rq);
+   bytes = blk_rq_bytes(rq);
+
+   bucket = ddir + 2*(ilog2(bytes) - 9);
+
+   if (bucket < 0)
+   return -1;
+   else if (bucket >= BLK_MQ_POLL_STATS_BKTS)
+   return ddir + BLK_MQ_POLL_STATS_BKTS - 2;
+
+   return bucket;
+}
+
 /*
  * Check if any of the ctx's have pending work in this hardware queue
  */
@@ -2245,7 +2264,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
q->mq_ops = set->ops;
 
q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn,
-blk_stat_rq_ddir, 2, q);
+blk_mq_poll_stats_bkt,
+BLK_MQ_POLL_STATS_BKTS, q);
if (!q->poll_cb)
goto err_exit;
 
@@ -2663,11 +2683,12 @@ static void blk_mq_poll_stats_start(struct 
request_queue *q)
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb)
 {
struct request_queue *q = cb->data;
+   int bucket;
 
-   if (cb->stat[READ].nr_samples)
-   q->poll_stat[READ] = cb->stat[READ];
-   if (cb->stat[WRITE].nr_samples)
-   q->poll_stat[WRITE] = cb->stat[WRITE];
+   for (bucket = 0; bucket < BLK_MQ_POLL_STATS_BKTS; bucket++) {
+   if (cb->stat[bucket].nr_samples)
+   q->poll_stat[bucket] = cb->stat[bucket];
+   }
 }
 
 static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
@@ -2675,6 +2696,7 @@ static unsigned long blk_mq_poll_nsecs(struct 
request_queue *q,
   struct request *rq)
 {
unsigned long ret = 0;
+   int bucket;
 
/*
 * If stats collection isn't on, don't sleep but turn it on for
@@ -2689,12 +2711,15 @@ static unsigned long blk_mq_poll_nsecs(struct 
request_queue *q,
 * For instance, if the completion latencies are tight, we can
 * get closer than just half the mean. This is especially
 * important on devices where the completion latencies are longer
-* than ~10 usec.
+* than ~10 usec. We do use the stats for the relevant IO size
+* if available which does lead to better estimates.
 */
-   if (req_op(rq) == REQ_OP_READ && q->poll_stat[READ].nr_samples)
-   ret = (q->poll_stat[READ].mean + 1) / 2;
-   else if (req_op(rq) == REQ_OP_WRITE && q->poll_stat[WRITE].nr_samples)
-   ret = (q->poll_stat[WRITE].mean + 1) / 2;
+   bucket = blk_mq_poll_stats_bkt(rq);
+   if (bucket < 0)
+   return ret;
+
+   if (q->poll_stat[bucket].nr_samples)
+   ret = (q->poll_stat[bucket].mean + 1) / 2;
 
return ret;
 }
-- 
2.7.4

[PATCH v3 1/2] blk-stat: convert blk-stat bucket callback to signed

2017-04-07 Thread sbates

From: Stephen Bates 

In order to allow for filtering of IO based on some other properties
of the request than direction we allow the bucket function to return
an int.

If the bucket callback returns a negative do no count it in the stats
accumulation.

Signed-off-by: Stephen Bates 
---
 block/blk-stat.c | 6 --
 block/blk-stat.h | 9 +
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index e77ec52..dde9d39 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -19,7 +19,7 @@ struct blk_queue_stats {
bool enable_accounting;
 };
 
-unsigned int blk_stat_rq_ddir(const struct request *rq)
+int blk_stat_rq_ddir(const struct request *rq)
 {
return rq_data_dir(rq);
 }
@@ -104,6 +104,8 @@ void blk_stat_add(struct request *rq)
list_for_each_entry_rcu(cb, >stats->callbacks, list) {
if (blk_stat_is_active(cb)) {
bucket = cb->bucket_fn(rq);
+   if (bucket < 0)
+   continue;
stat = _cpu_ptr(cb->cpu_stat)[bucket];
__blk_stat_add(stat, value);
}
@@ -135,7 +137,7 @@ static void blk_stat_timer_fn(unsigned long data)
 
 struct blk_stat_callback *
 blk_stat_alloc_callback(void (*timer_fn)(struct blk_stat_callback *),
-   unsigned int (*bucket_fn)(const struct request *),
+   int (*bucket_fn)(const struct request *),
unsigned int buckets, void *data)
 {
struct blk_stat_callback *cb;
diff --git a/block/blk-stat.h b/block/blk-stat.h
index 53f08a6..622a62c 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -48,9 +48,10 @@ struct blk_stat_callback {
 
/**
 * @bucket_fn: Given a request, returns which statistics bucket it
-* should be accounted under.
+* should be accounted under. Return -1 for no bucket for this
+* request.
 */
-   unsigned int (*bucket_fn)(const struct request *);
+   int (*bucket_fn)(const struct request *);
 
/**
 * @buckets: Number of statistics buckets.
@@ -120,7 +121,7 @@ void blk_stat_enable_accounting(struct request_queue *q);
  *
  * Return: Data direction of the request, either READ or WRITE.
  */
-unsigned int blk_stat_rq_ddir(const struct request *rq);
+int blk_stat_rq_ddir(const struct request *rq);
 
 /**
  * blk_stat_alloc_callback() - Allocate a block statistics callback.
@@ -135,7 +136,7 @@ unsigned int blk_stat_rq_ddir(const struct request *rq);
  */
 struct blk_stat_callback *
 blk_stat_alloc_callback(void (*timer_fn)(struct blk_stat_callback *),
-   unsigned int (*bucket_fn)(const struct request *),
+   int (*bucket_fn)(const struct request *),
unsigned int buckets, void *data);
 
 /**
-- 
2.7.4

[PATCH v3 0/2] blk-stat: Add ability to not bucket IO; improve IO polling.

2017-04-07 Thread sbates

From: Stephen Bates 

Omar recently developed some patches for block layer stats that use
callbacks to determine which bucket an IO should be considered for. At
the same time there was discussion at LSF/MM that we might not want to
consider all IO when generating stats for certain algorithms (e.g. IO
completion polling) or to bucket them in a more optimal fashion.

This set does two things. It makes the bucket callback for stats
signed so we can now ignore IO that cause a negative to be returned
from the bucket function. It then improves the IO polling latency
estimations by bucketing stats based on IO size and direction.

This patchset applies cleanly on 6809ef67eb7b4b68d (Merge branch
'for-4.12/block' into for-next) in Jens' for-next tree.

I've lightly tested this using QEMU and a real NVMe low-latency
device. I do not have performance number yet. Feedback would be
appreciated! I am not *super* happy with how the bucketing by size is
done. Any suggestions on how to improve this would be appreciated!

Cc: damien.lem...@wdc.com
Cc: osan...@osandov.com

Changes since v1:
  Modified the bucket function as per Jens' suggestions.

Changes since v1:
  Dropped the cast in blk_stat_rq_ddir() as per Omar's suggestion.
  Moved to an array of buckets based on IO size rather than a filter
  as suggested by Jens and Damien.
  Rebased on Jen's for-next tree

Stephen Bates (2):
  blk-stat: convert blk-stat bucket callback to signed
  blk-mq: Add a polling specific stats function

 block/blk-mq.c   | 45 +++--
 block/blk-stat.c |  6 --
 block/blk-stat.h |  9 +
 3 files changed, 44 insertions(+), 16 deletions(-)

-- 
2.7.4

Re: [PATCH v2 2/2] blk-mq: Add a polling specific stats function

2017-04-07 Thread Stephen Bates

On 2017-04-05, 7:14 PM, "Jens Axboe"  wrote:

> Why not just have 8 buckets, and make it:
>
>   bucket = ddir + ilog2(bytes) - 9;
>
> and cap it at MAX_BUCKET (8) and put all those above into the top
> bucket.

Thanks. However, that equation does not differentiate between direction and 
size. Instead we can use

bucket = ddir + 2*(ilog2(bytes) - 9);

and then bin any IO over 64K in the largest of the two buckets based on 
direction. I’ll implement this in a v3….

Cheers

Stephen

[PATCHv7 2/2] loop: support 4k physical blocksize

2017-04-07 Thread Hannes Reinecke

When generating bootable VM images certain systems (most notably
s390x) require devices with 4k blocksize. This patch implements
a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
blocksize to that of the underlying device, and allow to change
the logical blocksize for up to the physical blocksize.

Signed-off-by: Hannes Reinecke 
---
 drivers/block/loop.c  | 40 +++-
 drivers/block/loop.h  |  1 +
 include/uapi/linux/loop.h |  3 +++
 3 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 81b3900..0909e06 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, bool 
dio)
 }
 
 static int
-figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
+figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
+loff_t logical_blocksize)
 {
loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
sector_t x = (sector_t)size;
@@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, bool 
dio)
lo->lo_offset = offset;
if (lo->lo_sizelimit != sizelimit)
lo->lo_sizelimit = sizelimit;
+   if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
+   lo->lo_logical_blocksize = logical_blocksize;
+   blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
+   blk_queue_logical_block_size(lo->lo_queue,
+lo->lo_logical_blocksize);
+   }
set_capacity(lo->lo_disk, x);
bd_set_size(bdev, (loff_t)get_capacity(bdev->bd_disk) << 9);
/* let user-space know about the new size */
@@ -814,6 +821,7 @@ static void loop_config_discard(struct loop_device *lo)
struct file *file = lo->lo_backing_file;
struct inode *inode = file->f_mapping->host;
struct request_queue *q = lo->lo_queue;
+   int lo_bits = 9;
 
/*
 * We use punch hole to reclaim the free space used by the
@@ -833,7 +841,10 @@ static void loop_config_discard(struct loop_device *lo)
 
q->limits.discard_granularity = inode->i_sb->s_blocksize;
q->limits.discard_alignment = 0;
-   blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+   if (lo->lo_flags & LO_FLAGS_BLOCKSIZE)
+   lo_bits = blksize_bits(lo->lo_logical_blocksize);
+
+   blk_queue_max_discard_sectors(q, UINT_MAX >> lo_bits);
q->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
@@ -1087,6 +1098,7 @@ static int loop_clr_fd(struct loop_device *lo)
int err;
struct loop_func_table *xfer;
kuid_t uid = current_uid();
+   int lo_flags = lo->lo_flags;
 
if (lo->lo_encrypt_key_size &&
!uid_eq(lo->lo_key_owner, uid) &&
@@ -1119,9 +1131,26 @@ static int loop_clr_fd(struct loop_device *lo)
if (err)
goto exit;
 
+   if (info->lo_flags & LO_FLAGS_BLOCKSIZE) {
+   if (!(lo->lo_flags & LO_FLAGS_BLOCKSIZE))
+   lo->lo_logical_blocksize = 512;
+   lo->lo_flags |= LO_FLAGS_BLOCKSIZE;
+   if (LO_INFO_BLOCKSIZE(info) != 512 &&
+   LO_INFO_BLOCKSIZE(info) != 1024 &&
+   LO_INFO_BLOCKSIZE(info) != 2048 &&
+   LO_INFO_BLOCKSIZE(info) != 4096)
+   return -EINVAL;
+   if (LO_INFO_BLOCKSIZE(info) > lo->lo_blocksize)
+   return -EINVAL;
+   }
+
if (lo->lo_offset != info->lo_offset ||
-   lo->lo_sizelimit != info->lo_sizelimit)
-   if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
+   lo->lo_sizelimit != info->lo_sizelimit ||
+   lo->lo_flags != lo_flags ||
+   ((lo->lo_flags & LO_FLAGS_BLOCKSIZE) &&
+lo->lo_logical_blocksize != LO_INFO_BLOCKSIZE(info))) {
+   if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit,
+LO_INFO_BLOCKSIZE(info)))
err = -EFBIG;
goto exit;
}
@@ -1312,7 +1341,8 @@ static int loop_set_capacity(struct loop_device *lo)
if (unlikely(lo->lo_state != Lo_bound))
return -ENXIO;
 
-   return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit);
+   return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit,
+   lo->lo_logical_blocksize);
 }
 
 static int loop_set_dio(struct loop_device *lo, unsigned long arg)
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index fb2237c..579f2f7 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -49,6 +49,7 @@ struct loop_device {
struct file *   lo_backing_file;
struct block_device *lo_device;
unsigned

[PATCHv7 1/2] loop: Remove unused 'bdev' argument from loop_set_capacity

2017-04-07 Thread Hannes Reinecke

Signed-off-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ming Lei 
---
 drivers/block/loop.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cc981f3..81b3900 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1307,7 +1307,7 @@ static int loop_clr_fd(struct loop_device *lo)
return err;
 }
 
-static int loop_set_capacity(struct loop_device *lo, struct block_device *bdev)
+static int loop_set_capacity(struct loop_device *lo)
 {
if (unlikely(lo->lo_state != Lo_bound))
return -ENXIO;
@@ -1370,7 +1370,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
case LOOP_SET_CAPACITY:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
-   err = loop_set_capacity(lo, bdev);
+   err = loop_set_capacity(lo);
break;
case LOOP_SET_DIRECT_IO:
err = -EPERM;
-- 
1.8.5.6

[PATCHv7 0/2] loop: enable different logical blocksizes

2017-04-07 Thread Hannes Reinecke

Currently the loop driver just simulates 512-byte blocks. When
creating bootable images for virtual machines it might be required
to use a different physical blocksize (eg 4k for S/390 DASD), as
the some bootloaders (like lilo or zipl for S/390) need to know
the physical block addresses of the kernel and initrd.

With this patchset the loop driver will export the logical and
physical blocksize and the current LOOP_SET_STATUS64 ioctl is
extended to set the logical blocksize by re-using the existing
'init' fields, which are currently unused.

As usual, comments and reviews are welcome.

Changes to v1:
- Move LO_FLAGS_BLOCKSIZE definition
- Reshuffle patches
Changes to v2:
- Include reviews from Ming Lei
Changes to v3:
- Include reviews from Christoph
- Merge patches
Changes to v4:
- Add LO_INFO_BLOCKSIZE definition
Changes to v5:
- Rediff to latest upstream
Changes to v6:
- Include review from Ming Lei

Hannes Reinecke (2):
  loop: Remove unused 'bdev' argument from loop_set_capacity
  loop: support 4k physical blocksize

 drivers/block/loop.c  | 44 +---
 drivers/block/loop.h  |  1 +
 include/uapi/linux/loop.h |  3 +++
 3 files changed, 41 insertions(+), 7 deletions(-)

-- 
1.8.5.6

Re: [PATCHv6 2/2] loop: support 4k physical blocksize

2017-04-07 Thread Ming Lei

On Fri, Apr 07, 2017 at 11:52:27AM +0200, Hannes Reinecke wrote:
> On 04/07/2017 11:38 AM, Hannes Reinecke wrote:
> > On 04/07/2017 11:34 AM, Ming Lei wrote:
> >> On Fri, Apr 07, 2017 at 08:58:58AM +0200, Hannes Reinecke wrote:
> >>> When generating bootable VM images certain systems (most notably
> >>> s390x) require devices with 4k blocksize. This patch implements
> >>> a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
> >>> blocksize to that of the underlying device, and allow to change
> >>> the logical blocksize for up to the physical blocksize.
> >>
> >> Actually this UAPI flag is only for setting logical block size. 
> >>
> > Hmm? No, we're setting the physical blocksize, too.
> > Or am I missing something?
> > 
> >>>
> >>> Signed-off-by: Hannes Reinecke 
> >>> ---
> >>>  drivers/block/loop.c  | 36 +++-
> >>>  drivers/block/loop.h  |  1 +
> >>>  include/uapi/linux/loop.h |  3 +++
> >>>  3 files changed, 35 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> >>> index 81b3900..f098681 100644
> >>> --- a/drivers/block/loop.c
> >>> +++ b/drivers/block/loop.c
> >>> @@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, 
> >>> bool dio)
> >>>  }
> >>>  
> >>>  static int
> >>> -figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
> >>> +figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
> >>> +  loff_t logical_blocksize)
> >>>  {
> >>>   loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
> >>>   sector_t x = (sector_t)size;
> >>> @@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device 
> >>> *lo, bool dio)
> >>>   lo->lo_offset = offset;
> >>>   if (lo->lo_sizelimit != sizelimit)
> >>>   lo->lo_sizelimit = sizelimit;
> >>> + if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
> >>> + lo->lo_logical_blocksize = logical_blocksize;
> >>> + blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
> >>> + blk_queue_logical_block_size(lo->lo_queue,
> >>> +  lo->lo_logical_blocksize);
> >>> + }
> >>
> >> We can move setting physical block size into loop_set_fd(), and set
> >> 512 bytes as default logical block size in loop_set_fd() too.
> >>
> > Okay.
> > 
> After thinking about it some more I'm not sure if I agree with that
> reasoning.
> 
> One of the goals of this patchset is to keep compability with existing
> installations.
> If we move setting the physical blocksize into loop_set_fd() (which
> needs to be called before loop_set_status()) the physical blocksize
> would always be set.
> Which would induce a user-visible change.
> 
> Hence I've formulated my patch to _not_ change the default setup if the
> new flag isn't set.

OK, better to not break current users, :-)

Thanks,
Ming

Re: [PATCH v3 1/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list

2017-04-07 Thread Ming Lei

On Thu, Apr 06, 2017 at 11:10:46AM -0700, Bart Van Assche wrote:
> Since the next patch in this series will use RCU to iterate over
> tag_list, make this safe. Add lockdep_assert_held() statements
> in functions that iterate over tag_list to make clear that using
> list_for_each_entry() instead of list_for_each_entry_rcu() is
> fine in these functions.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> ---
>  block/blk-mq.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f7cd3208bcdf..b5580b09b4a5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2076,6 +2076,8 @@ static void blk_mq_update_tag_set_depth(struct 
> blk_mq_tag_set *set, bool shared)
>  {
>   struct request_queue *q;
>  
> + lockdep_assert_held(>tag_list_lock);
> +
>   list_for_each_entry(q, >tag_list, tag_set_list) {
>   blk_mq_freeze_queue(q);
>   queue_set_hctx_shared(q, shared);
> @@ -2096,6 +2098,8 @@ static void blk_mq_del_queue_tag_set(struct 
> request_queue *q)
>   blk_mq_update_tag_set_depth(set, false);
>   }
>   mutex_unlock(>tag_list_lock);
> +
> + synchronize_rcu();

Looks synchronize_rcu() is only needed in deletion path, so it can
be moved to blk_mq_del_queue_tag_set().

Also list_del_init/list_add_tail() need to be replaced with RCU
safe version in functions operating >tag_list.

>  }
>  
>  static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
> @@ -2601,6 +2605,8 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set 
> *set, int nr_hw_queues)
>  {
>   struct request_queue *q;
>  
> + lockdep_assert_held(>tag_list_lock);
> +
>   if (nr_hw_queues > nr_cpu_ids)
>   nr_hw_queues = nr_cpu_ids;
>   if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
> -- 
> 2.12.0
> 

-- 
Ming

Re: [PATCH v3 0/5] Avoid that scsi-mq queue processing stalls

2017-04-07 Thread Ming Lei

On Thu, Apr 06, 2017 at 11:10:45AM -0700, Bart Van Assche wrote:
> Hello Jens,
> 
> The five patches in this patch series fix the queue lockup I reported
> recently on the linux-block mailing list. Please consider these patches
> for inclusion in the upstream kernel.

I read the commit log of the 5 patches, looks not found descriptions
about root cause of the queue lockup, so could you explain a bit about
the reason behind?

Thanks,
Ming

Re: [PATCHv6 2/2] loop: support 4k physical blocksize

2017-04-07 Thread Hannes Reinecke

On 04/07/2017 11:34 AM, Ming Lei wrote:
> On Fri, Apr 07, 2017 at 08:58:58AM +0200, Hannes Reinecke wrote:
>> When generating bootable VM images certain systems (most notably
>> s390x) require devices with 4k blocksize. This patch implements
>> a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
>> blocksize to that of the underlying device, and allow to change
>> the logical blocksize for up to the physical blocksize.
> 
> Actually this UAPI flag is only for setting logical block size. 
> 
Hmm? No, we're setting the physical blocksize, too.
Or am I missing something?

>>
>> Signed-off-by: Hannes Reinecke 
>> ---
>>  drivers/block/loop.c  | 36 +++-
>>  drivers/block/loop.h  |  1 +
>>  include/uapi/linux/loop.h |  3 +++
>>  3 files changed, 35 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
>> index 81b3900..f098681 100644
>> --- a/drivers/block/loop.c
>> +++ b/drivers/block/loop.c
>> @@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, 
>> bool dio)
>>  }
>>  
>>  static int
>> -figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
>> +figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
>> + loff_t logical_blocksize)
>>  {
>>  loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
>>  sector_t x = (sector_t)size;
>> @@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, 
>> bool dio)
>>  lo->lo_offset = offset;
>>  if (lo->lo_sizelimit != sizelimit)
>>  lo->lo_sizelimit = sizelimit;
>> +if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
>> +lo->lo_logical_blocksize = logical_blocksize;
>> +blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
>> +blk_queue_logical_block_size(lo->lo_queue,
>> + lo->lo_logical_blocksize);
>> +}
> 
> We can move setting physical block size into loop_set_fd(), and set
> 512 bytes as default logical block size in loop_set_fd() too.
> 
Okay.

>>  set_capacity(lo->lo_disk, x);
>>  bd_set_size(bdev, (loff_t)get_capacity(bdev->bd_disk) << 9);
>>  /* let user-space know about the new size */
>> @@ -814,6 +821,7 @@ static void loop_config_discard(struct loop_device *lo)
>>  struct file *file = lo->lo_backing_file;
>>  struct inode *inode = file->f_mapping->host;
>>  struct request_queue *q = lo->lo_queue;
>> +int lo_bits = blksize_bits(lo->lo_logical_blocksize);
>>  
>>  /*
>>   * We use punch hole to reclaim the free space used by the
>> @@ -833,7 +841,7 @@ static void loop_config_discard(struct loop_device *lo)
>>  
>>  q->limits.discard_granularity = inode->i_sb->s_blocksize;
>>  q->limits.discard_alignment = 0;
>> -blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
>> +blk_queue_max_discard_sectors(q, UINT_MAX >> lo_bits);
>>  q->limits.discard_zeroes_data = 1;
>>  queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
>>  }
>> @@ -922,6 +930,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
>> mode,
>>  
>>  lo->use_dio = false;
>>  lo->lo_blocksize = lo_blocksize;
>> +lo->lo_logical_blocksize = 512;
> 
> It isn't enough to just this local vaiable as 512 here, since the last
> logical block size is still kept in queue's setting.
> 
Right. Will be updating the queue setting, too.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCHv6 1/2] loop: Remove unused 'bdev' argument from loop_set_capacity

2017-04-07 Thread Ming Lei

On Fri, Apr 07, 2017 at 08:58:57AM +0200, Hannes Reinecke wrote:
> Signed-off-by: Hannes Reinecke 
> Reviewed-by: Christoph Hellwig 
> ---
>  drivers/block/loop.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index cc981f3..81b3900 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1307,7 +1307,7 @@ static int loop_clr_fd(struct loop_device *lo)
>   return err;
>  }
>  
> -static int loop_set_capacity(struct loop_device *lo, struct block_device 
> *bdev)
> +static int loop_set_capacity(struct loop_device *lo)
>  {
>   if (unlikely(lo->lo_state != Lo_bound))
>   return -ENXIO;
> @@ -1370,7 +1370,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
> mode,
>   case LOOP_SET_CAPACITY:
>   err = -EPERM;
>   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
> - err = loop_set_capacity(lo, bdev);
> + err = loop_set_capacity(lo);
>   break;
>   case LOOP_SET_DIRECT_IO:
>   err = -EPERM;

Reviewed-by: Ming Lei 


Thanks,
Ming

Re: [PATCHv6 2/2] loop: support 4k physical blocksize

2017-04-07 Thread Ming Lei

On Fri, Apr 07, 2017 at 08:58:58AM +0200, Hannes Reinecke wrote:
> When generating bootable VM images certain systems (most notably
> s390x) require devices with 4k blocksize. This patch implements
> a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
> blocksize to that of the underlying device, and allow to change
> the logical blocksize for up to the physical blocksize.

Actually this UAPI flag is only for setting logical block size. 

> 
> Signed-off-by: Hannes Reinecke 
> ---
>  drivers/block/loop.c  | 36 +++-
>  drivers/block/loop.h  |  1 +
>  include/uapi/linux/loop.h |  3 +++
>  3 files changed, 35 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 81b3900..f098681 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, 
> bool dio)
>  }
>  
>  static int
> -figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
> +figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
> +  loff_t logical_blocksize)
>  {
>   loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
>   sector_t x = (sector_t)size;
> @@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, 
> bool dio)
>   lo->lo_offset = offset;
>   if (lo->lo_sizelimit != sizelimit)
>   lo->lo_sizelimit = sizelimit;
> + if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
> + lo->lo_logical_blocksize = logical_blocksize;
> + blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
> + blk_queue_logical_block_size(lo->lo_queue,
> +  lo->lo_logical_blocksize);
> + }

We can move setting physical block size into loop_set_fd(), and set
512 bytes as default logical block size in loop_set_fd() too.

>   set_capacity(lo->lo_disk, x);
>   bd_set_size(bdev, (loff_t)get_capacity(bdev->bd_disk) << 9);
>   /* let user-space know about the new size */
> @@ -814,6 +821,7 @@ static void loop_config_discard(struct loop_device *lo)
>   struct file *file = lo->lo_backing_file;
>   struct inode *inode = file->f_mapping->host;
>   struct request_queue *q = lo->lo_queue;
> + int lo_bits = blksize_bits(lo->lo_logical_blocksize);
>  
>   /*
>* We use punch hole to reclaim the free space used by the
> @@ -833,7 +841,7 @@ static void loop_config_discard(struct loop_device *lo)
>  
>   q->limits.discard_granularity = inode->i_sb->s_blocksize;
>   q->limits.discard_alignment = 0;
> - blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> + blk_queue_max_discard_sectors(q, UINT_MAX >> lo_bits);
>   q->limits.discard_zeroes_data = 1;
>   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
>  }
> @@ -922,6 +930,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
> mode,
>  
>   lo->use_dio = false;
>   lo->lo_blocksize = lo_blocksize;
> + lo->lo_logical_blocksize = 512;

It isn't enough to just this local vaiable as 512 here, since the last
logical block size is still kept in queue's setting.

>   lo->lo_device = bdev;
>   lo->lo_flags = lo_flags;
>   lo->lo_backing_file = file;
> @@ -1087,6 +1096,7 @@ static int loop_clr_fd(struct loop_device *lo)
>   int err;
>   struct loop_func_table *xfer;
>   kuid_t uid = current_uid();
> + int lo_flags = lo->lo_flags;
>  
>   if (lo->lo_encrypt_key_size &&
>   !uid_eq(lo->lo_key_owner, uid) &&
> @@ -1119,9 +1129,24 @@ static int loop_clr_fd(struct loop_device *lo)
>   if (err)
>   goto exit;
>  
> + if (info->lo_flags & LO_FLAGS_BLOCKSIZE) {
> + lo->lo_flags |= LO_FLAGS_BLOCKSIZE;
> + if (LO_INFO_BLOCKSIZE(info) != 512 &&
> + LO_INFO_BLOCKSIZE(info) != 1024 &&
> + LO_INFO_BLOCKSIZE(info) != 2048 &&
> + LO_INFO_BLOCKSIZE(info) != 4096)
> + return -EINVAL;
> + if (LO_INFO_BLOCKSIZE(info) > lo->lo_blocksize)
> + return -EINVAL;
> + }
> +
>   if (lo->lo_offset != info->lo_offset ||
> - lo->lo_sizelimit != info->lo_sizelimit)
> - if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
> + lo->lo_sizelimit != info->lo_sizelimit ||
> + lo->lo_flags != lo_flags ||
> + ((lo->lo_flags & LO_FLAGS_BLOCKSIZE) &&
> +  lo->lo_logical_blocksize != LO_INFO_BLOCKSIZE(info))) {
> + if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit,
> +  info->lo_init[0]))
>   err = -EFBIG;
>   goto exit;
>   }
> @@ -1312,7 +1337,8 @@ static int loop_set_capacity(struct loop_device *lo)
>   if (unlikely(lo->lo_state != Lo_bound))
>   return -ENXIO;
>

Re: [PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC

2017-04-07 Thread Vlastimil Babka

On 04/05/2017 01:40 PM, Andrey Ryabinin wrote:
> On 04/05/2017 10:46 AM, Vlastimil Babka wrote:
>> The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent
>> deadlock during page migration by lock_page() (see the comment in
>> __unmap_and_move()). Then it unconditionally clears the flag, which can 
>> clear a
>> pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a
>> problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct
>> compaction handling in slowpath"), because direct compation was called only
>> after direct reclaim, which was skipped when PF_MEMALLOC flag was set.
>> 
>> Even now it's only a theoretical issue, as the new callsite of
>> __alloc_pages_direct_compact() is reached only for costly orders and when
>> gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in
>is false   
> 
>> gfp_flags or in_interrupt() is true. There is no such known context, but 
>> let's
>> play it safe and make __alloc_pages_direct_compact() robust for cases where
>> PF_MEMALLOC is already set.
>> 
>> Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling 
>> in slowpath")
>> Reported-by: Andrey Ryabinin 
>> Signed-off-by: Vlastimil Babka 
>> Cc: 
>> ---
>>  mm/page_alloc.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3589f8be53be..b84e6ffbe756 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
>> int order,
>>  enum compact_priority prio, enum compact_result *compact_result)
>>  {
>>  struct page *page;
>> +unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
>>  
>>  if (!order)
>>  return NULL;
>> @@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
>> int order,
>>  current->flags |= PF_MEMALLOC;
>>  *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>>  prio);
>> -current->flags &= ~PF_MEMALLOC;
>> +current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
> 
> Perhaps this would look better:
> 
>   tsk_restore_flags(current, noreclaim_flag, PF_MEMALLOC);
> 
> ?

Well, I didn't care much considering this is for stable only, and patch 2/4
rewrites this to the new api.

>>  if (*compact_result <= COMPACT_INACTIVE)
>>  return NULL;
>> 
>

Re: [PATCH 2/4] mm: introduce memalloc_noreclaim_{save,restore}

2017-04-07 Thread Hillf Danton


On April 05, 2017 3:47 PM Vlastimil Babka wrote: 
> 
> The previous patch has shown that simply setting and clearing PF_MEMALLOC in
> current->flags can result in wrongly clearing a pre-existing PF_MEMALLOC flag
> and potentially lead to recursive reclaim. Let's introduce helpers that 
> support
> proper nesting by saving the previous stat of the flag, similar to the 
> existing
> memalloc_noio_* and memalloc_nofs_* helpers. Convert existing setting/clearing
> of PF_MEMALLOC within mm to the new helpers.
> 
> There are no known issues with the converted code, but the change makes it 
> more
> robust.
> 
> Suggested-by: Michal Hocko 
> Signed-off-by: Vlastimil Babka 
> ---

Acked-by: Hillf Danton

Re: [PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC

2017-04-07 Thread Hillf Danton

On April 05, 2017 3:47 PM Vlastimil Babka wrote: 
> 
> The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent
> deadlock during page migration by lock_page() (see the comment in
> __unmap_and_move()). Then it unconditionally clears the flag, which can clear 
> a
> pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a
> problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct
> compaction handling in slowpath"), because direct compation was called only
> after direct reclaim, which was skipped when PF_MEMALLOC flag was set.
> 
> Even now it's only a theoretical issue, as the new callsite of
> __alloc_pages_direct_compact() is reached only for costly orders and when
> gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in
> gfp_flags or in_interrupt() is true. There is no such known context, but let's
> play it safe and make __alloc_pages_direct_compact() robust for cases where
> PF_MEMALLOC is already set.
> 
> Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling 
> in slowpath")
> Reported-by: Andrey Ryabinin 
> Signed-off-by: Vlastimil Babka 
> Cc: 
> ---
Acked-by: Hillf Danton 

>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589f8be53be..b84e6ffbe756 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
> int order,
>   enum compact_priority prio, enum compact_result *compact_result)
>  {
>   struct page *page;
> + unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
> 
>   if (!order)
>   return NULL;
> @@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
> int order,
>   current->flags |= PF_MEMALLOC;
>   *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>   prio);
> - current->flags &= ~PF_MEMALLOC;
> + current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
> 
>   if (*compact_result <= COMPACT_INACTIVE)
>   return NULL;
> --
> 2.12.2

Re: kill req->errors

2017-04-07 Thread Christoph Hellwig

On Thu, Apr 06, 2017 at 04:00:24PM -0400, Konrad Rzeszutek Wilk wrote:
> You wouldn't have a git tree to easily test it? Thanks.

git://git.infradead.org/users/hch/block.git request-errors

Gitweb:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/request-errors

[PATCHv6 2/2] loop: support 4k physical blocksize

2017-04-07 Thread Hannes Reinecke

When generating bootable VM images certain systems (most notably
s390x) require devices with 4k blocksize. This patch implements
a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
blocksize to that of the underlying device, and allow to change
the logical blocksize for up to the physical blocksize.

Signed-off-by: Hannes Reinecke 
---
 drivers/block/loop.c  | 36 +++-
 drivers/block/loop.h  |  1 +
 include/uapi/linux/loop.h |  3 +++
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 81b3900..f098681 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, bool 
dio)
 }
 
 static int
-figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
+figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
+loff_t logical_blocksize)
 {
loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
sector_t x = (sector_t)size;
@@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, bool 
dio)
lo->lo_offset = offset;
if (lo->lo_sizelimit != sizelimit)
lo->lo_sizelimit = sizelimit;
+   if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
+   lo->lo_logical_blocksize = logical_blocksize;
+   blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
+   blk_queue_logical_block_size(lo->lo_queue,
+lo->lo_logical_blocksize);
+   }
set_capacity(lo->lo_disk, x);
bd_set_size(bdev, (loff_t)get_capacity(bdev->bd_disk) << 9);
/* let user-space know about the new size */
@@ -814,6 +821,7 @@ static void loop_config_discard(struct loop_device *lo)
struct file *file = lo->lo_backing_file;
struct inode *inode = file->f_mapping->host;
struct request_queue *q = lo->lo_queue;
+   int lo_bits = blksize_bits(lo->lo_logical_blocksize);
 
/*
 * We use punch hole to reclaim the free space used by the
@@ -833,7 +841,7 @@ static void loop_config_discard(struct loop_device *lo)
 
q->limits.discard_granularity = inode->i_sb->s_blocksize;
q->limits.discard_alignment = 0;
-   blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+   blk_queue_max_discard_sectors(q, UINT_MAX >> lo_bits);
q->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
@@ -922,6 +930,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 
lo->use_dio = false;
lo->lo_blocksize = lo_blocksize;
+   lo->lo_logical_blocksize = 512;
lo->lo_device = bdev;
lo->lo_flags = lo_flags;
lo->lo_backing_file = file;
@@ -1087,6 +1096,7 @@ static int loop_clr_fd(struct loop_device *lo)
int err;
struct loop_func_table *xfer;
kuid_t uid = current_uid();
+   int lo_flags = lo->lo_flags;
 
if (lo->lo_encrypt_key_size &&
!uid_eq(lo->lo_key_owner, uid) &&
@@ -1119,9 +1129,24 @@ static int loop_clr_fd(struct loop_device *lo)
if (err)
goto exit;
 
+   if (info->lo_flags & LO_FLAGS_BLOCKSIZE) {
+   lo->lo_flags |= LO_FLAGS_BLOCKSIZE;
+   if (LO_INFO_BLOCKSIZE(info) != 512 &&
+   LO_INFO_BLOCKSIZE(info) != 1024 &&
+   LO_INFO_BLOCKSIZE(info) != 2048 &&
+   LO_INFO_BLOCKSIZE(info) != 4096)
+   return -EINVAL;
+   if (LO_INFO_BLOCKSIZE(info) > lo->lo_blocksize)
+   return -EINVAL;
+   }
+
if (lo->lo_offset != info->lo_offset ||
-   lo->lo_sizelimit != info->lo_sizelimit)
-   if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
+   lo->lo_sizelimit != info->lo_sizelimit ||
+   lo->lo_flags != lo_flags ||
+   ((lo->lo_flags & LO_FLAGS_BLOCKSIZE) &&
+lo->lo_logical_blocksize != LO_INFO_BLOCKSIZE(info))) {
+   if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit,
+info->lo_init[0]))
err = -EFBIG;
goto exit;
}
@@ -1312,7 +1337,8 @@ static int loop_set_capacity(struct loop_device *lo)
if (unlikely(lo->lo_state != Lo_bound))
return -ENXIO;
 
-   return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit);
+   return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit,
+   lo->lo_logical_blocksize);
 }
 
 static int loop_set_dio(struct loop_device *lo, unsigned long arg)
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index fb2237c..579f2f7 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -49,6 +49,7 @@ struct loop_device {
struct file *

[PATCHv6 1/2] loop: Remove unused 'bdev' argument from loop_set_capacity

2017-04-07 Thread Hannes Reinecke

Signed-off-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
---
 drivers/block/loop.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cc981f3..81b3900 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1307,7 +1307,7 @@ static int loop_clr_fd(struct loop_device *lo)
return err;
 }
 
-static int loop_set_capacity(struct loop_device *lo, struct block_device *bdev)
+static int loop_set_capacity(struct loop_device *lo)
 {
if (unlikely(lo->lo_state != Lo_bound))
return -ENXIO;
@@ -1370,7 +1370,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
case LOOP_SET_CAPACITY:
err = -EPERM;
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
-   err = loop_set_capacity(lo, bdev);
+   err = loop_set_capacity(lo);
break;
case LOOP_SET_DIRECT_IO:
err = -EPERM;
-- 
1.8.5.6

[PATCHv6 0/2] loop: enable different logical blocksizes

2017-04-07 Thread Hannes Reinecke

Currently the loop driver just simulates 512-byte blocks. When
creating bootable images for virtual machines it might be required
to use a different physical blocksize (eg 4k for S/390 DASD), as
the some bootloaders (like lilo or zipl for S/390) need to know
the physical block addresses of the kernel and initrd.

With this patchset the loop driver will export the logical and
physical blocksize and the current LOOP_SET_STATUS64 ioctl is
extended to set the logical blocksize by re-using the existing
'init' fields, which are currently unused.

As usual, comments and reviews are welcome.

Changes to v1:
- Move LO_FLAGS_BLOCKSIZE definition
- Reshuffle patches
Changes to v2:
- Include reviews from Ming Lei
Changes to v3:
- Include reviews from Christoph
- Merge patches
Changes to v4:
- Add LO_INFO_BLOCKSIZE definition
Changes to v5:
- Rediff to latest upstream

Hannes Reinecke (2):
  loop: Remove unused 'bdev' argument from loop_set_capacity
  loop: support 4k physical blocksize

 drivers/block/loop.c  | 40 +---
 drivers/block/loop.h  |  1 +
 include/uapi/linux/loop.h |  3 +++
 3 files changed, 37 insertions(+), 7 deletions(-)

-- 
1.8.5.6

Re: [PATCH 1/2] block: Implement global tagset

2017-04-07 Thread Arun Easi

On Thu, 6 Apr 2017, 1:49am, Hannes Reinecke wrote:

> On 04/06/2017 08:27 AM, Arun Easi wrote:
> > Hi Hannes,
> > 
> > Thanks for taking a crack at the issue. My comments below..
> > 
> > On Tue, 4 Apr 2017, 5:07am, Hannes Reinecke wrote:
> > 
> >> Most legacy HBAs have a tagset per HBA, not per queue. To map
> >> these devices onto block-mq this patch implements a new tagset
> >> flag BLK_MQ_F_GLOBAL_TAGS, which will cause the tag allocator
> >> to use just one tagset for all hardware queues.
> >>
> >> Signed-off-by: Hannes Reinecke 
> >> ---
> >>  block/blk-mq-tag.c | 12 
> >>  block/blk-mq.c | 10 --
> >>  include/linux/blk-mq.h |  1 +
> >>  3 files changed, 17 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> >> index e48bc2c..a14e76c 100644
> >> --- a/block/blk-mq-tag.c
> >> +++ b/block/blk-mq-tag.c
> >> @@ -276,9 +276,11 @@ static void blk_mq_all_tag_busy_iter(struct 
> >> blk_mq_tags *tags,
> >>  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> >>busy_tag_iter_fn *fn, void *priv)
> >>  {
> >> -  int i;
> >> +  int i, lim = tagset->nr_hw_queues;
> >>  
> >> -  for (i = 0; i < tagset->nr_hw_queues; i++) {
> >> +  if (tagset->flags & BLK_MQ_F_GLOBAL_TAGS)
> >> +  lim = 1;
> >> +  for (i = 0; i < lim; i++) {
> >>if (tagset->tags && tagset->tags[i])
> >>blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
> >>}
> >> @@ -287,12 +289,14 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set 
> >> *tagset,
> >>  
> >>  int blk_mq_reinit_tagset(struct blk_mq_tag_set *set)
> >>  {
> >> -  int i, j, ret = 0;
> >> +  int i, j, ret = 0, lim = set->nr_hw_queues;
> >>  
> >>if (!set->ops->reinit_request)
> >>goto out;
> >>  
> >> -  for (i = 0; i < set->nr_hw_queues; i++) {
> >> +  if (set->flags & BLK_MQ_F_GLOBAL_TAGS)
> >> +  lim = 1;
> >> +  for (i = 0; i < lim; i++) {
> >>struct blk_mq_tags *tags = set->tags[i];
> >>  
> >>for (j = 0; j < tags->nr_tags; j++) {
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 159187a..db96ed0 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -2061,6 +2061,10 @@ static bool __blk_mq_alloc_rq_map(struct 
> >> blk_mq_tag_set *set, int hctx_idx)
> >>  {
> >>int ret = 0;
> >>  
> >> +  if ((set->flags & BLK_MQ_F_GLOBAL_TAGS) && hctx_idx != 0) {
> >> +  set->tags[hctx_idx] = set->tags[0];
> >> +  return true;
> >> +  }
> > 
:
> 
> > BTW, if you would like me to try out this patch on my setup, please let me 
> > know.
> > 
> Oh, yes. Please do.
> 

Ran the tests on my setup (Dell R730, 2 Node). This change did not drop 
any IOPs (got ~2M 512b). The cache miss percentage was varying based on if 
the tests were running on one node or both (latter yperformed worse). All 
interrupts were directed to only 1 node. Interestingly, the cache miss 
percentage was lowest when MQ was off.

I hit a fdisk hang (open path), btw, not sure if it has anything todo with 
this change, though.

Notes and hang stack attached.

Let me know if you are interested in any specific perf event/command-line.

Regards,
-Arunperf stat, ran on a short 10 second load.

---1port-1node-new-mq
 Performance counter stats for 'CPU(s) 2':

 188,642,696  LLC-loads(66.66%)
   3,615,142  LLC-load-misses  #1.92% of all LL-cache hits (66.67%)
  86,488,341  LLC-stores   (33.34%)
  10,820,977  LLC-store-misses (33.33%)
 391,370,104  cache-references (49.99%)
  14,498,491  cache-misses #3.705 % of all cache refs  (66.66%)

---1port-1node-mq---
 Performance counter stats for 'CPU(s) 2':

 145,025,999  LLC-loads(66.67%)
   3,793,427  LLC-load-misses  #2.62% of all LL-cache hits (66.67%)
  60,878,939  LLC-stores   (33.33%)
   8,044,714  LLC-store-misses (33.33%)
 294,713,070  cache-references (50.00%)
  11,923,354  cache-misses #4.046 % of all cache refs  (66.66%)

---1port-1node-nomq---
 Performance counter stats for 'CPU(s) 2':

 157,375,709  LLC-loads(66.66%)
 476,117  LLC-load-misses  #0.30% of all LL-cache hits (66.66%)
  76,046,098  LLC-stores   (33.34%)
 840,756  LLC-store-misses (33.34%)
 326,230,969  cache-references (50.00%)
   1,332,398  cache-misses #0.408 % of all cache refs  (66.67%)

==

--2port-allnodes-new-mq--
 Performance counter stats for

53 matches

Mail list logo