Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov

2017-09-21 Thread Vitaly Mayatskikh
Reproducer (needs SCSI disk):

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
 
#define NR_IOS 1
#define NR_IOVECS 8
#define SG_IO 0x2285
 
int main(int argc, char *argv[])
{
int fd, i, j;
unsigned char *buf, *ptr, cdb[10];
sg_io_hdr_t io_hdr;
sg_iovec_t iovec[NR_IOVECS];
 
if (argc < 2) {
printf("Run: %s \n", argv[0]);
exit(1);
}
 
buf = ptr = memalign(4096, NR_IOS * NR_IOVECS * 512);
if (!buf) {
printf("can't alloc memory\n");
exit(1);
}
 
fd = open(argv[1], 0);
if (fd < 0) {
printf("open %s failed: %d (%s)\n", argv[1], errno, 
strerror(errno));
exit(1);
}
 
io_hdr.interface_id = 'S';
io_hdr.cmd_len = sizeof(cdb);
io_hdr.cmdp = cdb;
io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
io_hdr.dxfer_len = 512 * NR_IOVECS;
io_hdr.dxferp = iovec;
io_hdr.iovec_count = NR_IOVECS;
 
cdb[0] = 0x28;  // READ10
cdb[8] = NR_IOVECS; // sectors
 
for (j = 0; j < NR_IOS; j++, ptr += 512) {
for (i = 0; i < NR_IOVECS; i++) {
iovec[i].iov_base = ptr;
iovec[i].iov_len = 512;
}
if (ioctl(fd, SG_IO, _hdr)) {
printf("IOCTL failed: %d (%s)\n", errno, 
strerror(errno));
exit(1);
}
}
 
free(buf);
close(fd);
return 0;
}


# free -m
  totalusedfree  shared  buff/cache   available
Mem:   3827  463601   0 1783568
Swap: 0   0   0
# ./sgio-leak /dev/sdd
# free -m
  totalusedfree  shared  buff/cache   available
Mem:   3827  853562   0 1783529
Swap: 0   0   0
[root@node-A ~]# free -m
  totalusedfree  shared  buff/cache   available
Mem:   3827  853628   0 1133561
Swap: 0   0   0
# ./sgio-leak /dev/sdd
# free -m
  totalusedfree  shared  buff/cache   available
Mem:   3827 1243589   0 1133523
Swap: 0   0   0

-- 
wbr, Vitaly


[PATCH] fix unbalanced page refcounting in bio_map_user_iov

2017-09-21 Thread Vitaly Mayatskikh
bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
IO vector has small consecutive buffers belonging to the same page.
bio_add_pc_page merges them into one, but the page reference is never
dropped.

Signed-off-by: Vitaly Mayatskikh 

diff --git a/block/bio.c b/block/bio.c
index b38e962fa83e..10cd3b6bed27 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1383,6 +1383,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
offset = offset_in_page(uaddr);
for (j = cur_page; j < page_limit; j++) {
unsigned int bytes = PAGE_SIZE - offset;
+   unsigned short prev_bi_vcnt = bio->bi_vcnt;
 
if (len <= 0)
break;
@@ -1397,6 +1398,13 @@ struct bio *bio_map_user_iov(struct request_queue *q,
bytes)
break;
 
+   /*
+* check if vector was merged with previous
+* drop page reference if needed
+*/
+   if (bio->bi_vcnt == prev_bi_vcnt)
+   put_page(pages[j]);
+
len -= bytes;
offset = 0;
}

-- 
wbr, Vitaly


Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue

2017-09-21 Thread Ming Lei
On Wed, Sep 20, 2017 at 08:20:58PM +0800, Ming Lei wrote:
> On Tue, Sep 19, 2017 at 02:37:50PM -0600, Jens Axboe wrote:
> > On 09/02/2017 09:17 AM, Ming Lei wrote:
> > > @@ -142,18 +178,31 @@ void blk_mq_sched_dispatch_requests(struct 
> > > blk_mq_hw_ctx *hctx)
> > >   if (!list_empty(_list)) {
> > >   blk_mq_sched_mark_restart_hctx(hctx);
> > >   do_sched_dispatch = blk_mq_dispatch_rq_list(q, _list);
> > > - } else if (!has_sched_dispatch) {
> > > + } else if (!has_sched_dispatch && !q->queue_depth) {
> > > + /*
> > > +  * If there is no per-request_queue depth, we
> > > +  * flush all requests in this hw queue, otherwise
> > > +  * pick up request one by one from sw queue for
> > > +  * avoiding to mess up I/O merge when dispatch
> > > +  * is busy, which can be triggered easily by
> > > +  * per-request_queue queue depth
> > > +  */
> > >   blk_mq_flush_busy_ctxs(hctx, _list);
> > >   blk_mq_dispatch_rq_list(q, _list);
> > >   }
> > 
> > I don't like this part at all. It's a heuristic, and on top of that,
> > it's a heuristic based on a weak assumption that if q->queue_depth == 0,
> > then we never run into BUSY constraints. I think that would be stronger
> > if it was based on "is this using shared tags", but even that is not
> > great at all. It's perfectly possible to have a shared tags setup and
> > never run into resource constraints. The opposite condition is also true
> > - you can run without shared tags, and yet hit resource constraints
> > constantly. Hence this is NOT a good test for deciding whether to flush
> > everything at once or not. In short, I think a much better test case
> > would be "has this device ever returned BLK_STS_RESOURCE. As it so
> > happens, that's easy enough to keep track of and base this test on.
> 
> Hi Jens,
> 
> The attached patch follows your suggestion, and uses EWMA to
> compute the average length of hctx->dispatch, then only flush
> all requests from ctxs if the average length is zero, what do
> you think about this approach? Or other suggestions?

Hi Jens,

I am going to prepare for V5, as suggested by Omar.

Could you suggest which way you prefer to?  Keeping to check
q->queue_depth, or the approach of using average length of
hctx->dispath, or others? 


-- 
Ming


[PATCH] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure

2017-09-21 Thread Ming Lei
blk-mq will rerun queue via RESTART after one request is completed,
so not necessary to wait random time for requeuing, we should trust
blk-mq to do it.

More importantly, we need return BLK_STS_RESOURCE to blk-mq
so that dequeue from I/O scheduler can be stopped, then
I/O merge gets improved.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-mpath.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 11f273d2f018..0902f7762306 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -504,8 +504,20 @@ static int multipath_clone_and_map(struct dm_target *ti, 
struct request *rq,
if (queue_dying) {
atomic_inc(>pg_init_in_progress);
activate_or_offline_path(pgpath);
+   return DM_MAPIO_DELAY_REQUEUE;
}
-   return DM_MAPIO_DELAY_REQUEUE;
+
+   /*
+* blk-mq's SCHED_RESTART can cover this requeue, so
+* we needn't to deal with it by DELAY_REQUEUE. More
+* importantly, we have to return DM_MAPIO_REQUEUE
+* so that blk-mq can get the queue busy feedback,
+* otherwise I/O merge can be hurt.
+*/
+   if (q->mq_ops)
+   return DM_MAPIO_REQUEUE;
+   else
+   return DM_MAPIO_DELAY_REQUEUE;
}
clone->bio = clone->biotail = NULL;
clone->rq_disk = bdev->bd_disk;
-- 
2.9.5



Re: nvme multipath support V2

2017-09-21 Thread Tony Yang
Excuse me,
   ask an junior question, how can you complete the nvme mpath package
clone, I use git clone, after completion, found no nvme directory.


[root@scst1 soft]#  git clone
git://git.infradead.org/users/hch/block.git nvme-mpath
Cloning into 'nvme-mpath'...
remote: Counting objects: 5623436, done.
remote: Compressing objects: 100% (922468/922468), done.
remote: Total 5623436 (delta 4757577), reused 5524606 (delta 4658819)
Receiving objects: 100% (5623436/5623436), 1.12 GiB | 3.17 MiB/s, done.
Resolving deltas: 100% (4757577/4757577), done.
Checking out files: 100% (50795/50795), done.


[root@scst1 drivers]# pwd
/u01/soft/nvme-mpath/drivers

[root@scst1 drivers]# ls nvme
ls: cannot access nvme: No such file or directory

Thanks

2017-09-21 22:50 GMT+08:00 Christoph Hellwig :
> On Thu, Sep 21, 2017 at 07:23:45AM +0200, Johannes Thumshirn wrote:
>> Ah OK, we maybe should update nvme-cli to recognize it as well then. I just
>> looked at the output of nvme list and obviously didn't find it.
>
> Overloading the new per-subsystem nodes into nvme list would be
> very confusing I think.  We could add a new command to list subsystems
> instead of controllers, if just doing a ls in /dev or sysfs is too hard.
>
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme


Re: [PATCH v2 4/4] scsi-mq: Reduce suspend latency

2017-09-21 Thread Ming Lei
On Thu, Sep 21, 2017 at 11:56:33PM +, Bart Van Assche wrote:
> On Fri, 2017-09-22 at 07:53 +0800, Ming Lei wrote:
> > Then that is simply not enough since the issue is that request pool can
> > be used up easily when SCSI device is quiesced, then no request is left
> > for RQF_PREEMPT(PM), and I/O hang is triggered.
> > 
> > It has been discussed for long time, I am sorry you still don't
> > understand it.
> 
> I'm really sorry that you don't understand the intention of this patch series.
> Anyway, let's close the discussion here and continue once the next version of
> this patch series has been posted.

So you mean you don't want to fix Cathy's issue? which was reported from
one production system of our customer.

And definitely belongs to same kind of SCSI quiesce issue.

-- 
Ming


Re: [PATCH v2 4/4] scsi-mq: Reduce suspend latency

2017-09-21 Thread Bart Van Assche
On Fri, 2017-09-22 at 07:53 +0800, Ming Lei wrote:
> Then that is simply not enough since the issue is that request pool can
> be used up easily when SCSI device is quiesced, then no request is left
> for RQF_PREEMPT(PM), and I/O hang is triggered.
> 
> It has been discussed for long time, I am sorry you still don't
> understand it.

I'm really sorry that you don't understand the intention of this patch series.
Anyway, let's close the discussion here and continue once the next version of
this patch series has been posted.

Bart.

Re: [PATCH v2 4/4] scsi-mq: Reduce suspend latency

2017-09-21 Thread Ming Lei
On Thu, Sep 21, 2017 at 11:32:33PM +, Bart Van Assche wrote:
> On Fri, 2017-09-22 at 07:25 +0800, Ming Lei wrote:
> > On Thu, Sep 21, 2017 at 10:43:26PM +, Bart Van Assche wrote:
> > > On Fri, 2017-09-22 at 06:06 +0800, Ming Lei wrote:
> > > > On Thu, Sep 21, 2017 at 02:22:55PM -0700, Bart Van Assche wrote:
> > > > > + } else {
> > > > >   scsi_run_queue(q);
> > > > > + while (atomic_read(>device_busy)) {
> > > > > + msleep_interruptible(200);
> > > > > + scsi_run_queue(q);
> > > > > + }
> > > > 
> > > > Are you sure only blk-mq need to drain queue? We need
> > > > to do that for block legacy too.
> > > 
> > > The code above your comment drains the queue for the legacy block layer.
> > 
> > That is just draining the requests dispatched to SCSI layer, and there
> > might be lots of requests in block I/O scheduler queue or requeue or
> > whatever.
> 
> There is no requeue list for the legacy block layer. There is only a requeue
> list for blk-mq.
> 
> Waiting for I/O requests that are in scheduler queues is not the purpose of
> scsi_quiesce_device(). The purpose of that function is to wait for requests
> that have already been started. The sdev->device_busy counter represents the
> number of started requests so waiting until that counter has reached zero is
> sufficient.

Then that is simply not enough since the issue is that request pool can
be used up easily when SCSI device is quiesced, then no request is left
for RQF_PREEMPT(PM), and I/O hang is triggered.

It has been discussed for long time, I am sorry you still don't
understand it.

-- 
Ming


Re: [PATCH v2 4/4] scsi-mq: Reduce suspend latency

2017-09-21 Thread Bart Van Assche
On Fri, 2017-09-22 at 07:25 +0800, Ming Lei wrote:
> On Thu, Sep 21, 2017 at 10:43:26PM +, Bart Van Assche wrote:
> > On Fri, 2017-09-22 at 06:06 +0800, Ming Lei wrote:
> > > On Thu, Sep 21, 2017 at 02:22:55PM -0700, Bart Van Assche wrote:
> > > > +   } else {
> > > > scsi_run_queue(q);
> > > > +   while (atomic_read(>device_busy)) {
> > > > +   msleep_interruptible(200);
> > > > +   scsi_run_queue(q);
> > > > +   }
> > > 
> > > Are you sure only blk-mq need to drain queue? We need
> > > to do that for block legacy too.
> > 
> > The code above your comment drains the queue for the legacy block layer.
> 
> That is just draining the requests dispatched to SCSI layer, and there
> might be lots of requests in block I/O scheduler queue or requeue or
> whatever.

There is no requeue list for the legacy block layer. There is only a requeue
list for blk-mq.

Waiting for I/O requests that are in scheduler queues is not the purpose of
scsi_quiesce_device(). The purpose of that function is to wait for requests
that have already been started. The sdev->device_busy counter represents the
number of started requests so waiting until that counter has reached zero is
sufficient.

Bart.

Re: [PATCH v2 4/4] scsi-mq: Reduce suspend latency

2017-09-21 Thread Ming Lei
On Thu, Sep 21, 2017 at 10:43:26PM +, Bart Van Assche wrote:
> On Fri, 2017-09-22 at 06:06 +0800, Ming Lei wrote:
> > On Thu, Sep 21, 2017 at 02:22:55PM -0700, Bart Van Assche wrote:
> > > + } else {
> > >   scsi_run_queue(q);
> > > + while (atomic_read(>device_busy)) {
> > > + msleep_interruptible(200);
> > > + scsi_run_queue(q);
> > > + }
> > 
> > Are you sure only blk-mq need to drain queue? We need
> > to do that for block legacy too.
> 
> The code above your comment drains the queue for the legacy block layer.

That is just draining the requests dispatched to SCSI layer, and there
might be lots of requests in block I/O scheduler queue or requeue or
whatever.

-- 
Ming


Re: [PATCH 6/9] nvme: track subsystems

2017-09-21 Thread Keith Busch
On Mon, Sep 18, 2017 at 04:14:50PM -0700, Christoph Hellwig wrote:
> +static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl 
> *id)
> +{
> + struct nvme_subsystem *subsys, *found;
> +
> + subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
> + if (!subsys)
> + return -ENOMEM;
> + INIT_LIST_HEAD(>ctrls);
> + kref_init(>ref);
> + nvme_init_subnqn(subsys, ctrl, id);
> + mutex_init(>lock);
> +
> + mutex_lock(_subsystems_lock);
> + found = __nvme_find_get_subsystem(subsys->subnqn);
> + if (found) {
> + /*
> +  * Verify that the subsystem actually supports multiple
> +  * controllers, else bail out.
> +  */
> + kfree(subsys);
> + if (!(id->cmic & (1 << 1))) {
> + dev_err(ctrl->device,
> + "ignoring ctrl due to duplicate subnqn (%s).\n",
> + found->subnqn);
> + mutex_unlock(_subsystems_lock);
> + return -EINVAL;
> + }
> +
> + subsys = found;
> + } else {
> + list_add_tail(>entry, _subsystems);
> + }
> +
> + ctrl->subsys = subsys;
> + mutex_unlock(_subsystems_lock);
> +
> + mutex_lock(>lock);
> + list_add_tail(>subsys_entry, >ctrls);
> + mutex_unlock(>lock);

This function is called every time nvme_init_identify is called, which
happens on every controller reset. The controller reset does not remove
itself from the subsystem list of controllers, so its entry is getting
doubly added after a controller reset.


Re: [PATCH v2 3/4] block, scsi: Make SCSI device suspend and resume work reliably

2017-09-21 Thread Bart Van Assche
On Fri, 2017-09-22 at 06:04 +0800, Ming Lei wrote:
> On Thu, Sep 21, 2017 at 02:22:54PM -0700, Bart Van Assche wrote:
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1429,11 +1429,18 @@ struct request *blk_get_request(struct 
> > request_queue *q, unsigned int op,
> > gfp_t gfp_mask)
> >  {
> > struct request *req;
> > +   const bool may_sleep = gfp_mask & __GFP_DIRECT_RECLAIM;
> > +
> > +   if (unlikely(blk_queue_preempt_only(q) && !(op & REQ_PREEMPT))) {
> 
> The flag is set with queue_lock, but checked without any lock, do you
> think it is safe in this way?
> 
> Also this flag isn't checked in normal I/O path, but you unfreeze
> queue during scsi_device_quiesce(), then any normal I/O can come
> from that time.

I will address both comments in the next version of this patch series.

Bart.

Re: [PATCH v2 4/4] scsi-mq: Reduce suspend latency

2017-09-21 Thread Bart Van Assche
On Fri, 2017-09-22 at 06:06 +0800, Ming Lei wrote:
> On Thu, Sep 21, 2017 at 02:22:55PM -0700, Bart Van Assche wrote:
> > +   } else {
> > scsi_run_queue(q);
> > +   while (atomic_read(>device_busy)) {
> > +   msleep_interruptible(200);
> > +   scsi_run_queue(q);
> > +   }
> 
> Are you sure only blk-mq need to drain queue? We need
> to do that for block legacy too.

The code above your comment drains the queue for the legacy block layer.

Bart.

[PATCH v2 0/4] Make SCSI device suspend and resume work reliably

2017-09-21 Thread Bart Van Assche
Hello Jens,

It is known that during the resume following a hibernate sometimes the
system hangs instead of coming up properly. This patch series fixes this
problem. This patch series is an alternative for Ming Lei's "[PATCH V5
0/10] block/scsi: safe SCSI quiescing" patch series. The advantages of
this patch series are:
- No new freeze states and hence no new freeze state variables.
- Easier to review because no new race conditions are introduced between
  queue freezing and blk_cleanup_queue(). As the discussion that followed
  Ming's patch series shows the correctness of the new code is hard to
  verify.

These patches have been tested on top of a merge of the block layer
for-next branch and Linus' master tree. Linus' master tree includes
patch "KVM: x86: Fix the NULL pointer parameter in check_cr_write()"
but the block layer for-next branch not yet.

Please consider these changes for kernel v4.15.

Thanks,

Bart.

Changes compared to v1 of this patch series:
- Changed the approach and rewrote the patch series.


Bart Van Assche (4):
  block: Convert RQF_PREEMPT into REQ_PREEMPT
  block: Add the QUEUE_PREEMPT_ONLY request queue flag
  block, scsi: Make SCSI device suspend and resume work reliably
  scsi-mq: Reduce suspend latency

 block/blk-core.c  | 26 +++---
 block/blk-mq-debugfs.c|  2 +-
 drivers/ide/ide-atapi.c   |  3 +--
 drivers/ide/ide-io.c  |  2 +-
 drivers/ide/ide-pm.c  |  4 ++--
 drivers/scsi/scsi_lib.c   | 31 +--
 include/linux/blk_types.h |  6 ++
 include/linux/blkdev.h|  8 +---
 8 files changed, 60 insertions(+), 22 deletions(-)

-- 
2.14.1



[PATCH v2 1/4] block: Convert RQF_PREEMPT into REQ_PREEMPT

2017-09-21 Thread Bart Van Assche
This patch does not change any functionality but makes the
REQ_PREEMPT flag available to blk_get_request(). A later patch
will add code to blk_get_request() that checks the REQ_PREEMPT
flag. Note: the IDE sense_rq request is allocated statically so
there is no blk_get_request() call that corresponds to this
request.

Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq-debugfs.c|  2 +-
 drivers/ide/ide-atapi.c   |  3 +--
 drivers/ide/ide-io.c  |  2 +-
 drivers/ide/ide-pm.c  |  4 ++--
 drivers/scsi/scsi_lib.c   | 11 ++-
 include/linux/blk_types.h |  6 ++
 include/linux/blkdev.h|  3 ---
 7 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 980e73095643..62ac248a4984 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -266,6 +266,7 @@ static const char *const cmd_flag_name[] = {
CMD_FLAG_NAME(BACKGROUND),
CMD_FLAG_NAME(NOUNMAP),
CMD_FLAG_NAME(NOWAIT),
+   CMD_FLAG_NAME(PREEMPT),
 };
 #undef CMD_FLAG_NAME
 
@@ -279,7 +280,6 @@ static const char *const rqf_name[] = {
RQF_NAME(MIXED_MERGE),
RQF_NAME(MQ_INFLIGHT),
RQF_NAME(DONTPREP),
-   RQF_NAME(PREEMPT),
RQF_NAME(COPY_USER),
RQF_NAME(FAILED),
RQF_NAME(QUIET),
diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index 14d1e7d9a1d6..1258739d5fa1 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -211,9 +211,8 @@ void ide_prep_sense(ide_drive_t *drive, struct request *rq)
}
 
sense_rq->rq_disk = rq->rq_disk;
-   sense_rq->cmd_flags = REQ_OP_DRV_IN;
+   sense_rq->cmd_flags = REQ_OP_DRV_IN | REQ_PREEMPT;
ide_req(sense_rq)->type = ATA_PRIV_SENSE;
-   sense_rq->rq_flags |= RQF_PREEMPT;
 
req->cmd[0] = GPCMD_REQUEST_SENSE;
req->cmd[4] = cmd_len;
diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index 3a234701d92c..06ffccd0fb10 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -539,7 +539,7 @@ void do_ide_request(struct request_queue *q)
 */
if ((drive->dev_flags & IDE_DFLAG_BLOCKED) &&
ata_pm_request(rq) == 0 &&
-   (rq->rq_flags & RQF_PREEMPT) == 0) {
+   (rq->cmd_flags & REQ_PREEMPT) == 0) {
/* there should be no pending command at this point */
ide_unlock_port(hwif);
goto plug_device;
diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index 544f02d673ca..f8d2709dcd56 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -89,9 +89,9 @@ int generic_ide_resume(struct device *dev)
}
 
memset(, 0, sizeof(rqpm));
-   rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
+   rq = blk_get_request(drive->queue, REQ_OP_DRV_IN | REQ_PREEMPT,
+__GFP_RECLAIM);
ide_req(rq)->type = ATA_PRIV_PM_RESUME;
-   rq->rq_flags |= RQF_PREEMPT;
rq->special = 
rqpm.pm_step = IDE_PM_START_RESUME;
rqpm.pm_state = PM_EVENT_ON;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..6db8247577a0 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -245,8 +245,9 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
int ret = DRIVER_ERROR << 24;
 
req = blk_get_request(sdev->request_queue,
-   data_direction == DMA_TO_DEVICE ?
-   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+ (data_direction == DMA_TO_DEVICE ?
+  REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN) | REQ_PREEMPT,
+ __GFP_RECLAIM);
if (IS_ERR(req))
return ret;
rq = scsi_req(req);
@@ -260,7 +261,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
rq->retries = retries;
req->timeout = timeout;
req->cmd_flags |= flags;
-   req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT;
+   req->rq_flags |= rq_flags | RQF_QUIET;
 
/*
 * head injection *required* here otherwise quiesce won't work
@@ -1271,7 +1272,7 @@ scsi_prep_state_check(struct scsi_device *sdev, struct 
request *req)
/*
 * If the devices is blocked we defer normal commands.
 */
-   if (!(req->rq_flags & RQF_PREEMPT))
+   if (!(req->cmd_flags & REQ_PREEMPT))
ret = BLKPREP_DEFER;
break;
default:
@@ -1280,7 +1281,7 @@ scsi_prep_state_check(struct scsi_device *sdev, struct 
request *req)
 

[PATCH v2 2/4] block: Add the QUEUE_PREEMPT_ONLY request queue flag

2017-09-21 Thread Bart Van Assche
This flag will be used in the next patch to let the block layer
core know whether or not a SCSI request queue has been quiesced.
A quiesced SCSI queue namely only processes RQF_PREEMPT requests.

Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 13 +
 include/linux/blkdev.h |  5 +
 2 files changed, 18 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index d709c0e3a2ac..1ac337712bbd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -346,6 +346,19 @@ void blk_sync_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
+void blk_set_preempt_only(struct request_queue *q, bool preempt_only)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   if (preempt_only)
+   queue_flag_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+   else
+   queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+}
+EXPORT_SYMBOL(blk_set_preempt_only);
+
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
  * @q: The queue to run
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cd26901a6e25..5bd87599eed0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -627,6 +627,7 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED  26  /* queue has been registered to a disk 
*/
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED28  /* queue has been quiesced */
+#define QUEUE_FLAG_PREEMPT_ONLY29  /* only process REQ_PREEMPT 
requests */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_STACKABLE)|   \
@@ -731,6 +732,10 @@ static inline void queue_flag_clear(unsigned int flag, 
struct request_queue *q)
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 REQ_FAILFAST_DRIVER))
 #define blk_queue_quiesced(q)  test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
+#define blk_queue_preempt_only(q)  \
+   test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
+
+extern void blk_set_preempt_only(struct request_queue *q, bool preempt_only);
 
 static inline bool blk_account_rq(struct request *rq)
 {
-- 
2.14.1



[PATCH v2 4/4] scsi-mq: Reduce suspend latency

2017-09-21 Thread Bart Van Assche
Avoid that it can take 200 ms too long to wait for ongoing requests
to finish. Note: blk_mq_freeze_queue() uses a wait queue to wait
for ongoing requests to finish.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/scsi/scsi_lib.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e76fd6e89a81..34e5f0f95d01 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2901,10 +2901,15 @@ scsi_device_quiesce(struct scsi_device *sdev)
if (err)
return err;
 
-   scsi_run_queue(q);
-   while (atomic_read(>device_busy)) {
-   msleep_interruptible(200);
+   if (q->mq_ops) {
+   blk_mq_freeze_queue(q);
+   blk_mq_unfreeze_queue(q);
+   } else {
scsi_run_queue(q);
+   while (atomic_read(>device_busy)) {
+   msleep_interruptible(200);
+   scsi_run_queue(q);
+   }
}
return 0;
 }
-- 
2.14.1



[PATCH v2 3/4] block, scsi: Make SCSI device suspend and resume work reliably

2017-09-21 Thread Bart Van Assche
It is essential during suspend and resume that neither the filesystem
state nor the filesystem metadata in RAM changes. This is why while
the hibernation image is being written or restored that SCSI devices
are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
and scsi_device_resume(). In the SDEV_QUIESCE state execution of
non-preempt requests is deferred. This is realized by returning
BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
devices. Avoid that a full queue prevents power management requests
to be submitted by slowing down allocation of non-preempt requests for
devices in the quiesced state. This patch has been tested by running
the following commands and by verifying that after resume the fio job
is still running:

for d in /sys/class/block/sd*[a-z]; do
  hcil=$(readlink "$d/device")
  hcil=${hcil#../../../}
  echo 4 > "$d"
  echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
done
bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c)
bdev=${bdev#../../}
hcil=$(readlink "/sys/block/$bdev/device")
hcil=${hcil#../../../}
fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 --rw=randread \
  --ioengine=psync --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \
  --loops=$((2**31)) &
pid=$!
sleep 1
systemctl hibernate
sleep 10
kill $pid

Reported-by: Oleksandr Natalenko 
References: "I/O hangs after resuming from suspend-to-ram" 
(https://marc.info/?l=linux-block=150340235201348).
Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c| 13 ++---
 drivers/scsi/scsi_lib.c | 11 ---
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1ac337712bbd..6a190dd998aa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1429,11 +1429,18 @@ struct request *blk_get_request(struct request_queue 
*q, unsigned int op,
gfp_t gfp_mask)
 {
struct request *req;
+   const bool may_sleep = gfp_mask & __GFP_DIRECT_RECLAIM;
+
+   if (unlikely(blk_queue_preempt_only(q) && !(op & REQ_PREEMPT))) {
+   if (may_sleep)
+   msleep(100);
+   else
+   return ERR_PTR(-EBUSY);
+   }
 
if (q->mq_ops) {
-   req = blk_mq_alloc_request(q, op,
-   (gfp_mask & __GFP_DIRECT_RECLAIM) ?
-   0 : BLK_MQ_REQ_NOWAIT);
+   req = blk_mq_alloc_request(q, op, may_sleep ?
+  0 : BLK_MQ_REQ_NOWAIT);
if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
q->mq_ops->initialize_rq_fn(req);
} else {
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6db8247577a0..e76fd6e89a81 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2889,19 +2889,22 @@ static void scsi_wait_for_queuecommand(struct 
scsi_device *sdev)
 int
 scsi_device_quiesce(struct scsi_device *sdev)
 {
+   struct request_queue *q = sdev->request_queue;
int err;
 
mutex_lock(>state_mutex);
err = scsi_device_set_state(sdev, SDEV_QUIESCE);
+   if (err == 0)
+   blk_set_preempt_only(q, true);
mutex_unlock(>state_mutex);
 
if (err)
return err;
 
-   scsi_run_queue(sdev->request_queue);
+   scsi_run_queue(q);
while (atomic_read(>device_busy)) {
msleep_interruptible(200);
-   scsi_run_queue(sdev->request_queue);
+   scsi_run_queue(q);
}
return 0;
 }
@@ -2924,8 +2927,10 @@ void scsi_device_resume(struct scsi_device *sdev)
 */
mutex_lock(>state_mutex);
if (sdev->sdev_state == SDEV_QUIESCE &&
-   scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
+   scsi_device_set_state(sdev, SDEV_RUNNING) == 0) {
+   blk_set_preempt_only(sdev->request_queue, false);
scsi_run_queue(sdev->request_queue);
+   }
mutex_unlock(>state_mutex);
 }
 EXPORT_SYMBOL(scsi_device_resume);
-- 
2.14.1



Re: [PATCH 9/9] nvme: implement multipath access to nvme subsystems

2017-09-21 Thread Keith Busch
On Thu, Sep 21, 2017 at 01:52:50AM +0200, Christoph Hellwig wrote:
> I noticed the odd renaming in sysfs and though about gettind rid
> of the /dev/nvme/ directory.  I just need to come up with a good
> name for the device nodes - the name can't contain /dev/nvme* as
> nvme-cli would break if it sees files that start with nvme in /dev/.

Ah, didn't know that. I'll fix nvme-cli to handle this.

We can still use a different naming convention for the multipath handles.

> /dev/nvmN ?

Looks safe.

BTW, considered persistent nameing rules to symlink these from
/dev/disk/by-id/? May need to add an attribute to the multipath object
to assist that.


[PATCH] bcache: fix a comments typo in bch_alloc_sectors()

2017-09-21 Thread Coly Li
Code comments in alloc.c:bch_alloc_sectors() mentions a function
name find_data_bucket(), the correct function name should be
pick_data_bucket() indeed. bch_alloc_sectors() is a quite important
function in bcache allocation code, fixing the typo may help
other people to have less confusion.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index cacbe2dbd5c3..071ff28be912 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -600,7 +600,7 @@ bool bch_alloc_sectors(struct cache_set *c, struct bkey *k, 
unsigned sectors,
 
/*
 * If we had to allocate, we might race and not need to allocate the
-* second time we call find_data_bucket(). If we allocated a bucket but
+* second time we call pick_data_bucket(). If we allocated a bucket but
 * didn't use it, drop the refcount bch_bucket_alloc_set() took:
 */
if (KEY_PTRS())
-- 
2.13.5



[PATCHv3] bcache: only permit to recovery read error when cache device is clean

2017-09-21 Thread Coly Li
When bcache does read I/Os, for example in writeback or writethrough mode,
if a read request on cache device is failed, bcache will try to recovery
the request by reading from cached device. If the data on cached device is
not synced with cache device, then requester will get a stale data.

For critical storage system like database, providing stale data from
recovery may result an application level data corruption, which is
unacceptible.

With this patch, for a failed read request in writeback or writethrough
mode, recovery a recoverable read request only happens when cache device
is clean. That is to say, all data on cached device is up to update.

For other cache modes in bcache, read request will never hit
cached_dev_read_error(), they don't need this patch.

Please note, because cache mode can be switched arbitrarily in run time, a
writethrough mode might be switched from a writeback mode. Therefore
checking dc->has_data in writethrough mode still makes sense.

Changelog:
v3: By response from Kent Oversteet, he thinks recovering stale data is a
bug to fix, and option to permit it is unneccessary. So this version
the sysfs file is removed.
v2: rename sysfs entry from allow_stale_data_on_failure  to
allow_stale_data_on_failure, and fix the confusing commit log.
v1: initial patch posted.

Signed-off-by: Coly Li 
Reported-by: Arne Wolf 
Cc: Kent Overstreet 
Cc: Michael Lyle 
Cc: Nix 
Cc: Kai Krakow 
Cc: Eric Wheeler 
Cc: Junhui Tang 
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/request.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 681b4f12b05a..e7f769ff7234 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -697,8 +697,16 @@ static void cached_dev_read_error(struct closure *cl)
 {
struct search *s = container_of(cl, struct search, cl);
struct bio *bio = >bio.bio;
+   struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
 
-   if (s->recoverable) {
+   /*
+* If cache device is dirty (dc->has_dirty is non-zero), then
+* recovery a failed read request from cached device may get a
+* stale data back. So read failure recovery is only permitted
+* when cache device is clean.
+*/
+   if (s->recoverable &&
+   (dc && !atomic_read(>has_dirty)) {
/* Retry from the backing device: */
trace_bcache_read_retry(s->orig_bio);
 
-- 
2.13.5



Re: [PATCH] block: fix a crash caused by wrong API

2017-09-21 Thread Jens Axboe
On 09/21/2017 01:17 PM, Shaohua Li wrote:
> part_stat_show takes a part device not a disk, so we should use
> part_to_disk.

Oops, thanks Shaohua! Applied.

-- 
Jens Axboe



[PATCH] block: fix a crash caused by wrong API

2017-09-21 Thread Shaohua Li
part_stat_show takes a part device not a disk, so we should use
part_to_disk.

Fix: d62e26b3ffd2(block: pass in queue to inflight accounting)
Cc: Bart Van Assche 
Cc: Omar Sandoval 
Signed-off-by: Shaohua Li 
---
 block/partition-generic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/partition-generic.c b/block/partition-generic.c
index 86e8fe1..88c555d 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -112,7 +112,7 @@ ssize_t part_stat_show(struct device *dev,
   struct device_attribute *attr, char *buf)
 {
struct hd_struct *p = dev_to_part(dev);
-   struct request_queue *q = dev_to_disk(dev)->queue;
+   struct request_queue *q = part_to_disk(p)->queue;
unsigned int inflight[2];
int cpu;
 
-- 
2.9.5



Re: Is the possible cross-talking between unrelated file-descriptors on bsg-device by design?

2017-09-21 Thread Benjamin Block
On Tue, Sep 19, 2017 at 02:16:26PM -0400, Douglas Gilbert wrote:
> On 2017-09-19 10:56 AM, Benjamin Block wrote:
> > Hello linux-block,
> > 
> > I wrote some tests recently to test patches against bsg.c and bsg-lib.c,
> > and while writing those I noticed something strange:
> > 
> > When you use the write() and read() call on multiple file-descriptors
> > for a single bsg-device (FC or SCSI), it is possible that you get
> > cross-talk between those different file-descriptors.
> > 
> > E.g.: You have two independent processes open a FD on /dev/bsg/fc_host0
> > and you send commands via write() in both processes, when they both do
> > read() later on - to read the response for the commands they send before
> > -, it is possible that process A gets the response (Sense-Data,
> > FC-Protocol-Data, ...) for a command that process B wrote and vis-a-vis.
> > 
> > I noticed this because in my tests I 'tag' each command I send with
> > write() via a value I write into the usr_ptr field of sg_io_v4. When I
> > later user read() to receive responses I check for this tag in a
> > hash-table and so can look up the original command. When I used this and
> > spawned multiple processes with the same target bsg-device, I got
> > responses for commands I don't have tags for in my hash-table, so they
> > were written by an other process. This never happend when I only spawn
> > one test-process.
> > 
> > This seems awfully contra-productive.. so much that I don't see any
> > point in allowing users to open more than one FD per bsg-device, because
> > that would inevitably lead to very hard to debug 'bugs'. And I wonder if
> > this is really by design or some regression that happend over time.
> > 
> > I looked at the code in bsg.c and yes, it seems this is what is coded
> > right now. We have a hash-table which manages all 'struct bsg_device's
> > who are indexed by device-minors, so one hash-table entry per
> > device-node.
> > 
> > So eh, long talk short question, is that intended?
> 
> Hi,
> About once a year I point out that major misfeature in the bsg driver
> and no-one seems to care. Most recently I pointed it out in a
> discussion about SCSI SG CVE-2017-0794 6 days ago:
>   " ... Last time I looked at the bsg driver, a SCSI command
> could be issued on one file descriptor and its data-in block
> and SCSI status delivered to another file descriptor to the
> same block device (potentially in another process). IOW chaos"
> 
> It is hard to imagine any sane application relying on this bizarre
> behaviour so fixing it should not break any applications. My guess
> is that it would require a non-trivial patch set to fix. Would you
> like to volunteer?
> 

Interesting. So this is not a regression then.

Well, personally I am intrigued to try to fix this, but professionally
it is not really up to me to choose to work on this. Especially because
as you say, this looks not trivial - at least if you want to let
multiple applications open a FD for a single BSG device node. Lets see.

But thank you for elaborating on this!


Beste Grüße / Best regards,
  - Benjamin Block
-- 
Linux on z Systems Development / IBM Systems & Technology Group
  IBM Deutschland Research & Development GmbH 
Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294



[PATCH 1/1] bsg-lib: fix use-after-free under memory-pressure

2017-09-21 Thread Benjamin Block
When under memory-pressure it is possible that the mempool which backs
the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count
emergency buffers - in case it can't get a regular allocation. These
buffers are preallocated and once they are also used, they are
re-supplied with old finished requests from the same request_queue (see
mempool_free()).

The bug is, when re-supplying the emergency pool, the old requests are
not again ran through the callback mempool_t->alloc(), and thus also not
through the callback bsg_init_rq(). Thus we skip initialization, and
while the sense-buffer still should be good, scsi_request->cmd might
have become to be an invalid pointer in the meantime. When the request
is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB,
bsg will replace it with a custom allocated buffer, which is freed when
the user's command is finished, thus it dangles afterwards. When next a
command is sent by the user that has a smaller/similar CDB as
BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by
scsi_request->__cmd, will not make a custom allocation, and write into
undefined memory.

Fix this by splitting bsg_init_rq() into two functions:
 - bsg_init_job() directly replace bsg_init_rq() and only does the
   allocation of the sense-buffer, which is used to back the bsg job's
   reply buffer. This pointer should never change during the lifetime of
   a scsi_request, so it doesn't need re-initialization.
 - bsg_init_rq() is a new function that make use of
   'struct request_queue's initialize_rq_fn callback (which was
   introduced in v4.12). This is always called before the request is
   given out via blk_get_request(). This function does the remaining
   initialization that was previously done in bsg_init_rq(), and will
   also do it when the request is taken from the emergency-pool of the
   backing mempool.

Also rename bsg_exit_rq() into bsg_exit_job(), to make it fit the
name-scheme.

Fixes: 50b4d485528d ("bsg-lib: fix kernel panic resulting from missing 
allocation of reply-buffer")
Cc:  # 4.11+
Signed-off-by: Benjamin Block 
---

Notes:
I did test this on zFCP with FC CT commands send via the ioctl() and
write() system-call. That did work fine. But I would very much
appreciate if anyone could run this against an other HBA or even an
other implementer of bsg-lib, such as now SAS, because I have no access
to such hardware here.

This should make no difference to the normal cases - where each request
is allocated via slab - with- or without this patch; if I didn't miss
anything. Only the order is a bit mixed up - the memset is done after
the sense-allocation, so I have to buffer the sense-pointer for that.
But otherwise there is no difference I am aware of, so it should behave
the same (does for me).

I could not reproduce the memory-pressure case here in the lab.. I
don't see any reason why it should work now, but I am open to
suggestions :)

Beste Grüße / Best regards,
  - Benjamin Block

 block/bsg-lib.c | 41 -
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index c82408c7cc3c..634d1557da38 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -203,28 +203,42 @@ static void bsg_request_fn(struct request_queue *q)
spin_lock_irq(q->queue_lock);
 }
 
-static int bsg_init_rq(struct request_queue *q, struct request *req, gfp_t gfp)
+static int bsg_init_job(struct request_queue *q, struct request *req, gfp_t 
gfp)
 {
struct bsg_job *job = blk_mq_rq_to_pdu(req);
struct scsi_request *sreq = >sreq;
 
-   memset(job, 0, sizeof(*job));
+   /* called right after the request is allocated for the request_queue */
 
-   scsi_req_init(sreq);
-   sreq->sense_len = SCSI_SENSE_BUFFERSIZE;
-   sreq->sense = kzalloc(sreq->sense_len, gfp);
+   sreq->sense = kzalloc(SCSI_SENSE_BUFFERSIZE, gfp);
if (!sreq->sense)
return -ENOMEM;
 
-   job->req = req;
-   job->reply = sreq->sense;
-   job->reply_len = sreq->sense_len;
-   job->dd_data = job + 1;
-
return 0;
 }
 
-static void bsg_exit_rq(struct request_queue *q, struct request *req)
+static void bsg_init_rq(struct request *req)
+{
+   struct bsg_job *job = blk_mq_rq_to_pdu(req);
+   struct scsi_request *sreq = >sreq;
+   void *sense = sreq->sense;
+
+   /* called right before the request is given to the request_queue user */
+
+   memset(job, 0, sizeof(*job));
+
+   scsi_req_init(sreq);
+
+   sreq->sense = sense;
+   sreq->sense_len = SCSI_SENSE_BUFFERSIZE;
+
+   job->req = req;
+   job->reply = sense;
+   job->reply_len = sreq->sense_len;
+   job->dd_data = job + 1;
+}
+
+static void 

Re: [PATCH v2 1/2] blk-mq: fix nr_requests wrong value when modify it from sysfs

2017-09-21 Thread weiping zhang
On Thu, Sep 21, 2017 at 10:37:48AM -0600, Jens Axboe wrote:
> On 09/21/2017 09:17 AM, weiping zhang wrote:
> > if blk-mq use "none" io scheduler, nr_request get a wrong value when
> > input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
> > the smaller one min(nr, set->queue_depth), and then q->nr_request get a
> > wrong value.
> > 
> > Reproduce:
> > 
> > echo none > /sys/block/nvme0n1/queue/ioscheduler
> > echo 100 > /sys/block/nvme0n1/queue/nr_requests
> > cat /sys/block/nvme0n1/queue/nr_requests
> > 100
> > 
> > Signed-off-by: weiping zhang 
> > ---
> >  block/blk-mq.c | 8 ++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 98a1860..479c35a 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2642,8 +2642,12 @@ int blk_mq_update_nr_requests(struct request_queue 
> > *q, unsigned int nr)
> >  * queue depth. This is similar to what the old code would do.
> >  */
> > if (!hctx->sched_tags) {
> > -   ret = blk_mq_tag_update_depth(hctx, >tags,
> > -   min(nr, 
> > set->queue_depth),
> > +   if (nr > set->queue_depth) {
> > +   ret = -EINVAL;
> > +   break;
> > +   }
> > +
> > +   ret = blk_mq_tag_update_depth(hctx, >tags, nr,
> > false);
> 
> What am I missing here? blk_mq_tag_update_depth() should already return
> -EINVAL for the case where we can't grow the tags. Looks like this patch
> should simply remove the min(nr, set->queue_depth) and just pass in 'nr'.
> Should not need the duplicated check for depth.
> 
Ya, you are right, I will send V3 later.

Thanks


[PATCH v3] blk-mq: fix nr_requests wrong value when modify it from sysfs

2017-09-21 Thread weiping zhang
if blk-mq use "none" io scheduler, nr_request get a wrong value when
input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
the smaller one min(nr, set->queue_depth), and then q->nr_request get a
wrong value.

Reproduce:

echo none > /sys/block/nvme0n1/queue/ioscheduler
echo 100 > /sys/block/nvme0n1/queue/nr_requests
cat /sys/block/nvme0n1/queue/nr_requests
100

Signed-off-by: weiping zhang 
---
 block/blk-mq.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a1860..491e336 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2642,8 +2642,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
 * queue depth. This is similar to what the old code would do.
 */
if (!hctx->sched_tags) {
-   ret = blk_mq_tag_update_depth(hctx, >tags,
-   min(nr, 
set->queue_depth),
+   ret = blk_mq_tag_update_depth(hctx, >tags, nr,
false);
} else {
ret = blk_mq_tag_update_depth(hctx, >sched_tags,
-- 
2.9.4



[PATCH 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()

2017-09-21 Thread Ilya Dryomov
sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to
permit trying WRITE SAME on older SCSI devices, unless ->no_write_same
is set.  Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE
SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO:

  $ fallocate -zn -l 1k /dev/sdg
  fallocate: fallocate failed: Remote I/O error
  $ fallocate -zn -l 1k /dev/sdg  # OK
  $ fallocate -zn -l 1k /dev/sdg  # OK

The following calls succeed because sd_done() sets ->no_write_same in
response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing
__blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios.

This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing
and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is
specified.  For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if
sd_done() has just set ->no_write_same thus indicating lack of offload
support.

Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing")
Cc: Christoph Hellwig 
Cc: "Martin K. Petersen" 
Cc: Hannes Reinecke 
Signed-off-by: Ilya Dryomov 
---
 block/blk-lib.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 6b97feb71065..1cb402beb983 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -316,12 +316,6 @@ static void __blkdev_issue_zero_pages(struct block_device 
*bdev,
  *  Zero-fill a block range, either using hardware offload or by explicitly
  *  writing zeroes to the device.
  *
- *  Note that this function may fail with -EOPNOTSUPP if the driver signals
- *  zeroing offload support, but the device fails to process the command (for
- *  some devices there is no non-destructive way to verify whether this
- *  operation is actually supported).  In this case the caller should call
- *  retry the call to blkdev_issue_zeroout() and the fallback path will be 
used.
- *
  *  If a device is using logical block provisioning, the underlying space will
  *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
  *
@@ -374,6 +368,27 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
, flags);
if (ret == 0 && bio) {
ret = submit_bio_wait(bio);
+   /*
+* Fall back to a manual zeroout on any error, if allowed.
+*
+* Particularly, WRITE ZEROES may fail with -EREMOTEIO if the
+* driver signals zeroing offload support, but the device
+* fails to process the command (for some devices there is no
+* non-destructive way to verify whether this operation is
+* actually supported).
+*/
+   if (ret && bio_op(bio) == REQ_OP_WRITE_ZEROES) {
+   if (flags & BLKDEV_ZERO_NOFALLBACK) {
+   if (!bdev_write_zeroes_sectors(bdev))
+   ret = -EOPNOTSUPP;
+   } else {
+   bio_put(bio);
+   bio = NULL;
+   __blkdev_issue_zero_pages(bdev, sector,
+   nr_sects, gfp_mask, );
+   ret = submit_bio_wait(bio);
+   }
+   }
bio_put(bio);
}
blk_finish_plug();
-- 
2.4.3



[PATCH 0/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()

2017-09-21 Thread Ilya Dryomov
Hi Christoph, Martin,

blkdev_issue_zeroout() now checks for any error.  This required a minor
refactor, so I dropped the stable tag, Jens can add it back if needed.

Previous patch and discussion at

  https://marc.info/?l=linux-block=150471953327942=2

Thanks,

Ilya


Ilya Dryomov (2):
  block: factor out __blkdev_issue_zero_pages()
  block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()

 block/blk-lib.c | 85 +++--
 1 file changed, 53 insertions(+), 32 deletions(-)

-- 
2.4.3



[PATCH 1/2] block: factor out __blkdev_issue_zero_pages()

2017-09-21 Thread Ilya Dryomov
blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case.

Signed-off-by: Ilya Dryomov 
---
 block/blk-lib.c | 58 +++--
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 62240f8832ca..6b97feb71065 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -274,6 +274,35 @@ static unsigned int __blkdev_sectors_to_bio_pages(sector_t 
nr_sects)
return min(pages, (sector_t)BIO_MAX_PAGES);
 }
 
+static void __blkdev_issue_zero_pages(struct block_device *bdev,
+   sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
+   struct bio **biop)
+{
+   struct bio *bio = *biop;
+   int bi_size = 0;
+   unsigned int sz;
+
+   while (nr_sects != 0) {
+   bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
+  gfp_mask);
+   bio->bi_iter.bi_sector = sector;
+   bio_set_dev(bio, bdev);
+   bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+
+   while (nr_sects != 0) {
+   sz = min((sector_t) PAGE_SIZE, nr_sects << 9);
+   bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0);
+   nr_sects -= bi_size >> 9;
+   sector += bi_size >> 9;
+   if (bi_size < sz)
+   break;
+   }
+   cond_resched();
+   }
+
+   *biop = bio;
+}
+
 /**
  * __blkdev_issue_zeroout - generate number of zero filed write bios
  * @bdev:  blockdev to issue
@@ -304,9 +333,6 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
unsigned flags)
 {
int ret;
-   int bi_size = 0;
-   struct bio *bio = *biop;
-   unsigned int sz;
sector_t bs_mask;
 
bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
@@ -316,30 +342,10 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
biop, flags);
if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK))
-   goto out;
-
-   ret = 0;
-   while (nr_sects != 0) {
-   bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
-  gfp_mask);
-   bio->bi_iter.bi_sector = sector;
-   bio_set_dev(bio, bdev);
-   bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
-
-   while (nr_sects != 0) {
-   sz = min((sector_t) PAGE_SIZE, nr_sects << 9);
-   bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0);
-   nr_sects -= bi_size >> 9;
-   sector += bi_size >> 9;
-   if (bi_size < sz)
-   break;
-   }
-   cond_resched();
-   }
+   return ret;
 
-   *biop = bio;
-out:
-   return ret;
+   __blkdev_issue_zero_pages(bdev, sector, nr_sects, gfp_mask, biop);
+   return 0;
 }
 EXPORT_SYMBOL(__blkdev_issue_zeroout);
 
-- 
2.4.3



Re: [PATCH v2 2/2] blk-sysfs: return EINVAL when user modify nr_request less than BLKDEV_MIN_RQ

2017-09-21 Thread weiping zhang
On Thu, Sep 21, 2017 at 10:38:22AM -0600, Jens Axboe wrote:
> On 09/21/2017 09:17 AM, weiping zhang wrote:
> > Signed-off-by: weiping zhang 
> > ---
> >  block/blk-sysfs.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index b8362c0..03a6e19 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -75,7 +75,7 @@ queue_requests_store(struct request_queue *q, const char 
> > *page, size_t count)
> > return ret;
> >  
> > if (nr < BLKDEV_MIN_RQ)
> > -   nr = BLKDEV_MIN_RQ;
> > +   return -EINVAL;
> 
> This is potentially breaking existing scripts.
> 
I just want this keep same behavior with the too larger case, If this
may make other side effect, I drop this patch.

Thanks
weiping


Re: [PATCH v2 2/2] blk-sysfs: return EINVAL when user modify nr_request less than BLKDEV_MIN_RQ

2017-09-21 Thread Jens Axboe
On 09/21/2017 09:17 AM, weiping zhang wrote:
> Signed-off-by: weiping zhang 
> ---
>  block/blk-sysfs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index b8362c0..03a6e19 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -75,7 +75,7 @@ queue_requests_store(struct request_queue *q, const char 
> *page, size_t count)
>   return ret;
>  
>   if (nr < BLKDEV_MIN_RQ)
> - nr = BLKDEV_MIN_RQ;
> + return -EINVAL;

This is potentially breaking existing scripts.

-- 
Jens Axboe



Re: [PATCH v2 1/2] blk-mq: fix nr_requests wrong value when modify it from sysfs

2017-09-21 Thread Jens Axboe
On 09/21/2017 09:17 AM, weiping zhang wrote:
> if blk-mq use "none" io scheduler, nr_request get a wrong value when
> input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
> the smaller one min(nr, set->queue_depth), and then q->nr_request get a
> wrong value.
> 
> Reproduce:
> 
> echo none > /sys/block/nvme0n1/queue/ioscheduler
> echo 100 > /sys/block/nvme0n1/queue/nr_requests
> cat /sys/block/nvme0n1/queue/nr_requests
> 100
> 
> Signed-off-by: weiping zhang 
> ---
>  block/blk-mq.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 98a1860..479c35a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2642,8 +2642,12 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
> unsigned int nr)
>* queue depth. This is similar to what the old code would do.
>*/
>   if (!hctx->sched_tags) {
> - ret = blk_mq_tag_update_depth(hctx, >tags,
> - min(nr, 
> set->queue_depth),
> + if (nr > set->queue_depth) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + ret = blk_mq_tag_update_depth(hctx, >tags, nr,
>   false);

What am I missing here? blk_mq_tag_update_depth() should already return
-EINVAL for the case where we can't grow the tags. Looks like this patch
should simply remove the min(nr, set->queue_depth) and just pass in 'nr'.
Should not need the duplicated check for depth.

-- 
Jens Axboe



[PATCH v2 2/2] blk-sysfs: return EINVAL when user modify nr_request less than BLKDEV_MIN_RQ

2017-09-21 Thread weiping zhang
Signed-off-by: weiping zhang 
---
 block/blk-sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b8362c0..03a6e19 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -75,7 +75,7 @@ queue_requests_store(struct request_queue *q, const char 
*page, size_t count)
return ret;
 
if (nr < BLKDEV_MIN_RQ)
-   nr = BLKDEV_MIN_RQ;
+   return -EINVAL;
 
if (q->request_fn)
err = blk_update_nr_requests(q, nr);
-- 
2.9.4



[PATCH v2 1/2] blk-mq: fix nr_requests wrong value when modify it from sysfs

2017-09-21 Thread weiping zhang
if blk-mq use "none" io scheduler, nr_request get a wrong value when
input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
the smaller one min(nr, set->queue_depth), and then q->nr_request get a
wrong value.

Reproduce:

echo none > /sys/block/nvme0n1/queue/ioscheduler
echo 100 > /sys/block/nvme0n1/queue/nr_requests
cat /sys/block/nvme0n1/queue/nr_requests
100

Signed-off-by: weiping zhang 
---
 block/blk-mq.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a1860..479c35a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2642,8 +2642,12 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
 * queue depth. This is similar to what the old code would do.
 */
if (!hctx->sched_tags) {
-   ret = blk_mq_tag_update_depth(hctx, >tags,
-   min(nr, 
set->queue_depth),
+   if (nr > set->queue_depth) {
+   ret = -EINVAL;
+   break;
+   }
+
+   ret = blk_mq_tag_update_depth(hctx, >tags, nr,
false);
} else {
ret = blk_mq_tag_update_depth(hctx, >sched_tags,
-- 
2.9.4



[PATCH v2 0/2] fix wrong value when user modify nr_request by sysfs

2017-09-21 Thread weiping zhang
Hi Jens,

This is v2 of fixing nr_request.

v1 -> v2:

blk-mq: fix nr_requests wrong value when modify it from sysfs
this patch return -EINVAL when user write a value that's too large.

blk-sysfs: return EINVAL when user modify nr_request less than
  BLKDEV_MIN_RQ
In order to keep same behavior with former patch, also return EINVAL 
when user write a value less than BLKDEV_MIN_RQ

weiping zhang (2):
  blk-mq: fix nr_requests wrong value when modify it from sysfs
  blk-sysfs: return EINVAL when user modify nr_request less than
BLKDEV_MIN_RQ

 block/blk-mq.c| 8 ++--
 block/blk-sysfs.c | 2 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

-- 
2.9.4



Re: [PATCH] blk-mq: fix nr_requests wrong value when modify it from sysfs

2017-09-21 Thread weiping zhang
On Thu, Sep 21, 2017 at 08:09:47AM -0600, Jens Axboe wrote:
> On 09/21/2017 07:03 AM, weiping zhang wrote:
> > On Tue, Sep 12, 2017 at 09:57:32PM +0800, weiping zhang wrote:
> >> On Wed, Sep 06, 2017 at 01:00:44PM +, Bart Van Assche wrote:
> >>> On Wed, 2017-09-06 at 15:34 +0800, weiping zhang wrote:
>  On Tue, Sep 05, 2017 at 03:42:45PM +, Bart Van Assche wrote:
> > On Sun, 2017-09-03 at 21:46 +0800, weiping zhang wrote:
> >> if blk-mq use "none" io scheduler, nr_request get a wrong value when
> >> input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
> >> the smaller one min(nr, set->queue_depth), and then q->nr_request get a
> >> wrong value.
> >>
> >> Reproduce:
> >>
> >> echo none > /sys/block/nvme0n1/queue/ioscheduler
> >> echo 100 > /sys/block/nvme0n1/queue/nr_requests
> >> cat /sys/block/nvme0n1/queue/nr_requests
> >> 100
> >>
> >> Signed-off-by: weiping zhang 
> >> ---
> >>  block/blk-mq.c | 7 +--
> >>  1 file changed, 5 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index f84d145..8303e5e 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -2622,8 +2622,11 @@ int blk_mq_update_nr_requests(struct 
> >> request_queue *q, unsigned int nr)
> >> * queue depth. This is similar to what the old code 
> >> would do.
> >> */
> >>if (!hctx->sched_tags) {
> >> -  ret = blk_mq_tag_update_depth(hctx, >tags,
> >> -  min(nr, 
> >> set->queue_depth),
> >> +  if (nr > set->queue_depth) {
> >> +  nr = set->queue_depth;
> >> +  pr_warn("reduce nr_request to %u\n", 
> >> nr);
> >> +  }
> >> +  ret = blk_mq_tag_update_depth(hctx, 
> >> >tags, nr,
> >>false);
> >>} else {
> >>ret = blk_mq_tag_update_depth(hctx, 
> >> >sched_tags,
> >
> > Shouldn't this code return -EINVAL or -ERANGE if 'nr' is too large? 
> > That will help to
> > keep user space code simple that updates the queue depth.
> 
>  Hi Bart,
> 
>  The reason why not return -EINVAL is keeping alin with minimum checking 
>  in queue_requests_store,
>  if you insist return -EINVAL/-ERANGE, minimum checking should also keep
>  same behavior. Both return error meesage and quietly changing are okey
>  for me. Which way do you prefer ?
> 
>  static ssize_t
>  queue_requests_store(struct request_queue *q, const char *page, size_t 
>  count)
>  {
>   unsigned long nr;
>   int ret, err;
> 
>   if (!q->request_fn && !q->mq_ops)
>   return -EINVAL;
> 
>   ret = queue_var_store(, page, count);
>   if (ret < 0)
>   return ret;
> 
>   if (nr < BLKDEV_MIN_RQ)
>   nr = BLKDEV_MIN_RQ;
> >>>
> >>> Hello Jens,
> >>>
> >>> Do you perhaps have a preference for one of the approaches that have been 
> >>> discussed
> >>> in this e-mail thread?
> >>>
> >>> Thanks,
> >>>
> >>> Bart.
> >>
> > Hello Jens,
> > 
> > Would you please give some comments about this patch,
> 
> If someone writes a value that's too large, return -EINVAL and
> don't set it. Don't add weird debug printks.
> 
> 
OK, I send patch V2 soon.


Re: [PATCH 8/9] nvme: track shared namespaces

2017-09-21 Thread Keith Busch
On Thu, Sep 21, 2017 at 04:37:48PM +0200, Christoph Hellwig wrote:
> On Thu, Sep 21, 2017 at 07:22:17AM +0200, Johannes Thumshirn wrote:
> > > But head also has connotations in the SAN world.  Maybe nvme_ns_chain?
> > 
> > I know that's why I didn't really like it all too much in the first place as
> > well. For nvme_ns_chain, it's not a chain really (the list itself is a 
> > chain,
> > the structure really is the list head...), but I suck at naming things so.
> 
> Well, it _is_ the structure for the namespace, and that's the fundamental
> problem here given that we use that name for something else at the
> moment.
> 
> We could hav nvme_namespace and nvme_ns, but I'm not sure that this
> helps clarity..

If there weren't resistence to renaming structs, it would be more
aligned to how the specification calls these if we rename nvme_ns to
nvme_ns_path, and what you're calling nvme_ns_head is should just be
the nvme_ns.


Re: [PATCH] block: drop "sending ioctl to a partition" message

2017-09-21 Thread Paolo Bonzini
On 21/09/2017 16:53, Christoph Hellwig wrote:
> This looks ok to me, but do we even need to keep the special
> cases above?  Is there anything relying on the safe but not very
> useful ioctls?

No idea, I stuck to the usual "don't break userspace" rule.

Honestly I doubt anything is using most of those ioctls _in general_,
not just on a partition.

Paolo

> Condensing the thing down to:
> 
> int scsi_verify_blk_ioctl(struct block_device *bd, unsigned int cmd)
> {
>   if (bd && bd == bd->bd_contains)
>   return 0;
>   if (capable(CAP_SYS_RAWIO))
>   return 0;
>   return -ENOIOCTLCMD;
> }
> 
> would certainly be nice.


Re: [PATCH] block: drop "sending ioctl to a partition" message

2017-09-21 Thread Christoph Hellwig
This looks ok to me, but do we even need to keep the special
cases above?  Is there anything relying on the safe but not very
useful ioctls?

Condensing the thing down to:

int scsi_verify_blk_ioctl(struct block_device *bd, unsigned int cmd)
{
if (bd && bd == bd->bd_contains)
return 0;
if (capable(CAP_SYS_RAWIO))
return 0;
return -ENOIOCTLCMD;
}

would certainly be nice.


Re: nvme multipath support V2

2017-09-21 Thread Christoph Hellwig
On Thu, Sep 21, 2017 at 07:23:45AM +0200, Johannes Thumshirn wrote:
> Ah OK, we maybe should update nvme-cli to recognize it as well then. I just
> looked at the output of nvme list and obviously didn't find it.

Overloading the new per-subsystem nodes into nvme list would be
very confusing I think.  We could add a new command to list subsystems
instead of controllers, if just doing a ls in /dev or sysfs is too hard.


[PATCH] block: drop "sending ioctl to a partition" message

2017-09-21 Thread Paolo Bonzini
After the first few months, the message has not led to many bug reports.
It's been almost five years now, and in practice the main source of
it seems to be MTIOCGET that someone is using to detect tape devices.
While we could whitelist it just like CDROM_GET_CAPABILITY, this patch
just removes the message altogether.

Signed-off-by: Paolo Bonzini 
---
 block/scsi_ioctl.c | 18 ++
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 7440de44dd85..eafcd67e2480 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -707,24 +707,10 @@ int scsi_verify_blk_ioctl(struct block_device *bd, 
unsigned int cmd)
case SG_SET_RESERVED_SIZE:
case SG_EMULATED_HOST:
return 0;
-   case CDROM_GET_CAPABILITY:
-   /* Keep this until we remove the printk below.  udev sends it
-* and we do not want to spam dmesg about it.   CD-ROMs do
-* not have partitions, so we get here only for disks.
-*/
-   return -ENOIOCTLCMD;
default:
-   break;
+   /* In particular, rule out all resets and host-specific ioctls. 
 */
+   return capable(CAP_SYS_RAWIO) ? 0 : -ENOIOCTLCMD;
}
-
-   if (capable(CAP_SYS_RAWIO))
-   return 0;
-
-   /* In particular, rule out all resets and host-specific ioctls.  */
-   printk_ratelimited(KERN_WARNING
-  "%s: sending ioctl %x to a partition!\n", 
current->comm, cmd);
-
-   return -ENOIOCTLCMD;
 }
 EXPORT_SYMBOL(scsi_verify_blk_ioctl);
 
-- 
1.8.3.1



Re: [PATCH 8/9] nvme: track shared namespaces

2017-09-21 Thread Christoph Hellwig
On Thu, Sep 21, 2017 at 07:22:17AM +0200, Johannes Thumshirn wrote:
> > But head also has connotations in the SAN world.  Maybe nvme_ns_chain?
> 
> I know that's why I didn't really like it all too much in the first place as
> well. For nvme_ns_chain, it's not a chain really (the list itself is a chain,
> the structure really is the list head...), but I suck at naming things so.

Well, it _is_ the structure for the namespace, and that's the fundamental
problem here given that we use that name for something else at the
moment.

We could hav nvme_namespace and nvme_ns, but I'm not sure that this
helps clarity..


Re: [PATCH] blk-mq: fix nr_requests wrong value when modify it from sysfs

2017-09-21 Thread Jens Axboe
On 09/21/2017 07:03 AM, weiping zhang wrote:
> On Tue, Sep 12, 2017 at 09:57:32PM +0800, weiping zhang wrote:
>> On Wed, Sep 06, 2017 at 01:00:44PM +, Bart Van Assche wrote:
>>> On Wed, 2017-09-06 at 15:34 +0800, weiping zhang wrote:
 On Tue, Sep 05, 2017 at 03:42:45PM +, Bart Van Assche wrote:
> On Sun, 2017-09-03 at 21:46 +0800, weiping zhang wrote:
>> if blk-mq use "none" io scheduler, nr_request get a wrong value when
>> input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
>> the smaller one min(nr, set->queue_depth), and then q->nr_request get a
>> wrong value.
>>
>> Reproduce:
>>
>> echo none > /sys/block/nvme0n1/queue/ioscheduler
>> echo 100 > /sys/block/nvme0n1/queue/nr_requests
>> cat /sys/block/nvme0n1/queue/nr_requests
>> 100
>>
>> Signed-off-by: weiping zhang 
>> ---
>>  block/blk-mq.c | 7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index f84d145..8303e5e 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2622,8 +2622,11 @@ int blk_mq_update_nr_requests(struct 
>> request_queue *q, unsigned int nr)
>>   * queue depth. This is similar to what the old code 
>> would do.
>>   */
>>  if (!hctx->sched_tags) {
>> -ret = blk_mq_tag_update_depth(hctx, >tags,
>> -min(nr, 
>> set->queue_depth),
>> +if (nr > set->queue_depth) {
>> +nr = set->queue_depth;
>> +pr_warn("reduce nr_request to %u\n", 
>> nr);
>> +}
>> +ret = blk_mq_tag_update_depth(hctx, 
>> >tags, nr,
>>  false);
>>  } else {
>>  ret = blk_mq_tag_update_depth(hctx, 
>> >sched_tags,
>
> Shouldn't this code return -EINVAL or -ERANGE if 'nr' is too large? That 
> will help to
> keep user space code simple that updates the queue depth.

 Hi Bart,

 The reason why not return -EINVAL is keeping alin with minimum checking in 
 queue_requests_store,
 if you insist return -EINVAL/-ERANGE, minimum checking should also keep
 same behavior. Both return error meesage and quietly changing are okey
 for me. Which way do you prefer ?

 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t 
 count)
 {
unsigned long nr;
int ret, err;

if (!q->request_fn && !q->mq_ops)
return -EINVAL;

ret = queue_var_store(, page, count);
if (ret < 0)
return ret;

if (nr < BLKDEV_MIN_RQ)
nr = BLKDEV_MIN_RQ;
>>>
>>> Hello Jens,
>>>
>>> Do you perhaps have a preference for one of the approaches that have been 
>>> discussed
>>> in this e-mail thread?
>>>
>>> Thanks,
>>>
>>> Bart.
>>
> Hello Jens,
> 
> Would you please give some comments about this patch,

If someone writes a value that's too large, return -EINVAL and
don't set it. Don't add weird debug printks.


-- 
Jens Axboe



Re: [PATCH V8 00/14] mmc: Add Command Queue support

2017-09-21 Thread Christoph Hellwig
On Wed, Sep 13, 2017 at 02:40:00PM +0300, Adrian Hunter wrote:
> Non-CQE blk-mq showed a 3% decrease in sequential read performance.  This
> seemed to be coming from the inferior latency of running work items compared
> with a dedicated thread.  Hacking blk-mq workqueue to be unbound reduced the
> performance degradation from 3% to 1%.

Can you start a discussion on just this on the linux-block lists?



Re: [PATCH V2] lightnvm: include NVM Express driver if OCSSD is selected for build

2017-09-21 Thread Matias Bjørling



On 09/21/2017 03:41 PM, Rakesh Pandit wrote:

Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM
in config file before building kernel.  Also append PCI to depends as
select doesn't automatically add dependencies.

Signed-off-by: Rakesh Pandit 
---

V2: appends 'depends' with PCI

  drivers/lightnvm/Kconfig | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
index ead61a9..2a953ef 100644
--- a/drivers/lightnvm/Kconfig
+++ b/drivers/lightnvm/Kconfig
@@ -4,7 +4,8 @@
  
  menuconfig NVM

bool "Open-Channel SSD target support"
-   depends on BLOCK && HAS_DMA
+   depends on BLOCK && HAS_DMA && PCI
+   select BLK_DEV_NVME
help
  Say Y here to get to enable Open-channel SSDs.
  



Thanks Rakesh.


[PATCH V2] lightnvm: include NVM Express driver if OCSSD is selected for build

2017-09-21 Thread Rakesh Pandit
Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM
in config file before building kernel.  Also append PCI to depends as
select doesn't automatically add dependencies.

Signed-off-by: Rakesh Pandit 
---

V2: appends 'depends' with PCI

 drivers/lightnvm/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
index ead61a9..2a953ef 100644
--- a/drivers/lightnvm/Kconfig
+++ b/drivers/lightnvm/Kconfig
@@ -4,7 +4,8 @@
 
 menuconfig NVM
bool "Open-Channel SSD target support"
-   depends on BLOCK && HAS_DMA
+   depends on BLOCK && HAS_DMA && PCI
+   select BLK_DEV_NVME
help
  Say Y here to get to enable Open-channel SSDs.
 
-- 
2.5.0



Re: [PATCH 6/6] lightnvm: include NVM Express driver if OCSSD is selected for build

2017-09-21 Thread Rakesh Pandit
On Thu, Sep 21, 2017 at 01:32:40PM +0200, Matias Bjørling wrote:
> On 09/21/2017 01:28 PM, Rakesh Pandit wrote:
> > Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM
> > in config file before building kernel.
> > 
> > Signed-off-by: Rakesh Pandit 
> > ---
> >   drivers/lightnvm/Kconfig | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
> > index ead61a9..b3c00cb 100644
> > --- a/drivers/lightnvm/Kconfig
> > +++ b/drivers/lightnvm/Kconfig
> > @@ -5,6 +5,7 @@
> >   menuconfig NVM
> > bool "Open-Channel SSD target support"
> > depends on BLOCK && HAS_DMA
> > +   select BLK_DEV_NVME
> > help
> >   Say Y here to get to enable Open-channel SSDs.
> > 
> 
> Thanks Rakesh. I've picked it up for 4.15.

As discussed (IRC) I wasn't very careful first time and missed
appending PCI to depends as select doesn't visit dependencies
automatically.  Will post a version 2 soon.

Will only post version 2 of this very patch.

Doesn't make sense to push anything else from set.  Thank you in
advance,


Re: [PATCH] blk-mq: fix nr_requests wrong value when modify it from sysfs

2017-09-21 Thread weiping zhang
On Tue, Sep 12, 2017 at 09:57:32PM +0800, weiping zhang wrote:
> On Wed, Sep 06, 2017 at 01:00:44PM +, Bart Van Assche wrote:
> > On Wed, 2017-09-06 at 15:34 +0800, weiping zhang wrote:
> > > On Tue, Sep 05, 2017 at 03:42:45PM +, Bart Van Assche wrote:
> > > > On Sun, 2017-09-03 at 21:46 +0800, weiping zhang wrote:
> > > > > if blk-mq use "none" io scheduler, nr_request get a wrong value when
> > > > > input a number > tag_set->queue_depth. blk_mq_tag_update_depth will 
> > > > > get
> > > > > the smaller one min(nr, set->queue_depth), and then q->nr_request get 
> > > > > a
> > > > > wrong value.
> > > > > 
> > > > > Reproduce:
> > > > > 
> > > > > echo none > /sys/block/nvme0n1/queue/ioscheduler
> > > > > echo 100 > /sys/block/nvme0n1/queue/nr_requests
> > > > > cat /sys/block/nvme0n1/queue/nr_requests
> > > > > 100
> > > > > 
> > > > > Signed-off-by: weiping zhang 
> > > > > ---
> > > > >  block/blk-mq.c | 7 +--
> > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > index f84d145..8303e5e 100644
> > > > > --- a/block/blk-mq.c
> > > > > +++ b/block/blk-mq.c
> > > > > @@ -2622,8 +2622,11 @@ int blk_mq_update_nr_requests(struct 
> > > > > request_queue *q, unsigned int nr)
> > > > >* queue depth. This is similar to what the old code 
> > > > > would do.
> > > > >*/
> > > > >   if (!hctx->sched_tags) {
> > > > > - ret = blk_mq_tag_update_depth(hctx, >tags,
> > > > > - min(nr, 
> > > > > set->queue_depth),
> > > > > + if (nr > set->queue_depth) {
> > > > > + nr = set->queue_depth;
> > > > > + pr_warn("reduce nr_request to %u\n", 
> > > > > nr);
> > > > > + }
> > > > > + ret = blk_mq_tag_update_depth(hctx, 
> > > > > >tags, nr,
> > > > >   false);
> > > > >   } else {
> > > > >   ret = blk_mq_tag_update_depth(hctx, 
> > > > > >sched_tags,
> > > > 
> > > > Shouldn't this code return -EINVAL or -ERANGE if 'nr' is too large? 
> > > > That will help to
> > > > keep user space code simple that updates the queue depth.
> > > 
> > > Hi Bart,
> > > 
> > > The reason why not return -EINVAL is keeping alin with minimum checking 
> > > in queue_requests_store,
> > > if you insist return -EINVAL/-ERANGE, minimum checking should also keep
> > > same behavior. Both return error meesage and quietly changing are okey
> > > for me. Which way do you prefer ?
> > > 
> > > static ssize_t
> > > queue_requests_store(struct request_queue *q, const char *page, size_t 
> > > count)
> > > {
> > >   unsigned long nr;
> > >   int ret, err;
> > > 
> > >   if (!q->request_fn && !q->mq_ops)
> > >   return -EINVAL;
> > > 
> > >   ret = queue_var_store(, page, count);
> > >   if (ret < 0)
> > >   return ret;
> > > 
> > >   if (nr < BLKDEV_MIN_RQ)
> > >   nr = BLKDEV_MIN_RQ;
> > 
> > Hello Jens,
> > 
> > Do you perhaps have a preference for one of the approaches that have been 
> > discussed
> > in this e-mail thread?
> > 
> > Thanks,
> > 
> > Bart.
> 
Hello Jens,

Would you please give some comments about this patch,

Thanks
Weiping.


Re: [PATCH 6/6] lightnvm: include NVM Express driver if OCSSD is selected for build

2017-09-21 Thread Matias Bjørling

On 09/21/2017 01:28 PM, Rakesh Pandit wrote:

Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM
in config file before building kernel.

Signed-off-by: Rakesh Pandit 
---
  drivers/lightnvm/Kconfig | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
index ead61a9..b3c00cb 100644
--- a/drivers/lightnvm/Kconfig
+++ b/drivers/lightnvm/Kconfig
@@ -5,6 +5,7 @@
  menuconfig NVM
bool "Open-Channel SSD target support"
depends on BLOCK && HAS_DMA
+   select BLK_DEV_NVME
help
  Say Y here to get to enable Open-channel SSDs.
  



Thanks Rakesh. I've picked it up for 4.15.


[PATCH 5/6] lightnvm: pblk: print incompatible line version correctly

2017-09-21 Thread Rakesh Pandit
Correct it by coverting little endian to cpu endian and also define a
macro for line version so that maintenance is easy.

Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/pblk-core.c | 2 +-
 drivers/lightnvm/pblk-recovery.c | 4 ++--
 drivers/lightnvm/pblk.h  | 1 +
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 74ddb30..57583a1 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -991,7 +991,7 @@ static int pblk_line_init_metadata(struct pblk *pblk, 
struct pblk_line *line,
memcpy(smeta_buf->header.uuid, pblk->instance_uuid, 16);
smeta_buf->header.id = cpu_to_le32(line->id);
smeta_buf->header.type = cpu_to_le16(line->type);
-   smeta_buf->header.version = cpu_to_le16(1);
+   smeta_buf->header.version = SMETA_VERSION;
 
/* Start metadata */
smeta_buf->seq_nr = cpu_to_le64(line->seq_nr);
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 1869eef..686bc17 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -876,9 +876,9 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
if (le32_to_cpu(smeta_buf->header.identifier) != PBLK_MAGIC)
continue;
 
-   if (le16_to_cpu(smeta_buf->header.version) != 1) {
+   if (smeta_buf->header.version != SMETA_VERSION) {
pr_err("pblk: found incompatible line version %u\n",
-   smeta_buf->header.version);
+   le16_to_cpu(smeta_buf->header.version));
return ERR_PTR(-EINVAL);
}
 
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index eaf5397..87b1d7f 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -318,6 +318,7 @@ enum {
 };
 
 #define PBLK_MAGIC 0x70626c6b /*pblk*/
+#define SMETA_VERSION cpu_to_le16(1)
 
 struct line_header {
__le32 crc;
-- 
2.5.0



[PATCH 6/6] lightnvm: include NVM Express driver if OCSSD is selected for build

2017-09-21 Thread Rakesh Pandit
Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM
in config file before building kernel.

Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
index ead61a9..b3c00cb 100644
--- a/drivers/lightnvm/Kconfig
+++ b/drivers/lightnvm/Kconfig
@@ -5,6 +5,7 @@
 menuconfig NVM
bool "Open-Channel SSD target support"
depends on BLOCK && HAS_DMA
+   select BLK_DEV_NVME
help
  Say Y here to get to enable Open-channel SSDs.
 
-- 
2.5.0



[PATCH 4/6] lightnvm: pblk: improve error message if down_timeout fails

2017-09-21 Thread Rakesh Pandit
The two pr_err messages are useless as they don't even differentiate
error code.

Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/pblk-core.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index b92eabc..74ddb30 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1734,16 +1734,8 @@ static void __pblk_down_page(struct pblk *pblk, struct 
ppa_addr *ppa_list,
 #endif
 
ret = down_timeout(>wr_sem, msecs_to_jiffies(3));
-   if (ret) {
-   switch (ret) {
-   case -ETIME:
-   pr_err("pblk: lun semaphore timed out\n");
-   break;
-   case -EINTR:
-   pr_err("pblk: lun semaphore timed out\n");
-   break;
-   }
-   }
+   if (ret == -ETIME || ret == -EINTR)
+   pr_err("pblk: taking lun semaphore timed out: err %d\n", -ret);
 }
 
 void pblk_down_page(struct pblk *pblk, struct ppa_addr *ppa_list, int nr_ppas)
-- 
2.5.0



[PATCH 3/6] lightnvm: pblk: fix message if L2P MAP is in device

2017-09-21 Thread Rakesh Pandit
This usually happens if we are developing with qemu and ll2pmode has
default value.  Even in that case message seems wrong.

Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/pblk-init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 470ef04..c5c1337 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -913,7 +913,7 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
int ret;
 
if (dev->identity.dom & NVM_RSP_L2P) {
-   pr_err("pblk: device-side L2P table not supported. (%x)\n",
+   pr_err("pblk: device-side L2P table supported. (%x)\n",
dev->identity.dom);
return ERR_PTR(-EINVAL);
}
-- 
2.5.0



[PATCH 2/6] lightnvm: pblk: protect line bitmap while submitting meta io

2017-09-21 Thread Rakesh Pandit
It seems pblk_dealloc_page would race against pblk_alloc_pages for
line bitmap for sector allocation.  The chances are very low but might
as well protect the bitmap properly.  It's not even in fast path.

Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/pblk-core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index a230125..b92eabc 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -502,12 +502,14 @@ void pblk_dealloc_page(struct pblk *pblk, struct 
pblk_line *line, int nr_secs)
u64 addr;
int i;
 
+   spin_lock(>lock);
addr = find_next_zero_bit(line->map_bitmap,
pblk->lm.sec_per_line, line->cur_sec);
line->cur_sec = addr - nr_secs;
 
for (i = 0; i < nr_secs; i++, line->cur_sec--)
WARN_ON(!test_and_clear_bit(line->cur_sec, line->map_bitmap));
+   spin_lock(>lock);
 }
 
 u64 __pblk_alloc_page(struct pblk *pblk, struct pblk_line *line, int nr_secs)
-- 
2.5.0



[PATCH 0/6] misc patches mostly for pblk

2017-09-21 Thread Rakesh Pandit
These are trivial changes up for review.  Most of them I made while
skimming through the code base.  They are mostly cleanups and they are
at random places.

Rakesh Pandit (6):
  lightnvm: pblk: reuse pblk_gc_should_kick
  lightnvm: pblk: protect line bitmap while submitting meta io
  lightnvm: pblk: fix message if L2P MAP is in device
  lightnvm: pblk: improve error message if down_timeout fails
  lightnvm: pblk: print incompatible line version correctly
  lightnvm: include NVM Express driver if OCSSD is selected for build

 drivers/lightnvm/Kconfig |  1 +
 drivers/lightnvm/pblk-core.c | 17 ++---
 drivers/lightnvm/pblk-init.c |  2 +-
 drivers/lightnvm/pblk-recovery.c |  4 ++--
 drivers/lightnvm/pblk-rl.c   |  9 -
 drivers/lightnvm/pblk.h  |  1 +
 6 files changed, 11 insertions(+), 23 deletions(-)

-- 
2.5.0



[PATCH 1/6] lightnvm: pblk: reuse pblk_gc_should_kick

2017-09-21 Thread Rakesh Pandit
This is a trivial change which reuses pblk_gc_should_kick instead of
repeating it again in pblk_rl_free_lines_inc.

Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/pblk-core.c | 1 +
 drivers/lightnvm/pblk-rl.c   | 9 -
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 64a6a25..a230125 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1478,6 +1478,7 @@ static void __pblk_line_put(struct pblk *pblk, struct 
pblk_line *line)
spin_unlock(_mg->free_lock);
 
pblk_rl_free_lines_inc(>rl, line);
+   pblk_gc_should_kick(pblk);
 }
 
 static void pblk_line_put_ws(struct work_struct *work)
diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c
index 596bdec..e7c162a 100644
--- a/drivers/lightnvm/pblk-rl.c
+++ b/drivers/lightnvm/pblk-rl.c
@@ -129,18 +129,9 @@ static int pblk_rl_update_rates(struct pblk_rl *rl, 
unsigned long max)
 
 void pblk_rl_free_lines_inc(struct pblk_rl *rl, struct pblk_line *line)
 {
-   struct pblk *pblk = container_of(rl, struct pblk, rl);
int blk_in_line = atomic_read(>blk_in_line);
-   int ret;
 
atomic_add(blk_in_line, >free_blocks);
-   /* Rates will not change that often - no need to lock update */
-   ret = pblk_rl_update_rates(rl, rl->rb_budget);
-
-   if (ret == (PBLK_RL_MID | PBLK_RL_LOW))
-   pblk_gc_should_start(pblk);
-   else
-   pblk_gc_should_stop(pblk);
 }
 
 void pblk_rl_free_lines_dec(struct pblk_rl *rl, struct pblk_line *line)
-- 
2.5.0



Re: [PATCH V8 12/14] mmc: block: Add CQE and blk-mq support

2017-09-21 Thread Adrian Hunter
On 21/09/17 12:59, Ulf Hansson wrote:
> On 13 September 2017 at 13:40, Adrian Hunter  wrote:
>> Add CQE support to the block driver, including:
>> - optionally using DCMD for flush requests
>> - "manually" issuing discard requests
>> - issuing read / write requests to the CQE
>> - supporting block-layer timeouts
>> - handling recovery
>> - supporting re-tuning
>>
>> CQE offers 25% - 50% better random multi-threaded I/O.  There is a slight
>> (e.g. 2%) drop in sequential read speed but no observable change to 
>> sequential
>> write.
>>
>> CQE automatically sends the commands to complete requests.  However it only
>> supports reads / writes and so-called "direct commands" (DCMD).  Furthermore
>> DCMD is limited to one command at a time, but discards require 3 commands.
>> That makes issuing discards through CQE very awkward, but some CQE's don't
>> support DCMD anyway.  So for discards, the existing non-CQE approach is
>> taken, where the mmc core code issues the 3 commands one at a time i.e.
>> mmc_erase(). Where DCMD is used, is for issuing flushes.
>>
>> For host controllers without CQE support, blk-mq support is extended to
>> synchronous reads/writes or, if the host supports CAP_WAIT_WHILE_BUSY,
>> asynchonous reads/writes.  The advantage of asynchronous reads/writes is
>> that it allows the preparation of the next request while the current
>> request is in progress.
>>
>> Signed-off-by: Adrian Hunter 
>> ---
>>  drivers/mmc/core/block.c | 734 
>> ++-
>>  drivers/mmc/core/block.h |   8 +
>>  drivers/mmc/core/queue.c | 428 +--
>>  drivers/mmc/core/queue.h |  54 +++-
>>  4 files changed, 1192 insertions(+), 32 deletions(-)
> 
> Adrian, this is just too hard for me to review. Please, can you split this up?
> 
> At least enabling mq support should be done in a separate and first
> step, then you could add the CQE bits on top.
> 
> I think that is important, because we want to make sure the mq
> deployment is done correctly first. Otherwise it becomes impossible to
> narrow down problems, because those could then be either CQE related
> or mq related.

I looked at splitting it up, but it makes more sense together.  Things are
the way they are to support all the different issue types.  It reads much
better together.  The CQE functions are named appropriately and practically
all the code is new, so it is not as though there are any intermediate steps.


Re: [PATCH] lightnvm: pblk: fix error path in pblk_lines_alloc_metadata

2017-09-21 Thread Matias Bjørling

On 09/21/2017 12:15 PM, Rakesh Pandit wrote:

On Thu, Sep 21, 2017 at 11:56:46AM +0200, Javier González wrote:

On 20 Sep 2017, at 21.50, Rakesh Pandit  wrote:

Use appropriate memory free calls based on allocation type used and
also fix number of times free is called if kmalloc fails.

Signed-off-by: Rakesh Pandit 
---
drivers/lightnvm/pblk-init.c | 8 ++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 7cf4b53..470ef04 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -624,12 +624,16 @@ static int pblk_lines_alloc_metadata(struct pblk *pblk)

fail_free_emeta:
while (--i >= 0) {
-   vfree(l_mg->eline_meta[i]->buf);
+   if (l_mg->emeta_alloc_type == PBLK_VMALLOC_META)
+   vfree(l_mg->eline_meta[i]->buf);
+   else
+   kfree(l_mg->eline_meta[i]->buf);
kfree(l_mg->eline_meta[i]);
}

+   i = PBLK_DATA_LINES;
fail_free_smeta:
-   for (i = 0; i < PBLK_DATA_LINES; i++)
+   while (--i >= 0)
kfree(l_mg->sline_meta[i]);


It is safe to use kfree on NULL pointers. No need to do this. You can
either send a new patch, or we can change it when picking it up.


Yes, that would be great if this is adjusted while picking up.





return -ENOMEM;
--
2.5.0


Rest looks good.


Reviewed-by: Javier González 



Thanks,



Thanks Rakesh. I queued it up.


Re: [PATCH] lightnvm: remove already calculated nr_chnls

2017-09-21 Thread Matias Bjørling

On 09/18/2017 12:56 PM, Matias Bjørling wrote:



Den 18. sep. 2017 09.56 skrev "Javier González" >:


 > On 17 Sep 2017, at 23.04, Rakesh Pandit > wrote:
 >
 > Remove repeated calculation for number of channels while creating a
 > target device.
 >
 > Signed-off-by: Rakesh Pandit >
 > ---
 >
 > This is also a trivial change I found while investigating/working on
 > other issue.
 >
 > drivers/lightnvm/core.c | 1 -
 > 1 file changed, 1 deletion(-)
 >
 > diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
 > index 1b8338d..01536b8 100644
 > --- a/drivers/lightnvm/core.c
 > +++ b/drivers/lightnvm/core.c
 > @@ -139,7 +139,6 @@ static struct nvm_tgt_dev
*nvm_create_tgt_dev(struct nvm_dev *dev,
 >   int prev_nr_luns;
 >   int i, j;
 >
 > - nr_chnls = nr_luns / dev->geo.luns_per_chnl;
 >   nr_chnls = (nr_chnls_mod == 0) ? nr_chnls : nr_chnls + 1;
 >
 >   dev_map = kmalloc(sizeof(struct nvm_dev_map), GFP_KERNEL);
 > --
 > 2.7.4

We wanted to make sure that nr_chnls was really, really set :)

Reviewed-by: Javier González >

What the hell... must have been a patch or merge that went wrong.


Thanks Rakesh. I pulled it in for 4.15.


Re: [PATCH V2] lightnvm: protect target type list with correct locks

2017-09-21 Thread Matias Bjørling

On 09/18/2017 09:53 AM, Javier González wrote:

On 16 Sep 2017, at 20.39, Rakesh Pandit  wrote:

nvm_tgt_types list was protected by wrong lock for NVM_INFO ioctl call
and can race with addition or removal of target types.  Also
unregistering target type was not protected correctly.

Fixes: 5cd907853 ("lightnvm: remove nested lock conflict with mm")
Signed-off-by: Rakesh Pandit 
---

V2: also add correct lock while unregistering and fix "Fixes" tag at
end.  Note I found these while investigating another issue and
skimming the core code but worth fixing.

drivers/lightnvm/core.c | 8 
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 9f9a137..1b8338d 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -589,9 +589,9 @@ void nvm_unregister_tgt_type(struct nvm_tgt_type *tt)
if (!tt)
return;

-   down_write(_lock);
+   down_write(_tgtt_lock);
list_del(>list);
-   up_write(_lock);
+   up_write(_tgtt_lock);
}
EXPORT_SYMBOL(nvm_unregister_tgt_type);

@@ -1190,7 +1190,7 @@ static long nvm_ioctl_info(struct file *file, void __user 
*arg)
info->version[1] = NVM_VERSION_MINOR;
info->version[2] = NVM_VERSION_PATCH;

-   down_write(_lock);
+   down_write(_tgtt_lock);
list_for_each_entry(tt, _tgt_types, list) {
struct nvm_ioctl_info_tgt *tgt = >tgts[tgt_iter];

@@ -1203,7 +1203,7 @@ static long nvm_ioctl_info(struct file *file, void __user 
*arg)
}

info->tgtsize = tgt_iter;
-   up_write(_lock);
+   up_write(_tgtt_lock);

if (copy_to_user(arg, info, sizeof(struct nvm_ioctl_info))) {
kfree(info);
--
2.7.4


LGTM.

Reviewed-by: Javier González 



Thanks Rakesh.


Re: [PATCH V2] lightnvm: prevent bd removal if busy

2017-09-21 Thread Matias Bjørling

On 09/12/2017 03:22 PM, Javier González wrote:

On 10 Sep 2017, at 21.07, Rakesh Pandit  wrote:

When a virtual block device is formatted and mounted after creating
with "nvme lnvm create... -t pblk", a removal from "nvm lnvm remove"
would result in this:

446416.309757] bdi-block not registered
[446416.309773] [ cut here ]
[446416.309780] WARNING: CPU: 3 PID: 4319 at fs/fs-writeback.c:2159 
__mark_inode_dirty+0x268/0x340

Ideally removal should return -EBUSY as block device is mounted after
formatting.  This patch tries to address this checking if whole device
or any partition of it already mounted or not before removal.

Whole device is checked using "bd_super" member of block device.  This
member is always set once block device has been mounted using a
filesystem.  Another member "bd_part_count" takes care of checking any
if any partitions are under use.  "bd_part_count" is only updated
under locks when partitions are opened or closed (first open and last
release).  This at least does take care sending -EBUSY if removal is
being attempted while whole block device or any partition is mounted.

Signed-off-by: Rakesh Pandit 
---

V2: Take a different approach. Instead of checking bd_openers use
bd_super and bd_part_count.  This should address the removal of bdevs
which are mounted from removal.

drivers/lightnvm/core.c | 14 ++
1 file changed, 14 insertions(+)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index c39f87d..9f9a137 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -373,6 +373,7 @@ static void __nvm_remove_target(struct nvm_target *t)
static int nvm_remove_tgt(struct nvm_dev *dev, struct nvm_ioctl_remove *remove)
{
struct nvm_target *t;
+   struct block_device *bdev;

mutex_lock(>mlock);
t = nvm_find_target(dev, remove->tgtname);
@@ -380,6 +381,19 @@ static int nvm_remove_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_remove *remove)
mutex_unlock(>mlock);
return 1;
}
+   bdev = bdget_disk(t->disk, 0);
+   if (!bdev) {
+   pr_err("nvm: removal failed, allocating bd failed\n");
+   mutex_unlock(>mlock);
+   return -ENOMEM;
+   }
+   if (bdev->bd_super || bdev->bd_part_count) {
+   pr_err("nvm: removal failed, block device busy\n");
+   bdput(bdev);
+   mutex_unlock(>mlock);
+   return -EBUSY;
+   }
+   bdput(bdev);
__nvm_remove_target(t);
mutex_unlock(>mlock);

--
2.7.4


Looks good.

Reviewed-by: Javier González 



Thanks Rakesh. I pulled it in for 4.15.


Re: [PATCH] lightnvm: pblk: fix error path in pblk_lines_alloc_metadata

2017-09-21 Thread Rakesh Pandit
On Thu, Sep 21, 2017 at 11:56:46AM +0200, Javier González wrote:
> > On 20 Sep 2017, at 21.50, Rakesh Pandit  wrote:
> > 
> > Use appropriate memory free calls based on allocation type used and
> > also fix number of times free is called if kmalloc fails.
> > 
> > Signed-off-by: Rakesh Pandit 
> > ---
> > drivers/lightnvm/pblk-init.c | 8 ++--
> > 1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
> > index 7cf4b53..470ef04 100644
> > --- a/drivers/lightnvm/pblk-init.c
> > +++ b/drivers/lightnvm/pblk-init.c
> > @@ -624,12 +624,16 @@ static int pblk_lines_alloc_metadata(struct pblk 
> > *pblk)
> > 
> > fail_free_emeta:
> > while (--i >= 0) {
> > -   vfree(l_mg->eline_meta[i]->buf);
> > +   if (l_mg->emeta_alloc_type == PBLK_VMALLOC_META)
> > +   vfree(l_mg->eline_meta[i]->buf);
> > +   else
> > +   kfree(l_mg->eline_meta[i]->buf);
> > kfree(l_mg->eline_meta[i]);
> > }
> > 
> > +   i = PBLK_DATA_LINES;
> > fail_free_smeta:
> > -   for (i = 0; i < PBLK_DATA_LINES; i++)
> > +   while (--i >= 0)
> > kfree(l_mg->sline_meta[i]);
> 
> It is safe to use kfree on NULL pointers. No need to do this. You can
> either send a new patch, or we can change it when picking it up.

Yes, that would be great if this is adjusted while picking up.

> 
> > 
> > return -ENOMEM;
> > --
> > 2.5.0
> 
> Rest looks good.
> 
> 
> Reviewed-by: Javier González 
> 

Thanks,


Re: [PATCH] lightnvm: pblk: fix error path in pblk_lines_alloc_metadata

2017-09-21 Thread Javier González
> On 20 Sep 2017, at 21.50, Rakesh Pandit  wrote:
> 
> Use appropriate memory free calls based on allocation type used and
> also fix number of times free is called if kmalloc fails.
> 
> Signed-off-by: Rakesh Pandit 
> ---
> drivers/lightnvm/pblk-init.c | 8 ++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
> index 7cf4b53..470ef04 100644
> --- a/drivers/lightnvm/pblk-init.c
> +++ b/drivers/lightnvm/pblk-init.c
> @@ -624,12 +624,16 @@ static int pblk_lines_alloc_metadata(struct pblk *pblk)
> 
> fail_free_emeta:
>   while (--i >= 0) {
> - vfree(l_mg->eline_meta[i]->buf);
> + if (l_mg->emeta_alloc_type == PBLK_VMALLOC_META)
> + vfree(l_mg->eline_meta[i]->buf);
> + else
> + kfree(l_mg->eline_meta[i]->buf);
>   kfree(l_mg->eline_meta[i]);
>   }
> 
> + i = PBLK_DATA_LINES;
> fail_free_smeta:
> - for (i = 0; i < PBLK_DATA_LINES; i++)
> + while (--i >= 0)
>   kfree(l_mg->sline_meta[i]);

It is safe to use kfree on NULL pointers. No need to do this. You can
either send a new patch, or we can change it when picking it up.

> 
>   return -ENOMEM;
> --
> 2.5.0

Rest looks good.


Reviewed-by: Javier González 



signature.asc
Description: Message signed with OpenPGP


Re: [PATCH V8 00/14] mmc: Add Command Queue support

2017-09-21 Thread Adrian Hunter
On 21/09/17 12:01, Ulf Hansson wrote:
> On 13 September 2017 at 13:40, Adrian Hunter  wrote:
>> Hi
>>
>> Here is V8 of the hardware command queue patches without the software
>> command queue patches, now using blk-mq and now with blk-mq support for
>> non-CQE I/O.
>>
>> After the unacceptable debacle of the last release cycle, I expect an
>> immediate response to these patches.
>>
>> HW CMDQ offers 25% - 50% better random multi-threaded I/O.  I see a slight
>> 2% drop in sequential read speed but no change to sequential write.
>>
>> Non-CQE blk-mq showed a 3% decrease in sequential read performance.  This
>> seemed to be coming from the inferior latency of running work items compared
>> with a dedicated thread.  Hacking blk-mq workqueue to be unbound reduced the
>> performance degradation from 3% to 1%.
>>
>> While we should look at changing blk-mq to give better workqueue performance,
>> a bigger gain is likely to be made by adding a new host API to enable the
>> next already-prepared request to be issued directly from within ->done()
>> callback of the current request.
> 
> Adrian, I am reviewing this series, however let me comment on each
> change individually.
> 
> I have also run some test on my ux500 board and enabling the blkmq
> path via the new MMC Kconfig option. My idea was to run some iozone
> comparisons between the legacy path and the new blkmq path, but I just
> couldn't get to that point because of the following errors.
> 
> I am using a Kingston 4GB SDHC card, which is detected and mounted
> nicely. However, when I decide to do some writes to the card I get the
> following errors.
> 
> root@ME:/mnt/sdcard dd if=/dev/zero of=testfile bs=8192 count=5000 conv=fsync
> [  463.714294] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  464.722656] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  466.081481] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  467.111236] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  468.669647] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  469.685699] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  471.043334] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  472.052337] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  473.342651] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  474.323760] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  475.544769] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  476.539031] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  477.748474] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> [  478.724182] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
> 
> I haven't yet got the point of investigating this any further, and
> unfortunate I have a busy schedule with traveling next week. I will do
> my best to look into this as soon as I can.
> 
> Perhaps you have some ideas?

The behaviour depends on whether you have MMC_CAP_WAIT_WHILE_BUSY. Try
changing that and see if it makes a difference.


Re: [PATCH V8 08/14] mmc: core: Add parameter use_blk_mq

2017-09-21 Thread Ulf Hansson
On 13 September 2017 at 13:40, Adrian Hunter  wrote:
> Until mmc has blk-mq support fully implemented and tested, add a
> parameter use_blk_mq, default to false unless config option MMC_MQ_DEFAULT
> is selected.
>
> Signed-off-by: Adrian Hunter 
> ---
>  drivers/mmc/Kconfig  | 11 +++
>  drivers/mmc/core/core.c  |  7 +++
>  drivers/mmc/core/core.h  |  2 ++
>  drivers/mmc/core/host.c  |  2 ++
>  drivers/mmc/core/host.h  |  4 
>  include/linux/mmc/host.h |  1 +
>  6 files changed, 27 insertions(+)
>
> diff --git a/drivers/mmc/Kconfig b/drivers/mmc/Kconfig
> index ec21388311db..98202934bd29 100644
> --- a/drivers/mmc/Kconfig
> +++ b/drivers/mmc/Kconfig
> @@ -12,6 +12,17 @@ menuconfig MMC
>   If you want MMC/SD/SDIO support, you should say Y here and
>   also to your specific host controller driver.
>
> +config MMC_MQ_DEFAULT
> +   bool "MMC: use blk-mq I/O path by default"
> +   depends on MMC && BLOCK
> +   ---help---
> + This option enables the new blk-mq based I/O path for MMC block
> + devices by default.  With the option the mmc_core.use_blk_mq
> + module/boot option defaults to Y, without it to N, but it can
> + still be overridden either way.
> +
> + If unsure say N.
> +
>  if MMC

I asume the goal of adding this option is to enable us to move slowly
forward. In general that might be a good idea, however for this
particular case I am not sure.

The main reason is simply that I find it unlikely that people and
distributions will actually go in and change the default value, so in
the end we will just be adding new code, which isn't really going to
be much tested. That's what happened in scsi case.

As I also stated earlier, I do worry about the maintenance of the mmc
block device code, and this approach make it worse, at least short
term.

To me, the scsi case is also different, because the mq support was
added long time ago and at that point one could worry a bit of
maturity of the blkmq in general, that I assume have been sorted out
by know.

>
>  source "drivers/mmc/core/Kconfig"
> diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
> index ef2d8aa1e7d2..3638ed4f0f9e 100644
> --- a/drivers/mmc/core/core.c
> +++ b/drivers/mmc/core/core.c
> @@ -66,6 +66,13 @@
>  bool use_spi_crc = 1;
>  module_param(use_spi_crc, bool, 0);
>
> +#ifdef CONFIG_MMC_MQ_DEFAULT
> +bool mmc_use_blk_mq = true;
> +#else
> +bool mmc_use_blk_mq = false;
> +#endif
> +module_param_named(use_blk_mq, mmc_use_blk_mq, bool, S_IWUSR | S_IRUGO);
> +
>  static int mmc_schedule_delayed_work(struct delayed_work *work,
>  unsigned long delay)
>  {
> diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
> index e941342ed450..535539a9e7eb 100644
> --- a/drivers/mmc/core/core.h
> +++ b/drivers/mmc/core/core.h
> @@ -35,6 +35,8 @@ struct mmc_bus_ops {
> int (*reset)(struct mmc_host *);
>  };
>
> +extern bool mmc_use_blk_mq;
> +
>  void mmc_attach_bus(struct mmc_host *host, const struct mmc_bus_ops *ops);
>  void mmc_detach_bus(struct mmc_host *host);
>
> diff --git a/drivers/mmc/core/host.c b/drivers/mmc/core/host.c
> index ad88deb2e8f3..b624dbb6cd15 100644
> --- a/drivers/mmc/core/host.c
> +++ b/drivers/mmc/core/host.c
> @@ -398,6 +398,8 @@ struct mmc_host *mmc_alloc_host(int extra, struct device 
> *dev)
> host->max_blk_size = 512;
> host->max_blk_count = PAGE_SIZE / 512;
>
> +   host->use_blk_mq = mmc_use_blk_mq;
> +
> return host;
>  }
>
> diff --git a/drivers/mmc/core/host.h b/drivers/mmc/core/host.h
> index 77d6f60d1bf9..170fe5947087 100644
> --- a/drivers/mmc/core/host.h
> +++ b/drivers/mmc/core/host.h
> @@ -69,6 +69,10 @@ static inline bool mmc_card_hs400es(struct mmc_card *card)
> return card->host->ios.enhanced_strobe;
>  }
>
> +static inline bool mmc_host_use_blk_mq(struct mmc_host *host)
> +{
> +   return host->use_blk_mq;
> +}
>
>  #endif
>
> diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
> index 54b0463443bd..5d1bd10991f7 100644
> --- a/include/linux/mmc/host.h
> +++ b/include/linux/mmc/host.h
> @@ -378,6 +378,7 @@ struct mmc_host {
> unsigned intdoing_retune:1; /* re-tuning in progress */
> unsigned intretune_now:1;   /* do re-tuning at next req */
> unsigned intretune_paused:1; /* re-tuning is temporarily 
> disabled */
> +   unsigned intuse_blk_mq:1;   /* use blk-mq */
>
> int rescan_disable; /* disable card detection */
> int rescan_entered; /* used with nonremovable 
> devices */
> --
> 1.9.1
>

Kind regards
Uffe


[PATCH BUGFIX/IMPROVEMENT 4/4] block, bfq: decrease burst size when queues in burst exit

2017-09-21 Thread Paolo Valente
If many queues belonging to the same group happen to be created
shortly after each other, then the concurrent processes associated
with these queues have typically a common goal, and they get it done
as soon as possible if not hampered by device idling.  Examples are
processes spawned by git grep, or by systemd during boot. As for
device idling, this mechanism is currently necessary for weight
raising to succeed in its goal: privileging I/O.  In view of these
facts, BFQ does not provide the above queues with either weight
raising or device idling.

On the other hand, a burst of queue creations may be caused also by
the start-up of a complex application. In this case, these queues need
usually to be served one after the other, and as quickly as possible,
to maximise responsiveness. Therefore, in this case the best strategy
is to weight-raise all the queues created during the burst, i.e., the
exact opposite of the strategy for the above case.

To distinguish between the two cases, BFQ uses an empirical burst-size
threshold, found through extensive tests and monitoring of daily
usage. Only large bursts, i.e., burst with a size above this
threshold, are considered as generated by a high number of parallel
processes. In this respect, upstart-based boot proved to be rather
hard to detect as generating a large burst of queue creations, because
with upstart most of the queues created in a burst exit *before* the
next queues in the same burst are created. To address this issue, I
changed the burst-detection mechanism so as to not decrease the size
of the current burst even if one of the queues in the burst is
eliminated.

Unfortunately, this missing decrease causes false positives on very
fast systems: on the start-up of a complex application, such as
libreoffice writer, so many queues are created, served and exited
shortly after each other, that a large burst of queue creations is
wrongly detected as occurring. These false positives just disappear if
the size of a burst is decreased when one of the queues in the burst
exits. This commit restores the missing burst-size decrease, relying
of the fact that upstart is apparently unlikely to be used on systems
running this and future versions of the kernel.

Signed-off-by: Paolo Valente 
Signed-off-by: Mauro Andreolini 
Signed-off-by: Angelo Ruocco 
Tested-by: Mirko Montanari 
Tested-by: Oleksandr Natalenko 
---
 block/bfq-iosched.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 115747f..70f9177 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3725,16 +3725,10 @@ void bfq_put_queue(struct bfq_queue *bfqq)
if (bfqq->ref)
return;
 
-   if (bfq_bfqq_sync(bfqq))
-   /*
-* The fact that this queue is being destroyed does not
-* invalidate the fact that this queue may have been
-* activated during the current burst. As a consequence,
-* although the queue does not exist anymore, and hence
-* needs to be removed from the burst list if there,
-* the burst size has not to be decremented.
-*/
+   if (bfq_bfqq_sync(bfqq) && !hlist_unhashed(>burst_list_node)) {
hlist_del_init(>burst_list_node);
+   bfqq->bfqd->burst_size--;
+   }
 
kmem_cache_free(bfq_pool, bfqq);
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
-- 
2.10.0



[PATCH BUGFIX/IMPROVEMENT 3/4] block, bfq: let early-merged queues be weight-raised on split too

2017-09-21 Thread Paolo Valente
A just-created bfq_queue, say Q, may happen to be merged with another
bfq_queue on the very first invocation of the function
__bfq_insert_request. In such a case, even if Q would clearly deserve
interactive weight raising (as it has just been created), the function
bfq_add_request does not make it to be invoked for Q, and thus to
activate weight raising for Q. As a consequence, when the state of Q
is saved for a possible future restore, after a split of Q from the
other bfq_queue(s), such a state happens to be (unjustly)
non-weight-raised. Then the bfq_queue will not enjoy any weight
raising on the split, even if should still be in an interactive
weight-raising period when the split occurs.

This commit solves this problem as follows, for a just-created
bfq_queue that is being early-merged: it stores directly, in the saved
state of the bfq_queue, the weight-raising state that would have been
assigned to the bfq_queue if not early-merged.

Signed-off-by: Paolo Valente 
Tested-by: Angelo Ruocco 
Tested-by: Mirko Montanari 
Tested-by: Oleksandr Natalenko 
---
 block/bfq-iosched.c | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 33b63bc..115747f 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2061,10 +2061,27 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);
bic->was_in_burst_list = !hlist_unhashed(>burst_list_node);
-   bic->saved_wr_coeff = bfqq->wr_coeff;
-   bic->saved_wr_start_at_switch_to_srt = bfqq->wr_start_at_switch_to_srt;
-   bic->saved_last_wr_start_finish = bfqq->last_wr_start_finish;
-   bic->saved_wr_cur_max_time = bfqq->wr_cur_max_time;
+   if (unlikely(bfq_bfqq_just_created(bfqq) &&
+!bfq_bfqq_in_large_burst(bfqq))) {
+   /*
+* bfqq being merged right after being created: bfqq
+* would have deserved interactive weight raising, but
+* did not make it to be set in a weight-raised state,
+* because of this early merge. Store directly the
+* weight-raising state that would have been assigned
+* to bfqq, so that to avoid that bfqq unjustly fails
+* to enjoy weight raising if split soon.
+*/
+   bic->saved_wr_coeff = bfqq->bfqd->bfq_wr_coeff;
+   bic->saved_wr_cur_max_time = bfq_wr_duration(bfqq->bfqd);
+   bic->saved_last_wr_start_finish = jiffies;
+   } else {
+   bic->saved_wr_coeff = bfqq->wr_coeff;
+   bic->saved_wr_start_at_switch_to_srt =
+   bfqq->wr_start_at_switch_to_srt;
+   bic->saved_last_wr_start_finish = bfqq->last_wr_start_finish;
+   bic->saved_wr_cur_max_time = bfqq->wr_cur_max_time;
+   }
 }
 
 static void
@@ -4150,7 +4167,6 @@ static void __bfq_insert_request(struct bfq_data *bfqd, 
struct request *rq)
new_bfqq->allocated++;
bfqq->allocated--;
new_bfqq->ref++;
-   bfq_clear_bfqq_just_created(bfqq);
/*
 * If the bic associated with the process
 * issuing this request still points to bfqq
@@ -4162,6 +4178,8 @@ static void __bfq_insert_request(struct bfq_data *bfqd, 
struct request *rq)
if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq)
bfq_merge_bfqqs(bfqd, RQ_BIC(rq),
bfqq, new_bfqq);
+
+   bfq_clear_bfqq_just_created(bfqq);
/*
 * rq is about to be enqueued into new_bfqq,
 * release rq reference on bfqq
-- 
2.10.0



[PATCH BUGFIX/IMPROVEMENT 1/4] block, bfq: fix wrong init of saved start time for weight raising

2017-09-21 Thread Paolo Valente
This commit fixes a bug that causes bfq to fail to guarantee a high
responsiveness on some drives, if there is heavy random read+write I/O
in the background. More precisely, such a failure allowed this bug to
be found [1], but the bug may well cause other yet unreported
anomalies.

BFQ raises the weight of the bfq_queues associated with soft real-time
applications, to privilege the I/O, and thus reduce latency, for these
applications. This mechanism is named soft-real-time weight raising in
BFQ. A soft real-time period may happen to be nested into an
interactive weight raising period, i.e., it may happen that, when a
bfq_queue switches to a soft real-time weight-raised state, the
bfq_queue is already being weight-raised because deemed interactive
too. In this case, BFQ saves in a special variable
wr_start_at_switch_to_srt, the time instant when the interactive
weight-raising period started for the bfq_queue, i.e., the time
instant when BFQ started to deem the bfq_queue interactive. This value
is then used to check whether the interactive weight-raising period
would still be in progress when the soft real-time weight-raising
period ends.  If so, interactive weight raising is restored for the
bfq_queue. This restore is useful, in particular, because it prevents
bfq_queues from losing their interactive weight raising prematurely,
as a consequence of spurious, short-lived soft real-time
weight-raising periods caused by wrong detections as soft real-time.

If, instead, a bfq_queue switches to soft-real-time weight raising
while it *is not* already in an interactive weight-raising period,
then the variable wr_start_at_switch_to_srt has no meaning during the
following soft real-time weight-raising period. Unfortunately the
handling of this case is wrong in BFQ: not only the variable is not
flagged somehow as meaningless, but it is also set to the time when
the switch to soft real-time weight-raising occurs. This may cause an
interactive weight-raising period to be considered mistakenly as still
in progress, and thus a spurious interactive weight-raising period to
start for the bfq_queue, at the end of the soft-real-time
weight-raising period. In particular the spurious interactive
weight-raising period will be considered as still in progress, if the
soft-real-time weight-raising period does not last very long. The
bfq_queue will then be wrongly privileged and, if I/O bound, will
unjustly steal bandwidth to truly interactive or soft real-time
bfq_queues, harming responsiveness and low latency.

This commit fixes this issue by just setting wr_start_at_switch_to_srt
to minus infinity (farthest past time instant according to jiffies
macros): when the soft-real-time weight-raising period ends, certainly
no interactive weight-raising period will be considered as still in
progress.

[1] Background I/O Type: Random - Background I/O mix: Reads and writes
- Application to start: LibreOffice Writer in
http://www.phoronix.com/scan.php?page=news_item=Linux-4.13-IO-Laptop

Signed-off-by: Paolo Valente 
Signed-off-by: Angelo Ruocco 
Tested-by: Oleksandr Natalenko 
Tested-by: Lee Tibbert 
Tested-by: Mirko Montanari 
---
 block/bfq-iosched.c | 50 +++---
 1 file changed, 31 insertions(+), 19 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a4783da..c25955c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1202,6 +1202,24 @@ static unsigned int bfq_wr_duration(struct bfq_data 
*bfqd)
return dur;
 }
 
+/*
+ * Return the farthest future time instant according to jiffies
+ * macros.
+ */
+static unsigned long bfq_greatest_from_now(void)
+{
+   return jiffies + MAX_JIFFY_OFFSET;
+}
+
+/*
+ * Return the farthest past time instant according to jiffies
+ * macros.
+ */
+static unsigned long bfq_smallest_from_now(void)
+{
+   return jiffies - MAX_JIFFY_OFFSET;
+}
+
 static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,
 struct bfq_queue *bfqq,
 unsigned int old_wr_coeff,
@@ -1216,7 +1234,19 @@ static void bfq_update_bfqq_wr_on_rq_arrival(struct 
bfq_data *bfqd,
bfqq->wr_coeff = bfqd->bfq_wr_coeff;
bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
} else {
-   bfqq->wr_start_at_switch_to_srt = jiffies;
+   /*
+* No interactive weight raising in progress
+* here: assign minus infinity to
+* wr_start_at_switch_to_srt, to make sure
+* that, at the end of the soft-real-time
+* weight raising periods that is starting
+* now, no interactive weight-raising period
+   

[PATCH BUGFIX/IMPROVEMENT 2/4] block, bfq: check and switch back to interactive wr also on queue split

2017-09-21 Thread Paolo Valente
As already explained in the message of commit "block, bfq: fix
wrong init of saved start time for weight raising", if a soft
real-time weight-raising period happens to be nested in a larger
interactive weight-raising period, then BFQ restores the interactive
weight raising at the end of the soft real-time weight raising. In
particular, BFQ checks whether the latter has ended only on request
dispatches.

Unfortunately, the above scheme fails to restore interactive weight
raising in the following corner case: if a bfq_queue, say Q,
1) Is merged with another bfq_queue while it is in a nested soft
real-time weight-raising period. The weight-raising state of Q is
then saved, and not considered any longer until a split occurs.
2) Is split from the other bfq_queue(s) at a time instant when its
soft real-time weight raising is already finished.
On the split, while resuming the previous, soft real-time
weight-raised state of the bfq_queue Q, BFQ checks whether the
current soft real-time weight-raising period is actually over. If so,
BFQ switches weight raising off for Q, *without* checking whether the
soft real-time period was actually nested in a non-yet-finished
interactive weight-raising period.

This commit addresses this issue by adding the above missing check in
bfq_queue splits, and restoring interactive weight raising if needed.

Signed-off-by: Paolo Valente 
Tested-by: Angelo Ruocco 
Tested-by: Mirko Montanari 
Tested-by: Oleksandr Natalenko 
---
 block/bfq-iosched.c | 87 ++---
 1 file changed, 49 insertions(+), 38 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c25955c..33b63bc 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -724,6 +724,44 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
}
 }
 
+static unsigned int bfq_wr_duration(struct bfq_data *bfqd)
+{
+   u64 dur;
+
+   if (bfqd->bfq_wr_max_time > 0)
+   return bfqd->bfq_wr_max_time;
+
+   dur = bfqd->RT_prod;
+   do_div(dur, bfqd->peak_rate);
+
+   /*
+* Limit duration between 3 and 13 seconds. Tests show that
+* higher values than 13 seconds often yield the opposite of
+* the desired result, i.e., worsen responsiveness by letting
+* non-interactive and non-soft-real-time applications
+* preserve weight raising for a too long time interval.
+*
+* On the other end, lower values than 3 seconds make it
+* difficult for most interactive tasks to complete their jobs
+* before weight-raising finishes.
+*/
+   if (dur > msecs_to_jiffies(13000))
+   dur = msecs_to_jiffies(13000);
+   else if (dur < msecs_to_jiffies(3000))
+   dur = msecs_to_jiffies(3000);
+
+   return dur;
+}
+
+/* switch back from soft real-time to interactive weight raising */
+static void switch_back_to_interactive_wr(struct bfq_queue *bfqq,
+ struct bfq_data *bfqd)
+{
+   bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+   bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+   bfqq->last_wr_start_finish = bfqq->wr_start_at_switch_to_srt;
+}
+
 static void
 bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
  struct bfq_io_cq *bic, bool bfq_already_existing)
@@ -750,10 +788,16 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
if (bfqq->wr_coeff > 1 && (bfq_bfqq_in_large_burst(bfqq) ||
time_is_before_jiffies(bfqq->last_wr_start_finish +
   bfqq->wr_cur_max_time))) {
-   bfq_log_bfqq(bfqq->bfqd, bfqq,
-   "resume state: switching off wr");
-
-   bfqq->wr_coeff = 1;
+   if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
+   !bfq_bfqq_in_large_burst(bfqq) &&
+   time_is_after_eq_jiffies(bfqq->wr_start_at_switch_to_srt +
+bfq_wr_duration(bfqd))) {
+   switch_back_to_interactive_wr(bfqq, bfqd);
+   } else {
+   bfqq->wr_coeff = 1;
+   bfq_log_bfqq(bfqq->bfqd, bfqq,
+"resume state: switching off wr");
+   }
}
 
/* make sure weight will be updated, however we got here */
@@ -1173,35 +1217,6 @@ static bool bfq_bfqq_update_budg_for_activation(struct 
bfq_data *bfqd,
return wr_or_deserves_wr;
 }
 
-static unsigned int bfq_wr_duration(struct bfq_data *bfqd)
-{
-   u64 dur;
-
-   if (bfqd->bfq_wr_max_time > 0)
-   return bfqd->bfq_wr_max_time;
-
-   dur = bfqd->RT_prod;
-   do_div(dur, bfqd->peak_rate);
-
-   /*
-* Limit duration between 3 and 13 seconds. Tests show that
-* 

[PATCH BUGFIX/IMPROVEMENT 0/4] block, bfq: series of fixes of bugs affecting service guarantees

2017-09-21 Thread Paolo Valente
Hi,
the first patch in this series fixes a bug that causes bfq to fail to
guarantee a high responsiveness on some drives, if there is heavy
random read+write I/O in the background. More precisely, such a
failure allowed this bug to be found [1], but the bug may well cause
other yet unreported anomalies.

This fix uncovered other bugs that were concealed by the fixed bug,
for rather subtle reasons. These further bugs caused similar
responsiveness failures, but with sequential reaad+write workloads in
the background. The remaining three patches fix these further bugs.

The sum of these fixes makes responsiveness much stabler with BFQ. In
the presence of write hogs, it is however still impossible for an I/O
scheduler to guarantee perfect responsiveness in any circustance,
because of throttling issues in the virtual-memory management
subsystem, and in other higher-level components.

Thanks,
Paolo

[1] Background I/O Type: Random - Background I/O mix: Reads and writes
- Application to start: LibreOffice Writer in
http://www.phoronix.com/scan.php?page=news_item=Linux-4.13-IO-Laptop


Paolo Valente (4):
  block, bfq: fix wrong init of saved start time for weight raising
  block, bfq: check and switch back to interactive wr also on queue
split
  block, bfq: let early-merged queues be weight-raised on split too
  block, bfq: decrease burst size when queues in burst exit

 block/bfq-iosched.c | 169 +++-
 1 file changed, 102 insertions(+), 67 deletions(-)

--
2.10.0


Re: [PATCH V8 00/14] mmc: Add Command Queue support

2017-09-21 Thread Ulf Hansson
On 13 September 2017 at 13:40, Adrian Hunter  wrote:
> Hi
>
> Here is V8 of the hardware command queue patches without the software
> command queue patches, now using blk-mq and now with blk-mq support for
> non-CQE I/O.
>
> After the unacceptable debacle of the last release cycle, I expect an
> immediate response to these patches.
>
> HW CMDQ offers 25% - 50% better random multi-threaded I/O.  I see a slight
> 2% drop in sequential read speed but no change to sequential write.
>
> Non-CQE blk-mq showed a 3% decrease in sequential read performance.  This
> seemed to be coming from the inferior latency of running work items compared
> with a dedicated thread.  Hacking blk-mq workqueue to be unbound reduced the
> performance degradation from 3% to 1%.
>
> While we should look at changing blk-mq to give better workqueue performance,
> a bigger gain is likely to be made by adding a new host API to enable the
> next already-prepared request to be issued directly from within ->done()
> callback of the current request.

Adrian, I am reviewing this series, however let me comment on each
change individually.

I have also run some test on my ux500 board and enabling the blkmq
path via the new MMC Kconfig option. My idea was to run some iozone
comparisons between the legacy path and the new blkmq path, but I just
couldn't get to that point because of the following errors.

I am using a Kingston 4GB SDHC card, which is detected and mounted
nicely. However, when I decide to do some writes to the card I get the
following errors.

root@ME:/mnt/sdcard dd if=/dev/zero of=testfile bs=8192 count=5000 conv=fsync
[  463.714294] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  464.722656] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  466.081481] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  467.111236] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  468.669647] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  469.685699] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  471.043334] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  472.052337] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  473.342651] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  474.323760] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  475.544769] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  476.539031] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  477.748474] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!
[  478.724182] mmci-pl18x 80126000.sdi0_per1: error during DMA transfer!

I haven't yet got the point of investigating this any further, and
unfortunate I have a busy schedule with traveling next week. I will do
my best to look into this as soon as I can.

Perhaps you have some ideas?

Kind regards
Uffe

>
>
> Changes since V7:
> Re-based
>   mmc: core: Introduce host claiming by context
> Slightly simplified
>   mmc: core: Add parameter use_blk_mq
> New patch.
>   mmc: core: Remove unnecessary host claim
> New patch.
>   mmc: core: Export mmc_start_bkops()
> New patch.
>   mmc: core: Export mmc_start_request()
> New patch.
>   mmc: block: Add CQE and blk-mq support
> Add blk-mq support for non_CQE requests
>
> Changes since V6:
>   mmc: core: Introduce host claiming by context
> New patch.
>   mmc: core: Move mmc_start_areq() declaration
> Dropped because it has been applied
>   mmc: block: Fix block status codes
> Dropped because it has been applied
>   mmc: host: Add CQE interface
> Dropped because it has been applied
>   mmc: core: Turn off CQE before sending commands
> Dropped because it has been applied
>   mmc: block: Factor out mmc_setup_queue()
> New patch.
>   mmc: block: Add CQE support
> Drop legacy support and add blk-mq support
>
> Changes since V5:
> Re-based
>   mmc: core: Add mmc_retune_hold_now()
> Dropped because it has been applied
>   mmc: core: Add members to mmc_request and mmc_data for CQE's
> Dropped because it has been applied
>   mmc: core: Move mmc_start_areq() declaration
> New patch at Ulf's request
>   mmc: block: Fix block status codes
> Another un-related patch
>   mmc: host: Add CQE interface
> Move recovery_notifier() callback to struct mmc_request
>   mmc: core: Add support for handling CQE requests
> Roll __mmc_cqe_request_done() into mmc_cqe_request_done()
> Move function declarations requested by Ulf
>   mmc: core: Remove unused MMC_CAP2_PACKED_CMD
> Dropped because it has been applied
>   mmc: block: Add CQE support
> Add explanation to commit message
> Adjustment for changed recovery_notifier() callback
>   mmc: cqhci: support for command 

Re: I/O hangs after resuming from suspend-to-ram

2017-09-21 Thread Martin Steigerwald
Martin Steigerwald - 21.09.17, 09:30:
> Ming Lei - 21.09.17, 06:20:
> > On Mon, Aug 28, 2017 at 03:10:35PM +0200, Martin Steigerwald wrote:
> > > Ming Lei - 28.08.17, 20:58:
> > > > On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote:
> > > > > Hi.
> > > > > 
> > > > > Here is disk setup for QEMU VM:
> > > > > 
> > > > > ===
> > > > > [root@archmq ~]# smartctl -i /dev/sda
> > > > > …
> > > > > Device Model: QEMU HARDDISK
> > > > > Serial Number:QM1
> > > > > Firmware Version: 2.5+
> > > > > User Capacity:4,294,967,296 bytes [4.29 GB]
> > > > > Sector Size:  512 bytes logical/physical
> > > > > Device is:Not in smartctl database [for details use: -P
> > > > > showall]
> > > > > ATA Version is:   ATA/ATAPI-7, ATA/ATAPI-5 published, ANSI NCITS
> > > > > 340-2000
> > > > > Local Time is:Sun Aug 27 09:31:54 2017 CEST
> > > > > SMART support is: Available - device has SMART capability.
> > > > > SMART support is: Enabled
> > > > > 
> > > > > [root@archmq ~]# lsblk
> > > > > NAMEMAJ:MIN RM  SIZE RO TYPE   MOUNTPOINT
> > > > > sda   8:004G  0 disk
> > > > > `-sda18:104G  0 part
> > > > > 
> > > > >   `-md0   9:004G  0 raid10
> > > > >   
> > > > > `-system253:004G  0 crypt
> > > > > 
> > > > >   |-system-boot 253:10  512M  0 lvm/boot
> > > > >   |-system-swap 253:20  512M  0 lvm[SWAP]
> > > > >   
> > > > >   `-system-root 253:303G  0 lvm/
> > > > > 
> > > > > sdb   8:16   04G  0 disk
> > > > > `-sdb18:17   04G  0 part
> > > > > 
> > > > >   `-md0   9:004G  0 raid10
> > > > >   
> > > > > `-system253:004G  0 crypt
> > > > > 
> > > > >   |-system-boot 253:10  512M  0 lvm/boot
> > > > >   |-system-swap 253:20  512M  0 lvm[SWAP]
> > > > >   
> > > > >   `-system-root 253:303G  0 lvm/
> > > > > 
> > > > > sr0  11:01 1024M  0 rom
> > > > > 
> > > > > [root@archmq ~]# mdadm --misc --detail /dev/md0
> > > > > 
> > > > > /dev/md0:
> > > > > Version : 1.2
> > > > >   
> > > > >   Creation Time : Sat Jul 29 16:37:05 2017
> > > > >   
> > > > >  Raid Level : raid10
> > > > >  Array Size : 4191232 (4.00 GiB 4.29 GB)
> > > > >   
> > > > >   Used Dev Size : 4191232 (4.00 GiB 4.29 GB)
> > > > >   
> > > > >Raid Devices : 2
> > > > >   
> > > > >   Total Devices : 2
> > > > >   
> > > > > Persistence : Superblock is persistent
> > > > > 
> > > > > Update Time : Sun Aug 27 09:30:33 2017
> > > > > 
> > > > >   State : clean
> > > > >  
> > > > >  Active Devices : 2
> > > > > 
> > > > > Working Devices : 2
> > > > > 
> > > > >  Failed Devices : 0
> > > > >  
> > > > >   Spare Devices : 0
> > > > >   
> > > > >  Layout : far=2
> > > > >  
> > > > >  Chunk Size : 512K
> > > > >  
> > > > >Name : archiso:0
> > > > >UUID : 43f4be59:c8d2fa0a:a94acdff:1c7f2f4e
> > > > >  
> > > > >  Events : 485
> > > > > 
> > > > > Number   Major   Minor   RaidDevice State
> > > > > 
> > > > >0   810  active sync   /dev/sda1
> > > > >1   8   171  active sync   /dev/sdb1
> > > > > 
> > > > > ===
> > > > > 
> > > > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on
> > > > > it,
> > > > > then
> > > > > LVM, then ext4 for boot, swap and btrfs for /.
> > > > > 
> > > > > I couldn't reproduce the issue with single disk without RAID.
> > > > 
> > > > Could you verify if the following patch fixes your issue?
> > 
> > Yes, the patch should address this kind of issue, not related
> > with RAID specially, and the latest version can be found in the
> > 
> > following link:
> > https://marc.info/?l=linux-block=150579298505484=2
> 
> Thank you.
> 
> So if I understand already I can just add
> 
> https://github.com/ming1/linux/tree/my_v4.13-safe-scsi-quiesce_V5_for_test
> 
> as an remote and go from there.

https://github.com/ming1/linux/tree/my_v4.13-safe-scsi-quiesce_V5_for_test

and checkout my_v4.13-safe-scsi-quiesce_V5_for_test branch of course.

-- 
Martin


Re: I/O hangs after resuming from suspend-to-ram

2017-09-21 Thread Martin Steigerwald
Ming Lei - 21.09.17, 06:20:
> On Mon, Aug 28, 2017 at 03:10:35PM +0200, Martin Steigerwald wrote:
> > Ming Lei - 28.08.17, 20:58:
> > > On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote:
> > > > Hi.
> > > > 
> > > > Here is disk setup for QEMU VM:
> > > > 
> > > > ===
> > > > [root@archmq ~]# smartctl -i /dev/sda
> > > > …
> > > > Device Model: QEMU HARDDISK
> > > > Serial Number:QM1
> > > > Firmware Version: 2.5+
> > > > User Capacity:4,294,967,296 bytes [4.29 GB]
> > > > Sector Size:  512 bytes logical/physical
> > > > Device is:Not in smartctl database [for details use: -P
> > > > showall]
> > > > ATA Version is:   ATA/ATAPI-7, ATA/ATAPI-5 published, ANSI NCITS
> > > > 340-2000
> > > > Local Time is:Sun Aug 27 09:31:54 2017 CEST
> > > > SMART support is: Available - device has SMART capability.
> > > > SMART support is: Enabled
> > > > 
> > > > [root@archmq ~]# lsblk
> > > > NAMEMAJ:MIN RM  SIZE RO TYPE   MOUNTPOINT
> > > > sda   8:004G  0 disk
> > > > `-sda18:104G  0 part
> > > > 
> > > >   `-md0   9:004G  0 raid10
> > > >   
> > > > `-system253:004G  0 crypt
> > > > 
> > > >   |-system-boot 253:10  512M  0 lvm/boot
> > > >   |-system-swap 253:20  512M  0 lvm[SWAP]
> > > >   
> > > >   `-system-root 253:303G  0 lvm/
> > > > 
> > > > sdb   8:16   04G  0 disk
> > > > `-sdb18:17   04G  0 part
> > > > 
> > > >   `-md0   9:004G  0 raid10
> > > >   
> > > > `-system253:004G  0 crypt
> > > > 
> > > >   |-system-boot 253:10  512M  0 lvm/boot
> > > >   |-system-swap 253:20  512M  0 lvm[SWAP]
> > > >   
> > > >   `-system-root 253:303G  0 lvm/
> > > > 
> > > > sr0  11:01 1024M  0 rom
> > > > 
> > > > [root@archmq ~]# mdadm --misc --detail /dev/md0
> > > > 
> > > > /dev/md0:
> > > > Version : 1.2
> > > >   
> > > >   Creation Time : Sat Jul 29 16:37:05 2017
> > > >   
> > > >  Raid Level : raid10
> > > >  Array Size : 4191232 (4.00 GiB 4.29 GB)
> > > >   
> > > >   Used Dev Size : 4191232 (4.00 GiB 4.29 GB)
> > > >   
> > > >Raid Devices : 2
> > > >   
> > > >   Total Devices : 2
> > > >   
> > > > Persistence : Superblock is persistent
> > > > 
> > > > Update Time : Sun Aug 27 09:30:33 2017
> > > > 
> > > >   State : clean
> > > >  
> > > >  Active Devices : 2
> > > > 
> > > > Working Devices : 2
> > > > 
> > > >  Failed Devices : 0
> > > >  
> > > >   Spare Devices : 0
> > > >   
> > > >  Layout : far=2
> > > >  
> > > >  Chunk Size : 512K
> > > >  
> > > >Name : archiso:0
> > > >UUID : 43f4be59:c8d2fa0a:a94acdff:1c7f2f4e
> > > >  
> > > >  Events : 485
> > > > 
> > > > Number   Major   Minor   RaidDevice State
> > > > 
> > > >0   810  active sync   /dev/sda1
> > > >1   8   171  active sync   /dev/sdb1
> > > > 
> > > > ===
> > > > 
> > > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on it,
> > > > then
> > > > LVM, then ext4 for boot, swap and btrfs for /.
> > > > 
> > > > I couldn't reproduce the issue with single disk without RAID.
> > > 
> > > Could you verify if the following patch fixes your issue?
> 
> Yes, the patch should address this kind of issue, not related
> with RAID specially, and the latest version can be found in the
> following link:
> 
>   https://marc.info/?l=linux-block=150579298505484=2

Thank you.

So if I understand already I can just add

https://github.com/ming1/linux/tree/my_v4.13-safe-scsi-quiesce_V5_for_test

as an remote and go from there.

Thanks,
-- 
Martin