Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.

2017-05-10 Thread NeilBrown
On Tue, May 02 2017, NeilBrown wrote:

> This is a revision of my series of patches working
> towards removing the bioset work queues.

Hi Jens,
 could I get some feed-back about your thoughts on this series?
 Will you apply it?  When?  Do I need to resend anything?
 Would you like a git-pull request?  If so, what should I base it on?
 There is a minor conflict with drivers/block/zram/zram_drv.c
 as it dropped the call to blk_queue_split() recently, but otherwise it
 still applies.

Thanks,
NeilBrown


>
> This set is based on Linus' tree as for today (2nd May) plus
> the for-linus branch from Shaohua's md/raid tree.
>
> This series adds a fix for the new lightnvm/pblk-read code
> and discards bioset_create_nobvec() in favor of a flag arg to
> bioset_create().  There are also minor fixes and a little
> code clean-up.
>
> I hope to eventually get rid of the new BIOSET_NEED_RESCUER flag,
> but that needs work ing dm and probably bcache first.
>
> Thanks,
> NeilBrown
>
>
> ---
>
> NeilBrown (13):
>   blk: remove bio_set arg from blk_queue_split()
>   blk: replace bioset_create_nobvec() with a flags arg to bioset_create()
>   blk: make the bioset rescue_workqueue optional.
>   blk: use non-rescuing bioset for q->bio_split.
>   block: Improvements to bounce-buffer handling
>   rbd: use bio_clone_fast() instead of bio_clone()
>   drbd: use bio_clone_fast() instead of bio_clone()
>   pktcdvd: use bio_clone_fast() instead of bio_clone()
>   lightnvm/pblk-read: use bio_clone_fast()
>   xen-blkfront: remove bio splitting.
>   bcache: use kmalloc to allocate bio in bch_data_verify()
>   block: remove bio_clone() and all references.
>   block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()
>
>
>  Documentation/block/biodoc.txt  |2 -
>  block/bio.c |   72 
> ---
>  block/blk-core.c|4 +-
>  block/blk-merge.c   |   31 ++-
>  block/blk-mq.c  |2 -
>  block/bounce.c  |   32 +---
>  drivers/block/drbd/drbd_int.h   |3 +
>  drivers/block/drbd/drbd_main.c  |   11 +
>  drivers/block/drbd/drbd_req.c   |2 -
>  drivers/block/drbd/drbd_req.h   |2 -
>  drivers/block/pktcdvd.c |   14 +--
>  drivers/block/ps3vram.c |2 -
>  drivers/block/rbd.c |   16 +++-
>  drivers/block/rsxx/dev.c|2 -
>  drivers/block/umem.c|2 -
>  drivers/block/xen-blkfront.c|   54 +-
>  drivers/block/zram/zram_drv.c   |2 -
>  drivers/lightnvm/pblk-init.c|   16 ++--
>  drivers/lightnvm/pblk-read.c|2 -
>  drivers/lightnvm/pblk.h |1 
>  drivers/lightnvm/rrpc.c |2 -
>  drivers/md/bcache/debug.c   |2 -
>  drivers/md/bcache/super.c   |6 ++-
>  drivers/md/dm-crypt.c   |2 -
>  drivers/md/dm-io.c  |2 -
>  drivers/md/dm.c |5 +-
>  drivers/md/md.c |6 +--
>  drivers/md/raid1.c  |2 -
>  drivers/md/raid10.c |2 -
>  drivers/md/raid5-cache.c|2 -
>  drivers/md/raid5-ppl.c  |2 -
>  drivers/md/raid5.c  |2 -
>  drivers/s390/block/dcssblk.c|2 -
>  drivers/s390/block/xpram.c  |2 -
>  drivers/target/target_core_iblock.c |2 -
>  fs/block_dev.c  |2 -
>  fs/btrfs/extent_io.c|3 +
>  fs/xfs/xfs_super.c  |3 +
>  include/linux/bio.h |   12 ++
>  include/linux/blkdev.h  |3 -
>  40 files changed, 162 insertions(+), 174 deletions(-)
>
> --
> Signature


signature.asc
Description: PGP signature


Re: [PATCH 6/9] T10: Move opencoded contants to common header

2017-05-10 Thread Martin K. Petersen

Dmitry,

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 5/9] bio-integrity: fold bio_integrity_enabled to bio_integrity_prep

2017-05-10 Thread Martin K. Petersen

Dmitry,

> Currently all integrity prep hooks are open-coded, and if prepare
> fails we ignore it's code and fail bio with EIO. Let's return real
> error to upper layer, so later caller may react accordingly.
>
> In fact no one want to use bio_integrity_prep() w/o
> bio_integrity_enabled, so it is reasonable to fold it in to one
> function.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 3/9] bio-integrity: bio_integrity_advance must update integrity seed

2017-05-10 Thread Martin K. Petersen

Dmitry,

> SCSI drivers do care about bip_seed so we must update it accordingly.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [linux-next][bock] [bisected c20cfc27a] WARNING: CPU: 22 PID: 0 at block/blk-core.c:2655 .blk_update_request+0x4f8/0x500

2017-05-10 Thread Christoph Hellwig
Hi Abdul,

can you test the patch below?  I'll try to create a way to inject
short WRITE SAME commands using qemu next, but I thought I'd give
you a chance to try it as well.

---
diff --git a/block/blk-core.c b/block/blk-core.c
index c580b0138a7f..c7068520794b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2644,8 +2644,6 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)
return false;
}
 
-   WARN_ON_ONCE(req->rq_flags & RQF_SPECIAL_PAYLOAD);
-
req->__data_len -= total_bytes;
 
/* update sector only for requests with clear definition of sector */
@@ -2658,17 +2656,19 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)
req->cmd_flags |= req->bio->bi_opf & REQ_FAILFAST_MASK;
}
 
-   /*
-* If total number of sectors is less than the first segment
-* size, something has gone terribly wrong.
-*/
-   if (blk_rq_bytes(req) < blk_rq_cur_bytes(req)) {
-   blk_dump_rq_flags(req, "request botched");
-   req->__data_len = blk_rq_cur_bytes(req);
-   }
+   if (!(req->rq_flags & RQF_SPECIAL_PAYLOAD)) {
+   /*
+* If total number of sectors is less than the first segment
+* size, something has gone terribly wrong.
+*/
+   if (blk_rq_bytes(req) < blk_rq_cur_bytes(req)) {
+   blk_dump_rq_flags(req, "request botched");
+   req->__data_len = blk_rq_cur_bytes(req);
+   }
 
-   /* recalculate the number of segments */
-   blk_recalc_rq_segments(req);
+   /* recalculate the number of segments */
+   blk_recalc_rq_segments(req);
+   }
 
return true;
 }


Re: [PATCH v4 23/27] gfs2: clean up some filemap_* calls

2017-05-10 Thread Bob Peterson
- Original Message -
| In some places, it's trying to reset the mapping error after calling
| filemap_fdatawait. That's no longer required. Also, turn several
| filemap_fdatawrite+filemap_fdatawait calls into filemap_write_and_wait.
| That will at least return writeback errors that occur during the write
| phase.
| 
| Signed-off-by: Jeff Layton 
| ---

Hi Jeff,

This version looks better. ACK.

Regards,

Bob Peterson
Red Hat File Systems


[PATCH v3 4/4] blk-mq: allow to use hw tag for shared tags

2017-05-10 Thread Ming Lei
In case of shared tags, hctx_may_queue() limits that
the maximum number of requests allocated to one hw
queue is .queue_depth / active_queues.

So we try to allow to use hw tag for this case
if .queue_depth/shared_queues is not less than
q->nr_requests.

This can cover some scsi devices too, such as virtio-scsi
in default configuration.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 16 
 block/blk-mq-sched.h |  1 +
 block/blk-mq.c   | 21 ++---
 block/blk-mq.h   | 23 +++
 4 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index a7e125a40e0a..f2114eb3eebb 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -459,8 +459,7 @@ static int blk_mq_set_queue_depth(struct blk_mq_hw_ctx 
*hctx,
return blk_mq_tag_update_depth(hctx, >tags, nr, false);
 }
 
-static int blk_mq_set_queues_depth(struct request_queue *q,
-  unsigned int nr)
+int blk_mq_set_queues_depth(struct request_queue *q, unsigned int nr)
 {
struct blk_mq_hw_ctx *hctx;
int i, j, ret;
@@ -543,15 +542,14 @@ void blk_mq_sched_exit_hctx(struct request_queue *q, 
struct blk_mq_hw_ctx *hctx,
 }
 
 /*
- * If this queue has enough hardware tags and doesn't share tags with
- * other queues, just use hw tag directly for scheduling.
+ * If this queue has enough hardware tags, just use hw tag directly
+ * for scheduling.
  */
 bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
 {
-   if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
-   return false;
+   int nr_shared = blk_mq_get_shared_queues(q);
 
-   if (q->act_hw_queue_depth < q->nr_requests)
+   if ((q->act_hw_queue_depth / nr_shared) < q->nr_requests)
return false;
 
return true;
@@ -578,8 +576,10 @@ int blk_mq_init_sched(struct request_queue *q, struct 
elevator_type *e)
 
auto_hw_tag = blk_mq_sched_may_use_hw_tag(q);
if (auto_hw_tag) {
+   unsigned int nr_shared = blk_mq_get_shared_queues(q);
+
q->act_hw_queue_depth = blk_mq_get_queue_depth(q);
-   if (blk_mq_set_queues_depth(q, q->nr_requests))
+   if (blk_mq_set_queues_depth(q, q->nr_requests * nr_shared))
auto_hw_tag = false;
}
 
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index bbfc1ea5fafa..6deca4f9e656 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -26,6 +26,7 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 
 bool blk_mq_sched_may_use_hw_tag(struct request_queue *q);
+int blk_mq_set_queues_depth(struct request_queue *q, unsigned int nr);
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e02fa8d078e6..401a04388ac9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2150,15 +2150,17 @@ int blk_mq_get_queue_depth(struct request_queue *q)
return tags->bitmap_tags.sb.depth + tags->breserved_tags.sb.depth;
 }
 
-static void blk_mq_update_sched_flag(struct request_queue *q)
+static bool blk_mq_update_sched_flag(struct request_queue *q)
 {
struct blk_mq_hw_ctx *hctx;
int i;
+   bool use_hw_tag;
 
if (!q->elevator)
-   return;
+   return false;
 
-   if (!blk_mq_sched_may_use_hw_tag(q))
+   use_hw_tag = blk_mq_sched_may_use_hw_tag(q);
+   if (!use_hw_tag)
queue_for_each_hw_ctx(q, hctx, i) {
if (hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG) {
blk_mq_set_queue_depth(hctx, 
q->act_hw_queue_depth);
@@ -2176,6 +2178,16 @@ static void blk_mq_update_sched_flag(struct 
request_queue *q)
if (hctx->sched_tags)
blk_mq_sched_free_tags(q->tag_set, hctx, i);
}
+   return use_hw_tag;
+}
+
+static void blk_mq_update_for_sched(struct request_queue *q)
+{
+   if (!blk_mq_update_sched_flag(q))
+   return;
+
+   blk_mq_set_queues_depth(q, q->nr_requests *
+   __blk_mq_get_shared_queues(q));
 }
 
 static void queue_set_hctx_shared(struct request_queue *q, bool shared)
@@ -2217,6 +2229,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue 
*q)
/* update existing queue */
blk_mq_update_tag_set_depth(set, false);
}
+
+   blk_mq_update_for_sched(q);
mutex_unlock(>tag_list_lock);
 
synchronize_rcu();
@@ -2239,6 +2253,7 @@ static void blk_mq_add_queue_tag_set(struct 
blk_mq_tag_set *set,
queue_set_hctx_shared(q, true);
list_add_tail_rcu(>tag_set_list, >tag_list);
 
+   blk_mq_update_for_sched(q);

[PATCH v3 3/4] blk-mq: use hw tag for scheduling if hw tag space is big enough

2017-05-10 Thread Ming Lei
When tag space of one device is big enough, we use hw tag
directly for I/O scheduling.

Now the decision is made if hw queue depth is not less than
q->nr_requests and the tag set isn't shared.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c   | 80 +-
 block/blk-mq-sched.h   |  8 +
 block/blk-mq.c | 35 --
 include/linux/blkdev.h |  8 +
 4 files changed, 122 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 2c5981ff9e04..a7e125a40e0a 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -417,9 +417,9 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
blk_mq_run_hw_queue(hctx, run_queue_async);
 }
 
-static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
-  struct blk_mq_hw_ctx *hctx,
-  unsigned int hctx_idx)
+void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
+   struct blk_mq_hw_ctx *hctx,
+   unsigned int hctx_idx)
 {
if (hctx->sched_tags) {
blk_mq_free_rqs(set, hctx->sched_tags, hctx_idx);
@@ -428,9 +428,9 @@ static void blk_mq_sched_free_tags(struct blk_mq_tag_set 
*set,
}
 }
 
-static int blk_mq_sched_alloc_tags(struct request_queue *q,
-  struct blk_mq_hw_ctx *hctx,
-  unsigned int hctx_idx)
+int blk_mq_sched_alloc_tags(struct request_queue *q,
+   struct blk_mq_hw_ctx *hctx,
+   unsigned int hctx_idx)
 {
struct blk_mq_tag_set *set = q->tag_set;
int ret;
@@ -450,14 +450,52 @@ static int blk_mq_sched_alloc_tags(struct request_queue 
*q,
return ret;
 }
 
+static int blk_mq_set_queue_depth(struct blk_mq_hw_ctx *hctx,
+ unsigned int nr)
+{
+   if (!hctx->tags)
+   return -EINVAL;
+
+   return blk_mq_tag_update_depth(hctx, >tags, nr, false);
+}
+
+static int blk_mq_set_queues_depth(struct request_queue *q,
+  unsigned int nr)
+{
+   struct blk_mq_hw_ctx *hctx;
+   int i, j, ret;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   ret = blk_mq_set_queue_depth(hctx, nr);
+   if (ret)
+   goto recovery;
+   }
+   return 0;
+
+ recovery:
+   queue_for_each_hw_ctx(q, hctx, j) {
+   if (j >= i)
+   break;
+   blk_mq_tag_update_depth(hctx, >tags,
+   q->act_hw_queue_depth,
+   false);
+   }
+   return ret;
+}
+
 static void blk_mq_sched_tags_teardown(struct request_queue *q)
 {
struct blk_mq_tag_set *set = q->tag_set;
struct blk_mq_hw_ctx *hctx;
int i;
 
-   queue_for_each_hw_ctx(q, hctx, i)
+   queue_for_each_hw_ctx(q, hctx, i) {
+   if (hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG) {
+   blk_mq_set_queue_depth(hctx, q->act_hw_queue_depth);
+   hctx->flags &= ~BLK_MQ_F_SCHED_USE_HW_TAG;
+   }
blk_mq_sched_free_tags(set, hctx, i);
+   }
 }
 
 int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
@@ -504,12 +542,28 @@ void blk_mq_sched_exit_hctx(struct request_queue *q, 
struct blk_mq_hw_ctx *hctx,
blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
 }
 
+/*
+ * If this queue has enough hardware tags and doesn't share tags with
+ * other queues, just use hw tag directly for scheduling.
+ */
+bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
+{
+   if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
+   return false;
+
+   if (q->act_hw_queue_depth < q->nr_requests)
+   return false;
+
+   return true;
+}
+
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
struct blk_mq_hw_ctx *hctx;
struct elevator_queue *eq;
unsigned int i;
int ret;
+   bool auto_hw_tag;
 
if (!e) {
q->elevator = NULL;
@@ -522,7 +576,19 @@ int blk_mq_init_sched(struct request_queue *q, struct 
elevator_type *e)
 */
q->nr_requests = 2 * BLKDEV_MAX_RQ;
 
+   auto_hw_tag = blk_mq_sched_may_use_hw_tag(q);
+   if (auto_hw_tag) {
+   q->act_hw_queue_depth = blk_mq_get_queue_depth(q);
+   if (blk_mq_set_queues_depth(q, q->nr_requests))
+   auto_hw_tag = false;
+   }
+
queue_for_each_hw_ctx(q, hctx, i) {
+   if (auto_hw_tag)
+   hctx->flags |= BLK_MQ_F_SCHED_USE_HW_TAG;
+   else
+   hctx->flags &= ~BLK_MQ_F_SCHED_USE_HW_TAG;
+
ret = blk_mq_sched_alloc_tags(q, hctx, i);
if (ret)
  

[PATCH v3 2/4] blk-mq: introduce blk_mq_get_queue_depth()

2017-05-10 Thread Ming Lei
The hardware queue depth can be resized via blk_mq_update_nr_requests(),
so introduce this helper for retrieving queue's depth easily.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 15 +++
 block/blk-mq.h |  1 +
 2 files changed, 16 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 50b968fa4922..1a61ca611fae 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2135,6 +2135,21 @@ static void blk_mq_map_swqueue(struct request_queue *q,
}
 }
 
+/*
+ * Queue depth can be changed via blk_mq_update_nr_requests(),
+ * so use this helper to retrieve queue's depth.
+ */
+int blk_mq_get_queue_depth(struct request_queue *q)
+{
+   /*
+* All queues have same queue depth, need to revisit
+* if per hw-queue depth is supported.
+*/
+   struct blk_mq_tags  *tags = q->tag_set->tags[0];
+
+   return tags->bitmap_tags.sb.depth + tags->breserved_tags.sb.depth;
+}
+
 static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 {
struct blk_mq_hw_ctx *hctx;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index cc67b48e3551..d49d46de2923 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -138,6 +138,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, 
struct blk_mq_ctx *ctx,
 void blk_mq_finish_request(struct request *rq);
 struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
unsigned int op);
+int blk_mq_get_queue_depth(struct request_queue *q);
 
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
 {
-- 
2.9.3



[PATCH v3 0/4] blk-mq: support to use hw tag for scheduling

2017-05-10 Thread Ming Lei
Hi,

This patchset introduces flag of BLK_MQ_F_SCHED_USE_HW_TAG and
allows to use hardware tag directly for IO scheduling if the queue's
depth is big enough. In this way, we can avoid to allocate extra tags
and request pool for IO schedule, and the schedule tag allocation/release
can be saved in I/O submit path.

V3:
- respect q->nr_requests by resizing hw tags, as suggested by Omar

V2:
- fix oops when kyber is used
- move dumping the new flag into patch 1
- support to use hw tag for shared tags
- update hctx->sched_tag when BLK_MQ_F_SCHED_USE_HW_TAG is changed
- clear the flag in patch of blk_mq_exit_sched()
- don't update q->nr_requests when updating hw queue's depth
- fix blk_mq_get_queue_depth()

Ming Lei (4):
  blk-mq: introduce BLK_MQ_F_SCHED_USE_HW_TAG
  blk-mq: introduce blk_mq_get_queue_depth()
  blk-mq: use hw tag for scheduling if hw tag space is big enough
  blk-mq: allow to use hw tag for shared tags

 block/blk-mq-debugfs.c |   1 +
 block/blk-mq-sched.c   |  90 
 block/blk-mq-sched.h   |   9 +
 block/blk-mq.c | 100 +
 block/blk-mq.h |  24 
 block/kyber-iosched.c  |   7 +++-
 include/linux/blk-mq.h |   1 +
 include/linux/blkdev.h |   8 
 8 files changed, 223 insertions(+), 17 deletions(-)

-- 
2.9.3



Re: [PATCH] elevator: eliminate unused result build warning

2017-05-10 Thread Jens Axboe
On 05/10/2017 09:29 AM, Firo Yang wrote:
> Gcc complains about ignoring return value of ‘strstrip’; Fix it by
> just using the strstrip() as the function parameter.

This has already been fixed up.

-- 
Jens Axboe



[PATCH] elevator: eliminate unused result build warning

2017-05-10 Thread Firo Yang
Gcc complains about ignoring return value of ‘strstrip’; Fix it by
just using the strstrip() as the function parameter.

Signed-off-by: Firo Yang 
---
 block/elevator.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index fda6be9..dd0ed19 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -1099,8 +1099,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const 
char *name,
return count;
 
strlcpy(elevator_name, skip_spaces(name), sizeof(elevator_name));
-   strstrip(elevator_name);
-   ret = __elevator_change(q, elevator_name);
+   ret = __elevator_change(q, strstrip(elevator_name));
if (!ret)
return count;
 
-- 
2.9.3



Re: [PATCH 0/9] block: T10/DIF Fixes and cleanups v3

2017-05-10 Thread Dmitry Monakhov

Christoph Hellwig  writes:

> Hi Dmitry,
>
> can you resend this series? 
Sorry for a very long delay, I'm in the middle of honeymoon and this is 
not a good time for a work :)
> I really think we should get this into 4.12 at least.
Please see updated version in the LKML list.


[PATCH 4/9] bio-integrity: fix interface for bio_integrity_trim

2017-05-10 Thread Dmitry Monakhov
bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be insync with bio's data offset. That is why we have
integrity update hook in bio_advance()

So only meaningful values are: offset == 0, sectors == bio_sectors(bio)
Let's just remove them completely.

Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 11 ++-
 block/bio.c   |  4 ++--
 drivers/md/dm.c   |  2 +-
 include/linux/bio.h   |  5 ++---
 4 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 82a6ffb..6170dad 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -433,22 +433,15 @@ EXPORT_SYMBOL(bio_integrity_advance);
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:   bio whose integrity vector to update
- * @offset:offset to first data sector
- * @sectors:   number of data sectors
  *
  * Description: Used to trim the integrity vector in a cloned bio.
- * The ivec will be advanced corresponding to 'offset' data sectors
- * and the length will be truncated corresponding to 'len' data
- * sectors.
  */
-void bio_integrity_trim(struct bio *bio, unsigned int offset,
-   unsigned int sectors)
+void bio_integrity_trim(struct bio *bio)
 {
struct bio_integrity_payload *bip = bio_integrity(bio);
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 
-   bio_integrity_advance(bio, offset << 9);
-   bip->bip_iter.bi_size = bio_integrity_bytes(bi, sectors);
+   bip->bip_iter.bi_size = bio_integrity_bytes(bi, bio_sectors(bio));
 }
 EXPORT_SYMBOL(bio_integrity_trim);
 
diff --git a/block/bio.c b/block/bio.c
index 2f01f1b..d42e602 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1857,7 +1857,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
split->bi_iter.bi_size = sectors << 9;
 
if (bio_integrity(split))
-   bio_integrity_trim(split, 0, sectors);
+   bio_integrity_trim(split);
 
bio_advance(bio, split->bi_iter.bi_size);
 
@@ -1891,7 +1891,7 @@ void bio_trim(struct bio *bio, int offset, int size)
bio->bi_iter.bi_size = size;
 
if (bio_integrity(bio))
-   bio_integrity_trim(bio, 0, size);
+   bio_integrity_trim(bio);
 
 }
 EXPORT_SYMBOL_GPL(bio_trim);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 6ef9500..35e5f57 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1146,7 +1146,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio 
*bio,
clone->bi_iter.bi_size = to_bytes(len);
 
if (unlikely(bio_integrity(bio) != NULL))
-   bio_integrity_trim(clone, 0, len);
+   bio_integrity_trim(clone);
 
return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d1b04b0..5834186 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -724,7 +724,7 @@ extern bool bio_integrity_enabled(struct bio *bio);
 extern int bio_integrity_prep(struct bio *);
 extern void bio_integrity_endio(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
-extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
+extern void bio_integrity_trim(struct bio *);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
 extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
@@ -774,8 +774,7 @@ static inline void bio_integrity_advance(struct bio *bio,
return;
 }
 
-static inline void bio_integrity_trim(struct bio *bio, unsigned int offset,
- unsigned int sectors)
+static inline void bio_integrity_trim(struct bio *bio)
 {
return;
 }
-- 
2.9.3



[PATCH 1/9] bio-integrity: Do not allocate integrity context for bio w/o data

2017-05-10 Thread Dmitry Monakhov
If bio has no data, such as ones from blkdev_issue_flush(),
then we have nothing to protect.

This patch prevent bugon like follows:

kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
kernel BUG at mm/slab.c:2773!
invalid opcode:  [#1] SMP
Modules linked in: bcache
CPU: 0 PID: 4428 Comm: xfs_io Tainted: GW   
4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
task: 880137786440 task.stack: c9ba8000
RIP: 0010:kfree_debugcheck+0x25/0x2a
RSP: 0018:c9babde0 EFLAGS: 00010082
RAX: 0034 RBX: ac1fa1d106742a5a RCX: 0007
RDX:  RSI:  RDI: 88013f3ccb40
RBP: c9babde8 R08:  R09: 
R10: fcb76420 R11: 725172ed R12: 0282
R13: 8150e766 R14: 88013a145e00 R15: 0001
FS:  7fb09384bf40() GS:88013f20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fd0172f9e40 CR3: 000137fa9000 CR4: 06f0
Call Trace:
 kfree+0xc8/0x1b3
 bio_integrity_free+0xc3/0x16b
 bio_free+0x25/0x66
 bio_put+0x14/0x26
 blkdev_issue_flush+0x7a/0x85
 blkdev_fsync+0x35/0x42
 vfs_fsync_range+0x8e/0x9f
 vfs_fsync+0x1c/0x1e
 do_fsync+0x31/0x4a
 SyS_fsync+0x10/0x14
 entry_SYSCALL_64_fastpath+0x1f/0xc2

Reviewed-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Martin K. Petersen 
Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713..b5009a8 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -175,6 +175,9 @@ bool bio_integrity_enabled(struct bio *bio)
if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
return false;
 
+   if (!bio_sectors(bio))
+   return false;
+
/* Already protected? */
if (bio_integrity(bio))
return false;
-- 
2.9.3



[PATCH 7/9] Guard bvec iteration logic

2017-05-10 Thread Dmitry Monakhov
Currently if some one try to advance bvec beyond it's size we simply
dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
This simply means that we endup dereferencing/corrupting random memory
region.

Sane reaction would be to propagate error back to calling context
But bvec_iter_advance's calling context is not always good for error
handling. For safity reason let truncate iterator size to zero which
will break external iteration loop which prevent us from unpredictable
memory range corruption. And even it caller ignores an error, it will
corrupt it's own bvecs, not others.

This patch does:
- Return error back to caller with hope that it will react on this
- Truncate iterator size

Code was added long time ago here 4550dd6c, luckily no one hit it
in real life :)

changes since V1:
 - Replace  BUG_ON with error logic.

Signed-off-by: Dmitry Monakhov 
---
 drivers/nvdimm/blk.c |  4 +++-
 drivers/nvdimm/btt.c |  4 +++-
 include/linux/bio.h  |  8 ++--
 include/linux/bvec.h | 11 ---
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 0b49336..c82331b 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk 
*nsblk,
 
len -= cur_len;
dev_offset += cur_len;
-   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   err = bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   if (err)
+   return err;
}
 
return err;
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 3d7a9fe..5a68681 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct 
bio_integrity_payload *bip,
 
len -= cur_len;
meta_nsoff += cur_len;
-   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   ret = bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   if (ret)
+   return ret;
}
 
return ret;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1b4ebb4..643ecba 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
-   else
-   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   else {
+   int err;
+
+   err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   /* TODO: It is reasonable to complete bio with error here. */
+   }
 }
 
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 89b65b8..984a7a8 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -22,6 +22,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -66,12 +67,15 @@ struct bvec_iter {
.bv_offset  = bvec_iter_offset((bvec), (iter)), \
 })
 
-static inline void bvec_iter_advance(const struct bio_vec *bv,
+static inline int bvec_iter_advance(const struct bio_vec *bv,
 struct bvec_iter *iter,
 unsigned bytes)
 {
-   WARN_ONCE(bytes > iter->bi_size,
- "Attempted to advance past end of bvec iter\n");
+   if (WARN_ONCE(bytes > iter->bi_size,
+"Attempted to advance past end of bvec iter\n")) {
+   iter->bi_size = 0;
+   return -EINVAL;
+   }
 
while (bytes) {
unsigned iter_len = bvec_iter_len(bv, *iter);
@@ -86,6 +90,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
iter->bi_idx++;
}
}
+   return 0;
 }
 
 #define for_each_bvec(bvl, bio_vec, iter, start)   \
-- 
2.9.3



[PATCH 2/9] bio-integrity: bio_trim should truncate integrity vector accordingly

2017-05-10 Thread Dmitry Monakhov
Reviewed-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Martin K. Petersen 
Signed-off-by: Dmitry Monakhov 
---
 block/bio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 888e780..2f01f1b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1889,6 +1889,10 @@ void bio_trim(struct bio *bio, int offset, int size)
bio_advance(bio, offset << 9);
 
bio->bi_iter.bi_size = size;
+
+   if (bio_integrity(bio))
+   bio_integrity_trim(bio, 0, size);
+
 }
 EXPORT_SYMBOL_GPL(bio_trim);
 
-- 
2.9.3



[PATCH 3/9] bio-integrity: bio_integrity_advance must update integrity seed

2017-05-10 Thread Dmitry Monakhov
SCSI drivers do care about bip_seed so we must update it accordingly.

Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b5009a8..82a6ffb 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -425,6 +425,7 @@ void bio_integrity_advance(struct bio *bio, unsigned int 
bytes_done)
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
 
+   bip->bip_iter.bi_sector += bytes_done >> 9;
bvec_iter_advance(bip->bip_vec, >bip_iter, bytes);
 }
 EXPORT_SYMBOL(bio_integrity_advance);
-- 
2.9.3



[PATCH 6/9] T10: Move opencoded contants to common header

2017-05-10 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov 
---
 block/t10-pi.c   | 9 +++--
 drivers/scsi/lpfc/lpfc_scsi.c| 5 +++--
 drivers/scsi/qla2xxx/qla_isr.c   | 8 
 drivers/target/target_core_sbc.c | 2 +-
 include/linux/t10-pi.h   | 2 ++
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index 680c6d6..bb099f2 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -28,9 +28,6 @@
 
 typedef __be16 (csum_fn) (void *, unsigned int);
 
-static const __be16 APP_ESCAPE = (__force __be16) 0x;
-static const __be32 REF_ESCAPE = (__force __be32) 0x;
-
 static __be16 t10_pi_crc_fn(void *data, unsigned int len)
 {
return cpu_to_be16(crc_t10dif(data, len));
@@ -82,7 +79,7 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
switch (type) {
case 1:
case 2:
-   if (pi->app_tag == APP_ESCAPE)
+   if (pi->app_tag == T10_PI_APP_ESCAPE)
goto next;
 
if (be32_to_cpu(pi->ref_tag) !=
@@ -95,8 +92,8 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
}
break;
case 3:
-   if (pi->app_tag == APP_ESCAPE &&
-   pi->ref_tag == REF_ESCAPE)
+   if (pi->app_tag == T10_PI_APP_ESCAPE &&
+   pi->ref_tag == T10_PI_REF_ESCAPE)
goto next;
break;
}
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 54fd0c8..99d2e99 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -2934,8 +2935,8 @@ lpfc_calc_bg_err(struct lpfc_hba *phba, struct 
lpfc_scsi_buf *lpfc_cmd)
 * First check to see if a protection data
 * check is valid
 */
-   if ((src->ref_tag == 0x) ||
-   (src->app_tag == 0x)) {
+   if ((src->ref_tag == T10_PI_REF_ESCAPE) ||
+   (src->app_tag == T10_PI_APP_ESCAPE)) {
start_ref_tag++;
goto skipit;
}
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index aac0350..7fa1712 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -1950,9 +1950,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
 * For type 3: ref & app tag is all 'f's
 * For type 0,1,2: app tag is all 'f's
 */
-   if ((a_app_tag == 0x) &&
+   if ((a_app_tag == T10_PI_APP_ESCAPE) &&
((scsi_get_prot_type(cmd) != SCSI_PROT_DIF_TYPE3) ||
-(a_ref_tag == 0x))) {
+(a_ref_tag == T10_PI_REF_ESCAPE))) {
uint32_t blocks_done, resid;
sector_t lba_s = scsi_get_lba(cmd);
 
@@ -1994,9 +1994,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
spt = page_address(sg_page(sg)) + sg->offset;
spt += j;
 
-   spt->app_tag = 0x;
+   spt->app_tag = T10_PI_APP_ESCAPE;
if (scsi_get_prot_type(cmd) == SCSI_PROT_DIF_TYPE3)
-   spt->ref_tag = 0x;
+   spt->ref_tag = T10_PI_REF_ESCAPE;
}
 
return 0;
diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index a0ad618..d542a12 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -1504,7 +1504,7 @@ sbc_dif_verify(struct se_cmd *cmd, sector_t start, 
unsigned int sectors,
 (unsigned long long)sector, sdt->guard_tag,
 sdt->app_tag, be32_to_cpu(sdt->ref_tag));
 
-   if (sdt->app_tag == cpu_to_be16(0x)) {
+   if (sdt->app_tag == T10_PI_APP_ESCAPE) {
dsg_off += block_size;
goto next;
}
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index 9375d23..635a3c5 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -33,6 +33,8 @@ struct t10_pi_tuple {
__be32 ref_tag; /* Target LBA or indirect LBA */
 };
 
+#define T10_PI_APP_ESCAPE cpu_to_be16(0x)
+#define T10_PI_REF_ESCAPE cpu_to_be32(0x)
 
 extern const struct blk_integrity_profile t10_pi_type1_crc;
 extern const 

[PATCH 8/9] bio: add bvec_iter rewind API

2017-05-10 Thread Dmitry Monakhov
Some ->bi_end_io handlers (for example: pi_verify or decrypt handlers)
need to know original data vector, but after bio traverse io-stack it may
be advanced, splited and relocated many times so it is hard to guess
original iterator. Let's add 'bi_done' conter which accounts number
of bytes iterator was advanced during it's evolution. Later end_io handler
may easily restore original iterator by rewinding iterator to
iter->bi_done.

Note: this change makes sizeof (struct bvec_iter) multiple to 8

Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dmitry Monakhov 
---
 include/linux/bio.h  | 21 +++--
 include/linux/bvec.h | 27 +++
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 643ecba..ab812a0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -166,9 +166,10 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 {
iter->bi_sector += bytes >> 9;
 
-   if (bio_no_advance_iter(bio))
+   if (bio_no_advance_iter(bio)) {
iter->bi_size -= bytes;
-   else {
+   iter->bi_done += bytes;
+   } else {
int err;
 
err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
@@ -176,6 +177,22 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
}
 }
 
+static inline int bio_rewind_iter(struct bio *bio, struct bvec_iter *iter,
+   unsigned int bytes)
+{
+   int err = 0;
+
+   iter->bi_sector -= bytes >> 9;
+
+   if (bio_no_advance_iter(bio)) {
+   iter->bi_size += bytes;
+   iter->bi_done -= bytes;
+   } else {
+   err = bvec_iter_rewind(bio->bi_io_vec, iter, bytes);
+   }
+   return err;
+}
+
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
for (iter = (start);\
 (iter).bi_size &&  \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 984a7a8..8cfe4d6 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -40,6 +40,8 @@ struct bvec_iter {
 
unsigned intbi_idx; /* current index into bvl_vec */
 
+   unsigned intbi_done;/* number of bytes completed */
+
unsigned intbi_bvec_done;   /* number of bytes completed in
   current bvec */
 };
@@ -84,6 +86,7 @@ static inline int bvec_iter_advance(const struct bio_vec *bv,
bytes -= len;
iter->bi_size -= len;
iter->bi_bvec_done += len;
+   iter->bi_done += len;
 
if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
iter->bi_bvec_done = 0;
@@ -93,6 +96,30 @@ static inline int bvec_iter_advance(const struct bio_vec *bv,
return 0;
 }
 
+static inline int bvec_iter_rewind(const struct bio_vec *bv,
+struct bvec_iter *iter,
+unsigned int bytes)
+{
+   while (bytes) {
+   unsigned len = min(bytes, iter->bi_bvec_done);
+
+   if (iter->bi_bvec_done == 0) {
+   if (WARN_ONCE(iter->bi_idx == 0,
+ "Attempted to rewind iter beyond "
+ "bvec's boundaries\n")) {
+   return -EINVAL;
+   }
+   iter->bi_idx--;
+   iter->bi_bvec_done = __bvec_iter_bvec(bv, 
*iter)->bv_len;
+   continue;
+   }
+   bytes -= len;
+   iter->bi_size += len;
+   iter->bi_bvec_done -= len;
+   }
+   return 0;
+}
+
 #define for_each_bvec(bvl, bio_vec, iter, start)   \
for (iter = (start);\
 (iter).bi_size &&  \
-- 
2.9.3



[PATCH 9/9] bio-integrity: Restore original iterator on verify stage

2017-05-10 Thread Dmitry Monakhov
Currently ->verify_fn not woks at all because at the moment it is called
bio->bi_iter.bi_size == 0, so we do not iterate integrity bvecs at all.

In order to perform verification we need to know original data vector,
with new bvec rewind API this is trivial.

testcase: 
https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85

Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 9b9ace3..4b03128 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -184,9 +184,10 @@ static inline unsigned int bio_integrity_bytes(struct 
blk_integrity *bi,
 /**
  * bio_integrity_process - Process integrity metadata for a bio
  * @bio:   bio to generate/verify integrity metadata for
+ * @proc_iter:  iterator to process
  * @proc_fn:   Pointer to the relevant processing function
  */
-static int bio_integrity_process(struct bio *bio,
+static int bio_integrity_process(struct bio *bio, struct bvec_iter *proc_iter,
 integrity_processing_fn *proc_fn)
 {
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
@@ -200,10 +201,10 @@ static int bio_integrity_process(struct bio *bio,
 
iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
iter.interval = 1 << bi->interval_exp;
-   iter.seed = bip_get_seed(bip);
+   iter.seed = proc_iter->bi_sector;
iter.prot_buf = prot_buf;
 
-   bio_for_each_segment(bv, bio, bviter) {
+   __bio_for_each_segment(bv, bio, bviter, *proc_iter) {
void *kaddr = kmap_atomic(bv.bv_page);
 
iter.data_buf = kaddr + bv.bv_offset;
@@ -333,7 +334,8 @@ int bio_integrity_prep(struct bio *bio)
 
/* Auto-generate integrity metadata if this is a write */
if (bio_data_dir(bio) == WRITE)
-   bio_integrity_process(bio, bi->profile->generate_fn);
+   bio_integrity_process(bio, >bi_iter,
+ bi->profile->generate_fn);
return 0;
 err_end_io:
bio->bi_error = err;
@@ -357,9 +359,15 @@ static void bio_integrity_verify_fn(struct work_struct 
*work)
container_of(work, struct bio_integrity_payload, bip_work);
struct bio *bio = bip->bip_bio;
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-
-   bio->bi_error = bio_integrity_process(bio, bi->profile->verify_fn);
-
+   struct bvec_iter iter = bio->bi_iter;
+
+   /* At the moment verify is called bio's iterator was advanced
+* during split and completion, we need to rewind iterator to
+* it's original position */
+   bio->bi_error = bio_rewind_iter(bio, , iter.bi_done);
+   if (!bio->bi_error)
+   bio->bi_error = bio_integrity_process(bio, ,
+ bi->profile->verify_fn);
/* Restore original bio completion handler */
bio->bi_end_io = bip->bip_end_io;
bio_endio(bio);
-- 
2.9.3



[PATCH 0/9] block: T10/DIF Fixes and cleanups v4

2017-05-10 Thread Dmitry Monakhov
TOC:
1-bio-integrity-Do-not-allocate-integrity-context-for-fsync
2-bio-integrity-bio_trim-should-truncate-integrity-vector
3-bio-integrity-bio_integrity_advance-must-update-interator
4-bio-integrity-fix-interface-for-bio_integrity_trim
5-bio-integrity-fold-bio_integrity_enabled-to-bio_interator
6-T10-Move-opencoded-contants-to-common-header
7-Guard-bvec-iteration-logic
8-bio-add-bvec_iter-rewind-API
9-bio-integrity-Restore-original-iterator-on-verify-state

Changes since V3:
 - Rebase to next-20170505
 
 - 5'th patch: fold two functions to one, in response to hch@ comments
 - 7'yh patch: replace unreliable error set to TODO note, in response to hch@ 
comments
Changes since V2
 - Cleanups in response to commens from axboe@ and hch@
 - Use rewind bvec api to restore original vector on verify
 - fix one more bug with bip_seed update on bio split.

Changes since V1
 - fix issues potted by kbuild bot
 - Replace BUG_ON with error logic for 7'th patch

Testcase: xfstest blockdev/003
https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85


Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it

2017-05-10 Thread Jeff Layton
On Wed, 2017-05-10 at 07:18 -0700, Matthew Wilcox wrote:
> On Tue, May 09, 2017 at 11:49:16AM -0400, Jeff Layton wrote:
> > +++ b/lib/errseq.c
> > @@ -0,0 +1,199 @@
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +/*
> > + * An errseq_t is a way of recording errors in one place, and allowing any
> > + * number of "subscribers" to tell whether it has changed since an 
> > arbitrary
> > + * time of their choosing.
> 
> You use the word "time" in several places in the documentation, but I think
> it's clearer to say "sampling point" or "sample", since you're not using 
> jiffies
> or nanoseconds.  For example, I'd phrase this paragraph this way:
> 
>  * An errseq_t is a way of recording errors in one place, and allowing any
>  * number of "subscribers" to tell whether it has changed since they last
>  * sampled it.
> 
> > + * The general idea is for consumers to sample an errseq_t value at a
> > + * particular point in time. Later, that value can be used to tell whether 
> > any
> > + * new errors have occurred since that time.
> 
>  * The general idea is for consumers to sample an errseq_t value.  That
>  * value can be used to tell whether any new errors have occurred since
>  * the last time it was sampled.
> 
> > +/* The "ones" bit for the counter */
> 
> Maybe "The lowest bit of the counter"?
> 
> > +/**
> > + * errseq_check - has an error occurred since a particular point in time?
> 
> "has an error occurred since the last time it was sampled"
> 
> > +/**
> > + * errseq_check_and_advance - check an errseq_t and advance it to the 
> > current value
> > + * @eseq: pointer to value being checked reported
> 
> "value being checked reported"?
> 

Thanks. I'm cleaning up the comments like you suggest.

> > +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> > +{
> > +   int err = 0;
> > +   errseq_t old, new;
> > +
> > +   /*
> > +* Most callers will want to use the inline wrapper to check this,
> > +* so that the common case of no error is handled without needing
> > +* to lock.
> > +*/
> > +   old = READ_ONCE(*eseq);
> > +   if (old != *since) {
> > +   /*
> > +* Set the flag and try to swap it into place if it has
> > +* changed.
> > +*
> > +* We don't care about the outcome of the swap here. If the
> > +* swap doesn't occur, then it has either been updated by a
> > +* writer who is bumping the seq count anyway, or another
> > +* reader who is just setting the "seen" flag. Either outcome
> > +* is OK, and we can advance "since" and return an error based
> > +* on what we have.
> > +*/
> > +   new = old | ERRSEQ_SEEN;
> > +   if (new != old)
> > +   cmpxchg(eseq, old, new);
> > +   *since = new;
> > +   err = -(new & MAX_ERRNO);
> > +   }
> 
> I probably need to read through the patchset some more to understand this.
> Naively, surely "since" should be updated to the current value of 'eseq'
> if we failed the cmpxchg()?

I don't think so. If we want to do that, then we'll need to redrive the
cmpxchg to set the SEEN flag if it's now clear. Storing the value in
"since" is effectively sampling it, so you do have to mark it seen.

The good news is that I think that "new" is just as valid a value to
store here as *eseq would be. It ends up representing an errseq_t value
that never actually got stored in eseq, but that's OK with the way this
works.

-- 
Jeff Layton 


Re: [PATCH] blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op

2017-05-10 Thread Martin K. Petersen

wenxi...@linux.vnet.ibm.com,

> When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op
> returns "Input/output error". Looks block layer split the bio after
> calling bio_integrity_prep(bio). This patch fixes the issue.

Looks good.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] scsi: sanity check for timeout in sg_io()

2017-05-10 Thread James Bottomley
On Wed, 2017-05-10 at 15:24 +0200, Hannes Reinecke wrote:
> sg_io() is using msecs_to_jiffies() to convert a passed in timeout
> value (in milliseconds) to a jiffies value. However, if the value
> is too large msecs_to_jiffies() will return MAX_JIFFY_OFFSET, which
> will be truncated to -2 and cause the timeout to be set to 1.3
> _years_. Which is probably too long for most applications.

What makes you think that?  If a user specified timeout in ms is over
MAX_JIFFY_OFFSET on, say a 250HZ machine, then they're already asking
for something around this length of time in ms, which is probably their
attempt at machine infinity, meaning they want effectively no timeout. 
 Your patch would now give them the default timeout, which looks way
too short.

James

> Signed-off-by: Hannes Reinecke 
> ---
>  block/scsi_ioctl.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
> index 4a294a5..53b95ea 100644
> --- a/block/scsi_ioctl.c
> +++ b/block/scsi_ioctl.c
> @@ -231,6 +231,7 @@ static int blk_fill_sghdr_rq(struct request_queue
> *q, struct request *rq,
>struct sg_io_hdr *hdr, fmode_t mode)
>  {
>   struct scsi_request *req = scsi_req(rq);
> + unsigned long timeout;
>  
>   if (copy_from_user(req->cmd, hdr->cmdp, hdr->cmd_len))
>   return -EFAULT;
> @@ -242,7 +243,11 @@ static int blk_fill_sghdr_rq(struct
> request_queue *q, struct request *rq,
>*/
>   req->cmd_len = hdr->cmd_len;
>  
> - rq->timeout = msecs_to_jiffies(hdr->timeout);
> + timeout = msecs_to_jiffies(hdr->timeout);
> + if (timeout == MAX_JIFFY_OFFSET)
> + rq->timeout = 0;
> + else
> + rq->timeout = timeout;
>   if (!rq->timeout)
>   rq->timeout = q->sg_timeout;
>   if (!rq->timeout)



[PATCH] scsi: sanity check for timeout in sg_io()

2017-05-10 Thread Hannes Reinecke
sg_io() is using msecs_to_jiffies() to convert a passed in timeout
value (in milliseconds) to a jiffies value. However, if the value
is too large msecs_to_jiffies() will return MAX_JIFFY_OFFSET, which
will be truncated to -2 and cause the timeout to be set to 1.3
_years_. Which is probably too long for most applications.

Signed-off-by: Hannes Reinecke 
---
 block/scsi_ioctl.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 4a294a5..53b95ea 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -231,6 +231,7 @@ static int blk_fill_sghdr_rq(struct request_queue *q, 
struct request *rq,
 struct sg_io_hdr *hdr, fmode_t mode)
 {
struct scsi_request *req = scsi_req(rq);
+   unsigned long timeout;
 
if (copy_from_user(req->cmd, hdr->cmdp, hdr->cmd_len))
return -EFAULT;
@@ -242,7 +243,11 @@ static int blk_fill_sghdr_rq(struct request_queue *q, 
struct request *rq,
 */
req->cmd_len = hdr->cmd_len;
 
-   rq->timeout = msecs_to_jiffies(hdr->timeout);
+   timeout = msecs_to_jiffies(hdr->timeout);
+   if (timeout == MAX_JIFFY_OFFSET)
+   rq->timeout = 0;
+   else
+   rq->timeout = timeout;
if (!rq->timeout)
rq->timeout = q->sg_timeout;
if (!rq->timeout)
-- 
1.8.5.6



Re: [PATCH v4 06/27] fs: check for writeback errors after syncing out buffers in generic_file_fsync

2017-05-10 Thread Matthew Wilcox
On Tue, May 09, 2017 at 11:49:09AM -0400, Jeff Layton wrote:
> ext2 currently does a test+clear of the AS_EIO flag, which is
> is problematic for some coming changes.
> 
> What we really need to do instead is call filemap_check_errors
> in __generic_file_fsync after syncing out the buffers. That
> will be sufficient for this case, and help other callers detect
> these errors properly as well.
> 
> With that, we don't need to twiddle it in ext2.
> 
> Suggested-by: Jan Kara 
> Signed-off-by: Jeff Layton 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Jan Kara 

Reviewed-by: Matthew Wilcox 


Re: [PATCH v4 14/27] fs: new infrastructure for writeback error handling and reporting

2017-05-10 Thread Jeff Layton
On Wed, 2017-05-10 at 13:48 +0200, Jan Kara wrote:
> On Tue 09-05-17 11:49:17, Jeff Layton wrote:
> > Most filesystems currently use mapping_set_error and
> > filemap_check_errors for setting and reporting/clearing writeback errors
> > at the mapping level. filemap_check_errors is indirectly called from
> > most of the filemap_fdatawait_* functions and from
> > filemap_write_and_wait*. These functions are called from all sorts of
> > contexts to wait on writeback to finish -- e.g. mostly in fsync, but
> > also in truncate calls, getattr, etc.
> > 
> > The non-fsync callers are problematic. We should be reporting writeback
> > errors during fsync, but many places spread over the tree clear out
> > errors before they can be properly reported, or report errors at
> > nonsensical times.
> > 
> > If I get -EIO on a stat() call, there is no reason for me to assume that
> > it is because some previous writeback failed. The fact that it also
> > clears out the error such that a subsequent fsync returns 0 is a bug,
> > and a nasty one since that's potentially silent data corruption.
> > 
> > This patch adds a small bit of new infrastructure for setting and
> > reporting errors during address_space writeback. While the above was my
> > original impetus for adding this, I think it's also the case that
> > current fsync semantics are just problematic for userland. Most
> > applications that call fsync do so to ensure that the data they wrote
> > has hit the backing store.
> > 
> > In the case where there are multiple writers to the file at the same
> > time, this is really hard to determine. The first one to call fsync will
> > see any stored error, and the rest get back 0. The processes with open
> > fds may not be associated with one another in any way. They could even
> > be in different containers, so ensuring coordination between all fsync
> > callers is not really an option.
> > 
> > One way to remedy this would be to track what file descriptor was used
> > to dirty the file, but that's rather cumbersome and would likely be
> > slow. However, there is a simpler way to improve the semantics here
> > without incurring too much overhead.
> > 
> > This set adds an errseq_t to struct address_space, and a corresponding
> > one is added to struct file. Writeback errors are recorded in the
> > mapping's errseq_t, and the one in struct file is used as the "since"
> > value.
> > 
> > This changes the semantics of the Linux fsync implementation such that
> > applications can now use it to determine whether there were any
> > writeback errors since fsync(fd) was last called (or since the file was
> > opened in the case of fsync having never been called).
> > 
> > Note that those writeback errors may have occurred when writing data
> > that was dirtied via an entirely different fd, but that's the case now
> > with the current mapping_set_error/filemap_check_error infrastructure.
> > This will at least prevent you from getting a false report of success.
> > 
> > The new behavior is still consistent with the POSIX spec, and is more
> > reliable for application developers. This patch just adds some basic
> > infrastructure for doing this. Later patches will change the existing
> > code to use this new infrastructure.
> > 
> > Signed-off-by: Jeff Layton 
> 
> Just one nit below. Otherwise the patch looks good to me. You can add:
> 
> Reviewed-by: Jan Kara 
> 
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index 954d510b765a..d6138b6411ff 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, 
> > fmode_t mode,
> > file->f_path = *path;
> > file->f_inode = path->dentry->d_inode;
> > file->f_mapping = path->dentry->d_inode->i_mapping;
> > +   file->f_wb_err = filemap_sample_wb_error(file->f_mapping);
> 
> Why do you sample here when you also sample in do_dentry_open()? I didn't
> find any alloc_file() callers that would possibly care about writeback
> errors... 
> 
>   Honza

I basically used the setting of f_mapping as a guideline as to where to
sample it for initialization. My thinking was that if f_mapping ever
ended up different then you'd probably also want f_wb_err to be
resampled anyway.

I can drop this hunk if you think we don't need it.

-- 
Jeff Layton 


Re: [PATCH v4 14/27] fs: new infrastructure for writeback error handling and reporting

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:17, Jeff Layton wrote:
> Most filesystems currently use mapping_set_error and
> filemap_check_errors for setting and reporting/clearing writeback errors
> at the mapping level. filemap_check_errors is indirectly called from
> most of the filemap_fdatawait_* functions and from
> filemap_write_and_wait*. These functions are called from all sorts of
> contexts to wait on writeback to finish -- e.g. mostly in fsync, but
> also in truncate calls, getattr, etc.
> 
> The non-fsync callers are problematic. We should be reporting writeback
> errors during fsync, but many places spread over the tree clear out
> errors before they can be properly reported, or report errors at
> nonsensical times.
> 
> If I get -EIO on a stat() call, there is no reason for me to assume that
> it is because some previous writeback failed. The fact that it also
> clears out the error such that a subsequent fsync returns 0 is a bug,
> and a nasty one since that's potentially silent data corruption.
> 
> This patch adds a small bit of new infrastructure for setting and
> reporting errors during address_space writeback. While the above was my
> original impetus for adding this, I think it's also the case that
> current fsync semantics are just problematic for userland. Most
> applications that call fsync do so to ensure that the data they wrote
> has hit the backing store.
> 
> In the case where there are multiple writers to the file at the same
> time, this is really hard to determine. The first one to call fsync will
> see any stored error, and the rest get back 0. The processes with open
> fds may not be associated with one another in any way. They could even
> be in different containers, so ensuring coordination between all fsync
> callers is not really an option.
> 
> One way to remedy this would be to track what file descriptor was used
> to dirty the file, but that's rather cumbersome and would likely be
> slow. However, there is a simpler way to improve the semantics here
> without incurring too much overhead.
> 
> This set adds an errseq_t to struct address_space, and a corresponding
> one is added to struct file. Writeback errors are recorded in the
> mapping's errseq_t, and the one in struct file is used as the "since"
> value.
> 
> This changes the semantics of the Linux fsync implementation such that
> applications can now use it to determine whether there were any
> writeback errors since fsync(fd) was last called (or since the file was
> opened in the case of fsync having never been called).
> 
> Note that those writeback errors may have occurred when writing data
> that was dirtied via an entirely different fd, but that's the case now
> with the current mapping_set_error/filemap_check_error infrastructure.
> This will at least prevent you from getting a false report of success.
> 
> The new behavior is still consistent with the POSIX spec, and is more
> reliable for application developers. This patch just adds some basic
> infrastructure for doing this. Later patches will change the existing
> code to use this new infrastructure.
> 
> Signed-off-by: Jeff Layton 

Just one nit below. Otherwise the patch looks good to me. You can add:

Reviewed-by: Jan Kara 

> diff --git a/fs/file_table.c b/fs/file_table.c
> index 954d510b765a..d6138b6411ff 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, fmode_t 
> mode,
>   file->f_path = *path;
>   file->f_inode = path->dentry->d_inode;
>   file->f_mapping = path->dentry->d_inode->i_mapping;
> + file->f_wb_err = filemap_sample_wb_error(file->f_mapping);

Why do you sample here when you also sample in do_dentry_open()? I didn't
find any alloc_file() callers that would possibly care about writeback
errors... 

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:16, Jeff Layton wrote:
> An errseq_t is a way of recording errors in one place, and allowing any
> number of "subscribers" to tell whether an error has been set again
> since a previous time.
> 
> It's implemented as an unsigned 32-bit value that is managed with atomic
> operations. The low order bits are designated to hold an error code
> (max size of MAX_ERRNO). The upper bits are used as a counter.
> 
> The API works with consumers sampling an errseq_t value at a particular
> point in time. Later, that value can be used to tell whether new errors
> have been set since that time.
> 
> Note that there is a 1 in 512k risk of collisions here if new errors
> are being recorded frequently, since we have so few bits to use as a
> counter. To mitigate this, one bit is used as a flag to tell whether the
> value has been sampled since a new value was recorded. That allows
> us to avoid bumping the counter if no one has sampled it since it
> was last bumped.
> 
> Later patches will build on this infrastructure to change how writeback
> errors are tracked in the kernel.
> 
> Signed-off-by: Jeff Layton 

The patch looks good to me. Feel free to add:

Reviewed-by: Jan Kara 

Just two nits below:
...
> +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> +{
> + int err = 0;
> + errseq_t old, new;
> +
> + /*
> +  * Most callers will want to use the inline wrapper to check this,
> +  * so that the common case of no error is handled without needing
> +  * to lock.
> +  */

I'm not sure which locking you are speaking about here. Is the comment
stale?

> + old = READ_ONCE(*eseq);
> + if (old != *since) {
> + /*
> +  * Set the flag and try to swap it into place if it has
> +  * changed.
> +  *
> +  * We don't care about the outcome of the swap here. If the
> +  * swap doesn't occur, then it has either been updated by a
> +  * writer who is bumping the seq count anyway, or another

"bumping the seq count anyway" part is not quite true. Writer may see
ERRSEQ_SEEN not set and so just update the error code and leave seq count
as is. But since you compare full errseq_t for equality, this works out as
designed...

> +  * reader who is just setting the "seen" flag. Either outcome
> +  * is OK, and we can advance "since" and return an error based
> +  * on what we have.
> +  */
> + new = old | ERRSEQ_SEEN;
> + if (new != old)
> + cmpxchg(eseq, old, new);
> + *since = new;
> + err = -(new & MAX_ERRNO);
> + }
> + return err;
> +}
> +EXPORT_SYMBOL(errseq_check_and_advance);

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it

2017-05-10 Thread Jeff Layton
On Wed, 2017-05-10 at 08:03 +1000, NeilBrown wrote:
> On Tue, May 09 2017, Jeff Layton wrote:
> 
> > An errseq_t is a way of recording errors in one place, and allowing any
> > number of "subscribers" to tell whether an error has been set again
> > since a previous time.
> > 
> > It's implemented as an unsigned 32-bit value that is managed with atomic
> > operations. The low order bits are designated to hold an error code
> > (max size of MAX_ERRNO). The upper bits are used as a counter.
> > 
> > The API works with consumers sampling an errseq_t value at a particular
> > point in time. Later, that value can be used to tell whether new errors
> > have been set since that time.
> > 
> > Note that there is a 1 in 512k risk of collisions here if new errors
> > are being recorded frequently, since we have so few bits to use as a
> > counter. To mitigate this, one bit is used as a flag to tell whether the
> > value has been sampled since a new value was recorded. That allows
> > us to avoid bumping the counter if no one has sampled it since it
> > was last bumped.
> > 
> > Later patches will build on this infrastructure to change how writeback
> > errors are tracked in the kernel.
> > 
> > Signed-off-by: Jeff Layton 
> 
> I like that this is a separate lib/*.c - nicely structured too.
> 
> Reviewed-by: NeilBrown 
> 
> 

Thanks, yeah...it occurred to me that this scheme is not really specific
to writeback errors. While I can't think of another use-case for
errseq_t's right offhand, I think this makes for cleaner layering and
should make it easy to use in other ways should they arise.

-- 
Jeff Layton 


Re: [PATCH v4 12/27] cifs: set mapping error when page writeback fails in writepage or launder_pages

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:15, Jeff Layton wrote:
> Signed-off-by: Jeff Layton 
> Reviewed-by: Christoph Hellwig 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/cifs/file.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 21d404535739..0bee7f8d91ad 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2234,14 +2234,16 @@ cifs_writepage_locked(struct page *page, struct 
> writeback_control *wbc)
>   set_page_writeback(page);
>  retry_write:
>   rc = cifs_partialpagewrite(page, 0, PAGE_SIZE);
> - if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL)
> - goto retry_write;
> - else if (rc == -EAGAIN)
> + if (rc == -EAGAIN) {
> + if (wbc->sync_mode == WB_SYNC_ALL)
> + goto retry_write;
>   redirty_page_for_writepage(wbc, page);
> - else if (rc != 0)
> + } else if (rc != 0) {
>   SetPageError(page);
> - else
> + mapping_set_error(page->mapping, rc);
> + } else {
>   SetPageUptodate(page);
> + }
>   end_page_writeback(page);
>   put_page(page);
>   free_xid(xid);
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v4 11/27] fuse: set mapping error in writepage_locked when it fails

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:14, Jeff Layton wrote:
> This ensures that we see errors on fsync when writeback fails.
> 
> Signed-off-by: Jeff Layton 
> Reviewed-by: Christoph Hellwig 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/fuse/file.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index ec238fb5a584..07d0efcb050c 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page)
>  err_free:
>   fuse_request_free(req);
>  err:
> + mapping_set_error(page->mapping, error);
>   end_page_writeback(page);
>   return error;
>  }
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v4 01/27] fs: remove unneeded forward definition of mm_struct from fs.h

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:04, Jeff Layton wrote:
> Signed-off-by: Jeff Layton 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/fs.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7251f7bb45e8..38adefd8e2a0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1252,8 +1252,6 @@ extern void f_delown(struct file *filp);
>  extern pid_t f_getown(struct file *filp);
>  extern int send_sigurg(struct fown_struct *fown);
>  
> -struct mm_struct;
> -
>  /*
>   *   Umount options
>   */
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


[PATCH 2/5] mmc: core: Allocate per-request data using the block layer core

2017-05-10 Thread Linus Walleij
The mmc_queue_req is a per-request state container the MMC core uses
to carry bounce buffers, pointers to asynchronous requests and so on.
Currently allocated as a static array of objects, then as a request
comes in, a mmc_queue_req is assigned to it, and used during the
lifetime of the request.

This is backwards compared to how other block layer drivers work:
they usally let the block core provide a per-request struct that get
allocated right beind the struct request, and which can be obtained
using the blk_mq_rq_to_pdu() helper. (The _mq_ infix in this function
name is misleading: it is used by both the old and the MQ block
layer.)

The per-request struct gets allocated to the size stored in the queue
variable .cmd_size initialized using the .init_rq_fn() and
cleaned up using .exit_rq_fn().

The block layer code makes the MMC core rely on this mechanism to
allocate the per-request mmc_queue_req state container.

Doing this make a lot of complicated queue handling go away. We only
need to keep the .qnct that keeps count of how many request are
currently being processed by the MMC layer. The MQ block layer will
replace also this once we transition to it.

Doing this refactoring is necessary to move the ioctl() operations
into custom block layer requests tagged with REQ_OP_DRV_[IN|OUT]
instead of the custom code using the BigMMCHostLock that we have
today: those require that per-request data be obtainable easily from
a request after creating a custom request with e.g.:

struct request *rq = blk_get_request(q, REQ_OP_DRV_IN, __GFP_RECLAIM);
struct mmc_queue_req *mq_rq = req_to_mq_rq(rq);

And this is not possible with the current construction, as the request
is not immediately assigned the per-request state container, but
instead it gets assigned when the request finally enters the MMC
queue, which is way too late for custom requests.

Signed-off-by: Linus Walleij 
---
 drivers/mmc/core/block.c |  38 ++--
 drivers/mmc/core/queue.c | 222 +--
 drivers/mmc/core/queue.h |  22 ++---
 include/linux/mmc/card.h |   2 -
 4 files changed, 80 insertions(+), 204 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 8273b078686d..be782b8d4a0d 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -129,13 +129,6 @@ static inline int mmc_blk_part_switch(struct mmc_card 
*card,
  struct mmc_blk_data *md);
 static int get_card_status(struct mmc_card *card, u32 *status, int retries);
 
-static void mmc_blk_requeue(struct request_queue *q, struct request *req)
-{
-   spin_lock_irq(q->queue_lock);
-   blk_requeue_request(q, req);
-   spin_unlock_irq(q->queue_lock);
-}
-
 static struct mmc_blk_data *mmc_blk_get(struct gendisk *disk)
 {
struct mmc_blk_data *md;
@@ -1642,7 +1635,7 @@ static void mmc_blk_rw_cmd_abort(struct mmc_queue *mq, 
struct mmc_card *card,
if (mmc_card_removed(card))
req->rq_flags |= RQF_QUIET;
while (blk_end_request(req, -EIO, blk_rq_cur_bytes(req)));
-   mmc_queue_req_free(mq, mqrq);
+   mq->qcnt--;
 }
 
 /**
@@ -1662,7 +1655,7 @@ static void mmc_blk_rw_try_restart(struct mmc_queue *mq, 
struct request *req,
if (mmc_card_removed(mq->card)) {
req->rq_flags |= RQF_QUIET;
blk_end_request_all(req, -EIO);
-   mmc_queue_req_free(mq, mqrq);
+   mq->qcnt--; /* FIXME: just set to 0? */
return;
}
/* Else proceed and try to restart the current async request */
@@ -1685,12 +1678,8 @@ static void mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *new_req)
bool req_pending = true;
 
if (new_req) {
-   mqrq_cur = mmc_queue_req_find(mq, new_req);
-   if (!mqrq_cur) {
-   WARN_ON(1);
-   mmc_blk_requeue(mq->queue, new_req);
-   new_req = NULL;
-   }
+   mqrq_cur = req_to_mq_rq(new_req);
+   mq->qcnt++;
}
 
if (!mq->qcnt)
@@ -1764,12 +1753,12 @@ static void mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *new_req)
if (req_pending)
mmc_blk_rw_cmd_abort(mq, card, old_req, 
mq_rq);
else
-   mmc_queue_req_free(mq, mq_rq);
+   mq->qcnt--;
mmc_blk_rw_try_restart(mq, new_req, mqrq_cur);
return;
}
if (!req_pending) {
-   mmc_queue_req_free(mq, mq_rq);
+   mq->qcnt--;
mmc_blk_rw_try_restart(mq, new_req, mqrq_cur);
return;
}
@@ 

[PATCH 1/5] mmc: core: Delete bounce buffer Kconfig option

2017-05-10 Thread Linus Walleij
This option is activated by all multiplatform configs and what
not so we almost always have it turned on, and the memory it
saves is negligible, even more so moving forward. The actual
bounce buffer only gets allocated only when used, the only
thing the ifdefs are saving is a little bit of code.

It is highly improper to have this as a Kconfig option that
get turned on by Kconfig, make this a pure runtime-thing and
let the host decide whether we use bounce buffers. We add a
new property "disable_bounce" to the host struct.

Notice that mmc_queue_calc_bouncesz() already disables the
bounce buffers if host->max_segs != 1, so any arch that has a
maximum number of segments higher than 1 will have bounce
buffers disabled.

The option CONFIG_MMC_BLOCK_BOUNCE is default y so the
majority of platforms in the kernel already have it on, and
it then gets turned off at runtime since most of these have
a host->max_segs > 1. The few exceptions that have
host->max_segs == 1 and still turn off the bounce buffering
are those that disable it in their defconfig.

Those are the following:

arch/arm/configs/colibri_pxa300_defconfig
arch/arm/configs/zeus_defconfig
- Uses MMC_PXA, drivers/mmc/host/pxamci.c
- Sets host->max_segs = NR_SG, which is 1
- This needs its bounce buffer deactivated so we set
  host->disable_bounce to true in the host driver

arch/arm/configs/davinci_all_defconfig
- Uses MMC_DAVINCI, drivers/mmc/host/davinci_mmc.c
- This driver sets host->max_segs to MAX_NR_SG, which is 16
- That means this driver anyways disabled bounce buffers
- No special action needed for this platform

arch/arm/configs/lpc32xx_defconfig
arch/arm/configs/nhk8815_defconfig
arch/arm/configs/u300_defconfig
- Uses MMC_ARMMMCI, drivers/mmc/host/mmci.[c|h]
- This driver by default sets host->max_segs to NR_SG,
  which is 128, unless a DMA engine is used, and in that case
  the number of segments are also > 1
- That means this driver already disables bounce buffers
- No special action needed for these platforms

arch/arm/configs/sama5_defconfig
- Uses MMC_SDHCI, MMC_SDHCI_PLTFM, MMC_SDHCI_OF_AT91, MMC_ATMELMCI
- Uses drivers/mmc/host/sdhci.c
- Normally sets host->max_segs to SDHCI_MAX_SEGS which is 128 and
  thus disables bounce buffers
- Sets host->max_segs to 1 if SDHCI_USE_SDMA is set
- SDHCI_USE_SDMA is only set by SDHCI on PCI adapers
- That means that for this platform bounce buffers are already
  disabled at runtime
- No special action needed for this platform

arch/blackfin/configs/CM-BF533_defconfig
arch/blackfin/configs/CM-BF537E_defconfig
- Uses MMC_SPI (a simple MMC card connected on SPI pins)
- Uses drivers/mmc/host/mmc_spi.c
- Sets host->max_segs to MMC_SPI_BLOCKSATONCE which is 128
- That means this platform already disables bounce buffers at
  runtime
- No special action needed for these platforms

arch/mips/configs/cavium_octeon_defconfig
- Uses MMC_CAVIUM_OCTEON, drivers/mmc/host/cavium.c
- Sets host->max_segs to 16 or 1
- Setting host->disable_bounce to be sure for the 1 case

arch/mips/configs/qi_lb60_defconfig
- Uses MMC_JZ4740, drivers/mmc/host/jz4740_mmc.c
- This sets host->max_segs to 128 so bounce buffers are
  already runtime disabled
- No action needed for this platform

It would be interesting to come up with a list of the platforms
that actually end up using bounce buffers. I have not been
able to infer such a list, but it occurs when
host->max_segs == 1 and the bounce buffering is not explicitly
disabled.

Signed-off-by: Linus Walleij 
---
 drivers/mmc/core/Kconfig  | 18 --
 drivers/mmc/core/queue.c  | 15 +--
 drivers/mmc/host/cavium.c |  3 +++
 drivers/mmc/host/pxamci.c |  6 ++
 include/linux/mmc/host.h  |  1 +
 5 files changed, 11 insertions(+), 32 deletions(-)

diff --git a/drivers/mmc/core/Kconfig b/drivers/mmc/core/Kconfig
index fc1ecdaaa9ca..42e89060cd41 100644
--- a/drivers/mmc/core/Kconfig
+++ b/drivers/mmc/core/Kconfig
@@ -61,24 +61,6 @@ config MMC_BLOCK_MINORS
 
  If unsure, say 8 here.
 
-config MMC_BLOCK_BOUNCE
-   bool "Use bounce buffer for simple hosts"
-   depends on MMC_BLOCK
-   default y
-   help
- SD/MMC is a high latency protocol where it is crucial to
- send large requests in order to get high performance. Many
- controllers, however, are restricted to continuous memory
- (i.e. they can't do scatter-gather), something the kernel
- rarely can provide.
-
- Say Y here to help these restricted hosts by bouncing
- requests back and forth from a large buffer. You will get
- a big performance gain at the cost of up to 64 KiB of
- physical memory.
-
- If unsure, say Y here.
-
 config SDIO_UART
tristate "SDIO UART/GPS class support"
depends on TTY
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 5c37b6be3e7b..545466342fb1 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -219,7 

Re: [PATCH 1/4] blk-mq: introduce BLK_MQ_F_SCHED_USE_HW_TAG

2017-05-10 Thread Ming Lei
Hi Jens,

On Thu, May 04, 2017 at 08:06:15AM -0600, Jens Axboe wrote:
...
> 
> No we do not. 256 is a LOT. I realize most of the devices expose 64K *
> num_hw_queues of depth. Expecting to utilize all that is insane.
> Internally, these devices have nowhere near that amount of parallelism.
> Hence we'd go well beyond the latency knee in the curve if we just allow
> tons of writeback to queue up, for example. Reaching peak performance on
> these devices do not require more than 256 requests, in fact it can be
> done much sooner. For a default setting, I'd actually argue that 256 is
> too much, and that we should set it lower.

After studying SSD and NVMe a bit, I think your point of '256 is a LOT'
is correct:

1) inside SSD, the channel number won't be big, which is often about 10
in high-end SSD, so 256 should be big enough to maximize the utilization
of each channel, even though multi-bank, multi-die or multi-plane are
considered.

2) For NVMe, the IO queue depth(size) is at most 64K according to spec,
now the driver limits it at most 1024, but the queue itself is totally
allocated from system memory, then big queue size won't increase NVMe
chip cost.

So I think we can respect .nr_requests via resizing hw tags in
blk_mq_init_sched(), as suggested by Omar. If you don't object that,
I will send out V3 soon.

Thanks,
Ming