from:"Mikulas Patocka"

Re: [PATCH] block: don't warn when calling fsync on read-only block devices

2018-09-03 Thread Mikulas Patocka

On Wed, 22 Aug 2018, Jens Axboe wrote:

> On 8/22/18 10:23 AM, Mikulas Patocka wrote:
> > It is possible to call fsync on a read-only handle (for example, fsck.ext2 
> > does it when doing read-only check), and this call results in kernel 
> > warning. This bug was introduced by the commit 721c7fc701c7 "block: fail 
> > op_is_write() requests to read-only partitions".
> 
> Similar patch is already queued up, it'll go into Linus's tree before
> -rc1.
> 
> -- 
> Jens Axboe

I think you mean the patch b089cfd95d32638335c551651a8e00fd2c4edb0b, but 
it doesn't work.

The first problem is that bio_op returns the opcode, but op_is_flush 
expects flags as its parameter. bio_op strips off the flags, and so the 
condition op_is_flush is never true and the warning is not shut off.

Anoher problem is that the flags REQ_FUA and REQ_PREFLUSH may be 
superimposed on regular write bio and in that case the warning should be 
triggered. We want to shut the warning off only if bio_sectors is zero 
(i.e. if the flush request is not superimposed on regular write).

Mikulas

From: Mikulas Patocka 
Subject: [PATCH] block: don't warn when doing fsync on read-only devices

It is possible to call fsync on a read-only handle (for example, fsck.ext2
does it when doing read-only check), and this call results in kernel
warning.

The patch b089cfd95d32 ("block: don't warn for flush on read-only device")
attempted to disable the warning, but it is buggy and it doesn't
(op_is_flush tests flags, but bio_op strips off the flags).

Signed-off-by: Mikulas Patocka 
Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only 
partitions")
Cc: sta...@vger.kernel.org  # 4.18

---
 block/blk-core.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6/block/blk-core.c
===
--- linux-2.6.orig/block/blk-core.c 2018-09-03 23:32:38.55000 +0200
+++ linux-2.6/block/blk-core.c  2018-09-03 23:33:11.96000 +0200
@@ -2163,9 +2163,12 @@ static inline bool bio_check_ro(struct b
 {
const int op = bio_op(bio);

-   if (part->policy && (op_is_write(op) && !op_is_flush(op))) {
+   if (part->policy && op_is_write(op)) {
char b[BDEVNAME_SIZE];

+   if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
+   return false;
+
WARN_ONCE(1,
   "generic_make_request: Trying to write "
"to read-only block-device %s (partno %d)\n",

[PATCH] block: don't warn when calling fsync on read-only block devices

2018-08-22 Thread Mikulas Patocka

It is possible to call fsync on a read-only handle (for example, fsck.ext2 
does it when doing read-only check), and this call results in kernel 
warning. This bug was introduced by the commit 721c7fc701c7 "block: fail 
op_is_write() requests to read-only partitions".

Signed-off-by: Mikulas Patocka 
Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only 
partitions")
Cc: sta...@vger.kernel.org  # 4.18

---
 block/blk-core.c  |3 +++
 drivers/md/dm-verity-target.c |   33 +
 2 files changed, 20 insertions(+), 16 deletions(-)

Index: linux-2.6/block/blk-core.c
===
--- linux-2.6.orig/block/blk-core.c 2018-08-15 16:47:18.93000 +0200
+++ linux-2.6/block/blk-core.c  2018-08-22 16:37:19.68000 +0200
@@ -2155,6 +2155,9 @@ static inline bool bio_check_ro(struct b
if (part->policy && op_is_write(bio_op(bio))) {
char b[BDEVNAME_SIZE];
 
+   if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
+   return false;
+
WARN_ONCE(1,
   "generic_make_request: Trying to write "
"to read-only block-device %s (partno %d)\n",

[PATCH 2/2] loop: enable compat ioctl LOOP_SET_BLOCK_SIZE

2018-05-03 Thread Mikulas Patocka

Enable compat ioctl LOOP_SET_BLOCK_SIZE.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>
Fixes: 89e4fdecb51c ("loop: add ioctl for changing logical block size")
Cc: sta...@vger.kernel.org  # v4.14+

---
 drivers/block/loop.c |1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6/drivers/block/loop.c
===
--- linux-2.6.orig/drivers/block/loop.c 2018-05-03 19:48:04.0 +0200
+++ linux-2.6/drivers/block/loop.c  2018-05-03 19:48:32.0 +0200
@@ -1600,6 +1600,7 @@ static int lo_compat_ioctl(struct block_
case LOOP_SET_FD:
case LOOP_CHANGE_FD:
case LOOP_SET_DIRECT_IO:
+   case LOOP_SET_BLOCK_SIZE:
err = lo_ioctl(bdev, mode, cmd, arg);
break;
default:

[PATCH 1/2] loop: enable compat ioctl LOOP_SET_DIRECT_IO

2018-05-03 Thread Mikulas Patocka

Enable compat ioctl LOOP_SET_DIRECT_IO.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>
Fixes: ab1cb278bc70 ("block: loop: introduce ioctl command of 
LOOP_SET_DIRECT_IO")
Cc: sta...@vger.kernel.org  # v4.4+

---
 drivers/block/loop.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/drivers/block/loop.c
===
--- linux-2.6.orig/drivers/block/loop.c 2018-04-30 00:30:38.0 +0200
+++ linux-2.6/drivers/block/loop.c  2018-05-03 18:49:44.0 +0200
@@ -1599,6 +1599,7 @@ static int lo_compat_ioctl(struct block_
arg = (unsigned long) compat_ptr(arg);
case LOOP_SET_FD:
case LOOP_CHANGE_FD:
+   case LOOP_SET_DIRECT_IO:
err = lo_ioctl(bdev, mode, cmd, arg);
break;
default:

Re: [PATCH] block: use 32-bit blk_status_t on Alpha

2018-03-21 Thread Mikulas Patocka



On Wed, 21 Mar 2018, Jens Axboe wrote:

> On 3/21/18 10:42 AM, Mikulas Patocka wrote:
> > Early alpha processors cannot write a single byte or word; they read 8
> > bytes, modify the value in registers and write back 8 bytes.
> > 
> > The type blk_status_t is defined as one byte, it is often written
> > asynchronously by I/O completion routines, this asynchronous modification
> > can corrupt content of nearby bytes if these nearby bytes can be written
> > simultaneously by another CPU.
> > 
> > - one example of such corruption is the structure dm_io where
> >   "blk_status_t status" is written by an asynchronous completion routine
> >   and "atomic_t io_count" is modified synchronously
> > - another example is the structure dm_buffer where "unsigned hold_count"
> >   is modified synchronously from process context and "blk_status_t
> >   write_error" is modified asynchronously from bio completion routine
> > 
> > This patch fixes the bug by changing the type blk_status_t to 32 bits if
> > we are on Alpha and if we are compiling for a processor that doesn't have
> > the byte-word-extension.
> 
> That's nasty. Is alpha the only problematic arch here?

Yes. All the other architectures supported by Linux have byte writes.

> As to the patch in question, normally I'd just say we should make it
> unconditionally u32. But we pack so nicely in the bio, and I don't think
> the bio itself has this issue as the rest of the members that share this
> word are all set before the bio is submitted. But callers embedding
> the status var in other structures don't necessarily have that
> guarantee, as your dm examples show.
> 
> -- 
> Jens Axboe

Keeping blk_status_t 8-bit for most architectures will save a few bytes in 
some of device mapper structures.

Mikulas

[PATCH] Fix slab name "biovec-(1<<(21-12))"

2018-03-21 Thread Mikulas Patocka

I'm getting a slab named "biovec-(1<<(21-12))". It is caused by unintended
expansion of the macro BIO_MAX_PAGES. This patch renames it to biovec-max.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>
Cc: sta...@vger.kernel.org  # v4.14+

---
 block/bio.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/block/bio.c
===
--- linux-2.6.orig/block/bio.c  2018-02-14 20:23:35.648255000 +0100
+++ linux-2.6/block/bio.c   2018-03-14 14:51:35.53000 +0100
@@ -43,9 +43,9 @@
  * break badly! cannot be bigger than what you can fit into an
  * unsigned short
  */
-#define BV(x) { .nr_vecs = x, .name = "biovec-"__stringify(x) }
+#define BV(x, n) { .nr_vecs = x, .name = "biovec-"#n }
 static struct biovec_slab bvec_slabs[BVEC_POOL_NR] __read_mostly = {
-   BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
+   BV(1, 1), BV(4, 4), BV(16, 16), BV(64, 64), BV(128, 128), 
BV(BIO_MAX_PAGES, max),
 };
 #undef BV

[PATCH] block: use 32-bit blk_status_t on Alpha

2018-03-21 Thread Mikulas Patocka

Early alpha processors cannot write a single byte or word; they read 8
bytes, modify the value in registers and write back 8 bytes.

The type blk_status_t is defined as one byte, it is often written
asynchronously by I/O completion routines, this asynchronous modification
can corrupt content of nearby bytes if these nearby bytes can be written
simultaneously by another CPU.

- one example of such corruption is the structure dm_io where
  "blk_status_t status" is written by an asynchronous completion routine
  and "atomic_t io_count" is modified synchronously
- another example is the structure dm_buffer where "unsigned hold_count"
  is modified synchronously from process context and "blk_status_t
  write_error" is modified asynchronously from bio completion routine

This patch fixes the bug by changing the type blk_status_t to 32 bits if
we are on Alpha and if we are compiling for a processor that doesn't have
the byte-word-extension.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>
Cc: sta...@vger.kernel.org  # 4.13+

---
 include/linux/blk_types.h |5 +
 1 file changed, 5 insertions(+)

Index: linux-2.6/include/linux/blk_types.h
===
--- linux-2.6.orig/include/linux/blk_types.h2018-02-14 20:24:42.038255000 
+0100
+++ linux-2.6/include/linux/blk_types.h 2018-03-21 15:04:54.96000 +0100
@@ -20,8 +20,13 @@ typedef void (bio_end_io_t) (struct bio
 
 /*
  * Block error status values.  See block/blk-core:blk_errors for the details.
+ * Alpha cannot write a byte atomically, so we need to use 32-bit value.
  */
+#if defined(CONFIG_ALPHA) && !defined(__alpha_bwx__)
+typedef u32 __bitwise blk_status_t;
+#else
 typedef u8 __bitwise blk_status_t;
+#endif
 #defineBLK_STS_OK 0
 #define BLK_STS_NOTSUPP((__force blk_status_t)1)
 #define BLK_STS_TIMEOUT((__force blk_status_t)2)

Re: [PATCH] wbt: fix incorrect throttling due to flush latency

2018-02-05 Thread Mikulas Patocka



On Mon, 5 Feb 2018, Jens Axboe wrote:

> On 2/5/18 12:22 PM, Jens Axboe wrote:
> > On 2/5/18 12:11 PM, Mikulas Patocka wrote:
> >> I have a workload where one process sends many asynchronous write bios
> >> (without waiting for them) and another process sends synchronous flush
> >> bios. During this workload, writeback throttling throttles down to one
> >> outstanding bio, and this incorrect throttling causes performance
> >> degradation (all write bios sleep in __wbt_wait and they couldn't be sent
> >> in parallel).
> >>
> >> The reason for this throttling is that wbt_data_dir counts flush requests
> >> in the read bucket. The flush requests (that take quite a long time)
> >> trigger this condition repeatedly:
> >>if (stat[READ].min > rwb->min_lat_nsec)
> >> and that condition causes scale down to one outstanding request, despite
> >> the fact that there are no read bios at all.
> >>
> >> A similar problem could also show up with REQ_OP_ZONE_REPORT and
> >> REQ_OP_ZONE_RESET - they are also counted in the read bucket.
> >>
> >> This patch fixes the function wbt_data_dir, so that only REQ_OP_READ 
> >> requests are counted in the read bucket. The patch improves SATA 
> >> write+flush throughput from 130MB/s to 350MB/s.
> > 
> > Good catch. But I do wonder if we should account them at all. It
> > might make sense to account flushes as writes, but we really
> > should not account anything that isn't regular read/write IO 
> > related.
> > 
> > Something like the below? Potentially including REQ_OP_FLUSH in the
> > write latencies as well.
> 
> Thinking about it, flush should just be counted as writes, and the
> rest disregarded. Ala below.
> 
> 
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index ae8de9780085..f92fc84b5e2c 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -697,7 +697,15 @@ u64 wbt_default_latency_nsec(struct request_queue *q)
>  
>  static int wbt_data_dir(const struct request *rq)
>  {
> - return rq_data_dir(rq);
> + const int op = req_op(rq);
> +
> + if (op == REQ_OP_READ)
> + return READ;
> + else if (op == REQ_OP_WRITE || op == REQ_OP_FLUSH)
> + return WRITE;
> +
> + /* don't account */
> + return -1;
>  }
>  
>  int wbt_init(struct request_queue *q)
> 
> -- 
> Jens Axboe

Yes, this works.

BTW. write throttling still causes some performance degradation (350MB/s 
vs. 500MB/s without throttling) - should there be some logic that disables 
write throttling if no sync requests are received?

Something like - if there's sync read, enable write throttling - if there 
were no sync reads in the last 5 seconds, disable write throttling?

Mikulas

[PATCH] wbt: fix incorrect throttling due to flush latency

2018-02-05 Thread Mikulas Patocka

I have a workload where one process sends many asynchronous write bios
(without waiting for them) and another process sends synchronous flush
bios. During this workload, writeback throttling throttles down to one
outstanding bio, and this incorrect throttling causes performance
degradation (all write bios sleep in __wbt_wait and they couldn't be sent
in parallel).

The reason for this throttling is that wbt_data_dir counts flush requests
in the read bucket. The flush requests (that take quite a long time)
trigger this condition repeatedly:
if (stat[READ].min > rwb->min_lat_nsec)
and that condition causes scale down to one outstanding request, despite
the fact that there are no read bios at all.

A similar problem could also show up with REQ_OP_ZONE_REPORT and
REQ_OP_ZONE_RESET - they are also counted in the read bucket.

This patch fixes the function wbt_data_dir, so that only REQ_OP_READ 
requests are counted in the read bucket. The patch improves SATA 
write+flush throughput from 130MB/s to 350MB/s.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>
Cc: sta...@vger.kernel.org  # v4.12+

---
 block/blk-wbt.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-stable/block/blk-wbt.c
===
--- linux-stable.orig/block/blk-wbt.c   2018-02-05 12:30:09.606063908 -0500
+++ linux-stable/block/blk-wbt.c2018-02-05 13:50:02.784213501 -0500
@@ -697,7 +697,11 @@ u64 wbt_default_latency_nsec(struct requ
 
 static int wbt_data_dir(const struct request *rq)
 {
-   return rq_data_dir(rq);
+   /*
+* Flushes must be counted in the write bucket, so that high flush
+* latencies don't cause scale down.
+*/
+   return req_op(rq) == REQ_OP_READ ? READ : WRITE;
 }
 
 int wbt_init(struct request_queue *q)

Re: [PATCH] schedule: use unlikely()

2017-11-27 Thread Mikulas Patocka



On Sat, 25 Nov 2017, Ingo Molnar wrote:

> 
> * Mikulas Patocka <mpato...@redhat.com> wrote:
> 
> > On Fri, 24 Nov 2017, Ingo Molnar wrote:
> > 
> > > > +   return unlikely(plug != NULL) &&
> > > > (!list_empty(>list) ||
> > > >  !list_empty(>mq_list) ||
> > > >  !list_empty(>cb_list));
> > > 
> > > That's an unrelated change.
> > 
> > It is related, because this function gets inlined into schedule().
> 
> That needs to be in the changelog and the patch needs to be Cc:-ed to the 
> affected maintainer as well.
> 
> > The static branch prediction in gcc assumes that pointers are usually not 
> > NULL, but in this case tsk->pi_blocked_on and tsk->plug are usually NULL.
> 
> That too should be in the changelog.
> 
> Thanks,
> 
>   Ingo
> 

OK - here I resend it with more verbose message:

Mikulas



From: Mikulas Patocka <mpato...@redhat.com>
Subject: [PATCH] schedule: add unlikely()

A small patch for schedule(), so that the code goes straght in the common 
case. This patch saves two jumps in the assembler listing. The static 
branch prediction in GCC assumes that pointers are usually non-NULL when 
they are compared against NULL, but in theis case, tsk->plug is usually 
NULL (the function blk_needs_flush_plug gets inlined into schedule()) and 
tsk->pi_blocked_on is also usually NULL (that is checked by 
tsk_is_pi_blocked). This patch adds the macro unlikely() to override the 
static prediction.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>

---
 include/linux/blkdev.h |2 +-
 kernel/sched/core.c|2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/blkdev.h
===
--- linux-2.6.orig/include/linux/blkdev.h
+++ linux-2.6/include/linux/blkdev.h
@@ -1308,7 +1308,7 @@ static inline bool blk_needs_flush_plug(
 {
struct blk_plug *plug = tsk->plug;
 
-   return plug &&
+   return unlikely(plug != NULL) &&
(!list_empty(>list) ||
 !list_empty(>mq_list) ||
 !list_empty(>cb_list));
Index: linux-2.6/kernel/sched/core.c
===
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -3405,7 +3405,7 @@ void __noreturn do_task_dead(void)
 
 static inline void sched_submit_work(struct task_struct *tsk)
 {
-   if (!tsk->state || tsk_is_pi_blocked(tsk))
+   if (!tsk->state || unlikely(tsk_is_pi_blocked(tsk)))
return;
/*
 * If we are going to sleep and we have plugged IO queued,

Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread Mikulas Patocka



On Wed, 22 Nov 2017, NeilBrown wrote:

> On Tue, Nov 21 2017, Mikulas Patocka wrote:
> 
> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
> >
> >> On Tue, Nov 21 2017 at  4:23pm -0500,
> >> Mikulas Patocka <mpato...@redhat.com> wrote:
> >> 
> >> > This is not correct:
> >> > 
> >> >2206 static void dm_wq_work(struct work_struct *work)
> >> >2207 {
> >> >2208 struct mapped_device *md = container_of(work, struct 
> >> > mapped_device, work);
> >> >2209 struct bio *bio;
> >> >2210 int srcu_idx;
> >> >2211 struct dm_table *map;
> >> >2212
> >> >2213 if (!bio_list_empty(>rescued)) {
> >> >2214 struct bio_list list;
> >> >2215 spin_lock_irq(>deferred_lock);
> >> >2216 list = md->rescued;
> >> >2217 bio_list_init(>rescued);
> >> >2218 spin_unlock_irq(>deferred_lock);
> >> >2219 while ((bio = bio_list_pop()))
> >> >2220 generic_make_request(bio);
> >> >2221 }
> >> >
> >> >2223 map = dm_get_live_table(md, _idx);
> >> >2224
> >> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> >> >2226 spin_lock_irq(>deferred_lock);
> >> >2227 bio = bio_list_pop(>deferred);
> >> >2228 spin_unlock_irq(>deferred_lock);
> >> >2229
> >> >2230 if (!bio)
> >> >2231 break;
> >> >2232
> >> >2233 if (dm_request_based(md))
> >> >2234 generic_make_request(bio);
> >> >2235 else
> >> >2236 __split_and_process_bio(md, map, bio);
> >> >2237 }
> >> >2238
> >> >2239 dm_put_live_table(md, srcu_idx);
> >> >2240 }
> >> > 
> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> >> > will not process md->rescued list.
> >> 
> >> Can you elaborate further?  We cannot be "in dm_wq_work in
> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> >> of scheduling away from __split_and_process_bio?
> >> 
> >> The more detail you can share the better.
> >
> > Suppose this scenario:
> >
> > * dm_wq_work calls __split_and_process_bio
> > * __split_and_process_bio eventually reaches the function snapshot_map
> > * snapshot_map attempts to take the snapshot lock
> >
> > * the snapshot lock could be released only if some bios submitted by the 
> > snapshot driver to the underlying device complete
> > * the bios submitted to the underlying device were already offloaded by 
> > some other task and they are waiting on the list md->rescued
> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> > blocked in snapshot_map (called from __split_and_process_bio)
> 
> Yes, I think you are right. 
> 
> I think the solution is to get rid of the dm_offload() infrastructure
> and make it not necessary.
> i.e. discard my patches
> dm: prepare to discontinue use of BIOSET_NEED_RESCUER
> and
> dm: revise 'rescue' strategy for bio-based bioset allocations
> 
> And build on "dm: ensure bio submission follows a depth-first tree walk"
> which was written after those and already makes dm_offload() less
> important.
> 
> Since that "depth-first" patch, every request to the dm device, after
> the initial splitting, allocates just one dm_target_io structure, and
> makes just one __map_bio() call, and so will behave exactly the way
> generic_make_request() expects and copes with - thus avoiding awkward
> dependencies and deadlocks.  Except
> 
> a/ If any target defines ->num_write_bios() to return >1,
>__clone_and_map_data_bio() will make multiple calls to alloc_tio()
>and __map_bio(), which might need rescuing.
>But no target defines num_write_bios, and none have since it was
>removed from dm-cache 4.5 years ago.
>Can we discard num_write_bios??
> 
> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>to a value > 1, t

[PATCH] block: remove useless assignment in bio_split

2017-11-22 Thread Mikulas Patocka

Remove useless assignment to the variable "split" because the variable is
unconditionally assigned later.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>

---
 block/bio.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/block/bio.c
===
--- linux-2.6.orig/block/bio.c
+++ linux-2.6/block/bio.c
@@ -1873,7 +1873,7 @@ EXPORT_SYMBOL(bio_endio);
 struct bio *bio_split(struct bio *bio, int sectors,
  gfp_t gfp, struct bio_set *bs)
 {
-   struct bio *split = NULL;
+   struct bio *split;
 
BUG_ON(sectors <= 0);
BUG_ON(sectors >= bio_sectors(bio));

Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-21 Thread Mikulas Patocka



On Tue, 21 Nov 2017, Mike Snitzer wrote:

> On Tue, Nov 21 2017 at  4:23pm -0500,
> Mikulas Patocka <mpato...@redhat.com> wrote:
> 
> > This is not correct:
> > 
> >2206 static void dm_wq_work(struct work_struct *work)
> >2207 {
> >2208 struct mapped_device *md = container_of(work, struct 
> > mapped_device, work);
> >2209 struct bio *bio;
> >2210 int srcu_idx;
> >2211 struct dm_table *map;
> >2212
> >2213 if (!bio_list_empty(>rescued)) {
> >2214 struct bio_list list;
> >2215 spin_lock_irq(>deferred_lock);
> >2216 list = md->rescued;
> >2217 bio_list_init(>rescued);
> >2218 spin_unlock_irq(>deferred_lock);
> >2219 while ((bio = bio_list_pop()))
> >2220 generic_make_request(bio);
> >2221 }
> >
> >2223 map = dm_get_live_table(md, _idx);
> >2224
> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> >2226 spin_lock_irq(>deferred_lock);
> >2227 bio = bio_list_pop(>deferred);
> >2228 spin_unlock_irq(>deferred_lock);
> >2229
> >2230 if (!bio)
> >2231 break;
> >2232
> >2233 if (dm_request_based(md))
> >2234 generic_make_request(bio);
> >2235 else
> >2236 __split_and_process_bio(md, map, bio);
> >2237 }
> >2238
> >2239 dm_put_live_table(md, srcu_idx);
> >2240 }
> > 
> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> > will not process md->rescued list.
> 
> Can you elaborate further?  We cannot be "in dm_wq_work in
> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> of scheduling away from __split_and_process_bio?
> 
> The more detail you can share the better.

Suppose this scenario:

* dm_wq_work calls __split_and_process_bio
* __split_and_process_bio eventually reaches the function snapshot_map
* snapshot_map attempts to take the snapshot lock

* the snapshot lock could be released only if some bios submitted by the 
snapshot driver to the underlying device complete
* the bios submitted to the underlying device were already offloaded by 
some other task and they are waiting on the list md->rescued
* the bios waiting on md->rescued are not processed, because dm_wq_work is 
blocked in snapshot_map (called from __split_and_process_bio)

> > The processing of md->rescued is also wrong - bios for different devices 
> > must be offloaded to different helper threads, so that processing a bio 
> > for a lower device doesn't depend on processing a bio for a higher device. 
> > If you offload all the bios on current->bio_list to the same thread, the 
> > bios still depend on each other and the deadlock will still happen.
> 
> Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
> allocations") speaks to this with:
> 
> "Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
> contains bios that were scheduled *before* the current one started, so
> they must have been submitted from higher up the stack, and we cannot be
> waiting for them here (thanks to the "dm: ensure bio submission follows
> a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
> list as there is nothing to be gained by being more selective."

I think you are right - if we only offload current->bio_list[0], then 
mixing of dependent bios on the offloaded list won't happen.

> And again: this patchset passes your dm-snapshot deadlock test.  Is
> that test somehow lacking?

With your patchset, the deadlock would happen only if bios are queued on 
>deferred - and that happens only in case of resume or if we are 
processing REQ_PREFLUSH with non-zero data size.

So, the simple test that I wrote doesn't trigger it, but a more complex 
test involving REQ_PREFLUSH could.

> Or do you see a hypothetical case where a deadlock is still possible?
> That is of less concern.  I'd prefer that we tackle problems for
> targets, and associated scenarios, that we currently support.
> 
> Either way, happy to review this with you further.  Any fixes are
> welcomed too.  But I'd like us to head in a direction that this patchset
> is taking us.  Specifically: away from DM relying on BIOSET_NEED_RESCUER.
> 
> Thanks,
> Mike

Mikulas

Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.]

2017-11-21 Thread Mikulas Patocka



On Tue, 21 Nov 2017, Mike Snitzer wrote:

> On Tue, Nov 21 2017 at  7:43am -0500,
> Mike Snitzer  wrote:
>  
> > Decided it a better use of my time to review and then hopefully use the
> > block-core's bio splitting infrastructure in DM.  Been meaning to do
> > that for quite a while anyway.  This mail from you just made it all the
> > more clear that needs doing:
> > https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html
> > 
> > So I will start here on this patch you proposed:
> > https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
> > (of note, this patch slipped through the cracks because I was recovering
> > from injury when it originally came through).
> > 
> > Once DM is using q->bio_split I'll come back to this patch (aka
> > "[1]") as a starting point for the follow-on work to remove DM's use of
> > BIOSET_NEED_RESCUER:
> > https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> 
> Hey Neil,
> 
> Good news!  All your code works ;)
> 
> (well after 1 fixup due to a cut-n-paste bug.. the code you added to
> dm_wq_work() to process the md->rescued bio_list was operating on
> the md->deferred bio_list due to cut-n-paste from code you copied from
> just below it)
> 
> I split your code out some to make it more reviewable.  I also tweaked
> headers accordingly.
> 
> Please see this branch (which _will_ get rebased between now and the
> 4.16 merge window):
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
> 
> I successfully tested these changes using Mikulas' test program that
> reproduces the snapshot deadlock:
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> 
> I'll throw various other DM testsuites at it to verify they all look
> good (e.g. thinp, cache, multipath).
> 
> I'm open to all suggestions about changes you'd like to see (either to
> these patches or anything you'd like to layer ontop of them).
> 
> Thanks for all your work, much appreciated!
> Mike

This is not correct:

   2206 static void dm_wq_work(struct work_struct *work)
   2207 {
   2208 struct mapped_device *md = container_of(work, struct 
mapped_device, work);
   2209 struct bio *bio;
   2210 int srcu_idx;
   2211 struct dm_table *map;
   2212
   2213 if (!bio_list_empty(>rescued)) {
   2214 struct bio_list list;
   2215 spin_lock_irq(>deferred_lock);
   2216 list = md->rescued;
   2217 bio_list_init(>rescued);
   2218 spin_unlock_irq(>deferred_lock);
   2219 while ((bio = bio_list_pop()))
   2220 generic_make_request(bio);
   2221 }
   
   2223 map = dm_get_live_table(md, _idx);
   2224
   2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
   2226 spin_lock_irq(>deferred_lock);
   2227 bio = bio_list_pop(>deferred);
   2228 spin_unlock_irq(>deferred_lock);
   2229
   2230 if (!bio)
   2231 break;
   2232
   2233 if (dm_request_based(md))
   2234 generic_make_request(bio);
   2235 else
   2236 __split_and_process_bio(md, map, bio);
   2237 }
   2238
   2239 dm_put_live_table(md, srcu_idx);
   2240 }

You can see that if we are in dm_wq_work in __split_and_process_bio, we 
will not process md->rescued list.

The processing of md->rescued is also wrong - bios for different devices 
must be offloaded to different helper threads, so that processing a bio 
for a lower device doesn't depend on processing a bio for a higher device. 
If you offload all the bios on current->bio_list to the same thread, the 
bios still depend on each other and the deadlock will still happen.

Mikulas

[PATCH] brd: remove unused brd_mutex

2017-11-10 Thread Mikulas Patocka

Remove unused mutex brd_mutex. It is unused since the commit ff26956875c2
("brd: remove support for BLKFLSBUF").

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>

---
 drivers/block/brd.c |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6/drivers/block/brd.c
===
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -60,7 +60,6 @@ struct brd_device {
 /*
  * Look up and return a brd's page for a given sector.
  */
-static DEFINE_MUTEX(brd_mutex);
 static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
 {
pgoff_t idx;

Re: [PATCH] bio: have bio_kmap_irq return the size of mapped data (fwd)

2017-11-08 Thread Mikulas Patocka

On Wed, 8 Nov 2017, Christoph Hellwig wrote:

> On Tue, Nov 07, 2017 at 04:45:17PM -0500, Mikulas Patocka wrote:
> > Hi
> > 
> > I need the function bio_kmap_irq in the driver that I am developing, but 
> > it doesn't return the size of the mapped data. I've made this patch to fix 
> > it.
> 
> To be honest I think we should just remove bio_kmap_irq.  It is currently
> unused and assumes there is only a single bvec to map.

It could be removed from include/linux/bio.h and moved to my driver. But 
if we leave it in bio.h, it could be used by others as well.

bio_kmap_irq can iterate over the whole bio if we use bio_advance on the 
bio.

> I think you're much better off using a proper bvec iterator in your
> caller and then call bvec_kmap_irq/bvec_kunmap_irq manually.

Mikulas

[PATCH] bio: have bio_kmap_irq return the size of mapped data (fwd)

2017-11-07 Thread Mikulas Patocka

Hi

I need the function bio_kmap_irq in the driver that I am developing, but 
it doesn't return the size of the mapped data. I've made this patch to fix 
it.

Mikulas


From: Mikulas Patocka <mpato...@redhat.com>

The function bio_kmap_irq is not usable because it does not return the
size of the mapped data.

Fix bio_kmap_irq and __bio_kmap_irq so that they return the size of
mapped data.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>
Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
 include/linux/bio.h |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/bio.h
===
--- linux-2.6.orig/include/linux/bio.h
+++ linux-2.6/include/linux/bio.h
@@ -576,14 +576,16 @@ static inline void bvec_kunmap_irq(char
 #endif
 
 static inline char *__bio_kmap_irq(struct bio *bio, struct bvec_iter iter,
-  unsigned long *flags)
+  unsigned long *flags, unsigned *size)
 {
-   return bvec_kmap_irq(_iter_iovec(bio, iter), flags);
+   struct bio_vec bv = bio_iter_iovec(bio, iter);
+   *size = bv.bv_len;
+   return bvec_kmap_irq(, flags);
 }
 #define __bio_kunmap_irq(buf, flags)   bvec_kunmap_irq(buf, flags)
 
-#define bio_kmap_irq(bio, flags) \
-   __bio_kmap_irq((bio), (bio)->bi_iter, (flags))
+#define bio_kmap_irq(bio, flags, size) \
+   __bio_kmap_irq((bio), (bio)->bi_iter, (flags), (size))
 #define bio_kunmap_irq(buf,flags)  __bio_kunmap_irq(buf, flags)
 
 /*

[PATCH] brd: fix overflow in __brd_direct_access

2017-09-13 Thread Mikulas Patocka

The code in __brd_direct_access multiplies the pgoff variable by page size
and divides it by 512. It can cause overflow on 32-bit architectures. The
overflow happens if we create ramdisk larger than 4G and use it as a
sparse device.

This patch replaces multiplication and division with multiplication by the
number of sectors per page.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>
Fixes: 1647b9b959c7 ("brd: add dax_operations support")
Cc: sta...@vger.kernel.org  # 4.12+

---
 drivers/block/brd.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-4.13/drivers/block/brd.c
===
--- linux-4.13.orig/drivers/block/brd.c
+++ linux-4.13/drivers/block/brd.c
@@ -339,7 +339,7 @@ static long __brd_direct_access(struct b
 
if (!brd)
return -ENODEV;
-   page = brd_insert_page(brd, PFN_PHYS(pgoff) / 512);
+   page = brd_insert_page(brd, (sector_t)pgoff << PAGE_SECTORS_SHIFT);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);

[PATCH] fix an integer overflow in __blkdev_sectors_to_bio_pages

2017-08-14 Thread Mikulas Patocka



On Mon, 14 Aug 2017, Damien Le Moal wrote:

> On Sun, 2017-08-13 at 22:47 -0400, Mikulas Patocka wrote:
> > 
> > On Wed, 9 Aug 2017, h...@lst.de wrote:
> > 
> > > Does commit 615d22a51c04856efe62af6e1d5b450aaf5cc2c0
> > > "block: Fix __blkdev_issue_zeroout loop" fix the issue for you?
> > > 
> > > --
> > > dm-devel mailing list
> > > dm-de...@redhat.com
> > > https://www.redhat.com/mailman/listinfo/dm-devel
> > 
> > I think that patch is incorrect. sector_t may be a 32-bit type and 
> > nr_sects << 9 may overflow.
> > 
> > static unsigned int __blkdev_sectors_to_bio_pages(sector_t nr_sects)
> > {
> >sector_t bytes = (nr_sects << 9) + PAGE_SIZE - 1;
> > 
> >return min(bytes >> PAGE_SHIFT, (sector_t)BIO_MAX_PAGES);
> > }
> > 
> > Mikulas
> 
> Mikulas,
> 
> Does the follwing patch fix the problem ?
> 
> From 947b3cf41e759b2b23f684e215e651d0c8037f88 Mon Sep 17 00:00:00 2001
> From: Damien Le Moal <damien.lem...@wdc.com>
> Date: Mon, 14 Aug 2017 13:01:16 +0900
> Subject: [PATCH] block: Fix __blkdev_sectors_to_bio_pages()
> 
> On 32bit systems where sector_t is a 32bits type, the calculation of
> bytes may overflow. Use the u64 type for the local calculation to avoid
> overflows.
> 
> Signed-off-by: Damien Le Moal <damien.lem...@wdc.com>
> ---
>  block/blk-lib.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 3fe0aec90597..ccf22dba21f0 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -269,9 +269,9 @@ static int __blkdev_issue_write_zeroes(struct block_device
> *bdev,
>   */
>  static unsigned int __blkdev_sectors_to_bio_pages(sector_t nr_sects)
>  {
> - sector_t bytes = (nr_sects << 9) + PAGE_SIZE - 1;
> + u64 bytes = ((u64)nr_sects << 9) + PAGE_SIZE - 1;
>  
> - return min(bytes >> PAGE_SHIFT, (sector_t)BIO_MAX_PAGES);
> + return min(bytes >> PAGE_SHIFT, (u64)BIO_MAX_PAGES);
>  }
>  

It's OK, but it is not needed to use 64-bit arithmetic here if all we need 
is to shift the value right. Here I submit a simplified patch, using the 
macro DIV_ROUND_UP_SECTOR_T (the macro gets optimized to just an addition 
and right shift).



From: Mikulas Patocka <mpato...@redhat.com>

Fix possible integer overflow in __blkdev_sectors_to_bio_pages if sector_t 
is 32-bit.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>
Fixes: 615d22a51c04 ("block: Fix __blkdev_issue_zeroout loop")

---
 block/blk-lib.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/block/blk-lib.c
===
--- linux-2.6.orig/block/blk-lib.c
+++ linux-2.6/block/blk-lib.c
@@ -269,9 +269,9 @@ static int __blkdev_issue_write_zeroes(s
  */
 static unsigned int __blkdev_sectors_to_bio_pages(sector_t nr_sects)
 {
-   sector_t bytes = (nr_sects << 9) + PAGE_SIZE - 1;
+   sector_t pages = DIV_ROUND_UP_SECTOR_T(nr_sects, PAGE_SIZE / 512);
 
-   return min(bytes >> PAGE_SHIFT, (sector_t)BIO_MAX_PAGES);
+   return min(pages, (sector_t)BIO_MAX_PAGES);
 }
 
 /**

Re: [dm-devel] [BUG] BLKZEROOUT on dm-crypt container cause OOM / kernel panic

2017-08-13 Thread Mikulas Patocka



On Wed, 9 Aug 2017, h...@lst.de wrote:

> Does commit 615d22a51c04856efe62af6e1d5b450aaf5cc2c0
> "block: Fix __blkdev_issue_zeroout loop" fix the issue for you?
> 
> --
> dm-devel mailing list
> dm-de...@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

I think that patch is incorrect. sector_t may be a 32-bit type and 
nr_sects << 9 may overflow.

static unsigned int __blkdev_sectors_to_bio_pages(sector_t nr_sects)
{
   sector_t bytes = (nr_sects << 9) + PAGE_SIZE - 1;

   return min(bytes >> PAGE_SHIFT, (sector_t)BIO_MAX_PAGES);
}

Mikulas

Re: [PATCH] bio-integrity: revert "stop abusing bi_end_io"

2017-08-06 Thread Mikulas Patocka

On Sat, 5 Aug 2017, Martin K. Petersen wrote:

> Mikulas,
> 
> > The sector number in the integrity tag must match the physical sector 
> > number. So, it must be verified at the bottom.
> 
> The ref tag seed matches the submitter block number (typically block
> layer sector for the top device) and is remapped to and from the LBA by
> the SCSI disk driver or HBA firmware.
> 
> So the verification is designed to be done by the top level entity that
> attaches the protection information to the bio. In this case
> bio_integrity_prep().

bio_integrity_prep is called by blk_queue_bio - that is the lowest level 
physical disk, not the top level. It is not called by device mapper.

Documentation/block/data-integrity.txt says that bio_integrity_prep() can 
be called by integrity-aware filesystem, but it seems that no such 
filesystem exists.

If you create the integrity tag at or above device mapper level, you will 
run into problems because the same device can be accessed using device 
mapper and using physical volume /dev/sd*. If you create integrity tags at 
device mapper level, they will contain device mapper's logical sector 
number and the sector number won't match if you access the device directly 
using /dev/sd*.

> -- 
> Martin K. PetersenOracle Linux Engineering

Mikulas

Re: [PATCH] bio-integrity: revert "stop abusing bi_end_io"

2017-08-05 Thread Mikulas Patocka



On Sat, 5 Aug 2017, Christoph Hellwig wrote:

> 
> On Thu, Aug 03, 2017 at 10:10:55AM -0400, Mikulas Patocka wrote:
> > That dm-crypt commit that uses bio integrity payload came 3 months before 
> > 7c20f11680a441df09de7235206f70115fbf6290 and it was already present in 
> > 4.12.
> 
> And on it's own that isn't an argument if your usage is indeed wrong,
> and that's why we are having this discussion.
> 
> > The patch 7c20f11680a441df09de7235206f70115fbf6290 changes it so that 
> > bio->bi_end_io is not changed and bio_integrity_endio is called directly 
> > from bio_endio if the bio has integrity payload. The unintended 
> > consequence of this change is that when a request with integrity payload 
> > is finished and bio_endio is called for each cloned bio, the verification 
> > routine bio_integrity_verify_fn is called for every level in the stack (it 
> > used to be called only for the lowest level in 4.12). At different levels 
> > in the stack, the bio may have different bi_sector value, so the repeated 
> > verification can't succeed.
> 
> We can simply add another bio flag to get back to the previous
> behavior.  That being said thing to do in the end is to verify it
> at the top of the stack, and not the bottom eventuall.  I can cook
> up a patch for that.

The sector number in the integrity tag must match the physical sector 
number. So, it must be verified at the bottom.

If the sector number in the integrity tag matched the logical sector 
number in the logical volume, it would create a lot of problems - for 
example, it would be impossible to move the logical volume and it would be 
impossible to copy the full physical volume.

> > I think the patch 7c20f11680a441df09de7235206f70115fbf6290 should be 
> > reverted and it should be reworked for the next merge window, so that it 
> > calls bio_integrity_verify_fn only for the lowest level.
> 
> And with this you will just reintroduce the memory leak for
> on-stack bios we've fixed.
> 
> I'll take a look at not calling the verify function for every level,
> which is wrong, and in the meantime we can discuss the uses and abuses
> of the bio integrity code by dm in the separate subthread.

It would be usefull if dm-crypt can put the autenticated tag directly into 
the integrity data. But for that usage, the lowest level physical driver 
must not generate or verify the integrity data - it must just read them 
and write them.

Mikulas

[PATCH] bio-integrity: revert "stop abusing bi_end_io"

2017-08-03 Thread Mikulas Patocka

On Wed, 2 Aug 2017, Christoph Hellwig wrote:

> On Wed, Aug 02, 2017 at 02:27:50PM +0200, Milan Broz wrote:
> > In dm-integrity target we register integrity profile that have
> > both generate_fn and verify_fn callbacks set to NULL.
> > 
> > This is used if dm-integrity is stacked under a dm-crypt device
> > for authenticated encryption (integrity payload contains authentication
> > tag and IV seed).
> > 
> > In this case the verification is done through own crypto API
> > processing inside dm-crypt; integrity profile is only holder
> > of these data. (And memory is owned by dm-crypt as well.)
> 
> Maybe that's where the problem lies?  You're abusing the integrity
> payload for something that is not end to end data integrity at all
> and then wonder why it breaks?  Also the commit that introduced your
> code had absolutely no review by Martin or any of the core block
> folks.
> 
> The right fix is to revert the dm-crypt commit.

That dm-crypt commit that uses bio integrity payload came 3 months before 
7c20f11680a441df09de7235206f70115fbf6290 and it was already present in 
4.12.

The patch 7c20f11680a441df09de7235206f70115fbf6290 not only breaks 
dm-crypt and dm-integrity. It also breaks regular integrity payload usage 
when stacking drivers are used.

Suppose that a bio goes the through the device mapper stack. At each level 
a clone bio is allocated and forwarded to a lower level. It eventually 
reaches the request-based physical disk driver.

In the kernel 4.12, when the bio reaches the request-based driver, the 
function blk_queue_bio is called, it calls bio_integrity_prep, 
bio_integrity_prep saves the bio->bi_end_io in the structure 
bio_integrity_payload and sets bio->bi_end_io = bio_integrity_endio. When 
the bio is finished, bio_integrity_endio is called, it performs the 
verification (a call to bio_integrity_verify_fn), restores bio->bi_end_io 
and calls it.

The patch 7c20f11680a441df09de7235206f70115fbf6290 changes it so that 
bio->bi_end_io is not changed and bio_integrity_endio is called directly 
from bio_endio if the bio has integrity payload. The unintended 
consequence of this change is that when a request with integrity payload 
is finished and bio_endio is called for each cloned bio, the verification 
routine bio_integrity_verify_fn is called for every level in the stack (it 
used to be called only for the lowest level in 4.12). At different levels 
in the stack, the bio may have different bi_sector value, so the repeated 
verification can't succeed.

I think the patch 7c20f11680a441df09de7235206f70115fbf6290 should be 
reverted and it should be reworked for the next merge window, so that it 
calls bio_integrity_verify_fn only for the lowest level.

Mikulas

From: Mikulas Patocka <mpato...@redhat.co>

The patch 7c20f11680a441df09de7235206f70115fbf6290 ("bio-integrity: stop 
abusing bi_end_io") changes the code so that it doesn't replace 
bio_end_io with bio_integrity_endio and calls bio_integrity_endio directly 
from bio_endio.

The unintended consequence of this change is that when a bio with 
integrity payload is passed through the device stack, bio_integrity_endio 
is called for each level of the stack (it used to be called only for the 
lowest level before).

This breaks dm-integrity and dm-crypt and it also causes verification 
errors when a bio with regular integrity payload is passed through the 
device mapper stack.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index f733beab6ca2..8df4eb103ba9 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -102,7 +102,7 @@ EXPORT_SYMBOL(bio_integrity_alloc);
  * Description: Used to free the integrity portion of a bio. Usually
  * called from bio_free().
  */
-static void bio_integrity_free(struct bio *bio)
+void bio_integrity_free(struct bio *bio)
 {
struct bio_integrity_payload *bip = bio_integrity(bio);
struct bio_set *bs = bio->bi_pool;
@@ -120,8 +120,8 @@ static void bio_integrity_free(struct bio *bio)
}

bio->bi_integrity = NULL;
-   bio->bi_opf &= ~REQ_INTEGRITY;
 }
+EXPORT_SYMBOL(bio_integrity_free);

 /**
  * bio_integrity_add_page - Attach integrity metadata
@@ -326,6 +326,12 @@ bool bio_integrity_prep(struct bio *bio)
offset = 0;
}

+   /* Install custom I/O completion handler if read verify is enabled */
+   if (bio_data_dir(bio) == READ) {
+   bip->bip_end_io = bio->bi_end_io;
+   bio->bi_end_io = bio_integrity_endio;
+   }
+
/* Auto-generate integrity metadata if this is a write */
if (bio_data_dir(bio) == WRITE) {
bio_integrity_process(bio, >bi_iter,
@@ -369,12 +375,13 @@ static void bio_integrity_verify_fn(struct work_struct 
*work)
bio->bi_st

Re: [dm-devel] v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone

2017-02-14 Thread Mikulas Patocka

On Thu, 9 Feb 2017, Kent Overstreet wrote:

> On Wed, Feb 08, 2017 at 11:34:07AM -0500, Mike Snitzer wrote:
> > On Tue, Feb 07 2017 at 11:58pm -0500,
> > Kent Overstreet <kent.overstr...@gmail.com> wrote:
> > 
> > > On Tue, Feb 07, 2017 at 09:39:11PM +0100, Pavel Machek wrote:
> > > > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > > > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > > > > 
> > > > > > > So.. what needs to be done there?
> > > > > 
> > > > > > But, I just got an idea for how to handle this that might be 
> > > > > > halfway sane, maybe
> > > > > > I'll try and come up with a patch...
> > > > > 
> > > > > Ok, here's such a patch, only lightly tested:
> > > > 
> > > > I guess it would be nice for me to test it... but what it is against?
> > > > I tried after v4.10-rc5 and linux-next, but got rejects in both cases.
> > > 
> > > Sorry, I forgot I had a few other patches in my branch that touch
> > > mempool/biosets code.
> > > 
> > > Also, after thinking about it more and looking at the relevant code, I'm 
> > > pretty
> > > sure we don't need rescuer threads for block devices that just split bios 
> > > - i.e.
> > > most of them, so I changed my patch to do that.
> > > 
> > > Tested it by ripping out the current->bio_list checks/workarounds from the
> > > bcache code, appears to work:
> > 
> > Feedback on this patch below, but first:
> > 
> > There are deeper issues with the current->bio_list and rescue workqueues
> > than thread counts.
> > 
> > I cannot help but feel like you (and Jens) are repeatedly ignoring the
> > issue that has been raised numerous times, most recently:
> > https://www.redhat.com/archives/dm-devel/2017-February/msg00059.html
> > 
> > FYI, this test (albeit ugly) can be used to check if the dm-snapshot
> > deadlock is fixed:
> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> > 
> > This situation is the unfortunate pathological worst case for what
> > happens when changes are merged and nobody wants to own fixing the
> > unforseen implications/regressions.   Like everyone else in a position
> > of Linux maintenance I've tried to stay away from owning the
> > responsibility of a fix -- it isn't working.  Ok, I'll stop bitching
> > now.. I do bear responsibility for not digging in myself.  We're all
> > busy and this issue is "hard".
> 
> Mike, it's not my job to debug DM code for you or sift through your bug 
> reports.
> I don't read dm-devel, and I don't know why you think I that's my job.
> 
> If there's something you think the block layer should be doing differently, 
> post
> patches - or at the very least, explain what you'd like to be done, with 
> words.
> Don't get pissy because I'm not sifting through your bug reports.

So I post this patch for that bug.

Will any of the block device maintainers respond to it?

From: Mikulas Patocka <mpato...@redhat.com>
Date: Tue, 27 May 2014 11:03:36 -0400
Subject: block: flush queued bios when process blocks to avoid deadlock

The block layer uses per-process bio list to avoid recursion in
generic_make_request.  When generic_make_request is called recursively,
the bio is added to current->bio_list and generic_make_request returns
immediately.  The top-level instance of generic_make_request takes bios
from current->bio_list and processes them.

The problem is that this bio queuing on current->bio_list creates an 
artifical locking dependency - a bio further on current->bio_list depends 
on any locks that preceding bios could take.  This could result in a 
deadlock.

Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
stacking drivers") created a workqueue for every bio set and code
in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
redirecting bios queued on current->bio_list to the workqueue if the
system is low on memory.  However another deadlock (see below **) may
happen, without any low memory condition, because generic_make_request
is queuing bios to current->bio_list (rather than submitting them).

Fix this deadlock by redirecting any bios on current->bio_list to the
bio_set's rescue workqueue on every schedule call.  Consequently, when
the process blocks on a mutex, the bios que

Re: REQ_OP for zeroing, was Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard

2016-10-31 Thread Mikulas Patocka



On Fri, 28 Oct 2016, Christoph Hellwig wrote:

> [adding Chaitanya to Cc]
> 
> On Fri, Oct 28, 2016 at 07:43:41AM -0400, Mikulas Patocka wrote:
> > We could detect if the REQ_OP_WRITE_SAME command contains all zeroes and 
> > if it does, turn it into "Write Zeroes" or TRIM command (if the device 
> > guarantees zeroing on trim). If it doesn't contain all zeroes and the 
> > device doesn't support non-zero WRITE SAME, then reject it.
> 
> I don't like this because it's very inefficient - we have to allocate
> a payload first and then compare the whole payload for very operation.
> 
> > Or maybe we could add a new command REQ_OP_WRITE_ZEROES - I'm not sure 
> > which of these two possibilities is better.
> 
> Chaitanya actually did an initial prototype implementation of this for
> NVMe that he shared with me.  I liked it a lot and I think he'll be
> ready to post it in a few days.  Now that we have the REQ_OP* values
> instead of mapping different command types to flags it's actually
> surprisingly easy to add new block layer operations.

OK - when it is in the kernel, let me know, so that I can write device 
mapper support for that.

We should remove the flag "discard_zeroes_data" afterwards, because it is 
unreliable and impossible to support correctly in the device mapper.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard

2016-10-31 Thread Mikulas Patocka



On Fri, 28 Oct 2016, Bart Van Assche wrote:

> On 10/28/2016 04:39 AM, Mikulas Patocka wrote:
> > On Wed, 26 Oct 2016, Bart Van Assche wrote:
> > > On 10/26/2016 02:46 PM, Mikulas Patocka wrote:
> > > > I think the proper thing would be to move "discard_zeroes_data" flag
> > > > into
> > > > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO -
> > > > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the
> > > > bio
> > > > and the caller is supposed to do zeroing manually.
> > > 
> > > Sorry but I don't like this proposal. I think that a much better solution
> > > would be to pause I/O before making any changes to the discard_zeroes_data
> > > flag.
> > 
> > The device mapper pauses all bios when it switches the table - but the bio
> > was submitted with assumption that it goes to a device withe
> > "discard_zeroes_data" set, then the bio is paused, device mapper table is
> > switched, the bio is unpaused, and now it can go to a device without
> > "discard_zeroes_data".
> 
> Hello Mikulas,
> 
> Sorry if I wasn't clear enough. What I meant is to wait until all outstanding
> requests have finished

It is possible that the process sends never ending stream of bios (for 
example when reading linear data and using readahead), so waiting until 
there are no oustanding bios never finishes.

> before modifying the discard_zeroes_data flag - the
> kind of operation that is performed by e.g. blk_mq_freeze_queue().

blk_mq_freeze_queue() works on request-based drivers, device mapper works 
with bios, so that function has no effect on device mapper device. Anyway 
- blk_mq_freeze_queue() won't stop the process that issues the I/O 
requests - it will just hold the requests in the queue and not forward 
them to the device.

There is no way to stop the process that issues the bios. We can't stop 
the process that is looping in __blkdev_issue_discard, issuing discard 
requests. All that we can do is to hold the bios that the process issued.

Device mapper can freeze the filesystem with "freeze_bdev", but...
- some filesystems don't support freeze
- if the filesystem is not directly mounted on the frozen device, but 
there is a stack of intermediate block devices between the filesystem and 
the frozen device, then the filesystem will not be frozen
- the user can open the block device directly and he won't be affected by 
the freeze

> Modifying the discard_zeroes_data flag after a bio has been submitted 
> and before it has completed could lead to several classes of subtle 
> bugs. Functions like __blkdev_issue_discard() assume that the value of 
> the discard_zeroes_data flag does not change after this function has 
> been called and before the submitted requests completed.
>
> Bart.

I agree. That's the topic of the discussion - that the discard_zeroes_data 
flag is unreliable and the flag should be moved to the bio.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: REQ_OP for zeroing, was Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard

2016-10-28 Thread Mikulas Patocka

On Wed, 26 Oct 2016, Christoph Hellwig wrote:

> On Wed, Oct 26, 2016 at 05:46:11PM -0400, Mikulas Patocka wrote:
> > I think the proper thing would be to move "discard_zeroes_data" flag into 
> > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - 
> > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio 
> > and the caller is supposed to do zeroing manually.
> 
> Yes, Martin and I have come to a similar conclusion recently.  An
> additional aspect is that NVMe has a Write Zeroes command which is more
> limited than what REQ_OP_WRITE_SAME does.
> 
> So I think the right way is to add a REQ_OP_WRITE_ZEROES (or
> REQ_OP_ZERO) and have modifies if it may discard or not.

We could detect if the REQ_OP_WRITE_SAME command contains all zeroes and 
if it does, turn it into "Write Zeroes" or TRIM command (if the device 
guarantees zeroing on trim). If it doesn't contain all zeroes and the 
device doesn't support non-zero WRITE SAME, then reject it.

Or maybe we could add a new command REQ_OP_WRITE_ZEROES - I'm not sure 
which of these two possibilities is better.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard

2016-10-28 Thread Mikulas Patocka



On Wed, 26 Oct 2016, Bart Van Assche wrote:

> On 10/26/2016 02:46 PM, Mikulas Patocka wrote:
> > I don't like the idea of complicating the code by turning discards into
> > writes.
> 
> That's not what my patch series does. The only writes added by my patch series
> are those for the non-aligned head and tail of the range passed to
> blkdev_issue_zeroout().

The purpose of discard is to improve SSD performance and reduce wear.

Generating more write requests for discard does quite the opposite - it 
reduces performance (discard + two small writes is slower than just 
discard) and it also causes more wear.

> > The flag "discard_zeroes_data" is actually misdesigned, because the
> > storage stack can change dynamically while bios are in progress. You can
> > send a discard bio to a device mapper device that has
> > "discard_zeroes_data" - and while the bio is in progress, the device
> > mapper stack can be reconfigured to redirect the bio to another device
> > that doesn't have "discard_zeroes_data" - and the bio won't zero data and
> > the caller that issued it has no way to find it out.
> > 
> > I think the proper thing would be to move "discard_zeroes_data" flag into
> > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO -
> > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio
> > and the caller is supposed to do zeroing manually.
> 
> Sorry but I don't like this proposal. I think that a much better solution
> would be to pause I/O before making any changes to the discard_zeroes_data
> flag.

The device mapper pauses all bios when it switches the table - but the bio 
was submitted with assumption that it goes to a device withe 
"discard_zeroes_data" set, then the bio is paused, device mapper table is 
switched, the bio is unpaused, and now it can go to a device without 
"discard_zeroes_data".

> Bart.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard

2016-10-26 Thread Mikulas Patocka

On Wed, 26 Oct 2016, Bart Van Assche wrote:

> On 10/26/2016 01:26 PM, Mikulas Patocka wrote:
> > The brd driver refuses misaligned discard requests with an error. However,
> > this is suboptimal, misaligned requests could be handled by discarding a
> > part of the request that is aligned on a page boundary. This patch changes
> > the code so that it handles misaligned requests.
> 
> Hello Mikulas,
> 
> We do not want this kind of discard request processing in every block driver.
> This is why I think that this kind of processing should be added to the block
> layer. See also "[PATCH v3 0/5] Make blkdev_issue_discard() submit aligned
> discard requests" (http://www.spinics.net/lists/linux-block/msg02360.html).
> 
> Bart.

I don't like the idea of complicating the code by turning discards into 
writes. You can just turn off the flag "discard_zeroes_data" and drop all 
the splitting code.

The flag "discard_zeroes_data" is actually misdesigned, because the 
storage stack can change dynamically while bios are in progress. You can 
send a discard bio to a device mapper device that has 
"discard_zeroes_data" - and while the bio is in progress, the device 
mapper stack can be reconfigured to redirect the bio to another device 
that doesn't have "discard_zeroes_data" - and the bio won't zero data and 
the caller that issued it has no way to find it out.

I think the proper thing would be to move "discard_zeroes_data" flag into 
the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - 
and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio 
and the caller is supposed to do zeroing manually.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/4] brd: support discard

2016-10-26 Thread Mikulas Patocka



On Tue, 25 Oct 2016, Jens Axboe wrote:

> On 10/25/2016 08:37 AM, Mike Snitzer wrote:
> > On Tue, Oct 25 2016 at  9:07P -0400,
> > Christoph Hellwig  wrote:
> > 
> > > I think the right fix is to kill off the BLKFLSBUF special case in
> > > brd.  Yes, it break compatibility - but in this case the compatibility
> > > breaks more than it helps.
> > 
> > Jens, please pick up this patch:
> > 
> > From: Mike Snitzer 
> > Date: Tue, 25 Oct 2016 10:25:07 -0400
> > Subject: [PATCH] brd: remove support for BLKFLSBUF
> 
> Added, thanks.
> 
> -- 
> Jens Axboe

Hi

With the removal of BLKFLSBUF, there is no way to free memory allocated by 
ramdisk, so I created these 4 patches that add DISCARD support for 
ramdisk.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/4] brd: extend rcu read sections

2016-10-26 Thread Mikulas Patocka

This patch extends rcu read sections, so that all manipulations of the
page and its data are within read sections.

This patch is a prerequisite for discarding pages using rcu.

Note that the page pointer escapes the rcu section in the function
brd_direct_access, however, direct access is not supposed to race with
discard.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>

---
 drivers/block/brd.c |   19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

Index: linux-2.6/drivers/block/brd.c
===
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -71,10 +71,8 @@ static struct page *brd_lookup_page(stru
 * safe here (we don't have total exclusion from radix tree updates
 * here, only deletes).
 */
-   rcu_read_lock();
idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
page = radix_tree_lookup(>brd_pages, idx);
-   rcu_read_unlock();
 
BUG_ON(page && page->index != idx);
 
@@ -92,7 +90,12 @@ static struct page *brd_insert_page(stru
struct page *page;
gfp_t gfp_flags;
 
+   rcu_read_lock();
+
page = brd_lookup_page(brd, sector);
+
+   rcu_read_unlock();
+
if (page)
return page;
 
@@ -151,9 +154,13 @@ static void brd_zero_page(struct brd_dev
 {
struct page *page;
 
+   rcu_read_lock();
+
page = brd_lookup_page(brd, sector);
if (page)
clear_highpage(page);
+
+   rcu_read_unlock();
 }
 
 /*
@@ -246,6 +253,8 @@ static void copy_to_brd(struct brd_devic
unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
size_t copy;
 
+   rcu_read_lock();
+
copy = min_t(size_t, n, PAGE_SIZE - offset);
page = brd_lookup_page(brd, sector);
BUG_ON(!page);
@@ -265,6 +274,8 @@ static void copy_to_brd(struct brd_devic
memcpy(dst, src, copy);
kunmap_atomic(dst);
}
+
+   rcu_read_unlock();
 }
 
 /*
@@ -278,6 +289,8 @@ static void copy_from_brd(void *dst, str
unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
size_t copy;
 
+   rcu_read_lock();
+
copy = min_t(size_t, n, PAGE_SIZE - offset);
page = brd_lookup_page(brd, sector);
if (page) {
@@ -299,6 +312,8 @@ static void copy_from_brd(void *dst, str
} else
memset(dst, 0, copy);
}
+
+   rcu_read_unlock();
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] brd: handle misaligned discard

2016-10-26 Thread Mikulas Patocka

The brd driver refuses misaligned discard requests with an error. However,
this is suboptimal, misaligned requests could be handled by discarding a
part of the request that is aligned on a page boundary. This patch changes
the code so that it handles misaligned requests.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>

---
 drivers/block/brd.c |   16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux-2.6/drivers/block/brd.c
===
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -213,9 +213,14 @@ static int copy_to_brd_setup(struct brd_
 }
 
 static void discard_from_brd(struct brd_device *brd,
-   sector_t sector, size_t n)
+   sector_t sector, unsigned n_sectors)
 {
-   while (n >= PAGE_SIZE) {
+   unsigned boundary = -sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1);
+   if (unlikely(boundary >= n_sectors))
+   return;
+   sector += boundary;
+   n_sectors -= boundary;
+   while (n_sectors >= PAGE_SIZE >> SECTOR_SHIFT) {
/*
 * Don't want to actually discard pages here because
 * re-allocating the pages can result in writeback
@@ -226,7 +231,7 @@ static void discard_from_brd(struct brd_
else
brd_zero_page(brd, sector);
sector += PAGE_SIZE >> SECTOR_SHIFT;
-   n -= PAGE_SIZE;
+   n_sectors -= PAGE_SIZE >> SECTOR_SHIFT;
}
 }
 
@@ -339,10 +344,7 @@ static blk_qc_t brd_make_request(struct
goto io_error;
 
if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
-   if (sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1) ||
-   bio->bi_iter.bi_size & ~PAGE_MASK)
-   goto io_error;
-   discard_from_brd(brd, sector, bio->bi_iter.bi_size);
+   discard_from_brd(brd, sector, bio->bi_iter.bi_size >> 
SECTOR_SHIFT);
goto out;
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] brd: implement discard

2016-10-26 Thread Mikulas Patocka

Implement page discard using rcu. Each page has built-in entry rcu_head
that could be used to free the page from rcu.

Regarding the commant that "re-allocating the pages can result in
writeback deadlocks under heavy load" - if the user is in the risk of such
deadlocks, he should mount the filesystem without "-o discard".

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>

---
 drivers/block/brd.c |   30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

Index: linux-2.6/drivers/block/brd.c
===
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -137,6 +137,12 @@ static struct page *brd_insert_page(stru
return page;
 }
 
+static void brd_free_page_rcu(struct rcu_head *head)
+{
+   struct page *page = container_of(head, struct page, rcu_head);
+   __free_page(page);
+}
+
 static void brd_free_page(struct brd_device *brd, sector_t sector)
 {
struct page *page;
@@ -147,7 +153,7 @@ static void brd_free_page(struct brd_dev
page = radix_tree_delete(>brd_pages, idx);
spin_unlock(>brd_lock);
if (page)
-   __free_page(page);
+   call_rcu(>rcu_head, brd_free_page_rcu);
 }
 
 static void brd_zero_page(struct brd_device *brd, sector_t sector)
@@ -233,7 +239,7 @@ static void discard_from_brd(struct brd_
 * re-allocating the pages can result in writeback
 * deadlocks under heavy load.
 */
-   if (0)
+   if (1)
brd_free_page(brd, sector);
else
brd_zero_page(brd, sector);
@@ -257,22 +263,22 @@ static void copy_to_brd(struct brd_devic
 
copy = min_t(size_t, n, PAGE_SIZE - offset);
page = brd_lookup_page(brd, sector);
-   BUG_ON(!page);
-
-   dst = kmap_atomic(page);
-   memcpy(dst + offset, src, copy);
-   kunmap_atomic(dst);
+   if (page) {
+   dst = kmap_atomic(page);
+   memcpy(dst + offset, src, copy);
+   kunmap_atomic(dst);
+   }
 
if (copy < n) {
src += copy;
sector += copy >> SECTOR_SHIFT;
copy = n - copy;
page = brd_lookup_page(brd, sector);
-   BUG_ON(!page);
-
-   dst = kmap_atomic(page);
-   memcpy(dst, src, copy);
-   kunmap_atomic(dst);
+   if (page) {
+   dst = kmap_atomic(page);
+   memcpy(dst, src, copy);
+   kunmap_atomic(dst);
+   }
}
 
rcu_read_unlock();

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4] brd: remove unused brd_zero_page

2016-10-26 Thread Mikulas Patocka

Remove the function brd_zero_page. This function was used to zero a page
when the discard request came in.

The discard request is used for performance or space optimization, it
makes no sense to zero pages on discard request, as it neither improves
performance nor saves memory.

Signed-off-by: Mikulas Patocka <mpato...@redhat.com>

---
 drivers/block/brd.c |   23 +--
 1 file changed, 1 insertion(+), 22 deletions(-)

Index: linux-2.6/drivers/block/brd.c
===
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -156,19 +156,6 @@ static void brd_free_page(struct brd_dev
call_rcu(>rcu_head, brd_free_page_rcu);
 }
 
-static void brd_zero_page(struct brd_device *brd, sector_t sector)
-{
-   struct page *page;
-
-   rcu_read_lock();
-
-   page = brd_lookup_page(brd, sector);
-   if (page)
-   clear_highpage(page);
-
-   rcu_read_unlock();
-}
-
 /*
  * Free all backing store pages and radix tree. This must only be called when
  * there are no other users of the device.
@@ -234,15 +221,7 @@ static void discard_from_brd(struct brd_
sector += boundary;
n_sectors -= boundary;
while (n_sectors >= PAGE_SIZE >> SECTOR_SHIFT) {
-   /*
-* Don't want to actually discard pages here because
-* re-allocating the pages can result in writeback
-* deadlocks under heavy load.
-*/
-   if (1)
-   brd_free_page(brd, sector);
-   else
-   brd_zero_page(brd, sector);
+   brd_free_page(brd, sector);
sector += PAGE_SIZE >> SECTOR_SHIFT;
n_sectors -= PAGE_SIZE >> SECTOR_SHIFT;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dm-crypt: Fix error with too large bios

2016-08-30 Thread Mikulas Patocka



On Tue, 30 Aug 2016, Mike Snitzer wrote:

> On Tue, Aug 30 2016 at  8:19P -0400,
> Mikulas Patocka <mpato...@redhat.com> wrote:
> 
> > 
> > 
> > On Tue, 30 Aug 2016, Ming Lei wrote:
> > 
> > > On Tue, Aug 30, 2016 at 5:57 AM, Mikulas Patocka <mpato...@redhat.com> 
> > > wrote:
> > > >
> > > >
> > > > On Mon, 29 Aug 2016, Ming Lei wrote:
> > > >
> > > >> On Sat, Aug 27, 2016 at 11:09 PM, Mikulas Patocka 
> > > >> <mpato...@redhat.com> wrote:
> > > >> >
> > > >> > But this patch won't work for device mapper, blk_bio_segment_split is
> > > >> > called from blk_queue_split and device mapper doesn't use 
> > > >> > blk_queue_split
> > > >> > (it can't because it frees queue->bio_split).
> > > >> >
> > > >> > We still need these two patches:
> > > >> > https://www.redhat.com/archives/dm-devel/2016-May/msg00211.html
> > > >> > https://www.redhat.com/archives/dm-devel/2016-May/msg00210.html
> > > >>
> > > >> About the 2nd patch, it might not be good enough because in theory
> > > >> a small size bio still may include big bvecs, such as, each bvec points
> > > >> to 512byte buffer, so strictly speaking the bvec number should
> > > >> be checked instead of bio size.
> > > >>
> > > >> Ming Lei
> > > >
> > > > This is not a problem.
> > > 
> > > I meant the following code in your 2nd patch:
> > > 
> > > + if (unlikely(bio->bi_iter.bi_size > BIO_MAX_SIZE) &&
> > > +(bio->bi_rw & (REQ_FLUSH | REQ_DISCARD | REQ_WRITE)) == REQ_WRITE)
> > > + dm_accept_partial_bio(bio, BIO_MAX_SIZE >> SECTOR_SHIFT);
> > > 
> > > And the check on .bi_size may not work.
> > 
> > kcryptd_crypt_write_convert calls:
> > crypt_alloc_buffer(io, io->base_bio->bi_iter.bi_size)
> > 
> > crypt_alloc_buffer does:
> > unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, cc->bs);
> > 
> > So, if io->base_bio->bi_iter.bi_size <= BIO_MAX_SIZE, then nr_iovecs will 
> > be less or equal than BIO_MAX_PAGES and the function bio_alloc_bioset will 
> > succeed.
> > 
> > (BTW. BIO_MAX_SIZE was removed in the current kernel, we should use 
> > (BIO_MAX_PAGES << PAGE_SHIFT) instead).
> 
> Is this revised patch OK with you?

Drop that "#ifdef CONFIG_BCACHE". Anyone should be allowed to create a big 
bio, not just bcache.

That one test has no performance impact, there is no need to hide it 
behind #ifdef.

Mikulas

> From: Mikulas Patocka <mpato...@redhat.com>
> Date: Tue, 30 Aug 2016 16:38:42 -0400
> Subject: [PATCH] dm crypt: fix error with too large bcache bios
> 
> When dm-crypt processes writes, it allocates a new bio in
> crypt_alloc_buffer().  The bio is allocated from a bio set and it can
> have at most BIO_MAX_PAGES vector entries, however the incoming bio can be
> larger if it was allocated by bcache.  If the incoming bio is larger,
> bio_alloc_bioset() fails and an error is returned.
> 
> To avoid the error, we test for a too large bio in the function
> crypt_map() and use dm_accept_partial_bio() to split the bio.
> dm_accept_partial_bio() trims the current bio to the desired size and
> asks DM core to send another bio with the rest of the data.
> 
> This fix is wrapped with a check for CONFIG_BCACHE because there
> currently isn't any other code that generates too large bios.  So unless
> bcache is configured there is no point wasting time making this check.
> 
> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org # v3.16+
> ---
>  drivers/md/dm-crypt.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index eedba67..743f548 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -1924,6 +1924,12 @@ static int crypt_map(struct dm_target *ti, struct bio 
> *bio)
>   return DM_MAPIO_REMAPPED;
>   }
>  
> +#ifdef CONFIG_BCACHE
> + if (unlikely(bio->bi_iter.bi_size > (BIO_MAX_PAGES << PAGE_SHIFT)) &&
> + bio_data_dir(bio) == WRITE)
> + dm_accept_partial_bio(bio, ((BIO_MAX_PAGES << PAGE_SHIFT) >> 
> SECTOR_SHIFT));
> +#endif
> +
>   io = dm_per_bio_data(bio, cc->per_bio_data_size);
>   crypt_io_init(io, cc, bio, dm_target_offset(ti, 
> bio->bi_iter.bi_sector));
>   io->ctx.req = (struct skcipher_request *)(io + 1);
> -- 
> 2.7.4 (Apple Git-66)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dm-crypt: Fix error with too large bios

2016-08-29 Thread Mikulas Patocka



On Mon, 29 Aug 2016, Mike Snitzer wrote:

> On Sat, Aug 27 2016 at 11:09am -0400,
> Mikulas Patocka <mpato...@redhat.com> wrote:
> 
> > 
> > 
> > On Fri, 26 Aug 2016, Mike Snitzer wrote:
> > 
> > > On Thu, Aug 25 2016 at  4:13pm -0400,
> > > Jens Axboe <ax...@kernel.dk> wrote:
> > > 
> > > > On 08/25/2016 12:34 PM, Mikulas Patocka wrote:
> > > > >
> > > > >Device mapper can't split the bio in generic_make_request - it frees 
> > > > >the
> > > > >md->queue->bio_split bioset, to save one kernel thread per device. 
> > > > >Device
> > > > >mapper uses its own splitting mechanism.
> > > > >
> > > > >So what is the final decision? - should device mapper split the big 
> > > > >bio or
> > > > >should bcache not submit big bios?
> > > > >
> > > > >I think splitting big bios in the device mapper is better - simply 
> > > > >because
> > > > >it is much less code than reworking bcache to split bios internally.
> > > > >
> > > > >BTW. In the device mapper, we have a layer dm-io, that was created to 
> > > > >work
> > > > >around bio size limitations - it accepts unlimited I/O request and 
> > > > >splits
> > > > >it to several bios. When bio size limitations are gone, we could 
> > > > >simplify
> > > > >dm-io too.
> > > > 
> > > > The patch from Ming Lei was applied for 4.8 the other day.
> > > 
> > > See linux-block commit:
> > > 4d70dca4eadf2f block: make sure a big bio is split into at most 256 bvecs
> > > http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus=4d70dca4eadf2f95abe389116ac02b8439c2d16c
> > 
> > But this patch won't work for device mapper, blk_bio_segment_split is 
> > called from blk_queue_split and device mapper doesn't use blk_queue_split 
> > (it can't because it frees queue->bio_split).
> > 
> > We still need these two patches:
> > https://www.redhat.com/archives/dm-devel/2016-May/msg00211.html
> > https://www.redhat.com/archives/dm-devel/2016-May/msg00210.html
> 
> I'm left wondering whether it'd be better to revert commit dbba42d8a9e
> ("dm: eliminate unused "bioset" process for each bio-based DM device")

That would create one more process per dm device.

There is no need to create another process if the bio can be split in the 
device mapper.

> And also look for other areas of DM that could leverage the block core's
> relatively new splitting capabilities.
> 
> Otherwise, we'll just be sprinking more BIO_MAX_PAGES-based splitting
> code in DM targets because DM is so "special" that it doesn't need the
> block core's splitting (even though it clearly would now benefit).

It would be possible to split bios on BIO_MAX_PAGES boundary in the dm 
core. But I checked all the dm targets and I found problem only in 
dm-crypt and dm-log-writes. So there is no need to split bio bios for the 
other targets.

> Mike

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [dm-devel] dm-crypt: Fix error with too large bios

2016-08-25 Thread Mikulas Patocka



On Thu, 18 Aug 2016, Eric Wheeler wrote:

> > On Wed, Jun 01 2016 at  9:44am -0400, Christoph Hellwig 
> >  wrote:
> > 
> > > > > be dm-crypt.c.  Maybe you've identified some indirect use of
> > > > > BIO_MAX_SIZE?
> > > >
> > > > I mean the recently introduced BIO_MAX_SIZE in -next tree:
> > > >
> > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/md/dm-crypt.c?id=4ed89c97b0706477b822ea2182827640c0cec486
> > >
> > > The crazy bcache bios striking back once again.  I really think it's
> > > harmful having a _MAX value and then having a minor driver
> > > reinterpreting it and sending larger ones.  Until we can lift the
> > > maximum limit in general nad have common code exercise it we really need
> > > to stop bcache from sending these instead of littering the tree with
> > > workarounds.
> > 
> > The bio_kmalloc function allocates bios with up to 1024 vector entries (as 
> > opposed to bio_alloc and bio_alloc_bioset that has a limit of 256 vector 
> > entries).
> > 
> > Device mapper is using bio_alloc_bioset with a bio set, so it is limited 
> > to 256 vector entries, but other kernel users may use bio_kmalloc and 
> > create larger bios.
> > 
> > So, if you don't want bios with more than 256 vector entries to exist, you 
> > should impose this limit in bio_kmalloc (and fix all the callers that use 
> > it).
> 
> FYI, Kent Overstreet notes this about bcache from the other thread here:
>   https://lkml.org/lkml/2016/8/15/620
> 
> [paste]
> >> bcache originally had workaround code to split too-large bios when it 
> >> first went upstream - that was dropped only after the patches to make 
> >> generic_make_request() handle arbitrary size bios went in. So to do what 
> >> you're suggesting would mean reverting that bcache patch and bringing that 
> >> code back, which from my perspective would be a step in the wrong 
> >> direction. I just want to get this over and done with.
> >> 
> >> re: interactions with other drivers - bio_clone() has already been changed 
> >> to only clone biovecs that are live for current bi_iter, so there 
> >> shouldn't be any safety issues. A driver would have to be intentionally 
> >> doing its own open coded bio cloning that clones all of bi_io_vec, not 
> >> just the active ones - but if they're doing that, they're already broken 
> >> because a driver isn't allowed to look at bi_vcnt if it isn't a bio that 
> >> it owns - bi_vcnt is 0 on bios that don't own their biovec (i.e. that were 
> >> created by bio_clone_fast).
> >> 
> >> And the cloning and bi_vcnt usage stuff I audited very thoroughly back 
> >> when I was working on immutable biovecs and such back in the day, and I 
> >> had to do a fair amount of cleanup/refactoring before that stuff could go 
> >> in. 
> [/paste]
> 
> They are making progress in the patch-v3 thread, so perhaps this can be 
> fixed for now in generic_make_request().
> 
> --
> Eric Wheeler

Device mapper can't split the bio in generic_make_request - it frees the 
md->queue->bio_split bioset, to save one kernel thread per device. Device 
mapper uses its own splitting mechanism.

So what is the final decision? - should device mapper split the big bio or 
should bcache not submit big bios?

I think splitting big bios in the device mapper is better - simply because 
it is much less code than reworking bcache to split bios internally.

BTW. In the device mapper, we have a layer dm-io, that was created to work 
around bio size limitations - it accepts unlimited I/O request and splits 
it to several bios. When bio size limitations are gone, we could simplify 
dm-io too.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

37 matches

Mail list logo