Re: csum failed root -9

2017-06-12 Thread Kai Krakow
Am Mon, 12 Jun 2017 11:00:31 +0200
schrieb Henk Slager :

> Hi all,
> 
> there is 1-block corruption a 8TB filesystem that showed up several
> months ago. The fs is almost exclusively a btrfs receive target and
> receives monthly sequential snapshots from two hosts but 1 received
> uuid. I do not know exactly when the corruption has happened but it
> must have been roughly 3 to 6 months ago. with monthly updated
> kernel+progs on that host.
> 
> Some more history:
> - fs was created in november 2015 on top of luks
> - initially bcache between the 2048-sector aligned partition and luks.
> Some months ago I removed 'the bcache layer' by making sure that cache
> was clean and then zeroing 8K bytes at start of partition in an
> isolated situation. Then setting partion offset to 2064 by
> delete-recreate in gdisk.
> - in december 2016 there were more scrub errors, but related to the
> monthly snapshot of december2016. I have removed that snapshot this
> year and now only this 1-block csum error is the only issue.
> - brand/type is seagate 8TB SMR. At least since kernel 4.4+ that
> includes some SMR related changes in the blocklayer this disk works
> fine with btrfs.
> - the smartctl values show no error so far but I will run an extended
> test this week after another btrfs check which did not show any error
> earlier with the csum fail being there
> - I have noticed that the board that has the disk attached has been
> rebooted due to power-failures many times (unreliable power switch and
> power dips from energy company) and the 150W powersupply is broken and
> replaced since then. Also due to this, I decided to remove bcache
> (which has been in write-through and write-around only).
> 
> Some btrfs inpect-internal exercise shows that the problem is in a
> directory in the root that contains most of the data and snapshots.
> But an  rsync -c  with an identical other clone snapshot shows no
> difference (no writes to an rw snapshot of that clone). So the fs is
> still OK as file-level backup, but btrfs replace/balance will fatal
> error on just this 1 csum error. It looks like that this is not a
> media/disk error but some HW induced error or SW/kernel issue.
> Relevant btrfs commands + dmesg info, see below.
> 
> Any comments on how to fix or handle this without incrementally
> sending all snapshots to a new fs (6+ TiB of data, assuming this won't
> fail)?
> 
> 
> # uname -r
> 4.11.3-1-default
> # btrfs --version
> btrfs-progs v4.10.2+20170406

There's btrfs-progs v4.11 available...

> fs profile is dup for system+meta, single for data
> 
> # btrfs scrub start /local/smr

What looks strange to me is that the parameters of the error reports
seem to be rotated by one... See below:

> [27609.626555] BTRFS error (device dm-0): parent transid verify failed
> on 6350718500864 wanted 23170 found 23076
> [27609.685416] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350718500864 (dev /dev/mapper/smr sector 11681212672)
> [27609.685928] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350718504960 (dev /dev/mapper/smr sector 11681212680)
> [27609.686160] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350718509056 (dev /dev/mapper/smr sector 11681212688)
> [27609.687136] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350718513152 (dev /dev/mapper/smr sector 11681212696)
> [37663.606455] BTRFS error (device dm-0): parent transid verify failed
> on 6350453751808 wanted 23170 found 23075
> [37663.685158] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350453751808 (dev /dev/mapper/smr sector 11679647008)
> [37663.685386] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350453755904 (dev /dev/mapper/smr sector 11679647016)
> [37663.685587] BTRFS info (device dm-0): read error corrected: ino 1
> off 635045376 (dev /dev/mapper/smr sector 11679647024)
> [37663.685798] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350453764096 (dev /dev/mapper/smr sector 11679647032)

Why does it say "ino 1"? Does it mean devid 1?

> [43497.234598] BTRFS error (device dm-0): bdev /dev/mapper/smr errs:
> wr 0, rd 0, flush 0, corrupt 1, gen 0
> [43497.234605] BTRFS error (device dm-0): unable to fixup (regular)
> error at logical 7175413624832 on dev /dev/mapper/smr
> 
> # < figure out which chunk with help of btrfs py lib >
> 
> chunk vaddr 7174898057216 type 1 stripe 0 devid 1 offset 6696948727808
> length 1073741824 used 1073741824 used_pct 100
> chunk vaddr 7175971799040 type 1 stripe 0 devid 1 offset 6698022469632
> length 1073741824 used 1073741824 used_pct 100
> 
> # btrfs balance start -v
> -dvrange=7174898057216..7174898057217 /local/smr
> 
> [74250.913273] BTRFS info (device dm-0): relocating block group
> 7174898057216 flags data
> [74255.941105] BTRFS warning (device dm-0): csum failed root -9 ino
> 257 off 515567616 csum 0x589cb236 expected csum 0xee19bf74 mirror 1
> [74255.965804] BTRFS warning (device dm-0): csum 

Re: [PATCH v6 19/20] xfs: minimal conversion to errseq_t writeback error reporting

2017-06-12 Thread Darrick J. Wong
On Mon, Jun 12, 2017 at 08:23:15AM -0400, Jeff Layton wrote:
> Just set the FS_WB_ERRSEQ flag to indicate that we want to use errseq_t
> based error reporting. Internal filemap_* calls are left as-is for now.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/xfs_super.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 455a575f101d..28d3be187025 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1758,7 +1758,7 @@ static struct file_system_type xfs_fs_type = {
>   .name   = "xfs",
>   .mount  = xfs_fs_mount,
>   .kill_sb= kill_block_super,
> - .fs_flags   = FS_REQUIRES_DEV,
> + .fs_flags   = FS_REQUIRES_DEV | FS_WB_ERRSEQ,

Huh?  Why are there two patches with the same subject line?  And this
same bit of code too?  Or ... 11/13, 11/20?  What's going on here?



--D

>  };
>  MODULE_ALIAS_FS("xfs");
>  
> -- 
> 2.13.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: use ino_t for inode in tracer

2017-06-12 Thread Anand Jain
Signed-off-by: Anand Jain 
---
 v2: fix another location where ino_t was required.

 include/trace/events/btrfs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index e37973526153..692207e3f5d5 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1418,7 +1418,7 @@ DECLARE_EVENT_CLASS(btrfs__qgroup_data_map,
 
TP_STRUCT__entry_btrfs(
__field(u64,rootid  )
-   __field(unsigned long,  ino )
+   __field(ino_t,  ino )
__field(u64,free_reserved   )
),
 
@@ -1459,7 +1459,7 @@ DECLARE_EVENT_CLASS(btrfs__qgroup_rsv_data,
 
TP_STRUCT__entry_btrfs(
__field(u64,rootid  )
-   __field(unsigned long,  ino )
+   __field(ino_t,  ino )
__field(u64,start   )
__field(u64,len )
__field(u64,reserved)
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/10 v11] No wait AIO

2017-06-12 Thread Goldwyn Rodrigues


On 06/10/2017 12:34 AM, Al Viro wrote:
> On Thu, Jun 08, 2017 at 12:39:10AM -0700, Christoph Hellwig wrote:
>> As already indicated this whole series looks fine to me.
>>
>> Al: are you going to pick this up?  Or Andrew?
> 
> The main issue here is "let's bail out from ->write_iter() instances"
> patch.  It very obviously has holes in coverage.
> 
> Could we have FMODE_AIO_NOWAIT and make those who claim to support it
> set that in ->open()?  And make aio check that and bail out if asked
> for nowait on a file without that flag...
> 

Yes, I would agree.

We had FS_NOWAIT in filesystem type flags (in v3), but retracted it
later in v4.

Another option could be to keep the feature against FS_REQUIRES_DEV to
rule out filesystems which are not local, but it again has the problem
of holes in coverage.

I will work on adding FMODE_AIO_NOWAIT in the meantime.

Thanks,

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] Btrfs: change how we iterate bios in endio

2017-06-12 Thread Liu Bo
Since dio submit has used bio_clone_fast, the submitted bio may not have a
reliable bi_vcnt, for the bio vector iterations in checksum related
functions, bio->bi_iter is not modified yet and it's safe to use
bio_for_each_segment, while for those bio vector iterations in dio read's
endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
bios and use the helper __bio_for_each_segment with the saved bvec_iter to
access each bvec.

Also for dio reads which don't get split, we also need to save a copy of
bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access
each bvec in dio read's endio.  Note that it doesn't affect other calls of
btrfs_bio_clone() because they don't need to use this iterator.

Cc: David Sterba 
Signed-off-by: Liu Bo 
---
v2: Fix null pointer crash in the case of non-split dio reads.

 fs/btrfs/extent_io.c |  2 ++
 fs/btrfs/file-item.c | 31 +++
 fs/btrfs/inode.c | 35 +++
 fs/btrfs/volumes.h   |  1 +
 4 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 89b824d..79be8c2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2719,6 +2719,7 @@ struct bio *btrfs_bio_clone(struct bio *bio, gfp_t 
gfp_mask)
btrfs_bio->csum = NULL;
btrfs_bio->csum_allocated = NULL;
btrfs_bio->end_io = NULL;
+   btrfs_bio->iter = bio->bi_iter;
}
return new;
 }
@@ -2755,6 +2756,7 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, 
gfp_t gfp_mask,
btrfs_bio->end_io = NULL;
 
bio_trim(bio, offset >> 9, size >> 9);
+   btrfs_bio->iter = bio->bi_iter;
return bio;
 }
 
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 64fcb31..9f6062c 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -164,7 +164,8 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, 
struct bio *bio,
   u64 logical_offset, u32 *dst, int dio)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-   struct bio_vec *bvec;
+   struct bio_vec bvec;
+   struct bvec_iter iter;
struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
struct btrfs_csum_item *item = NULL;
struct extent_io_tree *io_tree = _I(inode)->io_tree;
@@ -177,7 +178,7 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, 
struct bio *bio,
u64 page_bytes_left;
u32 diff;
int nblocks;
-   int count = 0, i;
+   int count = 0;
u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
 
path = btrfs_alloc_path();
@@ -206,8 +207,6 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, 
struct bio *bio,
if (bio->bi_iter.bi_size > PAGE_SIZE * 8)
path->reada = READA_FORWARD;
 
-   WARN_ON(bio->bi_vcnt <= 0);
-
/*
 * the free space stuff is only read when it hasn't been
 * updated in the current transaction.  So, we can safely
@@ -223,13 +222,13 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, 
struct bio *bio,
if (dio)
offset = logical_offset;
 
-   bio_for_each_segment_all(bvec, bio, i) {
-   page_bytes_left = bvec->bv_len;
+   bio_for_each_segment(bvec, bio, iter) {
+   page_bytes_left = bvec.bv_len;
if (count)
goto next;
 
if (!dio)
-   offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+   offset = page_offset(bvec.bv_page) + bvec.bv_offset;
count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
   (u32 *)csum, nblocks);
if (count)
@@ -440,15 +439,15 @@ int btrfs_csum_one_bio(struct inode *inode, struct bio 
*bio,
struct btrfs_ordered_sum *sums;
struct btrfs_ordered_extent *ordered = NULL;
char *data;
-   struct bio_vec *bvec;
+   struct bvec_iter iter;
+   struct bio_vec bvec;
int index;
int nr_sectors;
-   int i, j;
unsigned long total_bytes = 0;
unsigned long this_sum_bytes = 0;
+   int i;
u64 offset;
 
-   WARN_ON(bio->bi_vcnt <= 0);
sums = kzalloc(btrfs_ordered_sum_size(fs_info, bio->bi_iter.bi_size),
   GFP_NOFS);
if (!sums)
@@ -465,19 +464,19 @@ int btrfs_csum_one_bio(struct inode *inode, struct bio 
*bio,
sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
index = 0;
 
-   bio_for_each_segment_all(bvec, bio, j) {
+   bio_for_each_segment(bvec, bio, iter) {
if (!contig)
-   offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+   offset = page_offset(bvec.bv_page) + bvec.bv_offset;
 
if (!ordered) {
  

Re: [PATCH 3/3] btrfs: sink gfp parameter to btrfs_io_bio_alloc

2017-06-12 Thread Liu Bo
On Mon, Jun 12, 2017 at 05:29:41PM +0200, David Sterba wrote:
> We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
> change it back from GFP_KERNEL in scrub. I'd rather save a few stack
> bytes from not passing the gfp flags in the remaining, more imporatant,
> contexts and the bio allocating API now looks more consistent.
> 

Reviewed-by: Liu Bo 

-liubo
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/check-integrity.c |  2 +-
>  fs/btrfs/disk-io.c |  2 +-
>  fs/btrfs/extent_io.c   |  8 
>  fs/btrfs/extent_io.h   |  2 +-
>  fs/btrfs/raid56.c  |  2 +-
>  fs/btrfs/scrub.c   | 16 +++-
>  6 files changed, 15 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index 160879c802d0..e3b1d08dd03c 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -1638,7 +1638,7 @@ static int btrfsic_read_block(struct btrfsic_state 
> *state,
>   struct bio *bio;
>   unsigned int j;
>  
> - bio = btrfs_io_bio_alloc(GFP_NOFS, num_pages - i);
> + bio = btrfs_io_bio_alloc(num_pages - i);
>   bio->bi_bdev = block_ctx->dev->bdev;
>   bio->bi_iter.bi_sector = dev_bytenr >> 9;
>   bio_set_op_attrs(bio, REQ_OP_READ, 0);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 9f2ffe2c6afb..8b57c280e5cd 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3532,7 +3532,7 @@ static int write_dev_flush(struct btrfs_device *device, 
> int wait)
>* caller
>*/
>   device->flush_bio = NULL;
> - bio = btrfs_io_bio_alloc(GFP_NOFS, 0);
> + bio = btrfs_io_bio_alloc(0);
>   bio->bi_end_io = btrfs_end_empty_barrier;
>   bio->bi_bdev = device->bdev;
>   bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH;
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index cbd0a9a1daa5..29a6111a68d2 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1987,7 +1987,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, 
> u64 ino, u64 start,
>   ASSERT(!(fs_info->sb->s_flags & MS_RDONLY));
>   BUG_ON(!mirror_num);
>  
> - bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
> + bio = btrfs_io_bio_alloc(1);
>   bio->bi_iter.bi_size = 0;
>   map_length = length;
>  
> @@ -2331,7 +2331,7 @@ struct bio *btrfs_create_repair_bio(struct inode 
> *inode, struct bio *failed_bio,
>   struct btrfs_io_bio *btrfs_failed_bio;
>   struct btrfs_io_bio *btrfs_bio;
>  
> - bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
> + bio = btrfs_io_bio_alloc(1);
>   bio->bi_end_io = endio_func;
>   bio->bi_iter.bi_sector = failrec->logical >> 9;
>   bio->bi_bdev = fs_info->fs_devices->latest_bdev;
> @@ -2692,12 +2692,12 @@ struct bio *btrfs_bio_clone(struct bio *bio)
>   return new;
>  }
>  
> -struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
> +struct bio *btrfs_io_bio_alloc(unsigned int nr_iovecs)
>  {
>   struct bio *bio;
>  
>   /* Bio allocation backed by a bioset does not fail */
> - bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset);
> + bio = bio_alloc_bioset(GFP_NOFS, nr_iovecs, btrfs_bioset);
>   btrfs_io_bio_init(btrfs_io_bio(bio));
>   return bio;
>  }
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 8071e3977614..1e508a8f876e 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -463,7 +463,7 @@ void extent_clear_unlock_delalloc(struct inode *inode, 
> u64 start, u64 end,
>unsigned bits_to_clear,
>unsigned long page_ops);
>  struct bio *btrfs_bio_alloc(struct block_device *bdev, u64 first_byte);
> -struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs);
> +struct bio *btrfs_io_bio_alloc(unsigned int nr_iovecs);
>  struct bio *btrfs_bio_clone(struct bio *bio);
>  struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size);
>  
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index 7dd55448ac68..b9abb0b01021 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -1098,7 +1098,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
>   }
>  
>   /* put a new bio on the list */
> - bio = btrfs_io_bio_alloc(GFP_NOFS, bio_max_len >> PAGE_SHIFT?:1);
> + bio = btrfs_io_bio_alloc(bio_max_len >> PAGE_SHIFT ?: 1);
>   bio->bi_iter.bi_size = 0;
>   bio->bi_bdev = stripe->dev->bdev;
>   bio->bi_iter.bi_sector = disk_start >> 9;
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 1e2dfea00b2f..58a249cd5adc 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1737,7 +1737,7 @@ static void scrub_recheck_block(struct btrfs_fs_info 
> *fs_info,
>   }
>  
>   WARN_ON(!page->page);
> - bio = btrfs_io_bio_alloc(GFP_NOFS, 

Re: [PATCH 3/7] Btrfs: update total_bytes_pinned when pinning down extents

2017-06-12 Thread Liu Bo
On Tue, Jun 06, 2017 at 04:45:28PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> The extents marked in pin_down_extent() will be unpinned later in
> unpin_extent_range(), which decrements total_bytes_pinned.
> pin_down_extent() must increment the counter to avoid underflowing it.
> Also adjust btrfs_free_tree_block() to avoid accounting for the same
> extent twice.
> 

Reviewed-by: Liu Bo 

-liubo
> Signed-off-by: Omar Sandoval 
> ---
>  fs/btrfs/extent-tree.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 6032e9a635f2..9cff937415cb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6343,6 +6343,7 @@ static int pin_down_extent(struct btrfs_fs_info 
> *fs_info,
>  
>   trace_btrfs_space_reservation(fs_info, "pinned",
> cache->space_info->flags, num_bytes, 1);
> + percpu_counter_add(>space_info->total_bytes_pinned, num_bytes);
>   set_extent_dirty(fs_info->pinned_extents, bytenr,
>bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
>   return 0;
> @@ -7189,6 +7190,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
> *trans,
>   goto out;
>   }
>  
> + pin = 0;
>   cache = btrfs_lookup_block_group(fs_info, buf->start);
>  
>   if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> @@ -7204,7 +7206,6 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
> *trans,
>   btrfs_free_reserved_bytes(cache, buf->len, 0);
>   btrfs_put_block_group(cache);
>   trace_btrfs_reserved_extent_free(fs_info, buf->start, buf->len);
> - pin = 0;
>   }
>  out:
>   if (pin)
> -- 
> 2.13.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64

2017-06-12 Thread Liu Bo
On Tue, Jun 06, 2017 at 04:45:26PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> There are a few places where we pass in a negative num_bytes, so make it
> signed for clarity. Also move it up in the file since later patches will
> need it there.
> 

Reviewed-by: Liu Bo 

-liubo
> Signed-off-by: Omar Sandoval 
> ---
>  fs/btrfs/extent-tree.c | 41 -
>  1 file changed, 20 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e390451c72e6..7c01b4e9e3b6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -766,6 +766,26 @@ static struct btrfs_space_info *__find_space_info(struct 
> btrfs_fs_info *info,
>   return NULL;
>  }
>  
> +static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes,
> +  u64 owner, u64 root_objectid)
> +{
> + struct btrfs_space_info *space_info;
> + u64 flags;
> +
> + if (owner < BTRFS_FIRST_FREE_OBJECTID) {
> + if (root_objectid == BTRFS_CHUNK_TREE_OBJECTID)
> + flags = BTRFS_BLOCK_GROUP_SYSTEM;
> + else
> + flags = BTRFS_BLOCK_GROUP_METADATA;
> + } else {
> + flags = BTRFS_BLOCK_GROUP_DATA;
> + }
> +
> + space_info = __find_space_info(fs_info, flags);
> + BUG_ON(!space_info); /* Logic bug */
> + percpu_counter_add(_info->total_bytes_pinned, num_bytes);
> +}
> +
>  /*
>   * after adding space to the filesystem, we need to clear the full flags
>   * on all the space infos.
> @@ -6793,27 +6813,6 @@ int btrfs_finish_extent_commit(struct 
> btrfs_trans_handle *trans,
>   return 0;
>  }
>  
> -static void add_pinned_bytes(struct btrfs_fs_info *fs_info, u64 num_bytes,
> -  u64 owner, u64 root_objectid)
> -{
> - struct btrfs_space_info *space_info;
> - u64 flags;
> -
> - if (owner < BTRFS_FIRST_FREE_OBJECTID) {
> - if (root_objectid == BTRFS_CHUNK_TREE_OBJECTID)
> - flags = BTRFS_BLOCK_GROUP_SYSTEM;
> - else
> - flags = BTRFS_BLOCK_GROUP_METADATA;
> - } else {
> - flags = BTRFS_BLOCK_GROUP_DATA;
> - }
> -
> - space_info = __find_space_info(fs_info, flags);
> - BUG_ON(!space_info); /* Logic bug */
> - percpu_counter_add(_info->total_bytes_pinned, num_bytes);
> -}
> -
> -
>  static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>   struct btrfs_fs_info *info,
>   struct btrfs_delayed_ref_node *node, u64 parent,
> -- 
> 2.13.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs: add helper to initialize the non-bio part of btrfs_io_bio

2017-06-12 Thread Liu Bo
On Mon, Jun 12, 2017 at 05:29:39PM +0200, David Sterba wrote:
> We use btrfs_bioset for bios and ask to allocate the entire size of
> btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is
> initialized but the bytes from 0 to offset of 'bio' are left
> uninitialized. Although we initialize some of the members in our
> helpers, we should initialize the whole structures.
> 

Reviewed-by: Liu Bo 

-liubo
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/extent_io.c | 30 ++
>  1 file changed, 14 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 5037fd918f43..cbd0a9a1daa5 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2654,22 +2654,28 @@ static void end_bio_extent_readpage(struct bio *bio)
>  }
>  
>  /*
> + * Initialize the members up to but not including 'bio'. Use after 
> allocating a
> + * new bio by bio_alloc_bioset as it does not initialize the bytes outside of
> + * 'bio' because use of __GFP_ZERO is not supported.
> + */
> +static inline void btrfs_io_bio_init(struct btrfs_io_bio *btrfs_bio)
> +{
> + memset(btrfs_bio, 0, offsetof(struct btrfs_io_bio, bio));
> +}
> +
> +/*
>   * The following helpers allocate a bio. As it's backed by a bioset, it'll
>   * never fail.  We're returning a bio right now but you can call btrfs_io_bio
>   * for the appropriate container_of magic
>   */
>  struct bio *btrfs_bio_alloc(struct block_device *bdev, u64 first_byte)
>  {
> - struct btrfs_io_bio *btrfs_bio;
>   struct bio *bio;
>  
>   bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, btrfs_bioset);
>   bio->bi_bdev = bdev;
>   bio->bi_iter.bi_sector = first_byte >> 9;
> - btrfs_bio = btrfs_io_bio(bio);
> - btrfs_bio->csum = NULL;
> - btrfs_bio->csum_allocated = NULL;
> - btrfs_bio->end_io = NULL;
> + btrfs_io_bio_init(btrfs_io_bio(bio));
>   return bio;
>  }
>  
> @@ -2681,24 +2687,18 @@ struct bio *btrfs_bio_clone(struct bio *bio)
>   /* Bio allocation backed by a bioset does not fail */
>   new = bio_clone_fast(bio, GFP_NOFS, btrfs_bioset);
>   btrfs_bio = btrfs_io_bio(new);
> - btrfs_bio->csum = NULL;
> - btrfs_bio->csum_allocated = NULL;
> - btrfs_bio->end_io = NULL;
> + btrfs_io_bio_init(btrfs_bio);
>   btrfs_bio->iter = bio->bi_iter;
>   return new;
>  }
>  
>  struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
>  {
> - struct btrfs_io_bio *btrfs_bio;
>   struct bio *bio;
>  
>   /* Bio allocation backed by a bioset does not fail */
>   bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset);
> - btrfs_bio = btrfs_io_bio(bio);
> - btrfs_bio->csum = NULL;
> - btrfs_bio->csum_allocated = NULL;
> - btrfs_bio->end_io = NULL;
> + btrfs_io_bio_init(btrfs_io_bio(bio));
>   return bio;
>  }
>  
> @@ -2712,9 +2712,7 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, 
> int offset, int size)
>   ASSERT(bio);
>  
>   btrfs_bio = btrfs_io_bio(bio);
> - btrfs_bio->csum = NULL;
> - btrfs_bio->csum_allocated = NULL;
> - btrfs_bio->end_io = NULL;
> + btrfs_io_bio_init(btrfs_bio);
>  
>   bio_trim(bio, offset >> 9, size >> 9);
>   btrfs_bio->iter = bio->bi_iter;
> -- 
> 2.13.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] More bio fixes and cleanups

2017-06-12 Thread David Sterba
Main part is the initialization of btrfs_io_bio, the rest is doc & cleanup.

David Sterba (3):
  btrfs: document mandatory order of bio in btrfs_io_bio
  btrfs: add helper to initialize the non-bio part of btrfs_io_bio
  btrfs: sink gfp parameter to btrfs_io_bio_alloc

 fs/btrfs/check-integrity.c |  2 +-
 fs/btrfs/disk-io.c |  2 +-
 fs/btrfs/extent_io.c   | 38 ++
 fs/btrfs/extent_io.h   |  2 +-
 fs/btrfs/raid56.c  |  2 +-
 fs/btrfs/scrub.c   | 16 +++-
 fs/btrfs/volumes.h |  4 
 7 files changed, 33 insertions(+), 33 deletions(-)

-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs: sink gfp parameter to btrfs_io_bio_alloc

2017-06-12 Thread David Sterba
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.

Signed-off-by: David Sterba 
---
 fs/btrfs/check-integrity.c |  2 +-
 fs/btrfs/disk-io.c |  2 +-
 fs/btrfs/extent_io.c   |  8 
 fs/btrfs/extent_io.h   |  2 +-
 fs/btrfs/raid56.c  |  2 +-
 fs/btrfs/scrub.c   | 16 +++-
 6 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 160879c802d0..e3b1d08dd03c 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1638,7 +1638,7 @@ static int btrfsic_read_block(struct btrfsic_state *state,
struct bio *bio;
unsigned int j;
 
-   bio = btrfs_io_bio_alloc(GFP_NOFS, num_pages - i);
+   bio = btrfs_io_bio_alloc(num_pages - i);
bio->bi_bdev = block_ctx->dev->bdev;
bio->bi_iter.bi_sector = dev_bytenr >> 9;
bio_set_op_attrs(bio, REQ_OP_READ, 0);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9f2ffe2c6afb..8b57c280e5cd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3532,7 +3532,7 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
 * caller
 */
device->flush_bio = NULL;
-   bio = btrfs_io_bio_alloc(GFP_NOFS, 0);
+   bio = btrfs_io_bio_alloc(0);
bio->bi_end_io = btrfs_end_empty_barrier;
bio->bi_bdev = device->bdev;
bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cbd0a9a1daa5..29a6111a68d2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1987,7 +1987,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 
ino, u64 start,
ASSERT(!(fs_info->sb->s_flags & MS_RDONLY));
BUG_ON(!mirror_num);
 
-   bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
+   bio = btrfs_io_bio_alloc(1);
bio->bi_iter.bi_size = 0;
map_length = length;
 
@@ -2331,7 +2331,7 @@ struct bio *btrfs_create_repair_bio(struct inode *inode, 
struct bio *failed_bio,
struct btrfs_io_bio *btrfs_failed_bio;
struct btrfs_io_bio *btrfs_bio;
 
-   bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
+   bio = btrfs_io_bio_alloc(1);
bio->bi_end_io = endio_func;
bio->bi_iter.bi_sector = failrec->logical >> 9;
bio->bi_bdev = fs_info->fs_devices->latest_bdev;
@@ -2692,12 +2692,12 @@ struct bio *btrfs_bio_clone(struct bio *bio)
return new;
 }
 
-struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
+struct bio *btrfs_io_bio_alloc(unsigned int nr_iovecs)
 {
struct bio *bio;
 
/* Bio allocation backed by a bioset does not fail */
-   bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset);
+   bio = bio_alloc_bioset(GFP_NOFS, nr_iovecs, btrfs_bioset);
btrfs_io_bio_init(btrfs_io_bio(bio));
return bio;
 }
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 8071e3977614..1e508a8f876e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -463,7 +463,7 @@ void extent_clear_unlock_delalloc(struct inode *inode, u64 
start, u64 end,
 unsigned bits_to_clear,
 unsigned long page_ops);
 struct bio *btrfs_bio_alloc(struct block_device *bdev, u64 first_byte);
-struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs);
+struct bio *btrfs_io_bio_alloc(unsigned int nr_iovecs);
 struct bio *btrfs_bio_clone(struct bio *bio);
 struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size);
 
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7dd55448ac68..b9abb0b01021 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1098,7 +1098,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
}
 
/* put a new bio on the list */
-   bio = btrfs_io_bio_alloc(GFP_NOFS, bio_max_len >> PAGE_SHIFT?:1);
+   bio = btrfs_io_bio_alloc(bio_max_len >> PAGE_SHIFT ?: 1);
bio->bi_iter.bi_size = 0;
bio->bi_bdev = stripe->dev->bdev;
bio->bi_iter.bi_sector = disk_start >> 9;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 1e2dfea00b2f..58a249cd5adc 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1737,7 +1737,7 @@ static void scrub_recheck_block(struct btrfs_fs_info 
*fs_info,
}
 
WARN_ON(!page->page);
-   bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
+   bio = btrfs_io_bio_alloc(1);
bio->bi_bdev = page->dev->bdev;
 
bio_add_page(bio, page->page, PAGE_SIZE, 0);
@@ -1825,7 +1825,7 @@ static int scrub_repair_page_from_good_copy(struct 
scrub_block *sblock_bad,
  

[PATCH 2/3] btrfs: add helper to initialize the non-bio part of btrfs_io_bio

2017-06-12 Thread David Sterba
We use btrfs_bioset for bios and ask to allocate the entire size of
btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is
initialized but the bytes from 0 to offset of 'bio' are left
uninitialized. Although we initialize some of the members in our
helpers, we should initialize the whole structures.

Signed-off-by: David Sterba 
---
 fs/btrfs/extent_io.c | 30 ++
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5037fd918f43..cbd0a9a1daa5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2654,22 +2654,28 @@ static void end_bio_extent_readpage(struct bio *bio)
 }
 
 /*
+ * Initialize the members up to but not including 'bio'. Use after allocating a
+ * new bio by bio_alloc_bioset as it does not initialize the bytes outside of
+ * 'bio' because use of __GFP_ZERO is not supported.
+ */
+static inline void btrfs_io_bio_init(struct btrfs_io_bio *btrfs_bio)
+{
+   memset(btrfs_bio, 0, offsetof(struct btrfs_io_bio, bio));
+}
+
+/*
  * The following helpers allocate a bio. As it's backed by a bioset, it'll
  * never fail.  We're returning a bio right now but you can call btrfs_io_bio
  * for the appropriate container_of magic
  */
 struct bio *btrfs_bio_alloc(struct block_device *bdev, u64 first_byte)
 {
-   struct btrfs_io_bio *btrfs_bio;
struct bio *bio;
 
bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, btrfs_bioset);
bio->bi_bdev = bdev;
bio->bi_iter.bi_sector = first_byte >> 9;
-   btrfs_bio = btrfs_io_bio(bio);
-   btrfs_bio->csum = NULL;
-   btrfs_bio->csum_allocated = NULL;
-   btrfs_bio->end_io = NULL;
+   btrfs_io_bio_init(btrfs_io_bio(bio));
return bio;
 }
 
@@ -2681,24 +2687,18 @@ struct bio *btrfs_bio_clone(struct bio *bio)
/* Bio allocation backed by a bioset does not fail */
new = bio_clone_fast(bio, GFP_NOFS, btrfs_bioset);
btrfs_bio = btrfs_io_bio(new);
-   btrfs_bio->csum = NULL;
-   btrfs_bio->csum_allocated = NULL;
-   btrfs_bio->end_io = NULL;
+   btrfs_io_bio_init(btrfs_bio);
btrfs_bio->iter = bio->bi_iter;
return new;
 }
 
 struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 {
-   struct btrfs_io_bio *btrfs_bio;
struct bio *bio;
 
/* Bio allocation backed by a bioset does not fail */
bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset);
-   btrfs_bio = btrfs_io_bio(bio);
-   btrfs_bio->csum = NULL;
-   btrfs_bio->csum_allocated = NULL;
-   btrfs_bio->end_io = NULL;
+   btrfs_io_bio_init(btrfs_io_bio(bio));
return bio;
 }
 
@@ -2712,9 +2712,7 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int 
offset, int size)
ASSERT(bio);
 
btrfs_bio = btrfs_io_bio(bio);
-   btrfs_bio->csum = NULL;
-   btrfs_bio->csum_allocated = NULL;
-   btrfs_bio->end_io = NULL;
+   btrfs_io_bio_init(btrfs_bio);
 
bio_trim(bio, offset >> 9, size >> 9);
btrfs_bio->iter = bio->bi_iter;
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] btrfs: document mandatory order of bio in btrfs_io_bio

2017-06-12 Thread David Sterba
Signed-off-by: David Sterba 
---
 fs/btrfs/volumes.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 58b97b6f5f02..35327efecdbb 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -281,6 +281,10 @@ struct btrfs_io_bio {
u8 *csum_allocated;
btrfs_io_bio_end_io_t *end_io;
struct bvec_iter iter;
+   /*
+* This member must come last, bio_alloc_bioset will allocate enough
+* bytes for entire btrfs_io_bio but relies on bio being last.
+*/
struct bio bio;
 };
 
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: btrfs_ioctl_search_key documentation

2017-06-12 Thread David Sterba
On Mon, Jun 12, 2017 at 06:03:15PM +0200, Hans van Kranenburg wrote:
> >> Most interesting changes since v1:
> >>  - mention the special tree_id input value 0
> >>  - rewrite the part about min_key and max_key, trying to be more concise
> > 
> > I find the description instructive enough so the expanded expression to
> > describe the whole range is not IMHO needed.
> 
> You mean drop the extra line "All metadata..." ? Yeah, it's a bit
> redudant, stressing the fact, yes.

Ah, sorry I was not clear. I was referring to Goffredo's proposal with
the expression how the min_key and max_key are calculated. Your text in
v2 is fine. We know how to calculate one key and we know the where are
the limits.

> The main purpose is to stop users from thinking that setting min_type
> and max_type will filter the returned objects (like, only getting
> BLOCK_GROUP_ITEM_KEY or so). So as long as you think that's clear
> enough, I'm ok with anything.

The text looks good to me and I've added the patch to the queue. Further
refinements are welcome.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: flush error also accounts for its send error

2017-06-12 Thread David Sterba
On Fri, Jun 09, 2017 at 02:52:29PM +0800, Anand Jain wrote:
> So remove the static check of send error

Please improve the changelog text.

> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/disk-io.c | 19 +--
>  1 file changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index d8151839bb3d..682c494dbc7f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3534,14 +3534,11 @@ static int wait_dev_flush(struct btrfs_device *device)
>  
>  static int check_barrier_error(struct btrfs_fs_devices *fsdevs)
>  {
> - int submit_flush_error = 0;
>   int dev_flush_error = 0;
>   struct btrfs_device *dev;
> - int tolerance;
>  
>   list_for_each_entry_rcu(dev, >devices, dev_list) {
>   if (!dev->bdev) {
> - submit_flush_error++;
>   dev_flush_error++;
>   continue;
>   }
> @@ -3549,8 +3546,8 @@ static int check_barrier_error(struct btrfs_fs_devices 
> *fsdevs)
>   dev_flush_error++;
>   }
>  
> - tolerance = fsdevs->fs_info->num_tolerated_disk_barrier_failures;
> - if (submit_flush_error > tolerance || dev_flush_error > tolerance)
> + if (dev_flush_error >
> + fsdevs->fs_info->num_tolerated_disk_barrier_failures)
>   return -EIO;
>  
>   return 0;
> @@ -3564,7 +3561,6 @@ static int barrier_all_devices(struct btrfs_fs_info 
> *info)
>  {
>   struct list_head *head;
>   struct btrfs_device *dev;
> - int errors_send = 0;
>   int errors_wait = 0;
>   int ret;
>  
> @@ -3573,10 +3569,9 @@ static int barrier_all_devices(struct btrfs_fs_info 
> *info)
>   list_for_each_entry_rcu(dev, head, dev_list) {
>   if (dev->missing)
>   continue;
> - if (!dev->bdev) {
> - errors_send++;
> + if (!dev->bdev)
>   continue;
> - }
> +
>   if (!dev->in_fs_metadata || !dev->writeable)
>   continue;
>  
> @@ -3602,7 +3597,11 @@ static int barrier_all_devices(struct btrfs_fs_info 
> *info)
>   }
>   }
>  
> - if (errors_send || errors_wait) {
> + /*
> +  * By checking errors_wait here it avoids traverse through
> +  * the device loop for normal healthy case
> +  */

I think the code makes it obvious that we skip the barrier check, the
comment does not bring any new information, please drop it.

> + if (errors_wait) {
>   /*
>* At some point we need the status of all disks
>* to arrive at the volume status. So error checking
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: write_dev_flush does not return ENOMEM anymore

2017-06-12 Thread David Sterba
On Fri, Jun 09, 2017 at 02:52:28PM +0800, Anand Jain wrote:
> Commit
>  9035b5dbc576 btrfs: btrfs_io_bio_alloc never fails, skip error handling
> 
> removed the -ENOMEM return from write_dev_flush() so no need to
> check for the -ENOMEM during send.
> 
> This patch also peals write_dev_flush's wait part of the code,
> and creates a new function wait_dev_flush().

"This patch also" usually means that the patch should be split.

>  /*
> - * trigger flushes for one the devices.  If you pass wait == 0, the flushes 
> are
> - * sent down.  With wait == 1, it waits for the previous flush.
> - *
> - * any device where the flush fails with eopnotsupp are flagged as 
> not-barrier
> - * capable
> + * trigger flushes for one the devices.

In the patch that splits write_dev_flush, please use the following
comment wording:

 * Submit a flush request to the device if it supports it. Error handling is
 * done in the waiting counterpart.

>   */
> -static int write_dev_flush(struct btrfs_device *device, int wait)
> +static void write_dev_flush(struct btrfs_device *device)

and

/*
 * If the flush bio has been submitted by write_dev_flush, wait for it.
 */

> +static int wait_dev_flush(struct btrfs_device *device)
> +{

But otherwise the cleanup is good.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: btrfs_ioctl_search_key documentation

2017-06-12 Thread David Sterba
On Tue, Jun 06, 2017 at 12:20:32AM +0200, Hans van Kranenburg wrote:
> A programmer who is trying to implement calling the btrfs SEARCH
> or SEARCH_V2 ioctl will probably soon end up reading this struct
> definition.
> 
> Properly document the input fields to prevent common misconceptions:
>  1. The search space is linear, not 3 dimensional. The invidual min/max
>  values for objectid, type and offset cannot be used to filter the
>  result, they only define the endpoints of an interval.
>  2. The transaction id (a.k.a. generation) filter applies only on
>  transaction id of the last COW operation on a whole metadata page, not
>  on individual items.
> 
> Ad 1. The first misunderstanding was helped by the previous misleading
> comments on min/max type and offset:
>   "keys returned will be >= min and <= max".
> 
> Ad 2. For example, running btrfs balance will happily cause rewriting of
> metadata pages that contain a filesystem tree of a read only subvolume,
> causing transids to be increased.
> 
> Also, improve descriptions of tree_id and nr_items and add in/out
> annotations.
> 
> Signed-off-by: Hans van Kranenburg 

Looks good to me, thanks. I've realigned the comments so they don't
overflow 80 columns and aligned the /* in */ hints.

> Most interesting changes since v1:
>  - mention the special tree_id input value 0
>  - rewrite the part about min_key and max_key, trying to be more concise

I find the description instructive enough so the expanded expression to
describe the whole range is not IMHO needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: btrfs_ioctl_search_key documentation

2017-06-12 Thread Hans van Kranenburg
On 06/12/2017 05:38 PM, David Sterba wrote:
> On Tue, Jun 06, 2017 at 12:20:32AM +0200, Hans van Kranenburg wrote:
>> A programmer who is trying to implement calling the btrfs SEARCH
>> or SEARCH_V2 ioctl will probably soon end up reading this struct
>> definition.
>>
>> Properly document the input fields to prevent common misconceptions:
>>  1. The search space is linear, not 3 dimensional. The invidual min/max
>>  values for objectid, type and offset cannot be used to filter the
>>  result, they only define the endpoints of an interval.
>>  2. The transaction id (a.k.a. generation) filter applies only on
>>  transaction id of the last COW operation on a whole metadata page, not
>>  on individual items.
>>
>> Ad 1. The first misunderstanding was helped by the previous misleading
>> comments on min/max type and offset:
>>   "keys returned will be >= min and <= max".
>>
>> Ad 2. For example, running btrfs balance will happily cause rewriting of
>> metadata pages that contain a filesystem tree of a read only subvolume,
>> causing transids to be increased.
>>
>> Also, improve descriptions of tree_id and nr_items and add in/out
>> annotations.
>>
>> Signed-off-by: Hans van Kranenburg 
> 
> Looks good to me, thanks. I've realigned the comments so they don't
> overflow 80 columns and aligned the /* in */ hints.

Ah, I see, thanks. My vim takes 4 chars for a tab, that's the problem.
I'll get the vim C settings in order for the next time. :-)

>> Most interesting changes since v1:
>>  - mention the special tree_id input value 0
>>  - rewrite the part about min_key and max_key, trying to be more concise
> 
> I find the description instructive enough so the expanded expression to
> describe the whole range is not IMHO needed.

You mean drop the extra line "All metadata..." ? Yeah, it's a bit
redudant, stressing the fact, yes.

The main purpose is to stop users from thinking that setting min_type
and max_type will filter the returned objects (like, only getting
BLOCK_GROUP_ITEM_KEY or so). So as long as you think that's clear
enough, I'm ok with anything.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Convert btrfs_fs_info's free_chunk_space to an atomic64_t

2017-06-12 Thread David Sterba
On Fri, Jun 09, 2017 at 09:44:31AM +0300, Nikolay Borisov wrote:
> 
> 
> On  7.06.2017 21:09, Sargun Dhillon wrote:
> > This patch is a small performance optimization to get rid of a spin
> > lock, where instead an atomic64_t can be used.
> > 
> > Signed-off-by: Sargun Dhillon 
> 
> I've already sent similar patch 1 month ago:
> 
> https://patchwork.kernel.org/patch/9720999/

Another hint is to look to the linux-next or to my kernel.org for-next
branch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix Null pointer dereference in dio read endio

2017-06-12 Thread David Sterba
On Tue, Jun 06, 2017 at 01:52:52PM -0600, Liu Bo wrote:
> With switching to use btrfs_bio_clone_partial() to split bio in
> directIO path, read endio is also adapted to that by recording a
> iterator in btrfs_bio, however, it breaks those bios which are less
> than stripe length thus no need to be split and results in NULL
> pointer dereference.
> 
> This fixes the issue by recording the required bio iterator in
> btrfs_bio_clone() which is used to clone non-split bio in directIO
> path.  It doesn't affect other calls of btrfs_bio_clone() because they
> don't need to use this iterator.
> 
> This bug was caught by fstests/generic/091.
> 
> Cc: David Sterba 
> Signed-off-by: Liu Bo 
> ---
> Based on David's for-next.
> Fixes: commit "Btrfs: change how we iterate bios in endio"
> 
> Have run through fstests without introducing new problems.
> 
>  fs/btrfs/extent_io.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 806e8d6..a91c3a1 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2719,6 +2719,7 @@ struct bio *btrfs_bio_clone(struct bio *bio, gfp_t 
> gfp_mask)
>   btrfs_bio->csum = NULL;
>   btrfs_bio->csum_allocated = NULL;
>   btrfs_bio->end_io = NULL;
> + btrfs_bio->iter = bio->bi_iter;

Actually this points to a problem with initialization of the
btrfs_io_bio in general. They have embedded struct bio and the bioset
allocation will increase the size to cover btrfs_io_bio (although only
bio is handled). But, there's no kzalloc or memset anywhere, so it's
completely up to the caller to sanitize the fresh structure. Which we
only do partially (csum, csum_allocated, end_io). In that case 'iter'
could contain garbage and who knows what could happen.

So my suggestion: fold this patch to the one you refrence (ie. that
the behaviour is same before and after), and we'll fix the
initialization of btrfs_io_bio after the various bio clone/alloc calls.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix Null pointer dereference in dio read endio

2017-06-12 Thread David Sterba
On Tue, Jun 06, 2017 at 01:52:52PM -0600, Liu Bo wrote:
> With switching to use btrfs_bio_clone_partial() to split bio in
> directIO path, read endio is also adapted to that by recording a
> iterator in btrfs_bio, however, it breaks those bios which are less
> than stripe length thus no need to be split and results in NULL
> pointer dereference.
> 
> This fixes the issue by recording the required bio iterator in
> btrfs_bio_clone() which is used to clone non-split bio in directIO
> path.  It doesn't affect other calls of btrfs_bio_clone() because they
> don't need to use this iterator.
> 
> This bug was caught by fstests/generic/091.
> 
> Cc: David Sterba 
> Signed-off-by: Liu Bo 
> ---
> Based on David's for-next.
> Fixes: commit "Btrfs: change how we iterate bios in endio"

I'd rather fold this change to the original patch than to have a
separate fixup. The changelog can be updated with description of the
special case.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: send operates on ro snapshots only

2017-06-12 Thread David Sterba
On Tue, Jun 06, 2017 at 03:00:58PM +0200, Hans van Kranenburg wrote:
> While talking to another btrfs user on IRC today, it became clear that a
> major point of confusion in the btrfs send manual is that it's not
> telling the user soon enough that send/receive solely operates on
> subvolume snapshots instead of the original (read/write) subvolumes.
> 
> So, change the first few lines to explicitly mention snapshots instead.
> Technically, snapshots are also just subvolumes, but requiring this
> level of technical detailed knowledge doesn't help the user who is just
> trying out things.

Agreed.

> Signed-off-by: Hans van Kranenburg 
> ---

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64

2017-06-12 Thread David Sterba
On Tue, Jun 06, 2017 at 04:45:26PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> There are a few places where we pass in a negative num_bytes, so make it
> signed for clarity. Also move it up in the file since later patches will
> need it there.
> 
> Signed-off-by: Omar Sandoval 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/7] Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()

2017-06-12 Thread David Sterba
On Tue, Jun 06, 2017 at 04:45:27PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Signed-off-by: Omar Sandoval 
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 7c01b4e9e3b6..6032e9a635f2 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -782,7 +782,7 @@ static void add_pinned_bytes(struct btrfs_fs_info 
> *fs_info, s64 num_bytes,
>   }
>  
>   space_info = __find_space_info(fs_info, flags);
> - BUG_ON(!space_info); /* Logic bug */
> + ASSERT(space_info);

Why?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] btrfs: btrfs_bio_clone never fails, skip error handling

2017-06-12 Thread David Sterba
On Wed, Jun 07, 2017 at 11:19:37AM -0700, Omar Sandoval wrote:
> On Fri, Jun 02, 2017 at 06:58:36PM +0200, David Sterba wrote:
> > Update direct callers of btrfs_bio_clone that do error handling, that we
> > can now remove.
> > 
> > Signed-off-by: David Sterba 
> > ---
> >  fs/btrfs/inode.c   | 4 
> >  fs/btrfs/volumes.c | 1 -
> >  2 files changed, 5 deletions(-)
> > 
> >  
> > if (dev_nr < total_devs - 1) {
> > bio = btrfs_bio_clone(first_bio, GFP_NOFS);
> > -   BUG_ON(!bio); /* -ENOMEM */
> > } else
> > bio = first_bio;
> 
> Could you please get rid of the extra curly braces now, too?

Sure, patch updated.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] btrfs: remove redundant parameters from btrfs_bio_alloc

2017-06-12 Thread David Sterba
On Wed, Jun 07, 2017 at 03:40:16PM +0800, Anand Jain wrote:
> 
> 
> On 06/03/17 00:58, David Sterba wrote:
> > All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES.
> 
>   The line (in the other thread) mentioning the reason to remove
>__GFP_HIGH can go into the commit log here.

Makes sense, patch updated.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use ino_t for inode in tracer

2017-06-12 Thread David Sterba
On Mon, Jun 12, 2017 at 06:10:28PM +0800, Anand Jain wrote:
> Signed-off-by: Anand Jain 

There's one more in btrfs__qgroup_data_map, please fix it as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: add compression trace points

2017-06-12 Thread kbuild test robot
Hi Anand,

[auto build test WARNING on kdave/for-next]
[also build test WARNING on next-20170609]
[cannot apply to btrfs/next tip/perf/core v4.12-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Anand-Jain/btrfs-add-compression-trace-points/20170612-184615
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: s390-allmodconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 6.3.0-18) 6.3.0 20170516
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=s390 

All warnings (new ones prefixed by >>):

   In file included from include/trace/define_trace.h:95:0,
from include/trace/events/btrfs.h:1672,
from fs/btrfs/super.c:65:
   include/trace/events/btrfs.h: In function 
'trace_raw_output_btrfs_data_transformer':
>> include/trace/events/btrfs.h:91:12: warning: format '%lu' expects argument 
>> of type 'long unsigned int', but argument 6 has type 'ino_t {aka unsigned 
>> int}' [-Wformat=]
 TP_printk("%pU: " fmt, __entry->fsid, args)
   ^
   include/trace/trace_events.h:343:22: note: in definition of macro 
'DECLARE_EVENT_CLASS'
 trace_seq_printf(s, print); \
 ^
   include/trace/trace_events.h:65:9: note: in expansion of macro 'PARAMS'
PARAMS(print)); \
^~
>> include/trace/events/btrfs.h:1630:1: note: in expansion of macro 
>> 'TRACE_EVENT'
TRACE_EVENT(btrfs_data_transformer,
^~~
>> include/trace/events/btrfs.h:91:2: note: in expansion of macro 'TP_printk'
 TP_printk("%pU: " fmt, __entry->fsid, args)
 ^
>> include/trace/events/btrfs.h:1661:2: note: in expansion of macro 
>> 'TP_printk_btrfs'
 TP_printk_btrfs("%s %s ino=%lu type=%s len_before=%lu len_after=%lu "
 ^~~

vim +91 include/trace/events/btrfs.h

3f7de037 Josef Bacik  2011-11-10  75  
8c2a3ca2 Josef Bacik  2012-01-10  76  #define BTRFS_UUID_SIZE 16
bc074524 Jeff Mahoney 2016-06-09  77  #define TP_STRUCT__entry_fsid __array(u8, 
fsid, BTRFS_UUID_SIZE)
bc074524 Jeff Mahoney 2016-06-09  78  
bc074524 Jeff Mahoney 2016-06-09  79  #define TP_fast_assign_fsid(fs_info)  
\
bc074524 Jeff Mahoney 2016-06-09  80memcpy(__entry->fsid, fs_info->fsid, 
BTRFS_UUID_SIZE)
bc074524 Jeff Mahoney 2016-06-09  81  
bc074524 Jeff Mahoney 2016-06-09  82  #define TP_STRUCT__entry_btrfs(args...)   
\
bc074524 Jeff Mahoney 2016-06-09  83TP_STRUCT__entry(   
\
bc074524 Jeff Mahoney 2016-06-09  84TP_STRUCT__entry_fsid   
\
bc074524 Jeff Mahoney 2016-06-09  85args)
bc074524 Jeff Mahoney 2016-06-09  86  #define TP_fast_assign_btrfs(fs_info, 
args...)\
bc074524 Jeff Mahoney 2016-06-09  87TP_fast_assign( 
\
bc074524 Jeff Mahoney 2016-06-09  88TP_fast_assign_fsid(fs_info);   
\
bc074524 Jeff Mahoney 2016-06-09  89args)
bc074524 Jeff Mahoney 2016-06-09  90  #define TP_printk_btrfs(fmt, args...) \
bc074524 Jeff Mahoney 2016-06-09 @91TP_printk("%pU: " fmt, __entry->fsid, 
args)
8c2a3ca2 Josef Bacik  2012-01-10  92  
1abe9b8a liubo2011-03-24  93  TRACE_EVENT(btrfs_transaction_commit,
1abe9b8a liubo2011-03-24  94  
1abe9b8a liubo2011-03-24  95TP_PROTO(struct btrfs_root *root),
1abe9b8a liubo2011-03-24  96  
1abe9b8a liubo2011-03-24  97TP_ARGS(root),
1abe9b8a liubo2011-03-24  98  
bc074524 Jeff Mahoney 2016-06-09  99TP_STRUCT__entry_btrfs(

:: The code at line 91 was first introduced by commit
:: bc074524e123ded281cde25ebc5661910f9679e3 btrfs: prefix fsid to all trace 
events

:: TO: Jeff Mahoney <je...@suse.com>
:: CC: David Sterba <dste...@suse.com>

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: btrfs native encryption

2017-06-12 Thread David Sterba
On Mon, Jun 12, 2017 at 02:40:38PM +0200, David Sterba wrote:
> On Fri, Jun 09, 2017 at 08:50:12AM -0700, Filip Bystricky wrote:
> > Dear btrfs maintainers,
> > Google is evaluating btrfs for its potential use in android, but
> > currently the lack of native file-based encryption unfortunately makes
> > it a nonstarter.
> 
> The file-based encryption is covered by the fscrypt API, that's
> implemented in ext4/f2fs, so implementing that in btrfs could allow you
> to start evaluating btrfs.

For reference I'll add it here,
https://github.com/asj/linux-btrfs-fscryptv1 seems to implement it, I've
only scrolled through it so I don't know how usable is it in this state.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 14/20] dax: set errors in mapping when writeback fails

2017-06-12 Thread Christoph Hellwig
On Mon, Jun 12, 2017 at 08:23:10AM -0400, Jeff Layton wrote:
> For now, only do this when the FS_WB_ERRSEQ flag is set. The
> AS_EIO/AS_ENOSPC flags are not currently cleared in the older code when
> writeback initiation fails, only when we discover an error after waiting
> on writeback to complete, so we only want to do this with errseq_t based
> error handling to prevent seeing duplicate errors on fsync.

Please make sure this doens't stay conditional by the end of the series.
We only have three file systems using dax, and a series should be able
to make them agree on a single interface.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-06-12 Thread Christoph Hellwig
On Mon, Jun 12, 2017 at 08:23:06AM -0400, Jeff Layton wrote:
> Add a new FS_WB_ERRSEQ flag to the fstype. Later patches will set and
> key off of that to decide what behavior should be used.

Please invert this so that only file systems that keep the old semantics
need a flag.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[xfstests PATCH v4 3/5] generic: test writeback error handling on dmerror devices

2017-06-12 Thread Jeff Layton
Ensure that we get an error back on all fds when a block device is
open by multiple writers and writeback fails.

Signed-off-by: Jeff Layton 
---
 tests/generic/998 | 64 +++
 tests/generic/998.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 67 insertions(+)
 create mode 100755 tests/generic/998
 create mode 100644 tests/generic/998.out

diff --git a/tests/generic/998 b/tests/generic/998
new file mode 100755
index ..4e8379988252
--- /dev/null
+++ b/tests/generic/998
@@ -0,0 +1,64 @@
+#! /bin/bash
+# FS QA Test No. 998
+#
+# Test writeback error handling when writing to block devices via pagecache.
+# See src/fsync-err.c for details of what test actually does.
+#
+#---
+# Copyright (c) 2017, Jeff Layton 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+cd /
+rm -rf $tmp.* $testdir
+_dmerror_cleanup
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmerror
+
+# real QA test starts here
+_supported_os Linux
+_require_scratch
+_require_logdev
+_require_dm_target error
+_require_test_program fsync-err
+_require_test_program dmerror
+
+rm -f $seqres.full
+
+_dmerror_init
+
+$here/src/fsync-err -d $here/src/dmerror $DMERROR_DEV
+
+# success, all done
+_dmerror_load_working_table
+_dmerror_cleanup
+_scratch_mkfs > $seqres.full 2>&1
+status=0
+exit
diff --git a/tests/generic/998.out b/tests/generic/998.out
new file mode 100644
index ..658c438820e2
--- /dev/null
+++ b/tests/generic/998.out
@@ -0,0 +1,2 @@
+QA output created by 998
+Test passed!
diff --git a/tests/generic/group b/tests/generic/group
index b56bae8f04f0..9c62ab13ad36 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -442,4 +442,5 @@
 437 auto quick
 438 auto
 439 auto quick punch
+998 blockdev
 999 auto quick
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 15/20] fs: have call_fsync call filemap_report_wb_err if FS_WB_ERRSEQ is set

2017-06-12 Thread Christoph Hellwig
On Mon, Jun 12, 2017 at 08:23:11AM -0400, Jeff Layton wrote:
> Allow filesystems to opt-in to a final check of wb_err if FS_WB_ERRSEQ
> is set. Technically, we could just plumb these calls into all of the
> fsync operations, but I think this means less code, changes and churn.

Please add it to every fs, that is a consistent with how we handle
everything else related to writeback.

Oh, and please kill this idiotic call_fsync helper while you're at it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[xfstests PATCH v4 5/5] btrfs: make a btrfs version of writeback error reporting test

2017-06-12 Thread Jeff Layton
Make a new btrfs/999 test that works the way Chris Mason suggested:

Build a filesystem with 2 devices that stripes the data across
both devices, but mirrors metadata across both. Then, make one
of the devices fail and see how fsync is handled.

Signed-off-by: Jeff Layton 
---
 tests/btrfs/999   | 93 +++
 tests/btrfs/group |  1 +
 2 files changed, 94 insertions(+)
 create mode 100755 tests/btrfs/999

diff --git a/tests/btrfs/999 b/tests/btrfs/999
new file mode 100755
index ..84031cc0d913
--- /dev/null
+++ b/tests/btrfs/999
@@ -0,0 +1,93 @@
+#! /bin/bash
+# FS QA Test No. 999
+#
+# Open a file several times, write to it, fsync on all fds and make sure that
+# they all return 0. Change the device to start throwing errors. Write again
+# on all fds and fsync on all fds. Ensure that we get errors on all of them.
+# Then fsync on all one last time and verify that all return 0.
+#
+#---
+# Copyright (c) 2017, Jeff Layton 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+cd /
+rm -rf $tmp.* $testdir
+_dmerror_cleanup
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmerror
+
+# real QA test starts here
+_supported_os Linux
+_require_dm_target error
+_require_test_program fsync-err
+_require_test_program dmerror
+
+# bring up dmerror device
+_scratch_unmount
+_dmerror_init
+
+# Replace first device with error-test device
+old_SCRATCH_DEV=$SCRATCH_DEV
+SCRATCH_DEV_POOL=`echo $SCRATCH_DEV_POOL | perl -pe 
"s#$SCRATCH_DEV#$DMERROR_DEV#"`
+SCRATCH_DEV=$DMERROR_DEV
+
+_require_scratch
+_require_scratch_dev_pool
+
+rm -f $seqres.full
+
+echo "Format and mount"
+
+_scratch_pool_mkfs "-d raid0 -m raid1" > $seqres.full 2>&1
+_scratch_mount
+
+# How much do we need to write? We need to hit all of the stripes. btrfs uses
+# a fixed 64k stripesize, so write enough to hit each one
+number_of_devices=`echo $SCRATCH_DEV_POOL | wc -w`
+write_kb=$(($number_of_devices * 64))
+_require_fs_space $SCRATCH_MNT $write_kb
+
+testfile=$SCRATCH_MNT/fsync-err-test
+
+SCRATCH_DEV=$old_SCRATCH_DEV
+$here/src/fsync-err -b $(($write_kb * 1024)) -d $here/src/dmerror $testfile
+
+# success, all done
+_dmerror_load_working_table
+
+# fs may be corrupt after this -- attempt to repair it
+_repair_scratch_fs >> $seqres.full
+
+# remove dmerror device
+_dmerror_cleanup
+
+status=0
+exit
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 6f19619e877c..8dbdfbfe29fd 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -145,3 +145,4 @@
 141 auto quick
 142 auto quick
 143 auto quick
+999 auto quick
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[xfstests PATCH v4 1/5] generic: add a writeback error handling test

2017-06-12 Thread Jeff Layton
I'm working on a set of kernel patches to change how writeback errors
are handled and reported in the kernel. Instead of reporting a
writeback error to only the first fsync caller on the file, I aim
to make the kernel report them once on every file description.

This patch adds a test for the new behavior. Basically, open many fds
to the same file, turn on dm_error, write to each of the fds, and then
fsync them all to ensure that they all get an error back.

To do that, I'm adding a new tools/dmerror script that the C program
can use to load the error table. For now, that's all it can do, but
we can fill it out with other commands as necessary.

Signed-off-by: Jeff Layton 
---
 .gitignore |   1 +
 common/dmerror |  13 ++-
 doc/auxiliary-programs.txt |  16 
 src/Makefile   |   2 +-
 src/dmerror|  44 +
 src/fsync-err.c| 223 +
 tests/generic/999  |  77 
 tests/generic/999.out  |   3 +
 tests/generic/group|   1 +
 9 files changed, 374 insertions(+), 6 deletions(-)
 create mode 100755 src/dmerror
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out

diff --git a/.gitignore b/.gitignore
index 39664b0a7f53..56e863b2c8dc 100644
--- a/.gitignore
+++ b/.gitignore
@@ -72,6 +72,7 @@
 /src/fs_perms
 /src/fssum
 /src/fstest
+/src/fsync-err
 /src/fsync-tester
 /src/ftrunc
 /src/genhashnames
diff --git a/common/dmerror b/common/dmerror
index d46c5d0b7266..238baa213b1f 100644
--- a/common/dmerror
+++ b/common/dmerror
@@ -23,22 +23,25 @@ if [ $? -eq 0 ]; then
_notrun "Cannot run tests with DAX on dmerror devices"
 fi
 
-_dmerror_init()
+_dmerror_setup()
 {
local dm_backing_dev=$SCRATCH_DEV
 
-   $DMSETUP_PROG remove error-test > /dev/null 2>&1
-
local blk_dev_size=`blockdev --getsz $dm_backing_dev`
 
DMERROR_DEV='/dev/mapper/error-test'
 
DMLINEAR_TABLE="0 $blk_dev_size linear $dm_backing_dev 0"
 
+   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
+}
+
+_dmerror_init()
+{
+   _dmerror_setup
+   $DMSETUP_PROG remove error-test > /dev/null 2>&1
$DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \
_fatal "failed to create dm linear device"
-
-   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
 }
 
 _dmerror_mount()
diff --git a/doc/auxiliary-programs.txt b/doc/auxiliary-programs.txt
index 21ef118596b6..bcab453c4335 100644
--- a/doc/auxiliary-programs.txt
+++ b/doc/auxiliary-programs.txt
@@ -16,6 +16,8 @@ note the dependency with:
 Contents:
 
  - af_unix -- Create an AF_UNIX socket
+ - dmerror -- fault injection block device control
+ - fsync-err   -- tests fsync error reporting after failed writeback
  - open_by_handle  -- open_by_handle_at syscall exercise
  - stat_test   -- statx syscall exercise
  - t_dir_type  -- print directory entries and their file type
@@ -30,6 +32,20 @@ af_unix
 
The af_unix program creates an AF_UNIX socket at the given location.
 
+dmerror
+
+   dmerror is a program for creating, destroying and controlling a
+   fault injection device. The device can be set up as initially
+   working and then flip to throwing errors for testing purposes.
+
+fsync-err
+
+   Specialized program for testing how the kernel reports errors that
+   occur during writeback. Works in conjunction with the dmerror script
+   in tools/ to write data to a device, and then force it to fail
+   writeback and test that errors are reported during fsync and cleared
+   afterward.
+
 open_by_handle
 
The open_by_handle program exercises the open_by_handle_at() system
diff --git a/src/Makefile b/src/Makefile
index 6b0e4b022485..abd0fff34a64 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -13,7 +13,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \
t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \
holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd \
-   t_mmap_cow_race t_mmap_fallocate
+   t_mmap_cow_race t_mmap_fallocate fsync-err
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
diff --git a/src/dmerror b/src/dmerror
new file mode 100755
index ..4aaf682ee5f9
--- /dev/null
+++ b/src/dmerror
@@ -0,0 +1,44 @@
+#!/bin/bash
+#---
+# Copyright (c) 2017, Jeff Layton 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This 

[xfstests PATCH v4 2/5] ext4: allow ext4 to use $SCRATCH_LOGDEV

2017-06-12 Thread Jeff Layton
The writeback error handling test requires that you put the journal on a
separate device. This allows us to use dmerror to simulate data
writeback failure, without affecting the journal.

xfs already has infrastructure for this (a'la $SCRATCH_LOGDEV), so wire
up the ext4 code so that it can do the same thing when _scratch_mkfs is
called.

Signed-off-by: Jeff Layton 
Reviewed-by: Darrick J. Wong 
---
 common/rc | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/common/rc b/common/rc
index 87e6ff08b18d..08807ac7c22a 100644
--- a/common/rc
+++ b/common/rc
@@ -676,6 +676,9 @@ _scratch_mkfs_ext4()
local tmp=`mktemp`
local mkfs_status
 
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \
+   mkfs_cmd="$mkfs_cmd $MKFS_OPTIONS -J device=$SCRATCH_LOGDEV"
 
_scratch_do_mkfs "$mkfs_cmd" "$mkfs_filter" $* 2>$tmp.mkfserr 
1>$tmp.mkfsstd
mkfs_status=$?
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[xfstests PATCH v4 4/5] ext3: allow it to put journal on a separate device when doing scratch_mkfs

2017-06-12 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 common/rc | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/common/rc b/common/rc
index 08807ac7c22a..46b890cbff6a 100644
--- a/common/rc
+++ b/common/rc
@@ -832,7 +832,16 @@ _scratch_mkfs()
mkfs_cmd="$MKFS_BTRFS_PROG"
mkfs_filter="cat"
;;
-   ext2|ext3)
+   ext3)
+   mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F"
+   mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \""
+
+   # put journal on separate device?
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \
+   mkfs_cmd="$mkfs_cmd $MKFS_OPTIONS -J device=$SCRATCH_LOGDEV"
+   ;;
+   ext2)
mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F"
mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \""
;;
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[xfstests PATCH v4 0/5] new tests for writeback error reporting behavior

2017-06-12 Thread Jeff Layton
v4: respin set based on Eryu's comments

These tests are companion tests to the patchset I recently posted with
the cover letter:

[PATCH v6 00/20] fs: enhanced writeback error reporting with errseq_t (pile 
#1)

This set just adds 3 new xfstests to test writeback behavior. One generic
filesystem test, one test for raw block devices, and one test for btrfs.
The tests work with dmerror to ensure that writeback fails, and then
tests how the kernel reports errors afterward.

xfs, ext2/3/4 and btrfs all pass on a kernel with the patchset above.

The one comment I couldn't really address from earlier review is that
we don't have a great way for xfstests to tell what sort of error
reporting behavior it should expect from the running kernel. That
makes it difficult to tell whether failure is expected during a given
run.

Maybe that's OK though and we should just let unconverted filesystems
fail this test?

Jeff Layton (5):
  generic: add a writeback error handling test
  ext4: allow ext4 to use $SCRATCH_LOGDEV
  generic: test writeback error handling on dmerror devices
  ext3: allow it to put journal on a separate device when doing
scratch_mkfs
  btrfs: make a btrfs version of writeback error reporting test

 .gitignore |   1 +
 common/dmerror |  13 ++-
 common/rc  |  14 ++-
 doc/auxiliary-programs.txt |  16 
 src/Makefile   |   2 +-
 src/dmerror|  44 +
 src/fsync-err.c| 223 +
 tests/btrfs/999|  93 +++
 tests/btrfs/group  |   1 +
 tests/generic/998  |  64 +
 tests/generic/998.out  |   2 +
 tests/generic/999  |  77 
 tests/generic/999.out  |   3 +
 tests/generic/group|   2 +
 14 files changed, 548 insertions(+), 7 deletions(-)
 create mode 100755 src/dmerror
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/btrfs/999
 create mode 100755 tests/generic/998
 create mode 100644 tests/generic/998.out
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out

-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs native encryption

2017-06-12 Thread David Sterba
On Fri, Jun 09, 2017 at 08:50:12AM -0700, Filip Bystricky wrote:
> Dear btrfs maintainers,
> Google is evaluating btrfs for its potential use in android, but
> currently the lack of native file-based encryption unfortunately makes
> it a nonstarter.

The file-based encryption is covered by the fscrypt API, that's
implemented in ext4/f2fs, so implementing that in btrfs could allow you
to start evaluating btrfs. As other pointed, the usecases with snapshots
and reflinks need to be reviewed.

> According to the FAQ (specifically the answer to
> "Does btrfs support encryption"), nobody is currently working on this.
> How up-to-date is that answer, and are there any new plans to add
> native FBE in the future?

Wiki is a snapshot of status, knowledge and best practices from mixed
times, so not everything is up to date or accurate. That no one is
working on encryption is partially true, as we've seen some proposed
patches that received objections and from my perspective are not close
to being considered for merging.

The fscrypt functionality has been designed and the API defined, so it's
a matter of implementation on the btrfs side. Anything else is in the
phase of design and prototyping.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 00/20] fs: enhanced writeback error reporting with errseq_t (pile #1)

2017-06-12 Thread Jeff Layton
v6:
===
This is the sixth posting of the patchset to revamp the way writeback
errors are tracked and reported.

This is a smaller set than the last one. The main difference from the
last set is that this one just adds errseq_t based error reporting for
the purposes of fsync, while leaving the internal callers of filemap_*
functions and the like largely untouched.

Some of these patches have been posted separately, but I'm re-posting
them here to make it clear that they're prerequisites to the later
patches in the series.

Background:
===
The basic problem is that we have (for a very long time) tracked and
reported writeback errors based on two flags in the address_space:
AS_EIO and AS_ENOSPC. Those flags are cleared when they are checked,
so only the first caller to check them is able to consume them.

That model is quite unreliable though, for several related reasons:

* only the first fsync caller on the inode will see the error. In a
  world of containerized setups, that's no longer viable. Applications
  need to know that their writes are safely stored, and they can
  currently miss seeing errors that they should be aware of when
  they're not.

* there are a lot of internal callers to filemap_fdatawait* and
  filemap_write_and_wait* that clear the error flags but then never
  report them to userland in any fashion.

* Some internal callers report writeback errors, but can do so at
  non-sensical times. For instance, we might want to truncate a file,
  which triggers a pagecache flush. If that writeback fails, we might
  report that error to the truncate caller, but a subsequent fsync
  will likely not see it.

* Some internal callers try to reset the error flags after clearing
  them, but that's racy. Another task could check the flags between
  those two events.

Solution:
=
This patchset adds a new datatype called an errseq_t that represents a
sequence of errors. It's a u32, with a field for a POSIX-flavor error
and a counter, managed with atomics. We can sample that value at a
particular point in time, and can later tell whether there have been any
errors since that point.

That allows us to provide traditional check-and-clear fsync semantics
on every open file description in a lightweight fashion. fsync callers
no longer need to coordinate between one another in order to ensure
that errors at fsync time are handled correctly.

Strategy:
=
The aim with this set is to do the minimum possible to support for
reliable reporting of errors on fsync, without substantially changing
the internals of the filesystems themselves.

Most of the internal calls to filemap_fdatawait are left alone, so all
of the internal error error checking is done using the traditional flag
based checks. The only real difference here is more reliable reporting
of errors at fsync.

I think that we probably will want to eventually convert all of the
internal callers to use errseq_t based reporting too, but that can be
done in an incremental fashion in follow-on patchsets.

Testing:

I've primarily been testing this with a couple of new xfstests that I
will post separately. These tests use dm-error fault injection to flip
the underlying block device to start throwing I/O errors, and then test
the behavior of the filesystem layer on top of that.

Jeff Layton (20):
  mm: fix mapping_set_error call in me_pagecache_dirty
  buffer: use mapping_set_error instead of setting the flag
  fs: check for writeback errors after syncing out buffers in
generic_file_fsync
  buffer: set errors in mapping at the time that the error occurs
  mm: don't TestClearPageError in __filemap_fdatawait_range
  mm: drop "wait" parameter from write_one_page
  mm: clean up error handling in write_one_page
  lib: add errseq_t type and infrastructure for handling it
  fs: new infrastructure for writeback error handling and reporting
  mm: tracepoints for writeback error events
  mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
  fs: add a new fstype flag to indicate how writeback errors are tracked
  Documentation: flesh out the section in vfs.txt on storing and
reporting writeback errors
  dax: set errors in mapping when writeback fails
  fs: have call_fsync call filemap_report_wb_err if FS_WB_ERRSEQ is set
  block: convert to errseq_t based writeback error tracking
  fs: add f_md_wb_err field to struct file for tracking metadata errors
  ext4: use errseq_t based error handling for reporting data writeback
errors
  xfs: minimal conversion to errseq_t writeback error reporting
  btrfs: minimal conversion to errseq_t writeback error reporting on
fsync

 Documentation/filesystems/vfs.txt |  48 -
 drivers/dax/device.c  |   1 +
 fs/block_dev.c|   2 +
 fs/btrfs/super.c  |   2 +-
 fs/buffer.c   |  20 ++--
 fs/dax.c  |  18 +++-
 fs/exofs/dir.c|   2 +-
 fs/ext2/dir.c |   2 

[PATCH v6 06/20] mm: drop "wait" parameter from write_one_page

2017-06-12 Thread Jeff Layton
The callers all set it to 1.

Also, make it clear that this function will not set any sort of AS_*
error, and that the caller must do so if necessary. No existing caller
uses this on normal files, so none of them need it.

Also, add __must_check here since, in general, the callers need to
handle an error here in some fashion.

Signed-off-by: Jeff Layton 
Reviewed-by: Ross Zwisler 
Reviewed-by: Jan Kara 
Reviewed-by: Matthew Wilcox 
Reviewed-by: Christoph Hellwig 
---
 fs/exofs/dir.c|  2 +-
 fs/ext2/dir.c |  2 +-
 fs/jfs/jfs_metapage.c |  4 ++--
 fs/minix/dir.c|  2 +-
 fs/sysv/dir.c |  2 +-
 fs/ufs/dir.c  |  2 +-
 include/linux/mm.h|  2 +-
 mm/page-writeback.c   | 14 +++---
 8 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 8eeb694332fe..98233a97b7b8 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -72,7 +72,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
set_page_dirty(page);
 
if (IS_DIRSYNC(dir))
-   err = write_one_page(page, 1);
+   err = write_one_page(page);
else
unlock_page(page);
 
diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index d9650c9508e4..e2709695b177 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -100,7 +100,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
}
 
if (IS_DIRSYNC(dir)) {
-   err = write_one_page(page, 1);
+   err = write_one_page(page);
if (!err)
err = sync_inode_metadata(dir, 1);
} else {
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index 489aaa1403e5..744fa3c079e6 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -711,7 +711,7 @@ void force_metapage(struct metapage *mp)
get_page(page);
lock_page(page);
set_page_dirty(page);
-   write_one_page(page, 1);
+   write_one_page(page);
clear_bit(META_forcewrite, >flag);
put_page(page);
 }
@@ -756,7 +756,7 @@ void release_metapage(struct metapage * mp)
set_page_dirty(page);
if (test_bit(META_sync, >flag)) {
clear_bit(META_sync, >flag);
-   write_one_page(page, 1);
+   write_one_page(page);
lock_page(page); /* write_one_page unlocks the page */
}
} else if (mp->lsn) /* discard_metapage doesn't remove it */
diff --git a/fs/minix/dir.c b/fs/minix/dir.c
index 7edc9b395700..baa9721f1299 100644
--- a/fs/minix/dir.c
+++ b/fs/minix/dir.c
@@ -57,7 +57,7 @@ static int dir_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
mark_inode_dirty(dir);
}
if (IS_DIRSYNC(dir))
-   err = write_one_page(page, 1);
+   err = write_one_page(page);
else
unlock_page(page);
return err;
diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c
index 5bdae85ceef7..f5191cb2c947 100644
--- a/fs/sysv/dir.c
+++ b/fs/sysv/dir.c
@@ -45,7 +45,7 @@ static int dir_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
mark_inode_dirty(dir);
}
if (IS_DIRSYNC(dir))
-   err = write_one_page(page, 1);
+   err = write_one_page(page);
else
unlock_page(page);
return err;
diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index de01b8f2aa78..48609f1d9580 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -53,7 +53,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
mark_inode_dirty(dir);
}
if (IS_DIRSYNC(dir))
-   err = write_one_page(page, 1);
+   err = write_one_page(page);
else
unlock_page(page);
return err;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b892e95d4929..c0b1759304ec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2199,7 +2199,7 @@ extern void filemap_map_pages(struct vm_fault *vmf,
 extern int filemap_page_mkwrite(struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
-int write_one_page(struct page *page, int wait);
+int __must_check write_one_page(struct page *page);
 void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 143c1c25d680..b901fe52b153 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2366,15 +2366,16 @@ int do_writepages(struct address_space *mapping, struct 
writeback_control *wbc)
 }
 
 /**
- * write_one_page - write out a single page and optionally wait on I/O
+ * write_one_page - write out a single page and wait on I/O
  * @page: the page to write
- * @wait: if true, wait on writeout
  *
  * The page must be locked by the caller 

[PATCH v6 05/20] mm: don't TestClearPageError in __filemap_fdatawait_range

2017-06-12 Thread Jeff Layton
The -EIO returned here can end up overriding whatever error is marked in
the address space, and be returned at fsync time, even when there is a
more appropriate error stored in the mapping.

Read errors are also sometimes tracked on a per-page level using
PG_error. Suppose we have a read error on a page, and then that page is
subsequently dirtied by overwriting the whole page. Writeback doesn't
clear PG_error, so we can then end up successfully writing back that
page and still return -EIO on fsync.

Worse yet, PG_error is cleared during a sync() syscall, but the -EIO
return from that is silently discarded. Any subsystem that is relying on
PG_error to report errors during fsync can easily lose writeback errors
due to this. All you need is a stray sync() call to wait for writeback
to complete and you've lost the error.

Since the handling of the PG_error flag is somewhat inconsistent across
subsystems, let's just rely on marking the address space when there are
writeback errors. Change the TestClearPageError call to ClearPageError,
and make __filemap_fdatawait_range a void return function.

Signed-off-by: Jeff Layton 
---
 mm/filemap.c | 20 +---
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6f1be573a5e6..1de71753de28 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,17 +376,16 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
-static int __filemap_fdatawait_range(struct address_space *mapping,
+static void __filemap_fdatawait_range(struct address_space *mapping,
 loff_t start_byte, loff_t end_byte)
 {
pgoff_t index = start_byte >> PAGE_SHIFT;
pgoff_t end = end_byte >> PAGE_SHIFT;
struct pagevec pvec;
int nr_pages;
-   int ret = 0;
 
if (end_byte < start_byte)
-   goto out;
+   return;
 
pagevec_init(, 0);
while ((index <= end) &&
@@ -403,14 +402,11 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
continue;
 
wait_on_page_writeback(page);
-   if (TestClearPageError(page))
-   ret = -EIO;
+   ClearPageError(page);
}
pagevec_release();
cond_resched();
}
-out:
-   return ret;
 }
 
 /**
@@ -430,14 +426,8 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
 int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
loff_t end_byte)
 {
-   int ret, ret2;
-
-   ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
-   ret2 = filemap_check_errors(mapping);
-   if (!ret)
-   ret = ret2;
-
-   return ret;
+   __filemap_fdatawait_range(mapping, start_byte, end_byte);
+   return filemap_check_errors(mapping);
 }
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 02/20] buffer: use mapping_set_error instead of setting the flag

2017-06-12 Thread Jeff Layton
Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
Reviewed-by: Matthew Wilcox 
Reviewed-by: Christoph Hellwig 
---
 fs/buffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 161be58c5cb0..4be8b914a222 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -482,7 +482,7 @@ static void __remove_assoc_queue(struct buffer_head *bh)
list_del_init(>b_assoc_buffers);
WARN_ON(!bh->b_assoc_map);
if (buffer_write_io_error(bh))
-   set_bit(AS_EIO, >b_assoc_map->flags);
+   mapping_set_error(bh->b_assoc_map, -EIO);
bh->b_assoc_map = NULL;
 }
 
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 10/20] mm: tracepoints for writeback error events

2017-06-12 Thread Jeff Layton
To enable that, make __errseq_set return the value that it was set to
when we exit the loop. Take heed that that value is not suitable as a
later "since" value, as it will not have been marked seen.

Signed-off-by: Jeff Layton 
---
 include/linux/errseq.h |  2 +-
 include/linux/fs.h |  5 +++-
 include/trace/events/filemap.h | 55 ++
 lib/errseq.c   | 20 ++-
 mm/filemap.c   | 13 +-
 5 files changed, 86 insertions(+), 9 deletions(-)

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index 0d2555f310cd..9e0d444ac88d 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -5,7 +5,7 @@
 
 typedef u32errseq_t;
 
-void __errseq_set(errseq_t *eseq, int err);
+errseq_t __errseq_set(errseq_t *eseq, int err);
 static inline void errseq_set(errseq_t *eseq, int err)
 {
/* Optimize for the common case of no error */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e57321b7ee2a..6cd87887430b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2530,6 +2530,7 @@ extern int filemap_fdatawrite_range(struct address_space 
*mapping,
 extern int filemap_check_errors(struct address_space *mapping);
 
 extern int __must_check filemap_report_wb_err(struct file *file);
+extern void __filemap_set_wb_err(struct address_space *mapping, int err);
 
 /**
  * filemap_set_wb_err - set a writeback error on an address_space
@@ -2549,7 +2550,9 @@ extern int __must_check filemap_report_wb_err(struct file 
*file);
  */
 static inline void filemap_set_wb_err(struct address_space *mapping, int err)
 {
-   errseq_set(>wb_err, err);
+   /* Fastpath for common case of no error */
+   if (unlikely(err))
+   __filemap_set_wb_err(mapping, err);
 }
 
 /**
diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 42febb6bc1d5..2af66920f267 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 
@@ -52,6 +53,60 @@ DEFINE_EVENT(mm_filemap_op_page_cache, 
mm_filemap_add_to_page_cache,
TP_ARGS(page)
);
 
+TRACE_EVENT(filemap_set_wb_err,
+   TP_PROTO(struct address_space *mapping, errseq_t eseq),
+
+   TP_ARGS(mapping, eseq),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, i_ino)
+   __field(dev_t, s_dev)
+   __field(errseq_t, errseq)
+   ),
+
+   TP_fast_assign(
+   __entry->i_ino = mapping->host->i_ino;
+   __entry->errseq = eseq;
+   if (mapping->host->i_sb)
+   __entry->s_dev = mapping->host->i_sb->s_dev;
+   else
+   __entry->s_dev = mapping->host->i_rdev;
+   ),
+
+   TP_printk("dev=%d:%d ino=0x%lx errseq=0x%x",
+   MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+   __entry->i_ino, __entry->errseq)
+);
+
+TRACE_EVENT(filemap_report_wb_err,
+   TP_PROTO(struct file *file, errseq_t old),
+
+   TP_ARGS(file, old),
+
+   TP_STRUCT__entry(
+   __field(struct file *, file);
+   __field(unsigned long, i_ino)
+   __field(dev_t, s_dev)
+   __field(errseq_t, old)
+   __field(errseq_t, new)
+   ),
+
+   TP_fast_assign(
+   __entry->file = file;
+   __entry->i_ino = file->f_mapping->host->i_ino;
+   if (file->f_mapping->host->i_sb)
+   __entry->s_dev = 
file->f_mapping->host->i_sb->s_dev;
+   else
+   __entry->s_dev = file->f_mapping->host->i_rdev;
+   __entry->old = old;
+   __entry->new = file->f_wb_err;
+   ),
+
+   TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
+   __entry->file, MAJOR(__entry->s_dev),
+   MINOR(__entry->s_dev), __entry->i_ino, __entry->old,
+   __entry->new)
+);
 #endif /* _TRACE_FILEMAP_H */
 
 /* This part must be outside protection */
diff --git a/lib/errseq.c b/lib/errseq.c
index d129c0611c1f..009972d3000c 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -52,10 +52,14 @@
  *
  * Most callers will want to use the errseq_set inline wrapper to efficiently
  * handle the common case where err is 0.
+ *
+ * We do return an errseq_t here, primarily for debugging purposes. The return
+ * value should not be used as a previously sampled value in later calls as it
+ * will not have the SEEN flag set.
  */
-void 

[PATCH v6 04/20] buffer: set errors in mapping at the time that the error occurs

2017-06-12 Thread Jeff Layton
I noticed on xfs that I could still sometimes get back an error on fsync
on a fd that was opened after the error condition had been cleared.

The problem is that the buffer code sets the write_io_error flag and
then later checks that flag to set the error in the mapping. That flag
perisists for quite a while however. If the file is later opened with
O_TRUNC, the buffers will then be invalidated and the mapping's error
set such that a subsequent fsync will return error. I think this is
incorrect, as there was no writeback between the open and fsync.

Add a new mark_buffer_write_io_error operation that sets the flag and
the error in the mapping at the same time. Replace all calls to
set_buffer_write_io_error with mark_buffer_write_io_error, and remove
the places that check this flag in order to set the error in the
mapping.

This sets the error in the mapping earlier, at the time that it's first
detected.

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 fs/buffer.c | 20 +---
 fs/gfs2/lops.c  |  2 +-
 include/linux/buffer_head.h |  1 +
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4be8b914a222..b946149e8214 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -178,7 +178,7 @@ void end_buffer_write_sync(struct buffer_head *bh, int 
uptodate)
set_buffer_uptodate(bh);
} else {
buffer_io_error(bh, ", lost sync page write");
-   set_buffer_write_io_error(bh);
+   mark_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
}
unlock_buffer(bh);
@@ -352,8 +352,7 @@ void end_buffer_async_write(struct buffer_head *bh, int 
uptodate)
set_buffer_uptodate(bh);
} else {
buffer_io_error(bh, ", lost async page write");
-   mapping_set_error(page->mapping, -EIO);
-   set_buffer_write_io_error(bh);
+   mark_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);
}
@@ -481,8 +480,6 @@ static void __remove_assoc_queue(struct buffer_head *bh)
 {
list_del_init(>b_assoc_buffers);
WARN_ON(!bh->b_assoc_map);
-   if (buffer_write_io_error(bh))
-   mapping_set_error(bh->b_assoc_map, -EIO);
bh->b_assoc_map = NULL;
 }
 
@@ -1181,6 +1178,17 @@ void mark_buffer_dirty(struct buffer_head *bh)
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 
+void mark_buffer_write_io_error(struct buffer_head *bh)
+{
+   set_buffer_write_io_error(bh);
+   /* FIXME: do we need to set this in both places? */
+   if (bh->b_page && bh->b_page->mapping)
+   mapping_set_error(bh->b_page->mapping, -EIO);
+   if (bh->b_assoc_map)
+   mapping_set_error(bh->b_assoc_map, -EIO);
+}
+EXPORT_SYMBOL(mark_buffer_write_io_error);
+
 /*
  * Decrement a buffer_head's reference count.  If all buffers against a page
  * have zero reference count, are clean and unlocked, and if the page is clean
@@ -3279,8 +3287,6 @@ drop_buffers(struct page *page, struct buffer_head 
**buffers_to_free)
 
bh = head;
do {
-   if (buffer_write_io_error(bh) && page->mapping)
-   mapping_set_error(page->mapping, -EIO);
if (buffer_busy(bh))
goto failed;
bh = bh->b_this_page;
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index b1f9144b42c7..cd7857ab1a6a 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -182,7 +182,7 @@ static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, 
struct bio_vec *bvec,
bh = bh->b_this_page;
do {
if (error)
-   set_buffer_write_io_error(bh);
+   mark_buffer_write_io_error(bh);
unlock_buffer(bh);
next = bh->b_this_page;
size -= bh->b_size;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index bd029e52ef5e..e0abeba3ced7 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -149,6 +149,7 @@ void buffer_check_dirty_writeback(struct page *page,
  */
 
 void mark_buffer_dirty(struct buffer_head *bh);
+void mark_buffer_write_io_error(struct buffer_head *bh);
 void init_buffer(struct buffer_head *, bh_end_io_t *, void *);
 void touch_buffer(struct buffer_head *bh);
 void set_bh_page(struct buffer_head *bh,
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 03/20] fs: check for writeback errors after syncing out buffers in generic_file_fsync

2017-06-12 Thread Jeff Layton
ext2 currently does a test+clear of the AS_EIO flag, which is
is problematic for some coming changes.

What we really need to do instead is call filemap_check_errors
in __generic_file_fsync after syncing out the buffers. That
will be sufficient for this case, and help other callers detect
these errors properly as well.

With that, we don't need to twiddle it in ext2.

Suggested-by: Jan Kara 
Signed-off-by: Jeff Layton 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Reviewed-by: Matthew Wilcox 
---
 fs/ext2/file.c | 2 +-
 fs/libfs.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index b21891a6bfca..ed00e7ae0ef3 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -177,7 +177,7 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, 
int datasync)
struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
 
ret = generic_file_fsync(file, start, end, datasync);
-   if (ret == -EIO || test_and_clear_bit(AS_EIO, >flags)) {
+   if (ret == -EIO) {
/* We don't really know where the IO error happened... */
ext2_error(sb, __func__,
   "detected IO error when writing metadata buffers");
diff --git a/fs/libfs.c b/fs/libfs.c
index a04395334bb1..1dec90819366 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -991,7 +991,8 @@ int __generic_file_fsync(struct file *file, loff_t start, 
loff_t end,
 
 out:
inode_unlock(inode);
-   return ret;
+   err = filemap_check_errors(inode->i_mapping);
+   return ret ? ret : err;
 }
 EXPORT_SYMBOL(__generic_file_fsync);
 
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 10/13] ext4: add more robust reporting of metadata writeback errors

2017-06-12 Thread Jeff Layton
ext4 uses the blockdev mapping for tracking metadata stored in the
pagecache. Sample its wb_err when opening a file and store that in
the f_md_wb_err field.

Change ext4_sync_file to check for data errors first, and then check the
blockdev mapping for metadata errors afterward.

Note that because metadata writeback errors are only tracked on a
per-device level, this does mean that we'll end up reporting an error on
all open file descriptors when there is a metadata writeback failure.
That's not ideal, but we can possibly improve upon it in the future.

Signed-off-by: Jeff Layton 
---
 fs/ext4/dir.c   |  8 ++--
 fs/ext4/file.c  |  5 -
 fs/ext4/fsync.c | 13 +
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index e8b365000d73..6bbb19510f74 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -611,9 +611,13 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 
 static int ext4_dir_open(struct inode * inode, struct file * filp)
 {
+   int ret = 0;
+
if (ext4_encrypted_inode(inode))
-   return fscrypt_get_encryption_info(inode) ? -EACCES : 0;
-   return 0;
+   ret = fscrypt_get_encryption_info(inode) ? -EACCES : 0;
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 static int ext4_release_dir(struct inode *inode, struct file *filp)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 831fd6beebf0..fe0d6e01c4b7 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -435,7 +435,10 @@ static int ext4_file_open(struct inode * inode, struct 
file * filp)
if (ret < 0)
return ret;
}
-   return dquot_file_open(inode, filp);
+   ret = dquot_file_open(inode, filp);
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 /*
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 03d6259d8662..36363a6730d7 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -165,6 +165,19 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
err = filemap_report_wb_err(file);
if (!ret)
ret = err;
+
+   /*
+* Was there a metadata writeback error since last fsync?
+*
+* FIXME: ext4 tracks metadata with a whole-block device mapping. So,
+* if there is any sort of metadata writeback error, we'll report an
+* error on all open fds, even ones not associated with the inode
+*/
+   err = filemap_report_md_wb_err(file,
+   inode->i_sb->s_bdev->bd_inode->i_mapping);
+   if (!ret)
+   ret = err;
+
trace_ext4_sync_file_exit(inode, ret);
return ret;
 }
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-06-12 Thread Jeff Layton
Now that we have new infrastructure for handling writeback errors using
errseq_t, we need to convert the existing code to use it. We could
attempt to retrofit the old interfaces on top of the new, but there is
a conceptual disconnect here in the case of internal callers that
invoke filemap_fdatawait and the like.

When reporting writeback errors, we will always report errors that have
occurred since a particular point in time. With the old writeback error
reporting, the time we used was "since it was last tested/cleared" which
is entirely arbitrary and potentially racy. Now, we can report the
latest error that has occurred since an arbitrary point in time
(represented as a sampled errseq_t value).

This means that we need to touch each filesystem that calls
filemap_check_errors in some fashion and ensure that we establish sane
"since" values for those callers. But...some code is shared between
filesystems and needs to be able to handle both error tracking schemes.

Add a new FS_WB_ERRSEQ flag to the fstype. Later patches will set and
key off of that to decide what behavior should be used.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6cd87887430b..17ba6284ab14 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2023,6 +2023,7 @@ struct file_system_type {
 #define FS_BINARY_MOUNTDATA2
 #define FS_HAS_SUBTYPE 4
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
+#define FS_WB_ERRSEQ   16  /* errseq_t writeback err tracking */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 17/20] fs: add f_md_wb_err field to struct file for tracking metadata errors

2017-06-12 Thread Jeff Layton
Some filesystems keep a different mapping for metadata writeback. Add a
second errseq_t to struct file for tracking metadata writeback errors.
Also add a new function for checking a mapping of the caller's choosing
vs. the f_md_wb_err value.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h |  3 +++
 include/trace/events/filemap.h | 23 ++-
 mm/filemap.c   | 40 +++-
 3 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ef3feeec80b2..e366835c93b3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -871,6 +871,7 @@ struct file {
struct list_headf_tfile_llink;
 #endif /* #ifdef CONFIG_EPOLL */
struct address_space*f_mapping;
+   errseq_tf_md_wb_err; /* optional metadata wb error 
tracking */
 } __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
 
 struct file_handle {
@@ -2525,6 +2526,8 @@ extern int filemap_fdatawrite_range(struct address_space 
*mapping,
 extern int filemap_check_errors(struct address_space *mapping);
 
 extern int __must_check filemap_report_wb_err(struct file *file);
+extern int __must_check filemap_report_md_wb_err(struct file *file,
+   struct address_space *mapping);
 extern void __filemap_set_wb_err(struct address_space *mapping, int err);
 
 /**
diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 2af66920f267..6e0d78c01a2e 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -79,12 +79,11 @@ TRACE_EVENT(filemap_set_wb_err,
 );
 
 TRACE_EVENT(filemap_report_wb_err,
-   TP_PROTO(struct file *file, errseq_t old),
+   TP_PROTO(struct address_space *mapping, errseq_t old, errseq_t 
new),
 
-   TP_ARGS(file, old),
+   TP_ARGS(mapping, old, new),
 
TP_STRUCT__entry(
-   __field(struct file *, file);
__field(unsigned long, i_ino)
__field(dev_t, s_dev)
__field(errseq_t, old)
@@ -92,20 +91,18 @@ TRACE_EVENT(filemap_report_wb_err,
),
 
TP_fast_assign(
-   __entry->file = file;
-   __entry->i_ino = file->f_mapping->host->i_ino;
-   if (file->f_mapping->host->i_sb)
-   __entry->s_dev = 
file->f_mapping->host->i_sb->s_dev;
+   __entry->i_ino = mapping->host->i_ino;
+   if (mapping->host->i_sb)
+   __entry->s_dev = mapping->host->i_sb->s_dev;
else
-   __entry->s_dev = file->f_mapping->host->i_rdev;
+   __entry->s_dev = mapping->host->i_rdev;
__entry->old = old;
-   __entry->new = file->f_wb_err;
+   __entry->new = new;
),
 
-   TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
-   __entry->file, MAJOR(__entry->s_dev),
-   MINOR(__entry->s_dev), __entry->i_ino, __entry->old,
-   __entry->new)
+   TP_printk("dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
+   MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+   __entry->i_ino, __entry->old, __entry->new)
 );
 #endif /* _TRACE_FILEMAP_H */
 
diff --git a/mm/filemap.c b/mm/filemap.c
index c5e19ea0bf12..ef0ff6b87759 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -564,27 +564,49 @@ EXPORT_SYMBOL(__filemap_set_wb_err);
  * value is protected by the f_lock since we must ensure that it reflects
  * the latest value swapped in for this file descriptor.
  */
-int filemap_report_wb_err(struct file *file)
+static int __filemap_report_wb_err(errseq_t *cursor, spinlock_t *lock,
+   struct address_space *mapping)
 {
int err = 0;
-   errseq_t old = READ_ONCE(file->f_wb_err);
-   struct address_space *mapping = file->f_mapping;
+   errseq_t old = READ_ONCE(*cursor);
 
/* Locklessly handle the common case where nothing has changed */
if (errseq_check(>wb_err, old)) {
/* Something changed, must use slow path */
-   spin_lock(>f_lock);
-   old = file->f_wb_err;
-   err = errseq_check_and_advance(>wb_err,
-   >f_wb_err);
-   trace_filemap_report_wb_err(file, old);
-   spin_unlock(>f_lock);
+   spin_lock(lock);
+   old = *cursor;
+   err = errseq_check_and_advance(>wb_err, cursor);
+   trace_filemap_report_wb_err(mapping, old, *cursor);
+   spin_unlock(lock);
}

[PATCH v6 14/20] dax: set errors in mapping when writeback fails

2017-06-12 Thread Jeff Layton
Jan Kara's description for this patch is much better than mine, so I'm
quoting it verbatim here:

-8<-
DAX currently doesn't set errors in the mapping when cache flushing
fails in dax_writeback_mapping_range(). Since this function can get
called only from fsync(2) or sync(2), this is actually as good as it can
currently get since we correctly propagate the error up from
dax_writeback_mapping_range() to filemap_fdatawrite()

However, in the future better writeback error handling will enable us to
properly report these errors on fsync(2) even if there are multiple file
descriptors open against the file or if sync(2) gets called before
fsync(2). So convert DAX to using standard error reporting through the
mapping.
-8<-

For now, only do this when the FS_WB_ERRSEQ flag is set. The
AS_EIO/AS_ENOSPC flags are not currently cleared in the older code when
writeback initiation fails, only when we discover an error after waiting
on writeback to complete, so we only want to do this with errseq_t based
error handling to prevent seeing duplicate errors on fsync.

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
Reviewed-and-Tested-by: Ross Zwisler 
---
 fs/dax.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 2a6889b3585f..ba3b17eefcfc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -856,8 +856,24 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
 
ret = dax_writeback_one(bdev, dax_dev, mapping,
indices[i], pvec.pages[i]);
-   if (ret < 0)
+   if (ret < 0) {
+   /*
+* For fs' that use errseq_t based error
+* tracking, we must call mapping_set_error
+* here to ensure that fsync on all open fds
+* get back an error. Doing this with the old
+* wb error tracking infrastructure is
+* problematic though, as DAX writeback is
+* synchronous, and the error flags are not
+* cleared when initiation fails, only when
+* it fails after the write has been submitted
+* to the backing store.
+*/
+   if (mapping->host->i_sb->s_type->fs_flags &
+   FS_WB_ERRSEQ)
+   mapping_set_error(mapping, ret);
goto out;
+   }
}
}
 out:
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 12/13] xfs: minimal conversion to errseq_t writeback error reporting

2017-06-12 Thread Jeff Layton
Just check and advance the data errseq_t in struct file before
before returning from fsync on normal files. Internal filemap_*
callers are left as-is.

We also set the FS_WB_ERRSEQ flag just for completeness sake.
Not much is really using it at this point.

Signed-off-by: Jeff Layton 
---
 fs/xfs/xfs_file.c  | 15 +++
 fs/xfs/xfs_super.c |  2 +-
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5fb5a0958a14..bc3b1575e8db 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -134,7 +134,7 @@ xfs_file_fsync(
struct inode*inode = file->f_mapping->host;
struct xfs_inode*ip = XFS_I(inode);
struct xfs_mount*mp = ip->i_mount;
-   int error = 0;
+   int error = 0, err2;
int log_flushed = 0;
xfs_lsn_t   lsn = 0;
 
@@ -142,10 +142,12 @@ xfs_file_fsync(
 
error = filemap_write_and_wait_range(inode->i_mapping, start, end);
if (error)
-   return error;
+   goto out;
 
-   if (XFS_FORCED_SHUTDOWN(mp))
-   return -EIO;
+   if (XFS_FORCED_SHUTDOWN(mp)) {
+   error = -EIO;
+   goto out;
+   }
 
xfs_iflags_clear(ip, XFS_ITRUNCATED);
 
@@ -197,6 +199,11 @@ xfs_file_fsync(
mp->m_logdev_targp == mp->m_ddev_targp)
xfs_blkdev_issue_flush(mp->m_ddev_targp);
 
+out:
+   err2 = filemap_report_wb_err(file);
+   if (!error)
+   error = err2;
+
return error;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 455a575f101d..28d3be187025 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1758,7 +1758,7 @@ static struct file_system_type xfs_fs_type = {
.name   = "xfs",
.mount  = xfs_fs_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV | FS_WB_ERRSEQ,
 };
 MODULE_ALIAS_FS("xfs");
 
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 16/20] block: convert to errseq_t based writeback error tracking

2017-06-12 Thread Jeff Layton
This is a very minimal conversion to errseq_t based error tracking
for raw block device access.

Only real change that is strictly required is that we must ensure that
filemap_report_wb_err is unconditionally called after fsync, which is
now done if FS_WB_ERRSEQ is set in fs_flags. That ensures that the
errseq_t in the file is advanced to the latest value in the mapping.

Note that there are internal callers that call sync_blockdev and the
like that are not affected by this. They'll continue to use the
AS_EIO/AS_ENOSPC flags for error reporting like they always have for
now.

Signed-off-by: Jeff Layton 
---
 fs/block_dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4d62fe771587..4bf865cc99de 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -801,6 +801,7 @@ static struct file_system_type bd_type = {
.name   = "bdev",
.mount  = bd_mount,
.kill_sb= kill_anon_super,
+   .fs_flags   = FS_WB_ERRSEQ,
 };
 
 struct super_block *blockdev_superblock __read_mostly;
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 11/20] mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error

2017-06-12 Thread Jeff Layton
When a writeback error occurs, we want later callers to be able to pick
up that fact when they go to wait on that writeback to complete.
Traditionally, we've used AS_EIO/AS_ENOSPC flags to track that, but
that's problematic since only one "checker" will be informed when an
error occurs.

In later patches, we're going to want to convert many of these callers
to check for errors since a well-defined point in time. For now, ensure
that we can handle both sorts of checks by both setting errors in both
places when there is a writeback failure.

Signed-off-by: Jeff Layton 
---
 include/linux/pagemap.h | 31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 316a19f6b635..28acc94e0f81 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -28,14 +28,33 @@ enum mapping_flags {
AS_NO_WRITEBACK_TAGS = 5,
 };
 
+/**
+ * mapping_set_error - record a writeback error in the address_space
+ * @mapping - the mapping in which an error should be set
+ * @error - the error to set in the mapping
+ *
+ * When writeback fails in some way, we must record that error so that
+ * userspace can be informed when fsync and the like are called.  We endeavor
+ * to report errors on any file that was open at the time of the error.  Some
+ * internal callers also need to know when writeback errors have occurred.
+ *
+ * When a writeback error occurs, most filesystems will want to call
+ * mapping_set_error to record the error in the mapping so that it can be
+ * reported when the application calls fsync(2).
+ */
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
-   if (unlikely(error)) {
-   if (error == -ENOSPC)
-   set_bit(AS_ENOSPC, >flags);
-   else
-   set_bit(AS_EIO, >flags);
-   }
+   if (likely(!error))
+   return;
+
+   /* Record in wb_err for checkers using errseq_t based tracking */
+   filemap_set_wb_err(mapping, error);
+
+   /* Record it in flags for now, for legacy callers */
+   if (error == -ENOSPC)
+   set_bit(AS_ENOSPC, >flags);
+   else
+   set_bit(AS_EIO, >flags);
 }
 
 static inline void mapping_set_unevictable(struct address_space *mapping)
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 08/20] lib: add errseq_t type and infrastructure for handling it

2017-06-12 Thread Jeff Layton
An errseq_t is a way of recording errors in one place, and allowing any
number of "subscribers" to tell whether an error has been set again
since a previous time.

It's implemented as an unsigned 32-bit value that is managed with atomic
operations. The low order bits are designated to hold an error code
(max size of MAX_ERRNO). The upper bits are used as a counter.

The API works with consumers sampling an errseq_t value at a particular
point in time. Later, that value can be used to tell whether new errors
have been set since that time.

Note that there is a 1 in 512k risk of collisions here if new errors
are being recorded frequently, since we have so few bits to use as a
counter. To mitigate this, one bit is used as a flag to tell whether the
value has been sampled since a new value was recorded. That allows
us to avoid bumping the counter if no one has sampled it since it
was last bumped.

Later patches will build on this infrastructure to change how writeback
errors are tracked in the kernel.

Signed-off-by: Jeff Layton 
Reviewed-by: NeilBrown 
Reviewed-by: Jan Kara 
---
 include/linux/errseq.h |  19 +
 lib/Makefile   |   2 +-
 lib/errseq.c   | 200 +
 3 files changed, 220 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/errseq.h
 create mode 100644 lib/errseq.c

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
new file mode 100644
index ..0d2555f310cd
--- /dev/null
+++ b/include/linux/errseq.h
@@ -0,0 +1,19 @@
+#ifndef _LINUX_ERRSEQ_H
+#define _LINUX_ERRSEQ_H
+
+/* See lib/errseq.c for more info */
+
+typedef u32errseq_t;
+
+void __errseq_set(errseq_t *eseq, int err);
+static inline void errseq_set(errseq_t *eseq, int err)
+{
+   /* Optimize for the common case of no error */
+   if (unlikely(err))
+   __errseq_set(eseq, err);
+}
+
+errseq_t errseq_sample(errseq_t *eseq);
+int errseq_check(errseq_t *eseq, errseq_t since);
+int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 0166fbc0fa81..519782d9ca3f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o 
random32.o \
 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-once.o refcount.o usercopy.o
+once.o refcount.o usercopy.o errseq.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/errseq.c b/lib/errseq.c
new file mode 100644
index ..d129c0611c1f
--- /dev/null
+++ b/lib/errseq.c
@@ -0,0 +1,200 @@
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * An errseq_t is a way of recording errors in one place, and allowing any
+ * number of "subscribers" to tell whether it has changed since a previous
+ * point where it was sampled.
+ *
+ * It's implemented as an unsigned 32-bit value. The low order bits are
+ * designated to hold an error code (between 0 and -MAX_ERRNO). The upper bits
+ * are used as a counter. This is done with atomics instead of locking so that
+ * these functions can be called from any context.
+ *
+ * The general idea is for consumers to sample an errseq_t value. That value
+ * can later be used to tell whether any new errors have occurred since that
+ * sampling was done.
+ *
+ * Note that there is a risk of collisions if new errors are being recorded
+ * frequently, since we have so few bits to use as a counter.
+ *
+ * To mitigate this, one bit is used as a flag to tell whether the value has
+ * been sampled since a new value was recorded. That allows us to avoid bumping
+ * the counter if no one has sampled it since the last time an error was
+ * recorded.
+ *
+ * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
+ * is the special (but common) case where there has never been an error. An all
+ * zero value thus serves as the "epoch" if one wishes to know whether there
+ * has ever been an error set since it was first initialized.
+ */
+
+/* The low bits are designated for error code (max of MAX_ERRNO) */
+#define ERRSEQ_SHIFT   ilog2(MAX_ERRNO + 1)
+
+/* This bit is used as a flag to indicate whether the value has been seen */
+#define ERRSEQ_SEEN(1 << ERRSEQ_SHIFT)
+
+/* The lowest bit of the counter */
+#define ERRSEQ_CTR_INC (1 << (ERRSEQ_SHIFT + 1))
+
+/**
+ * __errseq_set - set a errseq_t for later reporting
+ * @eseq: errseq_t field that should be set
+ * @err: error to set
+ *
+ * This function sets the error in *eseq, and increments the sequence counter
+ * if the last sequence was sampled at some point in the past.
+ *
+ * Any error set will always overwrite an existing error.
+ *
+ * Most callers will want to 

[PATCH v6 18/20] ext4: use errseq_t based error handling for reporting data writeback errors

2017-06-12 Thread Jeff Layton
Add the FS_WB_ERRSEQ flag to indicate to other subsystems that errseq_t
based error reporting for data writeback is in effect, and to opt-in to
reporting those errors in call_fsync.

ext4 uses the blockdev mapping for tracking metadata stored in the
pagecache. Sample its wb_err when opening a file and store that in
the f_md_wb_err field. Ensure that we check and advance the metadata
before returning in fsync.

Note that because metadata writeback errors are only tracked on a
per-device level, this does mean that we'll end up reporting an error on
all open file descriptors on this fs when there is a metadata writeback
failure.  That's not ideal, but we can possibly improve upon it in the
future.

ext4 also has several calls to filemap_fdatawait and
filemap_write_and_wait. Those will also be changed in a later patch to
versions that use errseq_t based reporting.

Signed-off-by: Jeff Layton 
---
 fs/ext4/dir.c   |  8 ++--
 fs/ext4/file.c  |  5 -
 fs/ext4/fsync.c | 23 +++
 fs/ext4/super.c |  6 +++---
 4 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index e8b365000d73..6bbb19510f74 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -611,9 +611,13 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 
 static int ext4_dir_open(struct inode * inode, struct file * filp)
 {
+   int ret = 0;
+
if (ext4_encrypted_inode(inode))
-   return fscrypt_get_encryption_info(inode) ? -EACCES : 0;
-   return 0;
+   ret = fscrypt_get_encryption_info(inode) ? -EACCES : 0;
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 static int ext4_release_dir(struct inode *inode, struct file *filp)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 02ce7e7bbdf5..6e505269132c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -435,7 +435,10 @@ static int ext4_file_open(struct inode * inode, struct 
file * filp)
if (ret < 0)
return ret;
}
-   return dquot_file_open(inode, filp);
+   ret = dquot_file_open(inode, filp);
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 /*
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 9d549608fd30..09e28b7fd3f6 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -100,8 +100,10 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
tid_t commit_tid;
bool needs_barrier = false;
 
-   if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb
-   return -EIO;
+   if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb {
+   ret = -EIO;
+   goto out;
+   }
 
J_ASSERT(ext4_journal_current_handle() == NULL);
 
@@ -126,7 +128,8 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
 
ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
if (ret)
-   return ret;
+   goto out;
+
/*
 * data=writeback,ordered:
 *  The caller's filemap_fdatawrite()/wait will sync the data.
@@ -152,12 +155,24 @@ int ext4_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
needs_barrier = true;
ret = jbd2_complete_transaction(journal, commit_tid);
if (needs_barrier) {
-   issue_flush:
+issue_flush:
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (!ret)
ret = err;
}
 out:
+   /*
+* Was there a metadata writeback error since last fsync?
+*
+* FIXME: ext4 tracks metadata with a whole-block device mapping. So,
+* if there is any sort of metadata writeback error, we'll report an
+* error on all open fds, even ones not associated with the inode
+*/
+   err = filemap_report_md_wb_err(file,
+   inode->i_sb->s_bdev->bd_inode->i_mapping);
+   if (!ret)
+   ret = err;
+
trace_ext4_sync_file_exit(inode, ret);
return ret;
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d37c81f327e7..e98791ebee6f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -119,7 +119,7 @@ static struct file_system_type ext2_fs_type = {
.name   = "ext2",
.mount  = ext4_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV|FS_WB_ERRSEQ,
 };
 MODULE_ALIAS_FS("ext2");
 MODULE_ALIAS("ext2");
@@ -134,7 +134,7 @@ static struct file_system_type ext3_fs_type = {
.name   = "ext3",
.mount  = ext4_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   

[PATCH v6 20/20] btrfs: minimal conversion to errseq_t writeback error reporting on fsync

2017-06-12 Thread Jeff Layton
Set the FS_WB_ERRSEQ flag to opt-in to errseq_t based reporting.
Internal call to filemap_* functions are left as-is.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4f1cdd5058f1..c99af09cd3e7 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2184,7 +2184,7 @@ static struct file_system_type btrfs_fs_type = {
.name   = "btrfs",
.mount  = btrfs_mount,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | FS_WB_ERRSEQ,
 };
 MODULE_ALIAS_FS("btrfs");
 
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 15/20] fs: have call_fsync call filemap_report_wb_err if FS_WB_ERRSEQ is set

2017-06-12 Thread Jeff Layton
Allow filesystems to opt-in to a final check of wb_err if FS_WB_ERRSEQ
is set. Technically, we could just plumb these calls into all of the
fsync operations, but I think this means less code, changes and churn.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17ba6284ab14..ef3feeec80b2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1742,12 +1742,6 @@ static inline int call_mmap(struct file *file, struct 
vm_area_struct *vma)
return file->f_op->mmap(file, vma);
 }
 
-static inline int call_fsync(struct file *file, loff_t start, loff_t end,
-int datasync)
-{
-   return file->f_op->fsync(file, start, end, datasync);
-}
-
 ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
  unsigned long nr_segs, unsigned long fast_segs,
  struct iovec *fast_pointer,
@@ -2583,6 +2577,20 @@ static inline errseq_t filemap_sample_wb_err(struct 
address_space *mapping)
return errseq_sample(>wb_err);
 }
 
+static inline int call_fsync(struct file *file, loff_t start, loff_t end,
+int datasync)
+{
+   int ret;
+
+   ret = file->f_op->fsync(file, start, end, datasync);
+   if (file->f_mapping->host->i_sb->s_type->fs_flags & FS_WB_ERRSEQ) {
+   int ret2 = filemap_report_wb_err(file);
+   if (!ret)
+   ret = ret2;
+   }
+   return ret;
+}
+
 extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
   int datasync);
 extern int vfs_fsync(struct file *file, int datasync);
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 13/13] btrfs: minimal conversion to errseq_t writeback error reporting on fsync

2017-06-12 Thread Jeff Layton
Internal callers of filemap_* functions are left as-is.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/file.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index da1096eb1a40..4632f16bc49c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2011,7 +2011,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_trans_handle *trans;
struct btrfs_log_ctx ctx;
-   int ret = 0;
+   int ret = 0, err;
bool full_sync = 0;
u64 len;
 
@@ -2030,7 +2030,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
 */
ret = start_ordered_ops(inode, start, end);
if (ret)
-   return ret;
+   goto out;
 
inode_lock(inode);
atomic_inc(>log_batch);
@@ -2227,6 +2227,9 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
ret = btrfs_end_transaction(trans);
}
 out:
+   err = filemap_report_wb_err(file);
+   if (!ret)
+   ret = err;
return ret > 0 ? -EIO : ret;
 }
 
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 11/13] Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors

2017-06-12 Thread Jeff Layton
I waxed a little loquacious here, but I figured that more detail was
better, and writeback error handling is so hard to get right.

Although I think we'll eventually remove it once the transition is
complete, I've gone ahead and documented the FS_WB_ERRSEQ flag as well.

Cc: Jan Kara 
Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/vfs.txt | 50 ---
 1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index f42b90687d40..c3efdd833a3d 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,7 +576,49 @@ should clear PG_Dirty and set PG_Writeback.  It can be 
actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
-Writeback makes use of a writeback_control structure...
+Writeback makes use of a writeback_control structure to direct the
+operations.  This gives the the writepage and writepages operations some
+information about the nature of and reason for the writeback request,
+and the constraints under which it is being done.  It is also used to
+return information back to the caller about the result of a writepage or
+writepages request.
+
+Handling errors during writeback
+
+Most applications that utilize the pagecache will periodically call
+fsync to ensure that data written has made it to the backing store.
+When there is an error during writeback, expect that error to be
+reported when fsync is called.  After an error has been reported to
+fsync, subsequent fsync calls on the same file descriptor should return
+0, unless further writeback errors have occurred since the previous
+fsync.
+
+Ideally, the kernel would report an error only on file descriptions on
+which writes were done that subsequently failed to be written back.  The
+generic pagecache infrastructure does not track the file descriptions
+that have dirtied each individual page however, so determining which
+file descriptors should get back an error is not possible.
+
+Instead, the generic writeback error tracking infrastructure in the
+kernel settles for reporting errors to fsync on all file descriptions
+that were open at the time that the error occurred.  In a situation with
+multiple writers, all of them will get back an error on a subsequent fsync,
+even if all of the writes done through that particular file descriptor
+succeeded (or even if there were no writes on that file descriptor at all).
+
+Filesystems that wish to use this infrastructure should call
+filemap_set_wb_err to record the error in the address_space when it
+occurs.  Then, at the end of their fsync operation, they should call
+filemap_report_wb_err to ensure that the struct file's error cursor
+has advanced to the correct point in the stream of errors emitted by
+the backing device(s).
+
+Older kernels used a different method for tracking errors, based on flags
+in the address_space. We're currently switching everything over to use
+the infrastructure based on errseq_t values. During the transition,
+filesystem authors will want to also ensure their file_system_type has
+FS_WB_ERRSEQ set in fs_flags to ensure that shared infrastructure is
+aware of the model in use.
 
 struct address_space_operations
 ---
@@ -804,7 +846,8 @@ struct address_space_operations {
 The File Object
 ===
 
-A file object represents a file opened by a process.
+A file object represents a file opened by a process. This is also known
+as an "open file description" in POSIX parlance.
 
 
 struct file_operations
@@ -887,7 +930,8 @@ otherwise noted.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call
+  fsync: called by the fsync(2) system call. Also see the section above
+entitled "Handling errors during writeback".
 
   fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 13/20] Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors

2017-06-12 Thread Jeff Layton
Let's try to make this extra clear for fs authors.

Also, although I think we'll eventually remove it once the transition is
complete, I've gone ahead and documented the FS_WB_ERRSEQ flag as well.

Cc: Jan Kara 
Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/vfs.txt | 48 ---
 1 file changed, 45 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index f42b90687d40..0f6415c26385 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,7 +576,47 @@ should clear PG_Dirty and set PG_Writeback.  It can be 
actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
-Writeback makes use of a writeback_control structure...
+Writeback makes use of a writeback_control structure to direct the
+operations.  This gives the the writepage and writepages operations some
+information about the nature of and reason for the writeback request,
+and the constraints under which it is being done.  It is also used to
+return information back to the caller about the result of a writepage or
+writepages request.
+
+Handling errors during writeback
+
+Most applications that utilize the pagecache will periodically call
+fsync to ensure that data written has made it to the backing store.
+When there is an error during writeback, expect that error to be
+reported when fsync is called.  After an error has been reported to
+fsync, subsequent fsync calls on the same file descriptor should return
+0, unless further writeback errors have occurred since the previous
+fsync.
+
+Ideally, the kernel would report an error only on file descriptions on
+which writes were done that subsequently failed to be written back.  The
+generic pagecache infrastructure does not track the file descriptions
+that have dirtied each individual page however, so determining which
+file descriptors should get back an error is not possible.
+
+Instead, the generic writeback error tracking infrastructure in the
+kernel settles for reporting errors to fsync on all file descriptions
+that were open at the time that the error occurred.  In a situation with
+multiple writers, all of them will get back an error on a subsequent fsync,
+even if all of the writes done through that particular file descriptor
+succeeded (or even if there were no writes on that file descriptor at all).
+
+Filesystems that wish to use this infrastructure need to do two things:
+
+1) call mapping_set_error to record the error in the address_space when
+one occurs.
+
+2) set FS_WB_ERRSEQ in the fs_flags field in the file_system_type to
+indicate to other subsystems that the filesystem wants to use errseq_t
+based error reporting for writeback.
+
+The flag may go away in the future or moved to an opt-out flag once
+the majority of filesystems are converted to use errseq_t based reporting.
 
 struct address_space_operations
 ---
@@ -804,7 +844,8 @@ struct address_space_operations {
 The File Object
 ===
 
-A file object represents a file opened by a process.
+A file object represents a file opened by a process. This is also known
+as an "open file description" in POSIX parlance.
 
 
 struct file_operations
@@ -887,7 +928,8 @@ otherwise noted.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call
+  fsync: called by the fsync(2) system call. Also see the section above
+entitled "Handling errors during writeback".
 
   fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 19/20] xfs: minimal conversion to errseq_t writeback error reporting

2017-06-12 Thread Jeff Layton
Just set the FS_WB_ERRSEQ flag to indicate that we want to use errseq_t
based error reporting. Internal filemap_* calls are left as-is for now.

Signed-off-by: Jeff Layton 
---
 fs/xfs/xfs_super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 455a575f101d..28d3be187025 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1758,7 +1758,7 @@ static struct file_system_type xfs_fs_type = {
.name   = "xfs",
.mount  = xfs_fs_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV | FS_WB_ERRSEQ,
 };
 MODULE_ALIAS_FS("xfs");
 
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 09/20] fs: new infrastructure for writeback error handling and reporting

2017-06-12 Thread Jeff Layton
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure.

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 drivers/dax/device.c |  1 +
 fs/block_dev.c   |  1 +
 fs/file_table.c  |  1 +
 fs/open.c|  3 +++
 include/linux/fs.h   | 53 
 mm/filemap.c | 38 +
 6 files changed, 97 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 006e657dfcb9..12943d19bfc4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -499,6 +499,7 @@ static int dax_open(struct inode *inode, struct file *filp)
inode->i_mapping = __dax_inode->i_mapping;
inode->i_mapping->host = __dax_inode;
filp->f_mapping = inode->i_mapping;
+   filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
filp->private_data = dev_dax;
inode->i_flags = S_DAX;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 51959936..4d62fe771587 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1743,6 +1743,7 @@ static int blkdev_open(struct inode * inode, struct file 
* filp)
return -ENOMEM;
 
filp->f_mapping = bdev->bd_inode->i_mapping;
+   filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
 
return blkdev_get(bdev, filp->f_mode, filp);
 }
diff --git a/fs/file_table.c b/fs/file_table.c
index 954d510b765a..72e861a35a7f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, fmode_t 
mode,
file->f_path = *path;
file->f_inode = path->dentry->d_inode;
file->f_mapping = path->dentry->d_inode->i_mapping;
+   file->f_wb_err = filemap_sample_wb_err(file->f_mapping);
if ((mode & FMODE_READ) &&
 likely(fop->read || fop->read_iter))
mode |= FMODE_CAN_READ;
diff --git a/fs/open.c b/fs/open.c
index cd0c5be8d012..280d4a963791 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -707,6 

[PATCH v6 01/20] mm: fix mapping_set_error call in me_pagecache_dirty

2017-06-12 Thread Jeff Layton
The error code should be negative. Since this ends up in the default
case anyway, this is harmless, but it's less confusing to negate it.
Also, later patches will require a negative error code here.

Signed-off-by: Jeff Layton 
Reviewed-by: Ross Zwisler 
Reviewed-by: Jan Kara 
Reviewed-by: Matthew Wilcox 
Reviewed-by: Christoph Hellwig 
---
 mm/memory-failure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 342fac9ba89b..55bc61791fe1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -684,7 +684,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long 
pfn)
 * the first EIO, but we're not worse than other parts
 * of the kernel.
 */
-   mapping_set_error(mapping, EIO);
+   mapping_set_error(mapping, -EIO);
}
 
return me_pagecache_clean(p, pfn);
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 07/20] mm: clean up error handling in write_one_page

2017-06-12 Thread Jeff Layton
Don't try to check PageError since that's potentially racy and not
necessarily going to be set after writepage errors out.

Instead, check the mapping for an error after writepage returns.

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 mm/page-writeback.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b901fe52b153..e369e8ea2a29 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2371,14 +2371,13 @@ int do_writepages(struct address_space *mapping, struct 
writeback_control *wbc)
  *
  * The page must be locked by the caller and will be unlocked upon return.
  *
- * write_one_page() returns a negative error code if I/O failed. Note that
- * the address_space is not marked for error. The caller must do this if
- * needed.
+ * Note that the mapping's AS_EIO/AS_ENOSPC flags will be cleared when this
+ * function returns.
  */
 int write_one_page(struct page *page)
 {
struct address_space *mapping = page->mapping;
-   int ret = 0;
+   int ret = 0, ret2;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 1,
@@ -2391,15 +2390,15 @@ int write_one_page(struct page *page)
if (clear_page_dirty_for_io(page)) {
get_page(page);
ret = mapping->a_ops->writepage(page, );
-   if (ret == 0) {
+   if (ret == 0)
wait_on_page_writeback(page);
-   if (PageError(page))
-   ret = -EIO;
-   }
put_page(page);
} else {
unlock_page(page);
}
+
+   if (!ret)
+   ret = filemap_check_errors(mapping);
return ret;
 }
 EXPORT_SYMBOL(write_one_page);
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: add compression trace points

2017-06-12 Thread Nikolay Borisov


On 12.06.2017 12:44, Anand Jain wrote:
> 
> Thanks for the review.
> 
>> I haven't read previous submissions and any discussions that might have
>> occurred there but why not just stick to
>> btrfs_data_compression/btrfs_data_compressor. I know there is certain
>> semantic overload since we call it a compressor yet it also does
>> decompression but let's focus on making the code/intention clear for the
>> code reader and not bogging down too much on actual word semantics. To
>> me "compressor" is a synonym to something which compresses AND
>> decompresses. It's very well possible that this is just me in which case
>> my argument is flawed and you can ignore it :)
> 
>  The same tracer will also trace encryption at some point in future.

Right, in that case why don't you introduce the concept of
stages/pipeline. We can have a compression stage with either
compress/decompress ops. We can have an encryption stage with
encrypt/decrypt? How does that abstraction sound?


> 
>>> +#define show_transformer_type(type)\
>>
>> Why not show_compression_type ?
> 
>  as above.
> 
>>> +TP_printk_btrfs("%s %s ino=%lu type=%s len_before=%lu
>>> len_after=%lu "
>>> +"start=%lu ret=%d",
>>> +__entry->uncompress ? "untransform":"transform",
>>
>> decompress/compress. Transform/untransform are as cryptic as they can
>> be. It's a lot easier to put those terms in context when reading the
>> len_before/len_after values.
> 
>> Otherwise one might ask themselves "What
>> kind of transformation are we talking about?"
>>
>> Even if you don't do a v3 you can add:
> 
>  ha. Thanks.
> 
> current trace an example:
>  untransform bio ino=258 type=lzo len_before=4096 len_after=16384
> start=393216 ret=0

Now that I"m seeing this, why not prepent something in front of bio/page
.e.g granularity/unit:

unit=bio
granularity=page

I'm more inclined towards the latter.

> 
> before after size may be same in encryption, so we need extra specifier
> about the operation.
> 
> How about: (for example)
>  de-lzo bio ino=258 len_before=4096 len_after=16384 start=393216 ret=0
>  lzo ino=258 ..

de-lzo still seems a bit botched. what about

stage: compress op=lzo-decompress/zlib-compress granularity=bio/page
stage: encryption op=XTS-decrypt/encrypt ... ?


> 
> Thanks, Anand
> 
> 
>> Reviewed-by: Nikolay Borisov 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: use ino_t for inode in tracer

2017-06-12 Thread Anand Jain
Signed-off-by: Anand Jain 
---
 include/trace/events/btrfs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index e37973526153..6494e34b6df9 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1459,7 +1459,7 @@ DECLARE_EVENT_CLASS(btrfs__qgroup_rsv_data,
 
TP_STRUCT__entry_btrfs(
__field(u64,rootid  )
-   __field(unsigned long,  ino )
+   __field(ino_t,  ino )
__field(u64,start   )
__field(u64,len )
__field(u64,reserved)
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: add compression trace points

2017-06-12 Thread Anand Jain


Thanks for the review.


I haven't read previous submissions and any discussions that might have
occurred there but why not just stick to
btrfs_data_compression/btrfs_data_compressor. I know there is certain
semantic overload since we call it a compressor yet it also does
decompression but let's focus on making the code/intention clear for the
code reader and not bogging down too much on actual word semantics. To
me "compressor" is a synonym to something which compresses AND
decompresses. It's very well possible that this is just me in which case
my argument is flawed and you can ignore it :)


 The same tracer will also trace encryption at some point in future.


+#define show_transformer_type(type)\


Why not show_compression_type ?


 as above.


+   TP_printk_btrfs("%s %s ino=%lu type=%s len_before=%lu len_after=%lu "
+   "start=%lu ret=%d",
+   __entry->uncompress ? "untransform":"transform",


decompress/compress. Transform/untransform are as cryptic as they can
be. It's a lot easier to put those terms in context when reading the
len_before/len_after values.



Otherwise one might ask themselves "What
kind of transformation are we talking about?"

Even if you don't do a v3 you can add:


 ha. Thanks.

current trace an example:
 untransform bio ino=258 type=lzo len_before=4096 len_after=16384 
start=393216 ret=0


before after size may be same in encryption, so we need extra specifier 
about the operation.


How about: (for example)
 de-lzo bio ino=258 len_before=4096 len_after=16384 start=393216 ret=0
 lzo ino=258 ..

Thanks, Anand



Reviewed-by: Nikolay Borisov 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


csum failed root -9

2017-06-12 Thread Henk Slager
Hi all,

there is 1-block corruption a 8TB filesystem that showed up several
months ago. The fs is almost exclusively a btrfs receive target and
receives monthly sequential snapshots from two hosts but 1 received
uuid. I do not know exactly when the corruption has happened but it
must have been roughly 3 to 6 months ago. with monthly updated
kernel+progs on that host.

Some more history:
- fs was created in november 2015 on top of luks
- initially bcache between the 2048-sector aligned partition and luks.
Some months ago I removed 'the bcache layer' by making sure that cache
was clean and then zeroing 8K bytes at start of partition in an
isolated situation. Then setting partion offset to 2064 by
delete-recreate in gdisk.
- in december 2016 there were more scrub errors, but related to the
monthly snapshot of december2016. I have removed that snapshot this
year and now only this 1-block csum error is the only issue.
- brand/type is seagate 8TB SMR. At least since kernel 4.4+ that
includes some SMR related changes in the blocklayer this disk works
fine with btrfs.
- the smartctl values show no error so far but I will run an extended
test this week after another btrfs check which did not show any error
earlier with the csum fail being there
- I have noticed that the board that has the disk attached has been
rebooted due to power-failures many times (unreliable power switch and
power dips from energy company) and the 150W powersupply is broken and
replaced since then. Also due to this, I decided to remove bcache
(which has been in write-through and write-around only).

Some btrfs inpect-internal exercise shows that the problem is in a
directory in the root that contains most of the data and snapshots.
But an  rsync -c  with an identical other clone snapshot shows no
difference (no writes to an rw snapshot of that clone). So the fs is
still OK as file-level backup, but btrfs replace/balance will fatal
error on just this 1 csum error. It looks like that this is not a
media/disk error but some HW induced error or SW/kernel issue.
Relevant btrfs commands + dmesg info, see below.

Any comments on how to fix or handle this without incrementally
sending all snapshots to a new fs (6+ TiB of data, assuming this won't
fail)?


# uname -r
4.11.3-1-default
# btrfs --version
btrfs-progs v4.10.2+20170406

fs profile is dup for system+meta, single for data

# btrfs scrub start /local/smr

[27609.626555] BTRFS error (device dm-0): parent transid verify failed
on 6350718500864 wanted 23170 found 23076
[27609.685416] BTRFS info (device dm-0): read error corrected: ino 1
off 6350718500864 (dev /dev/mapper/smr sector 11681212672)
[27609.685928] BTRFS info (device dm-0): read error corrected: ino 1
off 6350718504960 (dev /dev/mapper/smr sector 11681212680)
[27609.686160] BTRFS info (device dm-0): read error corrected: ino 1
off 6350718509056 (dev /dev/mapper/smr sector 11681212688)
[27609.687136] BTRFS info (device dm-0): read error corrected: ino 1
off 6350718513152 (dev /dev/mapper/smr sector 11681212696)
[37663.606455] BTRFS error (device dm-0): parent transid verify failed
on 6350453751808 wanted 23170 found 23075
[37663.685158] BTRFS info (device dm-0): read error corrected: ino 1
off 6350453751808 (dev /dev/mapper/smr sector 11679647008)
[37663.685386] BTRFS info (device dm-0): read error corrected: ino 1
off 6350453755904 (dev /dev/mapper/smr sector 11679647016)
[37663.685587] BTRFS info (device dm-0): read error corrected: ino 1
off 635045376 (dev /dev/mapper/smr sector 11679647024)
[37663.685798] BTRFS info (device dm-0): read error corrected: ino 1
off 6350453764096 (dev /dev/mapper/smr sector 11679647032)
[43497.234598] BTRFS error (device dm-0): bdev /dev/mapper/smr errs:
wr 0, rd 0, flush 0, corrupt 1, gen 0
[43497.234605] BTRFS error (device dm-0): unable to fixup (regular)
error at logical 7175413624832 on dev /dev/mapper/smr

# < figure out which chunk with help of btrfs py lib >

chunk vaddr 7174898057216 type 1 stripe 0 devid 1 offset 6696948727808
length 1073741824 used 1073741824 used_pct 100
chunk vaddr 7175971799040 type 1 stripe 0 devid 1 offset 6698022469632
length 1073741824 used 1073741824 used_pct 100

# btrfs balance start -v -dvrange=7174898057216..7174898057217 /local/smr

[74250.913273] BTRFS info (device dm-0): relocating block group
7174898057216 flags data
[74255.941105] BTRFS warning (device dm-0): csum failed root -9 ino
257 off 515567616 csum 0x589cb236 expected csum 0xee19bf74 mirror 1
[74255.965804] BTRFS warning (device dm-0): csum failed root -9 ino
257 off 515567616 csum 0x589cb236 expected csum 0xee19bf74 mirror 1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: add compression trace points

2017-06-12 Thread Nikolay Borisov


On 12.06.2017 11:32, Anand Jain wrote:
> This patch adds compression and decompression trace points for the
> purpose of debugging.
> 
> Signed-off-by: Anand Jain 
> ---
> v2:
>  . Use better naming.
>(If transform is not good enough I have run out of ideas, pls suggest).
>  . To be applied on top of
>git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
>(tested without namelen check patch set).

I haven't read previous submissions and any discussions that might have
occurred there but why not just stick to
btrfs_data_compression/btrfs_data_compressor. I know there is certain
semantic overload since we call it a compressor yet it also does
decompression but let's focus on making the code/intention clear for the
code reader and not bogging down too much on actual word semantics. To
me "compressor" is a synonym to something which compresses AND
decompresses. It's very well possible that this is just me in which case
my argument is flawed and you can ignore it :)

> 
>  fs/btrfs/compression.c   | 11 +++
>  include/trace/events/btrfs.h | 44 
> 
>  2 files changed, 55 insertions(+)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index fd6508bcff77..53908722d80e 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -884,6 +884,10 @@ int btrfs_compress_pages(int type, struct address_space 
> *mapping,
> start, pages,
> out_pages,
> total_in, total_out);
> +
> + trace_btrfs_data_transformer(0, 0, mapping->host, type, *total_in,
> + *total_out, start, ret);
> +
>   free_workspace(type, workspace);
>   return ret;
>  }
> @@ -910,6 +914,10 @@ static int btrfs_decompress_bio(struct compressed_bio 
> *cb)
>  
>   workspace = find_workspace(type);
>   ret = btrfs_compress_op[type - 1]->decompress_bio(workspace, cb);
> +
> + trace_btrfs_data_transformer(1, 1, cb->inode, type,
> + cb->compressed_len, cb->len, cb->start, ret);
> +
>   free_workspace(type, workspace);
>  
>   return ret;
> @@ -932,6 +940,9 @@ int btrfs_decompress(int type, unsigned char *data_in, 
> struct page *dest_page,
> dest_page, start_byte,
> srclen, destlen);
>  
> + trace_btrfs_data_transformer(1, 0, dest_page->mapping->host,
> + type, srclen, destlen, start_byte, ret);
> +
>   free_workspace(type, workspace);
>   return ret;
>  }
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index cd99a3658156..7c2442dab6a0 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -1622,6 +1622,50 @@ TRACE_EVENT(qgroup_meta_reserve,
>   show_root_type(__entry->refroot), __entry->diff)
>  );
>  
> +#define show_transformer_type(type)  \

Why not show_compression_type ?

> + __print_symbolic(type,  \
> + { BTRFS_COMPRESS_ZLIB,  "zlib" },   \
> + { BTRFS_COMPRESS_LZO,   "lzo" })
> +
> +TRACE_EVENT(btrfs_data_transformer,
> +
> + TP_PROTO(int uncompress, int its_bio, struct inode *inode, int type,
> + unsigned long len_before, unsigned long len_after,
> + unsigned long start, int ret)> +
> + TP_ARGS(uncompress, its_bio, inode, type, len_before,
> + len_after, start, ret),
> +
> + TP_STRUCT__entry_btrfs(
> + __field(int,uncompress)
> + __field(int,its_bio)
> + __field(ino_t,  i_ino)
> + __field(int,type)
> + __field(unsigned long,  len_before)
> + __field(unsigned long,  len_after)
> + __field(unsigned long,  start)
> + __field(int,ret)
> + ),
> +
> + TP_fast_assign_btrfs(btrfs_sb(inode->i_sb),
> + __entry->uncompress = uncompress;
> + __entry->its_bio= its_bio;
> + __entry->i_ino  = inode->i_ino;
> + __entry->type   = type;
> + __entry->len_before = len_before;
> + __entry->len_after  = len_after;
> + __entry->start  = start;
> + __entry->ret= ret;
> + ),
> +
> + TP_printk_btrfs("%s %s ino=%lu type=%s len_before=%lu len_after=%lu "
> + "start=%lu ret=%d",
> + __entry->uncompress ? "untransform":"transform",

decompress/compress. Transform/untransform are as cryptic as they can
be. It's a lot 

[PATCH v2] btrfs: add compression trace points

2017-06-12 Thread Anand Jain
This patch adds compression and decompression trace points for the
purpose of debugging.

Signed-off-by: Anand Jain 
---
v2:
 . Use better naming.
   (If transform is not good enough I have run out of ideas, pls suggest).
 . To be applied on top of
   git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
   (tested without namelen check patch set).

 fs/btrfs/compression.c   | 11 +++
 include/trace/events/btrfs.h | 44 
 2 files changed, 55 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index fd6508bcff77..53908722d80e 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -884,6 +884,10 @@ int btrfs_compress_pages(int type, struct address_space 
*mapping,
  start, pages,
  out_pages,
  total_in, total_out);
+
+   trace_btrfs_data_transformer(0, 0, mapping->host, type, *total_in,
+   *total_out, start, ret);
+
free_workspace(type, workspace);
return ret;
 }
@@ -910,6 +914,10 @@ static int btrfs_decompress_bio(struct compressed_bio *cb)
 
workspace = find_workspace(type);
ret = btrfs_compress_op[type - 1]->decompress_bio(workspace, cb);
+
+   trace_btrfs_data_transformer(1, 1, cb->inode, type,
+   cb->compressed_len, cb->len, cb->start, ret);
+
free_workspace(type, workspace);
 
return ret;
@@ -932,6 +940,9 @@ int btrfs_decompress(int type, unsigned char *data_in, 
struct page *dest_page,
  dest_page, start_byte,
  srclen, destlen);
 
+   trace_btrfs_data_transformer(1, 0, dest_page->mapping->host,
+   type, srclen, destlen, start_byte, ret);
+
free_workspace(type, workspace);
return ret;
 }
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index cd99a3658156..7c2442dab6a0 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1622,6 +1622,50 @@ TRACE_EVENT(qgroup_meta_reserve,
show_root_type(__entry->refroot), __entry->diff)
 );
 
+#define show_transformer_type(type)\
+   __print_symbolic(type,  \
+   { BTRFS_COMPRESS_ZLIB,  "zlib" },   \
+   { BTRFS_COMPRESS_LZO,   "lzo" })
+
+TRACE_EVENT(btrfs_data_transformer,
+
+   TP_PROTO(int uncompress, int its_bio, struct inode *inode, int type,
+   unsigned long len_before, unsigned long len_after,
+   unsigned long start, int ret),
+
+   TP_ARGS(uncompress, its_bio, inode, type, len_before,
+   len_after, start, ret),
+
+   TP_STRUCT__entry_btrfs(
+   __field(int,uncompress)
+   __field(int,its_bio)
+   __field(ino_t,  i_ino)
+   __field(int,type)
+   __field(unsigned long,  len_before)
+   __field(unsigned long,  len_after)
+   __field(unsigned long,  start)
+   __field(int,ret)
+   ),
+
+   TP_fast_assign_btrfs(btrfs_sb(inode->i_sb),
+   __entry->uncompress = uncompress;
+   __entry->its_bio= its_bio;
+   __entry->i_ino  = inode->i_ino;
+   __entry->type   = type;
+   __entry->len_before = len_before;
+   __entry->len_after  = len_after;
+   __entry->start  = start;
+   __entry->ret= ret;
+   ),
+
+   TP_printk_btrfs("%s %s ino=%lu type=%s len_before=%lu len_after=%lu "
+   "start=%lu ret=%d",
+   __entry->uncompress ? "untransform":"transform",
+   __entry->its_bio ? "bio":"page", __entry->i_ino,
+   show_transformer_type(__entry->type), __entry->len_before,
+   __entry->len_after, __entry->start, __entry->ret)
+
+);
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Filesystem won't mount (open_ctree failed) or repair (BUG_ON)

2017-06-12 Thread Koen Kooi
Op 12-06-17 om 01:00 schreef Chris Murphy:
> On Sun, Jun 11, 2017 at 4:13 AM, Koen Kooi  wrote:
>>
>>> Op 11 jun. 2017, om 12:05 heeft Koen Kooi  het 
>>> volgende geschreven:
>>>

 Op 11 jun. 2017, om 06:20 heeft Chris Murphy  het 
 volgende geschreven:

 On Fri, Jun 9, 2017 at 1:57 PM, Hugo Mills  wrote:
>>
>> [..]
>>
 I'd say take a btrfs-image and put it up somewhere and also file a
 bug. The fsck should not crash.
>>>
>>> I’ll create a bugzilla account and file a bug for that.
>>
>> Done: https://bugzilla.kernel.org/show_bug.cgi?id=196031
>>
>> Btrfs-image is still running, will put it online when it has finished. It’s 
>> already 1.2G:
>>
>> koen@beast:~$ du -ms /data/backup/btrfs.img
>> 1205/data/backup/btrfs.img
> 
> Hopefully you're using -s -t4 -c9 but if not you can at least compress
> it after the fact but that takes even longer.

I wasn't using '-s', after I did it and ran it again it shrunk from 14GiB to 
733MiB:

https://dominion.thruhere.net/btrfs.img

xz -9 -e wouldn't only compressed that 14GiB file to 13.99GiB, so I wonder why 
'-s' seems to make such a difference.

regards,

Koen


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Filesystem won't mount (open_ctree failed) or repair (BUG_ON)

2017-06-12 Thread Koen Kooi
Op 12-06-17 om 00:58 schreef Chris Murphy:

[..]

> Also worth trying btrfs check --mode=lowmem. This doesn't repair but
> is a whole new implementation so it might find the source of the
> problem better than the current fsck.

I ran it under 'catchsegv' to give more data where it segfaults, here's the log:

https://dominion.thruhere.net/btrfsck-lowmem.txt.gz

It's 688K compressed and 16MiB uncompressed.

regards,

Koen


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html