Re: csum failed root -9
Am Mon, 12 Jun 2017 11:00:31 +0200 schrieb Henk Slager: > Hi all, > > there is 1-block corruption a 8TB filesystem that showed up several > months ago. The fs is almost exclusively a btrfs receive target and > receives monthly sequential snapshots from two hosts but 1 received > uuid. I do not know exactly when the corruption has happened but it > must have been roughly 3 to 6 months ago. with monthly updated > kernel+progs on that host. > > Some more history: > - fs was created in november 2015 on top of luks > - initially bcache between the 2048-sector aligned partition and luks. > Some months ago I removed 'the bcache layer' by making sure that cache > was clean and then zeroing 8K bytes at start of partition in an > isolated situation. Then setting partion offset to 2064 by > delete-recreate in gdisk. > - in december 2016 there were more scrub errors, but related to the > monthly snapshot of december2016. I have removed that snapshot this > year and now only this 1-block csum error is the only issue. > - brand/type is seagate 8TB SMR. At least since kernel 4.4+ that > includes some SMR related changes in the blocklayer this disk works > fine with btrfs. > - the smartctl values show no error so far but I will run an extended > test this week after another btrfs check which did not show any error > earlier with the csum fail being there > - I have noticed that the board that has the disk attached has been > rebooted due to power-failures many times (unreliable power switch and > power dips from energy company) and the 150W powersupply is broken and > replaced since then. Also due to this, I decided to remove bcache > (which has been in write-through and write-around only). > > Some btrfs inpect-internal exercise shows that the problem is in a > directory in the root that contains most of the data and snapshots. > But an rsync -c with an identical other clone snapshot shows no > difference (no writes to an rw snapshot of that clone). So the fs is > still OK as file-level backup, but btrfs replace/balance will fatal > error on just this 1 csum error. It looks like that this is not a > media/disk error but some HW induced error or SW/kernel issue. > Relevant btrfs commands + dmesg info, see below. > > Any comments on how to fix or handle this without incrementally > sending all snapshots to a new fs (6+ TiB of data, assuming this won't > fail)? > > > # uname -r > 4.11.3-1-default > # btrfs --version > btrfs-progs v4.10.2+20170406 There's btrfs-progs v4.11 available... > fs profile is dup for system+meta, single for data > > # btrfs scrub start /local/smr What looks strange to me is that the parameters of the error reports seem to be rotated by one... See below: > [27609.626555] BTRFS error (device dm-0): parent transid verify failed > on 6350718500864 wanted 23170 found 23076 > [27609.685416] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350718500864 (dev /dev/mapper/smr sector 11681212672) > [27609.685928] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350718504960 (dev /dev/mapper/smr sector 11681212680) > [27609.686160] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350718509056 (dev /dev/mapper/smr sector 11681212688) > [27609.687136] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350718513152 (dev /dev/mapper/smr sector 11681212696) > [37663.606455] BTRFS error (device dm-0): parent transid verify failed > on 6350453751808 wanted 23170 found 23075 > [37663.685158] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350453751808 (dev /dev/mapper/smr sector 11679647008) > [37663.685386] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350453755904 (dev /dev/mapper/smr sector 11679647016) > [37663.685587] BTRFS info (device dm-0): read error corrected: ino 1 > off 635045376 (dev /dev/mapper/smr sector 11679647024) > [37663.685798] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350453764096 (dev /dev/mapper/smr sector 11679647032) Why does it say "ino 1"? Does it mean devid 1? > [43497.234598] BTRFS error (device dm-0): bdev /dev/mapper/smr errs: > wr 0, rd 0, flush 0, corrupt 1, gen 0 > [43497.234605] BTRFS error (device dm-0): unable to fixup (regular) > error at logical 7175413624832 on dev /dev/mapper/smr > > # < figure out which chunk with help of btrfs py lib > > > chunk vaddr 7174898057216 type 1 stripe 0 devid 1 offset 6696948727808 > length 1073741824 used 1073741824 used_pct 100 > chunk vaddr 7175971799040 type 1 stripe 0 devid 1 offset 6698022469632 > length 1073741824 used 1073741824 used_pct 100 > > # btrfs balance start -v > -dvrange=7174898057216..7174898057217 /local/smr > > [74250.913273] BTRFS info (device dm-0): relocating block group > 7174898057216 flags data > [74255.941105] BTRFS warning (device dm-0): csum failed root -9 ino > 257 off 515567616 csum 0x589cb236 expected csum 0xee19bf74 mirror 1 > [74255.965804] BTRFS warning (device dm-0): csum
Re: [PATCH v6 19/20] xfs: minimal conversion to errseq_t writeback error reporting
On Mon, Jun 12, 2017 at 08:23:15AM -0400, Jeff Layton wrote: > Just set the FS_WB_ERRSEQ flag to indicate that we want to use errseq_t > based error reporting. Internal filemap_* calls are left as-is for now. > > Signed-off-by: Jeff Layton> --- > fs/xfs/xfs_super.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 455a575f101d..28d3be187025 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1758,7 +1758,7 @@ static struct file_system_type xfs_fs_type = { > .name = "xfs", > .mount = xfs_fs_mount, > .kill_sb= kill_block_super, > - .fs_flags = FS_REQUIRES_DEV, > + .fs_flags = FS_REQUIRES_DEV | FS_WB_ERRSEQ, Huh? Why are there two patches with the same subject line? And this same bit of code too? Or ... 11/13, 11/20? What's going on here? --D > }; > MODULE_ALIAS_FS("xfs"); > > -- > 2.13.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs: use ino_t for inode in tracer
Signed-off-by: Anand Jain--- v2: fix another location where ino_t was required. include/trace/events/btrfs.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index e37973526153..692207e3f5d5 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1418,7 +1418,7 @@ DECLARE_EVENT_CLASS(btrfs__qgroup_data_map, TP_STRUCT__entry_btrfs( __field(u64,rootid ) - __field(unsigned long, ino ) + __field(ino_t, ino ) __field(u64,free_reserved ) ), @@ -1459,7 +1459,7 @@ DECLARE_EVENT_CLASS(btrfs__qgroup_rsv_data, TP_STRUCT__entry_btrfs( __field(u64,rootid ) - __field(unsigned long, ino ) + __field(ino_t, ino ) __field(u64,start ) __field(u64,len ) __field(u64,reserved) -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10 v11] No wait AIO
On 06/10/2017 12:34 AM, Al Viro wrote: > On Thu, Jun 08, 2017 at 12:39:10AM -0700, Christoph Hellwig wrote: >> As already indicated this whole series looks fine to me. >> >> Al: are you going to pick this up? Or Andrew? > > The main issue here is "let's bail out from ->write_iter() instances" > patch. It very obviously has holes in coverage. > > Could we have FMODE_AIO_NOWAIT and make those who claim to support it > set that in ->open()? And make aio check that and bail out if asked > for nowait on a file without that flag... > Yes, I would agree. We had FS_NOWAIT in filesystem type flags (in v3), but retracted it later in v4. Another option could be to keep the feature against FS_REQUIRES_DEV to rule out filesystems which are not local, but it again has the problem of holes in coverage. I will work on adding FMODE_AIO_NOWAIT in the meantime. Thanks, -- Goldwyn -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] Btrfs: change how we iterate bios in endio
Since dio submit has used bio_clone_fast, the submitted bio may not have a reliable bi_vcnt, for the bio vector iterations in checksum related functions, bio->bi_iter is not modified yet and it's safe to use bio_for_each_segment, while for those bio vector iterations in dio read's endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning bios and use the helper __bio_for_each_segment with the saved bvec_iter to access each bvec. Also for dio reads which don't get split, we also need to save a copy of bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access each bvec in dio read's endio. Note that it doesn't affect other calls of btrfs_bio_clone() because they don't need to use this iterator. Cc: David SterbaSigned-off-by: Liu Bo --- v2: Fix null pointer crash in the case of non-split dio reads. fs/btrfs/extent_io.c | 2 ++ fs/btrfs/file-item.c | 31 +++ fs/btrfs/inode.c | 35 +++ fs/btrfs/volumes.h | 1 + 4 files changed, 37 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 89b824d..79be8c2 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2719,6 +2719,7 @@ struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask) btrfs_bio->csum = NULL; btrfs_bio->csum_allocated = NULL; btrfs_bio->end_io = NULL; + btrfs_bio->iter = bio->bi_iter; } return new; } @@ -2755,6 +2756,7 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, gfp_t gfp_mask, btrfs_bio->end_io = NULL; bio_trim(bio, offset >> 9, size >> 9); + btrfs_bio->iter = bio->bi_iter; return bio; } diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 64fcb31..9f6062c 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -164,7 +164,8 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u64 logical_offset, u32 *dst, int dio) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); - struct bio_vec *bvec; + struct bio_vec bvec; + struct bvec_iter iter; struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio); struct btrfs_csum_item *item = NULL; struct extent_io_tree *io_tree = _I(inode)->io_tree; @@ -177,7 +178,7 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u64 page_bytes_left; u32 diff; int nblocks; - int count = 0, i; + int count = 0; u16 csum_size = btrfs_super_csum_size(fs_info->super_copy); path = btrfs_alloc_path(); @@ -206,8 +207,6 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, if (bio->bi_iter.bi_size > PAGE_SIZE * 8) path->reada = READA_FORWARD; - WARN_ON(bio->bi_vcnt <= 0); - /* * the free space stuff is only read when it hasn't been * updated in the current transaction. So, we can safely @@ -223,13 +222,13 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, if (dio) offset = logical_offset; - bio_for_each_segment_all(bvec, bio, i) { - page_bytes_left = bvec->bv_len; + bio_for_each_segment(bvec, bio, iter) { + page_bytes_left = bvec.bv_len; if (count) goto next; if (!dio) - offset = page_offset(bvec->bv_page) + bvec->bv_offset; + offset = page_offset(bvec.bv_page) + bvec.bv_offset; count = btrfs_find_ordered_sum(inode, offset, disk_bytenr, (u32 *)csum, nblocks); if (count) @@ -440,15 +439,15 @@ int btrfs_csum_one_bio(struct inode *inode, struct bio *bio, struct btrfs_ordered_sum *sums; struct btrfs_ordered_extent *ordered = NULL; char *data; - struct bio_vec *bvec; + struct bvec_iter iter; + struct bio_vec bvec; int index; int nr_sectors; - int i, j; unsigned long total_bytes = 0; unsigned long this_sum_bytes = 0; + int i; u64 offset; - WARN_ON(bio->bi_vcnt <= 0); sums = kzalloc(btrfs_ordered_sum_size(fs_info, bio->bi_iter.bi_size), GFP_NOFS); if (!sums) @@ -465,19 +464,19 @@ int btrfs_csum_one_bio(struct inode *inode, struct bio *bio, sums->bytenr = (u64)bio->bi_iter.bi_sector << 9; index = 0; - bio_for_each_segment_all(bvec, bio, j) { + bio_for_each_segment(bvec, bio, iter) { if (!contig) - offset = page_offset(bvec->bv_page) + bvec->bv_offset; + offset = page_offset(bvec.bv_page) + bvec.bv_offset; if (!ordered) {
Re: [PATCH 3/3] btrfs: sink gfp parameter to btrfs_io_bio_alloc
On Mon, Jun 12, 2017 at 05:29:41PM +0200, David Sterba wrote: > We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we > change it back from GFP_KERNEL in scrub. I'd rather save a few stack > bytes from not passing the gfp flags in the remaining, more imporatant, > contexts and the bio allocating API now looks more consistent. > Reviewed-by: Liu Bo-liubo > Signed-off-by: David Sterba > --- > fs/btrfs/check-integrity.c | 2 +- > fs/btrfs/disk-io.c | 2 +- > fs/btrfs/extent_io.c | 8 > fs/btrfs/extent_io.h | 2 +- > fs/btrfs/raid56.c | 2 +- > fs/btrfs/scrub.c | 16 +++- > 6 files changed, 15 insertions(+), 17 deletions(-) > > diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c > index 160879c802d0..e3b1d08dd03c 100644 > --- a/fs/btrfs/check-integrity.c > +++ b/fs/btrfs/check-integrity.c > @@ -1638,7 +1638,7 @@ static int btrfsic_read_block(struct btrfsic_state > *state, > struct bio *bio; > unsigned int j; > > - bio = btrfs_io_bio_alloc(GFP_NOFS, num_pages - i); > + bio = btrfs_io_bio_alloc(num_pages - i); > bio->bi_bdev = block_ctx->dev->bdev; > bio->bi_iter.bi_sector = dev_bytenr >> 9; > bio_set_op_attrs(bio, REQ_OP_READ, 0); > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 9f2ffe2c6afb..8b57c280e5cd 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3532,7 +3532,7 @@ static int write_dev_flush(struct btrfs_device *device, > int wait) >* caller >*/ > device->flush_bio = NULL; > - bio = btrfs_io_bio_alloc(GFP_NOFS, 0); > + bio = btrfs_io_bio_alloc(0); > bio->bi_end_io = btrfs_end_empty_barrier; > bio->bi_bdev = device->bdev; > bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH; > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index cbd0a9a1daa5..29a6111a68d2 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -1987,7 +1987,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, > u64 ino, u64 start, > ASSERT(!(fs_info->sb->s_flags & MS_RDONLY)); > BUG_ON(!mirror_num); > > - bio = btrfs_io_bio_alloc(GFP_NOFS, 1); > + bio = btrfs_io_bio_alloc(1); > bio->bi_iter.bi_size = 0; > map_length = length; > > @@ -2331,7 +2331,7 @@ struct bio *btrfs_create_repair_bio(struct inode > *inode, struct bio *failed_bio, > struct btrfs_io_bio *btrfs_failed_bio; > struct btrfs_io_bio *btrfs_bio; > > - bio = btrfs_io_bio_alloc(GFP_NOFS, 1); > + bio = btrfs_io_bio_alloc(1); > bio->bi_end_io = endio_func; > bio->bi_iter.bi_sector = failrec->logical >> 9; > bio->bi_bdev = fs_info->fs_devices->latest_bdev; > @@ -2692,12 +2692,12 @@ struct bio *btrfs_bio_clone(struct bio *bio) > return new; > } > > -struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) > +struct bio *btrfs_io_bio_alloc(unsigned int nr_iovecs) > { > struct bio *bio; > > /* Bio allocation backed by a bioset does not fail */ > - bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset); > + bio = bio_alloc_bioset(GFP_NOFS, nr_iovecs, btrfs_bioset); > btrfs_io_bio_init(btrfs_io_bio(bio)); > return bio; > } > diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h > index 8071e3977614..1e508a8f876e 100644 > --- a/fs/btrfs/extent_io.h > +++ b/fs/btrfs/extent_io.h > @@ -463,7 +463,7 @@ void extent_clear_unlock_delalloc(struct inode *inode, > u64 start, u64 end, >unsigned bits_to_clear, >unsigned long page_ops); > struct bio *btrfs_bio_alloc(struct block_device *bdev, u64 first_byte); > -struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs); > +struct bio *btrfs_io_bio_alloc(unsigned int nr_iovecs); > struct bio *btrfs_bio_clone(struct bio *bio); > struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size); > > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > index 7dd55448ac68..b9abb0b01021 100644 > --- a/fs/btrfs/raid56.c > +++ b/fs/btrfs/raid56.c > @@ -1098,7 +1098,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio, > } > > /* put a new bio on the list */ > - bio = btrfs_io_bio_alloc(GFP_NOFS, bio_max_len >> PAGE_SHIFT?:1); > + bio = btrfs_io_bio_alloc(bio_max_len >> PAGE_SHIFT ?: 1); > bio->bi_iter.bi_size = 0; > bio->bi_bdev = stripe->dev->bdev; > bio->bi_iter.bi_sector = disk_start >> 9; > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > index 1e2dfea00b2f..58a249cd5adc 100644 > --- a/fs/btrfs/scrub.c > +++ b/fs/btrfs/scrub.c > @@ -1737,7 +1737,7 @@ static void scrub_recheck_block(struct btrfs_fs_info > *fs_info, > } > > WARN_ON(!page->page); > - bio = btrfs_io_bio_alloc(GFP_NOFS,
Re: [PATCH 3/7] Btrfs: update total_bytes_pinned when pinning down extents
On Tue, Jun 06, 2017 at 04:45:28PM -0700, Omar Sandoval wrote: > From: Omar Sandoval> > The extents marked in pin_down_extent() will be unpinned later in > unpin_extent_range(), which decrements total_bytes_pinned. > pin_down_extent() must increment the counter to avoid underflowing it. > Also adjust btrfs_free_tree_block() to avoid accounting for the same > extent twice. > Reviewed-by: Liu Bo -liubo > Signed-off-by: Omar Sandoval > --- > fs/btrfs/extent-tree.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 6032e9a635f2..9cff937415cb 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -6343,6 +6343,7 @@ static int pin_down_extent(struct btrfs_fs_info > *fs_info, > > trace_btrfs_space_reservation(fs_info, "pinned", > cache->space_info->flags, num_bytes, 1); > + percpu_counter_add(>space_info->total_bytes_pinned, num_bytes); > set_extent_dirty(fs_info->pinned_extents, bytenr, >bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL); > return 0; > @@ -7189,6 +7190,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle > *trans, > goto out; > } > > + pin = 0; > cache = btrfs_lookup_block_group(fs_info, buf->start); > > if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) { > @@ -7204,7 +7206,6 @@ void btrfs_free_tree_block(struct btrfs_trans_handle > *trans, > btrfs_free_reserved_bytes(cache, buf->len, 0); > btrfs_put_block_group(cache); > trace_btrfs_reserved_extent_free(fs_info, buf->start, buf->len); > - pin = 0; > } > out: > if (pin) > -- > 2.13.0 > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
On Tue, Jun 06, 2017 at 04:45:26PM -0700, Omar Sandoval wrote: > From: Omar Sandoval> > There are a few places where we pass in a negative num_bytes, so make it > signed for clarity. Also move it up in the file since later patches will > need it there. > Reviewed-by: Liu Bo -liubo > Signed-off-by: Omar Sandoval > --- > fs/btrfs/extent-tree.c | 41 - > 1 file changed, 20 insertions(+), 21 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index e390451c72e6..7c01b4e9e3b6 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -766,6 +766,26 @@ static struct btrfs_space_info *__find_space_info(struct > btrfs_fs_info *info, > return NULL; > } > > +static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes, > + u64 owner, u64 root_objectid) > +{ > + struct btrfs_space_info *space_info; > + u64 flags; > + > + if (owner < BTRFS_FIRST_FREE_OBJECTID) { > + if (root_objectid == BTRFS_CHUNK_TREE_OBJECTID) > + flags = BTRFS_BLOCK_GROUP_SYSTEM; > + else > + flags = BTRFS_BLOCK_GROUP_METADATA; > + } else { > + flags = BTRFS_BLOCK_GROUP_DATA; > + } > + > + space_info = __find_space_info(fs_info, flags); > + BUG_ON(!space_info); /* Logic bug */ > + percpu_counter_add(_info->total_bytes_pinned, num_bytes); > +} > + > /* > * after adding space to the filesystem, we need to clear the full flags > * on all the space infos. > @@ -6793,27 +6813,6 @@ int btrfs_finish_extent_commit(struct > btrfs_trans_handle *trans, > return 0; > } > > -static void add_pinned_bytes(struct btrfs_fs_info *fs_info, u64 num_bytes, > - u64 owner, u64 root_objectid) > -{ > - struct btrfs_space_info *space_info; > - u64 flags; > - > - if (owner < BTRFS_FIRST_FREE_OBJECTID) { > - if (root_objectid == BTRFS_CHUNK_TREE_OBJECTID) > - flags = BTRFS_BLOCK_GROUP_SYSTEM; > - else > - flags = BTRFS_BLOCK_GROUP_METADATA; > - } else { > - flags = BTRFS_BLOCK_GROUP_DATA; > - } > - > - space_info = __find_space_info(fs_info, flags); > - BUG_ON(!space_info); /* Logic bug */ > - percpu_counter_add(_info->total_bytes_pinned, num_bytes); > -} > - > - > static int __btrfs_free_extent(struct btrfs_trans_handle *trans, > struct btrfs_fs_info *info, > struct btrfs_delayed_ref_node *node, u64 parent, > -- > 2.13.0 > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] btrfs: add helper to initialize the non-bio part of btrfs_io_bio
On Mon, Jun 12, 2017 at 05:29:39PM +0200, David Sterba wrote: > We use btrfs_bioset for bios and ask to allocate the entire size of > btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is > initialized but the bytes from 0 to offset of 'bio' are left > uninitialized. Although we initialize some of the members in our > helpers, we should initialize the whole structures. > Reviewed-by: Liu Bo-liubo > Signed-off-by: David Sterba > --- > fs/btrfs/extent_io.c | 30 ++ > 1 file changed, 14 insertions(+), 16 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 5037fd918f43..cbd0a9a1daa5 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2654,22 +2654,28 @@ static void end_bio_extent_readpage(struct bio *bio) > } > > /* > + * Initialize the members up to but not including 'bio'. Use after > allocating a > + * new bio by bio_alloc_bioset as it does not initialize the bytes outside of > + * 'bio' because use of __GFP_ZERO is not supported. > + */ > +static inline void btrfs_io_bio_init(struct btrfs_io_bio *btrfs_bio) > +{ > + memset(btrfs_bio, 0, offsetof(struct btrfs_io_bio, bio)); > +} > + > +/* > * The following helpers allocate a bio. As it's backed by a bioset, it'll > * never fail. We're returning a bio right now but you can call btrfs_io_bio > * for the appropriate container_of magic > */ > struct bio *btrfs_bio_alloc(struct block_device *bdev, u64 first_byte) > { > - struct btrfs_io_bio *btrfs_bio; > struct bio *bio; > > bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, btrfs_bioset); > bio->bi_bdev = bdev; > bio->bi_iter.bi_sector = first_byte >> 9; > - btrfs_bio = btrfs_io_bio(bio); > - btrfs_bio->csum = NULL; > - btrfs_bio->csum_allocated = NULL; > - btrfs_bio->end_io = NULL; > + btrfs_io_bio_init(btrfs_io_bio(bio)); > return bio; > } > > @@ -2681,24 +2687,18 @@ struct bio *btrfs_bio_clone(struct bio *bio) > /* Bio allocation backed by a bioset does not fail */ > new = bio_clone_fast(bio, GFP_NOFS, btrfs_bioset); > btrfs_bio = btrfs_io_bio(new); > - btrfs_bio->csum = NULL; > - btrfs_bio->csum_allocated = NULL; > - btrfs_bio->end_io = NULL; > + btrfs_io_bio_init(btrfs_bio); > btrfs_bio->iter = bio->bi_iter; > return new; > } > > struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) > { > - struct btrfs_io_bio *btrfs_bio; > struct bio *bio; > > /* Bio allocation backed by a bioset does not fail */ > bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset); > - btrfs_bio = btrfs_io_bio(bio); > - btrfs_bio->csum = NULL; > - btrfs_bio->csum_allocated = NULL; > - btrfs_bio->end_io = NULL; > + btrfs_io_bio_init(btrfs_io_bio(bio)); > return bio; > } > > @@ -2712,9 +2712,7 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, > int offset, int size) > ASSERT(bio); > > btrfs_bio = btrfs_io_bio(bio); > - btrfs_bio->csum = NULL; > - btrfs_bio->csum_allocated = NULL; > - btrfs_bio->end_io = NULL; > + btrfs_io_bio_init(btrfs_bio); > > bio_trim(bio, offset >> 9, size >> 9); > btrfs_bio->iter = bio->bi_iter; > -- > 2.13.0 > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] More bio fixes and cleanups
Main part is the initialization of btrfs_io_bio, the rest is doc & cleanup. David Sterba (3): btrfs: document mandatory order of bio in btrfs_io_bio btrfs: add helper to initialize the non-bio part of btrfs_io_bio btrfs: sink gfp parameter to btrfs_io_bio_alloc fs/btrfs/check-integrity.c | 2 +- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent_io.c | 38 ++ fs/btrfs/extent_io.h | 2 +- fs/btrfs/raid56.c | 2 +- fs/btrfs/scrub.c | 16 +++- fs/btrfs/volumes.h | 4 7 files changed, 33 insertions(+), 33 deletions(-) -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs: sink gfp parameter to btrfs_io_bio_alloc
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we change it back from GFP_KERNEL in scrub. I'd rather save a few stack bytes from not passing the gfp flags in the remaining, more imporatant, contexts and the bio allocating API now looks more consistent. Signed-off-by: David Sterba--- fs/btrfs/check-integrity.c | 2 +- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent_io.c | 8 fs/btrfs/extent_io.h | 2 +- fs/btrfs/raid56.c | 2 +- fs/btrfs/scrub.c | 16 +++- 6 files changed, 15 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index 160879c802d0..e3b1d08dd03c 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -1638,7 +1638,7 @@ static int btrfsic_read_block(struct btrfsic_state *state, struct bio *bio; unsigned int j; - bio = btrfs_io_bio_alloc(GFP_NOFS, num_pages - i); + bio = btrfs_io_bio_alloc(num_pages - i); bio->bi_bdev = block_ctx->dev->bdev; bio->bi_iter.bi_sector = dev_bytenr >> 9; bio_set_op_attrs(bio, REQ_OP_READ, 0); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 9f2ffe2c6afb..8b57c280e5cd 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3532,7 +3532,7 @@ static int write_dev_flush(struct btrfs_device *device, int wait) * caller */ device->flush_bio = NULL; - bio = btrfs_io_bio_alloc(GFP_NOFS, 0); + bio = btrfs_io_bio_alloc(0); bio->bi_end_io = btrfs_end_empty_barrier; bio->bi_bdev = device->bdev; bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH; diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index cbd0a9a1daa5..29a6111a68d2 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1987,7 +1987,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start, ASSERT(!(fs_info->sb->s_flags & MS_RDONLY)); BUG_ON(!mirror_num); - bio = btrfs_io_bio_alloc(GFP_NOFS, 1); + bio = btrfs_io_bio_alloc(1); bio->bi_iter.bi_size = 0; map_length = length; @@ -2331,7 +2331,7 @@ struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio, struct btrfs_io_bio *btrfs_failed_bio; struct btrfs_io_bio *btrfs_bio; - bio = btrfs_io_bio_alloc(GFP_NOFS, 1); + bio = btrfs_io_bio_alloc(1); bio->bi_end_io = endio_func; bio->bi_iter.bi_sector = failrec->logical >> 9; bio->bi_bdev = fs_info->fs_devices->latest_bdev; @@ -2692,12 +2692,12 @@ struct bio *btrfs_bio_clone(struct bio *bio) return new; } -struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) +struct bio *btrfs_io_bio_alloc(unsigned int nr_iovecs) { struct bio *bio; /* Bio allocation backed by a bioset does not fail */ - bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset); + bio = bio_alloc_bioset(GFP_NOFS, nr_iovecs, btrfs_bioset); btrfs_io_bio_init(btrfs_io_bio(bio)); return bio; } diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 8071e3977614..1e508a8f876e 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -463,7 +463,7 @@ void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end, unsigned bits_to_clear, unsigned long page_ops); struct bio *btrfs_bio_alloc(struct block_device *bdev, u64 first_byte); -struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs); +struct bio *btrfs_io_bio_alloc(unsigned int nr_iovecs); struct bio *btrfs_bio_clone(struct bio *bio); struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size); diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 7dd55448ac68..b9abb0b01021 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1098,7 +1098,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio, } /* put a new bio on the list */ - bio = btrfs_io_bio_alloc(GFP_NOFS, bio_max_len >> PAGE_SHIFT?:1); + bio = btrfs_io_bio_alloc(bio_max_len >> PAGE_SHIFT ?: 1); bio->bi_iter.bi_size = 0; bio->bi_bdev = stripe->dev->bdev; bio->bi_iter.bi_sector = disk_start >> 9; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 1e2dfea00b2f..58a249cd5adc 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1737,7 +1737,7 @@ static void scrub_recheck_block(struct btrfs_fs_info *fs_info, } WARN_ON(!page->page); - bio = btrfs_io_bio_alloc(GFP_NOFS, 1); + bio = btrfs_io_bio_alloc(1); bio->bi_bdev = page->dev->bdev; bio_add_page(bio, page->page, PAGE_SIZE, 0); @@ -1825,7 +1825,7 @@ static int scrub_repair_page_from_good_copy(struct scrub_block *sblock_bad,
[PATCH 2/3] btrfs: add helper to initialize the non-bio part of btrfs_io_bio
We use btrfs_bioset for bios and ask to allocate the entire size of btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is initialized but the bytes from 0 to offset of 'bio' are left uninitialized. Although we initialize some of the members in our helpers, we should initialize the whole structures. Signed-off-by: David Sterba--- fs/btrfs/extent_io.c | 30 ++ 1 file changed, 14 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 5037fd918f43..cbd0a9a1daa5 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2654,22 +2654,28 @@ static void end_bio_extent_readpage(struct bio *bio) } /* + * Initialize the members up to but not including 'bio'. Use after allocating a + * new bio by bio_alloc_bioset as it does not initialize the bytes outside of + * 'bio' because use of __GFP_ZERO is not supported. + */ +static inline void btrfs_io_bio_init(struct btrfs_io_bio *btrfs_bio) +{ + memset(btrfs_bio, 0, offsetof(struct btrfs_io_bio, bio)); +} + +/* * The following helpers allocate a bio. As it's backed by a bioset, it'll * never fail. We're returning a bio right now but you can call btrfs_io_bio * for the appropriate container_of magic */ struct bio *btrfs_bio_alloc(struct block_device *bdev, u64 first_byte) { - struct btrfs_io_bio *btrfs_bio; struct bio *bio; bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, btrfs_bioset); bio->bi_bdev = bdev; bio->bi_iter.bi_sector = first_byte >> 9; - btrfs_bio = btrfs_io_bio(bio); - btrfs_bio->csum = NULL; - btrfs_bio->csum_allocated = NULL; - btrfs_bio->end_io = NULL; + btrfs_io_bio_init(btrfs_io_bio(bio)); return bio; } @@ -2681,24 +2687,18 @@ struct bio *btrfs_bio_clone(struct bio *bio) /* Bio allocation backed by a bioset does not fail */ new = bio_clone_fast(bio, GFP_NOFS, btrfs_bioset); btrfs_bio = btrfs_io_bio(new); - btrfs_bio->csum = NULL; - btrfs_bio->csum_allocated = NULL; - btrfs_bio->end_io = NULL; + btrfs_io_bio_init(btrfs_bio); btrfs_bio->iter = bio->bi_iter; return new; } struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) { - struct btrfs_io_bio *btrfs_bio; struct bio *bio; /* Bio allocation backed by a bioset does not fail */ bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset); - btrfs_bio = btrfs_io_bio(bio); - btrfs_bio->csum = NULL; - btrfs_bio->csum_allocated = NULL; - btrfs_bio->end_io = NULL; + btrfs_io_bio_init(btrfs_io_bio(bio)); return bio; } @@ -2712,9 +2712,7 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size) ASSERT(bio); btrfs_bio = btrfs_io_bio(bio); - btrfs_bio->csum = NULL; - btrfs_bio->csum_allocated = NULL; - btrfs_bio->end_io = NULL; + btrfs_io_bio_init(btrfs_bio); bio_trim(bio, offset >> 9, size >> 9); btrfs_bio->iter = bio->bi_iter; -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] btrfs: document mandatory order of bio in btrfs_io_bio
Signed-off-by: David Sterba--- fs/btrfs/volumes.h | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 58b97b6f5f02..35327efecdbb 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -281,6 +281,10 @@ struct btrfs_io_bio { u8 *csum_allocated; btrfs_io_bio_end_io_t *end_io; struct bvec_iter iter; + /* +* This member must come last, bio_alloc_bioset will allocate enough +* bytes for entire btrfs_io_bio but relies on bio being last. +*/ struct bio bio; }; -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: btrfs_ioctl_search_key documentation
On Mon, Jun 12, 2017 at 06:03:15PM +0200, Hans van Kranenburg wrote: > >> Most interesting changes since v1: > >> - mention the special tree_id input value 0 > >> - rewrite the part about min_key and max_key, trying to be more concise > > > > I find the description instructive enough so the expanded expression to > > describe the whole range is not IMHO needed. > > You mean drop the extra line "All metadata..." ? Yeah, it's a bit > redudant, stressing the fact, yes. Ah, sorry I was not clear. I was referring to Goffredo's proposal with the expression how the min_key and max_key are calculated. Your text in v2 is fine. We know how to calculate one key and we know the where are the limits. > The main purpose is to stop users from thinking that setting min_type > and max_type will filter the returned objects (like, only getting > BLOCK_GROUP_ITEM_KEY or so). So as long as you think that's clear > enough, I'm ok with anything. The text looks good to me and I've added the patch to the queue. Further refinements are welcome. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: flush error also accounts for its send error
On Fri, Jun 09, 2017 at 02:52:29PM +0800, Anand Jain wrote: > So remove the static check of send error Please improve the changelog text. > Signed-off-by: Anand Jain> --- > fs/btrfs/disk-io.c | 19 +-- > 1 file changed, 9 insertions(+), 10 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index d8151839bb3d..682c494dbc7f 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3534,14 +3534,11 @@ static int wait_dev_flush(struct btrfs_device *device) > > static int check_barrier_error(struct btrfs_fs_devices *fsdevs) > { > - int submit_flush_error = 0; > int dev_flush_error = 0; > struct btrfs_device *dev; > - int tolerance; > > list_for_each_entry_rcu(dev, >devices, dev_list) { > if (!dev->bdev) { > - submit_flush_error++; > dev_flush_error++; > continue; > } > @@ -3549,8 +3546,8 @@ static int check_barrier_error(struct btrfs_fs_devices > *fsdevs) > dev_flush_error++; > } > > - tolerance = fsdevs->fs_info->num_tolerated_disk_barrier_failures; > - if (submit_flush_error > tolerance || dev_flush_error > tolerance) > + if (dev_flush_error > > + fsdevs->fs_info->num_tolerated_disk_barrier_failures) > return -EIO; > > return 0; > @@ -3564,7 +3561,6 @@ static int barrier_all_devices(struct btrfs_fs_info > *info) > { > struct list_head *head; > struct btrfs_device *dev; > - int errors_send = 0; > int errors_wait = 0; > int ret; > > @@ -3573,10 +3569,9 @@ static int barrier_all_devices(struct btrfs_fs_info > *info) > list_for_each_entry_rcu(dev, head, dev_list) { > if (dev->missing) > continue; > - if (!dev->bdev) { > - errors_send++; > + if (!dev->bdev) > continue; > - } > + > if (!dev->in_fs_metadata || !dev->writeable) > continue; > > @@ -3602,7 +3597,11 @@ static int barrier_all_devices(struct btrfs_fs_info > *info) > } > } > > - if (errors_send || errors_wait) { > + /* > + * By checking errors_wait here it avoids traverse through > + * the device loop for normal healthy case > + */ I think the code makes it obvious that we skip the barrier check, the comment does not bring any new information, please drop it. > + if (errors_wait) { > /* >* At some point we need the status of all disks >* to arrive at the volume status. So error checking -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: write_dev_flush does not return ENOMEM anymore
On Fri, Jun 09, 2017 at 02:52:28PM +0800, Anand Jain wrote: > Commit > 9035b5dbc576 btrfs: btrfs_io_bio_alloc never fails, skip error handling > > removed the -ENOMEM return from write_dev_flush() so no need to > check for the -ENOMEM during send. > > This patch also peals write_dev_flush's wait part of the code, > and creates a new function wait_dev_flush(). "This patch also" usually means that the patch should be split. > /* > - * trigger flushes for one the devices. If you pass wait == 0, the flushes > are > - * sent down. With wait == 1, it waits for the previous flush. > - * > - * any device where the flush fails with eopnotsupp are flagged as > not-barrier > - * capable > + * trigger flushes for one the devices. In the patch that splits write_dev_flush, please use the following comment wording: * Submit a flush request to the device if it supports it. Error handling is * done in the waiting counterpart. > */ > -static int write_dev_flush(struct btrfs_device *device, int wait) > +static void write_dev_flush(struct btrfs_device *device) and /* * If the flush bio has been submitted by write_dev_flush, wait for it. */ > +static int wait_dev_flush(struct btrfs_device *device) > +{ But otherwise the cleanup is good. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: btrfs_ioctl_search_key documentation
On Tue, Jun 06, 2017 at 12:20:32AM +0200, Hans van Kranenburg wrote: > A programmer who is trying to implement calling the btrfs SEARCH > or SEARCH_V2 ioctl will probably soon end up reading this struct > definition. > > Properly document the input fields to prevent common misconceptions: > 1. The search space is linear, not 3 dimensional. The invidual min/max > values for objectid, type and offset cannot be used to filter the > result, they only define the endpoints of an interval. > 2. The transaction id (a.k.a. generation) filter applies only on > transaction id of the last COW operation on a whole metadata page, not > on individual items. > > Ad 1. The first misunderstanding was helped by the previous misleading > comments on min/max type and offset: > "keys returned will be >= min and <= max". > > Ad 2. For example, running btrfs balance will happily cause rewriting of > metadata pages that contain a filesystem tree of a read only subvolume, > causing transids to be increased. > > Also, improve descriptions of tree_id and nr_items and add in/out > annotations. > > Signed-off-by: Hans van KranenburgLooks good to me, thanks. I've realigned the comments so they don't overflow 80 columns and aligned the /* in */ hints. > Most interesting changes since v1: > - mention the special tree_id input value 0 > - rewrite the part about min_key and max_key, trying to be more concise I find the description instructive enough so the expanded expression to describe the whole range is not IMHO needed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: btrfs_ioctl_search_key documentation
On 06/12/2017 05:38 PM, David Sterba wrote: > On Tue, Jun 06, 2017 at 12:20:32AM +0200, Hans van Kranenburg wrote: >> A programmer who is trying to implement calling the btrfs SEARCH >> or SEARCH_V2 ioctl will probably soon end up reading this struct >> definition. >> >> Properly document the input fields to prevent common misconceptions: >> 1. The search space is linear, not 3 dimensional. The invidual min/max >> values for objectid, type and offset cannot be used to filter the >> result, they only define the endpoints of an interval. >> 2. The transaction id (a.k.a. generation) filter applies only on >> transaction id of the last COW operation on a whole metadata page, not >> on individual items. >> >> Ad 1. The first misunderstanding was helped by the previous misleading >> comments on min/max type and offset: >> "keys returned will be >= min and <= max". >> >> Ad 2. For example, running btrfs balance will happily cause rewriting of >> metadata pages that contain a filesystem tree of a read only subvolume, >> causing transids to be increased. >> >> Also, improve descriptions of tree_id and nr_items and add in/out >> annotations. >> >> Signed-off-by: Hans van Kranenburg> > Looks good to me, thanks. I've realigned the comments so they don't > overflow 80 columns and aligned the /* in */ hints. Ah, I see, thanks. My vim takes 4 chars for a tab, that's the problem. I'll get the vim C settings in order for the next time. :-) >> Most interesting changes since v1: >> - mention the special tree_id input value 0 >> - rewrite the part about min_key and max_key, trying to be more concise > > I find the description instructive enough so the expanded expression to > describe the whole range is not IMHO needed. You mean drop the extra line "All metadata..." ? Yeah, it's a bit redudant, stressing the fact, yes. The main purpose is to stop users from thinking that setting min_type and max_type will filter the returned objects (like, only getting BLOCK_GROUP_ITEM_KEY or so). So as long as you think that's clear enough, I'm ok with anything. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Convert btrfs_fs_info's free_chunk_space to an atomic64_t
On Fri, Jun 09, 2017 at 09:44:31AM +0300, Nikolay Borisov wrote: > > > On 7.06.2017 21:09, Sargun Dhillon wrote: > > This patch is a small performance optimization to get rid of a spin > > lock, where instead an atomic64_t can be used. > > > > Signed-off-by: Sargun Dhillon> > I've already sent similar patch 1 month ago: > > https://patchwork.kernel.org/patch/9720999/ Another hint is to look to the linux-next or to my kernel.org for-next branch. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix Null pointer dereference in dio read endio
On Tue, Jun 06, 2017 at 01:52:52PM -0600, Liu Bo wrote: > With switching to use btrfs_bio_clone_partial() to split bio in > directIO path, read endio is also adapted to that by recording a > iterator in btrfs_bio, however, it breaks those bios which are less > than stripe length thus no need to be split and results in NULL > pointer dereference. > > This fixes the issue by recording the required bio iterator in > btrfs_bio_clone() which is used to clone non-split bio in directIO > path. It doesn't affect other calls of btrfs_bio_clone() because they > don't need to use this iterator. > > This bug was caught by fstests/generic/091. > > Cc: David Sterba> Signed-off-by: Liu Bo > --- > Based on David's for-next. > Fixes: commit "Btrfs: change how we iterate bios in endio" > > Have run through fstests without introducing new problems. > > fs/btrfs/extent_io.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 806e8d6..a91c3a1 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2719,6 +2719,7 @@ struct bio *btrfs_bio_clone(struct bio *bio, gfp_t > gfp_mask) > btrfs_bio->csum = NULL; > btrfs_bio->csum_allocated = NULL; > btrfs_bio->end_io = NULL; > + btrfs_bio->iter = bio->bi_iter; Actually this points to a problem with initialization of the btrfs_io_bio in general. They have embedded struct bio and the bioset allocation will increase the size to cover btrfs_io_bio (although only bio is handled). But, there's no kzalloc or memset anywhere, so it's completely up to the caller to sanitize the fresh structure. Which we only do partially (csum, csum_allocated, end_io). In that case 'iter' could contain garbage and who knows what could happen. So my suggestion: fold this patch to the one you refrence (ie. that the behaviour is same before and after), and we'll fix the initialization of btrfs_io_bio after the various bio clone/alloc calls. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix Null pointer dereference in dio read endio
On Tue, Jun 06, 2017 at 01:52:52PM -0600, Liu Bo wrote: > With switching to use btrfs_bio_clone_partial() to split bio in > directIO path, read endio is also adapted to that by recording a > iterator in btrfs_bio, however, it breaks those bios which are less > than stripe length thus no need to be split and results in NULL > pointer dereference. > > This fixes the issue by recording the required bio iterator in > btrfs_bio_clone() which is used to clone non-split bio in directIO > path. It doesn't affect other calls of btrfs_bio_clone() because they > don't need to use this iterator. > > This bug was caught by fstests/generic/091. > > Cc: David Sterba> Signed-off-by: Liu Bo > --- > Based on David's for-next. > Fixes: commit "Btrfs: change how we iterate bios in endio" I'd rather fold this change to the original patch than to have a separate fixup. The changelog can be updated with description of the special case. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs-progs: send operates on ro snapshots only
On Tue, Jun 06, 2017 at 03:00:58PM +0200, Hans van Kranenburg wrote: > While talking to another btrfs user on IRC today, it became clear that a > major point of confusion in the btrfs send manual is that it's not > telling the user soon enough that send/receive solely operates on > subvolume snapshots instead of the original (read/write) subvolumes. > > So, change the first few lines to explicitly mention snapshots instead. > Technically, snapshots are also just subvolumes, but requiring this > level of technical detailed knowledge doesn't help the user who is just > trying out things. Agreed. > Signed-off-by: Hans van Kranenburg> --- Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
On Tue, Jun 06, 2017 at 04:45:26PM -0700, Omar Sandoval wrote: > From: Omar Sandoval> > There are a few places where we pass in a negative num_bytes, so make it > signed for clarity. Also move it up in the file since later patches will > need it there. > > Signed-off-by: Omar Sandoval Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
On Tue, Jun 06, 2017 at 04:45:27PM -0700, Omar Sandoval wrote: > From: Omar Sandoval> > Signed-off-by: Omar Sandoval > --- > fs/btrfs/extent-tree.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 7c01b4e9e3b6..6032e9a635f2 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -782,7 +782,7 @@ static void add_pinned_bytes(struct btrfs_fs_info > *fs_info, s64 num_bytes, > } > > space_info = __find_space_info(fs_info, flags); > - BUG_ON(!space_info); /* Logic bug */ > + ASSERT(space_info); Why? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] btrfs: btrfs_bio_clone never fails, skip error handling
On Wed, Jun 07, 2017 at 11:19:37AM -0700, Omar Sandoval wrote: > On Fri, Jun 02, 2017 at 06:58:36PM +0200, David Sterba wrote: > > Update direct callers of btrfs_bio_clone that do error handling, that we > > can now remove. > > > > Signed-off-by: David Sterba> > --- > > fs/btrfs/inode.c | 4 > > fs/btrfs/volumes.c | 1 - > > 2 files changed, 5 deletions(-) > > > > > > if (dev_nr < total_devs - 1) { > > bio = btrfs_bio_clone(first_bio, GFP_NOFS); > > - BUG_ON(!bio); /* -ENOMEM */ > > } else > > bio = first_bio; > > Could you please get rid of the extra curly braces now, too? Sure, patch updated. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] btrfs: remove redundant parameters from btrfs_bio_alloc
On Wed, Jun 07, 2017 at 03:40:16PM +0800, Anand Jain wrote: > > > On 06/03/17 00:58, David Sterba wrote: > > All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES. > > The line (in the other thread) mentioning the reason to remove >__GFP_HIGH can go into the commit log here. Makes sense, patch updated. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: use ino_t for inode in tracer
On Mon, Jun 12, 2017 at 06:10:28PM +0800, Anand Jain wrote: > Signed-off-by: Anand JainThere's one more in btrfs__qgroup_data_map, please fix it as well. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: add compression trace points
Hi Anand, [auto build test WARNING on kdave/for-next] [also build test WARNING on next-20170609] [cannot apply to btrfs/next tip/perf/core v4.12-rc5] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Anand-Jain/btrfs-add-compression-trace-points/20170612-184615 base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next config: s390-allmodconfig (attached as .config) compiler: s390x-linux-gnu-gcc (Debian 6.3.0-18) 6.3.0 20170516 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=s390 All warnings (new ones prefixed by >>): In file included from include/trace/define_trace.h:95:0, from include/trace/events/btrfs.h:1672, from fs/btrfs/super.c:65: include/trace/events/btrfs.h: In function 'trace_raw_output_btrfs_data_transformer': >> include/trace/events/btrfs.h:91:12: warning: format '%lu' expects argument >> of type 'long unsigned int', but argument 6 has type 'ino_t {aka unsigned >> int}' [-Wformat=] TP_printk("%pU: " fmt, __entry->fsid, args) ^ include/trace/trace_events.h:343:22: note: in definition of macro 'DECLARE_EVENT_CLASS' trace_seq_printf(s, print); \ ^ include/trace/trace_events.h:65:9: note: in expansion of macro 'PARAMS' PARAMS(print)); \ ^~ >> include/trace/events/btrfs.h:1630:1: note: in expansion of macro >> 'TRACE_EVENT' TRACE_EVENT(btrfs_data_transformer, ^~~ >> include/trace/events/btrfs.h:91:2: note: in expansion of macro 'TP_printk' TP_printk("%pU: " fmt, __entry->fsid, args) ^ >> include/trace/events/btrfs.h:1661:2: note: in expansion of macro >> 'TP_printk_btrfs' TP_printk_btrfs("%s %s ino=%lu type=%s len_before=%lu len_after=%lu " ^~~ vim +91 include/trace/events/btrfs.h 3f7de037 Josef Bacik 2011-11-10 75 8c2a3ca2 Josef Bacik 2012-01-10 76 #define BTRFS_UUID_SIZE 16 bc074524 Jeff Mahoney 2016-06-09 77 #define TP_STRUCT__entry_fsid __array(u8, fsid, BTRFS_UUID_SIZE) bc074524 Jeff Mahoney 2016-06-09 78 bc074524 Jeff Mahoney 2016-06-09 79 #define TP_fast_assign_fsid(fs_info) \ bc074524 Jeff Mahoney 2016-06-09 80memcpy(__entry->fsid, fs_info->fsid, BTRFS_UUID_SIZE) bc074524 Jeff Mahoney 2016-06-09 81 bc074524 Jeff Mahoney 2016-06-09 82 #define TP_STRUCT__entry_btrfs(args...) \ bc074524 Jeff Mahoney 2016-06-09 83TP_STRUCT__entry( \ bc074524 Jeff Mahoney 2016-06-09 84TP_STRUCT__entry_fsid \ bc074524 Jeff Mahoney 2016-06-09 85args) bc074524 Jeff Mahoney 2016-06-09 86 #define TP_fast_assign_btrfs(fs_info, args...)\ bc074524 Jeff Mahoney 2016-06-09 87TP_fast_assign( \ bc074524 Jeff Mahoney 2016-06-09 88TP_fast_assign_fsid(fs_info); \ bc074524 Jeff Mahoney 2016-06-09 89args) bc074524 Jeff Mahoney 2016-06-09 90 #define TP_printk_btrfs(fmt, args...) \ bc074524 Jeff Mahoney 2016-06-09 @91TP_printk("%pU: " fmt, __entry->fsid, args) 8c2a3ca2 Josef Bacik 2012-01-10 92 1abe9b8a liubo2011-03-24 93 TRACE_EVENT(btrfs_transaction_commit, 1abe9b8a liubo2011-03-24 94 1abe9b8a liubo2011-03-24 95TP_PROTO(struct btrfs_root *root), 1abe9b8a liubo2011-03-24 96 1abe9b8a liubo2011-03-24 97TP_ARGS(root), 1abe9b8a liubo2011-03-24 98 bc074524 Jeff Mahoney 2016-06-09 99TP_STRUCT__entry_btrfs( :: The code at line 91 was first introduced by commit :: bc074524e123ded281cde25ebc5661910f9679e3 btrfs: prefix fsid to all trace events :: TO: Jeff Mahoney <je...@suse.com> :: CC: David Sterba <dste...@suse.com> --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: btrfs native encryption
On Mon, Jun 12, 2017 at 02:40:38PM +0200, David Sterba wrote: > On Fri, Jun 09, 2017 at 08:50:12AM -0700, Filip Bystricky wrote: > > Dear btrfs maintainers, > > Google is evaluating btrfs for its potential use in android, but > > currently the lack of native file-based encryption unfortunately makes > > it a nonstarter. > > The file-based encryption is covered by the fscrypt API, that's > implemented in ext4/f2fs, so implementing that in btrfs could allow you > to start evaluating btrfs. For reference I'll add it here, https://github.com/asj/linux-btrfs-fscryptv1 seems to implement it, I've only scrolled through it so I don't know how usable is it in this state. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 14/20] dax: set errors in mapping when writeback fails
On Mon, Jun 12, 2017 at 08:23:10AM -0400, Jeff Layton wrote: > For now, only do this when the FS_WB_ERRSEQ flag is set. The > AS_EIO/AS_ENOSPC flags are not currently cleared in the older code when > writeback initiation fails, only when we discover an error after waiting > on writeback to complete, so we only want to do this with errseq_t based > error handling to prevent seeing duplicate errors on fsync. Please make sure this doens't stay conditional by the end of the series. We only have three file systems using dax, and a series should be able to make them agree on a single interface. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked
On Mon, Jun 12, 2017 at 08:23:06AM -0400, Jeff Layton wrote: > Add a new FS_WB_ERRSEQ flag to the fstype. Later patches will set and > key off of that to decide what behavior should be used. Please invert this so that only file systems that keep the old semantics need a flag. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[xfstests PATCH v4 3/5] generic: test writeback error handling on dmerror devices
Ensure that we get an error back on all fds when a block device is open by multiple writers and writeback fails. Signed-off-by: Jeff Layton--- tests/generic/998 | 64 +++ tests/generic/998.out | 2 ++ tests/generic/group | 1 + 3 files changed, 67 insertions(+) create mode 100755 tests/generic/998 create mode 100644 tests/generic/998.out diff --git a/tests/generic/998 b/tests/generic/998 new file mode 100755 index ..4e8379988252 --- /dev/null +++ b/tests/generic/998 @@ -0,0 +1,64 @@ +#! /bin/bash +# FS QA Test No. 998 +# +# Test writeback error handling when writing to block devices via pagecache. +# See src/fsync-err.c for details of what test actually does. +# +#--- +# Copyright (c) 2017, Jeff Layton +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1# failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ +cd / +rm -rf $tmp.* $testdir +_dmerror_cleanup +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmerror + +# real QA test starts here +_supported_os Linux +_require_scratch +_require_logdev +_require_dm_target error +_require_test_program fsync-err +_require_test_program dmerror + +rm -f $seqres.full + +_dmerror_init + +$here/src/fsync-err -d $here/src/dmerror $DMERROR_DEV + +# success, all done +_dmerror_load_working_table +_dmerror_cleanup +_scratch_mkfs > $seqres.full 2>&1 +status=0 +exit diff --git a/tests/generic/998.out b/tests/generic/998.out new file mode 100644 index ..658c438820e2 --- /dev/null +++ b/tests/generic/998.out @@ -0,0 +1,2 @@ +QA output created by 998 +Test passed! diff --git a/tests/generic/group b/tests/generic/group index b56bae8f04f0..9c62ab13ad36 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -442,4 +442,5 @@ 437 auto quick 438 auto 439 auto quick punch +998 blockdev 999 auto quick -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 15/20] fs: have call_fsync call filemap_report_wb_err if FS_WB_ERRSEQ is set
On Mon, Jun 12, 2017 at 08:23:11AM -0400, Jeff Layton wrote: > Allow filesystems to opt-in to a final check of wb_err if FS_WB_ERRSEQ > is set. Technically, we could just plumb these calls into all of the > fsync operations, but I think this means less code, changes and churn. Please add it to every fs, that is a consistent with how we handle everything else related to writeback. Oh, and please kill this idiotic call_fsync helper while you're at it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[xfstests PATCH v4 5/5] btrfs: make a btrfs version of writeback error reporting test
Make a new btrfs/999 test that works the way Chris Mason suggested: Build a filesystem with 2 devices that stripes the data across both devices, but mirrors metadata across both. Then, make one of the devices fail and see how fsync is handled. Signed-off-by: Jeff Layton--- tests/btrfs/999 | 93 +++ tests/btrfs/group | 1 + 2 files changed, 94 insertions(+) create mode 100755 tests/btrfs/999 diff --git a/tests/btrfs/999 b/tests/btrfs/999 new file mode 100755 index ..84031cc0d913 --- /dev/null +++ b/tests/btrfs/999 @@ -0,0 +1,93 @@ +#! /bin/bash +# FS QA Test No. 999 +# +# Open a file several times, write to it, fsync on all fds and make sure that +# they all return 0. Change the device to start throwing errors. Write again +# on all fds and fsync on all fds. Ensure that we get errors on all of them. +# Then fsync on all one last time and verify that all return 0. +# +#--- +# Copyright (c) 2017, Jeff Layton +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1# failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ +cd / +rm -rf $tmp.* $testdir +_dmerror_cleanup +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmerror + +# real QA test starts here +_supported_os Linux +_require_dm_target error +_require_test_program fsync-err +_require_test_program dmerror + +# bring up dmerror device +_scratch_unmount +_dmerror_init + +# Replace first device with error-test device +old_SCRATCH_DEV=$SCRATCH_DEV +SCRATCH_DEV_POOL=`echo $SCRATCH_DEV_POOL | perl -pe "s#$SCRATCH_DEV#$DMERROR_DEV#"` +SCRATCH_DEV=$DMERROR_DEV + +_require_scratch +_require_scratch_dev_pool + +rm -f $seqres.full + +echo "Format and mount" + +_scratch_pool_mkfs "-d raid0 -m raid1" > $seqres.full 2>&1 +_scratch_mount + +# How much do we need to write? We need to hit all of the stripes. btrfs uses +# a fixed 64k stripesize, so write enough to hit each one +number_of_devices=`echo $SCRATCH_DEV_POOL | wc -w` +write_kb=$(($number_of_devices * 64)) +_require_fs_space $SCRATCH_MNT $write_kb + +testfile=$SCRATCH_MNT/fsync-err-test + +SCRATCH_DEV=$old_SCRATCH_DEV +$here/src/fsync-err -b $(($write_kb * 1024)) -d $here/src/dmerror $testfile + +# success, all done +_dmerror_load_working_table + +# fs may be corrupt after this -- attempt to repair it +_repair_scratch_fs >> $seqres.full + +# remove dmerror device +_dmerror_cleanup + +status=0 +exit diff --git a/tests/btrfs/group b/tests/btrfs/group index 6f19619e877c..8dbdfbfe29fd 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -145,3 +145,4 @@ 141 auto quick 142 auto quick 143 auto quick +999 auto quick -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[xfstests PATCH v4 1/5] generic: add a writeback error handling test
I'm working on a set of kernel patches to change how writeback errors are handled and reported in the kernel. Instead of reporting a writeback error to only the first fsync caller on the file, I aim to make the kernel report them once on every file description. This patch adds a test for the new behavior. Basically, open many fds to the same file, turn on dm_error, write to each of the fds, and then fsync them all to ensure that they all get an error back. To do that, I'm adding a new tools/dmerror script that the C program can use to load the error table. For now, that's all it can do, but we can fill it out with other commands as necessary. Signed-off-by: Jeff Layton--- .gitignore | 1 + common/dmerror | 13 ++- doc/auxiliary-programs.txt | 16 src/Makefile | 2 +- src/dmerror| 44 + src/fsync-err.c| 223 + tests/generic/999 | 77 tests/generic/999.out | 3 + tests/generic/group| 1 + 9 files changed, 374 insertions(+), 6 deletions(-) create mode 100755 src/dmerror create mode 100644 src/fsync-err.c create mode 100755 tests/generic/999 create mode 100644 tests/generic/999.out diff --git a/.gitignore b/.gitignore index 39664b0a7f53..56e863b2c8dc 100644 --- a/.gitignore +++ b/.gitignore @@ -72,6 +72,7 @@ /src/fs_perms /src/fssum /src/fstest +/src/fsync-err /src/fsync-tester /src/ftrunc /src/genhashnames diff --git a/common/dmerror b/common/dmerror index d46c5d0b7266..238baa213b1f 100644 --- a/common/dmerror +++ b/common/dmerror @@ -23,22 +23,25 @@ if [ $? -eq 0 ]; then _notrun "Cannot run tests with DAX on dmerror devices" fi -_dmerror_init() +_dmerror_setup() { local dm_backing_dev=$SCRATCH_DEV - $DMSETUP_PROG remove error-test > /dev/null 2>&1 - local blk_dev_size=`blockdev --getsz $dm_backing_dev` DMERROR_DEV='/dev/mapper/error-test' DMLINEAR_TABLE="0 $blk_dev_size linear $dm_backing_dev 0" + DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0" +} + +_dmerror_init() +{ + _dmerror_setup + $DMSETUP_PROG remove error-test > /dev/null 2>&1 $DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \ _fatal "failed to create dm linear device" - - DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0" } _dmerror_mount() diff --git a/doc/auxiliary-programs.txt b/doc/auxiliary-programs.txt index 21ef118596b6..bcab453c4335 100644 --- a/doc/auxiliary-programs.txt +++ b/doc/auxiliary-programs.txt @@ -16,6 +16,8 @@ note the dependency with: Contents: - af_unix -- Create an AF_UNIX socket + - dmerror -- fault injection block device control + - fsync-err -- tests fsync error reporting after failed writeback - open_by_handle -- open_by_handle_at syscall exercise - stat_test -- statx syscall exercise - t_dir_type -- print directory entries and their file type @@ -30,6 +32,20 @@ af_unix The af_unix program creates an AF_UNIX socket at the given location. +dmerror + + dmerror is a program for creating, destroying and controlling a + fault injection device. The device can be set up as initially + working and then flip to throwing errors for testing purposes. + +fsync-err + + Specialized program for testing how the kernel reports errors that + occur during writeback. Works in conjunction with the dmerror script + in tools/ to write data to a device, and then force it to fail + writeback and test that errors are reported during fsync and cleared + afterward. + open_by_handle The open_by_handle program exercises the open_by_handle_at() system diff --git a/src/Makefile b/src/Makefile index 6b0e4b022485..abd0fff34a64 100644 --- a/src/Makefile +++ b/src/Makefile @@ -13,7 +13,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \ multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \ t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \ holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd \ - t_mmap_cow_race t_mmap_fallocate + t_mmap_cow_race t_mmap_fallocate fsync-err LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \ preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \ diff --git a/src/dmerror b/src/dmerror new file mode 100755 index ..4aaf682ee5f9 --- /dev/null +++ b/src/dmerror @@ -0,0 +1,44 @@ +#!/bin/bash +#--- +# Copyright (c) 2017, Jeff Layton +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This
[xfstests PATCH v4 2/5] ext4: allow ext4 to use $SCRATCH_LOGDEV
The writeback error handling test requires that you put the journal on a separate device. This allows us to use dmerror to simulate data writeback failure, without affecting the journal. xfs already has infrastructure for this (a'la $SCRATCH_LOGDEV), so wire up the ext4 code so that it can do the same thing when _scratch_mkfs is called. Signed-off-by: Jeff LaytonReviewed-by: Darrick J. Wong --- common/rc | 3 +++ 1 file changed, 3 insertions(+) diff --git a/common/rc b/common/rc index 87e6ff08b18d..08807ac7c22a 100644 --- a/common/rc +++ b/common/rc @@ -676,6 +676,9 @@ _scratch_mkfs_ext4() local tmp=`mktemp` local mkfs_status + [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \ + $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \ + mkfs_cmd="$mkfs_cmd $MKFS_OPTIONS -J device=$SCRATCH_LOGDEV" _scratch_do_mkfs "$mkfs_cmd" "$mkfs_filter" $* 2>$tmp.mkfserr 1>$tmp.mkfsstd mkfs_status=$? -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[xfstests PATCH v4 4/5] ext3: allow it to put journal on a separate device when doing scratch_mkfs
Signed-off-by: Jeff Layton--- common/rc | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/common/rc b/common/rc index 08807ac7c22a..46b890cbff6a 100644 --- a/common/rc +++ b/common/rc @@ -832,7 +832,16 @@ _scratch_mkfs() mkfs_cmd="$MKFS_BTRFS_PROG" mkfs_filter="cat" ;; - ext2|ext3) + ext3) + mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F" + mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \"" + + # put journal on separate device? + [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \ + $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \ + mkfs_cmd="$mkfs_cmd $MKFS_OPTIONS -J device=$SCRATCH_LOGDEV" + ;; + ext2) mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F" mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \"" ;; -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[xfstests PATCH v4 0/5] new tests for writeback error reporting behavior
v4: respin set based on Eryu's comments These tests are companion tests to the patchset I recently posted with the cover letter: [PATCH v6 00/20] fs: enhanced writeback error reporting with errseq_t (pile #1) This set just adds 3 new xfstests to test writeback behavior. One generic filesystem test, one test for raw block devices, and one test for btrfs. The tests work with dmerror to ensure that writeback fails, and then tests how the kernel reports errors afterward. xfs, ext2/3/4 and btrfs all pass on a kernel with the patchset above. The one comment I couldn't really address from earlier review is that we don't have a great way for xfstests to tell what sort of error reporting behavior it should expect from the running kernel. That makes it difficult to tell whether failure is expected during a given run. Maybe that's OK though and we should just let unconverted filesystems fail this test? Jeff Layton (5): generic: add a writeback error handling test ext4: allow ext4 to use $SCRATCH_LOGDEV generic: test writeback error handling on dmerror devices ext3: allow it to put journal on a separate device when doing scratch_mkfs btrfs: make a btrfs version of writeback error reporting test .gitignore | 1 + common/dmerror | 13 ++- common/rc | 14 ++- doc/auxiliary-programs.txt | 16 src/Makefile | 2 +- src/dmerror| 44 + src/fsync-err.c| 223 + tests/btrfs/999| 93 +++ tests/btrfs/group | 1 + tests/generic/998 | 64 + tests/generic/998.out | 2 + tests/generic/999 | 77 tests/generic/999.out | 3 + tests/generic/group| 2 + 14 files changed, 548 insertions(+), 7 deletions(-) create mode 100755 src/dmerror create mode 100644 src/fsync-err.c create mode 100755 tests/btrfs/999 create mode 100755 tests/generic/998 create mode 100644 tests/generic/998.out create mode 100755 tests/generic/999 create mode 100644 tests/generic/999.out -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs native encryption
On Fri, Jun 09, 2017 at 08:50:12AM -0700, Filip Bystricky wrote: > Dear btrfs maintainers, > Google is evaluating btrfs for its potential use in android, but > currently the lack of native file-based encryption unfortunately makes > it a nonstarter. The file-based encryption is covered by the fscrypt API, that's implemented in ext4/f2fs, so implementing that in btrfs could allow you to start evaluating btrfs. As other pointed, the usecases with snapshots and reflinks need to be reviewed. > According to the FAQ (specifically the answer to > "Does btrfs support encryption"), nobody is currently working on this. > How up-to-date is that answer, and are there any new plans to add > native FBE in the future? Wiki is a snapshot of status, knowledge and best practices from mixed times, so not everything is up to date or accurate. That no one is working on encryption is partially true, as we've seen some proposed patches that received objections and from my perspective are not close to being considered for merging. The fscrypt functionality has been designed and the API defined, so it's a matter of implementation on the btrfs side. Anything else is in the phase of design and prototyping. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 00/20] fs: enhanced writeback error reporting with errseq_t (pile #1)
v6: === This is the sixth posting of the patchset to revamp the way writeback errors are tracked and reported. This is a smaller set than the last one. The main difference from the last set is that this one just adds errseq_t based error reporting for the purposes of fsync, while leaving the internal callers of filemap_* functions and the like largely untouched. Some of these patches have been posted separately, but I'm re-posting them here to make it clear that they're prerequisites to the later patches in the series. Background: === The basic problem is that we have (for a very long time) tracked and reported writeback errors based on two flags in the address_space: AS_EIO and AS_ENOSPC. Those flags are cleared when they are checked, so only the first caller to check them is able to consume them. That model is quite unreliable though, for several related reasons: * only the first fsync caller on the inode will see the error. In a world of containerized setups, that's no longer viable. Applications need to know that their writes are safely stored, and they can currently miss seeing errors that they should be aware of when they're not. * there are a lot of internal callers to filemap_fdatawait* and filemap_write_and_wait* that clear the error flags but then never report them to userland in any fashion. * Some internal callers report writeback errors, but can do so at non-sensical times. For instance, we might want to truncate a file, which triggers a pagecache flush. If that writeback fails, we might report that error to the truncate caller, but a subsequent fsync will likely not see it. * Some internal callers try to reset the error flags after clearing them, but that's racy. Another task could check the flags between those two events. Solution: = This patchset adds a new datatype called an errseq_t that represents a sequence of errors. It's a u32, with a field for a POSIX-flavor error and a counter, managed with atomics. We can sample that value at a particular point in time, and can later tell whether there have been any errors since that point. That allows us to provide traditional check-and-clear fsync semantics on every open file description in a lightweight fashion. fsync callers no longer need to coordinate between one another in order to ensure that errors at fsync time are handled correctly. Strategy: = The aim with this set is to do the minimum possible to support for reliable reporting of errors on fsync, without substantially changing the internals of the filesystems themselves. Most of the internal calls to filemap_fdatawait are left alone, so all of the internal error error checking is done using the traditional flag based checks. The only real difference here is more reliable reporting of errors at fsync. I think that we probably will want to eventually convert all of the internal callers to use errseq_t based reporting too, but that can be done in an incremental fashion in follow-on patchsets. Testing: I've primarily been testing this with a couple of new xfstests that I will post separately. These tests use dm-error fault injection to flip the underlying block device to start throwing I/O errors, and then test the behavior of the filesystem layer on top of that. Jeff Layton (20): mm: fix mapping_set_error call in me_pagecache_dirty buffer: use mapping_set_error instead of setting the flag fs: check for writeback errors after syncing out buffers in generic_file_fsync buffer: set errors in mapping at the time that the error occurs mm: don't TestClearPageError in __filemap_fdatawait_range mm: drop "wait" parameter from write_one_page mm: clean up error handling in write_one_page lib: add errseq_t type and infrastructure for handling it fs: new infrastructure for writeback error handling and reporting mm: tracepoints for writeback error events mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error fs: add a new fstype flag to indicate how writeback errors are tracked Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors dax: set errors in mapping when writeback fails fs: have call_fsync call filemap_report_wb_err if FS_WB_ERRSEQ is set block: convert to errseq_t based writeback error tracking fs: add f_md_wb_err field to struct file for tracking metadata errors ext4: use errseq_t based error handling for reporting data writeback errors xfs: minimal conversion to errseq_t writeback error reporting btrfs: minimal conversion to errseq_t writeback error reporting on fsync Documentation/filesystems/vfs.txt | 48 - drivers/dax/device.c | 1 + fs/block_dev.c| 2 + fs/btrfs/super.c | 2 +- fs/buffer.c | 20 ++-- fs/dax.c | 18 +++- fs/exofs/dir.c| 2 +- fs/ext2/dir.c | 2
[PATCH v6 06/20] mm: drop "wait" parameter from write_one_page
The callers all set it to 1. Also, make it clear that this function will not set any sort of AS_* error, and that the caller must do so if necessary. No existing caller uses this on normal files, so none of them need it. Also, add __must_check here since, in general, the callers need to handle an error here in some fashion. Signed-off-by: Jeff LaytonReviewed-by: Ross Zwisler Reviewed-by: Jan Kara Reviewed-by: Matthew Wilcox Reviewed-by: Christoph Hellwig --- fs/exofs/dir.c| 2 +- fs/ext2/dir.c | 2 +- fs/jfs/jfs_metapage.c | 4 ++-- fs/minix/dir.c| 2 +- fs/sysv/dir.c | 2 +- fs/ufs/dir.c | 2 +- include/linux/mm.h| 2 +- mm/page-writeback.c | 14 +++--- 8 files changed, 15 insertions(+), 15 deletions(-) diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c index 8eeb694332fe..98233a97b7b8 100644 --- a/fs/exofs/dir.c +++ b/fs/exofs/dir.c @@ -72,7 +72,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len) set_page_dirty(page); if (IS_DIRSYNC(dir)) - err = write_one_page(page, 1); + err = write_one_page(page); else unlock_page(page); diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index d9650c9508e4..e2709695b177 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -100,7 +100,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len) } if (IS_DIRSYNC(dir)) { - err = write_one_page(page, 1); + err = write_one_page(page); if (!err) err = sync_inode_metadata(dir, 1); } else { diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index 489aaa1403e5..744fa3c079e6 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -711,7 +711,7 @@ void force_metapage(struct metapage *mp) get_page(page); lock_page(page); set_page_dirty(page); - write_one_page(page, 1); + write_one_page(page); clear_bit(META_forcewrite, >flag); put_page(page); } @@ -756,7 +756,7 @@ void release_metapage(struct metapage * mp) set_page_dirty(page); if (test_bit(META_sync, >flag)) { clear_bit(META_sync, >flag); - write_one_page(page, 1); + write_one_page(page); lock_page(page); /* write_one_page unlocks the page */ } } else if (mp->lsn) /* discard_metapage doesn't remove it */ diff --git a/fs/minix/dir.c b/fs/minix/dir.c index 7edc9b395700..baa9721f1299 100644 --- a/fs/minix/dir.c +++ b/fs/minix/dir.c @@ -57,7 +57,7 @@ static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len) mark_inode_dirty(dir); } if (IS_DIRSYNC(dir)) - err = write_one_page(page, 1); + err = write_one_page(page); else unlock_page(page); return err; diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c index 5bdae85ceef7..f5191cb2c947 100644 --- a/fs/sysv/dir.c +++ b/fs/sysv/dir.c @@ -45,7 +45,7 @@ static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len) mark_inode_dirty(dir); } if (IS_DIRSYNC(dir)) - err = write_one_page(page, 1); + err = write_one_page(page); else unlock_page(page); return err; diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c index de01b8f2aa78..48609f1d9580 100644 --- a/fs/ufs/dir.c +++ b/fs/ufs/dir.c @@ -53,7 +53,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len) mark_inode_dirty(dir); } if (IS_DIRSYNC(dir)) - err = write_one_page(page, 1); + err = write_one_page(page); else unlock_page(page); return err; diff --git a/include/linux/mm.h b/include/linux/mm.h index b892e95d4929..c0b1759304ec 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2199,7 +2199,7 @@ extern void filemap_map_pages(struct vm_fault *vmf, extern int filemap_page_mkwrite(struct vm_fault *vmf); /* mm/page-writeback.c */ -int write_one_page(struct page *page, int wait); +int __must_check write_one_page(struct page *page); void task_dirty_inc(struct task_struct *tsk); /* readahead.c */ diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 143c1c25d680..b901fe52b153 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2366,15 +2366,16 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) } /** - * write_one_page - write out a single page and optionally wait on I/O + * write_one_page - write out a single page and wait on I/O * @page: the page to write - * @wait: if true, wait on writeout * * The page must be locked by the caller
[PATCH v6 05/20] mm: don't TestClearPageError in __filemap_fdatawait_range
The -EIO returned here can end up overriding whatever error is marked in the address space, and be returned at fsync time, even when there is a more appropriate error stored in the mapping. Read errors are also sometimes tracked on a per-page level using PG_error. Suppose we have a read error on a page, and then that page is subsequently dirtied by overwriting the whole page. Writeback doesn't clear PG_error, so we can then end up successfully writing back that page and still return -EIO on fsync. Worse yet, PG_error is cleared during a sync() syscall, but the -EIO return from that is silently discarded. Any subsystem that is relying on PG_error to report errors during fsync can easily lose writeback errors due to this. All you need is a stray sync() call to wait for writeback to complete and you've lost the error. Since the handling of the PG_error flag is somewhat inconsistent across subsystems, let's just rely on marking the address space when there are writeback errors. Change the TestClearPageError call to ClearPageError, and make __filemap_fdatawait_range a void return function. Signed-off-by: Jeff Layton--- mm/filemap.c | 20 +--- 1 file changed, 5 insertions(+), 15 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 6f1be573a5e6..1de71753de28 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -376,17 +376,16 @@ int filemap_flush(struct address_space *mapping) } EXPORT_SYMBOL(filemap_flush); -static int __filemap_fdatawait_range(struct address_space *mapping, +static void __filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, loff_t end_byte) { pgoff_t index = start_byte >> PAGE_SHIFT; pgoff_t end = end_byte >> PAGE_SHIFT; struct pagevec pvec; int nr_pages; - int ret = 0; if (end_byte < start_byte) - goto out; + return; pagevec_init(, 0); while ((index <= end) && @@ -403,14 +402,11 @@ static int __filemap_fdatawait_range(struct address_space *mapping, continue; wait_on_page_writeback(page); - if (TestClearPageError(page)) - ret = -EIO; + ClearPageError(page); } pagevec_release(); cond_resched(); } -out: - return ret; } /** @@ -430,14 +426,8 @@ static int __filemap_fdatawait_range(struct address_space *mapping, int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, loff_t end_byte) { - int ret, ret2; - - ret = __filemap_fdatawait_range(mapping, start_byte, end_byte); - ret2 = filemap_check_errors(mapping); - if (!ret) - ret = ret2; - - return ret; + __filemap_fdatawait_range(mapping, start_byte, end_byte); + return filemap_check_errors(mapping); } EXPORT_SYMBOL(filemap_fdatawait_range); -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 02/20] buffer: use mapping_set_error instead of setting the flag
Signed-off-by: Jeff LaytonReviewed-by: Jan Kara Reviewed-by: Matthew Wilcox Reviewed-by: Christoph Hellwig --- fs/buffer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/buffer.c b/fs/buffer.c index 161be58c5cb0..4be8b914a222 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -482,7 +482,7 @@ static void __remove_assoc_queue(struct buffer_head *bh) list_del_init(>b_assoc_buffers); WARN_ON(!bh->b_assoc_map); if (buffer_write_io_error(bh)) - set_bit(AS_EIO, >b_assoc_map->flags); + mapping_set_error(bh->b_assoc_map, -EIO); bh->b_assoc_map = NULL; } -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 10/20] mm: tracepoints for writeback error events
To enable that, make __errseq_set return the value that it was set to when we exit the loop. Take heed that that value is not suitable as a later "since" value, as it will not have been marked seen. Signed-off-by: Jeff Layton--- include/linux/errseq.h | 2 +- include/linux/fs.h | 5 +++- include/trace/events/filemap.h | 55 ++ lib/errseq.c | 20 ++- mm/filemap.c | 13 +- 5 files changed, 86 insertions(+), 9 deletions(-) diff --git a/include/linux/errseq.h b/include/linux/errseq.h index 0d2555f310cd..9e0d444ac88d 100644 --- a/include/linux/errseq.h +++ b/include/linux/errseq.h @@ -5,7 +5,7 @@ typedef u32errseq_t; -void __errseq_set(errseq_t *eseq, int err); +errseq_t __errseq_set(errseq_t *eseq, int err); static inline void errseq_set(errseq_t *eseq, int err) { /* Optimize for the common case of no error */ diff --git a/include/linux/fs.h b/include/linux/fs.h index e57321b7ee2a..6cd87887430b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2530,6 +2530,7 @@ extern int filemap_fdatawrite_range(struct address_space *mapping, extern int filemap_check_errors(struct address_space *mapping); extern int __must_check filemap_report_wb_err(struct file *file); +extern void __filemap_set_wb_err(struct address_space *mapping, int err); /** * filemap_set_wb_err - set a writeback error on an address_space @@ -2549,7 +2550,9 @@ extern int __must_check filemap_report_wb_err(struct file *file); */ static inline void filemap_set_wb_err(struct address_space *mapping, int err) { - errseq_set(>wb_err, err); + /* Fastpath for common case of no error */ + if (unlikely(err)) + __filemap_set_wb_err(mapping, err); } /** diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h index 42febb6bc1d5..2af66920f267 100644 --- a/include/trace/events/filemap.h +++ b/include/trace/events/filemap.h @@ -10,6 +10,7 @@ #include #include #include +#include DECLARE_EVENT_CLASS(mm_filemap_op_page_cache, @@ -52,6 +53,60 @@ DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_add_to_page_cache, TP_ARGS(page) ); +TRACE_EVENT(filemap_set_wb_err, + TP_PROTO(struct address_space *mapping, errseq_t eseq), + + TP_ARGS(mapping, eseq), + + TP_STRUCT__entry( + __field(unsigned long, i_ino) + __field(dev_t, s_dev) + __field(errseq_t, errseq) + ), + + TP_fast_assign( + __entry->i_ino = mapping->host->i_ino; + __entry->errseq = eseq; + if (mapping->host->i_sb) + __entry->s_dev = mapping->host->i_sb->s_dev; + else + __entry->s_dev = mapping->host->i_rdev; + ), + + TP_printk("dev=%d:%d ino=0x%lx errseq=0x%x", + MAJOR(__entry->s_dev), MINOR(__entry->s_dev), + __entry->i_ino, __entry->errseq) +); + +TRACE_EVENT(filemap_report_wb_err, + TP_PROTO(struct file *file, errseq_t old), + + TP_ARGS(file, old), + + TP_STRUCT__entry( + __field(struct file *, file); + __field(unsigned long, i_ino) + __field(dev_t, s_dev) + __field(errseq_t, old) + __field(errseq_t, new) + ), + + TP_fast_assign( + __entry->file = file; + __entry->i_ino = file->f_mapping->host->i_ino; + if (file->f_mapping->host->i_sb) + __entry->s_dev = file->f_mapping->host->i_sb->s_dev; + else + __entry->s_dev = file->f_mapping->host->i_rdev; + __entry->old = old; + __entry->new = file->f_wb_err; + ), + + TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x", + __entry->file, MAJOR(__entry->s_dev), + MINOR(__entry->s_dev), __entry->i_ino, __entry->old, + __entry->new) +); #endif /* _TRACE_FILEMAP_H */ /* This part must be outside protection */ diff --git a/lib/errseq.c b/lib/errseq.c index d129c0611c1f..009972d3000c 100644 --- a/lib/errseq.c +++ b/lib/errseq.c @@ -52,10 +52,14 @@ * * Most callers will want to use the errseq_set inline wrapper to efficiently * handle the common case where err is 0. + * + * We do return an errseq_t here, primarily for debugging purposes. The return + * value should not be used as a previously sampled value in later calls as it + * will not have the SEEN flag set. */ -void
[PATCH v6 04/20] buffer: set errors in mapping at the time that the error occurs
I noticed on xfs that I could still sometimes get back an error on fsync on a fd that was opened after the error condition had been cleared. The problem is that the buffer code sets the write_io_error flag and then later checks that flag to set the error in the mapping. That flag perisists for quite a while however. If the file is later opened with O_TRUNC, the buffers will then be invalidated and the mapping's error set such that a subsequent fsync will return error. I think this is incorrect, as there was no writeback between the open and fsync. Add a new mark_buffer_write_io_error operation that sets the flag and the error in the mapping at the same time. Replace all calls to set_buffer_write_io_error with mark_buffer_write_io_error, and remove the places that check this flag in order to set the error in the mapping. This sets the error in the mapping earlier, at the time that it's first detected. Signed-off-by: Jeff LaytonReviewed-by: Jan Kara --- fs/buffer.c | 20 +--- fs/gfs2/lops.c | 2 +- include/linux/buffer_head.h | 1 + 3 files changed, 15 insertions(+), 8 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 4be8b914a222..b946149e8214 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -178,7 +178,7 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate) set_buffer_uptodate(bh); } else { buffer_io_error(bh, ", lost sync page write"); - set_buffer_write_io_error(bh); + mark_buffer_write_io_error(bh); clear_buffer_uptodate(bh); } unlock_buffer(bh); @@ -352,8 +352,7 @@ void end_buffer_async_write(struct buffer_head *bh, int uptodate) set_buffer_uptodate(bh); } else { buffer_io_error(bh, ", lost async page write"); - mapping_set_error(page->mapping, -EIO); - set_buffer_write_io_error(bh); + mark_buffer_write_io_error(bh); clear_buffer_uptodate(bh); SetPageError(page); } @@ -481,8 +480,6 @@ static void __remove_assoc_queue(struct buffer_head *bh) { list_del_init(>b_assoc_buffers); WARN_ON(!bh->b_assoc_map); - if (buffer_write_io_error(bh)) - mapping_set_error(bh->b_assoc_map, -EIO); bh->b_assoc_map = NULL; } @@ -1181,6 +1178,17 @@ void mark_buffer_dirty(struct buffer_head *bh) } EXPORT_SYMBOL(mark_buffer_dirty); +void mark_buffer_write_io_error(struct buffer_head *bh) +{ + set_buffer_write_io_error(bh); + /* FIXME: do we need to set this in both places? */ + if (bh->b_page && bh->b_page->mapping) + mapping_set_error(bh->b_page->mapping, -EIO); + if (bh->b_assoc_map) + mapping_set_error(bh->b_assoc_map, -EIO); +} +EXPORT_SYMBOL(mark_buffer_write_io_error); + /* * Decrement a buffer_head's reference count. If all buffers against a page * have zero reference count, are clean and unlocked, and if the page is clean @@ -3279,8 +3287,6 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) bh = head; do { - if (buffer_write_io_error(bh) && page->mapping) - mapping_set_error(page->mapping, -EIO); if (buffer_busy(bh)) goto failed; bh = bh->b_this_page; diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c index b1f9144b42c7..cd7857ab1a6a 100644 --- a/fs/gfs2/lops.c +++ b/fs/gfs2/lops.c @@ -182,7 +182,7 @@ static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, struct bio_vec *bvec, bh = bh->b_this_page; do { if (error) - set_buffer_write_io_error(bh); + mark_buffer_write_io_error(bh); unlock_buffer(bh); next = bh->b_this_page; size -= bh->b_size; diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index bd029e52ef5e..e0abeba3ced7 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -149,6 +149,7 @@ void buffer_check_dirty_writeback(struct page *page, */ void mark_buffer_dirty(struct buffer_head *bh); +void mark_buffer_write_io_error(struct buffer_head *bh); void init_buffer(struct buffer_head *, bh_end_io_t *, void *); void touch_buffer(struct buffer_head *bh); void set_bh_page(struct buffer_head *bh, -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 03/20] fs: check for writeback errors after syncing out buffers in generic_file_fsync
ext2 currently does a test+clear of the AS_EIO flag, which is is problematic for some coming changes. What we really need to do instead is call filemap_check_errors in __generic_file_fsync after syncing out the buffers. That will be sufficient for this case, and help other callers detect these errors properly as well. With that, we don't need to twiddle it in ext2. Suggested-by: Jan KaraSigned-off-by: Jeff Layton Reviewed-by: Christoph Hellwig Reviewed-by: Jan Kara Reviewed-by: Matthew Wilcox --- fs/ext2/file.c | 2 +- fs/libfs.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ext2/file.c b/fs/ext2/file.c index b21891a6bfca..ed00e7ae0ef3 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -177,7 +177,7 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; ret = generic_file_fsync(file, start, end, datasync); - if (ret == -EIO || test_and_clear_bit(AS_EIO, >flags)) { + if (ret == -EIO) { /* We don't really know where the IO error happened... */ ext2_error(sb, __func__, "detected IO error when writing metadata buffers"); diff --git a/fs/libfs.c b/fs/libfs.c index a04395334bb1..1dec90819366 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -991,7 +991,8 @@ int __generic_file_fsync(struct file *file, loff_t start, loff_t end, out: inode_unlock(inode); - return ret; + err = filemap_check_errors(inode->i_mapping); + return ret ? ret : err; } EXPORT_SYMBOL(__generic_file_fsync); -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 10/13] ext4: add more robust reporting of metadata writeback errors
ext4 uses the blockdev mapping for tracking metadata stored in the pagecache. Sample its wb_err when opening a file and store that in the f_md_wb_err field. Change ext4_sync_file to check for data errors first, and then check the blockdev mapping for metadata errors afterward. Note that because metadata writeback errors are only tracked on a per-device level, this does mean that we'll end up reporting an error on all open file descriptors when there is a metadata writeback failure. That's not ideal, but we can possibly improve upon it in the future. Signed-off-by: Jeff Layton--- fs/ext4/dir.c | 8 ++-- fs/ext4/file.c | 5 - fs/ext4/fsync.c | 13 + 3 files changed, 23 insertions(+), 3 deletions(-) diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index e8b365000d73..6bbb19510f74 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -611,9 +611,13 @@ static int ext4_dx_readdir(struct file *file, struct dir_context *ctx) static int ext4_dir_open(struct inode * inode, struct file * filp) { + int ret = 0; + if (ext4_encrypted_inode(inode)) - return fscrypt_get_encryption_info(inode) ? -EACCES : 0; - return 0; + ret = fscrypt_get_encryption_info(inode) ? -EACCES : 0; + if (!ret) + filp->f_md_wb_err = filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping); + return ret; } static int ext4_release_dir(struct inode *inode, struct file *filp) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 831fd6beebf0..fe0d6e01c4b7 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -435,7 +435,10 @@ static int ext4_file_open(struct inode * inode, struct file * filp) if (ret < 0) return ret; } - return dquot_file_open(inode, filp); + ret = dquot_file_open(inode, filp); + if (!ret) + filp->f_md_wb_err = filemap_sample_wb_err(sb->s_bdev->bd_inode->i_mapping); + return ret; } /* diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 03d6259d8662..36363a6730d7 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -165,6 +165,19 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) err = filemap_report_wb_err(file); if (!ret) ret = err; + + /* +* Was there a metadata writeback error since last fsync? +* +* FIXME: ext4 tracks metadata with a whole-block device mapping. So, +* if there is any sort of metadata writeback error, we'll report an +* error on all open fds, even ones not associated with the inode +*/ + err = filemap_report_md_wb_err(file, + inode->i_sb->s_bdev->bd_inode->i_mapping); + if (!ret) + ret = err; + trace_ext4_sync_file_exit(inode, ret); return ret; } -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked
Now that we have new infrastructure for handling writeback errors using errseq_t, we need to convert the existing code to use it. We could attempt to retrofit the old interfaces on top of the new, but there is a conceptual disconnect here in the case of internal callers that invoke filemap_fdatawait and the like. When reporting writeback errors, we will always report errors that have occurred since a particular point in time. With the old writeback error reporting, the time we used was "since it was last tested/cleared" which is entirely arbitrary and potentially racy. Now, we can report the latest error that has occurred since an arbitrary point in time (represented as a sampled errseq_t value). This means that we need to touch each filesystem that calls filemap_check_errors in some fashion and ensure that we establish sane "since" values for those callers. But...some code is shared between filesystems and needs to be able to handle both error tracking schemes. Add a new FS_WB_ERRSEQ flag to the fstype. Later patches will set and key off of that to decide what behavior should be used. Signed-off-by: Jeff Layton--- include/linux/fs.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/fs.h b/include/linux/fs.h index 6cd87887430b..17ba6284ab14 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2023,6 +2023,7 @@ struct file_system_type { #define FS_BINARY_MOUNTDATA2 #define FS_HAS_SUBTYPE 4 #define FS_USERNS_MOUNT8 /* Can be mounted by userns root */ +#define FS_WB_ERRSEQ 16 /* errseq_t writeback err tracking */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ struct dentry *(*mount) (struct file_system_type *, int, const char *, void *); -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 17/20] fs: add f_md_wb_err field to struct file for tracking metadata errors
Some filesystems keep a different mapping for metadata writeback. Add a second errseq_t to struct file for tracking metadata writeback errors. Also add a new function for checking a mapping of the caller's choosing vs. the f_md_wb_err value. Signed-off-by: Jeff Layton--- include/linux/fs.h | 3 +++ include/trace/events/filemap.h | 23 ++- mm/filemap.c | 40 +++- 3 files changed, 44 insertions(+), 22 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index ef3feeec80b2..e366835c93b3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -871,6 +871,7 @@ struct file { struct list_headf_tfile_llink; #endif /* #ifdef CONFIG_EPOLL */ struct address_space*f_mapping; + errseq_tf_md_wb_err; /* optional metadata wb error tracking */ } __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */ struct file_handle { @@ -2525,6 +2526,8 @@ extern int filemap_fdatawrite_range(struct address_space *mapping, extern int filemap_check_errors(struct address_space *mapping); extern int __must_check filemap_report_wb_err(struct file *file); +extern int __must_check filemap_report_md_wb_err(struct file *file, + struct address_space *mapping); extern void __filemap_set_wb_err(struct address_space *mapping, int err); /** diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h index 2af66920f267..6e0d78c01a2e 100644 --- a/include/trace/events/filemap.h +++ b/include/trace/events/filemap.h @@ -79,12 +79,11 @@ TRACE_EVENT(filemap_set_wb_err, ); TRACE_EVENT(filemap_report_wb_err, - TP_PROTO(struct file *file, errseq_t old), + TP_PROTO(struct address_space *mapping, errseq_t old, errseq_t new), - TP_ARGS(file, old), + TP_ARGS(mapping, old, new), TP_STRUCT__entry( - __field(struct file *, file); __field(unsigned long, i_ino) __field(dev_t, s_dev) __field(errseq_t, old) @@ -92,20 +91,18 @@ TRACE_EVENT(filemap_report_wb_err, ), TP_fast_assign( - __entry->file = file; - __entry->i_ino = file->f_mapping->host->i_ino; - if (file->f_mapping->host->i_sb) - __entry->s_dev = file->f_mapping->host->i_sb->s_dev; + __entry->i_ino = mapping->host->i_ino; + if (mapping->host->i_sb) + __entry->s_dev = mapping->host->i_sb->s_dev; else - __entry->s_dev = file->f_mapping->host->i_rdev; + __entry->s_dev = mapping->host->i_rdev; __entry->old = old; - __entry->new = file->f_wb_err; + __entry->new = new; ), - TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x", - __entry->file, MAJOR(__entry->s_dev), - MINOR(__entry->s_dev), __entry->i_ino, __entry->old, - __entry->new) + TP_printk("dev=%d:%d ino=0x%lx old=0x%x new=0x%x", + MAJOR(__entry->s_dev), MINOR(__entry->s_dev), + __entry->i_ino, __entry->old, __entry->new) ); #endif /* _TRACE_FILEMAP_H */ diff --git a/mm/filemap.c b/mm/filemap.c index c5e19ea0bf12..ef0ff6b87759 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -564,27 +564,49 @@ EXPORT_SYMBOL(__filemap_set_wb_err); * value is protected by the f_lock since we must ensure that it reflects * the latest value swapped in for this file descriptor. */ -int filemap_report_wb_err(struct file *file) +static int __filemap_report_wb_err(errseq_t *cursor, spinlock_t *lock, + struct address_space *mapping) { int err = 0; - errseq_t old = READ_ONCE(file->f_wb_err); - struct address_space *mapping = file->f_mapping; + errseq_t old = READ_ONCE(*cursor); /* Locklessly handle the common case where nothing has changed */ if (errseq_check(>wb_err, old)) { /* Something changed, must use slow path */ - spin_lock(>f_lock); - old = file->f_wb_err; - err = errseq_check_and_advance(>wb_err, - >f_wb_err); - trace_filemap_report_wb_err(file, old); - spin_unlock(>f_lock); + spin_lock(lock); + old = *cursor; + err = errseq_check_and_advance(>wb_err, cursor); + trace_filemap_report_wb_err(mapping, old, *cursor); + spin_unlock(lock); }
[PATCH v6 14/20] dax: set errors in mapping when writeback fails
Jan Kara's description for this patch is much better than mine, so I'm quoting it verbatim here: -8<- DAX currently doesn't set errors in the mapping when cache flushing fails in dax_writeback_mapping_range(). Since this function can get called only from fsync(2) or sync(2), this is actually as good as it can currently get since we correctly propagate the error up from dax_writeback_mapping_range() to filemap_fdatawrite() However, in the future better writeback error handling will enable us to properly report these errors on fsync(2) even if there are multiple file descriptors open against the file or if sync(2) gets called before fsync(2). So convert DAX to using standard error reporting through the mapping. -8<- For now, only do this when the FS_WB_ERRSEQ flag is set. The AS_EIO/AS_ENOSPC flags are not currently cleared in the older code when writeback initiation fails, only when we discover an error after waiting on writeback to complete, so we only want to do this with errseq_t based error handling to prevent seeing duplicate errors on fsync. Signed-off-by: Jeff LaytonReviewed-by: Jan Kara Reviewed-by: Christoph Hellwig Reviewed-and-Tested-by: Ross Zwisler --- fs/dax.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index 2a6889b3585f..ba3b17eefcfc 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -856,8 +856,24 @@ int dax_writeback_mapping_range(struct address_space *mapping, ret = dax_writeback_one(bdev, dax_dev, mapping, indices[i], pvec.pages[i]); - if (ret < 0) + if (ret < 0) { + /* +* For fs' that use errseq_t based error +* tracking, we must call mapping_set_error +* here to ensure that fsync on all open fds +* get back an error. Doing this with the old +* wb error tracking infrastructure is +* problematic though, as DAX writeback is +* synchronous, and the error flags are not +* cleared when initiation fails, only when +* it fails after the write has been submitted +* to the backing store. +*/ + if (mapping->host->i_sb->s_type->fs_flags & + FS_WB_ERRSEQ) + mapping_set_error(mapping, ret); goto out; + } } } out: -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 12/13] xfs: minimal conversion to errseq_t writeback error reporting
Just check and advance the data errseq_t in struct file before before returning from fsync on normal files. Internal filemap_* callers are left as-is. We also set the FS_WB_ERRSEQ flag just for completeness sake. Not much is really using it at this point. Signed-off-by: Jeff Layton--- fs/xfs/xfs_file.c | 15 +++ fs/xfs/xfs_super.c | 2 +- 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5fb5a0958a14..bc3b1575e8db 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -134,7 +134,7 @@ xfs_file_fsync( struct inode*inode = file->f_mapping->host; struct xfs_inode*ip = XFS_I(inode); struct xfs_mount*mp = ip->i_mount; - int error = 0; + int error = 0, err2; int log_flushed = 0; xfs_lsn_t lsn = 0; @@ -142,10 +142,12 @@ xfs_file_fsync( error = filemap_write_and_wait_range(inode->i_mapping, start, end); if (error) - return error; + goto out; - if (XFS_FORCED_SHUTDOWN(mp)) - return -EIO; + if (XFS_FORCED_SHUTDOWN(mp)) { + error = -EIO; + goto out; + } xfs_iflags_clear(ip, XFS_ITRUNCATED); @@ -197,6 +199,11 @@ xfs_file_fsync( mp->m_logdev_targp == mp->m_ddev_targp) xfs_blkdev_issue_flush(mp->m_ddev_targp); +out: + err2 = filemap_report_wb_err(file); + if (!error) + error = err2; + return error; } diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 455a575f101d..28d3be187025 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1758,7 +1758,7 @@ static struct file_system_type xfs_fs_type = { .name = "xfs", .mount = xfs_fs_mount, .kill_sb= kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV | FS_WB_ERRSEQ, }; MODULE_ALIAS_FS("xfs"); -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 16/20] block: convert to errseq_t based writeback error tracking
This is a very minimal conversion to errseq_t based error tracking for raw block device access. Only real change that is strictly required is that we must ensure that filemap_report_wb_err is unconditionally called after fsync, which is now done if FS_WB_ERRSEQ is set in fs_flags. That ensures that the errseq_t in the file is advanced to the latest value in the mapping. Note that there are internal callers that call sync_blockdev and the like that are not affected by this. They'll continue to use the AS_EIO/AS_ENOSPC flags for error reporting like they always have for now. Signed-off-by: Jeff Layton--- fs/block_dev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index 4d62fe771587..4bf865cc99de 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -801,6 +801,7 @@ static struct file_system_type bd_type = { .name = "bdev", .mount = bd_mount, .kill_sb= kill_anon_super, + .fs_flags = FS_WB_ERRSEQ, }; struct super_block *blockdev_superblock __read_mostly; -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 11/20] mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
When a writeback error occurs, we want later callers to be able to pick up that fact when they go to wait on that writeback to complete. Traditionally, we've used AS_EIO/AS_ENOSPC flags to track that, but that's problematic since only one "checker" will be informed when an error occurs. In later patches, we're going to want to convert many of these callers to check for errors since a well-defined point in time. For now, ensure that we can handle both sorts of checks by both setting errors in both places when there is a writeback failure. Signed-off-by: Jeff Layton--- include/linux/pagemap.h | 31 +-- 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 316a19f6b635..28acc94e0f81 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -28,14 +28,33 @@ enum mapping_flags { AS_NO_WRITEBACK_TAGS = 5, }; +/** + * mapping_set_error - record a writeback error in the address_space + * @mapping - the mapping in which an error should be set + * @error - the error to set in the mapping + * + * When writeback fails in some way, we must record that error so that + * userspace can be informed when fsync and the like are called. We endeavor + * to report errors on any file that was open at the time of the error. Some + * internal callers also need to know when writeback errors have occurred. + * + * When a writeback error occurs, most filesystems will want to call + * mapping_set_error to record the error in the mapping so that it can be + * reported when the application calls fsync(2). + */ static inline void mapping_set_error(struct address_space *mapping, int error) { - if (unlikely(error)) { - if (error == -ENOSPC) - set_bit(AS_ENOSPC, >flags); - else - set_bit(AS_EIO, >flags); - } + if (likely(!error)) + return; + + /* Record in wb_err for checkers using errseq_t based tracking */ + filemap_set_wb_err(mapping, error); + + /* Record it in flags for now, for legacy callers */ + if (error == -ENOSPC) + set_bit(AS_ENOSPC, >flags); + else + set_bit(AS_EIO, >flags); } static inline void mapping_set_unevictable(struct address_space *mapping) -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 08/20] lib: add errseq_t type and infrastructure for handling it
An errseq_t is a way of recording errors in one place, and allowing any number of "subscribers" to tell whether an error has been set again since a previous time. It's implemented as an unsigned 32-bit value that is managed with atomic operations. The low order bits are designated to hold an error code (max size of MAX_ERRNO). The upper bits are used as a counter. The API works with consumers sampling an errseq_t value at a particular point in time. Later, that value can be used to tell whether new errors have been set since that time. Note that there is a 1 in 512k risk of collisions here if new errors are being recorded frequently, since we have so few bits to use as a counter. To mitigate this, one bit is used as a flag to tell whether the value has been sampled since a new value was recorded. That allows us to avoid bumping the counter if no one has sampled it since it was last bumped. Later patches will build on this infrastructure to change how writeback errors are tracked in the kernel. Signed-off-by: Jeff LaytonReviewed-by: NeilBrown Reviewed-by: Jan Kara --- include/linux/errseq.h | 19 + lib/Makefile | 2 +- lib/errseq.c | 200 + 3 files changed, 220 insertions(+), 1 deletion(-) create mode 100644 include/linux/errseq.h create mode 100644 lib/errseq.c diff --git a/include/linux/errseq.h b/include/linux/errseq.h new file mode 100644 index ..0d2555f310cd --- /dev/null +++ b/include/linux/errseq.h @@ -0,0 +1,19 @@ +#ifndef _LINUX_ERRSEQ_H +#define _LINUX_ERRSEQ_H + +/* See lib/errseq.c for more info */ + +typedef u32errseq_t; + +void __errseq_set(errseq_t *eseq, int err); +static inline void errseq_set(errseq_t *eseq, int err) +{ + /* Optimize for the common case of no error */ + if (unlikely(err)) + __errseq_set(eseq, err); +} + +errseq_t errseq_sample(errseq_t *eseq); +int errseq_check(errseq_t *eseq, errseq_t since); +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since); +#endif diff --git a/lib/Makefile b/lib/Makefile index 0166fbc0fa81..519782d9ca3f 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \ gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \ bsearch.o find_bit.o llist.o memweight.o kfifo.o \ percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \ -once.o refcount.o usercopy.o +once.o refcount.o usercopy.o errseq.o obj-y += string_helpers.o obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o obj-y += hexdump.o diff --git a/lib/errseq.c b/lib/errseq.c new file mode 100644 index ..d129c0611c1f --- /dev/null +++ b/lib/errseq.c @@ -0,0 +1,200 @@ +#include +#include +#include +#include + +/* + * An errseq_t is a way of recording errors in one place, and allowing any + * number of "subscribers" to tell whether it has changed since a previous + * point where it was sampled. + * + * It's implemented as an unsigned 32-bit value. The low order bits are + * designated to hold an error code (between 0 and -MAX_ERRNO). The upper bits + * are used as a counter. This is done with atomics instead of locking so that + * these functions can be called from any context. + * + * The general idea is for consumers to sample an errseq_t value. That value + * can later be used to tell whether any new errors have occurred since that + * sampling was done. + * + * Note that there is a risk of collisions if new errors are being recorded + * frequently, since we have so few bits to use as a counter. + * + * To mitigate this, one bit is used as a flag to tell whether the value has + * been sampled since a new value was recorded. That allows us to avoid bumping + * the counter if no one has sampled it since the last time an error was + * recorded. + * + * A new errseq_t should always be zeroed out. A errseq_t value of all zeroes + * is the special (but common) case where there has never been an error. An all + * zero value thus serves as the "epoch" if one wishes to know whether there + * has ever been an error set since it was first initialized. + */ + +/* The low bits are designated for error code (max of MAX_ERRNO) */ +#define ERRSEQ_SHIFT ilog2(MAX_ERRNO + 1) + +/* This bit is used as a flag to indicate whether the value has been seen */ +#define ERRSEQ_SEEN(1 << ERRSEQ_SHIFT) + +/* The lowest bit of the counter */ +#define ERRSEQ_CTR_INC (1 << (ERRSEQ_SHIFT + 1)) + +/** + * __errseq_set - set a errseq_t for later reporting + * @eseq: errseq_t field that should be set + * @err: error to set + * + * This function sets the error in *eseq, and increments the sequence counter + * if the last sequence was sampled at some point in the past. + * + * Any error set will always overwrite an existing error. + * + * Most callers will want to
[PATCH v6 18/20] ext4: use errseq_t based error handling for reporting data writeback errors
Add the FS_WB_ERRSEQ flag to indicate to other subsystems that errseq_t based error reporting for data writeback is in effect, and to opt-in to reporting those errors in call_fsync. ext4 uses the blockdev mapping for tracking metadata stored in the pagecache. Sample its wb_err when opening a file and store that in the f_md_wb_err field. Ensure that we check and advance the metadata before returning in fsync. Note that because metadata writeback errors are only tracked on a per-device level, this does mean that we'll end up reporting an error on all open file descriptors on this fs when there is a metadata writeback failure. That's not ideal, but we can possibly improve upon it in the future. ext4 also has several calls to filemap_fdatawait and filemap_write_and_wait. Those will also be changed in a later patch to versions that use errseq_t based reporting. Signed-off-by: Jeff Layton--- fs/ext4/dir.c | 8 ++-- fs/ext4/file.c | 5 - fs/ext4/fsync.c | 23 +++ fs/ext4/super.c | 6 +++--- 4 files changed, 32 insertions(+), 10 deletions(-) diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index e8b365000d73..6bbb19510f74 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -611,9 +611,13 @@ static int ext4_dx_readdir(struct file *file, struct dir_context *ctx) static int ext4_dir_open(struct inode * inode, struct file * filp) { + int ret = 0; + if (ext4_encrypted_inode(inode)) - return fscrypt_get_encryption_info(inode) ? -EACCES : 0; - return 0; + ret = fscrypt_get_encryption_info(inode) ? -EACCES : 0; + if (!ret) + filp->f_md_wb_err = filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping); + return ret; } static int ext4_release_dir(struct inode *inode, struct file *filp) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 02ce7e7bbdf5..6e505269132c 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -435,7 +435,10 @@ static int ext4_file_open(struct inode * inode, struct file * filp) if (ret < 0) return ret; } - return dquot_file_open(inode, filp); + ret = dquot_file_open(inode, filp); + if (!ret) + filp->f_md_wb_err = filemap_sample_wb_err(sb->s_bdev->bd_inode->i_mapping); + return ret; } /* diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 9d549608fd30..09e28b7fd3f6 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -100,8 +100,10 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) tid_t commit_tid; bool needs_barrier = false; - if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb - return -EIO; + if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb { + ret = -EIO; + goto out; + } J_ASSERT(ext4_journal_current_handle() == NULL); @@ -126,7 +128,8 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) ret = filemap_write_and_wait_range(inode->i_mapping, start, end); if (ret) - return ret; + goto out; + /* * data=writeback,ordered: * The caller's filemap_fdatawrite()/wait will sync the data. @@ -152,12 +155,24 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) needs_barrier = true; ret = jbd2_complete_transaction(journal, commit_tid); if (needs_barrier) { - issue_flush: +issue_flush: err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); if (!ret) ret = err; } out: + /* +* Was there a metadata writeback error since last fsync? +* +* FIXME: ext4 tracks metadata with a whole-block device mapping. So, +* if there is any sort of metadata writeback error, we'll report an +* error on all open fds, even ones not associated with the inode +*/ + err = filemap_report_md_wb_err(file, + inode->i_sb->s_bdev->bd_inode->i_mapping); + if (!ret) + ret = err; + trace_ext4_sync_file_exit(inode, ret); return ret; } diff --git a/fs/ext4/super.c b/fs/ext4/super.c index d37c81f327e7..e98791ebee6f 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -119,7 +119,7 @@ static struct file_system_type ext2_fs_type = { .name = "ext2", .mount = ext4_mount, .kill_sb= kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV|FS_WB_ERRSEQ, }; MODULE_ALIAS_FS("ext2"); MODULE_ALIAS("ext2"); @@ -134,7 +134,7 @@ static struct file_system_type ext3_fs_type = { .name = "ext3", .mount = ext4_mount, .kill_sb= kill_block_super, - .fs_flags = FS_REQUIRES_DEV, +
[PATCH v6 20/20] btrfs: minimal conversion to errseq_t writeback error reporting on fsync
Set the FS_WB_ERRSEQ flag to opt-in to errseq_t based reporting. Internal call to filemap_* functions are left as-is. Signed-off-by: Jeff Layton--- fs/btrfs/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 4f1cdd5058f1..c99af09cd3e7 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2184,7 +2184,7 @@ static struct file_system_type btrfs_fs_type = { .name = "btrfs", .mount = btrfs_mount, .kill_sb= btrfs_kill_super, - .fs_flags = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA, + .fs_flags = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | FS_WB_ERRSEQ, }; MODULE_ALIAS_FS("btrfs"); -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 15/20] fs: have call_fsync call filemap_report_wb_err if FS_WB_ERRSEQ is set
Allow filesystems to opt-in to a final check of wb_err if FS_WB_ERRSEQ is set. Technically, we could just plumb these calls into all of the fsync operations, but I think this means less code, changes and churn. Signed-off-by: Jeff Layton--- include/linux/fs.h | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 17ba6284ab14..ef3feeec80b2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1742,12 +1742,6 @@ static inline int call_mmap(struct file *file, struct vm_area_struct *vma) return file->f_op->mmap(file, vma); } -static inline int call_fsync(struct file *file, loff_t start, loff_t end, -int datasync) -{ - return file->f_op->fsync(file, start, end, datasync); -} - ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, unsigned long nr_segs, unsigned long fast_segs, struct iovec *fast_pointer, @@ -2583,6 +2577,20 @@ static inline errseq_t filemap_sample_wb_err(struct address_space *mapping) return errseq_sample(>wb_err); } +static inline int call_fsync(struct file *file, loff_t start, loff_t end, +int datasync) +{ + int ret; + + ret = file->f_op->fsync(file, start, end, datasync); + if (file->f_mapping->host->i_sb->s_type->fs_flags & FS_WB_ERRSEQ) { + int ret2 = filemap_report_wb_err(file); + if (!ret) + ret = ret2; + } + return ret; +} + extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync); extern int vfs_fsync(struct file *file, int datasync); -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 13/13] btrfs: minimal conversion to errseq_t writeback error reporting on fsync
Internal callers of filemap_* functions are left as-is. Signed-off-by: Jeff Layton--- fs/btrfs/file.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index da1096eb1a40..4632f16bc49c 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2011,7 +2011,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_trans_handle *trans; struct btrfs_log_ctx ctx; - int ret = 0; + int ret = 0, err; bool full_sync = 0; u64 len; @@ -2030,7 +2030,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) */ ret = start_ordered_ops(inode, start, end); if (ret) - return ret; + goto out; inode_lock(inode); atomic_inc(>log_batch); @@ -2227,6 +2227,9 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) ret = btrfs_end_transaction(trans); } out: + err = filemap_report_wb_err(file); + if (!ret) + ret = err; return ret > 0 ? -EIO : ret; } -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 11/13] Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
I waxed a little loquacious here, but I figured that more detail was better, and writeback error handling is so hard to get right. Although I think we'll eventually remove it once the transition is complete, I've gone ahead and documented the FS_WB_ERRSEQ flag as well. Cc: Jan KaraSigned-off-by: Jeff Layton --- Documentation/filesystems/vfs.txt | 50 --- 1 file changed, 47 insertions(+), 3 deletions(-) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f42b90687d40..c3efdd833a3d 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -576,7 +576,49 @@ should clear PG_Dirty and set PG_Writeback. It can be actually written at any point after PG_Dirty is clear. Once it is known to be safe, PG_Writeback is cleared. -Writeback makes use of a writeback_control structure... +Writeback makes use of a writeback_control structure to direct the +operations. This gives the the writepage and writepages operations some +information about the nature of and reason for the writeback request, +and the constraints under which it is being done. It is also used to +return information back to the caller about the result of a writepage or +writepages request. + +Handling errors during writeback + +Most applications that utilize the pagecache will periodically call +fsync to ensure that data written has made it to the backing store. +When there is an error during writeback, expect that error to be +reported when fsync is called. After an error has been reported to +fsync, subsequent fsync calls on the same file descriptor should return +0, unless further writeback errors have occurred since the previous +fsync. + +Ideally, the kernel would report an error only on file descriptions on +which writes were done that subsequently failed to be written back. The +generic pagecache infrastructure does not track the file descriptions +that have dirtied each individual page however, so determining which +file descriptors should get back an error is not possible. + +Instead, the generic writeback error tracking infrastructure in the +kernel settles for reporting errors to fsync on all file descriptions +that were open at the time that the error occurred. In a situation with +multiple writers, all of them will get back an error on a subsequent fsync, +even if all of the writes done through that particular file descriptor +succeeded (or even if there were no writes on that file descriptor at all). + +Filesystems that wish to use this infrastructure should call +filemap_set_wb_err to record the error in the address_space when it +occurs. Then, at the end of their fsync operation, they should call +filemap_report_wb_err to ensure that the struct file's error cursor +has advanced to the correct point in the stream of errors emitted by +the backing device(s). + +Older kernels used a different method for tracking errors, based on flags +in the address_space. We're currently switching everything over to use +the infrastructure based on errseq_t values. During the transition, +filesystem authors will want to also ensure their file_system_type has +FS_WB_ERRSEQ set in fs_flags to ensure that shared infrastructure is +aware of the model in use. struct address_space_operations --- @@ -804,7 +846,8 @@ struct address_space_operations { The File Object === -A file object represents a file opened by a process. +A file object represents a file opened by a process. This is also known +as an "open file description" in POSIX parlance. struct file_operations @@ -887,7 +930,8 @@ otherwise noted. release: called when the last reference to an open file is closed - fsync: called by the fsync(2) system call + fsync: called by the fsync(2) system call. Also see the section above +entitled "Handling errors during writeback". fasync: called by the fcntl(2) system call when asynchronous (non-blocking) mode is enabled for a file -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 13/20] Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
Let's try to make this extra clear for fs authors. Also, although I think we'll eventually remove it once the transition is complete, I've gone ahead and documented the FS_WB_ERRSEQ flag as well. Cc: Jan KaraSigned-off-by: Jeff Layton --- Documentation/filesystems/vfs.txt | 48 --- 1 file changed, 45 insertions(+), 3 deletions(-) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f42b90687d40..0f6415c26385 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -576,7 +576,47 @@ should clear PG_Dirty and set PG_Writeback. It can be actually written at any point after PG_Dirty is clear. Once it is known to be safe, PG_Writeback is cleared. -Writeback makes use of a writeback_control structure... +Writeback makes use of a writeback_control structure to direct the +operations. This gives the the writepage and writepages operations some +information about the nature of and reason for the writeback request, +and the constraints under which it is being done. It is also used to +return information back to the caller about the result of a writepage or +writepages request. + +Handling errors during writeback + +Most applications that utilize the pagecache will periodically call +fsync to ensure that data written has made it to the backing store. +When there is an error during writeback, expect that error to be +reported when fsync is called. After an error has been reported to +fsync, subsequent fsync calls on the same file descriptor should return +0, unless further writeback errors have occurred since the previous +fsync. + +Ideally, the kernel would report an error only on file descriptions on +which writes were done that subsequently failed to be written back. The +generic pagecache infrastructure does not track the file descriptions +that have dirtied each individual page however, so determining which +file descriptors should get back an error is not possible. + +Instead, the generic writeback error tracking infrastructure in the +kernel settles for reporting errors to fsync on all file descriptions +that were open at the time that the error occurred. In a situation with +multiple writers, all of them will get back an error on a subsequent fsync, +even if all of the writes done through that particular file descriptor +succeeded (or even if there were no writes on that file descriptor at all). + +Filesystems that wish to use this infrastructure need to do two things: + +1) call mapping_set_error to record the error in the address_space when +one occurs. + +2) set FS_WB_ERRSEQ in the fs_flags field in the file_system_type to +indicate to other subsystems that the filesystem wants to use errseq_t +based error reporting for writeback. + +The flag may go away in the future or moved to an opt-out flag once +the majority of filesystems are converted to use errseq_t based reporting. struct address_space_operations --- @@ -804,7 +844,8 @@ struct address_space_operations { The File Object === -A file object represents a file opened by a process. +A file object represents a file opened by a process. This is also known +as an "open file description" in POSIX parlance. struct file_operations @@ -887,7 +928,8 @@ otherwise noted. release: called when the last reference to an open file is closed - fsync: called by the fsync(2) system call + fsync: called by the fsync(2) system call. Also see the section above +entitled "Handling errors during writeback". fasync: called by the fcntl(2) system call when asynchronous (non-blocking) mode is enabled for a file -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 19/20] xfs: minimal conversion to errseq_t writeback error reporting
Just set the FS_WB_ERRSEQ flag to indicate that we want to use errseq_t based error reporting. Internal filemap_* calls are left as-is for now. Signed-off-by: Jeff Layton--- fs/xfs/xfs_super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 455a575f101d..28d3be187025 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1758,7 +1758,7 @@ static struct file_system_type xfs_fs_type = { .name = "xfs", .mount = xfs_fs_mount, .kill_sb= kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV | FS_WB_ERRSEQ, }; MODULE_ALIAS_FS("xfs"); -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 09/20] fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and filemap_check_errors for setting and reporting/clearing writeback errors at the mapping level. filemap_check_errors is indirectly called from most of the filemap_fdatawait_* functions and from filemap_write_and_wait*. These functions are called from all sorts of contexts to wait on writeback to finish -- e.g. mostly in fsync, but also in truncate calls, getattr, etc. The non-fsync callers are problematic. We should be reporting writeback errors during fsync, but many places spread over the tree clear out errors before they can be properly reported, or report errors at nonsensical times. If I get -EIO on a stat() call, there is no reason for me to assume that it is because some previous writeback failed. The fact that it also clears out the error such that a subsequent fsync returns 0 is a bug, and a nasty one since that's potentially silent data corruption. This patch adds a small bit of new infrastructure for setting and reporting errors during address_space writeback. While the above was my original impetus for adding this, I think it's also the case that current fsync semantics are just problematic for userland. Most applications that call fsync do so to ensure that the data they wrote has hit the backing store. In the case where there are multiple writers to the file at the same time, this is really hard to determine. The first one to call fsync will see any stored error, and the rest get back 0. The processes with open fds may not be associated with one another in any way. They could even be in different containers, so ensuring coordination between all fsync callers is not really an option. One way to remedy this would be to track what file descriptor was used to dirty the file, but that's rather cumbersome and would likely be slow. However, there is a simpler way to improve the semantics here without incurring too much overhead. This set adds an errseq_t to struct address_space, and a corresponding one is added to struct file. Writeback errors are recorded in the mapping's errseq_t, and the one in struct file is used as the "since" value. This changes the semantics of the Linux fsync implementation such that applications can now use it to determine whether there were any writeback errors since fsync(fd) was last called (or since the file was opened in the case of fsync having never been called). Note that those writeback errors may have occurred when writing data that was dirtied via an entirely different fd, but that's the case now with the current mapping_set_error/filemap_check_error infrastructure. This will at least prevent you from getting a false report of success. The new behavior is still consistent with the POSIX spec, and is more reliable for application developers. This patch just adds some basic infrastructure for doing this, and ensures that the f_wb_err "cursor" is properly set when a file is opened. Later patches will change the existing code to use this new infrastructure. Signed-off-by: Jeff LaytonReviewed-by: Jan Kara --- drivers/dax/device.c | 1 + fs/block_dev.c | 1 + fs/file_table.c | 1 + fs/open.c| 3 +++ include/linux/fs.h | 53 mm/filemap.c | 38 + 6 files changed, 97 insertions(+) diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 006e657dfcb9..12943d19bfc4 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -499,6 +499,7 @@ static int dax_open(struct inode *inode, struct file *filp) inode->i_mapping = __dax_inode->i_mapping; inode->i_mapping->host = __dax_inode; filp->f_mapping = inode->i_mapping; + filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping); filp->private_data = dev_dax; inode->i_flags = S_DAX; diff --git a/fs/block_dev.c b/fs/block_dev.c index 51959936..4d62fe771587 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1743,6 +1743,7 @@ static int blkdev_open(struct inode * inode, struct file * filp) return -ENOMEM; filp->f_mapping = bdev->bd_inode->i_mapping; + filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping); return blkdev_get(bdev, filp->f_mode, filp); } diff --git a/fs/file_table.c b/fs/file_table.c index 954d510b765a..72e861a35a7f 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, fmode_t mode, file->f_path = *path; file->f_inode = path->dentry->d_inode; file->f_mapping = path->dentry->d_inode->i_mapping; + file->f_wb_err = filemap_sample_wb_err(file->f_mapping); if ((mode & FMODE_READ) && likely(fop->read || fop->read_iter)) mode |= FMODE_CAN_READ; diff --git a/fs/open.c b/fs/open.c index cd0c5be8d012..280d4a963791 100644 --- a/fs/open.c +++ b/fs/open.c @@ -707,6
[PATCH v6 01/20] mm: fix mapping_set_error call in me_pagecache_dirty
The error code should be negative. Since this ends up in the default case anyway, this is harmless, but it's less confusing to negate it. Also, later patches will require a negative error code here. Signed-off-by: Jeff LaytonReviewed-by: Ross Zwisler Reviewed-by: Jan Kara Reviewed-by: Matthew Wilcox Reviewed-by: Christoph Hellwig --- mm/memory-failure.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 342fac9ba89b..55bc61791fe1 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -684,7 +684,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) * the first EIO, but we're not worse than other parts * of the kernel. */ - mapping_set_error(mapping, EIO); + mapping_set_error(mapping, -EIO); } return me_pagecache_clean(p, pfn); -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 07/20] mm: clean up error handling in write_one_page
Don't try to check PageError since that's potentially racy and not necessarily going to be set after writepage errors out. Instead, check the mapping for an error after writepage returns. Signed-off-by: Jeff LaytonReviewed-by: Jan Kara --- mm/page-writeback.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b901fe52b153..e369e8ea2a29 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2371,14 +2371,13 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) * * The page must be locked by the caller and will be unlocked upon return. * - * write_one_page() returns a negative error code if I/O failed. Note that - * the address_space is not marked for error. The caller must do this if - * needed. + * Note that the mapping's AS_EIO/AS_ENOSPC flags will be cleared when this + * function returns. */ int write_one_page(struct page *page) { struct address_space *mapping = page->mapping; - int ret = 0; + int ret = 0, ret2; struct writeback_control wbc = { .sync_mode = WB_SYNC_ALL, .nr_to_write = 1, @@ -2391,15 +2390,15 @@ int write_one_page(struct page *page) if (clear_page_dirty_for_io(page)) { get_page(page); ret = mapping->a_ops->writepage(page, ); - if (ret == 0) { + if (ret == 0) wait_on_page_writeback(page); - if (PageError(page)) - ret = -EIO; - } put_page(page); } else { unlock_page(page); } + + if (!ret) + ret = filemap_check_errors(mapping); return ret; } EXPORT_SYMBOL(write_one_page); -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: add compression trace points
On 12.06.2017 12:44, Anand Jain wrote: > > Thanks for the review. > >> I haven't read previous submissions and any discussions that might have >> occurred there but why not just stick to >> btrfs_data_compression/btrfs_data_compressor. I know there is certain >> semantic overload since we call it a compressor yet it also does >> decompression but let's focus on making the code/intention clear for the >> code reader and not bogging down too much on actual word semantics. To >> me "compressor" is a synonym to something which compresses AND >> decompresses. It's very well possible that this is just me in which case >> my argument is flawed and you can ignore it :) > > The same tracer will also trace encryption at some point in future. Right, in that case why don't you introduce the concept of stages/pipeline. We can have a compression stage with either compress/decompress ops. We can have an encryption stage with encrypt/decrypt? How does that abstraction sound? > >>> +#define show_transformer_type(type)\ >> >> Why not show_compression_type ? > > as above. > >>> +TP_printk_btrfs("%s %s ino=%lu type=%s len_before=%lu >>> len_after=%lu " >>> +"start=%lu ret=%d", >>> +__entry->uncompress ? "untransform":"transform", >> >> decompress/compress. Transform/untransform are as cryptic as they can >> be. It's a lot easier to put those terms in context when reading the >> len_before/len_after values. > >> Otherwise one might ask themselves "What >> kind of transformation are we talking about?" >> >> Even if you don't do a v3 you can add: > > ha. Thanks. > > current trace an example: > untransform bio ino=258 type=lzo len_before=4096 len_after=16384 > start=393216 ret=0 Now that I"m seeing this, why not prepent something in front of bio/page .e.g granularity/unit: unit=bio granularity=page I'm more inclined towards the latter. > > before after size may be same in encryption, so we need extra specifier > about the operation. > > How about: (for example) > de-lzo bio ino=258 len_before=4096 len_after=16384 start=393216 ret=0 > lzo ino=258 .. de-lzo still seems a bit botched. what about stage: compress op=lzo-decompress/zlib-compress granularity=bio/page stage: encryption op=XTS-decrypt/encrypt ... ? > > Thanks, Anand > > >> Reviewed-by: Nikolay Borisov> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: use ino_t for inode in tracer
Signed-off-by: Anand Jain--- include/trace/events/btrfs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index e37973526153..6494e34b6df9 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1459,7 +1459,7 @@ DECLARE_EVENT_CLASS(btrfs__qgroup_rsv_data, TP_STRUCT__entry_btrfs( __field(u64,rootid ) - __field(unsigned long, ino ) + __field(ino_t, ino ) __field(u64,start ) __field(u64,len ) __field(u64,reserved) -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: add compression trace points
Thanks for the review. I haven't read previous submissions and any discussions that might have occurred there but why not just stick to btrfs_data_compression/btrfs_data_compressor. I know there is certain semantic overload since we call it a compressor yet it also does decompression but let's focus on making the code/intention clear for the code reader and not bogging down too much on actual word semantics. To me "compressor" is a synonym to something which compresses AND decompresses. It's very well possible that this is just me in which case my argument is flawed and you can ignore it :) The same tracer will also trace encryption at some point in future. +#define show_transformer_type(type)\ Why not show_compression_type ? as above. + TP_printk_btrfs("%s %s ino=%lu type=%s len_before=%lu len_after=%lu " + "start=%lu ret=%d", + __entry->uncompress ? "untransform":"transform", decompress/compress. Transform/untransform are as cryptic as they can be. It's a lot easier to put those terms in context when reading the len_before/len_after values. Otherwise one might ask themselves "What kind of transformation are we talking about?" Even if you don't do a v3 you can add: ha. Thanks. current trace an example: untransform bio ino=258 type=lzo len_before=4096 len_after=16384 start=393216 ret=0 before after size may be same in encryption, so we need extra specifier about the operation. How about: (for example) de-lzo bio ino=258 len_before=4096 len_after=16384 start=393216 ret=0 lzo ino=258 .. Thanks, Anand Reviewed-by: Nikolay Borisov-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
csum failed root -9
Hi all, there is 1-block corruption a 8TB filesystem that showed up several months ago. The fs is almost exclusively a btrfs receive target and receives monthly sequential snapshots from two hosts but 1 received uuid. I do not know exactly when the corruption has happened but it must have been roughly 3 to 6 months ago. with monthly updated kernel+progs on that host. Some more history: - fs was created in november 2015 on top of luks - initially bcache between the 2048-sector aligned partition and luks. Some months ago I removed 'the bcache layer' by making sure that cache was clean and then zeroing 8K bytes at start of partition in an isolated situation. Then setting partion offset to 2064 by delete-recreate in gdisk. - in december 2016 there were more scrub errors, but related to the monthly snapshot of december2016. I have removed that snapshot this year and now only this 1-block csum error is the only issue. - brand/type is seagate 8TB SMR. At least since kernel 4.4+ that includes some SMR related changes in the blocklayer this disk works fine with btrfs. - the smartctl values show no error so far but I will run an extended test this week after another btrfs check which did not show any error earlier with the csum fail being there - I have noticed that the board that has the disk attached has been rebooted due to power-failures many times (unreliable power switch and power dips from energy company) and the 150W powersupply is broken and replaced since then. Also due to this, I decided to remove bcache (which has been in write-through and write-around only). Some btrfs inpect-internal exercise shows that the problem is in a directory in the root that contains most of the data and snapshots. But an rsync -c with an identical other clone snapshot shows no difference (no writes to an rw snapshot of that clone). So the fs is still OK as file-level backup, but btrfs replace/balance will fatal error on just this 1 csum error. It looks like that this is not a media/disk error but some HW induced error or SW/kernel issue. Relevant btrfs commands + dmesg info, see below. Any comments on how to fix or handle this without incrementally sending all snapshots to a new fs (6+ TiB of data, assuming this won't fail)? # uname -r 4.11.3-1-default # btrfs --version btrfs-progs v4.10.2+20170406 fs profile is dup for system+meta, single for data # btrfs scrub start /local/smr [27609.626555] BTRFS error (device dm-0): parent transid verify failed on 6350718500864 wanted 23170 found 23076 [27609.685416] BTRFS info (device dm-0): read error corrected: ino 1 off 6350718500864 (dev /dev/mapper/smr sector 11681212672) [27609.685928] BTRFS info (device dm-0): read error corrected: ino 1 off 6350718504960 (dev /dev/mapper/smr sector 11681212680) [27609.686160] BTRFS info (device dm-0): read error corrected: ino 1 off 6350718509056 (dev /dev/mapper/smr sector 11681212688) [27609.687136] BTRFS info (device dm-0): read error corrected: ino 1 off 6350718513152 (dev /dev/mapper/smr sector 11681212696) [37663.606455] BTRFS error (device dm-0): parent transid verify failed on 6350453751808 wanted 23170 found 23075 [37663.685158] BTRFS info (device dm-0): read error corrected: ino 1 off 6350453751808 (dev /dev/mapper/smr sector 11679647008) [37663.685386] BTRFS info (device dm-0): read error corrected: ino 1 off 6350453755904 (dev /dev/mapper/smr sector 11679647016) [37663.685587] BTRFS info (device dm-0): read error corrected: ino 1 off 635045376 (dev /dev/mapper/smr sector 11679647024) [37663.685798] BTRFS info (device dm-0): read error corrected: ino 1 off 6350453764096 (dev /dev/mapper/smr sector 11679647032) [43497.234598] BTRFS error (device dm-0): bdev /dev/mapper/smr errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [43497.234605] BTRFS error (device dm-0): unable to fixup (regular) error at logical 7175413624832 on dev /dev/mapper/smr # < figure out which chunk with help of btrfs py lib > chunk vaddr 7174898057216 type 1 stripe 0 devid 1 offset 6696948727808 length 1073741824 used 1073741824 used_pct 100 chunk vaddr 7175971799040 type 1 stripe 0 devid 1 offset 6698022469632 length 1073741824 used 1073741824 used_pct 100 # btrfs balance start -v -dvrange=7174898057216..7174898057217 /local/smr [74250.913273] BTRFS info (device dm-0): relocating block group 7174898057216 flags data [74255.941105] BTRFS warning (device dm-0): csum failed root -9 ino 257 off 515567616 csum 0x589cb236 expected csum 0xee19bf74 mirror 1 [74255.965804] BTRFS warning (device dm-0): csum failed root -9 ino 257 off 515567616 csum 0x589cb236 expected csum 0xee19bf74 mirror 1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: add compression trace points
On 12.06.2017 11:32, Anand Jain wrote: > This patch adds compression and decompression trace points for the > purpose of debugging. > > Signed-off-by: Anand Jain> --- > v2: > . Use better naming. >(If transform is not good enough I have run out of ideas, pls suggest). > . To be applied on top of >git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next >(tested without namelen check patch set). I haven't read previous submissions and any discussions that might have occurred there but why not just stick to btrfs_data_compression/btrfs_data_compressor. I know there is certain semantic overload since we call it a compressor yet it also does decompression but let's focus on making the code/intention clear for the code reader and not bogging down too much on actual word semantics. To me "compressor" is a synonym to something which compresses AND decompresses. It's very well possible that this is just me in which case my argument is flawed and you can ignore it :) > > fs/btrfs/compression.c | 11 +++ > include/trace/events/btrfs.h | 44 > > 2 files changed, 55 insertions(+) > > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > index fd6508bcff77..53908722d80e 100644 > --- a/fs/btrfs/compression.c > +++ b/fs/btrfs/compression.c > @@ -884,6 +884,10 @@ int btrfs_compress_pages(int type, struct address_space > *mapping, > start, pages, > out_pages, > total_in, total_out); > + > + trace_btrfs_data_transformer(0, 0, mapping->host, type, *total_in, > + *total_out, start, ret); > + > free_workspace(type, workspace); > return ret; > } > @@ -910,6 +914,10 @@ static int btrfs_decompress_bio(struct compressed_bio > *cb) > > workspace = find_workspace(type); > ret = btrfs_compress_op[type - 1]->decompress_bio(workspace, cb); > + > + trace_btrfs_data_transformer(1, 1, cb->inode, type, > + cb->compressed_len, cb->len, cb->start, ret); > + > free_workspace(type, workspace); > > return ret; > @@ -932,6 +940,9 @@ int btrfs_decompress(int type, unsigned char *data_in, > struct page *dest_page, > dest_page, start_byte, > srclen, destlen); > > + trace_btrfs_data_transformer(1, 0, dest_page->mapping->host, > + type, srclen, destlen, start_byte, ret); > + > free_workspace(type, workspace); > return ret; > } > diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h > index cd99a3658156..7c2442dab6a0 100644 > --- a/include/trace/events/btrfs.h > +++ b/include/trace/events/btrfs.h > @@ -1622,6 +1622,50 @@ TRACE_EVENT(qgroup_meta_reserve, > show_root_type(__entry->refroot), __entry->diff) > ); > > +#define show_transformer_type(type) \ Why not show_compression_type ? > + __print_symbolic(type, \ > + { BTRFS_COMPRESS_ZLIB, "zlib" }, \ > + { BTRFS_COMPRESS_LZO, "lzo" }) > + > +TRACE_EVENT(btrfs_data_transformer, > + > + TP_PROTO(int uncompress, int its_bio, struct inode *inode, int type, > + unsigned long len_before, unsigned long len_after, > + unsigned long start, int ret)> + > + TP_ARGS(uncompress, its_bio, inode, type, len_before, > + len_after, start, ret), > + > + TP_STRUCT__entry_btrfs( > + __field(int,uncompress) > + __field(int,its_bio) > + __field(ino_t, i_ino) > + __field(int,type) > + __field(unsigned long, len_before) > + __field(unsigned long, len_after) > + __field(unsigned long, start) > + __field(int,ret) > + ), > + > + TP_fast_assign_btrfs(btrfs_sb(inode->i_sb), > + __entry->uncompress = uncompress; > + __entry->its_bio= its_bio; > + __entry->i_ino = inode->i_ino; > + __entry->type = type; > + __entry->len_before = len_before; > + __entry->len_after = len_after; > + __entry->start = start; > + __entry->ret= ret; > + ), > + > + TP_printk_btrfs("%s %s ino=%lu type=%s len_before=%lu len_after=%lu " > + "start=%lu ret=%d", > + __entry->uncompress ? "untransform":"transform", decompress/compress. Transform/untransform are as cryptic as they can be. It's a lot
[PATCH v2] btrfs: add compression trace points
This patch adds compression and decompression trace points for the purpose of debugging. Signed-off-by: Anand Jain--- v2: . Use better naming. (If transform is not good enough I have run out of ideas, pls suggest). . To be applied on top of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next (tested without namelen check patch set). fs/btrfs/compression.c | 11 +++ include/trace/events/btrfs.h | 44 2 files changed, 55 insertions(+) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index fd6508bcff77..53908722d80e 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -884,6 +884,10 @@ int btrfs_compress_pages(int type, struct address_space *mapping, start, pages, out_pages, total_in, total_out); + + trace_btrfs_data_transformer(0, 0, mapping->host, type, *total_in, + *total_out, start, ret); + free_workspace(type, workspace); return ret; } @@ -910,6 +914,10 @@ static int btrfs_decompress_bio(struct compressed_bio *cb) workspace = find_workspace(type); ret = btrfs_compress_op[type - 1]->decompress_bio(workspace, cb); + + trace_btrfs_data_transformer(1, 1, cb->inode, type, + cb->compressed_len, cb->len, cb->start, ret); + free_workspace(type, workspace); return ret; @@ -932,6 +940,9 @@ int btrfs_decompress(int type, unsigned char *data_in, struct page *dest_page, dest_page, start_byte, srclen, destlen); + trace_btrfs_data_transformer(1, 0, dest_page->mapping->host, + type, srclen, destlen, start_byte, ret); + free_workspace(type, workspace); return ret; } diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index cd99a3658156..7c2442dab6a0 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1622,6 +1622,50 @@ TRACE_EVENT(qgroup_meta_reserve, show_root_type(__entry->refroot), __entry->diff) ); +#define show_transformer_type(type)\ + __print_symbolic(type, \ + { BTRFS_COMPRESS_ZLIB, "zlib" }, \ + { BTRFS_COMPRESS_LZO, "lzo" }) + +TRACE_EVENT(btrfs_data_transformer, + + TP_PROTO(int uncompress, int its_bio, struct inode *inode, int type, + unsigned long len_before, unsigned long len_after, + unsigned long start, int ret), + + TP_ARGS(uncompress, its_bio, inode, type, len_before, + len_after, start, ret), + + TP_STRUCT__entry_btrfs( + __field(int,uncompress) + __field(int,its_bio) + __field(ino_t, i_ino) + __field(int,type) + __field(unsigned long, len_before) + __field(unsigned long, len_after) + __field(unsigned long, start) + __field(int,ret) + ), + + TP_fast_assign_btrfs(btrfs_sb(inode->i_sb), + __entry->uncompress = uncompress; + __entry->its_bio= its_bio; + __entry->i_ino = inode->i_ino; + __entry->type = type; + __entry->len_before = len_before; + __entry->len_after = len_after; + __entry->start = start; + __entry->ret= ret; + ), + + TP_printk_btrfs("%s %s ino=%lu type=%s len_before=%lu len_after=%lu " + "start=%lu ret=%d", + __entry->uncompress ? "untransform":"transform", + __entry->its_bio ? "bio":"page", __entry->i_ino, + show_transformer_type(__entry->type), __entry->len_before, + __entry->len_after, __entry->start, __entry->ret) + +); #endif /* _TRACE_BTRFS_H */ /* This part must be outside protection */ -- 2.13.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Filesystem won't mount (open_ctree failed) or repair (BUG_ON)
Op 12-06-17 om 01:00 schreef Chris Murphy: > On Sun, Jun 11, 2017 at 4:13 AM, Koen Kooiwrote: >> >>> Op 11 jun. 2017, om 12:05 heeft Koen Kooi het >>> volgende geschreven: >>> Op 11 jun. 2017, om 06:20 heeft Chris Murphy het volgende geschreven: On Fri, Jun 9, 2017 at 1:57 PM, Hugo Mills wrote: >> >> [..] >> I'd say take a btrfs-image and put it up somewhere and also file a bug. The fsck should not crash. >>> >>> I’ll create a bugzilla account and file a bug for that. >> >> Done: https://bugzilla.kernel.org/show_bug.cgi?id=196031 >> >> Btrfs-image is still running, will put it online when it has finished. It’s >> already 1.2G: >> >> koen@beast:~$ du -ms /data/backup/btrfs.img >> 1205/data/backup/btrfs.img > > Hopefully you're using -s -t4 -c9 but if not you can at least compress > it after the fact but that takes even longer. I wasn't using '-s', after I did it and ran it again it shrunk from 14GiB to 733MiB: https://dominion.thruhere.net/btrfs.img xz -9 -e wouldn't only compressed that 14GiB file to 13.99GiB, so I wonder why '-s' seems to make such a difference. regards, Koen -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Filesystem won't mount (open_ctree failed) or repair (BUG_ON)
Op 12-06-17 om 00:58 schreef Chris Murphy: [..] > Also worth trying btrfs check --mode=lowmem. This doesn't repair but > is a whole new implementation so it might find the source of the > problem better than the current fsck. I ran it under 'catchsegv' to give more data where it segfaults, here's the log: https://dominion.thruhere.net/btrfsck-lowmem.txt.gz It's 688K compressed and 16MiB uncompressed. regards, Koen -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html