Re:
On Thu, Nov 22, 2018 at 11:41 PM Andy Leadbetter wrote: > > I have a failing 2TB disk that is part of a 4 disk RAID 6 system. I > have added a new 2TB disk to the computer, and started a BTRFS replace > for the old and new disk. The process starts correctly however some > hours into the job, there is an error and kernel oops. relevant log > below. The relevant log is the entire dmesg, not a snippet. It's decently likely there's more than one thing going on here. We also need full output of 'smartctl -x' for all four drives, and also 'smartctl -l scterc' for all four drives, and also 'cat /sys/block/sda/device/timeout' for all four drives. And which bcache mode you're using. The call trace provided is from kernel 4.15 which is sufficiently long ago I think any dev working on raid56 might want to see where it's getting tripped up on something a lot newer, and this is why: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/raid56.c?id=v4.19.3=v4.15.1 That's a lot of changes in just the raid56 code between 4.15 and 4.19. And then in you call trace, btrfs_dev_replace_start is found in dev-replace.c which likewise has a lot of changes. But then also, I think 4.15 might still be in the era where it was not recommended to use 'btrfs dev replace' for raid56, only non-raid56. I'm not sure if the problems with device replace were fixed, and if they were fixed kernel or progs side. Anyway, the latest I recall, it was recommended on raid56 to 'btrfs dev add' then 'btrfs dev remove'. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/dev-replace.c?id=v4.19.3=v4.15.1 And that's only a few hundred changes for each. Check out inode.c - there are over 2000 changes. > The disks are configured on top of bcache, in 5 arrays with a small > 128GB SSD cache shared. The system in this configuration has worked > perfectly for 3 years, until 2 weeks ago csum errors started > appearing. I have a crashplan backup of all files on the disk, so I > am not concerned about data loss, but I would like to avoid rebuild > the system. btrfs-progs 4.17 still considers raid56 experimental, not for production use. And three years ago, the current upstream kernel released was 4.3 so I'm gonna guess the kernel history of this file system goes back older than that, very close to raid56 code birth. And then adding bcache to this mix just makes it all the more complicated. > > btrfs dev stats shows > [/dev/bcache0].write_io_errs0 > [/dev/bcache0].read_io_errs 0 > [/dev/bcache0].flush_io_errs0 > [/dev/bcache0].corruption_errs 0 > [/dev/bcache0].generation_errs 0 > [/dev/bcache1].write_io_errs0 > [/dev/bcache1].read_io_errs 20 > [/dev/bcache1].flush_io_errs0 > [/dev/bcache1].corruption_errs 0 > [/dev/bcache1].generation_errs 14 > [/dev/bcache3].write_io_errs0 > [/dev/bcache3].read_io_errs 0 > [/dev/bcache3].flush_io_errs0 > [/dev/bcache3].corruption_errs 0 > [/dev/bcache3].generation_errs 19 > [/dev/bcache2].write_io_errs0 > [/dev/bcache2].read_io_errs 0 > [/dev/bcache2].flush_io_errs0 > [/dev/bcache2].corruption_errs 0 > [/dev/bcache2].generation_errs 2 3 of 4 drives have at least one generation error. While there are no corruptions reported, generation errors can be really tricky to recover from at all. If only one device had only read errors, this would be a lot less difficult. > I've tried the latest kernel, and the latest tools, but nothing will > allow me to replace, or delete the failed disk. If the file system is mounted, I would try to make a local backup ASAP before you lose the whole volume. Whether it's LVM pool of two drives (linear/concat) with XFS, or if you go with Btrfs -dsingle -mraid1 (also basically a concat) doesn't really matter, but I'd get whatever you can off the drive. I expect avoiding a rebuild in some form or another is very wishful thinking and not very likely. The more changes are made to the file system, repair attempts or otherwise writing to it, decreases the chance of recovery. -- Chris Murphy
Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
On Thu, Nov 22, 2018 at 6:07 AM Tomasz Chmielewski wrote: > > On 2018-11-22 21:46, Nikolay Borisov wrote: > > >> # echo w > /proc/sysrq-trigger > >> > >> # dmesg -c > >> [ 931.585611] sysrq: SysRq : Show Blocked State > >> [ 931.585715] taskPC stack pid father > >> [ 931.590168] btrfs-cleaner D0 1340 2 0x8000 > >> [ 931.590175] Call Trace: > >> [ 931.590190] __schedule+0x29e/0x840 > >> [ 931.590195] schedule+0x2c/0x80 > >> [ 931.590199] schedule_timeout+0x258/0x360 > >> [ 931.590204] io_schedule_timeout+0x1e/0x50 > >> [ 931.590208] wait_for_completion_io+0xb7/0x140 > >> [ 931.590214] ? wake_up_q+0x80/0x80 > >> [ 931.590219] submit_bio_wait+0x61/0x90 > >> [ 931.590225] blkdev_issue_discard+0x7a/0xd0 > >> [ 931.590266] btrfs_issue_discard+0x123/0x160 [btrfs] > >> [ 931.590299] btrfs_discard_extent+0xd8/0x160 [btrfs] > >> [ 931.590335] btrfs_finish_extent_commit+0xe2/0x240 [btrfs] > >> [ 931.590382] btrfs_commit_transaction+0x573/0x840 [btrfs] > >> [ 931.590415] ? btrfs_block_rsv_check+0x25/0x70 [btrfs] > >> [ 931.590456] __btrfs_end_transaction+0x2be/0x2d0 [btrfs] > >> [ 931.590493] btrfs_end_transaction_throttle+0x13/0x20 [btrfs] > >> [ 931.590530] btrfs_drop_snapshot+0x489/0x800 [btrfs] > >> [ 931.590567] btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs] > >> [ 931.590607] cleaner_kthread+0x136/0x160 [btrfs] > >> [ 931.590612] kthread+0x120/0x140 > >> [ 931.590646] ? btree_submit_bio_start+0x20/0x20 [btrfs] > >> [ 931.590658] ? kthread_bind+0x40/0x40 > >> [ 931.590661] ret_from_fork+0x22/0x40 > >> > > > > It seems your filesystem is mounted with the DSICARD option meaning > > every delete will result in discard this is highly suboptimal for > > ssd's. > > Try remounting the fs without the discard option see if it helps. > > Generally for discard you want to submit it in big batches (what fstrim > > does) so that the ftl on the ssd could apply any optimisations it might > > have up its sleeve. > > Spot on! > > Removed "discard" from fstab and added "ssd", rebooted - no more > btrfs-cleaner running. > > Do you know if the issue you described ("discard this is highly > suboptimal for ssd") affects other filesystems as well to a similar > extent? I.e. if using ext4 on ssd? Quite a lot of activity on ext4 and XFS are overwrites, so discard isn't needed. And it might be discard is subject to delays. On Btrfs, it's almost immediate, to the degree that on a couple SSDs I've tested, stale trees referenced exclusively by the most recent backup tree entires in the superblock are already zeros. That functionally means no automatic recoveries at mount time if there's a problem with any of the current trees. I was using it for about a year to no ill effect, BUT not a lot of file deletions either. I wouldn't recommend it, and instead suggest enabling the fstrim.timer which by default runs fstrim.service once a week (which in turn issues fstrim, I think on all mounted volumes.) I am a bit more concerned about the read errors you had that were being corrected automatically? The corruption suggests a firmware bug related to trim. I'd check the affected SSD firmware revision and consider updating it (only after a backup, it's plausible the firmware update is not guaranteed to be data safe). Does the volume use DUP or raid1 metadata? I'm not sure how it's correcting for these problems otherwise. -- Chris Murphy
[no subject]
I have a failing 2TB disk that is part of a 4 disk RAID 6 system. I have added a new 2TB disk to the computer, and started a BTRFS replace for the old and new disk. The process starts correctly however some hours into the job, there is an error and kernel oops. relevant log below. The disks are configured on top of bcache, in 5 arrays with a small 128GB SSD cache shared. The system in this configuration has worked perfectly for 3 years, until 2 weeks ago csum errors started appearing. I have a crashplan backup of all files on the disk, so I am not concerned about data loss, but I would like to avoid rebuild the system. btrfs dev stats shows [/dev/bcache0].write_io_errs0 [/dev/bcache0].read_io_errs 0 [/dev/bcache0].flush_io_errs0 [/dev/bcache0].corruption_errs 0 [/dev/bcache0].generation_errs 0 [/dev/bcache1].write_io_errs0 [/dev/bcache1].read_io_errs 20 [/dev/bcache1].flush_io_errs0 [/dev/bcache1].corruption_errs 0 [/dev/bcache1].generation_errs 14 [/dev/bcache3].write_io_errs0 [/dev/bcache3].read_io_errs 0 [/dev/bcache3].flush_io_errs0 [/dev/bcache3].corruption_errs 0 [/dev/bcache3].generation_errs 19 [/dev/bcache2].write_io_errs0 [/dev/bcache2].read_io_errs 0 [/dev/bcache2].flush_io_errs0 [/dev/bcache2].corruption_errs 0 [/dev/bcache2].generation_errs 2 and a smart test of the backing disk /dev/bcache1 shows a high read error rate, and lot of reallocated sectors. The disk is 10 years old, and has clearly started to fail. I've tried the latest kernel, and the latest tools, but nothing will allow me to replace, or delete the failed disk. 884.171025] BTRFS info (device bcache0): dev_replace from /dev/bcache1 (devid 2) to /dev/bcache4 started [ 3301.101958] BTRFS error (device bcache0): parent transid verify failed on 8251260944384 wanted 640926 found 640907 [ 3301.241214] BTRFS error (device bcache0): parent transid verify failed on 8251260944384 wanted 640926 found 640907 [ 3301.241398] BTRFS error (device bcache0): parent transid verify failed on 8251260944384 wanted 640926 found 640907 [ 3301.241513] BTRFS error (device bcache0): parent transid verify failed on 8251260944384 wanted 640926 found 640907 [ 3302.381094] BTRFS error (device bcache0): btrfs_scrub_dev(/dev/bcache1, 2, /dev/bcache4) failed -5 [ 3302.394612] WARNING: CPU: 0 PID: 5936 at /build/linux-5s7Xkn/linux-4.15.0/fs/btrfs/dev-replace.c:413 btrfs_dev_replace_start+0x281/0x320 [btrfs] [ 3302.394613] Modules linked in: btrfs zstd_compress xor raid6_pq bcache intel_rapl x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi coretemp kvm_intel snd_hda_codec_realtek kvm snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hwdep snd_pcm pcbc snd_seq_midi aesni_intel snd_seq_midi_event joydev input_leds aes_x86_64 snd_rawmidi crypto_simd glue_helper snd_seq eeepc_wmi cryptd asus_wmi snd_seq_device snd_timer wmi_bmof sparse_keymap snd intel_cstate intel_rapl_perf soundcore mei_me mei shpchp mac_hid sch_fq_codel acpi_pad parport_pc ppdev lp parport ip_tables x_tables autofs4 overlay nls_iso8859_1 dm_mirror dm_region_hash dm_log hid_generic usbhid hid uas usb_storage i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect [ 3302.394640] sysimgblt fb_sys_fops r8169 mxm_wmi mii drm ahci libahci wmi video [ 3302.394646] CPU: 0 PID: 5936 Comm: btrfs Not tainted 4.15.0-20-generic #21-Ubuntu [ 3302.394646] Hardware name: System manufacturer System Product Name/H110M-R, BIOS 3404 10/10/2017 [ 3302.394658] RIP: 0010:btrfs_dev_replace_start+0x281/0x320 [btrfs] [ 3302.394659] RSP: 0018:a8b582b5fd18 EFLAGS: 00010282 [ 3302.394660] RAX: fffb RBX: 927d3afe RCX: [ 3302.394660] RDX: 0001 RSI: 0296 RDI: 927d3afece90 [ 3302.394661] RBP: a8b582b5fd68 R08: R09: a8b582b5fc18 [ 3302.394662] R10: a8b582b5fd10 R11: R12: 927d3afece20 [ 3302.394662] R13: 927d34b59421 R14: 927d34b59020 R15: 0001 [ 3302.394663] FS: 7fba4831c8c0() GS:927df6c0() knlGS: [ 3302.394664] CS: 0010 DS: ES: CR0: 80050033 [ 3302.394664] CR2: 2b9b83db85b8 CR3: 000164d3a002 CR4: 003606f0 [ 3302.394665] Call Trace: [ 3302.394676] btrfs_dev_replace_by_ioctl+0x39/0x60 [btrfs] [ 3302.394686] btrfs_ioctl+0x1988/0x2080 [btrfs] [ 3302.394689] ? iput+0x8d/0x220 [ 3302.394690] ? __blkdev_put+0x199/0x1f0 [ 3302.394692] do_vfs_ioctl+0xa8/0x630 [ 3302.394701] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 3302.394703] ? do_vfs_ioctl+0xa8/0x630 [ 3302.394704] ? do_sigaction+0xb4/0x1e0 [ 3302.394706] SyS_ioctl+0x79/0x90 [ 3302.394708] do_syscall_64+0x73/0x130 [ 3302.394710] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 3302.394711] RIP: 0033:0x7fba471085d7 [ 3302.394712] RSP: 002b:7ffe5af753b8 EFLAGS: 0246 ORIG_RAX: 0010 [
[PATCH] btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable
[BUG] A completely valid btrfs will refuse to mount, with error message like: BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \ bg_start=12018974720 bg_len=10888413184, invalid block group size, \ have 10888413184 expect (0, 10737418240] Btrfs check returns no error, and all kernels used on this fs is later than 2011, which should all have the 10G size limit commit. [CAUSE] For a 12 devices btrfs, we could allocate a chunk larger than 10G due to stripe stripe bump up. __btrfs_alloc_chunk() |- max_stripe_size = 1G |- max_chunk_size = 10G |- data_stripe = 11 |- if (1G * 11 > 10G) { stripe_size = 976128930; stripe_size = round_up(976128930, SZ_16M) = 989855744 However the final stripe_size (989855744) * 11 = 10888413184, which is still larger than 10G. [FIX] For the comprehensive check, we need to do the full check at chunk read time, and rely on bg <-> chunk mapping to do the check. We could just skip the length check for now. Fixes: fce466eab7ac ("btrfs: tree-checker: Verify block_group_item") Cc: sta...@vger.kernel.org # v4.19+ Reported-by: Wang Yugui Signed-off-by: Qu Wenruo --- fs/btrfs/tree-checker.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c index cab0b1f1f741..d8bd5340fbbc 100644 --- a/fs/btrfs/tree-checker.c +++ b/fs/btrfs/tree-checker.c @@ -389,13 +389,11 @@ static int check_block_group_item(struct btrfs_fs_info *fs_info, /* * Here we don't really care about alignment since extent allocator can -* handle it. We care more about the size, as if one block group is -* larger than maximum size, it's must be some obvious corruption. +* handle it. We care more about the size. */ - if (key->offset > BTRFS_MAX_DATA_CHUNK_SIZE || key->offset == 0) { + if (key->offset == 0) { block_group_err(fs_info, leaf, slot, - "invalid block group size, have %llu expect (0, %llu]", - key->offset, BTRFS_MAX_DATA_CHUNK_SIZE); + "invalid block group size 0"); return -EUCLEAN; } -- 2.19.1
Re: [PATCH 4/4] btrfs: replace btrfs_io_bio::end_io with a simple helper
On 22.11.18 г. 18:16 ч., David Sterba wrote: > The end_io callback implemented as btrfs_io_bio_endio_readpage only > calls kfree. Also the callback is set only in case the csum buffer is > allocated and not pointing to the inline buffer. We can use that > information to drop the indirection and call a helper that will free the > csums only in the right case. > > This shrinks struct btrfs_io_bio by 8 bytes. > > Signed-off-by: David Sterba Reviewed-by: Nikolay Borisov > --- > fs/btrfs/extent_io.c | 3 +-- > fs/btrfs/file-item.c | 9 - > fs/btrfs/inode.c | 7 ++- > fs/btrfs/volumes.h | 10 -- > 4 files changed, 11 insertions(+), 18 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 4ea808d6cfbc..aef3c9866ff0 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2623,8 +2623,7 @@ static void end_bio_extent_readpage(struct bio *bio) > if (extent_len) > endio_readpage_release_extent(tree, extent_start, extent_len, > uptodate); > - if (io_bio->end_io) > - io_bio->end_io(io_bio, blk_status_to_errno(bio->bi_status)); > + btrfs_io_bio_free_csum(io_bio); > bio_put(bio); > } > > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c > index 1f2d0a6ab634..920bf3b4b0ef 100644 > --- a/fs/btrfs/file-item.c > +++ b/fs/btrfs/file-item.c > @@ -142,14 +142,6 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle > *trans, > return ret; > } > > -static void btrfs_io_bio_endio_readpage(struct btrfs_io_bio *bio, int err) > -{ > - if (bio->csum != bio->csum_inline) { > - kfree(bio->csum); > - bio->csum = NULL; > - } > -} > - > static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio > *bio, > u64 logical_offset, u32 *dst, int dio) > { > @@ -184,7 +176,6 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode > *inode, struct bio *bio > btrfs_free_path(path); > return BLK_STS_RESOURCE; > } > - btrfs_bio->end_io = btrfs_io_bio_endio_readpage; > } else { > btrfs_bio->csum = btrfs_bio->csum_inline; > } > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 26b8bec7c2dc..6bfd37e58924 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -8017,9 +8017,7 @@ static void btrfs_endio_direct_read(struct bio *bio) > > dio_bio->bi_status = err; > dio_end_io(dio_bio); > - > - if (io_bio->end_io) > - io_bio->end_io(io_bio, blk_status_to_errno(err)); > + btrfs_io_bio_free_csum(io_bio); > bio_put(bio); > } > > @@ -8372,8 +8370,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, > struct inode *inode, > if (!ret) > return; > > - if (io_bio->end_io) > - io_bio->end_io(io_bio, ret); > + btrfs_io_bio_free_csum(io_bio); > > free_ordered: > /* > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index 9a764f2d462e..a13045fcfc45 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -267,14 +267,12 @@ struct btrfs_fs_devices { > * we allocate are actually btrfs_io_bios. We'll cram as much of > * struct btrfs_bio as we can into this over time. > */ > -typedef void (btrfs_io_bio_end_io_t) (struct btrfs_io_bio *bio, int err); > struct btrfs_io_bio { > unsigned int mirror_num; > unsigned int stripe_index; > u64 logical; > u8 *csum; > u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE]; > - btrfs_io_bio_end_io_t *end_io; > struct bvec_iter iter; > /* >* This member must come last, bio_alloc_bioset will allocate enough > @@ -288,6 +286,14 @@ static inline struct btrfs_io_bio *btrfs_io_bio(struct > bio *bio) > return container_of(bio, struct btrfs_io_bio, bio); > } > > +static inline void btrfs_io_bio_free_csum(struct btrfs_io_bio *io_bio) > +{ > + if (io_bio->csum != io_bio->csum_inline) { > + kfree(io_bio->csum); > + io_bio->csum = NULL; > + } > +} > + > struct btrfs_bio_stripe { > struct btrfs_device *dev; > u64 physical; >
Re: [PATCH 3/4] btrfs: remove redundant csum buffer in btrfs_io_bio
On 22.11.18 г. 18:16 ч., David Sterba wrote: > The io_bio tracks checksums and has an inline buffer or an allocated > one. And there's a third member that points to the right one, but we > don't need to use an extra pointer for that. Let btrfs_io_bio::csum > point to the right buffer and check that the inline buffer is not > accidentally freed. > > This shrinks struct btrfs_io_bio by 8 bytes. > > Signed-off-by: David Sterba Reviewed-by: Nikolay Borisov > --- > fs/btrfs/file-item.c | 12 +++- > fs/btrfs/volumes.h | 1 - > 2 files changed, 7 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c > index ba74827beb32..1f2d0a6ab634 100644 > --- a/fs/btrfs/file-item.c > +++ b/fs/btrfs/file-item.c > @@ -144,7 +144,10 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle > *trans, > > static void btrfs_io_bio_endio_readpage(struct btrfs_io_bio *bio, int err) > { > - kfree(bio->csum_allocated); > + if (bio->csum != bio->csum_inline) { > + kfree(bio->csum); > + bio->csum = NULL; > + } > } > > static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio > *bio, > @@ -175,13 +178,12 @@ static blk_status_t __btrfs_lookup_bio_sums(struct > inode *inode, struct bio *bio > nblocks = bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits; > if (!dst) { > if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) { > - btrfs_bio->csum_allocated = kmalloc_array(nblocks, > - csum_size, GFP_NOFS); > - if (!btrfs_bio->csum_allocated) { > + btrfs_bio->csum = kmalloc_array(nblocks, csum_size, > + GFP_NOFS); > + if (!btrfs_bio->csum) { > btrfs_free_path(path); > return BLK_STS_RESOURCE; > } > - btrfs_bio->csum = btrfs_bio->csum_allocated; > btrfs_bio->end_io = btrfs_io_bio_endio_readpage; > } else { > btrfs_bio->csum = btrfs_bio->csum_inline; > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index 1d936ce282c3..9a764f2d462e 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -274,7 +274,6 @@ struct btrfs_io_bio { > u64 logical; > u8 *csum; > u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE]; > - u8 *csum_allocated; > btrfs_io_bio_end_io_t *end_io; > struct bvec_iter iter; > /* >
Re: [PATCH 2/4] btrfs: replace async_cow::root with fs_info
On 22.11.18 г. 18:16 ч., David Sterba wrote: > The async_cow::root is used to propagate fs_info to async_cow_submit. > We can't use inode to reach it because it could become NULL after > write without compression in async_cow_start. > > Signed-off-by: David Sterba Reviewed-by: Nikolay Borisov > --- > fs/btrfs/inode.c | 9 +++-- > 1 file changed, 3 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index a88122b89f50..26b8bec7c2dc 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -358,7 +358,7 @@ struct async_extent { > > struct async_cow { > struct inode *inode; > - struct btrfs_root *root; > + struct btrfs_fs_info *fs_info; > struct page *locked_page; > u64 start; > u64 end; > @@ -1144,13 +1144,11 @@ static noinline void async_cow_submit(struct > btrfs_work *work) > { > struct btrfs_fs_info *fs_info; > struct async_cow *async_cow; > - struct btrfs_root *root; > unsigned long nr_pages; > > async_cow = container_of(work, struct async_cow, work); > > - root = async_cow->root; > - fs_info = root->fs_info; > + fs_info = async_cow->fs_info; > nr_pages = (async_cow->end - async_cow->start + PAGE_SIZE) >> > PAGE_SHIFT; > > @@ -1179,7 +1177,6 @@ static int cow_file_range_async(struct inode *inode, > struct page *locked_page, > { > struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > struct async_cow *async_cow; > - struct btrfs_root *root = BTRFS_I(inode)->root; > unsigned long nr_pages; > u64 cur_end; > > @@ -1189,7 +1186,7 @@ static int cow_file_range_async(struct inode *inode, > struct page *locked_page, > async_cow = kmalloc(sizeof(*async_cow), GFP_NOFS); > BUG_ON(!async_cow); /* -ENOMEM */ > async_cow->inode = igrab(inode); > - async_cow->root = root; > + async_cow->fs_info = fs_info; > async_cow->locked_page = locked_page; > async_cow->start = start; > async_cow->write_flags = write_flags; >
Re: [PATCH 1/4] btrfs: merge btrfs_submit_bio_done to its caller
On 22.11.18 г. 18:16 ч., David Sterba wrote: > There's one caller and its code is simple, we can open code it in > run_one_async_done. The errors are passed through bio. > > Signed-off-by: David Sterba Reviewed-by: Nikolay Borisov > --- > fs/btrfs/disk-io.c | 18 +- > fs/btrfs/inode.c | 23 --- > 2 files changed, 17 insertions(+), 24 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index feb67dfd663d..2f5c5442cf00 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -764,11 +764,22 @@ static void run_one_async_start(struct btrfs_work *work) > async->status = ret; > } > > +/* > + * In order to insert checksums into the metadata in large chunks, we wait > + * until bio submission time. All the pages in the bio are checksummed and > + * sums are attached onto the ordered extent record. > + * > + * At IO completion time the csums attached on the ordered extent record are > + * inserted into the tree. > + */ > static void run_one_async_done(struct btrfs_work *work) > { > struct async_submit_bio *async; > + struct inode *inode; > + blk_status_t ret; > > async = container_of(work, struct async_submit_bio, work); > + inode = async->private_data; > > /* If an error occurred we just want to clean up the bio and move on */ > if (async->status) { > @@ -777,7 +788,12 @@ static void run_one_async_done(struct btrfs_work *work) > return; > } > > - btrfs_submit_bio_done(async->private_data, async->bio, > async->mirror_num); > + ret = btrfs_map_bio(btrfs_sb(inode->i_sb), async->bio, > + async->mirror_num, 1); > + if (ret) { > + async->bio->bi_status = ret; > + bio_endio(async->bio); > + } > } > > static void run_one_async_free(struct btrfs_work *work) > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 9becf8543489..a88122b89f50 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -1924,29 +1924,6 @@ static blk_status_t btrfs_submit_bio_start(void > *private_data, struct bio *bio, > return 0; > } > > -/* > - * in order to insert checksums into the metadata in large chunks, > - * we wait until bio submission time. All the pages in the bio are > - * checksummed and sums are attached onto the ordered extent record. > - * > - * At IO completion time the cums attached on the ordered extent record > - * are inserted into the btree > - */ > -blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio, > - int mirror_num) > -{ > - struct inode *inode = private_data; > - struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > - blk_status_t ret; > - > - ret = btrfs_map_bio(fs_info, bio, mirror_num, 1); > - if (ret) { > - bio->bi_status = ret; > - bio_endio(bio); > - } > - return ret; > -} > - > /* > * extent_io.c submission hook. This does the right thing for csum > calculation > * on write, or reading the csums from the tree before a read. >
[PATCH 2/4] btrfs: replace async_cow::root with fs_info
The async_cow::root is used to propagate fs_info to async_cow_submit. We can't use inode to reach it because it could become NULL after write without compression in async_cow_start. Signed-off-by: David Sterba --- fs/btrfs/inode.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index a88122b89f50..26b8bec7c2dc 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -358,7 +358,7 @@ struct async_extent { struct async_cow { struct inode *inode; - struct btrfs_root *root; + struct btrfs_fs_info *fs_info; struct page *locked_page; u64 start; u64 end; @@ -1144,13 +1144,11 @@ static noinline void async_cow_submit(struct btrfs_work *work) { struct btrfs_fs_info *fs_info; struct async_cow *async_cow; - struct btrfs_root *root; unsigned long nr_pages; async_cow = container_of(work, struct async_cow, work); - root = async_cow->root; - fs_info = root->fs_info; + fs_info = async_cow->fs_info; nr_pages = (async_cow->end - async_cow->start + PAGE_SIZE) >> PAGE_SHIFT; @@ -1179,7 +1177,6 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page, { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct async_cow *async_cow; - struct btrfs_root *root = BTRFS_I(inode)->root; unsigned long nr_pages; u64 cur_end; @@ -1189,7 +1186,7 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page, async_cow = kmalloc(sizeof(*async_cow), GFP_NOFS); BUG_ON(!async_cow); /* -ENOMEM */ async_cow->inode = igrab(inode); - async_cow->root = root; + async_cow->fs_info = fs_info; async_cow->locked_page = locked_page; async_cow->start = start; async_cow->write_flags = write_flags; -- 2.19.1
[PATCH 3/4] btrfs: remove redundant csum buffer in btrfs_io_bio
The io_bio tracks checksums and has an inline buffer or an allocated one. And there's a third member that points to the right one, but we don't need to use an extra pointer for that. Let btrfs_io_bio::csum point to the right buffer and check that the inline buffer is not accidentally freed. This shrinks struct btrfs_io_bio by 8 bytes. Signed-off-by: David Sterba --- fs/btrfs/file-item.c | 12 +++- fs/btrfs/volumes.h | 1 - 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index ba74827beb32..1f2d0a6ab634 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -144,7 +144,10 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans, static void btrfs_io_bio_endio_readpage(struct btrfs_io_bio *bio, int err) { - kfree(bio->csum_allocated); + if (bio->csum != bio->csum_inline) { + kfree(bio->csum); + bio->csum = NULL; + } } static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, @@ -175,13 +178,12 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio nblocks = bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits; if (!dst) { if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) { - btrfs_bio->csum_allocated = kmalloc_array(nblocks, - csum_size, GFP_NOFS); - if (!btrfs_bio->csum_allocated) { + btrfs_bio->csum = kmalloc_array(nblocks, csum_size, + GFP_NOFS); + if (!btrfs_bio->csum) { btrfs_free_path(path); return BLK_STS_RESOURCE; } - btrfs_bio->csum = btrfs_bio->csum_allocated; btrfs_bio->end_io = btrfs_io_bio_endio_readpage; } else { btrfs_bio->csum = btrfs_bio->csum_inline; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 1d936ce282c3..9a764f2d462e 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -274,7 +274,6 @@ struct btrfs_io_bio { u64 logical; u8 *csum; u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE]; - u8 *csum_allocated; btrfs_io_bio_end_io_t *end_io; struct bvec_iter iter; /* -- 2.19.1
[PATCH 4/4] btrfs: replace btrfs_io_bio::end_io with a simple helper
The end_io callback implemented as btrfs_io_bio_endio_readpage only calls kfree. Also the callback is set only in case the csum buffer is allocated and not pointing to the inline buffer. We can use that information to drop the indirection and call a helper that will free the csums only in the right case. This shrinks struct btrfs_io_bio by 8 bytes. Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 3 +-- fs/btrfs/file-item.c | 9 - fs/btrfs/inode.c | 7 ++- fs/btrfs/volumes.h | 10 -- 4 files changed, 11 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 4ea808d6cfbc..aef3c9866ff0 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2623,8 +2623,7 @@ static void end_bio_extent_readpage(struct bio *bio) if (extent_len) endio_readpage_release_extent(tree, extent_start, extent_len, uptodate); - if (io_bio->end_io) - io_bio->end_io(io_bio, blk_status_to_errno(bio->bi_status)); + btrfs_io_bio_free_csum(io_bio); bio_put(bio); } diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 1f2d0a6ab634..920bf3b4b0ef 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -142,14 +142,6 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans, return ret; } -static void btrfs_io_bio_endio_readpage(struct btrfs_io_bio *bio, int err) -{ - if (bio->csum != bio->csum_inline) { - kfree(bio->csum); - bio->csum = NULL; - } -} - static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u64 logical_offset, u32 *dst, int dio) { @@ -184,7 +176,6 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio btrfs_free_path(path); return BLK_STS_RESOURCE; } - btrfs_bio->end_io = btrfs_io_bio_endio_readpage; } else { btrfs_bio->csum = btrfs_bio->csum_inline; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 26b8bec7c2dc..6bfd37e58924 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8017,9 +8017,7 @@ static void btrfs_endio_direct_read(struct bio *bio) dio_bio->bi_status = err; dio_end_io(dio_bio); - - if (io_bio->end_io) - io_bio->end_io(io_bio, blk_status_to_errno(err)); + btrfs_io_bio_free_csum(io_bio); bio_put(bio); } @@ -8372,8 +8370,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, if (!ret) return; - if (io_bio->end_io) - io_bio->end_io(io_bio, ret); + btrfs_io_bio_free_csum(io_bio); free_ordered: /* diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 9a764f2d462e..a13045fcfc45 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -267,14 +267,12 @@ struct btrfs_fs_devices { * we allocate are actually btrfs_io_bios. We'll cram as much of * struct btrfs_bio as we can into this over time. */ -typedef void (btrfs_io_bio_end_io_t) (struct btrfs_io_bio *bio, int err); struct btrfs_io_bio { unsigned int mirror_num; unsigned int stripe_index; u64 logical; u8 *csum; u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE]; - btrfs_io_bio_end_io_t *end_io; struct bvec_iter iter; /* * This member must come last, bio_alloc_bioset will allocate enough @@ -288,6 +286,14 @@ static inline struct btrfs_io_bio *btrfs_io_bio(struct bio *bio) return container_of(bio, struct btrfs_io_bio, bio); } +static inline void btrfs_io_bio_free_csum(struct btrfs_io_bio *io_bio) +{ + if (io_bio->csum != io_bio->csum_inline) { + kfree(io_bio->csum); + io_bio->csum = NULL; + } +} + struct btrfs_bio_stripe { struct btrfs_device *dev; u64 physical; -- 2.19.1
[PATCH 0/4] Minor helper and struct member cleanups
Assorted updates to code vaguely related to the bio callbacks and structures. Remove call indirection, remove redundant struct members, etc. David Sterba (4): btrfs: merge btrfs_submit_bio_done to its caller btrfs: replace async_cow::root with fs_info btrfs: remove redundant csum buffer in btrfs_io_bio btrfs: replace btrfs_io_bio::end_io with a simple helper fs/btrfs/disk-io.c | 18 +- fs/btrfs/extent_io.c | 3 +-- fs/btrfs/file-item.c | 13 +++-- fs/btrfs/inode.c | 39 +-- fs/btrfs/volumes.h | 11 --- 5 files changed, 34 insertions(+), 50 deletions(-) -- 2.19.1
[PATCH 1/4] btrfs: merge btrfs_submit_bio_done to its caller
There's one caller and its code is simple, we can open code it in run_one_async_done. The errors are passed through bio. Signed-off-by: David Sterba --- fs/btrfs/disk-io.c | 18 +- fs/btrfs/inode.c | 23 --- 2 files changed, 17 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index feb67dfd663d..2f5c5442cf00 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -764,11 +764,22 @@ static void run_one_async_start(struct btrfs_work *work) async->status = ret; } +/* + * In order to insert checksums into the metadata in large chunks, we wait + * until bio submission time. All the pages in the bio are checksummed and + * sums are attached onto the ordered extent record. + * + * At IO completion time the csums attached on the ordered extent record are + * inserted into the tree. + */ static void run_one_async_done(struct btrfs_work *work) { struct async_submit_bio *async; + struct inode *inode; + blk_status_t ret; async = container_of(work, struct async_submit_bio, work); + inode = async->private_data; /* If an error occurred we just want to clean up the bio and move on */ if (async->status) { @@ -777,7 +788,12 @@ static void run_one_async_done(struct btrfs_work *work) return; } - btrfs_submit_bio_done(async->private_data, async->bio, async->mirror_num); + ret = btrfs_map_bio(btrfs_sb(inode->i_sb), async->bio, + async->mirror_num, 1); + if (ret) { + async->bio->bi_status = ret; + bio_endio(async->bio); + } } static void run_one_async_free(struct btrfs_work *work) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 9becf8543489..a88122b89f50 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1924,29 +1924,6 @@ static blk_status_t btrfs_submit_bio_start(void *private_data, struct bio *bio, return 0; } -/* - * in order to insert checksums into the metadata in large chunks, - * we wait until bio submission time. All the pages in the bio are - * checksummed and sums are attached onto the ordered extent record. - * - * At IO completion time the cums attached on the ordered extent record - * are inserted into the btree - */ -blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio, - int mirror_num) -{ - struct inode *inode = private_data; - struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); - blk_status_t ret; - - ret = btrfs_map_bio(fs_info, bio, mirror_num, 1); - if (ret) { - bio->bi_status = ret; - bio_endio(bio); - } - return ret; -} - /* * extent_io.c submission hook. This does the right thing for csum calculation * on write, or reading the csums from the tree before a read. -- 2.19.1
Re: [PATCH v2] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation
On Thu, Nov 22, 2018 at 09:46:44PM +0800, Qu Wenruo wrote: > >>> it and after > >>> we started (or joined) a transaction, a lot could of modifications may > >>> have happened. > >>> Nevertheless I don't think it's unexpected for anyone to have the > >>> accounting happening > >>> only after the quota enable ioctl returns success. > >> > >> The problem is not accounting, the qgroup number won't cause problem. > >> > >> It's the reserved space. Like some dirty pages are dirtied before quota > >> enabled, but at run_dealloc() time quota is enabled. > >> > >> For such case we have io_tree based method to avoid underflow so it > >> should be OK. > >> > >> So v2 patch looks OK. > > > > Does that mean reviewed-by? In case there's a evolved discussion under a > > patch, a clear yes/no is appreciated and an explicit Reviewed-by even > > more. I'm about to add this patch to rc4 pull, thre's still some time to > > add the tag. Thanks. > > I'd like to add reviewed-by tab, but I'm still not 100% if this will > cause extra qgroup reserved space related problem. > > At least from my side, I can't directly see a case where it will cause > problem. > > Does such case mean a reviewed-by tag? Or something LGTM-but-uncertain? It means that we can keep the patch in testing branch for a bit longer, there's still time to put it to a later rc once there's enough certainty.
Re: [PATCH] btrfs: improve error handling of btrfs_add_link()
On Thu, Nov 22, 2018 at 02:44:35PM +0100, Johannes Thumshirn wrote: > On 22/11/2018 14:35, David Sterba wrote: > > On Thu, Nov 22, 2018 at 02:15:28PM +0100, Johannes Thumshirn wrote: > >> err holds the return value of either btrfs_del_root_ref() or > >> btrfs_del_inode_ref() but it hasn't been checked since it's > >> introduction with commit fe66a05a0679 (Btrfs: improve error handling > >> for btrfs_insert_dir_item callers) in 2012. > >> > >> The first attempt in removing the variable was rejected, so instead of > >> removing 'err', start handling the errors of > >> btrfs_del_root_ref()/btrfs_del_inode_ref() by aborting the transaction in > >> btrfs_add_link() directly instead of the call-sites. > > > > ... which is an anti-pattern. > > > > https://btrfs.wiki.kernel.org/index.php/Development_notes#Error_handling_and_transaction_abort > > > > In case there are several conditons that can fail we want to see the > > first one for later analysis and debugging. > > OK, but then handling the error of either btrfs_del_root_ref() or > btrfs_del_inode_ref() will essentially only be printing an error message > as anything else would hide the error condition which has led us into > this path (i.e. btrfs_insert_dir_item() returning EEXISTS or EOVERFLOW), > wouldn't it? So this seems to point out there are more things to resolve here: * how to recover from errors of btrfs_del_root_ref or btrfs_del_inode_ref * what to report back to the user When insert_dir_item fails, there are the inode references to be cleaned up, and error reported as EEXIST or EOVERFLOW. When btrfs_del_root_ref or btrfs_del_inode_ref fail, the cleanup is not possible and there's no other option than the transaction abort. As the original reason was the overflow, this should be reported to back to the user. The reason of the del_*_ref failures can be misleading, in case it's eg. ENOMEM. This would mean that the add-link operation is possible but there was not enough memory and restarting later would let it possibly succeed. So the missing part of error hanling is to add two "if (err) abort" and then still return 'ret'. As this is not all obvious why, a comment would be good there.
Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
On 2018/11/22 下午10:03, Roman Mamedov wrote: > On Thu, 22 Nov 2018 22:07:25 +0900 > Tomasz Chmielewski wrote: > >> Spot on! >> >> Removed "discard" from fstab and added "ssd", rebooted - no more >> btrfs-cleaner running. > > Recently there has been a bugfix for TRIM in Btrfs: > > btrfs: Ensure btrfs_trim_fs can trim the whole fs > https://patchwork.kernel.org/patch/10579539/ > > Perhaps your upgraded kernel is the first one to contain it, and for the first > time you're seeing TRIM to actually *work*, with the actual performance impact > of it on a large fragmented FS, instead of a few contiguous unallocated areas. > That only affects btrfs_trim_fs(), and you can see it's only called from ioctl interface, so it's definitely not the case. Thanks, Qu
Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
On Thu, 22 Nov 2018 22:07:25 +0900 Tomasz Chmielewski wrote: > Spot on! > > Removed "discard" from fstab and added "ssd", rebooted - no more > btrfs-cleaner running. Recently there has been a bugfix for TRIM in Btrfs: btrfs: Ensure btrfs_trim_fs can trim the whole fs https://patchwork.kernel.org/patch/10579539/ Perhaps your upgraded kernel is the first one to contain it, and for the first time you're seeing TRIM to actually *work*, with the actual performance impact of it on a large fragmented FS, instead of a few contiguous unallocated areas. -- With respect, Roman
Re: [PATCH v2] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation
On 2018/11/22 下午9:12, David Sterba wrote: > On Tue, Nov 20, 2018 at 08:32:42AM +0800, Qu Wenruo wrote: > @@ -1013,16 +1013,22 @@ int btrfs_quota_enable(struct btrfs_fs_info > *fs_info) > btrfs_abort_transaction(trans, ret); > goto out_free_path; > } > - spin_lock(_info->qgroup_lock); > - fs_info->quota_root = quota_root; > - set_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags); > - spin_unlock(_info->qgroup_lock); > > ret = btrfs_commit_transaction(trans); > trans = NULL; > if (ret) > goto out_free_path; The main concern here is, if we don't set qgroup enabled bit before we commit transaction, there will be a window where new tree modification could happen before QGROUP_ENABLED bit set. >>> >>> That doesn't seem to make much sense to me, if I understood correctly. >>> Essentially you're saying stuff done to any tree in the the >>> transaction we use to >>> enable quotas must be accounted for. In that case the quota enabled bit >>> should >>> be done as soon as the transaction is started, because before we set >>> it and after >>> we started (or joined) a transaction, a lot could of modifications may >>> have happened. >>> Nevertheless I don't think it's unexpected for anyone to have the >>> accounting happening >>> only after the quota enable ioctl returns success. >> >> The problem is not accounting, the qgroup number won't cause problem. >> >> It's the reserved space. Like some dirty pages are dirtied before quota >> enabled, but at run_dealloc() time quota is enabled. >> >> For such case we have io_tree based method to avoid underflow so it >> should be OK. >> >> So v2 patch looks OK. > > Does that mean reviewed-by? In case there's a evolved discussion under a > patch, a clear yes/no is appreciated and an explicit Reviewed-by even > more. I'm about to add this patch to rc4 pull, thre's still some time to > add the tag. Thanks. > I'd like to add reviewed-by tab, but I'm still not 100% if this will cause extra qgroup reserved space related problem. At least from my side, I can't directly see a case where it will cause problem. Does such case mean a reviewed-by tag? Or something LGTM-but-uncertain? Thanks, Qu signature.asc Description: OpenPGP digital signature
Re: [PATCH] btrfs: improve error handling of btrfs_add_link()
On 22/11/2018 14:35, David Sterba wrote: > On Thu, Nov 22, 2018 at 02:15:28PM +0100, Johannes Thumshirn wrote: >> err holds the return value of either btrfs_del_root_ref() or >> btrfs_del_inode_ref() but it hasn't been checked since it's >> introduction with commit fe66a05a0679 (Btrfs: improve error handling >> for btrfs_insert_dir_item callers) in 2012. >> >> The first attempt in removing the variable was rejected, so instead of >> removing 'err', start handling the errors of >> btrfs_del_root_ref()/btrfs_del_inode_ref() by aborting the transaction in >> btrfs_add_link() directly instead of the call-sites. > > ... which is an anti-pattern. > > https://btrfs.wiki.kernel.org/index.php/Development_notes#Error_handling_and_transaction_abort > > In case there are several conditons that can fail we want to see the > first one for later analysis and debugging. OK, but then handling the error of either btrfs_del_root_ref() or btrfs_del_inode_ref() will essentially only be printing an error message as anything else would hide the error condition which has led us into this path (i.e. btrfs_insert_dir_item() returning EEXISTS or EOVERFLOW), wouldn't it? Byte, Johannes -- Johannes ThumshirnSUSE Labs jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH] btrfs: improve error handling of btrfs_add_link()
On Thu, Nov 22, 2018 at 02:15:28PM +0100, Johannes Thumshirn wrote: > err holds the return value of either btrfs_del_root_ref() or > btrfs_del_inode_ref() but it hasn't been checked since it's > introduction with commit fe66a05a0679 (Btrfs: improve error handling > for btrfs_insert_dir_item callers) in 2012. > > The first attempt in removing the variable was rejected, so instead of > removing 'err', start handling the errors of > btrfs_del_root_ref()/btrfs_del_inode_ref() by aborting the transaction in > btrfs_add_link() directly instead of the call-sites. ... which is an anti-pattern. https://btrfs.wiki.kernel.org/index.php/Development_notes#Error_handling_and_transaction_abort In case there are several conditons that can fail we want to see the first one for later analysis and debugging.
[PATCH] btrfs: improve error handling of btrfs_add_link()
err holds the return value of either btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked since it's introduction with commit fe66a05a0679 (Btrfs: improve error handling for btrfs_insert_dir_item callers) in 2012. The first attempt in removing the variable was rejected, so instead of removing 'err', start handling the errors of btrfs_del_root_ref()/btrfs_del_inode_ref() by aborting the transaction in btrfs_add_link() directly instead of the call-sites. Link: https://lore.kernel.org/linux-btrfs/20181119141323.gc24...@twin.jikos.cz/ Signed-off-by: Johannes Thumshirn --- fs/btrfs/inode.c | 19 --- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 9becf8543489..314142ea9d80 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6351,6 +6351,7 @@ int btrfs_add_link(struct btrfs_trans_handle *trans, struct btrfs_root *root = parent_inode->root; u64 ino = btrfs_ino(inode); u64 parent_ino = btrfs_ino(parent_inode); + int err = 0; if (unlikely(ino == BTRFS_FIRST_FREE_OBJECTID)) { memcpy(, >root->root_key, sizeof(key)); @@ -6395,18 +6396,20 @@ int btrfs_add_link(struct btrfs_trans_handle *trans, fail_dir_item: if (unlikely(ino == BTRFS_FIRST_FREE_OBJECTID)) { u64 local_index; - int err; + err = btrfs_del_root_ref(trans, key.objectid, root->root_key.objectid, parent_ino, _index, name, name_len); } else if (add_backref) { u64 local_index; - int err; err = btrfs_del_inode_ref(trans, root, name, name_len, ino, parent_ino, _index); } + + btrfs_abort_transaction(trans, err ? err : ret); + return ret; } @@ -9502,18 +9505,14 @@ static int btrfs_rename_exchange(struct inode *old_dir, ret = btrfs_add_link(trans, BTRFS_I(new_dir), BTRFS_I(old_inode), new_dentry->d_name.name, new_dentry->d_name.len, 0, old_idx); - if (ret) { - btrfs_abort_transaction(trans, ret); + if (ret) goto out_fail; - } ret = btrfs_add_link(trans, BTRFS_I(old_dir), BTRFS_I(new_inode), old_dentry->d_name.name, old_dentry->d_name.len, 0, new_idx); - if (ret) { - btrfs_abort_transaction(trans, ret); + if (ret) goto out_fail; - } if (old_inode->i_nlink == 1) BTRFS_I(old_inode)->dir_index = old_idx; @@ -9822,10 +9821,8 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry, ret = btrfs_add_link(trans, BTRFS_I(new_dir), BTRFS_I(old_inode), new_dentry->d_name.name, new_dentry->d_name.len, 0, index); - if (ret) { - btrfs_abort_transaction(trans, ret); + if (ret) goto out_fail; - } if (old_inode->i_nlink == 1) BTRFS_I(old_inode)->dir_index = index; -- 2.16.4
Re: [PATCH v2] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation
On Tue, Nov 20, 2018 at 08:32:42AM +0800, Qu Wenruo wrote: > >>> @@ -1013,16 +1013,22 @@ int btrfs_quota_enable(struct btrfs_fs_info > >>> *fs_info) > >>> btrfs_abort_transaction(trans, ret); > >>> goto out_free_path; > >>> } > >>> - spin_lock(_info->qgroup_lock); > >>> - fs_info->quota_root = quota_root; > >>> - set_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags); > >>> - spin_unlock(_info->qgroup_lock); > >>> > >>> ret = btrfs_commit_transaction(trans); > >>> trans = NULL; > >>> if (ret) > >>> goto out_free_path; > >> > >> The main concern here is, if we don't set qgroup enabled bit before we > >> commit transaction, there will be a window where new tree modification > >> could happen before QGROUP_ENABLED bit set. > > > > That doesn't seem to make much sense to me, if I understood correctly. > > Essentially you're saying stuff done to any tree in the the > > transaction we use to > > enable quotas must be accounted for. In that case the quota enabled bit > > should > > be done as soon as the transaction is started, because before we set > > it and after > > we started (or joined) a transaction, a lot could of modifications may > > have happened. > > Nevertheless I don't think it's unexpected for anyone to have the > > accounting happening > > only after the quota enable ioctl returns success. > > The problem is not accounting, the qgroup number won't cause problem. > > It's the reserved space. Like some dirty pages are dirtied before quota > enabled, but at run_dealloc() time quota is enabled. > > For such case we have io_tree based method to avoid underflow so it > should be OK. > > So v2 patch looks OK. Does that mean reviewed-by? In case there's a evolved discussion under a patch, a clear yes/no is appreciated and an explicit Reviewed-by even more. I'm about to add this patch to rc4 pull, thre's still some time to add the tag. Thanks.
Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
On 2018-11-22 21:46, Nikolay Borisov wrote: # echo w > /proc/sysrq-trigger # dmesg -c [ 931.585611] sysrq: SysRq : Show Blocked State [ 931.585715] task PC stack pid father [ 931.590168] btrfs-cleaner D 0 1340 2 0x8000 [ 931.590175] Call Trace: [ 931.590190] __schedule+0x29e/0x840 [ 931.590195] schedule+0x2c/0x80 [ 931.590199] schedule_timeout+0x258/0x360 [ 931.590204] io_schedule_timeout+0x1e/0x50 [ 931.590208] wait_for_completion_io+0xb7/0x140 [ 931.590214] ? wake_up_q+0x80/0x80 [ 931.590219] submit_bio_wait+0x61/0x90 [ 931.590225] blkdev_issue_discard+0x7a/0xd0 [ 931.590266] btrfs_issue_discard+0x123/0x160 [btrfs] [ 931.590299] btrfs_discard_extent+0xd8/0x160 [btrfs] [ 931.590335] btrfs_finish_extent_commit+0xe2/0x240 [btrfs] [ 931.590382] btrfs_commit_transaction+0x573/0x840 [btrfs] [ 931.590415] ? btrfs_block_rsv_check+0x25/0x70 [btrfs] [ 931.590456] __btrfs_end_transaction+0x2be/0x2d0 [btrfs] [ 931.590493] btrfs_end_transaction_throttle+0x13/0x20 [btrfs] [ 931.590530] btrfs_drop_snapshot+0x489/0x800 [btrfs] [ 931.590567] btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs] [ 931.590607] cleaner_kthread+0x136/0x160 [btrfs] [ 931.590612] kthread+0x120/0x140 [ 931.590646] ? btree_submit_bio_start+0x20/0x20 [btrfs] [ 931.590658] ? kthread_bind+0x40/0x40 [ 931.590661] ret_from_fork+0x22/0x40 It seems your filesystem is mounted with the DSICARD option meaning every delete will result in discard this is highly suboptimal for ssd's. Try remounting the fs without the discard option see if it helps. Generally for discard you want to submit it in big batches (what fstrim does) so that the ftl on the ssd could apply any optimisations it might have up its sleeve. Spot on! Removed "discard" from fstab and added "ssd", rebooted - no more btrfs-cleaner running. Do you know if the issue you described ("discard this is highly suboptimal for ssd") affects other filesystems as well to a similar extent? I.e. if using ext4 on ssd? Would you finally care to share the smart data + the model and make of the ssd? 2x these: Model Family: Samsung based SSDs Device Model: SAMSUNG MZ7LM1T9HCJM-5 Firmware Version: GXT1103Q User Capacity:1,920,383,410,176 bytes [1.92 TB] Sector Size: 512 bytes logical/physical Rotation Rate:Solid State Device 1x this: Device Model: Micron_5200_MTFDDAK1T9TDC Firmware Version: D1MU004 User Capacity:1,920,383,410,176 bytes [1.92 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate:Solid State Device Form Factor: 2.5 inches But - seems the issue was unneeded discard option, so not pasting unnecessary SMART data, thanks for finding this out. Tomasz Chmielewski
Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
On 22.11.18 г. 14:31 ч., Tomasz Chmielewski wrote: > Yet another system upgraded to 4.19 and showing strange issues. > > btrfs-cleaner is showing as ~90-100% busy in iotop: > > TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND > 1340 be/4 root 0.00 B/s 0.00 B/s 0.00 % 92.88 % > [btrfs-cleaner] > > > Note disk read, disk write are 0.00 B/s. > > > iostat -mx 1 shows all disks are ~100% busy, yet there are 0 reads and 0 > writes to them: > > Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s > %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.91 0.00 0.00 0.00 90.80 > sdb 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 100.00 > sdc 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 100.00 > > > > btrfs-cleaner persists 100% busy after reboots. The system is generally > not very responsive. > > > > # echo w > /proc/sysrq-trigger > > # dmesg -c > [ 931.585611] sysrq: SysRq : Show Blocked State > [ 931.585715] task PC stack pid father > [ 931.590168] btrfs-cleaner D 0 1340 2 0x8000 > [ 931.590175] Call Trace: > [ 931.590190] __schedule+0x29e/0x840 > [ 931.590195] schedule+0x2c/0x80 > [ 931.590199] schedule_timeout+0x258/0x360 > [ 931.590204] io_schedule_timeout+0x1e/0x50 > [ 931.590208] wait_for_completion_io+0xb7/0x140 > [ 931.590214] ? wake_up_q+0x80/0x80 > [ 931.590219] submit_bio_wait+0x61/0x90 > [ 931.590225] blkdev_issue_discard+0x7a/0xd0 > [ 931.590266] btrfs_issue_discard+0x123/0x160 [btrfs] > [ 931.590299] btrfs_discard_extent+0xd8/0x160 [btrfs] > [ 931.590335] btrfs_finish_extent_commit+0xe2/0x240 [btrfs] > [ 931.590382] btrfs_commit_transaction+0x573/0x840 [btrfs] > [ 931.590415] ? btrfs_block_rsv_check+0x25/0x70 [btrfs] > [ 931.590456] __btrfs_end_transaction+0x2be/0x2d0 [btrfs] > [ 931.590493] btrfs_end_transaction_throttle+0x13/0x20 [btrfs] > [ 931.590530] btrfs_drop_snapshot+0x489/0x800 [btrfs] > [ 931.590567] btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs] > [ 931.590607] cleaner_kthread+0x136/0x160 [btrfs] > [ 931.590612] kthread+0x120/0x140 > [ 931.590646] ? btree_submit_bio_start+0x20/0x20 [btrfs] > [ 931.590658] ? kthread_bind+0x40/0x40 > [ 931.590661] ret_from_fork+0x22/0x40 > It seems your filesystem is mounted with the DSICARD option meaning every delete will result in discard this is highly suboptimal for ssd's. Try remounting the fs without the discard option see if it helps. Generally for discard you want to submit it in big batches (what fstrim does) so that the ftl on the ssd could apply any optimisations it might have up its sleeve. Would you finally care to share the smart data + the model and make of the ssd? > > > > After rebooting to 4.19.3, I've started seeing read errors. There were > no errors with a previous kernel (4.17.14); disks are healthy according > to SMART; no errors reported when we read the whole surface with dd. > > # grep -i btrfs.*corrected /var/log/syslog|wc -l > 156 > > Things like: > > Nov 22 12:17:43 lxd05 kernel: [ 711.538836] BTRFS info (device sda2): > read error corrected: ino 0 off 3739083145216 (dev /dev/sdc2 sector > 101088384) > Nov 22 12:17:43 lxd05 kernel: [ 711.538905] BTRFS info (device sda2): > read error corrected: ino 0 off 3739083149312 (dev /dev/sdc2 sector > 101088392) > Nov 22 12:17:43 lxd05 kernel: [ 711.538958] BTRFS info (device sda2): > read error corrected: ino 0 off 3739083153408 (dev /dev/sdc2 sector > 101088400) > Nov 22 12:17:43 lxd05 kernel: [ 711.539006] BTRFS info (device sda2): > read error corrected: ino 0 off 3739083157504 (dev /dev/sdc2 sector > 101088408) > > > Yet - 0 errors, according to stats, not sure if it's expected or not: Since btrfs was able to fix the read failure on the fly it doesn't record the error. IMO you have problems with your storage below the filesystem level. > > # btrfs device stats /data/lxd > [/dev/sda2].write_io_errs 0 > [/dev/sda2].read_io_errs 0 > [/dev/sda2].flush_io_errs 0 > [/dev/sda2].corruption_errs 0 > [/dev/sda2].generation_errs 0 > [/dev/sdb2].write_io_errs 0 > [/dev/sdb2].read_io_errs 0 > [/dev/sdb2].flush_io_errs 0 > [/dev/sdb2].corruption_errs 0 > [/dev/sdb2].generation_errs 0 > [/dev/sdc2].write_io_errs 0 > [/dev/sdc2].read_io_errs 0 > [/dev/sdc2].flush_io_errs 0 > [/dev/sdc2].corruption_errs 0 > [/dev/sdc2].generation_errs 0 > >
btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
Yet another system upgraded to 4.19 and showing strange issues. btrfs-cleaner is showing as ~90-100% busy in iotop: TID PRIO USER DISK READ DISK WRITE SWAPIN IO>COMMAND 1340 be/4 root0.00 B/s0.00 B/s 0.00 % 92.88 % [btrfs-cleaner] Note disk read, disk write are 0.00 B/s. iostat -mx 1 shows all disks are ~100% busy, yet there are 0 reads and 0 writes to them: Devicer/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 0.000.00 0.00 0.00 0.00 0.00 0.00 0.000.000.00 0.91 0.00 0.00 0.00 90.80 sdb 0.000.00 0.00 0.00 0.00 0.00 0.00 0.000.000.00 1.00 0.00 0.00 0.00 100.00 sdc 0.000.00 0.00 0.00 0.00 0.00 0.00 0.000.000.00 1.00 0.00 0.00 0.00 100.00 btrfs-cleaner persists 100% busy after reboots. The system is generally not very responsive. # echo w > /proc/sysrq-trigger # dmesg -c [ 931.585611] sysrq: SysRq : Show Blocked State [ 931.585715] taskPC stack pid father [ 931.590168] btrfs-cleaner D0 1340 2 0x8000 [ 931.590175] Call Trace: [ 931.590190] __schedule+0x29e/0x840 [ 931.590195] schedule+0x2c/0x80 [ 931.590199] schedule_timeout+0x258/0x360 [ 931.590204] io_schedule_timeout+0x1e/0x50 [ 931.590208] wait_for_completion_io+0xb7/0x140 [ 931.590214] ? wake_up_q+0x80/0x80 [ 931.590219] submit_bio_wait+0x61/0x90 [ 931.590225] blkdev_issue_discard+0x7a/0xd0 [ 931.590266] btrfs_issue_discard+0x123/0x160 [btrfs] [ 931.590299] btrfs_discard_extent+0xd8/0x160 [btrfs] [ 931.590335] btrfs_finish_extent_commit+0xe2/0x240 [btrfs] [ 931.590382] btrfs_commit_transaction+0x573/0x840 [btrfs] [ 931.590415] ? btrfs_block_rsv_check+0x25/0x70 [btrfs] [ 931.590456] __btrfs_end_transaction+0x2be/0x2d0 [btrfs] [ 931.590493] btrfs_end_transaction_throttle+0x13/0x20 [btrfs] [ 931.590530] btrfs_drop_snapshot+0x489/0x800 [btrfs] [ 931.590567] btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs] [ 931.590607] cleaner_kthread+0x136/0x160 [btrfs] [ 931.590612] kthread+0x120/0x140 [ 931.590646] ? btree_submit_bio_start+0x20/0x20 [btrfs] [ 931.590658] ? kthread_bind+0x40/0x40 [ 931.590661] ret_from_fork+0x22/0x40 After rebooting to 4.19.3, I've started seeing read errors. There were no errors with a previous kernel (4.17.14); disks are healthy according to SMART; no errors reported when we read the whole surface with dd. # grep -i btrfs.*corrected /var/log/syslog|wc -l 156 Things like: Nov 22 12:17:43 lxd05 kernel: [ 711.538836] BTRFS info (device sda2): read error corrected: ino 0 off 3739083145216 (dev /dev/sdc2 sector 101088384) Nov 22 12:17:43 lxd05 kernel: [ 711.538905] BTRFS info (device sda2): read error corrected: ino 0 off 3739083149312 (dev /dev/sdc2 sector 101088392) Nov 22 12:17:43 lxd05 kernel: [ 711.538958] BTRFS info (device sda2): read error corrected: ino 0 off 3739083153408 (dev /dev/sdc2 sector 101088400) Nov 22 12:17:43 lxd05 kernel: [ 711.539006] BTRFS info (device sda2): read error corrected: ino 0 off 3739083157504 (dev /dev/sdc2 sector 101088408) Yet - 0 errors, according to stats, not sure if it's expected or not: # btrfs device stats /data/lxd [/dev/sda2].write_io_errs0 [/dev/sda2].read_io_errs 0 [/dev/sda2].flush_io_errs0 [/dev/sda2].corruption_errs 0 [/dev/sda2].generation_errs 0 [/dev/sdb2].write_io_errs0 [/dev/sdb2].read_io_errs 0 [/dev/sdb2].flush_io_errs0 [/dev/sdb2].corruption_errs 0 [/dev/sdb2].generation_errs 0 [/dev/sdc2].write_io_errs0 [/dev/sdc2].read_io_errs 0 [/dev/sdc2].flush_io_errs0 [/dev/sdc2].corruption_errs 0 [/dev/sdc2].generation_errs 0 # btrfs fi usage /data/lxd Overall: Device size: 5.18TiB Device allocated: 2.36TiB Device unallocated:2.83TiB Device missing: 0.00B Used: 1.93TiB Free (estimated): 1.62TiB (min: 1.62TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID1: Size:1.17TiB, Used:983.87GiB /dev/sda2 796.00GiB /dev/sdb2 799.00GiB /dev/sdc2 797.00GiB Metadata,RAID1: Size:10.00GiB, Used:4.09GiB /dev/sda2 8.00GiB /dev/sdb2 5.00GiB /dev/sdc2 7.00GiB System,RAID1: Size:64.00MiB, Used:224.00KiB /dev/sda2 32.00MiB /dev/sdb2 64.00MiB /dev/sdc2 32.00MiB Unallocated: /dev/sda2 964.46GiB /dev/sdb2 964.43GiB /dev/sdc2 964.46GiB # btrfs fi df /data/lxd Data, RAID1: total=1.17TiB, used=983.04GiB System, RAID1: total=64.00MiB, used=224.00KiB Metadata, RAID1: total=10.00GiB, used=4.08GiB GlobalReserve,
Re: [RFC][PATCH v4 09/09] btrfs: use common file type conversion
On Wed 21-11-18 19:07:06, Phillip Potter wrote: > Deduplicate the btrfs file type conversion implementation - file systems > that use the same file types as defined by POSIX do not need to define > their own versions and can use the common helper functions decared in > fs_types.h and implemented in fs_types.c > > Acked-by: David Sterba > Signed-off-by: Amir Goldstein > Signed-off-by: Phillip Potter The patch looks good. You can add: Reviewed-by: Jan Kara Honza > --- > fs/btrfs/btrfs_inode.h | 2 -- > fs/btrfs/delayed-inode.c| 2 +- > fs/btrfs/inode.c| 32 +++- > include/uapi/linux/btrfs_tree.h | 2 ++ > 4 files changed, 18 insertions(+), 20 deletions(-) > > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h > index 97d91e55b70a..bb01c804485f 100644 > --- a/fs/btrfs/btrfs_inode.h > +++ b/fs/btrfs/btrfs_inode.h > @@ -196,8 +196,6 @@ struct btrfs_inode { > struct inode vfs_inode; > }; > > -extern unsigned char btrfs_filetype_table[]; > - > static inline struct btrfs_inode *BTRFS_I(const struct inode *inode) > { > return container_of(inode, struct btrfs_inode, vfs_inode); > diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c > index c669f250d4a0..e61947f5eb76 100644 > --- a/fs/btrfs/delayed-inode.c > +++ b/fs/btrfs/delayed-inode.c > @@ -1692,7 +1692,7 @@ int btrfs_readdir_delayed_dir_index(struct dir_context > *ctx, > name = (char *)(di + 1); > name_len = btrfs_stack_dir_name_len(di); > > - d_type = btrfs_filetype_table[di->type]; > + d_type = fs_ftype_to_dtype(di->type); > btrfs_disk_key_to_cpu(, >location); > > over = !dir_emit(ctx, name, name_len, > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 9ea4c6f0352f..8b7b1b29e2ad 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -72,17 +72,6 @@ struct kmem_cache *btrfs_trans_handle_cachep; > struct kmem_cache *btrfs_path_cachep; > struct kmem_cache *btrfs_free_space_cachep; > > -#define S_SHIFT 12 > -static const unsigned char btrfs_type_by_mode[S_IFMT >> S_SHIFT] = { > - [S_IFREG >> S_SHIFT]= BTRFS_FT_REG_FILE, > - [S_IFDIR >> S_SHIFT]= BTRFS_FT_DIR, > - [S_IFCHR >> S_SHIFT]= BTRFS_FT_CHRDEV, > - [S_IFBLK >> S_SHIFT]= BTRFS_FT_BLKDEV, > - [S_IFIFO >> S_SHIFT]= BTRFS_FT_FIFO, > - [S_IFSOCK >> S_SHIFT] = BTRFS_FT_SOCK, > - [S_IFLNK >> S_SHIFT]= BTRFS_FT_SYMLINK, > -}; > - > static int btrfs_setsize(struct inode *inode, struct iattr *attr); > static int btrfs_truncate(struct inode *inode, bool skip_writeback); > static int btrfs_finish_ordered_io(struct btrfs_ordered_extent > *ordered_extent); > @@ -5793,10 +5782,6 @@ static struct dentry *btrfs_lookup(struct inode *dir, > struct dentry *dentry, > return d_splice_alias(inode, dentry); > } > > -unsigned char btrfs_filetype_table[] = { > - DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK > -}; > - > /* > * All this infrastructure exists because dir_emit can fault, and we are > holding > * the tree lock when doing readdir. For now just allocate a buffer and copy > @@ -5935,7 +5920,7 @@ static int btrfs_real_readdir(struct file *file, struct > dir_context *ctx) > name_ptr = (char *)(entry + 1); > read_extent_buffer(leaf, name_ptr, (unsigned long)(di + 1), > name_len); > - put_unaligned(btrfs_filetype_table[btrfs_dir_type(leaf, di)], > + put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)), > >type); > btrfs_dir_item_key_to_cpu(leaf, di, ); > put_unaligned(location.objectid, >ino); > @@ -6340,7 +6325,20 @@ static struct inode *btrfs_new_inode(struct > btrfs_trans_handle *trans, > > static inline u8 btrfs_inode_type(struct inode *inode) > { > - return btrfs_type_by_mode[(inode->i_mode & S_IFMT) >> S_SHIFT]; > + /* > + * compile-time asserts that generic FT_x types still match > + * BTRFS_FT_x types > + */ > + BUILD_BUG_ON(BTRFS_FT_UNKNOWN != FT_UNKNOWN); > + BUILD_BUG_ON(BTRFS_FT_REG_FILE != FT_REG_FILE); > + BUILD_BUG_ON(BTRFS_FT_DIR != FT_DIR); > + BUILD_BUG_ON(BTRFS_FT_CHRDEV != FT_CHRDEV); > + BUILD_BUG_ON(BTRFS_FT_BLKDEV != FT_BLKDEV); > + BUILD_BUG_ON(BTRFS_FT_FIFO != FT_FIFO); > + BUILD_BUG_ON(BTRFS_FT_SOCK != FT_SOCK); > + BUILD_BUG_ON(BTRFS_FT_SYMLINK != FT_SYMLINK); > + > + return fs_umode_to_ftype(inode->i_mode); > } > > /* > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h > index aff1356c2bb8..4126ed7ee89a 100644 > --- a/include/uapi/linux/btrfs_tree.h > +++ b/include/uapi/linux/btrfs_tree.h > @@ -307,6 +307,8 @@ > * > * Used by: > * struct btrfs_dir_item.type > + * > + * Values
Re: [PATCH 4/6] btrfs: only track ref_heads in delayed_ref_updates
On 21.11.18 г. 20:59 ч., Josef Bacik wrote: > From: Josef Bacik > > We use this number to figure out how many delayed refs to run, but > __btrfs_run_delayed_refs really only checks every time we need a new > delayed ref head, so we always run at least one ref head completely no > matter what the number of items on it. Fix the accounting to only be > adjusted when we add/remove a ref head. LGTM: Reviewed-by: Nikolay Borisov However, what if we kill delayed_ref_updates since the name is a bit ambiguous and instead migrate num_heads_ready from delayed_refs to trans and use that? Otherwise, as stated previously num_heads_ready is currently unused and could be removed. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/delayed-ref.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index b3e4c9fcb664..48725fa757a3 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct > btrfs_trans_handle *trans, > ref->in_tree = 0; > btrfs_put_delayed_ref(ref); > atomic_dec(_refs->num_entries); > - if (trans->delayed_ref_updates) > - trans->delayed_ref_updates--; > } > > static bool merge_ref(struct btrfs_trans_handle *trans, > @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle > *trans, > if (ref->action == BTRFS_ADD_DELAYED_REF) > list_add_tail(>add_list, >ref_add_list); > atomic_inc(>num_entries); > - trans->delayed_ref_updates++; > spin_unlock(>lock); > return ret; > } >
Re: [PATCH 3/6] btrfs: cleanup extent_op handling
On 21.11.18 г. 20:59 ч., Josef Bacik wrote: > From: Josef Bacik > > The cleanup_extent_op function actually would run the extent_op if it > needed running, which made the name sort of a misnomer. Change it to > run_and_cleanup_extent_op, and move the actual cleanup work to > cleanup_extent_op so it can be used by check_ref_cleanup() in order to > unify the extent op handling. The whole name extent_op is actually a misnomer since AFAIR this is some sort of modification of the references of metadata nodes. I don't see why it can't be made as yet another type of reference which is run for a given node. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 36 +++- > 1 file changed, 23 insertions(+), 13 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index e3ed3507018d..8a776dc9cb38 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2424,19 +2424,33 @@ static void unselect_delayed_ref_head(struct > btrfs_delayed_ref_root *delayed_ref > btrfs_delayed_ref_unlock(head); > } > > -static int cleanup_extent_op(struct btrfs_trans_handle *trans, > - struct btrfs_delayed_ref_head *head) > +static struct btrfs_delayed_extent_op * > +cleanup_extent_op(struct btrfs_trans_handle *trans, > + struct btrfs_delayed_ref_head *head) > { > struct btrfs_delayed_extent_op *extent_op = head->extent_op; > - int ret; > > if (!extent_op) > - return 0; > - head->extent_op = NULL; > + return NULL; > + > if (head->must_insert_reserved) { > + head->extent_op = NULL; > btrfs_free_delayed_extent_op(extent_op); > - return 0; > + return NULL; > } > + return extent_op; > +} > + > +static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans, > + struct btrfs_delayed_ref_head *head) > +{ > + struct btrfs_delayed_extent_op *extent_op = > + cleanup_extent_op(trans, head); > + int ret; > + > + if (!extent_op) > + return 0; > + head->extent_op = NULL; > spin_unlock(>lock); > ret = run_delayed_extent_op(trans, head, extent_op); > btrfs_free_delayed_extent_op(extent_op); > @@ -2488,7 +2502,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > > delayed_refs = >transaction->delayed_refs; > > - ret = cleanup_extent_op(trans, head); > + ret = run_and_cleanup_extent_op(trans, head); > if (ret < 0) { > unselect_delayed_ref_head(delayed_refs, head); > btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret); > @@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct > btrfs_trans_handle *trans, > if (!RB_EMPTY_ROOT(>ref_tree.rb_root)) > goto out; > > - if (head->extent_op) { > - if (!head->must_insert_reserved) > - goto out; > - btrfs_free_delayed_extent_op(head->extent_op); > - head->extent_op = NULL; > - } > + if (cleanup_extent_op(trans, head) != NULL) > + goto out; > > /* >* waiting for the lock here would deadlock. If someone else has it >
Re: [PATCH 1/6] btrfs: add btrfs_delete_ref_head helper
On 22.11.18 г. 11:12 ч., Nikolay Borisov wrote: > > > On 21.11.18 г. 20:59 ч., Josef Bacik wrote: >> From: Josef Bacik >> >> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it >> into a helper and cleanup the calling functions. >> >> Signed-off-by: Josef Bacik >> Reviewed-by: Omar Sandoval >> --- >> fs/btrfs/delayed-ref.c | 14 ++ >> fs/btrfs/delayed-ref.h | 3 ++- >> fs/btrfs/extent-tree.c | 22 +++--- >> 3 files changed, 19 insertions(+), 20 deletions(-) >> >> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c >> index 9301b3ad9217..b3e4c9fcb664 100644 >> --- a/fs/btrfs/delayed-ref.c >> +++ b/fs/btrfs/delayed-ref.c >> @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head( >> return head; >> } >> >> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, >> + struct btrfs_delayed_ref_head *head) >> +{ >> +lockdep_assert_held(_refs->lock); >> +lockdep_assert_held(>lock); >> + >> +rb_erase_cached(>href_node, _refs->href_root); >> +RB_CLEAR_NODE(>href_node); >> +atomic_dec(_refs->num_entries); >> +delayed_refs->num_heads--; >> +if (head->processing == 0) >> +delayed_refs->num_heads_ready--; > > num_heads_ready will never execute in cleanup_ref_head, since > processing == 0 only when the ref head is unselected. Perhaps those 2 > lines shouldn't be in this function? I find it a bit confusing that if > processing is 0 we decrement num_heads_ready in check_ref_cleanup, but > in unselect_delayed_ref_head we set it to 0 and increment it. > >> +} >> + >> /* >> * Helper to insert the ref_node to the tail or merge with tail. >> * >> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h >> index 8e20c5cb5404..d2af974f68a1 100644 >> --- a/fs/btrfs/delayed-ref.h >> +++ b/fs/btrfs/delayed-ref.h >> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct >> btrfs_delayed_ref_head *head) >> { >> mutex_unlock(>mutex); >> } >> - >> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, >> + struct btrfs_delayed_ref_head *head); >> >> struct btrfs_delayed_ref_head *btrfs_select_ref_head( >> struct btrfs_delayed_ref_root *delayed_refs); >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index d242a1174e50..c36b3a42f2bb 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle >> *trans, >> spin_unlock(_refs->lock); >> return 1; >> } >> -delayed_refs->num_heads--; >> -rb_erase_cached(>href_node, _refs->href_root); >> -RB_CLEAR_NODE(>href_node); >> +btrfs_delete_ref_head(delayed_refs, head); >> spin_unlock(>lock); >> spin_unlock(_refs->lock); >> -atomic_dec(_refs->num_entries); >> >> trace_run_delayed_ref_head(fs_info, head, 0); >> >> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct >> btrfs_trans_handle *trans, >> if (!mutex_trylock(>mutex)) >> goto out; >> >> -/* >> - * at this point we have a head with no other entries. Go >> - * ahead and process it. >> - */ >> -rb_erase_cached(>href_node, _refs->href_root); >> -RB_CLEAR_NODE(>href_node); >> -atomic_dec(_refs->num_entries); >> - >> -/* >> - * we don't take a ref on the node because we're removing it from the >> - * tree, so we just steal the ref the tree was holding. >> - */ >> -delayed_refs->num_heads--; >> -if (head->processing == 0) >> -delayed_refs->num_heads_ready--; >> +btrfs_delete_ref_head(delayed_refs, head); >> head->processing = 0; On a closer inspection I think here we can do: ASSERT(head->processing == 0) because at that point we've taken the head->lock spinlock which is held during ordinary delayed refs processing (in __btrfs_run_delayed_refs) when the head is selected (and processing is 1). So head->processing == 0 here I think is a hard invariant of the code. The decrement here should pair with the increment when the head was initially added to the tree. In cleanup_ref_head we don't need to ever worry about num_heads_ready since it has already been decremented by btrfs_select_ref_head. As a matter fact this counter is not used anywhere so we might as well just remove it. > > Something is fishy here, before the code checked for processing == 0 and > then also set it to 0 ? > >> + >> spin_unlock(>lock); >> spin_unlock(_refs->lock); >> >> >
Re: [PATCH 1/6] btrfs: add btrfs_delete_ref_head helper
On 21.11.18 г. 20:59 ч., Josef Bacik wrote: > From: Josef Bacik > > We do this dance in cleanup_ref_head and check_ref_cleanup, unify it > into a helper and cleanup the calling functions. > > Signed-off-by: Josef Bacik > Reviewed-by: Omar Sandoval > --- > fs/btrfs/delayed-ref.c | 14 ++ > fs/btrfs/delayed-ref.h | 3 ++- > fs/btrfs/extent-tree.c | 22 +++--- > 3 files changed, 19 insertions(+), 20 deletions(-) > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index 9301b3ad9217..b3e4c9fcb664 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head( > return head; > } > > +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, > +struct btrfs_delayed_ref_head *head) > +{ > + lockdep_assert_held(_refs->lock); > + lockdep_assert_held(>lock); > + > + rb_erase_cached(>href_node, _refs->href_root); > + RB_CLEAR_NODE(>href_node); > + atomic_dec(_refs->num_entries); > + delayed_refs->num_heads--; > + if (head->processing == 0) > + delayed_refs->num_heads_ready--; num_heads_ready will never execute in cleanup_ref_head, since processing == 0 only when the ref head is unselected. Perhaps those 2 lines shouldn't be in this function? I find it a bit confusing that if processing is 0 we decrement num_heads_ready in check_ref_cleanup, but in unselect_delayed_ref_head we set it to 0 and increment it. > +} > + > /* > * Helper to insert the ref_node to the tail or merge with tail. > * > diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h > index 8e20c5cb5404..d2af974f68a1 100644 > --- a/fs/btrfs/delayed-ref.h > +++ b/fs/btrfs/delayed-ref.h > @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct > btrfs_delayed_ref_head *head) > { > mutex_unlock(>mutex); > } > - > +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, > +struct btrfs_delayed_ref_head *head); > > struct btrfs_delayed_ref_head *btrfs_select_ref_head( > struct btrfs_delayed_ref_root *delayed_refs); > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index d242a1174e50..c36b3a42f2bb 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > spin_unlock(_refs->lock); > return 1; > } > - delayed_refs->num_heads--; > - rb_erase_cached(>href_node, _refs->href_root); > - RB_CLEAR_NODE(>href_node); > + btrfs_delete_ref_head(delayed_refs, head); > spin_unlock(>lock); > spin_unlock(_refs->lock); > - atomic_dec(_refs->num_entries); > > trace_run_delayed_ref_head(fs_info, head, 0); > > @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct > btrfs_trans_handle *trans, > if (!mutex_trylock(>mutex)) > goto out; > > - /* > - * at this point we have a head with no other entries. Go > - * ahead and process it. > - */ > - rb_erase_cached(>href_node, _refs->href_root); > - RB_CLEAR_NODE(>href_node); > - atomic_dec(_refs->num_entries); > - > - /* > - * we don't take a ref on the node because we're removing it from the > - * tree, so we just steal the ref the tree was holding. > - */ > - delayed_refs->num_heads--; > - if (head->processing == 0) > - delayed_refs->num_heads_ready--; > + btrfs_delete_ref_head(delayed_refs, head); > head->processing = 0; Something is fishy here, before the code checked for processing == 0 and then also set it to 0 ? > + > spin_unlock(>lock); > spin_unlock(_refs->lock); > >
Re: [PATCH 3/6] btrfs: cleanup extent_op handling
On Wed, Nov 21, 2018 at 01:59:09PM -0500, Josef Bacik wrote: >From: Josef Bacik > >The cleanup_extent_op function actually would run the extent_op if it >needed running, which made the name sort of a misnomer. Change it to >run_and_cleanup_extent_op, and move the actual cleanup work to >cleanup_extent_op so it can be used by check_ref_cleanup() in order to >unify the extent op handling. > >Signed-off-by: Josef Bacik One nitpick below. Reviewed-by: Lu Fengqi >--- > fs/btrfs/extent-tree.c | 36 +++- > 1 file changed, 23 insertions(+), 13 deletions(-) > >diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >index e3ed3507018d..8a776dc9cb38 100644 >--- a/fs/btrfs/extent-tree.c >+++ b/fs/btrfs/extent-tree.c >@@ -2424,19 +2424,33 @@ static void unselect_delayed_ref_head(struct >btrfs_delayed_ref_root *delayed_ref > btrfs_delayed_ref_unlock(head); > } > >-static int cleanup_extent_op(struct btrfs_trans_handle *trans, >- struct btrfs_delayed_ref_head *head) >+static struct btrfs_delayed_extent_op * >+cleanup_extent_op(struct btrfs_trans_handle *trans, The trans parameter seems useless. -- Thanks, Lu >+struct btrfs_delayed_ref_head *head) > { > struct btrfs_delayed_extent_op *extent_op = head->extent_op; >- int ret; > > if (!extent_op) >- return 0; >- head->extent_op = NULL; >+ return NULL; >+ > if (head->must_insert_reserved) { >+ head->extent_op = NULL; > btrfs_free_delayed_extent_op(extent_op); >- return 0; >+ return NULL; > } >+ return extent_op; >+} >+ >+static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans, >+ struct btrfs_delayed_ref_head *head) >+{ >+ struct btrfs_delayed_extent_op *extent_op = >+ cleanup_extent_op(trans, head); >+ int ret; >+ >+ if (!extent_op) >+ return 0; >+ head->extent_op = NULL; > spin_unlock(>lock); > ret = run_delayed_extent_op(trans, head, extent_op); > btrfs_free_delayed_extent_op(extent_op); >@@ -2488,7 +2502,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle >*trans, > > delayed_refs = >transaction->delayed_refs; > >- ret = cleanup_extent_op(trans, head); >+ ret = run_and_cleanup_extent_op(trans, head); > if (ret < 0) { > unselect_delayed_ref_head(delayed_refs, head); > btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret); >@@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct >btrfs_trans_handle *trans, > if (!RB_EMPTY_ROOT(>ref_tree.rb_root)) > goto out; > >- if (head->extent_op) { >- if (!head->must_insert_reserved) >- goto out; >- btrfs_free_delayed_extent_op(head->extent_op); >- head->extent_op = NULL; >- } >+ if (cleanup_extent_op(trans, head) != NULL) >+ goto out; > > /* >* waiting for the lock here would deadlock. If someone else has it >-- >2.14.3 > > >
[PATCH] btrfs: Remove extent_io_ops::readpage_io_failed_hook
For data inodes this hook does nothing but to return -EAGAIN which is used to signal to the endio routines that this bio belongs to a data inode. If this is the case the actual retrying is handled by bio_readpage_error. Alternatively, if this bio belongs to the btree inode then btree_io_failed_hook just does some cleanup and doesn't retry anything. This patch simplifies the code flow by eliminating readpage_io_failed_hook and instead open-coding btree_io_failed_hook in end_bio_extent_readpage. Also eliminate some needless checks since IO is always performed on either data inode or btree inode, both of which are guaranteed to have their extent_io_tree::ops set. Signed-off-by: Nikolay Borisov --- fs/btrfs/disk-io.c | 16 -- fs/btrfs/extent_io.c | 69 ++-- fs/btrfs/extent_io.h | 1 - fs/btrfs/inode.c | 7 - 4 files changed, 34 insertions(+), 59 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index feb67dfd663d..5024d9374163 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -673,19 +673,6 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio, return ret; } -static int btree_io_failed_hook(struct page *page, int failed_mirror) -{ - struct extent_buffer *eb; - - eb = (struct extent_buffer *)page->private; - set_bit(EXTENT_BUFFER_READ_ERR, >bflags); - eb->read_mirror = failed_mirror; - atomic_dec(>io_pages); - if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags)) - btree_readahead_hook(eb, -EIO); - return -EIO;/* we fixed nothing */ -} - static void end_workqueue_bio(struct bio *bio) { struct btrfs_end_io_wq *end_io_wq = bio->bi_private; @@ -4541,7 +4528,4 @@ static const struct extent_io_ops btree_extent_io_ops = { /* mandatory callbacks */ .submit_bio_hook = btree_submit_bio_hook, .readpage_end_io_hook = btree_readpage_end_io_hook, - .readpage_io_failed_hook = btree_io_failed_hook, - - /* optional callbacks */ }; diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 4ea808d6cfbc..cca3bccd142e 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2333,11 +2333,10 @@ struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio, } /* - * this is a generic handler for readpage errors (default - * readpage_io_failed_hook). if other copies exist, read those and write back - * good data to the failed position. does not investigate in remapping the - * failed extent elsewhere, hoping the device will be smart enough to do this as - * needed + * This is a generic handler for readpage errors. If other copies exist, read + * those and write back good data to the failed position. Does not investigate + * in remapping the failed extent elsewhere, hoping the device will be smart + * enough to do this as needed */ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, @@ -2501,6 +2500,8 @@ static void end_bio_extent_readpage(struct bio *bio) struct page *page = bvec->bv_page; struct inode *inode = page->mapping->host; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + bool data_inode = btrfs_ino(BTRFS_I(inode)) + != BTRFS_BTREE_INODE_OBJECTID; btrfs_debug(fs_info, "end_bio_extent_readpage: bi_sector=%llu, err=%d, mirror=%u", @@ -2530,7 +2531,7 @@ static void end_bio_extent_readpage(struct bio *bio) len = bvec->bv_len; mirror = io_bio->mirror_num; - if (likely(uptodate && tree->ops)) { + if (likely(uptodate)) { ret = tree->ops->readpage_end_io_hook(io_bio, offset, page, start, end, mirror); @@ -2546,38 +2547,36 @@ static void end_bio_extent_readpage(struct bio *bio) if (likely(uptodate)) goto readpage_ok; - if (tree->ops) { - ret = tree->ops->readpage_io_failed_hook(page, mirror); - if (ret == -EAGAIN) { - /* -* Data inode's readpage_io_failed_hook() always -* returns -EAGAIN. -* -* The generic bio_readpage_error handles errors -* the following way: If possible, new read -* requests are created and submitted and will -* end up in end_bio_extent_readpage as well (if -* we're lucky, not in the !uptodate case). In -* that case it returns 0 and we just go on with -
[PATCH] btrfs: Remove unnecessary code from __btrfs_rebalance
The first step fo the rebalance process, ensuring there is 1mb free on each device. This number seems rather small. And in fact when talking to the original authors their opinions were: "man that's a little bonkers" "i don't think we even need that code anymore" "I think it was there to make sure we had room for the blank 1M at the beginning. I bet it goes all the way back to v0" "we just don't need any of that tho, i say we just delete it" Clearly, this piece of code has lost its original intent throughout the years. It doesn't really bring any real practical benefits to the relocation process. No functional changes. Signed-off-by: Nikolay Borisov Suggested-by: Josef Bacik --- fs/btrfs/volumes.c | 53 -- 1 file changed, 53 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 8e36cbb355df..eb9fa8a6429c 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3645,17 +3645,11 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info) { struct btrfs_balance_control *bctl = fs_info->balance_ctl; struct btrfs_root *chunk_root = fs_info->chunk_root; - struct btrfs_root *dev_root = fs_info->dev_root; - struct list_head *devices; - struct btrfs_device *device; - u64 old_size; - u64 size_to_free; u64 chunk_type; struct btrfs_chunk *chunk; struct btrfs_path *path = NULL; struct btrfs_key key; struct btrfs_key found_key; - struct btrfs_trans_handle *trans; struct extent_buffer *leaf; int slot; int ret; @@ -3670,53 +3664,6 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info) u32 count_sys = 0; int chunk_reserved = 0; - /* step one make some room on all the devices */ - devices = _info->fs_devices->devices; - list_for_each_entry(device, devices, dev_list) { - old_size = btrfs_device_get_total_bytes(device); - size_to_free = div_factor(old_size, 1); - size_to_free = min_t(u64, size_to_free, SZ_1M); - if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) || - btrfs_device_get_total_bytes(device) - - btrfs_device_get_bytes_used(device) > size_to_free || - test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) - continue; - - ret = btrfs_shrink_device(device, old_size - size_to_free); - if (ret == -ENOSPC) - break; - if (ret) { - /* btrfs_shrink_device never returns ret > 0 */ - WARN_ON(ret > 0); - goto error; - } - - trans = btrfs_start_transaction(dev_root, 0); - if (IS_ERR(trans)) { - ret = PTR_ERR(trans); - btrfs_info_in_rcu(fs_info, -"resize: unable to start transaction after shrinking device %s (error %d), old size %llu, new size %llu", - rcu_str_deref(device->name), ret, - old_size, old_size - size_to_free); - goto error; - } - - ret = btrfs_grow_device(trans, device, old_size); - if (ret) { - btrfs_end_transaction(trans); - /* btrfs_grow_device never returns ret > 0 */ - WARN_ON(ret > 0); - btrfs_info_in_rcu(fs_info, -"resize: unable to grow device after shrinking device %s (error %d), old size %llu, new size %llu", - rcu_str_deref(device->name), ret, - old_size, old_size - size_to_free); - goto error; - } - - btrfs_end_transaction(trans); - } - - /* step two, relocate all the chunks */ path = btrfs_alloc_path(); if (!path) { ret = -ENOMEM; -- 2.17.1