Re:

2018-11-22 Thread Chris Murphy
On Thu, Nov 22, 2018 at 11:41 PM Andy Leadbetter
 wrote:
>
> I have a failing 2TB disk that is part of a 4 disk RAID 6 system.  I
> have added a new 2TB disk to the computer, and started a BTRFS replace
> for the old and new disk.  The process starts correctly however some
> hours into the job, there is an error and kernel oops. relevant log
> below.

The relevant log is the entire dmesg, not a snippet. It's decently
likely there's more than one thing going on here. We also need full
output of 'smartctl -x' for all four drives, and also 'smartctl -l
scterc' for all four drives, and also 'cat
/sys/block/sda/device/timeout' for all four drives. And which bcache
mode you're using.

The call trace provided is from kernel 4.15 which is sufficiently long
ago I think any dev working on raid56 might want to see where it's
getting tripped up on something a lot newer, and this is why:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/raid56.c?id=v4.19.3=v4.15.1

That's a lot of changes in just the raid56 code between 4.15 and 4.19.
And then in you call trace, btrfs_dev_replace_start is found in
dev-replace.c which likewise has a lot of changes. But then also, I
think 4.15 might still be in the era where it was not recommended to
use 'btrfs dev replace' for raid56, only non-raid56. I'm not sure if
the problems with device replace were fixed, and if they were fixed
kernel or progs side. Anyway, the latest I recall, it was recommended
on raid56 to 'btrfs dev add' then 'btrfs dev remove'.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/dev-replace.c?id=v4.19.3=v4.15.1

And that's only a few hundred changes for each. Check out inode.c -
there are over 2000 changes.


> The disks are configured on top of bcache, in 5 arrays with a small
> 128GB SSD cache shared.  The system in this configuration has worked
> perfectly for 3 years, until 2 weeks ago csum errors started
> appearing.  I have a crashplan backup of all files on the disk, so I
> am not concerned about data loss, but I would like to avoid rebuild
> the system.

btrfs-progs 4.17 still considers raid56 experimental, not for
production use. And three years ago, the current upstream kernel
released was 4.3 so I'm gonna guess the kernel history of this file
system goes back older than that, very close to raid56 code birth. And
then adding bcache to this mix just makes it all the more complicated.



>
> btrfs dev stats shows
> [/dev/bcache0].write_io_errs0
> [/dev/bcache0].read_io_errs 0
> [/dev/bcache0].flush_io_errs0
> [/dev/bcache0].corruption_errs  0
> [/dev/bcache0].generation_errs  0
> [/dev/bcache1].write_io_errs0
> [/dev/bcache1].read_io_errs 20
> [/dev/bcache1].flush_io_errs0
> [/dev/bcache1].corruption_errs  0
> [/dev/bcache1].generation_errs  14
> [/dev/bcache3].write_io_errs0
> [/dev/bcache3].read_io_errs 0
> [/dev/bcache3].flush_io_errs0
> [/dev/bcache3].corruption_errs  0
> [/dev/bcache3].generation_errs  19
> [/dev/bcache2].write_io_errs0
> [/dev/bcache2].read_io_errs 0
> [/dev/bcache2].flush_io_errs0
> [/dev/bcache2].corruption_errs  0
> [/dev/bcache2].generation_errs  2


3 of 4 drives have at least one generation error. While there are no
corruptions reported, generation errors can be really tricky to
recover from at all. If only one device had only read errors, this
would be a lot less difficult.


> I've tried the latest kernel, and the latest tools, but nothing will
> allow me to replace, or delete the failed disk.

If the file system is mounted, I would try to make a local backup ASAP
before you lose the whole volume. Whether it's LVM pool of two drives
(linear/concat) with XFS, or if you go with Btrfs -dsingle -mraid1
(also basically a concat) doesn't really matter, but I'd get whatever
you can off the drive. I expect avoiding a rebuild in some form or
another is very wishful thinking and not very likely.

The more changes are made to the file system, repair attempts or
otherwise writing to it, decreases the chance of recovery.

-- 
Chris Murphy


Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Chris Murphy
On Thu, Nov 22, 2018 at 6:07 AM Tomasz Chmielewski  wrote:
>
> On 2018-11-22 21:46, Nikolay Borisov wrote:
>
> >> # echo w > /proc/sysrq-trigger
> >>
> >> # dmesg -c
> >> [  931.585611] sysrq: SysRq : Show Blocked State
> >> [  931.585715]   taskPC stack   pid father
> >> [  931.590168] btrfs-cleaner   D0  1340  2 0x8000
> >> [  931.590175] Call Trace:
> >> [  931.590190]  __schedule+0x29e/0x840
> >> [  931.590195]  schedule+0x2c/0x80
> >> [  931.590199]  schedule_timeout+0x258/0x360
> >> [  931.590204]  io_schedule_timeout+0x1e/0x50
> >> [  931.590208]  wait_for_completion_io+0xb7/0x140
> >> [  931.590214]  ? wake_up_q+0x80/0x80
> >> [  931.590219]  submit_bio_wait+0x61/0x90
> >> [  931.590225]  blkdev_issue_discard+0x7a/0xd0
> >> [  931.590266]  btrfs_issue_discard+0x123/0x160 [btrfs]
> >> [  931.590299]  btrfs_discard_extent+0xd8/0x160 [btrfs]
> >> [  931.590335]  btrfs_finish_extent_commit+0xe2/0x240 [btrfs]
> >> [  931.590382]  btrfs_commit_transaction+0x573/0x840 [btrfs]
> >> [  931.590415]  ? btrfs_block_rsv_check+0x25/0x70 [btrfs]
> >> [  931.590456]  __btrfs_end_transaction+0x2be/0x2d0 [btrfs]
> >> [  931.590493]  btrfs_end_transaction_throttle+0x13/0x20 [btrfs]
> >> [  931.590530]  btrfs_drop_snapshot+0x489/0x800 [btrfs]
> >> [  931.590567]  btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs]
> >> [  931.590607]  cleaner_kthread+0x136/0x160 [btrfs]
> >> [  931.590612]  kthread+0x120/0x140
> >> [  931.590646]  ? btree_submit_bio_start+0x20/0x20 [btrfs]
> >> [  931.590658]  ? kthread_bind+0x40/0x40
> >> [  931.590661]  ret_from_fork+0x22/0x40
> >>
> >
> > It seems your filesystem is mounted with the DSICARD option meaning
> > every delete will result in discard this is highly suboptimal for
> > ssd's.
> > Try remounting the fs without the discard option see if it helps.
> > Generally for discard you want to submit it in big batches (what fstrim
> > does) so that the ftl on the ssd could apply any optimisations it might
> > have up its sleeve.
>
> Spot on!
>
> Removed "discard" from fstab and added "ssd", rebooted - no more
> btrfs-cleaner running.
>
> Do you know if the issue you described ("discard this is highly
> suboptimal for ssd") affects other filesystems as well to a similar
> extent? I.e. if using ext4 on ssd?

Quite a lot of activity on ext4 and XFS are overwrites, so discard
isn't needed. And it might be discard is subject to delays. On Btrfs,
it's almost immediate, to the degree that on a couple SSDs I've
tested, stale trees referenced exclusively by the most recent backup
tree entires in the superblock are already zeros. That functionally
means no automatic recoveries at mount time if there's a problem with
any of the current trees.

I was using it for about a year to no ill effect, BUT not a lot of
file deletions either. I wouldn't recommend it, and instead suggest
enabling the fstrim.timer which by default runs fstrim.service once a
week (which in turn issues fstrim, I think on all mounted volumes.)

I am a bit more concerned about the read errors you had that were
being corrected automatically? The corruption suggests a firmware bug
related to trim. I'd check the affected SSD firmware revision and
consider updating it (only after a backup, it's plausible the firmware
update is not guaranteed to be data safe). Does the volume use DUP or
raid1 metadata? I'm not sure how it's correcting for these problems
otherwise.


-- 
Chris Murphy


[no subject]

2018-11-22 Thread Andy Leadbetter
I have a failing 2TB disk that is part of a 4 disk RAID 6 system.  I
have added a new 2TB disk to the computer, and started a BTRFS replace
for the old and new disk.  The process starts correctly however some
hours into the job, there is an error and kernel oops. relevant log
below.

The disks are configured on top of bcache, in 5 arrays with a small
128GB SSD cache shared.  The system in this configuration has worked
perfectly for 3 years, until 2 weeks ago csum errors started
appearing.  I have a crashplan backup of all files on the disk, so I
am not concerned about data loss, but I would like to avoid rebuild
the system.

btrfs dev stats shows
[/dev/bcache0].write_io_errs0
[/dev/bcache0].read_io_errs 0
[/dev/bcache0].flush_io_errs0
[/dev/bcache0].corruption_errs  0
[/dev/bcache0].generation_errs  0
[/dev/bcache1].write_io_errs0
[/dev/bcache1].read_io_errs 20
[/dev/bcache1].flush_io_errs0
[/dev/bcache1].corruption_errs  0
[/dev/bcache1].generation_errs  14
[/dev/bcache3].write_io_errs0
[/dev/bcache3].read_io_errs 0
[/dev/bcache3].flush_io_errs0
[/dev/bcache3].corruption_errs  0
[/dev/bcache3].generation_errs  19
[/dev/bcache2].write_io_errs0
[/dev/bcache2].read_io_errs 0
[/dev/bcache2].flush_io_errs0
[/dev/bcache2].corruption_errs  0
[/dev/bcache2].generation_errs  2

and a smart test of the backing disk /dev/bcache1 shows a high read
error rate, and lot of reallocated sectors.  The disk is 10 years old,
and has clearly started to fail.

I've tried the latest kernel, and the latest tools, but nothing will
allow me to replace, or delete the failed disk.

  884.171025] BTRFS info (device bcache0): dev_replace from
/dev/bcache1 (devid 2) to /dev/bcache4 started
[ 3301.101958] BTRFS error (device bcache0): parent transid verify
failed on 8251260944384 wanted 640926 found 640907
[ 3301.241214] BTRFS error (device bcache0): parent transid verify
failed on 8251260944384 wanted 640926 found 640907
[ 3301.241398] BTRFS error (device bcache0): parent transid verify
failed on 8251260944384 wanted 640926 found 640907
[ 3301.241513] BTRFS error (device bcache0): parent transid verify
failed on 8251260944384 wanted 640926 found 640907
[ 3302.381094] BTRFS error (device bcache0):
btrfs_scrub_dev(/dev/bcache1, 2, /dev/bcache4) failed -5
[ 3302.394612] WARNING: CPU: 0 PID: 5936 at
/build/linux-5s7Xkn/linux-4.15.0/fs/btrfs/dev-replace.c:413
btrfs_dev_replace_start+0x281/0x320 [btrfs]
[ 3302.394613] Modules linked in: btrfs zstd_compress xor raid6_pq
bcache intel_rapl x86_pkg_temp_thermal intel_powerclamp
snd_hda_codec_hdmi coretemp kvm_intel snd_hda_codec_realtek kvm
snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hwdep
snd_pcm pcbc snd_seq_midi aesni_intel snd_seq_midi_event joydev
input_leds aes_x86_64 snd_rawmidi crypto_simd glue_helper snd_seq
eeepc_wmi cryptd asus_wmi snd_seq_device snd_timer wmi_bmof
sparse_keymap snd intel_cstate intel_rapl_perf soundcore mei_me mei
shpchp mac_hid sch_fq_codel acpi_pad parport_pc ppdev lp parport
ip_tables x_tables autofs4 overlay nls_iso8859_1 dm_mirror
dm_region_hash dm_log hid_generic usbhid hid uas usb_storage i915
i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
[ 3302.394640]  sysimgblt fb_sys_fops r8169 mxm_wmi mii drm ahci
libahci wmi video
[ 3302.394646] CPU: 0 PID: 5936 Comm: btrfs Not tainted
4.15.0-20-generic #21-Ubuntu
[ 3302.394646] Hardware name: System manufacturer System Product
Name/H110M-R, BIOS 3404 10/10/2017
[ 3302.394658] RIP: 0010:btrfs_dev_replace_start+0x281/0x320 [btrfs]
[ 3302.394659] RSP: 0018:a8b582b5fd18 EFLAGS: 00010282
[ 3302.394660] RAX: fffb RBX: 927d3afe RCX: 
[ 3302.394660] RDX: 0001 RSI: 0296 RDI: 927d3afece90
[ 3302.394661] RBP: a8b582b5fd68 R08:  R09: a8b582b5fc18
[ 3302.394662] R10: a8b582b5fd10 R11:  R12: 927d3afece20
[ 3302.394662] R13: 927d34b59421 R14: 927d34b59020 R15: 0001
[ 3302.394663] FS:  7fba4831c8c0() GS:927df6c0()
knlGS:
[ 3302.394664] CS:  0010 DS:  ES:  CR0: 80050033
[ 3302.394664] CR2: 2b9b83db85b8 CR3: 000164d3a002 CR4: 003606f0
[ 3302.394665] Call Trace:
[ 3302.394676]  btrfs_dev_replace_by_ioctl+0x39/0x60 [btrfs]
[ 3302.394686]  btrfs_ioctl+0x1988/0x2080 [btrfs]
[ 3302.394689]  ? iput+0x8d/0x220
[ 3302.394690]  ? __blkdev_put+0x199/0x1f0
[ 3302.394692]  do_vfs_ioctl+0xa8/0x630
[ 3302.394701]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 3302.394703]  ? do_vfs_ioctl+0xa8/0x630
[ 3302.394704]  ? do_sigaction+0xb4/0x1e0
[ 3302.394706]  SyS_ioctl+0x79/0x90
[ 3302.394708]  do_syscall_64+0x73/0x130
[ 3302.394710]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 3302.394711] RIP: 0033:0x7fba471085d7
[ 3302.394712] RSP: 002b:7ffe5af753b8 EFLAGS: 0246 ORIG_RAX:
0010
[ 

[PATCH] btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable

2018-11-22 Thread Qu Wenruo
[BUG]
A completely valid btrfs will refuse to mount, with error message like:
  BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \
bg_start=12018974720 bg_len=10888413184, invalid block group size, \
have 10888413184 expect (0, 10737418240]

Btrfs check returns no error, and all kernels used on this fs is later
than 2011, which should all have the 10G size limit commit.

[CAUSE]
For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
stripe stripe bump up.

__btrfs_alloc_chunk()
|- max_stripe_size = 1G
|- max_chunk_size = 10G
|- data_stripe = 11
|- if (1G * 11 > 10G) {
   stripe_size = 976128930;
   stripe_size = round_up(976128930, SZ_16M) = 989855744

However the final stripe_size (989855744) * 11 = 10888413184, which is
still larger than 10G.

[FIX]
For the comprehensive check, we need to do the full check at chunk
read time, and rely on bg <-> chunk mapping to do the check.

We could just skip the length check for now.

Fixes: fce466eab7ac ("btrfs: tree-checker: Verify block_group_item")
Cc: sta...@vger.kernel.org # v4.19+
Reported-by: Wang Yugui 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/tree-checker.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index cab0b1f1f741..d8bd5340fbbc 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -389,13 +389,11 @@ static int check_block_group_item(struct btrfs_fs_info 
*fs_info,
 
/*
 * Here we don't really care about alignment since extent allocator can
-* handle it.  We care more about the size, as if one block group is
-* larger than maximum size, it's must be some obvious corruption.
+* handle it.  We care more about the size.
 */
-   if (key->offset > BTRFS_MAX_DATA_CHUNK_SIZE || key->offset == 0) {
+   if (key->offset == 0) {
block_group_err(fs_info, leaf, slot,
-   "invalid block group size, have %llu expect (0, %llu]",
-   key->offset, BTRFS_MAX_DATA_CHUNK_SIZE);
+   "invalid block group size 0");
return -EUCLEAN;
}
 
-- 
2.19.1



Re: [PATCH 4/4] btrfs: replace btrfs_io_bio::end_io with a simple helper

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 18:16 ч., David Sterba wrote:
> The end_io callback implemented as btrfs_io_bio_endio_readpage only
> calls kfree. Also the callback is set only in case the csum buffer is
> allocated and not pointing to the inline buffer. We can use that
> information to drop the indirection and call a helper that will free the
> csums only in the right case.
> 
> This shrinks struct btrfs_io_bio by 8 bytes.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent_io.c |  3 +--
>  fs/btrfs/file-item.c |  9 -
>  fs/btrfs/inode.c |  7 ++-
>  fs/btrfs/volumes.h   | 10 --
>  4 files changed, 11 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 4ea808d6cfbc..aef3c9866ff0 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2623,8 +2623,7 @@ static void end_bio_extent_readpage(struct bio *bio)
>   if (extent_len)
>   endio_readpage_release_extent(tree, extent_start, extent_len,
> uptodate);
> - if (io_bio->end_io)
> - io_bio->end_io(io_bio, blk_status_to_errno(bio->bi_status));
> + btrfs_io_bio_free_csum(io_bio);
>   bio_put(bio);
>  }
>  
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 1f2d0a6ab634..920bf3b4b0ef 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -142,14 +142,6 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle 
> *trans,
>   return ret;
>  }
>  
> -static void btrfs_io_bio_endio_readpage(struct btrfs_io_bio *bio, int err)
> -{
> - if (bio->csum != bio->csum_inline) {
> - kfree(bio->csum);
> - bio->csum = NULL;
> - }
> -}
> -
>  static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio 
> *bio,
>  u64 logical_offset, u32 *dst, int dio)
>  {
> @@ -184,7 +176,6 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode 
> *inode, struct bio *bio
>   btrfs_free_path(path);
>   return BLK_STS_RESOURCE;
>   }
> - btrfs_bio->end_io = btrfs_io_bio_endio_readpage;
>   } else {
>   btrfs_bio->csum = btrfs_bio->csum_inline;
>   }
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 26b8bec7c2dc..6bfd37e58924 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8017,9 +8017,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
>  
>   dio_bio->bi_status = err;
>   dio_end_io(dio_bio);
> -
> - if (io_bio->end_io)
> - io_bio->end_io(io_bio, blk_status_to_errno(err));
> + btrfs_io_bio_free_csum(io_bio);
>   bio_put(bio);
>  }
>  
> @@ -8372,8 +8370,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
> struct inode *inode,
>   if (!ret)
>   return;
>  
> - if (io_bio->end_io)
> - io_bio->end_io(io_bio, ret);
> + btrfs_io_bio_free_csum(io_bio);
>  
>  free_ordered:
>   /*
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 9a764f2d462e..a13045fcfc45 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -267,14 +267,12 @@ struct btrfs_fs_devices {
>   * we allocate are actually btrfs_io_bios.  We'll cram as much of
>   * struct btrfs_bio as we can into this over time.
>   */
> -typedef void (btrfs_io_bio_end_io_t) (struct btrfs_io_bio *bio, int err);
>  struct btrfs_io_bio {
>   unsigned int mirror_num;
>   unsigned int stripe_index;
>   u64 logical;
>   u8 *csum;
>   u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
> - btrfs_io_bio_end_io_t *end_io;
>   struct bvec_iter iter;
>   /*
>* This member must come last, bio_alloc_bioset will allocate enough
> @@ -288,6 +286,14 @@ static inline struct btrfs_io_bio *btrfs_io_bio(struct 
> bio *bio)
>   return container_of(bio, struct btrfs_io_bio, bio);
>  }
>  
> +static inline void btrfs_io_bio_free_csum(struct btrfs_io_bio *io_bio)
> +{
> + if (io_bio->csum != io_bio->csum_inline) {
> + kfree(io_bio->csum);
> + io_bio->csum = NULL;
> + }
> +}
> +
>  struct btrfs_bio_stripe {
>   struct btrfs_device *dev;
>   u64 physical;
> 


Re: [PATCH 3/4] btrfs: remove redundant csum buffer in btrfs_io_bio

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 18:16 ч., David Sterba wrote:
> The io_bio tracks checksums and has an inline buffer or an allocated
> one. And there's a third member that points to the right one, but we
> don't need to use an extra pointer for that. Let btrfs_io_bio::csum
> point to the right buffer and check that the inline buffer is not
> accidentally freed.
> 
> This shrinks struct btrfs_io_bio by 8 bytes.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/file-item.c | 12 +++-
>  fs/btrfs/volumes.h   |  1 -
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index ba74827beb32..1f2d0a6ab634 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -144,7 +144,10 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle 
> *trans,
>  
>  static void btrfs_io_bio_endio_readpage(struct btrfs_io_bio *bio, int err)
>  {
> - kfree(bio->csum_allocated);
> + if (bio->csum != bio->csum_inline) {
> + kfree(bio->csum);
> + bio->csum = NULL;
> + }
>  }
>  
>  static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio 
> *bio,
> @@ -175,13 +178,12 @@ static blk_status_t __btrfs_lookup_bio_sums(struct 
> inode *inode, struct bio *bio
>   nblocks = bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits;
>   if (!dst) {
>   if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
> - btrfs_bio->csum_allocated = kmalloc_array(nblocks,
> - csum_size, GFP_NOFS);
> - if (!btrfs_bio->csum_allocated) {
> + btrfs_bio->csum = kmalloc_array(nblocks, csum_size,
> + GFP_NOFS);
> + if (!btrfs_bio->csum) {
>   btrfs_free_path(path);
>   return BLK_STS_RESOURCE;
>   }
> - btrfs_bio->csum = btrfs_bio->csum_allocated;
>   btrfs_bio->end_io = btrfs_io_bio_endio_readpage;
>   } else {
>   btrfs_bio->csum = btrfs_bio->csum_inline;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 1d936ce282c3..9a764f2d462e 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -274,7 +274,6 @@ struct btrfs_io_bio {
>   u64 logical;
>   u8 *csum;
>   u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
> - u8 *csum_allocated;
>   btrfs_io_bio_end_io_t *end_io;
>   struct bvec_iter iter;
>   /*
> 


Re: [PATCH 2/4] btrfs: replace async_cow::root with fs_info

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 18:16 ч., David Sterba wrote:
> The async_cow::root is used to propagate fs_info to async_cow_submit.
> We can't use inode to reach it because it could become NULL after
> write without compression in async_cow_start.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/inode.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a88122b89f50..26b8bec7c2dc 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -358,7 +358,7 @@ struct async_extent {
>  
>  struct async_cow {
>   struct inode *inode;
> - struct btrfs_root *root;
> + struct btrfs_fs_info *fs_info;
>   struct page *locked_page;
>   u64 start;
>   u64 end;
> @@ -1144,13 +1144,11 @@ static noinline void async_cow_submit(struct 
> btrfs_work *work)
>  {
>   struct btrfs_fs_info *fs_info;
>   struct async_cow *async_cow;
> - struct btrfs_root *root;
>   unsigned long nr_pages;
>  
>   async_cow = container_of(work, struct async_cow, work);
>  
> - root = async_cow->root;
> - fs_info = root->fs_info;
> + fs_info = async_cow->fs_info;
>   nr_pages = (async_cow->end - async_cow->start + PAGE_SIZE) >>
>   PAGE_SHIFT;
>  
> @@ -1179,7 +1177,6 @@ static int cow_file_range_async(struct inode *inode, 
> struct page *locked_page,
>  {
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   struct async_cow *async_cow;
> - struct btrfs_root *root = BTRFS_I(inode)->root;
>   unsigned long nr_pages;
>   u64 cur_end;
>  
> @@ -1189,7 +1186,7 @@ static int cow_file_range_async(struct inode *inode, 
> struct page *locked_page,
>   async_cow = kmalloc(sizeof(*async_cow), GFP_NOFS);
>   BUG_ON(!async_cow); /* -ENOMEM */
>   async_cow->inode = igrab(inode);
> - async_cow->root = root;
> + async_cow->fs_info = fs_info;
>   async_cow->locked_page = locked_page;
>   async_cow->start = start;
>   async_cow->write_flags = write_flags;
> 


Re: [PATCH 1/4] btrfs: merge btrfs_submit_bio_done to its caller

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 18:16 ч., David Sterba wrote:
> There's one caller and its code is simple, we can open code it in
> run_one_async_done. The errors are passed through bio.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Nikolay Borisov 
> ---
>  fs/btrfs/disk-io.c | 18 +-
>  fs/btrfs/inode.c   | 23 ---
>  2 files changed, 17 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index feb67dfd663d..2f5c5442cf00 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -764,11 +764,22 @@ static void run_one_async_start(struct btrfs_work *work)
>   async->status = ret;
>  }
>  
> +/*
> + * In order to insert checksums into the metadata in large chunks, we wait
> + * until bio submission time.   All the pages in the bio are checksummed and
> + * sums are attached onto the ordered extent record.
> + *
> + * At IO completion time the csums attached on the ordered extent record are
> + * inserted into the tree.
> + */
>  static void run_one_async_done(struct btrfs_work *work)
>  {
>   struct async_submit_bio *async;
> + struct inode *inode;
> + blk_status_t ret;
>  
>   async = container_of(work, struct  async_submit_bio, work);
> + inode = async->private_data;
>  
>   /* If an error occurred we just want to clean up the bio and move on */
>   if (async->status) {
> @@ -777,7 +788,12 @@ static void run_one_async_done(struct btrfs_work *work)
>   return;
>   }
>  
> - btrfs_submit_bio_done(async->private_data, async->bio, 
> async->mirror_num);
> + ret = btrfs_map_bio(btrfs_sb(inode->i_sb), async->bio,
> + async->mirror_num, 1);
> + if (ret) {
> + async->bio->bi_status = ret;
> + bio_endio(async->bio);
> + }
>  }
>  
>  static void run_one_async_free(struct btrfs_work *work)
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9becf8543489..a88122b89f50 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1924,29 +1924,6 @@ static blk_status_t btrfs_submit_bio_start(void 
> *private_data, struct bio *bio,
>   return 0;
>  }
>  
> -/*
> - * in order to insert checksums into the metadata in large chunks,
> - * we wait until bio submission time.   All the pages in the bio are
> - * checksummed and sums are attached onto the ordered extent record.
> - *
> - * At IO completion time the cums attached on the ordered extent record
> - * are inserted into the btree
> - */
> -blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio,
> -   int mirror_num)
> -{
> - struct inode *inode = private_data;
> - struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> - blk_status_t ret;
> -
> - ret = btrfs_map_bio(fs_info, bio, mirror_num, 1);
> - if (ret) {
> - bio->bi_status = ret;
> - bio_endio(bio);
> - }
> - return ret;
> -}
> -
>  /*
>   * extent_io.c submission hook. This does the right thing for csum 
> calculation
>   * on write, or reading the csums from the tree before a read.
> 


[PATCH 2/4] btrfs: replace async_cow::root with fs_info

2018-11-22 Thread David Sterba
The async_cow::root is used to propagate fs_info to async_cow_submit.
We can't use inode to reach it because it could become NULL after
write without compression in async_cow_start.

Signed-off-by: David Sterba 
---
 fs/btrfs/inode.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a88122b89f50..26b8bec7c2dc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -358,7 +358,7 @@ struct async_extent {
 
 struct async_cow {
struct inode *inode;
-   struct btrfs_root *root;
+   struct btrfs_fs_info *fs_info;
struct page *locked_page;
u64 start;
u64 end;
@@ -1144,13 +1144,11 @@ static noinline void async_cow_submit(struct btrfs_work 
*work)
 {
struct btrfs_fs_info *fs_info;
struct async_cow *async_cow;
-   struct btrfs_root *root;
unsigned long nr_pages;
 
async_cow = container_of(work, struct async_cow, work);
 
-   root = async_cow->root;
-   fs_info = root->fs_info;
+   fs_info = async_cow->fs_info;
nr_pages = (async_cow->end - async_cow->start + PAGE_SIZE) >>
PAGE_SHIFT;
 
@@ -1179,7 +1177,6 @@ static int cow_file_range_async(struct inode *inode, 
struct page *locked_page,
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct async_cow *async_cow;
-   struct btrfs_root *root = BTRFS_I(inode)->root;
unsigned long nr_pages;
u64 cur_end;
 
@@ -1189,7 +1186,7 @@ static int cow_file_range_async(struct inode *inode, 
struct page *locked_page,
async_cow = kmalloc(sizeof(*async_cow), GFP_NOFS);
BUG_ON(!async_cow); /* -ENOMEM */
async_cow->inode = igrab(inode);
-   async_cow->root = root;
+   async_cow->fs_info = fs_info;
async_cow->locked_page = locked_page;
async_cow->start = start;
async_cow->write_flags = write_flags;
-- 
2.19.1



[PATCH 3/4] btrfs: remove redundant csum buffer in btrfs_io_bio

2018-11-22 Thread David Sterba
The io_bio tracks checksums and has an inline buffer or an allocated
one. And there's a third member that points to the right one, but we
don't need to use an extra pointer for that. Let btrfs_io_bio::csum
point to the right buffer and check that the inline buffer is not
accidentally freed.

This shrinks struct btrfs_io_bio by 8 bytes.

Signed-off-by: David Sterba 
---
 fs/btrfs/file-item.c | 12 +++-
 fs/btrfs/volumes.h   |  1 -
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index ba74827beb32..1f2d0a6ab634 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -144,7 +144,10 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle 
*trans,
 
 static void btrfs_io_bio_endio_readpage(struct btrfs_io_bio *bio, int err)
 {
-   kfree(bio->csum_allocated);
+   if (bio->csum != bio->csum_inline) {
+   kfree(bio->csum);
+   bio->csum = NULL;
+   }
 }
 
 static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio 
*bio,
@@ -175,13 +178,12 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode 
*inode, struct bio *bio
nblocks = bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits;
if (!dst) {
if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
-   btrfs_bio->csum_allocated = kmalloc_array(nblocks,
-   csum_size, GFP_NOFS);
-   if (!btrfs_bio->csum_allocated) {
+   btrfs_bio->csum = kmalloc_array(nblocks, csum_size,
+   GFP_NOFS);
+   if (!btrfs_bio->csum) {
btrfs_free_path(path);
return BLK_STS_RESOURCE;
}
-   btrfs_bio->csum = btrfs_bio->csum_allocated;
btrfs_bio->end_io = btrfs_io_bio_endio_readpage;
} else {
btrfs_bio->csum = btrfs_bio->csum_inline;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1d936ce282c3..9a764f2d462e 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -274,7 +274,6 @@ struct btrfs_io_bio {
u64 logical;
u8 *csum;
u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
-   u8 *csum_allocated;
btrfs_io_bio_end_io_t *end_io;
struct bvec_iter iter;
/*
-- 
2.19.1



[PATCH 4/4] btrfs: replace btrfs_io_bio::end_io with a simple helper

2018-11-22 Thread David Sterba
The end_io callback implemented as btrfs_io_bio_endio_readpage only
calls kfree. Also the callback is set only in case the csum buffer is
allocated and not pointing to the inline buffer. We can use that
information to drop the indirection and call a helper that will free the
csums only in the right case.

This shrinks struct btrfs_io_bio by 8 bytes.

Signed-off-by: David Sterba 
---
 fs/btrfs/extent_io.c |  3 +--
 fs/btrfs/file-item.c |  9 -
 fs/btrfs/inode.c |  7 ++-
 fs/btrfs/volumes.h   | 10 --
 4 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4ea808d6cfbc..aef3c9866ff0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2623,8 +2623,7 @@ static void end_bio_extent_readpage(struct bio *bio)
if (extent_len)
endio_readpage_release_extent(tree, extent_start, extent_len,
  uptodate);
-   if (io_bio->end_io)
-   io_bio->end_io(io_bio, blk_status_to_errno(bio->bi_status));
+   btrfs_io_bio_free_csum(io_bio);
bio_put(bio);
 }
 
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 1f2d0a6ab634..920bf3b4b0ef 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -142,14 +142,6 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle 
*trans,
return ret;
 }
 
-static void btrfs_io_bio_endio_readpage(struct btrfs_io_bio *bio, int err)
-{
-   if (bio->csum != bio->csum_inline) {
-   kfree(bio->csum);
-   bio->csum = NULL;
-   }
-}
-
 static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio 
*bio,
   u64 logical_offset, u32 *dst, int dio)
 {
@@ -184,7 +176,6 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode 
*inode, struct bio *bio
btrfs_free_path(path);
return BLK_STS_RESOURCE;
}
-   btrfs_bio->end_io = btrfs_io_bio_endio_readpage;
} else {
btrfs_bio->csum = btrfs_bio->csum_inline;
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 26b8bec7c2dc..6bfd37e58924 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8017,9 +8017,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
 
dio_bio->bi_status = err;
dio_end_io(dio_bio);
-
-   if (io_bio->end_io)
-   io_bio->end_io(io_bio, blk_status_to_errno(err));
+   btrfs_io_bio_free_csum(io_bio);
bio_put(bio);
 }
 
@@ -8372,8 +8370,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
struct inode *inode,
if (!ret)
return;
 
-   if (io_bio->end_io)
-   io_bio->end_io(io_bio, ret);
+   btrfs_io_bio_free_csum(io_bio);
 
 free_ordered:
/*
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 9a764f2d462e..a13045fcfc45 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -267,14 +267,12 @@ struct btrfs_fs_devices {
  * we allocate are actually btrfs_io_bios.  We'll cram as much of
  * struct btrfs_bio as we can into this over time.
  */
-typedef void (btrfs_io_bio_end_io_t) (struct btrfs_io_bio *bio, int err);
 struct btrfs_io_bio {
unsigned int mirror_num;
unsigned int stripe_index;
u64 logical;
u8 *csum;
u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
-   btrfs_io_bio_end_io_t *end_io;
struct bvec_iter iter;
/*
 * This member must come last, bio_alloc_bioset will allocate enough
@@ -288,6 +286,14 @@ static inline struct btrfs_io_bio *btrfs_io_bio(struct bio 
*bio)
return container_of(bio, struct btrfs_io_bio, bio);
 }
 
+static inline void btrfs_io_bio_free_csum(struct btrfs_io_bio *io_bio)
+{
+   if (io_bio->csum != io_bio->csum_inline) {
+   kfree(io_bio->csum);
+   io_bio->csum = NULL;
+   }
+}
+
 struct btrfs_bio_stripe {
struct btrfs_device *dev;
u64 physical;
-- 
2.19.1



[PATCH 0/4] Minor helper and struct member cleanups

2018-11-22 Thread David Sterba
Assorted updates to code vaguely related to the bio callbacks and
structures. Remove call indirection, remove redundant struct members,
etc.

David Sterba (4):
  btrfs: merge btrfs_submit_bio_done to its caller
  btrfs: replace async_cow::root with fs_info
  btrfs: remove redundant csum buffer in btrfs_io_bio
  btrfs: replace btrfs_io_bio::end_io with a simple helper

 fs/btrfs/disk-io.c   | 18 +-
 fs/btrfs/extent_io.c |  3 +--
 fs/btrfs/file-item.c | 13 +++--
 fs/btrfs/inode.c | 39 +--
 fs/btrfs/volumes.h   | 11 ---
 5 files changed, 34 insertions(+), 50 deletions(-)

-- 
2.19.1



[PATCH 1/4] btrfs: merge btrfs_submit_bio_done to its caller

2018-11-22 Thread David Sterba
There's one caller and its code is simple, we can open code it in
run_one_async_done. The errors are passed through bio.

Signed-off-by: David Sterba 
---
 fs/btrfs/disk-io.c | 18 +-
 fs/btrfs/inode.c   | 23 ---
 2 files changed, 17 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index feb67dfd663d..2f5c5442cf00 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -764,11 +764,22 @@ static void run_one_async_start(struct btrfs_work *work)
async->status = ret;
 }
 
+/*
+ * In order to insert checksums into the metadata in large chunks, we wait
+ * until bio submission time.   All the pages in the bio are checksummed and
+ * sums are attached onto the ordered extent record.
+ *
+ * At IO completion time the csums attached on the ordered extent record are
+ * inserted into the tree.
+ */
 static void run_one_async_done(struct btrfs_work *work)
 {
struct async_submit_bio *async;
+   struct inode *inode;
+   blk_status_t ret;
 
async = container_of(work, struct  async_submit_bio, work);
+   inode = async->private_data;
 
/* If an error occurred we just want to clean up the bio and move on */
if (async->status) {
@@ -777,7 +788,12 @@ static void run_one_async_done(struct btrfs_work *work)
return;
}
 
-   btrfs_submit_bio_done(async->private_data, async->bio, 
async->mirror_num);
+   ret = btrfs_map_bio(btrfs_sb(inode->i_sb), async->bio,
+   async->mirror_num, 1);
+   if (ret) {
+   async->bio->bi_status = ret;
+   bio_endio(async->bio);
+   }
 }
 
 static void run_one_async_free(struct btrfs_work *work)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9becf8543489..a88122b89f50 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1924,29 +1924,6 @@ static blk_status_t btrfs_submit_bio_start(void 
*private_data, struct bio *bio,
return 0;
 }
 
-/*
- * in order to insert checksums into the metadata in large chunks,
- * we wait until bio submission time.   All the pages in the bio are
- * checksummed and sums are attached onto the ordered extent record.
- *
- * At IO completion time the cums attached on the ordered extent record
- * are inserted into the btree
- */
-blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio,
- int mirror_num)
-{
-   struct inode *inode = private_data;
-   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-   blk_status_t ret;
-
-   ret = btrfs_map_bio(fs_info, bio, mirror_num, 1);
-   if (ret) {
-   bio->bi_status = ret;
-   bio_endio(bio);
-   }
-   return ret;
-}
-
 /*
  * extent_io.c submission hook. This does the right thing for csum calculation
  * on write, or reading the csums from the tree before a read.
-- 
2.19.1



Re: [PATCH v2] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-22 Thread David Sterba
On Thu, Nov 22, 2018 at 09:46:44PM +0800, Qu Wenruo wrote:
> >>> it and after
> >>> we started (or joined) a transaction, a lot could of modifications may
> >>> have happened.
> >>> Nevertheless I don't think it's unexpected for anyone to have the
> >>> accounting happening
> >>> only after the quota enable ioctl returns success.
> >>
> >> The problem is not accounting, the qgroup number won't cause problem.
> >>
> >> It's the reserved space. Like some dirty pages are dirtied before quota
> >> enabled, but at run_dealloc() time quota is enabled.
> >>
> >> For such case we have io_tree based method to avoid underflow so it
> >> should be OK.
> >>
> >> So v2 patch looks OK.
> > 
> > Does that mean reviewed-by? In case there's a evolved discussion under a
> > patch, a clear yes/no is appreciated and an explicit Reviewed-by even
> > more. I'm about to add this patch to rc4 pull, thre's still some time to
> > add the tag. Thanks.
> 
> I'd like to add reviewed-by tab, but I'm still not 100% if this will
> cause extra qgroup reserved space related problem.
> 
> At least from my side, I can't directly see a case where it will cause
> problem.
> 
> Does such case mean a reviewed-by tag? Or something LGTM-but-uncertain?

It means that we can keep the patch in testing branch for a bit longer,
there's still time to put it to a later rc once there's enough
certainty.


Re: [PATCH] btrfs: improve error handling of btrfs_add_link()

2018-11-22 Thread David Sterba
On Thu, Nov 22, 2018 at 02:44:35PM +0100, Johannes Thumshirn wrote:
> On 22/11/2018 14:35, David Sterba wrote:
> > On Thu, Nov 22, 2018 at 02:15:28PM +0100, Johannes Thumshirn wrote:
> >> err holds the return value of either btrfs_del_root_ref() or
> >> btrfs_del_inode_ref() but it hasn't been checked since it's
> >> introduction with commit fe66a05a0679 (Btrfs: improve error handling
> >> for btrfs_insert_dir_item callers) in 2012.
> >>
> >> The first attempt in removing the variable was rejected, so instead of
> >> removing 'err', start handling the errors of
> >> btrfs_del_root_ref()/btrfs_del_inode_ref() by aborting the transaction in
> >> btrfs_add_link() directly instead of the call-sites.
> > 
> > ... which is an anti-pattern.
> > 
> > https://btrfs.wiki.kernel.org/index.php/Development_notes#Error_handling_and_transaction_abort
> > 
> > In case there are several conditons that can fail we want to see the
> > first one for later analysis and debugging.
> 
> OK, but then handling the error of either btrfs_del_root_ref() or
> btrfs_del_inode_ref() will essentially only be printing an error message
> as anything else would hide the error condition which has led us into
> this path (i.e. btrfs_insert_dir_item() returning EEXISTS or EOVERFLOW),
> wouldn't it?

So this seems to point out there are more things to resolve here:

* how to recover from errors of btrfs_del_root_ref or btrfs_del_inode_ref
* what to report back to the user

When insert_dir_item fails, there are the inode references to be cleaned
up, and error reported as EEXIST or EOVERFLOW.

When btrfs_del_root_ref or btrfs_del_inode_ref fail, the cleanup is not
possible and there's no other option than the transaction abort. As the
original reason was the overflow, this should be reported to back to the
user.

The reason of the del_*_ref failures can be misleading, in case it's eg.
ENOMEM. This would mean that the add-link operation is possible but
there was not enough memory and restarting later would let it possibly
succeed.

So the missing part of error hanling is to add two "if (err) abort" and
then still return 'ret'. As this is not all obvious why, a comment would
be good there.


Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Qu Wenruo



On 2018/11/22 下午10:03, Roman Mamedov wrote:
> On Thu, 22 Nov 2018 22:07:25 +0900
> Tomasz Chmielewski  wrote:
> 
>> Spot on!
>>
>> Removed "discard" from fstab and added "ssd", rebooted - no more 
>> btrfs-cleaner running.
> 
> Recently there has been a bugfix for TRIM in Btrfs:
>   
>   btrfs: Ensure btrfs_trim_fs can trim the whole fs
>   https://patchwork.kernel.org/patch/10579539/
> 
> Perhaps your upgraded kernel is the first one to contain it, and for the first
> time you're seeing TRIM to actually *work*, with the actual performance impact
> of it on a large fragmented FS, instead of a few contiguous unallocated areas.
> 
That only affects btrfs_trim_fs(), and you can see it's only called from
ioctl interface, so it's definitely not the case.

Thanks,
Qu


Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Roman Mamedov
On Thu, 22 Nov 2018 22:07:25 +0900
Tomasz Chmielewski  wrote:

> Spot on!
> 
> Removed "discard" from fstab and added "ssd", rebooted - no more 
> btrfs-cleaner running.

Recently there has been a bugfix for TRIM in Btrfs:
  
  btrfs: Ensure btrfs_trim_fs can trim the whole fs
  https://patchwork.kernel.org/patch/10579539/

Perhaps your upgraded kernel is the first one to contain it, and for the first
time you're seeing TRIM to actually *work*, with the actual performance impact
of it on a large fragmented FS, instead of a few contiguous unallocated areas.

-- 
With respect,
Roman


Re: [PATCH v2] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-22 Thread Qu Wenruo


On 2018/11/22 下午9:12, David Sterba wrote:
> On Tue, Nov 20, 2018 at 08:32:42AM +0800, Qu Wenruo wrote:
> @@ -1013,16 +1013,22 @@ int btrfs_quota_enable(struct btrfs_fs_info 
> *fs_info)
>   btrfs_abort_transaction(trans, ret);
>   goto out_free_path;
>   }
> - spin_lock(_info->qgroup_lock);
> - fs_info->quota_root = quota_root;
> - set_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags);
> - spin_unlock(_info->qgroup_lock);
>
>   ret = btrfs_commit_transaction(trans);
>   trans = NULL;
>   if (ret)
>   goto out_free_path;

 The main concern here is, if we don't set qgroup enabled bit before we
 commit transaction, there will be a window where new tree modification
 could happen before QGROUP_ENABLED bit set.
>>>
>>> That doesn't seem to make much sense to me, if I understood correctly.
>>> Essentially you're saying stuff done to any tree in the the
>>> transaction we use to
>>> enable quotas must be accounted for. In that case the quota enabled bit 
>>> should
>>> be done as soon as the transaction is started, because before we set
>>> it and after
>>> we started (or joined) a transaction, a lot could of modifications may
>>> have happened.
>>> Nevertheless I don't think it's unexpected for anyone to have the
>>> accounting happening
>>> only after the quota enable ioctl returns success.
>>
>> The problem is not accounting, the qgroup number won't cause problem.
>>
>> It's the reserved space. Like some dirty pages are dirtied before quota
>> enabled, but at run_dealloc() time quota is enabled.
>>
>> For such case we have io_tree based method to avoid underflow so it
>> should be OK.
>>
>> So v2 patch looks OK.
> 
> Does that mean reviewed-by? In case there's a evolved discussion under a
> patch, a clear yes/no is appreciated and an explicit Reviewed-by even
> more. I'm about to add this patch to rc4 pull, thre's still some time to
> add the tag. Thanks.
> 

I'd like to add reviewed-by tab, but I'm still not 100% if this will
cause extra qgroup reserved space related problem.

At least from my side, I can't directly see a case where it will cause
problem.

Does such case mean a reviewed-by tag? Or something LGTM-but-uncertain?

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: improve error handling of btrfs_add_link()

2018-11-22 Thread Johannes Thumshirn
On 22/11/2018 14:35, David Sterba wrote:
> On Thu, Nov 22, 2018 at 02:15:28PM +0100, Johannes Thumshirn wrote:
>> err holds the return value of either btrfs_del_root_ref() or
>> btrfs_del_inode_ref() but it hasn't been checked since it's
>> introduction with commit fe66a05a0679 (Btrfs: improve error handling
>> for btrfs_insert_dir_item callers) in 2012.
>>
>> The first attempt in removing the variable was rejected, so instead of
>> removing 'err', start handling the errors of
>> btrfs_del_root_ref()/btrfs_del_inode_ref() by aborting the transaction in
>> btrfs_add_link() directly instead of the call-sites.
> 
> ... which is an anti-pattern.
> 
> https://btrfs.wiki.kernel.org/index.php/Development_notes#Error_handling_and_transaction_abort
> 
> In case there are several conditons that can fail we want to see the
> first one for later analysis and debugging.

OK, but then handling the error of either btrfs_del_root_ref() or
btrfs_del_inode_ref() will essentially only be printing an error message
as anything else would hide the error condition which has led us into
this path (i.e. btrfs_insert_dir_item() returning EEXISTS or EOVERFLOW),
wouldn't it?

Byte,
Johannes
-- 
Johannes ThumshirnSUSE Labs
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] btrfs: improve error handling of btrfs_add_link()

2018-11-22 Thread David Sterba
On Thu, Nov 22, 2018 at 02:15:28PM +0100, Johannes Thumshirn wrote:
> err holds the return value of either btrfs_del_root_ref() or
> btrfs_del_inode_ref() but it hasn't been checked since it's
> introduction with commit fe66a05a0679 (Btrfs: improve error handling
> for btrfs_insert_dir_item callers) in 2012.
> 
> The first attempt in removing the variable was rejected, so instead of
> removing 'err', start handling the errors of
> btrfs_del_root_ref()/btrfs_del_inode_ref() by aborting the transaction in
> btrfs_add_link() directly instead of the call-sites.

... which is an anti-pattern.

https://btrfs.wiki.kernel.org/index.php/Development_notes#Error_handling_and_transaction_abort

In case there are several conditons that can fail we want to see the
first one for later analysis and debugging.


[PATCH] btrfs: improve error handling of btrfs_add_link()

2018-11-22 Thread Johannes Thumshirn
err holds the return value of either btrfs_del_root_ref() or
btrfs_del_inode_ref() but it hasn't been checked since it's
introduction with commit fe66a05a0679 (Btrfs: improve error handling
for btrfs_insert_dir_item callers) in 2012.

The first attempt in removing the variable was rejected, so instead of
removing 'err', start handling the errors of
btrfs_del_root_ref()/btrfs_del_inode_ref() by aborting the transaction in
btrfs_add_link() directly instead of the call-sites.

Link: https://lore.kernel.org/linux-btrfs/20181119141323.gc24...@twin.jikos.cz/
Signed-off-by: Johannes Thumshirn 
---
 fs/btrfs/inode.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9becf8543489..314142ea9d80 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6351,6 +6351,7 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
struct btrfs_root *root = parent_inode->root;
u64 ino = btrfs_ino(inode);
u64 parent_ino = btrfs_ino(parent_inode);
+   int err = 0;
 
if (unlikely(ino == BTRFS_FIRST_FREE_OBJECTID)) {
memcpy(, >root->root_key, sizeof(key));
@@ -6395,18 +6396,20 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
 fail_dir_item:
if (unlikely(ino == BTRFS_FIRST_FREE_OBJECTID)) {
u64 local_index;
-   int err;
+
err = btrfs_del_root_ref(trans, key.objectid,
 root->root_key.objectid, parent_ino,
 _index, name, name_len);
 
} else if (add_backref) {
u64 local_index;
-   int err;
 
err = btrfs_del_inode_ref(trans, root, name, name_len,
  ino, parent_ino, _index);
}
+
+   btrfs_abort_transaction(trans, err ? err : ret);
+
return ret;
 }
 
@@ -9502,18 +9505,14 @@ static int btrfs_rename_exchange(struct inode *old_dir,
ret = btrfs_add_link(trans, BTRFS_I(new_dir), BTRFS_I(old_inode),
 new_dentry->d_name.name,
 new_dentry->d_name.len, 0, old_idx);
-   if (ret) {
-   btrfs_abort_transaction(trans, ret);
+   if (ret)
goto out_fail;
-   }
 
ret = btrfs_add_link(trans, BTRFS_I(old_dir), BTRFS_I(new_inode),
 old_dentry->d_name.name,
 old_dentry->d_name.len, 0, new_idx);
-   if (ret) {
-   btrfs_abort_transaction(trans, ret);
+   if (ret)
goto out_fail;
-   }
 
if (old_inode->i_nlink == 1)
BTRFS_I(old_inode)->dir_index = old_idx;
@@ -9822,10 +9821,8 @@ static int btrfs_rename(struct inode *old_dir, struct 
dentry *old_dentry,
ret = btrfs_add_link(trans, BTRFS_I(new_dir), BTRFS_I(old_inode),
 new_dentry->d_name.name,
 new_dentry->d_name.len, 0, index);
-   if (ret) {
-   btrfs_abort_transaction(trans, ret);
+   if (ret)
goto out_fail;
-   }
 
if (old_inode->i_nlink == 1)
BTRFS_I(old_inode)->dir_index = index;
-- 
2.16.4



Re: [PATCH v2] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-22 Thread David Sterba
On Tue, Nov 20, 2018 at 08:32:42AM +0800, Qu Wenruo wrote:
> >>> @@ -1013,16 +1013,22 @@ int btrfs_quota_enable(struct btrfs_fs_info 
> >>> *fs_info)
> >>>   btrfs_abort_transaction(trans, ret);
> >>>   goto out_free_path;
> >>>   }
> >>> - spin_lock(_info->qgroup_lock);
> >>> - fs_info->quota_root = quota_root;
> >>> - set_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags);
> >>> - spin_unlock(_info->qgroup_lock);
> >>>
> >>>   ret = btrfs_commit_transaction(trans);
> >>>   trans = NULL;
> >>>   if (ret)
> >>>   goto out_free_path;
> >>
> >> The main concern here is, if we don't set qgroup enabled bit before we
> >> commit transaction, there will be a window where new tree modification
> >> could happen before QGROUP_ENABLED bit set.
> > 
> > That doesn't seem to make much sense to me, if I understood correctly.
> > Essentially you're saying stuff done to any tree in the the
> > transaction we use to
> > enable quotas must be accounted for. In that case the quota enabled bit 
> > should
> > be done as soon as the transaction is started, because before we set
> > it and after
> > we started (or joined) a transaction, a lot could of modifications may
> > have happened.
> > Nevertheless I don't think it's unexpected for anyone to have the
> > accounting happening
> > only after the quota enable ioctl returns success.
> 
> The problem is not accounting, the qgroup number won't cause problem.
> 
> It's the reserved space. Like some dirty pages are dirtied before quota
> enabled, but at run_dealloc() time quota is enabled.
> 
> For such case we have io_tree based method to avoid underflow so it
> should be OK.
> 
> So v2 patch looks OK.

Does that mean reviewed-by? In case there's a evolved discussion under a
patch, a clear yes/no is appreciated and an explicit Reviewed-by even
more. I'm about to add this patch to rc4 pull, thre's still some time to
add the tag. Thanks.


Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Tomasz Chmielewski

On 2018-11-22 21:46, Nikolay Borisov wrote:


# echo w > /proc/sysrq-trigger

# dmesg -c
[  931.585611] sysrq: SysRq : Show Blocked State
[  931.585715]   task    PC stack   pid father
[  931.590168] btrfs-cleaner   D    0  1340  2 0x8000
[  931.590175] Call Trace:
[  931.590190]  __schedule+0x29e/0x840
[  931.590195]  schedule+0x2c/0x80
[  931.590199]  schedule_timeout+0x258/0x360
[  931.590204]  io_schedule_timeout+0x1e/0x50
[  931.590208]  wait_for_completion_io+0xb7/0x140
[  931.590214]  ? wake_up_q+0x80/0x80
[  931.590219]  submit_bio_wait+0x61/0x90
[  931.590225]  blkdev_issue_discard+0x7a/0xd0
[  931.590266]  btrfs_issue_discard+0x123/0x160 [btrfs]
[  931.590299]  btrfs_discard_extent+0xd8/0x160 [btrfs]
[  931.590335]  btrfs_finish_extent_commit+0xe2/0x240 [btrfs]
[  931.590382]  btrfs_commit_transaction+0x573/0x840 [btrfs]
[  931.590415]  ? btrfs_block_rsv_check+0x25/0x70 [btrfs]
[  931.590456]  __btrfs_end_transaction+0x2be/0x2d0 [btrfs]
[  931.590493]  btrfs_end_transaction_throttle+0x13/0x20 [btrfs]
[  931.590530]  btrfs_drop_snapshot+0x489/0x800 [btrfs]
[  931.590567]  btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs]
[  931.590607]  cleaner_kthread+0x136/0x160 [btrfs]
[  931.590612]  kthread+0x120/0x140
[  931.590646]  ? btree_submit_bio_start+0x20/0x20 [btrfs]
[  931.590658]  ? kthread_bind+0x40/0x40
[  931.590661]  ret_from_fork+0x22/0x40



It seems your filesystem is mounted with the DSICARD option meaning
every delete will result in discard this is highly suboptimal for 
ssd's.

Try remounting the fs without the discard option see if it helps.
Generally for discard you want to submit it in big batches (what fstrim
does) so that the ftl on the ssd could apply any optimisations it might
have up its sleeve.


Spot on!

Removed "discard" from fstab and added "ssd", rebooted - no more 
btrfs-cleaner running.


Do you know if the issue you described ("discard this is highly 
suboptimal for ssd") affects other filesystems as well to a similar 
extent? I.e. if using ext4 on ssd?




Would you finally care to share the smart data + the model and make of
the ssd?


2x these:

Model Family: Samsung based SSDs
Device Model: SAMSUNG MZ7LM1T9HCJM-5
Firmware Version: GXT1103Q
User Capacity:1,920,383,410,176 bytes [1.92 TB]
Sector Size:  512 bytes logical/physical
Rotation Rate:Solid State Device

1x this:

Device Model: Micron_5200_MTFDDAK1T9TDC
Firmware Version: D1MU004
User Capacity:1,920,383,410,176 bytes [1.92 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate:Solid State Device
Form Factor:  2.5 inches


But - seems the issue was unneeded discard option, so not pasting 
unnecessary SMART data, thanks for finding this out.



Tomasz Chmielewski


Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 14:31 ч., Tomasz Chmielewski wrote:
> Yet another system upgraded to 4.19 and showing strange issues.
> 
> btrfs-cleaner is showing as ~90-100% busy in iotop:
> 
>    TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN IO>    COMMAND
>   1340 be/4 root    0.00 B/s    0.00 B/s  0.00 % 92.88 %
> [btrfs-cleaner]
> 
> 
> Note disk read, disk write are 0.00 B/s.
> 
> 
> iostat -mx 1 shows all disks are ~100% busy, yet there are 0 reads and 0
> writes to them:
> 
> Device    r/s w/s rMB/s wMB/s   rrqm/s   wrqm/s 
> %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> sda  0.00    0.00  0.00  0.00 0.00 0.00  
> 0.00   0.00    0.00    0.00   0.91 0.00 0.00   0.00  90.80
> sdb  0.00    0.00  0.00  0.00 0.00 0.00  
> 0.00   0.00    0.00    0.00   1.00 0.00 0.00   0.00 100.00
> sdc  0.00    0.00  0.00  0.00 0.00 0.00  
> 0.00   0.00    0.00    0.00   1.00 0.00 0.00   0.00 100.00
> 
> 
> 
> btrfs-cleaner persists 100% busy after reboots. The system is generally
> not very responsive.
> 
> 
> 
> # echo w > /proc/sysrq-trigger
> 
> # dmesg -c
> [  931.585611] sysrq: SysRq : Show Blocked State
> [  931.585715]   task    PC stack   pid father
> [  931.590168] btrfs-cleaner   D    0  1340  2 0x8000
> [  931.590175] Call Trace:
> [  931.590190]  __schedule+0x29e/0x840
> [  931.590195]  schedule+0x2c/0x80
> [  931.590199]  schedule_timeout+0x258/0x360
> [  931.590204]  io_schedule_timeout+0x1e/0x50
> [  931.590208]  wait_for_completion_io+0xb7/0x140
> [  931.590214]  ? wake_up_q+0x80/0x80
> [  931.590219]  submit_bio_wait+0x61/0x90
> [  931.590225]  blkdev_issue_discard+0x7a/0xd0
> [  931.590266]  btrfs_issue_discard+0x123/0x160 [btrfs]
> [  931.590299]  btrfs_discard_extent+0xd8/0x160 [btrfs]
> [  931.590335]  btrfs_finish_extent_commit+0xe2/0x240 [btrfs]
> [  931.590382]  btrfs_commit_transaction+0x573/0x840 [btrfs]
> [  931.590415]  ? btrfs_block_rsv_check+0x25/0x70 [btrfs]
> [  931.590456]  __btrfs_end_transaction+0x2be/0x2d0 [btrfs]
> [  931.590493]  btrfs_end_transaction_throttle+0x13/0x20 [btrfs]
> [  931.590530]  btrfs_drop_snapshot+0x489/0x800 [btrfs]
> [  931.590567]  btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs]
> [  931.590607]  cleaner_kthread+0x136/0x160 [btrfs]
> [  931.590612]  kthread+0x120/0x140
> [  931.590646]  ? btree_submit_bio_start+0x20/0x20 [btrfs]
> [  931.590658]  ? kthread_bind+0x40/0x40
> [  931.590661]  ret_from_fork+0x22/0x40
> 



It seems your filesystem is mounted with the DSICARD option meaning
every delete will result in discard this is highly suboptimal for ssd's.
Try remounting the fs without the discard option see if it helps.
Generally for discard you want to submit it in big batches (what fstrim
does) so that the ftl on the ssd could apply any optimisations it might
have up its sleeve.


Would you finally care to share the smart data + the model and make of
the ssd?

> 
> 
> 
> After rebooting to 4.19.3, I've started seeing read errors. There were
> no errors with a previous kernel (4.17.14); disks are healthy according
> to SMART; no errors reported when we read the whole surface with dd.
> 
> # grep -i btrfs.*corrected /var/log/syslog|wc -l
> 156
> 
> Things like:
> 
> Nov 22 12:17:43 lxd05 kernel: [  711.538836] BTRFS info (device sda2):
> read error corrected: ino 0 off 3739083145216 (dev /dev/sdc2 sector
> 101088384)
> Nov 22 12:17:43 lxd05 kernel: [  711.538905] BTRFS info (device sda2):
> read error corrected: ino 0 off 3739083149312 (dev /dev/sdc2 sector
> 101088392)
> Nov 22 12:17:43 lxd05 kernel: [  711.538958] BTRFS info (device sda2):
> read error corrected: ino 0 off 3739083153408 (dev /dev/sdc2 sector
> 101088400)
> Nov 22 12:17:43 lxd05 kernel: [  711.539006] BTRFS info (device sda2):
> read error corrected: ino 0 off 3739083157504 (dev /dev/sdc2 sector
> 101088408)
> 
> 
> Yet - 0 errors, according to stats, not sure if it's expected or not:

Since btrfs was able to fix the read failure on the fly it doesn't
record the error.  IMO you have problems with your storage below the
filesystem level.

> 
> # btrfs device stats /data/lxd
> [/dev/sda2].write_io_errs    0
> [/dev/sda2].read_io_errs 0
> [/dev/sda2].flush_io_errs    0
> [/dev/sda2].corruption_errs  0
> [/dev/sda2].generation_errs  0
> [/dev/sdb2].write_io_errs    0
> [/dev/sdb2].read_io_errs 0
> [/dev/sdb2].flush_io_errs    0
> [/dev/sdb2].corruption_errs  0
> [/dev/sdb2].generation_errs  0
> [/dev/sdc2].write_io_errs    0
> [/dev/sdc2].read_io_errs 0
> [/dev/sdc2].flush_io_errs    0
> [/dev/sdc2].corruption_errs  0
> [/dev/sdc2].generation_errs  0
> 
> 




btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Tomasz Chmielewski

Yet another system upgraded to 4.19 and showing strange issues.

btrfs-cleaner is showing as ~90-100% busy in iotop:

   TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN IO>COMMAND
  1340 be/4 root0.00 B/s0.00 B/s  0.00 % 92.88 % 
[btrfs-cleaner]



Note disk read, disk write are 0.00 B/s.


iostat -mx 1 shows all disks are ~100% busy, yet there are 0 reads and 0 
writes to them:


Devicer/s w/s rMB/s wMB/s   rrqm/s   wrqm/s  
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda  0.000.00  0.00  0.00 0.00 0.00   
0.00   0.000.000.00   0.91 0.00 0.00   0.00  90.80
sdb  0.000.00  0.00  0.00 0.00 0.00   
0.00   0.000.000.00   1.00 0.00 0.00   0.00 100.00
sdc  0.000.00  0.00  0.00 0.00 0.00   
0.00   0.000.000.00   1.00 0.00 0.00   0.00 100.00




btrfs-cleaner persists 100% busy after reboots. The system is generally 
not very responsive.




# echo w > /proc/sysrq-trigger

# dmesg -c
[  931.585611] sysrq: SysRq : Show Blocked State
[  931.585715]   taskPC stack   pid father
[  931.590168] btrfs-cleaner   D0  1340  2 0x8000
[  931.590175] Call Trace:
[  931.590190]  __schedule+0x29e/0x840
[  931.590195]  schedule+0x2c/0x80
[  931.590199]  schedule_timeout+0x258/0x360
[  931.590204]  io_schedule_timeout+0x1e/0x50
[  931.590208]  wait_for_completion_io+0xb7/0x140
[  931.590214]  ? wake_up_q+0x80/0x80
[  931.590219]  submit_bio_wait+0x61/0x90
[  931.590225]  blkdev_issue_discard+0x7a/0xd0
[  931.590266]  btrfs_issue_discard+0x123/0x160 [btrfs]
[  931.590299]  btrfs_discard_extent+0xd8/0x160 [btrfs]
[  931.590335]  btrfs_finish_extent_commit+0xe2/0x240 [btrfs]
[  931.590382]  btrfs_commit_transaction+0x573/0x840 [btrfs]
[  931.590415]  ? btrfs_block_rsv_check+0x25/0x70 [btrfs]
[  931.590456]  __btrfs_end_transaction+0x2be/0x2d0 [btrfs]
[  931.590493]  btrfs_end_transaction_throttle+0x13/0x20 [btrfs]
[  931.590530]  btrfs_drop_snapshot+0x489/0x800 [btrfs]
[  931.590567]  btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs]
[  931.590607]  cleaner_kthread+0x136/0x160 [btrfs]
[  931.590612]  kthread+0x120/0x140
[  931.590646]  ? btree_submit_bio_start+0x20/0x20 [btrfs]
[  931.590658]  ? kthread_bind+0x40/0x40
[  931.590661]  ret_from_fork+0x22/0x40




After rebooting to 4.19.3, I've started seeing read errors. There were 
no errors with a previous kernel (4.17.14); disks are healthy according 
to SMART; no errors reported when we read the whole surface with dd.


# grep -i btrfs.*corrected /var/log/syslog|wc -l
156

Things like:

Nov 22 12:17:43 lxd05 kernel: [  711.538836] BTRFS info (device sda2): 
read error corrected: ino 0 off 3739083145216 (dev /dev/sdc2 sector 
101088384)
Nov 22 12:17:43 lxd05 kernel: [  711.538905] BTRFS info (device sda2): 
read error corrected: ino 0 off 3739083149312 (dev /dev/sdc2 sector 
101088392)
Nov 22 12:17:43 lxd05 kernel: [  711.538958] BTRFS info (device sda2): 
read error corrected: ino 0 off 3739083153408 (dev /dev/sdc2 sector 
101088400)
Nov 22 12:17:43 lxd05 kernel: [  711.539006] BTRFS info (device sda2): 
read error corrected: ino 0 off 3739083157504 (dev /dev/sdc2 sector 
101088408)



Yet - 0 errors, according to stats, not sure if it's expected or not:

# btrfs device stats /data/lxd
[/dev/sda2].write_io_errs0
[/dev/sda2].read_io_errs 0
[/dev/sda2].flush_io_errs0
[/dev/sda2].corruption_errs  0
[/dev/sda2].generation_errs  0
[/dev/sdb2].write_io_errs0
[/dev/sdb2].read_io_errs 0
[/dev/sdb2].flush_io_errs0
[/dev/sdb2].corruption_errs  0
[/dev/sdb2].generation_errs  0
[/dev/sdc2].write_io_errs0
[/dev/sdc2].read_io_errs 0
[/dev/sdc2].flush_io_errs0
[/dev/sdc2].corruption_errs  0
[/dev/sdc2].generation_errs  0


# btrfs fi usage /data/lxd
Overall:
Device size:   5.18TiB
Device allocated:  2.36TiB
Device unallocated:2.83TiB
Device missing:  0.00B
Used:  1.93TiB
Free (estimated):  1.62TiB  (min: 1.62TiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,RAID1: Size:1.17TiB, Used:983.87GiB
   /dev/sda2 796.00GiB
   /dev/sdb2 799.00GiB
   /dev/sdc2 797.00GiB

Metadata,RAID1: Size:10.00GiB, Used:4.09GiB
   /dev/sda2   8.00GiB
   /dev/sdb2   5.00GiB
   /dev/sdc2   7.00GiB

System,RAID1: Size:64.00MiB, Used:224.00KiB
   /dev/sda2  32.00MiB
   /dev/sdb2  64.00MiB
   /dev/sdc2  32.00MiB

Unallocated:
   /dev/sda2 964.46GiB
   /dev/sdb2 964.43GiB
   /dev/sdc2 964.46GiB


# btrfs fi df /data/lxd
Data, RAID1: total=1.17TiB, used=983.04GiB
System, RAID1: total=64.00MiB, used=224.00KiB
Metadata, RAID1: total=10.00GiB, used=4.08GiB
GlobalReserve, 

Re: [RFC][PATCH v4 09/09] btrfs: use common file type conversion

2018-11-22 Thread Jan Kara
On Wed 21-11-18 19:07:06, Phillip Potter wrote:
> Deduplicate the btrfs file type conversion implementation - file systems
> that use the same file types as defined by POSIX do not need to define
> their own versions and can use the common helper functions decared in
> fs_types.h and implemented in fs_types.c
> 
> Acked-by: David Sterba 
> Signed-off-by: Amir Goldstein 
> Signed-off-by: Phillip Potter 

The patch looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/btrfs/btrfs_inode.h  |  2 --
>  fs/btrfs/delayed-inode.c|  2 +-
>  fs/btrfs/inode.c| 32 +++-
>  include/uapi/linux/btrfs_tree.h |  2 ++
>  4 files changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 97d91e55b70a..bb01c804485f 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -196,8 +196,6 @@ struct btrfs_inode {
>   struct inode vfs_inode;
>  };
>  
> -extern unsigned char btrfs_filetype_table[];
> -
>  static inline struct btrfs_inode *BTRFS_I(const struct inode *inode)
>  {
>   return container_of(inode, struct btrfs_inode, vfs_inode);
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index c669f250d4a0..e61947f5eb76 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -1692,7 +1692,7 @@ int btrfs_readdir_delayed_dir_index(struct dir_context 
> *ctx,
>   name = (char *)(di + 1);
>   name_len = btrfs_stack_dir_name_len(di);
>  
> - d_type = btrfs_filetype_table[di->type];
> + d_type = fs_ftype_to_dtype(di->type);
>   btrfs_disk_key_to_cpu(, >location);
>  
>   over = !dir_emit(ctx, name, name_len,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9ea4c6f0352f..8b7b1b29e2ad 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -72,17 +72,6 @@ struct kmem_cache *btrfs_trans_handle_cachep;
>  struct kmem_cache *btrfs_path_cachep;
>  struct kmem_cache *btrfs_free_space_cachep;
>  
> -#define S_SHIFT 12
> -static const unsigned char btrfs_type_by_mode[S_IFMT >> S_SHIFT] = {
> - [S_IFREG >> S_SHIFT]= BTRFS_FT_REG_FILE,
> - [S_IFDIR >> S_SHIFT]= BTRFS_FT_DIR,
> - [S_IFCHR >> S_SHIFT]= BTRFS_FT_CHRDEV,
> - [S_IFBLK >> S_SHIFT]= BTRFS_FT_BLKDEV,
> - [S_IFIFO >> S_SHIFT]= BTRFS_FT_FIFO,
> - [S_IFSOCK >> S_SHIFT]   = BTRFS_FT_SOCK,
> - [S_IFLNK >> S_SHIFT]= BTRFS_FT_SYMLINK,
> -};
> -
>  static int btrfs_setsize(struct inode *inode, struct iattr *attr);
>  static int btrfs_truncate(struct inode *inode, bool skip_writeback);
>  static int btrfs_finish_ordered_io(struct btrfs_ordered_extent 
> *ordered_extent);
> @@ -5793,10 +5782,6 @@ static struct dentry *btrfs_lookup(struct inode *dir, 
> struct dentry *dentry,
>   return d_splice_alias(inode, dentry);
>  }
>  
> -unsigned char btrfs_filetype_table[] = {
> - DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
> -};
> -
>  /*
>   * All this infrastructure exists because dir_emit can fault, and we are 
> holding
>   * the tree lock when doing readdir.  For now just allocate a buffer and copy
> @@ -5935,7 +5920,7 @@ static int btrfs_real_readdir(struct file *file, struct 
> dir_context *ctx)
>   name_ptr = (char *)(entry + 1);
>   read_extent_buffer(leaf, name_ptr, (unsigned long)(di + 1),
>  name_len);
> - put_unaligned(btrfs_filetype_table[btrfs_dir_type(leaf, di)],
> + put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)),
>   >type);
>   btrfs_dir_item_key_to_cpu(leaf, di, );
>   put_unaligned(location.objectid, >ino);
> @@ -6340,7 +6325,20 @@ static struct inode *btrfs_new_inode(struct 
> btrfs_trans_handle *trans,
>  
>  static inline u8 btrfs_inode_type(struct inode *inode)
>  {
> - return btrfs_type_by_mode[(inode->i_mode & S_IFMT) >> S_SHIFT];
> + /*
> +  * compile-time asserts that generic FT_x types still match
> +  * BTRFS_FT_x types
> +  */
> + BUILD_BUG_ON(BTRFS_FT_UNKNOWN != FT_UNKNOWN);
> + BUILD_BUG_ON(BTRFS_FT_REG_FILE != FT_REG_FILE);
> + BUILD_BUG_ON(BTRFS_FT_DIR != FT_DIR);
> + BUILD_BUG_ON(BTRFS_FT_CHRDEV != FT_CHRDEV);
> + BUILD_BUG_ON(BTRFS_FT_BLKDEV != FT_BLKDEV);
> + BUILD_BUG_ON(BTRFS_FT_FIFO != FT_FIFO);
> + BUILD_BUG_ON(BTRFS_FT_SOCK != FT_SOCK);
> + BUILD_BUG_ON(BTRFS_FT_SYMLINK != FT_SYMLINK);
> +
> + return fs_umode_to_ftype(inode->i_mode);
>  }
>  
>  /*
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index aff1356c2bb8..4126ed7ee89a 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -307,6 +307,8 @@
>   *
>   * Used by:
>   * struct btrfs_dir_item.type
> + *
> + * Values 

Re: [PATCH 4/6] btrfs: only track ref_heads in delayed_ref_updates

2018-11-22 Thread Nikolay Borisov



On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We use this number to figure out how many delayed refs to run, but
> __btrfs_run_delayed_refs really only checks every time we need a new
> delayed ref head, so we always run at least one ref head completely no
> matter what the number of items on it.  Fix the accounting to only be
> adjusted when we add/remove a ref head.

LGTM:

Reviewed-by: Nikolay Borisov 

However, what if we kill delayed_ref_updates since the name is a bit
ambiguous and instead migrate num_heads_ready from delayed_refs to trans
and use that? Otherwise, as stated previously num_heads_ready is
currently unused and could be removed.



> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/delayed-ref.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index b3e4c9fcb664..48725fa757a3 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct 
> btrfs_trans_handle *trans,
>   ref->in_tree = 0;
>   btrfs_put_delayed_ref(ref);
>   atomic_dec(_refs->num_entries);
> - if (trans->delayed_ref_updates)
> - trans->delayed_ref_updates--;
>  }
>  
>  static bool merge_ref(struct btrfs_trans_handle *trans,
> @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
> *trans,
>   if (ref->action == BTRFS_ADD_DELAYED_REF)
>   list_add_tail(>add_list, >ref_add_list);
>   atomic_inc(>num_entries);
> - trans->delayed_ref_updates++;
>   spin_unlock(>lock);
>   return ret;
>  }
> 


Re: [PATCH 3/6] btrfs: cleanup extent_op handling

2018-11-22 Thread Nikolay Borisov



On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> The cleanup_extent_op function actually would run the extent_op if it
> needed running, which made the name sort of a misnomer.  Change it to
> run_and_cleanup_extent_op, and move the actual cleanup work to
> cleanup_extent_op so it can be used by check_ref_cleanup() in order to
> unify the extent op handling.

The whole name extent_op is actually a misnomer since AFAIR this is some
sort of modification of the references of metadata nodes. I don't see
why it can't be made as yet another type of reference which is run for a
given node.

> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 36 +++-
>  1 file changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e3ed3507018d..8a776dc9cb38 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2424,19 +2424,33 @@ static void unselect_delayed_ref_head(struct 
> btrfs_delayed_ref_root *delayed_ref
>   btrfs_delayed_ref_unlock(head);
>  }
>  
> -static int cleanup_extent_op(struct btrfs_trans_handle *trans,
> -  struct btrfs_delayed_ref_head *head)
> +static struct btrfs_delayed_extent_op *
> +cleanup_extent_op(struct btrfs_trans_handle *trans,
> +   struct btrfs_delayed_ref_head *head)
>  {
>   struct btrfs_delayed_extent_op *extent_op = head->extent_op;
> - int ret;
>  
>   if (!extent_op)
> - return 0;
> - head->extent_op = NULL;
> + return NULL;
> +
>   if (head->must_insert_reserved) {
> + head->extent_op = NULL;
>   btrfs_free_delayed_extent_op(extent_op);
> - return 0;
> + return NULL;
>   }
> + return extent_op;
> +}
> +
> +static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans,
> +  struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_delayed_extent_op *extent_op =
> + cleanup_extent_op(trans, head);
> + int ret;
> +
> + if (!extent_op)
> + return 0;
> + head->extent_op = NULL;
>   spin_unlock(>lock);
>   ret = run_delayed_extent_op(trans, head, extent_op);
>   btrfs_free_delayed_extent_op(extent_op);
> @@ -2488,7 +2502,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>  
>   delayed_refs = >transaction->delayed_refs;
>  
> - ret = cleanup_extent_op(trans, head);
> + ret = run_and_cleanup_extent_op(trans, head);
>   if (ret < 0) {
>   unselect_delayed_ref_head(delayed_refs, head);
>   btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
> @@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!RB_EMPTY_ROOT(>ref_tree.rb_root))
>   goto out;
>  
> - if (head->extent_op) {
> - if (!head->must_insert_reserved)
> - goto out;
> - btrfs_free_delayed_extent_op(head->extent_op);
> - head->extent_op = NULL;
> - }
> + if (cleanup_extent_op(trans, head) != NULL)
> + goto out;
>  
>   /*
>* waiting for the lock here would deadlock.  If someone else has it
> 


Re: [PATCH 1/6] btrfs: add btrfs_delete_ref_head helper

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 11:12 ч., Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
>> From: Josef Bacik 
>>
>> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
>> into a helper and cleanup the calling functions.
>>
>> Signed-off-by: Josef Bacik 
>> Reviewed-by: Omar Sandoval 
>> ---
>>  fs/btrfs/delayed-ref.c | 14 ++
>>  fs/btrfs/delayed-ref.h |  3 ++-
>>  fs/btrfs/extent-tree.c | 22 +++---
>>  3 files changed, 19 insertions(+), 20 deletions(-)
>>
>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>> index 9301b3ad9217..b3e4c9fcb664 100644
>> --- a/fs/btrfs/delayed-ref.c
>> +++ b/fs/btrfs/delayed-ref.c
>> @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>>  return head;
>>  }
>>  
>> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
>> +   struct btrfs_delayed_ref_head *head)
>> +{
>> +lockdep_assert_held(_refs->lock);
>> +lockdep_assert_held(>lock);
>> +
>> +rb_erase_cached(>href_node, _refs->href_root);
>> +RB_CLEAR_NODE(>href_node);
>> +atomic_dec(_refs->num_entries);
>> +delayed_refs->num_heads--;
>> +if (head->processing == 0)
>> +delayed_refs->num_heads_ready--;
> 
> num_heads_ready will never execute in cleanup_ref_head, since
> processing == 0 only when the ref head is unselected. Perhaps those 2
> lines shouldn't be in this function? I find it a bit confusing that if
> processing is 0 we decrement num_heads_ready in check_ref_cleanup, but
> in unselect_delayed_ref_head we set it to 0 and increment it.
> 
>> +}
>> +
>>  /*
>>   * Helper to insert the ref_node to the tail or merge with tail.
>>   *
>> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
>> index 8e20c5cb5404..d2af974f68a1 100644
>> --- a/fs/btrfs/delayed-ref.h
>> +++ b/fs/btrfs/delayed-ref.h
>> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
>> btrfs_delayed_ref_head *head)
>>  {
>>  mutex_unlock(>mutex);
>>  }
>> -
>> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
>> +   struct btrfs_delayed_ref_head *head);
>>  
>>  struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>>  struct btrfs_delayed_ref_root *delayed_refs);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index d242a1174e50..c36b3a42f2bb 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
>> *trans,
>>  spin_unlock(_refs->lock);
>>  return 1;
>>  }
>> -delayed_refs->num_heads--;
>> -rb_erase_cached(>href_node, _refs->href_root);
>> -RB_CLEAR_NODE(>href_node);
>> +btrfs_delete_ref_head(delayed_refs, head);
>>  spin_unlock(>lock);
>>  spin_unlock(_refs->lock);
>> -atomic_dec(_refs->num_entries);
>>  
>>  trace_run_delayed_ref_head(fs_info, head, 0);
>>  
>> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
>> btrfs_trans_handle *trans,
>>  if (!mutex_trylock(>mutex))
>>  goto out;
>>  
>> -/*
>> - * at this point we have a head with no other entries.  Go
>> - * ahead and process it.
>> - */
>> -rb_erase_cached(>href_node, _refs->href_root);
>> -RB_CLEAR_NODE(>href_node);
>> -atomic_dec(_refs->num_entries);
>> -
>> -/*
>> - * we don't take a ref on the node because we're removing it from the
>> - * tree, so we just steal the ref the tree was holding.
>> - */
>> -delayed_refs->num_heads--;
>> -if (head->processing == 0)
>> -delayed_refs->num_heads_ready--;
>> +btrfs_delete_ref_head(delayed_refs, head);
>>  head->processing = 0;

On a closer inspection I think here we can do:

ASSERT(head->processing == 0) because at that point we've taken the
head->lock spinlock which is held during ordinary delayed refs
processing (in __btrfs_run_delayed_refs) when the head is selected (and
processing is 1). So head->processing == 0 here I think is a hard
invariant of the code. The decrement here should pair with the increment
when the head was initially added to the tree.

In cleanup_ref_head we don't need to ever worry about num_heads_ready
since it has already been decremented by btrfs_select_ref_head.

As a matter fact this counter is not used anywhere so we might as well
just remove it.

> 
> Something is fishy here, before the code checked for processing == 0 and
> then also set it to 0 ?
> 
>> +
>>  spin_unlock(>lock);
>>  spin_unlock(_refs->lock);
>>  
>>
> 


Re: [PATCH 1/6] btrfs: add btrfs_delete_ref_head helper

2018-11-22 Thread Nikolay Borisov



On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> into a helper and cleanup the calling functions.
> 
> Signed-off-by: Josef Bacik 
> Reviewed-by: Omar Sandoval 
> ---
>  fs/btrfs/delayed-ref.c | 14 ++
>  fs/btrfs/delayed-ref.h |  3 ++-
>  fs/btrfs/extent-tree.c | 22 +++---
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 9301b3ad9217..b3e4c9fcb664 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   return head;
>  }
>  
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head)
> +{
> + lockdep_assert_held(_refs->lock);
> + lockdep_assert_held(>lock);
> +
> + rb_erase_cached(>href_node, _refs->href_root);
> + RB_CLEAR_NODE(>href_node);
> + atomic_dec(_refs->num_entries);
> + delayed_refs->num_heads--;
> + if (head->processing == 0)
> + delayed_refs->num_heads_ready--;

num_heads_ready will never execute in cleanup_ref_head, since
processing == 0 only when the ref head is unselected. Perhaps those 2
lines shouldn't be in this function? I find it a bit confusing that if
processing is 0 we decrement num_heads_ready in check_ref_cleanup, but
in unselect_delayed_ref_head we set it to 0 and increment it.

> +}
> +
>  /*
>   * Helper to insert the ref_node to the tail or merge with tail.
>   *
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 8e20c5cb5404..d2af974f68a1 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
> btrfs_delayed_ref_head *head)
>  {
>   mutex_unlock(>mutex);
>  }
> -
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head);
>  
>  struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   struct btrfs_delayed_ref_root *delayed_refs);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d242a1174e50..c36b3a42f2bb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(_refs->lock);
>   return 1;
>   }
> - delayed_refs->num_heads--;
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> + btrfs_delete_ref_head(delayed_refs, head);
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
> - atomic_dec(_refs->num_entries);
>  
>   trace_run_delayed_ref_head(fs_info, head, 0);
>  
> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!mutex_trylock(>mutex))
>   goto out;
>  
> - /*
> -  * at this point we have a head with no other entries.  Go
> -  * ahead and process it.
> -  */
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> - atomic_dec(_refs->num_entries);
> -
> - /*
> -  * we don't take a ref on the node because we're removing it from the
> -  * tree, so we just steal the ref the tree was holding.
> -  */
> - delayed_refs->num_heads--;
> - if (head->processing == 0)
> - delayed_refs->num_heads_ready--;
> + btrfs_delete_ref_head(delayed_refs, head);
>   head->processing = 0;

Something is fishy here, before the code checked for processing == 0 and
then also set it to 0 ?

> +
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> 


Re: [PATCH 3/6] btrfs: cleanup extent_op handling

2018-11-22 Thread Lu Fengqi
On Wed, Nov 21, 2018 at 01:59:09PM -0500, Josef Bacik wrote:
>From: Josef Bacik 
>
>The cleanup_extent_op function actually would run the extent_op if it
>needed running, which made the name sort of a misnomer.  Change it to
>run_and_cleanup_extent_op, and move the actual cleanup work to
>cleanup_extent_op so it can be used by check_ref_cleanup() in order to
>unify the extent op handling.
>
>Signed-off-by: Josef Bacik 

One nitpick below.

Reviewed-by: Lu Fengqi 

>---
> fs/btrfs/extent-tree.c | 36 +++-
> 1 file changed, 23 insertions(+), 13 deletions(-)
>
>diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>index e3ed3507018d..8a776dc9cb38 100644
>--- a/fs/btrfs/extent-tree.c
>+++ b/fs/btrfs/extent-tree.c
>@@ -2424,19 +2424,33 @@ static void unselect_delayed_ref_head(struct 
>btrfs_delayed_ref_root *delayed_ref
>   btrfs_delayed_ref_unlock(head);
> }
> 
>-static int cleanup_extent_op(struct btrfs_trans_handle *trans,
>-   struct btrfs_delayed_ref_head *head)
>+static struct btrfs_delayed_extent_op *
>+cleanup_extent_op(struct btrfs_trans_handle *trans,

The trans parameter seems useless.

-- 
Thanks,
Lu

>+struct btrfs_delayed_ref_head *head)
> {
>   struct btrfs_delayed_extent_op *extent_op = head->extent_op;
>-  int ret;
> 
>   if (!extent_op)
>-  return 0;
>-  head->extent_op = NULL;
>+  return NULL;
>+
>   if (head->must_insert_reserved) {
>+  head->extent_op = NULL;
>   btrfs_free_delayed_extent_op(extent_op);
>-  return 0;
>+  return NULL;
>   }
>+  return extent_op;
>+}
>+
>+static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans,
>+   struct btrfs_delayed_ref_head *head)
>+{
>+  struct btrfs_delayed_extent_op *extent_op =
>+  cleanup_extent_op(trans, head);
>+  int ret;
>+
>+  if (!extent_op)
>+  return 0;
>+  head->extent_op = NULL;
>   spin_unlock(>lock);
>   ret = run_delayed_extent_op(trans, head, extent_op);
>   btrfs_free_delayed_extent_op(extent_op);
>@@ -2488,7 +2502,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
>*trans,
> 
>   delayed_refs = >transaction->delayed_refs;
> 
>-  ret = cleanup_extent_op(trans, head);
>+  ret = run_and_cleanup_extent_op(trans, head);
>   if (ret < 0) {
>   unselect_delayed_ref_head(delayed_refs, head);
>   btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
>@@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct 
>btrfs_trans_handle *trans,
>   if (!RB_EMPTY_ROOT(>ref_tree.rb_root))
>   goto out;
> 
>-  if (head->extent_op) {
>-  if (!head->must_insert_reserved)
>-  goto out;
>-  btrfs_free_delayed_extent_op(head->extent_op);
>-  head->extent_op = NULL;
>-  }
>+  if (cleanup_extent_op(trans, head) != NULL)
>+  goto out;
> 
>   /*
>* waiting for the lock here would deadlock.  If someone else has it
>-- 
>2.14.3
>
>
>




[PATCH] btrfs: Remove extent_io_ops::readpage_io_failed_hook

2018-11-22 Thread Nikolay Borisov
For data inodes this hook does nothing but to return -EAGAIN which is
used to signal to the endio routines that this bio belongs to a data
inode. If this is the case the actual retrying is handled by
bio_readpage_error. Alternatively, if this bio belongs to the btree
inode then btree_io_failed_hook just does some cleanup and doesn't
retry anything.

This patch simplifies the code flow by eliminating
readpage_io_failed_hook and instead open-coding btree_io_failed_hook in
end_bio_extent_readpage. Also eliminate some needless checks since IO is always
performed on either data inode or btree inode, both of which are guaranteed to 
have their extent_io_tree::ops set. 

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/disk-io.c   | 16 --
 fs/btrfs/extent_io.c | 69 ++--
 fs/btrfs/extent_io.h |  1 -
 fs/btrfs/inode.c |  7 -
 4 files changed, 34 insertions(+), 59 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index feb67dfd663d..5024d9374163 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -673,19 +673,6 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
return ret;
 }
 
-static int btree_io_failed_hook(struct page *page, int failed_mirror)
-{
-   struct extent_buffer *eb;
-
-   eb = (struct extent_buffer *)page->private;
-   set_bit(EXTENT_BUFFER_READ_ERR, >bflags);
-   eb->read_mirror = failed_mirror;
-   atomic_dec(>io_pages);
-   if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags))
-   btree_readahead_hook(eb, -EIO);
-   return -EIO;/* we fixed nothing */
-}
-
 static void end_workqueue_bio(struct bio *bio)
 {
struct btrfs_end_io_wq *end_io_wq = bio->bi_private;
@@ -4541,7 +4528,4 @@ static const struct extent_io_ops btree_extent_io_ops = {
/* mandatory callbacks */
.submit_bio_hook = btree_submit_bio_hook,
.readpage_end_io_hook = btree_readpage_end_io_hook,
-   .readpage_io_failed_hook = btree_io_failed_hook,
-
-   /* optional callbacks */
 };
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4ea808d6cfbc..cca3bccd142e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2333,11 +2333,10 @@ struct bio *btrfs_create_repair_bio(struct inode 
*inode, struct bio *failed_bio,
 }
 
 /*
- * this is a generic handler for readpage errors (default
- * readpage_io_failed_hook). if other copies exist, read those and write back
- * good data to the failed position. does not investigate in remapping the
- * failed extent elsewhere, hoping the device will be smart enough to do this 
as
- * needed
+ * This is a generic handler for readpage errors. If other copies exist, read
+ * those and write back good data to the failed position. Does not investigate
+ * in remapping the failed extent elsewhere, hoping the device will be smart
+ * enough to do this as needed
  */
 
 static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
@@ -2501,6 +2500,8 @@ static void end_bio_extent_readpage(struct bio *bio)
struct page *page = bvec->bv_page;
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+   bool data_inode = btrfs_ino(BTRFS_I(inode))
+   != BTRFS_BTREE_INODE_OBJECTID;
 
btrfs_debug(fs_info,
"end_bio_extent_readpage: bi_sector=%llu, err=%d, 
mirror=%u",
@@ -2530,7 +2531,7 @@ static void end_bio_extent_readpage(struct bio *bio)
len = bvec->bv_len;
 
mirror = io_bio->mirror_num;
-   if (likely(uptodate && tree->ops)) {
+   if (likely(uptodate)) {
ret = tree->ops->readpage_end_io_hook(io_bio, offset,
  page, start, end,
  mirror);
@@ -2546,38 +2547,36 @@ static void end_bio_extent_readpage(struct bio *bio)
if (likely(uptodate))
goto readpage_ok;
 
-   if (tree->ops) {
-   ret = tree->ops->readpage_io_failed_hook(page, mirror);
-   if (ret == -EAGAIN) {
-   /*
-* Data inode's readpage_io_failed_hook() always
-* returns -EAGAIN.
-*
-* The generic bio_readpage_error handles errors
-* the following way: If possible, new read
-* requests are created and submitted and will
-* end up in end_bio_extent_readpage as well (if
-* we're lucky, not in the !uptodate case). In
-* that case it returns 0 and we just go on with
- 

[PATCH] btrfs: Remove unnecessary code from __btrfs_rebalance

2018-11-22 Thread Nikolay Borisov
The first step fo the rebalance process, ensuring there is 1mb free on
each device. This number seems rather small. And in fact when talking
to the original authors their opinions were:

"man that's a little bonkers"
"i don't think we even need that code anymore"
"I think it was there to make sure we had room for the blank 1M at the
beginning. I bet it goes all the way back to v0"
"we just don't need any of that tho, i say we just delete it"

Clearly, this piece of code has lost its original intent throughout
the years. It doesn't really bring any real practical benefits to the
relocation process. No functional changes.

Signed-off-by: Nikolay Borisov 
Suggested-by: Josef Bacik 
---
 fs/btrfs/volumes.c | 53 --
 1 file changed, 53 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8e36cbb355df..eb9fa8a6429c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3645,17 +3645,11 @@ static int __btrfs_balance(struct btrfs_fs_info 
*fs_info)
 {
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
struct btrfs_root *chunk_root = fs_info->chunk_root;
-   struct btrfs_root *dev_root = fs_info->dev_root;
-   struct list_head *devices;
-   struct btrfs_device *device;
-   u64 old_size;
-   u64 size_to_free;
u64 chunk_type;
struct btrfs_chunk *chunk;
struct btrfs_path *path = NULL;
struct btrfs_key key;
struct btrfs_key found_key;
-   struct btrfs_trans_handle *trans;
struct extent_buffer *leaf;
int slot;
int ret;
@@ -3670,53 +3664,6 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
u32 count_sys = 0;
int chunk_reserved = 0;
 
-   /* step one make some room on all the devices */
-   devices = _info->fs_devices->devices;
-   list_for_each_entry(device, devices, dev_list) {
-   old_size = btrfs_device_get_total_bytes(device);
-   size_to_free = div_factor(old_size, 1);
-   size_to_free = min_t(u64, size_to_free, SZ_1M);
-   if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) ||
-   btrfs_device_get_total_bytes(device) -
-   btrfs_device_get_bytes_used(device) > size_to_free ||
-   test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state))
-   continue;
-
-   ret = btrfs_shrink_device(device, old_size - size_to_free);
-   if (ret == -ENOSPC)
-   break;
-   if (ret) {
-   /* btrfs_shrink_device never returns ret > 0 */
-   WARN_ON(ret > 0);
-   goto error;
-   }
-
-   trans = btrfs_start_transaction(dev_root, 0);
-   if (IS_ERR(trans)) {
-   ret = PTR_ERR(trans);
-   btrfs_info_in_rcu(fs_info,
-"resize: unable to start transaction after shrinking device %s 
(error %d), old size %llu, new size %llu",
- rcu_str_deref(device->name), ret,
- old_size, old_size - size_to_free);
-   goto error;
-   }
-
-   ret = btrfs_grow_device(trans, device, old_size);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   /* btrfs_grow_device never returns ret > 0 */
-   WARN_ON(ret > 0);
-   btrfs_info_in_rcu(fs_info,
-"resize: unable to grow device after shrinking device %s 
(error %d), old size %llu, new size %llu",
- rcu_str_deref(device->name), ret,
- old_size, old_size - size_to_free);
-   goto error;
-   }
-
-   btrfs_end_transaction(trans);
-   }
-
-   /* step two, relocate all the chunks */
path = btrfs_alloc_path();
if (!path) {
ret = -ENOMEM;
-- 
2.17.1