Re: Ongoing Btrfs stability issues

2018-02-16 Thread Shehbaz Jaffer
>It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS 
>volumes are all SSD

I have recently done some SSD corruption experiments on small set of
workloads, so I thought I would share my experience.

While creating btrfs using mkfs.btrfs command for SSDs, by default the
metadata duplication option is disabled. this renders btrfs-scrubbing
ineffective, as there are no redundant metadata to restore corrupted
metadata from.
So if there are any errors during read operation on SSD, unlike HDD
where the corruptions would be handled by btrfs scrub on the fly while
detecting checksum error, for SSD the read would fail as uncorrectable
error.

Could you confirm if metadata DUP is enabled for your system by
running the following cmd:

$btrfs fi df /mnt # mount is the mount point
Data, single: total=8.00MiB, used=64.00KiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=168.00MiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

If metadata is single in your case as well (and not DUP), that may be
the problem for btrfs-scrub not working effectively on the fly
(mid-stream bit-rot correction), causing reliability issues. A couple
of such bugs that are observed specifically for SSDs is reported here:

https://bugzilla.kernel.org/show_bug.cgi?id=198463
https://bugzilla.kernel.org/show_bug.cgi?id=198807

These do not occur for HDD, and I believe should not occur when
filesystem is mounted with nossd mode.

On Fri, Feb 16, 2018 at 10:03 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as
> excerpted:
>
>> This will probably sound like an odd question, but does BTRFS think your
>> storage devices are SSD's or not?  Based on what you're saying, it
>> sounds like you're running into issues resulting from the
>> over-aggressive SSD 'optimizations' that were done by BTRFS until very
>> recently.
>>
>> You can verify if this is what's causing your problems or not by either
>> upgrading to a recent mainline kernel version (I know the changes are in
>> 4.15, I don't remember for certain if they're in 4.14 or not, but I
>> think they are), or by adding 'nossd' to your mount options, and then
>> seeing if you still have the problems or not (I suspect this is only
>> part of it, and thus changing this will reduce the issues, but not
>> completely eliminate them).  Make sure and run a full balance after
>> changing either item, as the aforementioned 'optimizations' have an
>> impact on how data is organized on-disk (which is ultimately what causes
>> the issues), so they will have a lingering effect if you don't balance
>> everything.
>
> According to the wiki, 4.14 does indeed have the ssd changes.
>
> According to the bug, he's running 4.13.x on one server and 4.14.x on
> two.  So upgrading the one to 4.14.x should mean all will have that fix.
>
> However, without a full balance it /will/ take some time to settle down
> (again, assuming btrfs was using ssd mode), so the lingering effect could
> still be creating problems on the 4.14 kernel servers for the moment.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Shehbaz Jaffer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-02-16 Thread Duncan
Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as
excerpted:

> This will probably sound like an odd question, but does BTRFS think your
> storage devices are SSD's or not?  Based on what you're saying, it
> sounds like you're running into issues resulting from the
> over-aggressive SSD 'optimizations' that were done by BTRFS until very
> recently.
> 
> You can verify if this is what's causing your problems or not by either
> upgrading to a recent mainline kernel version (I know the changes are in
> 4.15, I don't remember for certain if they're in 4.14 or not, but I
> think they are), or by adding 'nossd' to your mount options, and then
> seeing if you still have the problems or not (I suspect this is only
> part of it, and thus changing this will reduce the issues, but not
> completely eliminate them).  Make sure and run a full balance after
> changing either item, as the aforementioned 'optimizations' have an
> impact on how data is organized on-disk (which is ultimately what causes
> the issues), so they will have a lingering effect if you don't balance
> everything.

According to the wiki, 4.14 does indeed have the ssd changes.

According to the bug, he's running 4.13.x on one server and 4.14.x on 
two.  So upgrading the one to 4.14.x should mean all will have that fix.

However, without a full balance it /will/ take some time to settle down 
(again, assuming btrfs was using ssd mode), so the lingering effect could 
still be creating problems on the 4.14 kernel servers for the moment.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-16 Thread Qu Wenruo


On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
> On 02/15/2018 08:55 PM, Qu Wenruo wrote:
>> On 2018年02月16日 00:30, Ellis H. Wilson III wrote:
>>> Very helpful information.  Thank you Qu and Hans!
>>>
>>> I have about 1.7TB of homedir data newly rsync'd data on a single
>>> enterprise 7200rpm HDD and the following output for btrfs-debug:
>>>
>>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2
>>> total bytes 6001175126016
>>> bytes used 1832557875200
>>>
>>> Hans' (very cool) tool reports:
>>> ROOT_TREE 624.00KiB 0(    38) 1( 1)
>>> EXTENT_TREE   327.31MiB 0( 20881) 1(    66) 2( 1)
>>
>> Extent tree is not so large, a little unexpected to see such slow mount.
>>
>> BTW, how many chunks do you have?
>>
>> It could be checked by:
>>
>> # btrfs-debug-tree -t chunk  | grep CHUNK_ITEM | wc -l
> 
> Since yesterday I've doubled the size by copying the homdir dataset in
> again.  Here are new stats:
> 
> extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2
> total bytes 6001175126016
> bytes used 3663525969920
> 
> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
> 3454

OK, this explains everything.

There are too many chunks.
This means at mount you need to search for block group item 3454 times.

Even each search only needs to iterate 3 tree blocks, multiply it 3454
it would still be a big work.
Although some tree blocks like the root node and level 1 nodes can be
cached, we still need to read about 3500 tree blocks.

If the fs is created using 16K nodesize, this means you need to do
random read for 54M using 16K blocksize.

No wonder it will takes some time.

Normally I would expect 1G chunk for each data and metadata chunk.

If there is nothing special, it means your filesystem is already larger
than 3T.
If your used space is way smaller (less than 30%) than 3.5T, then this
means your chunk usage is pretty low, and in that case, balance to
reduce number of chunks (block groups) would reduce mount time.

My personally estimate about mount time is O(nlogn).
So if you are able to reduce chunk number to half, you could reduce
mount time by 60%.

> 
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE   1.14MiB 0(    72) 1( 1)
> EXTENT_TREE   644.27MiB 0( 41101) 1(   131) 2( 1)
> CHUNK_TREE    384.00KiB 0(    23) 1( 1)
> DEV_TREE  272.00KiB 0(    16) 1( 1)
> FS_TREE    11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
> CSUM_TREE   3.50GiB 0(228593) 1(   791) 2( 2) 3( 1)
> QUOTA_TREE    0.00B
> UUID_TREE  16.00KiB 0( 1)
> FREE_SPACE_TREE   0.00B
> DATA_RELOC_TREE    16.00KiB 0( 1)
> 
> The old mean mount time was 4.319s.  It now takes 11.537s for the
> doubled dataset.  Again please realize this is on an old version of
> BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still
> like to understand this delay more.  Should I expect this to scale in
> this way all the way up to my proposed 60-80TB filesystem so long as the
> file size distribution stays roughly similar?  That would definitely be
> in terms of multiple minutes at that point.
> 
>>> Taking 100 snapshots (no changes between snapshots however) of the above
>>> subvolume doesn't appear to impact mount/umount time.
>>
>> 100 unmodified snapshots won't affect mount time.
>>
>> It needs new extents, which can be created by overwriting extents in
>> snapshots.
>> So it won't really cause much difference if all these snapshots are all
>> unmodified.
> 
> Good to know, thanks!
> 
>>> Snapshot creation
>>> and deletion both operate at between 0.25s to 0.5s.
>>
>> IIRC snapshot deletion is delayed, so the real work doesn't happen when
>> "btrfs sub del" returns.
> 
> I was using btrfs sub del -C for the deletions, so I believe (if that
> command truly waits for the subvolume to be utterly gone) it captures
> the entirety of the snapshot.

No, snapshot deletion is completely delayed in background.

-C only ensures that even a powerloss happen after command return, you
won't see the snapshot anywhere, but it will still be deleted in background.

Thanks,
Qu

> 
> Best,
> 
> ellis
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] Fix NULL pointer exception in find_bio_stripe()

2018-02-16 Thread Greg KH
On Fri, Feb 16, 2018 at 07:51:38PM +, Dmitriy Gorokh wrote:
> On detaching of a disk which is a part of a RAID6 filesystem, the following 
> kernel OOPS may happen:
> 
> [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, 
> flush 1, corrupt 0, gen 0 
> [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on 
> /dev/sdo 
> [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, 
> flush 1, corrupt 0, gen 0 
> [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on 
> /dev/sdo 
> [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, 
> flush 1, corrupt 0, gen 0 
> [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo 
> [63122.935338] BUG: unable to handle kernel NULL pointer dereference at 
> 0080 
> [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs] 
> [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0 
> [63122.971202] Oops:  [#1] SMP 
> [63122.990786] Modules linked in: libcrc32c dlm configfs cpufreq_userspace 
> cpufreq_powersave cpufreq_conservative softdog nfsd auth_rpcgss nfs_acl nfs 
> lockd grace fscache sunrpc bonding ipmi_devintf ipmi_msghandler joydev 
> snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd psmouse evdev parport_pc 
> soundcore serio_raw battery pcspkr video ac97_bus ac parport ohci_pci 
> ohci_hcd i2c_piix4 button crc32c_generic crc32c_intel btrfs xor 
> zstd_decompress zstd_compress xxhash raid6_pq dm_mod dax raid1 md_mod 
> hid_generic usbhid hid xhci_pci xhci_hcd ehci_pci ehci_hcd usbcore sg sd_mod 
> sr_mod cdrom ata_generic ahci libahci ata_piix libata e1000 scsi_mod [last 
> unloaded: scst] 
> [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 
> 4.14.2-16-scst34x+ #8 
> [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
> VirtualBox 12/01/2006 
> [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs] 
> [63123.007595] task: 880036ea4040 task.stack: c90006384000 
> [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs] 
> [63123.007968] RSP: 0018:c90006387ad8 EFLAGS: 00010287 
> [63123.008140] RAX: 0002 RBX: 88004beaa0b8 RCX: 
> 8800b2bd5690 
> [63123.008359] RDX:  RSI: 88007bb43500 RDI: 
> 88004beaa000 
> [63123.008621] RBP: c90006387ae8 R08: 9910 R09: 
> 8800b2bd5600 
> [63123.008840] R10: 0004 R11: 0001 R12: 
> 88007bb43500 
> [63123.009059] R13: fffb R14: 880036fc5180 R15: 
> 0004 
> [63123.009278] FS: () GS:8800b700() 
> knlGS: 
> [63123.009564] CS: 0010 DS:  ES:  CR0: 80050033 
> [63123.009748] CR2: 0080 CR3: b0866000 CR4: 
> 000406f0 
> [63123.009969] Call Trace: 
> [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs] 
> [63123.010251] bio_endio+0xa1/0x120 
> [63123.010378] generic_make_request+0x218/0x270 
> [63123.010921] submit_bio+0x66/0x130 
> [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs] 
> [63123.011245] full_stripe_write+0x96/0xc0 [btrfs] 
> [63123.011428] raid56_parity_write+0x117/0x170 [btrfs] 
> [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs] 
> [63123.011759] ? ___cache_free+0x1c5/0x300 
> [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs] 
> [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs] 
> [63123.012257] normal_work_helper+0x19e/0x300 [btrfs] 
> [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs] 
> [63123.012656] process_one_work+0x14d/0x350 
> [63123.012888] worker_thread+0x4d/0x3a0 
> [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20 
> [63123.013192] kthread+0x109/0x140 
> [63123.013315] ? process_scheduled_works+0x40/0x40 
> [63123.013472] ? kthread_stop+0x110/0x110 
> [63123.013610] ret_from_fork+0x25/0x30 
> [63123.013741] Code: 7e 43 31 c0 48 63 d0 48 8d 14 52 49 8d 4c d1 60 48 8b 51 
> 08 49 39 d0 72 1f 4c 63 1b 4c 01 da 49 39 d0 73 14 48 8b 11 48 8b 52 68 <48> 
> 8b 8a 80 00 00 00 48 39 4e 08 74 14 83 c0 01 44 39 d0 75 c4 
> [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: c90006387ad8 
> [63123.014678] CR2: 0080 
> [63123.016590] ---[ end trace a295ea7259c17880 ]— 
> 
> This is reproducible in a cycle, where a series of writes is followed by SCSI 
> device delete command. The test may take up to few minutes.
> 
> Fixes: commit 74d46992e0d9dee7f1f376de0d56d31614c8a17a ("block: replace 
> bi_bdev with a gendisk pointer and partitions index")
> ---
>  fs/btrfs/raid56.c | 1 +
>  1 file changed, 1 insertion(+)



This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org

[PATCH] Fix NULL pointer exception in find_bio_stripe()

2018-02-16 Thread Dmitriy Gorokh
On detaching of a disk which is a part of a RAID6 filesystem, the following 
kernel OOPS may happen:

[63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, flush 
1, corrupt 0, gen 0 
[63122.719584] BTRFS warning (device sdo): lost page write due to IO error on 
/dev/sdo 
[63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, flush 
1, corrupt 0, gen 0 
[63122.803516] BTRFS warning (device sdo): lost page write due to IO error on 
/dev/sdo 
[63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, flush 
1, corrupt 0, gen 0 
[63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo 
[63122.935338] BUG: unable to handle kernel NULL pointer dereference at 
0080 
[63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs] 
[63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0 
[63122.971202] Oops:  [#1] SMP 
[63122.990786] Modules linked in: libcrc32c dlm configfs cpufreq_userspace 
cpufreq_powersave cpufreq_conservative softdog nfsd auth_rpcgss nfs_acl nfs 
lockd grace fscache sunrpc bonding ipmi_devintf ipmi_msghandler joydev 
snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd psmouse evdev parport_pc 
soundcore serio_raw battery pcspkr video ac97_bus ac parport ohci_pci ohci_hcd 
i2c_piix4 button crc32c_generic crc32c_intel btrfs xor zstd_decompress 
zstd_compress xxhash raid6_pq dm_mod dax raid1 md_mod hid_generic usbhid hid 
xhci_pci xhci_hcd ehci_pci ehci_hcd usbcore sg sd_mod sr_mod cdrom ata_generic 
ahci libahci ata_piix libata e1000 scsi_mod [last unloaded: scst] 
[63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 
4.14.2-16-scst34x+ #8 
[63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
VirtualBox 12/01/2006 
[63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs] 
[63123.007595] task: 880036ea4040 task.stack: c90006384000 
[63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs] 
[63123.007968] RSP: 0018:c90006387ad8 EFLAGS: 00010287 
[63123.008140] RAX: 0002 RBX: 88004beaa0b8 RCX: 
8800b2bd5690 
[63123.008359] RDX:  RSI: 88007bb43500 RDI: 
88004beaa000 
[63123.008621] RBP: c90006387ae8 R08: 9910 R09: 
8800b2bd5600 
[63123.008840] R10: 0004 R11: 0001 R12: 
88007bb43500 
[63123.009059] R13: fffb R14: 880036fc5180 R15: 
0004 
[63123.009278] FS: () GS:8800b700() 
knlGS: 
[63123.009564] CS: 0010 DS:  ES:  CR0: 80050033 
[63123.009748] CR2: 0080 CR3: b0866000 CR4: 
000406f0 
[63123.009969] Call Trace: 
[63123.010085] raid_write_end_io+0x7e/0x80 [btrfs] 
[63123.010251] bio_endio+0xa1/0x120 
[63123.010378] generic_make_request+0x218/0x270 
[63123.010921] submit_bio+0x66/0x130 
[63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs] 
[63123.011245] full_stripe_write+0x96/0xc0 [btrfs] 
[63123.011428] raid56_parity_write+0x117/0x170 [btrfs] 
[63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs] 
[63123.011759] ? ___cache_free+0x1c5/0x300 
[63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs] 
[63123.012087] run_one_async_done+0x9c/0xc0 [btrfs] 
[63123.012257] normal_work_helper+0x19e/0x300 [btrfs] 
[63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs] 
[63123.012656] process_one_work+0x14d/0x350 
[63123.012888] worker_thread+0x4d/0x3a0 
[63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20 
[63123.013192] kthread+0x109/0x140 
[63123.013315] ? process_scheduled_works+0x40/0x40 
[63123.013472] ? kthread_stop+0x110/0x110 
[63123.013610] ret_from_fork+0x25/0x30 
[63123.013741] Code: 7e 43 31 c0 48 63 d0 48 8d 14 52 49 8d 4c d1 60 48 8b 51 
08 49 39 d0 72 1f 4c 63 1b 4c 01 da 49 39 d0 73 14 48 8b 11 48 8b 52 68 <48> 8b 
8a 80 00 00 00 48 39 4e 08 74 14 83 c0 01 44 39 d0 75 c4 
[63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: c90006387ad8 
[63123.014678] CR2: 0080 
[63123.016590] ---[ end trace a295ea7259c17880 ]— 

This is reproducible in a cycle, where a series of writes is followed by SCSI 
device delete command. The test may take up to few minutes.

Fixes: commit 74d46992e0d9dee7f1f376de0d56d31614c8a17a ("block: replace bi_bdev 
with a gendisk pointer and partitions index")
---
 fs/btrfs/raid56.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index dec0907dfb8a..fcfc20de2df3 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1370,6 +1370,7 @@ static int find_bio_stripe(struct btrfs_raid_bio *rbio,
stripe_start = stripe->physical;
if (physical >= stripe_start &&
physical < stripe_start + rbio->stripe_len &&
+   stripe->dev->bdev &&
bio->bi_disk == stripe->dev->bdev->bd_disk &&
bio->bi_partno == stripe->dev->bdev->bd_partno) {
return i;
-- 
2.14.2



Re: Ongoing Btrfs stability issues

2018-02-16 Thread Austin S. Hemmelgarn

On 2018-02-15 11:18, Alex Adriaanse wrote:

We've been using Btrfs in production on AWS EC2 with EBS devices for over 2 
years. There is so much I love about Btrfs: CoW snapshots, compression, 
subvolumes, flexibility, the tools, etc. However, lack of stability has been a 
serious ongoing issue for us, and we're getting to the point that it's becoming 
hard to justify continuing to use it unless we make some changes that will get 
it stable. The instability manifests itself mostly in the form of the VM 
completely crashing, I/O operations freezing, or the filesystem going into 
readonly mode. We've spent an enormous amount of time trying to recover 
corrupted filesystems, and the time that servers were down as a result of Btrfs 
instability has accumulated to many days.

We've made many changes to try to improve Btrfs stability: upgrading to newer 
kernels, setting up nightly balances, setting up monitoring to ensure our 
filesystems stay under 70% utilization, etc. This has definitely helped quite a 
bit, but even with these things in place it's still unstable. Take 
https://bugzilla.kernel.org/show_bug.cgi?id=198787 for example, which I created 
yesterday: we've had 4 VMs (out of 20) go down over the past week alone because 
of Btrfs errors. Thankfully, no data was lost, but I did have to copy 
everything over to a new filesystem.

Many of our VMs that run Btrfs have a high rate of I/O (both read/write; I/O 
utilization is often pegged at 100%). The filesystems that get little I/O seem 
pretty stable, but the ones that undergo a lot of I/O activity are the ones 
that suffer from the most instability problems. We run the following balances 
on every filesystem every night:

 btrfs balance start -dusage=10 
 btrfs balance start -dusage=20 
 btrfs balance start -dusage=40,limit=100 
I would suggest changing this to eliminate the balance with '-dusage=10' 
(it's redundant with the '-dusage=20' one unless your filesystem is in 
pathologically bad shape), and adding equivalent filters for balancing 
metadata (which generally goes pretty fast).


Unless you've got a huge filesystem, you can also cut down on that limit 
filter.  100 data chunks that are 40% full is up to 40GB of data to move 
on a normally sized filesystem, or potentially up to 200GB if you've got 
a really big filesystem (I forget what point BTRFS starts scaling up 
chunk sizes at, but I'm pretty sure it's in the TB range).


We also use the following btrfs-snap cronjobs to implement rotating snapshots, 
with short-term snapshots taking place every 15 minutes and less frequent ones 
being retained for up to 3 days:

 0 1-23 * * * /opt/btrfs-snap/btrfs-snap -r  23
 15,30,45 * * * * /opt/btrfs-snap/btrfs-snap -r  15m 3
 0 0 * * * /opt/btrfs-snap/btrfs-snap -r  daily 3

Our filesystems are mounted with the "compress=lzo" option.

Are we doing something wrong? Are there things we should change to improve 
stability? I wouldn't be surprised if eliminating snapshots would stabilize 
things, but if we do that we might as well be using a filesystem like XFS. Are 
there fixes queued up that will solve the problems listed in the Bugzilla 
ticket referenced above? Or is our I/O-intensive workload just not a good fit 
for Btrfs?


This will probably sound like an odd question, but does BTRFS think your 
storage devices are SSD's or not?  Based on what you're saying, it 
sounds like you're running into issues resulting from the 
over-aggressive SSD 'optimizations' that were done by BTRFS until very 
recently.


You can verify if this is what's causing your problems or not by either 
upgrading to a recent mainline kernel version (I know the changes are in 
4.15, I don't remember for certain if they're in 4.14 or not, but I 
think they are), or by adding 'nossd' to your mount options, and then 
seeing if you still have the problems or not (I suspect this is only 
part of it, and thus changing this will reduce the issues, but not 
completely eliminate them).  Make sure and run a full balance after 
changing either item, as the aforementioned 'optimizations' have an 
impact on how data is organized on-disk (which is ultimately what causes 
the issues), so they will have a lingering effect if you don't balance 
everything.


'autodefrag' is the other mount option that I would try toggling (turn 
it off if you've got it on, or on if you've got it off).  I doubt it 
will have much impact, but it does change how things end up on disk.


Additionally to all that, make sure your monitoring isn't just looking 
at the regular `df` command's output, it's woefully insufficient for 
monitoring space usage on BTRFS.  If you want to check things properly, 
you want to be looking at the data in /sys/fs/btrfs//allocation, 
more specifically checking the following percentages:


1. The sum of the values in /sys/fs/btrfs/relative to the sum total of the size of the block devices for the 
filesystem.
2. The ratio of /sys/fs/btrfs//allocation/data/bytes_u

Btrfs progs release 4.15.1

2018-02-16 Thread David Sterba
Hi,

btrfs-progs version 4.15.1 have been released. This is a minor update with
build fixes, cleanups and test enhancements.


Changes:
* build
  * fix build on musl
  * support asciidoctor for doc generation
* cleanups
  * sync some code with kernel
  * check: move code to own directory, split to more files
* tests
  * more build tests in travis
  * tests now pass with asan and ubsan
  * testsuite can be exported and used separately

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

Anand Jain (1):
  btrfs-progs: print-tree: fix INODE_ITEM sequence and flags

David Sterba (20):
  btrfs-progs: check: rename files after moving code
  btrfs-progs: docs: fix manual page title format
  btrfs-progs: build: add support for asciidoctor doc generator
  btrfs-progs: convert: fix build on musl
  btrfs-progs: mkfs: fix build on musl
  btrfs-progs: ci: change clone depth to 1
  btrfs-progs: ci: add support scripts for docker build
  btrfs-progs: ci: add dockerfile for a musl build test
  btrfs-progs: ci: enable musl build tests in docker
  btrfs-progs: ci: use helper script for default build commands
  btrfs-progs: ci: replace inline shell commands with scripts
  btrfs-progs: rework testsuite export
  btrfs-progs: tests: update README.md
  btrfs-progs: tests: unify test drivers, make ready for extenral testsuite
  btrfs-progs: test: update clean-test.sh after the TEST_TOP update
  btrfs-progs: tests: document exported testsuite
  btrfs-progs: reorder tests in make target
  btrfs-progs: let callers of btrfs_show_qgroups free the buffers
  btrfs-progs: update CHANGES for v4.15.1
  Btrfs progs v4.15.1

Gu Jinxiang (12):
  btrfs-progs: Use fs_info instead of root for BTRFS_LEAF_DATA_SIZE
  btrfs-progs: Use fs_info instead of root for BTRFS_NODEPTRS_PER_BLOCK
  btrfs-progs: Sync code with kernel for BTRFS_MAX_INLINE_DATA_SIZE
  btrfs-progs: Use fs_info instead of root for BTRFS_MAX_XATTR_SIZE
  btrfs-progs: do clean up for redundancy value assignment
  btrfs-progs: remove no longer used btrfs_alloc_extent
  btrfs-progs: Cleanup use of root in leaf_data_end
  btrfs-progs: add prerequisite mkfs.btrfs for test-cli
  btrfs-progs: add prerequisite btrfs-image for test-fuzz
  btrfs-progs: add prerequisite btrfs-convert for test-misc
  btrfs-progs: Add make testsuite command for export tests
  btrfs-progs: introduce TEST_TOP and INTERNAL_BIN for tests

Qu Wenruo (22):
  btrfs-progs: tests: chang tree-reloc-tree test number from 027 to 015
  btrfs-progs: Move cmds-check.c to check/main.c
  btrfs-progs: check: Move original mode definitions to check/original.h
  btrfs-progs: check: Move definitions of lowmem mode to check/lowmem.h
  btrfs-progs: check: Move node_refs structure to check/common.h
  btrfs-progs: check: Export check global variables to check/common.h
  btrfs-progs: check: Move imode_to_type function to check/common.h
  btrfs-progs: check: Move fs_root_objectid function to check/common.h
  btrfs-progs: check: Move count_csum_range function to check/common.c
  btrfs-progs: check: Move __create_inode_item function to check/common.c
  btrfs-progs: check: Move link_inode_to_lostfound function to common.c
  btrfs-progs: check: Move check_dev_size_alignment to check/common.c
  btrfs-progs: check: move reada_walk_down to check/common.c
  btrfs-progs: check: Move check_child_node to check/common.c
  btrfs-progs: check: Move reset_cached_block_groups to check/common.c
  btrfs-progs: check: Move lowmem check code to its own check/lowmem.[ch]
  btrfs-progs: check/lowmem: Cleanup unnecessary _v2 suffixes
  btrfs-progs: check: Cleanup all checkpatch error and warning
  btrfs-progs: fsck-tests: Cleanup the restored image for 028
  btrfs-progs: btrfs-progs: Fix read beyond boundary bug in 
build_roots_info_cache()
  btrfs-progs: mkfs/rootdir: Fix memory leak in traverse_directory()
  btrfs-progs: convert/ext2: Fix memory leak caused by handled ext2_filsys

Su Yue (1):
  btrfs-progs: tests common: remove meaningless colon in extract_image()

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs fixes for 4.16-rc1

2018-02-16 Thread David Sterba
Hi,

we have a few assorted fixes, some of them show up during fstests so
I gave them more testing. Please pull, thanks.


The following changes since commit 3acbcbfc8f06d4ade2aab2ebba0a2542a05ce90c:

  btrfs: drop devid as device_list_add() arg (2018-01-29 19:31:16 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-4.16-rc1-tag

for you to fetch changes up to fd649f10c3d21ee9d7542c609f29978bdf73ab94:

  btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device 
(2018-02-05 17:15:14 +0100)


Filipe Manana (1):
  Btrfs: fix null pointer dereference when replacing missing device

Liu Bo (6):
  Btrfs: fix deadlock in run_delalloc_nocow
  Btrfs: fix crash due to not cleaning up tree log block's dirty bits
  Btrfs: fix extent state leak from tree log
  Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly
  Btrfs: fix use-after-free on root->orphan_block_rsv
  Btrfs: fix unexpected -EEXIST when creating new inode

Nikolay Borisov (2):
  btrfs: Ignore errors from btrfs_qgroup_trace_extent_post
  btrfs: Fix use-after-free when cleaning up fs_devs with a single stale 
device

Zygo Blaxell (1):
  btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes

 fs/btrfs/backref.c | 11 ++-
 fs/btrfs/delayed-ref.c |  3 ++-
 fs/btrfs/extent-tree.c |  4 
 fs/btrfs/inode.c   | 41 ++---
 fs/btrfs/qgroup.c  |  9 +++--
 fs/btrfs/tree-log.c| 32 ++--
 fs/btrfs/volumes.c |  1 +
 7 files changed, 80 insertions(+), 21 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-16 Thread Ellis H. Wilson III

On 02/16/2018 09:42 AM, Ellis H. Wilson III wrote:

On 02/16/2018 09:20 AM, Hans van Kranenburg wrote:

Well, imagine you have a big tree (an actual real life tree outside) and
you need to pick things (e.g. apples) which are hanging everywhere.

So, what you need to to is climb the tree, climb on a branch all the way
to the end where the first apple is... climb back, climb up a bit, go
onto the next branch to the end for the next apple... etc etc

The bigger the tree is, the longer it keeps you busy, because the apples
will be semi-evenly distributed around the full tree, and they're always
hanging at the end of the branch. The speed with which you can climb
around (random read disk access IO speed for btrfs, because your disk
cache is empty when first mounting) determines how quickly you're done.

So, yes.


Thanks Hans.  I will say multiple minutes (by the looks of things, I'll 
end up near to an hour for 60TB if this non-linear scaling continues) to 
mount a filesystem is undesirable, but I won't offer that criticism 
without thinking constructively for a moment:


Help me out by referencing the tree in question if you don't mind, so I 
can better understand the point of picking all these "apples" (I would 
guess for capacity reporting via df, but maybe there's more).


Typical disclaimer that I haven't yet grokked the various inner-workings 
of BTRFS, so this is quite possibly a terrible or unapproachable idea:


On umount, you must already have whatever metadata you were doing the 
tree walk on mount for in-memory (otherwise you would have been able to 
lazily do the treewalk after a quick mount).  Therefore, could we not 
stash this metadata at or associated with, say, the root of the 
subvolumes?  This way you can always determine on mount quickly if the 
cache is still valid (i.e., no situation like: remount with old btrfs, 
change stuff, umount with old btrfs, remount with new btrfs, pain).  I 
would guess generation would be sufficient to determine if the cached 
metadata is valid for the given root block.


This would scale with number of subvolumes (but not snapshots), and 
would be reasonably quick I think.


I see on 02/13 Qu commented regarding a similar idea, except proposed 
perhaps a richer version of my above suggestion (making block group into 
its own tree).  The concern was that it would be a lot of work since it 
modifies the on-disk format.  That's a reasonable worry.


I will get a new kernel, expand my array to around 36TB, and will 
generate a plot of mount times against extents going up to at least 30TB 
in increments of 0.5TB.  If this proves to reach absurd mount time 
delays (to be specific, anything above around 60s is untenable for our 
use), we may very well be sufficiently motivated to implement the above 
improvement and submit it for consideration.  Accordingly, if anybody 
has additional and/or more specific thoughts on the optimization, I am 
all ears.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-16 Thread Ellis H. Wilson III

On 02/16/2018 09:20 AM, Hans van Kranenburg wrote:

Well, imagine you have a big tree (an actual real life tree outside) and
you need to pick things (e.g. apples) which are hanging everywhere.

So, what you need to to is climb the tree, climb on a branch all the way
to the end where the first apple is... climb back, climb up a bit, go
onto the next branch to the end for the next apple... etc etc

The bigger the tree is, the longer it keeps you busy, because the apples
will be semi-evenly distributed around the full tree, and they're always
hanging at the end of the branch. The speed with which you can climb
around (random read disk access IO speed for btrfs, because your disk
cache is empty when first mounting) determines how quickly you're done.

So, yes.


Thanks Hans.  I will say multiple minutes (by the looks of things, I'll 
end up near to an hour for 60TB if this non-linear scaling continues) to 
mount a filesystem is undesirable, but I won't offer that criticism 
without thinking constructively for a moment:


Help me out by referencing the tree in question if you don't mind, so I 
can better understand the point of picking all these "apples" (I would 
guess for capacity reporting via df, but maybe there's more).


Typical disclaimer that I haven't yet grokked the various inner-workings 
of BTRFS, so this is quite possibly a terrible or unapproachable idea:


On umount, you must already have whatever metadata you were doing the 
tree walk on mount for in-memory (otherwise you would have been able to 
lazily do the treewalk after a quick mount).  Therefore, could we not 
stash this metadata at or associated with, say, the root of the 
subvolumes?  This way you can always determine on mount quickly if the 
cache is still valid (i.e., no situation like: remount with old btrfs, 
change stuff, umount with old btrfs, remount with new btrfs, pain).  I 
would guess generation would be sufficient to determine if the cached 
metadata is valid for the given root block.


This would scale with number of subvolumes (but not snapshots), and 
would be reasonably quick I think.


Thoughts?

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-16 Thread Hans van Kranenburg
On 02/16/2018 03:12 PM, Ellis H. Wilson III wrote:
> On 02/15/2018 08:55 PM, Qu Wenruo wrote:
>> On 2018年02月16日 00:30, Ellis H. Wilson III wrote:
>>> Very helpful information.  Thank you Qu and Hans!
>>>
>>> I have about 1.7TB of homedir data newly rsync'd data on a single
>>> enterprise 7200rpm HDD and the following output for btrfs-debug:
>>>
>>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2
>>> total bytes 6001175126016
>>> bytes used 1832557875200
>>>
>>> Hans' (very cool) tool reports:
>>> ROOT_TREE 624.00KiB 0(    38) 1( 1)
>>> EXTENT_TREE   327.31MiB 0( 20881) 1(    66) 2( 1)
>>
>> Extent tree is not so large, a little unexpected to see such slow mount.
>>
>> BTW, how many chunks do you have?
>>
>> It could be checked by:
>>
>> # btrfs-debug-tree -t chunk  | grep CHUNK_ITEM | wc -l
> 
> Since yesterday I've doubled the size by copying the homdir dataset in
> again.  Here are new stats:
> 
> extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2
> total bytes 6001175126016
> bytes used 3663525969920
> 
> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
> 3454
> 
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE   1.14MiB 0(    72) 1( 1)
> EXTENT_TREE   644.27MiB 0( 41101) 1(   131) 2( 1)
> CHUNK_TREE    384.00KiB 0(    23) 1( 1)
> DEV_TREE  272.00KiB 0(    16) 1( 1)
> FS_TREE    11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
> CSUM_TREE   3.50GiB 0(228593) 1(   791) 2( 2) 3( 1)
> QUOTA_TREE    0.00B
> UUID_TREE  16.00KiB 0( 1)
> FREE_SPACE_TREE   0.00B
> DATA_RELOC_TREE    16.00KiB 0( 1)
> 
> The old mean mount time was 4.319s.  It now takes 11.537s for the
> doubled dataset.  Again please realize this is on an old version of
> BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still
> like to understand this delay more.  Should I expect this to scale in
> this way all the way up to my proposed 60-80TB filesystem so long as the
> file size distribution stays roughly similar?  That would definitely be
> in terms of multiple minutes at that point.

Well, imagine you have a big tree (an actual real life tree outside) and
you need to pick things (e.g. apples) which are hanging everywhere.

So, what you need to to is climb the tree, climb on a branch all the way
to the end where the first apple is... climb back, climb up a bit, go
onto the next branch to the end for the next apple... etc etc

The bigger the tree is, the longer it keeps you busy, because the apples
will be semi-evenly distributed around the full tree, and they're always
hanging at the end of the branch. The speed with which you can climb
around (random read disk access IO speed for btrfs, because your disk
cache is empty when first mounting) determines how quickly you're done.

So, yes.

>>> Taking 100 snapshots (no changes between snapshots however) of the above
>>> subvolume doesn't appear to impact mount/umount time.
>>
>> 100 unmodified snapshots won't affect mount time.
>>
>> It needs new extents, which can be created by overwriting extents in
>> snapshots.
>> So it won't really cause much difference if all these snapshots are all
>> unmodified.
> 
> Good to know, thanks!
> 
>>> Snapshot creation
>>> and deletion both operate at between 0.25s to 0.5s.
>>
>> IIRC snapshot deletion is delayed, so the real work doesn't happen when
>> "btrfs sub del" returns.
> 
> I was using btrfs sub del -C for the deletions, so I believe (if that
> command truly waits for the subvolume to be utterly gone) it captures
> the entirety of the snapshot.
> 
> Best,
> 
> ellis


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-16 Thread Ellis H. Wilson III

On 02/15/2018 08:55 PM, Qu Wenruo wrote:

On 2018年02月16日 00:30, Ellis H. Wilson III wrote:

Very helpful information.  Thank you Qu and Hans!

I have about 1.7TB of homedir data newly rsync'd data on a single
enterprise 7200rpm HDD and the following output for btrfs-debug:

extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2
total bytes 6001175126016
bytes used 1832557875200

Hans' (very cool) tool reports:
ROOT_TREE 624.00KiB 0(    38) 1( 1)
EXTENT_TREE   327.31MiB 0( 20881) 1(    66) 2( 1)


Extent tree is not so large, a little unexpected to see such slow mount.

BTW, how many chunks do you have?

It could be checked by:

# btrfs-debug-tree -t chunk  | grep CHUNK_ITEM | wc -l


Since yesterday I've doubled the size by copying the homdir dataset in 
again.  Here are new stats:


extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2
total bytes 6001175126016
bytes used 3663525969920

$ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
3454

$ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
ROOT_TREE   1.14MiB 0(72) 1( 1)
EXTENT_TREE   644.27MiB 0( 41101) 1(   131) 2( 1)
CHUNK_TREE384.00KiB 0(23) 1( 1)
DEV_TREE  272.00KiB 0(16) 1( 1)
FS_TREE11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
CSUM_TREE   3.50GiB 0(228593) 1(   791) 2( 2) 3( 1)
QUOTA_TREE0.00B
UUID_TREE  16.00KiB 0( 1)
FREE_SPACE_TREE   0.00B
DATA_RELOC_TREE16.00KiB 0( 1)

The old mean mount time was 4.319s.  It now takes 11.537s for the 
doubled dataset.  Again please realize this is on an old version of 
BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still 
like to understand this delay more.  Should I expect this to scale in 
this way all the way up to my proposed 60-80TB filesystem so long as the 
file size distribution stays roughly similar?  That would definitely be 
in terms of multiple minutes at that point.



Taking 100 snapshots (no changes between snapshots however) of the above
subvolume doesn't appear to impact mount/umount time.


100 unmodified snapshots won't affect mount time.

It needs new extents, which can be created by overwriting extents in
snapshots.
So it won't really cause much difference if all these snapshots are all
unmodified.


Good to know, thanks!


Snapshot creation
and deletion both operate at between 0.25s to 0.5s.


IIRC snapshot deletion is delayed, so the real work doesn't happen when
"btrfs sub del" returns.


I was using btrfs sub del -C for the deletions, so I believe (if that 
command truly waits for the subvolume to be utterly gone) it captures 
the entirety of the snapshot.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs send/receive in reverse possible?

2018-02-16 Thread Hugo Mills
On Fri, Feb 16, 2018 at 10:43:54AM +0800, Sampson Fung wrote:
> I have snapshot A on Drive_A.
> I send snapshot A to an empty Drive_B.  Then keep Drive_A as backup.
> I use Drive_B as active.
> I create new snapshot B on Drive_B.
> 
> Can I use btrfs send/receive to send incremental differences back to Drive_A?
> What is the correct way of doing this?

   You can't do it with the existing tools -- it needs a change to the
send stream format. Here's a write-up of what's going on behind the
scenes, and what needs to change:

https://www.spinics.net/lists/linux-btrfs/msg44089.html

   Hugo.

-- 
Hugo Mills | I can't foretell the future, I just work there.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |The Doctor


signature.asc
Description: Digital signature