Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Stefan Priebe - Profihost AG
i've backported the free space cache tree to my kerne and hopefully any
fixes related to it.

The first mount with clear_cache,space_cache=v2 took around 5 hours.

Currently i do not see any kworker with 100CPU but i don't see much load
at all.

btrfs-transaction tooks around 2-4% CPU together with a kworker process
and some 2-3% mdadm processes. I/O Wait is at 3%.

That's it. It does not do much more. Writing a file does not work.

Greets,
Stefan

Am 16.08.2017 um 14:29 schrieb Konstantin V. Gavrilenko:
> Roman, initially I had a single process occupying 100% CPU, when sysrq it was 
> indicating as "btrfs_find_space_for_alloc"
> but that's when I used the autodefrag, compress, forcecompress and commit=10 
> mount flags and space_cache was v1 by default.
> when I switched to "relatime,compress-force=zlib,space_cache=v2" the 100% cpu 
> has dissapeared, but the shite performance remained.
> 
> 
> As to the chunk size, there is no information in the article about the type 
> of data that was used. While in our case we are pretty certain about the 
> compressed block size (32-128). I am currently inclining towards 32k as it 
> might be ideal in a situation when we have a 5 disk raid5 array.
> 
> In theory
> 1. The minimum compressed write (32k) would fill the chunk on a single disk, 
> thus the IO cost of the operation would be 2 reads (original chunk + original 
> parity)  and 2 writes (new chunk + new parity)
> 
> 2. The maximum compressed write (128k) would require the update of 1 chunk on 
> each of the 4 data disks + 1 parity  write 
> 
> 
> 
> Stefan what mount flags do you use?
> 
> kos
> 
> 
> 
> - Original Message -
> From: "Roman Mamedov" 
> To: "Konstantin V. Gavrilenko" 
> Cc: "Stefan Priebe - Profihost AG" , "Marat Khalili" 
> , linux-btrfs@vger.kernel.org, "Peter Grandi" 
> 
> Sent: Wednesday, 16 August, 2017 2:00:03 PM
> Subject: Re: slow btrfs with a single kworker process using 100% CPU
> 
> On Wed, 16 Aug 2017 12:48:42 +0100 (BST)
> "Konstantin V. Gavrilenko"  wrote:
> 
>> I believe the chunk size of 512kb is even worth for performance then the 
>> default settings on my HW RAID of  256kb.
> 
> It might be, but that does not explain the original problem reported at all.
> If mdraid performance would be the bottleneck, you would see high iowait,
> possibly some CPU load from the mdX_raidY threads. But not a single Btrfs
> thread pegging into 100% CPU.
> 
>> So now I am moving the data from the array and will be rebuilding it with 64
>> or 32 chunk size and checking the performance.
> 
> 64K is the sweet spot for RAID5/6:
> http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid0 rescue

2017-08-16 Thread Chris Murphy
OK this time, also -mraid1 -draid0, and filled it with some more
metadata this time, but I then formatted NTFS, then ext4, then xfs.
And then wiped those signatures. Brutal, especially ext4 which writes
a lot more stuff and zeros a bunch of areas too.



# btrfs rescue super -v /dev/mapper/vg-2
All Devices:
Device: id = 1, name = /dev/mapper/vg-1
Device: id = 2, name = /dev/mapper/vg-2

Before Recovering:
[All good supers]:
device name = /dev/mapper/vg-1
superblock bytenr = 65536

device name = /dev/mapper/vg-1
superblock bytenr = 67108864

device name = /dev/mapper/vg-2
superblock bytenr = 67108864

[All bad supers]:

All supers are valid, no need to recover


Obviously vg-2 is missing its first superblock and this tool is not
complaining about it at all. Normal mount does not work (generic open
ctree error).

# btrfs check /dev/mapper/vg-1
warning, device 2 is missing


Umm, no. But yeah because the first super is missing the kernel isn't
considering it a Btrfs volume at all. There's also other errors with
the check, due to metadata being stepped on I'm guessing. But we need
a way to fix an obviously stepped on first super, and I don't like the
idea of using btrfs check for that anyway. All I need is the first
copy fixed up, and then just do a normal mount. But let's see how
messy this gets, pointing check to the damaged device and the known
good 2nd super (-s0 is the first super):

# btrfs check -s 1 /dev/mapper/vg-2
using SB copy 1, bytenr 67108864
...skipping checksum errors etc

OK so I guess I have to try --repair.

# btrfs check --repair -s1 /dev/mapper/vg-2
enabling repair mode
using SB copy 1, bytenr 67108864
...skipping checksum errors etc.

]# btrfs rescue super -v /dev/mapper/vg-1
All Devices:
Device: id = 1, name = /dev/mapper/vg-1

Before Recovering:
[All good supers]:
device name = /dev/mapper/vg-1
superblock bytenr = 67108864

[All bad supers]:
device name = /dev/mapper/vg-1
superblock bytenr = 65536


That is fucked. It broke the previously good super on vg-1?

[root@f26wnuc ~]# btrfs rescue super -v /dev/mapper/vg-2
All Devices:
Device: id = 1, name = /dev/mapper/vg-1
Device: id = 2, name = /dev/mapper/vg-2

Before Recovering:
[All good supers]:
device name = /dev/mapper/vg-1
superblock bytenr = 67108864

device name = /dev/mapper/vg-2
superblock bytenr = 67108864

[All bad supers]:
device name = /dev/mapper/vg-1
superblock bytenr = 65536


Worse, it did not actually fix the bad/missing superblock on vg-2
either. Let's answer Y to its questions...

[root@f26wnuc ~]# btrfs rescue super -v /dev/mapper/vg-2
All Devices:
Device: id = 1, name = /dev/mapper/vg-1
Device: id = 2, name = /dev/mapper/vg-2

Before Recovering:
[All good supers]:
device name = /dev/mapper/vg-1
superblock bytenr = 67108864

device name = /dev/mapper/vg-2
superblock bytenr = 67108864

[All bad supers]:
device name = /dev/mapper/vg-1
superblock bytenr = 65536


Make sure this is a btrfs disk otherwise the tool will destroy other
fs, Are you sure? [y/N]: y
checksum verify failed on 20971520 found 348F13AD wanted 8100
checksum verify failed on 20971520 found 348F13AD wanted 8100
Recovered bad superblocks successful
[root@f26wnuc ~]# btrfs rescue super -v /dev/mapper/vg-2
All Devices:
Device: id = 1, name = /dev/mapper/vg-1
Device: id = 2, name = /dev/mapper/vg-2

Before Recovering:
[All good supers]:
device name = /dev/mapper/vg-1
superblock bytenr = 65536

device name = /dev/mapper/vg-1
superblock bytenr = 67108864

device name = /dev/mapper/vg-2
superblock bytenr = 65536

device name = /dev/mapper/vg-2
superblock bytenr = 67108864

[All bad supers]:

All supers are valid, no need to recover



OK! That's better! Mount it.

dmesg
https://pastebin.com/6kVzYLfZ

Pretty boring, bad tree block, and then some read errors corrected. I
get more similarly formatted errors, different numbers... but no
failures. Scrub it...

# btrfs scrub status /mnt/yo
scrub status for b2ee5125-cf56-493a-b094-81fe8330115a
scrub started at Wed Aug 16 23:08:54 2017, running for 00:00:30
total bytes scrubbed: 1.19GiB with 5 errors
error details: csum=5
corrected errors: 5, uncorrectable errors: 0, unverified errors: 0
#

There's almost no data on this file system, it's mostly metadata which
is raid1 so that's why data survives. But even in the previous example
where some data is clobbered, the data loss is limited. The file
system itself survives, and can continue to be used. The 'btrfs rescue
super' function could be better, and it looks like there's a bug in
btrfs check's superblock repair.



Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to 

Re: Raid0 rescue

2017-08-16 Thread Chris Murphy
I'm testing explicitly for this case:


# lvs
  LV VG Attr   LSize   Pool   Origin Data%  Meta%
Move Log Cpy%Sync Convert
  1  vg Vwi-a-tz--  10.00g thintastic0.00
  2  vg Vwi-a-tz--  10.00g thintastic0.00
  thintastic vg twi-aotz-- 100.00g   0.00   0.38
# mkfs.btrfs -f -mraid1 -draid0 /dev/mapper/vg-1 /dev/mapper/vg-2

...
mount and copy some variable data to the volume, most files are less
than 64KiB, and even some are less than 2KiB. So there will be a mix
of files that will definitely get nerfed by damaged strips, and many
that will live from the drive not accidentally formatted, as well as
inline. But for sure the file system *ought* to survive.

umount and then format NTFS

# mkfs.ntfs -f /dev/mapper/vg-2

Now get this bit of curiousness:

# wipefs /dev/mapper/vg-2
offset   type

0x1fedos   [partition table]

0x10040  btrfs   [filesystem]
 UUID:  bebaedc5-96a1-4163-9527-8254ecae817e

0x3  ntfs   [filesystem]
 UUID:  67AD98CF36096C70


So the two supers can co-exist. That invariably is going to cause
kernel code confusion. blkid will neither consider it NTFS nor Btrfs.
So it's sortof in a zombie situation. Get this:

# btrfs rescue super -v /dev/mapper/vg-1
All Devices:
Device: id = 1, name = /dev/mapper/vg-1

Before Recovering:
[All good supers]:
device name = /dev/mapper/vg-1
superblock bytenr = 65536

device name = /dev/mapper/vg-1
superblock bytenr = 67108864

[All bad supers]:

All supers are valid, no need to recover

# btrfs rescue super -v /dev/mapper/vg-2
All Devices:
Device: id = 1, name = /dev/mapper/vg-1
Device: id = 2, name = /dev/mapper/vg-2

Before Recovering:
[All good supers]:
device name = /dev/mapper/vg-1
superblock bytenr = 65536

device name = /dev/mapper/vg-1
superblock bytenr = 67108864

device name = /dev/mapper/vg-2
superblock bytenr = 65536

device name = /dev/mapper/vg-2
superblock bytenr = 67108864

[All bad supers]:

All supers are valid, no need to recover
#


So the first command sees the supers only on vg-1, it doesn't go
looking at vg-2 at all presumably because kernel code is ignoring that
device due to two different file system supers (?). But the second
command forces it to look at vg-2, and it says the Btrfs supers are
fine, and then also auto discovers the vg-1 device too.

OK so I'm just going to cheat at this point and wipefs just the NTFS
magic so this device is now seen as Btrfs.

# wipefs -n -o 0x3 /dev/mapper/vg-2
/dev/mapper/vg-2: 8 bytes were erased at offset 0x0003 (ntfs): 4e
54 46 53 20 20 20 20
# wipefs -o 0x3 /dev/mapper/vg-2
/dev/mapper/vg-2: 8 bytes were erased at offset 0x0003 (ntfs): 4e
54 46 53 20 20 20 20
# partprobe
# blkid
...
/dev/mapper/vg-1: UUID="bebaedc5-96a1-4163-9527-8254ecae817e"
UUID_SUB="ef9dbcf0-bb0b-4faf-a7b4-02f1c92631e4" TYPE="btrfs"
/dev/mapper/vg-2: UUID="bebaedc5-96a1-4163-9527-8254ecae817e"
UUID_SUB="490504ea-4ee4-47ad-91a7-58b6ccf4be8e" TYPE="btrfs"
PTTYPE="dos"
...

OK good. Except, what is PTTYPE? Ohh, that's the first entry in the
wipefs command way at the top I bet.


[root@f26wnuc ~]# wipefs -o 0x1fe /dev/mapper/vg-2
/dev/mapper/vg-2: 2 bytes were erased at offset 0x01fe (dos): 55 aa
# blkid
...
/dev/mapper/vg-1: UUID="bebaedc5-96a1-4163-9527-8254ecae817e"
UUID_SUB="ef9dbcf0-bb0b-4faf-a7b4-02f1c92631e4" TYPE="btrfs"
/dev/mapper/vg-2: UUID="bebaedc5-96a1-4163-9527-8254ecae817e"
UUID_SUB="490504ea-4ee4-47ad-91a7-58b6ccf4be8e" TYPE="btrfs"
...

Yep!

OK let's just try a normal mount.

It mounts! No errors at all. list all the files on the file system
(about 700). No errors.

Let's cat a few to /dev/null manually no errors. OK I'm bored.
Let's just scrub it.


[root@f26wnuc yo]# btrfs scrub status /mnt/yo/
scrub status for bebaedc5-96a1-4163-9527-8254ecae817e
scrub started at Wed Aug 16 19:40:26 2017, running for 00:00:10
total bytes scrubbed: 529.62MiB with 181 errors
error details: csum=181
corrected errors: 0, uncorrectable errors: 181, unverified errors: 0

One file is affected, the large ~1+GiB file.

[77898.116429] BTRFS warning (device dm-6): checksum error at logical
1621229568 on dev /dev/mapper/vg-2, sector 2621568, root 5, inode 257,
offset 517341184, length 4096, links 1 (path:
Fedora-Workstation-Live-x86_64-Rawhide-20170814.n.0.iso)
[77898.116463] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
wr 0, rd 0, flush 0, corrupt 1, gen 0
[77898.116478] BTRFS error (device dm-6): unable to fixup (regular)
error at logical 1621229568 on dev /dev/mapper/vg-2

There's about 9 more of those kinds of messages. Anyway that looks to
me like that file itself is nerfed by the NTFS format, but the file
system itself wasn't hit. There's no fixups 

Re: btrfs fi du -s gives Inappropriate ioctl for device

2017-08-16 Thread Chris Murphy
On Wed, Aug 16, 2017 at 3:27 AM, Piotr Szymaniak  wrote:
> On Mon, Aug 14, 2017 at 05:40:30PM -0600, Chris Murphy wrote:
>> On Mon, Aug 14, 2017 at 4:57 PM, Piotr Szymaniak  wrote:
>>
>> >
>> > and... some issues:
>> > ~ # btrfs fi du -s /mnt/red/\@backup/
>> >  Total   Exclusive  Set shared  Filename
>> > ERROR: cannot check space of '/mnt/red/@backup/': Inappropriate ioctl for 
>> > device
>>
>>
>> It's a bug, but I don't know if any devs are working on a fix yet.
>>
>> The problem is that the subvolume being snapshot, contains subvolumes.
>> The resulting snapshot, contains an empty directory in place of the
>> nested subvolume(s), and that is the cause for the error.
>
> Ok, but why, on the same btrfs, it works on some subvols with subvols and does
> not work on other subvols with subvols? If it does not work - OK, if it works 
> -
> OK, but that seems a bit... random?
>
> ~ # btrfs fi du -s /mnt/red/\@backup/ 
> /mnt/red/\@backup/.snapshot/monthly_2017-08-01_05\:30\:01/ /mnt/red/\@svn/ 
> /mnt/red/\@svn/.snapshot/weekly_2017-08-05_04\:20\:02/
>  Total   Exclusive  Set shared  Filename
> ERROR: cannot check space of '/mnt/red/@backup/': Inappropriate ioctl for 
> device
> ERROR: cannot check space of 
> '/mnt/red/@backup/.snapshot/monthly_2017-08-01_05:30:01/': Inappropriate 
> ioctl for device
>   52.23GiB10.57MiB 4.13GiB  /mnt/red/@svn/
>4.35GiB 1.03MiB 4.12GiB  
> /mnt/red/@svn/.snapshot/weekly_2017-08-05_04:20:02/

I don't know.

 It might be that there's something inconsistent about the inode for
the missing/ghost subvolume placeholder directory at snapshot creation
time?

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qcow2 images make scrub believe the filesystem is corrupted.

2017-08-16 Thread Chris Murphy
>>>
On Tue, Aug 15, 2017 at 7:12 PM, Paulo Dias  wrote:
Device Model: Samsung SSD 850 EVO M.2 500GB
Serial Number:S33DNX0H812686V
LU WWN Device Id: 5 002538 d4130d027
Firmware Version: EMT21B6Q
>>>

Unfortunately no firmware updates listed with Samsung for this model.
It's worth filing a bug report with them, and then try not using
either fstrim or discard for a while and see if the problem reoccurs.
If not, then that suggests trim bug in the firmware. If it does still
occur it could just be defective hardware.

Does smartctl -x reveal any issues?



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Peter Grandi
>>> I've one system where a single kworker process is using 100%
>>> CPU sometimes a second process comes up with 100% CPU
>>> [btrfs-transacti]. [ ... ]

>> [ ... ]1413 Snapshots. I'm deleting 50 of them every night. But
>> btrfs-cleaner process isn't running / consuming CPU currently.

Reminder that:

https://btrfs.wiki.kernel.org/index.php/Gotchas#Having_many_subvolumes_can_be_very_slow

"The cost of several operations, including currently balance, device
delete and fs resize, is proportional to the number of subvolumes,
including snapshots, and (slightly super-linearly) the number of
extents in the subvolumes."

>> [ ... ] btrfs is mounted with compress-force=zlib

> Could be similar issue as what I had recently, with the RAID5 and
> 256kb chunk size. please provide more information about your RAID
> setup.

It is similar, but updating in-place compressed files can create
this situation even without RAID5 RMW:

https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation

"Files with a lot of random writes can become heavily fragmented
(1+ extents) causing thrashing on HDDs and excessive multi-second
spikes of CPU load on systems with an SSD or large amount a RAM. ...
Symptoms include btrfs-transacti and btrfs-endio-wri taking up a lot
of CPU time (in spikes, possibly triggered by syncs)."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS error (device sda4): failed to read chunk tree: -5

2017-08-16 Thread Chris Murphy
On Wed, Aug 16, 2017 at 4:25 PM, Zirconium Hacker  wrote:
> Hi,
> This is my first time using a mailing list, and I hope I'm doing this right.
>
> $ uname -a
> Linux thinkpad 4.12.6-1-ARCH #1 SMP PREEMPT Sat Aug 12 09:16:22 CEST
> 2017 x86_64 GNU/Linux
> $ btrfs --version
> btrfs-progs v4.12
> $ sudo mount -o ro,recovery /dev/sda4 /mnt
> mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda4,
> missing codepage or helper program, or other error.
> $ dmesg | tail
> 
> [ 1289.087439] BTRFS warning (device sda4): 'recovery' is deprecated,
> use 'usebackuproot' instead
> [ 1289.087440] BTRFS info (device sda4): trying to use backup root at mount 
> time
> [ 1289.087442] BTRFS info (device sda4): disk space caching is enabled
> [ 1289.097757] BTRFS error (device sda4): failed to read chunk tree: -5
> [ 1289.135222] BTRFS error (device sda4): open_ctree failed
> 
> $ sudo btrfs check /dev/sda4
> bytenr mismatch, want=61809344512, have=0
> Couldn't read tree root
> ERROR: cannot open file system
> $ sudo btrfs restore - -D /dev/sda4 .
> bytenr mismatch, want=61809344512, have=0
> Couldn't read tree root
> Could not open root, trying backup super
> bytenr mismatch, want=61809344512, have=0
> Couldn't read tree root
> Could not open root, trying backup super
> ERROR: superblock bytenr 274877906944 is larger than device size 58056507392
> Could not open root, trying backup super
>

What happened before this?

What do you get for:

btrfs rescue super -v /dev/sda4




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS error (device sda4): failed to read chunk tree: -5

2017-08-16 Thread Zirconium Hacker
Hi,
This is my first time using a mailing list, and I hope I'm doing this right.

$ uname -a
Linux thinkpad 4.12.6-1-ARCH #1 SMP PREEMPT Sat Aug 12 09:16:22 CEST
2017 x86_64 GNU/Linux
$ btrfs --version
btrfs-progs v4.12
$ sudo mount -o ro,recovery /dev/sda4 /mnt
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda4,
missing codepage or helper program, or other error.
$ dmesg | tail

[ 1289.087439] BTRFS warning (device sda4): 'recovery' is deprecated,
use 'usebackuproot' instead
[ 1289.087440] BTRFS info (device sda4): trying to use backup root at mount time
[ 1289.087442] BTRFS info (device sda4): disk space caching is enabled
[ 1289.097757] BTRFS error (device sda4): failed to read chunk tree: -5
[ 1289.135222] BTRFS error (device sda4): open_ctree failed

$ sudo btrfs check /dev/sda4
bytenr mismatch, want=61809344512, have=0
Couldn't read tree root
ERROR: cannot open file system
$ sudo btrfs restore - -D /dev/sda4 .
bytenr mismatch, want=61809344512, have=0
Couldn't read tree root
Could not open root, trying backup super
bytenr mismatch, want=61809344512, have=0
Couldn't read tree root
Could not open root, trying backup super
ERROR: superblock bytenr 274877906944 is larger than device size 58056507392
Could not open root, trying backup super

A script called btrfs-undelete
(https://gist.github.com/Changaco/45f8d171027ea2655d74) also fails
with similar errors.

I'd like to recover at least one folder, my desktop -- everything else
was backed up.
I'm using PhotoRec to try and recover some files, but I'd like a
better solution that keeps filenames and at least some folder
structure.

Thanks in advance!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fix cross-compile build

2017-08-16 Thread Eric Sandeen
On 8/15/17 7:17 PM, Qu Wenruo wrote:
> 
> 
> On 2017年08月16日 02:11, Eric Sandeen wrote:
>> The mktables binary needs to be build with the host
>> compiler at built time, not the target compiler, because
>> it runs at build time to generate the raid tables.
>>
>> Copy auto-fu from xfsprogs and modify Makefile to
>> accomodate this.
>>
>> Reported-by: Hallo32 
>> Signed-off-by: Eric Sandeen 
> 
> Looks better than my previous patch.
> With @BUILD_CLFAGS support and better BUILD_CC/CLFAGS detection for native 
> build environment.
> 
> Reviewed-by: Qu Wenruo 

Thanks - and sorry for missing your earlier patch, I didn't mean to
ignore it.  :)  I just missed it.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4 v4] btrfs: add compression trace points

2017-08-16 Thread Anand Jain
From: Anand Jain 

This patch adds compression and decompression trace points for the
purpose of debugging.

Signed-off-by: Anand Jain 
Reviewed-by: Nikolay Borisov 
---
v4:
 Accepts David's review comments
 . changes from unsigned long to u64.
 . format changes
v3:
 . Rename to a simple names, without worrying about being
   compatible with the future naming.
 . The type was not working fixed it.
v2:
 . Use better naming.
   (If transform is not good enough I have run out of ideas, pls suggest).
 . To be applied on top of
   git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
   (tested without namelen check patch set)

 fs/btrfs/compression.c   | 11 +++
 include/trace/events/btrfs.h | 36 
 2 files changed, 47 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index d2ef9ac2a630..4a652f67ee87 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -895,6 +895,10 @@ int btrfs_compress_pages(int type, struct address_space 
*mapping,
  start, pages,
  out_pages,
  total_in, total_out);
+
+   trace_btrfs_compress(1, 1, mapping->host, type, *total_in,
+   *total_out, start, ret);
+
free_workspace(type, workspace);
return ret;
 }
@@ -921,6 +925,10 @@ static int btrfs_decompress_bio(struct compressed_bio *cb)
 
workspace = find_workspace(type);
ret = btrfs_compress_op[type - 1]->decompress_bio(workspace, cb);
+
+   trace_btrfs_compress(0, 0, cb->inode, type,
+   cb->compressed_len, cb->len, cb->start, ret);
+
free_workspace(type, workspace);
 
return ret;
@@ -943,6 +951,9 @@ int btrfs_decompress(int type, unsigned char *data_in, 
struct page *dest_page,
  dest_page, start_byte,
  srclen, destlen);
 
+   trace_btrfs_compress(0, 1, dest_page->mapping->host,
+   type, srclen, destlen, start_byte, ret);
+
free_workspace(type, workspace);
return ret;
 }
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index d412c49f5a6a..d0c0bd4fe3c2 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1629,6 +1629,42 @@ TRACE_EVENT(qgroup_meta_reserve,
show_root_type(__entry->refroot), __entry->diff)
 );
 
+TRACE_EVENT(btrfs_compress,
+
+   TP_PROTO(int compress, int page, struct inode *inode, unsigned int type,
+   u64 len_before, u64 len_after, u64 start, int ret),
+
+   TP_ARGS(compress, page, inode, type, len_before, len_after, start, ret),
+
+   TP_STRUCT__entry_btrfs(
+   __field(int,compress)
+   __field(int,page)
+   __field(u64,i_ino)
+   __field(unsigned int,   type)
+   __field(u64,len_before)
+   __field(u64,len_after)
+   __field(u64,start)
+   __field(int,ret)
+   ),
+
+   TP_fast_assign_btrfs(btrfs_sb(inode->i_sb),
+   __entry->compress   = compress;
+   __entry->page   = page;
+   __entry->i_ino  = inode->i_ino;
+   __entry->type   = type;
+   __entry->len_before = len_before;
+   __entry->len_after  = len_after;
+   __entry->start  = start;
+   __entry->ret= ret;
+   ),
+
+   TP_printk_btrfs("%s %s ino=%llu type=%s len_before=%llu len_after=%llu 
"\
+   "start=%llu ret=%d",
+   __entry->compress ? "compress" : "decompress",
+   __entry->page ? "page" : "bio", __entry->i_ino,
+   show_compress_type(__entry->type), __entry->len_before,
+   __entry->len_after, __entry->start, __entry->ret)
+);
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency

2017-08-16 Thread Vijay Chidambaram
Amir,

That's a fair response. I certainly did not mean to add more work on
your end :) Using dm-log-writes for now is a reasonable approach.

Like I mentioned before, I think there is further work involved in
getting CrashMonkey to a useful point (where it finds at least known
bugs). Once this is done, I'd be happy to rework the device_wrapper as
a DM target (or perhaps as a modification of log-writes) for upstream.
I'm not sure how feasible it would be to keep functionality in-kernel
simple, but we will try our best.

We will keep this goal in mind as we continue development, so that we
don't make any decisions that will prevent us from going the DM target
route later.
Thanks,
Vijay


On Wed, Aug 16, 2017 at 3:27 PM, Amir Goldstein  wrote:
> On Wed, Aug 16, 2017 at 10:06 PM, Vijay Chidambaram  
> wrote:
>> Hi Josef,
>>
>> Thank you for the detailed reply -- I think it provides several
>> pointers for our future work. It sounds like we have a similar vision
>> for where we want this to go, though we may disagree about how to
>> implement this :) This is exciting!
>>
>> I agree that we should be building off existing work if it is a good
>> option. We might end up using log-writes, but for now we see several
>> problems:
>>
>> - The log-writes code is not documented well. As you have mentioned,
>> at this point, only you know how it works, and we are not seeing a lot
>> of adoption by other developers of log-writes as well.
>>
>> - I don't think our requirements exactly match what log-writes
>> provides. For example, at some point we want to introduce checkpoints
>> so that we can co-relate a crash state with file-system state at the
>> time of crash. We also want to add functionality to guide creation of
>> random crash states (see below). This might require changing
>> log-writes significantly. I don't know if that would be a good idea.
>>
>> Regarding random crashes, there is a lot of complexity there that
>> log-writes couldn't handle without significant changes. For example,
>> just randomly generating crash states and testing each state is
>> unlikely to catch bugs. We need a more nuanced way of doing this. We
>> plan to add a lot of functionality to CrashMonkey to (a) let the user
>> guide crash-state generation (b) focus on "interesting" states (by
>> re-ordering or dropping metadata). All of this will likely require
>> adding more sophistication to the kernel module. I don't think we want
>> to take log-writes and add a lot of extra functionality.
>>
>> Regarding logging writes, I think there is a difference in approach
>> between log-writes and CrashMonkey. We don't really care about the
>> completion order since the device may anyway re-order the writes after
>> that point. Thus, the set of crash states generated by CrashMonkey is
>> bound only by FUA and FLUSH flags. It sounds as if log-writes focuses
>> on a more restricted set of crash states.
>>
>> CrashMonkey works with the 4.4 kernel, and we will try and keep up
>> with changes to the kernel that breaks CrashMonkey. CrashMonkey is
>> useless without the user-space component, so users will be needing to
>> compile some code anyway. I do not believe it will matter much whether
>> it is in-tree or not, as long as it compiles with the latest kernel.
>>
>> Regarding discard, multi-device support, and application-level crash
>> consistency, this is on our road-map too! Our current priority is to
>> build enough scaffolding to reproduce a known crash-consistency bug
>> (such as the delayed allocation bug of ext4), and then go on and try
>> to find new bugs in newer file systems like btrfs.
>>
>> Adding CrashMonkey into the kernel is not a priority at this point (I
>> don't think CrashMonkey is useful enough at this point to do so). When
>> CrashMonkey becomes useful enough to do so, we will perhaps add the
>> device_wrapper as a DM target to enable adoption.
>>
>> Our hope currently is that developers like Ari will try out
>> CrashMonkey in its current form, which will guide us as to what
>> functionality to add to CrashMonkey to find bugs more effectively.
>>
>
> Vijay,
>
> I can only speak for myself, but I think I represent other filesystem
> developers with this response:
> - Often with competing projects the end
> results is always for the best when project members cooperate to combine
> the best of both projects.
> - Some of your project goals (e.g. user guided crash states) sound very
> intriguing
> - IMO you are severely underestimating the pros in mainlined
> kernel code for other developers. If you find the dm-log-writes target
> is lacking functionality it would be MUCH better if you work to improve it.
> Even more - it would be far better if you make sure that your userspace
> tools can work also with the reduced functionality in mainline kernel.
> - If you choose to complete your academic research before crossing over
> to existing code base, that is a reasonable choice for you to make, but
> 

Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency

2017-08-16 Thread Amir Goldstein
On Wed, Aug 16, 2017 at 10:06 PM, Vijay Chidambaram  wrote:
> Hi Josef,
>
> Thank you for the detailed reply -- I think it provides several
> pointers for our future work. It sounds like we have a similar vision
> for where we want this to go, though we may disagree about how to
> implement this :) This is exciting!
>
> I agree that we should be building off existing work if it is a good
> option. We might end up using log-writes, but for now we see several
> problems:
>
> - The log-writes code is not documented well. As you have mentioned,
> at this point, only you know how it works, and we are not seeing a lot
> of adoption by other developers of log-writes as well.
>
> - I don't think our requirements exactly match what log-writes
> provides. For example, at some point we want to introduce checkpoints
> so that we can co-relate a crash state with file-system state at the
> time of crash. We also want to add functionality to guide creation of
> random crash states (see below). This might require changing
> log-writes significantly. I don't know if that would be a good idea.
>
> Regarding random crashes, there is a lot of complexity there that
> log-writes couldn't handle without significant changes. For example,
> just randomly generating crash states and testing each state is
> unlikely to catch bugs. We need a more nuanced way of doing this. We
> plan to add a lot of functionality to CrashMonkey to (a) let the user
> guide crash-state generation (b) focus on "interesting" states (by
> re-ordering or dropping metadata). All of this will likely require
> adding more sophistication to the kernel module. I don't think we want
> to take log-writes and add a lot of extra functionality.
>
> Regarding logging writes, I think there is a difference in approach
> between log-writes and CrashMonkey. We don't really care about the
> completion order since the device may anyway re-order the writes after
> that point. Thus, the set of crash states generated by CrashMonkey is
> bound only by FUA and FLUSH flags. It sounds as if log-writes focuses
> on a more restricted set of crash states.
>
> CrashMonkey works with the 4.4 kernel, and we will try and keep up
> with changes to the kernel that breaks CrashMonkey. CrashMonkey is
> useless without the user-space component, so users will be needing to
> compile some code anyway. I do not believe it will matter much whether
> it is in-tree or not, as long as it compiles with the latest kernel.
>
> Regarding discard, multi-device support, and application-level crash
> consistency, this is on our road-map too! Our current priority is to
> build enough scaffolding to reproduce a known crash-consistency bug
> (such as the delayed allocation bug of ext4), and then go on and try
> to find new bugs in newer file systems like btrfs.
>
> Adding CrashMonkey into the kernel is not a priority at this point (I
> don't think CrashMonkey is useful enough at this point to do so). When
> CrashMonkey becomes useful enough to do so, we will perhaps add the
> device_wrapper as a DM target to enable adoption.
>
> Our hope currently is that developers like Ari will try out
> CrashMonkey in its current form, which will guide us as to what
> functionality to add to CrashMonkey to find bugs more effectively.
>

Vijay,

I can only speak for myself, but I think I represent other filesystem
developers with this response:
- Often with competing projects the end
results is always for the best when project members cooperate to combine
the best of both projects.
- Some of your project goals (e.g. user guided crash states) sound very
intriguing
- IMO you are severely underestimating the pros in mainlined
kernel code for other developers. If you find the dm-log-writes target
is lacking functionality it would be MUCH better if you work to improve it.
Even more - it would be far better if you make sure that your userspace
tools can work also with the reduced functionality in mainline kernel.
- If you choose to complete your academic research before crossing over
to existing code base, that is a reasonable choice for you to make, but
the reasonable choice for me to make is to try Joseph's tools from his
repo (even if not documented) and *only* if it doesn't meet my needs
I would make the extra effort to try out  CrashMonkey.
- AFAIK the state of filesystem crash consistency testing tools is so bright
(maybe except in Facebook ;) , so my priority is to get *some* automated
testing tools in motion

In any case, I'm glad this discussion started and I hope it would expedite
the adoption of crash testing tools.
I wish you all the best with your project.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] btrfs: convert enum btrfs_compression_type to define

2017-08-16 Thread Anand Jain



On 08/16/2017 09:59 PM, David Sterba wrote:

On Sun, Aug 13, 2017 at 12:02:42PM +0800, Anand Jain wrote:

There isn't a huge list to manage the types, which can be managed
with defines. It helps to easily print the types in tracing as well.


We use enums in a lot of places, I'd rather keep it as it is.


 This patch converts all of them, and it was at only one place.
 I hope I didn't miss any. Further the next patch 3/4 needs it
 to be define instead of enums, handling enums in the tracing
 isn't as easy as define.

Thanks, Anand

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Chris Murphy
On Wed, Aug 16, 2017 at 8:01 AM, Qu Wenruo  wrote:

> BTW, when Fujitsu tested the postgresql workload on btrfs, the result is
> quite interesting.
>
> For HDD, when number of clients is low, btrfs shows obvious performance
> drop.
> And the problem seems to be mandatory metadata COW, which leads to
> superblock FUA updates.
> And when number of clients grow, difference between btrfs and other fses
> gets much smaller, the bottleneck is the HDD itself.
>
> While for SSD, when number of clients is low, btrfs is almost the same
> performance as other fses, nodatacow/nodatasum only provides marginal
> difference.
> But when number of clients grows, btrfs falls far behind other fses.
> The reason seems to be related to how postgresql commit its transaction,
> which always fsync its journal sequentially without concurrency.


I wonder to what degree fsync is used as a hammer for a problem that
needs more granular indicators to solve, like fsadvise() and even
extending it?

But I'm also curious if the above behaviors you report, how it changes
by combining SSD and HDD via either dm-cache or bcache? Do the worst
aspects of SSD and HDD get muted in that case? Or do the worst aspects
become even worse across the board?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency

2017-08-16 Thread Vijay Chidambaram
Hi Josef,

Thank you for the detailed reply -- I think it provides several
pointers for our future work. It sounds like we have a similar vision
for where we want this to go, though we may disagree about how to
implement this :) This is exciting!

I agree that we should be building off existing work if it is a good
option. We might end up using log-writes, but for now we see several
problems:

- The log-writes code is not documented well. As you have mentioned,
at this point, only you know how it works, and we are not seeing a lot
of adoption by other developers of log-writes as well.

- I don't think our requirements exactly match what log-writes
provides. For example, at some point we want to introduce checkpoints
so that we can co-relate a crash state with file-system state at the
time of crash. We also want to add functionality to guide creation of
random crash states (see below). This might require changing
log-writes significantly. I don't know if that would be a good idea.

Regarding random crashes, there is a lot of complexity there that
log-writes couldn't handle without significant changes. For example,
just randomly generating crash states and testing each state is
unlikely to catch bugs. We need a more nuanced way of doing this. We
plan to add a lot of functionality to CrashMonkey to (a) let the user
guide crash-state generation (b) focus on "interesting" states (by
re-ordering or dropping metadata). All of this will likely require
adding more sophistication to the kernel module. I don't think we want
to take log-writes and add a lot of extra functionality.

Regarding logging writes, I think there is a difference in approach
between log-writes and CrashMonkey. We don't really care about the
completion order since the device may anyway re-order the writes after
that point. Thus, the set of crash states generated by CrashMonkey is
bound only by FUA and FLUSH flags. It sounds as if log-writes focuses
on a more restricted set of crash states.

CrashMonkey works with the 4.4 kernel, and we will try and keep up
with changes to the kernel that breaks CrashMonkey. CrashMonkey is
useless without the user-space component, so users will be needing to
compile some code anyway. I do not believe it will matter much whether
it is in-tree or not, as long as it compiles with the latest kernel.

Regarding discard, multi-device support, and application-level crash
consistency, this is on our road-map too! Our current priority is to
build enough scaffolding to reproduce a known crash-consistency bug
(such as the delayed allocation bug of ext4), and then go on and try
to find new bugs in newer file systems like btrfs.

Adding CrashMonkey into the kernel is not a priority at this point (I
don't think CrashMonkey is useful enough at this point to do so). When
CrashMonkey becomes useful enough to do so, we will perhaps add the
device_wrapper as a DM target to enable adoption.

Our hope currently is that developers like Ari will try out
CrashMonkey in its current form, which will guide us as to what
functionality to add to CrashMonkey to find bugs more effectively.

Thanks,
Vijay

On Wed, Aug 16, 2017 at 8:06 AM, Josef Bacik  wrote:
> On Tue, Aug 15, 2017 at 08:44:16PM -0500, Vijay Chidambaram wrote:
>> Hi Amir,
>>
>> I neglected to mention this earlier: CrashMonkey does not require
>> recompiling the kernel (it is a stand-alone kernel module), and has
>> been tested with the kernel 4.4. It should work with future kernel
>> versions as long as there are no changes to the bio structure.
>>
>> As it is, I believe CrashMonkey is compatible with the current kernel.
>> It certainly provides functionality beyond log-writes (the ability to
>> replay a subset of writes between FLUSH/FUA), and we intend to add
>> more functionality in the future.
>>
>> Right now, CrashMonkey does not do random sampling among possible
>> crash states -- it will simply test a given number of unique states.
>> Thus, right now I don't think it is very effective in finding
>> crash-consistency bugs. But the entire infrastructure to profile a
>> workload, construct crash states, and test them with fsck is present.
>>
>> I'd be grateful if you could try it and give us feedback on what make
>> testing easier/more useful for you. As I mentioned before, this is a
>> work-in-progress, so we are happy to incorporate feedback.
>>
>
> Sorry I was travelling yesterday so I couldn't give this my full attention.
> Everything you guys do is already accomplished with dm-log-writes.  If you 
> look
> at the example scripts I've provided
>
> https://github.com/josefbacik/log-writes/blob/master/replay-individual-faster.sh
> https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh
>
> The first initiates the replay, and points at the second script to run after
> each entry is replayed.  The whole point of this stuff was to make it as
> flexible as possible.  The way we use it is to replay, create a snapshot of 
> the
> 

Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread David Sterba
On Wed, Aug 16, 2017 at 09:53:57AM -0400, Austin S. Hemmelgarn wrote:
> > So apart from some central DBs for the storage management system
> > itself, CoW is mostly no issue for us.
> > But I've talked to some friend at the local super computing centre and
> > they have rather general issues with CoW at their virtualisation
> > cluster.
> > Like SUSE's snapper making many snapshots leading the storage images of
> > VMs apparently to explode (in terms of space usage).
> SUSE is pathological case of brain-dead defaults.  Snapper needs to 
> either die or have some serious sense beat into it.  When you turn off 
> the automatic snapshot generation for everything but updates and set the 
> retention policy to not keep almost everything, it's actually not bad at 
> all.

The defaults for timeline are really bad, the partition is almost never
big enough to hold 10 months worth of data updates, not to say 10 years.
A rolling distro can fill the space even with the daily or weeky
settings set to low numbers. But certain people had different oppinion
and I was not successful to change that. The least I did was to document
some of the usecases and the hints that could allow one to have a bit
more understanding of the effects.

https://github.com/kdave/btrfsmaintenance#tuning-periodic-snapshotting

> > For some of their storage backends there simply seem to be no de-
> > duplication available (or other reasons that prevent it's usage).
> If the snapshots are being CoW'ed, then dedupe won't save them any 
> space.  Also, nodatacow is inherently at odds with reflinks used for dedupe.
> > 
> >  From that I'd guess there would be still people who want the nice
> > features of btrfs (snapshots, checksumming, etc.), while still being
> > able to nodatacow in specific cases.
> Snapshots work fine with nodatacow, each block gets CoW'ed once when 
> it's first written to, and then goes back to being NOCOW.  The only 
> caveat is that you probably want to defrag either once everything has 
> been rewritten, or right after the snapshot.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread David Sterba
On Thu, Aug 03, 2017 at 08:08:59PM +0200, waxhead wrote:
> BTRFS biggest problem is not that there are some bits and pieces that 
> are thoroughly screwed up (raid5/6 (which just got some fixes by the 
> way)), but  the fact that the documentation is rather dated.
> 
> There is a simple status page here 
> https://btrfs.wiki.kernel.org/index.php/Status
> 
> As others have pointed out already the explanations on the status page 
> is not exactly good. For example compression (that was also mentioned) 
> is as of writing this marked as 'Mostly ok'  '(needs verification and 
> source) - auto repair and compression may crash'
> 
> Now, I am aware that many use compression without trouble. I am not sure 
> how many that has compression with disk issues and don't have trouble , 
> but I would at least expect to see more people yelling on the mailing 
> list if that where the case. The problem here is that this message is 
> rather scary and certainly does NOT sound like 'mostly ok' for most people.
> 
> What exactly needs verification and source? the mostly ok statement or 
> something else?! A more detailed explanation would be required here to 
> avoid scaring people away.
> 
> Same thing with the trim feature that is marked OK . It clearly says 
> that is has performance implications. It is marked OK so one would 
> expect it to not cause the filesystem to fail, but if the performance 
> becomes so slow that the filesystem gets practically unusable it is of 
> course not "OK". The relevant information is missing for people to make 
> a decent choice and I certainly don't know how serious these performance 
> implications are, if they are at all relevant...

I'll try to restructure the page so it reflects status of the features
from more aspects, like overall/performance/"known bad scenarios". The
in-row notes are proably bad idea as they are short on details, the
section under table will be better for that.

> Most people interested in BTRFS are probably a bit more paranoid and 
> concerned about their data than the average computer user. What people 
> tend to forget is that other filesystems either have NO redundancy, 
> auto-repair and other fancy features that BTRFS have. So for the 
> compression example above... if you run compressed files on ext4 and 
> your disk gets some corruption you are in a no better state than what 
> you would be with btrfs either (in fact probably worse). Also nothing is 
> stopping you from putting btrfs DUP on a mdadm raid5 or 6 which mean you 
> should be VERY safe.
> 
> Simple documentation is the key so HERE ARE MY DEMANDS!!!. ehhh 
> so here is what I think should be done:
> 
> 1. The documentation needs to either be improved (or old non-relevant 
> stuff simply removed / archived somewhere)

Agreed, this happens from time.

> 2. The status page MUST always be up to date for the latest kernel 
> release (It's ok so far , let's hope nobody sleeps here)

I'm watching over the page. It's been locked from edits so there's a
mandatory review of the new contents, the update process is documented
on the page.

> 3. Proper explanations must be given so the layman and reasonably 
> technical people understand the risks / issues for non-ok stuff.

This can be hard, the audience are both technical and non-technical
users. The page is supposed to give quick overview, the more detailed
information is either in the notes or on separate pages linked from
there. I believe this structure should be able to cover what you need,
but the acutal contents hasn't been written and there are not enough
people willing/capable of writing it.

> 4. There should be links to roadmaps for each feature on the status page 
> that clearly stats what is being worked on for the NEXT kernel release

We've tried something like that in the past, the page got out of sync
with reality over time and was deleted.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Peter Grandi
[ ... ]

>>> Snapshots work fine with nodatacow, each block gets CoW'ed
>>> once when it's first written to, and then goes back to being
>>> NOCOW.
>>> The only caveat is that you probably want to defrag either
>>> once everything has been rewritten, or right after the
>>> snapshot.

>> I thought defrag would unshare the reflinks?
 
> Which is exactly why you might want to do it. It will get rid
> of the overhead of the single CoW operation, and it will make
> sure there is minimal fragmentation.
> IOW, when mixing NOCOW and snapshots, you either have to use
> extra space, or you deal with performance issues. Aside from
> that though, it works just fine and has no special issues as
> compared to snapshots without NOCOW.

The above illustrates my guess as to why RHEL 7.4 dropped Btrfs
support, which is:

  * RHEL is sold to managers who want to minimize the cost of
upgrades and sysadm skills.
  * Every time a customer creates a ticket, RH profits fall.
  * RH had adopted 'ext3' because it was an in-place upgrade
from 'ext2' and "just worked", 'ext4' because it was an
in-place upgrade from 'ext3' and was supposed to "just
work", and then was looking at Btrfs as an in-place upgrade
from 'ext4', and presumably also a replacement for MD RAID,
that would "just work".
  * 'ext4' (and XFS before that) already created a few years ago
trouble because of the 'O_PONIES' controversy.
  * Not only Btrfs still has "challenges" as to multi-device
functionality, and in-place upgrades from 'ext4' have
"challenges" too, it has many "special cases" that need
skill and discretion to handle, because it tries to cover so
many different cases, and the first thing many a RH customer
would do is to create a ticket to ask what to do, or how to
fix a choice already made.

Try to imagine the impact on the RH ticketing system of a switch
from 'ext4' to Btrfs, with explanations like the above, about
NOCOW, defrag, snapshots, balance, reflinks, and the exact order
in which they have to be performed for best results.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/7] add sanity check for extent inline ref type

2017-08-16 Thread Liu Bo
On Wed, Aug 16, 2017 at 04:53:15PM +0200, David Sterba wrote:
> On Mon, Aug 07, 2017 at 03:55:24PM -0600, Liu Bo wrote:
> > An invalid extent inline ref type could be read from a btrfs image and
> > it ends up with a panic[1], this set is to deal with the insane value
> > gracefully in patch 1-2 and clean up BUG() in the code in patch 3-6.
> > 
> > Patch 7 adds one more check to see if the ref is a valid shared one.
> > 
> > I'm not sure in the real world what may result in this corruption, but
> > I've seen several reports on the ML about __btrfs_free_extent saying
> > something was missing (or simply wrong), while testing this set with
> > btrfs-corrupt-block, I found that switching ref type could end up that
> > situation as well, eg. a data extent's ref type
> > (BTRFS_EXTENT_DATA_REF_KEY) is switched to (BTRFS_TREE_BLOCK_REF_KEY).
> > Hopefully this can give people more sights next time when that
> > happens.
> > 
> > [1]:https://www.spinics.net/lists/linux-btrfs/msg65646.html
> 
> The series looks good to me overall, there are some minor comments. The
> use of WARN(1, ...) will lack the common message prefix identifying the
> filesystem, so I suggest to use the btrfs_err helper and consider if the
> WARN_ON(1) is really useful in the place. Most of them look like that.
> 
> in patch btrfs_inline_ref_types, rename it to btrfs_inline_ref_type, so
> it's in line with other similar definitions.

Sounds good, I'll update them then.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Peter Grandi
[ ... ]

> But I've talked to some friend at the local super computing
> centre and they have rather general issues with CoW at their
> virtualisation cluster.

Amazing news! :-)

> Like SUSE's snapper making many snapshots leading the storage
> images of VMs apparently to explode (in terms of space usage).

Well, this could be an argument that some of your friends are being
"challenged" by running the storage systems of a "super computing
centre" and that they could become "more prepared" about system
administration, for example as to the principle "know which tool to
use for which workload". Or else it could be an argument that they
expect Btrfs to do their job while they watch cat videos from the
intertubes. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Peter Grandi
> We use the crcs to catch storage gone wrong, [ ... ]

And that's an opportunistically feasible idea given that current
CPUs can do that in real-time.

> [ ... ] It's possible to protect against all three without COW,
> but all solutions have their own tradeoffs and this is the setup
> we chose. It's easy to trust and easy to debug and at scale that
> really helps.

Indeed all filesystem designs have pathological workloads, and
system administrators and applications developers who are "more
prepared" know which one is best for which workload, or try to
figure it out.

> Some databases also crc, and all drives have correction bits of
> of some kind. There's nothing wrong with crcs happening at lots
> of layers.

Well, there is: in theory checksumming should be end-to-end, that
is entirely application level, so applications that don't need it
don't pay the price, but having it done at other layers can help
the very many applications that don't do it and should do it, and
it is cheap, and can help when troubleshooting exactly there the
problem is. It is an opportunistic thing to do.

> [ ... ] My real goal is to make COW fast enough that we can
> leave it on for the database applications too.  Obviously I
> haven't quite finished that one yet ;) [ ... ]

And this worries me because it portends the usual "marketing" goal
of making Btrfs all things to all workloads, the "OpenStack of
filesystems", with little consideration for complexity,
maintainability, or even sometimes reality.

The reality is that all known storage media have hugely
anisotropic performance envelopes, both as to functionality, cost,
speed, reliability, and there is no way to have an automagic
filesystem that "just works" in all cases, despite the constant
demands for one by "less prepared" storage administrators and
application developers. The reality is also that if one such
filesystem could automagically adapt to cover optimally the
performance envelopes of every possible device and workload, it
would be so complex as to be unmaintainable in practice.

So Btrfs, in its base "Rodeh" functionality, with COW, checksums,
subvolumes, shapshots, *on a single device*, works pretty well and
reliably and it is already very useful, for most workloads. Some
people also like some of its exotic complexities like in-place
compression and defragmentation, but they come at a high cost.

For workloads that inflict lots of small random in-place updates
on storage, like tablespaces for DBMSes etc, perhaps simpler less
featureful storage abstraction layers are more appropriate, from
OCFS2 to simple DM/LVM2 LVs, and Btrfs NOCOW approximates them
well.

BTW as to the specifics of DBMSes and filesystems, there is a
classic paper making eminently reasonable, practical, suggestions
that have been ignored for only 35 years and some:

  %A M. R. Stonebraker
  %T Operating system support for database management
  %J CACM
  %V 24
  %D JUL 1981
  %P 412-418
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: copy fsid to super_block s_uuid

2017-08-16 Thread David Sterba
On Tue, Aug 01, 2017 at 06:35:08PM +0800, Anand Jain wrote:
> We didn't copy fsid to struct super_block.s_uuid so Overlay disables
> index feature with btrfs as the lower FS.
> 
> kernel: overlayfs: fs on '/lower' does not support file handles, falling back 
> to index=off.
> 
> Fix this by publishing the fsid through struct super_block.s_uuid.
> 
> Signed-off-by: Anand Jain 
> ---
> I tried to know if in case did we deliberately missed this for some reason,
> however there is no information on that. If we mount a non-default subvol in
> the next mount/remount, its still the same FS, so publishing the FSID
> instead of subvol uuid is correct, OR I can't think any other reason for
> not using s_uuid for btrfs.

I think that setting s_uuid is the last missing bit. Overlay needs the
file handle encoding support from the lower filesystem, which is
supported. Filling the whole filesystem id is correct, the subvolume id
is encoded in the file handle buffer from inside btrfs_encode_fh.

>From that point I think the patch is ok, but haven't tested it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Remove unused sectorsize variable from struct map_lookup

2017-08-16 Thread Nikolay Borisov
This variable was added in 1abe9b8a138c ("Btrfs: add initial tracepointi
support for btrfs"), yet it never really got used, only assigned to. So let's
remove it.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/volumes.c | 2 --
 fs/btrfs/volumes.h | 1 -
 2 files changed, 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f93ac3d7e997..47a0cb1dcc5e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4836,7 +4836,6 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
   j * stripe_size;
}
}
-   map->sector_size = info->sectorsize;
map->stripe_len = raid_stripe_len;
map->io_align = raid_stripe_len;
map->io_width = raid_stripe_len;
@@ -6491,7 +6490,6 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, 
struct btrfs_key *key,
map->num_stripes = num_stripes;
map->io_width = btrfs_chunk_io_width(leaf, chunk);
map->io_align = btrfs_chunk_io_align(leaf, chunk);
-   map->sector_size = btrfs_chunk_sector_size(leaf, chunk);
map->stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
map->type = btrfs_chunk_type(leaf, chunk);
map->sub_stripes = btrfs_chunk_sub_stripes(leaf, chunk);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6f45fd60d15a..d0193e795dc2 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -353,7 +353,6 @@ struct map_lookup {
int io_align;
int io_width;
u64 stripe_len;
-   int sector_size;
int num_stripes;
int sub_stripes;
struct btrfs_bio_stripe stripes[];
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: expose internal free space tree routine only if sanity tests are enabled

2017-08-16 Thread Nikolay Borisov
The internal free space tree management routines are always exposed for testing
purposes. Make them dependent on SANITY_TESTS being on so that they are exposed
only when they really have to.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/free-space-tree.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/free-space-tree.h b/fs/btrfs/free-space-tree.h
index 54ffced3bce8..ba3787df43c3 100644
--- a/fs/btrfs/free-space-tree.h
+++ b/fs/btrfs/free-space-tree.h
@@ -44,7 +44,7 @@ int remove_from_free_space_tree(struct btrfs_trans_handle 
*trans,
struct btrfs_fs_info *fs_info,
u64 start, u64 size);
 
-/* Exposed for testing. */
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 struct btrfs_free_space_info *
 search_free_space_info(struct btrfs_trans_handle *trans,
   struct btrfs_fs_info *fs_info,
@@ -68,5 +68,6 @@ int convert_free_space_to_extents(struct btrfs_trans_handle 
*trans,
  struct btrfs_path *path);
 int free_space_test_bit(struct btrfs_block_group_cache *block_group,
struct btrfs_path *path, u64 offset);
+#endif
 
 #endif
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Austin S. Hemmelgarn

On 2017-08-16 10:11, Christoph Anton Mitterer wrote:

On Wed, 2017-08-16 at 09:53 -0400, Austin S. Hemmelgarn wrote:

Go try BTRFS on top of dm-integrity, or on a
system with T10-DIF or T13-EPP support


When dm-integrity is used... would that be enough for btrfs to do a
proper repair in the RAID+nodatacow case? I assume it can't do repairs
now there, because how should it know which copy is valid.
dm-integrity is functionally a 1:1 mapping target (it uses a secondary 
device for storing the integrity info, but it requires one table per 
target).  It takes one backing device, and gives one mapped device.  The 
setup I'm suggesting would involve putting that on each device that you 
have BTRFS configured to use.  When the checksum there fails, you get a 
read error (AFAIK at least), which will trigger the regular BTRFS 
recovery code just like a failed checksum.  So in this case, it should 
recover just fine if one copy is bogus (assuming it's a media issue and 
not something between the the block device and the filesystem.


In all honesty, putting BTRFS on dm-integrity is going to be slow.  If 
you can find some T10 DIF or T13 EPP hardware, that will almost 
certainly be faster.




  (which you should have access to
given the amount of funding CERN gets)

Hehe, CERN may get that funding (I don't know),... but the universities
rather don't ;-)
Point taken, I often forget that funding isn't exactly distributed in 
the most obvious ways.




Except it isn't clear with nodatacow, because it might be a false
positive.


Sure, never claimed the opposite... just that I'd expect this to be
less likely than the other way round, and less of a problem in
practise.
Any number of hardware failures or errors can cause the same net effect 
as an unclean shutdown, and even some much more complicated issues (a 
loose data cable to a storage device is probably one of the best 
examples, as it's trivial to explain and not as rare as most people think).





SUSE is pathological case of brain-dead defaults.  Snapper needs to
either die or have some serious sense beat into it.  When you turn
off
the automatic snapshot generation for everything but updates and set
the
retention policy to not keep almost everything, it's actually not bad
at
all.


Well, still, with CoW (unless you have some form of deduplication,
which in e.g. their use case would have to be on the layers below
btrfs), your storage usage will grow probably more significantly than
without.
Yes, and for most VM use cases I would advocate not using BTRFS 
snapshots inside the VM and instead using snapshot functionality in the 
VM software itself.  That still has performance issues in some cases, 
but at least it's easier to see where the data is actually being used.


And as you've mentioned yourself in the other mail, there's still the
issue with fragmentation.



Snapshots work fine with nodatacow, each block gets CoW'ed once when
it's first written to, and then goes back to being NOCOW.  The only
caveat is that you probably want to defrag either once everything
has
been rewritten, or right after the snapshot.


I thought defrag would unshare the reflinks?
Which is exactly why you might want to do it.  It will get rid of the 
overhead of the single CoW operation, and it will make sure there is 
minimal fragmentation.  IOW, when mixing NOCOW and snapshots, you either 
have to use extra space, or you deal with performance issues.  Aside 
from that though, it works just fine and has no special issues as 
compared to snapshots without NOCOW.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/7] add sanity check for extent inline ref type

2017-08-16 Thread David Sterba
On Mon, Aug 07, 2017 at 03:55:24PM -0600, Liu Bo wrote:
> An invalid extent inline ref type could be read from a btrfs image and
> it ends up with a panic[1], this set is to deal with the insane value
> gracefully in patch 1-2 and clean up BUG() in the code in patch 3-6.
> 
> Patch 7 adds one more check to see if the ref is a valid shared one.
> 
> I'm not sure in the real world what may result in this corruption, but
> I've seen several reports on the ML about __btrfs_free_extent saying
> something was missing (or simply wrong), while testing this set with
> btrfs-corrupt-block, I found that switching ref type could end up that
> situation as well, eg. a data extent's ref type
> (BTRFS_EXTENT_DATA_REF_KEY) is switched to (BTRFS_TREE_BLOCK_REF_KEY).
> Hopefully this can give people more sights next time when that
> happens.
> 
> [1]:https://www.spinics.net/lists/linux-btrfs/msg65646.html

The series looks good to me overall, there are some minor comments. The
use of WARN(1, ...) will lack the common message prefix identifying the
filesystem, so I suggest to use the btrfs_err helper and consider if the
WARN_ON(1) is really useful in the place. Most of them look like that.

in patch btrfs_inline_ref_types, rename it to btrfs_inline_ref_type, so
it's in line with other similar definitions.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4 v3] btrfs: add compression trace points

2017-08-16 Thread David Sterba
On Sun, Aug 13, 2017 at 12:02:44PM +0800, Anand Jain wrote:
> From: Anand Jain 
> 
> This patch adds compression and decompression trace points for the
> purpose of debugging.
> 
> Signed-off-by: Anand Jain 
> Reviewed-by: Nikolay Borisov 
> ---
> v3:
>  . Rename to a simple names, without worrying about being
>compatible with the future naming.
>  . The type was not working fixed it.
> v2:
>  . Use better naming.
>(If transform is not good enough I have run out of ideas, pls suggest).
>  . To be applied on top of
>git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
>(tested without namelen check patch set)
>  fs/btrfs/compression.c   | 11 +++
>  include/trace/events/btrfs.h | 39 +++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index d2ef9ac2a630..4a652f67ee87 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -895,6 +895,10 @@ int btrfs_compress_pages(int type, struct address_space 
> *mapping,
> start, pages,
> out_pages,
> total_in, total_out);
> +
> + trace_btrfs_compress(1, 1, mapping->host, type, *total_in,
> + *total_out, start, ret);
> +
>   free_workspace(type, workspace);
>   return ret;
>  }
> @@ -921,6 +925,10 @@ static int btrfs_decompress_bio(struct compressed_bio 
> *cb)
>  
>   workspace = find_workspace(type);
>   ret = btrfs_compress_op[type - 1]->decompress_bio(workspace, cb);
> +
> + trace_btrfs_compress(0, 0, cb->inode, type,
> + cb->compressed_len, cb->len, cb->start, ret);
> +
>   free_workspace(type, workspace);
>  
>   return ret;
> @@ -943,6 +951,9 @@ int btrfs_decompress(int type, unsigned char *data_in, 
> struct page *dest_page,
> dest_page, start_byte,
> srclen, destlen);
>  
> + trace_btrfs_compress(0, 1, dest_page->mapping->host,
> + type, srclen, destlen, start_byte, ret);
> +
>   free_workspace(type, workspace);
>   return ret;
>  }
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index d412c49f5a6a..db33d6649d12 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -1629,6 +1629,45 @@ TRACE_EVENT(qgroup_meta_reserve,
>   show_root_type(__entry->refroot), __entry->diff)
>  );
>  
> +TRACE_EVENT(btrfs_compress,
> +
> + TP_PROTO(int compress, int page, struct inode *inode,
> + unsigned int type,
> + unsigned long len_before, unsigned long len_after,
> + unsigned long start, int ret),
> +
> + TP_ARGS(compress, page, inode, type, len_before,
> + len_after, start, ret),
> +
> + TP_STRUCT__entry_btrfs(
> + __field(int,compress)
> + __field(int,page)
> + __field(ino_t,  i_ino)

u64 for the inode number

> + __field(unsigned int,   type)
> + __field(unsigned long,  len_before)
> + __field(unsigned long,  len_after)
> + __field(unsigned long,  start)

and u64 here

> + __field(int,ret)
> + ),
> +
> + TP_fast_assign_btrfs(btrfs_sb(inode->i_sb),
> + __entry->compress   = compress;
> + __entry->page   = page;
> + __entry->i_ino  = inode->i_ino;
> + __entry->type   = type;
> + __entry->len_before = len_before;
> + __entry->len_after  = len_after;
> + __entry->start  = start;
> + __entry->ret= ret;
> + ),
> +
> + TP_printk_btrfs("%s %s ino=%lu type=%s len_before=%lu len_after=%lu 
> start=%lu ret=%d",

The format looks good, although I'm not sure we need to make the
distinction between page and bio compression. This also needs the extra
argument for the tracepoint.

> + __entry->compress ? "compress":"uncompress",

decompress

> + __entry->page ? "page":"bio", __entry->i_ino,

add spaces around :

> + show_compress_type(__entry->type),
> + __entry->len_before, __entry->len_after, __entry->start,
> + __entry->ret)
> +);
>  #endif /* _TRACE_BTRFS_H */
>  
>  /* This part must be outside protection */
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to 

Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Christoph Anton Mitterer
On Wed, 2017-08-16 at 09:53 -0400, Austin S. Hemmelgarn wrote:
> Go try BTRFS on top of dm-integrity, or on a 
> system with T10-DIF or T13-EPP support

When dm-integrity is used... would that be enough for btrfs to do a
proper repair in the RAID+nodatacow case? I assume it can't do repairs
now there, because how should it know which copy is valid.


>  (which you should have access to 
> given the amount of funding CERN gets)
Hehe, CERN may get that funding (I don't know),... but the universities
rather don't ;-)


> Except it isn't clear with nodatacow, because it might be a false
> positive.

Sure, never claimed the opposite... just that I'd expect this to be
less likely than the other way round, and less of a problem in
practise.



> SUSE is pathological case of brain-dead defaults.  Snapper needs to 
> either die or have some serious sense beat into it.  When you turn
> off 
> the automatic snapshot generation for everything but updates and set
> the 
> retention policy to not keep almost everything, it's actually not bad
> at 
> all.

Well, still, with CoW (unless you have some form of deduplication,
which in e.g. their use case would have to be on the layers below
btrfs), your storage usage will grow probably more significantly than
without.

And as you've mentioned yourself in the other mail, there's still the
issue with fragmentation.


> Snapshots work fine with nodatacow, each block gets CoW'ed once when 
> it's first written to, and then goes back to being NOCOW.  The only 
> caveat is that you probably want to defrag either once everything
> has 
> been rewritten, or right after the snapshot.

I thought defrag would unshare the reflinks?

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v2] btrfs: use appropriate define for the fsid

2017-08-16 Thread David Sterba
On Sat, Jul 29, 2017 at 05:50:09PM +0800, Anand Jain wrote:
> Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE or of same size,
> for the purpose of doing it correctly use BTRFS_FSID_SIZE instead.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] btrfs: decode compress type for tracing

2017-08-16 Thread David Sterba
On Sun, Aug 13, 2017 at 12:02:43PM +0800, Anand Jain wrote:
> So with this now we see the compression type in string.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Qu Wenruo



On 2017年08月16日 21:12, Chris Mason wrote:

On Mon, Aug 14, 2017 at 09:54:48PM +0200, Christoph Anton Mitterer wrote:

On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote:

Quite a few applications actually _do_ have some degree of secondary
verification or protection from a crash.  Go look at almost any
database
software.

Then please give proper references for this!

This is from 2015, where you claimed this already and I looked up all
the bigger DBs and they either couldn't do it at all, didn't to it per
default, or it required application support (i.e. from the programs
using the DB)
https://www.spinics.net/lists/linux-btrfs/msg50258.html



It usually will not have checksumming, but it will almost
always have support for a journal, which is enough to cover the
particular data loss scenario we're talking about (unexpected
unclean
shutdown).


I don't think we talk about this:
We talk about people wanting checksuming to notice e.g. silent data
corruption.

The crash case is only the corner case about what happens then if data
is written correctly but csums not.


We use the crcs to catch storage gone wrong, both in terms of simple 
things like cabling, bus errors, drives gone crazy or exotic problems 
like every time I reboot the box a handful of sectors return EFI 
partition table headers instead of the data I wrote.  You don't need 
data center scale for this to happen, but it does help...


So, we do catch crc errors in prod and they do keep us from replicating 
bad data over good data.  Some databases also crc, and all drives have 
correction bits of of some kind.  There's nothing wrong with crcs 
happening at lots of layers.


Btrfs couples the crcs with COW because it's the least complicated way 
to protect against:


* bits flipping
* IO getting lost on the way to the drive, leaving stale but valid data 
in place
* IO from sector A going to sector B instead, overwriting valid data 
with other valid data.


It's possible to protect against all three without COW, but all 
solutions have their own tradeoffs and this is the setup we chose.  It's 
easy to trust and easy to debug and at scale that really helps.


In general, production storage environments prefer clearly defined 
errors when the storage has the wrong data.  EIOs happen often, and you 
want to be able to quickly pitch the bad data and replicate in good data.


Btrfs csum is really good, specially for case like RAID1/5/6 where csum 
can provide extra info about which mirror/stripe/parity can be trusted, 
with minimal space wasted.


DM layer should really have the ability to verify its data at that 
timing like btrfs.




My real goal is to make COW fast enough that we can leave it on for the 
database applications too.


Yes, most of the complexity of nodatasum/nodatacow comes from those 
special workload.


BTW, when Fujitsu tested the postgresql workload on btrfs, the result is 
quite interesting.


For HDD, when number of clients is low, btrfs shows obvious performance 
drop.
And the problem seems to be mandatory metadata COW, which leads to 
superblock FUA updates.
And when number of clients grow, difference between btrfs and other fses 
gets much smaller, the bottleneck is the HDD itself.


While for SSD, when number of clients is low, btrfs is almost the same 
performance as other fses, nodatacow/nodatasum only provides marginal 
difference.

But when number of clients grows, btrfs falls far behind other fses.
The reason seems to be related to how postgresql commit its transaction, 
which always fsync its journal sequentially without concurrency.
While Btrfs needs to wait its data write before updating its log tree, 
this makes most of its time wasted on waiting data IO.
In that case, nodatacow does improves the performance, by allowing btrfs 
to update its log tree without waiting data IO.


But in both case, CoW itself, like allocating new extent, or calculating 
csum, is not the main cause to slow down btrfs.

That's to say, nodatacow is not as important as we used to think.

If we can get rid of nodatacow/nodatasum, there will be much less thing 
to consider for us developers, and less related bugs.


Thanks,
Qu

 Obviously I haven't quite finished that one 
yet ;) But I'd rather keep the building block of all the other btrfs 
features in place than try to do crcs differently.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] btrfs: convert enum btrfs_compression_type to define

2017-08-16 Thread David Sterba
On Sun, Aug 13, 2017 at 12:02:42PM +0800, Anand Jain wrote:
> There isn't a huge list to manage the types, which can be managed
> with defines. It helps to easily print the types in tracing as well.

We use enums in a lot of places, I'd rather keep it as it is.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] btrfs: remove unused BTRFS_COMPRESS_LAST

2017-08-16 Thread David Sterba
On Sun, Aug 13, 2017 at 12:02:41PM +0800, Anand Jain wrote:
> We aren't using this define, so removing it.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Austin S. Hemmelgarn

On 2017-08-16 09:12, Chris Mason wrote:
My real goal is to make COW fast enough that we can leave it on for the 
database applications too.  Obviously I haven't quite finished that one 
yet ;) But I'd rather keep the building block of all the other btrfs 
features in place than try to do crcs differently.
In general, the performance issue isn't because of the time it takes to 
CoW the blocks, it's because of the fragmentation it introduces.  That 
fragmentation could in theory be mitigated by making CoW happen at a 
larger chunk size, but that would push the issue more towards being one 
of CoW performance, not fragmentation.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Austin S. Hemmelgarn

On 2017-08-16 09:31, Christoph Anton Mitterer wrote:

Just out of curiosity:


On Wed, 2017-08-16 at 09:12 -0400, Chris Mason wrote:

Btrfs couples the crcs with COW because


this (which sounds like you want it to stay coupled that way)...

plus



It's possible to protect against all three without COW, but all
solutions have their own tradeoffs and this is the setup we
chose.  It's
easy to trust and easy to debug and at scale that really helps.


... this (which sounds more you think the checksumming is so helpful,
that it would be nice in the nodatacow as well).

What does that mean now? Things will stay as they are... or it may
become a goal to get checksumming for nodatacow (while of course still
retaining the possibility to disable both, datacow AND checksumming)?
It means that you have other options if you want this so badly that you 
need to keep pestering the developers about it but can't be arsed to try 
to code it yourself.  Go try BTRFS on top of dm-integrity, or on a 
system with T10-DIF or T13-EPP support (which you should have access to 
given the amount of funding CERN gets), or even on a ZFS zvol if you're 
crazy enough.  It works wonderfully in the first two cases, and reliably 
(but not efficiently) in the third, and all of them provide exactly what 
you want, plus the bonus that they do a slightly better job of 
differentiating between media and memory errors.




In general, production storage environments prefer clearly defined
errors when the storage has the wrong data.  EIOs happen often, and
you
want to be able to quickly pitch the bad data and replicate in good
data.


Which would also rather point towards getting clear EIOs (and thus
checksumming) in the nodatacow case.

Except it isn't clear with nodatacow, because it might be a false positive.





My real goal is to make COW fast enough that we can leave it on for
the
database applications too.  Obviously I haven't quite finished that
one
yet ;)


Well the question is, even if you manage that sooner or later, will
everyone be fully satisfied by this?!
I've mentioned earlier on the list that I manage one of the many big
data/computing centres for LHC.
Our use case is typically big plain storage servers connected via some
higher level storage management system (http://dcache.org/)... with
mostly write once/read many.

So apart from some central DBs for the storage management system
itself, CoW is mostly no issue for us.
But I've talked to some friend at the local super computing centre and
they have rather general issues with CoW at their virtualisation
cluster.
Like SUSE's snapper making many snapshots leading the storage images of
VMs apparently to explode (in terms of space usage).
SUSE is pathological case of brain-dead defaults.  Snapper needs to 
either die or have some serious sense beat into it.  When you turn off 
the automatic snapshot generation for everything but updates and set the 
retention policy to not keep almost everything, it's actually not bad at 
all.

For some of their storage backends there simply seem to be no de-
duplication available (or other reasons that prevent it's usage).
If the snapshots are being CoW'ed, then dedupe won't save them any 
space.  Also, nodatacow is inherently at odds with reflinks used for dedupe.


 From that I'd guess there would be still people who want the nice
features of btrfs (snapshots, checksumming, etc.), while still being
able to nodatacow in specific cases.
Snapshots work fine with nodatacow, each block gets CoW'ed once when 
it's first written to, and then goes back to being NOCOW.  The only 
caveat is that you probably want to defrag either once everything has 
been rewritten, or right after the snapshot.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Christoph Anton Mitterer
Just out of curiosity:


On Wed, 2017-08-16 at 09:12 -0400, Chris Mason wrote:
> Btrfs couples the crcs with COW because

this (which sounds like you want it to stay coupled that way)...

plus


> It's possible to protect against all three without COW, but all 
> solutions have their own tradeoffs and this is the setup we
> chose.  It's 
> easy to trust and easy to debug and at scale that really helps.

... this (which sounds more you think the checksumming is so helpful,
that it would be nice in the nodatacow as well).

What does that mean now? Things will stay as they are... or it may
become a goal to get checksumming for nodatacow (while of course still
retaining the possibility to disable both, datacow AND checksumming)?


> In general, production storage environments prefer clearly defined 
> errors when the storage has the wrong data.  EIOs happen often, and
> you 
> want to be able to quickly pitch the bad data and replicate in good 
> data.

Which would also rather point towards getting clear EIOs (and thus
checksumming) in the nodatacow case.



> My real goal is to make COW fast enough that we can leave it on for
> the 
> database applications too.  Obviously I haven't quite finished that
> one 
> yet ;)

Well the question is, even if you manage that sooner or later, will
everyone be fully satisfied by this?!
I've mentioned earlier on the list that I manage one of the many big
data/computing centres for LHC.
Our use case is typically big plain storage servers connected via some
higher level storage management system (http://dcache.org/)... with
mostly write once/read many.

So apart from some central DBs for the storage management system
itself, CoW is mostly no issue for us.
But I've talked to some friend at the local super computing centre and
they have rather general issues with CoW at their virtualisation
cluster.
Like SUSE's snapper making many snapshots leading the storage images of
VMs apparently to explode (in terms of space usage).
For some of their storage backends there simply seem to be no de-
duplication available (or other reasons that prevent it's usage).

From that I'd guess there would be still people who want the nice
features of btrfs (snapshots, checksumming, etc.), while still being
able to nodatacow in specific cases.


> But I'd rather keep the building block of all the other btrfs 
> features in place than try to do crcs differently.

Mhh I see, what a pity.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Chris Mason

On Mon, Aug 14, 2017 at 09:54:48PM +0200, Christoph Anton Mitterer wrote:

On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote:

Quite a few applications actually _do_ have some degree of secondary 
verification or protection from a crash.  Go look at almost any
database 
software.

Then please give proper references for this!

This is from 2015, where you claimed this already and I looked up all
the bigger DBs and they either couldn't do it at all, didn't to it per
default, or it required application support (i.e. from the programs
using the DB)
https://www.spinics.net/lists/linux-btrfs/msg50258.html



It usually will not have checksumming, but it will almost 
always have support for a journal, which is enough to cover the 
particular data loss scenario we're talking about (unexpected
unclean 
shutdown).


I don't think we talk about this:
We talk about people wanting checksuming to notice e.g. silent data
corruption.

The crash case is only the corner case about what happens then if data
is written correctly but csums not.


We use the crcs to catch storage gone wrong, both in terms of simple 
things like cabling, bus errors, drives gone crazy or exotic problems 
like every time I reboot the box a handful of sectors return EFI 
partition table headers instead of the data I wrote.  You don't need 
data center scale for this to happen, but it does help...


So, we do catch crc errors in prod and they do keep us from replicating 
bad data over good data.  Some databases also crc, and all drives have 
correction bits of of some kind.  There's nothing wrong with crcs 
happening at lots of layers.


Btrfs couples the crcs with COW because it's the least complicated way 
to protect against:


* bits flipping
* IO getting lost on the way to the drive, leaving stale but valid data 
in place
* IO from sector A going to sector B instead, overwriting valid data 
with other valid data.


It's possible to protect against all three without COW, but all 
solutions have their own tradeoffs and this is the setup we chose.  It's 
easy to trust and easy to debug and at scale that really helps.


In general, production storage environments prefer clearly defined 
errors when the storage has the wrong data.  EIOs happen often, and you 
want to be able to quickly pitch the bad data and replicate in good 
data.


My real goal is to make COW fast enough that we can leave it on for the 
database applications too.  Obviously I haven't quite finished that one 
yet ;) But I'd rather keep the building block of all the other btrfs 
features in place than try to do crcs differently.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Fix -EOVERFLOW handling in btrfs_ioctl_tree_search_v2

2017-08-16 Thread David Sterba
On Fri, Aug 04, 2017 at 02:41:18PM +0300, Nikolay Borisov wrote:
> The buffer passed to btrfs_ioctl_tree_search* functions have to be at least
> sizeof(struct btrfs_ioctl_search_header). If this is not the case then the
> ioctl should return -EOVERFLOW and set the uarg->buf_size to the minimum
> required size. Currently btrfs_ioctl_tree_search_v2 would return an -EOVERFLOW
> error with ->buf_size being set to the value passed by user space. Fix this by
> removing the size check and relying on search_ioctl, which already includes it
> and correctly sets buf_size.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency

2017-08-16 Thread Josef Bacik
On Tue, Aug 15, 2017 at 08:44:16PM -0500, Vijay Chidambaram wrote:
> Hi Amir,
> 
> I neglected to mention this earlier: CrashMonkey does not require
> recompiling the kernel (it is a stand-alone kernel module), and has
> been tested with the kernel 4.4. It should work with future kernel
> versions as long as there are no changes to the bio structure.
> 
> As it is, I believe CrashMonkey is compatible with the current kernel.
> It certainly provides functionality beyond log-writes (the ability to
> replay a subset of writes between FLUSH/FUA), and we intend to add
> more functionality in the future.
> 
> Right now, CrashMonkey does not do random sampling among possible
> crash states -- it will simply test a given number of unique states.
> Thus, right now I don't think it is very effective in finding
> crash-consistency bugs. But the entire infrastructure to profile a
> workload, construct crash states, and test them with fsck is present.
> 
> I'd be grateful if you could try it and give us feedback on what make
> testing easier/more useful for you. As I mentioned before, this is a
> work-in-progress, so we are happy to incorporate feedback.
> 

Sorry I was travelling yesterday so I couldn't give this my full attention.
Everything you guys do is already accomplished with dm-log-writes.  If you look
at the example scripts I've provided

https://github.com/josefbacik/log-writes/blob/master/replay-individual-faster.sh
https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh

The first initiates the replay, and points at the second script to run after
each entry is replayed.  The whole point of this stuff was to make it as
flexible as possible.  The way we use it is to replay, create a snapshot of the
replay, mount, unmount, fsck, delete the snapshot and carry on to the next
position in the log.

There is nothing keeping us from generating random crash points, this has been
something on my list of things to do forever.  All that would be required would
be to hold the entries between flush/fua events in memory, and then replay them
in whatever order you deemed fit.  That's the only functionality missing from my
replay-log stuff that CrashMonkey has.

The other part of this is getting user space applications to do more thorough
checking of consistency that it expects, which I implemented here

https://github.com/josefbacik/fstests/commit/70d41e17164b2afc9a3f2ae532f084bf64cb4a07

fsx will randomly do operations to a file, and every time it fsync()'s it saves
it's state and marks the log.  Then we can go back and replay the log to the
mark and md5sum the file to make sure it matches the saved state.  This
infrastructure was meant to be as simple as possible so the possiblities for
crash consistency testing were endless.  One of the next areas we plan to use
this in Facebook is just for application consistency, so we can replay the fs
and verify the application works in whatever state the fs is at any given point.

I looked at your code and you are logging entries at submit time, not completion
time.  The reason I do those crazy acrobatics is because we have had bugs in
previous kernels where we were not waiting for io completion of important
metadata before writing out the super block, so logging only at completion
allows us to catch that class of problems.

The other thing CrashMonkey is missing is DISCARD support.  We fuck up discard
support constantly, and being able to replay discards to make sure we're not
discarding important data is very important.

I'm not trying to shit on your project, obviously it's a good idea, that's why I
did it years ago ;).  The community is going to use what is easiest to use, and
modprobe dm-log-writes is a lot easier than compiling and insmod'ing an out of
tree driver.  Also your driver won't work on upstream kernels because of the way
the bio flags were changed recently, which is why we prefer using upstream
solutions.

If you guys want to get this stuff used then it would be better at this point to
build on top of what we already have.  Just off the top of my head we need

1) Random replay support for replay-log.  This is probably a day or two worth of
work for a student.

2) Documentation, because right now I'm the only one who knows how this works.

3) My patches need to actually be pushed into upstream fstests.  This would be
the largest win because then all the fs developers would be running the tests
by default.

4) Multi-device support.  One thing that would be good to have and is a dream of
mine is to connect multiple devices to one log, so we can do things like make
sure mdraid or btrfs's raid consistency.  We could do super evil things like
only replay one device, or replay alternating writes on each device.  This would
be a larger project but would be super helpful.

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Stefan Priebe - Profihost AG

Am 16.08.2017 um 14:29 schrieb Konstantin V. Gavrilenko:
> Roman, initially I had a single process occupying 100% CPU, when sysrq it was 
> indicating as "btrfs_find_space_for_alloc"
> but that's when I used the autodefrag, compress, forcecompress and commit=10 
> mount flags and space_cache was v1 by default.
> when I switched to "relatime,compress-force=zlib,space_cache=v2" the 100% cpu 
> has dissapeared, but the shite performance remained.

space_cache=v2 is not supported by the opensuse kernel - but as i
compile the kernel myself anyway. Is there a patchset to add support for
space_cache=v2?

Greets,
Stefan

> 
> As to the chunk size, there is no information in the article about the type 
> of data that was used. While in our case we are pretty certain about the 
> compressed block size (32-128). I am currently inclining towards 32k as it 
> might be ideal in a situation when we have a 5 disk raid5 array.
> 
> In theory
> 1. The minimum compressed write (32k) would fill the chunk on a single disk, 
> thus the IO cost of the operation would be 2 reads (original chunk + original 
> parity)  and 2 writes (new chunk + new parity)
> 
> 2. The maximum compressed write (128k) would require the update of 1 chunk on 
> each of the 4 data disks + 1 parity  write 
> 
> 
> 
> Stefan what mount flags do you use?
> 
> kos
> 
> 
> 
> - Original Message -
> From: "Roman Mamedov" 
> To: "Konstantin V. Gavrilenko" 
> Cc: "Stefan Priebe - Profihost AG" , "Marat Khalili" 
> , linux-btrfs@vger.kernel.org, "Peter Grandi" 
> 
> Sent: Wednesday, 16 August, 2017 2:00:03 PM
> Subject: Re: slow btrfs with a single kworker process using 100% CPU
> 
> On Wed, 16 Aug 2017 12:48:42 +0100 (BST)
> "Konstantin V. Gavrilenko"  wrote:
> 
>> I believe the chunk size of 512kb is even worth for performance then the 
>> default settings on my HW RAID of  256kb.
> 
> It might be, but that does not explain the original problem reported at all.
> If mdraid performance would be the bottleneck, you would see high iowait,
> possibly some CPU load from the mdX_raidY threads. But not a single Btrfs
> thread pegging into 100% CPU.
> 
>> So now I am moving the data from the array and will be rebuilding it with 64
>> or 32 chunk size and checking the performance.
> 
> 64K is the sweet spot for RAID5/6:
> http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Stefan Priebe - Profihost AG

Am 16.08.2017 um 14:29 schrieb Konstantin V. Gavrilenko:
> Roman, initially I had a single process occupying 100% CPU, when sysrq it was 
> indicating as "btrfs_find_space_for_alloc"
> but that's when I used the autodefrag, compress, forcecompress and commit=10 
> mount flags and space_cache was v1 by default.
> when I switched to "relatime,compress-force=zlib,space_cache=v2" the 100% cpu 
> has dissapeared, but the shite performance remained.
> 
> 
> As to the chunk size, there is no information in the article about the type 
> of data that was used. While in our case we are pretty certain about the 
> compressed block size (32-128). I am currently inclining towards 32k as it 
> might be ideal in a situation when we have a 5 disk raid5 array.
> 
> In theory
> 1. The minimum compressed write (32k) would fill the chunk on a single disk, 
> thus the IO cost of the operation would be 2 reads (original chunk + original 
> parity)  and 2 writes (new chunk + new parity)
> 
> 2. The maximum compressed write (128k) would require the update of 1 chunk on 
> each of the 4 data disks + 1 parity  write 
> 
> 
> 
> Stefan what mount flags do you use?

noatime,compress-force=zlib,noacl,space_cache,skip_balance,subvolid=5,subvol=/

Greets,
Stefan


> kos
> 
> 
> 
> - Original Message -
> From: "Roman Mamedov" 
> To: "Konstantin V. Gavrilenko" 
> Cc: "Stefan Priebe - Profihost AG" , "Marat Khalili" 
> , linux-btrfs@vger.kernel.org, "Peter Grandi" 
> 
> Sent: Wednesday, 16 August, 2017 2:00:03 PM
> Subject: Re: slow btrfs with a single kworker process using 100% CPU
> 
> On Wed, 16 Aug 2017 12:48:42 +0100 (BST)
> "Konstantin V. Gavrilenko"  wrote:
> 
>> I believe the chunk size of 512kb is even worth for performance then the 
>> default settings on my HW RAID of  256kb.
> 
> It might be, but that does not explain the original problem reported at all.
> If mdraid performance would be the bottleneck, you would see high iowait,
> possibly some CPU load from the mdX_raidY threads. But not a single Btrfs
> thread pegging into 100% CPU.
> 
>> So now I am moving the data from the array and will be rebuilding it with 64
>> or 32 chunk size and checking the performance.
> 
> 64K is the sweet spot for RAID5/6:
> http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Stefan Priebe - Profihost AG

Am 16.08.2017 um 14:29 schrieb Konstantin V. Gavrilenko:
> Roman, initially I had a single process occupying 100% CPU, when sysrq it was 
> indicating as "btrfs_find_space_for_alloc"
> but that's when I used the autodefrag, compress, forcecompress and commit=10 
> mount flags and space_cache was v1 by default.
> when I switched to "relatime,compress-force=zlib,space_cache=v2" the 100% cpu 
> has dissapeared, but the shite performance remained.
> 
> 
> As to the chunk size, there is no information in the article about the type 
> of data that was used. While in our case we are pretty certain about the 
> compressed block size (32-128). I am currently inclining towards 32k as it 
> might be ideal in a situation when we have a 5 disk raid5 array.
> 
> In theory
> 1. The minimum compressed write (32k) would fill the chunk on a single disk, 
> thus the IO cost of the operation would be 2 reads (original chunk + original 
> parity)  and 2 writes (new chunk + new parity)
> 
> 2. The maximum compressed write (128k) would require the update of 1 chunk on 
> each of the 4 data disks + 1 parity  write 
> 
> 
> 
> Stefan what mount flags do you use?

noatime,compress-force=zlib,noacl,space_cache,skip_balance,subvolid=5,subvol=/

Greets,
Stefan


> kos
> 
> 
> 
> - Original Message -
> From: "Roman Mamedov" 
> To: "Konstantin V. Gavrilenko" 
> Cc: "Stefan Priebe - Profihost AG" , "Marat Khalili" 
> , linux-btrfs@vger.kernel.org, "Peter Grandi" 
> 
> Sent: Wednesday, 16 August, 2017 2:00:03 PM
> Subject: Re: slow btrfs with a single kworker process using 100% CPU
> 
> On Wed, 16 Aug 2017 12:48:42 +0100 (BST)
> "Konstantin V. Gavrilenko"  wrote:
> 
>> I believe the chunk size of 512kb is even worth for performance then the 
>> default settings on my HW RAID of  256kb.
> 
> It might be, but that does not explain the original problem reported at all.
> If mdraid performance would be the bottleneck, you would see high iowait,
> possibly some CPU load from the mdX_raidY threads. But not a single Btrfs
> thread pegging into 100% CPU.
> 
>> So now I am moving the data from the array and will be rebuilding it with 64
>> or 32 chunk size and checking the performance.
> 
> 64K is the sweet spot for RAID5/6:
> http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Konstantin V. Gavrilenko
Roman, initially I had a single process occupying 100% CPU, when sysrq it was 
indicating as "btrfs_find_space_for_alloc"
but that's when I used the autodefrag, compress, forcecompress and commit=10 
mount flags and space_cache was v1 by default.
when I switched to "relatime,compress-force=zlib,space_cache=v2" the 100% cpu 
has dissapeared, but the shite performance remained.


As to the chunk size, there is no information in the article about the type of 
data that was used. While in our case we are pretty certain about the 
compressed block size (32-128). I am currently inclining towards 32k as it 
might be ideal in a situation when we have a 5 disk raid5 array.

In theory
1. The minimum compressed write (32k) would fill the chunk on a single disk, 
thus the IO cost of the operation would be 2 reads (original chunk + original 
parity)  and 2 writes (new chunk + new parity)

2. The maximum compressed write (128k) would require the update of 1 chunk on 
each of the 4 data disks + 1 parity  write 



Stefan what mount flags do you use?

kos



- Original Message -
From: "Roman Mamedov" 
To: "Konstantin V. Gavrilenko" 
Cc: "Stefan Priebe - Profihost AG" , "Marat Khalili" 
, linux-btrfs@vger.kernel.org, "Peter Grandi" 

Sent: Wednesday, 16 August, 2017 2:00:03 PM
Subject: Re: slow btrfs with a single kworker process using 100% CPU

On Wed, 16 Aug 2017 12:48:42 +0100 (BST)
"Konstantin V. Gavrilenko"  wrote:

> I believe the chunk size of 512kb is even worth for performance then the 
> default settings on my HW RAID of  256kb.

It might be, but that does not explain the original problem reported at all.
If mdraid performance would be the bottleneck, you would see high iowait,
possibly some CPU load from the mdX_raidY threads. But not a single Btrfs
thread pegging into 100% CPU.

> So now I am moving the data from the array and will be rebuilding it with 64
> or 32 chunk size and checking the performance.

64K is the sweet spot for RAID5/6:
http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Roman Mamedov
On Wed, 16 Aug 2017 12:48:42 +0100 (BST)
"Konstantin V. Gavrilenko"  wrote:

> I believe the chunk size of 512kb is even worth for performance then the 
> default settings on my HW RAID of  256kb.

It might be, but that does not explain the original problem reported at all.
If mdraid performance would be the bottleneck, you would see high iowait,
possibly some CPU load from the mdX_raidY threads. But not a single Btrfs
thread pegging into 100% CPU.

> So now I am moving the data from the array and will be rebuilding it with 64
> or 32 chunk size and checking the performance.

64K is the sweet spot for RAID5/6:
http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Konstantin V. Gavrilenko


I believe the chunk size of 512kb is even worth for performance then the 
default settings on my HW RAID of  256kb.

Peter Grandi explained it earlier on in one of his posts.

QTE
++
That runs counter to this simple story: suppose a program is
doing 64KiB IO:

* For *reads*, there are 4 data drives and the strip size is
  16KiB: the 64KiB will be read in parallel on 4 drives. If the
  strip size is 256KiB then the 64KiB will be read sequentially
  from just one disk, and 4 successive reads will be read
  sequentially from the same drive.

* For *writes* on a parity RAID like RAID5 things are much, much
  more extreme: the 64KiB will be written with 16KiB strips on a
  5-wide RAID5 set in parallel to 5 drives, with 4 stripes being
  updated with RMW. But with 256KiB strips it will partially
  update 5 drives, because the stripe is 1024+256KiB, and it
  needs to do RMW, and four successive 64KiB drives will need to
  do that too, even if only one drive is updated. Usually for
  RAID5 there is an optimization that means that only the
  specific target drive and the parity drives(s) need RMW, but
  it is still very expensive.

This is the "storage for beginners" version, what happens in
practice however depends a lot on specific workload profile
(typical read/write size and latencies and rates), caching and
queueing algorithms in both Linux and the HA firmware.
++
UNQTE


I've also found another explanation of the same problem with the right chunk 
size and how it works here
http://holyhandgrenade.org/blog/2011/08/disk-performance-part-2-raid-layouts-and-stripe-sizing/#more-1212



So in my understanding, when working with compressed data, your compressed data 
will vary between 128kb (urandom) and 32kb (zeroes) that will be passed to the 
FS to take care of.

and in our setup of large chunk sizes, if we need to write 32kb-128kb of 
compressed data, the RAID5 would need to perform  3 read operations and 2 write 
operations.

As updating a parity chunk requires either
- The original chunk, the new chunk, and the old parity block
- Or, all chunks (except for the parity chunk) in the stripe

diskdisk1   disk2   disk3   disk4
chunk size  512kb   512kb   512kb   512kbP

So in worst case scenario, in order to write 32kb, RAID5 would need to read 
(480 + 512 + P512) then write (32 + P512)

That's my current understanding of the situation.
I was planning to write an update to my story later on, once I hopefully solve 
the problem. But an intermidiary update is that I have performed full defrag 
with full compression (2 days). Then balance of the all data (10 days)and it 
didn't help the performance .

So now I am moving the data from the array and will be rebuilding it with 64 or 
32 chunk size and checking the performance.

VG,
kos



- Original Message -
From: "Stefan Priebe - Profihost AG" 
To: "Konstantin V. Gavrilenko" 
Cc: "Marat Khalili" , linux-btrfs@vger.kernel.org
Sent: Wednesday, 16 August, 2017 11:26:38 AM
Subject: Re: slow btrfs with a single kworker process using 100% CPU

Am 16.08.2017 um 11:02 schrieb Konstantin V. Gavrilenko:
> Could be similar issue as what I had recently, with the RAID5 and 256kb chunk 
> size.
> please provide more information about your RAID setup.

Hope this helps:

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath]
[raid0] [raid10]
md0 : active raid5 sdd1[1] sdf1[4] sdc1[0] sde1[2]
  11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []
  bitmap: 6/30 pages [24KB], 65536KB chunk

md2 : active raid5 sdm1[2] sdl1[1] sdk1[0] sdn1[4]
  11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []
  bitmap: 7/30 pages [28KB], 65536KB chunk

md1 : active raid5 sdi1[2] sdg1[0] sdj1[4] sdh1[1]
  11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []
  bitmap: 7/30 pages [28KB], 65536KB chunk

md3 : active raid5 sdp1[1] sdo1[0] sdq1[2] sdr1[4]
  11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []
  bitmap: 6/30 pages [24KB], 65536KB chunk

# btrfs fi usage /vmbackup/
Overall:
Device size:  43.65TiB
Device allocated: 31.98TiB
Device unallocated:   11.67TiB
Device missing:  0.00B
Used: 30.80TiB
Free (estimated): 12.84TiB  (min: 12.84TiB)
Data ratio:   1.00
Metadata ratio:   1.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,RAID0: Size:31.83TiB, Used:30.66TiB
   /dev/md07.96TiB
   /dev/md17.96TiB
   /dev/md27.96TiB
   /dev/md37.96TiB

Metadata,RAID0: Size:153.00GiB, Used:141.34GiB
   /dev/md0   38.25GiB
   /dev/md1   38.25GiB
   /dev/md2   38.25GiB
   /dev/md3   38.25GiB

System,RAID0: Size:128.00MiB, Used:2.28MiB
   /dev/md0   

Re: btrfs fi du -s gives Inappropriate ioctl for device

2017-08-16 Thread Piotr Szymaniak
On Mon, Aug 14, 2017 at 05:40:30PM -0600, Chris Murphy wrote:
> On Mon, Aug 14, 2017 at 4:57 PM, Piotr Szymaniak  wrote:
> 
> >
> > and... some issues:
> > ~ # btrfs fi du -s /mnt/red/\@backup/
> >  Total   Exclusive  Set shared  Filename
> > ERROR: cannot check space of '/mnt/red/@backup/': Inappropriate ioctl for 
> > device
> 
> 
> It's a bug, but I don't know if any devs are working on a fix yet.
> 
> The problem is that the subvolume being snapshot, contains subvolumes.
> The resulting snapshot, contains an empty directory in place of the
> nested subvolume(s), and that is the cause for the error.

Ok, but why, on the same btrfs, it works on some subvols with subvols and does
not work on other subvols with subvols? If it does not work - OK, if it works -
OK, but that seems a bit... random?

~ # btrfs fi du -s /mnt/red/\@backup/ 
/mnt/red/\@backup/.snapshot/monthly_2017-08-01_05\:30\:01/ /mnt/red/\@svn/ 
/mnt/red/\@svn/.snapshot/weekly_2017-08-05_04\:20\:02/
 Total   Exclusive  Set shared  Filename
ERROR: cannot check space of '/mnt/red/@backup/': Inappropriate ioctl for device
ERROR: cannot check space of 
'/mnt/red/@backup/.snapshot/monthly_2017-08-01_05:30:01/': Inappropriate ioctl 
for device
  52.23GiB10.57MiB 4.13GiB  /mnt/red/@svn/
   4.35GiB 1.03MiB 4.12GiB  
/mnt/red/@svn/.snapshot/weekly_2017-08-05_04:20:02/


Best regards,
Piotr Szymaniak.


signature.asc
Description: Digital signature


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Stefan Priebe - Profihost AG
Am 16.08.2017 um 11:02 schrieb Konstantin V. Gavrilenko:
> Could be similar issue as what I had recently, with the RAID5 and 256kb chunk 
> size.
> please provide more information about your RAID setup.

Hope this helps:

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath]
[raid0] [raid10]
md0 : active raid5 sdd1[1] sdf1[4] sdc1[0] sde1[2]
  11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []
  bitmap: 6/30 pages [24KB], 65536KB chunk

md2 : active raid5 sdm1[2] sdl1[1] sdk1[0] sdn1[4]
  11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []
  bitmap: 7/30 pages [28KB], 65536KB chunk

md1 : active raid5 sdi1[2] sdg1[0] sdj1[4] sdh1[1]
  11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []
  bitmap: 7/30 pages [28KB], 65536KB chunk

md3 : active raid5 sdp1[1] sdo1[0] sdq1[2] sdr1[4]
  11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []
  bitmap: 6/30 pages [24KB], 65536KB chunk

# btrfs fi usage /vmbackup/
Overall:
Device size:  43.65TiB
Device allocated: 31.98TiB
Device unallocated:   11.67TiB
Device missing:  0.00B
Used: 30.80TiB
Free (estimated): 12.84TiB  (min: 12.84TiB)
Data ratio:   1.00
Metadata ratio:   1.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,RAID0: Size:31.83TiB, Used:30.66TiB
   /dev/md07.96TiB
   /dev/md17.96TiB
   /dev/md27.96TiB
   /dev/md37.96TiB

Metadata,RAID0: Size:153.00GiB, Used:141.34GiB
   /dev/md0   38.25GiB
   /dev/md1   38.25GiB
   /dev/md2   38.25GiB
   /dev/md3   38.25GiB

System,RAID0: Size:128.00MiB, Used:2.28MiB
   /dev/md0   32.00MiB
   /dev/md1   32.00MiB
   /dev/md2   32.00MiB
   /dev/md3   32.00MiB

Unallocated:
   /dev/md02.92TiB
   /dev/md12.92TiB
   /dev/md22.92TiB
   /dev/md32.92TiB


Stefan

> 
> p.s.
> you can also check the tread "Btrfs + compression = slow performance and high 
> cpu usage"
> 
> - Original Message -
> From: "Stefan Priebe - Profihost AG" 
> To: "Marat Khalili" , linux-btrfs@vger.kernel.org
> Sent: Wednesday, 16 August, 2017 10:37:43 AM
> Subject: Re: slow btrfs with a single kworker process using 100% CPU
> 
> Am 16.08.2017 um 08:53 schrieb Marat Khalili:
>>> I've one system where a single kworker process is using 100% CPU
>>> sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is
>>> there anything i can do to get the old speed again or find the culprit?
>>
>> 1. Do you use quotas (qgroups)?
> 
> No qgroups and no quota.
> 
>> 2. Do you have a lot of snapshots? Have you deleted some recently?
> 
> 1413 Snapshots. I'm deleting 50 of them every night. But btrfs-cleaner
> process isn't running / consuming CPU currently.
> 
>> More info about your system would help too.
> Kernel is OpenSuSE Leap 42.3.
> 
> btrfs is mounted with
> compress-force=zlib
> 
> btrfs is running as a raid0 on top of 4 md raid 5 devices.
> 
> Greets,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Konstantin V. Gavrilenko
Could be similar issue as what I had recently, with the RAID5 and 256kb chunk 
size.

please provide more information about your RAID setup.

p.s.
you can also check the tread "Btrfs + compression = slow performance and high 
cpu usage"

- Original Message -
From: "Stefan Priebe - Profihost AG" 
To: "Marat Khalili" , linux-btrfs@vger.kernel.org
Sent: Wednesday, 16 August, 2017 10:37:43 AM
Subject: Re: slow btrfs with a single kworker process using 100% CPU

Am 16.08.2017 um 08:53 schrieb Marat Khalili:
>> I've one system where a single kworker process is using 100% CPU
>> sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is
>> there anything i can do to get the old speed again or find the culprit?
> 
> 1. Do you use quotas (qgroups)?

No qgroups and no quota.

> 2. Do you have a lot of snapshots? Have you deleted some recently?

1413 Snapshots. I'm deleting 50 of them every night. But btrfs-cleaner
process isn't running / consuming CPU currently.

> More info about your system would help too.
Kernel is OpenSuSE Leap 42.3.

btrfs is mounted with
compress-force=zlib

btrfs is running as a raid0 on top of 4 md raid 5 devices.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Stefan Priebe - Profihost AG
Am 16.08.2017 um 08:53 schrieb Marat Khalili:
>> I've one system where a single kworker process is using 100% CPU
>> sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is
>> there anything i can do to get the old speed again or find the culprit?
> 
> 1. Do you use quotas (qgroups)?

No qgroups and no quota.

> 2. Do you have a lot of snapshots? Have you deleted some recently?

1413 Snapshots. I'm deleting 50 of them every night. But btrfs-cleaner
process isn't running / consuming CPU currently.

> More info about your system would help too.
Kernel is OpenSuSE Leap 42.3.

btrfs is mounted with
compress-force=zlib

btrfs is running as a raid0 on top of 4 md raid 5 devices.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qcow2 images make scrub believe the filesystem is corrupted.

2017-08-16 Thread Qu Wenruo
BTW, to determine it's really data corruption, you could check the data 
checksum by executing "btrfs check --check-data-csum".


--check-data-csum has its limitation of skipping remaining mirrors if 
the first mirror is correct, but since your data is single, such 
limitation is not a problem at all.


Or, you could also try the out-of-tree btrfs-progs with offline scrub 
support:

https://github.com/adam900710/btrfs-progs/tree/offline_scrub

It should be much like kernel scrub equivalent in btrfs-progs.
Using "btrfs scrub start --offline " should be able to 
verify all checksum for data and metadata.


If btrfs-progs reports csum error (for data), then it's really 
corrupted, and highly possible caused by discard mount option.


Thanks,
Qu

On 2017年08月16日 10:28, Qu Wenruo wrote:



On 2017年08月16日 09:51, Paulo Dias wrote:

Hi, thanks for the quick answer.

So, since i wrote this i tested this even further.

First, and as you predicted, if i try to cp the file to another
location i get read errors:

root@kerberos:/home/groo# cp Fedora/Fedora.qcow2 /
cp: error reading 'Fedora/Fedora.qcow2': Input/output error


Less possible to blame scrub now.
As normal read routine also reports such error, it maybe a real 
corruption of the file.




so i used this trick:

# modprobe nbd
# qemu-nbd --connect=/dev/nbd0 Fedora2.qcow2
# ddrescue /dev/nbd0 new_file.raw
# qemu-nbd --disconnect /dev/nbd0
# qemu-img convert -O qcow2 new_file.raw new_file.qcow2

and sure enough i was able to recreate the qcow2 but with this errors:

ago 15 22:19:49 kerberos kernel: block nbd0: Other side returned error 
(5)

ago 15 22:19:49 kerberos kernel: print_req_error: I/O error, dev nbd0,
sector 22159872
ago 15 22:19:49 kerberos kernel: BTRFS warning (device sda3): csum
failed root 258 ino 968837 off 17455849472 csum 0xcc028588 expected
csum 0xe3338de1 mirror 1


Still csum error.
And furthermore, both the expected and on-disk csum is not special value 
like crc32 for all zero page.

So it may means that, it's a real corruption.

ago 15 22:19:49 kerberos kernel: block nbd0: Other side returned error 
(5)

ago 15 22:19:49 kerberos kernel: print_req_error: I/O error, dev nbd0,
sector 22160016
ago 15 22:19:49 kerberos kernel: Buffer I/O error on dev nbd0, logical
block 2770002, async page read
ago 15 22:19:49 kerberos kernel: BTRFS warning (device sda3): csum
failed root 258 ino 968837 off 17455849472 csum 0xcc028588 expected
csum 0xe3338de1 mirror 1


At least, we now know which inode (968837 of root 258) and file offset 
(17455849472 length 4K) is corrupted.


ago 15 22:19:49 kerberos kernel: block nbd0: Other side returned error 
(5)

ago 15 22:19:49 kerberos kernel: print_req_error: I/O error, dev nbd0,
sector 22160016
ago 15 22:19:49 kerberos kernel: Buffer I/O error on dev nbd0, logical
block 2770002, async page read
ago 15 22:20:47 kerberos kernel: BTRFS warning (device sda3): csum
failed root 258 ino 968837 off 17455849472 csum 0xcc028588 expected
csum 0xe3338de1 mirror 1
ago 15 22:20:47 kerberos kernel: BTRFS warning (device sda3): csum
failed root 258 ino 968837 off 17455849472 csum 0xcc028588 expected
csum 0xe3338de1 mirror 1



block 2770002, async page read
ago 15 22:21:32 kerberos kernel: block nbd0: NBD_DISCONNECT
ago 15 22:21:32 kerberos kernel: block nbd0: shutting down sockets

i deleted the original Fedora.qcow2 and again scrub said i didnt had
any errors, so i wondered, could it be the raid1 code (long shot), so
i moved the metadata back to DUP.

btrfs fi balance start -dconvert=single -mconvert=dup /home/


OK, data is not touched.
Single to single, so data chunks are not touched.
And your metadata is always good, so no problem should happen during 
balance.


BTW, if you balance data, (no need to do convert, just balancing all 
data), it should also report error if my assumption is correct:

Some data is *really* corrupted.



root@kerberos:/home/groo# btrfs filesystem usage -T /home/
Overall:
 Device size: 333.50GiB
 Device allocated: 18.06GiB
 Device unallocated:  315.44GiB
 Device missing:  0.00B
 Used: 16.25GiB
 Free (estimated):315.83GiB  (min: 158.11GiB)
 Data ratio:   1.00
 Metadata ratio:   2.00
 Global reserve:   39.45MiB  (used: 0.00B)

  Data Metadata  System
Id Path  single   DUP   DUP  Unallocated
-- -  -  ---
  1 /dev/sda3 16.00GiB   2.00GiB 64.00MiB   181.94GiB
  2 /dev/sdb7- --   133.03GiB
  3 /dev/sdb8- --   488.13MiB
-- -  -  ---
Total 16.00GiB   1.00GiB 32.00MiB   315.44GiB
Used  15.61GiB 329.27MiB 16.00KiB

and once again copied the NEW fedora.qcow2 back to home and rerun scrub >
and once again i got errors:

root@kerberos:/home/groo# btrfs scrub start -B 

Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Marat Khalili

I've one system where a single kworker process is using 100% CPU
sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is
there anything i can do to get the old speed again or find the culprit?


1. Do you use quotas (qgroups)?

2. Do you have a lot of snapshots? Have you deleted some recently?

More info about your system would help too.

--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Stefan Priebe - Profihost AG
Hello,

I've one system where a single kworker process is using 100% CPU
sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is
there anything i can do to get the old speed again or find the culprit?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html