Re: BTRFS did it's job nicely (thanks!)

2018-11-05 Thread Chris Murphy
On Mon, Nov 5, 2018 at 6:27 AM, Austin S. Hemmelgarn
 wrote:
> On 11/4/2018 11:44 AM, waxhead wrote:
>>
>> Sterling Windmill wrote:
>>>
>>> Out of curiosity, what led to you choosing RAID1 for data but RAID10
>>> for metadata?
>>>
>>> I've flip flipped between these two modes myself after finding out
>>> that BTRFS RAID10 doesn't work how I would've expected.
>>>
>>> Wondering what made you choose your configuration.
>>>
>>> Thanks!
>>> Sure,
>>
>>
>> The "RAID"1 profile for data was chosen to maximize disk space utilization
>> since I got a lot of mixed size devices.
>>
>> The "RAID"10 profile for metadata was chosen simply because it *feels* a
>> bit faster for some of my (previous) workload which was reading a lot of
>> small files (which I guess was embedded in the metadata). While I never
>> remembered that I got any measurable performance increase the system simply
>> felt smoother (which is strange since "RAID"10 should hog more disks at
>> once).
>>
>> I would love to try "RAID"10 for both data and metadata, but I have to
>> delete some files first (or add yet another drive).
>>
>> Would you like to elaborate a bit more yourself about how BTRFS "RAID"10
>> does not work as you expected?
>>
>> As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1
>> replica) is striped over as many disks it can (as long as there is free
>> space).
>>
>> So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe
>> over (20/2) x 2 and if you run out of space on 10 of the devices it will
>> continue to stripe over (5/2) x 2. So your stripe width vary with the
>> available space essentially... I may be terribly wrong about this (until
>> someones corrects me that is...)
>
> He's probably referring to the fact that instead of there being a roughly
> 50% chance of it surviving the failure of at least 2 devices like classical
> RAID10 is technically able to do, it's currently functionally 100% certain
> it won't survive more than one device failing.

Right. Classic RAID10 is *two block device* copies, where you have
mirror1 drives and mirror2 drives, and each mirror pair becomes a
single virtual block device that are then striped across. If you lose
a single mirror1 drive, its mirror2 data is available and
statistically unlikely to also go away.

Whereas with Btrfs raid10, it's *two block group* copies. And it is
the block group that's striped. That means block group copy 1 is
striped across 1/2 the available drives (at the time the bg is
allocated), and block group copy 2 is striped across the other drives.
When a drive dies, there is no single remaining drive that contains
all the missing copies, they're distributed. Which means you've got a
very good chance in a 2 drive failure of losing two copies of either
metadata or data or both. While I'm not certain it's 100% not
survivable, the real gotcha is it's possible maybe even likely that
it'll mount and seem to work fine but as soon as it runs into two
missing bg's, it'll face plant.


-- 
Chris Murphy


Re: BTRFS did it's job nicely (thanks!)

2018-11-05 Thread Austin S. Hemmelgarn

On 11/4/2018 11:44 AM, waxhead wrote:

Sterling Windmill wrote:

Out of curiosity, what led to you choosing RAID1 for data but RAID10
for metadata?

I've flip flipped between these two modes myself after finding out
that BTRFS RAID10 doesn't work how I would've expected.

Wondering what made you choose your configuration.

Thanks!
Sure,


The "RAID"1 profile for data was chosen to maximize disk space 
utilization since I got a lot of mixed size devices.


The "RAID"10 profile for metadata was chosen simply because it *feels* a 
bit faster for some of my (previous) workload which was reading a lot of 
small files (which I guess was embedded in the metadata). While I never 
remembered that I got any measurable performance increase the system 
simply felt smoother (which is strange since "RAID"10 should hog more 
disks at once).


I would love to try "RAID"10 for both data and metadata, but I have to 
delete some files first (or add yet another drive).


Would you like to elaborate a bit more yourself about how BTRFS "RAID"10 
does not work as you expected?


As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1 
replica) is striped over as many disks it can (as long as there is free 
space).


So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe 
over (20/2) x 2 and if you run out of space on 10 of the devices it will 
continue to stripe over (5/2) x 2. So your stripe width vary with the 
available space essentially... I may be terribly wrong about this (until 
someones corrects me that is...)
He's probably referring to the fact that instead of there being a 
roughly 50% chance of it surviving the failure of at least 2 devices 
like classical RAID10 is technically able to do, it's currently 
functionally 100% certain it won't survive more than one device failing.




Re: BTRFS did it's job nicely (thanks!)

2018-11-04 Thread waxhead

Sterling Windmill wrote:

Out of curiosity, what led to you choosing RAID1 for data but RAID10
for metadata?

I've flip flipped between these two modes myself after finding out
that BTRFS RAID10 doesn't work how I would've expected.

Wondering what made you choose your configuration.

Thanks!
Sure,


The "RAID"1 profile for data was chosen to maximize disk space 
utilization since I got a lot of mixed size devices.


The "RAID"10 profile for metadata was chosen simply because it *feels* a 
bit faster for some of my (previous) workload which was reading a lot of 
small files (which I guess was embedded in the metadata). While I never 
remembered that I got any measurable performance increase the system 
simply felt smoother (which is strange since "RAID"10 should hog more 
disks at once).


I would love to try "RAID"10 for both data and metadata, but I have to 
delete some files first (or add yet another drive).


Would you like to elaborate a bit more yourself about how BTRFS "RAID"10 
does not work as you expected?


As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1 
replica) is striped over as many disks it can (as long as there is free 
space).


So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe 
over (20/2) x 2 and if you run out of space on 10 of the devices it will 
continue to stripe over (5/2) x 2. So your stripe width vary with the 
available space essentially... I may be terribly wrong about this (until 
someones corrects me that is...)




Re: BTRFS did it's job nicely (thanks!)

2018-11-04 Thread Sterling Windmill
Out of curiosity, what led to you choosing RAID1 for data but RAID10
for metadata?

I've flip flipped between these two modes myself after finding out
that BTRFS RAID10 doesn't work how I would've expected.

Wondering what made you choose your configuration.

Thanks!

On Fri, Nov 2, 2018 at 3:55 PM waxhead  wrote:
>
> Hi,
>
> my main computer runs on a 7x SSD BTRFS as rootfs with
> data:RAID1 and metadata:RAID10.
>
> One SSD is probably about to fail, and it seems that BTRFS fixed it
> nicely (thanks everyone!)
>
> I decided to just post the ugly details in case someone just wants to
> have a look. Note that I tend to interpret the btrfs de st / output as
> if the error was NOT fixed even if (seems clearly that) it was, so I
> think the output is a bit misleading... just saying...
>
>
>
> -- below are the details for those curious (just for fun) ---
>
> scrub status for [YOINK!]
>  scrub started at Fri Nov  2 17:49:45 2018 and finished after
> 00:29:26
>  total bytes scrubbed: 1.15TiB with 1 errors
>  error details: csum=1
>  corrected errors: 1, uncorrectable errors: 0, unverified errors: 0
>
>   btrfs fi us -T /
> Overall:
>  Device size:   1.18TiB
>  Device allocated:  1.17TiB
>  Device unallocated:9.69GiB
>  Device missing:  0.00B
>  Used:  1.17TiB
>  Free (estimated):  6.30GiB  (min: 6.30GiB)
>  Data ratio:   2.00
>  Metadata ratio:   2.00
>  Global reserve:  512.00MiB  (used: 0.00B)
>
>   Data  Metadata  System
> Id Path  RAID1 RAID10RAID10Unallocated
> -- - - - - ---
>   6 /dev/sda1 236.28GiB 704.00MiB  32.00MiB   485.00MiB
>   7 /dev/sdb1 233.72GiB   1.03GiB  32.00MiB 2.69GiB
>   2 /dev/sdc1 110.56GiB 352.00MiB -   904.00MiB
>   8 /dev/sdd1 234.96GiB   1.03GiB  32.00MiB 1.45GiB
>   1 /dev/sde1 164.90GiB   1.03GiB  32.00MiB 1.72GiB
>   9 /dev/sdf1 109.00GiB   1.03GiB  32.00MiB   744.00MiB
> 10 /dev/sdg1 107.98GiB   1.03GiB  32.00MiB 1.74GiB
> -- - - - - ---
> Total 598.70GiB   3.09GiB  96.00MiB 9.69GiB
> Used  597.25GiB   1.57GiB 128.00KiB
>
>
>
> uname -a
> Linux main 4.18.0-2-amd64 #1 SMP Debian 4.18.10-2 (2018-10-07) x86_64
> GNU/Linux
>
> btrfs --version
> btrfs-progs v4.17
>
>
> dmesg | grep -i btrfs
> [7.801817] Btrfs loaded, crc32c=crc32c-generic
> [8.163288] BTRFS: device label btrfsroot devid 10 transid 669961
> /dev/sdg1
> [8.163433] BTRFS: device label btrfsroot devid 9 transid 669961
> /dev/sdf1
> [8.163591] BTRFS: device label btrfsroot devid 1 transid 669961
> /dev/sde1
> [8.163734] BTRFS: device label btrfsroot devid 8 transid 669961
> /dev/sdd1
> [8.163974] BTRFS: device label btrfsroot devid 2 transid 669961
> /dev/sdc1
> [8.164117] BTRFS: device label btrfsroot devid 7 transid 669961
> /dev/sdb1
> [8.164262] BTRFS: device label btrfsroot devid 6 transid 669961
> /dev/sda1
> [8.206174] BTRFS info (device sde1): disk space caching is enabled
> [8.206236] BTRFS info (device sde1): has skinny extents
> [8.348610] BTRFS info (device sde1): enabling ssd optimizations
> [8.854412] BTRFS info (device sde1): enabling free space tree
> [8.854471] BTRFS info (device sde1): using free space tree
> [   68.170580] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.185973] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.185991] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186003] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186015] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186028] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186041] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186052] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186063] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186075] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.199237] 

Re: BTRFS did it's job nicely (thanks!)

2018-11-04 Thread waxhead

Duncan wrote:

waxhead posted on Fri, 02 Nov 2018 20:54:40 +0100 as excerpted:


Note that I tend to interpret the btrfs de st / output as if the error
was NOT fixed even if (seems clearly that) it was, so I think the output
is a bit misleading... just saying...


See the btrfs-device manpage, stats subcommand, -z|--reset option, and
device stats section:

-z|--reset
Print the stats and reset the values to zero afterwards.

DEVICE STATS
The device stats keep persistent record of several error classes related
to doing IO. The current values are printed at mount time and
updated during filesystem lifetime or from a scrub run.


So stats keeps a count of historic errors and is only reset when you
specifically reset it, *NOT* when the error is fixed.

Yes, I am perfectly aware of all that. The issue I have is that the 
manpage describes corruption errors as "A block checksum mismatched or 
corrupted metadata header was found". This does not tell me if this was 
a permanent corruption or if it was fixed. That is why I think the 
output is a bit misleadning (and I should have said that more clearly).


My point being that btrfs device stats /mnt would have been a lot easier 
to read and understand if it distinguished between permanent corruption 
e.g. unfixable errors vs fixed errors.



(There's actually a recent patch, I believe in the current dev kernel
4.20/5.0, that will reset a device's stats automatically for the btrfs
replace case when it's actually a different device afterward anyway.
Apparently, it doesn't even do /that/ automatically yet.  Keep that in
mind if you replace that device.)

Oh thanks for the heads up, I was under the impression that the device 
stats was tracked by btrfs devid, but apparently it is (was) not. Good 
to know!


Re: BTRFS did it's job nicely (thanks!)

2018-11-03 Thread Duncan
waxhead posted on Fri, 02 Nov 2018 20:54:40 +0100 as excerpted:

> Note that I tend to interpret the btrfs de st / output as if the error
> was NOT fixed even if (seems clearly that) it was, so I think the output
> is a bit misleading... just saying...

See the btrfs-device manpage, stats subcommand, -z|--reset option, and 
device stats section:

-z|--reset
Print the stats and reset the values to zero afterwards.

DEVICE STATS
The device stats keep persistent record of several error classes related 
to doing IO. The current values are printed at mount time and
updated during filesystem lifetime or from a scrub run.


So stats keeps a count of historic errors and is only reset when you 
specifically reset it, *NOT* when the error is fixed.

(There's actually a recent patch, I believe in the current dev kernel 
4.20/5.0, that will reset a device's stats automatically for the btrfs 
replace case when it's actually a different device afterward anyway.  
Apparently, it doesn't even do /that/ automatically yet.  Keep that in 
mind if you replace that device.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



BTRFS did it's job nicely (thanks!)

2018-11-02 Thread waxhead

Hi,

my main computer runs on a 7x SSD BTRFS as rootfs with
data:RAID1 and metadata:RAID10.

One SSD is probably about to fail, and it seems that BTRFS fixed it 
nicely (thanks everyone!)


I decided to just post the ugly details in case someone just wants to 
have a look. Note that I tend to interpret the btrfs de st / output as 
if the error was NOT fixed even if (seems clearly that) it was, so I 
think the output is a bit misleading... just saying...




-- below are the details for those curious (just for fun) ---

scrub status for [YOINK!]
scrub started at Fri Nov  2 17:49:45 2018 and finished after 
00:29:26

total bytes scrubbed: 1.15TiB with 1 errors
error details: csum=1
corrected errors: 1, uncorrectable errors: 0, unverified errors: 0

 btrfs fi us -T /
Overall:
Device size:   1.18TiB
Device allocated:  1.17TiB
Device unallocated:9.69GiB
Device missing:  0.00B
Used:  1.17TiB
Free (estimated):  6.30GiB  (min: 6.30GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

 Data  Metadata  System
Id Path  RAID1 RAID10RAID10Unallocated
-- - - - - ---
 6 /dev/sda1 236.28GiB 704.00MiB  32.00MiB   485.00MiB
 7 /dev/sdb1 233.72GiB   1.03GiB  32.00MiB 2.69GiB
 2 /dev/sdc1 110.56GiB 352.00MiB -   904.00MiB
 8 /dev/sdd1 234.96GiB   1.03GiB  32.00MiB 1.45GiB
 1 /dev/sde1 164.90GiB   1.03GiB  32.00MiB 1.72GiB
 9 /dev/sdf1 109.00GiB   1.03GiB  32.00MiB   744.00MiB
10 /dev/sdg1 107.98GiB   1.03GiB  32.00MiB 1.74GiB
-- - - - - ---
   Total 598.70GiB   3.09GiB  96.00MiB 9.69GiB
   Used  597.25GiB   1.57GiB 128.00KiB



uname -a
Linux main 4.18.0-2-amd64 #1 SMP Debian 4.18.10-2 (2018-10-07) x86_64 
GNU/Linux


btrfs --version
btrfs-progs v4.17


dmesg | grep -i btrfs
[7.801817] Btrfs loaded, crc32c=crc32c-generic
[8.163288] BTRFS: device label btrfsroot devid 10 transid 669961 
/dev/sdg1
[8.163433] BTRFS: device label btrfsroot devid 9 transid 669961 
/dev/sdf1
[8.163591] BTRFS: device label btrfsroot devid 1 transid 669961 
/dev/sde1
[8.163734] BTRFS: device label btrfsroot devid 8 transid 669961 
/dev/sdd1
[8.163974] BTRFS: device label btrfsroot devid 2 transid 669961 
/dev/sdc1
[8.164117] BTRFS: device label btrfsroot devid 7 transid 669961 
/dev/sdb1
[8.164262] BTRFS: device label btrfsroot devid 6 transid 669961 
/dev/sda1

[8.206174] BTRFS info (device sde1): disk space caching is enabled
[8.206236] BTRFS info (device sde1): has skinny extents
[8.348610] BTRFS info (device sde1): enabling ssd optimizations
[8.854412] BTRFS info (device sde1): enabling free space tree
[8.854471] BTRFS info (device sde1): using free space tree
[   68.170580] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.185973] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.185991] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186003] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186015] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186028] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186041] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186052] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186063] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186075] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.199237] BTRFS info (device sde1): read error corrected: ino 
3247424 off 36700160 (dev /dev/sda1 sector 244987192)
[   68.202602] BTRFS info (device sde1): read error corrected: ino 
3247424 off 36704256 (dev /dev/sda1 sector 244987192)
[   68.203176] BTRFS info (device sde1): read error corrected: ino 
3247424 off 36712448 (dev /dev/sda1 sector 244987192)
[   68.206762] BTRFS info (device sde1): read error corrected: ino 
3247424 off 36708352 (dev /dev/sda1 sector 244987192)
[   68.212071] BTRFS info