Re: BTRFS did it's job nicely (thanks!)

2018-11-04 Thread waxhead

Sterling Windmill wrote:

Out of curiosity, what led to you choosing RAID1 for data but RAID10
for metadata?

I've flip flipped between these two modes myself after finding out
that BTRFS RAID10 doesn't work how I would've expected.

Wondering what made you choose your configuration.

Thanks!
Sure,


The "RAID"1 profile for data was chosen to maximize disk space 
utilization since I got a lot of mixed size devices.


The "RAID"10 profile for metadata was chosen simply because it *feels* a 
bit faster for some of my (previous) workload which was reading a lot of 
small files (which I guess was embedded in the metadata). While I never 
remembered that I got any measurable performance increase the system 
simply felt smoother (which is strange since "RAID"10 should hog more 
disks at once).


I would love to try "RAID"10 for both data and metadata, but I have to 
delete some files first (or add yet another drive).


Would you like to elaborate a bit more yourself about how BTRFS "RAID"10 
does not work as you expected?


As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1 
replica) is striped over as many disks it can (as long as there is free 
space).


So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe 
over (20/2) x 2 and if you run out of space on 10 of the devices it will 
continue to stripe over (5/2) x 2. So your stripe width vary with the 
available space essentially... I may be terribly wrong about this (until 
someones corrects me that is...)




Re: BTRFS did it's job nicely (thanks!)

2018-11-04 Thread waxhead

Duncan wrote:

waxhead posted on Fri, 02 Nov 2018 20:54:40 +0100 as excerpted:


Note that I tend to interpret the btrfs de st / output as if the error
was NOT fixed even if (seems clearly that) it was, so I think the output
is a bit misleading... just saying...


See the btrfs-device manpage, stats subcommand, -z|--reset option, and
device stats section:

-z|--reset
Print the stats and reset the values to zero afterwards.

DEVICE STATS
The device stats keep persistent record of several error classes related
to doing IO. The current values are printed at mount time and
updated during filesystem lifetime or from a scrub run.


So stats keeps a count of historic errors and is only reset when you
specifically reset it, *NOT* when the error is fixed.

Yes, I am perfectly aware of all that. The issue I have is that the 
manpage describes corruption errors as "A block checksum mismatched or 
corrupted metadata header was found". This does not tell me if this was 
a permanent corruption or if it was fixed. That is why I think the 
output is a bit misleadning (and I should have said that more clearly).


My point being that btrfs device stats /mnt would have been a lot easier 
to read and understand if it distinguished between permanent corruption 
e.g. unfixable errors vs fixed errors.



(There's actually a recent patch, I believe in the current dev kernel
4.20/5.0, that will reset a device's stats automatically for the btrfs
replace case when it's actually a different device afterward anyway.
Apparently, it doesn't even do /that/ automatically yet.  Keep that in
mind if you replace that device.)

Oh thanks for the heads up, I was under the impression that the device 
stats was tracked by btrfs devid, but apparently it is (was) not. Good 
to know!


BTRFS did it's job nicely (thanks!)

2018-11-02 Thread waxhead

Hi,

my main computer runs on a 7x SSD BTRFS as rootfs with
data:RAID1 and metadata:RAID10.

One SSD is probably about to fail, and it seems that BTRFS fixed it 
nicely (thanks everyone!)


I decided to just post the ugly details in case someone just wants to 
have a look. Note that I tend to interpret the btrfs de st / output as 
if the error was NOT fixed even if (seems clearly that) it was, so I 
think the output is a bit misleading... just saying...




-- below are the details for those curious (just for fun) ---

scrub status for [YOINK!]
scrub started at Fri Nov  2 17:49:45 2018 and finished after 
00:29:26

total bytes scrubbed: 1.15TiB with 1 errors
error details: csum=1
corrected errors: 1, uncorrectable errors: 0, unverified errors: 0

 btrfs fi us -T /
Overall:
Device size:   1.18TiB
Device allocated:  1.17TiB
Device unallocated:9.69GiB
Device missing:  0.00B
Used:  1.17TiB
Free (estimated):  6.30GiB  (min: 6.30GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

 Data  Metadata  System
Id Path  RAID1 RAID10RAID10Unallocated
-- - - - - ---
 6 /dev/sda1 236.28GiB 704.00MiB  32.00MiB   485.00MiB
 7 /dev/sdb1 233.72GiB   1.03GiB  32.00MiB 2.69GiB
 2 /dev/sdc1 110.56GiB 352.00MiB -   904.00MiB
 8 /dev/sdd1 234.96GiB   1.03GiB  32.00MiB 1.45GiB
 1 /dev/sde1 164.90GiB   1.03GiB  32.00MiB 1.72GiB
 9 /dev/sdf1 109.00GiB   1.03GiB  32.00MiB   744.00MiB
10 /dev/sdg1 107.98GiB   1.03GiB  32.00MiB 1.74GiB
-- - - - - ---
   Total 598.70GiB   3.09GiB  96.00MiB 9.69GiB
   Used  597.25GiB   1.57GiB 128.00KiB



uname -a
Linux main 4.18.0-2-amd64 #1 SMP Debian 4.18.10-2 (2018-10-07) x86_64 
GNU/Linux


btrfs --version
btrfs-progs v4.17


dmesg | grep -i btrfs
[7.801817] Btrfs loaded, crc32c=crc32c-generic
[8.163288] BTRFS: device label btrfsroot devid 10 transid 669961 
/dev/sdg1
[8.163433] BTRFS: device label btrfsroot devid 9 transid 669961 
/dev/sdf1
[8.163591] BTRFS: device label btrfsroot devid 1 transid 669961 
/dev/sde1
[8.163734] BTRFS: device label btrfsroot devid 8 transid 669961 
/dev/sdd1
[8.163974] BTRFS: device label btrfsroot devid 2 transid 669961 
/dev/sdc1
[8.164117] BTRFS: device label btrfsroot devid 7 transid 669961 
/dev/sdb1
[8.164262] BTRFS: device label btrfsroot devid 6 transid 669961 
/dev/sda1

[8.206174] BTRFS info (device sde1): disk space caching is enabled
[8.206236] BTRFS info (device sde1): has skinny extents
[8.348610] BTRFS info (device sde1): enabling ssd optimizations
[8.854412] BTRFS info (device sde1): enabling free space tree
[8.854471] BTRFS info (device sde1): using free space tree
[   68.170580] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.185973] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.185991] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186003] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186015] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186028] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186041] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186052] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186063] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.186075] BTRFS warning (device sde1): csum failed root 3760 ino 
3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
[   68.199237] BTRFS info (device sde1): read error corrected: ino 
3247424 off 36700160 (dev /dev/sda1 sector 244987192)
[   68.202602] BTRFS info (device sde1): read error corrected: ino 
3247424 off 36704256 (dev /dev/sda1 sector 244987192)
[   68.203176] BTRFS info (device sde1): read error corrected: ino 
3247424 off 36712448 (dev /dev/sda1 sector 244987192)
[   68.206762] BTRFS info (device sde1): read error corrected: ino 
3247424 off 36708352 (dev /dev/sda1 sector 244987192)
[   68.212071] BTRFS info 

BTRFS bad block management. Does it exist?

2018-10-14 Thread waxhead

In case BTRFS fails to WRITE to a disk. What happens?
Does the bad area get mapped out somehow? Does it try again until it 
succeed or until it "times out" or reach a threshold counter?
Does it eventually try to write to a different disk (in case of using 
the raid1/10 profile?)


Re: lazytime mount option—no support in Btrfs

2018-08-18 Thread waxhead

Adam Hunt wrote:

Back in 2014 Ted Tso introduced the lazytime mount option for ext4 and
shortly thereafter a more generic VFS implementation which was then
merged into mainline. His early patches included support for Btrfs but
those changes were removed prior to the feature being merged. His
changelog includes the following note about the removal:

   - Per Christoph's suggestion, drop support for btrfs and xfs for now,
 issues with how btrfs and xfs handle dirty inode tracking.  We can add
 btrfs and xfs support back later or at the end of this series if we
 want to revisit this decision.

My reading of the current mainline shows that Btrfs still lacks any
support for lazytime. Has any thought been given to adding support for
lazytime to Btrfs?

Thanks,

Adam



Is there any new regarding this?


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-19 Thread waxhead




Hugo Mills wrote:

On Wed, Jul 18, 2018 at 08:39:48AM +, Duncan wrote:

Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:

Perhaps it's a case of coder's view (no code doing it that way, it's just
a coincidental oddity conditional on equal sizes), vs. sysadmin's view
(code or not, accidental or not, it's a reasonably accurate high-level
description of how it ends up working most of the time with equivalent
sized devices).)


Well, it's an *accurate* observation. It's just not a particularly
*useful* one. :)

Hugo.


A bit off topic perhaps - but I've got to give it a go:
Pretty please with sugar, nuts, a cherry and chocolate sprinkles dipped 
in syrup and coated with icecream on top , would it not be about time to 
update your online btrfs-usage calculator (which is insanely useful in 
so many ways) to support the new modes!?


In fact it would have been a great- / even better as a- cli-tool.
And yes, a while ago I toyed about porting it to C for own use mostly, 
but never got that far.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-16 Thread waxhead

waxhead wrote:

David Sterba wrote:

An interesting question is the naming of the extended profiles. I picked
something that can be easily understood but it's not a final proposal.
Years ago, Hugo proposed a naming scheme that described the
non-standard raid varieties of the btrfs flavor:

https://marc.info/?l=linux-btrfs=136286324417767

Switching to this naming would be a good addition to the extended raid.

As just a humble BTRFS user I agree and really think it is about time to 
move far away from the RAID terminology. However adding some more 
descriptive profile names (or at least some aliases) would be much 
better for the commoners (such as myself). 
...snip... > Which would make the above table look like so:


Old format / My Format / My suggested alias
SINGLE  / R0.S0.P0 / SINGLE
DUP / R1.S1.P0 / DUP (or even MIRRORLOCAL1)
RAID0   / R0.Sm.P0 / STRIPE
RAID1   / R1.S0.P0 / MIRROR1
RAID1c3 / R2.S0.P0 / MIRROR2
RAID1c4 / R3.S0.P0 / MIRROR3
RAID10  / R1.Sm.P0 / STRIPE.MIRROR1
RAID5   / R1.Sm.P1 / STRIPE.PARITY1
RAID6   / R1.Sm.P2 / STRIPE.PARITY2

And i think this is much more readable, but others may disagree. And as 
a side note... from a (hobby) coders perspective this is probably 
simpler to parse as well. 
...snap...


...and before someone else points this out that my suggestion has an 
ugly flaw , I got a bit copy / paste happy and messed up the RAID 5 and 
6 like profiles. The below table are corrected and hopefully it make the 
point why using the word 'replicas' is easier to understand than 
'copies' even if I messed it up :)


Old format / My Format / My suggested alias
SINGLE  / R0.S0.P0 / SINGLE
DUP / R1.S1.P0 / DUP (or even MIRRORLOCAL1)
RAID0   / R0.Sm.P0 / STRIPE
RAID1   / R1.S0.P0 / MIRROR1
RAID1c3 / R2.S0.P0 / MIRROR2
RAID1c4 / R3.S0.P0 / MIRROR3
RAID10  / R1.Sm.P0 / STRIPE.MIRROR1
RAID5   / R0.Sm.P1 / STRIPE.PARITY1
RAID6   / R0.Sm.P2 / STRIPE.PARITY2
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-15 Thread waxhead

David Sterba wrote:

An interesting question is the naming of the extended profiles. I picked
something that can be easily understood but it's not a final proposal.
Years ago, Hugo proposed a naming scheme that described the
non-standard raid varieties of the btrfs flavor:

https://marc.info/?l=linux-btrfs=136286324417767

Switching to this naming would be a good addition to the extended raid.

As just a humble BTRFS user I agree and really think it is about time to 
move far away from the RAID terminology. However adding some more 
descriptive profile names (or at least some aliases) would be much 
better for the commoners (such as myself).


For example:

Old format / New Format / My suggested alias
SINGLE  / 1C / SINGLE
DUP / 2CD/ DUP (or even MIRRORLOCAL1)
RAID0   / 1CmS   / STRIPE
RAID1   / 2C / MIRROR1
RAID1c3 / 3C / MIRROR2
RAID1c4 / 4C / MIRROR3
RAID10  / 2CmS   / STRIPE.MIRROR1
RAID5   / 1CmS1P / STRIPE.PARITY1
RAID6   / 1CmS2P / STRIPE.PARITY2

I find that writing something like "btrfs balance start 
-dconvert=stripe5.parity2 /mnt" is far less confusing and therefore less 
error prone than writing "-dconvert=1C5S2P".


While Hugo's suggestion is compact and to the point I would call for 
expanding that so it is a bit more descriptive and human readable.


So for example : STRIPE where  obviously is the same as Hugo 
proposed - the number of storage devices for the stripe and no  
would be best to mean 'use max devices'.

For PARITY then  is obviously required

Keep in mind that most people (...and I am willing to bet even Duncan 
which probably HAS backups ;) ) get a bit stressed when their storage 
system is degraded. With that in mind I hope for more elaborate, 
descriptive and human readable profile names to be used to avoid making 
mistakes using the "compact" layout.


...and yes, of course this could go both ways. A more compact (and dare 
I say cryptic) variant can cause people to stop and think before doing 
something and thus avoid errors,


Now that I made my point I can't help being a bit extra hash, obnoxious 
and possibly difficult so I would also suggest that Hugo's format could 
have been changed (dare I say improved?) from


numCOPIESnumSTRIPESnumPARITY

to.

REPLICASnum.STRIPESnum.PARITYnum

Which would make the above table look like so:

Old format / My Format / My suggested alias
SINGLE  / R0.S0.P0 / SINGLE
DUP / R1.S1.P0 / DUP (or even MIRRORLOCAL1)
RAID0   / R0.Sm.P0 / STRIPE
RAID1   / R1.S0.P0 / MIRROR1
RAID1c3 / R2.S0.P0 / MIRROR2
RAID1c4 / R3.S0.P0 / MIRROR3
RAID10  / R1.Sm.P0 / STRIPE.MIRROR1
RAID5   / R1.Sm.P1 / STRIPE.PARITY1
RAID6   / R1.Sm.P2 / STRIPE.PARITY2

And i think this is much more readable, but others may disagree. And as 
a side note... from a (hobby) coders perspective this is probably 
simpler to parse as well.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unsolvable technical issues?

2018-06-27 Thread waxhead




Chris Murphy wrote:

On Thu, Jun 21, 2018 at 5:13 PM, waxhead  wrote:

According to this:

https://stratis-storage.github.io/StratisSoftwareDesign.pdf
Page 4 , section 1.2

It claims that BTRFS still have significant technical issues that may never
be resolved.
Could someone shed some light on exactly what these technical issues might
be?! What are BTRFS biggest technical problems?



I think it's appropriate to file an issue and ask what they're
referring to. It very well might be use case specific to Red Hat.
https://github.com/stratis-storage/stratis-storage.github.io/issues

I also think it's appropriate to crosslink: include URL for the start
of this thread in the issue, and the issue URL to this thread.




https://github.com/stratis-storage/stratis-storage.github.io/issues/1

Apparently the author have toned down the wording a bit, this confirm 
that the claim was without basis and probably based on "popular myth".

The document the PDF links to is not yet updated.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unsolvable technical issues?

2018-06-25 Thread waxhead




David Sterba wrote:

On Fri, Jun 22, 2018 at 01:13:31AM +0200, waxhead wrote:

According to this:

https://stratis-storage.github.io/StratisSoftwareDesign.pdf
Page 4 , section 1.2

It claims that BTRFS still have significant technical issues that may
never be resolved.
Could someone shed some light on exactly what these technical issues
might be?! What are BTRFS biggest technical problems?


The subject you write is 'unsolvable', which I read as 'impossible to
solve', eg. on the design level. I'm not aware of such issues.

Alright , so I interpret this as there is no showstopper regarding 
implementation of existing and planned features...



If this is about issues that are difficult either to implement or
getting right, there are a few known ones.

Alright again, and I interpret this as there might be some code that is 
not flexible enough and changing that might affect working / stable 
parts of the code so therefore other solutions are looked at which is 
not that uncommon for software. Apart from not listing the known issues 
I think I got my questions answered :) and now it is perhaps finally 
appropriate to file a request at the Stratis bugtracker to ask what 
specifically they are referring to.



If you forget about the "RAID"5/6 like features then the only annoyances
that I have with BTRFS so far is...

1. Lack of per subvolume "RAID" levels
2. Lack of not using the deviceid to re-discover and re-add dropped devices

And that's about it really...


This could quickly turn into 'my faviourite bug/feature' list that can
be very long. The most asked for are raid56, and performance of qgroups.

Qu Wenruo improved some of the core problems and Jeff is working on the
performance problem. So there are people working on that.

On the raid56 front, there were some recent updates that fixed some
bugs, but the fix for write hole is still missing so we can't raise the
status yet.  I have some some good news but nobody should get too
excited until the code lands.

I have prototype for the N-copy raid (where N is 3 or 4).  This will
provide the underlying infrastructure for the raid5/6 logging mechanism,
the rest can be taken from Liu Bo's patchset sent some time ago.  In the
end the N-copy can be used for data and metadata too, independently and
flexibly switched via the balance filters. This will cost one
incompatibility bit.


I hope I am not asking for too much (but I know I probably am), but I 
suggest that having a small snippet of information on the status page 
showing a little bit about what is either currently the development 
focus , or what people are known for working at would be very valuable 
for users and it may of course work both ways, such as exciting people 
or calming them down. ;)


For example something simple like a "development focus" list...
2018-Q4: (planned) Renaming the grotesque "RAID" terminology
2018-Q3: (planned) Magical feature X
2018-Q2: N-Way mirroring
2018-Q1: Feature work "RAID"5/6

I think it would be good for people living their lives outside as it 
would perhaps spark some attention from developers and perhaps even 
media as well.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unsolvable technical issues?

2018-06-24 Thread waxhead

Jukka Larja wrote:

waxhead wrote 24.6.2018 klo 1.01:

Nikolay Borisov wrote:


On 22.06.2018 02:13, waxhead wrote:

According to this:

https://stratis-storage.github.io/StratisSoftwareDesign.pdf
Page 4 , section 1.2

It claims that BTRFS still have significant technical issues that may
never be resolved.
Could someone shed some light on exactly what these technical issues
might be?! What are BTRFS biggest technical problems?


That's a question that needs to be directed at the author of the 
statement.


I think not, and here's why: I am asking the BTRFS developers a 
general question , with some basis as to why I became curious. The 
question is simply what (if any) are the biggest technical issues in 
BTRFS because one must expect that if anyone is going to give me a 
credible answer it must be the people that hack on BTRFS and 
understand what they are working on and not the stratis guys. It would 
surprise me if they knew better than the BTRFS devs.


I think the problem with that question is that it is too general. 
Duncan's post already highlights several things that could be a 
significant problem for some user while being non-issue for most. 
Without more specific problem description, best you can hope for is 
speculation on things that Btrfs currently does badly.


-Jukka Larja


Well, I still don't agree (apparently I am starting to become 
difficult). There is a "roadmap" on the BTRFS wiki that describes 
features implemented and feature planned for example. Naturally people 
are working on improvements to existing features and prep-work for new 
features. If some of this work is not moving ahead due to design issues 
it sounds likely that someone would know about it by now.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unsolvable technical issues?

2018-06-23 Thread waxhead

Nikolay Borisov wrote:



On 22.06.2018 02:13, waxhead wrote:

According to this:

https://stratis-storage.github.io/StratisSoftwareDesign.pdf
Page 4 , section 1.2

It claims that BTRFS still have significant technical issues that may
never be resolved.
Could someone shed some light on exactly what these technical issues
might be?! What are BTRFS biggest technical problems?


That's a question that needs to be directed at the author of the statement.

I think not, and here's why: I am asking the BTRFS developers a general 
question , with some basis as to why I became curious. The question is 
simply what (if any) are the biggest technical issues in BTRFS because 
one must expect that if anyone is going to give me a credible answer it 
must be the people that hack on BTRFS and understand what they are 
working on and not the stratis guys. It would surprise me if they knew 
better than the BTRFS devs.


And yes absolutely, I do understand why one would want to direct that to 
the author of the statement as this claim is as far as I can tell 
completely without basis at all, and we all know that extraordinary 
claims require extraordinary evidence right? I do however feel that I 
should educate myself a bit on BTRFS to have some sort of basis to work 
on before confronting the stratis guys and risk ending up as the middle 
man in a potential email flame war.


So again , does BTRFS have any *known* major technical obstacles which 
the devs are having a hard time solving? (Duncan already gave the best 
answer so far).


PS! I have a tendency to sound a bit aggressive / harsh. I assure you 
all that it is not my intent. I am simply trying to get some knowledge 
of a filesystem (that interest me a lot before) trying to validate a 
"third party" claim.






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


unsolvable technical issues?

2018-06-21 Thread waxhead

According to this:

https://stratis-storage.github.io/StratisSoftwareDesign.pdf
Page 4 , section 1.2

It claims that BTRFS still have significant technical issues that may 
never be resolved.
Could someone shed some light on exactly what these technical issues 
might be?! What are BTRFS biggest technical problems?


If you forget about the "RAID"5/6 like features then the only annoyances 
that I have with BTRFS so far is...


1. Lack of per subvolume "RAID" levels
2. Lack of not using the deviceid to re-discover and re-add dropped devices

And that's about it really...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56

2018-06-19 Thread waxhead

Gandalf Corvotempesta wrote:

Another kernel release was made.
Any improvements in RAID56?

I didn't see any changes in that sector, is something still being
worked on or it's stuck waiting for something ?

Based on official BTRFS status page, RAID56 is the only "unstable"
item marked in red.
No interested from Suse in fixing that?

I think it's the real missing part for a feature-complete filesystem.
Nowadays parity raid is mandatory, we can't only rely on mirroring.


First of all: I am not a BTRFS developer, but I follow the mailing list 
closely and I too have a particular interest in the "RAID"5/6 feature 
which realistically is probably about 3-4 years (if not more) in the future.


From what I am able to understand the pesky write hole is one of the 
major obstacles for having BTRFS "RAID"5/6 work reliably. There was 
patches to fix this a while ago, but if these patches are to be 
classified as a workaround or actually as "the darn thing done right" is 
perhaps up for discussion.


In general there seems to be a lot more momentum on the "RAID"5/6 
feature now compared to earlier. There also seem to be a lot of focus on 
fixing bugs and running tests as well. This is why I am guessing that 
3-4 years ahead is a absolute minimum until "RAID"5/6 might be somewhat 
reliable and usable.


There are a few other basics missing that may be acceptable for you as 
long as you know about it. For example as far as I know BTRFS does still 
not use the "device-id" or "BTRFS internal number" for storage devices 
to keep track of the storage device.


This means that if you have a multi storage device filesystem with for 
example /dev/sda /dev/sdb /dev/sdc etc... and /dev/sdc disappears and 
show up again as /dev/sdx then BTRFS would not recoginize this and 
happily try to continue to write on /dev/sdc even if it does not exist.


...and perhaps even worse - I can imagine that if you swap device 
ordering and a different device takes /dev/sdc's place then BTRFS 
*could* overwrite data on this device - possibly making a real mess of 
things. I am not sure if this holds true, but if it does it's for sure a 
real nugget of basic functionality missing right there.


BTRFS also so far have no automatic "drop device" function e.g. it will 
not automatically kick out a storage device that is throwing lots of 
errors and causing delays etc. There may be benefits to keeping this 
design of course, but for some dropping the device might be desirable.


And no hot-spare "or hot-(reserved-)space" (which would be more accurate 
in BTRFS terms) is implemented either, and that is one good reason to 
keep an eye on your storage pool.


What you *might* consider is to have your metadata in "RAID"1 or 
"RAID"10 and your data in "RAID5" or even "RAID6" so that if you run 
into problems then you might in worst case loose some data, but since 
"RAID"1/10 is beginning to be rather mature then it is likely that your 
filesystem might survive a disk failure.


So if you are prepared to perhaps loose a file or two, but want to feel 
confident that your filesystem is surviving and will give you a report 
about what file(s) are toast then this may be acceptable for you as you 
can always restore from backups (because you do have backups right? If 
not, read 'any' of Duncan's posts - he explains better than most people 
why you need and should have backups!)


Now keep in mind that this is just a humble users analysis of the 
situation based on whatever I have picked up from the mailing list which 
may or may not be entirely accurate so take it for what it is!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: replace drive with write_io_errs?

2018-05-12 Thread waxhead

Adam Bahe wrote:

Hello all,


'All' includes me as well, but keep in mind I am not a BTRFS dev.


I have a drive that has been in my btrfs array for about 6 months now.
It was purchased new. Its an IBM-ESXS SAS drive rebranded from an HGST
HUH721010AL4200. Here is the following stats, it passed a long
smartctl test. But I'm not sure what to make of it.

[/dev/sdi].write_io_errs1823
[/dev/sdi].read_io_errs 0
[/dev/sdi].flush_io_errs0
[/dev/sdi].corruption_errs  0
[/dev/sdi].generation_errs  0


Just a few observations.

You are more likely to get (faster) help from the friendly devs here if 
you provide the output of...


btrfs --version
uname -a
btrfs filesystem show

Have you gone through the "regular" stuff?! E.g. things like bad cables, 
rerouting cables, checking your power supply (noise, correct voltages), 
temperature (your drive is not *that* far off the trip temperature, if 
it is 52 C I imagine it could easily hit 65 C with a bit of load), 
trying to eliminate other hardware, sound cards, graphics cards etc...
If you run your array on a USB enclosure weird things may/will/(has to) 
happen.



=== START OF INFORMATION SECTION ===
Vendor:   IBM-ESXS
Product:  HUH721010AL4200
Revision: J6R2
User Capacity:9,931,038,130,176 bytes [9.93 TB]
Logical block size:   4096 bytes
Formatted with type 2 protection
Logical block provisioning type unreported, LBPME=0, LBPRZ=0
Rotation Rate:7200 rpm
Form Factor:  3.5 inches
Logical Unit id:  0x5000cca266a405e4
Serial number:*YOINK*
Device type:  disk
Transport protocol:   SAS
Local Time is:Sat May 12 03:06:35 2018 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 52 C
Drive Trip Temperature:65 C

Manufactured in week 33 of year 2017
Specified cycle count over device lifetime:  5
Accumulated start-stop cycles:  28
Specified load-unload count over device lifetime:  60
Accumulated load-unload cycles:  170
Elements in grown defect list: 0

Vendor (Seagate) cache information
   Blocks sent to initiator = 1848304782540800

Error counter log:
Errors Corrected by   Total   Correction
GigabytesTotal
ECC  rereads/errors   algorithm
processeduncorrected
fast | delayed   rewrites  corrected  invocations   [10^9
bytes]  errors
read:  00 0 01096283
14317.360   0
write: 00 0 0   2906
27801.489   0
verify:00 0 0  13027
0.000   0

Non-medium error count:0

SMART Self-test log
Num  Test  Status segment  LifeTime
LBA_first_err [SK ASC ASQ]
  Description  number   (hours)
# 1  Background long   Completed   -2466
   - [-   --]
# 2  Background short  Completed   -2448
   - [-   --]
Long (extended) Self Test duration: 65535 seconds [1092.2 minutes]

I have not seen the 'correction algorithm invocations' before, but I 
expect that such a large drive probably do some of this as part of 
regular use. If the number is significantly higher than your other 
drives (if they have the same load) I would suspect something is fishy 
with your drive. But then again , it's better to ask someone else.


I can't RMA the drive as I have no idea how or where to RMA an IBM
branded HGST drive. So if on the off chance someone here is reading
this who can also point me in the right direction, let me know where
to RMA and IBM standalone drive with no FRU.

Uhm... can't you just return the drive where you purchased it?



But is this drive healthy or should I have it replaced? What is the
extent of a write_io_err? Are they somewhat common or a sign of a bad
drive? A scrub returned no errors.


The manual is a bit hard to understand
https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-device

It does not say clearly what happens if you have a redundant storage 
profile for your (meta)data. Would a write be redirected to another 
copy? if yes would it retry the original write. I *assume* that as long 
as you don't get any write errors in your application it works. But 
perhaps someone else care to explain this better (by preferably updating 
the manual/wiki)



Also what about the correction algorithm invocations? All of my IBM
drives seem to have those. Whereas all of my other drives do not. I
was curious about that too, if anyone knows. Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe 

Re: RAID56 - 6 parity raid

2018-05-02 Thread waxhead



Andrei Borzenkov wrote:

02.05.2018 21:17, waxhead пишет:

Goffredo Baroncelli wrote:

On 05/02/2018 06:55 PM, waxhead wrote:


So again, which problem would solve having the parity checksummed ?
On the best of my knowledge nothing. In any case the data is
checksummed so it is impossible to return corrupted data (modulo bug
:-) ).


I am not a BTRFS dev , but this should be quite easy to answer.
Unless you checksum the parity there is no way to verify that that
the data (parity) you use to reconstruct other data is correct.


In any case you could catch that the compute data is wrong, because
the data is always checksummed. And in any case you must check the
data against their checksum.


What if you lost an entire disk?


How does it matter exactly? RAID is per chunk anyway.

It does not matter. I was wrong, got bitten by thinking about BTRFS 
"RAID5" as normal RAID5. Again a good reason to change the naming for it 
I think...



or had corruption for both data AND checksum?


By the same logic you may have corrupted parity and its checksum.


Yup. Indeed


How do you plan to safely reconstruct that without checksummed
parity?



Define "safely". The main problem of current RAID56 implementation is
that stripe is not updated atomically (at least, that is what I
understood from the past discussions) and this is not solved by having
extra parity checksum. So how exactly "safety" is improved here? You
still need overall checksum to verify result of reconstruction, what
exactly extra parity checksum buys you?


> [...]




Again - please describe when having parity checksum will be beneficial
over current implementation. You do not reconstruct anything as long as
all data strips are there, so parity checksum will not be used. If one
data strip fails (including checksum) it will be reconstructed and
verified. If parity itself is corrupted, checksum verification fails
(hopefully). How is it different from verifying parity checksum before
reconstructing? In both cases data cannot be reconstructed, end of story.


Ok, before attempting and answer I have to admit that I do not know 
enough about how RAID56 is laid out on disk in BTRFS terms. Is data 
checksummed pr. stripe or pr. disk? Is parity calculated on the data 
only or is it calculated on the data+checksum ?!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 - 6 parity raid

2018-05-02 Thread waxhead

Goffredo Baroncelli wrote:

On 05/02/2018 06:55 PM, waxhead wrote:


So again, which problem would solve having the parity checksummed ? On the best 
of my knowledge nothing. In any case the data is checksummed so it is 
impossible to return corrupted data (modulo bug :-) ).


I am not a BTRFS dev , but this should be quite easy to answer. Unless you 
checksum the parity there is no way to verify that that the data (parity) you 
use to reconstruct other data is correct.


In any case you could catch that the compute data is wrong, because the data is 
always checksummed. And in any case you must check the data against their 
checksum.

What if you lost an entire disk? or had corruption for both data AND 
checksum? How do you plan to safely reconstruct that without checksummed 
parity?



My point is that storing the checksum is a cost that you pay *every time*. 
Every time you update a part of a stripe you need to update the parity, and 
then in turn the parity checksum. It is not a problem of space occupied nor a 
computational problem. It is a a problem of write amplification...
How much of a problem is this? no benchmarks have been run since the 
feature is not yet there I suppose.




The only gain is to avoid to try to use the parity when
a) you need it (i.e. when the data is missing and/or corrupted)
I'm not sure I can make out your argument here , but with RAID5/6 you 
don't have another copy to restore from. You *have* to use the parity to 
reconstruct data and it is a good thing if this data is trusted.



and b) it is corrupted.
But the likelihood of this case is very low. And you can catch it during the 
data checksum check (which has to be performed in any case !).

So from one side you have a *cost every time* (the write amplification), to 
other side you have a gain (cpu-time) *only in case* of the parity is corrupted 
and you need it (eg. scrub or corrupted data)).

IMHO the cost are very higher than the gain, and the likelihood the gain is 
very lower compared to the likelihood (=100% or always) of the cost.

Then run benchmarks and considering making parity checksums optional 
(but pretty please dipped in syrup with sugar on top - keep in on by 
default).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 - 6 parity raid

2018-05-02 Thread waxhead

Goffredo Baroncelli wrote:

Hi
On 05/02/2018 03:47 AM, Duncan wrote:

Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 + as
excerpted:


Hi to all I've found some patches from Andrea Mazzoleni that adds
support up to 6 parity raid.
Why these are wasn't merged ?
With modern disk size, having something greater than 2 parity, would be
great.

1) [...] the parity isn't checksummed, 


Why the fact that the parity is not checksummed is a problem ?
I read several times that this is a problem. However each time the thread 
reached the conclusion that... it is not a problem.

So again, which problem would solve having the parity checksummed ? On the best 
of my knowledge nothing. In any case the data is checksummed so it is 
impossible to return corrupted data (modulo bug :-) ).

I am not a BTRFS dev , but this should be quite easy to answer. Unless 
you checksum the parity there is no way to verify that that the data 
(parity) you use to reconstruct other data is correct.



On the other side, having the parity would increase both the code complexity 
and the write amplification, because every time a part of the stripe is touched 
not only the parity has to be updated, but also the checksum has too..
Which is a good thing. BTRFS main selling point is that you can feel 
pretty confident that whatever you put is exactly what you get out.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


libbrtfsutil questions

2018-04-23 Thread waxhead

Howdy!

I am pondering writing a little C program that use libmicrohttpd and 
libbtrfsutil to display some very basic (overview) details about BTRFS.


I was hoping to display the same information that'btrfs fi sh /mnt' and 
'btrfs fi us -T /mnt' do, but somewhat combined. Since I recently just 
figured out how easy it was to do svg graphics I was hoping to try to 
visualize things a bit.


What I was hoping to achieve is:
- show all filesystems
- ..show all devices in a filesystem (and mark missing devices clearly)
- show usage and/or allocation for each device
- possibly display chunks as blocks (like old defrag programs) where 
the brightness indicate how utilied a (meta)data chunk is.

- possibly mark devices with errors ( 'btrfs de st /mnt' ).

The problem is ... I looked at libbtrfsutil and it appears that there is 
mostly sync + subvolume/snapshot stuff in there.


So my question is: Is libbtrfsutil the right choice and intended to at 
some point (in the future?) supply me with the data I need for these 
things or should I look elsewhere?


PS! This a completely private project for my own egoistic reasons. 
However if it turns out to be useful and the code is not too 
embarrassing I am happy put the code into public domain ... if it ever 
gets written :S








--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-22 Thread waxhead

Liu Bo wrote:

On Wed, Mar 21, 2018 at 9:50 AM, Menion  wrote:

Hi all
I am trying to understand the status of RAID5/6 in BTRFS
I know that there are some discussion ongoing on the RFC patch
proposed by Liu bo
But it seems that everything stopped last summary. Also it mentioned
about a "separate disk for journal", does it mean that the final
implementation of RAID5/6 will require a dedicated HDD for the
journaling?


Thanks for the interest on btrfs and raid56.

The patch set is to plug write hole, which is very rare in practice, tbh.
The feedback is to use existing space instead of another dedicate
"fast device" as the journal in order to get some extent of raid
protection.  I'd need some time to pick it up.

With that being said, we have several data reconstruction fixes for
raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the
upstream kernel or some distros which do kernel updates frequently,
the most important one is

8810f7517a3b Btrfs: make raid6 rebuild retry more
https://patchwork.kernel.org/patch/10091755/

AFAIK, no other data corruptions showed up.

I am very interested in the "raid"5/6 like behavior myself. Actually 
calling it RAID in the past may have had it's benefits , but these days 
continuing to use the RAID term is not helping. Even technically minded 
people seem to get confused.


For example: It was suggested that "raid"5/6 should have hot-spare 
support. In BTRFS terms a hot spare devicse sounds wrong to me, but 
reserving extra space for a "hot-space" so any "raid"5/6 like system can 
(auto?) rebalance to missing blocks to the rest of the pool sounds 
sensible enough (as long as the number of devices allows to separate the 
different bits and pieces).


Anyway , I got carried away a bit there. Sorry about that.
What I really wanted to comment is about usability of "raid"5/6
How would really a metadata "raid"1 + data "raid"5 or 6 compare to say 
mdraid 5 or 6 from a reliability point of view.


Sure mdraid has the advantage, but even with the write hole and the risk 
of corruption of data (not the filesystem) would not BTRFS in "theory" 
be safer that at least mdraid 5 if run with metadata "raid"5 ?!
You have to run scrub on both mdraid as well as BTRFS to ensure data is 
not corrupted.


PS! It might be worth mentioning that I am slightly affected by a 
Glenfarclas 105 Whisky while writing this so please bare with me in case 
something is too far off :)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crashes running btrfs scrub

2018-03-18 Thread waxhead

Liu Bo wrote:

On Sat, Mar 17, 2018 at 5:26 PM, Liu Bo  wrote:

On Fri, Mar 16, 2018 at 2:46 PM, Mike Stevens  wrote:

Could you please paste the whole dmesg, it looks like it hit
btrfs_abort_transaction(),
which should give us more information about where goes wrong.


The whole thing is here https://pastebin.com/4ENq2saQ


Given this,

[  299.410998] BTRFS: error (device sdag) in
btrfs_create_pending_block_groups:10192: errno=-27 unknown

it refers to -EFBIG, so I think the warning comes from

btrfs_add_system_chunk()
{
...
 if (array_size + item_size + sizeof(disk_key)

 > BTRFS_SYSTEM_CHUNK_ARRAY_SIZE) {

 mutex_unlock(_info->chunk_mutex);

 return -EFBIG;

 }

If that's the case, we need to check this earlier during mount.



I didn't realize this until now,  we do have a limitation on up to how
many disks btrfs could handle, in order to make balance/scrub work
properly (where system chunks may be set readonly),

((BTRFS_SYSTEM_CHUNK_ARRAY_SIZE / 2) - sizeof(struct btrfs_chunk)) /
sizeof(struct btrfs_stripe) + 1

will be the number of disks btrfs can handle at most.


Am I understanding this correct, BTRFS have limit to the number of 
physical devices it can handle?! (max 30 devices?!)


Or are this referring to the number of devices BTRFS can utilize in a 
stripe (in which case 30 actually sounds like a high number).


30 devices is really not that much, heck you get 90 disks top load JBOD 
storage chassis these days and BTRFS does sound like an attractive 
choice for things like that.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crashes running btrfs scrub

2018-03-15 Thread waxhead

Mike Stevens wrote:

First, the required information
  
~ $ uname -a

Linux auswscs9903 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux
  ~ $ btrfs --version
btrfs-progs v4.9.1
  ~ $ sudo btrfs fi show
Label: none  uuid: 77afc2bb-f7a8-4ce9-9047-c031f7571150
 Total devices 34 FS bytes used 89.06TiB
 devid1 size 5.46TiB used 4.72TiB path /dev/sdb
 devid2 size 5.46TiB used 4.72TiB path /dev/sda
 devid3 size 5.46TiB used 4.72TiB path /dev/sdx
 devid4 size 5.46TiB used 4.72TiB path /dev/sdt
 devid5 size 5.46TiB used 4.72TiB path /dev/sdz
 devid6 size 5.46TiB used 4.72TiB path /dev/sdv
 devid7 size 5.46TiB used 4.72TiB path /dev/sdab
 devid8 size 5.46TiB used 4.72TiB path /dev/sdw
 devid9 size 5.46TiB used 4.72TiB path /dev/sdad
 devid   10 size 5.46TiB used 4.72TiB path /dev/sdaa
 devid   11 size 5.46TiB used 4.72TiB path /dev/sdr
 devid   12 size 5.46TiB used 4.72TiB path /dev/sdy
 devid   13 size 5.46TiB used 4.72TiB path /dev/sdj
 devid   14 size 5.46TiB used 4.72TiB path /dev/sdaf
 devid   15 size 5.46TiB used 4.72TiB path /dev/sdag
 devid   16 size 5.46TiB used 4.72TiB path /dev/sdh
 devid   17 size 5.46TiB used 4.72TiB path /dev/sdu
 devid   18 size 5.46TiB used 4.72TiB path /dev/sdac
 devid   19 size 5.46TiB used 4.72TiB path /dev/sdk
 devid   20 size 5.46TiB used 4.72TiB path /dev/sdah
 devid   21 size 5.46TiB used 4.72TiB path /dev/sdp
 devid   22 size 5.46TiB used 4.72TiB path /dev/sdae
 devid   23 size 5.46TiB used 4.72TiB path /dev/sdc
 devid   24 size 5.46TiB used 4.72TiB path /dev/sdl
 devid   25 size 5.46TiB used 4.72TiB path /dev/sdo
 devid   26 size 5.46TiB used 4.72TiB path /dev/sdd
 devid   27 size 5.46TiB used 4.72TiB path /dev/sdi
 devid   28 size 5.46TiB used 4.72TiB path /dev/sdn
 devid   29 size 5.46TiB used 4.72TiB path /dev/sds
 devid   30 size 5.46TiB used 4.72TiB path /dev/sdm
 devid   31 size 5.46TiB used 4.72TiB path /dev/sdf
 devid   32 size 5.46TiB used 4.72TiB path /dev/sdq
 devid   33 size 5.46TiB used 4.72TiB path /dev/sdg
 devid   34 size 5.46TiB used 4.72TiB path /dev/sde

  ~ $ sudo btrfs fi df /gpfs_backups
Data, RAID6: total=150.82TiB, used=88.88TiB
System, RAID6: total=512.00MiB, used=19.08MiB
Metadata, RAID6: total=191.00GiB, used=187.38GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

That's a hell of a filesystem. RAID5 and RAID5 is unstable and should 
not be used for anything but throw away data. You will be happy that you 
value you data enough to have backups because all sensible sysadmins 
do have backups correct?! (Do read just about any of Duncan's replies - 
he describes this better than me).


Also if you are running kernel ***3.10*** that is nearly antique in 
btrfs terms. As a word of advise, try a more recent kernel (there have 
been lots of patches to raid5/6 since kernel 4.9) and if you ever get 
the filesystem running again then *at least* rebalance the metadata to 
raid1 as quickly as possible as the raid1 profile is (unlike raid5 or 
raid6) working really well.


PS! I'm not a BTRFS dev so don't run away just yet. Someone else may 
magically help you recover, Best of luck!


- Waxhead
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to replace a failed drive in btrfs RAID 1 filesystem

2018-03-10 Thread waxhead

Austin S. Hemmelgarn wrote:

On 2018-03-09 11:02, Paul Richards wrote:

Hello there,

I have a 3 disk btrfs RAID 1 filesystem, with a single failed drive.
Before I attempt any recovery I’d like to ask what is the recommended
approach?  (The wiki docs suggest consulting here before attempting
recovery[1].)

The system is powered down currently and a replacement drive is being
delivered soon.

Should I use “replace”, or “add” and “delete”?

Once replaced should I rebalance and/or scrub?

I believe that the recovery may involve mounting in degraded mode.  If
I do this, how do I later get out of degraded mode, or if it’s
automatic how do i determine when I’m out of degraded mode?

It won't automatically mount degraded, you either have to explicitly ask 
it to, or you have to have an option to do so in your default mount 
options for the volume in /etc/fstab (which is dangerous for multiple 
reasons).


Now, as to what the best way to go about this is, there are three things 
to consider:


1. Is the failed disk still usable enough that you can get good data off 
of it in a reasonable amount of time?  If you're replacing the disk 
because of a lot of failed sectors, you can still probably get data off 
of it, while something like a head crash isn't worth trying to get data 
back.
2. Do you have enough room in the system itself to add another disk 
without removing one?

3. Is the replacement disk at least as big as the failed disk?

If the answer to all three is yes, then just put in the new disk, mount 
the volume normally (you don't need to mount it degraded if the failed 
disk is working this well), and use `btrfs replace` to move the data. 
This is the most efficient option in terms of both time and is also 
generally the safest (and I personally always over-spec drive-bays in 
systems we build where I work specifically so that this approach can be 
used).


If the answer to the third question is no, put in the new disk (removing 
the failed one first if the answer to the second question is no), mount 
the volume (mount it degraded if one of the first two questions is no, 
normally otherwise), then add the new disk to the volume with `btrfs 
device add` and remove the old one with `btrfs device delete` (using the 
'missing' option if you had to remove the failed disk).  This is needed 
because the replace operation requires the new device to be at least as 
big as the old one.


If the answer to either one or two is no but the answer to three is yes, 
pull out the failed disk, put in a new one, mount the volume degraded, 
and use `btrfs replace` as well (you will need to specify the device ID 
for the now missing failed disk, which you can find by calling `btrfs 
filesystem show` on the volume).  In the event that the replace 
operation refuses to run in this case, instead add the new disk to the 
volume with `btrfs device add` and then run `btrfs device delete 
missing` on the volume.


If you follow any of the above procedures, you don't need to balance 
(the replace operation is equivalent to a block level copy and will 
result in data being distributed exactly the same as it was before, 
while the delete operation is a special type of balance), and you 
generally don't need to scrub the volume either (though it may still be 
a good idea).  As far as getting back from degraded mode, you can just 
remount the volume to do so, though I would generally suggest rebooting.


Note that there are three other possible approaches to consider as well:

1. If you can't immediately get a new disk _and_ all the data will fit 
on the other two disks, use `btrfs device delete` to remove the failed 
disk anyway, and run with just the two until you can get a new disk. 
This is exponentially safer than running the volume degraded until you 
get a new disk, and is the only case you realistically should delete a 
device before adding the new one.  Make sure to balance the volume after 
adding the new device.
2. Depending on the situation, it may be faster to just recreate the 
whole volume from scratch using a backup than it is to try to repair it. 
  This is actually the absolute safest method of handling this 
situation, as it makes sure that nothing from the old volume with the 
failed disk causes problems in the future.
3. If you don't have a backup, but have some temporary storage space 
that will fit all the data from the volume, you could also use `btrfs 
restore` to extract files from the old volume to temporary storage, 
recreate the volume, and copy the data back in from the temporary storage.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


I did a quick scan of the wiki just to see, but I did not find any good 
info about how to recover a "RAID" like set if degraded. Information 
about how to recover, and what profiles can be recovered from would be 
good to have 

Per subvolume "RAID" level?!

2018-03-08 Thread waxhead
Just out of curiosity, are there any work going on for enabling 
different "RAID" levels per subvolume?!


And out of even more curiosity how is this planned to be handled with 
btrfs balance?! When per subvolume "RAID" levels are good to go, how 
would you then run the balance filters to convert / leave alone certain 
parts of the filesystem?!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Please update the BTRFS status page

2018-02-23 Thread waxhead

The latest released kernel is 4.15
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-29 Thread waxhead



Austin S. Hemmelgarn wrote:

On 2018-01-29 12:58, Andrei Borzenkov wrote:

29.01.2018 14:24, Adam Borowski пишет:
...


So any event (the user's request) has already happened.  A rc system, of
which systemd is one, knows whether we reached the "want root 
filesystem" or
"want secondary filesystems" stage.  Once you're there, you can issue 
the

mount() call and let the kernel do the work.

It is a btrfs choice to not expose compound device as separate one 
(like

every other device manager does)


Btrfs is not a device manager, it's a filesystem.

it is a btrfs drawback that doesn't provice anything else except for 
this

IOCTL with it's logic


How can it provide you with something it doesn't yet have?  If you 
want the

information, call mount().  And as others in this thread have mentioned,
what, pray tell, would you want to know "would a mount succeed?" for 
if you

don't want to mount?

it is a btrfs drawback that there is nothing to push assembling into 
"OK,

going degraded" state


The way to do so is to timeout, then retry with -o degraded.



That's possible way to solve it. This likely requires support from
mount.btrfs (or btrfs.ko) to return proper indication that filesystem is
incomplete so caller can decide whether to retry or to try degraded 
mount.
We already do so in the accepted standard manner.  If the mount fails 
because of a missing device, you get a very specific message in the 
kernel log about it, as is the case for most other common errors (for 
uncommon ones you usually just get a generic open_ctree error).  This is 
really the only option too, as the mount() syscall (which the mount 
command calls) returns only 0 on success or -1 and an appropriate errno 
value on failure, and we can't exactly go about creating a half dozen 
new error numbers just for this (well, technically we could, but I very 
much doubt that they would be accepted upstream, which defeats the 
purpose).


Or may be mount.btrfs should implement this logic internally. This would
really be the most simple way to make it acceptable to the other side by
not needing to accept anything :)
And would also be another layering violation which would require a 
proliferation of extra mount options to control the mount command itself 
and adjust the timeout handling.


This has been done before with mount.nfs, but for slightly different 
reasons (primarily to allow nested NFS mounts, since the local directory 
that the filesystem is being mounted on not being present is treated 
like a mount timeout), and it had near zero control.  It works there 
because they push the complicated policy decisions to userspace (namely, 
there is no support for retrying with different options or trying a 
different server).


I just felt like commenting a bit on this from a regular users point of 
view.


Remember that at some point BTRFS will probably be the default 
filesystem for the average penguin.
BTRFS big selling point is redundance and a guarantee that whatever you 
write is the same that you will read sometime later.


Many users will probably build their BTRFS system on a redundant array 
of storage devices. As long as there are sufficient (not necessarily 
all) storage devices present they expect their system to come up and 
work. If the system is not able to come up in a fully operative state it 
must at least be able to limp until the issue is fixed.


Starting a argument about what init system is the most sane or most 
shiny is not helping. The truth is that systemd is not going away 
sometime soon and one might as well try to become friends if nothing 
else for the sake of having things working which should be a common goal 
regardless of the religion.


I personally think the degraded mount option is a mistake as this 
assumes that a lightly degraded system is not able to work which is false.
If the system can mount to some working state then it should mount 
regardless if it is fully operative or not. If the array is in a bad 
state you need to learn about it by issuing a command or something. The 
same goes for a MD array (and yes, I am aware of the block layer vs 
filesystem thing here).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Superblock update: Is there really any benefits of updating synchronously?

2018-01-24 Thread waxhead

Hans van Kranenburg wrote:

On 01/23/2018 08:51 PM, waxhead wrote:

Nikolay Borisov wrote:

On 23.01.2018 16:20, Hans van Kranenburg wrote:


[...]



We also had a discussion about the "backup roots" that are stored
besides the superblock, and that they are "better than nothing" to help
maybe recover something from a borken fs, but never ever guarantee you
will get a working filesystem back.

The same holds for superblocks from a previous generation. As soon as
the transaction for generation X succesfully hits the disk, all space
that was occupied in generation X-1 but no longer in X is available to
be overwritten immediately.


Ok so this means that superblocks with a older generation is utterly
useless and will lead to corruption (effectively making my argument
above useless as that would in fact assist corruption then).


Mostly, yes.


Does this means that if disk space was allocated in X-1 and is freed in
X it will unallocated if you roll back to X-1 e.g. writing to
unallocated storage.


Can you reword that? I can't follow that sentence.

Sure why not. I'll give it a go:

Does this mean that if...
* Superblock generation N-1 have range 1234-2345 allocated and used.

and

* Superblock generation N-0 (the current) have range 1234-2345 free 
because someone deleted a file or something


Then

It is no point in rolling back to generation N-1 because that refers to 
what is no essentially free "memory" which may or may have not been 
written over by generation N-0. And therefore N-1 which still thinks 
range 1234-2345 is allocated may point to the wrong data.


I hope that was easier to follow - if not don't hold back on the 
explicitives! :)





I was under the impression that a superblock was like a "snapshot" of
the entire filesystem and that rollbacks via pre-gen superblocks was
possible. Am I mistaking?


Yes. The first fundamental thing in Btrfs is COW which makes sure that
everything referenced from transaction X, from the superblock all the
way down to metadata trees and actual data space is never overwritten by
changes done in transaction X+1.

Perhaps a tad off topic, but assuming the (hopefully) better explanation 
above clear things up a bit. What happens if a block is freed?! in X+1 
--- which must mean that it can be overwritten in transaction X+1 (which 
I assume means a new superblock generation). After all without freeing 
and overwriting data there is no way to re-use space.



For metadata trees that are NOT filesystem trees a.k.a. subvolumes, the
way this is done is actually quite simple. If a block is cowed, the old
location is added to a 'pinned extents' list (in memory), which is used
as a blacklist for choosing space to put new writes in. After a
transaction is completed on disk, that list with pinned extents is
emptied and all that space is available for immediate reuse. This way we
make sure that if the transaction that is ongoing is aborted, the
previous one (latest one that is completely on disk) is always still
there. If the computer crashes and the in memory list is lost, no big
deal, we just continue from the latest completed transaction again after
a reboot. (ignoring extra log things for simplicity)

So, the only situation in which you can fully use an X-1 superblock is
when none of that previously pinned space has actually been overwritten
yet afterwards.

And if any of the space was overwritten already, you can go play around
with using an older superblock and your filesystem mounts and everything
might look fine, until you hit that distant corner and BOOM!
Got it , this takes care of my questions above, but I'll leave them in 
just for completeness sake.

Thanks for the good explanation.



 >8  Extra!! Moar!!  >8 

But, doing so does not give you snapshot functionality yet! It's more
like a poor mans snapshot that only can prevent from messing up the
current version.

Snapshot functionality is implemented only for filesystem trees
(subvolumes) by adding reference counting (which does end up on disk) to
the metadata blocks, and then COW trees as a whole.

If you make a snapshot of a filesystem tree, the snapshot gets a whole
new tree ID! It's not a previous version of the same subvolume you're
looking at, it's a clone!

This is a big difference. The extent tree is always tree 2. The chunk
tree is always tree 3. But your subvolume snapshot gets a new tree number.

Technically, it would maybe be possible to implement reference counting
and snapshots to all of the metadata trees, but it would probably mean
that the whole filesystem would get stuck in rewriting itself all day
instead of doing any useful work. The current extent tree already has
such amount of rumination problems that the added work of keeping track
of reference counts would make it completely unusable.

In the wiki, it's here:
https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging

Actually, I just 

Re: Superblock update: Is there really any benefits of updating synchronously?

2018-01-23 Thread waxhead



Nikolay Borisov wrote:



On 23.01.2018 16:20, Hans van Kranenburg wrote:

On 01/23/2018 10:03 AM, Nikolay Borisov wrote:


On 23.01.2018 09:03, waxhead wrote:

Note: This have been mentioned before, but since I see some issues
related to superblocks I think it would be good to bring up the question
again.

[...]
https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock

The superblocks are updated synchronously on HDD's and one after each
other on SSD's.


There is currently no distinction in the code whether we are writing to
SSD or HDD.


So what does that line in the wiki mean, and why is it there? "btrfs
normally updates all superblocks, but in SSD mode it will update only
one at a time."


It means the wiki is outdated.


Ok and now the wiki is updated. Great :)




Also what do you mean by synchronously, if you inspect the
code in write_all_supers you will see what for every device we issue
writes for every available copy of the superblock and then wait for all
of them to be finished via the 'wait_dev_supers'. In that regard sb
writeout is asynchronous.

I meant basically what you have explained. You write the same memory to 
all superblocks "step by step" but in one operation.



Superblocks are also (to my knowledge) not protected by copy-on-write
and are read-modify-update.

On a storage device with >256GB there will be three superblocks.

BTRFS will always prefer the superblock with the highest generation
number providing that the checksum is good.


Wrong. On mount btrfs will only ever read the first superblock at 64k.
If that one is corrupted it will refuse to mount, then it's expected the
user will initiate recovery procedure with btrfs-progs which reads all
supers and replaces them with the "newest" one (as decided by the
generation number)


So again, the line "The superblock with the highest generation is used
when reading." in the wiki needs to go away then?


Yep, for background information you can read the discussion here:
https://www.spinics.net/lists/linux-btrfs/msg71878.html


And the wiki is also updated... Great!




On the list there seem to be a few incidents where the superblocks have
gone toast and I am pondering what (if any) benefits there is by
updating the superblocks synchronously.

The superblock is checkpoint'ed every 30 seconds by default and if
someone pulls the plug (poweroutage) on HDD's then a synchronous write
depending on (the quality of) your hardware may perhaps ruin all the
superblock copies in one go. E.g. Copy A,B and C will all be updated at
30s.

On SSD's, since one superblock is updated after other it would mean that
using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s


As explained previously there is no notion of "SSD vs HDD" modes.
Ok, thanks for clearing things up. But the main thing here is that all 
superblocks are updated at the same time both on SSD and HDD's. I think 
the question is still valid. What is there to gain on updating all of 
them every 30s instead of updating them one by one?! Would not that be 
safer, perhaps itty-bitty quicker and perhaps better in terms of recovery?!




We also had a discussion about the "backup roots" that are stored
besides the superblock, and that they are "better than nothing" to help
maybe recover something from a borken fs, but never ever guarantee you
will get a working filesystem back.

The same holds for superblocks from a previous generation. As soon as
the transaction for generation X succesfully hits the disk, all space
that was occupied in generation X-1 but no longer in X is available to
be overwritten immediately.

Ok so this means that superblocks with a older generation is utterly 
useless and will lead to corruption (effectively making my argument 
above useless as that would in fact assist corruption then).


Does this means that if disk space was allocated in X-1 and is freed in 
X it will unallocated if you roll back to X-1 e.g. writing to 
unallocated storage.


I was under the impression that a superblock was like a "snapshot" of 
the entire filesystem and that rollbacks via pre-gen superblocks was 
possible. Am I mistaking?




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Superblock update: Is there really any benefits of updating synchronously?

2018-01-22 Thread waxhead
Note: This have been mentioned before, but since I see some issues 
related to superblocks I think it would be good to bring up the question 
again.


According to the information found in the wiki: 
https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock


The superblocks are updated synchronously on HDD's and one after each 
other on SSD's.


Superblocks are also (to my knowledge) not protected by copy-on-write 
and are read-modify-update.


On a storage device with >256GB there will be three superblocks.

BTRFS will always prefer the superblock with the highest generation 
number providing that the checksum is good.


On the list there seem to be a few incidents where the superblocks have 
gone toast and I am pondering what (if any) benefits there is by 
updating the superblocks synchronously.


The superblock is checkpoint'ed every 30 seconds by default and if 
someone pulls the plug (poweroutage) on HDD's then a synchronous write 
depending on (the quality of) your hardware may perhaps ruin all the 
superblock copies in one go. E.g. Copy A,B and C will all be updated at 30s.


On SSD's, since one superblock is updated after other it would mean that 
using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s


Why is the SSD method not used on harddrives also?! If two superblocks 
are toast you would at maximum loose 1m30s by default , and if this is 
considered a problem then you can always adjust downwards the commit 
time. If this is set to 15 seconds you would still only loose 30 seconds 
of "action time" and would in my opinion be far better off from a 
reliability point of view than having to update multiple superblocks at 
the same time. I can't see why on earth updating all superblocks at the 
same time would have any benefits.


So this all boils down to the questions three (ere the other side will 
see. :P )


1. What are the benefits of updating all superblocks at the same time? 
(Just imagine if your memory is bad - you could risk updating all 
superblocks simultaneously with kebab'ed data).


2. What would the negative consequences be by using the SSD scheme also 
for harddisks? Especially if the commit time is set to 15s instead of 30s


3. In a RAID1 / 10 / 5 / 6 like setup. Would a set of corrupt 
superblocks on a single drive be recoverable from other disks or do the 
superblocks need to be intact on the (possibly) damaged drive?
(If the superblocks are needed then why would not SSD mode be better 
especially if the drive is partly working)


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommendations for balancing as part of regular maintenance?

2018-01-10 Thread waxhead

Austin S. Hemmelgarn wrote:

So, for a while now I've been recommending small filtered balances to
people as part of regular maintenance for BTRFS filesystems under the
logic that it does help in some cases and can't really hurt (and if done
right, is really inexpensive in terms of resources).  This ended up
integrated partially in the info text next to the BTRFS charts on
netdata's dashboard, and someone has now pointed out (correctly I might
add) that this is at odds with the BTRFS FAQ entry on balances.

For reference, here's the bit about it in netdata:

You can keep your volume healthy by running the `btrfs balance` command
on it regularly (check `man btrfs-balance` for more info).


And here's the FAQ entry:

Q: Do I need to run a balance regularly?

A: In general usage, no. A full unfiltered balance typically takes a
long time, and will rewrite huge amounts of data unnecessarily. You may
wish to run a balance on metadata only (see Balance_Filters) if you find
you have very large amounts of metadata space allocated but unused, but
this should be a last resort.


I've commented in the issue in netdata's issue tracker that I feel that
the FAQ entry could be better worded (strictly speaking, you don't
_need_ to run balances regularly, but it's usually a good idea). Looking
at both though, I think they could probably both be improved, but I
would like to get some input here on what people actually think the best
current practices are regarding this (and ideally why they feel that
way) before I go and change anything.

So, on that note, how does anybody else out there feel about this?  Is
balancing regularly with filters restricting things to small numbers of
mostly empty chunks a good thing for regular maintenance or not?
--
As just a regular user I would think that the first thing you would need 
is an analyze that can tell you if it is a good idea to balance or not 
in the first place.


Scrub seems like a great place to start - e.g. scrub could auto-analyze 
and report back need to balance. I also think that scrub should 
optionally autobalance if needed.


Balance may not be needed, but if one can determine that balancing would 
speed up things a bit I don't see why this as an option can't be 
scheduled automatically. Ideally there should be a "scrub and polish" 
option that would scrub, balance and perhaps even defragment in one go.


In fact, the way I see it btrfs should idealy by itself keep track on 
each data/metadata chunk and it should know , when was this chunk last 
affected by a scrub, balance, defrag etc and perform the required 
operations by itself based on a configuration or similar. Some may 
disagree for good reasons , but for me this is my wishlist for a 
filesystem :) e.g. a pool that just works and only annoys you with the 
need of replacing a bad disk every now and then :)



To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-01 Thread waxhead

Qu Wenruo wrote:



On 2018年01月01日 08:48, Stirling Westrup wrote:

Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
YOU to Nikolay Borisov and most especially to Qu Wenruo!

Thanks to their tireless help in answering all my dumb questions I
have managed to get my BTRFS working again! As I speak I have the
full, non-degraded, quad of drives mounted and am updating my latest
backup of their contents.

I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T
drives failed, and with help I was able to make a 100% recovery of the
lost data. I do have some observations on what I went through though.
Take this as constructive criticism, or as a point for discussing
additions to the recovery tools:

1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
errors exactly coincided with the 3 super-blocks on the drive.


WTF, why all these corruption all happens at btrfs super blocks?!

What a coincident.


The
odds against this happening as random independent events is so
unlikely as to be mind-boggling. (Something like odds of 1 in 10^26)


Yep, that's also why I was thinking the corruption is much heavier than
our expectation.

But if this turns out to be superblocks only, then as long as superblock
can be recovered, you're OK to go.


So, I'm going to guess this wasn't random chance. Its possible that
something inside the drive's layers of firmware is to blame, but it
seems more likely to me that there must be some BTRFS process that
can, under some conditions, try to update all superblocks as quickly
as possible.


Btrfs only tries to update its superblock when committing transaction.
And it's only done after all devices are flushed.

AFAIK there is nothing strange.


I think it must be that a drive failure during this
window managed to corrupt all three superblocks.


Maybe, but at least the first (primary) superblock is written with FUA
flag, unless you enabled libata FUA support (which is disabled by
default) AND your driver supports native FUA (not all HDD supports it, I
only have a seagate 3.5 HDD supports it), FUA write will be converted to
write & flush, which should be quite safe.

The only timing I can think of is, between the superblock write request
submit and the wait for them.

But anyway, btrfs superblocks are the ONLY metadata not protected by
CoW, so it is possible something may go wrong at certain timming.



So from what I can piece together SSD mode is safer even for regular 
harddisks correct?


According to this...
https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock

- There is 3x superblocks for every device.
- The superblocks are updated every 30 seconds if there is any changes...
- SSD mode will not try to update all superblocks in one go, but update 
one by one every 30 seconds.


So if SSD mode is enabled even for harddisks then only 60 seconds of 
filesystem history / activity will potentially be lost... this sounds 
like a reasonable trade-off compared to having your entire filesystem 
hampered if your hardware is not perhaps optimal (which is sort of the 
point with BTRFS' checksumming anyway)


So would it make sense to enable SSD behavior by default for HDD's ?!


It may be better to
perform an update-readback-compare on each superblock before moving
onto the next, so as to avoid this particular failure in the future. I
doubt this would slow things down much as the superblocks must be
cached in memory anyway.


That should be done by block layer, where things like dm-integrity could
help.



2) The recovery tools seem too dumb while thinking they are smarter
than they are. There should be some way to tell the various tools to
consider some subset of the drives in a system as worth considering.


My fault, in fact there is a -F option for dump-super, to force it to
recognize the bad superblock and output whatever it has.

In that case at least we could be able to see if it was really corrupted
or just some bitflip in magic numbers.


Not knowing that a superblock was a single 4096-byte sector, I had
primed my recovery by copying a valid superblock from one drive to the
clone of my broken drive before starting the ddrescue of the failing
drive. I had hoped that I could piece together a valid superblock from
a good drive, and whatever I could recover from the failing one. In
the end this turned out to be a useful strategy, but meanwhile I had
two drives that both claimed to be drive 2 of 4, and no drive claiming
to be drive 1 of 4. The tools completely failed to deal with this case
and were consistently preferring to read the bogus drive 2 instead of
the real drive 2, and it wasn't until I deliberately patched over the
magic in the cloned drive that I could use the various recovery tools
without bizarre and spurious errors. I understand how this was never
an anticipated scenario for the recovery process, but if its happened
once, it could happen again. Just dealing with a failing drive and its
clone both available in one 

Re: [PATCH] Btrfs: enchanse raid1/10 balance heuristic for non rotating devices

2017-12-28 Thread waxhead



Timofey Titovets wrote:

Currently btrfs raid1/10 balancer blance requests to mirrors,
based on pid % num of mirrors.

Update logic and make it understood if underline device are non rotational.

If one of mirrors are non rotational, then all read requests will be moved to
non rotational device.

And this would make reads regardless of the PID always end up on the 
fastest device which sounds sane enough , but scubbing will be even more 
important since there is a less chance that a "random PID" will check 
the other copy every now and then.



If both of mirrors are non rotational, calculate sum of
pending and in flight request for queue on that bdev and use
device with least queue leght.

I think this would be tried out on rotational disk as well. I am happy 
to test this out for you on a 7x disk server if you want.
Note: I have no experience with compiling kernels and applying patches 
(but I do code a bit in C every now and then) so a pre-compiled kernel 
would be required (I believe you are on Debain as well)
For rotational then perhaps it would not be wise to use another mirror 
unless the queue length is significantly higher than the other. Again I 
am happy to test if tunables are provided.



P.S.
Inspired by md-raid1 read balancing

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/volumes.c | 59 ++
 1 file changed, 59 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9a04245003ab..98bc2433a920 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5216,13 +5216,30 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }

+static inline int bdev_get_queue_len(struct block_device *bdev)
+{
+   int sum = 0;
+   struct request_queue *rq = bdev_get_queue(bdev);
+
+   sum += rq->nr_rqs[BLK_RW_SYNC] + rq->nr_rqs[BLK_RW_ASYNC];
+   sum += rq->in_flight[BLK_RW_SYNC] + rq->in_flight[BLK_RW_ASYNC];
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by 2
+*/
+   return ALIGN(sum, 2);
+}
+
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
struct map_lookup *map, int first, int num,
int optimal, int dev_replace_is_ongoing)
 {
int i;
int tolerance;
+   struct block_device *bdev;
struct btrfs_device *srcdev;
+   bool all_bdev_nonrot = true;

if (dev_replace_is_ongoing &&
fs_info->dev_replace.cont_reading_from_srcdev_mode ==
@@ -5231,6 +5248,48 @@ static int find_live_mirror(struct btrfs_fs_info 
*fs_info,
else
srcdev = NULL;

+   /*
+* Optimal expected to be pid % num
+* That's generaly ok for spinning rust drives
+* But if one of mirror are non rotating,
+* that bdev can show better performance
+*
+* if one of disks are non rotating:
+*  - set optimal to non rotating device
+* if both disk are non rotating
+*  - set optimal to bdev with least queue
+* If both disks are spinning rust:
+*  - leave old pid % nu,
+*/
+   for (i = 0; i < num; i++) {
+   bdev = map->stripes[i].dev->bdev;
+   if (!bdev)
+   continue;
+   if (blk_queue_nonrot(bdev_get_queue(bdev)))
+   optimal = i;
+   else
+   all_bdev_nonrot = false;
+   }
+
+   if (all_bdev_nonrot) {
+   int qlen;
+   /* Forse following logic choise by init with some big number */
+   int optimal_dev_rq_count = 1 << 24;
+
+   for (i = 0; i < num; i++) {
+   bdev = map->stripes[i].dev->bdev;
+   if (!bdev)
+   continue;
+
+   qlen = bdev_get_queue_len(bdev);
+
+   if (qlen < optimal_dev_rq_count) {
+   optimal = i;
+   optimal_dev_rq_count = qlen;
+   }
+   }
+   }
+
/*
 * try to avoid the drive that is the source drive for a
 * dev-replace procedure, only choose it if no other non-missing


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-14 Thread waxhead
As a regular BTRFS user I can tell you that there is no such thing as 
hot data tracking yet. Some people seem to use bcache together with 
btrfs and come asking for help on the mailing list.


Raid5/6 have received a few fixes recently, and it *may* soon me worth 
trying out raid5/6 for data, but keeping metadata in raid1/10 (I would 
rather loose a file or two than the entire filesystem).

I had plans to run some tests on this a while ago, but forgot about it.
As call good citizens, remember to have good backups. Last time I tested 
for Raid5/6 I ran into issues easily. For what it's worth - raid1/10 
seems pretty rock solid as long as you have sufficient disks (hint: you 
need more than two for raid1 if you want to stay safe)


As for dedupe there is (to my knowledge) nothing fully automatic yet. 
You have to run a program to scan your filesystem but all the 
deduplication is done in the kernel.
duperemove works apparently quite well when I tested it, but there may 
be some performance implications.


Roy Sigurd Karlsbakk wrote:

Hi all

I've been following this project on and off for quite a few years, and I wonder 
if anyone has looked into tiered storage on it. With tiered storage, I mean hot 
data lying on fast storage and cold data on slow storage. I'm not talking about 
cashing (where you just keep a copy of the hot data on the fast storage).

And btw, how far is raid[56] and block-level dedup from something useful in 
production?

Vennlig hilsen

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Several questions regarding btrfs

2017-11-06 Thread waxhead

ST wrote:

Hello,

I've recently learned about btrfs and consider to utilize for my needs.
I have several questions in this regard:

I manage a dedicated server remotely and have some sort of script that
installs an OS from several images. There I can define partitions and
their FSs.

1. By default the script provides a small separate partition for /boot
with ext3. Does it have any advantages or can I simply have /boot
within / all on btrfs? (Note: the OS is Debian9)

I am on Debian as well and run /boot on multiple systems without any 
issues. Remember to run grub-install on all your disks and update-grub 
if you run it in a redundant setup. That way you can loose a disk and 
still be happy about it.
If you run a redundant setup like raid1 / raid10 make sure you have 
sufficient disks to avoid that the filesystem enters read-only mode. See 
the status page for details.



2. as for the / I get ca. following written to /etc/fstab:
UUID=blah_blah /dev/sda3 / btrfs ...
So top-level volume is populated after initial installation with the
main filesystem dir-structure (/bin /usr /home, etc..). As per btrfs
wiki I would like top-level volume to have only subvolumes (at least,
the one mounted as /) and snapshots. I can make a snapshot of the
top-level volume with / structure, but how can get rid of all the
directories within top-lvl volume and keep only the subvolume
containing / (and later snapshots), unmount it and then mount the
snapshot that I took? rm -rf / - is not a good idea...

There are some tutorials floating around the web for this stuff. Just be 
careful, after a system update you might run into boot issues.

(I suggest you try playing with this in a VM first to see what happens)


3. in my current ext4-based setup I have two servers while one syncs
files of certain dir to the other using lsyncd (which launches rsync on
inotify events). As far as I have understood it is more efficient to use
btrfs send/receive (over ssh) than rsync (over ssh) to sync two boxes.
Do you think it would be possible to make lsyncd to use btrfs for
syncing instead of rsync? I.e. can btrfs work with inotify events? Did
somebody try it already?
Otherwise I can sync using btrfs send/receive from within cron every
10-15 minutes, but it seems less elegant.
Have no idea, but since Debian uses systemd you might be able to cook up 
something with systemd.path 
(https://www.freedesktop.org/software/systemd/man/systemd.path.html




4. In a case when compression is used - what quota is based on - (a)
amount of GBs the data actually consumes on the hard drive while in
compressed state or (b) amount of GBs the data naturally is in
uncompressed form. I need to set quotas as in (b). Is it possible? If
not - should I file a feature request?


No, you should not file a feature request it seems.
Look what me and Google found for you :)
https://btrfs.wiki.kernel.org/index.php/Quota_support
(hint: read the "using limits" section)


Thank you in advance!

No worries, good luck!



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS and MDADM (Dec 2014) – Ronny Egners Blog

2017-11-02 Thread waxhead

Dave wrote:

Has this been discussed here? Has anything changed since it was written?

I have (more or less) been following the mailing list since this feature 
was suggested. I have been drooling over it since, but not much have 
happened.



Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS
and MDADM (Dec 2014) – Ronny Egners Blog
http://blog.ronnyegner-consulting.de/2014/12/10/parity-based-redundancy-raid56triple-parity-and-beyond-on-btrfs-and-mdadm-dec-2014/comment-page-1/

TL;DR: There are patches to extend the linux kernel to support up to 6
parity disks but BTRFS does not want them because it does not fit
their “business case” and MDADM would want them but somebody needs to
develop patches for the MDADM component. The kernel raid
implementation is ready and usable. If someone volunteers to do this
kind of work I would support with equipment and myself as a test
resource.
--
I am just a list "stalker" and no BTRFS developer, but as others have 
indirectly said already. It is not so much that BTRFS don't want the 
patches as it is that BTRFS do not want to / can't focus on this right 
now due to other priorities.


There was some updates to raid5/6 in kernel 4.12 that should fix (or at 
least improve) scrub/auto-repair. The write hole does still exist.


That being said there might be configurations where btrfs raid5/6 might 
be of some use. I think I read somewhere that you can set data to 
raid5/6 and METADATA to raid1 or 10 and you would risk loosing some data 
(but not the filesystem) in the event of a system crash / power failure.


This sounds tempting since it in theory would not make btrfs raid 5/6 
significantly less reliable than other RAID's which will corrupt your 
data if the disk happens to spits out bad bits without complaining (one 
possible exception that might catch this is md raid6 which I use). That 
being said there is no way I would personally use btrfs raid 5/6 even 
with metadata raid1/10 yet without proper tested backups at standby at 
this point.


Anyway - I would worry more about getting raid5/6 to work properly 
before even thinking about multi-parity at all :)



To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


8 disk metadata radi10 + data raid1

2017-08-17 Thread waxhead

Hi,

On one of my machines I run a BTRFS filesystem with the following 
configuration


Kernel: 4.11.0-1-amd64 #1 SMP Debian 4.11.6-1 (2017-06-19) x86_64 GNU/Linux
Disks: 8
Metadata: Raid 10
Data: Raid1

One of the disks is going bad , and while the system still runs fine I 
ran some md5sum's on a few files and hit this bug.


Currently the only non-zero output from btrfs de st / is about 27 
write_io_errs and 278000 read_io_errs


I do have backups (hah! you did not expect that Duncan did you!)
and the data is not important on this filesystem.
Even if I got thrown the below stuff in dmesg the system keeps running 
fine.


...the failed device disappeared as of writing this... so now the 
filesystem have one missing device.


---snip--- ( https://pastebin.ca/3856147 )
[120678.569637]  ? worker_thread+0x4d/0x490
[120678.570931]  ? kthread+0xfc/0x130
[120678.572206]  ? process_one_work+0x430/0x430
[120678.573480]  ? kthread_create_on_node+0x70/0x70
[120678.574756]  ? do_group_exit+0x3a/0xa0
[120678.576024]  ? ret_from_fork+0x26/0x40
[120678.577287] Code: 00 00 c7 43 28 00 00 00 00 b9 01 00 00 00 31 c0 eb 
d8 8d 48 02 eb da 41 89 e8 48 c7 c6 d8 67 4a c0 4c 89 e7 e8 c0 b9 fa ff 
eb 80 <0f> 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57
[120678.580003] RIP: btrfs_check_repairable+0xe2/0xf0 [btrfs] RSP: 
ac3a419ffd60

[120678.581377] [ cut here ]
[120678.581526] ---[ end trace 1f2b98046a799b47 ]---
[120678.584109] kernel BUG at 
/build/linux-C5oXKu/linux-4.11.6/fs/btrfs/extent_io.c:2315!

[120678.585477] invalid opcode:  [#5] SMP
[120678.586821] Modules linked in: ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter cpufreq_userspace 
cpufreq_powersave cpufreq_conservative intel_powerclamp iTCO_wdt 
iTCO_vendor_support coretemp kvm_intel cdc_ether usbnet mii kvm joydev 
mgag200 ttm irqbypass intel_cstate evdev intel_uncore drm_kms_helper drm 
pcspkr i2c_algo_bit lpc_ich mfd_core ioatdma sg ipmi_si ipmi_devintf dca 
ipmi_msghandler i7core_edac button edac_core i5500_temp shpchp 
acpi_cpufreq binfmt_misc ip_tables x_tables autofs4 btrfs crc32c_generic 
xor raid6_pq sd_mod hid_generic usbhid hid sr_mod cdrom ata_generic 
crc32c_intel ata_piix i2c_i801 libata mptsas ehci_pci scsi_transport_sas 
uhci_hcd mptscsih mptbase ehci_hcd scsi_mod usbcore usb_common bnx2
[120678.595485] CPU: 0 PID: 15305 Comm: kworker/u32:3 Tainted: G  D 
 I 4.11.0-1-amd64 #1 Debian 4.11.6-1
[120678.596960] Hardware name: IBM Lenovo ThinkServer RD220 
-[379811G]-/59Y3827 , BIOS -[D6EL28AUS-1.03]- 08/20/2009

[120678.598476] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
[120678.599943] task: 9a4dc7bd6d40 task.stack: ac3a44214000
[120678.601431] RIP: 0010:btrfs_check_repairable+0xe2/0xf0 [btrfs]
[120678.602867] RSP: :ac3a44217d60 EFLAGS: 00010297
[120678.604271] RAX: 0001 RBX: 9a4daa28e080 RCX: 

[120678.605660] RDX: 0002 RSI:  RDI: 
9a4de67d6e18
[120678.607017] RBP: 0001 R08: 0dbbb224 R09: 
0dbbf224
[120678.608344] R10:  R11: fffb R12: 
9a4de67d6000
[120678.609641] R13: 9a4cfea6dda8 R14: 9a4cfea6dda8 R15: 

[120678.610921] FS:  () GS:9a4def20() 
knlGS:

[120678.612190] CS:  0010 DS:  ES:  CR0: 80050033
[120678.613435] CR2: 5627a2d96ff0 CR3: 00039310f000 CR4: 
06f0

[120678.614687] Call Trace:
[120678.615965]  ? end_bio_extent_readpage+0x42e/0x580 [btrfs]
[120678.617244]  ? btrfs_scrubparity_helper+0xcf/0x300 [btrfs]
[120678.618488]  ? process_one_work+0x197/0x430
[120678.619728]  ? worker_thread+0x4d/0x490
[120678.620960]  ? kthread+0xfc/0x130
[120678.622185]  ? process_one_work+0x430/0x430
[120678.623408]  ? kthread_create_on_node+0x70/0x70
[120678.624620]  ? do_group_exit+0x3a/0xa0
[120678.625817]  ? ret_from_fork+0x26/0x40
[120678.626994] Code: 00 00 c7 43 28 00 00 00 00 b9 01 00 00 00 31 c0 eb 
d8 8d 48 02 eb da 41 89 e8 48 c7 c6 d8 67 4a c0 4c 89 e7 e8 c0 b9 fa ff 
eb 80 <0f> 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57
[120678.629525] RIP: btrfs_check_repairable+0xe2/0xf0 [btrfs] RSP: 
ac3a44217d60

[120678.630795] [ cut here ]
[120678.630839] ---[ end trace 1f2b98046a799b48 ]---
[120678.633382] kernel BUG at 
/build/linux-C5oXKu/linux-4.11.6/fs/btrfs/extent_io.c:2315!

[120678.634625] invalid opcode:  [#6] SMP
[120678.635820] Modules linked in: ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter cpufreq_userspace 
cpufreq_powersave cpufreq_conservative intel_powerclamp iTCO_wdt 
iTCO_vendor_support coretemp kvm_intel cdc_ether usbnet mii kvm joydev 
mgag200 ttm irqbypass intel_cstate evdev intel_uncore drm_kms_helper drm 
pcspkr i2c_algo_bit lpc_ich mfd_core ioatdma sg ipmi_si ipmi_devintf dca 
ipmi_msghandler i7core_edac button edac_core i5500_temp 

Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-03 Thread waxhead

Brendan Hide wrote:
The title seems alarmist to me - and I suspect it is going to be 
misconstrued. :-/


From the release notes at 
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html


"Btrfs has been deprecated

The Btrfs file system has been in Technology Preview state since the 
initial release of Red Hat Enterprise Linux 6. Red Hat will not be 
moving Btrfs to a fully supported feature and it will be removed in a 
future major release of Red Hat Enterprise Linux.


The Btrfs file system did receive numerous updates from the upstream 
in Red Hat Enterprise Linux 7.4 and will remain available in the Red 
Hat Enterprise Linux 7 series. However, this is the last planned 
update to this feature.


Red Hat will continue to invest in future technologies to address the 
use cases of our customers, specifically those related to snapshots, 
compression, NVRAM, and ease of use. We encourage feedback through 
your Red Hat representative on features and requirements you have for 
file systems and storage technology."



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
First of all I am not a BTRFS dev, but I use it for various projects and 
have high hopes for what it can become.


Now, the fact that Red Hat depreciate BTRFS does not mean that BTRFS is 
depreciated. It not removed from the kernel and so far BTRFS offers 
features that other filesystems don't have. ZFS is something that people 
brag about all the time as a viable alternative, but for me it seems to 
be a pain to manage properly. E.g. grow, add/remove devices, shrink 
etc... good luck doing that right!


BTRFS biggest problem is not that there are some bits and pieces that 
are thoroughly screwed up (raid5/6 (which just got some fixes by the 
way)), but  the fact that the documentation is rather dated.


There is a simple status page here 
https://btrfs.wiki.kernel.org/index.php/Status


As others have pointed out already the explanations on the status page 
is not exactly good. For example compression (that was also mentioned) 
is as of writing this marked as 'Mostly ok'  '(needs verification and 
source) - auto repair and compression may crash'


Now, I am aware that many use compression without trouble. I am not sure 
how many that has compression with disk issues and don't have trouble , 
but I would at least expect to see more people yelling on the mailing 
list if that where the case. The problem here is that this message is 
rather scary and certainly does NOT sound like 'mostly ok' for most people.


What exactly needs verification and source? the mostly ok statement or 
something else?! A more detailed explanation would be required here to 
avoid scaring people away.


Same thing with the trim feature that is marked OK . It clearly says 
that is has performance implications. It is marked OK so one would 
expect it to not cause the filesystem to fail, but if the performance 
becomes so slow that the filesystem gets practically unusable it is of 
course not "OK". The relevant information is missing for people to make 
a decent choice and I certainly don't know how serious these performance 
implications are, if they are at all relevant...


Most people interested in BTRFS are probably a bit more paranoid and 
concerned about their data than the average computer user. What people 
tend to forget is that other filesystems either have NO redundancy, 
auto-repair and other fancy features that BTRFS have. So for the 
compression example above... if you run compressed files on ext4 and 
your disk gets some corruption you are in a no better state than what 
you would be with btrfs either (in fact probably worse). Also nothing is 
stopping you from putting btrfs DUP on a mdadm raid5 or 6 which mean you 
should be VERY safe.


Simple documentation is the key so HERE ARE MY DEMANDS!!!. ehhh 
so here is what I think should be done:


1. The documentation needs to either be improved (or old non-relevant 
stuff simply removed / archived somewhere)
2. The status page MUST always be up to date for the latest kernel 
release (It's ok so far , let's hope nobody sleeps here)
3. Proper explanations must be given so the layman and reasonably 
technical people understand the risks / issues for non-ok stuff.
4. There should be links to roadmaps for each feature on the status page 
that clearly stats what is being worked on for the NEXT kernel release






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid assurance

2017-07-25 Thread waxhead



Hugo Mills wrote:



You can see about the disk usage in different scenarios with the
online tool at:

http://carfax.org.uk/btrfs-usage/

Hugo.

As a side note, have you ever considered making this online tool (that 
should never go away just for the record) part of btrfs-progs e.g. a 
proper tool? I use it quite often (at least several timers per. month) 
and I would love for this to be a visual tool 'btrfs-space-calculator' 
would be a great name for it I think.


Imagine how nice it would be to run

btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1 /dev/sdc2 
/dev/sdd2 /dev/sde3 for example and instantly get something similar to 
my example below (no accuracy intended)


d=data
m=metadata
.=unusable

{  500mb} [|d|] /dev/sda1
{ 3000mb} [|d|m|m|m|m|mm...|] /dev/sdb1
{ 3000mb} [|d|m|m|m|m|mmm..|] /dev/sdc2
{ 5000mb} 
[|d|m|m|m|m|m|m|m|m|m|] /dev/sdb1


{11500mb} Total space

usable for data (raid10): 1000mb / 2000mb
usable for metadata (raid1): 4500mb / 9000mb
unusable: 500mb

Of course this would have to change one (if ever) subvolumes can have 
different raid levels etc, but I would have loved using something like 
this instead of jumping around carfax abbey (!) at night.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best Practice: Add new device to RAID1 pool

2017-07-24 Thread waxhead



Chris Murphy wrote:

On Mon, Jul 24, 2017 at 5:27 AM, Cloud Admin  wrote:


I am a little bit confused because the balance command is running since
12 hours and only 3GB of data are touched.

That's incredibly slow. Something isn't right.

Using btrfs-debug -b from btrfs-progs, I've selected a few 100% full chunks.

[156777.077378] f26s.localdomain sudo[13757]:chris : TTY=pts/2 ;
PWD=/home/chris ; USER=root ; COMMAND=/sbin/btrfs balance start
-dvrange=157970071552..159043813376 /
[156773.328606] f26s.localdomain kernel: BTRFS info (device sda1):
relocating block group 157970071552 flags data
[156800.408918] f26s.localdomain kernel: BTRFS info (device sda1):
found 38952 extents
[156861.343067] f26s.localdomain kernel: BTRFS info (device sda1):
found 38951 extents

That 1GiB chunk with quite a few fragments took 88s. That's 11MB/s.
Even for a hard drive, that's slow. I
This may be a stupid question , but are your pool of butter (or BTRFS 
pool) by any chance hooked up via USB? If this is USB2.0 at 480mitb/s 
then it is about 57MB/s / 4 drives = roughly 14.25 or about 11MB/s if 
you shave off some overhead.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Exactly what is wrong with RAID5/6

2017-06-20 Thread waxhead

I am trying to piece together the actual status of the RAID5/6 bit of BTRFS.
The wiki refer to kernel 3.19 which was released in February 2015 so I 
assume that the information there is a tad outdated (the last update on 
the wiki page was July 2016)

https://btrfs.wiki.kernel.org/index.php/RAID56

Now there are four problems listed

1. Parity may be inconsistent after a crash (the "write hole")
Is this still true, if yes - would not this apply for RAID1 / RAID10 as 
well? How was it solved there , and why can't that be done for RAID5/6


2. Parity data is not checksummed
Why is this a problem? Does it have to do with the design of BTRFS somehow?
Parity is after all just data, BTRFS does checksum data so what is the 
reason this is a problem?


3. No support for discard? (possibly -- needs confirmation with cmason)
Does this matter that much really?, is there an update on this?

4. The algorithm uses as many devices as are available: No support for a 
fixed-width stripe.
What is the plan for this one? There was patches on the mailing list by 
the SnapRAID author to support up to 6 parity devices. Will the (re?) 
resign of btrfs raid5/6 support a scheme that allows for multiple parity 
devices?


I do have a few other questions as well...

5. BTRFS does still (kernel 4.9) not seem to use the device ID to 
communicate with devices.


If you on a multi device filesystem yank out a device, for example 
/dev/sdg and it reappear as /dev/sdx for example btrfs will still 
happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the 
correct device ID. What is the status for getting BTRFS to properly 
understand that a device is missing?


6. RAID1 needs to be able to make two copies always. E.g. if you have 
three disks you can loose one and it should still work. What about 
RAID10 ? If you have for example 6 disk RAID10 array, loose one disk and 
reboots (due to #5 above). Will RAID10 recognize that the array now is a 
5 disk array and stripe+mirror over 2 disks (or possibly 2.5 disks?) 
instead of 3? In other words, will it work as long as it can create a 
RAID10 profile that requires a minimum of four disks?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Home storage with btrfs

2017-03-13 Thread waxhead

Same here, Have been using BTRFS for a 'scratch' disk since about 2014.
The disk have had quite some abuse and no issues yet.
I don't use compression, snapshots or any fancy features.
I have recently moved all of the root filesystem to BTRFS with 5x SSD 
disks set up in RAID1 and everything is (still) working fine, and I have 
been shuffling large amounts of data on this volume. I bet the SSD's 
will break before BTRFS does, so the real test is yet to come I guess...
I am on Debian GNU/Linux with kernel 4.9.0-2-amd64 (Debian 4.9.13-1) - 
btrfs-progs 4.7.3


However, keep in mind that backups is winning the fight against binary 
related traumas :)


Peter Becker wrote:

I can confirm this. I have also no generell issues since the past 2
years with BTRFS in RAID1 and 6 Disks with different sizes and also no
issues with the DUP profile on a single disk.
Only some performance issues with deduplication and very large files.
But i also recommand to use a newer kernel (4.4 or higher) or better
the newest and build a newer version of btrfs progs form source.
I use Ubuntu 16.04 and kernel 4.9 + btrfs progs 4.9 currently.

2017-03-13 13:02 GMT+01:00 Austin S. Hemmelgarn :

On 2017-03-13 07:52, Juan Orti Alcaine wrote:

2017-03-13 12:29 GMT+01:00 Hérikz Nawarro :

Hello everyone,

Today is safe to use btrfs for home storage? No raid, just secure
storage for some files and create snapshots from it.


In my humble opinion, yes. I'm running a RAID1 btrfs at home for 5
years and I feel the most serious bugs have been fixed, because in the
last two years I have not experienced any issue.

In general, I'd agree.  I've not seen any issues resulting from BTRFS itself
for the past 2.5 years (although it's helped me find quite a lot of marginal
or failing hardware over that time), but I've also not used many of the less
stable features (raid56, qgroups, and a handful of other things).

One piece of advice I will give though, try to keep the total number of
snapshots to a reasonably small three digit number (ideally less than 200,
absolutely less than 300), otherwise performance is going to be horrible.


Anyway, keeping your kernel and btrfs-progs updated is a must, and of
course, having good backups. I'm using Fedora and it's fine.

Also agreed, Fedora is one of the best options for a traditional distro
(they're very good about staying up to date and back-porting bug-fixes from
the upstream kernel).  The other two I'd recommend are Arch (they actually
use an almost upstream kernel and are generally the first distro to have new
versions of any arbitrary software) and Gentoo (similar to Arch, but more
maintenance intensive (although also more efficient (usually))).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Why do BTRFS (still) forgets what device to write to?

2017-03-05 Thread waxhead

I am doing some test on BTRFS with both data and metadata in raid1.

uname -a
Linux daffy 4.9.0-1-amd64 #1 SMP Debian 4.9.6-3 (2017-01-28) x86_64 
GNU/Linux


btrfs--version
btrfs-progs v4.7.3


01. mkfs.btrfs /dev/sd[fgh]1
02. mount /dev/sdf1 /btrfs_test/
03. btrfs balance start -dconvert=raid1 /btrfs_test/
04. copied a lots of 3-4MB files to it (about 40GB)...
05. Started to compress some of the files to create one larger file...
06. Pulled the (sata) plug on one of the drives... (sdf1)
07. dmesg shows that the kernel is rejecting I/O to offline device + 
[sdf] killing request]
08. BTRS error (device sdf1) bdev /dev/sdf1 errs: wr 0, rd 1, flush 0, 
corrupt 0, gen 0

09. the previous line repeats - increasing rd count
10. Reconnecting the sdf1 drive again makes it show up as sdi1
11. btrfs fi sh /btrfs_test shows sd1 as the correct device id (1).
12. Yet dmesg shows tons of errors like this: BTRFS error (device sdf1) 
: bdev /dev/sdi1 errs wr 37182, rd 39851, flush 1, corrupt 0, gen 0

13. and the above line repeats increasing wr, and rd errors.
14. BTRFS never seems to "get in tune again" while the filesystem is 
mounted.


The conclusion appears to be that the device ID is back again in the 
btrfs pool so why does btrfs still try to write to the wrong device (or 
does it?!).


The good thing here is that BTRFS does still work fine after a unmount 
and mount again. Running a scrub on the filesystem cleans up tons of 
errors , but no uncorrectable errors.


However it says total bytes scrubbed 94.21GB with 75 errors ... and 
further down it says corrected errors: 72, uncorrectable errors: 0 , 
unverified errors: 0


Why 75 vs 72 errors?! did it correct all or not?

I have recently lost 1x 5 device BTRFS filesystem as well as 2x 3 device 
BTRFS filesystems set up in RAID1 (both data and medata) by toying 
around with them. The 2x filesystems I lost was using all bad disks (all 
3 of them) but the one mentioned here uses good (but old) 400GB drives 
just for the record.


By lost I mean that mount does not recognize the filesystem, but BTRFS 
fi sh does show that all devices are present. I did not make notes for 
those filesystems , but it appears that RAID1 is a bit fragile.


I don't need to recover anything. This is just a "toy system" for 
playing around with btrfs and doing some tests.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-04 Thread waxhead

Chris Murphy wrote:

On Thu, Mar 2, 2017 at 6:48 PM, Chris Murphy  wrote:


Again, my data is fine. The problem I'm having is this:
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/filesystems/btrfs.txt?id=refs/tags/v4.10.1

Which says in the first line, in part, "focusing on fault tolerance,
repair and easy administration" and quite frankly this sort of
enduring bug in this file system that's nearly 10 years old now, is
rendered misleading, and possibly dishonest. How do we describe this
file system as focusing on fault tolerance when, in the identical
scenario using mdadm or LVM raid, the user's data is not mishandled
like it is on Btrfs with multiple devices?


I think until these problems are fixed, the Btrfs status page should
describe RAID 1 and 10 as mostly OK, with this problem as the reason
for it not being OK.


I took the liberty of changing the status page...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 status?

2017-01-22 Thread Waxhead

Hugo Mills wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Sun, Jan 22, 2017 at 11:35:49PM +0100, Christoph Anton Mitterer wrote:

On Sun, 2017-01-22 at 22:22 +0100, Jan Vales wrote:

Therefore my question: whats the status of raid5/6 is in btrfs?
Is it somehow "production"-ready by now?

AFAIK, what's on the - apparently already no longer updated -
https://btrfs.wiki.kernel.org/index.php/Status still applies, and
RAID56 is not yet usable for anything near production.

It's still all valid. Nothing's changed.

How would you like it to be updated? "Nope, still broken"?

Hugo.

I risked updating the wiki to show kernel version from 4.9 instead of 
4.7 then...

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke? (wiki updated)

2016-09-12 Thread Waxhead

Pasi Kärkkäinen wrote:

On Mon, Sep 12, 2016 at 09:57:17PM +0200, Martin Steigerwald wrote:


Great.

I made to minor adaption. I added a link to the Status page to my warning in
before the kernel log by feature page. And I also mentioned that at the time
the page was last updated the latest kernel version was 4.7. Yes, thats some
extra work to update the kernel version, but I think its beneficial to
explicitely mention the kernel version the page talks about. Everyone who
updates the page can update the version within a second.


Hmm.. that will still leave people wondering "but I'm running Linux 4.4, not 4.7, I 
wonder what the status of feature X is.."

Should we also add a column for kernel version, so we can add "feature X is known to 
be OK on Linux 3.18 and later"..  ?
Or add those to "notes" field, where applicable?


-- Pasi

I think a separate column would be the best solution. For example 
archiving the status page pr. kernel version (as I suggested) will lead 
to issues too. For example if something appears to be just fine in 4.6 
is found to be horribly broken in for example 4.10 the archive would 
still indicate that it WAS ok at that time even if it perhaps was not. 
Then you have regressions - something that worked in 4.4 may not work in 
4.9, but I still think the best idea is to simply label the status as ok 
/ broken since 4.x as those who really want to use a broken feature 
probably would to research to see if this used to work , besides if 
something that used to work goes haywire it should be fixed quickly :)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke?

2016-09-12 Thread Waxhead

Zoiled wrote:

Chris Mason wrote:



On 09/11/2016 04:55 AM, Waxhead wrote:
I have been following BTRFS for years and have recently been 
starting to

use BTRFS more and more and as always BTRFS' stability is a hot topic.
Some says that BTRFS is a dead end research project while others claim
the opposite.

Taking a quick glance at the wiki does not say much about what is safe
to use or not and it also points to some who are using BTRFS in 
production.

While BTRFS can apparently work well in production it does have some
caveats, and finding out what features is safe or not can be 
problematic

and I especially think that new users of BTRFS can easily be bitten if
they do not do a lot of research on it first.

The Debian wiki for BTRFS (which is recent by the way) contains a bunch
of warnings and recommendations and is for me a bit better than the
official BTRFS wiki when it comes to how to decide what features to 
use.


The Nouveau graphics driver have a nice feature matrix on it's webpage
and I think that BTRFS perhaps should consider doing something like 
that

on it's official wiki as well

For example something along the lines of  (the statuses are taken
our of thin air just for demonstration purposes)



The out of thin air part is a little confusing, I'm not sure if 
you're basing this on reports you've read?


Well to be honest I used "whatever I felt was right" more or less in 
that table and as I wrote it was only for demonstration purposes only 
to show how such a table could look.
I'm in favor flagged device replace with raid5/6 as not supported 
yet. That seems to be where most of the problems are coming in.


The compression framework shouldn't allow one to work well with the 
other unusable.
Ok good to know , however from the Debian wiki as well as the link to 
the mailing list only LZO compression are mentioned (as far as I 
remember) and I have no idea myself how much difference there is 
between LZO and the ZLIB code,


There were  problems with autodefrag related to snapshot-aware 
defrag, so Josef disabled the snapshot aware part.


In general, we put btrfs through heavy use at facebook.  The crcs 
have found serious hardware problems the other filesystems missed.


We've also uncovered performance problems and a some serious bugs, 
both in btrfs and the other filesystems.  With the other filesystems 
the fixes were usually upstream (doubly true for the most serious 
problems), and with btrfs we usually had to make the fixes ourselves.


-chris
--
To unsubscribe from this list: send the line "unsubscribe 
linux-btrfs" in

the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

I'll just pop this in here since I assume most people will read the 
response from your comment:


I think I made my point. The wiki lacks some good documentation on 
what's safe to use and what's not. Yesterday I (Svein Engelsgjerd) did 
put a table on the main wiki and someone have moved that away to a 
status page and also improved the layout a bit. It is a tad more 
complex than my version, but also a lot better for the slightly more 
advanced users and it actually made my view on things a bit clearer as 
well.


I am glad that I by bringing this up (hopefully) contributed slightly 
to improving the documentation a tiny bit! :)


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Just for the record - sorry for using my "crap mail" - I sometimes 
forget to change to the correct sender. I am therefore Svein Engelsgjerd 
a.k.a. Waxhead a.k.a. "Zoiled" :)

...sorry for the confusion

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke?

2016-09-11 Thread Waxhead

Martin Steigerwald wrote:

Am Sonntag, 11. September 2016, 13:43:59 CEST schrieb Martin Steigerwald:

The Nouveau graphics driver have a nice feature matrix on it's webpage
and I think that BTRFS perhaps should consider doing something like
that
on it's official wiki as well

BTRFS also has a feature matrix. The links to it are in the "News"
section
however:

https://btrfs.wiki.kernel.org/index.php/Changelog#By_feature

I disagree, this is not a feature / stability matrix. It is a clearly a
changelog by kernel version.

It is a *feature* matrix. I fully said its not about stability, but about
implementation – I just wrote this a sentence after this one. There is no
need  whatsoever to further discuss this as I never claimed that it is a
feature / stability matrix in the first place.


Thing is: This just seems to be when has a feature been implemented
matrix.
Not when it is considered to be stable. I think this could be done with
colors or so. Like red for not supported, yellow for implemented and
green for production ready.

Exactly, just like the Nouveau matrix. It clearly shows what you can
expect from it.

I mentioned this matrix as a good *starting* point. And I think it would be
easy to extent it:

Just add another column called "Production ready". Then research / ask about
production stability of each feature. The only challenge is: Who is
authoritative on that? I´d certainly ask the developer of a feature, but I´d
also consider user reports to some extent.

Maybe thats the real challenge.

If you wish, I´d go through each feature there and give my own estimation. But
I think there are others who are deeper into this.
That is exactly the same reason I don't edit the wiki myself. I could of 
course get it started and hopefully someone will correct what I write, 
but I feel that if I start this off I don't have deep enough knowledge 
to do a proper start. Perhaps I will change my mind about this.


I do think for example that scrubbing and auto raid repair are stable, except
for RAID 5/6. Also device statistics and RAID 0 and 1 I consider to be stable.
I think RAID 10 is also stable, but as I do not run it, I don´t know. For me
also skinny-metadata is stable. For me so far even compress=lzo seems to be
stable, but well for others it may not.

Since what kernel version? Now, there you go. I have no idea. All I know I
started BTRFS with Kernel 2.6.38 or 2.6.39 on my laptop, but not as RAID 1 at
that time.

See, the implementation time of a feature is much easier to assess. Maybe
thats part of the reason why there is not stability matrix: Maybe no one
*exactly* knows *for sure*. How could you? So I would even put a footnote on
that "production ready" row explaining "Considered to be stable by developer
and user oppinions".

Of course additionally it would be good to read about experiences of corporate
usage of BTRFS. I know at least Fujitsu, SUSE, Facebook, Oracle are using it.
But I don´t know in what configurations and with what experiences. One Oracle
developer invests a lot of time to bring BTRFS like features to XFS and RedHat
still favors XFS over BTRFS, even SLES defaults to XFS for /home and other non
/-filesystems. That also tells a story.

Some ideas you can get from SUSE releasenotes. Even if you do not want to use
it, it tells something and I bet is one of the better sources of information
regarding your question you can get at this time. Cause I believe SUSE
developers invested some time to assess the stability of features. Cause they
would carefully assess what they can support in enterprise environments. There
is also someone from Fujitsu who shared experiences in a talk, I can search
the URL to the slides again.
By all means, SUSE's wiki is very valuable. I just said that I *prefer* 
to have that stuff on the BTRFS wiki and feel that is the right place 
for it.


I bet Chris Mason and other BTRFS developers at Facebook have some idea on
what they use within Facebook as well. To what extent they are allowed to talk
about it… I don´t know. My personal impression is that as soon as Chris went
to Facebook he became quite quiet. Maybe just due to being busy. Maybe due to
Facebook being concerned much more about the privacy of itself than of its
users.

Thanks,


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke?

2016-09-11 Thread Waxhead

Martin Steigerwald wrote:

Am Sonntag, 11. September 2016, 13:21:30 CEST schrieb Zoiled:

Martin Steigerwald wrote:

Am Sonntag, 11. September 2016, 10:55:21 CEST schrieb Waxhead:

I have been following BTRFS for years and have recently been starting to
use BTRFS more and more and as always BTRFS' stability is a hot topic.
Some says that BTRFS is a dead end research project while others claim
the opposite.

First off: On my systems BTRFS definately runs too stable for a research
project. Actually: I have zero issues with stability of BTRFS on *any* of
my systems at the moment and in the last half year.

The only issue I had till about half an year ago was BTRFS getting stuck
at
seeking free space on a highly fragmented RAID 1 + compress=lzo /home.
This
went away with either kernel 4.4 or 4.5.

Additionally I never ever lost even a single byte of data on my own BTRFS
filesystems. I had a checksum failure on one of the SSDs, but BTRFS RAID 1
repaired it.


Where do I use BTRFS?

1) On this ThinkPad T520 with two SSDs. /home and / in RAID 1, another
data
volume as single. In case you can read german, search blog.teamix.de for
BTRFS.

2) On my music box ThinkPad T42 for /home. I did not bother to change / so
far and may never to so for this laptop. It has a slow 2,5 inch harddisk.

3) I used it on Workstation at work as well for a data volume in RAID 1.
But workstation is no more (not due to a filesystem failure).

4) On a server VM for /home with Maildirs and Owncloud data. /var is still
on Ext4, but I want to migrate it as well. Whether I ever change /, I
don´t know.

5) On another server VM, a backup VM which I currently use with
borgbackup.
With borgbackup I actually wouldn´t really need BTRFS, but well…

6) On *all* of my externel eSATA based backup harddisks for snapshotting
older states of the backups.

In other words, you are one of those who claim the opposite :) I have
also myself run btrfs for a "toy" filesystem since 2013 without any
issues, but this is more or less irrelevant since some people have
experienced data loss thanks to unstable features that are not clearly
marked as such.
And making a claim that you have not lost a single byte of data does not
make sense, how did you test this? SHA256 against a backup? :)

Do you have any proof like that with *any* other filesystem on Linux?

No, my claim is a bit weaker: BTRFS own scrubbing feature and well no I/O
errors on rsyncing my data over to the backup drive - BTRFS checks checksum on
read as well –, and yes I know BTRFS uses a weaker hashing algorithm, I think
crc32c. Yet this is still more than what I can say about *any* other
filesystem I used so far. Up to my current knowledge neither XFS nor Ext4/3
provide data checksumming. They do have metadata checksumming and I found
contradicting information on whether XFS may support data checksumming in the
future, but up to now, no *proof* *whatsoever* from side of the filesystem
that the data is, what it was when I saved it initially. There may be bit
errors rotting on any of your Ext4 and XFS filesystem without you even
noticing for *years*. I think thats still unlikely, but it can happen, I have
seen this years ago after restoring a backup with bit errors from a hardware
RAID controller.

Of course, I rely on the checksumming feature within BTRFS – which may have
errors. But even that is more than with any other filesystem I had before.

And I do not scrub daily, especially not the backup disks, but for any scrubs
up to now, no issues. So, granted, my claim has been a bit bold. Right now I
have no up-to-this-day scrubs so all I can say is that I am not aware of any
data losses up to the point in time where I last scrubbed my devices. Just
redoing the scrubbing now on my laptop.
The way I see it BTRFS is the best filesystem we got so far. It is also 
the first (to my knowledge) that provides checksums of both data and 
metadata. My point was simply that such an extraordinary claim require 
some evidence. I am not saying it is unlikely that you have never lost a 
byte, I am just saying that it is a fantastic thing to claim.

The Debian wiki for BTRFS (which is recent by the way) contains a bunch
of warnings and recommendations and is for me a bit better than the
official BTRFS wiki when it comes to how to decide what features to use.

Nice page. I wasn´t aware of this one.

If you use BTRFS with Debian, I suggest to usually use the recent backport
kernel, currently 4.6.

Hmmm, maybe I better remove that compress=lzo mount option. Never saw any
issue with it, tough. Will research what they say about it.

My point exactly: You did not know about this and hence the risk of your
data being gnawed on.

Well I do follow BTRFS mailinglist to some extent and I recommend anyone who
uses BTRFS in production to do this. And: So far I see no data loss from using
that option and for me personally it is exactly that what counts. J

Still: An information on what features are stable with what version

Is stability a joke?

2016-09-11 Thread Waxhead
I have been following BTRFS for years and have recently been starting to 
use BTRFS more and more and as always BTRFS' stability is a hot topic.
Some says that BTRFS is a dead end research project while others claim 
the opposite.


Taking a quick glance at the wiki does not say much about what is safe 
to use or not and it also points to some who are using BTRFS in production.
While BTRFS can apparently work well in production it does have some 
caveats, and finding out what features is safe or not can be problematic 
and I especially think that new users of BTRFS can easily be bitten if 
they do not do a lot of research on it first.


The Debian wiki for BTRFS (which is recent by the way) contains a bunch 
of warnings and recommendations and is for me a bit better than the 
official BTRFS wiki when it comes to how to decide what features to use.


The Nouveau graphics driver have a nice feature matrix on it's webpage 
and I think that BTRFS perhaps should consider doing something like that 
on it's official wiki as well


For example something along the lines of  (the statuses are taken 
our of thin air just for demonstration purposes)


Kernel version 4.7
+++-+---+---++---++
| Feature / Redundancy level | Single | Dup | Raid0 | Raid1 | Raid10 | 
Raid5 | Raid 6 |

+++-+---+---++---++
| Subvolumes | Ok | Ok  | Ok| Ok| Ok   | Bad 
  | Bad|

+++-+---+---++---++
| Snapshots  | Ok | Ok  | Ok| Ok| Ok | 
Bad   | Bad|

+++-+---+---++---++
| LZO Compression| Bad(1) | Bad | Bad   | Bad(2)| Bad| 
Bad   | Bad|

+++-+---+---++---++
| ZLIB Compression   | Ok | Ok  | Ok| Ok| Ok | 
Bad   | Bad|

+++-+---+---++---++
| Autodefrag | Ok | Bad | Bad(3)| Ok| Ok | 
Bad   | Bad|

+++-+---+---++---++

(1) Some explanation here...
(2) Some explanation there
(3) And some explanation elsewhere...

...etc...etc...

I therefore would like to propose that some sort of feature / stability 
matrix for the latest kernel is added to the wiki preferably somewhere 
where it is easy to find. It would be nice to archive old matrix'es as 
well in case someone runs on a bit older kernel (we who use Debian tend 
to like older kernels). In my opinion it would make things bit easier 
and perhaps a bit less scary too. Remember if you get bitten badly once 
you tend to stay away from from it all just in case, if you on the other 
hand know what bites you can safely pet the fluffy end instead :)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs scrub failure for raid 6 kernel 4.3

2015-12-30 Thread Waxhead

Chris Murphy wrote:

Well all the generations on all devices are now the same, and so are
the chunk trees. I haven't looked at them in detail to see if there
are any discrepancies among them.

If you don't care much for this file system, then you could try btrfs
check --repair, using btrfs-progs 4.3.1 or integration branch. I have
no idea where btrfsck repair is at with raid56.

On the one hand, corruption should be fixed by scrub. But scrub fails
with a kernel trace. Maybe btrfs check --repair can fix the tree block
corruption since scrub can't, and then if that corruption is fixed,
possibly scrub will work.

I could not care less about this particular filesystem as I wrote in the 
original post. It's just for having some fun with btrfs. What I find 
troublesome is that corrupting one (or even two) drives in a Raid6 
config fails. Granted the filesystem "works" e.g. I can mount it and 
access files, but I get a input/output error on a file on this 
filesystem and btrfs only shows warning (not errors) on device sdg1 
where the csum failed.
A raid6 setup should work fine even if two missing disks (or in this 
case chunks of data) is missing and even if I don't care about this 
filesystem I care about btrfs getting stable ;) so if I can help I'll 
keep this filesystem around for a little longer!


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs scrub failure for raid 6 kernel 4.3

2015-12-30 Thread Waxhead

Waxhead wrote:

Chris Murphy wrote:

Well all the generations on all devices are now the same, and so are
the chunk trees. I haven't looked at them in detail to see if there
are any discrepancies among them.

If you don't care much for this file system, then you could try btrfs
check --repair, using btrfs-progs 4.3.1 or integration branch. I have
no idea where btrfsck repair is at with raid56.

On the one hand, corruption should be fixed by scrub. But scrub fails
with a kernel trace. Maybe btrfs check --repair can fix the tree block
corruption since scrub can't, and then if that corruption is fixed,
possibly scrub will work.

I could not care less about this particular filesystem as I wrote in 
the original post. It's just for having some fun with btrfs. What I 
find troublesome is that corrupting one (or even two) drives in a 
Raid6 config fails. Granted the filesystem "works" e.g. I can mount it 
and access files, but I get a input/output error on a file on this 
filesystem and btrfs only shows warning (not errors) on device sdg1 
where the csum failed.
A raid6 setup should work fine even if two missing disks (or in this 
case chunks of data) is missing and even if I don't care about this 
filesystem I care about btrfs getting stable ;) so if I can help I'll 
keep this filesystem around for a little longer!


For your information I tried a balance on the filesystem - a new stack 
trace below (the system is still working).
Sorry for flooding the mailinglist with the stack trace - this is what I 
got from dmesg , hope it is of some use... / gets used... :)


[  243.603661] CPU: 0 PID: 1182 Comm: btrfs Tainted: G W   
4.3.0-1-686-pae #1 Debian 4.3.3-2

[  243.603664] Hardware name: Acer AOA150/, BIOS v0.3310 10/06/2008
[  243.603676]   09f7a8eb eef57990 c12ae3c5  c106685d 
c1614e20 
[  243.603687]  049e f86df010 190a f86350ff 0009 f86350ff 
f1dd8b18 
[  243.603697]  0078 eef579a0 c1066962 0009  eef57a6c 
f86350ff 

[  243.603699] Call Trace:
[  243.603716]  [] ? dump_stack+0x3e/0x59
[  243.603724]  [] ? warn_slowpath_common+0x8d/0xc0
[  243.603763]  [] ? __btrfs_free_extent+0xbbf/0xec0 [btrfs]
[  243.603798]  [] ? __btrfs_free_extent+0xbbf/0xec0 [btrfs]
[  243.603806]  [] ? warn_slowpath_null+0x22/0x30
[  243.603837]  [] ? __btrfs_free_extent+0xbbf/0xec0 [btrfs]
[  243.603877]  [] ? __btrfs_run_delayed_refs+0x96e/0x11a0 [btrfs]
[  243.603889]  [] ? __percpu_counter_add+0x8e/0xb0
[  243.603930]  [] ? btrfs_run_delayed_refs+0x6d/0x250 [btrfs]
[  243.603969]  [] ? btrfs_should_end_transaction+0x3c/0x60 
[btrfs]

[  243.604003]  [] ? btrfs_drop_snapshot+0x426/0x850 [btrfs]
[  243.604110]  [] ? merge_reloc_roots+0xee/0x260 [btrfs]
[  243.604152]  [] ? remove_backref_node+0x67/0xe0 [btrfs]
[  243.604198]  [] ? relocate_block_group+0x28f/0x750 [btrfs]
[  243.604242]  [] ? btrfs_relocate_block_group+0x1d8/0x2e0 
[btrfs]
[  243.604282]  [] ? btrfs_relocate_chunk.isra.29+0x3d/0xf0 
[btrfs]

[  243.604326]  [] ? btrfs_balance+0x97c/0x12e0 [btrfs]
[  243.604338]  [] ? __alloc_pages_nodemask+0x13b/0x850
[  243.604345]  [] ? get_page_from_freelist+0x3dd/0x5c0
[  243.604391]  [] ? btrfs_ioctl_balance+0x385/0x390 [btrfs]
[  243.604430]  [] ? btrfs_ioctl+0x793/0x2c50 [btrfs]
[  243.604437]  [] ? __alloc_pages_nodemask+0x13b/0x850
[  243.604443]  [] ? terminate_walk+0x69/0xc0
[  243.604453]  [] ? anon_vma_prepare+0xdf/0x130
[  243.604460]  [] ? page_add_new_anon_rmap+0x6c/0x90
[  243.604468]  [] ? handle_mm_fault+0xa63/0x14f0
[  243.604476]  [] ? __rb_insert_augmented+0xf3/0x1c0
[  243.604520]  [] ? update_ioctl_balance_args+0x1c0/0x1c0 [btrfs]
[  243.604527]  [] ? do_vfs_ioctl+0x2e2/0x500
[  243.604534]  [] ? do_brk+0x113/0x2b0
[  243.604542]  [] ? __do_page_fault+0x1a0/0x460
[  243.604549]  [] ? SyS_ioctl+0x68/0x80
[  243.604557]  [] ? sysenter_do_call+0x12/0x12
[  243.604563] ---[ end trace eb3e6200cba2a564 ]---
[  243.604654] [ cut here ]
[  243.604695] WARNING: CPU: 0 PID: 1182 at 
/build/linux-P8Ifgy/linux-4.3.3/fs/btrfs/extent-tree.c:6410 
__btrfs_free_extent+0xbbf/0xec0 [btrfs]()
[  243.604813] Modules linked in: cpufreq_stats cpufreq_conservative 
cpufreq_userspace bnep cpufreq_powersave zram zsmalloc lz4_compress nfsd 
auth_rpcgss oid_registry nfs_acl lockd grace sunrpc joydev iTCO_wdt 
iTCO_vendor_support sparse_keymap arc4 acerhdf coretemp pcspkr evdev 
psmouse serio_raw i2c_i801 uvcvideo videobuf2_vmalloc videobuf2_memops 
videobuf2_core v4l2_common videodev media lpc_ich mfd_core btusb btrtl 
btbcm btintel rng_core bluetooth ath5k ath snd_hda_codec_realtek 
snd_hda_codec_generic mac80211 jmb38x_ms snd_hda_intel i915 cfg80211 
memstick snd_hda_codec rfkill snd_hda_core snd_hwdep drm_kms_helper 
snd_pcm snd_timer shpchp snd soundcore drm i2c_algo_bit wmi battery 
video ac button acpi_cpufreq processor sg loop autofs4 uas usb_storage 
ext4 crc16 mbcache jbd2 crc32c_generic btrfs xor
[  243.604837]

Re: Btrfs scrub failure for raid 6 kernel 4.3

2015-12-29 Thread Waxhead

Chris Murphy wrote:

On Mon, Dec 28, 2015 at 3:55 PM, Waxhead <waxh...@online.no> wrote:


I tried the following

  btrfs-image -t4 -c9 /dev/sdb1 /btrfs_raid6.img
checksum verify failed on 28734324736 found C3E98F3B wanted EB2392C6
checksum verify failed on 28734324736 found C3E98F3B wanted EB2392C6
checksum verify failed on 28734324736 found 5F516E2A wanted BBB2D39C
checksum verify failed on 28734324736 found C4AA0B8D wanted 41745FB5
checksum verify failed on 28734324736 found C3E98F3B wanted EB2392C6
bytenr mismatch, want=28734324736, have=16273726433708437499
Error reading metadata block
Error adding block -5
checksum verify failed on 28734324736 found C3E98F3B wanted EB2392C6
checksum verify failed on 28734324736 found C3E98F3B wanted EB2392C6
checksum verify failed on 28734324736 found 5F516E2A wanted BBB2D39C
checksum verify failed on 28734324736 found C4AA0B8D wanted 41745FB5
checksum verify failed on 28734324736 found C3E98F3B wanted EB2392C6
bytenr mismatch, want=28734324736, have=16273726433708437499
Error reading metadata block
Error flushing pending -5
create failed (Success)

Well, I can't make out what this is supposed to mean, but no output file...

Dunno. Maybe btrfs-show-super -fa for each device, along with
btrfs-debug-tree  output to a file might have some useful info
for a dev. It could be a while before we hear from one though,
considering the season.

The btrfs-debug-tree creates about 56.3 megabytes of text for each 
device a few checksums (marked [match]) and dev.item.uuid differs 
between the files for all partitions so I will leave that out of this post.


The output of btrfs-show-super -fa for each device follows:

--snip--

superblock: bytenr=65536, device=/dev/sdb1
-
csum0xdedfa466 [match]
bytenr65536
flags0x1
( WRITTEN )
magic_BHRfS_M [match]
fsid2832346e-0720-499f-8239-355534e5721b
label
generation15529
root28539699200
sys_array_size257
chunk_root_generation15524
root_level0
chunk_root28416425984
chunk_root_level0
log_root0
log_root_transid0
log_root_level0
total_bytes49453105152
bytes_used9239322624
sectorsize4096
nodesize16384
leafsize16384
stripesize4096
root_dir6
num_devices6
compat_flags0x0
compat_ro_flags0x0
incompat_flags0xe1
( MIXED_BACKREF |
  BIG_METADATA |
  EXTENDED_IREF |
  RAID56 )
csum_type0
csum_size4
cache_generation15529
uuid_tree_generation15529
dev_item.uuidc14fb599-f515-4feb-a458-227af0af683b
dev_item.fsid2832346e-0720-499f-8239-355534e5721b [match]
dev_item.type0
dev_item.total_bytes8242184192
dev_item.bytes_used3305111552
dev_item.io_align4096
dev_item.io_width4096
dev_item.sector_size4096
dev_item.devid1
dev_item.dev_group0
dev_item.seek_speed0
dev_item.bandwidth0
dev_item.generation0
sys_chunk_array[2048]:
item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 28416409600)
chunk length 67108864 owner 2 stripe_len 65536
type SYSTEM|RAID6 num_stripes 6
stripe 0 devid 5 offset 303038464
dev uuid: c292e6f4-d113-47af-96bf-1b262ef28c77
stripe 1 devid 4 offset 1048576
dev uuid: 3458b150-02a7-44a5-81e5-61a734e30439
stripe 2 devid 3 offset 1074790400
dev uuid: a1ae2083-2730-4611-b019-ba9954c8fa13
stripe 3 devid 2 offset 1074790400
dev uuid: 9a1af441-88fc-43b1-b1f8-e1d1163195ef
stripe 4 devid 6 offset 1074790400
dev uuid: 094d2a3c-f538-4a4e-850a-e611f2517e7f
stripe 5 devid 1 offset 1074790400
dev uuid: c14fb599-f515-4feb-a458-227af0af683b
backup_roots[4]:
backup 0:
backup_tree_root:28541386752gen: 15526level: 0
backup_chunk_root:28416425984gen: 15524level: 0
backup_extent_root:28541403136gen: 15526level: 1
backup_fs_root:28496560128gen: 15514level: 2
backup_dev_root:28540829696gen: 15524level: 0
backup_csum_root:28541452288gen: 15526level: 2
backup_total_bytes:49453105152
backup_bytes_used:9164218368
backup_num_devices:6

backup 1:
backup_tree_root:28502654976gen: 15527level: 0
backup_chunk_root:28416425984gen: 15524level: 0
backup_extent_root:28513714176gen: 15528level: 1
backup_fs_root:28513566720gen: 15528level: 2
backup_dev_root:28513550336gen: 15527level: 0
backup_csum_root:28513615872gen: 15528level: 2
backup_total_bytes:49453105152
backup_bytes_used:9206

Re: Btrfs scrub failure for raid 6 kernel 4.3

2015-12-28 Thread Waxhead

Chris Murphy wrote:

On Sun, Dec 27, 2015 at 7:04 PM, Waxhead <waxh...@online.no> wrote:


Since all drives register and since I can even mount the filesystem.

OK so you've umounted the file system, reconnected all devices,
mounted the file system normally, and there are no problems reported
in dmesg?

If so, yes I agree that a scrub should probably work, it should fix
any problems with the simulated corrupt device, and also not crash.

What if you umount, and run btrfs check without --repair, what are the
results? This is btrfs-progs 4.3.1?


The output from dmesg after mounting
[  546.857533] BTRFS info (device sdg1): disk space caching is enabled
[  546.872126] BTRFS: bdev /dev/sde1 errs: wr 10, rd 0, flush 0, corrupt 
29094, gen 4
[  546.872165] BTRFS: bdev /dev/sdb1 errs: wr 16, rd 7, flush 0, corrupt 
0, gen 0



This is the output I get from btrfs check /dev/sdb1 > somefile.output 
(note the filesystem was checked in unmounted state)

Checking filesystem on /dev/sdb1
UUID: 2832346e-0720-499f-8239-355534e5721b
The following tree block(s) is corrupted in tree 5:
tree block bytenr: 28488941568, level: 1, node key: (52273, 1, 0)
The following data extent is lost in tree 5:
inode: 104721, offset:0, disk_bytenr: 37828165632, disk_len: 524288
found 9161007108 bytes used err is 1
total csum bytes: 8859672
total tree bytes: 80969728
total fs tree bytes: 66633728
total extent tree bytes: 3211264
btree space waste bytes: 10638420
file data blocks allocated: 7929208832
 referenced 7929208832
btrfs-progs v4.3

I also get a pretty fantastic amount of errors that will not be 
redirected to a file.


---snip---
parent transid verify failed on 28597895168 wanted 371 found 339
parent transid verify failed on 28597895168 wanted 371 found 339
checksum verify failed on 28597895168 found 5D16DA87 wanted B9F56731
checksum verify failed on 28597895168 found 1183EB4E wanted C18D87AC
checksum verify failed on 28597895168 found 1183EB4E wanted C18D87AC
bytenr mismatch, want=28597895168, have=147474999040
Incorrect local backref count on 37826895872 root 5 owner 104850 offset 
0 found 0 wanted 1 back 0x94ac688
Backref disk bytenr does not match extent record, bytenr=37826895872, 
ref bytenr=0

backpointer mismatch on [37826895872 131072]
owner ref check failed [37826895872 131072]
ref mismatch on [37827117056 475136] extent item 1, found 0
parent transid verify failed on 28597714944 wanted 371 found 339
parent transid verify failed on 28597714944 wanted 371 found 339
checksum verify failed on 28597714944 found 49CB81B9 wanted AD283C0F
checksum verify failed on 28597714944 found D9F20AF8 wanted 1A5EE553
checksum verify failed on 28597714944 found D9F20AF8 wanted 1A5EE553
bytenr mismatch, want=28597714944, have=147480498944
Incorrect local backref count on 37827117056 root 5 owner 104719 offset 
0 found 0 wanted 1 back 0x93688b0
Backref disk bytenr does not match extent record, bytenr=37827117056, 
ref bytenr=37827026944

backpointer mismatch on [37827117056 475136]
owner ref check failed [37827117056 475136]
ref mismatch on [37827641344 487424] extent item 1, found 0
parent transid verify failed on 28597714944 wanted 371 found 339
parent transid verify failed on 28597714944 wanted 371 found 339
checksum verify failed on 28597714944 found 49CB81B9 wanted AD283C0F
checksum verify failed on 28597714944 found D9F20AF8 wanted 1A5EE553
checksum verify failed on 28597714944 found D9F20AF8 wanted 1A5EE553
bytenr mismatch, want=28597714944, have=147480498944
Incorrect local backref count on 37827641344 root 5 owner 104720 offset 
0 found 0 wanted 1 back 0x94ac778
Backref disk bytenr does not match extent record, bytenr=37827641344, 
ref bytenr=0

backpointer mismatch on [37827641344 487424]
owner ref check failed [37827641344 487424]
ref mismatch on [37828165632 524288] extent item 1, found 0
parent transid verify failed on 28597714944 wanted 371 found 339
parent transid verify failed on 28597714944 wanted 371 found 339
checksum verify failed on 28597714944 found 49CB81B9 wanted AD283C0F
checksum verify failed on 28597714944 found D9F20AF8 wanted 1A5EE553
checksum verify failed on 28597714944 found D9F20AF8 wanted 1A5EE553
bytenr mismatch, want=28597714944, have=147480498944
Incorrect local backref count on 37828165632 root 5 owner 104721 offset 
0 found 0 wanted 1 back 0x94ac868
Backref disk bytenr does not match extent record, bytenr=37828165632, 
ref bytenr=0

backpointer mismatch on [37828165632 524288]
owner ref check failed [37828165632 524288]
checking free space cache
checking fs roots
---snip end---

---snippety snip---
root 5 inode 53325 errors 2001, no inode item, link count wrong
unresolved ref dir 52207 index 0 namelen 14 name pthreadtypes.h 
filetype 1 errors 6, no dir index, no inode ref

root 5 inode 53328 errors 2001, no inode item, link count wrong
unresolved ref dir 52207 index 0 namelen 8 name select.h 
filetype 1 errors 6, no dir index, no inode ref

root 5 inode 533

Re: Btrfs scrub failure for raid 6 kernel 4.3

2015-12-28 Thread Waxhead

Duncan wrote:

Waxhead posted on Mon, 28 Dec 2015 03:04:33 +0100 as excerpted:


Duncan wrote:

Waxhead posted on Mon, 28 Dec 2015 00:06:46 +0100 as excerpted:


btrfs scrub status /mnt
scrub status for 2832346e-0720-499f-8239-355534e5721b
   scrub started at Sun Mar 29 23:21:04 2015
Now here is the first worrying part... it says that scrub started at
Sun Mar 29.

Hmm...  The status is stored in readable plain-text files in /var/lib/
btrfs/scrub.status.*, where the * is the UUID.  If you check there, the
start time (t_start) seems to be in POSIX time.

Is it possible you were or are running the scrub from, for instance, a
rescue image that might not set the system time correctly and that
falls back to, say, the date the rescue image was created, if it can't
get network connectivity or some such?


No I don't think so

# ls -la
/var/lib/btrfs/scrub.status.2832346e-0720-499f-8239-355534e5721b
-rw--- 1 root root 2315 Mar 29  2015
/var/lib/btrfs/scrub.status.2832346e-0720-499f-8239-355534e5721b

# cat /var/lib/btrfs/scrub.status.2832346e-0720-499f-8239-355534e5721b
scrub status:1
2832346e-0720-499f-8239-355534e5721b:1|[...]|t_start:1427664064|[...]

# date Mon Dec 28 02:54:11 CET 2015

Just to clear up any possible misunderstandings. I run this from a
simple netbook, and I have no idea why the date is off by so much.

Well, both the file time and the unix time in the file say back in March,
so whatever time syncing mechanism you use on that netbook, it evidently
failed the boot you did that scrub.

The netbook is set up with NTP with pfSense as a host server. The 
pfSense is itself synched with multiple pools.

Note:  I have used the same USB drives (memory sticks really) to create
various configs of btrfs filesystems earlier. Could it be old metadata
in the filesystem that mess up things? Is not metadata stamped with the
UUID of the filesystem to prevent such things?

Yes, metadata is stamped with UUID.  But one other possible explanation
for the scrub time back in March might be if you were already playing
with it back then, and somehow you have a USB stick with a filesystem
from back then that... somehow... has the same UUID as the one you're
experimenting on today.
yes, I  have played around with these usb sticks for a long time. 
Probably also before march 29.


Don't ask me how it could get the same UUID.  I don't understand it
either.  But if it did somehow happen, btrfs would be /very/ confused,
and crashing scrubs and further data corruption could certainly result.
What if my use of dd accidentally trashed some important part of the new 
filesystem and btrfs therefore thinks a older version of the filesystem 
is the current one? If UUID's are in every metadatablock I find that 
pretty hard to believe. What if the UUID==0 ? is this accounted for?

Of course if you weren't experimenting with btrfs on these devices back
at the end of March and there's absolutely no way they could have gotten
btrfs on them until say October or whenever, then we're back to the date
somehow being wrong for that scrub, and having to look elsewhere for why
scrub is crashing.

No, by all means - I tried a lot of weird stuff on those usb sticks way 
before march so they defiantly had a (multi disk) btrfs filesystem on 
them before.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs scrub failure for raid 6 kernel 4.3

2015-12-27 Thread Waxhead

Hi,

I have a "toy-array" of 6x USB drives hooked up to a hub where I made a 
btrfs raid 6 data+metadata filesystem.


I copied some files to the filesystem, ripped out one USB drive and 
ruined it dd if=/dev/random to various locations on the drive. Put the 
USB drive back and the filesystem mounts ok.


If i start scrub I after seconds get the following

 kernel:[   50.844026] CPU: 1 PID: 91 Comm: kworker/u4:2 Not tainted 
4.3.0-1-686-pae #1 Debian 4.3.3-2
 kernel:[   50.844026] Hardware name: Acer AOA150/, BIOS v0.3310 
10/06/2008
 kernel:[   50.844026] Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper 
[btrfs]
 kernel:[   50.844026] task: f642c040 ti: f664c000 task.ti: f664c000
 kernel:[   50.844026] Stack:
 kernel:[   50.844026]  0005 f0d20800 f664ded0 f86d0262  f664deac 
c109a0fc 0001
 kernel:[   50.844026]  f79eac40 edb4a000 edb7a000 edb8a000 edbba000 eccc1000 
ecca1000 
 kernel:[   50.844026]   f664de68 0003 f664de74 ecb23000 f664de5c 
f5cda6a4 f0d20800
 kernel:[   50.844026] Call Trace:
 kernel:[   50.844026]  [] ? finish_parity_scrub+0x272/0x560 [btrfs]
 kernel:[   50.844026]  [] ? set_next_entity+0x8c/0xba0
 kernel:[   50.844026]  [] ? bio_endio+0x40/0x70
 kernel:[   50.844026]  [] ? btrfs_scrubparity_helper+0xce/0x270 
[btrfs]
 kernel:[   50.844026]  [] ? process_one_work+0x14d/0x360
 kernel:[   50.844026]  [] ? worker_thread+0x39/0x440
 kernel:[   50.844026]  [] ? process_one_work+0x360/0x360
 kernel:[   50.844026]  [] ? kthread+0xa6/0xc0
 kernel:[   50.844026]  [] ? ret_from_kernel_thread+0x21/0x30
 kernel:[   50.844026]  [] ? kthread_create_on_node+0x130/0x130
 kernel:[   50.844026] Code: 6e c1 e8 ac dd f2 ff 83 c4 04 5b 5d c3 8d b6 00 00 
00 00 31 c9 81 3d 84 f0 6e c1 84 f0 6e c1 0f 95 c1 eb b9 8d b4 200 00 00 00 0f 
0b 8d b4 26 00 00 00 00 8d bc 27 00
 kernel:[   50.844026] EIP: [] kunmap_high+0xa8/0xc0 SS:ESP 
0068:f664de40

This is only a test setup and I will keep this filesystem for a while if 
it can be of any use...

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs scrub failure for raid 6 kernel 4.3

2015-12-27 Thread Waxhead

Duncan wrote:

Waxhead posted on Mon, 28 Dec 2015 00:06:46 +0100 as excerpted:


btrfs scrub status /mnt scrub status for
2832346e-0720-499f-8239-355534e5721b
  scrub started at Sun Mar 29 23:21:04 2015 and finished after
00:01:04
  total bytes scrubbed: 1.97GiB with 14549 errors error details:
  super=2 csum=14547 corrected errors: 0, uncorrectable errors:
  14547, unverified
errors: 0

Now here is the first worrying part... it says that scrub started at Sun
Mar 29. That is NOT true, the first scrub I did on this filesystem was a
few days ago and it claims it is a lot of uncorrectable errors. Why?
This is after all a raid6 filesystem correct?!

Hmm...  The status is stored in readable plain-text files in /var/lib/
btrfs/scrub.status.*, where the * is the UUID.  If you check there, the
start time (t_start) seems to be in POSIX time.

Is it possible you were or are running the scrub from, for instance, a
rescue image that might not set the system time correctly and that falls
back to, say, the date the rescue image was created, if it can't get
network connectivity or some such?


No I don't think so

# ls -la /var/lib/btrfs/scrub.status.2832346e-0720-499f-8239-355534e5721b
-rw--- 1 root root 2315 Mar 29  2015 
/var/lib/btrfs/scrub.status.2832346e-0720-499f-8239-355534e5721b


# cat /var/lib/btrfs/scrub.status.2832346e-0720-499f-8239-355534e5721b
scrub status:1
2832346e-0720-499f-8239-355534e5721b:1|data_extents_scrubbed:5391|tree_extents_scrubbed:21|data_bytes_scrubbed:352542720|tree_bytes_scrubbed:344064|read_errors:0|csum_errors:0|verify_errors:0|no_csum:32|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:3306160128|t_start:1427664064|t_resumed:0|duration:51|canceled:0|finished:1
2832346e-0720-499f-8239-355534e5721b:2|data_extents_scrubbed:5404|tree_extents_scrubbed:26|data_bytes_scrubbed:353517568|tree_bytes_scrubbed:425984|read_errors:0|csum_errors:0|verify_errors:0|no_csum:64|csum_discards:2|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:3306160128|t_start:1427664064|t_resumed:0|duration:51|canceled:0|finished:1
2832346e-0720-499f-8239-355534e5721b:3|data_extents_scrubbed:5396|tree_extents_scrubbed:19|data_bytes_scrubbed:352718848|tree_bytes_scrubbed:311296|read_errors:0|csum_errors:0|verify_errors:0|no_csum:48|csum_discards:2|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:3306160128|t_start:1427664064|t_resumed:0|duration:51|canceled:0|finished:1
2832346e-0720-499f-8239-355534e5721b:4|data_extents_scrubbed:5391|tree_extents_scrubbed:31|data_bytes_scrubbed:352739328|tree_bytes_scrubbed:507904|read_errors:0|csum_errors:14547|verify_errors:0|no_csum:32|csum_discards:0|super_errors:2|malloc_errors:0|uncorrectable_errors:14547|corrected_errors:0|last_physical:2282749952|t_start:1427664064|t_resumed:0|duration:64|canceled:0|finished:1
2832346e-0720-499f-8239-355534e5721b:5|data_extents_scrubbed:5393|tree_extents_scrubbed:23|data_bytes_scrubbed:352665600|tree_bytes_scrubbed:376832|read_errors:0|csum_errors:0|verify_errors:0|no_csum:48|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:2534408192|t_start:1427664064|t_resumed:0|duration:51|canceled:0|finished:1
2832346e-0720-499f-8239-355534e5721b:6|data_extents_scrubbed:5407|tree_extents_scrubbed:33|data_bytes_scrubbed:353361920|tree_bytes_scrubbed:540672|read_errors:0|csum_errors:0|verify_errors:0|no_csum:48|csum_discards:2|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:3306160128|t_start:1427664064|t_resumed:0|duration:51|canceled:0|finished:1

# date
Mon Dec 28 02:54:11 CET 2015

Just to clear up any possible misunderstandings. I run this from a 
simple netbook, and I have no idea why the date is off by so much.


Since all drives register and since I can even mount the filesystem. 
Since I can reproduce this every time I try to start a scrub I have not 
tried to run balance , defrag or just md5sum all the files on the 
filesystem to see if that fixes up things a bit. In a raid6 config you 
should be able to loose up to two drives and honestly so far only one 
drive is hampered and even if another one for any bizarre reason should 
contain damaged data things should "just work" right?


Note:  I have used the same USB drives (memory sticks really) to create 
various configs of btrfs filesystems earlier. Could it be old metadata 
in the filesystem that mess up things? Is not metadata stamped with the 
UUID of the filesystem to prevent such things?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs scrub failure for raid 6 kernel 4.3

2015-12-27 Thread Waxhead

Chris Murphy wrote:

On Sun, Dec 27, 2015 at 6:59 AM, Waxhead <waxh...@online.no> wrote:

Hi,

I have a "toy-array" of 6x USB drives hooked up to a hub where I made a
btrfs raid 6 data+metadata filesystem.

I copied some files to the filesystem, ripped out one USB drive and ruined
it dd if=/dev/random to various locations on the drive. Put the USB drive
back and the filesystem mounts ok.

If i start scrub I after seconds get the following

  kernel:[   50.844026] CPU: 1 PID: 91 Comm: kworker/u4:2 Not tainted
4.3.0-1-686-pae #1 Debian 4.3.3-2
  kernel:[   50.844026] Hardware name: Acer AOA150/, BIOS v0.3310
10/06/2008
  kernel:[   50.844026] Workqueue: btrfs-endio-raid56
btrfs_endio_raid56_helper [btrfs]
  kernel:[   50.844026] task: f642c040 ti: f664c000 task.ti: f664c000
  kernel:[   50.844026] Stack:
  kernel:[   50.844026]  0005 f0d20800 f664ded0 f86d0262 
f664deac c109a0fc 0001
  kernel:[   50.844026]  f79eac40 edb4a000 edb7a000 edb8a000 edbba000
eccc1000 ecca1000 
  kernel:[   50.844026]   f664de68 0003 f664de74 ecb23000
f664de5c f5cda6a4 f0d20800
  kernel:[   50.844026] Call Trace:
  kernel:[   50.844026]  [] ? finish_parity_scrub+0x272/0x560
[btrfs]
  kernel:[   50.844026]  [] ? set_next_entity+0x8c/0xba0
  kernel:[   50.844026]  [] ? bio_endio+0x40/0x70
  kernel:[   50.844026]  [] ? btrfs_scrubparity_helper+0xce/0x270
[btrfs]
  kernel:[   50.844026]  [] ? process_one_work+0x14d/0x360
  kernel:[   50.844026]  [] ? worker_thread+0x39/0x440
  kernel:[   50.844026]  [] ? process_one_work+0x360/0x360
  kernel:[   50.844026]  [] ? kthread+0xa6/0xc0
  kernel:[   50.844026]  [] ? ret_from_kernel_thread+0x21/0x30
  kernel:[   50.844026]  [] ? kthread_create_on_node+0x130/0x130
  kernel:[   50.844026] Code: 6e c1 e8 ac dd f2 ff 83 c4 04 5b 5d c3 8d b6 00
00 00 00 31 c9 81 3d 84 f0 6e c1 84 f0 6e c1 0f 95 c1 eb b9 8d b4 200 00 00
00 0f 0b 8d b4 26 00 00 00 00 8d bc 27 00
  kernel:[   50.844026] EIP: [] kunmap_high+0xa8/0xc0 SS:ESP
0068:f664de40

This is only a test setup and I will keep this filesystem for a while if it
can be of any use...

Sounds like a bug, but also might be missing functionality still. If
you can include the reproduce steps, including the exact
locations+lengths of the random writes, that's probably useful.

More than one thing could be going on. First, I don't know that Btrfs
even understands the device went missing because it doesn't yet have a
concept of faulty devices, and then I've seen it get confused when
drives reappear with new drive designations (not uncommon), and from
your call trace we don't know if that happened because there's not
enough information posted. Second, if the damage is too much on a
device, it almost certainly isn't recognized when reattached. But this
depends on what locations were damaged. If Btrfs doesn't recognize the
drive as part of the array, then the scrub request is effectively a
scrub for a volume with a missing drive which you probably wouldn't
ever do, you'd first replace the missing device. Scrubs happen on
normally operating arrays not degraded ones. So it's uncertain either
Btrfs, or the user, had any idea what state the volume was actually in
at the time.

Conversely on mdadm, it knows in such a case to mark a device as
faulty, the array automatically goes degraded, but when the drive is
reattached it is not automatically re-added. When the user re-adds,
typically a complete rebuild happens unless there's a write-intent
bitmap, which isn't a default at create time.


I am afraid I can't exactly include the how to reproduce steps.
I do however have the filesystem in a "bad state" so if there is 
anything I can do - let me know.


First of all ... a "btrfs filesystem show" does list all drives
Label: none  uuid: 2832346e-0720-499f-8239-355534e5721b
Total devices 6 FS bytes used 8.53GiB
devid1 size 7.68GiB used 3.08GiB path /dev/sdb1
devid2 size 7.68GiB used 3.08GiB path /dev/sdc1
devid3 size 7.68GiB used 3.08GiB path /dev/sdd1
devid4 size 7.68GiB used 3.08GiB path /dev/sde1
devid5 size 7.68GiB used 3.08GiB path /dev/sdf1
devid6 size 7.68GiB used 3.08GiB path /dev/sdg1

mount /dev/sdb1 /mnt/
btrfs filesystem df /mnt

Data, RAID6: total=12.00GiB, used=8.45GiB
System, RAID6: total=64.00MiB, used=16.00KiB
Metadata, RAID6: total=256.00MiB, used=84.58MiB
GlobalReserve, single: total=32.00MiB, used=0.00B

btrfs scrub status /mnt
scrub status for 2832346e-0720-499f-8239-355534e5721b
scrub started at Sun Mar 29 23:21:04 2015 and finished after 
00:01:04

total bytes scrubbed: 1.97GiB with 14549 errors
error details: super=2 csum=14547
corrected errors: 0, uncorrectable errors: 14547, unverified 
errors: 0


Now here is the first worrying part... it says that scrub started at Sun 
Mar 29. That is NOT true, the first scrub I did on this filesystem was a 
fe

Re: Hot data Tracking

2012-05-03 Thread Waxhead

David Sterba wrote:

On Sat, Feb 11, 2012 at 05:49:41AM +0100, Timo Witte wrote:

What happened to the hot data tracking feature in btrfs? There are a lot
of old patches from aug 2010, but it looks like the feature has been
completly removed from the current version of btrfs. Is this feature
still on the roadmap?

Removed? AFAIK it hasn't been ever merged, though it's be a nice
feature. There were suggestions to turn it into a generic API for any
filesystem to use, but this hasn't happened.

The patches are quite independent and it was easy to refresh them on top
of current for-linus branch. A test run did not survive a random
xfstest, 013 this time, so I probably mismerged some bits. The patchset
lives in branch foreign/ibm/hotdatatrack in my git repo.


david

Someone recently mentioned bcache in another post who seems to cover 
this subject fairly well. However would it not make sense if btrfs 
actually was able to automatically take advantage of whatever disks is 
added to the pool? For example if you have 10 disk of different size and 
performance in a raid5/6 like configuration would it not be feasible if 
btrfs automagically (option) could manage it's own cache? For example it 
could reserve a chunk of free space as cache (based on how much data is 
free) and stripe data over all disks (cache). When the filesystem 
becomes idle or at set intervals it could empty the cache or 
move/rebalance pending writes over to the original raid5/6 like setup.
As far as I remember hot data tracking was all about moving the data 
over to the fastest disk. Why not utilize all disks and benefit from 
disks working together?


Svein Engelsgjerd

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is btrfsck really required?

2012-03-25 Thread Waxhead
After playing around with btrfs for a while, reading about it and also 
watching Avi Miller's presentation on youtube I am starting to wonder 
why one would need btrfsck at all. I am no expert in filesystems so I 
apologize if any of these questions may sound a bit stupid.


1. How self-healing is btrfs really?! According to Miller's talk btrfs 
is making a (circular?) backup of the root tree every 30 seconds.
If I remember correctly the root tree is also mirrored several places on 
disk and on rotational media all those are updated in tandem. This is 
leading me to believe that there should be no problem in recovering from 
a corruption.


2. Also in addition to question 1. Is there some sanity checking when 
writing the root tree? e.g. if you write garbage to the root tree by 
accident will there be some recovery mechanism there to protect you as well?


3. What is the point with the mount -o recovery ? If there already is a 
corruption there is there any reason btrfs should not recover 
automatically by itself?


4. If a disk responds slowly, Will btrfs throw it out of a raid 
configuration and if so will a btrfsck be less strict about timeouts and 
will it automatically rebalance the data from the bad disk over to other 
good disks?!


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How well does BTRFS manage different sized disks?

2012-01-25 Thread Waxhead

Hi,

Can someone shed some light on how BTRFS will manage a bunch of disks of 
varying size for the planned raid5/6. e.g. 3x 2TB disk and 1x 250GB 
disk? If using a raid5 setup will a 750GB of usable data automatically 
be used as a 4 disk raid5 while the rest is used as a 3 disk raid5?! If 
so; how do you control what files are on the speedy section of the volume?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Will BTRFS repair or restore data if corrupted?

2012-01-25 Thread Waxhead

Hi,

From what I have read BTRFS does replace a bad copy of data with a 
known good copy (if it has one). Will BTRFS try to repair the corrupt 
data or will it simply silently restore the data without the user 
knowing that a file has been fixed?


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html