Re: Recover btrfs volume which can only be mounded in read-only mode

2015-10-26 Thread Hugo Mills
On Mon, Oct 26, 2015 at 09:14:00AM +, Duncan wrote:
> Dmitry Katsubo posted on Sun, 18 Oct 2015 11:44:08 +0200 as excerpted:
> 
> >> Meanwhile, the present btrfs raid1 read-scheduler is both pretty simple
> >> to code up and pretty simple to arrange tests for that run either one
> >> side or the other, but not both, or that are well balanced to both.
> >> However, it's pretty poor in terms of ensuring optimized real-world
> >> deployment read-scheduling.
> >> 
> >> What it does is simply this.  Remember, btrfs raid1 is specifically two
> >> copies.  It chooses which copy of the two will be read very simply,
> >> based on the PID making the request.  Odd PIDs get assigned one copy,
> >> even PIDs the other.  As I said, simple to code, great for ensuring
> >> testing of one copy or the other or both, but not really optimized at
> >> all for real-world usage.
> >> 
> >> If your workload happens to be a bunch of all odd or all even PIDs,
> >> well, enjoy your testing-grade read-scheduler, bottlenecking everything
> >> reading one copy, while the other sits entirely idle.
> > 
> > I think PID-based solution is not the best one. Why not simply take a
> > random device? Then at least all drives in the volume are equally loaded
> > (in average).
> 
> Nobody argues that the even/odd-PID-based read-scheduling solution is 
> /optimal/, in a production sense at least.  But at the time and for the 
> purpose it was written it was pretty good, arguably reasonably close to 
> "best", because the implementation is at once simple and transparent for 
> debugging purposes, and real easy to test either one side or the other, 
> or both, and equally important, to duplicate the results of those tests, 
> by simply arranging for the testing to have either all even or all odd 
> PIDs, or both.  And for ordinary use, it's good /enough/, as ordinarily, 
> PIDs will be evenly distributed even/odd.
> 
> In that context, your random device read-scheduling algorithm would be 
> far worse, because while being reasonably simple, it's anything *but* 
> easy to ensure reads go to only one side or equally to both, or for that 
> matter, to duplicate the tests, because randomization, by definition 
> does /not/ lend itself to duplication.

   For what it's worth, David tried implementing round-robin (IIRC)
some time ago, and found that it performed *worse* than the pid-based
system. (It may have been random, but memory says it was round-robin).

   Hugo.

-- 
Hugo Mills | Great films about cricket: The Umpire Strikes Back
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Recover btrfs volume which can only be mounded in read-only mode

2015-10-26 Thread Duncan
Dmitry Katsubo posted on Sun, 18 Oct 2015 11:44:08 +0200 as excerpted:

>> Meanwhile, the present btrfs raid1 read-scheduler is both pretty simple
>> to code up and pretty simple to arrange tests for that run either one
>> side or the other, but not both, or that are well balanced to both.
>> However, it's pretty poor in terms of ensuring optimized real-world
>> deployment read-scheduling.
>> 
>> What it does is simply this.  Remember, btrfs raid1 is specifically two
>> copies.  It chooses which copy of the two will be read very simply,
>> based on the PID making the request.  Odd PIDs get assigned one copy,
>> even PIDs the other.  As I said, simple to code, great for ensuring
>> testing of one copy or the other or both, but not really optimized at
>> all for real-world usage.
>> 
>> If your workload happens to be a bunch of all odd or all even PIDs,
>> well, enjoy your testing-grade read-scheduler, bottlenecking everything
>> reading one copy, while the other sits entirely idle.
> 
> I think PID-based solution is not the best one. Why not simply take a
> random device? Then at least all drives in the volume are equally loaded
> (in average).

Nobody argues that the even/odd-PID-based read-scheduling solution is 
/optimal/, in a production sense at least.  But at the time and for the 
purpose it was written it was pretty good, arguably reasonably close to 
"best", because the implementation is at once simple and transparent for 
debugging purposes, and real easy to test either one side or the other, 
or both, and equally important, to duplicate the results of those tests, 
by simply arranging for the testing to have either all even or all odd 
PIDs, or both.  And for ordinary use, it's good /enough/, as ordinarily, 
PIDs will be evenly distributed even/odd.

In that context, your random device read-scheduling algorithm would be 
far worse, because while being reasonably simple, it's anything *but* 
easy to ensure reads go to only one side or equally to both, or for that 
matter, to duplicate the tests, because randomization, by definition 
does /not/ lend itself to duplication.

And with both simplicity/transparency/debuggability and duplicatability 
of testing being primary factors when the code went in...

And again, the fact that it hasn't been optimized since then, in the 
context of "premature optimization", really says quite a bit about what 
the btrfs devs themselves consider btrfs' status to be -- obviously *not* 
production-grade stable and mature, or optimizations like this would have 
already been done.

Like it or not, that's btrfs' status at the moment.

Actually, the coming N-way-mirroring may very well be why they've not yet 
optimized the even/odd-PID mechanism already, because doing an optimized 
two-way would obviously be premature-optimization given the coming N-way, 
and doing an N-way clearly couldn't be properly tested at present, 
because only two-way is possible.  Introducing an optimized N-way 
scheduler together with the N-way-mirroring code necessary to properly 
test it thus becomes a no-brainer.

> From what you said I believe that certain servers will not benefit from
> btrfs, e.g. dedicated server that runs only one "fat" Java process, or
> one "huge" MySQL database.

Indeed.  But with btrfs still "stabilizing, but not entirely stable and 
mature", and indeed, various features still set to drop, and various 
optimizations still yet to do including this one, nobody, leastwise not 
the btrfs devs and knowledgeable regulars on this list, is /claiming/ 
that btrfs is at this time the be-all and end-all optimal solution for 
every single use-case.  Rather far from it!

As for the claims of salespeople... should any of them be making wild 
claims about btrfs, who in their sane mind takes salespeople's claims at 
face value in any case?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recover btrfs volume which can only be mounded in read-only mode

2015-10-26 Thread Duncan
Dmitry Katsubo posted on Sun, 18 Oct 2015 11:44:08 +0200 as excerpted:

[Regarding the btrfs raid1 "device-with-the-most-space" chunk-allocation 
strategy.]

> I think the mentioned strategy (fill in the device with most free space)
> is not most effective. If the data is spread equally, the read
> performance would be higher (reading from 3 disks instead of 2). In my
> case this is even crucial, because the smallest drive is SSD (and it is
> not loaded at all).
> 
> Maybe I don't see the benefit from the strategy which is currently
> implemented (besides that it is robust and well-tested)?

Two comments:

1) As Hugo alluded to, in striped mode (raid0/5/6 and I believe 10), the 
chunk allocator goes wide, allocating a chunk from each device with free 
space, then striping at something smaller (64 KiB maybe?).  When the 
smallest device is full, it reduces the width by one and continues 
allocating, down to the minimum stripe width for the raid type.  However, 
raid1 and single do device-with-the-most-space first, thus, particularly 
for raid1, ensuring maximum usage of available space.

Were raid1 to do width-first, capacity would be far lower and much more 
of the largest device would remain unusable, because some chunk pairs 
would be allocated entirely on the smaller devices, meaning less of the 
largest device would be used before the smaller devices fill up and no 
more raid1 chunks could be allocated as only the single largest device 
has free space left and raid1 requires allocation on two separate devices.

In the three-device raid1 case, the difference in usable capacity would 
be 1/3 the capacity of the smallest device, since until it is full, 1/3 
of all allocations would be to the two smaller devices, leaving that much 
more space unusable on the largest device.

So you see there's a reason for most-space-first, that being that it 
forces one chunk from each pair-allocation to the largest device, thereby 
most efficiently distributing space so as to leave as little space as 
possible unusable due to only one device left when pair-allocation is 
required.

2) There has been talk of a more flexible chunk allocator with an admin-
specified strategy allowing smart use of hybrid ssd/disk filesystems, for 
instance.  Perhaps put the metadata on the ssds, for instance, since 
btrfs metadata is relatively hot as in addition to the traditional 
metadata, it contains the checksums which btrfs of course checks on read.

However, this sort of thing is likely to be some time off, as it's 
relatively lower priority than various other possible features.  
Unfortunately, given the rate of btrfs development, "some time off" is in 
practice likely to be at least five years out.

In the mean time, there's technologies such as bcache that allow hybrid 
caching of "hot" data, designed to present themselves as virtual block 
devices so btrfs as well as other filesystems can layer on top.

And in fact, we have some regular users that have btrfs on top of bcache 
actually deployed, and from reports, it now works quite well.  (There 
were some problems awhile in the past, but they're several years in the 
past now, back well before the last couple LTS kernel series that's the 
oldest recommended for btrfs deployment.)

If you're interested, start a new thread with btrfs on bcache in the 
subject line, and you'll likely get some very useful replies. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recover btrfs volume which can only be mounded in read-only mode

2015-10-26 Thread Duncan
Hugo Mills posted on Mon, 26 Oct 2015 09:24:57 + as excerpted:

> On Mon, Oct 26, 2015 at 09:14:00AM +, Duncan wrote:
>> Dmitry Katsubo posted on Sun, 18 Oct 2015 11:44:08 +0200 as excerpted:
>> 
>>> I think PID-based solution is not the best one. Why not simply take a
>>> random device? Then at least all drives in the volume are equally
>>> loaded (in average).
>> 
>> Nobody argues that the even/odd-PID-based read-scheduling solution is
>> /optimal/, in a production sense at least.  But [it's near ideal for
>> testing, and "good enough" for the most general case].
> 
> For what it's worth, David tried implementing round-robin (IIRC)
> some time ago, and found that it performed *worse* than the pid-based
> system. (It may have been random, but memory says it was round-robin).

What I'd like to know is what mdraid1 uses, and if btrfs can get that.  
Because some upgrades worth ago, after trying mdraid6 for the main system 
and mdraid0 for some parts (with mdraid1 for boot since grub1 could deal 
with it, but not the others), I eventually settled on 4-way mdraid1 for 
everything, using the same disks I had used for the raid6 and raid0.

And I was rather blown away by the mdraid1 speed, in comparison, 
especially compared to raid0, which I thought would be better than 
raid1.  I guess my use-case is multi-thread read-heavy enough that the 
whatever mdraid1 uses, I was getting upto four separate reads (one per 
spindle) going at once, while writes still happened at single-spindle 
speed as with SATA (as opposed to the older IDE, this was when SATA was 
still new), each spindle had its own channel and they could write in 
parallel with bottleneck being the speed at which the slowest of the four 
completed its write.  So writes were single-spindle-speed, still far 
faster than the raid6 read-modify-write cycle, while reads... it really 
did appear to multitask one per spindle.

Also, the mdraid1 may have actually taken into account spindle head 
location as well, and scheduled reads to the spindle with the head 
already positioned closest to the target, tho I'm not sure on that.

But whatever mdraid1 scheduling does, I was totally astonished at how 
efficient it was, and it really did turn my thinking on most efficient 
raid choices upside down.  So if btrfs could simply take that scheduler 
and modify it as necessary for btrfs specifics, provided the 
modifications weren't /too/ heavy (and the fact that btrfs does read-time 
checksum verification could very well mean the algorithm as directly 
adapted as possible may not reach anything like the same efficiency), I 
really do think that'd be the ideal.  And of course it's freedomware code 
in the same kernel, so reusing the mdraid read-scheduler shouldn't be the 
problem it might be in other circumstances, tho the possible caveat of 
btrfs specific implementation issues does remain.

And of course someone would have to take the time to adapt it to work 
with btrfs, which gets us back onto the practical side of things, the 
"opportunity rich, developer-time poor" situation that is btrfs coding 
reality, premature optimization, possibly doing it at the same time as N-
way-mirroring, etc.

But anyway, mdraid's raid1 read-scheduler really does seem to be 
impressively efficient, the benchmark to try to match, if possible.  If 
that can be done by reusing some of the same code, so much the better. 
=:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recover btrfs volume which can only be mounded in read-only mode

2015-10-18 Thread Dmitry Katsubo
On 16/10/2015 10:18, Duncan wrote:
> Dmitry Katsubo posted on Thu, 15 Oct 2015 16:10:13 +0200 as excerpted:
> 
>> On 15 October 2015 at 02:48, Duncan <1i5t5.dun...@cox.net> wrote:
>>
>>> [snipped] 
>>
>> Thanks for this information. As far as I can see, btrfs-tools v4.1.2 in
>> now in experimental Debian repo (but you anyway suggest at least 4.2.2,
>> which is just 10 days ago released in master git). Kernel image 3.18 is
>> still not there, perhaps because Debian jessie was frozen before is was
>> released (2014-12-07).
> 
> For userspace, as long as it's supporting the features you need at 
> runtime (where it generally simply has to know how to make the call to 
> the kernel, to do the actual work), and you're not running into anything 
> really hairy that you're trying to offline-recover, which is where the 
> latest userspace code becomes critical...
> 
> Running a userspace series behind, or even more (as long as it's not 
> /too/ far), isn't all /that/ critical a problem.
> 
> It generally becomes a problem in one of three ways: 1) You have a bad 
> filesystem and want the best chance at fixing it, in which case you 
> really want the latest code, including the absolute latest fixups for the 
> most recently discovered possible problems. 2) You want/need a new 
> feature that's simply not supported in your old userspace.  3) The 
> userspace gets so old that the output from its diagnostics commands no 
> longer easily compares with that of current tools, giving people on-list 
> difficulties when trying to compare the output in your posts to the 
> output they get.
> 
> As a very general rule, at least try to keep the userspace version 
> comparable to the kernel version you are running.  Since the userspace 
> version numbering syncs to kernelspace version numbering, and userspace 
> of a particular version is normally released shortly after the similarly 
> numbered kernel series is released, with a couple minor updates before 
> the next kernel-series-synced release, keeping userspace to at least the 
> kernel space version, means you're at least running the userspace release 
> that was made with that kernel series release in mind.
> 
> Then, as long as you don't get too far behind on kernel version, you 
> should remain at least /somewhat/ current on userspace as well, since 
> you'll be upgrading to near the same userspace (at least), when you 
> upgrade the kernel.
> 
> Using that loose guideline, since you're aiming for the 3.18 stable 
> kernel, you should be running at least a 3.18 btrfs-progs as well.
> 
> In that context, btrfs-progs 4.1.2 should be fine, as long as you're not 
> trying to fix any problems that a newer version fixed.  And, my 
> recommendation of the latest 4.2.2 was in the "fixing problems" context, 
> in which case, yes, getting your hands on 4.2.2, even if it means 
> building from sources to do so, could be critical, depending of course on 
> the problem you're trying to fix.  But otherwise, 4.1.2, or even back to 
> the last 3.18.whatever release since that's the kernel version you're 
> targeting, should be fine.
> 
> Just be sure that whenever you do upgrade to later, you avoid the known-
> bad-mkfs.btrfs in 4.2.0 and/or 4.2.1 -- be sure if you're doing the btrfs-
> progs-4.2 series, that you get 4.2.2 or later.
> 
> As for finding a current 3.18 series kernel released for Debian, I'm not 
> a Debian user so my my knowledge of the ecosystem around it is limited, 
> but I've been very much under the impression that there are various 
> optional repos available that you can choose to include and update from 
> as well, and I'm quite sure based on previous discussions with others 
> that there's a well recognized and fairly commonly enabled repo that 
> includes debian kernel updates thru current release, or close to it.
> 
> Of course you could also simply run a mainstream Linus kernel and build 
> it yourself, and it's not too horribly hard to do either, as there's all 
> sorts of places with instructions for doing so out there, and back when I 
> switched from MS to freedomware Linux in late 2001, I learned the skill, 
> at at least the reasonably basic level of mostly taking a working config 
> from my distro's kernel and using it as a basis for my mainstream kernel 
> config as well, within about two months of switching.
> 
> Tho of course just because you can doesn't mean you want to, and for 
> many, finding their distro's experimental/current kernel repos and simply 
> installing the packages from it, will be far simpler.
> 
> But regardless of the method used, finding or building and keeping 
> current with your own copy of at least the lastest couple of LTS 
> releases, shouldn't be /horribly/ difficult.  While I've not used them as 
> actual package resources in years, I do still know a couple rpm-based 
> package resources from my time back on Mandrake (and do still check them 
> in contexts like this for others, or to quickly see what files a package 
> 

Re: Recover btrfs volume which can only be mounded in read-only mode

2015-10-16 Thread Duncan
Dmitry Katsubo posted on Thu, 15 Oct 2015 16:10:13 +0200 as excerpted:

> On 15 October 2015 at 02:48, Duncan <1i5t5.dun...@cox.net> wrote:
> 
>> [snipped] 
> 
> Thanks for this information. As far as I can see, btrfs-tools v4.1.2 in
> now in experimental Debian repo (but you anyway suggest at least 4.2.2,
> which is just 10 days ago released in master git). Kernel image 3.18 is
> still not there, perhaps because Debian jessie was frozen before is was
> released (2014-12-07).

For userspace, as long as it's supporting the features you need at 
runtime (where it generally simply has to know how to make the call to 
the kernel, to do the actual work), and you're not running into anything 
really hairy that you're trying to offline-recover, which is where the 
latest userspace code becomes critical...

Running a userspace series behind, or even more (as long as it's not 
/too/ far), isn't all /that/ critical a problem.

It generally becomes a problem in one of three ways: 1) You have a bad 
filesystem and want the best chance at fixing it, in which case you 
really want the latest code, including the absolute latest fixups for the 
most recently discovered possible problems. 2) You want/need a new 
feature that's simply not supported in your old userspace.  3) The 
userspace gets so old that the output from its diagnostics commands no 
longer easily compares with that of current tools, giving people on-list 
difficulties when trying to compare the output in your posts to the 
output they get.

As a very general rule, at least try to keep the userspace version 
comparable to the kernel version you are running.  Since the userspace 
version numbering syncs to kernelspace version numbering, and userspace 
of a particular version is normally released shortly after the similarly 
numbered kernel series is released, with a couple minor updates before 
the next kernel-series-synced release, keeping userspace to at least the 
kernel space version, means you're at least running the userspace release 
that was made with that kernel series release in mind.

Then, as long as you don't get too far behind on kernel version, you 
should remain at least /somewhat/ current on userspace as well, since 
you'll be upgrading to near the same userspace (at least), when you 
upgrade the kernel.

Using that loose guideline, since you're aiming for the 3.18 stable 
kernel, you should be running at least a 3.18 btrfs-progs as well.

In that context, btrfs-progs 4.1.2 should be fine, as long as you're not 
trying to fix any problems that a newer version fixed.  And, my 
recommendation of the latest 4.2.2 was in the "fixing problems" context, 
in which case, yes, getting your hands on 4.2.2, even if it means 
building from sources to do so, could be critical, depending of course on 
the problem you're trying to fix.  But otherwise, 4.1.2, or even back to 
the last 3.18.whatever release since that's the kernel version you're 
targeting, should be fine.

Just be sure that whenever you do upgrade to later, you avoid the known-
bad-mkfs.btrfs in 4.2.0 and/or 4.2.1 -- be sure if you're doing the btrfs-
progs-4.2 series, that you get 4.2.2 or later.

As for finding a current 3.18 series kernel released for Debian, I'm not 
a Debian user so my my knowledge of the ecosystem around it is limited, 
but I've been very much under the impression that there are various 
optional repos available that you can choose to include and update from 
as well, and I'm quite sure based on previous discussions with others 
that there's a well recognized and fairly commonly enabled repo that 
includes debian kernel updates thru current release, or close to it.

Of course you could also simply run a mainstream Linus kernel and build 
it yourself, and it's not too horribly hard to do either, as there's all 
sorts of places with instructions for doing so out there, and back when I 
switched from MS to freedomware Linux in late 2001, I learned the skill, 
at at least the reasonably basic level of mostly taking a working config 
from my distro's kernel and using it as a basis for my mainstream kernel 
config as well, within about two months of switching.

Tho of course just because you can doesn't mean you want to, and for 
many, finding their distro's experimental/current kernel repos and simply 
installing the packages from it, will be far simpler.

But regardless of the method used, finding or building and keeping 
current with your own copy of at least the lastest couple of LTS 
releases, shouldn't be /horribly/ difficult.  While I've not used them as 
actual package resources in years, I do still know a couple rpm-based 
package resources from my time back on Mandrake (and do still check them 
in contexts like this for others, or to quickly see what files a package 
I don't have installed on gentoo might include, etc), and would point you 
at them if Debian was an rpm-based distro, but of course it's not, so 
they won't do any good.  But I'd guess a google 

Re: Recover btrfs volume which can only be mounded in read-only mode

2015-10-15 Thread Dmitry Katsubo
On 15 October 2015 at 02:48, Duncan <1i5t5.dun...@cox.net> wrote:
> Dmitry Katsubo posted on Wed, 14 Oct 2015 22:27:29 +0200 as excerpted:
>
>> On 14/10/2015 16:40, Anand Jain wrote:
 # mount -o degraded /var Oct 11 18:20:15 kernel: BTRFS: too many
 missing devices, writeable mount is not allowed

 # mount -o degraded,ro /var # btrfs device add /dev/sdd1 /var ERROR:
 error adding the device '/dev/sdd1' - Read-only file system

 Now I am stuck: I cannot add device to the volume to satisfy raid
 pre-requisite.
>>>
>>>  This is a known issue. Would you be able to test below set of patches
>>>  and update us..
>>>
>>>[PATCH 0/5] Btrfs: Per-chunk degradable check
>>
>> Many thanks for the reply. Unfortunately I have no environment to
>> recompile the kernel, and setting it up will perhaps take a day. Can the
>> latest kernel be pushed to Debian sid?

Duncan, many thanks for verbose answer. I appreciate a lot.

> In the way of general information...
>
> While btrfs is no longer entirely unstable (since 3.12 when the
> experimental tag was removed) and kernel patch backports are generally
> done where stability is a factor, it's not yet fully stable and mature,
> either.  As such, an expectation of true stability such that wishing to
> remain on kernels more than one LTS series behind the latest LTS kernel
> series (4.1, with 3.18 the one LTS series back version) can be considered
> incompatible with wishing to run the still under heavy development and
> not yet fully stable and mature btrfs, at least as soon as problems are
> reported.  A request to upgrade to current and/or to try various not yet
> mainline integrated patches is thus to be expected on report of problems.
>
> As for userspace, the division between btrfs kernel and userspace works
> like this:  Under normal operating conditions, userspace simply makes
> requests of the kernel, which does the actual work.  Thus, under normal
> conditions, updated kernel code is most important.  However, once a
> problem occurs and repair/recovery is attempted, it's generally userspace
> code itself directly operating on the unmounted filesystem, so having the
> latest userspace code fixes becomes most important once something has
> gone wrong and you're trying to fix it.
>
> So upgrading to a 3.18 series kernel, at minimum, is very strongly
> recommended for those running btrfs, with an expectation that an upgrade
> to 4.1 should be being planned and tested, for deployment as soon as it's
> passing on-site pre-deployment testing.  And an upgrade to current or
> close to current btrfs-progs 4.2.2 userspace is recommended as soon as
> you need its features, which include the latest patches for repair and
> recovery, so as soon as you have a filesystem that's not working as
> expected, if not before.  (Note that earlier btrfs-progs 4.2 releases,
> before 4.2.2, had a buggy mkfs.btrfs, so they should be skipped if you
> will be doing mkfs.btrfs with them, and any btrfs created with those
> versions should have what's on them backed up if it's not already, and
> the filesystems recreated with 4.2.2, as they'll be unstable and are
> subject to failure.)

Thanks for this information. As far as I can see, btrfs-tools v4.1.2
in now in experimental Debian repo (but you anyway suggest at least
4.2.2, which is just 10 days ago released in master git). Kernel image
3.18 is still not there, perhaps because Debian jessie was frozen
before is was released (2014-12-07).

>> 1. Is there any way to recover btrfs at the moment? Or the easiest I can
>> do is to mount ro, copy all data to another drive, re-create btrfs
>> volume and copy back?
>
> Sysadmin's rule of backups:  If data isn't backed up, by definition you
> value the data less than the cost of time/hassle/resources to do the
> backup, so loss of a filesystem is never a big problem, because if the
> data was of any value, it was backed up and can be restored from that
> backup, and if it wasn't backed up, then by definition you have already
> saved the more important to you commodity, the hassle/time/resources you
> would have spent doing the backup.  Therefore, loss of a filesystem is
> loss of throw-away data in any case, either because it was backed up (and
> a would-be backup that hasn't been tested restorable isn't yet a
> completed backup, so doesn't count), or because the data really was throw-
> away data, not worth the hassle of backing up in the first place, even at
> risk of loss should the un-backed-up data be lost.
>
> No exceptions.  Any after-the-fact protests to the contrary simply put
> the lie to claims that the value was considered valuable, since actions
> spoke louder than words and actions defined the data as throw-away.
>
> Therefore, no worries.  Worst-case, you either recover the data from
> backup, or if it wasn't backed up, by definition, it wasn't valuable data
> in the first place.  Either way, no valuable data was, or can be, lost.
>
> (It's worth noting 

Re: Recover btrfs volume which can only be mounded in read-only mode

2015-10-15 Thread Hugo Mills
On Thu, Oct 15, 2015 at 04:10:13PM +0200, Dmitry Katsubo wrote:
[snip]
> If I may ask:
> 
> Provided that btrfs allowed to mount a volume in read-only mode – does
> it mean that add data blocks are present (e.g. it has assured that add
> files / directories can be read)?
> 
> Do you have any ideas why "btrfs balance" has pulled all data to two
> drives (and not balanced between three)?

   If you're using a non-striped RAID level (single, 1), btrfs will
start by filling up the largest devices first: balance attempts to
make the free space equal across the devices, not to make the used
space equal.

   If you're using a striped RAID level (0, 5, 6), then the FS will
fill up the devices equally, until one is full, and then switch to
using the remaining devices (until one is full, etc).

> Does btrfs has the following optimization for mirrored data: if drive
> is non-rotational, then prefer reads from it? Or it simply schedules
> the read to the drive that performs faster (irrelative to rotational
> status)?

   No, it'll read arbitrarily from the available devices at the moment.

   Hugo.

-- 
Hugo Mills | People are too unreliable to be replaced by
hugo@... carfax.org.uk | machines.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  Nathan Spring, Star Cops


signature.asc
Description: Digital signature


Recover btrfs volume which can only be mounded in read-only mode

2015-10-14 Thread Dmitry Katsubo
Dear btrfs community,

I am facing several problems regarding to btrfs, and I will be very
thankful if someone can help me with. Also while playing with btrfs I
have few suggestions – would be nice if one can comment on those.

While starting the system, /var (which is btrfs volume) failed to be
mounted. That btrfs volume was created with the following options:

# mkfs.btrfs -d raid1 -m raid1 /dev/sdc2 /dev/sda /dev/sdd1

Here comes what is recorded in systemd journal during the startup:

[2.931097] BTRFS: device fsid 57b828ee-5984-4f50-89ff-4c9be0fd3084
devid 2 transid 394288 /dev/sda
[9.810439] BTRFS: device fsid 57b828ee-5984-4f50-89ff-4c9be0fd3084
devid 1 transid 394288 /dev/sdc2
Oct 11 13:00:22 systemd[1]: Job
dev-disk-by\x2duuid-57b828ee\x2d5984\x2d4f50\x2d89ff\x2d4c9be0fd3084.device/start
timed out.
Oct 11 13:00:22 systemd[1]: Timed out waiting for device
dev-disk-by\x2duuid-57b828ee\x2d5984\x2d4f50\x2d89ff\x2d4c9be0fd3084.device.

After the system started on runlevel 1, I attempted to mount the filesystem:

# mount /var
Oct 11 13:53:55 kernel: BTRFS info (device sdc2): disk space caching is enabled
Oct 11 13:53:55 kernel: BTRFS: failed to read chunk tree on sdc2
Oct 11 13:53:55 kernel: BTRFS: open_ctree failed

When I google for "failed to read chunk tree" the feedback was that
something really bad is happening, and it's time to restore the data /
give up with btrfs. In fact, this message is misleading because it
refers /dev/sdc2 which is a mount device in fstab but this is SSD
drive, so it is very unlikely to cause "read" error. Literally I read
the message as "BTRFS: tried to read something from sdc2 and failed".
Maybe it is better to re-phrase the message to "failed to construct
chunk tree on /var (sdc2,sda,sdd1)"?

Next I did a check:

# btrfs check /dev/sdc2
warning devid 3 not found already
checking extents
checking free space cache
Error reading 36818145280, -1
checking fs roots
checking csums
checking root refs
Checking filesystem on /dev/sdc2
UUID: 57b828ee-5984-4f50-89ff-4c9be0fd3084
failed to load free space cache for block group 36536582144
found 29602081783 bytes used err is 0
total csum bytes: 57681304
total tree bytes: 1047363584
total fs tree bytes: 843694080
total extent tree bytes: 121159680
btree space waste bytes: 207443742
file data blocks allocated: 4524416
 referenced 60893913088

The message "devid 3 not found already" does not tell much to me. If I
understand correctly, btrfs does not store the list of devices in the
metadata, but maybe it would be a good idea to save the last seen
information about devices so that I would not need to guess what
"devid 3" means?

Next I tried to list all devices in my btrfs volume. I found this is
not possible (unless volume is mounted). Would be nice if "btrfs
device scan" outputs the detected volumes / devices to stdout (e.g.
with "-v" option) or there is any other way to do that.

Then I have mounted the volume in degraded mode and only after that I
could understand what the error message means:

# mount /var -o degraded
# btrfs device stats /var
btrfs device stats /var
[/dev/sdc2].write_io_errs   0
[/dev/sdc2].read_io_errs0
[/dev/sdc2].flush_io_errs   0
[/dev/sdc2].corruption_errs 0
[/dev/sdc2].generation_errs 0
[/dev/sda].write_io_errs   0
[/dev/sda].read_io_errs0
[/dev/sda].flush_io_errs   0
[/dev/sda].corruption_errs 0
[/dev/sda].generation_errs 0
[].write_io_errs   3160958
[].read_io_errs0
[].flush_io_errs   0
[].corruption_errs 0
[].generation_errs 0

Now I can see that the device with devid 3 is actually /dev/sdd1,
which btrfs found not ready. Is it possible to improve btrfs output
and to list "last seen device" in that output, e.g.

[/dev/sdd1*].write_io_errs   3160958
[/dev/sdd1*].read_io_errs0
...

where "*" means that device is missing.

I have listed all partitions and /dev/sdd1 was among them. I have also run

# badblocks /dev/sdd

and it found no bad blocks. Why btrfs considers the device "not ready"
– that is a question.

Afterwards I have decided to run scrub:

# btrfs scrub start /var
# btrfs scrub status /var
scrub status for 57b828ee-5984-4f50-89ff-4c9be0fd3084
scrub started at Sun Oct 11 14:55:45 2015 and was aborted after 1365 seconds
total bytes scrubbed: 89.52GiB with 0 errors

I have noticed that btrfs always reports "was aborted after X
seconds", even if scrub is still running (I check that X and number of
bytes scrubbed is increasing). That is confusing. After scrub
finished, I have no idea whether it scrubbed everything, or was really
aborted. And if it was aborted, what is the reason? Also it would be
nice if status displays the number of data bytes (without replicas)
scrubbed because the number 89.52GiB includes all replicas (of raid1
in my case):

total bytes scrubbed: 89.52GiB (data 55.03GiB, system 16.00KiB,
metadata 998.83MiB) with 0 errors

Then I can compare this number with "filesystem df" output to answer
the question: was all data successfully scrubbed?

# btrfs 

Re: Recover btrfs volume which can only be mounded in read-only mode

2015-10-14 Thread Dmitry Katsubo
On 14/10/2015 16:40, Anand Jain wrote:
>> # mount -o degraded /var
>> Oct 11 18:20:15 kernel: BTRFS: too many missing devices, writeable
>> mount is not allowed
>>
>> # mount -o degraded,ro /var
>> # btrfs device add /dev/sdd1 /var
>> ERROR: error adding the device '/dev/sdd1' - Read-only file system
>>
>> Now I am stuck: I cannot add device to the volume to satisfy raid
>> pre-requisite.
> 
>  This is a known issue. Would you be able to test below set of patches
>  and update us..
> 
>[PATCH 0/5] Btrfs: Per-chunk degradable check

Many thanks for the reply. Unfortunately I have no environment to
recompile the kernel, and setting it up will perhaps take a day. Can the
latest kernel be pushed to Debian sid?

1. Is there any way to recover btrfs at the moment? Or the easiest I can
do is to mount ro, copy all data to another drive, re-create btrfs
volume and copy back?

2. How to avoid such a trap in the future?

3. How can I know what version of kernel the patch "Per-chunk degradable
check" is targeting?

4. What is the best way to express/vote for new features or suggestions
(wikipage "Project_ideas" / bugzilla)?

Thanks!

-- 
With best regards,
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recover btrfs volume which can only be mounded in read-only mode

2015-10-14 Thread Duncan
Dmitry Katsubo posted on Wed, 14 Oct 2015 22:27:29 +0200 as excerpted:

> On 14/10/2015 16:40, Anand Jain wrote:
>>> # mount -o degraded /var Oct 11 18:20:15 kernel: BTRFS: too many
>>> missing devices, writeable mount is not allowed
>>>
>>> # mount -o degraded,ro /var # btrfs device add /dev/sdd1 /var ERROR:
>>> error adding the device '/dev/sdd1' - Read-only file system
>>>
>>> Now I am stuck: I cannot add device to the volume to satisfy raid
>>> pre-requisite.
>> 
>>  This is a known issue. Would you be able to test below set of patches
>>  and update us..
>> 
>>[PATCH 0/5] Btrfs: Per-chunk degradable check
> 
> Many thanks for the reply. Unfortunately I have no environment to
> recompile the kernel, and setting it up will perhaps take a day. Can the
> latest kernel be pushed to Debian sid?

In the way of general information...

While btrfs is no longer entirely unstable (since 3.12 when the 
experimental tag was removed) and kernel patch backports are generally 
done where stability is a factor, it's not yet fully stable and mature, 
either.  As such, an expectation of true stability such that wishing to 
remain on kernels more than one LTS series behind the latest LTS kernel 
series (4.1, with 3.18 the one LTS series back version) can be considered 
incompatible with wishing to run the still under heavy development and 
not yet fully stable and mature btrfs, at least as soon as problems are 
reported.  A request to upgrade to current and/or to try various not yet 
mainline integrated patches is thus to be expected on report of problems.

As for userspace, the division between btrfs kernel and userspace works 
like this:  Under normal operating conditions, userspace simply makes 
requests of the kernel, which does the actual work.  Thus, under normal 
conditions, updated kernel code is most important.  However, once a 
problem occurs and repair/recovery is attempted, it's generally userspace 
code itself directly operating on the unmounted filesystem, so having the 
latest userspace code fixes becomes most important once something has 
gone wrong and you're trying to fix it.

So upgrading to a 3.18 series kernel, at minimum, is very strongly 
recommended for those running btrfs, with an expectation that an upgrade 
to 4.1 should be being planned and tested, for deployment as soon as it's 
passing on-site pre-deployment testing.  And an upgrade to current or 
close to current btrfs-progs 4.2.2 userspace is recommended as soon as 
you need its features, which include the latest patches for repair and 
recovery, so as soon as you have a filesystem that's not working as 
expected, if not before.  (Note that earlier btrfs-progs 4.2 releases, 
before 4.2.2, had a buggy mkfs.btrfs, so they should be skipped if you 
will be doing mkfs.btrfs with them, and any btrfs created with those 
versions should have what's on them backed up if it's not already, and 
the filesystems recreated with 4.2.2, as they'll be unstable and are 
subject to failure.)

> 1. Is there any way to recover btrfs at the moment? Or the easiest I can
> do is to mount ro, copy all data to another drive, re-create btrfs
> volume and copy back?

Sysadmin's rule of backups:  If data isn't backed up, by definition you 
value the data less than the cost of time/hassle/resources to do the 
backup, so loss of a filesystem is never a big problem, because if the 
data was of any value, it was backed up and can be restored from that 
backup, and if it wasn't backed up, then by definition you have already 
saved the more important to you commodity, the hassle/time/resources you 
would have spent doing the backup.  Therefore, loss of a filesystem is 
loss of throw-away data in any case, either because it was backed up (and 
a would-be backup that hasn't been tested restorable isn't yet a 
completed backup, so doesn't count), or because the data really was throw-
away data, not worth the hassle of backing up in the first place, even at 
risk of loss should the un-backed-up data be lost.

No exceptions.  Any after-the-fact protests to the contrary simply put 
the lie to claims that the value was considered valuable, since actions 
spoke louder than words and actions defined the data as throw-away.

Therefore, no worries.  Worst-case, you either recover the data from 
backup, or if it wasn't backed up, by definition, it wasn't valuable data 
in the first place.  Either way, no valuable data was, or can be, lost.

(It's worth noting that this rule nicely takes care of the loss of both 
the working copy and N'th backup case, as well, since again, either it 
was worth the cost of N+1 levels of backup, or that N+1 backup wasn't 
made, which automatically defines the data as not worth the cost of the 
the N+1 backup, at least relative to the risk factor that it might 
actually be needed.  That remains the case, regardless of whether N=0 or 
N=10^1000, since even at N=10^1000, backup to level N+1 is either worth 
the cost vs. risk -- the data really is THAT