Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-09 Thread Austin S. Hemmelgarn

On 2017-03-09 04:49, Peter Grandi wrote:

Consider the common case of a 3-member volume with a 'raid1'
target profile: if the sysadm thinks that a drive should be
replaced, the goal is to take it out *without* converting every
chunk to 'single', because with 2-out-of-3 devices half of the
chunks will still be fully mirrored.



Also, removing the device to be replaced should really not be
the same thing as balancing the chunks, if there is space, to be
'raid1' across remaining drives, because that's a completely
different operation.



There is a command specifically for replacing devices.  It
operates very differently from the add+delete or delete+add
sequences. [ ... ]


Perhaps it was not clear that I was talking about removing a
device, as distinct from replacing it, and that I used "removed"
instead of "deleted" deliberately, to avoid the confusion with
the 'delete' command.

Ah, sorry I misunderstood what you were saying.


In the everyday practice of system administration it often
happens that a device should be removed first, and replaced
later, for example when it is suspected to be faulty, or is
intermittently faulty. The replacement can be done with
'replace' or 'add+delete' or 'delete+add', but that's a
different matter.

Perhaps I should have not have used the generic verb "remove",
but written "make unavailable".

This brings about again the topic of some "confusion" in the
design of the Btrfs multidevice handling logic, where at least
initially one could only expand the storage space of a
multidevice by 'add' of a new device or shrink the storage space
by 'delete' of an existing one, but I think it was not conceived
at Btrfs design time of storage space being nominally constant
but for a device (and the chunks on it) having a state of
"available" ("present", "online", "enabled") or "unavailable"
("absent", "offline", "disabled"), either because of events or
because of system administrator action.

The 'missing' pseudo-device designator was added later, and
'replace' also later to avoid having to first expand then shrink
(or viceversa) the storage space and the related copying.

My impression is that it would be less "confused" if the Btrfs
device handling logic were changed to allow for the the state of
"member of the multidevice set but not actually available" and
the related consequent state for chunks that ought to be on it;
that probably would be essential to fixing the confusing current
aspects of recovery in a multidevice set. That would be very
useful even if it may require a change in the on-disk format to
distinguish the distinct states of membership and availability
for devices and mark chunks as available or not (chunks of course
being only possible on member devices).

That is, it would also be nice to have the opposite state of "not
member of the multidevice set but actually available to it", that
is a spare device, and related logic.
OK, so expanding on this a bit, there are currently three functional 
device states in BTRFS right now (note that the terms I use here aren't 
official, they're just what I use to describe them):
1. Active/Online.  This is the normal state for a device, you can both 
read from it and write to it.
2. Inactive/Replacing/Deleting.  This is the state a device is in when 
it's either being deleted or replaced.  Inactive devices don't count 
towards total volume size, and can't be written to, but can be read from 
if they weren't missing prior to becoming inactive.
3. Missing/Offilne.  This is pretty self explanatory.  A device in this 
state can't be read from or written to, but it does count towards volume 
size.


Currently, the only transitions available to a sysadmin through BTRFS 
itself are temporary transitions from Active to Inactive (replace and 
delete).


In an ideal situation, there would be two other states:
4. Local hot-spare/Nearline.  Won't be read from and doesn't count 
towards total volume size, but may be written to (depending on how the 
FS is configured), and will be automatically used to replace a failed 
device in the filesystem it's associated with.
5. Global hot-spare.  Similar to local hot-spare, but can be used for 
any filesystem on the system, and won't be touched until it's needed.


The following manually initiated transitions would be possible for 
regular operation:

1. Active -> Inactive (persistently)
2. Inactive -> Active
3. Active -> Local hot-spare
4. Inactive -> Local hot-spare
5. Local hot-spare -> Active
6. Local hot-spare -> Inactive
7. Global hot-spare -> Active
8. Global hot-spare -> Inactive
9. Local hot-spare -> Global hot-spare
10. Global hot-spare -> Local hot-spare

And the following automatic transitions would be possible:
1. Local hot-spare -> Active
2. Global hot-spare -> Active
3.  -> Missing
4. Missing -> 

And there would be the option of manually triggering the automatic 
transitions for debugging purposes.


Note: simply setting '/sys/block/$DEV/device/delete' is not a
good option, because that 

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-09 Thread Peter Grandi
>> Consider the common case of a 3-member volume with a 'raid1'
>> target profile: if the sysadm thinks that a drive should be
>> replaced, the goal is to take it out *without* converting every
>> chunk to 'single', because with 2-out-of-3 devices half of the
>> chunks will still be fully mirrored.

>> Also, removing the device to be replaced should really not be
>> the same thing as balancing the chunks, if there is space, to be
>> 'raid1' across remaining drives, because that's a completely
>> different operation.

> There is a command specifically for replacing devices.  It
> operates very differently from the add+delete or delete+add
> sequences. [ ... ]

Perhaps it was not clear that I was talking about removing a
device, as distinct from replacing it, and that I used "removed"
instead of "deleted" deliberately, to avoid the confusion with
the 'delete' command.

In the everyday practice of system administration it often
happens that a device should be removed first, and replaced
later, for example when it is suspected to be faulty, or is
intermittently faulty. The replacement can be done with
'replace' or 'add+delete' or 'delete+add', but that's a
different matter.

Perhaps I should have not have used the generic verb "remove",
but written "make unavailable".

This brings about again the topic of some "confusion" in the
design of the Btrfs multidevice handling logic, where at least
initially one could only expand the storage space of a
multidevice by 'add' of a new device or shrink the storage space
by 'delete' of an existing one, but I think it was not conceived
at Btrfs design time of storage space being nominally constant
but for a device (and the chunks on it) having a state of
"available" ("present", "online", "enabled") or "unavailable"
("absent", "offline", "disabled"), either because of events or
because of system administrator action.

The 'missing' pseudo-device designator was added later, and
'replace' also later to avoid having to first expand then shrink
(or viceversa) the storage space and the related copying.

My impression is that it would be less "confused" if the Btrfs
device handling logic were changed to allow for the the state of
"member of the multidevice set but not actually available" and
the related consequent state for chunks that ought to be on it;
that probably would be essential to fixing the confusing current
aspects of recovery in a multidevice set. That would be very
useful even if it may require a change in the on-disk format to
distinguish the distinct states of membership and availability
for devices and mark chunks as available or not (chunks of course
being only possible on member devices).

That is, it would also be nice to have the opposite state of "not
member of the multidevice set but actually available to it", that
is a spare device, and related logic.

Note: simply setting '/sys/block/$DEV/device/delete' is not a
good option, because that makes the device unavailable not just
to Btrfs, but also to the whole systems. In the ordinary practice
of system administration it may well be useful to make a device
unavailable to Btrfs but still available to the system, for
example for testing, and anyhow they are logically distinct
states. That also means a member device might well be available
to the system, but marked as "not available" to Btrfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-06 Thread Austin S. Hemmelgarn

On 2017-03-05 14:13, Peter Grandi wrote:

What makes me think that "unmirrored" 'raid1' profile chunks
are "not a thing" is that it is impossible to remove
explicitly a member device from a 'raid1' profile volume:
first one has to 'convert' to 'single', and then the 'remove'
copies back to the remaining devices the 'single' chunks that
are on the explicitly 'remove'd device. Which to me seems
absurd.



It is, there should be a way to do this as a single operation.
[ ... ] The reason this is currently the case though is a
simple one, 'btrfs device delete' is just a special instance
of balance [ ... ]  does no profile conversion, but having
that as an option would actually be _very_ useful from a data
safety perspective.


That seems to me an even more "confused" opinion: because
removing a device to make it "missing" and removing it
permanently should be very different operations.

Consider the common case of a 3-member volume with a 'raid1'
target profile: if the sysadm thinks that a drive should be
replaced, the goal is to take it out *without* converting every
chunk to 'single', because with 2-out-of-3 devices half of the
chunks will still be fully mirrored.

Also, removing the device to be replaced should really not be
the same thing as balancing the chunks, if there is space, to be
'raid1' across remaining drives, because that's a completely
different operation.
There is a command specifically for replacing devices.  It operates very 
differently from the add+delete or delete+add sequences.  Instead of 
balancing, it's more similar to LVM's pvmove command.  It redirects all 
new writes that would go to the old device to the new one, then copies 
all the data from the old to the new (while properly recreating damaged 
chunks).  it uses way less bandwidth than add+delete, runs faster, and 
is in general much safer because it moves less data around.  If you're 
just replacing devices, you should be using this, not the add and delete 
commands, which are more for reshaping arrays than repairing them.


Additionally, if you _have_ to use add and remove to replace a device, 
if possible, you should add the new device then delete the old one, not 
the other way around, as that avoids most of the issues other than the 
high load on the filesystem from the balance operation.



Going further in my speculation, I suspect that at the core of
the Btrfs multidevice design there is a persistent "confusion"
(to use en euphemism) between volumes having a profile, and
merely chunks have a profile.



There generally is.  The profile is entirely a property of the
chunks (each chunk literally has a bit of metadata that says
what profile it is), not the volume.  There's some metadata in
the volume somewhere that says what profile to use for new
chunks of each type (I think),


That's the "target" profile for the volume.


but that doesn't dictate what chunk profiles there are on the
volume. [ ... ]


But as that's the case then the current Btrfs logic for
determining whether a volume is degraded or not is quite
"confused" indeed.
Entirely agreed.  Currently, it checks the target profile, when it 
should be checking per-chunk.


Because suppose there is again the simple case of a 3-device
volume, where all existing chunks have 'raid1' profile and the
volume's target profile is also 'raid1' and one device has gone
offline: the volume cannot be said to be "degraded", unless a
full examination of all chunks is made. Because it can well
happen that in fact *none* of the chunks was mirrored to that
device, for example, however unlikely. And viceversa. Even with
3 devices some chunks may be temporarily "unmirrored" (even if
for brief times hopefully).

The average case is that half of the chunks will be fully
mirrored across the two remaining devices and half will be
"unmirrored".

Now consider re-adding the third device: at that point the
volume has got back all 3 devices, so it is not "degraded", but
50% of the chunks in the volume will still be "unmirrored", even
if eventually they will be mirrored on the newly added device.

Note: possibilities get even more interesting with a 4-device
volume with 'raid1' profile chunks, and similar case involving
other profiles than 'raid1'.

Therefore the current Btrfs logic for deciding whether a volume
is "degraded" seems simply "confused" to me, because whether
there are missing devices and some chunks are "unmirrored" is
not quite the same thing.

The same applies to the current logic that in a 2-device volume
with a device missing new chunks are created as "single" profile
instead of as "unmirrored" 'raid1' profile: another example of
"confusion" between number of devices and chunk profile.

Note: the best that can be said is that a volume has both a
"target chunk profile" (one per data, metadata, system chunks)
and a target number of member devices, and that a volume with a
number of devices below the target *might* be degraded, and that
whether a volume is in fact degraded is not 

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-06 Thread Austin S. Hemmelgarn

On 2017-03-03 15:10, Kai Krakow wrote:

Am Fri, 3 Mar 2017 07:19:06 -0500
schrieb "Austin S. Hemmelgarn" :


On 2017-03-03 00:56, Kai Krakow wrote:

Am Thu, 2 Mar 2017 11:37:53 +0100
schrieb Adam Borowski :


On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote:

 [...]


Well, there's Qu's patch at:
https://www.spinics.net/lists/linux-btrfs/msg47283.html
but it doesn't apply cleanly nor is easy to rebase to current
kernels.

 [...]


Well, yeah.  The current check is naive and wrong.  It does have a
purpose, just fails in this, very common, case.


I guess the reasoning behind this is: Creating any more chunks on
this drive will make raid1 chunks with only one copy. Adding
another drive later will not replay the copies without user
interaction. Is that true?

If yes, this may leave you with a mixed case of having a raid1 drive
with some chunks not mirrored and some mirrored. When the other
drives goes missing later, you are loosing data or even the whole
filesystem although you were left with the (wrong) imagination of
having a mirrored drive setup...

Is this how it works?

If yes, a real patch would also need to replay the missing copies
after adding a new drive.


The problem is that that would use some serious disk bandwidth
without user intervention.  The way from userspace to fix this is to
scrub the FS.  It would essentially be the same from kernel space,
which means that if you had a multi-TB FS and this happened, you'd be
running at below capacity in terms of bandwidth for quite some time.
If this were to be implemented, it would have to be keyed off of the
per-chunk degraded check (so that _only_ the chunks that need it get
touched), and there would need to be a switch to disable it.


Well, I'd expect that a replaced drive would involve reduced bandwidth
for a while. Every traditional RAID does this. The key solution there
is that you can limit bandwidth and/or define priorities (BG rebuild
rate).

Btrfs OTOH could be a lot more smarter, only rebuilding chunks that are
affected. The kernel can already do IO priorities and some sort of
bandwidth limiting should also be possible. I think IO throttling is
already implemented in the kernel somewhere (at least with 4.10) and
also in btrfs. So the basics are there.
I/O prioritization in Linux is crap right now.  Only one scheduler 
properly supports it, and that scheduler is deprecated, not to mention 
that it didn't work reliably to begin with.  There is a bandwidth 
limiting mechanism in place, but that's for userspace stuff, not kernel 
stuff (which is why scrub is such an issue, the actual I/O is done by 
the kernel, not userspace).


In a RAID setup, performance should never have priority over redundancy
by default.

If performance is an important factor, I suggest working with SSD
writeback caches. This is already possible with different kernel
techniques like mdcache or bcache. Proper hardware controllers also
support this in hardware. It's cheap to have a mirrored SSD
writeback cache of 1TB or so if your setup already contains a multiple
terabytes array. Such a setup has huge performance benefits in setups
we deploy (tho, not btrfs related).

Also, adding/replacing a drive is usually not a totally unplanned
event. Except for hot spares, a missing drive will be replaced at the
time you arrive on-site. If performance is a factor, this can be done
the same time as manually starting the process. So why not should it be
done automatically?
You're already going to be involved because you can't (from a practical 
perspective) automate the physical device replacement, so all that 
making it automatic does is make things more convenient.  In general, if 
you're concerned enough to be using a RAID array, you probably shouldn't 
be trading convenience for data safety, and as of right now, BTRFS isn't 
mature enough that it could be said to be consistently safe to automate 
almost anything.


There are plenty of other reasons for it to not be automatic though, the 
biggest being that it will waste bandwidth (and therefore time) if you 
plan to convert profiles after adding the device.  That said, it would 
be nice to have a switch for the add command to automatically re-balance 
the array.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-05 Thread Peter Grandi
[ ... on the difference between number of devices and length of
a chunk-stripe ... ]

> Note: possibilities get even more interesting with a 4-device
> volume with 'raid1' profile chunks, and similar case involving
> other profiles than 'raid1'.

Consider for example a 4-device volume with 2 devices abruptly
missing: if 2-length 'raid1' chunk-stripes have been uniformly
laid across devices, then some chunk-stripes will be completely
missing (where both chunks in the stripe were on the 2 missing
devices), some will be 1-length, and some will be 2-length.

What to do when devices are missing?

One possibility is to simply require mount with the 'degraded'
option, by default read-only, but allowing read-write, simply as
a way to ensure the sysadm knows that some metada/data *may* not
be redundant or *may* even be unavailable (if the chunk-stripe
length is less than the minimum to reconstruct the data).

Then attempts to read unavailable metadata or data would return
an error like a checksum violation without redundancy,
dynamically (when the application or 'balance' or 'scrub'
attempt to read the unavailable data).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-05 Thread Peter Grandi
>> What makes me think that "unmirrored" 'raid1' profile chunks
>> are "not a thing" is that it is impossible to remove
>> explicitly a member device from a 'raid1' profile volume:
>> first one has to 'convert' to 'single', and then the 'remove'
>> copies back to the remaining devices the 'single' chunks that
>> are on the explicitly 'remove'd device. Which to me seems
>> absurd.

> It is, there should be a way to do this as a single operation.
> [ ... ] The reason this is currently the case though is a
> simple one, 'btrfs device delete' is just a special instance
> of balance [ ... ]  does no profile conversion, but having
> that as an option would actually be _very_ useful from a data
> safety perspective.

That seems to me an even more "confused" opinion: because
removing a device to make it "missing" and removing it
permanently should be very different operations.

Consider the common case of a 3-member volume with a 'raid1'
target profile: if the sysadm thinks that a drive should be
replaced, the goal is to take it out *without* converting every
chunk to 'single', because with 2-out-of-3 devices half of the
chunks will still be fully mirrored.

Also, removing the device to be replaced should really not be
the same thing as balancing the chunks, if there is space, to be
'raid1' across remaining drives, because that's a completely
different operation.

>> Going further in my speculation, I suspect that at the core of
>> the Btrfs multidevice design there is a persistent "confusion"
>> (to use en euphemism) between volumes having a profile, and
>> merely chunks have a profile.

> There generally is.  The profile is entirely a property of the
> chunks (each chunk literally has a bit of metadata that says
> what profile it is), not the volume.  There's some metadata in
> the volume somewhere that says what profile to use for new
> chunks of each type (I think),

That's the "target" profile for the volume.

> but that doesn't dictate what chunk profiles there are on the
> volume. [ ... ]

But as that's the case then the current Btrfs logic for
determining whether a volume is degraded or not is quite
"confused" indeed.

Because suppose there is again the simple case of a 3-device
volume, where all existing chunks have 'raid1' profile and the
volume's target profile is also 'raid1' and one device has gone
offline: the volume cannot be said to be "degraded", unless a
full examination of all chunks is made. Because it can well
happen that in fact *none* of the chunks was mirrored to that
device, for example, however unlikely. And viceversa. Even with
3 devices some chunks may be temporarily "unmirrored" (even if
for brief times hopefully).

The average case is that half of the chunks will be fully
mirrored across the two remaining devices and half will be
"unmirrored".

Now consider re-adding the third device: at that point the
volume has got back all 3 devices, so it is not "degraded", but
50% of the chunks in the volume will still be "unmirrored", even
if eventually they will be mirrored on the newly added device.

Note: possibilities get even more interesting with a 4-device
volume with 'raid1' profile chunks, and similar case involving
other profiles than 'raid1'.

Therefore the current Btrfs logic for deciding whether a volume
is "degraded" seems simply "confused" to me, because whether
there are missing devices and some chunks are "unmirrored" is
not quite the same thing.

The same applies to the current logic that in a 2-device volume
with a device missing new chunks are created as "single" profile
instead of as "unmirrored" 'raid1' profile: another example of
"confusion" between number of devices and chunk profile.

Note: the best that can be said is that a volume has both a
"target chunk profile" (one per data, metadata, system chunks)
and a target number of member devices, and that a volume with a
number of devices below the target *might* be degraded, and that
whether a volume is in fact degraded is not either/or, but given
by the percentage of chunks or stripes that are degraded. This
is expecially made clear by the 'raid1' case where the chunk
stripe length is always 2, but the number of target devices can
be greater than 2. Management of devices and management of
stripes are in Btrfs, unlike conventional RAID like Linux MD,
rather different operations needing rather different, if
related, logic.

My impression is that because of "confusion" between number of
devices in a volume and status of chunk profile there are some
"surprising" behaviors in Btrfs, and that will take quite a bit
to fix, most importantly for the Btrfs developer team to clear
among themselves the semantics attaching to both. After 10 years
of development that seems the right thing to do :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-04 Thread waxhead

Chris Murphy wrote:

On Thu, Mar 2, 2017 at 6:48 PM, Chris Murphy  wrote:


Again, my data is fine. The problem I'm having is this:
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/filesystems/btrfs.txt?id=refs/tags/v4.10.1

Which says in the first line, in part, "focusing on fault tolerance,
repair and easy administration" and quite frankly this sort of
enduring bug in this file system that's nearly 10 years old now, is
rendered misleading, and possibly dishonest. How do we describe this
file system as focusing on fault tolerance when, in the identical
scenario using mdadm or LVM raid, the user's data is not mishandled
like it is on Btrfs with multiple devices?


I think until these problems are fixed, the Btrfs status page should
describe RAID 1 and 10 as mostly OK, with this problem as the reason
for it not being OK.


I took the liberty of changing the status page...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Chris Murphy
On Thu, Mar 2, 2017 at 6:48 PM, Chris Murphy  wrote:

>
> Again, my data is fine. The problem I'm having is this:
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/filesystems/btrfs.txt?id=refs/tags/v4.10.1
>
> Which says in the first line, in part, "focusing on fault tolerance,
> repair and easy administration" and quite frankly this sort of
> enduring bug in this file system that's nearly 10 years old now, is
> rendered misleading, and possibly dishonest. How do we describe this
> file system as focusing on fault tolerance when, in the identical
> scenario using mdadm or LVM raid, the user's data is not mishandled
> like it is on Btrfs with multiple devices?


I think until these problems are fixed, the Btrfs status page should
describe RAID 1 and 10 as mostly OK, with this problem as the reason
for it not being OK.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Kai Krakow
Am Fri, 3 Mar 2017 07:19:06 -0500
schrieb "Austin S. Hemmelgarn" :

> On 2017-03-03 00:56, Kai Krakow wrote:
> > Am Thu, 2 Mar 2017 11:37:53 +0100
> > schrieb Adam Borowski :
> >  
> >> On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote:  
>  [...]  
> >>
> >> Well, there's Qu's patch at:
> >> https://www.spinics.net/lists/linux-btrfs/msg47283.html
> >> but it doesn't apply cleanly nor is easy to rebase to current
> >> kernels. 
>  [...]  
> >>
> >> Well, yeah.  The current check is naive and wrong.  It does have a
> >> purpose, just fails in this, very common, case.  
> >
> > I guess the reasoning behind this is: Creating any more chunks on
> > this drive will make raid1 chunks with only one copy. Adding
> > another drive later will not replay the copies without user
> > interaction. Is that true?
> >
> > If yes, this may leave you with a mixed case of having a raid1 drive
> > with some chunks not mirrored and some mirrored. When the other
> > drives goes missing later, you are loosing data or even the whole
> > filesystem although you were left with the (wrong) imagination of
> > having a mirrored drive setup...
> >
> > Is this how it works?
> >
> > If yes, a real patch would also need to replay the missing copies
> > after adding a new drive.
> >  
> The problem is that that would use some serious disk bandwidth
> without user intervention.  The way from userspace to fix this is to
> scrub the FS.  It would essentially be the same from kernel space,
> which means that if you had a multi-TB FS and this happened, you'd be
> running at below capacity in terms of bandwidth for quite some time.
> If this were to be implemented, it would have to be keyed off of the
> per-chunk degraded check (so that _only_ the chunks that need it get
> touched), and there would need to be a switch to disable it.

Well, I'd expect that a replaced drive would involve reduced bandwidth
for a while. Every traditional RAID does this. The key solution there
is that you can limit bandwidth and/or define priorities (BG rebuild
rate).

Btrfs OTOH could be a lot more smarter, only rebuilding chunks that are
affected. The kernel can already do IO priorities and some sort of
bandwidth limiting should also be possible. I think IO throttling is
already implemented in the kernel somewhere (at least with 4.10) and
also in btrfs. So the basics are there.

In a RAID setup, performance should never have priority over redundancy
by default.

If performance is an important factor, I suggest working with SSD
writeback caches. This is already possible with different kernel
techniques like mdcache or bcache. Proper hardware controllers also
support this in hardware. It's cheap to have a mirrored SSD
writeback cache of 1TB or so if your setup already contains a multiple
terabytes array. Such a setup has huge performance benefits in setups
we deploy (tho, not btrfs related).

Also, adding/replacing a drive is usually not a totally unplanned
event. Except for hot spares, a missing drive will be replaced at the
time you arrive on-site. If performance is a factor, this can be done
the same time as manually starting the process. So why not should it be
done automatically?

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Austin S. Hemmelgarn

On 2017-03-03 00:56, Kai Krakow wrote:

Am Thu, 2 Mar 2017 11:37:53 +0100
schrieb Adam Borowski :


On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote:

[1717713.408675] BTRFS warning (device dm-8): missing devices (1)
exceeds the limit (0), writeable mount is not allowed
[1717713.446453] BTRFS error (device dm-8): open_ctree failed

[chris@f25s ~]$ uname -r
4.9.8-200.fc25.x86_64

I thought this was fixed. I'm still getting a one time degraded rw
mount, after that it's no longer allowed, which really doesn't make
any sense because those single chunks are on the drive I'm trying to
mount.


Well, there's Qu's patch at:
https://www.spinics.net/lists/linux-btrfs/msg47283.html
but it doesn't apply cleanly nor is easy to rebase to current kernels.


I don't understand what problem this proscription is trying to
avoid. If it's OK to mount rw,degraded once, then it's OK to allow
it twice. If it's not OK twice, it's not OK once.


Well, yeah.  The current check is naive and wrong.  It does have a
purpose, just fails in this, very common, case.


I guess the reasoning behind this is: Creating any more chunks on this
drive will make raid1 chunks with only one copy. Adding another drive
later will not replay the copies without user interaction. Is that true?

If yes, this may leave you with a mixed case of having a raid1 drive
with some chunks not mirrored and some mirrored. When the other drives
goes missing later, you are loosing data or even the whole filesystem
although you were left with the (wrong) imagination of having a
mirrored drive setup...

Is this how it works?

If yes, a real patch would also need to replay the missing copies after
adding a new drive.

The problem is that that would use some serious disk bandwidth without 
user intervention.  The way from userspace to fix this is to scrub the 
FS.  It would essentially be the same from kernel space, which means 
that if you had a multi-TB FS and this happened, you'd be running at 
below capacity in terms of bandwidth for quite some time.  If this were 
to be implemented, it would have to be keyed off of the per-chunk 
degraded check (so that _only_ the chunks that need it get touched), and 
there would need to be a switch to disable it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Austin S. Hemmelgarn

On 2017-03-02 19:47, Peter Grandi wrote:

[ ... ] Meanwhile, the problem as I understand it is that at
the first raid1 degraded writable mount, no single-mode chunks
exist, but without the second device, they are created.  [
... ]


That does not make any sense, unless there is a fundamental
mistake in the design of the 'raid1' profile, which this and
other situations make me think is a possibility: that the
category of "mirrored" 'raid1' chunk does not exist in the Btrfs
chunk manager. That is, a chunk is either 'raid1' if it has a
mirror, or if has no mirror it must be 'single'.

If a member device of a 'raid1' profile multidevice volume
disappears there will be "unmirrored" 'raid1' profile chunks and
some code path must recognize them as such, but the logic of the
code does not allow their creation. Question: how does the code
know that a specific 'raid1' chunk is mirrored or not? The chunk
must have a link (member, offset) to its mirror, do they?

What makes me think that "unmirrored" 'raid1' profile chunks are
"not a thing" is that it is impossible to remove explicitly a
member device from a 'raid1' profile volume: first one has to
'convert' to 'single', and then  the 'remove' copies back to the
remaining devices the 'single' chunks that are on the explicitly
'remove'd device. Which to me seems absurd.
It is, there should be a way to do this as a single operation.  The 
reason this is currently the case though is a simple one, 'btrfs device 
delete' is just a special instance of balance that prevents new chunks 
being allocated on the device being removed and balances all the chunks 
on that device so they end up on other devices.  It currently does no 
profile conversion, but having that as an option would actually be 
_very_ useful from a data safety perspective.


Going further in my speculation, I suspect that at the core of
the Btrfs multidevice design there is a persistent "confusion"
(to use en euphemism) between volumes having a profile, and
merely chunks have a profile.
There generally is.  The profile is entirely a property of the chunks 
(each chunk literally has a bit of metadata that says what profile it 
is), not the volume.  There's some metadata in the volume somewhere that 
says what profile to use for new chunks of each type (I think), but that 
doesn't dictate what chunk profiles there are on the volume.  This whole 
arrangement is actually pretty important for fault tolerance in general, 
since during a conversion you have _both_ profiles for that chunk type 
at the same time on the same filesystem (new chunks will get allocated 
with the new type though), and the kernel has to be able to handle a 
partially converted FS.


My additional guess that the original design concept had
multidevice volumes to be merely containers for chunks of
whichever mixed profiles, so a subvolume could have 'raid1'
profile metadata and 'raid0' profile data, and another could
have 'raid10' profile metadata and data, but since handling this
turned out to be too hard, this was compromised into volumes
having all metadata chunks to have the same profile and all data
of the same profile, which requires special-case handling of
corner cases, like volumes being converted or missing member
devices.
Actually, the only bits missing that would be needed to do this are 
stuff to segregate the data of given subvolumes completely form each 
other (ie, make sure they can't be in the same chunks at all).  Doing 
that is hard, so we don't have per-subvolume profiles yet.  It's fully 
possible to have a mix of profiles on a given volume though.  Some old 
versions of mkfs actually did this (you'd end up with a small single 
profile chunk of each type on a FS that used different profiles), and 
the filesystem is in exactly that state when converting between profiles 
for a given chunk type.  New chunks will only be generated with one 
profile, but you can have whatever other mix you want essentially (in 
fact, one of the handful of regression tests I run when I'm checking 
patches explicitly creates a filesystem with one data and one system 
chunk of every profile and makes sure the kernel can still access it 
correctly).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Adam Borowski
On Fri, Mar 03, 2017 at 06:56:22AM +0100, Kai Krakow wrote:
> > > I don't understand what problem this proscription is trying to
> > > avoid. If it's OK to mount rw,degraded once, then it's OK to allow
> > > it twice. If it's not OK twice, it's not OK once.  
> > 
> > Well, yeah.  The current check is naive and wrong.  It does have a
> > purpose, just fails in this, very common, case.
> 
> I guess the reasoning behind this is: Creating any more chunks on this
> drive will make raid1 chunks with only one copy. Adding another drive
> later will not replay the copies without user interaction. Is that true?
> 
> If yes, this may leave you with a mixed case of having a raid1 drive
> with some chunks not mirrored and some mirrored. When the other drives
> goes missing later, you are loosing data or even the whole filesystem
> although you were left with the (wrong) imagination of having a
> mirrored drive setup...

Ie, you want a degraded mount to create degraded raid1 chunks rather than
single ones?  Good idea, it would solve the most common case with least
surprise to the user.

But there are other scenarios where Qu's patch[-set] is needed.  For
example, if you try to convert a single-disk filesystem to raid1, yet the
new shiny disk you just added decides to remind you of words "infant
mortality" halfway during conversion.

Or, if you have degraded raid1 chunks and something bad happens during
recovery.  Having the required number of devices, despite passing the
current bogus check, doesn't mean you can continue.  Qu's patch checks
whether at least one copy of every chunk is present.

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Anand Jain



AFAIK, no, it hasn't been fixed, at least not in mainline, because the
patches to fix it got stuck in some long-running project patch queue
(IIRC, the one for on-degraded auto-device-replace), with no timeline
known to me on mainline merge.

Meanwhile, the problem as I understand it is that at the first raid1
degraded writable mount, no single-mode chunks exist, but without the
second device, they are created.



It might be an accidental feature introduced in the patch [1].
RFC [2] (limited tested) tried to correct it. But, if the accidental
feature works better than the traditional RAID1 approach then
workaround fix [3] will help, however for the accidental feature
I am not sure if it is would to support all the failures-recovery/
FS-is-full cases.

[1]
  commit 95669976bd7d30ae265db938ecb46a6b7f8cb893
  Btrfs: don't consider the missing device when allocating new chunks

[2]
  [PATCH 0/2] [RFC] btrfs: create degraded-RAID1 chunks

[3]
  Patches 01/13 to 05/13 of the below patch set (which were needed
  to test rest of the patches in the set).
  [PATCH v6 00/13] Introduce device state 'failed', spare device and 
auto replace.



Hope this sheds some light on the long standing issue.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Kai Krakow
Am Thu, 2 Mar 2017 11:37:53 +0100
schrieb Adam Borowski :

> On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote:
> > [1717713.408675] BTRFS warning (device dm-8): missing devices (1)
> > exceeds the limit (0), writeable mount is not allowed
> > [1717713.446453] BTRFS error (device dm-8): open_ctree failed
> > 
> > [chris@f25s ~]$ uname -r
> > 4.9.8-200.fc25.x86_64
> > 
> > I thought this was fixed. I'm still getting a one time degraded rw
> > mount, after that it's no longer allowed, which really doesn't make
> > any sense because those single chunks are on the drive I'm trying to
> > mount.  
> 
> Well, there's Qu's patch at:
> https://www.spinics.net/lists/linux-btrfs/msg47283.html
> but it doesn't apply cleanly nor is easy to rebase to current kernels.
> 
> > I don't understand what problem this proscription is trying to
> > avoid. If it's OK to mount rw,degraded once, then it's OK to allow
> > it twice. If it's not OK twice, it's not OK once.  
> 
> Well, yeah.  The current check is naive and wrong.  It does have a
> purpose, just fails in this, very common, case.

I guess the reasoning behind this is: Creating any more chunks on this
drive will make raid1 chunks with only one copy. Adding another drive
later will not replay the copies without user interaction. Is that true?

If yes, this may leave you with a mixed case of having a raid1 drive
with some chunks not mirrored and some mirrored. When the other drives
goes missing later, you are loosing data or even the whole filesystem
although you were left with the (wrong) imagination of having a
mirrored drive setup...

Is this how it works?

If yes, a real patch would also need to replay the missing copies after
adding a new drive.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-02 Thread Duncan
Peter Grandi posted on Fri, 03 Mar 2017 00:47:46 + as excerpted:

>> [ ... ] Meanwhile, the problem as I understand it is that at the first
>> raid1 degraded writable mount, no single-mode chunks exist, but without
>> the second device, they are created.  [ ... ]
> 
> That does not make any sense, unless there is a fundamental mistake in
> the design of the 'raid1' profile, which this and other situations make
> me think is a possibility: that the category of "mirrored" 'raid1' chunk
> does not exist in the Btrfs chunk manager. That is, a chunk is either
> 'raid1' if it has a mirror, or if has no mirror it must be 'single'.
> 
> If a member device of a 'raid1' profile multidevice volume disappears
> there will be "unmirrored" 'raid1' profile chunks and some code path
> must recognize them as such, but the logic of the code does not allow
> their creation. Question: how does the code know that a specific 'raid1'
> chunk is mirrored or not? The chunk must have a link (member, offset) to
> its mirror, do they?

The problem at the surface level is, raid1 chunks MUST be created with 
two copies, one each on two different devices.  It is (currently) not 
allowed to create only a single copy of a raid1 chunk, and the two copies 
must be on different devices, so once you have only a single device, 
raid1 chunks cannot be created.

Which presents a problem when you're trying to recover, needing writable 
in ordered to be able to do a device replace or add/remove (with the 
remove triggering a balance), because btrfs is COW, so any changes get 
written to new locations, which requires chunked space that might not be 
available in the currently allocated chunks.

To work around that, they allowed the chunk allocator to fallback to 
single mode when it couldn't create raid1.

Which is fine as long as the recovery is completed in the same mount.  
But if you unmount or crash and try to remount to complete the job after 
those single-mode chunks have been created, oops!  Single mode chunks on 
a multi-device filesystem with a device missing, and the logic currently 
isn't sophisticated enough to realize that all the chunks are actually 
accounted for, so it forces read-only mounting to prevent further damage.

Which means you can copy off the files to a different filesystem as 
they're still all available, including any written in single-mode, but 
you can't fix the degraded filesystem any longer, as that requires a 
writable mount you're not going to be able to get, at least not with 
mainline.


At a lower level, the problem is that for raid1 (and I think raid10 as 
well tho I'm not sure on it), they made a mistake in the implementation.

For raid56, the minimum allowed writable devices is lower than the 
minimum number of devices for undegraded write, by the number of parity 
devices (so raid5 will allow two devices for undegraded write, 1 parity, 
one data, but one device for degraded write, raid6 will allow three 
devices for undegraded write, one data, two parity, or again, one device 
for degraded write).

But for raid1, both the degraded write minimum and the undegraded write 
minimum are set to *two* devices, an implementation error since the 
degraded write minimum should arguably be one device, without a mirror.

So the degrade to single-mode is a workaround for the real problem, not 
allowing degraded raid1 write (that is, chunk creation).

And all this is known and has been discussed right here on this list by 
the devs, but nobody has actually bothered to properly fix it, either by 
correctly setting the degraded raid1 write minimum to a single device, or 
even by working around the single-mode workaround, by correctly checking 
each chunk and allowing writable mount if all are accounted for, even if 
there's a missing device.

Or rather, the workaround for the incomplete workaround has had a patch 
submitted, but it got stuck in that long-running project and has been in 
limbo every since, and now I guess the patch has gone stale and doesn't 
even properly apply any longer.


All of which is yet more demonstration of the fact that is stated time 
and again on this list, that btrfs should be considered stabilizing, but 
still under heavy development and not yet fully stable, and backups 
should be kept updated and at-hand for any data you value higher than the 
bother and resources necessary to make those backups.

Because if there's backups updated and at hand, then what happens to the 
working copy doesn't matter, and in this particular case, even if the 
backups aren't fully current, the fact that they're available means 
there's space available to update them from the working copy should it go 
into readonly mode as well, which means recovery from the read-only 
formerly working copy is no big deal.

Either that, or by definition, the data wasn't of enough value to have 
backups when storing it on a widely known to be still stabilizing and 
under heavy development filesystem, where those backups 

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-02 Thread Chris Murphy
On Thu, Mar 2, 2017 at 6:18 PM, Qu Wenruo  wrote:
>
>
> At 03/03/2017 09:15 AM, Chris Murphy wrote:
>>
>> [1805985.267438] BTRFS info (device dm-6): allowing degraded mounts
>> [1805985.267566] BTRFS info (device dm-6): disk space caching is enabled
>> [1805985.267676] BTRFS info (device dm-6): has skinny extents
>> [1805987.187857] BTRFS warning (device dm-6): missing devices (1)
>> exceeds the limit (0), writeable mount is not allowed
>> [1805987.228990] BTRFS error (device dm-6): open_ctree failed
>> [chris@f25s ~]$ sudo mount -o noatime,degraded,ro /dev/mapper/sdb /mnt
>> [chris@f25s ~]$ sudo btrfs fi df /mnt
>> Data, RAID1: total=434.00GiB, used=432.46GiB
>> Data, single: total=1.00GiB, used=1.66MiB
>> System, RAID1: total=8.00MiB, used=48.00KiB
>> System, single: total=32.00MiB, used=32.00KiB
>> Metadata, RAID1: total=2.00GiB, used=729.17MiB
>> Metadata, single: total=1.00GiB, used=0.00B
>> GlobalReserve, single: total=495.02MiB, used=0.00B
>> [chris@f25s ~]$
>>
>>
>>
>> So the sequence is:
>> 1. mkfs.btrfs -d raid1 -m raid1 > 2. fill it with a bunch of data over a few months, always mounted
>> normally with default options
>> 3. physically remove 1 of 2 devices, and do a degraded mount. This
>> mounts without error, and more stuff is added. Volume is umounted.
>> 4. Try to mount the same 1 of 2 devices, with degraded mount option,
>> and I get the first error, "writeable mount is not allowed".
>> 5. Try to mount the same 1 of 2 devices, with degraded,ro option, and
>> it mounts, and then I captured the 'btfs fi df' above.
>>
>> So very clearly there are single chunks added during the degraded rw
>> mount.
>>
>> But does 1.66MiB of data in that single data chunk make sense? And
>> does 0.00 MiB of metadata in that single metadata chunk make sense?
>> I'm not sure, seems unlikely. Most of what happened in that subvolume
>> since the previous snapshot was moving things around, reorganizing,
>> not adding files. So, maybe 1.66MiB data added is possible? But
>> definitely the metadata changes must be in the raid1 chunks, while the
>> newly created single profile metadata chunk is left unused.
>>
>> So I think there's more than one bug going on here, separate problems
>> for data and metadata.
>
>
> IIRC I submitted a patch long time ago to check each chunk to see if it's OK
> to mount in degraded mode.
>
> And in your case, it will allow RW degraded mount since the stripe of that
> single chunk is not missing.
>
> That patch is later merged into hot-spare patchset, but AFAIK it will be a
> long long time before such hot-spare get merged.
>
> So I'll update that patch and hope it can solve the problem.
>

OK thanks. Yeah I should have said that this is not a critical
situation for me. It's just a confusing situation.

In particular that people could do a btrfs replace; or do btrfs dev
add, then btrfs dev missing, and what happens? There's some data
that's not replicated on the replacement drive because it's single
profile, and if that happens to be metadata it's possibly
unpredictable what happens when the drive with single chunks dies. At
the very least there is going to be some data loss. It's entirely
possible the drive that's missing these single chunks can't be mounted
degraded. And for sure it's possible that it can't be used for
replication, when doing a device replace for the 1st device with the
only copy of these single chunks.

Again, my data is fine. The problem I'm having is this:
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/filesystems/btrfs.txt?id=refs/tags/v4.10.1

Which says in the first line, in part, "focusing on fault tolerance,
repair and easy administration" and quite frankly this sort of
enduring bug in this file system that's nearly 10 years old now, is
rendered misleading, and possibly dishonest. How do we describe this
file system as focusing on fault tolerance when, in the identical
scenario using mdadm or LVM raid, the user's data is not mishandled
like it is on Btrfs with multiple devices?



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-02 Thread Qu Wenruo



At 03/03/2017 09:15 AM, Chris Murphy wrote:

[1805985.267438] BTRFS info (device dm-6): allowing degraded mounts
[1805985.267566] BTRFS info (device dm-6): disk space caching is enabled
[1805985.267676] BTRFS info (device dm-6): has skinny extents
[1805987.187857] BTRFS warning (device dm-6): missing devices (1)
exceeds the limit (0), writeable mount is not allowed
[1805987.228990] BTRFS error (device dm-6): open_ctree failed
[chris@f25s ~]$ sudo mount -o noatime,degraded,ro /dev/mapper/sdb /mnt
[chris@f25s ~]$ sudo btrfs fi df /mnt
Data, RAID1: total=434.00GiB, used=432.46GiB
Data, single: total=1.00GiB, used=1.66MiB
System, RAID1: total=8.00MiB, used=48.00KiB
System, single: total=32.00MiB, used=32.00KiB
Metadata, RAID1: total=2.00GiB, used=729.17MiB
Metadata, single: total=1.00GiB, used=0.00B
GlobalReserve, single: total=495.02MiB, used=0.00B
[chris@f25s ~]$



So the sequence is:
1. mkfs.btrfs -d raid1 -m raid1 

IIRC I submitted a patch long time ago to check each chunk to see if 
it's OK to mount in degraded mode.


And in your case, it will allow RW degraded mount since the stripe of 
that single chunk is not missing.


That patch is later merged into hot-spare patchset, but AFAIK it will be 
a long long time before such hot-spare get merged.


So I'll update that patch and hope it can solve the problem.

Thanks,
Qu



Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-02 Thread Chris Murphy
[1805985.267438] BTRFS info (device dm-6): allowing degraded mounts
[1805985.267566] BTRFS info (device dm-6): disk space caching is enabled
[1805985.267676] BTRFS info (device dm-6): has skinny extents
[1805987.187857] BTRFS warning (device dm-6): missing devices (1)
exceeds the limit (0), writeable mount is not allowed
[1805987.228990] BTRFS error (device dm-6): open_ctree failed
[chris@f25s ~]$ sudo mount -o noatime,degraded,ro /dev/mapper/sdb /mnt
[chris@f25s ~]$ sudo btrfs fi df /mnt
Data, RAID1: total=434.00GiB, used=432.46GiB
Data, single: total=1.00GiB, used=1.66MiB
System, RAID1: total=8.00MiB, used=48.00KiB
System, single: total=32.00MiB, used=32.00KiB
Metadata, RAID1: total=2.00GiB, used=729.17MiB
Metadata, single: total=1.00GiB, used=0.00B
GlobalReserve, single: total=495.02MiB, used=0.00B
[chris@f25s ~]$



So the sequence is:
1. mkfs.btrfs -d raid1 -m raid1 http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-02 Thread Peter Grandi
> [ ... ] Meanwhile, the problem as I understand it is that at
> the first raid1 degraded writable mount, no single-mode chunks
> exist, but without the second device, they are created.  [
> ... ]

That does not make any sense, unless there is a fundamental
mistake in the design of the 'raid1' profile, which this and
other situations make me think is a possibility: that the
category of "mirrored" 'raid1' chunk does not exist in the Btrfs
chunk manager. That is, a chunk is either 'raid1' if it has a
mirror, or if has no mirror it must be 'single'.

If a member device of a 'raid1' profile multidevice volume
disappears there will be "unmirrored" 'raid1' profile chunks and
some code path must recognize them as such, but the logic of the
code does not allow their creation. Question: how does the code
know that a specific 'raid1' chunk is mirrored or not? The chunk
must have a link (member, offset) to its mirror, do they?

What makes me think that "unmirrored" 'raid1' profile chunks are
"not a thing" is that it is impossible to remove explicitly a
member device from a 'raid1' profile volume: first one has to
'convert' to 'single', and then  the 'remove' copies back to the
remaining devices the 'single' chunks that are on the explicitly
'remove'd device. Which to me seems absurd.

Going further in my speculation, I suspect that at the core of
the Btrfs multidevice design there is a persistent "confusion"
(to use en euphemism) between volumes having a profile, and
merely chunks have a profile.

My additional guess that the original design concept had
multidevice volumes to be merely containers for chunks of
whichever mixed profiles, so a subvolume could have 'raid1'
profile metadata and 'raid0' profile data, and another could
have 'raid10' profile metadata and data, but since handling this
turned out to be too hard, this was compromised into volumes
having all metadata chunks to have the same profile and all data
of the same profile, which requires special-case handling of
corner cases, like volumes being converted or missing member
devices.

So in the case of 'raid1', a volume with say a 'raid1' data
profile should have all-'raid1' and fully mirrored profile
chunks, and the lack of a member devices fails that aim in two
ways.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-02 Thread Austin S. Hemmelgarn

On 2017-03-02 12:26, Andrei Borzenkov wrote:

02.03.2017 16:41, Duncan пишет:

Chris Murphy posted on Wed, 01 Mar 2017 17:30:37 -0700 as excerpted:


[1717713.408675] BTRFS warning (device dm-8): missing devices (1)
exceeds the limit (0), writeable mount is not allowed
[1717713.446453] BTRFS error (device dm-8): open_ctree failed

[chris@f25s ~]$ uname
-r 4.9.8-200.fc25.x86_64

I thought this was fixed. I'm still getting a one time degraded rw
mount, after that it's no longer allowed, which really doesn't make any
sense because those single chunks are on the drive I'm trying to mount.
I don't understand what problem this proscription is trying to avoid. If
it's OK to mount rw,degraded once, then it's OK to allow it twice. If
it's not OK twice, it's not OK once.


AFAIK, no, it hasn't been fixed, at least not in mainline, because the
patches to fix it got stuck in some long-running project patch queue
(IIRC, the one for on-degraded auto-device-replace), with no timeline
known to me on mainline merge.

Meanwhile, the problem as I understand it is that at the first raid1
degraded writable mount, no single-mode chunks exist, but without the
second device, they are created.


Is not it the root cause? I would expect it to create degraded mirrored
chunks that will be synchronized when second device is added back.
That's exactly what it should be doing, and AFAIK what the correct fix 
for this should be, but in the interim just relaxing the degraded check 
to be per-chunk makes things usable, and is arguably how it should have 
been to begin with.


 (It's not clear to me whether they are

created with the first write, that is, ignoring any space in existing
degraded raid1 chunks, or if that's used up first and the single-mode
chunks only created later, when a new chunk must be allocated to continue
writing as the old ones are full.)

So the first degraded-writable mount is allowed, because no single-mode
chunks yet exist, while after such single-mode chunks are created, the
existing dumb algorithm won't allow further writable mounts, because it
sees single-mode chunks on a multi-device filesystem, and never mind that
all the single mode chunks are there, it simply doesn't check that and
won't allow writable mount because some /might/ be on the missing device.

The patches stuck in queue would make btrfs more intelligent about that,
having it check each chunk as listed in the chunk tree, and if at least
one copy is available (as would be the case for single-mode chunks
created after the degraded mount), writable mount would still be
allowed.  But... that's stuck in a long running project queue with no
known timetable for merging... ... so the only way to
get it is to go find and merge them yourself, in your own build.



Will it replicate single mode chunks when second device is added?
Not automatically, you would need to convert them to raid1 (or whatever 
other profile.  Even with the patch, this would still be needed, but at 
least it would (technically) work sanely.  On that note, on most of my 
systems, I have a startup script that calls balance with the appropriate 
convert flags and the soft flag for every fixed (non-removable) BTRFS 
volume on the system to clean up after this.  The actual balance call 
takes no time at all unless there are actually chunks to convert, so it 
normally has very little impact on boot times.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-02 Thread Andrei Borzenkov
02.03.2017 16:41, Duncan пишет:
> Chris Murphy posted on Wed, 01 Mar 2017 17:30:37 -0700 as excerpted:
> 
>> [1717713.408675] BTRFS warning (device dm-8): missing devices (1)
>> exceeds the limit (0), writeable mount is not allowed
>> [1717713.446453] BTRFS error (device dm-8): open_ctree failed
>>
>> [chris@f25s ~]$ uname
>> -r 4.9.8-200.fc25.x86_64
>>
>> I thought this was fixed. I'm still getting a one time degraded rw
>> mount, after that it's no longer allowed, which really doesn't make any
>> sense because those single chunks are on the drive I'm trying to mount.
>> I don't understand what problem this proscription is trying to avoid. If
>> it's OK to mount rw,degraded once, then it's OK to allow it twice. If
>> it's not OK twice, it's not OK once.
> 
> AFAIK, no, it hasn't been fixed, at least not in mainline, because the 
> patches to fix it got stuck in some long-running project patch queue 
> (IIRC, the one for on-degraded auto-device-replace), with no timeline 
> known to me on mainline merge.
> 
> Meanwhile, the problem as I understand it is that at the first raid1 
> degraded writable mount, no single-mode chunks exist, but without the 
> second device, they are created. 

Is not it the root cause? I would expect it to create degraded mirrored
chunks that will be synchronized when second device is added back.

 (It's not clear to me whether they are
> created with the first write, that is, ignoring any space in existing 
> degraded raid1 chunks, or if that's used up first and the single-mode 
> chunks only created later, when a new chunk must be allocated to continue 
> writing as the old ones are full.)
> 
> So the first degraded-writable mount is allowed, because no single-mode 
> chunks yet exist, while after such single-mode chunks are created, the 
> existing dumb algorithm won't allow further writable mounts, because it 
> sees single-mode chunks on a multi-device filesystem, and never mind that 
> all the single mode chunks are there, it simply doesn't check that and 
> won't allow writable mount because some /might/ be on the missing device.
> 
> The patches stuck in queue would make btrfs more intelligent about that, 
> having it check each chunk as listed in the chunk tree, and if at least 
> one copy is available (as would be the case for single-mode chunks 
> created after the degraded mount), writable mount would still be 
> allowed.  But... that's stuck in a long running project queue with no 
> known timetable for merging... ... so the only way to 
> get it is to go find and merge them yourself, in your own build.
> 

Will it replicate single mode chunks when second device is added?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-02 Thread Duncan
Chris Murphy posted on Wed, 01 Mar 2017 17:30:37 -0700 as excerpted:

> [1717713.408675] BTRFS warning (device dm-8): missing devices (1)
> exceeds the limit (0), writeable mount is not allowed
> [1717713.446453] BTRFS error (device dm-8): open_ctree failed
> 
> [chris@f25s ~]$ uname
> -r 4.9.8-200.fc25.x86_64
> 
> I thought this was fixed. I'm still getting a one time degraded rw
> mount, after that it's no longer allowed, which really doesn't make any
> sense because those single chunks are on the drive I'm trying to mount.
> I don't understand what problem this proscription is trying to avoid. If
> it's OK to mount rw,degraded once, then it's OK to allow it twice. If
> it's not OK twice, it's not OK once.

AFAIK, no, it hasn't been fixed, at least not in mainline, because the 
patches to fix it got stuck in some long-running project patch queue 
(IIRC, the one for on-degraded auto-device-replace), with no timeline 
known to me on mainline merge.

Meanwhile, the problem as I understand it is that at the first raid1 
degraded writable mount, no single-mode chunks exist, but without the 
second device, they are created.  (It's not clear to me whether they are 
created with the first write, that is, ignoring any space in existing 
degraded raid1 chunks, or if that's used up first and the single-mode 
chunks only created later, when a new chunk must be allocated to continue 
writing as the old ones are full.)

So the first degraded-writable mount is allowed, because no single-mode 
chunks yet exist, while after such single-mode chunks are created, the 
existing dumb algorithm won't allow further writable mounts, because it 
sees single-mode chunks on a multi-device filesystem, and never mind that 
all the single mode chunks are there, it simply doesn't check that and 
won't allow writable mount because some /might/ be on the missing device.

The patches stuck in queue would make btrfs more intelligent about that, 
having it check each chunk as listed in the chunk tree, and if at least 
one copy is available (as would be the case for single-mode chunks 
created after the degraded mount), writable mount would still be 
allowed.  But... that's stuck in a long running project queue with no 
known timetable for merging... ... so the only way to 
get it is to go find and merge them yourself, in your own build.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-02 Thread Adam Borowski
On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote:
> [1717713.408675] BTRFS warning (device dm-8): missing devices (1)
> exceeds the limit (0), writeable mount is not allowed
> [1717713.446453] BTRFS error (device dm-8): open_ctree failed
> 
> [chris@f25s ~]$ uname -r
> 4.9.8-200.fc25.x86_64
> 
> I thought this was fixed. I'm still getting a one time degraded rw
> mount, after that it's no longer allowed, which really doesn't make
> any sense because those single chunks are on the drive I'm trying to
> mount.

Well, there's Qu's patch at:
https://www.spinics.net/lists/linux-btrfs/msg47283.html
but it doesn't apply cleanly nor is easy to rebase to current kernels.

> I don't understand what problem this proscription is trying to
> avoid. If it's OK to mount rw,degraded once, then it's OK to allow it
> twice. If it's not OK twice, it's not OK once.

Well, yeah.  The current check is naive and wrong.  It does have a purpose,
just fails in this, very common, case.

For people needing to recover their filesystem at this moment there's
https://www.spinics.net/lists/linux-btrfs/msg62473.html
but it removes the protection you still want for other cases.

This problem pops up way too often, thus I guess that if not the devs, then
at least us in the peanut gallery should do the work reviving the real
solution.  Obviously, I for one am shortish on tuits at the moment...

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-01 Thread Chris Murphy
[1717713.408675] BTRFS warning (device dm-8): missing devices (1)
exceeds the limit (0), writeable mount is not allowed
[1717713.446453] BTRFS error (device dm-8): open_ctree failed

[chris@f25s ~]$ uname -r
4.9.8-200.fc25.x86_64

I thought this was fixed. I'm still getting a one time degraded rw
mount, after that it's no longer allowed, which really doesn't make
any sense because those single chunks are on the drive I'm trying to
mount. I don't understand what problem this proscription is trying to
avoid. If it's OK to mount rw,degraded once, then it's OK to allow it
twice. If it's not OK twice, it's not OK once.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html