Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
On 2017-03-09 04:49, Peter Grandi wrote: Consider the common case of a 3-member volume with a 'raid1' target profile: if the sysadm thinks that a drive should be replaced, the goal is to take it out *without* converting every chunk to 'single', because with 2-out-of-3 devices half of the chunks will still be fully mirrored. Also, removing the device to be replaced should really not be the same thing as balancing the chunks, if there is space, to be 'raid1' across remaining drives, because that's a completely different operation. There is a command specifically for replacing devices. It operates very differently from the add+delete or delete+add sequences. [ ... ] Perhaps it was not clear that I was talking about removing a device, as distinct from replacing it, and that I used "removed" instead of "deleted" deliberately, to avoid the confusion with the 'delete' command. Ah, sorry I misunderstood what you were saying. In the everyday practice of system administration it often happens that a device should be removed first, and replaced later, for example when it is suspected to be faulty, or is intermittently faulty. The replacement can be done with 'replace' or 'add+delete' or 'delete+add', but that's a different matter. Perhaps I should have not have used the generic verb "remove", but written "make unavailable". This brings about again the topic of some "confusion" in the design of the Btrfs multidevice handling logic, where at least initially one could only expand the storage space of a multidevice by 'add' of a new device or shrink the storage space by 'delete' of an existing one, but I think it was not conceived at Btrfs design time of storage space being nominally constant but for a device (and the chunks on it) having a state of "available" ("present", "online", "enabled") or "unavailable" ("absent", "offline", "disabled"), either because of events or because of system administrator action. The 'missing' pseudo-device designator was added later, and 'replace' also later to avoid having to first expand then shrink (or viceversa) the storage space and the related copying. My impression is that it would be less "confused" if the Btrfs device handling logic were changed to allow for the the state of "member of the multidevice set but not actually available" and the related consequent state for chunks that ought to be on it; that probably would be essential to fixing the confusing current aspects of recovery in a multidevice set. That would be very useful even if it may require a change in the on-disk format to distinguish the distinct states of membership and availability for devices and mark chunks as available or not (chunks of course being only possible on member devices). That is, it would also be nice to have the opposite state of "not member of the multidevice set but actually available to it", that is a spare device, and related logic. OK, so expanding on this a bit, there are currently three functional device states in BTRFS right now (note that the terms I use here aren't official, they're just what I use to describe them): 1. Active/Online. This is the normal state for a device, you can both read from it and write to it. 2. Inactive/Replacing/Deleting. This is the state a device is in when it's either being deleted or replaced. Inactive devices don't count towards total volume size, and can't be written to, but can be read from if they weren't missing prior to becoming inactive. 3. Missing/Offilne. This is pretty self explanatory. A device in this state can't be read from or written to, but it does count towards volume size. Currently, the only transitions available to a sysadmin through BTRFS itself are temporary transitions from Active to Inactive (replace and delete). In an ideal situation, there would be two other states: 4. Local hot-spare/Nearline. Won't be read from and doesn't count towards total volume size, but may be written to (depending on how the FS is configured), and will be automatically used to replace a failed device in the filesystem it's associated with. 5. Global hot-spare. Similar to local hot-spare, but can be used for any filesystem on the system, and won't be touched until it's needed. The following manually initiated transitions would be possible for regular operation: 1. Active -> Inactive (persistently) 2. Inactive -> Active 3. Active -> Local hot-spare 4. Inactive -> Local hot-spare 5. Local hot-spare -> Active 6. Local hot-spare -> Inactive 7. Global hot-spare -> Active 8. Global hot-spare -> Inactive 9. Local hot-spare -> Global hot-spare 10. Global hot-spare -> Local hot-spare And the following automatic transitions would be possible: 1. Local hot-spare -> Active 2. Global hot-spare -> Active 3. -> Missing 4. Missing -> And there would be the option of manually triggering the automatic transitions for debugging purposes. Note: simply setting '/sys/block/$DEV/device/delete' is not a good option, because that
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
>> Consider the common case of a 3-member volume with a 'raid1' >> target profile: if the sysadm thinks that a drive should be >> replaced, the goal is to take it out *without* converting every >> chunk to 'single', because with 2-out-of-3 devices half of the >> chunks will still be fully mirrored. >> Also, removing the device to be replaced should really not be >> the same thing as balancing the chunks, if there is space, to be >> 'raid1' across remaining drives, because that's a completely >> different operation. > There is a command specifically for replacing devices. It > operates very differently from the add+delete or delete+add > sequences. [ ... ] Perhaps it was not clear that I was talking about removing a device, as distinct from replacing it, and that I used "removed" instead of "deleted" deliberately, to avoid the confusion with the 'delete' command. In the everyday practice of system administration it often happens that a device should be removed first, and replaced later, for example when it is suspected to be faulty, or is intermittently faulty. The replacement can be done with 'replace' or 'add+delete' or 'delete+add', but that's a different matter. Perhaps I should have not have used the generic verb "remove", but written "make unavailable". This brings about again the topic of some "confusion" in the design of the Btrfs multidevice handling logic, where at least initially one could only expand the storage space of a multidevice by 'add' of a new device or shrink the storage space by 'delete' of an existing one, but I think it was not conceived at Btrfs design time of storage space being nominally constant but for a device (and the chunks on it) having a state of "available" ("present", "online", "enabled") or "unavailable" ("absent", "offline", "disabled"), either because of events or because of system administrator action. The 'missing' pseudo-device designator was added later, and 'replace' also later to avoid having to first expand then shrink (or viceversa) the storage space and the related copying. My impression is that it would be less "confused" if the Btrfs device handling logic were changed to allow for the the state of "member of the multidevice set but not actually available" and the related consequent state for chunks that ought to be on it; that probably would be essential to fixing the confusing current aspects of recovery in a multidevice set. That would be very useful even if it may require a change in the on-disk format to distinguish the distinct states of membership and availability for devices and mark chunks as available or not (chunks of course being only possible on member devices). That is, it would also be nice to have the opposite state of "not member of the multidevice set but actually available to it", that is a spare device, and related logic. Note: simply setting '/sys/block/$DEV/device/delete' is not a good option, because that makes the device unavailable not just to Btrfs, but also to the whole systems. In the ordinary practice of system administration it may well be useful to make a device unavailable to Btrfs but still available to the system, for example for testing, and anyhow they are logically distinct states. That also means a member device might well be available to the system, but marked as "not available" to Btrfs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
On 2017-03-05 14:13, Peter Grandi wrote: What makes me think that "unmirrored" 'raid1' profile chunks are "not a thing" is that it is impossible to remove explicitly a member device from a 'raid1' profile volume: first one has to 'convert' to 'single', and then the 'remove' copies back to the remaining devices the 'single' chunks that are on the explicitly 'remove'd device. Which to me seems absurd. It is, there should be a way to do this as a single operation. [ ... ] The reason this is currently the case though is a simple one, 'btrfs device delete' is just a special instance of balance [ ... ] does no profile conversion, but having that as an option would actually be _very_ useful from a data safety perspective. That seems to me an even more "confused" opinion: because removing a device to make it "missing" and removing it permanently should be very different operations. Consider the common case of a 3-member volume with a 'raid1' target profile: if the sysadm thinks that a drive should be replaced, the goal is to take it out *without* converting every chunk to 'single', because with 2-out-of-3 devices half of the chunks will still be fully mirrored. Also, removing the device to be replaced should really not be the same thing as balancing the chunks, if there is space, to be 'raid1' across remaining drives, because that's a completely different operation. There is a command specifically for replacing devices. It operates very differently from the add+delete or delete+add sequences. Instead of balancing, it's more similar to LVM's pvmove command. It redirects all new writes that would go to the old device to the new one, then copies all the data from the old to the new (while properly recreating damaged chunks). it uses way less bandwidth than add+delete, runs faster, and is in general much safer because it moves less data around. If you're just replacing devices, you should be using this, not the add and delete commands, which are more for reshaping arrays than repairing them. Additionally, if you _have_ to use add and remove to replace a device, if possible, you should add the new device then delete the old one, not the other way around, as that avoids most of the issues other than the high load on the filesystem from the balance operation. Going further in my speculation, I suspect that at the core of the Btrfs multidevice design there is a persistent "confusion" (to use en euphemism) between volumes having a profile, and merely chunks have a profile. There generally is. The profile is entirely a property of the chunks (each chunk literally has a bit of metadata that says what profile it is), not the volume. There's some metadata in the volume somewhere that says what profile to use for new chunks of each type (I think), That's the "target" profile for the volume. but that doesn't dictate what chunk profiles there are on the volume. [ ... ] But as that's the case then the current Btrfs logic for determining whether a volume is degraded or not is quite "confused" indeed. Entirely agreed. Currently, it checks the target profile, when it should be checking per-chunk. Because suppose there is again the simple case of a 3-device volume, where all existing chunks have 'raid1' profile and the volume's target profile is also 'raid1' and one device has gone offline: the volume cannot be said to be "degraded", unless a full examination of all chunks is made. Because it can well happen that in fact *none* of the chunks was mirrored to that device, for example, however unlikely. And viceversa. Even with 3 devices some chunks may be temporarily "unmirrored" (even if for brief times hopefully). The average case is that half of the chunks will be fully mirrored across the two remaining devices and half will be "unmirrored". Now consider re-adding the third device: at that point the volume has got back all 3 devices, so it is not "degraded", but 50% of the chunks in the volume will still be "unmirrored", even if eventually they will be mirrored on the newly added device. Note: possibilities get even more interesting with a 4-device volume with 'raid1' profile chunks, and similar case involving other profiles than 'raid1'. Therefore the current Btrfs logic for deciding whether a volume is "degraded" seems simply "confused" to me, because whether there are missing devices and some chunks are "unmirrored" is not quite the same thing. The same applies to the current logic that in a 2-device volume with a device missing new chunks are created as "single" profile instead of as "unmirrored" 'raid1' profile: another example of "confusion" between number of devices and chunk profile. Note: the best that can be said is that a volume has both a "target chunk profile" (one per data, metadata, system chunks) and a target number of member devices, and that a volume with a number of devices below the target *might* be degraded, and that whether a volume is in fact degraded is not
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
On 2017-03-03 15:10, Kai Krakow wrote: Am Fri, 3 Mar 2017 07:19:06 -0500 schrieb "Austin S. Hemmelgarn": On 2017-03-03 00:56, Kai Krakow wrote: Am Thu, 2 Mar 2017 11:37:53 +0100 schrieb Adam Borowski : On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote: [...] Well, there's Qu's patch at: https://www.spinics.net/lists/linux-btrfs/msg47283.html but it doesn't apply cleanly nor is easy to rebase to current kernels. [...] Well, yeah. The current check is naive and wrong. It does have a purpose, just fails in this, very common, case. I guess the reasoning behind this is: Creating any more chunks on this drive will make raid1 chunks with only one copy. Adding another drive later will not replay the copies without user interaction. Is that true? If yes, this may leave you with a mixed case of having a raid1 drive with some chunks not mirrored and some mirrored. When the other drives goes missing later, you are loosing data or even the whole filesystem although you were left with the (wrong) imagination of having a mirrored drive setup... Is this how it works? If yes, a real patch would also need to replay the missing copies after adding a new drive. The problem is that that would use some serious disk bandwidth without user intervention. The way from userspace to fix this is to scrub the FS. It would essentially be the same from kernel space, which means that if you had a multi-TB FS and this happened, you'd be running at below capacity in terms of bandwidth for quite some time. If this were to be implemented, it would have to be keyed off of the per-chunk degraded check (so that _only_ the chunks that need it get touched), and there would need to be a switch to disable it. Well, I'd expect that a replaced drive would involve reduced bandwidth for a while. Every traditional RAID does this. The key solution there is that you can limit bandwidth and/or define priorities (BG rebuild rate). Btrfs OTOH could be a lot more smarter, only rebuilding chunks that are affected. The kernel can already do IO priorities and some sort of bandwidth limiting should also be possible. I think IO throttling is already implemented in the kernel somewhere (at least with 4.10) and also in btrfs. So the basics are there. I/O prioritization in Linux is crap right now. Only one scheduler properly supports it, and that scheduler is deprecated, not to mention that it didn't work reliably to begin with. There is a bandwidth limiting mechanism in place, but that's for userspace stuff, not kernel stuff (which is why scrub is such an issue, the actual I/O is done by the kernel, not userspace). In a RAID setup, performance should never have priority over redundancy by default. If performance is an important factor, I suggest working with SSD writeback caches. This is already possible with different kernel techniques like mdcache or bcache. Proper hardware controllers also support this in hardware. It's cheap to have a mirrored SSD writeback cache of 1TB or so if your setup already contains a multiple terabytes array. Such a setup has huge performance benefits in setups we deploy (tho, not btrfs related). Also, adding/replacing a drive is usually not a totally unplanned event. Except for hot spares, a missing drive will be replaced at the time you arrive on-site. If performance is a factor, this can be done the same time as manually starting the process. So why not should it be done automatically? You're already going to be involved because you can't (from a practical perspective) automate the physical device replacement, so all that making it automatic does is make things more convenient. In general, if you're concerned enough to be using a RAID array, you probably shouldn't be trading convenience for data safety, and as of right now, BTRFS isn't mature enough that it could be said to be consistently safe to automate almost anything. There are plenty of other reasons for it to not be automatic though, the biggest being that it will waste bandwidth (and therefore time) if you plan to convert profiles after adding the device. That said, it would be nice to have a switch for the add command to automatically re-balance the array. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
[ ... on the difference between number of devices and length of a chunk-stripe ... ] > Note: possibilities get even more interesting with a 4-device > volume with 'raid1' profile chunks, and similar case involving > other profiles than 'raid1'. Consider for example a 4-device volume with 2 devices abruptly missing: if 2-length 'raid1' chunk-stripes have been uniformly laid across devices, then some chunk-stripes will be completely missing (where both chunks in the stripe were on the 2 missing devices), some will be 1-length, and some will be 2-length. What to do when devices are missing? One possibility is to simply require mount with the 'degraded' option, by default read-only, but allowing read-write, simply as a way to ensure the sysadm knows that some metada/data *may* not be redundant or *may* even be unavailable (if the chunk-stripe length is less than the minimum to reconstruct the data). Then attempts to read unavailable metadata or data would return an error like a checksum violation without redundancy, dynamically (when the application or 'balance' or 'scrub' attempt to read the unavailable data). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
>> What makes me think that "unmirrored" 'raid1' profile chunks >> are "not a thing" is that it is impossible to remove >> explicitly a member device from a 'raid1' profile volume: >> first one has to 'convert' to 'single', and then the 'remove' >> copies back to the remaining devices the 'single' chunks that >> are on the explicitly 'remove'd device. Which to me seems >> absurd. > It is, there should be a way to do this as a single operation. > [ ... ] The reason this is currently the case though is a > simple one, 'btrfs device delete' is just a special instance > of balance [ ... ] does no profile conversion, but having > that as an option would actually be _very_ useful from a data > safety perspective. That seems to me an even more "confused" opinion: because removing a device to make it "missing" and removing it permanently should be very different operations. Consider the common case of a 3-member volume with a 'raid1' target profile: if the sysadm thinks that a drive should be replaced, the goal is to take it out *without* converting every chunk to 'single', because with 2-out-of-3 devices half of the chunks will still be fully mirrored. Also, removing the device to be replaced should really not be the same thing as balancing the chunks, if there is space, to be 'raid1' across remaining drives, because that's a completely different operation. >> Going further in my speculation, I suspect that at the core of >> the Btrfs multidevice design there is a persistent "confusion" >> (to use en euphemism) between volumes having a profile, and >> merely chunks have a profile. > There generally is. The profile is entirely a property of the > chunks (each chunk literally has a bit of metadata that says > what profile it is), not the volume. There's some metadata in > the volume somewhere that says what profile to use for new > chunks of each type (I think), That's the "target" profile for the volume. > but that doesn't dictate what chunk profiles there are on the > volume. [ ... ] But as that's the case then the current Btrfs logic for determining whether a volume is degraded or not is quite "confused" indeed. Because suppose there is again the simple case of a 3-device volume, where all existing chunks have 'raid1' profile and the volume's target profile is also 'raid1' and one device has gone offline: the volume cannot be said to be "degraded", unless a full examination of all chunks is made. Because it can well happen that in fact *none* of the chunks was mirrored to that device, for example, however unlikely. And viceversa. Even with 3 devices some chunks may be temporarily "unmirrored" (even if for brief times hopefully). The average case is that half of the chunks will be fully mirrored across the two remaining devices and half will be "unmirrored". Now consider re-adding the third device: at that point the volume has got back all 3 devices, so it is not "degraded", but 50% of the chunks in the volume will still be "unmirrored", even if eventually they will be mirrored on the newly added device. Note: possibilities get even more interesting with a 4-device volume with 'raid1' profile chunks, and similar case involving other profiles than 'raid1'. Therefore the current Btrfs logic for deciding whether a volume is "degraded" seems simply "confused" to me, because whether there are missing devices and some chunks are "unmirrored" is not quite the same thing. The same applies to the current logic that in a 2-device volume with a device missing new chunks are created as "single" profile instead of as "unmirrored" 'raid1' profile: another example of "confusion" between number of devices and chunk profile. Note: the best that can be said is that a volume has both a "target chunk profile" (one per data, metadata, system chunks) and a target number of member devices, and that a volume with a number of devices below the target *might* be degraded, and that whether a volume is in fact degraded is not either/or, but given by the percentage of chunks or stripes that are degraded. This is expecially made clear by the 'raid1' case where the chunk stripe length is always 2, but the number of target devices can be greater than 2. Management of devices and management of stripes are in Btrfs, unlike conventional RAID like Linux MD, rather different operations needing rather different, if related, logic. My impression is that because of "confusion" between number of devices in a volume and status of chunk profile there are some "surprising" behaviors in Btrfs, and that will take quite a bit to fix, most importantly for the Btrfs developer team to clear among themselves the semantics attaching to both. After 10 years of development that seems the right thing to do :-). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
Chris Murphy wrote: On Thu, Mar 2, 2017 at 6:48 PM, Chris Murphywrote: Again, my data is fine. The problem I'm having is this: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/filesystems/btrfs.txt?id=refs/tags/v4.10.1 Which says in the first line, in part, "focusing on fault tolerance, repair and easy administration" and quite frankly this sort of enduring bug in this file system that's nearly 10 years old now, is rendered misleading, and possibly dishonest. How do we describe this file system as focusing on fault tolerance when, in the identical scenario using mdadm or LVM raid, the user's data is not mishandled like it is on Btrfs with multiple devices? I think until these problems are fixed, the Btrfs status page should describe RAID 1 and 10 as mostly OK, with this problem as the reason for it not being OK. I took the liberty of changing the status page... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
On Thu, Mar 2, 2017 at 6:48 PM, Chris Murphywrote: > > Again, my data is fine. The problem I'm having is this: > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/filesystems/btrfs.txt?id=refs/tags/v4.10.1 > > Which says in the first line, in part, "focusing on fault tolerance, > repair and easy administration" and quite frankly this sort of > enduring bug in this file system that's nearly 10 years old now, is > rendered misleading, and possibly dishonest. How do we describe this > file system as focusing on fault tolerance when, in the identical > scenario using mdadm or LVM raid, the user's data is not mishandled > like it is on Btrfs with multiple devices? I think until these problems are fixed, the Btrfs status page should describe RAID 1 and 10 as mostly OK, with this problem as the reason for it not being OK. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
Am Fri, 3 Mar 2017 07:19:06 -0500 schrieb "Austin S. Hemmelgarn": > On 2017-03-03 00:56, Kai Krakow wrote: > > Am Thu, 2 Mar 2017 11:37:53 +0100 > > schrieb Adam Borowski : > > > >> On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote: > [...] > >> > >> Well, there's Qu's patch at: > >> https://www.spinics.net/lists/linux-btrfs/msg47283.html > >> but it doesn't apply cleanly nor is easy to rebase to current > >> kernels. > [...] > >> > >> Well, yeah. The current check is naive and wrong. It does have a > >> purpose, just fails in this, very common, case. > > > > I guess the reasoning behind this is: Creating any more chunks on > > this drive will make raid1 chunks with only one copy. Adding > > another drive later will not replay the copies without user > > interaction. Is that true? > > > > If yes, this may leave you with a mixed case of having a raid1 drive > > with some chunks not mirrored and some mirrored. When the other > > drives goes missing later, you are loosing data or even the whole > > filesystem although you were left with the (wrong) imagination of > > having a mirrored drive setup... > > > > Is this how it works? > > > > If yes, a real patch would also need to replay the missing copies > > after adding a new drive. > > > The problem is that that would use some serious disk bandwidth > without user intervention. The way from userspace to fix this is to > scrub the FS. It would essentially be the same from kernel space, > which means that if you had a multi-TB FS and this happened, you'd be > running at below capacity in terms of bandwidth for quite some time. > If this were to be implemented, it would have to be keyed off of the > per-chunk degraded check (so that _only_ the chunks that need it get > touched), and there would need to be a switch to disable it. Well, I'd expect that a replaced drive would involve reduced bandwidth for a while. Every traditional RAID does this. The key solution there is that you can limit bandwidth and/or define priorities (BG rebuild rate). Btrfs OTOH could be a lot more smarter, only rebuilding chunks that are affected. The kernel can already do IO priorities and some sort of bandwidth limiting should also be possible. I think IO throttling is already implemented in the kernel somewhere (at least with 4.10) and also in btrfs. So the basics are there. In a RAID setup, performance should never have priority over redundancy by default. If performance is an important factor, I suggest working with SSD writeback caches. This is already possible with different kernel techniques like mdcache or bcache. Proper hardware controllers also support this in hardware. It's cheap to have a mirrored SSD writeback cache of 1TB or so if your setup already contains a multiple terabytes array. Such a setup has huge performance benefits in setups we deploy (tho, not btrfs related). Also, adding/replacing a drive is usually not a totally unplanned event. Except for hot spares, a missing drive will be replaced at the time you arrive on-site. If performance is a factor, this can be done the same time as manually starting the process. So why not should it be done automatically? -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
On 2017-03-03 00:56, Kai Krakow wrote: Am Thu, 2 Mar 2017 11:37:53 +0100 schrieb Adam Borowski: On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote: [1717713.408675] BTRFS warning (device dm-8): missing devices (1) exceeds the limit (0), writeable mount is not allowed [1717713.446453] BTRFS error (device dm-8): open_ctree failed [chris@f25s ~]$ uname -r 4.9.8-200.fc25.x86_64 I thought this was fixed. I'm still getting a one time degraded rw mount, after that it's no longer allowed, which really doesn't make any sense because those single chunks are on the drive I'm trying to mount. Well, there's Qu's patch at: https://www.spinics.net/lists/linux-btrfs/msg47283.html but it doesn't apply cleanly nor is easy to rebase to current kernels. I don't understand what problem this proscription is trying to avoid. If it's OK to mount rw,degraded once, then it's OK to allow it twice. If it's not OK twice, it's not OK once. Well, yeah. The current check is naive and wrong. It does have a purpose, just fails in this, very common, case. I guess the reasoning behind this is: Creating any more chunks on this drive will make raid1 chunks with only one copy. Adding another drive later will not replay the copies without user interaction. Is that true? If yes, this may leave you with a mixed case of having a raid1 drive with some chunks not mirrored and some mirrored. When the other drives goes missing later, you are loosing data or even the whole filesystem although you were left with the (wrong) imagination of having a mirrored drive setup... Is this how it works? If yes, a real patch would also need to replay the missing copies after adding a new drive. The problem is that that would use some serious disk bandwidth without user intervention. The way from userspace to fix this is to scrub the FS. It would essentially be the same from kernel space, which means that if you had a multi-TB FS and this happened, you'd be running at below capacity in terms of bandwidth for quite some time. If this were to be implemented, it would have to be keyed off of the per-chunk degraded check (so that _only_ the chunks that need it get touched), and there would need to be a switch to disable it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
On 2017-03-02 19:47, Peter Grandi wrote: [ ... ] Meanwhile, the problem as I understand it is that at the first raid1 degraded writable mount, no single-mode chunks exist, but without the second device, they are created. [ ... ] That does not make any sense, unless there is a fundamental mistake in the design of the 'raid1' profile, which this and other situations make me think is a possibility: that the category of "mirrored" 'raid1' chunk does not exist in the Btrfs chunk manager. That is, a chunk is either 'raid1' if it has a mirror, or if has no mirror it must be 'single'. If a member device of a 'raid1' profile multidevice volume disappears there will be "unmirrored" 'raid1' profile chunks and some code path must recognize them as such, but the logic of the code does not allow their creation. Question: how does the code know that a specific 'raid1' chunk is mirrored or not? The chunk must have a link (member, offset) to its mirror, do they? What makes me think that "unmirrored" 'raid1' profile chunks are "not a thing" is that it is impossible to remove explicitly a member device from a 'raid1' profile volume: first one has to 'convert' to 'single', and then the 'remove' copies back to the remaining devices the 'single' chunks that are on the explicitly 'remove'd device. Which to me seems absurd. It is, there should be a way to do this as a single operation. The reason this is currently the case though is a simple one, 'btrfs device delete' is just a special instance of balance that prevents new chunks being allocated on the device being removed and balances all the chunks on that device so they end up on other devices. It currently does no profile conversion, but having that as an option would actually be _very_ useful from a data safety perspective. Going further in my speculation, I suspect that at the core of the Btrfs multidevice design there is a persistent "confusion" (to use en euphemism) between volumes having a profile, and merely chunks have a profile. There generally is. The profile is entirely a property of the chunks (each chunk literally has a bit of metadata that says what profile it is), not the volume. There's some metadata in the volume somewhere that says what profile to use for new chunks of each type (I think), but that doesn't dictate what chunk profiles there are on the volume. This whole arrangement is actually pretty important for fault tolerance in general, since during a conversion you have _both_ profiles for that chunk type at the same time on the same filesystem (new chunks will get allocated with the new type though), and the kernel has to be able to handle a partially converted FS. My additional guess that the original design concept had multidevice volumes to be merely containers for chunks of whichever mixed profiles, so a subvolume could have 'raid1' profile metadata and 'raid0' profile data, and another could have 'raid10' profile metadata and data, but since handling this turned out to be too hard, this was compromised into volumes having all metadata chunks to have the same profile and all data of the same profile, which requires special-case handling of corner cases, like volumes being converted or missing member devices. Actually, the only bits missing that would be needed to do this are stuff to segregate the data of given subvolumes completely form each other (ie, make sure they can't be in the same chunks at all). Doing that is hard, so we don't have per-subvolume profiles yet. It's fully possible to have a mix of profiles on a given volume though. Some old versions of mkfs actually did this (you'd end up with a small single profile chunk of each type on a FS that used different profiles), and the filesystem is in exactly that state when converting between profiles for a given chunk type. New chunks will only be generated with one profile, but you can have whatever other mix you want essentially (in fact, one of the handful of regression tests I run when I'm checking patches explicitly creates a filesystem with one data and one system chunk of every profile and makes sure the kernel can still access it correctly). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
On Fri, Mar 03, 2017 at 06:56:22AM +0100, Kai Krakow wrote: > > > I don't understand what problem this proscription is trying to > > > avoid. If it's OK to mount rw,degraded once, then it's OK to allow > > > it twice. If it's not OK twice, it's not OK once. > > > > Well, yeah. The current check is naive and wrong. It does have a > > purpose, just fails in this, very common, case. > > I guess the reasoning behind this is: Creating any more chunks on this > drive will make raid1 chunks with only one copy. Adding another drive > later will not replay the copies without user interaction. Is that true? > > If yes, this may leave you with a mixed case of having a raid1 drive > with some chunks not mirrored and some mirrored. When the other drives > goes missing later, you are loosing data or even the whole filesystem > although you were left with the (wrong) imagination of having a > mirrored drive setup... Ie, you want a degraded mount to create degraded raid1 chunks rather than single ones? Good idea, it would solve the most common case with least surprise to the user. But there are other scenarios where Qu's patch[-set] is needed. For example, if you try to convert a single-disk filesystem to raid1, yet the new shiny disk you just added decides to remind you of words "infant mortality" halfway during conversion. Or, if you have degraded raid1 chunks and something bad happens during recovery. Having the required number of devices, despite passing the current bogus check, doesn't mean you can continue. Qu's patch checks whether at least one copy of every chunk is present. -- ⢀⣴⠾⠻⢶⣦⠀ Meow! ⣾⠁⢠⠒⠀⣿⡁ ⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second ⠈⠳⣄ preimage for double rot13! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
AFAIK, no, it hasn't been fixed, at least not in mainline, because the patches to fix it got stuck in some long-running project patch queue (IIRC, the one for on-degraded auto-device-replace), with no timeline known to me on mainline merge. Meanwhile, the problem as I understand it is that at the first raid1 degraded writable mount, no single-mode chunks exist, but without the second device, they are created. It might be an accidental feature introduced in the patch [1]. RFC [2] (limited tested) tried to correct it. But, if the accidental feature works better than the traditional RAID1 approach then workaround fix [3] will help, however for the accidental feature I am not sure if it is would to support all the failures-recovery/ FS-is-full cases. [1] commit 95669976bd7d30ae265db938ecb46a6b7f8cb893 Btrfs: don't consider the missing device when allocating new chunks [2] [PATCH 0/2] [RFC] btrfs: create degraded-RAID1 chunks [3] Patches 01/13 to 05/13 of the below patch set (which were needed to test rest of the patches in the set). [PATCH v6 00/13] Introduce device state 'failed', spare device and auto replace. Hope this sheds some light on the long standing issue. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
Am Thu, 2 Mar 2017 11:37:53 +0100 schrieb Adam Borowski: > On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote: > > [1717713.408675] BTRFS warning (device dm-8): missing devices (1) > > exceeds the limit (0), writeable mount is not allowed > > [1717713.446453] BTRFS error (device dm-8): open_ctree failed > > > > [chris@f25s ~]$ uname -r > > 4.9.8-200.fc25.x86_64 > > > > I thought this was fixed. I'm still getting a one time degraded rw > > mount, after that it's no longer allowed, which really doesn't make > > any sense because those single chunks are on the drive I'm trying to > > mount. > > Well, there's Qu's patch at: > https://www.spinics.net/lists/linux-btrfs/msg47283.html > but it doesn't apply cleanly nor is easy to rebase to current kernels. > > > I don't understand what problem this proscription is trying to > > avoid. If it's OK to mount rw,degraded once, then it's OK to allow > > it twice. If it's not OK twice, it's not OK once. > > Well, yeah. The current check is naive and wrong. It does have a > purpose, just fails in this, very common, case. I guess the reasoning behind this is: Creating any more chunks on this drive will make raid1 chunks with only one copy. Adding another drive later will not replay the copies without user interaction. Is that true? If yes, this may leave you with a mixed case of having a raid1 drive with some chunks not mirrored and some mirrored. When the other drives goes missing later, you are loosing data or even the whole filesystem although you were left with the (wrong) imagination of having a mirrored drive setup... Is this how it works? If yes, a real patch would also need to replay the missing copies after adding a new drive. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
Peter Grandi posted on Fri, 03 Mar 2017 00:47:46 + as excerpted: >> [ ... ] Meanwhile, the problem as I understand it is that at the first >> raid1 degraded writable mount, no single-mode chunks exist, but without >> the second device, they are created. [ ... ] > > That does not make any sense, unless there is a fundamental mistake in > the design of the 'raid1' profile, which this and other situations make > me think is a possibility: that the category of "mirrored" 'raid1' chunk > does not exist in the Btrfs chunk manager. That is, a chunk is either > 'raid1' if it has a mirror, or if has no mirror it must be 'single'. > > If a member device of a 'raid1' profile multidevice volume disappears > there will be "unmirrored" 'raid1' profile chunks and some code path > must recognize them as such, but the logic of the code does not allow > their creation. Question: how does the code know that a specific 'raid1' > chunk is mirrored or not? The chunk must have a link (member, offset) to > its mirror, do they? The problem at the surface level is, raid1 chunks MUST be created with two copies, one each on two different devices. It is (currently) not allowed to create only a single copy of a raid1 chunk, and the two copies must be on different devices, so once you have only a single device, raid1 chunks cannot be created. Which presents a problem when you're trying to recover, needing writable in ordered to be able to do a device replace or add/remove (with the remove triggering a balance), because btrfs is COW, so any changes get written to new locations, which requires chunked space that might not be available in the currently allocated chunks. To work around that, they allowed the chunk allocator to fallback to single mode when it couldn't create raid1. Which is fine as long as the recovery is completed in the same mount. But if you unmount or crash and try to remount to complete the job after those single-mode chunks have been created, oops! Single mode chunks on a multi-device filesystem with a device missing, and the logic currently isn't sophisticated enough to realize that all the chunks are actually accounted for, so it forces read-only mounting to prevent further damage. Which means you can copy off the files to a different filesystem as they're still all available, including any written in single-mode, but you can't fix the degraded filesystem any longer, as that requires a writable mount you're not going to be able to get, at least not with mainline. At a lower level, the problem is that for raid1 (and I think raid10 as well tho I'm not sure on it), they made a mistake in the implementation. For raid56, the minimum allowed writable devices is lower than the minimum number of devices for undegraded write, by the number of parity devices (so raid5 will allow two devices for undegraded write, 1 parity, one data, but one device for degraded write, raid6 will allow three devices for undegraded write, one data, two parity, or again, one device for degraded write). But for raid1, both the degraded write minimum and the undegraded write minimum are set to *two* devices, an implementation error since the degraded write minimum should arguably be one device, without a mirror. So the degrade to single-mode is a workaround for the real problem, not allowing degraded raid1 write (that is, chunk creation). And all this is known and has been discussed right here on this list by the devs, but nobody has actually bothered to properly fix it, either by correctly setting the degraded raid1 write minimum to a single device, or even by working around the single-mode workaround, by correctly checking each chunk and allowing writable mount if all are accounted for, even if there's a missing device. Or rather, the workaround for the incomplete workaround has had a patch submitted, but it got stuck in that long-running project and has been in limbo every since, and now I guess the patch has gone stale and doesn't even properly apply any longer. All of which is yet more demonstration of the fact that is stated time and again on this list, that btrfs should be considered stabilizing, but still under heavy development and not yet fully stable, and backups should be kept updated and at-hand for any data you value higher than the bother and resources necessary to make those backups. Because if there's backups updated and at hand, then what happens to the working copy doesn't matter, and in this particular case, even if the backups aren't fully current, the fact that they're available means there's space available to update them from the working copy should it go into readonly mode as well, which means recovery from the read-only formerly working copy is no big deal. Either that, or by definition, the data wasn't of enough value to have backups when storing it on a widely known to be still stabilizing and under heavy development filesystem, where those backups
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
On Thu, Mar 2, 2017 at 6:18 PM, Qu Wenruowrote: > > > At 03/03/2017 09:15 AM, Chris Murphy wrote: >> >> [1805985.267438] BTRFS info (device dm-6): allowing degraded mounts >> [1805985.267566] BTRFS info (device dm-6): disk space caching is enabled >> [1805985.267676] BTRFS info (device dm-6): has skinny extents >> [1805987.187857] BTRFS warning (device dm-6): missing devices (1) >> exceeds the limit (0), writeable mount is not allowed >> [1805987.228990] BTRFS error (device dm-6): open_ctree failed >> [chris@f25s ~]$ sudo mount -o noatime,degraded,ro /dev/mapper/sdb /mnt >> [chris@f25s ~]$ sudo btrfs fi df /mnt >> Data, RAID1: total=434.00GiB, used=432.46GiB >> Data, single: total=1.00GiB, used=1.66MiB >> System, RAID1: total=8.00MiB, used=48.00KiB >> System, single: total=32.00MiB, used=32.00KiB >> Metadata, RAID1: total=2.00GiB, used=729.17MiB >> Metadata, single: total=1.00GiB, used=0.00B >> GlobalReserve, single: total=495.02MiB, used=0.00B >> [chris@f25s ~]$ >> >> >> >> So the sequence is: >> 1. mkfs.btrfs -d raid1 -m raid1 > 2. fill it with a bunch of data over a few months, always mounted >> normally with default options >> 3. physically remove 1 of 2 devices, and do a degraded mount. This >> mounts without error, and more stuff is added. Volume is umounted. >> 4. Try to mount the same 1 of 2 devices, with degraded mount option, >> and I get the first error, "writeable mount is not allowed". >> 5. Try to mount the same 1 of 2 devices, with degraded,ro option, and >> it mounts, and then I captured the 'btfs fi df' above. >> >> So very clearly there are single chunks added during the degraded rw >> mount. >> >> But does 1.66MiB of data in that single data chunk make sense? And >> does 0.00 MiB of metadata in that single metadata chunk make sense? >> I'm not sure, seems unlikely. Most of what happened in that subvolume >> since the previous snapshot was moving things around, reorganizing, >> not adding files. So, maybe 1.66MiB data added is possible? But >> definitely the metadata changes must be in the raid1 chunks, while the >> newly created single profile metadata chunk is left unused. >> >> So I think there's more than one bug going on here, separate problems >> for data and metadata. > > > IIRC I submitted a patch long time ago to check each chunk to see if it's OK > to mount in degraded mode. > > And in your case, it will allow RW degraded mount since the stripe of that > single chunk is not missing. > > That patch is later merged into hot-spare patchset, but AFAIK it will be a > long long time before such hot-spare get merged. > > So I'll update that patch and hope it can solve the problem. > OK thanks. Yeah I should have said that this is not a critical situation for me. It's just a confusing situation. In particular that people could do a btrfs replace; or do btrfs dev add, then btrfs dev missing, and what happens? There's some data that's not replicated on the replacement drive because it's single profile, and if that happens to be metadata it's possibly unpredictable what happens when the drive with single chunks dies. At the very least there is going to be some data loss. It's entirely possible the drive that's missing these single chunks can't be mounted degraded. And for sure it's possible that it can't be used for replication, when doing a device replace for the 1st device with the only copy of these single chunks. Again, my data is fine. The problem I'm having is this: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/filesystems/btrfs.txt?id=refs/tags/v4.10.1 Which says in the first line, in part, "focusing on fault tolerance, repair and easy administration" and quite frankly this sort of enduring bug in this file system that's nearly 10 years old now, is rendered misleading, and possibly dishonest. How do we describe this file system as focusing on fault tolerance when, in the identical scenario using mdadm or LVM raid, the user's data is not mishandled like it is on Btrfs with multiple devices? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
At 03/03/2017 09:15 AM, Chris Murphy wrote: [1805985.267438] BTRFS info (device dm-6): allowing degraded mounts [1805985.267566] BTRFS info (device dm-6): disk space caching is enabled [1805985.267676] BTRFS info (device dm-6): has skinny extents [1805987.187857] BTRFS warning (device dm-6): missing devices (1) exceeds the limit (0), writeable mount is not allowed [1805987.228990] BTRFS error (device dm-6): open_ctree failed [chris@f25s ~]$ sudo mount -o noatime,degraded,ro /dev/mapper/sdb /mnt [chris@f25s ~]$ sudo btrfs fi df /mnt Data, RAID1: total=434.00GiB, used=432.46GiB Data, single: total=1.00GiB, used=1.66MiB System, RAID1: total=8.00MiB, used=48.00KiB System, single: total=32.00MiB, used=32.00KiB Metadata, RAID1: total=2.00GiB, used=729.17MiB Metadata, single: total=1.00GiB, used=0.00B GlobalReserve, single: total=495.02MiB, used=0.00B [chris@f25s ~]$ So the sequence is: 1. mkfs.btrfs -d raid1 -m raid1 IIRC I submitted a patch long time ago to check each chunk to see if it's OK to mount in degraded mode. And in your case, it will allow RW degraded mount since the stripe of that single chunk is not missing. That patch is later merged into hot-spare patchset, but AFAIK it will be a long long time before such hot-spare get merged. So I'll update that patch and hope it can solve the problem. Thanks, Qu Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
[1805985.267438] BTRFS info (device dm-6): allowing degraded mounts [1805985.267566] BTRFS info (device dm-6): disk space caching is enabled [1805985.267676] BTRFS info (device dm-6): has skinny extents [1805987.187857] BTRFS warning (device dm-6): missing devices (1) exceeds the limit (0), writeable mount is not allowed [1805987.228990] BTRFS error (device dm-6): open_ctree failed [chris@f25s ~]$ sudo mount -o noatime,degraded,ro /dev/mapper/sdb /mnt [chris@f25s ~]$ sudo btrfs fi df /mnt Data, RAID1: total=434.00GiB, used=432.46GiB Data, single: total=1.00GiB, used=1.66MiB System, RAID1: total=8.00MiB, used=48.00KiB System, single: total=32.00MiB, used=32.00KiB Metadata, RAID1: total=2.00GiB, used=729.17MiB Metadata, single: total=1.00GiB, used=0.00B GlobalReserve, single: total=495.02MiB, used=0.00B [chris@f25s ~]$ So the sequence is: 1. mkfs.btrfs -d raid1 -m raid1 http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
> [ ... ] Meanwhile, the problem as I understand it is that at > the first raid1 degraded writable mount, no single-mode chunks > exist, but without the second device, they are created. [ > ... ] That does not make any sense, unless there is a fundamental mistake in the design of the 'raid1' profile, which this and other situations make me think is a possibility: that the category of "mirrored" 'raid1' chunk does not exist in the Btrfs chunk manager. That is, a chunk is either 'raid1' if it has a mirror, or if has no mirror it must be 'single'. If a member device of a 'raid1' profile multidevice volume disappears there will be "unmirrored" 'raid1' profile chunks and some code path must recognize them as such, but the logic of the code does not allow their creation. Question: how does the code know that a specific 'raid1' chunk is mirrored or not? The chunk must have a link (member, offset) to its mirror, do they? What makes me think that "unmirrored" 'raid1' profile chunks are "not a thing" is that it is impossible to remove explicitly a member device from a 'raid1' profile volume: first one has to 'convert' to 'single', and then the 'remove' copies back to the remaining devices the 'single' chunks that are on the explicitly 'remove'd device. Which to me seems absurd. Going further in my speculation, I suspect that at the core of the Btrfs multidevice design there is a persistent "confusion" (to use en euphemism) between volumes having a profile, and merely chunks have a profile. My additional guess that the original design concept had multidevice volumes to be merely containers for chunks of whichever mixed profiles, so a subvolume could have 'raid1' profile metadata and 'raid0' profile data, and another could have 'raid10' profile metadata and data, but since handling this turned out to be too hard, this was compromised into volumes having all metadata chunks to have the same profile and all data of the same profile, which requires special-case handling of corner cases, like volumes being converted or missing member devices. So in the case of 'raid1', a volume with say a 'raid1' data profile should have all-'raid1' and fully mirrored profile chunks, and the lack of a member devices fails that aim in two ways. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
On 2017-03-02 12:26, Andrei Borzenkov wrote: 02.03.2017 16:41, Duncan пишет: Chris Murphy posted on Wed, 01 Mar 2017 17:30:37 -0700 as excerpted: [1717713.408675] BTRFS warning (device dm-8): missing devices (1) exceeds the limit (0), writeable mount is not allowed [1717713.446453] BTRFS error (device dm-8): open_ctree failed [chris@f25s ~]$ uname -r 4.9.8-200.fc25.x86_64 I thought this was fixed. I'm still getting a one time degraded rw mount, after that it's no longer allowed, which really doesn't make any sense because those single chunks are on the drive I'm trying to mount. I don't understand what problem this proscription is trying to avoid. If it's OK to mount rw,degraded once, then it's OK to allow it twice. If it's not OK twice, it's not OK once. AFAIK, no, it hasn't been fixed, at least not in mainline, because the patches to fix it got stuck in some long-running project patch queue (IIRC, the one for on-degraded auto-device-replace), with no timeline known to me on mainline merge. Meanwhile, the problem as I understand it is that at the first raid1 degraded writable mount, no single-mode chunks exist, but without the second device, they are created. Is not it the root cause? I would expect it to create degraded mirrored chunks that will be synchronized when second device is added back. That's exactly what it should be doing, and AFAIK what the correct fix for this should be, but in the interim just relaxing the degraded check to be per-chunk makes things usable, and is arguably how it should have been to begin with. (It's not clear to me whether they are created with the first write, that is, ignoring any space in existing degraded raid1 chunks, or if that's used up first and the single-mode chunks only created later, when a new chunk must be allocated to continue writing as the old ones are full.) So the first degraded-writable mount is allowed, because no single-mode chunks yet exist, while after such single-mode chunks are created, the existing dumb algorithm won't allow further writable mounts, because it sees single-mode chunks on a multi-device filesystem, and never mind that all the single mode chunks are there, it simply doesn't check that and won't allow writable mount because some /might/ be on the missing device. The patches stuck in queue would make btrfs more intelligent about that, having it check each chunk as listed in the chunk tree, and if at least one copy is available (as would be the case for single-mode chunks created after the degraded mount), writable mount would still be allowed. But... that's stuck in a long running project queue with no known timetable for merging...... so the only way to get it is to go find and merge them yourself, in your own build. Will it replicate single mode chunks when second device is added? Not automatically, you would need to convert them to raid1 (or whatever other profile. Even with the patch, this would still be needed, but at least it would (technically) work sanely. On that note, on most of my systems, I have a startup script that calls balance with the appropriate convert flags and the soft flag for every fixed (non-removable) BTRFS volume on the system to clean up after this. The actual balance call takes no time at all unless there are actually chunks to convert, so it normally has very little impact on boot times. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
02.03.2017 16:41, Duncan пишет: > Chris Murphy posted on Wed, 01 Mar 2017 17:30:37 -0700 as excerpted: > >> [1717713.408675] BTRFS warning (device dm-8): missing devices (1) >> exceeds the limit (0), writeable mount is not allowed >> [1717713.446453] BTRFS error (device dm-8): open_ctree failed >> >> [chris@f25s ~]$ uname >> -r 4.9.8-200.fc25.x86_64 >> >> I thought this was fixed. I'm still getting a one time degraded rw >> mount, after that it's no longer allowed, which really doesn't make any >> sense because those single chunks are on the drive I'm trying to mount. >> I don't understand what problem this proscription is trying to avoid. If >> it's OK to mount rw,degraded once, then it's OK to allow it twice. If >> it's not OK twice, it's not OK once. > > AFAIK, no, it hasn't been fixed, at least not in mainline, because the > patches to fix it got stuck in some long-running project patch queue > (IIRC, the one for on-degraded auto-device-replace), with no timeline > known to me on mainline merge. > > Meanwhile, the problem as I understand it is that at the first raid1 > degraded writable mount, no single-mode chunks exist, but without the > second device, they are created. Is not it the root cause? I would expect it to create degraded mirrored chunks that will be synchronized when second device is added back. (It's not clear to me whether they are > created with the first write, that is, ignoring any space in existing > degraded raid1 chunks, or if that's used up first and the single-mode > chunks only created later, when a new chunk must be allocated to continue > writing as the old ones are full.) > > So the first degraded-writable mount is allowed, because no single-mode > chunks yet exist, while after such single-mode chunks are created, the > existing dumb algorithm won't allow further writable mounts, because it > sees single-mode chunks on a multi-device filesystem, and never mind that > all the single mode chunks are there, it simply doesn't check that and > won't allow writable mount because some /might/ be on the missing device. > > The patches stuck in queue would make btrfs more intelligent about that, > having it check each chunk as listed in the chunk tree, and if at least > one copy is available (as would be the case for single-mode chunks > created after the degraded mount), writable mount would still be > allowed. But... that's stuck in a long running project queue with no > known timetable for merging...... so the only way to > get it is to go find and merge them yourself, in your own build. > Will it replicate single mode chunks when second device is added? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
Chris Murphy posted on Wed, 01 Mar 2017 17:30:37 -0700 as excerpted: > [1717713.408675] BTRFS warning (device dm-8): missing devices (1) > exceeds the limit (0), writeable mount is not allowed > [1717713.446453] BTRFS error (device dm-8): open_ctree failed > > [chris@f25s ~]$ uname > -r 4.9.8-200.fc25.x86_64 > > I thought this was fixed. I'm still getting a one time degraded rw > mount, after that it's no longer allowed, which really doesn't make any > sense because those single chunks are on the drive I'm trying to mount. > I don't understand what problem this proscription is trying to avoid. If > it's OK to mount rw,degraded once, then it's OK to allow it twice. If > it's not OK twice, it's not OK once. AFAIK, no, it hasn't been fixed, at least not in mainline, because the patches to fix it got stuck in some long-running project patch queue (IIRC, the one for on-degraded auto-device-replace), with no timeline known to me on mainline merge. Meanwhile, the problem as I understand it is that at the first raid1 degraded writable mount, no single-mode chunks exist, but without the second device, they are created. (It's not clear to me whether they are created with the first write, that is, ignoring any space in existing degraded raid1 chunks, or if that's used up first and the single-mode chunks only created later, when a new chunk must be allocated to continue writing as the old ones are full.) So the first degraded-writable mount is allowed, because no single-mode chunks yet exist, while after such single-mode chunks are created, the existing dumb algorithm won't allow further writable mounts, because it sees single-mode chunks on a multi-device filesystem, and never mind that all the single mode chunks are there, it simply doesn't check that and won't allow writable mount because some /might/ be on the missing device. The patches stuck in queue would make btrfs more intelligent about that, having it check each chunk as listed in the chunk tree, and if at least one copy is available (as would be the case for single-mode chunks created after the degraded mount), writable mount would still be allowed. But... that's stuck in a long running project queue with no known timetable for merging...... so the only way to get it is to go find and merge them yourself, in your own build. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote: > [1717713.408675] BTRFS warning (device dm-8): missing devices (1) > exceeds the limit (0), writeable mount is not allowed > [1717713.446453] BTRFS error (device dm-8): open_ctree failed > > [chris@f25s ~]$ uname -r > 4.9.8-200.fc25.x86_64 > > I thought this was fixed. I'm still getting a one time degraded rw > mount, after that it's no longer allowed, which really doesn't make > any sense because those single chunks are on the drive I'm trying to > mount. Well, there's Qu's patch at: https://www.spinics.net/lists/linux-btrfs/msg47283.html but it doesn't apply cleanly nor is easy to rebase to current kernels. > I don't understand what problem this proscription is trying to > avoid. If it's OK to mount rw,degraded once, then it's OK to allow it > twice. If it's not OK twice, it's not OK once. Well, yeah. The current check is naive and wrong. It does have a purpose, just fails in this, very common, case. For people needing to recover their filesystem at this moment there's https://www.spinics.net/lists/linux-btrfs/msg62473.html but it removes the protection you still want for other cases. This problem pops up way too often, thus I guess that if not the devs, then at least us in the peanut gallery should do the work reviving the real solution. Obviously, I for one am shortish on tuits at the moment... -- ⢀⣴⠾⠻⢶⣦⠀ Meow! ⣾⠁⢠⠒⠀⣿⡁ ⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second ⠈⠳⣄ preimage for double rot13! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html