Re: Metadata balance fails ENOSPC
Am 01.12.2016 um 00:02 schrieb Chris Murphy: > On Wed, Nov 30, 2016 at 2:03 PM, Stefan Priebe - Profihost AG >wrote: >> Hello, >> >> # btrfs balance start -v -dusage=0 -musage=1 /ssddisk/ >> Dumping filters: flags 0x7, state 0x0, force is off >> DATA (flags 0x2): balancing, usage=0 >> METADATA (flags 0x2): balancing, usage=1 >> SYSTEM (flags 0x2): balancing, usage=1 >> ERROR: error during balancing '/ssddisk/': No space left on device >> There may be more info in syslog - try dmesg | tail > > You haven't provided kernel messages at the time of the error. Kernel Message: [ 429.107723] BTRFS info (device vdb1): 1 enospc errors during balance > Also useful is the kernel version. Custom 4.4 kernel with patches up to 4.10. But i already tried 4.9-rc7 which does the same. >> # btrfs filesystem show /ssddisk/ >> Label: none uuid: a69d2e90-c2ca-4589-9876-234446868adc >> Total devices 1 FS bytes used 305.67GiB >> devid1 size 500.00GiB used 500.00GiB path /dev/vdb1 >> >> # btrfs filesystem usage /ssddisk/ >> Overall: >> Device size: 500.00GiB >> Device allocated:500.00GiB >> Device unallocated:1.05MiB > > Drive is actually fully allocated so if Btrfs needs to create a new > chunk right now, it can't. However, Yes but there's lot of free space: Free (estimated):193.46GiB (min: 193.46GiB) How does this match? > All three chunk types have quite a bit of unused space in them, so > it's unclear why there's a no space left error. > > Try remounting with enoscp_debug, and then trigger the problem again, > and post the resulting kernel messages. With enospc debug it says: [39193.425682] BTRFS warning (device vdb1): no space to allocate a new chunk for block group 839941881856 [39193.426033] BTRFS info (device vdb1): 1 enospc errors during balance Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Austin S. Hemmelgarn posted on Wed, 30 Nov 2016 11:48:57 -0500 as excerpted: > On 2016-11-30 10:49, Wilson Meier wrote: >> Do you also have all home users in mind, which go to vacation (sometime >>> 3 weeks) and don't have a 24/7 support team to replace monitored disks >> which do report SMART errors? > Better than 90% of people I know either shut down their systems when > they're going to be away for a long period of time, or like me have ways > to log in remotely and tell the FS to not use that disk anymore. https://btrfs.wiki.kernel.org/index.php/Getting_started ... ... has two warnings offset in red right in the first section: * If you have btrfs filesystems, run the latest kernel. * You should keep and test backups of your data, and be prepared to use them. It also says: The status of btrfs was experimental for a long time, but the the core functionality is considered good enough for daily use. [...] While many people use it reliably, there are still problems being found. Were I editing that or something very similar would be on the main landing page and as a general status announcement on the feature and profile status page. However, it IS on the wiki. As to the three weeks vacation thing... And "daily use" != "three weeks without physical access to something you're going to actually be relying on for parts of those three weeks". And "keep and test backups [and] be prepared to use them" != "go away for three weeks and leave yourself unable to restore from those backups, for something you're relying on over those three weeks", either. As Austin says, many home users actually shut down their systems when they're going to be away, because they are /not/ going to be using them in that period, and *certainly* *don't* actually /rely/ on them. And most of those that /do/ actually rely on them, have learned or will learn, possibly the hard way, that "things happen", and they need either someone that can be called to poke the systems if necessary, or alternative plans in case what they can't access ATM fails. Meanwhile, arguably those who /are/ relying on their filesystems to be up and running for extended periods while they can't actually poke (or have someone else poke) the hardware if necessary, shouldn't be running btrfs as yet in the first place, as it's simply not stable and mature enough for that. And people who really care about it will have done the research to know the stability status. And people who don't... well, by not doing that research they've effectively defined it as not that important in their life, other things have taken priority. So if btrfs fails on them and they didn't know it's stability status, it can only be because it wasn't that important to them that they know, so no big deal. (I know for certain that before /I/ switched to btrfs, I scoured both the wiki and the manpages, as well as reading a number of articles on btrfs, and then still posted to this list a number of questions I had remaining after doing all that, and got answers I read as well, before I actually did my switch. That's because it was my data at risk, data I place a high enough value on to want to know the risk at which I was placing it and the best way to deal with various issues I could anticipate possibly happening, before they actually happened. And I actually did some of my own testing before final deployment, as well, satisfying myself that I /could/ reasonably deal with various hardware and software disaster scenarios, before putting any real data at risk, as well. Of course I don't expect everyone to do all that, but then I don't expect everyone to place the value in their data that I do in mine. Which is fine, as long as they're willing to live with the consequences of the priority they placed on appreciating and dealing appropriately with the risk factor on their data, based on the definition of value their actions placed on it. If they're willing to risk the data because it's of no particular value to them anyway, well then, no such preliminary research and testing is required. Indeed, it would be stupid, because they surely have more important and higher priority things to deal with.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quota enable
[BUG] Under the following case, we can underflow qgroup reserved space. Task A|Task B --- Quota disabled | Buffered write | |- btrfs_check_data_free_space() | | *NO* qgroup space is reserved | | since quota is *DISABLED* | |- All pages are copied to page | cache | | Enable quota | Quota scan finished | | Sync_fs | |- run_delalloc_range | |- Write pages | |- btrfs_finish_ordered_io ||- insert_reserved_file_extent | |- btrfs_qgroup_release_data() | Since no qgroup space is reserved in Task A, we underflow qgroup reserved space This can be detected by fstest btrfs/104. [CAUSE] In insert_reserved_file_extent() we info qgroup to release the @ram_bytes size of qgroup reserved_space under all case. And btrfs_qgroup_release_data() will check if qgroup is enabled. However in above case, the buffered write happens before quota is enabled, so we don't havee reserved space for that range. [FIX] In insert_reserved_file_extent(), we info qgroup to release the acctual byte number it released. In above case, since we don't have reserved space, we info qgroup to release 0 byte, so the problem can be fixed. And thanks to the @reserved parameter introduced by qgroup rework, and previous patch to return release bytes, the fix can be as small as less than 10 lines. Signed-off-by: Qu Wenruo--- To David: These 2 patches, with updated extra WARN_ON() patch(V5), btrfs on x86_64 is good for current qgroup test group. But the bug reported by Chadan still exists, and the fix for that will be delayed for a while, as the fix involves large interface change (add struct extent_changeset for every qgroup reserve caller). I'll run extra tests for that fix to ensure they are OK and won't cause extra bugs. Thanks, Qu --- fs/btrfs/inode.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 242dc7e..3db58d9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2256,6 +2256,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, struct btrfs_path *path; struct extent_buffer *leaf; struct btrfs_key ins; + u64 qg_released; int extent_inserted = 0; int ret; @@ -2311,15 +2312,19 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, ins.objectid = disk_bytenr; ins.offset = disk_num_bytes; ins.type = BTRFS_EXTENT_ITEM_KEY; - ret = btrfs_alloc_reserved_file_extent(trans, root, - root->root_key.objectid, - btrfs_ino(inode), file_pos, - ram_bytes, ); + /* * Release the reserved range from inode dirty range map, as it is * already moved into delayed_ref_head */ - btrfs_qgroup_release_data(inode, file_pos, ram_bytes); + ret = btrfs_qgroup_release_data(inode, file_pos, ram_bytes); + if (ret < 0) + goto out; + qg_released = ret; + ret = btrfs_alloc_reserved_file_extent(trans, root, + root->root_key.objectid, + btrfs_ino(inode), file_pos, + qg_released, ); out: btrfs_free_path(path); -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs_qgroup_release/free_data() only returns 0 or minus error number(ENOMEM is the only possible error). This is normally good enough, but sometimes we need the accurate byte number it freed/released. Change it to return actually released/freed bytenr number instead of 0 for success. And slightly modify related extent_changeset structure, since in btrfs one none-hole data extent won't be larger than 128M, so "unsigned int" is large enough for the use case. Signed-off-by: Qu Wenruo--- fs/btrfs/extent-tree.c | 2 +- fs/btrfs/extent_io.h | 2 +- fs/btrfs/qgroup.c | 1 + 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ac3ae27..dae287d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4318,7 +4318,7 @@ int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len) /* Use new btrfs_qgroup_reserve_data to reserve precious data space. */ ret = btrfs_qgroup_reserve_data(inode, start, len); - if (ret) + if (ret < 0) btrfs_free_reserved_data_space_noquota(inode, start, len); return ret; } diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 8df24c6..13edb86 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -190,7 +190,7 @@ struct extent_buffer { */ struct extent_changeset { /* How many bytes are set/cleared in this operation */ - u64 bytes_changed; + unsigned int bytes_changed; /* Changed ranges */ struct ulist *range_changed; diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 1ad3be8..7263065 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2873,6 +2873,7 @@ static int __btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len, } trace_btrfs_qgroup_release_data(inode, start, len, changeset.bytes_changed, trace_op); + ret = changeset.bytes_changed; out: ulist_free(changeset.range_changed); return ret; -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix infinite loop when tree log recovery
Hi Filipe, Thank you for your review. I have seen your modified change log with below Btrfs: fix tree search logic when replaying directory entry deletes Btrfs: fix deadlock caused by fsync when logging directory entries Btrfs: fix enospc in hole punching So what's the next step ? modify patch change log and then send again ? Thanks. robbieko Filipe Manana 於 2016-12-01 00:53 寫到: On Fri, Oct 7, 2016 at 10:30 AM, robbiekowrote: From: Robbie Ko if log tree like below: leaf N: ... item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8 dir log end 1275809046 leaf N+1: item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 itemsize 8 dir log end 18446744073709551615 ... when start_ret > 1275809046, but slot[0] never >= nritems, so never go to next leaf. This doesn't explain how the infinite loop happens. Nor exactly how any problem happens. It's important to have detailed information in the change logs. I understand that english isn't your native tongue (it's not mine either, and I'm far from mastering it), but that's not an excuse to not express all the important information in detail (we can all live with grammar errors and typos, and we all do such errors frequently). I've added this patch to my branch at https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=for-chris-4.10 but with a modified changelog and subject. The results of the wrong logic that decides when to move to the next leaf are unpredictable, and it won't always result in an infinite loop. We are accessing a slot that doesn't point to an item, to a memory location containing garbage to something unexpected, and in the worst case that location is beyond the last page of the extent buffer. Thanks. Signed-off-by: Robbie Ko --- fs/btrfs/tree-log.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index ef9c55b..e63dd99 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -1940,12 +1940,11 @@ static noinline int find_dir_range(struct btrfs_root *root, next: /* check the next slot in the tree to see if it is a valid item */ nritems = btrfs_header_nritems(path->nodes[0]); + path->slots[0]++; if (path->slots[0] >= nritems) { ret = btrfs_next_leaf(root, path); if (ret) goto out; - } else { - path->slots[0]++; } btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/14] btrfs-progs: check: introduce function to find dir_item
At 12/01/2016 12:34 AM, David Sterba wrote: On Wed, Nov 16, 2016 at 10:27:59AM +0800, Qu Wenruo wrote: Yes please. Third namespace for existing error bits is not a good option. Move the I_ERR bits to start from 32 and use them in the low-mem code that's been merged to devel. I didn't see such fix in devel branch. Well, that's because nobody implemented it and I was not intending to do it myself as it's a followup to yor lowmem patchset in devel. OK, I'm going to fix it soon. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.8.8, bcache deadlock and hard lockup
On Wed, Nov 30, 2016 at 4:57 PM, Eric Wheelerwrote: > On Wed, 30 Nov 2016, Marc MERLIN wrote: >> +btrfs mailing list, see below why >> >> On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote: >> > On Mon, 27 Nov 2016, Coly Li wrote: >> > > >> > > Yes, too many work queues... I guess the locking might be caused by some >> > > very obscure reference of closure code. I cannot have any clue if I >> > > cannot find a stable procedure to reproduce this issue. >> > > >> > > Hmm, if there is a tool to clone all the meta data of the back end cache >> > > and whole cached device, there might be a method to replay the oops much >> > > easier. >> > > >> > > Eric, do you have any hint ? >> > >> > Note that the backing device doesn't have any metadata, just a superblock. >> > You can easily dd that off onto some other volume without transferring the >> > data. By default, data starts at 8k, or whatever you used in `make-bcache >> > -w`. >> >> Ok, Linus helped me find a workaround for this problem: >> https://lkml.org/lkml/2016/11/29/667 >> namely: >>echo 2 > /proc/sys/vm/dirty_ratio >>echo 1 > /proc/sys/vm/dirty_background_ratio >> (it's a 24GB system, so the defaults of 20 and 10 were creating too many >> requests in th buffers) >> >> Note that this is only a workaround, not a fix. >> >> When I did this and re tried my big copy again, I still got 100+ kernel >> work queues, but apparently the underlying swraid5 was able to unblock >> and satisfy the write requests before too many accumulated and crashed >> the kernel. >> >> I'm not a kernel coder, but seems to me that bcache needs a way to >> throttle incoming requests if there are too many so that it does not end >> up in a state where things blow up due to too many piled up requests. >> >> You should be able to reproduce this by taking 5 spinning rust drives, >> put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although >> I used btrfs) and send lots of requests. >> Actually to be honest, the problems have mostly been happening when I do >> btrfs scrub and btrfs send/receive which both generate I/O from within >> the kernel instead of user space. >> So here, btrfs may be a contributor to the problem too, but while btrfs >> still trashes my system if I remove the caching device on bcache (and >> with the default dirty ratio values), it doesn't crash the kernel. >> >> I'll start another separate thread with the btrfs folks on how much >> pressure is put on the system, but on your side it would be good to help >> ensure that bcache doesn't crash the system altogether if too many >> requests are allowed to pile up. > > > Try BFQ. It is AWESOME and helps reduce the congestion problem with bulk > writes at the request queue on its way to the spinning disk or SSD: > http://algo.ing.unimo.it/people/paolo/disk_sched/ > > use the latest BFQ git here, merge it into v4.8.y: > https://github.com/linusw/linux-bfq/commits/bfq-v8 > > This doesn't completely fix the dirty_ration problem, but it is far better > than CFQ or deadline in my opinion (and experience). There are several threads over the past year with users having problems no one else had previously reported, and they were using BFQ. But there's no evidence whether BFQ was the cause, or exposing some existing bug that another scheduler doesn't. Anyway, I'd say using an out of tree scheduler means higher burden of testing and skepticism. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.8.8, bcache deadlock and hard lockup
On Wed, Nov 30, 2016 at 03:57:28PM -0800, Eric Wheeler wrote: > > I'll start another separate thread with the btrfs folks on how much > > pressure is put on the system, but on your side it would be good to help > > ensure that bcache doesn't crash the system altogether if too many > > requests are allowed to pile up. > > Try BFQ. It is AWESOME and helps reduce the congestion problem with bulk > writes at the request queue on its way to the spinning disk or SSD: > http://algo.ing.unimo.it/people/paolo/disk_sched/ > > use the latest BFQ git here, merge it into v4.8.y: > https://github.com/linusw/linux-bfq/commits/bfq-v8 > > This doesn't completely fix the dirty_ration problem, but it is far better > than CFQ or deadline in my opinion (and experience). That's good to know thanks. But for my uninformed opinion, is there anything bcache can do to throttle incoming requests if they are piling up, or they're coming from producers upstream and bcache has no choice but try and process them as quickly as possible without a way to block the sender if too many are coming? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.8.8, bcache deadlock and hard lockup
On Wed, 30 Nov 2016, Marc MERLIN wrote: > +btrfs mailing list, see below why > > On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote: > > On Mon, 27 Nov 2016, Coly Li wrote: > > > > > > Yes, too many work queues... I guess the locking might be caused by some > > > very obscure reference of closure code. I cannot have any clue if I > > > cannot find a stable procedure to reproduce this issue. > > > > > > Hmm, if there is a tool to clone all the meta data of the back end cache > > > and whole cached device, there might be a method to replay the oops much > > > easier. > > > > > > Eric, do you have any hint ? > > > > Note that the backing device doesn't have any metadata, just a superblock. > > You can easily dd that off onto some other volume without transferring the > > data. By default, data starts at 8k, or whatever you used in `make-bcache > > -w`. > > Ok, Linus helped me find a workaround for this problem: > https://lkml.org/lkml/2016/11/29/667 > namely: >echo 2 > /proc/sys/vm/dirty_ratio >echo 1 > /proc/sys/vm/dirty_background_ratio > (it's a 24GB system, so the defaults of 20 and 10 were creating too many > requests in th buffers) > > Note that this is only a workaround, not a fix. > > When I did this and re tried my big copy again, I still got 100+ kernel > work queues, but apparently the underlying swraid5 was able to unblock > and satisfy the write requests before too many accumulated and crashed > the kernel. > > I'm not a kernel coder, but seems to me that bcache needs a way to > throttle incoming requests if there are too many so that it does not end > up in a state where things blow up due to too many piled up requests. > > You should be able to reproduce this by taking 5 spinning rust drives, > put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although > I used btrfs) and send lots of requests. > Actually to be honest, the problems have mostly been happening when I do > btrfs scrub and btrfs send/receive which both generate I/O from within > the kernel instead of user space. > So here, btrfs may be a contributor to the problem too, but while btrfs > still trashes my system if I remove the caching device on bcache (and > with the default dirty ratio values), it doesn't crash the kernel. > > I'll start another separate thread with the btrfs folks on how much > pressure is put on the system, but on your side it would be good to help > ensure that bcache doesn't crash the system altogether if too many > requests are allowed to pile up. Try BFQ. It is AWESOME and helps reduce the congestion problem with bulk writes at the request queue on its way to the spinning disk or SSD: http://algo.ing.unimo.it/people/paolo/disk_sched/ use the latest BFQ git here, merge it into v4.8.y: https://github.com/linusw/linux-bfq/commits/bfq-v8 This doesn't completely fix the dirty_ration problem, but it is far better than CFQ or deadline in my opinion (and experience). -Eric -- Eric Wheeler > > Thanks, > Marc > -- > "A mouse is a device used to point at the xterm you want to type in" - A.S.R. > Microsoft is to operating systems > what McDonalds is to gourmet > cooking > Home page: http://marc.merlins.org/ | PGP > 1024R/763BE901 > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Metadata balance fails ENOSPC
On Wed, Nov 30, 2016 at 2:03 PM, Stefan Priebe - Profihost AGwrote: > Hello, > > # btrfs balance start -v -dusage=0 -musage=1 /ssddisk/ > Dumping filters: flags 0x7, state 0x0, force is off > DATA (flags 0x2): balancing, usage=0 > METADATA (flags 0x2): balancing, usage=1 > SYSTEM (flags 0x2): balancing, usage=1 > ERROR: error during balancing '/ssddisk/': No space left on device > There may be more info in syslog - try dmesg | tail You haven't provided kernel messages at the time of the error. Also useful is the kernel version. > > # btrfs filesystem show /ssddisk/ > Label: none uuid: a69d2e90-c2ca-4589-9876-234446868adc > Total devices 1 FS bytes used 305.67GiB > devid1 size 500.00GiB used 500.00GiB path /dev/vdb1 > > # btrfs filesystem usage /ssddisk/ > Overall: > Device size: 500.00GiB > Device allocated:500.00GiB > Device unallocated:1.05MiB Drive is actually fully allocated so if Btrfs needs to create a new chunk right now, it can't. However, > > Data,single: Size:483.97GiB, Used:298.18GiB >/dev/vdb1 483.97GiB > > Metadata,single: Size:16.00GiB, Used:7.51GiB >/dev/vdb1 16.00GiB > > System,single: Size:32.00MiB, Used:144.00KiB >/dev/vdb1 32.00MiB All three chunk types have quite a bit of unused space in them, so it's unclear why there's a no space left error. Try remounting with enoscp_debug, and then trigger the problem again, and post the resulting kernel messages. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Metadata balance fails ENOSPC
Hello, # btrfs balance start -v -dusage=0 -musage=1 /ssddisk/ Dumping filters: flags 0x7, state 0x0, force is off DATA (flags 0x2): balancing, usage=0 METADATA (flags 0x2): balancing, usage=1 SYSTEM (flags 0x2): balancing, usage=1 ERROR: error during balancing '/ssddisk/': No space left on device There may be more info in syslog - try dmesg | tail # btrfs filesystem show /ssddisk/ Label: none uuid: a69d2e90-c2ca-4589-9876-234446868adc Total devices 1 FS bytes used 305.67GiB devid1 size 500.00GiB used 500.00GiB path /dev/vdb1 # btrfs filesystem usage /ssddisk/ Overall: Device size: 500.00GiB Device allocated:500.00GiB Device unallocated:1.05MiB Device missing: 0.00B Used:305.69GiB Free (estimated):185.78GiB (min: 185.78GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 512.00MiB (used: 608.00KiB) Data,single: Size:483.97GiB, Used:298.18GiB /dev/vdb1 483.97GiB Metadata,single: Size:16.00GiB, Used:7.51GiB /dev/vdb1 16.00GiB System,single: Size:32.00MiB, Used:144.00KiB /dev/vdb1 32.00MiB Unallocated: /dev/vdb1 1.05MiB How can i make it balancing again? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On 30 November 2016 at 19:09, Chris Murphywrote: > On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn > wrote: > >> The stability info could be improved, but _absolutely none_ of the things >> mentioned as issues with raid1 are specific to raid1. And in general, in >> the context of a feature stability matrix, 'OK' generally means that there >> are no significant issues with that specific feature, and since none of the >> issues outlined are specific to raid1, it does meet that description of >> 'OK'. > > Maybe the gotchas page needs a one or two liner for each profile's > gotchas compared to what the profile leads the user into believing. > The overriding gotcha with all Btrfs multiple device support is the > lack of monitoring and notification other than kernel messages; and > the raid10 actually being more like raid0+1 I think it certainly a > gotcha, however 'man mkfs.btrfs' contains a grid that very clearly > states raid10 can only safely lose 1 device. > > >> Looking at this another way, I've been using BTRFS on all my systems since >> kernel 3.16 (I forget what exact vintage that is in regular years). I've >> not had any data integrity or data loss issues as a result of BTRFS itself >> since 3.19, and in just the past year I've had multiple raid1 profile >> filesystems survive multiple hardware issues with near zero issues (with the >> caveat that I had to re-balance after replacing devices to convert a few >> single chunks to raid1), and that includes multiple disk failures and 2 bad >> PSU's plus about a dozen (not BTRFS related) kernel panics and 4 unexpected >> power loss events. I also have exhaustive monitoring, so I'm replacing bad >> hardware early instead of waiting for it to actually fail. > > Possibly nothing aids predictably reliable storage stacks than healthy > doses of skepticism and awareness of all limitations. :-D > > -- > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Please, I beg you add another column to man and wiki stating clearly how many devices every profile can withstand to loose. I frequently have to explain how btrfs profiles work and show quotes from this mailing list because "dawning-kruger effect victims" keep poping up with statements like "in btrfs raid10 with 8 drives you can loose 4 drives" ... I seriously beg you guys, my beating stick is half broken by now. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Am Mittwoch, 30. November 2016, 12:09:23 CET schrieb Chris Murphy: > On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn > >wrote: > > The stability info could be improved, but _absolutely none_ of the things > > mentioned as issues with raid1 are specific to raid1. And in general, in > > the context of a feature stability matrix, 'OK' generally means that there > > are no significant issues with that specific feature, and since none of > > the > > issues outlined are specific to raid1, it does meet that description of > > 'OK'. > > Maybe the gotchas page needs a one or two liner for each profile's > gotchas compared to what the profile leads the user into believing. > The overriding gotcha with all Btrfs multiple device support is the > lack of monitoring and notification other than kernel messages; and > the raid10 actually being more like raid0+1 I think it certainly a > gotcha, however 'man mkfs.btrfs' contains a grid that very clearly > states raid10 can only safely lose 1 device. Wow, that manpage is quite an resource. Developers, documentation people definitely improved the official BTRFS documentation. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarnwrote: > The stability info could be improved, but _absolutely none_ of the things > mentioned as issues with raid1 are specific to raid1. And in general, in > the context of a feature stability matrix, 'OK' generally means that there > are no significant issues with that specific feature, and since none of the > issues outlined are specific to raid1, it does meet that description of > 'OK'. Maybe the gotchas page needs a one or two liner for each profile's gotchas compared to what the profile leads the user into believing. The overriding gotcha with all Btrfs multiple device support is the lack of monitoring and notification other than kernel messages; and the raid10 actually being more like raid0+1 I think it certainly a gotcha, however 'man mkfs.btrfs' contains a grid that very clearly states raid10 can only safely lose 1 device. > Looking at this another way, I've been using BTRFS on all my systems since > kernel 3.16 (I forget what exact vintage that is in regular years). I've > not had any data integrity or data loss issues as a result of BTRFS itself > since 3.19, and in just the past year I've had multiple raid1 profile > filesystems survive multiple hardware issues with near zero issues (with the > caveat that I had to re-balance after replacing devices to convert a few > single chunks to raid1), and that includes multiple disk failures and 2 bad > PSU's plus about a dozen (not BTRFS related) kernel panics and 4 unexpected > power loss events. I also have exhaustive monitoring, so I'm replacing bad > hardware early instead of waiting for it to actually fail. Possibly nothing aids predictably reliable storage stacks than healthy doses of skepticism and awareness of all limitations. :-D -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On Wed, Nov 30, 2016 at 7:04 AM, Roman Mamedovwrote: > On Wed, 30 Nov 2016 07:50:17 -0500 > Also I don't know what is particularly insane about copying a 4-8 GB file onto > a storage array. I'd expect both disks to write at the same time (like they > do in pretty much any other RAID1 system), not one-after-another, effectively > slowing down the entire operation by as much as 2x in extreme cases. I don't experience this behavior. Writes take the same amount of time to single profile volume as a two device raid1 profile volume. iotop reports 2x the write bandwidth when writing to the raid1 volume, which corresponds to simultaneous writes to both drives in the volume. It's also not an elaborate setup by any means: two laptop drives, each in cheap USB 3.0 cases using bus power only, connected to a USB 3.0 hub, in turn connected to an Intel NUC. > > Comparing to Ext4, that one appears to have the "errors=continue" behavior by > default, the user has to explicitly request "errors=remount-ro", and I have > never seen anyone use or recommend the third option of "errors=panic", which > is basically the equivalent of the current Btrfs practce. I think in the context of degradedness, it may be appropriate to mount degraded,ro by default rather than fail. But changing the default isn't enough for the root fs use case, because the mount command isn't even issued when udev's btrfs 'dev scan' fails to report back all devices available. In this case there is a sort of "pre check" before even mounting is attempted, and that is what fails. Also, Btrfs has fatal_errors=panic and it's not the default. Rather, we just get mount failure. There really isn't anything quite like this in the mdadm/lvm + other file system world where the array is active degraded and the file system mounts anyway; if it doesn't mount it's because the array isn't active, and doesn't even exist yet. > Unplugging and replugging a SATA cable of a RAID1 member should never put your > system under the risk of a massive filesystem corruption; you cannot say it > absolutely doesn't with the current implementation. I can't say it absolutely doesn't even with md. Of course it shouldn't, but users do report corruptions on all of the other fs lists (ext4, XFS, linux-raid) from time to time that are not the result of user error. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
+folks from linux-mm thread for your suggestion On Wed, Nov 30, 2016 at 01:00:45PM -0500, Austin S. Hemmelgarn wrote: > > swraid5 < bcache < dmcrypt < btrfs > > > > Copying with btrfs send/receive causes massive hangs on the system. > > Please see this explanation from Linus on why the workaround was > > suggested: > > https://lkml.org/lkml/2016/11/29/667 > And Linux' assessment is absolutely correct (at least, the general > assessment is, I have no idea about btrfs_start_shared_extent, but I'm more > than willing to bet he's correct that that's the culprit). > > All of this mostly went away with Linus' suggestion: > > echo 2 > /proc/sys/vm/dirty_ratio > > echo 1 > /proc/sys/vm/dirty_background_ratio > > > > But that's hiding the symptom which I think is that btrfs is piling up too > > many I/O > > requests during btrfs send/receive and btrfs scrub (probably balance too) > > and not > > looking at resulting impact to system health. > I see pretty much identical behavior using any number of other storage > configurations on a USB 2.0 flash drive connected to a system with 16GB of > RAM with the default dirty ratios because it's trying to cache up to 3.2GB > of data for writeback. While BTRFS is doing highly sub-optimal things here, > the ancient default writeback ratios are just as much a culprit. I would > suggest that get changed to 200MB or 20% of RAM, whichever is smaller, which > would give overall almost identical behavior to x86-32, which in turn works > reasonably well for most cases. I sadly don't have the time, patience, or > expertise to write up such a patch myself though. Dear linux-mm folks, is that something you could consider (changing the dirty_ratio defaults) given that it affects at least bcache and btrfs (with or without bcache)? By the way, on the 200MB max suggestion, when I had 2 and 1% (or 480MB and 240MB on my 24GB system), this was enough to make btrfs behave sanely, but only if I had bcache turned off. With bcache enabled, those values were just enough so that bcache didn't crash my system, but not enough that prevent undesirable behaviour (things hanging, 100+ bcache kworkers piled up, and more). However, the copy did succeed, despite the relative impact on the system, so it's better than nothing :) But the impact from bcache probably goes beyond what btrfs is responsible for, so I have a separate thread on the bcache list: http://marc.info/?l=linux-bcache=148052441423532=2 http://marc.info/?l=linux-bcache=148052620524162=2 On the plus side, btrfs did ok with 0 visible impact to my system with those 480 and 240MB dirty ratio values. Thanks for your reply, Austin. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
On 2016-11-30 12:18, Marc MERLIN wrote: On Wed, Nov 30, 2016 at 08:46:46AM -0800, Marc MERLIN wrote: +btrfs mailing list, see below why Ok, Linus helped me find a workaround for this problem: https://lkml.org/lkml/2016/11/29/667 namely: echo 2 > /proc/sys/vm/dirty_ratio echo 1 > /proc/sys/vm/dirty_background_ratio (it's a 24GB system, so the defaults of 20 and 10 were creating too many requests in th buffers) I'll remove the bcache list on this followup since I want to concentrate here on the fact that btrfs does behave badly with the default dirty_ratio values. I will comment that on big systems, almost everything behaves badly with the default dirty ratios, they're leftovers from when 1GB was a huge amount of RAM. As usual though, BTRFS has pathological behavior compared to other options. As a reminder, it's a btrfs send/receive copy between 2 swraid5 arrays on spinning rust. swraid5 < bcache < dmcrypt < btrfs Copying with btrfs send/receive causes massive hangs on the system. Please see this explanation from Linus on why the workaround was suggested: https://lkml.org/lkml/2016/11/29/667 And Linux' assessment is absolutely correct (at least, the general assessment is, I have no idea about btrfs_start_shared_extent, but I'm more than willing to bet he's correct that that's the culprit). The hangs that I'm getting with bcache cache turned off (i.e. passthrough) are now very likely only due to btrfs and mess up anything doing file IO that ends up timing out, break USB even as reads time out in the middle of USB requests, interrupts lost, and so forth. All of this mostly went away with Linus' suggestion: echo 2 > /proc/sys/vm/dirty_ratio echo 1 > /proc/sys/vm/dirty_background_ratio But that's hiding the symptom which I think is that btrfs is piling up too many I/O requests during btrfs send/receive and btrfs scrub (probably balance too) and not looking at resulting impact to system health. I see pretty much identical behavior using any number of other storage configurations on a USB 2.0 flash drive connected to a system with 16GB of RAM with the default dirty ratios because it's trying to cache up to 3.2GB of data for writeback. While BTRFS is doing highly sub-optimal things here, the ancient default writeback ratios are just as much a culprit. I would suggest that get changed to 200MB or 20% of RAM, whichever is smaller, which would give overall almost identical behavior to x86-32, which in turn works reasonably well for most cases. I sadly don't have the time, patience, or expertise to write up such a patch myself though. Is there a way to stop flodding the entire system with I/O and causing so much strain on it? (I realize that if there is a caching layer underneath that just takes requests and says thank you without giving other clues that underneath bad things are happening, it may be hard, but I'm asking anyway :) [10338.968912] perf: interrupt took too long (3927 > 3917), lowering kernel.perf_event_max_sample_rate to 50750 [12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb stopped: -32 [17761.122238] usb 4-1.4: USB disconnect, device number 39 [17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 rq 6 len 1024 ret -108 [17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd [17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd [24130.574425] hpet1: lost 2306 rtc interrupts [24156.034950] hpet1: lost 1628 rtc interrupts [24173.314738] hpet1: lost 1104 rtc interrupts [24180.129950] hpet1: lost 436 rtc interrupts [24257.557955] hpet1: lost 4954 rtc interrupts [24267.522656] hpet1: lost 637 rtc interrupts [28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds. [28034.975471] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12 [28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [28035.025429] btrfs D 91154d33fc70 0 5618 5372 0x0080 [28035.047717] 91154d33fc70 00200246 911842f880c0 9115a4cf01c0 [28035.071020] 91154d33fc58 91154d34 91165493bca0 9115623773f0 [28035.094252] 1000 0001 91154d33fc88 b86cf1a6 [28035.117538] Call Trace: [28035.125791] [] schedule+0x8b/0xa3 [28035.141550] [] btrfs_start_ordered_extent+0xce/0x122 [28035.162457] [] ? wake_up_atomic_t+0x2c/0x2c [28035.180891] [] btrfs_wait_ordered_range+0xa9/0x10d [28035.201723] [] btrfs_truncate+0x40/0x24b [28035.219269] [] btrfs_setattr+0x1da/0x2d7 [28035.237032] [] notify_change+0x252/0x39c [28035.254566] [] do_truncate+0x81/0xb4 [28035.271057] [] vfs_truncate+0xd9/0xf9 [28035.287782] [] do_sys_truncate+0x63/0xa7 [28155.781987] INFO: task btrfs:5618 blocked for more than 120 seconds. [28155.802229] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12 [28155.827894] "echo 0 >
[GIT PULL] Btrfs fixes for 4.10
From: Filipe MananaHi Chris, Here follows a small list of fixes a couple cleanups for the 4.10 merge window. It contains all the patches from the previous pull request (which got unanswered nor were the changes pulled yet apparently). The most important change is still the fix for the extent tree corruption that happens due to balance when qgroups are enabled (a regression introduced in 4.7 by a fix for a regression from the last qgroups rework). This has been hitting SLE and openSUSE users and QA very badly, where transactions keep getting aborted when running delayed references leaving the root filesystem in RO mode and nearly unusable. There are fixes here that allow us to run xfstests again with the integrity checker enabled, which has been impossible since 4.8 (apparently I'm the only one running xfstests with the integrity checker enabled, which is useful to validate dirtied leafs, like checking if there are keys out of order, etc). The rest are just some trivial fixes, most of them tagged for stable, and two cleanups. Thanks. The following changes since commit e3597e6090ddf40904dce6d0a5a404e2c490cac6: Merge branch 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.9 (2016-11-01 12:54:45 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git for-chris-4.10 for you to fetch changes up to 2a7bf53f577e49c43de4ffa7776056de26db65d9: Btrfs: fix tree search logic when replaying directory entry deletes (2016-11-30 16:56:12 +) Filipe Manana (5): Btrfs: fix relocation incorrectly dropping data references Btrfs: remove unused code when creating and merging reloc trees Btrfs: remove rb_node field from the delayed ref node structure Btrfs: fix emptiness check for dirtied extent buffers at check_leaf() Btrfs: fix qgroup rescan worker initialization Liu Bo (1): Btrfs: fix BUG_ON in btrfs_mark_buffer_dirty Robbie Ko (3): Btrfs: fix enospc in hole punching Btrfs: fix deadlock caused by fsync when logging directory entries Btrfs: fix tree search logic when replaying directory entry deletes fs/btrfs/delayed-ref.h | 6 -- fs/btrfs/disk-io.c | 23 +++ fs/btrfs/file.c| 4 ++-- fs/btrfs/qgroup.c | 5 + fs/btrfs/relocation.c | 34 -- fs/btrfs/tree-log.c| 7 +++ 6 files changed, 37 insertions(+), 42 deletions(-) -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.8.8, bcache deadlock and hard lockup
On Wed, Nov 30, 2016 at 08:46:46AM -0800, Marc MERLIN wrote: > +btrfs mailing list, see below why > > On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote: > > On Mon, 27 Nov 2016, Coly Li wrote: > > > > > > Yes, too many work queues... I guess the locking might be caused by some > > > very obscure reference of closure code. I cannot have any clue if I > > > cannot find a stable procedure to reproduce this issue. > > > > > > Hmm, if there is a tool to clone all the meta data of the back end cache > > > and whole cached device, there might be a method to replay the oops much > > > easier. > > > > > > Eric, do you have any hint ? > > > > Note that the backing device doesn't have any metadata, just a superblock. > > You can easily dd that off onto some other volume without transferring the > > data. By default, data starts at 8k, or whatever you used in `make-bcache > > -w`. > > Ok, Linus helped me find a workaround for this problem: > https://lkml.org/lkml/2016/11/29/667 > namely: >echo 2 > /proc/sys/vm/dirty_ratio >echo 1 > /proc/sys/vm/dirty_background_ratio > (it's a 24GB system, so the defaults of 20 and 10 were creating too many > requests in th buffers) > > Note that this is only a workaround, not a fix. Actually, I'm even more worried about the general bcache situation when caching is enabled. In the message above, Linus wrote: "One situation where I've seen something like this happen is (a) lots and lots of dirty data queued up (b) horribly slow storage (c) filesystem that ends up serializing on writeback under certain circumstances The usual case for (b) in the modern world is big SSD's that have bad worst-case behavior (ie they may do gbps speeds when doing well, and then they come to a screeching halt when their buffers fill up and they have to do rewrites, and their gbps throughput drops to mbps or lower). Generally you only find that kind of really nasty SSD in the USB stick world these days." Well, come to think of it, this is _exactly_ what bcache will create, by design. It'll swallow up a lot of IO cached to the SSD, until the SSD buffers fill up and then things will hang while bcache struggles to write it all to slower spinning rust storage. Looks to me like bcache and dirty_ratio need to be synced somehow, or things will fall over reliably. What do you think? Thanks, Marc > When I did this and re tried my big copy again, I still got 100+ kernel > work queues, but apparently the underlying swraid5 was able to unblock > and satisfy the write requests before too many accumulated and crashed > the kernel. > > I'm not a kernel coder, but seems to me that bcache needs a way to > throttle incoming requests if there are too many so that it does not end > up in a state where things blow up due to too many piled up requests. > > You should be able to reproduce this by taking 5 spinning rust drives, > put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although > I used btrfs) and send lots of requests. > Actually to be honest, the problems have mostly been happening when I do > btrfs scrub and btrfs send/receive which both generate I/O from within > the kernel instead of user space. > So here, btrfs may be a contributor to the problem too, but while btrfs > still trashes my system if I remove the caching device on bcache (and > with the default dirty ratio values), it doesn't crash the kernel. > > I'll start another separate thread with the btrfs folks on how much > pressure is put on the system, but on your side it would be good to help > ensure that bcache doesn't crash the system altogether if too many > requests are allowed to pile up. > > Thanks, > Marc > -- > "A mouse is a device used to point at the xterm you want to type in" - A.S.R. > Microsoft is to operating systems > what McDonalds is to gourmet > cooking > Home page: http://marc.merlins.org/ | PGP > 1024R/763BE901 -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
On Wed, Nov 30, 2016 at 08:46:46AM -0800, Marc MERLIN wrote: > +btrfs mailing list, see below why > > Ok, Linus helped me find a workaround for this problem: > https://lkml.org/lkml/2016/11/29/667 > namely: >echo 2 > /proc/sys/vm/dirty_ratio >echo 1 > /proc/sys/vm/dirty_background_ratio > (it's a 24GB system, so the defaults of 20 and 10 were creating too many > requests in th buffers) I'll remove the bcache list on this followup since I want to concentrate here on the fact that btrfs does behave badly with the default dirty_ratio values. As a reminder, it's a btrfs send/receive copy between 2 swraid5 arrays on spinning rust. swraid5 < bcache < dmcrypt < btrfs Copying with btrfs send/receive causes massive hangs on the system. Please see this explanation from Linus on why the workaround was suggested: https://lkml.org/lkml/2016/11/29/667 The hangs that I'm getting with bcache cache turned off (i.e. passthrough) are now very likely only due to btrfs and mess up anything doing file IO that ends up timing out, break USB even as reads time out in the middle of USB requests, interrupts lost, and so forth. All of this mostly went away with Linus' suggestion: echo 2 > /proc/sys/vm/dirty_ratio echo 1 > /proc/sys/vm/dirty_background_ratio But that's hiding the symptom which I think is that btrfs is piling up too many I/O requests during btrfs send/receive and btrfs scrub (probably balance too) and not looking at resulting impact to system health. Is there a way to stop flodding the entire system with I/O and causing so much strain on it? (I realize that if there is a caching layer underneath that just takes requests and says thank you without giving other clues that underneath bad things are happening, it may be hard, but I'm asking anyway :) [10338.968912] perf: interrupt took too long (3927 > 3917), lowering kernel.perf_event_max_sample_rate to 50750 [12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb stopped: -32 [17761.122238] usb 4-1.4: USB disconnect, device number 39 [17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 rq 6 len 1024 ret -108 [17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd [17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd [24130.574425] hpet1: lost 2306 rtc interrupts [24156.034950] hpet1: lost 1628 rtc interrupts [24173.314738] hpet1: lost 1104 rtc interrupts [24180.129950] hpet1: lost 436 rtc interrupts [24257.557955] hpet1: lost 4954 rtc interrupts [24267.522656] hpet1: lost 637 rtc interrupts [28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds. [28034.975471] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12 [28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [28035.025429] btrfs D 91154d33fc70 0 5618 5372 0x0080 [28035.047717] 91154d33fc70 00200246 911842f880c0 9115a4cf01c0 [28035.071020] 91154d33fc58 91154d34 91165493bca0 9115623773f0 [28035.094252] 1000 0001 91154d33fc88 b86cf1a6 [28035.117538] Call Trace: [28035.125791] [] schedule+0x8b/0xa3 [28035.141550] [] btrfs_start_ordered_extent+0xce/0x122 [28035.162457] [] ? wake_up_atomic_t+0x2c/0x2c [28035.180891] [] btrfs_wait_ordered_range+0xa9/0x10d [28035.201723] [] btrfs_truncate+0x40/0x24b [28035.219269] [] btrfs_setattr+0x1da/0x2d7 [28035.237032] [] notify_change+0x252/0x39c [28035.254566] [] do_truncate+0x81/0xb4 [28035.271057] [] vfs_truncate+0xd9/0xf9 [28035.287782] [] do_sys_truncate+0x63/0xa7 [28155.781987] INFO: task btrfs:5618 blocked for more than 120 seconds. [28155.802229] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12 [28155.827894] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [28155.852479] btrfs D 91154d33fc70 0 5618 5372 0x0080 [28155.874761] 91154d33fc70 00200246 911842f880c0 9115a4cf01c0 [28155.898059] 91154d33fc58 91154d34 91165493bca0 9115623773f0 [28155.921464] 1000 0001 91154d33fc88 b86cf1a6 [28155.944720] Call Trace: [28155.953176] [] schedule+0x8b/0xa3 [28155.968945] [] btrfs_start_ordered_extent+0xce/0x122 [28155.989811] [] ? wake_up_atomic_t+0x2c/0x2c [28156.008195] [] btrfs_wait_ordered_range+0xa9/0x10d [28156.028498] [] btrfs_truncate+0x40/0x24b [28156.046081] [] btrfs_setattr+0x1da/0x2d7 [28156.063621] [] notify_change+0x252/0x39c [28156.081667] [] do_truncate+0x81/0xb4 [28156.098732] [] vfs_truncate+0xd9/0xf9 [28156.115489] [] do_sys_truncate+0x63/0xa7 [28156.133389] [] SyS_truncate+0xe/0x10 [28156.149831] [] do_syscall_64+0x61/0x72 [28156.167179] [] entry_SYSCALL64_slow_path+0x25/0x25 [28397.436986] INFO: task btrfs:5618 blocked for more than 120 seconds. [28397.456798] Tainted: G U
Re: [PATCH] Btrfs: fix infinite loop when tree log recovery
On Fri, Oct 7, 2016 at 10:30 AM, robbiekowrote: > From: Robbie Ko > > if log tree like below: > leaf N: > ... > item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8 > dir log end 1275809046 > leaf N+1: > item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 itemsize 8 > dir log end 18446744073709551615 > ... > > when start_ret > 1275809046, but slot[0] never >= nritems, > so never go to next leaf. This doesn't explain how the infinite loop happens. Nor exactly how any problem happens. It's important to have detailed information in the change logs. I understand that english isn't your native tongue (it's not mine either, and I'm far from mastering it), but that's not an excuse to not express all the important information in detail (we can all live with grammar errors and typos, and we all do such errors frequently). I've added this patch to my branch at https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=for-chris-4.10 but with a modified changelog and subject. The results of the wrong logic that decides when to move to the next leaf are unpredictable, and it won't always result in an infinite loop. We are accessing a slot that doesn't point to an item, to a memory location containing garbage to something unexpected, and in the worst case that location is beyond the last page of the extent buffer. Thanks. > > Signed-off-by: Robbie Ko > --- > fs/btrfs/tree-log.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c > index ef9c55b..e63dd99 100644 > --- a/fs/btrfs/tree-log.c > +++ b/fs/btrfs/tree-log.c > @@ -1940,12 +1940,11 @@ static noinline int find_dir_range(struct btrfs_root > *root, > next: > /* check the next slot in the tree to see if it is a valid item */ > nritems = btrfs_header_nritems(path->nodes[0]); > + path->slots[0]++; > if (path->slots[0] >= nritems) { > ret = btrfs_next_leaf(root, path); > if (ret) > goto out; > - } else { > - path->slots[0]++; > } > > btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]); > -- > 1.9.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "People will forget what you said, people will forget what you did, but people will never forget how you made them feel." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On 2016-11-30 10:49, Wilson Meier wrote: Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn: On 2016-11-30 08:12, Wilson Meier wrote: Am 30/11/16 um 11:41 schrieb Duncan: Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: Am 30/11/16 um 09:06 schrieb Martin Steigerwald: Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: [snip] So the stability matrix would need to be updated not to recommend any kind of BTRFS RAID 1 at the moment? Actually I faced the BTRFS RAID 1 read only after first attempt of mounting it "degraded" just a short time ago. BTRFS still needs way more stability work it seems to me. I would say the matrix should be updated to not recommend any RAID Level as from the discussion it seems they all of them have flaws. To me RAID is broken if one cannot expect to recover from a device failure in a solid way as this is why RAID is used. Correct me if i'm wrong. Right now i'm making my thoughts about migrating to another FS and/or Hardware RAID. It should be noted that no list regular that I'm aware of anyway, would make any claims about btrfs being stable and mature either now or in the near-term future in any case. Rather to the contrary, as I generally put it, btrfs is still stabilizing and maturing, with backups one is willing to use (and as any admin of any worth would say, a backup that hasn't been tested usable isn't yet a backup; the job of creating the backup isn't done until that backup has been tested actually usable for recovery) still extremely strongly recommended. Similarly, keeping up with the list is recommended, as is staying relatively current on both the kernel and userspace (generally considered to be within the latest two kernel series of either current or LTS series kernels, and with a similarly versioned btrfs userspace). In that context, btrfs single-device and raid1 (and raid0 of course) are quite usable and as stable as btrfs in general is, that being stabilizing but not yet fully stable and mature, with raid10 being slightly less so and raid56 being much more experimental/unstable at this point. But that context never claims full stability even for the relatively stable raid1 and single device modes, and in fact anticipates that there may be times when recovery from the existing filesystem may not be practical, thus the recommendation to keep tested usable backups at the ready. Meanwhile, it remains relatively common on this list for those wondering about their btrfs on long-term-stale (not a typo) "enterprise" distros, or even debian-stale, to be actively steered away from btrfs, especially if they're not willing to update to something far more current than those distros often provide, because in general, the current stability status of btrfs is in conflict with the reason people generally choose to use that level of old and stale software in the first place -- they prioritize tried and tested to work, stable and mature, over the latest generally newer and flashier featured but sometimes not entirely stable, and btrfs at this point simply doesn't meet that sort of stability/ maturity expectations, nor is it likely to for some time (measured in years), due to all the reasons enumerated so well in the above thread. In that context, the stability status matrix on the wiki is already reasonably accurate, certainly so IMO, because "OK" in context means as OK as btrfs is in general, and btrfs itself remains still stabilizing, not fully stable and mature. If there IS an argument as to the accuracy of the raid0/1/10 OK status, I'd argue it's purely due to people not understanding the status of btrfs in general, and that if there's a general deficiency at all, it's in the lack of a general stability status paragraph on that page itself explaining all this, despite the fact that the main https:// btrfs.wiki.kernel.org landing page states quite plainly under stability status that btrfs remains under heavy development and that current kernels are strongly recommended. (Tho were I editing it, there'd certainly be a more prominent mention of keeping backups at the ready as well.) Hi Duncan, i understand your arguments but cannot fully agree. First of all, i'm not sticking with old stale versions of whatever as i try to keep my system up2date. My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4. That being said, i'm quite aware of the heavy development status of btrfs but pointing the finger on the users saying that they don't fully understand the status of btrfs without giving the information on the wiki is in my opinion not the right way. Heavy development doesn't mean that features marked as ok are "not" or "mostly" ok in the context of overall btrfs stability. There is no indication on the wiki that raid1 or every other raid (except for raid5/6) suffers from the problems stated in this thread. The performance issues are inherent to BTRFS right now, and none of the other issues are likely to impact most regular
Re: 4.8.8, bcache deadlock and hard lockup
+btrfs mailing list, see below why On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote: > On Mon, 27 Nov 2016, Coly Li wrote: > > > > Yes, too many work queues... I guess the locking might be caused by some > > very obscure reference of closure code. I cannot have any clue if I > > cannot find a stable procedure to reproduce this issue. > > > > Hmm, if there is a tool to clone all the meta data of the back end cache > > and whole cached device, there might be a method to replay the oops much > > easier. > > > > Eric, do you have any hint ? > > Note that the backing device doesn't have any metadata, just a superblock. > You can easily dd that off onto some other volume without transferring the > data. By default, data starts at 8k, or whatever you used in `make-bcache > -w`. Ok, Linus helped me find a workaround for this problem: https://lkml.org/lkml/2016/11/29/667 namely: echo 2 > /proc/sys/vm/dirty_ratio echo 1 > /proc/sys/vm/dirty_background_ratio (it's a 24GB system, so the defaults of 20 and 10 were creating too many requests in th buffers) Note that this is only a workaround, not a fix. When I did this and re tried my big copy again, I still got 100+ kernel work queues, but apparently the underlying swraid5 was able to unblock and satisfy the write requests before too many accumulated and crashed the kernel. I'm not a kernel coder, but seems to me that bcache needs a way to throttle incoming requests if there are too many so that it does not end up in a state where things blow up due to too many piled up requests. You should be able to reproduce this by taking 5 spinning rust drives, put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although I used btrfs) and send lots of requests. Actually to be honest, the problems have mostly been happening when I do btrfs scrub and btrfs send/receive which both generate I/O from within the kernel instead of user space. So here, btrfs may be a contributor to the problem too, but while btrfs still trashes my system if I remove the caching device on bcache (and with the default dirty ratio values), it doesn't crash the kernel. I'll start another separate thread with the btrfs folks on how much pressure is put on the system, but on your side it would be good to help ensure that bcache doesn't crash the system altogether if too many requests are allowed to pile up. Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/14] btrfs-progs: check: introduce function to find dir_item
On Wed, Nov 16, 2016 at 10:27:59AM +0800, Qu Wenruo wrote: > > Yes please. Third namespace for existing error bits is not a good > > option. Move the I_ERR bits to start from 32 and use them in the low-mem > > code that's been merged to devel. > > I didn't see such fix in devel branch. Well, that's because nobody implemented it and I was not intending to do it myself as it's a followup to yor lowmem patchset in devel. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Am Mittwoch, 30. November 2016, 16:49:59 CET schrieb Wilson Meier: > Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn: > > On 2016-11-30 08:12, Wilson Meier wrote: > >> Am 30/11/16 um 11:41 schrieb Duncan: > >>> Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: > Am 30/11/16 um 09:06 schrieb Martin Steigerwald: > > Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: […] > >> It is really disappointing to not have this information in the wiki > >> itself. This would have saved me, and i'm quite sure others too, a lot > >> of time. > >> Sorry for being a bit frustrated. > > I'm not angry or something like that :) . > I just would like to have the possibility to read such information about > the storage i put my personal data (> 3 TB) on its official wiki. Anyone can get an account on the wiki and add notes there, so feel free. You can even use footnotes or something like that. Maybe it would be good to add a paragraph there that features are related to one another, so while BTRFS RAID 1 for example might be quite okay, it depends on features that are still flaky. I for myself rely quite much on BTRFS RAID 1 with lzo compression and it seems to work okay for me. -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix fsync deadlock in log_new_dir_dentries
On Fri, Oct 28, 2016 at 3:48 AM, robbiekowrote: > From: Robbie Ko > > We found a fsync deadlock in log_new_dir_dentries, because > btrfs_search_forward get path lock, then call btrfs_iget will > get another extent_buffer lock, maybe occur deadlock. This still doesn't explain how the deadlock happens. For it to happen it's necessary that before btrfs_iget() does a tree search, some other task gets write locks on nodes and blocks waiting for the leaf locked by btrfs_search_forward() to be unlocked, and that btrfs_iget() tries to read lock those same nodes write locked by that other task. It's important to have detailed information in the change logs. I understand that english isn't your native tongue (it's not mine either, and I'm far from mastering it), but that's not an excuse to not express all the important information in detail (we can all live with grammar errors and typos). > > Fix this by release path before call btrfs_iget, avoid deadlock occur. > > Example: > Pid waiting: 32021->32020->32028->14431->14436->32021 > > The following are their extent_buffer locked/waiting respectively: > extent_buffer: start:207060992, len:16384 > locker pid: 32020 read lock > wait pid: 32021 write lock > extent_buffer: start:14730821632, len:16384 > locker pid: 32028 read lock > wait pid: 32020 write lock > extent_buffer: start:446503813120, len:16384 > locker pid: 14431 write lock > wait pid: 32028 read lock > extent_buffer: start:446503845888, len: 16384 > locker pid: 14436 write lock > wait pid: 14431 write lock > extent_buffer: start: 446504386560, len: 16384 > locker pid: 32021 write lock > wait pid: 14436 write lock > > The following are their call trace respectively. > [ 4077.478852] kworker/u24:10 D 88107fc90640 0 14431 2 > 0x > [ 4077.486752] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] > [ 4077.494346] 880ffa56bad0 0046 9000 > 880ffa56bfd8 > [ 4077.502629] 880ffa56bfd8 881016ce21c0 a06ecb26 > 88101a5d6138 > [ 4077.510915] 880ebb5173b0 880ffa56baf8 880ebb517410 > 881016ce21c0 > [ 4077.519202] Call Trace: > [ 4077.528752] [] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs] > [ 4077.536049] [] ? wake_up_atomic_t+0x30/0x30 > [ 4077.542574] [] ? btrfs_search_slot+0x79f/0xb10 [btrfs] > [ 4077.550171] [] ? btrfs_lookup_file_extent+0x33/0x40 > [btrfs] > [ 4077.558252] [] ? __btrfs_drop_extents+0x13b/0xdf0 > [btrfs] > [ 4077.566140] [] ? add_delayed_data_ref+0xe2/0x150 [btrfs] > [ 4077.573928] [] ? btrfs_add_delayed_data_ref+0x149/0x1d0 > [btrfs] > [ 4077.582399] [] ? __set_extent_bit+0x4c0/0x5c0 [btrfs] > [ 4077.589896] [] ? > insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs] > [ 4077.599632] [] ? start_transaction+0x8d/0x470 [btrfs] > [ 4077.607134] [] ? btrfs_finish_ordered_io+0x2e7/0x600 > [btrfs] > [ 4077.615329] [] ? process_one_work+0x142/0x3d0 > [ 4077.622043] [] ? worker_thread+0x109/0x3b0 > [ 4077.628459] [] ? manage_workers.isra.26+0x270/0x270 > [ 4077.635759] [] ? kthread+0xaf/0xc0 > [ 4077.641404] [] ? kthread_create_on_node+0x110/0x110 > [ 4077.648696] [] ? ret_from_fork+0x58/0x90 > [ 4077.654926] [] ? kthread_create_on_node+0x110/0x110 > > [ 4078.358087] kworker/u24:15 D 88107fcd0640 0 14436 2 > 0x > [ 4078.365981] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] > [ 4078.373574] 880ffa57fad0 0046 9000 > 880ffa57ffd8 > [ 4078.381864] 880ffa57ffd8 88103004d0a0 a06ecb26 > 88101a5d6138 > [ 4078.390163] 880fbeffc298 880ffa57faf8 880fbeffc2f8 > 88103004d0a0 > [ 4078.398466] Call Trace: > [ 4078.408019] [] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs] > [ 4078.415322] [] ? wake_up_atomic_t+0x30/0x30 > [ 4078.421844] [] ? btrfs_search_slot+0x79f/0xb10 [btrfs] > [ 4078.429438] [] ? btrfs_lookup_file_extent+0x33/0x40 > [btrfs] > [ 4078.437518] [] ? __btrfs_drop_extents+0x13b/0xdf0 > [btrfs] > [ 4078.445404] [] ? add_delayed_data_ref+0xe2/0x150 [btrfs] > [ 4078.453194] [] ? btrfs_add_delayed_data_ref+0x149/0x1d0 > [btrfs] > [ 4078.461663] [] ? __set_extent_bit+0x4c0/0x5c0 [btrfs] > [ 4078.469161] [] ? > insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs] > [ 4078.478893] [] ? start_transaction+0x8d/0x470 [btrfs] > [ 4078.486388] [] ? btrfs_finish_ordered_io+0x2e7/0x600 > [btrfs] > [ 4078.494561] [] ? process_one_work+0x142/0x3d0 > [ 4078.501278] [] ? pwq_activate_delayed_work+0x27/0x40 > [ 4078.508673] [] ? worker_thread+0x109/0x3b0 > [ 4078.515098] [] ? manage_workers.isra.26+0x270/0x270 > [ 4078.522396] [] ? kthread+0xaf/0xc0 > [ 4078.528032] [] ? kthread_create_on_node+0x110/0x110 > [ 4078.535325] [] ? ret_from_fork+0x58/0x90 > [ 4078.541552] [] ? kthread_create_on_node+0x110/0x110 > > [ 4079.355824] user-space-program D 88107fd30640 0 32020
Re: [PATCH v2 02/14] btrfs-progs: check: introduce function to find dir_item
On Tue, Nov 08, 2016 at 09:45:54AM +0800, Qu Wenruo wrote: > > Yes please. Third namespace for existing error bits is not a good > > option. Move the I_ERR bits to start from 32 and use them in the low-mem > > code that's been merged to devel. > > > Should I submit a separate fix or replace the patchset? Separate patches please. The check patches are at the beginning of devel and there are several cleanup patches on top of them so that would probably cause too many merge conflicts. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: fix enospc in hole punching
On Fri, Oct 28, 2016 at 3:32 AM, robbiekowrote: > From: Robbie Ko > > The hole punching can result in adding new leafs (and as a consequence > new nodes) to the tree because when we find file extent items that span > beyond the hole range we may end up not deleting them (just adjusting them) > and add new file extent items representing holes. > > That after splitting a leaf (therefore creating a new one), a new node > might be added to each level of the tree (since there's a new key and > every parent node was full). > > Fix this by use btrfs_calc_trans_metadata_size instead of > btrfs_calc_trunc_metadata_size. > > v2: > * Improve the change log Version information does not belong in the changelog but after the --- below (it wouldn't make sense to have it in the git changelogs...). See https://btrfs.wiki.kernel.org/index.php/Developer's_FAQ#Repeated_submissions and examples from others that submit patches to this list. > > Signed-off-by: Robbie Ko I've reworded the changelog for clarity and added it to my branch at: https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=for-chris-4.10 Thanks. > --- > fs/btrfs/file.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index fea31a4..809ca85 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -2322,7 +2322,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t > offset, loff_t len) > u64 tail_len; > u64 orig_start = offset; > u64 cur_offset; > - u64 min_size = btrfs_calc_trunc_metadata_size(root, 1); > + u64 min_size = btrfs_calc_trans_metadata_size(root, 1); > u64 drop_end; > int ret = 0; > int err = 0; > @@ -2469,7 +2469,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t > offset, loff_t len) > ret = -ENOMEM; > goto out_free; > } > - rsv->size = btrfs_calc_trunc_metadata_size(root, 1); > + rsv->size = btrfs_calc_trans_metadata_size(root, 1); > rsv->failfast = 1; > > /* > -- > 1.9.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "People will forget what you said, people will forget what you did, but people will never forget how you made them feel." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
I completely agree, the whole wiki status is simply *FRUSTRATING*. Niccolò Belli On mercoledì 30 novembre 2016 14:12:36 CET, Wilson Meier wrote: Am 30/11/16 um 11:41 schrieb Duncan: Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: ... Hi Duncan, i understand your arguments but cannot fully agree. First of all, i'm not sticking with old stale versions of whatever as i try to keep my system up2date. My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4. That being said, i'm quite aware of the heavy development status of btrfs but pointing the finger on the users saying that they don't fully understand the status of btrfs without giving the information on the wiki is in my opinion not the right way. Heavy development doesn't mean that features marked as ok are "not" or "mostly" ok in the context of overall btrfs stability. There is no indication on the wiki that raid1 or every other raid (except for raid5/6) suffers from the problems stated in this thread. If there are know problems then the stability matrix should point them out or link to a corresponding wiki entry otherwise one has to assume that the features marked as "ok" are in fact "ok". And yes, the overall btrfs stability should be put on the wiki. Just to give you a quick overview of my history with btrfs. I migrated away from MD Raid and ext4 to btrfs raid6 because of its CoW and checksum features at a time as raid6 was not considered fully stable but also not as badly broken. After a few months i had a disk failure and the raid could not recover. I looked at the wiki an the mailing list and noticed that raid6 has been marked as badly broken :( I was quite happy to have a backup. So i asked on the btrfs IRC channel (the wiki had no relevant information) if raid10 is usable or suffers from the same problems. The summary was "Yes it is usable and has no known problems". So i migrated to raid10. Now i know that raid10 (marked as ok) has also problems with 2 disk failures in different stripes and can in fact lead to data loss. I thought, hmm ok, i'll split my data and use raid1 (marked as ok). And again the mailing list states that raid1 has also problems in case of recovery. It is really disappointing to not have this information in the wiki itself. This would have saved me, and i'm quite sure others too, a lot of time. Sorry for being a bit frustrated. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn: > On 2016-11-30 08:12, Wilson Meier wrote: >> Am 30/11/16 um 11:41 schrieb Duncan: >>> Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: >>> Am 30/11/16 um 09:06 schrieb Martin Steigerwald: > Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: >> [snip] > So the stability matrix would need to be updated not to recommend any > kind of BTRFS RAID 1 at the moment? > > Actually I faced the BTRFS RAID 1 read only after first attempt of > mounting it "degraded" just a short time ago. > > BTRFS still needs way more stability work it seems to me. > I would say the matrix should be updated to not recommend any RAID Level as from the discussion it seems they all of them have flaws. To me RAID is broken if one cannot expect to recover from a device failure in a solid way as this is why RAID is used. Correct me if i'm wrong. Right now i'm making my thoughts about migrating to another FS and/or Hardware RAID. >>> It should be noted that no list regular that I'm aware of anyway, would >>> make any claims about btrfs being stable and mature either now or in >>> the >>> near-term future in any case. Rather to the contrary, as I >>> generally put >>> it, btrfs is still stabilizing and maturing, with backups one is >>> willing >>> to use (and as any admin of any worth would say, a backup that hasn't >>> been tested usable isn't yet a backup; the job of creating the backup >>> isn't done until that backup has been tested actually usable for >>> recovery) still extremely strongly recommended. Similarly, keeping up >>> with the list is recommended, as is staying relatively current on both >>> the kernel and userspace (generally considered to be within the latest >>> two kernel series of either current or LTS series kernels, and with a >>> similarly versioned btrfs userspace). >>> >>> In that context, btrfs single-device and raid1 (and raid0 of course) >>> are >>> quite usable and as stable as btrfs in general is, that being >>> stabilizing >>> but not yet fully stable and mature, with raid10 being slightly less so >>> and raid56 being much more experimental/unstable at this point. >>> >>> But that context never claims full stability even for the relatively >>> stable raid1 and single device modes, and in fact anticipates that >>> there >>> may be times when recovery from the existing filesystem may not be >>> practical, thus the recommendation to keep tested usable backups at the >>> ready. >>> >>> Meanwhile, it remains relatively common on this list for those >>> wondering >>> about their btrfs on long-term-stale (not a typo) "enterprise" distros, >>> or even debian-stale, to be actively steered away from btrfs, >>> especially >>> if they're not willing to update to something far more current than >>> those >>> distros often provide, because in general, the current stability status >>> of btrfs is in conflict with the reason people generally choose to use >>> that level of old and stale software in the first place -- they >>> prioritize tried and tested to work, stable and mature, over the latest >>> generally newer and flashier featured but sometimes not entirely >>> stable, >>> and btrfs at this point simply doesn't meet that sort of stability/ >>> maturity expectations, nor is it likely to for some time (measured in >>> years), due to all the reasons enumerated so well in the above thread. >>> >>> >>> In that context, the stability status matrix on the wiki is already >>> reasonably accurate, certainly so IMO, because "OK" in context means as >>> OK as btrfs is in general, and btrfs itself remains still stabilizing, >>> not fully stable and mature. >>> >>> If there IS an argument as to the accuracy of the raid0/1/10 OK status, >>> I'd argue it's purely due to people not understanding the status of >>> btrfs >>> in general, and that if there's a general deficiency at all, it's in >>> the >>> lack of a general stability status paragraph on that page itself >>> explaining all this, despite the fact that the main https:// >>> btrfs.wiki.kernel.org landing page states quite plainly under stability >>> status that btrfs remains under heavy development and that current >>> kernels are strongly recommended. (Tho were I editing it, there'd >>> certainly be a more prominent mention of keeping backups at the >>> ready as >>> well.) >>> >> Hi Duncan, >> >> i understand your arguments but cannot fully agree. >> First of all, i'm not sticking with old stale versions of whatever as i >> try to keep my system up2date. >> My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4. >> That being said, i'm quite aware of the heavy development status of >> btrfs but pointing the finger on the users saying that they don't fully >> understand the status of btrfs without giving the information on the >> wiki is in my opinion not the right way. Heavy development doesn't mean >> that
Re: Convert from RAID 5 to 10
On 2016-11-30 09:04, Roman Mamedov wrote: On Wed, 30 Nov 2016 07:50:17 -0500 "Austin S. Hemmelgarn"wrote: *) Read performance is not optimized: all metadata is always read from the first device unless it has failed, data reads are supposedly balanced between devices per PID of the process reading. Better implementations dispatch reads per request to devices that are currently idle. Based on what I've seen, the metadata reads get balanced too. https://github.com/torvalds/linux/blob/v4.8/fs/btrfs/disk-io.c#L451 This starts from the mirror number 0 and tries others in an incrementing order, until succeeds. It appears that as long as the mirror with copy #0 is up and not corrupted, all reads will simply get satisfied from it. That's actually how all reads work, it's just that the PID selects what constitutes the 'first' copy. IIRC, that selection is doen by a lower layer. *) Write performance is not optimized, during long full bandwidth sequential writes it is common to see devices writing not in parallel, but with a long periods of just one device writing, then another. (Admittedly have been some time since I tested that). I've never seen this be an issue in practice, especially if you're using transparent compression (which caps extent size, and therefore I/O size to a given device, at 128k). I'm also sane enough that I'm not doing bulk streaming writes to traditional HDD's or fully saturating the bandwidth on my SSD's (you should be over-provisioning whenever possible). For a desktop user, unless you're doing real-time video recording at higher than HD resolution with high quality surround sound, this probably isn't going to hit you (and even then you should be recording to a temporary location with much faster write speeds (tmpfs or ext4 without a journal for example) because you'll likely get hit with fragmentation). I did not use compression while observing this; Compression doesn't make things parallel, but it does cause BTRFS to distribute the writes more evenly because it writes first one extent then the other, which in turn makes things much more efficient because you're not stalling as much waiting for the I/O queue to finish. It also means you have to write less overall to the disk, so on systems which can do LZO compression significantly faster than they can write to or read from the disk, it will generally improve performance all around. Also I don't know what is particularly insane about copying a 4-8 GB file onto a storage array. I'd expect both disks to write at the same time (like they do in pretty much any other RAID1 system), not one-after-another, effectively slowing down the entire operation by as much as 2x in extreme cases. I'm not talking 4-8GB files, I'm talking really big stuff at least an order of magnitude larger than that, stuff like filesystem images and big databases. On the only system I have where I have traditional hard disks (7200RPM consumer SATA3 drives connected to an LSI MPT2SAS HBA, about 80-100MB/s bulk write speed to a single disk), an 8GB copy from tmpfs is only in practice about 20% slower to BTRFS raid1 mode than to XFS on top of a DM-RAID RAID1 volume, and about 30% slower than the same with ext4. In both cases, this is actually about 50% faster than ZFS (which does prallelize reads and writes) in an equivalent configuration on the same hardware. Comparing all of that to single disk versions on the same hardware, I see roughly the same performance ratios between filesystems, and the same goes for running on the motherboard's SATA controller instead of the LSI HBA. In this case, I am using compression (and the data gets reasonable compression ratios), and I see both disks running at just below peak bandwidth, and based on tracing, most of the difference is in the metadata updates required to change the extents. I would love to see BTRFS properly parallelize writes and stripe reads sanely, but I seriously doubt it's going to have as much impact as you think, especially on systems with fast storage. As far as not mounting degraded by default, that's a conscious design choice that isn't going to change. There's a switch (adding 'degraded' to the mount options) to enable this behavior per-mount, so we're still on-par in that respect with LVM and MD, we just picked a different default. In this case, I actually feel it's a better default for most cases, because most regular users aren't doing exhaustive monitoring, and thus are not likely to notice the filesystem being mounted degraded until it's far too late. If the filesystem is degraded, then _something_ has happened that the user needs to know about, and until some sane monitoring solution is implemented, the easiest way to ensure this is to refuse to mount. The easiest is to write to dmesg and syslog, if a user doesn't monitor those either, it's their own fault; and the more user friendly one would be to still auto mount degraded, but
Re: [PATCH] btrfs-progs: Fix extents after finding all errors
On Thu, Nov 10, 2016 at 09:01:47AM -0600, Goldwyn Rodrigues wrote: > Simplifying the logic of fixing. > > Calling fixup_extent_ref() after encountering every error causes > more error messages after the extent is fixed. In case of multiple errors, > this is confusing because the error message is displayed after the fix > message and it works on stale data. It is best to show all errors and > then fix the extents. > > Set a variable and call fixup_extent_ref() if it is set. err is not used, > so cleared it. Sounds ok, more comments below. > Signed-off-by: Goldwyn Rodrigues> --- > cmds-check.c | 75 > +++- > 1 file changed, 24 insertions(+), 51 deletions(-) > > diff --git a/cmds-check.c b/cmds-check.c > index 779870a..8fa0b38 100644 > --- a/cmds-check.c > +++ b/cmds-check.c > @@ -8994,6 +8994,9 @@ out: > ret = err; > } > > + if (!ret) > + fprintf(stderr, "Repaired extent references for %llu\n", > (unsigned long long)rec->start); Line too long, please stick to ~80 chars, here it's easy to break line after string. > + > btrfs_release_path(); > return ret; > } > @@ -9051,7 +9054,11 @@ static int fixup_extent_flags(struct btrfs_fs_info > *fs_info, > btrfs_set_extent_flags(path.nodes[0], ei, flags); > btrfs_mark_buffer_dirty(path.nodes[0]); > btrfs_release_path(); > - return btrfs_commit_transaction(trans, root); > + ret = btrfs_commit_transaction(trans, root); > + if (!ret) > + fprintf(stderr, "Repaired extent flags for %llu\n", (unsigned > long long)rec->start); > + > + return ret; > } > > /* right now we only prune from the extent allocation tree */ > @@ -9178,11 +9185,8 @@ static int check_extent_refs(struct btrfs_root *root, > { > struct extent_record *rec; > struct cache_extent *cache; > - int err = 0; > int ret = 0; > - int fixed = 0; > int had_dups = 0; > - int recorded = 0; > > if (repair) { > /* > @@ -9251,9 +9255,8 @@ static int check_extent_refs(struct btrfs_root *root, > > while(1) { > int cur_err = 0; > + int fix = 0; > > - fixed = 0; > - recorded = 0; > cache = search_cache_extent(extent_cache, 0); > if (!cache) > break; > @@ -9261,7 +9264,6 @@ static int check_extent_refs(struct btrfs_root *root, > if (rec->num_duplicates) { > fprintf(stderr, "extent item %llu has multiple extent " > "items\n", (unsigned long long)rec->start); > - err = 1; > cur_err = 1; > } > > @@ -9272,57 +9274,33 @@ static int check_extent_refs(struct btrfs_root *root, > fprintf(stderr, "extent item %llu, found %llu\n", > (unsigned long long)rec->extent_item_refs, > (unsigned long long)rec->refs); > - ret = record_orphan_data_extents(root->fs_info, rec); > - if (ret < 0) > + fix = record_orphan_data_extents(root->fs_info, rec); > + if (fix < 0) > goto repair_abort; I think ret has to be set to fix here as well (in some way, eg. not using fix for a return value), otherwise the repair_abort label will not thake the same code path as before. > - if (ret == 0) { > - recorded = 1; > - } else { > - /* > - * we can't use the extent to repair file > - * extent, let the fallback method handle it. > - */ > - if (!fixed && repair) { > - ret = fixup_extent_refs( > - root->fs_info, > - extent_cache, rec); > - if (ret) > - goto repair_abort; > - fixed = 1; > - } > - } > - err = 1; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Minor coverity defect fix - CID 1125928 In set_file_xattrs: Dereference of an explicit null value
Hi, this patch lacks basic formatting requirements, this has been extensively documented eg. here https://btrfs.wiki.kernel.org/index.php/Developer's_FAQ . Besides the formalities, I'm missing what's the change rationale. It deals with a strange case when the xattr name length is 0, which is unexpected and should not be handled silently. Next I'm not sure if bailing out of the function is right, there are more items to process. Best if we could skip the damaged ones but still continue. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: mkfs, balance convert: warn about RAID5/6 in fiery letters
On Mon, Nov 28, 2016 at 07:51:53PM +0100, Adam Borowski wrote: > People who don't frequent IRC nor the mailing list tend to believe RAID 5/6 > are stable; this leads to data loss. Thus, let's do warn them. > > At this point, I think fiery letters that won't be missed are warranted. > > Kernel 4.9 and its -progs will be a part of LTS of multiple distributions, > so leaving experimental features without a warning is inappropriate. I'm ok with adding the warning about raid56 feature, but I have some comments to how it's implemented. Special case warning for the raid56 is ok, as it corresponds to the 'mkfs_features' table where the missing value for 'safe' should lead to a similar warning. This is planned to be more generic, so I just want to make sure we can adjust it later without problems. The warning should go last, after the final summary (and respect verbosity level). If the message were not colored, I'd completely miss the warning. This also means the warning should not be printed from a helper function and not during the option parsing phase. The colors seem a bit too much to me, red text or just emphasize 'warning' would IMHO suffice. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs-progs: Remove duplicate printfs in warning_trace()/assert_trace()
On Tue, Nov 29, 2016 at 10:25:14AM -0600, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues> > Code reduction. Call warning_trace from assert_trace in order to > reduce the printf's used. Also, trace variable in warning_trace() > is not required because it is already handled by BTRFS_DISABLE_BACKTRACE. This drops the distinction between BUG_ON and WARN_ON but I'm not sure we need it. Patch applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs-progs: Correct value printed by assertions/BUG_ON/WARN_ON
On Tue, Nov 29, 2016 at 10:24:52AM -0600, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues> > The values passed to BUG_ON/WARN_ON are negated(!) and printed, which > results in printing the value zero for each bug/warning. For example: > volumes.c:988: btrfs_alloc_chunk: Assertion `ret` failed, value 0 > > This is not useful. Instead changed to print the value of the parameter > passed to BUG_ON()/WARN_ON(). The value needed to be changed to long > to accomodate pointers being passed. > > Also, consolidated assert() and BUG() into ifndef. > > Signed-off-by: Goldwyn Rodrigues Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On 2016-11-30 08:12, Wilson Meier wrote: Am 30/11/16 um 11:41 schrieb Duncan: Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: Am 30/11/16 um 09:06 schrieb Martin Steigerwald: Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: [snip] So the stability matrix would need to be updated not to recommend any kind of BTRFS RAID 1 at the moment? Actually I faced the BTRFS RAID 1 read only after first attempt of mounting it "degraded" just a short time ago. BTRFS still needs way more stability work it seems to me. I would say the matrix should be updated to not recommend any RAID Level as from the discussion it seems they all of them have flaws. To me RAID is broken if one cannot expect to recover from a device failure in a solid way as this is why RAID is used. Correct me if i'm wrong. Right now i'm making my thoughts about migrating to another FS and/or Hardware RAID. It should be noted that no list regular that I'm aware of anyway, would make any claims about btrfs being stable and mature either now or in the near-term future in any case. Rather to the contrary, as I generally put it, btrfs is still stabilizing and maturing, with backups one is willing to use (and as any admin of any worth would say, a backup that hasn't been tested usable isn't yet a backup; the job of creating the backup isn't done until that backup has been tested actually usable for recovery) still extremely strongly recommended. Similarly, keeping up with the list is recommended, as is staying relatively current on both the kernel and userspace (generally considered to be within the latest two kernel series of either current or LTS series kernels, and with a similarly versioned btrfs userspace). In that context, btrfs single-device and raid1 (and raid0 of course) are quite usable and as stable as btrfs in general is, that being stabilizing but not yet fully stable and mature, with raid10 being slightly less so and raid56 being much more experimental/unstable at this point. But that context never claims full stability even for the relatively stable raid1 and single device modes, and in fact anticipates that there may be times when recovery from the existing filesystem may not be practical, thus the recommendation to keep tested usable backups at the ready. Meanwhile, it remains relatively common on this list for those wondering about their btrfs on long-term-stale (not a typo) "enterprise" distros, or even debian-stale, to be actively steered away from btrfs, especially if they're not willing to update to something far more current than those distros often provide, because in general, the current stability status of btrfs is in conflict with the reason people generally choose to use that level of old and stale software in the first place -- they prioritize tried and tested to work, stable and mature, over the latest generally newer and flashier featured but sometimes not entirely stable, and btrfs at this point simply doesn't meet that sort of stability/ maturity expectations, nor is it likely to for some time (measured in years), due to all the reasons enumerated so well in the above thread. In that context, the stability status matrix on the wiki is already reasonably accurate, certainly so IMO, because "OK" in context means as OK as btrfs is in general, and btrfs itself remains still stabilizing, not fully stable and mature. If there IS an argument as to the accuracy of the raid0/1/10 OK status, I'd argue it's purely due to people not understanding the status of btrfs in general, and that if there's a general deficiency at all, it's in the lack of a general stability status paragraph on that page itself explaining all this, despite the fact that the main https:// btrfs.wiki.kernel.org landing page states quite plainly under stability status that btrfs remains under heavy development and that current kernels are strongly recommended. (Tho were I editing it, there'd certainly be a more prominent mention of keeping backups at the ready as well.) Hi Duncan, i understand your arguments but cannot fully agree. First of all, i'm not sticking with old stale versions of whatever as i try to keep my system up2date. My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4. That being said, i'm quite aware of the heavy development status of btrfs but pointing the finger on the users saying that they don't fully understand the status of btrfs without giving the information on the wiki is in my opinion not the right way. Heavy development doesn't mean that features marked as ok are "not" or "mostly" ok in the context of overall btrfs stability. There is no indication on the wiki that raid1 or every other raid (except for raid5/6) suffers from the problems stated in this thread. The performance issues are inherent to BTRFS right now, and none of the other issues are likely to impact most regular users. Most of the people who would be interested in the features of BTRFS also have existing
Re: [PATCH] btrfs-progs: Use helper functions to access btrfs_super_block->sys_chunk_array_size
On Tue, Nov 29, 2016 at 08:29:02PM +0530, Chandan Rajendra wrote: > btrfs_super_block->sys_chunk_array_size is stored as le32 data on > disk. However insert_temp_chunk_item() writes sys_chunk_array_size in > host cpu order. This commit fixes this by using super block access > helper functions to read and write > btrfs_super_block->sys_chunk_array_size field. > > Signed-off-by: Chandan Rajendra> --- > utils.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/utils.c b/utils.c > index d0189ad..7b17b20 100644 > --- a/utils.c > +++ b/utils.c > @@ -562,14 +562,17 @@ static int insert_temp_chunk_item(int fd, struct > extent_buffer *buf, >*/ > if (type & BTRFS_BLOCK_GROUP_SYSTEM) { > char *cur; > + u32 array_size; > > cur = (char *)sb->sys_chunk_array + sb->sys_chunk_array_size; This should also use the accessor, 'sb' is directly mapped to the buffer read from disk. > memcpy(cur, _key, sizeof(disk_key)); > cur += sizeof(disk_key); > read_extent_buffer(buf, cur, (unsigned long int)chunk, > btrfs_chunk_item_size(1)); > - sb->sys_chunk_array_size += btrfs_chunk_item_size(1) + > + array_size = btrfs_super_sys_array_size(sb); > + array_size += btrfs_chunk_item_size(1) + > sizeof(disk_key); > + btrfs_set_super_sys_array_size(sb, array_size); > > ret = write_temp_super(fd, sb, cfg->super_bytenr); > } > -- > 2.5.5 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On Wed, 30 Nov 2016 07:50:17 -0500 "Austin S. Hemmelgarn"wrote: > > *) Read performance is not optimized: all metadata is always read from the > > first device unless it has failed, data reads are supposedly balanced > > between > > devices per PID of the process reading. Better implementations dispatch > > reads > > per request to devices that are currently idle. > Based on what I've seen, the metadata reads get balanced too. https://github.com/torvalds/linux/blob/v4.8/fs/btrfs/disk-io.c#L451 This starts from the mirror number 0 and tries others in an incrementing order, until succeeds. It appears that as long as the mirror with copy #0 is up and not corrupted, all reads will simply get satisfied from it. > > *) Write performance is not optimized, during long full bandwidth sequential > > writes it is common to see devices writing not in parallel, but with a long > > periods of just one device writing, then another. (Admittedly have been some > > time since I tested that). > I've never seen this be an issue in practice, especially if you're using > transparent compression (which caps extent size, and therefore I/O size > to a given device, at 128k). I'm also sane enough that I'm not doing > bulk streaming writes to traditional HDD's or fully saturating the > bandwidth on my SSD's (you should be over-provisioning whenever > possible). For a desktop user, unless you're doing real-time video > recording at higher than HD resolution with high quality surround sound, > this probably isn't going to hit you (and even then you should be > recording to a temporary location with much faster write speeds (tmpfs > or ext4 without a journal for example) because you'll likely get hit > with fragmentation). I did not use compression while observing this; Also I don't know what is particularly insane about copying a 4-8 GB file onto a storage array. I'd expect both disks to write at the same time (like they do in pretty much any other RAID1 system), not one-after-another, effectively slowing down the entire operation by as much as 2x in extreme cases. > As far as not mounting degraded by default, that's a conscious design > choice that isn't going to change. There's a switch (adding 'degraded' > to the mount options) to enable this behavior per-mount, so we're still > on-par in that respect with LVM and MD, we just picked a different > default. In this case, I actually feel it's a better default for most > cases, because most regular users aren't doing exhaustive monitoring, > and thus are not likely to notice the filesystem being mounted degraded > until it's far too late. If the filesystem is degraded, then > _something_ has happened that the user needs to know about, and until > some sane monitoring solution is implemented, the easiest way to ensure > this is to refuse to mount. The easiest is to write to dmesg and syslog, if a user doesn't monitor those either, it's their own fault; and the more user friendly one would be to still auto mount degraded, but read-only. Comparing to Ext4, that one appears to have the "errors=continue" behavior by default, the user has to explicitly request "errors=remount-ro", and I have never seen anyone use or recommend the third option of "errors=panic", which is basically the equivalent of the current Btrfs practce. > > *) It does not properly handle a device disappearing during operation. > > (There > > is a patchset to add that). > > > > *) It does not properly handle said device returning (under a > > different /dev/sdX name, for bonus points). > These are not an easy problem to fix completely, especially considering > that the device is currently guaranteed to reappear under a different > name because BTRFS will still have an open reference on the original > device name. > > On top of that, if you've got hardware that's doing this without manual > intervention, you've got much bigger issues than how BTRFS reacts to it. > No correctly working hardware should be doing this. Unplugging and replugging a SATA cable of a RAID1 member should never put your system under the risk of a massive filesystem corruption; you cannot say it absolutely doesn't with the current implementation. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Am 30/11/16 um 11:41 schrieb Duncan: > Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: > >> Am 30/11/16 um 09:06 schrieb Martin Steigerwald: >>> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: [snip] >>> So the stability matrix would need to be updated not to recommend any >>> kind of BTRFS RAID 1 at the moment? >>> >>> Actually I faced the BTRFS RAID 1 read only after first attempt of >>> mounting it "degraded" just a short time ago. >>> >>> BTRFS still needs way more stability work it seems to me. >>> >> I would say the matrix should be updated to not recommend any RAID Level >> as from the discussion it seems they all of them have flaws. >> To me RAID is broken if one cannot expect to recover from a device >> failure in a solid way as this is why RAID is used. >> Correct me if i'm wrong. Right now i'm making my thoughts about >> migrating to another FS and/or Hardware RAID. > It should be noted that no list regular that I'm aware of anyway, would > make any claims about btrfs being stable and mature either now or in the > near-term future in any case. Rather to the contrary, as I generally put > it, btrfs is still stabilizing and maturing, with backups one is willing > to use (and as any admin of any worth would say, a backup that hasn't > been tested usable isn't yet a backup; the job of creating the backup > isn't done until that backup has been tested actually usable for > recovery) still extremely strongly recommended. Similarly, keeping up > with the list is recommended, as is staying relatively current on both > the kernel and userspace (generally considered to be within the latest > two kernel series of either current or LTS series kernels, and with a > similarly versioned btrfs userspace). > > In that context, btrfs single-device and raid1 (and raid0 of course) are > quite usable and as stable as btrfs in general is, that being stabilizing > but not yet fully stable and mature, with raid10 being slightly less so > and raid56 being much more experimental/unstable at this point. > > But that context never claims full stability even for the relatively > stable raid1 and single device modes, and in fact anticipates that there > may be times when recovery from the existing filesystem may not be > practical, thus the recommendation to keep tested usable backups at the > ready. > > Meanwhile, it remains relatively common on this list for those wondering > about their btrfs on long-term-stale (not a typo) "enterprise" distros, > or even debian-stale, to be actively steered away from btrfs, especially > if they're not willing to update to something far more current than those > distros often provide, because in general, the current stability status > of btrfs is in conflict with the reason people generally choose to use > that level of old and stale software in the first place -- they > prioritize tried and tested to work, stable and mature, over the latest > generally newer and flashier featured but sometimes not entirely stable, > and btrfs at this point simply doesn't meet that sort of stability/ > maturity expectations, nor is it likely to for some time (measured in > years), due to all the reasons enumerated so well in the above thread. > > > In that context, the stability status matrix on the wiki is already > reasonably accurate, certainly so IMO, because "OK" in context means as > OK as btrfs is in general, and btrfs itself remains still stabilizing, > not fully stable and mature. > > If there IS an argument as to the accuracy of the raid0/1/10 OK status, > I'd argue it's purely due to people not understanding the status of btrfs > in general, and that if there's a general deficiency at all, it's in the > lack of a general stability status paragraph on that page itself > explaining all this, despite the fact that the main https:// > btrfs.wiki.kernel.org landing page states quite plainly under stability > status that btrfs remains under heavy development and that current > kernels are strongly recommended. (Tho were I editing it, there'd > certainly be a more prominent mention of keeping backups at the ready as > well.) > Hi Duncan, i understand your arguments but cannot fully agree. First of all, i'm not sticking with old stale versions of whatever as i try to keep my system up2date. My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4. That being said, i'm quite aware of the heavy development status of btrfs but pointing the finger on the users saying that they don't fully understand the status of btrfs without giving the information on the wiki is in my opinion not the right way. Heavy development doesn't mean that features marked as ok are "not" or "mostly" ok in the context of overall btrfs stability. There is no indication on the wiki that raid1 or every other raid (except for raid5/6) suffers from the problems stated in this thread. If there are know problems then the stability matrix should point
[PULL] Btrfs updates for 4.10
Hi, here's my first pull request for 4.10. Assorted patches that have been in for-next, mostly fixes and some cleanups. I'm expecting to send one more before the rc1, I don't see much reason to hold the current queue back for any longer. The following changes since commit e5517c2a5a49ed5e99047008629f1cd60246ea0e: Linux 4.9-rc7 (2016-11-27 13:08:04 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris-4.10 for you to fetch changes up to 515bdc479097ec9d5f389202842345af3162f71c: Merge branch 'misc-4.10' into for-chris-4.10-20161130 (2016-11-30 14:02:20 +0100) Adam Borowski (1): btrfs: make block group flags in balance printks human-readable Christoph Hellwig (9): btrfs: don't abuse REQ_OP_* flags for btrfs_map_block btrfs: use bio iterators for the decompression handlers btrfs: don't access the bio directly in the raid5/6 code btrfs: don't access the bio directly in the direct I/O code btrfs: don't access the bio directly in btrfs_csum_one_bio btrfs: use bi_size btrfs: calculate end of bio offset properly btrfs: refactor __btrfs_lookup_bio_sums to use bio_for_each_segment_all btrfs: use bio_for_each_segment_all in __btrfsic_submit_bio Christophe JAILLET (1): btrfs: remove redundant check of btrfs_iget return value David Sterba (17): btrfs: remove unused headers, statfs.h btrfs: remove stale comment from btrfs_statfs btrfs: rename helper macros for qgroup and aux data casts btrfs: reada, cleanup remove unneeded variable in __readahead_hook btrfs: reada, remove unused parameter from __readahead_hook btrfs: reada, sink start parameter to btree_readahead_hook btrfs: reada, remove pointless BUG_ON in reada_find_extent btrfs: reada, remove pointless BUG_ON check for fs_info btrfs: remove trivial helper btrfs_find_tree_block btrfs: delete unused member from superblock btrfs: introduce helpers for updating eb uuids btrfs: use new helpers to set uuids in eb btrfs: use specialized page copying helpers in btrfs_clone_extent_buffer btrfs: remove constant parameter to memset_extent_buffer and rename it btrfs: add optimized version of eb to eb copy btrfs: store and load values of stripes_min/stripes_max in balance status item Merge branch 'misc-4.10' into for-chris-4.10-20161130 Domagoj Tršan (1): btrfs: change btrfs_csum_final result param type to u8 Jeff Mahoney (3): btrfs: remove old tree_root dirent processing in btrfs_real_readdir() btrfs: increment ctx->pos for every emitted or skipped dirent in readdir btrfs: Ensure proper sector alignment for btrfs_free_reserved_data_space Josef Bacik (2): Btrfs: fix file extent corruption Btrfs: abort transaction if fill_holes() fails Liu Bo (1): Btrfs: adjust len of writes if following a preallocated extent Nick Terrell (1): btrfs: Call kunmap if zlib_inflateInit2 fails Omar Sandoval (1): Btrfs: deal with existing encompassing extent map in btrfs_get_extent() Qu Wenruo (4): btrfs: qgroup: Add comments explaining how btrfs qgroup works btrfs: qgroup: Rename functions to make it follow reserve,trace,account steps btrfs: Export and move leaf/subtree qgroup helpers to qgroup.c btrfs: qgroup: Fix qgroup data leaking by using subtree tracing Shailendra Verma (1): btrfs: return early from failed memory allocations in ioctl handlers Wang Xiaoguang (3): btrfs: cleanup: use already calculated value in btrfs_should_throttle_delayed_refs() btrfs: add necessary comments about tickets_id btrfs: improve delayed refs iterations Xiaoguang Wang (1): btrfs: remove useless comments fs/btrfs/check-integrity.c | 32 ++--- fs/btrfs/compression.c | 142 - fs/btrfs/compression.h | 12 +- fs/btrfs/ctree.c | 49 +++- fs/btrfs/ctree.h | 14 ++- fs/btrfs/delayed-inode.c | 3 +- fs/btrfs/delayed-inode.h | 2 +- fs/btrfs/delayed-ref.c | 20 ++- fs/btrfs/delayed-ref.h | 8 ++ fs/btrfs/disk-io.c | 30 ++--- fs/btrfs/disk-io.h | 4 +- fs/btrfs/extent-tree.c | 263 --- fs/btrfs/extent_io.c | 49 ++-- fs/btrfs/extent_io.h | 9 +- fs/btrfs/file-item.c | 55 fs/btrfs/file.c | 35 +- fs/btrfs/free-space-cache.c | 10 +- fs/btrfs/inode.c | 163 fs/btrfs/ioctl.c | 32 ++--- fs/btrfs/lzo.c | 17 +-- fs/btrfs/qgroup.c|
Re: [PATCH v4 1/3] btrfs: Add WARN_ON for qgroup reserved underflow
On Wed, Nov 30, 2016 at 08:24:32AM +0800, Qu Wenruo wrote: > > > At 11/30/2016 12:10 AM, David Sterba wrote: > > On Mon, Nov 28, 2016 at 09:40:07AM +0800, Qu Wenruo wrote: > >> Goldwyn Rodrigues has exposed and fixed a bug which underflows btrfs > >> qgroup reserved space, and leads to non-writable fs. > >> > >> This reminds us that we don't have enough underflow check for qgroup > >> reserved space. > >> > >> For underflow case, we should not really underflow the numbers but warn > >> and keeps qgroup still work. > >> > >> So add more check on qgroup reserved space and add WARN_ON() and > >> btrfs_warn() for any underflow case. > >> > >> Signed-off-by: Qu Wenruo> >> Reviewed-by: David Sterba > > > > One of the warnings is visible during xfstests > > (btrfs_qgroup_free_refroot), is there a fix? Either a patch in > > mailinglist or work in progress. If not, I'm a bit reluctant to add it > > to 4.10 as we'd get that reported from users for sure. > > Fix WIP, ETA would be in this week. Good, thanks. > At least, this warning is working and helped us to find bugs. No doubt about that, I'll keep the patch in 4.10 queue but will not add it to the first pull that I'm about to send today. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On 2016-11-30 00:38, Roman Mamedov wrote: On Wed, 30 Nov 2016 00:16:48 +0100 Wilson Meierwrote: That said, btrfs shouldn't be used for other then raid1 as every other raid level has serious problems or at least doesn't work as the expected raid level (in terms of failure recovery). RAID1 shouldn't be used either: *) Read performance is not optimized: all metadata is always read from the first device unless it has failed, data reads are supposedly balanced between devices per PID of the process reading. Better implementations dispatch reads per request to devices that are currently idle. Based on what I've seen, the metadata reads get balanced too. As far as the read balancing in general, while it doesn't work very well for single processes, but if you have a large number of processes started sequentially (for example, a thread-pool based server), it actually works out to being near optimal with a lot less logic than DM and MD have. Aggregated over an entire system it's usually near optimal as well. *) Write performance is not optimized, during long full bandwidth sequential writes it is common to see devices writing not in parallel, but with a long periods of just one device writing, then another. (Admittedly have been some time since I tested that). I've never seen this be an issue in practice, especially if you're using transparent compression (which caps extent size, and therefore I/O size to a given device, at 128k). I'm also sane enough that I'm not doing bulk streaming writes to traditional HDD's or fully saturating the bandwidth on my SSD's (you should be over-provisioning whenever possible). For a desktop user, unless you're doing real-time video recording at higher than HD resolution with high quality surround sound, this probably isn't going to hit you (and even then you should be recording to a temporary location with much faster write speeds (tmpfs or ext4 without a journal for example) because you'll likely get hit with fragmentation). This also has overall pretty low impact compared to a number of other things that BTRFS does (BTRFS on a single disk with single profile for everything versus 2 of the same disks with raid1 profile for everything gets less than a 20% performance difference in all the testing I've done). *) A degraded RAID1 won't mount by default. If this was the root filesystem, the machine won't boot. To mount it, you need to add the "degraded" mount option. However you have exactly a single chance at that, you MUST restore the RAID to non-degraded state while it's mounted during that session, since it won't ever mount again in the r/w+degraded mode, and in r/o mode you can't perform any operations on the filesystem, including adding/removing devices. There is a fix pending for the single chance to mount degraded thing, and even then, it only applies to a 2 device raid1 array (with more devices, new chunks are still raid1 if you're missing 1 device, so the checks don't trigger and refuse the mount). As far as not mounting degraded by default, that's a conscious design choice that isn't going to change. There's a switch (adding 'degraded' to the mount options) to enable this behavior per-mount, so we're still on-par in that respect with LVM and MD, we just picked a different default. In this case, I actually feel it's a better default for most cases, because most regular users aren't doing exhaustive monitoring, and thus are not likely to notice the filesystem being mounted degraded until it's far too late. If the filesystem is degraded, then _something_ has happened that the user needs to know about, and until some sane monitoring solution is implemented, the easiest way to ensure this is to refuse to mount. *) It does not properly handle a device disappearing during operation. (There is a patchset to add that). *) It does not properly handle said device returning (under a different /dev/sdX name, for bonus points). These are not an easy problem to fix completely, especially considering that the device is currently guaranteed to reappear under a different name because BTRFS will still have an open reference on the original device name. On top of that, if you've got hardware that's doing this without manual intervention, you've got much bigger issues than how BTRFS reacts to it. No correctly working hardware should be doing this. Most of these also apply to all other RAID levels. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs progs release 4.8.5
Hi, btrfs-progs version 4.8.5 have been released, contains an urgent bugfix for receive that mistakenly reports error on valid streams. Bug introduced in 4.8.4 by me, my appologies. Changes: * receive: fix detection of end of stream (error reported even for valid streams) * other: * added test for the receive bug * fix linking of library-test Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git Shortlog: David Sterba (7): btrfs-progs: docs: fix typo in btrfs-man5 btrfs-progs: receive: properly detect end of stream conditions btrfs-progs: tests: end of stream conditions btrfs-progs: tests: add correct rpath to library-test btrfs-progs: test: fix static build of library-test btrfs-progs: update CHANGES for v4.8.5 Btrfs progs v4.8.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: > Am 30/11/16 um 09:06 schrieb Martin Steigerwald: >> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: >>> On Wed, 30 Nov 2016 00:16:48 +0100 >>> >>> Wilson Meierwrote: That said, btrfs shouldn't be used for other then raid1 as every other raid level has serious problems or at least doesn't work as the expected raid level (in terms of failure recovery). >>> RAID1 shouldn't be used either: >>> >>> *) Read performance is not optimized: all metadata is always read from >>> the first device unless it has failed, data reads are supposedly >>> balanced between devices per PID of the process reading. Better >>> implementations dispatch reads per request to devices that are >>> currently idle. >>> >>> *) Write performance is not optimized, during long full bandwidth >>> sequential writes it is common to see devices writing not in parallel, >>> but with a long periods of just one device writing, then another. >>> (Admittedly have been some time since I tested that). >>> >>> *) A degraded RAID1 won't mount by default. >>> >>> If this was the root filesystem, the machine won't boot. >>> >>> To mount it, you need to add the "degraded" mount option. >>> However you have exactly a single chance at that, you MUST restore the >>> RAID to non-degraded state while it's mounted during that session, >>> since it won't ever mount again in the r/w+degraded mode, and in r/o >>> mode you can't perform any operations on the filesystem, including >>> adding/removing devices. >>> >>> *) It does not properly handle a device disappearing during operation. >>> (There is a patchset to add that). >>> >>> *) It does not properly handle said device returning (under a >>> different /dev/sdX name, for bonus points). >>> >>> Most of these also apply to all other RAID levels. >> So the stability matrix would need to be updated not to recommend any >> kind of BTRFS RAID 1 at the moment? >> >> Actually I faced the BTRFS RAID 1 read only after first attempt of >> mounting it "degraded" just a short time ago. >> >> BTRFS still needs way more stability work it seems to me. >> > I would say the matrix should be updated to not recommend any RAID Level > as from the discussion it seems they all of them have flaws. > To me RAID is broken if one cannot expect to recover from a device > failure in a solid way as this is why RAID is used. > Correct me if i'm wrong. Right now i'm making my thoughts about > migrating to another FS and/or Hardware RAID. It should be noted that no list regular that I'm aware of anyway, would make any claims about btrfs being stable and mature either now or in the near-term future in any case. Rather to the contrary, as I generally put it, btrfs is still stabilizing and maturing, with backups one is willing to use (and as any admin of any worth would say, a backup that hasn't been tested usable isn't yet a backup; the job of creating the backup isn't done until that backup has been tested actually usable for recovery) still extremely strongly recommended. Similarly, keeping up with the list is recommended, as is staying relatively current on both the kernel and userspace (generally considered to be within the latest two kernel series of either current or LTS series kernels, and with a similarly versioned btrfs userspace). In that context, btrfs single-device and raid1 (and raid0 of course) are quite usable and as stable as btrfs in general is, that being stabilizing but not yet fully stable and mature, with raid10 being slightly less so and raid56 being much more experimental/unstable at this point. But that context never claims full stability even for the relatively stable raid1 and single device modes, and in fact anticipates that there may be times when recovery from the existing filesystem may not be practical, thus the recommendation to keep tested usable backups at the ready. Meanwhile, it remains relatively common on this list for those wondering about their btrfs on long-term-stale (not a typo) "enterprise" distros, or even debian-stale, to be actively steered away from btrfs, especially if they're not willing to update to something far more current than those distros often provide, because in general, the current stability status of btrfs is in conflict with the reason people generally choose to use that level of old and stale software in the first place -- they prioritize tried and tested to work, stable and mature, over the latest generally newer and flashier featured but sometimes not entirely stable, and btrfs at this point simply doesn't meet that sort of stability/ maturity expectations, nor is it likely to for some time (measured in years), due to all the reasons enumerated so well in the above thread. In that context, the stability status matrix on the wiki is already reasonably accurate, certainly so IMO, because "OK" in context means as OK as
Re: Convert from RAID 5 to 10
Am 30/11/16 um 09:06 schrieb Martin Steigerwald: > Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: >> On Wed, 30 Nov 2016 00:16:48 +0100 >> >> Wilson Meierwrote: >>> That said, btrfs shouldn't be used for other then raid1 as every other >>> raid level has serious problems or at least doesn't work as the expected >>> raid level (in terms of failure recovery). >> RAID1 shouldn't be used either: >> >> *) Read performance is not optimized: all metadata is always read from the >> first device unless it has failed, data reads are supposedly balanced >> between devices per PID of the process reading. Better implementations >> dispatch reads per request to devices that are currently idle. >> >> *) Write performance is not optimized, during long full bandwidth sequential >> writes it is common to see devices writing not in parallel, but with a long >> periods of just one device writing, then another. (Admittedly have been >> some time since I tested that). >> >> *) A degraded RAID1 won't mount by default. >> >> If this was the root filesystem, the machine won't boot. >> >> To mount it, you need to add the "degraded" mount option. >> However you have exactly a single chance at that, you MUST restore the RAID >> to non-degraded state while it's mounted during that session, since it >> won't ever mount again in the r/w+degraded mode, and in r/o mode you can't >> perform any operations on the filesystem, including adding/removing >> devices. >> >> *) It does not properly handle a device disappearing during operation. >> (There is a patchset to add that). >> >> *) It does not properly handle said device returning (under a >> different /dev/sdX name, for bonus points). >> >> Most of these also apply to all other RAID levels. > So the stability matrix would need to be updated not to recommend any kind of > BTRFS RAID 1 at the moment? > > Actually I faced the BTRFS RAID 1 read only after first attempt of mounting > it > "degraded" just a short time ago. > > BTRFS still needs way more stability work it seems to me. > I would say the matrix should be updated to not recommend any RAID Level as from the discussion it seems they all of them have flaws. To me RAID is broken if one cannot expect to recover from a device failure in a solid way as this is why RAID is used. Correct me if i'm wrong. Right now i'm making my thoughts about migrating to another FS and/or Hardware RAID. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: > On Wed, 30 Nov 2016 00:16:48 +0100 > > Wilson Meierwrote: > > That said, btrfs shouldn't be used for other then raid1 as every other > > raid level has serious problems or at least doesn't work as the expected > > raid level (in terms of failure recovery). > > RAID1 shouldn't be used either: > > *) Read performance is not optimized: all metadata is always read from the > first device unless it has failed, data reads are supposedly balanced > between devices per PID of the process reading. Better implementations > dispatch reads per request to devices that are currently idle. > > *) Write performance is not optimized, during long full bandwidth sequential > writes it is common to see devices writing not in parallel, but with a long > periods of just one device writing, then another. (Admittedly have been > some time since I tested that). > > *) A degraded RAID1 won't mount by default. > > If this was the root filesystem, the machine won't boot. > > To mount it, you need to add the "degraded" mount option. > However you have exactly a single chance at that, you MUST restore the RAID > to non-degraded state while it's mounted during that session, since it > won't ever mount again in the r/w+degraded mode, and in r/o mode you can't > perform any operations on the filesystem, including adding/removing > devices. > > *) It does not properly handle a device disappearing during operation. > (There is a patchset to add that). > > *) It does not properly handle said device returning (under a > different /dev/sdX name, for bonus points). > > Most of these also apply to all other RAID levels. So the stability matrix would need to be updated not to recommend any kind of BTRFS RAID 1 at the moment? Actually I faced the BTRFS RAID 1 read only after first attempt of mounting it "degraded" just a short time ago. BTRFS still needs way more stability work it seems to me. -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html