Re: Power down tests...
On Fri, Aug 04, 2017 at 11:21:15AM +0530, Shyam Prasad N wrote: > We're running a couple of experiments on our servers with btrfs > (kernel version 4.4). > And we're running some abrupt power-off tests for a couple of scenarios: > > 1. We have a filesystem on top of two different btrfs filesystems > (distributed across N disks). i.e. Our filesystem lays out data and > metadata on top of these two filesystems. With the test workload, it > is going to generate a good amount of 16MB files on top of the system. > On abrupt power-off and following reboot, what is the recommended > steps to be run. We're attempting btrfs mount, which seems to fail > sometimes. If it fails, we run a fsck and then mount the btrfs. The > issue that we're facing is that a few files have been zero-sized. As a > result, there is either a data-loss, or inconsistency in the stacked > filesystem's metadata. Sounds like you want to mount with -o flushoncommit. > We're mounting the btrfs with commit period of 5s. However, I do > expect btrfs to journal the I/Os that are still dirty. Why then are we > seeing the above behaviour. By default, btrfs does only metadata consistency, like most filesystems. This improves performance at the cost of failing use case like yours. -- ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition: ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair) ⠈⠳⣄ • use glitches to walk on water -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Power down tests...
Hi all, We're running a couple of experiments on our servers with btrfs (kernel version 4.4). And we're running some abrupt power-off tests for a couple of scenarios: 1. We have a filesystem on top of two different btrfs filesystems (distributed across N disks). i.e. Our filesystem lays out data and metadata on top of these two filesystems. With the test workload, it is going to generate a good amount of 16MB files on top of the system. On abrupt power-off and following reboot, what is the recommended steps to be run. We're attempting btrfs mount, which seems to fail sometimes. If it fails, we run a fsck and then mount the btrfs. The issue that we're facing is that a few files have been zero-sized. As a result, there is either a data-loss, or inconsistency in the stacked filesystem's metadata. We're mounting the btrfs with commit period of 5s. However, I do expect btrfs to journal the I/Os that are still dirty. Why then are we seeing the above behaviour. 2. Another test that we're running is to create a virtual disk each on multiple NFS mounts (softmount with timeout of 1 min), and use these virtual disks as individual devices for one single btrfs. What is the expected btrfs behaviour when one of the virtual disk becomes unresponsive for a period of time (say 5 min)? Does the expectation change if the NFS mounts are mounted with sync option? Thanks in advance for any help you can offer. -- -Shyam -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
In 30 seconds I should be able to fill about 200MB * 30 = 6GB. Requiring the parity to not grow larger than there is a 6GB additional space is possible to live with on a 10TB disk. It seems that for SnapRAID to have any chance to work correctly with parity on a BTRFS partition, it would need a min-free configuration paramter to make sure there is always enough free space for one parity file update. But as it is right now, requiring that the disc isn't filled past 50% because fallocate() wants enough free space for 100% of the original file data to be rewritten obviously is not a working solution. Right now, it sounds like I should change all parity disks to a different file system to avoid the CoW issue. There doesn't seem to be any way to turn off CoW for an already existing file, and the parity data is already way past 50% so I can't make a copy. /Per W On Thu, 3 Aug 2017, Goffredo Baroncelli wrote: On 2017-08-03 13:44, Marat Khalili wrote: On 02/08/17 20:52, Goffredo Baroncelli wrote: consider the following scenario: a) create a 2GB file b) fallocate -o 1GB -l 2GB c) write from 1GB to 3GB after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. Just curious. With current implementation, in the following case: a) create a 2GB file1 && create a 2GB file2 b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2 A this step you are trying to allocate 3GB+3GB = 6GB, so you exhausted the filesystem space. c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2 will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or does it only allocate additional 1GB and check free space for another 1GB? If it's only the latter, it is useless. The file is physically extended ghigo@venice:/tmp$ fallocate -l 1000 foo.txt ghigo@venice:/tmp$ ls -l foo.txt -rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt ghigo@venice:/tmp$ ls -l foo.txt -rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt ghigo@venice:/tmp$ -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Thu, Aug 3, 2017 at 2:45 PM, Brendan Hide wrote: > > To counter, I think this is a big problem with btrfs, especially in terms of > user attrition. We don't need "GUI" tools. At all. But we do need that btrfs > is self-sufficient enough that regular users don't get burnt by what they > would view as unexpected behaviour. We have currently a situation where > btrfs is too demanding on inexperienced users. I think the top complaint is the manual nature of balancing to avoid enospc when there's free space, followed by balancing needed to avoid/reduce free space fragmentation and thus maintain consistent performance. Obviously the kernel needs more intelligent code to free up partially full block groups, more correctly to free up contiguous space for it to write to. That solve both problems. But in the meantime, btrfs-progs should ship a policy to do some minimal balance to totally obviate this and find the edge cases. Maybe it's dusage=3 every day. And dusage=10 one a week. And dusage=20 musage=20 once a month. I don't know but some iteration in this area is better than saying, PUNT! And putting it on the user's lap. Better would be a trigger that is statistics based rather than time based. The metric might be some combination of workload (i.e. idle) and ratios found in sysfs. Anyway, the first step is for people on this list to stop micromanaging their own volumes, and try to center on a sane one size fits all solution. And then iterate better and better solutions as we determine the edge cases where one size doesn't fit all. We're throwing hammers at the problem by default because it's a learned behavior. We all need to just stop balancing and act like regular users. And then figure out how to automatically optimize. > > I feel we need better worst-case behaviours. For example, if *I* have a > btrfs on its second-to-last-available chunk, it means I'm not micro-managing > properly. But users shouldn't have to micro-manage in the first place. Btrfs > (or a management tool) should just know to balance the least-used chunk > and/or delete the lowest-priority snapshot, etc. It shouldn't cause my > services/apps to give diskspace errors when, clearly, there is free space > available. Ideally the kernel code needs to do a better job freeing up partial block groups. But in the meantime, this can be set as an optimization policy in user space. And it should be in btrfs-progs so it's consistent across distros. SUSE has a distro specific balancer, on a systemd timer, but I don't think it's enabled by default and I also think it's weirdly too aggressive. If it could be made smarter, with a trigger other than a timer, that'd be even better. But doing nothing has been one of the most consistently negative user responses about Btrfs is the manual balance to maintain performance. > The other "high-level" aspect would be along the lines of better guidance > and standardisation for distros on how best to configure btrfs. This would > include guidance/best practices for things like appropriate subvolume > mountpoints and snapshot paths, sensible schedules or logic (or perhaps even > example tools/scripts) for balancing and scrubbing the filesystem. Would they listen? My experience with openSUSE is, nope. > I don't have all the answers. But I also don't want to have to tell people > they can't adopt it because a) they don't (or never will) understand it; and > b) they're going to resent me for their irresponsibly losing their own data. Sure. You can read on the linux-raid@ list where there are still constant problems with users doing crazy things like mdadm --create to fix a raid assembly problem, and obliterate their data by doing this and then getting pissed. It's like, where the hell do people keep getting the idea of doing that? There are six ways to Sunday ways of fixing a Btrfs volume. It reads like a choose your own adventure book. No, actually it's worse because at least the choose your own adventure book tells you what page to go to next, and Btrfs gives you zero advice what order to try things in. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: search parity device wisely
After mapping block with BTRFS_MAP_WRITE, parities have been sorted to the end position, so this search can start from the first parity stripe. Signed-off-by: Liu Bo --- v2: fix typo (data_stripes -> nr_data). fs/btrfs/raid56.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 2086383..88a5b7b 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -2229,12 +2229,13 @@ raid56_parity_alloc_scrub_rbio(struct btrfs_fs_info *fs_info, struct bio *bio, ASSERT(!bio->bi_iter.bi_size); rbio->operation = BTRFS_RBIO_PARITY_SCRUB; - for (i = 0; i < rbio->real_stripes; i++) { + for (i = rbio->nr_data; i < rbio->real_stripes; i++) { if (bbio->stripes[i].dev == scrub_dev) { rbio->scrubp = i; break; } } + ASSERT(i < rbio->real_stripes); /* Now we just support the sectorsize equals to page size */ ASSERT(fs_info->sectorsize == PAGE_SIZE); -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 08/03/2017 09:22 PM, Austin S. Hemmelgarn wrote: On 2017-08-03 14:29, Christoph Anton Mitterer wrote: On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote: There are no higher-level management tools (e.g. RAID management/monitoring, etc.)... [snip] As far as 'higher-level' management tools, you're using your system wrong if you _need_ them. There is no need for there to be a GUI, or a web interface, or a DBus interface, or any other such bloat in the main management tools, they work just fine as is and are mostly on par with the interfaces provided by LVM, MD, and ZFS (other than the lack of machine parseable output). I'd also argue that if you can't reassemble your storage stack by hand without using 'higher-level' tools, you should not be using that storage stack as you don't properly understand it. On the subject of monitoring specifically, part of the issue there is kernel side, any monitoring system currently needs to be polling-based, not event-based, and as a result monitoring tends to be a very system specific affair based on how much overhead you're willing to tolerate. The limited stuff that does exist is also trivial to integrate with many pieces of existing monitoring infrastructure (like Nagios or monit), and therefore the people who care about it a lot (like me) are either monitoring by hand, or are just using the tools with their existing infrastructure (for example, I use monit already on all my systems, so I just make sure to have entries in the config for that to check error counters and scrub results), so there's not much in the way of incentive for the concerned parties to reinvent the wheel. To counter, I think this is a big problem with btrfs, especially in terms of user attrition. We don't need "GUI" tools. At all. But we do need that btrfs is self-sufficient enough that regular users don't get burnt by what they would view as unexpected behaviour. We have currently a situation where btrfs is too demanding on inexperienced users. I feel we need better worst-case behaviours. For example, if *I* have a btrfs on its second-to-last-available chunk, it means I'm not micro-managing properly. But users shouldn't have to micro-manage in the first place. Btrfs (or a management tool) should just know to balance the least-used chunk and/or delete the lowest-priority snapshot, etc. It shouldn't cause my services/apps to give diskspace errors when, clearly, there is free space available. The other "high-level" aspect would be along the lines of better guidance and standardisation for distros on how best to configure btrfs. This would include guidance/best practices for things like appropriate subvolume mountpoints and snapshot paths, sensible schedules or logic (or perhaps even example tools/scripts) for balancing and scrubbing the filesystem. I don't have all the answers. But I also don't want to have to tell people they can't adopt it because a) they don't (or never will) understand it; and b) they're going to resent me for their irresponsibly losing their own data. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017-08-03 14:29, Christoph Anton Mitterer wrote: On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote: Brendan Hide wrote: The title seems alarmist to me - and I suspect it is going to be misconstrued. :-/ From the release notes at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Li nux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux- 7.4_Release_Notes-Deprecated_Functionality.html "Btrfs has been deprecated Wow... not that this would have any direct effect... it's still quite alarming, isn't it? This is not meant as criticism, but I often wonder myself where the btrfs is going to!? :-/ It's in the kernel now since when? 2009? And while the extremely basic things (snapshots, etc.) seem to work quite stable... other things seem to be rather stuck (RAID?)... not to talk about many things that have been kinda "promised" (fancy different compression algos, n-parity- raid). I assume you mean the erasure coding the devs and docs call raid56 when you're talking about stuck features, and you're right, it has been stuck, but it arguably should have been better tested and verified before being merged at all. As far as other 'raid' profiles, raid1 and raid0 work fine, and raid10 is mostly fine once you wrap your head around the implications of the inconsistent component device ordering. There are no higher-level management tools (e.g. RAID management/monitoring, etc.)... there are still some kinda serious issues (the attacks/corruptions likely possible via UUID collisions)... The UUID collision issue is present in almost all volume managers and filesystems, it just does more damage in BTRFS, and is exacerbated by the brain-dead 'scan everything for BTRFS' policy in udev. As far as 'higher-level' management tools, you're using your system wrong if you _need_ them. There is no need for there to be a GUI, or a web interface, or a DBus interface, or any other such bloat in the main management tools, they work just fine as is and are mostly on par with the interfaces provided by LVM, MD, and ZFS (other than the lack of machine parseable output). I'd also argue that if you can't reassemble your storage stack by hand without using 'higher-level' tools, you should not be using that storage stack as you don't properly understand it. On the subject of monitoring specifically, part of the issue there is kernel side, any monitoring system currently needs to be polling-based, not event-based, and as a result monitoring tends to be a very system specific affair based on how much overhead you're willing to tolerate. The limited stuff that does exist is also trivial to integrate with many pieces of existing monitoring infrastructure (like Nagios or monit), and therefore the people who care about it a lot (like me) are either monitoring by hand, or are just using the tools with their existing infrastructure (for example, I use monit already on all my systems, so I just make sure to have entries in the config for that to check error counters and scrub results), so there's not much in the way of incentive for the concerned parties to reinvent the wheel. One thing that I miss since long would be the checksumming with nodatacow. It has been stated multiple times on the list that this is not possible without making nodatacow prone to data loss. Also it has always been said that the actual performance tunning would still lay ahead?! While there hasn't been anything touted specifically as performance tuning, performance has improved slightly since I started using BTRFS. I really like btrfs and use it on all my personal systems... and I haven't had any data loss since then (only a number of seriously looking false positives due to bugs in btrfs check ;-) )... but one still reads every now and then from people here on the list who seem to suffer from more serious losses. And this brings up part of the issue with uptake. People are quick to post about issues, but not successes. I've been running BTRFS on almost everything (I don't use it in VM's because of the performance implications of having multiple CoW layers) since around kernel 3.9, have had no critical issues (ones resulting in data loss) since about 3.16, and have actually survived quite a few pieces of marginal or failed hardware as a result of BTRFS. So is there any concrete roadmap? Or priority tasks? Is there a lack of developers? In order, no, in theory yes but not in practice, and somewhat. As a general rule, all FOSS projects are short on developers. Most of the work that is occurring on BTRFS is being sponsored by SUSE, Facebook, or Fujitsu (at least, I'm pretty sure those are the primary sponsors), and their priorities will not necessarily coincide with normal end-user priorities. I'd say though that testing and review are just as much short on manpower as development. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vg
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Wed, Aug 2, 2017 at 3:11 AM, Wang Shilong wrote: > I haven't seen active btrfs developers from some time, Redhat looks > put most of their efforts on XFS, It is time to switch to SLES/opensuse! I disagree. We need one or more Btrfs developers involved in Fedora. Fedora runs fairly unmodified upstream kernels, which are kept up to date. By default, Fedora 24, 25, 26 users today are on kernel 4.11.11 or 4.11.12. Fedora 25, 26 will soon be rebased to probably 4.12.5. That's the stable repo. You can optionally get newer non-rc ones from testing repo. And nightly Rawhide kernels are built as well with the latest patchset in between rc's. Both Btrfs and Fedora are heavily developing in containerize deployments, so it seems like a good fit for both camps. The problem is the Fedora kernel team has no one sufficiently familiar with Btrfs, nor anyone at Red Hat to fall back on. But they do have this with ext4, XFS, device-mapper, and LVM developers. So they're not going to take on a burden like Btrfs by default without a knowledgeable pair of eyes to triage issues as they come up. And instead they're moving to XFS + overlayfs. There's more opportunity for Btrfs than just as a default file system. I like the idea of using Btrfs on install media to eliminate the monolithic isomd5sum most users skip to test their USB install media; eliminate device-mapper based persistent overlay for the install media and use Btrfs seed/sprout instead (which would help the Sugar on a Stick project as well); and at least for nightly composes eliminate squashfs xz based images in favor of Btrfs compression (faster compression and decompression, bigger file sizes but these are daily throw aways so I think time is more important). Anyway - point is that converging on SUSE doesn't help. If anything I think it'll shrink the market for Btrfs as a general purpose file system, rather than grow it. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017-08-03 14:08, waxhead wrote: Brendan Hide wrote: The title seems alarmist to me - and I suspect it is going to be misconstrued. :-/ From the release notes at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html "Btrfs has been deprecated The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux. The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature. Red Hat will continue to invest in future technologies to address the use cases of our customers, specifically those related to snapshots, compression, NVRAM, and ease of use. We encourage feedback through your Red Hat representative on features and requirements you have for file systems and storage technology." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html First of all I am not a BTRFS dev, but I use it for various projects and have high hopes for what it can become. Now, the fact that Red Hat depreciate BTRFS does not mean that BTRFS is depreciated. It not removed from the kernel and so far BTRFS offers features that other filesystems don't have. ZFS is something that people brag about all the time as a viable alternative, but for me it seems to be a pain to manage properly. E.g. grow, add/remove devices, shrink etc... good luck doing that right! BTRFS biggest problem is not that there are some bits and pieces that are thoroughly screwed up (raid5/6 (which just got some fixes by the way)), but the fact that the documentation is rather dated. There is a simple status page here https://btrfs.wiki.kernel.org/index.php/Status As others have pointed out already the explanations on the status page is not exactly good. For example compression (that was also mentioned) is as of writing this marked as 'Mostly ok' '(needs verification and source) - auto repair and compression may crash' Now, I am aware that many use compression without trouble. I am not sure how many that has compression with disk issues and don't have trouble , but I would at least expect to see more people yelling on the mailing list if that where the case. The problem here is that this message is rather scary and certainly does NOT sound like 'mostly ok' for most people. What exactly needs verification and source? the mostly ok statement or something else?! A more detailed explanation would be required here to avoid scaring people away. Not certain what was meant here, but there were (a while back) some known issues with compressed extents, but I thought those had been fixed. Same thing with the trim feature that is marked OK . It clearly says that is has performance implications. It is marked OK so one would expect it to not cause the filesystem to fail, but if the performance becomes so slow that the filesystem gets practically unusable it is of course not "OK". The relevant information is missing for people to make a decent choice and I certainly don't know how serious these performance implications are, if they are at all relevant... The performance implications bit shouldn't be listed, that's a given for any filesystem with discard (TRIM is the ATA and eMMC command, UNMAP is the SCSI one, and ERASE is the name on SD cards, discard is the generic kernel term) support. The issue arises from devices that don't have support for queuing such commands, which is quite rare for SSD's these days. Most people interested in BTRFS are probably a bit more paranoid and concerned about their data than the average computer user. What people tend to forget is that other filesystems either have NO redundancy, auto-repair and other fancy features that BTRFS have. So for the compression example above... if you run compressed files on ext4 and your disk gets some corruption you are in a no better state than what you would be with btrfs either (in fact probably worse). Also nothing is stopping you from putting btrfs DUP on a mdadm raid5 or 6 which mean you should be VERY safe. Simple documentation is the key so HERE ARE MY DEMANDS!!!. ehhh so here is what I think should be done: 1. The documentation needs to either be improved (or old non-relevant stuff simply removed / archived somewhere) 2. The status page MUST always be up to date for the latest kernel release (It's ok so far , let's hope nobody sleeps here) 3. Proper explanations must be given so the layman and reasonably technica
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote: > Brendan Hide wrote: > > The title seems alarmist to me - and I suspect it is going to be > > misconstrued. :-/ > > > > From the release notes at > > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Li > > nux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux- > > 7.4_Release_Notes-Deprecated_Functionality.html > > "Btrfs has been deprecated > > Wow... not that this would have any direct effect... it's still quite alarming, isn't it? This is not meant as criticism, but I often wonder myself where the btrfs is going to!? :-/ It's in the kernel now since when? 2009? And while the extremely basic things (snapshots, etc.) seem to work quite stable... other things seem to be rather stuck (RAID?)... not to talk about many things that have been kinda "promised" (fancy different compression algos, n-parity- raid). There are no higher-level management tools (e.g. RAID management/monitoring, etc.)... there are still some kinda serious issues (the attacks/corruptions likely possible via UUID collisions)... One thing that I miss since long would be the checksumming with nodatacow. Also it has always been said that the actual performance tunning would still lay ahead?! I really like btrfs and use it on all my personal systems... and I haven't had any data loss since then (only a number of seriously looking false positives due to bugs in btrfs check ;-) )... but one still reads every now and then from people here on the list who seem to suffer from more serious losses. So is there any concrete roadmap? Or priority tasks? Is there a lack of developers? Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
Brendan Hide wrote: The title seems alarmist to me - and I suspect it is going to be misconstrued. :-/ From the release notes at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html "Btrfs has been deprecated The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux. The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature. Red Hat will continue to invest in future technologies to address the use cases of our customers, specifically those related to snapshots, compression, NVRAM, and ease of use. We encourage feedback through your Red Hat representative on features and requirements you have for file systems and storage technology." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html First of all I am not a BTRFS dev, but I use it for various projects and have high hopes for what it can become. Now, the fact that Red Hat depreciate BTRFS does not mean that BTRFS is depreciated. It not removed from the kernel and so far BTRFS offers features that other filesystems don't have. ZFS is something that people brag about all the time as a viable alternative, but for me it seems to be a pain to manage properly. E.g. grow, add/remove devices, shrink etc... good luck doing that right! BTRFS biggest problem is not that there are some bits and pieces that are thoroughly screwed up (raid5/6 (which just got some fixes by the way)), but the fact that the documentation is rather dated. There is a simple status page here https://btrfs.wiki.kernel.org/index.php/Status As others have pointed out already the explanations on the status page is not exactly good. For example compression (that was also mentioned) is as of writing this marked as 'Mostly ok' '(needs verification and source) - auto repair and compression may crash' Now, I am aware that many use compression without trouble. I am not sure how many that has compression with disk issues and don't have trouble , but I would at least expect to see more people yelling on the mailing list if that where the case. The problem here is that this message is rather scary and certainly does NOT sound like 'mostly ok' for most people. What exactly needs verification and source? the mostly ok statement or something else?! A more detailed explanation would be required here to avoid scaring people away. Same thing with the trim feature that is marked OK . It clearly says that is has performance implications. It is marked OK so one would expect it to not cause the filesystem to fail, but if the performance becomes so slow that the filesystem gets practically unusable it is of course not "OK". The relevant information is missing for people to make a decent choice and I certainly don't know how serious these performance implications are, if they are at all relevant... Most people interested in BTRFS are probably a bit more paranoid and concerned about their data than the average computer user. What people tend to forget is that other filesystems either have NO redundancy, auto-repair and other fancy features that BTRFS have. So for the compression example above... if you run compressed files on ext4 and your disk gets some corruption you are in a no better state than what you would be with btrfs either (in fact probably worse). Also nothing is stopping you from putting btrfs DUP on a mdadm raid5 or 6 which mean you should be VERY safe. Simple documentation is the key so HERE ARE MY DEMANDS!!!. ehhh so here is what I think should be done: 1. The documentation needs to either be improved (or old non-relevant stuff simply removed / archived somewhere) 2. The status page MUST always be up to date for the latest kernel release (It's ok so far , let's hope nobody sleeps here) 3. Proper explanations must be given so the layman and reasonably technical people understand the risks / issues for non-ok stuff. 4. There should be links to roadmaps for each feature on the status page that clearly stats what is being worked on for the NEXT kernel release -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
On 2017-08-03 13:15, Marat Khalili wrote: On August 3, 2017 7:01:06 PM GMT+03:00, Goffredo Baroncelli The file is physically extended ghigo@venice:/tmp$ fallocate -l 1000 foo.txt For clarity let's replace the fallocate above with: $ head -c 1000 foo.txt ghigo@venice:/tmp$ ls -l foo.txt -rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt ghigo@venice:/tmp$ ls -l foo.txt -rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt ghigo@venice:/tmp$ According to explanation by Austin the foo.txt at this point somehow occupies 2000 bytes of space because I can reflink it and then write another 1000 bytes of data into it without losing 1000 bytes I already have or getting out of drive space. (Or is it only true while there are open file handles?) OK, I think there may be some misunderstanding here. By 'CoW unwritten extents', I mean that when we write to the extent, a CoW operation happens, instead of the data being written directly into the extent. In this case, it has nothing to do with reflinking, and Goffredo is correct that if your filesystem is small enough, the second fallocate will fail there. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
On 2017-08-03 12:37, Goffredo Baroncelli wrote: On 2017-08-03 13:39, Austin S. Hemmelgarn wrote: On 2017-08-02 17:05, Goffredo Baroncelli wrote: On 2017-08-02 21:10, Austin S. Hemmelgarn wrote: On 2017-08-02 13:52, Goffredo Baroncelli wrote: Hi, [...] consider the following scenario: a) create a 2GB file b) fallocate -o 1GB -l 2GB c) write from 1GB to 3GB after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit. The man page of fallocate doesn't guarantee that. Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true. Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file. Yes, you need space, but you don't need _all_ the space. For a file that already has data in it, you only _need_ as much space as the largest chunk of data that can be written at once at a low level, because the moment that first write finishes, the space that was used in the file for that region is freed, and the next write can go there. Put a bit differently, you only need to allocate what isn't allocated in the region, and then a bit more to handle the initial write to the file. Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is. It seems that ZFS on linux doesn't support fallocate see https://github.com/zfsonlinux/zfs/issues/326 So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment. Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on). That said, I'm starting to wonder if just failing fallocate() calls to allocate space is actually the right thing to do here after all. Aside from this, we don't reserve metadata space for checksums and similar things for the eventual writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region anyway because of metadata exhaustion), and splitting extents can also cause it to fail, so it's perfectly possible for the fallocate assumption to not hole on BTRFS. The irony of this is that if you're in a situation where you actually need to reserve space, you're more likely to fail (because if you actually _need_ to reserve the space, your filesystem may already be mostly full, and therefore any of the above issues may occur). On the specific note of splitting extents, the following will probably fail on BTRFS as well when done with a large enough FS (the turn over point ends up being the point at which 256MiB isn't enough space to account for all the extents), but will succeed with : 1. Create filesystem and mount it. On BTRFS, make sure autodefrag is off (this makes it fail more reliably, but is not essential for it to fail). 2. Use fallocate to allocate as large a file as possible (in the BTRFS case, try for the size of the filesystem - 544MiB (512 MiB for the metadata chunk, 32 for the system chunk). 3. Write half the file using 1MB blocks and skipping 1MB of space between each block (so every other 1MB of space is actually written to. 4. Write the other half of the file by filling in the holes. The net effect of this is to split the single large fallocat'ed extent into a very large number of 1MB extents, which in turn eats up lots of metadata space and will eventually exhaust it. While this specific exercise requires a large filesystem, more generic real world situations exist where this can happen (and I have had this happen before). [...] In terms of a COW filesystem, you need the space of a) + the space of b) No, that is only required if the entire file needs to be written atomically. There is some maximal size atomic write that BTRFS can perform as a single operation at a low level (I'm not sure if this is equal to the block size, or larger, but it doesn't matter much, either way, I'm talking the largest chunk of data it will write to a disk in a single operation before updating metadata to point to that new data). On the best of my knowledge there is only a time limit: IIRC every 30seconds a transaction is closed. If you are able to fill the filesystem in this time window you are in trouble. Even with that, it's still possible to implement the method I outlined by defining such a limit and forcing a transaction commit when th
Re: Massive loss of disk space
On August 3, 2017 7:01:06 PM GMT+03:00, Goffredo Baroncelli >The file is physically extended > >ghigo@venice:/tmp$ fallocate -l 1000 foo.txt For clarity let's replace the fallocate above with: $ head -c 1000 foo.txt >ghigo@venice:/tmp$ ls -l foo.txt >-rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt >ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt >ghigo@venice:/tmp$ ls -l foo.txt >-rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt >ghigo@venice:/tmp$ According to explanation by Austin the foo.txt at this point somehow occupies 2000 bytes of space because I can reflink it and then write another 1000 bytes of data into it without losing 1000 bytes I already have or getting out of drive space. (Or is it only true while there are open file handles?) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
On 2017-08-03 13:39, Austin S. Hemmelgarn wrote: > On 2017-08-02 17:05, Goffredo Baroncelli wrote: >> On 2017-08-02 21:10, Austin S. Hemmelgarn wrote: >>> On 2017-08-02 13:52, Goffredo Baroncelli wrote: Hi, >> [...] >> consider the following scenario: a) create a 2GB file b) fallocate -o 1GB -l 2GB c) write from 1GB to 3GB after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. >> >>> There is also an expectation based on pretty much every other FS in >>> existence that calling fallocate() on a range that is already in use is a >>> (possibly expensive) no-op, and by extension using fallocate() with an >>> offset of 0 like a ftruncate() call will succeed as long as the new size >>> will fit. >> >> The man page of fallocate doesn't guarantee that. >> >> Unfortunately in a COW filesystem the assumption that an allocate area may >> be simply overwritten is not true. >> >> Let me to say it with others words: as general rule if you want to _write_ >> something in a cow filesystem, you need space. Doesn't matter if you are >> *over-writing* existing data or you are *appending* to a file. > Yes, you need space, but you don't need _all_ the space. For a file that > already has data in it, you only _need_ as much space as the largest chunk of > data that can be written at once at a low level, because the moment that > first write finishes, the space that was used in the file for that region is > freed, and the next write can go there. Put a bit differently, you only need > to allocate what isn't allocated in the region, and then a bit more to handle > the initial write to the file. > > Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a > CoW filesystem _does not_ need to behave like BTRFS is. It seems that ZFS on linux doesn't support fallocate see https://github.com/zfsonlinux/zfs/issues/326 So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment. [...] >> In terms of a COW filesystem, you need the space of a) + the space of b) > No, that is only required if the entire file needs to be written atomically. > There is some maximal size atomic write that BTRFS can perform as a single > operation at a low level (I'm not sure if this is equal to the block size, or > larger, but it doesn't matter much, either way, I'm talking the largest chunk > of data it will write to a disk in a single operation before updating > metadata to point to that new data). On the best of my knowledge there is only a time limit: IIRC every 30seconds a transaction is closed. If you are able to fill the filesystem in this time window you are in trouble. [...]-- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
On 2017-08-03 13:44, Marat Khalili wrote: > On 02/08/17 20:52, Goffredo Baroncelli wrote: >> consider the following scenario: >> >> a) create a 2GB file >> b) fallocate -o 1GB -l 2GB >> c) write from 1GB to 3GB >> >> after b), the expectation is that c) always succeed [1]: i.e. there is >> enough space on the filesystem. Due to the COW nature of BTRFS, you cannot >> rely on the already allocated space because there could be a small time >> window where both the old and the new data exists on the disk. > Just curious. With current implementation, in the following case: > a) create a 2GB file1 && create a 2GB file2 > b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2 A this step you are trying to allocate 3GB+3GB = 6GB, so you exhausted the filesystem space. > c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2 > will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or > does it only allocate additional 1GB and check free space for another 1GB? If > it's only the latter, it is useless. The file is physically extended ghigo@venice:/tmp$ fallocate -l 1000 foo.txt ghigo@venice:/tmp$ ls -l foo.txt -rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt ghigo@venice:/tmp$ ls -l foo.txt -rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt ghigo@venice:/tmp$ > > -- > > With Best Regards, > Marat Khalili > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: avoid unnecessarily locking inode when clearing a range
On 08/03/2017 11:25 AM, Wang Shilong wrote: On Thu, Aug 3, 2017 at 11:00 PM, Chris Mason wrote: On 07/27/2017 02:52 PM, fdman...@kernel.org wrote: From: Filipe Manana If the range being cleared was not marked for defrag and we are not about to clear the range from the defrag status, we don't need to lock and unlock the inode. Signed-off-by: Filipe Manana Thanks Filipe, looks like it goes all the way back to: commit 47059d930f0e002ff851beea87d738146804726d Author: Wang Shilong Date: Thu Jul 3 18:22:07 2014 +0800 Btrfs: make defragment work with nodatacow option I can't see how the inode lock is required here. This blames to me, thanks for fixing it. No blame ;) I'll take code that works any day. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: avoid unnecessarily locking inode when clearing a range
On Thu, Aug 3, 2017 at 11:00 PM, Chris Mason wrote: > > > On 07/27/2017 02:52 PM, fdman...@kernel.org wrote: >> >> From: Filipe Manana >> >> If the range being cleared was not marked for defrag and we are not >> about to clear the range from the defrag status, we don't need to >> lock and unlock the inode. >> >> Signed-off-by: Filipe Manana > > > Thanks Filipe, looks like it goes all the way back to: > > commit 47059d930f0e002ff851beea87d738146804726d > Author: Wang Shilong > Date: Thu Jul 3 18:22:07 2014 +0800 > > Btrfs: make defragment work with nodatacow option > > I can't see how the inode lock is required here. This blames to me, thanks for fixing it. Reviewed-by: Wang Shilong > > Reviewed-by: Chris Mason > > -chris > >> --- >> fs/btrfs/inode.c | 7 --- >> 1 file changed, 4 insertions(+), 3 deletions(-) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index eb495e956d53..51c45c0a8553 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -1797,10 +1797,11 @@ static void btrfs_clear_bit_hook(void >> *private_data, >> u64 len = state->end + 1 - state->start; >> u32 num_extents = count_max_extents(len); >> - spin_lock(&inode->lock); >> - if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) >> + if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) { >> + spin_lock(&inode->lock); >> inode->defrag_bytes -= len; >> - spin_unlock(&inode->lock); >> + spin_unlock(&inode->lock); >> + } >> /* >> * set_bit and clear bit hooks normally require _irqsave/restore >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Move skip checksum check from btrfs_submit_direct to __btrfs_submit_dio_bio
On 08/03/2017 08:44 AM, Nikolay Borisov wrote: Currently the code checks whether we should do data checksumming in btrfs_submit_direct and the boolean result of this check is passed to btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which actually consumes it. The last function actually has all the necessary context to figure out whether to skip the check or not, so let's move the check closer to where it's being consumed. No functional changes. I like it, thanks. Reviewed-by: Chris Mason -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: avoid unnecessarily locking inode when clearing a range
On 07/27/2017 02:52 PM, fdman...@kernel.org wrote: From: Filipe Manana If the range being cleared was not marked for defrag and we are not about to clear the range from the defrag status, we don't need to lock and unlock the inode. Signed-off-by: Filipe Manana Thanks Filipe, looks like it goes all the way back to: commit 47059d930f0e002ff851beea87d738146804726d Author: Wang Shilong Date: Thu Jul 3 18:22:07 2014 +0800 Btrfs: make defragment work with nodatacow option I can't see how the inode lock is required here. Reviewed-by: Chris Mason -chris --- fs/btrfs/inode.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index eb495e956d53..51c45c0a8553 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1797,10 +1797,11 @@ static void btrfs_clear_bit_hook(void *private_data, u64 len = state->end + 1 - state->start; u32 num_extents = count_max_extents(len); - spin_lock(&inode->lock); - if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) + if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) { + spin_lock(&inode->lock); inode->defrag_bytes -= len; - spin_unlock(&inode->lock); + spin_unlock(&inode->lock); + } /* * set_bit and clear bit hooks normally require _irqsave/restore -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Move skip checksum check from btrfs_submit_direct to __btrfs_submit_dio_bio
Currently the code checks whether we should do data checksumming in btrfs_submit_direct and the boolean result of this check is passed to btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which actually consumes it. The last function actually has all the necessary context to figure out whether to skip the check or not, so let's move the check closer to where it's being consumed. No functional changes. Signed-off-by: Nikolay Borisov --- fs/btrfs/inode.c | 18 ++ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5e48d2c10152..a8bd0f951454 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8424,8 +8424,7 @@ static inline blk_status_t btrfs_lookup_and_bind_dio_csum(struct inode *inode, } static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode, -u64 file_offset, int skip_sum, -int async_submit) +u64 file_offset, int async_submit) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_dio_private *dip = bio->bi_private; @@ -8443,7 +8442,7 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode, goto err; } - if (skip_sum) + if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) goto map; if (write && async_submit) { @@ -8473,8 +8472,7 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode, return ret; } -static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip, - int skip_sum) +static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip) { struct inode *inode = dip->inode; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); @@ -8537,7 +8535,7 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip, */ atomic_inc(&dip->pending_bios); - ret = __btrfs_submit_dio_bio(bio, inode, file_offset, skip_sum, + ret = __btrfs_submit_dio_bio(bio, inode, file_offset, async_submit); if (ret) { bio_put(bio); @@ -8557,8 +8555,7 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip, } while (submit_len > 0); submit: - ret = __btrfs_submit_dio_bio(bio, inode, file_offset, skip_sum, -async_submit); + ret = __btrfs_submit_dio_bio(bio, inode, file_offset, async_submit); if (!ret) return 0; @@ -8583,12 +8580,9 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, struct btrfs_dio_private *dip = NULL; struct bio *bio = NULL; struct btrfs_io_bio *io_bio; - int skip_sum; bool write = (bio_op(dio_bio) == REQ_OP_WRITE); int ret = 0; - skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM; - bio = btrfs_bio_clone(dio_bio); dip = kzalloc(sizeof(*dip), GFP_NOFS); @@ -8631,7 +8625,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, dio_data->unsubmitted_oe_range_end; } - ret = btrfs_submit_direct_hook(dip, skip_sum); + ret = btrfs_submit_direct_hook(dip); if (!ret) return; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: search parity device wisely
Hi Liu, [auto build test ERROR on v4.13-rc3] [also build test ERROR on next-20170803] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Liu-Bo/Btrfs-search-parity-device-wisely/20170803-193103 config: xtensa-allmodconfig (attached as .config) compiler: xtensa-linux-gcc (GCC) 4.9.0 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=xtensa All errors (new ones prefixed by >>): fs/btrfs/raid56.c: In function 'raid56_parity_alloc_scrub_rbio': >> fs/btrfs/raid56.c:2232:15: error: 'struct btrfs_raid_bio' has no member >> named 'data_stripes' for (i = rbio->data_stripes; i < rbio->real_stripes; i++) { ^ vim +2232 fs/btrfs/raid56.c 2201 2202 /* 2203 * The following code is used to scrub/replace the parity stripe 2204 * 2205 * Caller must have already increased bio_counter for getting @bbio. 2206 * 2207 * Note: We need make sure all the pages that add into the scrub/replace 2208 * raid bio are correct and not be changed during the scrub/replace. That 2209 * is those pages just hold metadata or file data with checksum. 2210 */ 2211 2212 struct btrfs_raid_bio * 2213 raid56_parity_alloc_scrub_rbio(struct btrfs_fs_info *fs_info, struct bio *bio, 2214 struct btrfs_bio *bbio, u64 stripe_len, 2215 struct btrfs_device *scrub_dev, 2216 unsigned long *dbitmap, int stripe_nsectors) 2217 { 2218 struct btrfs_raid_bio *rbio; 2219 int i; 2220 2221 rbio = alloc_rbio(fs_info, bbio, stripe_len); if (IS_ERR(rbio)) 2223 return NULL; 2224 bio_list_add(&rbio->bio_list, bio); 2225 /* 2226 * This is a special bio which is used to hold the completion handler 2227 * and make the scrub rbio is similar to the other types 2228 */ 2229 ASSERT(!bio->bi_iter.bi_size); 2230 rbio->operation = BTRFS_RBIO_PARITY_SCRUB; 2231 > 2232 for (i = rbio->data_stripes; i < rbio->real_stripes; i++) { 2233 if (bbio->stripes[i].dev == scrub_dev) { 2234 rbio->scrubp = i; 2235 break; 2236 } 2237 } 2238 ASSERT(i < rbio->real_stripes); 2239 2240 /* Now we just support the sectorsize equals to page size */ 2241 ASSERT(fs_info->sectorsize == PAGE_SIZE); 2242 ASSERT(rbio->stripe_npages == stripe_nsectors); 2243 bitmap_copy(rbio->dbitmap, dbitmap, stripe_nsectors); 2244 2245 /* 2246 * We have already increased bio_counter when getting bbio, record it 2247 * so we can free it at rbio_orig_end_io(). 2248 */ 2249 rbio->generic_bio_cnt = 1; 2250 2251 return rbio; 2252 } 2253 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH] Btrfs: search parity device wisely
Hi Liu, [auto build test ERROR on v4.13-rc3] [also build test ERROR on next-20170803] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Liu-Bo/Btrfs-search-parity-device-wisely/20170803-193103 config: x86_64-randconfig-x007-201731 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): fs/btrfs/raid56.c: In function 'raid56_parity_alloc_scrub_rbio': >> fs/btrfs/raid56.c:2232:15: error: 'struct btrfs_raid_bio' has no member >> named 'data_stripes'; did you mean 'real_stripes'? for (i = rbio->data_stripes; i < rbio->real_stripes; i++) { ^~ vim +2232 fs/btrfs/raid56.c 2201 2202 /* 2203 * The following code is used to scrub/replace the parity stripe 2204 * 2205 * Caller must have already increased bio_counter for getting @bbio. 2206 * 2207 * Note: We need make sure all the pages that add into the scrub/replace 2208 * raid bio are correct and not be changed during the scrub/replace. That 2209 * is those pages just hold metadata or file data with checksum. 2210 */ 2211 2212 struct btrfs_raid_bio * 2213 raid56_parity_alloc_scrub_rbio(struct btrfs_fs_info *fs_info, struct bio *bio, 2214 struct btrfs_bio *bbio, u64 stripe_len, 2215 struct btrfs_device *scrub_dev, 2216 unsigned long *dbitmap, int stripe_nsectors) 2217 { 2218 struct btrfs_raid_bio *rbio; 2219 int i; 2220 2221 rbio = alloc_rbio(fs_info, bbio, stripe_len); if (IS_ERR(rbio)) 2223 return NULL; 2224 bio_list_add(&rbio->bio_list, bio); 2225 /* 2226 * This is a special bio which is used to hold the completion handler 2227 * and make the scrub rbio is similar to the other types 2228 */ 2229 ASSERT(!bio->bi_iter.bi_size); 2230 rbio->operation = BTRFS_RBIO_PARITY_SCRUB; 2231 > 2232 for (i = rbio->data_stripes; i < rbio->real_stripes; i++) { 2233 if (bbio->stripes[i].dev == scrub_dev) { 2234 rbio->scrubp = i; 2235 break; 2236 } 2237 } 2238 ASSERT(i < rbio->real_stripes); 2239 2240 /* Now we just support the sectorsize equals to page size */ 2241 ASSERT(fs_info->sectorsize == PAGE_SIZE); 2242 ASSERT(rbio->stripe_npages == stripe_nsectors); 2243 bitmap_copy(rbio->dbitmap, dbitmap, stripe_nsectors); 2244 2245 /* 2246 * We have already increased bio_counter when getting bbio, record it 2247 * so we can free it at rbio_orig_end_io(). 2248 */ 2249 rbio->generic_bio_cnt = 1; 2250 2251 return rbio; 2252 } 2253 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: Massive loss of disk space
On 2017-08-03 07:44, Marat Khalili wrote: On 02/08/17 20:52, Goffredo Baroncelli wrote: consider the following scenario: a) create a 2GB file b) fallocate -o 1GB -l 2GB c) write from 1GB to 3GB after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. Just curious. With current implementation, in the following case: a) create a 2GB file1 && create a 2GB file2 b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2 c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2 will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or does it only allocate additional 1GB and check free space for another 1GB? If it's only the latter, it is useless. It will currently allocate 4GB total in this case (2 for each file), and _should_ succeed. I think there are corner cases where it can fail though because of metadata exhaustion, and I'm still not certain we don't CoW unwritten extents (if we do CoW unwritten extents, then this, and all fallocate allocation for that matter, becomes non-deterministic as to whether or not it succeeds). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
On 02/08/17 20:52, Goffredo Baroncelli wrote: consider the following scenario: a) create a 2GB file b) fallocate -o 1GB -l 2GB c) write from 1GB to 3GB after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. Just curious. With current implementation, in the following case: a) create a 2GB file1 && create a 2GB file2 b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2 c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2 will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or does it only allocate additional 1GB and check free space for another 1GB? If it's only the latter, it is useless. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
On 2017-08-02 17:05, Goffredo Baroncelli wrote: On 2017-08-02 21:10, Austin S. Hemmelgarn wrote: On 2017-08-02 13:52, Goffredo Baroncelli wrote: Hi, [...] consider the following scenario: a) create a 2GB file b) fallocate -o 1GB -l 2GB c) write from 1GB to 3GB after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit. The man page of fallocate doesn't guarantee that. Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true. Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file. Yes, you need space, but you don't need _all_ the space. For a file that already has data in it, you only _need_ as much space as the largest chunk of data that can be written at once at a low level, because the moment that first write finishes, the space that was used in the file for that region is freed, and the next write can go there. Put a bit differently, you only need to allocate what isn't allocated in the region, and then a bit more to handle the initial write to the file. Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is. I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel driver), NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, UFS and HFS+ on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different name) and LFS (log structured) on NetBSD, and UFS and ZFS on Solaris, and VxFS on HP-UX, and _all_ of them behave correctly here and succeed with the test I listed, while BTRFS does not. This isn't codified in POSIX, but it's also not something that is listed as implementation defined, which in turn means that we should be trying to match the other implementations. [...] My opinion is that in general this behavior is correct due to the COW nature of BTRFS. The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better. There are other, saner ways to make that expectation hold though, and I'm not even certain that it does as things are implemented (I believe we still CoW unwritten extents when data is written to them, because I _have_ had writes to fallocate'ed files fail on BTRFS before with -ENOSPC). The ideal situation IMO is as follows: 1. This particular case (using fallocate() with an offset of 0 to extend a file that is already larger than half the remaining free space on the FS) _should_ succeed. This description is not accurate. What happened is the following: 1) you have a file *with valid data* 2) you want to prepare an update of this file and want to be sure to have enough space Except this is not the common case. Most filesystems aren't CoW, so calling fallocate() like this is generally not 'ensuring you have enough space', it's 'ensuring the file isn't sparse, and we can write to the extra area beyond the end we care about'. at this point fallocate have to guarantee: a) you have your old data still available b) you have allocated the space for the update In terms of a COW filesystem, you need the space of a) + the space of b) No, that is only required if the entire file needs to be written atomically. There is some maximal size atomic write that BTRFS can perform as a single operation at a low level (I'm not sure if this is equal to the block size, or larger, but it doesn't matter much, either way, I'm talking the largest chunk of data it will write to a disk in a single operation before updating metadata to point to that new data). If your total size (original data plus the new space) is less than this maximal atomic write size, then the above is true, but if it is larger, you only need to allocate space for regions of the fallocate() range that aren't already allocated, plus space to accommodate at least one write of this maximal atomic write size. Any space beyond that just ends up minimizing the degree of fragmentation introduced by allocation. The methodology that allows this is really simple. When you start to write data to the file, the first part of the write goes into the newly allocated space, and the original region covered by that write gets freed. You can then write into the space that was just freed and repea
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 08/03/2017 12:22 AM, Chris Murphy wrote: Also more interesting is this Stratis project that started up a few months ago: https://github.com/stratis-storage/stratisd Which also includes this design document: https://stratis-storage.github.io/StratisSoftwareDesign.pdf This concept, if successfully implemented, does not seem to achieve anything beyond "hide the complexity of its implementation from the user". No actual new functionality, no reason to assume any additional robustness or stability, and certainly not a new filesystem, just yet-another-wrapper. Keeping users from understanding the complexity of a storage system they use is not a benefit for all but the most trivial use cases. And I find it symptomatic that the section "D-Bus Access Control" in StratisSoftwareDesign.pdf is empty. So it's going to use existing device mapper, md, some LVM stuff, XFS That is the only part of the Stratis concept that looks reasonable to me. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html