Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Fri, Sep 21, 2018 at 12:59:31PM +1000, Dave Chinner wrote: > On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote: [...] > With no DMAPI in the future, people with custom HSM-like interfaces > based on dmapi are starting to turn to fanotify and friends to > provide them with the change notifications they require I had a fanotify-based scanner once, before I noticed btrfs effectively had timestamps all over its metadata. fanotify won't tell me which parts of a file were modified (unless it got that feature in the last few years?). fanotify was pretty useless when the only file on the system that was being modified was a 13TB VM image. Or even a little 16GB one. Has to scan the whole file to find the one new byte. Even on desktops the poor thing spends most of its time looping over /var/log/messages. It was sad. If fanotify gave me (inode, offset, length) tuples of dirty pages in cache, I could look them up and use a dedupe_file_range call to replace the dirty pages with a reference to an existing disk block. If my listener can do that fast enough, it's in-band dedupe; if it doesn't, the data gets flushed to disk as normal, and I fall back to a scan of the filesystem to clean it up later. > > > e.g. a soft requirement is that we need to scan the entire fs at > > > least once a month. > > > > I have to scan and dedupe multiple times per hour. OK, the first-ever > > scan of a non-empty filesystem is allowed to take much longer, but after > > that, if you have enough spare iops for continuous autodefrag you should > > also have spare iops for continuous dedupe. > > Yup, but using notifications avoids the for even these scans - you'd > know exactly what data has changed, when it changed, and know > exactly that you needed to read to calculate the new hashes. ...if the scanner can keep up with the notifications; otherwise, the notification receiver has to log them somewhere for the scanner to catch up. If there are missed or dropped notifications--or 23 hours a day we're not listening for notifications because we only have an hour a day maintenance window--some kind of filesystem scan has to be done after the fact anyway. > > > A simple piece-wise per-AG scanning algorithm (like we use in > > > xfs_repair) could easily work within a 3GB RAM per AG constraint and > > > would scale very well. We'd only need to scan 30-40 AGs in the hour, > > > and a single AG at 1GB/s will only take 2 minutes to scan. We can > > > then do the processing while the next AG gets scanned. If we've got > > > 10-20GB RAM to use (and who doesn't when they have 1PB of storage?) > > > then we can scan 5-10AGs at once to keep the IO rate up, and process > > > them in bulk as we scan more. > > > > How do you match dupe blocks from different AGs if you only keep RAM for > > the duration of one AG scan? Do you not dedupe across AG boundaries? > > We could, but do we need too? There's a heap of runtime considerations > at the filesystem level we need to take into consideration here, and > there's every chance that too much consolidation creates > unpredictable bottlenecks in overwrite workloads that need to break > the sharing (i.e. COW operations). I'm well aware of that. I have a bunch of hacks in bees to not be too efficient lest it push the btrfs reflink bottlenecks too far. > e.g. An AG contains up to 1TB of data which is more than enough to > get decent AG-internal dedupe rates. If we've got 1PB of data spread > across 1000AGs, deduping a million copies of a common data pattern > spread across the entire filesystem down to one per AG (i.e. 10^6 > copies down to 10^3) still gives a massive space saving. That's true for 1000+ AG filesystems, but it's a bigger problem for filesystems of 2-5 AGs, where each AG holds one copy of 20-50% of the duplicates on the filesystem. OTOH, a filesystem that small could just be done in one pass with a larger but still reasonable amount of RAM. > > What you've described so far means the scope isn't limited anyway. If the > > call is used to dedupe two heavily-reflinked extents together (e.g. > > both duplicate copies are each shared by thousands of snapshots that > > have been created during the month-long period between dedupe runs), > > it could always be stuck doing a lot of work updating dst owners. > > Was there an omitted detail there? > > As I said early in the discussion - if both copies of identical data > are already shared hundreds or thousands of times each, then it > makes no sense to dedupe them again. All that does is create huge > amounts of work updating metadata for very little additional gain. I've had a user complain about the existing 2560-reflink limit in bees, because they were starting with 3000 snapshots of their data before they ran dedupe for the first time, so almost all their data started above the reflink limit before dedupe, and no dedupes occurred because of that. Build servers end up with a 3-4 digit number of reflinks to every
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote: > On Mon, Sep 10, 2018 at 07:06:46PM +1000, Dave Chinner wrote: > > On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote: > > > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote: > > > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote: > > > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote: > > > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > > > > > For future development I've abandoned the entire dedupe_file_range > > > > > approach. I need to be able to read and dedupe the data blocks of > > > > > the filesystem directly without having to deal with details like which > > > > > files those blocks belong to, especially on filesystems with lots of > > > > > existing deduped blocks and snapshots. > > > > > > > > IOWs, your desired OOB dedupe algorithm is: > > > > > > > > a) ask the filesystem where all it's file data is > > > > > > Actually, it's "ask the filesystem where all the *new* file data is" > > > since we don't want to read any unique data twice on subsequent runs. > > > > Sorry, how do you read "unique data" twice? By definition, unique > > data only occurs once > > ...but once it has been read, we don't want to read it again. Ever. > Even better would be to read unique data less than 1.0 times on average. > > > Oh, and you still need to hash the old data so you can find > > collisions with the new data that got written. Unless, of course, > > you are keeping your hash tree in a persistent database > > I do that. OK, so you're slowly re-inventing the typical HSM implementation that keep a userspace database to keep track of what is happening in the filesystem. They do this to make better decisions when moving less frequently accessed data out to a slower teirs of storage to keep space in the primary storage available as it fills up. You're approaching dedupe in essentially the same way, for very similar reasons. > One big difference I am noticing in our approaches is latency. ZFS (and > in-kernel btrfs dedupe) provides minimal dedupe latency (duplicate > data occupies disk space for zero time as it is never written to disk > at all) but it requires more RAM for a given dedupe hit rate than any > other dedupe implementation I've seen. What you've written tells me > XFS saves RAM by partitioning the data and relying on an existing but > very large source of iops (sharing scrub reads with dedupe), but then > the dedupe latency is the same as the scrub interval (the worst so far). That's just the background maintenance implementation. xfs_fsr can run this way, but it can also be run as a complete immediate scan, or it can be pointed at a defined set of files to run on. We can (and probably will) do exactly the same thing with dedupe. > bees aims to have latency of a few minutes (ideally scanning data while > it's still dirty in cache, but there's no good userspace API for that) > though it's obviously not there yet. Yup, this is pretty much the "I need immediate notification that a file changed" problem that HSMs face. That's one of the things DMAPI was used for - XFS has (now unused) dmapi event notification fields in it's inode structure that were used by DMAPI to configure the notifications the filesystem was supposed to send userspace With no DMAPI in the future, people with custom HSM-like interfaces based on dmapi are starting to turn to fanotify and friends to provide them with the change notifications they require > > e.g. a soft requirement is that we need to scan the entire fs at > > least once a month. > > I have to scan and dedupe multiple times per hour. OK, the first-ever > scan of a non-empty filesystem is allowed to take much longer, but after > that, if you have enough spare iops for continuous autodefrag you should > also have spare iops for continuous dedupe. Yup, but using notifications avoids the for even these scans - you'd know exactly what data has changed, when it changed, and know exactly that you needed to read to calculate the new hashes. > > A simple piece-wise per-AG scanning algorithm (like we use in > > xfs_repair) could easily work within a 3GB RAM per AG constraint and > > would scale very well. We'd only need to scan 30-40 AGs in the hour, > > and a single AG at 1GB/s will only take 2 minutes to scan. We can > > then do the processing while the next AG gets scanned. If we've got > > 10-20GB RAM to use (and who doesn't when they have 1PB of storage?) > > then we can scan 5-10AGs at once to keep the IO rate up, and process > > them in bulk as we scan more. > > How do you match dupe blocks from different AGs if you only keep RAM for > the duration of one AG scan? Do you not dedupe across AG boundaries? We could, but do we need too? There's a heap of runtime considerations at the filesystem level we need to take into consideration here, and there's every chance that too much consolidation
Re: btrfs problems
On 2018-09-20 05:35 PM, Adrian Bastholm wrote: > Thanks a lot for the detailed explanation. > Aabout "stable hardware/no lying hardware". I'm not running any raid > hardware, was planning on just software raid. three drives glued > together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would > this be a safer bet, or would You recommend running the sausage method > instead, with "-d single" for safety ? I'm guessing that if one of the > drives dies the data is completely lost > Another variant I was considering is running a raid1 mirror on two of > the drives and maybe a subvolume on the third, for less important > stuff In case you were not aware, it's perfectly acceptable with BTRFS to use Raid 1 over 3 devices. Even more amazing, regardless of how many devices you start with, 2, 3, 4, whatever, you can add a single drive to the array to increase capacity, (at 50%, of course,, ie, adding a 4TB drive will give you 2TB usable space, assuming the other drives add up to at least 4TB to match it.) <>
Re: btrfs problems
On Thu, Sep 20, 2018 at 3:36 PM Adrian Bastholm wrote: > > Thanks a lot for the detailed explanation. > Aabout "stable hardware/no lying hardware". I'm not running any raid > hardware, was planning on just software raid. Yep. I'm referring to the drives, their firmware, cables, logic board, its firmware, the power supply, power, etc. Btrfs is by nature intolerant of corruption. Other file systems are more tolerant because they don't know about it (although recent versions of XFS and ext4 are now defaulting to checksummed metadata and journals). >three drives glued > together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would > this be a safer bet, or would You recommend running the sausage method > instead, with "-d single" for safety ? I'm guessing that if one of the > drives dies the data is completely lost > Another variant I was considering is running a raid1 mirror on two of > the drives and maybe a subvolume on the third, for less important > stuff RAID does not substantially reduce the chances of data loss. It's not anything like a backup. It's an uptime enhancer. If you have backups, and your primary storage dies, of course you can restore from backup no problem, but it takes time and while the restore is happening, you're not online - uptime is killed. If that's a negative, might want to run RAID so you can keep working during the degraded period, and instead of a restore you're doing a rebuild. But of course there is a chance of failure during the degraded period. So you have to have a backup anyway. At least with Btrfs/ZFS, there is another reason to run with some replication like raid1 or raid5 and that's so that if there's corruption or a bad sector, Btrfs doesn't just detect it, it can fix it up with the good copy. For what it's worth, make sure the drives have lower SCT ERC time than the SCSI command timer. This is the same for Btrfs as it is for md and LVM RAID. The command timer default is 30 seconds, and most drives have SCT ERC disabled with very high recovery times well over 30 seconds. So either set SCT ERC to something like 70 deciseconds. Or increase the command timer to something like 120 or 180 (either one is absurdly high but what you want is for the drive to eventually give up and report a discrete error message which Btrfs can do something about, rather than do a SATA link reset in which case Btrfs can't do anything about it). -- Chris Murphy
Re: btrfs problems
Thanks a lot for the detailed explanation. Aabout "stable hardware/no lying hardware". I'm not running any raid hardware, was planning on just software raid. three drives glued together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would this be a safer bet, or would You recommend running the sausage method instead, with "-d single" for safety ? I'm guessing that if one of the drives dies the data is completely lost Another variant I was considering is running a raid1 mirror on two of the drives and maybe a subvolume on the third, for less important stuff BR Adrian On Thu, Sep 20, 2018 at 9:39 PM Chris Murphy wrote: > > On Thu, Sep 20, 2018 at 11:23 AM, Adrian Bastholm wrote: > > On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo wrote: > > > >> > >> Then I strongly recommend to use the latest upstream kernel and progs > >> for btrfs. (thus using Debian Testing) > >> > >> And if anything went wrong, please report asap to the mail list. > >> > >> Especially for fs corruption, that's the ghost I'm always chasing for. > >> So if any corruption happens again (although I hope it won't happen), I > >> may have a chance to catch it. > > > > You got it > >> > > >> >> Anyway, enjoy your stable fs even it's not btrfs > > > >> > My new stable fs is too rigid. Can't grow it, can't shrink it, can't > >> > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess > >> > after the dust settled I realize I like the flexibility of BTRFS. > >> > > > I'm back to btrfs. > > > >> From the code aspect, the biggest difference is the chunk layout. > >> Due to the ext* block group usage, each block group header (except some > >> sparse bg) is always used, thus btrfs can't use them. > >> > >> This leads to highly fragmented chunk layout. > > > > The only thing I really understood is "highly fragmented" == not good > > . I might need to google these "chunk" thingies > > Chunks are synonyms with block groups. They're like a super extent, or > extent of extents. > > The block group is how Btrfs abstracts the logical address used most > everywhere in Btrfs land, and device + physical location of extents. > It's how a file is referenced only by on logical address, and doesn't > need to know either where the extent is located, or how many copies > there are. The block group allocation profile is what determines if > there's one copy, duplicate copies, raid1, 10, 5, 6 copies of a chunk > and where the copies are located. It's also fundamental to how device > add, remove, replace, file system resize, and balance all interrelate. > > > >> If your primary concern is to make the fs as stable as possible, then > >> keep snapshots to a minimal amount, avoid any functionality you won't > >> use, like qgroup, routinely balance, RAID5/6. > > > > So, is RAID5 stable enough ? reading the wiki there's a big fat > > warning about some parity issues, I read an article about silent > > corruption (written a while back), and chris says he can't recommend > > raid56 to mere mortals. > > Depends on how you define stable. In recent kernels it's stable on > stable hardware, i.e. no lying hardware (actually flushes when it > claims it has), no power failures, and no failed devices. Of course > it's designed to help protect against a clear loss of a device, but > there's tons of stuff here that's just not finished including ejecting > bad devices from the array like md and lvm raids will do. Btrfs will > just keep trying, through all the failures. There are some patches to > moderate this but I don't think they're merged yet. > > You'd also want to be really familiar with how to handle degraded > operation, if you're going to depend on it, and how to replace a bad > device. Last I refreshed my memory on it, it's advised to use "btrfs > device add" followed by "btrfs device remove" for raid56; whereas > "btrfs replace" is preferred for all other profiles. I'm not sure if > the "btrfs replace" issues with parity raid were fixed. > > Metadata as raid56 shows a lot more problem reports than metadata > raid1, so there's something goofy going on in those cases. I'm not > sure how well understood they are. But other people don't have > problems with it. > > It's worth looking through the archives about some things. Btrfs > raid56 isn't exactly perfectly COW, there is read-modify-write code > that means there can be overwrites. I vaguely recall that it's COW in > the logical layer, but the physical writes can end up being RMW or not > for sure COW. > > > > -- > Chris Murphy -- Vänliga hälsningar / Kind regards, Adrian Bastholm ``I would change the world, but they won't give me the sourcecode``
Re: btrfs problems
On Thu, Sep 20, 2018 at 11:23 AM, Adrian Bastholm wrote: > On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo wrote: > >> >> Then I strongly recommend to use the latest upstream kernel and progs >> for btrfs. (thus using Debian Testing) >> >> And if anything went wrong, please report asap to the mail list. >> >> Especially for fs corruption, that's the ghost I'm always chasing for. >> So if any corruption happens again (although I hope it won't happen), I >> may have a chance to catch it. > > You got it >> > >> >> Anyway, enjoy your stable fs even it's not btrfs > >> > My new stable fs is too rigid. Can't grow it, can't shrink it, can't >> > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess >> > after the dust settled I realize I like the flexibility of BTRFS. >> > > I'm back to btrfs. > >> From the code aspect, the biggest difference is the chunk layout. >> Due to the ext* block group usage, each block group header (except some >> sparse bg) is always used, thus btrfs can't use them. >> >> This leads to highly fragmented chunk layout. > > The only thing I really understood is "highly fragmented" == not good > . I might need to google these "chunk" thingies Chunks are synonyms with block groups. They're like a super extent, or extent of extents. The block group is how Btrfs abstracts the logical address used most everywhere in Btrfs land, and device + physical location of extents. It's how a file is referenced only by on logical address, and doesn't need to know either where the extent is located, or how many copies there are. The block group allocation profile is what determines if there's one copy, duplicate copies, raid1, 10, 5, 6 copies of a chunk and where the copies are located. It's also fundamental to how device add, remove, replace, file system resize, and balance all interrelate. >> If your primary concern is to make the fs as stable as possible, then >> keep snapshots to a minimal amount, avoid any functionality you won't >> use, like qgroup, routinely balance, RAID5/6. > > So, is RAID5 stable enough ? reading the wiki there's a big fat > warning about some parity issues, I read an article about silent > corruption (written a while back), and chris says he can't recommend > raid56 to mere mortals. Depends on how you define stable. In recent kernels it's stable on stable hardware, i.e. no lying hardware (actually flushes when it claims it has), no power failures, and no failed devices. Of course it's designed to help protect against a clear loss of a device, but there's tons of stuff here that's just not finished including ejecting bad devices from the array like md and lvm raids will do. Btrfs will just keep trying, through all the failures. There are some patches to moderate this but I don't think they're merged yet. You'd also want to be really familiar with how to handle degraded operation, if you're going to depend on it, and how to replace a bad device. Last I refreshed my memory on it, it's advised to use "btrfs device add" followed by "btrfs device remove" for raid56; whereas "btrfs replace" is preferred for all other profiles. I'm not sure if the "btrfs replace" issues with parity raid were fixed. Metadata as raid56 shows a lot more problem reports than metadata raid1, so there's something goofy going on in those cases. I'm not sure how well understood they are. But other people don't have problems with it. It's worth looking through the archives about some things. Btrfs raid56 isn't exactly perfectly COW, there is read-modify-write code that means there can be overwrites. I vaguely recall that it's COW in the logical layer, but the physical writes can end up being RMW or not for sure COW. -- Chris Murphy
Re: inline extents
On Wed, Sep 19, 2018 at 09:18:16PM -0600, Chris Murphy wrote: > > ext4 has inline data, too, so there's every chance grub will corrupt > > ext4 filesystems with tit's wonderful new feature. I'm not sure if > > the ext4 metadata cksums cover the entire inode and inline data, but > > if they do it's the same problem as btrfs. > > I don't see inline used with a default mkfs, but I do see metadata_csum The grub environment block is 1024 bytes, so unless you are using a 4k inode size (the default is 256 bytes), the grub environment block won't be an inline data file. So in practice, this won't be a problem. So in practice this won't be a problem for ext4 today. There are ways that grub can query the file system (using FIEMAP or FS_IOC_GETFLAGS) to see whether or not the file is an inline file. For xfs grub would also need to make sure the file isn't reflinked file (which can also be queried using FIEMAP). That's fine for inline, since a file won't magically change from stored in a block versus inline, so grub can check this at grub setup time. But the problem with reflink is that a file that was originally using a block exclusively could later have that block be shared by another file, at which point it's only safe to write the block using copy-on-write. > I know some of the file systems have reserve areas for bootloader > stuff. I'm not sure if that's preferred over bootloaders just getting > their own partition and controlling it stem to stern however they > want. Ext4 had reserved the bootloader inode, and we have defined a way for a bootloader to access it. The original intent was that not it just be fore the environment block, but also the grub stage 1.5 or stage 2 loader. This would allow grub to just use a fixed list of blocks or extents, and not need to understand the ext4 extents structures. We added way this when, because we anticipated this would be easier for grub than having it learn how to read ext4 extents structures. Instead, the bootloader would at setup time use the EXT4_IOC_SWAP_BOOT ioctl and then use a fixed list of blocks or block extents like LILO used to do. Of course, grub exceeded expectations and actually learned how to read ext4 file systems, and probably we (the ext4 developers) should have reached out to GRUB folks. Cheers, - Ted
Re: [PATCH v8 0/6] Btrfs: implement swap file support
On Thu, Sep 20, 2018 at 07:22:55PM +0200, David Sterba wrote: > On Wed, Sep 19, 2018 at 10:02:11PM -0700, Omar Sandoval wrote: > > From: Omar Sandoval > > Changes from v7 [1]: > > > > - Expanded a few commit messages > > - Added Johannes' acked-by on patches 1 and 2 > > - Rebased on v4.19-rc4 > > I've sent my comments, it's mostly about the usability or small > enhancements. As you've got acks from MM people, I hope it would be ok > if I add this series to for-next so we can give it some testing. That'd be great. Feel free to grab it from my git tree (https://github.com/osandov/linux/tree/btrfs-swap) if you want the version with your comments addressed. > The MM patches would better go separately to 4.20 via the mm tree. I > did only build tests so 4.20 target is still feasible but given that > it's rc4 it's a bit too close. There are some limitations posed by the > swapfiles so I'd like to have a chance to do some actual tests myself > and check the usability status. 4.20 would be nice, but I could live with 4.21. I'll just be backporting it to our internal kernel here anyways ;) Let me know how the tests go and which way you want to go. Thanks! It's nice to finally have the end in sight for this series, it's almost 4 years old, although it's changed quite a bit since https://lkml.org/lkml/2014/11/17/141.
Re: [PATCH 0/5] rb_first to rb_first_cached conversion
On Thu, Sep 20, 2018 at 06:49:59PM +0200, David Sterba wrote: > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote: > > Several structs in btrfs are using rb_first() in a while loop, it'd be > > more efficient to do this with rb_first_cached() which has the O(1) > > complexity. > > > > This patch set updates five structs which may have a large rb tree in > > practice > > > > Liu Bo (5): > > Btrfs: href_root: use rb_first_cached > > Btrfs: href->ref_tree: use rb_first_cached > > Btrfs: delayed_items: use rb_first_cached > > Btrfs: extent_map: use rb_first_cached > > Btrfs: preftree: use rb_first_cached > > For the record, patches have been merged to misc-next. I've changed the > subject titles to > > Btrfs: delayed-refs: use rb_first_cached for href_root > Btrfs: delayed-refs: use rb_first_cached for ref_tree > Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root > Btrfs: extent_map: use rb_first_cached > Btrfs: preftree: use rb_first_cached > > added Holger's tested-by and my reviewed-by. Thanks so much for the efforts. thanks, -liubo
Re: btrfs send hangs after partial transfer and blocks all IO
On Wed, Sep 19, 2018 at 1:41 PM, Jürgen Herrmann wrote: > Am 13.9.2018 14:35, schrieb Nikolay Borisov: >> >> On 13.09.2018 15:30, Jürgen Herrmann wrote: >>> >>> OK, I will install kdump later and perform a dump after the hang. >>> >>> One more noob question beforehand: does this dump contain sensitive >>> information, for example the luks encryption key for the disk etc? A >>> Google search only brings up one relevant search result which can only >>> be viewed with a redhat subscription... >> >> >> >> So a kdump will dump the kernel memory so it's possible that the LUKS >> encryption keys could be extracted from that image. Bummer, it's >> understandable why you wouldn't want to upload it :). In this case you'd >> have to install also the 'crash' utility to open the crashdump and >> extract the calltrace of the btrfs process. The rough process should be : >> >> >> crash 'path to vm linux' 'path to vmcore file', then once inside the >> crash utility : >> >> set , you can acquire the pid by issuing 'ps' >> which will give you a ps-like output of all running processes at the >> time of crash. After the context has been set you can run 'bt' which >> will give you a backtrace of the send process. >> >> >> >>> >>> Best regards, >>> Jürgen >>> >>> Am 13. September 2018 14:02:11 schrieb Nikolay Borisov >>> : >>> On 13.09.2018 14:50, Jürgen Herrmann wrote: > > I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also > after the hang because I started the loop before the hang. The dmesg > output should show the hanging tasks from second 346 on or so. Still > not > useful? > So from 346 it's evident that transaction commit is waiting for commit_root_sem to be acquired. So something else is holding it and not giving the transaction chance to finish committing. Now the only place where send acquires this lock is in find_extent_clone around the call to extent_from_logical. The latter basically does an extent tree search and doesn't loop so can't possibly deadlock. Furthermore I don't see any userspace processes being hung in kernel space. Additionally looking at the userspace processes they indicate that find_extent_clone has finished and are blocked in send_write_or_clone which does the write. But I guess this actually happens before the hang. So at this point without looking at the stacktrace of the btrfs send process after the hung has occurred I don't think much can be done >>> >>> > I know this is probably not the correct list to ask this question but maybe > someone of the devs can point me to the right list? > > I cannot get kdump to work. The crashkernel is loaded and everything is > setup for it afaict. I asked a question on this over at stackexchange but no > answer yet. > https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing > > So i did a little digging and added some debug printk() statements to see > whats going on and it seems that panic() is never called. maybe the second > stack trace is the reason? > Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN > > Could someone please tell me where I can report this problem and get some > help on this topic? Try kexec mailing list. They handle kdump. http://lists.infradead.org/mailman/listinfo/kexec -- Chris Murphy
Re: btrfs problems
On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo wrote: > > Then I strongly recommend to use the latest upstream kernel and progs > for btrfs. (thus using Debian Testing) > > And if anything went wrong, please report asap to the mail list. > > Especially for fs corruption, that's the ghost I'm always chasing for. > So if any corruption happens again (although I hope it won't happen), I > may have a chance to catch it. You got it > > > >> Anyway, enjoy your stable fs even it's not btrfs > > My new stable fs is too rigid. Can't grow it, can't shrink it, can't > > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess > > after the dust settled I realize I like the flexibility of BTRFS. > > I'm back to btrfs. > From the code aspect, the biggest difference is the chunk layout. > Due to the ext* block group usage, each block group header (except some > sparse bg) is always used, thus btrfs can't use them. > > This leads to highly fragmented chunk layout. The only thing I really understood is "highly fragmented" == not good . I might need to google these "chunk" thingies > We doesn't have error report about such layout yet, but if you want > everything to be as stable as possible, I still recommend to use a newly > created fs. I guess I'll stick with ext4 on the rootfs > > Another thing is I'd like to see a "first steps after getting started > > " section in the wiki. Something like take your first snapshot, back > > up, how to think when running it - can i just set some cron jobs and > > forget about it, or does it need constant attention, and stuff like > > that. > > There are projects do such things automatically, like snapper. > > If your primary concern is to make the fs as stable as possible, then > keep snapshots to a minimal amount, avoid any functionality you won't > use, like qgroup, routinely balance, RAID5/6. So, is RAID5 stable enough ? reading the wiki there's a big fat warning about some parity issues, I read an article about silent corruption (written a while back), and chris says he can't recommend raid56 to mere mortals. > And keep the necessary btrfs specific operations to minimal, like > subvolume/snapshot (and don't keep too many snapshots, say over 20), > shrink, send/receive. > > Thanks, > Qu > > > > > BR Adrian > > > > > -- Vänliga hälsningar / Kind regards, Adrian Bastholm ``I would change the world, but they won't give me the sourcecode``
Re: [PATCH v8 6/6] Btrfs: support swap files
On Thu, Sep 20, 2018 at 07:15:41PM +0200, David Sterba wrote: > On Wed, Sep 19, 2018 at 10:02:17PM -0700, Omar Sandoval wrote: > > From: Omar Sandoval > > > > Btrfs has not allowed swap files since commit 35054394c4b3 ("Btrfs: stop > > providing a bmap operation to avoid swapfile corruptions"). However, now > > that the proper restrictions are in place, Btrfs can support swap files > > through the swap file a_ops, similar to iomap in commit 67482129cdab > > ("iomap: add a swapfile activation function"). > > > > For Btrfs, activation needs to make sure that the file can be used as a > > swap file, which currently means that it must be fully allocated as > > nocow with no compression on one device. It must also do the proper > > tracking so that ioctls will not interfere with the swap file. > > Deactivation clears this tracking. > > > > Signed-off-by: Omar Sandoval > > --- > > fs/btrfs/inode.c | 317 +++ > > 1 file changed, 317 insertions(+) > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index 3ea5339603cf..0586285b1d9f 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c [snip] > > +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file > > *file, > > + sector_t *span) > > +{ > > + struct inode *inode = file_inode(file); > > + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; > > + struct extent_io_tree *io_tree = _I(inode)->io_tree; > > + struct extent_state *cached_state = NULL; > > + struct extent_map *em = NULL; > > + struct btrfs_device *device = NULL; > > + struct btrfs_swap_info bsi = { > > + .lowest_ppage = (sector_t)-1ULL, > > + }; > > + int ret = 0; > > + u64 isize = inode->i_size; > > + u64 start; > > + > > + /* > > +* If the swap file was just created, make sure delalloc is done. If the > > +* file changes again after this, the user is doing something stupid and > > +* we don't really care. > > +*/ > > + ret = btrfs_wait_ordered_range(inode, 0, (u64)-1); > > + if (ret) > > + return ret; > > + > > + /* > > +* The inode is locked, so these flags won't change after we check them. > > +*/ > > + if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) { > > + btrfs_info(fs_info, "swapfile must not be compressed"); > > + return -EINVAL; > > + } > > + if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) { > > + btrfs_info(fs_info, "swapfile must not be copy-on-write"); > > + return -EINVAL; > > + } > > I wonder if we should also explicitly check for the checkums flag, ie. > that NODATASUM is present. Right now it's bound to NODATACOW, but as > with other sanity checks, it does not hurt to have it here. > > > + > > + /* > > +* Balance or device remove/replace/resize can move stuff around from > > +* under us. The EXCL_OP flag makes sure they aren't running/won't run > > +* concurrently while we are mapping the swap extents, and > > +* fs_info->swapfile_pins prevents them from running while the swap file > > +* is active and moving the extents. Note that this also prevents a > > +* concurrent device add which isn't actually necessary, but it's not > > +* really worth the trouble to allow it. > > +*/ > > + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) > > + return -EBUSY; > > This could be also accompanied by a message, "why does not my swapfile > activate?" -> "there's an exclusive operation running". I've checked if > there are similar messages for the other exclusive ops. There are. Sounds good. I addressed all of your comments and pushed to https://github.com/osandov/linux/tree/btrfs-swap. The only thing I didn't change was the btrfs_info about not being able to relocate an active swapfile. I think it makes sense as btrfs_info since we already log every block group we are relocating as info (see describe_relocation()).
Re: [PATCH v8 0/6] Btrfs: implement swap file support
On Wed, Sep 19, 2018 at 10:02:11PM -0700, Omar Sandoval wrote: > From: Omar Sandoval > Changes from v7 [1]: > > - Expanded a few commit messages > - Added Johannes' acked-by on patches 1 and 2 > - Rebased on v4.19-rc4 I've sent my comments, it's mostly about the usability or small enhancements. As you've got acks from MM people, I hope it would be ok if I add this series to for-next so we can give it some testing. The MM patches would better go separately to 4.20 via the mm tree. I did only build tests so 4.20 target is still feasible but given that it's rc4 it's a bit too close. There are some limitations posed by the swapfiles so I'd like to have a chance to do some actual tests myself and check the usability status.
Re: [PATCH v8 6/6] Btrfs: support swap files
On Wed, Sep 19, 2018 at 10:02:17PM -0700, Omar Sandoval wrote: > From: Omar Sandoval > > Btrfs has not allowed swap files since commit 35054394c4b3 ("Btrfs: stop > providing a bmap operation to avoid swapfile corruptions"). However, now > that the proper restrictions are in place, Btrfs can support swap files > through the swap file a_ops, similar to iomap in commit 67482129cdab > ("iomap: add a swapfile activation function"). > > For Btrfs, activation needs to make sure that the file can be used as a > swap file, which currently means that it must be fully allocated as > nocow with no compression on one device. It must also do the proper > tracking so that ioctls will not interfere with the swap file. > Deactivation clears this tracking. > > Signed-off-by: Omar Sandoval > --- > fs/btrfs/inode.c | 317 +++ > 1 file changed, 317 insertions(+) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 3ea5339603cf..0586285b1d9f 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > #include > #include "ctree.h" > #include "disk-io.h" > @@ -10488,6 +10489,320 @@ void btrfs_set_range_writeback(struct > extent_io_tree *tree, u64 start, u64 end) > } > } > > +/* > + * Add an entry indicating a block group or device which is pinned by a > + * swapfile. Returns 0 on success, 1 if there is already an entry for it, or > a > + * negative errno on failure. > + */ > +static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr, > + bool is_block_group) > +{ > + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; > + struct btrfs_swapfile_pin *sp, *entry; > + struct rb_node **p; > + struct rb_node *parent = NULL; > + > + sp = kmalloc(sizeof(*sp), GFP_NOFS); > + if (!sp) > + return -ENOMEM; > + sp->ptr = ptr; > + sp->inode = inode; > + sp->is_block_group = is_block_group; > + > + spin_lock(_info->swapfile_pins_lock); > + p = _info->swapfile_pins.rb_node; > + while (*p) { > + parent = *p; > + entry = rb_entry(parent, struct btrfs_swapfile_pin, node); > + if (sp->ptr < entry->ptr || > + (sp->ptr == entry->ptr && sp->inode < entry->inode)) { > + p = &(*p)->rb_left; > + } else if (sp->ptr > entry->ptr || > +(sp->ptr == entry->ptr && sp->inode > entry->inode)) > { > + p = &(*p)->rb_right; > + } else { > + spin_unlock(_info->swapfile_pins_lock); > + kfree(sp); > + return 1; > + } > + } > + rb_link_node(>node, parent, p); > + rb_insert_color(>node, _info->swapfile_pins); > + spin_unlock(_info->swapfile_pins_lock); > + return 0; > +} > + > +/* Free all of the entries pinned by this swapfile. */ > +static void btrfs_free_swapfile_pins(struct inode *inode) > +{ > + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; > + struct btrfs_swapfile_pin *sp; > + struct rb_node *node, *next; > + > + spin_lock(_info->swapfile_pins_lock); > + node = rb_first(_info->swapfile_pins); > + while (node) { > + next = rb_next(node); > + sp = rb_entry(node, struct btrfs_swapfile_pin, node); > + if (sp->inode == inode) { > + rb_erase(>node, _info->swapfile_pins); > + if (sp->is_block_group) > + btrfs_put_block_group(sp->ptr); > + kfree(sp); > + } > + node = next; > + } > + spin_unlock(_info->swapfile_pins_lock); > +} > + > +struct btrfs_swap_info { > + u64 start; > + u64 block_start; > + u64 block_len; > + u64 lowest_ppage; > + u64 highest_ppage; > + unsigned long nr_pages; > + int nr_extents; > +}; > + > +static int btrfs_add_swap_extent(struct swap_info_struct *sis, > + struct btrfs_swap_info *bsi) > +{ > + unsigned long nr_pages; > + u64 first_ppage, first_ppage_reported, next_ppage; > + int ret; > + > + first_ppage = ALIGN(bsi->block_start, PAGE_SIZE) >> PAGE_SHIFT; > + next_ppage = ALIGN_DOWN(bsi->block_start + bsi->block_len, > + PAGE_SIZE) >> PAGE_SHIFT; > + > + if (first_ppage >= next_ppage) > + return 0; > + nr_pages = next_ppage - first_ppage; > + > + first_ppage_reported = first_ppage; > + if (bsi->start == 0) > + first_ppage_reported++; > + if (bsi->lowest_ppage > first_ppage_reported) > + bsi->lowest_ppage = first_ppage_reported; > + if (bsi->highest_ppage < (next_ppage - 1)) > + bsi->highest_ppage = next_ppage - 1; > + > + ret = add_swap_extent(sis, bsi->nr_pages, nr_pages, first_ppage); > + if
Re: [PATCH v8 5/6] Btrfs: rename get_chunk_map() and make it non-static
On Wed, Sep 19, 2018 at 10:02:16PM -0700, Omar Sandoval wrote: > From: Omar Sandoval > > The Btrfs swap code is going to need it, so give it a btrfs_ prefix and > make it non-static. > > Reviewed-by: Nikolay Borisov > Signed-off-by: Omar Sandoval > --- > fs/btrfs/volumes.c | 29 ++--- > fs/btrfs/volumes.h | 2 ++ > 2 files changed, 20 insertions(+), 11 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index a2761395ed22..fe66b635c023 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -2714,8 +2714,15 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info > *fs_info, u64 chunk_offset) > return ret; > } > > -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info, > - u64 logical, u64 length) > +/** Minor nit, I'll fix that eventually: the /** is for kernel doc but as the function is internal to btrfs and there's no documentation generated from that, it's better to use plain /* . The parameter formatting according to the kernel doc documentation is fine but does not strictly need to conform as long as it's human readable. Otherwise patch is ok.
Re: [PATCH v8 4/6] Btrfs: prevent ioctls from interfering with a swap file
On Wed, Sep 19, 2018 at 10:02:15PM -0700, Omar Sandoval wrote: > --- a/fs/btrfs/dev-replace.c > +++ b/fs/btrfs/dev-replace.c > @@ -414,6 +414,14 @@ int btrfs_dev_replace_start(struct btrfs_fs_info > *fs_info, > if (ret) > return ret; > > + if (btrfs_pinned_by_swapfile(fs_info, src_device)) { > + btrfs_info_in_rcu(fs_info, > + "cannot replace device %s (devid %llu) due to > active swapfile", Please un-indent the messages so they fit under 80 columns, the arguments can stay aligned as before. > + btrfs_dev_name(src_device), > + src_device->devid); > + return -ETXTBSY; I think all the error messages before returning ETXTBUSY or EINVAL should be btrfs_error or btrfs_warning. The info level can be filtered out and easily lost in the logs. If there's user intention to activate the swap it needs to be resolved. > + } > + > ret = btrfs_init_dev_replace_tgtdev(fs_info, tgtdev_name, > src_device, _device); > if (ret) > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -3626,10 +3634,15 @@ static int __btrfs_balance(struct btrfs_fs_info > *fs_info) > > ret = btrfs_relocate_chunk(fs_info, found_key.offset); > mutex_unlock(_info->delete_unused_bgs_mutex); > - if (ret && ret != -ENOSPC) > - goto error; > if (ret == -ENOSPC) { > enospc_errors++; > + } else if (ret == -ETXTBSY) { > + btrfs_info(fs_info, > +"skipping relocation of block group %llu due > to active swapfile", I wonder if this should be visible even on the info level, ie. so it could be debug. The swap file can be long lived, the balance could be run several times during that period. > +found_key.offset); > + ret = 0; > + } else if (ret) { > + goto error; > } else { > spin_lock(_info->balance_lock); > bctl->stat.completed++;
Re: [PATCH 0/5] rb_first to rb_first_cached conversion
On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote: > Several structs in btrfs are using rb_first() in a while loop, it'd be > more efficient to do this with rb_first_cached() which has the O(1) > complexity. > > This patch set updates five structs which may have a large rb tree in > practice > > Liu Bo (5): > Btrfs: href_root: use rb_first_cached > Btrfs: href->ref_tree: use rb_first_cached > Btrfs: delayed_items: use rb_first_cached > Btrfs: extent_map: use rb_first_cached > Btrfs: preftree: use rb_first_cached For the record, patches have been merged to misc-next. I've changed the subject titles to Btrfs: delayed-refs: use rb_first_cached for href_root Btrfs: delayed-refs: use rb_first_cached for ref_tree Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root Btrfs: extent_map: use rb_first_cached Btrfs: preftree: use rb_first_cached added Holger's tested-by and my reviewed-by.
Re: cannot mount btrfs as root
On 20 September 2018 at 08:20, Zbigniew 'zibi' Jarosik wrote: > On 19 September 2018 at 15:54, Qu Wenruo wrote: >> >> Does the iniramfs do a "btrfs dev scan" to make populate btrfs devices >> lists? > > This solved problem. Looks like btrfs scans disks to early, before > bcache initializes. I need to debug initrfamfs scripts. There is race condition - bcache initalizes slower than expected, so btrfs cannot find devices, because they are not ready. rootdelay=10 in grub solved problem temporary, need to write boot-script that will handle it and wait until bcache devices will be ready to use. -- Zbigniew 'zibi' Jarosik cell: +48 667956686 jid: tap...@gmail.com gg: 2830 http://zibi.nora.pl/
Re: [PATCH 0/5] rb_first to rb_first_cached conversion
On Fri, Sep 14, 2018 at 12:14:57PM -0700, Liu Bo wrote: > On Fri, Sep 14, 2018 at 05:11:02PM +0200, David Sterba wrote: > > On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote: > > > On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote: > > > > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote: > > > > > Several structs in btrfs are using rb_first() in a while loop, it'd be > > > > > more efficient to do this with rb_first_cached() which has the O(1) > > > > > complexity. > > > > > > > > We'd like to see some numbers, though just by reading the code I think > > > > there's going to be a noticeable improvement for some workloads. > > > > > > > > There's a common pattern: > > > > > > > > while(node = rb_first) { > > > > entry = rb_entry(node) > > > > next = rb_next(node) > > > > rb_erase(node) > > > > cleanup(entry) > > > > } > > > > > > > > rb_first needs to traverse the tree up to logN depth, rb_erase can > > > > completely reshuffle the tree. With the caching we'll skip the traversal > > > > in rb_first. That's a cached memory access vs looped pointer > > > > dereference trade-off that IMHO has a clear winner. > > > > > > > > As the pattern in many cases traverses the whole tree and removes all > > > > entries, ideally we could get rid of the rebalancing too. The entries > > > > would be cleaned up in postorder way, ie. reset the data and kfree in > > > > the end. This requires that order of the freeing does not matter, which > > > > might no apply to some trees. > > > > > > The order of freeing does not matter, but we need the tree to return > > > the correct next node to free, IOW, we have to maintain the rbtree > > > until the last two nodes. > > > > > > > > > > > I did not find suitable rbtree functions or helpers for that, > > > > rbtree_postorder_for_each_entry_safe does not allow rb_erase and > > > > rb_erase itself forces the rebalancing internally. But I think we can > > > > implement that. > > > > > > > > We could possibly use rb_next_postorder that makes sure that all > > > > children are traversed before the parent so this is at least something > > > > that could considerd. > > > > > > > > > > Qu was inquiring about the same thing, I haven't got time to dig it > > > further. > > > > > > The question we need to answer is that whether we care about how fast > > > destruction can make, as some are in critical paths while some are > > > not. > > > > Yeah, I take __btrfs_return_cluster_to_free_space as an example where > > the more efficient tree destruction would shorten the time where a lock > > is held. > > > > In other cases it might complicate the code too much, though the > > performance is not critical, eg. at umount time. Although, it'd be good > > to optimize that too now that we know how. > > > > And in the remaining cases it's not possible to avoid rb_erase. I had a > > closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq. > > Based on that I'd like to add this series to for-next, the first node > > caching is clear enough and safe. We can do the other optimizations > > later. > > This needs more fine-grained debugging to see what's the cost > lies on. > > About the perf. number of rb_cached_node, I measured the time spent in > extent map loop in btrfs_evict_inode(), > > evict_inode_truncate_pages() > while (!RB_EMPTY_ROOT(_tree->map)) { > rb_first(); >rb_erase(); > } > > > for a em tree containing 10,000 em, The maximum depth of the rbtree for this number of nodes is roughly 2 * lg(1) = 2 * 14 = 28. This is not an average case so it's going to be better in practice but still 10+ pointer dereferences can be exchanged with a pointer update. > - with rb_first_cached, the loop takes 4880454 ns, > - with rb_first, the loop takes 4584148 ns, > > looks like the difference is not that much, ~6%. After all it's about > scalability, the more rb tree gets, the more rb_first_cached() wins. Yeah, there's some difference but I think the cache effects will make the measurements hard to estimate, so I take the numbers as a sort of confirmation that there's not going to be a big difference. The delayed refs and delayed inode can get some speedup from that as there's usually only rb_first without rb_erase. I'm going to update the first patch with the analysis we did and reference the patch from the others so we'll have a starting point for further optimizations and can verify if the described expected speedups happen.
Re: Punch hole on full fs
On 2018/9/20 下午5:04, Anand Jain wrote: > > > On 09/20/2018 04:45 PM, Qu Wenruo wrote: >> >> >> On 2018/9/20 下午1:26, anand.j...@oracle.com wrote: >>> >>> >>> Test script [1] tries to punch hole on a full FS and it works fine as >>> long as the hole size and the offset is aligned with the sectorsize and >>> the extent, so that it could just drop the relevant extent to create the >>> hole. >>> >>> The reason why this test fails for non aligned hole size and offset is >>> because, as it rightfully tries to truncate the non aligned extent at >>> the front and back of the hole, it tries to create new extent which ends >>> up with ENOSPC. >> >> This is what btrfs should do, I don't see any problem though. >> >>> >>> Any idea, how do we solve this? >> >> I don't really think this is a problem though. >> >> As long as we need to do data COW, such ENOSPC would happen. > > It happens even with '-o max_inline=0 -o nodatacow' so certainly there > is some problem, but looking at btrfs_punch_hole() it apparently looks > like there is no way to fix this as sub-sectorsize hole-size depends on > the truncate which needs to alloc space. Forgot to mention, in your test script, it only fills data space, so we should still have enough free metadata space to fulfill fs/extent/root tree operations. Although the problem is, btrfs_truncate_block() will call btrfs_delalloc_reserve_space() even the inode has NODATACOW flag. (buffered write part does the extra check, although not completely perfect check). So in short, you could still fix it false ENOSPC problem by enhance btrfs_truncate_block() to follow NODATACOW check. Thanks, Qu > >>> >>> xfs is fine. >> >> Of course xfs is fine, XFS defaults to do NODATACOW, in contrast to >> btrfs' default DATACOW. >> >> Have you tried to do the same when the file is reflinked? > > No but I presume it will rightfully fail. > > Thanks, -Anand > >> Thanks, >> Qu >> >>> >>> [1] >>> cat ./punch-hole-on-full-fs >>> >>> cleanup() >>> { >>> umount /dev/sdb > /dev/null 2>&1 >>> btrfs_reload >>> } >>> >>> full_fs_setup() >>> { >>> btrfs-dyndbg disable >>> mkfs.$fs -b 200M -fq $mkfs_options /dev/sdb || exit >>> mount $mount_options /dev/sdb /mnt/scratch || exit >>> dd status=none if=/dev/zero of=/mnt/scratch/filler bs=512 > >>> /dev/null 2>&1 >>> } >>> >>> test() >>> { >>> cleanup >>> full_fs_setup >>> >>> btrfs-dyndbg enable >>> echo " fallocate -p -o 0 -l $punch_hole_size >>> /mnt/scratch/filler >>> " >>> fallocate -p -o 0 -l $punch_hole_size /mnt/scratch/filler >>> } >>> >>> fs=btrfs; mkfs_options=""; mount_options=""; >>> [[ $1 ]] || { echo "usage: $0 "; exit; } >>> punch_hole_size=$1 && test >>> >>> >>> ./punch-hole-on-full-fs 512 >>> -- fallocate -p -o 0 -l 512 /mnt/scratch/filler -- >>> fallocate: /mnt/scratch/filler: fallocate failed: No space left on >>> device >>> >>> ./punch-hole-on-full-fs 8000 >>> -- fallocate -p -o 0 -l 8000 /mnt/scratch/filler -- >>> fallocate: /mnt/scratch/filler: fallocate failed: No space left on >>> device >>> >>> >>> Thanks, Anand >> signature.asc Description: OpenPGP digital signature
Re: [PATCH V5 RESEND] Btrfs: enchanse raid1/10 balance heuristic
чт, 20 сент. 2018 г. в 12:05, Peter Becker : > > i like the idea. > do you have any benchmarks for this change? > > the general logic looks good for me. https://patchwork.kernel.org/patch/10137909/ > > Tested-by: Dmitrii Tcvetkov > > Benchmark summary (arithmetic mean of 3 runs): > Mainline Patch > -- > RAID1 | 18.9 MiB/s | 26.5 MiB/s > RAID10 | 30.7 MiB/s | 30.7 MiB/s > fio configuration: > [global] > ioengine=libaio > buffered=0 > direct=1 > bssplit=32k/100 > size=8G > directory=/mnt/ > iodepth=16 > time_based > runtime=900 > > [test-fio] > rw=randread > > All tests were run on 4 HDD btrfs filesystem in a VM with 4 Gb > of ram on idle host. Full results attached to the email. Also: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg71758.html - - - So, IIRC its works at least.
Re: Punch hole on full fs
On 2018/9/20 下午5:04, Anand Jain wrote: > > > On 09/20/2018 04:45 PM, Qu Wenruo wrote: >> >> >> On 2018/9/20 下午1:26, anand.j...@oracle.com wrote: >>> >>> >>> Test script [1] tries to punch hole on a full FS and it works fine as >>> long as the hole size and the offset is aligned with the sectorsize and >>> the extent, so that it could just drop the relevant extent to create the >>> hole. >>> >>> The reason why this test fails for non aligned hole size and offset is >>> because, as it rightfully tries to truncate the non aligned extent at >>> the front and back of the hole, it tries to create new extent which ends >>> up with ENOSPC. >> >> This is what btrfs should do, I don't see any problem though. >> >>> >>> Any idea, how do we solve this? >> >> I don't really think this is a problem though. >> >> As long as we need to do data COW, such ENOSPC would happen. > > It happens even with '-o max_inline=0 -o nodatacow' so certainly there > is some problem, but looking at btrfs_punch_hole() it apparently looks > like there is no way to fix this as sub-sectorsize hole-size depends on > the truncate which needs to alloc space. Well, at least in my test, that's not the case. I did a pretty simpler test, and for NODATACOW inode, the sub-sectorsize part didn't get COWed, just as what we expect: # mkfs.btrfs -f $dev # mount $dev -o nodatacow $dev $mnt # xfs_io -f -c "pwrite 0 16K" -c sync $mnt/file # btrfs ins dump-tree -t 5 $dev # xfs_io -c "fpunch 2K 6K" $mnt/file # btrfs ins dump-tree -t 5 $dev The last punch hole did created a 4K hole as expect, but the 0~4K extent doesn't get COWed at all. item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53 generation 6 type 1 (regular) extent data disk byte 13631488 nr 16384 <<< extent data offset 0 nr 4096 ram 16384 extent compression 0 (none) item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53 generation 8 type 1 (regular) extent data disk byte 0 nr 0 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) item 8 key (257 EXTENT_DATA 8192) itemoff 15707 itemsize 53 generation 6 type 1 (regular) extent data disk byte 13631488 nr 16384 <<< extent data offset 8192 nr 8192 ram 16384 extent compression 0 (none) So the failure even with nodatacow mount option looks a little suspicious now. Thanks, Qu > >>> >>> xfs is fine. >> >> Of course xfs is fine, XFS defaults to do NODATACOW, in contrast to >> btrfs' default DATACOW. >> >> Have you tried to do the same when the file is reflinked? > > No but I presume it will rightfully fail. > > Thanks, -Anand > >> Thanks, >> Qu >> >>> >>> [1] >>> cat ./punch-hole-on-full-fs >>> >>> cleanup() >>> { >>> umount /dev/sdb > /dev/null 2>&1 >>> btrfs_reload >>> } >>> >>> full_fs_setup() >>> { >>> btrfs-dyndbg disable >>> mkfs.$fs -b 200M -fq $mkfs_options /dev/sdb || exit >>> mount $mount_options /dev/sdb /mnt/scratch || exit >>> dd status=none if=/dev/zero of=/mnt/scratch/filler bs=512 > >>> /dev/null 2>&1 >>> } >>> >>> test() >>> { >>> cleanup >>> full_fs_setup >>> >>> btrfs-dyndbg enable >>> echo " fallocate -p -o 0 -l $punch_hole_size >>> /mnt/scratch/filler >>> " >>> fallocate -p -o 0 -l $punch_hole_size /mnt/scratch/filler >>> } >>> >>> fs=btrfs; mkfs_options=""; mount_options=""; >>> [[ $1 ]] || { echo "usage: $0 "; exit; } >>> punch_hole_size=$1 && test >>> >>> >>> ./punch-hole-on-full-fs 512 >>> -- fallocate -p -o 0 -l 512 /mnt/scratch/filler -- >>> fallocate: /mnt/scratch/filler: fallocate failed: No space left on >>> device >>> >>> ./punch-hole-on-full-fs 8000 >>> -- fallocate -p -o 0 -l 8000 /mnt/scratch/filler -- >>> fallocate: /mnt/scratch/filler: fallocate failed: No space left on >>> device >>> >>> >>> Thanks, Anand >> signature.asc Description: OpenPGP digital signature
Re: [PATCH V5 RESEND] Btrfs: enchanse raid1/10 balance heuristic
i like the idea. do you have any benchmarks for this change? the general logic looks good for me.
Re: Punch hole on full fs
On 09/20/2018 04:45 PM, Qu Wenruo wrote: On 2018/9/20 下午1:26, anand.j...@oracle.com wrote: Test script [1] tries to punch hole on a full FS and it works fine as long as the hole size and the offset is aligned with the sectorsize and the extent, so that it could just drop the relevant extent to create the hole. The reason why this test fails for non aligned hole size and offset is because, as it rightfully tries to truncate the non aligned extent at the front and back of the hole, it tries to create new extent which ends up with ENOSPC. This is what btrfs should do, I don't see any problem though. Any idea, how do we solve this? I don't really think this is a problem though. As long as we need to do data COW, such ENOSPC would happen. It happens even with '-o max_inline=0 -o nodatacow' so certainly there is some problem, but looking at btrfs_punch_hole() it apparently looks like there is no way to fix this as sub-sectorsize hole-size depends on the truncate which needs to alloc space. xfs is fine. Of course xfs is fine, XFS defaults to do NODATACOW, in contrast to btrfs' default DATACOW. Have you tried to do the same when the file is reflinked? No but I presume it will rightfully fail. Thanks, -Anand Thanks, Qu [1] cat ./punch-hole-on-full-fs cleanup() { umount /dev/sdb > /dev/null 2>&1 btrfs_reload } full_fs_setup() { btrfs-dyndbg disable mkfs.$fs -b 200M -fq $mkfs_options /dev/sdb || exit mount $mount_options /dev/sdb /mnt/scratch || exit dd status=none if=/dev/zero of=/mnt/scratch/filler bs=512 > /dev/null 2>&1 } test() { cleanup full_fs_setup btrfs-dyndbg enable echo " fallocate -p -o 0 -l $punch_hole_size /mnt/scratch/filler " fallocate -p -o 0 -l $punch_hole_size /mnt/scratch/filler } fs=btrfs; mkfs_options=""; mount_options=""; [[ $1 ]] || { echo "usage: $0 "; exit; } punch_hole_size=$1 && test ./punch-hole-on-full-fs 512 -- fallocate -p -o 0 -l 512 /mnt/scratch/filler -- fallocate: /mnt/scratch/filler: fallocate failed: No space left on device ./punch-hole-on-full-fs 8000 -- fallocate -p -o 0 -l 8000 /mnt/scratch/filler -- fallocate: /mnt/scratch/filler: fallocate failed: No space left on device Thanks, Anand
Re: Punch hole on full fs
On 2018/9/20 下午1:26, anand.j...@oracle.com wrote: > > > Test script [1] tries to punch hole on a full FS and it works fine as > long as the hole size and the offset is aligned with the sectorsize and > the extent, so that it could just drop the relevant extent to create the > hole. > > The reason why this test fails for non aligned hole size and offset is > because, as it rightfully tries to truncate the non aligned extent at > the front and back of the hole, it tries to create new extent which ends > up with ENOSPC. This is what btrfs should do, I don't see any problem though. > > Any idea, how do we solve this? I don't really think this is a problem though. As long as we need to do data COW, such ENOSPC would happen. > > xfs is fine. Of course xfs is fine, XFS defaults to do NODATACOW, in contrast to btrfs' default DATACOW. Have you tried to do the same when the file is reflinked? Thanks, Qu > > [1] > cat ./punch-hole-on-full-fs > > cleanup() > { > umount /dev/sdb > /dev/null 2>&1 > btrfs_reload > } > > full_fs_setup() > { > btrfs-dyndbg disable > mkfs.$fs -b 200M -fq $mkfs_options /dev/sdb || exit > mount $mount_options /dev/sdb /mnt/scratch || exit > dd status=none if=/dev/zero of=/mnt/scratch/filler bs=512 > > /dev/null 2>&1 > } > > test() > { > cleanup > full_fs_setup > > btrfs-dyndbg enable > echo " fallocate -p -o 0 -l $punch_hole_size /mnt/scratch/filler > " > fallocate -p -o 0 -l $punch_hole_size /mnt/scratch/filler > } > > fs=btrfs; mkfs_options=""; mount_options=""; > [[ $1 ]] || { echo "usage: $0 "; exit; } > punch_hole_size=$1 && test > > > ./punch-hole-on-full-fs 512 > -- fallocate -p -o 0 -l 512 /mnt/scratch/filler -- > fallocate: /mnt/scratch/filler: fallocate failed: No space left on device > > ./punch-hole-on-full-fs 8000 > -- fallocate -p -o 0 -l 8000 /mnt/scratch/filler -- > fallocate: /mnt/scratch/filler: fallocate failed: No space left on device > > > Thanks, Anand signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH v2 0/4] btrfs-progs: build distinct binaries for specific btrfs subcommands
Axel Burri posted on Thu, 20 Sep 2018 00:02:22 +0200 as excerpted: > Now not everybody wants to install these with fscaps or setuid, but it > might also make sense to provide "/usr/bin/btrfs-subvolume-{show,list}", > as they now work for a regular user. Having both root/user binaries > concurrently is not an issue (e.g. in gentoo the full-featured btrfs > command is in "/sbin/"). That's going to be a problem for distros (or users like me with advanced layouts, on gentoo too FWIW) that have the bin/sbin merge, where one is a symlink to the other. FWIW I have both the /usr merge (tho reversed for me, so /usr -> . instead of having to have /bin and /sbin symlinks to /usr/bin) and the bin/sbin merge, along with, since I'm on amd64-nomultilib, the lib/lib64 merge. So: $$ dir -gGd /bin /sbin /usr /lib /lib64 drwxr-xr-x 1 35688 Sep 18 22:56 /bin lrwxrwxrwx 1 5 Aug 7 00:29 /lib -> lib64 drwxr-xr-x 1 78560 Sep 18 22:56 /lib64 lrwxrwxrwx 1 3 Mar 11 2018 /sbin -> bin lrwxrwxrwx 1 1 Mar 11 2018 /usr -> . Of course that last one (/usr -> .) leads to /share and /include hanging directly off of / as well, but it works. But in that scheme /bin, /sbin, /usr/bin and /usr/sbin, are all the same dir, so only one executable of a particularly name can exist therein. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
Tomasz Chmielewski posted on Wed, 19 Sep 2018 10:43:18 +0200 as excerpted: > I have a mysql slave which writes to a RAID-1 btrfs filesystem (with > 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full. > > The slave receives around 0.5-1 MB/s of data from the master over the > network, which is then saved to MySQL's relay log and executed. In ideal > conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s > of data written to disk. > > MySQL directory and files in it are chattr +C (since the directory was > created, so all files are really +C); there are no snapshots. > > > Now, an interesting thing. > > When the filesystem is mounted with these options in fstab: > > defaults,noatime,discard > > > We can see a *constant* write of 25-100 MB/s to each disk. The system is > generally unresponsive and it sometimes takes long seconds for a simple > command executed in bash to return. > > > However, as soon as we remount the filesystem with space_cache=v2 - > writes drop to just around 3-10 MB/s to each disk. If we remount to > space_cache - lots of writes, system unresponsive. Again remount to > space_cache=v2 - low writes, system responsive. > > > That's a huuge, 10x overhead! Is it expected? Especially that > space_cache=v1 is still the default mount option? The other replies are good but I've not seen this pointed out yet... Perhaps you are accounting for this already, but you don't /say/ you are, while you do mention repeatedly toggling the space-cache options, which would trigger it so you /need/ to account for it... I'm not sure about space_cache=v2 (it's probably more efficient with it even if it does have to do it), but I'm quite sure that space_cache=v1 takes some time after initial mount with it to scan the filesystem and actually create the map of available free space that is the space_cache. Now you said ssds, which should be reasonably fast, but you also say 3- device btrfs raid1, with each device ~2TB, and the filesystem ~40% full, which should be ~2 TB of data, which is likely somewhat fragmented so it's likely rather more than 2 TB of data chunks to scan for free space, and that's going to take /some/ time even on SSDs! So if you're toggling settings like that in your tests, be sure to let the filesystem rebuild its cache that you just toggled and give it time to complete that and quiesce, before you start trying to measure write amplification. Otherwise it's not write-amplification you're measuring, but the churn from the filesystem still trying to reset its cache after you toggled it! Also, while 4.17 is well after the ssd mount option (usually auto- detected, check /proc/mounts, mount output, or dmesg, to see if the ssd mount option is being added) fixes that went in in 4.14, if the filesystem has been in use for several kernel cycles and in particular before 4.14, with the ssd mount option active, and you've not rebalanced since then, you may well still have serious space fragmentation from that, which could increase the amount of data in the space_cache map rather drastically, thus increasing the time it takes to update the space_cache, particularly v1, after toggling it on. A balance can help correct that, but it might well be easier and should result in a better layout to simply blow the filesystem away with a mkfs.btrfs and start over. Meanwhile, as Remi already mentioned, you might want to reconsider nocow on btrfs raid1, since nocow defeats checksumming and thus scrub, which verifies checksums, simply skips it, and if the two copies get out of sync for some reason... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: btrfs filesystem show takes a long time
Hi, > You may try to run the show command under strace to see where it blocks. any recommendations for strace options? On Friday, September 14, 2018 1:25:30 PM CEST David Sterba wrote: > Hi, > > thanks for the report, I've forwarded it to the issue tracker > https://github.com/kdave/btrfs-progs/issues/148 > > The show command uses the information provided by blkid, that presumably > caches that. The default behaviour of 'fi show' is to skip mount checks, > so the delays are likely caused by blkid, but that's not the only > possible reason. > > You may try to run the show command under strace to see where it blocks. >
Re: cannot mount btrfs as root
On 19 September 2018 at 15:54, Qu Wenruo wrote: > > Does the iniramfs do a "btrfs dev scan" to make populate btrfs devices > lists? This solved problem. Looks like btrfs scans disks to early, before bcache initializes. I need to debug initrfamfs scripts. Thanks a lot for hint! -- Zbigniew 'zibi' Jarosik cell: +48 667956686 jid: tap...@gmail.com gg: 2830 http://zibi.nora.pl/