Re: [PATCH 0/7] Let user specify the kernel version for features
Hope we are in sync on.. 1. The term auto that you are using here refs to 'Progs default-features being updated at the _run time_'. 2. In the long run, mostly it would be: progs-version > LTS-kernel-version (for the reason that user would need fsck,tools.. etc) With the new -O comp= option, the concern on user who want to make a btrfs for newer kernel is hugely reduced. NO!. actually new option -O comp= provides no concern for users who want to create _a btrfs disk layout which is compatible with more than one kernel_. above there are two examples of it. Why you can't give a higher kernel version than current kernel? mount fails. Pls try !! But that's what user want to do. He/she knows what he is doing. Maybe he is testing btrfs-progs self test without the need to mount it(at least some of the tests doesn't require mount) right. It will continue to fail even with this patch set. Now we need to auto align feature with kernel, who know one day we will need to auto align our libs to upstream package? align libs to upstream package ? is there any eg you could provide ? Keeping a matrix with different packages like libuuid/acl/attr with different Makefile? At least this is not a good idea for me, and that's the work of autoconfig IIRC. And if I'm a package and face such problem, I'll choose the simplest solution, just add a line in PKGBUILD(package system of Archlinux) of btrfs. -- depends=('linux>=3.14') -- (Yeah, such simple and slick packaging solution is the reason I like Arch over other rolling distribution) Not every thing really needed to be done in code level. As we are handling default features at the run time, how is this relevant in this context. ? Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] Let user specify the kernel version for features
Anand Jain wrote on 2015/11/27 06:17 +0800: Hope we are in sync on.. 1. The term auto that you are using here refs to 'Progs default-features being updated at the _run time_'. Yes. 2. In the long run, mostly it would be: progs-version > LTS-kernel-version (for the reason that user would need fsck,tools.. etc) Also true. But mkfs default features won't change during one or two LTS kernels. With the new -O comp= option, the concern on user who want to make a btrfs for newer kernel is hugely reduced. NO!. actually new option -O comp= provides no concern for users who want to create _a btrfs disk layout which is compatible with more than one kernel_. above there are two examples of it. Why you can't give a higher kernel version than current kernel? mount fails. Pls try !! But that's what user want to do. He/she knows what he is doing. Maybe he is testing btrfs-progs self test without the need to mount it(at least some of the tests doesn't require mount) right. It will continue to fail even with this patch set. Now we need to auto align feature with kernel, who know one day we will need to auto align our libs to upstream package? align libs to upstream package ? is there any eg you could provide ? IIRC, for the ancient time, libblkid is still included in e2fsprogs and its API is different from nowadays. Will us need to support that one with different blkid calls? Keeping a matrix with different packages like libuuid/acl/attr with different Makefile? At least this is not a good idea for me, and that's the work of autoconfig IIRC. And if I'm a package and face such problem, I'll choose the simplest solution, just add a line in PKGBUILD(package system of Archlinux) of btrfs. -- depends=('linux>=3.14') -- (Yeah, such simple and slick packaging solution is the reason I like Arch over other rolling distribution) Not every thing really needed to be done in code level. As we are handling default features at the run time, how is this relevant in this context. ? I meant, it can be done in packaging level and it's much easier to do. One dependency line vs near 200 codes. And it's much predictable than version based detection. Thanks, Qu Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: subvols and parents - how?
Christoph Anton Mitterer posted on Tue, 24 Nov 2015 22:25:50 +0100 as excerpted: >> Suppose you only want to rollback /, because some update screwed you >> up, >> but not /home, which is fine. If /home is a nested subvolume, then >> you're now mounting the nested home subvolume from some other nesting >> tree entirely, > That's a bit unclear to me,... I thought when I make a snapshot, any > nested subvols wouldn't be snapshotted and thus be empty dirs. > So I'd have rather that if I would simply have no /home (if I didn't > move it to the rolled-back subvol manually) What I was intending to convey but apparently failed to be quite clear enough, suppose: 5 | +-+ subvols (dir) | +-+ root (subvol) | | | + home (nested subvol) | +-+ snaps-2015.0901 (dir) | +-+ root-2015.0901 (subvol) As long as you're on the working /, then /home is a nested subvol, and you don't have to mount it to access, tho you can if you want. But now, you roll back to snaps-2015.0901/root-2015.0901. It won't have /home nested underneath, as you correctly pointed out, but in ordered to access it, you now MUST mount /home, which... #1 could be a pain to setup if you weren't actually mounting it previously, just relying on the nested tree, AND... #2 The point I was trying to make, now, to mount it you'll mount not a native nested subvol, and not a directly available sibling 5/subvols/home, but you'll actually be reaching into an entirely different nesting structure to grab something down inside, mounting 5/subvols/root/home subvolume nesting down inside the direct 5/subvols/root sibling subvol. With just one level of nesting and one additional mount, it's not too hard to keep track of, but if you're dealing with four or five levels of subvol nesting and some of them you're mounting the working head copy while others you're rolling back, it could get difficult to keep straight in your head what's going on. Consider a layout like this: 5 +-+ subvols (dir) | +-+ root (subvol) | | | +-+ home (subvol) | | | | | +-+ henry (dir, no subvol for henry) | | | | | +-+ fred (subvol) | | | | | | | +-+ vms (subvol) | | | | | +-+ betty (subvol) | | | +-+ svr (subvol) | | | +-+ vms (subvol) | +-+ snaps-2015.0901 (dir) | +-+ root-2015.0901 (subvol here and below) | +-+ home-2015.0901 | +-+ fred-2015.0901 | +-+ fred-vms-2015.0901 | +-+ betty-2015.0901 | +-+ svr-2015.0901 | +-+ svr-vms-2015.0901 Now, you were hacked and they encrypted a bunch of stuff, but you were lucky and caught them before they got everything. You need to roll back root but not home, fred is fine, but his vms subvol needs rolled back, betty needs rolled back, svr needs rolled back, but svr's vms are fine. Try to sort THAT out along with the nesting, and keep it all straight while under the severe pressure of trying to recover from a hack in time for those svr things to go live for Black Friday in a few hours, where in a single day you expect to make as much as you normally do in a month, the rest of the year! The pressure is on! Oh, and you weren't actually doing the mounts as you were depending on the nested tree, so you have to actually setup the mounts as well, not just switch them to point to the appropriate location. OK, so that's a bit contrived, but server encryption for ransom hacks are in the news, black Friday starts in a few hours here, and I think the point should be obvious! =:^) (Some years ago, before btrfs, I had something similar setup but with partitions. Disaster struck and I ended up with / from one backup, /usr from another, and /var, with the package database of what was installed on the other two, from current, or something like that. Needless to say I learned quite some lessons from that, one of which was that everything that the package manager installs should be on the same partition with the installed-package database, so if it has to be restored from backup, at least if it's all old, at least it's all equally old, and the package database actually matches what's on the system because it's in the same backup! That partition and btrfs, along with each of its various backups, are now 8 GiB each, so it's not like I'll run out of room with several levels of backup. I went mdraid after that, but after an initial experiment with lvm on top of the raid, I decided that was too complex to deal with in the pressure of a disaster and redid it to multiple raids on parallel partitioned hardware. In a disaster the raid would be bad enough to deal with but tolerable, but I did NOT need the complexity of lvm on top of raid, and after dealing with the parts of three different installs mess, I had the hard-earned wisdom to realize it! The same idea applies here. Once you start reaching into nested subvols to get the deeper nested subvols you're trying to mount, it's too much and you're just begging
implications of mixed mode
Dear list, if a larger RAID file system (say disk space of 8 TB in total) is created in mixed mode, what are the implications? >From reading the mailing list and the Wiki, I can think of the following: + less hassle with "false positive" ENOSPC - data and metadata have to have the same replication level forever (e.g. RAID 1) - higher fragmentation (does this reduce with no(dir)atime?) -> more work for autodefrag Is that roughly what is to be expected? Any implications on recovery etc.? In the specific case, the file system usage is as follows: * data spread over ~20 subvolumes * snapshotted with various frequencies * compression is used * mostly archive storage * write once * read infrequently * ~500GB of daily rsync'ed system backup Thanks in advance, Lukas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs-progs: fsck: Fix a false alert where extent record has wrong metadata flag
On Fri, 2015-11-27 at 08:40 +0800, Qu Wenruo wrote: > But since there is no real error sure... > feel free to keep using it or just re > format it with skinny-metadata. That's just onging =) Thanks for all your efforts in that issue =) Chris. smime.p7s Description: S/MIME cryptographic signature
Re: implications of mixed mode
Lukas Pirl wrote on 2015/11/27 12:54 +1300: Dear list, if a larger RAID file system (say disk space of 8 TB in total) is created in mixed mode, what are the implications? From reading the mailing list and the Wiki, I can think of the following: + less hassle with "false positive" ENOSPC If your "false positive" means unbalanced DATA/METADATA chunk allocation, then yes. - data and metadata have to have the same replication level forever (e.g. RAID 1) - higher fragmentation (does this reduce with no(dir)atime?) -> more work for autodefrag They are also true. And some extra pros and cons due to fixed(4K) small(compared to 16K default) nodesize: + A little higher performance node/leaf size is restricted to sectorsize, smaller node/leaf, smaller range to lock. In our SSD test, operations with high concurrency, the performance is overall 10% better than 16K nodesize. And in extreme metadata operation case, like high concurrency on sequence write into small files, it can be 8 times the performance of default 16K nodesize. - Smaller subvolume size Since the tree block are smaller, but tree level stays the same(level 0 - 7), the up limit of a subvolume is reduced hugely be smaller node/leaf size. Although it's quite hard to hit that up limit though. - (Possible) less developer interest. Other developers are trying remove default mixed-bg, so I'd like to consider the trend will be less mixed-bg focused developers. And hidden bugs are more and more hard to hit and fixed. Is that roughly what is to be expected? Any implications on recovery etc.? As long as your chunk tree and extent tree is OK, it shouldn't be much different from normal fs, at least for now. Thanks, Qu In the specific case, the file system usage is as follows: * data spread over ~20 subvolumes * snapshotted with various frequencies * compression is used * mostly archive storage * write once * read infrequently * ~500GB of daily rsync'ed system backup Thanks in advance, Lukas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: implications of mixed mode
Lukas Pirl posted on Fri, 27 Nov 2015 12:54:57 +1300 as excerpted: > Dear list, > > if a larger RAID file system (say disk space of 8 TB in total) is > created in mixed mode, what are the implications? > > From reading the mailing list and the Wiki, I can think of the > following: > > + less hassle with "false positive" ENOSPC > - data and metadata have to have the same replication level > forever (e.g. RAID 1) > - higher fragmentation > (does this reduce with no(dir)atime?) > -> more work for autodefrag > > Is that roughly what is to be expected? Any implications on recovery > etc.? To the best of my knowledge that looks reasonably accurate. My big hesitancy would be over that fact that very few will run or test mixed-mode at TB scale filesystem level, and where they do, it's likely to be in ordered to work around the current (but set to soon be eliminated) metadata-only (no data) dup mode limit on single-device, since in that regard mixed-mode is treated as metadata and dup mode is allowed. So you're relatively more likely to run into rarely seen scaling issues and perhaps bugs that nobody else has ever run into as (relatively) nobody else runs mixed-mode on multi-terabyte-scale btrfs. If you want to be the guinea pig and make it easier for others to try later on, after you've flushed out the worst bugs, that's definitely one way to do it. =:^] > In the specific case, the file system usage is as follows: > * data spread over ~20 subvolumes > * snapshotted with various frequencies > * compression is used > * mostly archive storage > * write once > * read infrequently > * ~500GB of daily rsync'ed system backup It's worth noting that rsync... seems to stress btrfs more than pretty much any other common single application. It's extremely heavy access pattern just seems to trigger bugs that nothing else does, and while they do tend to get fixed, it really does seem to push btrfs to the limits, and there have been a /lot/ of rsync triggered btrfs bugs reported over the years. Between the stresses of rsyncing half a TiB daily and the relatively untested quantity that is mixed-mode btrfs at multi-terabyte scales on multi-devices, there's a reasonably high chance that you /will/ be working with the devs on various bugs for awhile. If you're willing to do it, great, somebody putting the filesystem thru those kinds of mixed- mode paces at that scale is just the sort of thing we need to get coverage on that particular not yet well tested corner case, but don't expect it to be particularly stable for a couple kernel cycles anyway, and after that, you'll still be running a particularly rare corner-case that's likely to put new code thru its paces as well, so just be aware of the relatively stony path you're signing up to navigate, should you choose to go that route. Meanwhile, assuming you're /not/ deliberately setting out to test a rarely tested corner-case with stress tests known to rather too frequently get the best of btrfs... Why are you considering mixed-mode here? At that size the ENOSPC hassles of unmixed-mode btrfs on say single-digit GiB and below really should be dwarfed into insignificance, particularly since btrfs since 3.17 or so deletes empty chunks instead of letting them build up to the point where they're a problem, so what possible reason, other than simply to test it and cover that corner-case, could justify mixed-mode at that sort of scale? Unless of course, given that you didn't mention number of devices or individual device size, only the 8 TB total, you have in mind a raid of something like 1000 8-GB USB sticks, or the like, in which case mixed- mode on the individual sticks might make some sense (well, to the extent that a 1000-device raid of /anything/ makes sense! =:^), given their 8-GB each size. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs-progs: fsck: Fix a false alert where extent record has wrong metadata flag
Christoph Anton Mitterer wrote on 2015/11/26 16:20 +0100: Hey. I can confirm that the new patch fixes the issue on both test filesystems. Thanks for working that out. I guess there's no longer a need to keep that old filesystems now?! Of course no need to keep. But since there is no real error, feel free to keep using it or just re format it with skinny-metadata. Thanks, Qu Cheers, Chris. On Thu, 2015-11-26 at 15:27 +0100, David Sterba wrote: On Wed, Nov 25, 2015 at 02:19:06PM +0800, Qu Wenruo wrote: In process_extent_item(), it gives 'metadata' initial value 0, but for non-skinny-metadata case, metadata extent can't be judged just from key type and it forgot that case. This causes a lot of false alert in non-skinny-metadata filesystem. Fix it by set correct metadata value before calling add_extent_rec(). Reported-by: Christoph Anton MittererSigned-off-by: Qu Wenruo Patch replaced, thanks. The test image is pushed as well. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: subvols and parents - how?
Christoph Anton Mitterer posted on Tue, 24 Nov 2015 22:25:50 +0100 as excerpted: >> Then there's the security angle to consider. With the (basically, >> possibly modified as I suggested) flat layout, mounting something >> doesn't automatically give people in-tree access to nested subvolumes >> (subject to normal file permissions, of course), like nested layout >> does. And with (possibly modified) flat layout, the whole subvolume >> tree doesn't need to be mounted all the time either, only when you're >> actually working with subvolumes. > Uhm, I don't get the big security advantage here... whether nested or > manually mounted to a subdir,... if the permissions are insecure I'll > have a problem... if they're secure, than not. Consider a setuid-root binary with a recently publicized but patched on your system vuln. But if you have root snapshots from before the patch and those snapshots are nested below root, then they're always accessible. If the path to the vulnerable setuid is as user accessible as it likely was in its original location, then anyone with login access to the system is likely to be able to run it from the snapshot... and will be able to get root due to the vuln. On a flat layout, a snapshot with the vuln would have to be mounted before it could be accessed, as otherwise it'd be outside the mounted tree. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
Christoph Anton Mitterer posted on Fri, 27 Nov 2015 01:06:45 +0100 as excerpted: > And additionally, allow people to mount subvols with different > noatime/relatime/atime settings (unless that's already working)... that > way, they could enable it for things where they want/need it,... and > disable it where not. AFAIK, per-subvolume *atime mounts should already be working. The *atime mount options are filesystem-generic (aka Linux vfs level), and while I my own use-case doesn't involve subvolumes, the wiki says they should be working (wrapped link I'm not bothering to jump thru the hoops to properly unwrap): https://btrfs.wiki.kernel.org/index.php/FAQ #Can_I_mount_subvolumes_with_different_mount_options.3F So while personally untested, per-subvolume *atime mount options /should/ "just work". Meanwhile, I've simply grown to hate atime as an inefficient and mostly useless drain on resources, so I pretty much just noatime everything, the reason I decided to bother patching my kernel to make that the default, instead of having yet another option I use everywhere anyway, clogging up the options field in my fstab. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Thu, 26 Nov 2015 01:23:59 +0100 as excerpted: > Hey. > > I've worried before about the topics Mitch has raised. > Some questions. > > 1) AFAIU, the fragmentation problem exists especially for those files > that see many random writes, especially, but not limited to, big files. > Now that databases and VMs are affected by this, is probably broadly > known in the meantime (well at least by people on that list). > But I'd guess there are n other cases where such IO patterns can happen > which one simply never notices, while the btrfs continues to degrade. The two other known cases are: 1) Bittorrent download files, where the full file size is preallocated (and I think fsynced), then the torrent client downloads into it a chunk at a time. The more general case would be any time a file of some size is preallocated and then written into more or less randomly, the problem being the preallocation, which on traditional rewrite-in-place filesystems helps avoid fragmentation (as well as ensuring space to save the full file), but on COW-based filesystems like btrfs, triggers exactly the fragmentation it was trying to avoid. At least some torrent clients (ktorrent at least) have an option to turn off that preallocation, however, and that would be recommended where possible. Where disabling the preallocation isn't possible, arranging to have the client write into a dir with the nocow attribute set, so newly created torrent files inherit it and do rewrite-in-place, is highly recommended. It's also worth noting that once the download is complete, the files aren't going to be rewritten any further, and thus can be moved out of the nocow-set download dir and treated normally. For those who will continue to seed the files for some time, this could be done, provided the client can seed from a directory different than the download dir. 2) As a subcase of the database file case that people may not think about, systemd journal files are known to have had the internal-rewrite- pattern problem in the past. Apparently, while they're mostly append- only in general, they do have an index at the beginning of the file that gets rewritten quite a bit. The problem is much reduced in newer systemd, which is btrfs aware and in fact uses btrfs-specific features such as subvolumes in a number of cases (creating subvolumes rather than directories where it makes sense in some shipped tmpfiles.d config files, for instance), if it's running on btrfs. For the journal, I /think/ (see the next paragraph) that it now sets the journal files nocow, and puts them in a dedicated subvolume so snapshots of the parent won't snapshot the journals, thereby helping to avoid the snapshot-triggered cow1 issue. On my own systems, however, I've configured journald to only use the volatile tmpfs journals in /run, not the permanent /var location, tweaking the size of the tmpfs mounted on /run and the journald config so it normally stores a full boot session, but of course doesn't store journals from previous sessions as they're wiped along with the tmpfs at reboot. I run syslog-ng as well, configured to work with journald, and thus have its more traditional append-only plain-text syslogs for previous boot sessions. For my usage that actually seems the best of both worlds as I get journald benefits such as service status reports showing the last 10 log entries for that service, etc, with those benefits mostly applying to the current session only, while I still have the traditional plain-text greppable, etc, syslogs, from both the current and previous sessions, back as far as my log rotation policy keeps them. It also keeps the journals entirely off of btrfs, so that's one particular problem I don't have to worry about at all, the reason I'm a bit fuzzy on the exact details of systemd's solution to the journal on btrfs issue. > So is there any general approach towards this? The general case is that for normal desktop users, it doesn't tend to be a problem, as they don't do either large VMs or large databases, and small ones such as the sqlite files generated by firefox and various email clients are handled quite well by autodefrag, with that general desktop usage being its primary target. For server usage and the more technically inclined workstation users who are running VMs and larger databases, the general feeling seems to be that those adminning such systems are, or should be, technically inclined enough to do their research and know when measures such as nocow and limited snapshotting along with manual defrags where necessary, are called for. And if they don't originally, they find out when they start researching why performance isn't what they expected and what to do about it. =:^) > And what are the actual possible consequences? Is it just that fs gets > slower (due to the fragmentation) or may I even run into other issues to > the point the space is
Re: btrfs: poor performance on deleting many large files
Christoph Anton Mitterer posted on Thu, 26 Nov 2015 19:25:47 +0100 as excerpted: > On Thu, 2015-11-26 at 16:52 +, Duncan wrote: >> For people doing snapshotting in particular, atime updates can be a big >> part of the differences between snapshots, so it's particularly >> important to set noatime if you're snapshotting. > What everything happens when that is left at relatime? > > I'd guess that obviously everytime the atime is updated there will be > some CoW, but only on meta-data blocks, right? Yes. > Does this then lead to fragmentation problems in the meta-data block > groups? I don't believe so. I think individual metadata elements tend to be small enough that several fit in a metadata node (16 KiB by default these days, IIRC), so there's no "metadata fragmentation" to speak of. > And how serious are the effects on space that is eaten up... say I have > n snapshots and access all of their files... then I'd probably get n > times the metadata, right? Which would sound quite dramatic... > > Or is just parts of the metadate copied with new atimes? I think it's whole 4 KiB blocks and possibly whole metadata nodes (16 KiB), copy-on-write, and these would be relatively small changes triggering cow of the entire block/node, aka write amplification. While not too large in themselves, it's the number of them that becomes a problem. IIRC relatime updates once a day on access. If you're doing daily snapshots, updating metadata blocks for all files accessed in the last 24 hours... Again, individual snapshots aren't so much of a problem, and if you're thinning to the 250 snapshots per subvolume or less as I recommend, the problem will remain controlled, but at 250, starting at daily snapshots so they all have atime changes for at least all files accessed during that 24 hours, that's still a sizable set of unnecessarily modified and thus space-taking snapshotted metadata. But I wouldn't worry about it too much if you're doing say monthly snapshots and only keeping a year's worth or less, 12-13 snapshots per subvolume total. In my case, I'm on SSD with their limited write cycles, so while the snapshot thing doesn't affect me since my use-case doesn't involve snapshots, the SSD write cycle count thing certainly does, and noatime is worth it to me for that alone. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
On Thu, 2015-11-26 at 23:29 +, Duncan wrote: > > but only on meta-data blocks, right? > Yes. Okay... so it'll at most get the whole meta-data for a snapshot separately and not shared anymore... And when these are chained as in ZFS,.. it probably amplifies... i.e. a change deep down in the tree changes all the upper elements as well? Which shouldn't be a too big problem unless I have a lot snapshots or extremely many files. > I think it's whole 4 KiB blocks and possibly whole metadata nodes (16 > KiB), copy-on-write, and these would be relatively small changes > triggering cow of the entire block/node, aka write > amplification. While > not too large in themselves, it's the number of them that becomes a > problem. Ah... there you say it already =) But still it's always only meta-data that is copied, never the data, right?! > IIRC relatime updates once a day on access. If you're doing daily > snapshots, updating metadata blocks for all files accessed in the > last 24 > hours... Yes... Wouldn't it be a way to handle that problem if btrfs allowed to create snapshots for which the atime never gets updated, regardless of any mount option? And additionally, allow people to mount subvols with different noatime/relatime/atime settings (unless that's already working)... that way, they could enable it for things where they want/need it,... and disable it where not. > In my case, I'm on SSD with their limited write cycles, so while the > snapshot thing doesn't affect me since my use-case doesn't involve > snapshots, the SSD write cycle count thing certainly does, and > noatime is > worth it to me for that alone. I'm always a bit unsure about that... I've used to do it as well as for the wear.. but is that really necessary? With relatime, atime updates happen at most once a day... so at worst you rewrite... what... some 100 MB (at least in the ext234 case)... and SSDs seem to bare much more write cycles than advertised. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
kernel call trace during send/receive
Hey. Just got the following during send/receiving a big snapshot from one btrfs to another fresh one. Both under kernel 4.2.6, tools 4.3 The send/receive seems to continue however... Any ideas what that means? Cheers, Chris. Nov 27 01:52:36 heisenberg kernel: [ cut here ] Nov 27 01:52:36 heisenberg kernel: WARNING: CPU: 7 PID: 18086 at /build/linux-CrHvZ_/linux-4.2.6/fs/btrfs/send.c:5794 btrfs_ioctl_send+0x661/0x1120 [btrfs]() Nov 27 01:52:36 heisenberg kernel: Modules linked in: ext4 mbcache jbd2 nls_utf8 nls_cp437 vfat fat uas vhost_net vhost macvtap macvlan xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat xt_tcpudp tun bridge stp llc fuse ccm ebtable_filter ebtables seqiv ecb drbg ansi_cprng algif_skcipher md4 algif_hash af_alg binfmt_misc xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 cpufreq_userspace cpufreq_powersave cpufreq_stats cpufreq_conservative ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_policy ipt_REJECT nf_reject_ipv4 xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport xt_conntrack nf_conntrack iptable_filter ip_tables x_tables joydev rtsx_pci_ms rtsx_pci_sdmmc mmc_core memstick iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal Nov 27 01:52:36 heisenberg kernel: intel_powerclamp intel_rapl iosf_mbi coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul evdev deflate ctr psmouse serio_raw twofish_generic pcspkr btusb btrtl btbcm btintel bluetooth crc16 uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common videodev media twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common sg arc4 camellia_generic iwldvm mac80211 iwlwifi cfg80211 rtsx_pci rfkill camellia_aesni_avx_x86_64 snd_hda_codec_hdmi tpm_tis tpm 8250_fintek camellia_x86_64 snd_hda_codec_realtek snd_hda_codec_generic processor battery fujitsu_laptop i2c_i801 ac lpc_ich serpent_avx_x86_64 mfd_core snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm shpchp snd_timer e1000e snd soundcore i915 ptp pps_core video button drm_kms_helper drm thermal_sys mei_me Nov 27 01:52:36 heisenberg kernel: i2c_algo_bit mei serpent_sse2_x86_64 xts serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic cast_common des_generic cbc cmac xcbc rmd160 sha512_ssse3 sha512_generic sha256_ssse3 sha256_generic hmac crypto_null af_key xfrm_algo loop parport_pc ppdev lp parport autofs4 dm_crypt dm_mod md_mod btrfs xor raid6_pq uhci_hcd usb_storage sd_mod crc32c_intel aesni_intel aes_x86_64 glue_helper ahci lrw gf128mul ablk_helper libahci cryptd libata ehci_pci xhci_pci ehci_hcd scsi_mod xhci_hcd usbcore usb_common Nov 27 01:52:36 heisenberg kernel: CPU: 7 PID: 18086 Comm: btrfs Not tainted 4.2.0-1-amd64 #1 Debian 4.2.6-1 Nov 27 01:52:36 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK E782/FJNB23E, BIOS Version 1.11 05/24/2012 Nov 27 01:52:36 heisenberg kernel: a02e6260 8154e2f6 Nov 27 01:52:36 heisenberg kernel: 8106e5b1 880235a3c42c 7ffd3d3796c0 8802f0e5c000 Nov 27 01:52:36 heisenberg kernel: 0004 88010543c500 a02d2d81 88041e5ebb00 Nov 27 01:52:36 heisenberg kernel: Call Trace: Nov 27 01:52:36 heisenberg kernel: [] ? dump_stack+0x40/0x50 Nov 27 01:52:36 heisenberg kernel: [] ? warn_slowpath_common+0x81/0xb0 Nov 27 01:52:36 heisenberg kernel: [] ? btrfs_ioctl_send+0x661/0x1120 [btrfs] Nov 27 01:52:36 heisenberg kernel: [] ? __alloc_pages_nodemask+0x194/0x9e0 Nov 27 01:52:36 heisenberg kernel: [] ? btrfs_ioctl+0x26c/0x2a10 [btrfs] Nov 27 01:52:36 heisenberg kernel: [] ? sched_move_task+0xca/0x1d0 Nov 27 01:52:36 heisenberg kernel: [] ? cpumask_next_and+0x2e/0x50 Nov 27 01:52:36 heisenberg kernel: [] ? select_task_rq_fair+0x23f/0x5c0 Nov 27 01:52:36 heisenberg kernel: [] ? enqueue_task_fair+0x387/0x1120 Nov 27 01:52:36 heisenberg kernel: [] ? native_sched_clock+0x24/0x80 Nov 27 01:52:36 heisenberg kernel: [] ? sched_clock+0x5/0x10 Nov 27 01:52:36 heisenberg kernel: [] ? do_vfs_ioctl+0x2c3/0x4a0 Nov 27 01:52:36 heisenberg kernel: [] ? _do_fork+0x146/0x3a0 Nov 27 01:52:36 heisenberg kernel: [] ? SyS_ioctl+0x76/0x90 Nov 27 01:52:36 heisenberg kernel: [] ? system_call_fast_compare_end+0xc/0x6b Nov 27 01:52:36 heisenberg kernel: ---[ end trace f5fa91e2672eead0 ]--- smime.p7s Description: S/MIME cryptographic signature
Re: btrfs: poor performance on deleting many large files
Mitchell Fossen wrote on 2015/11/25 15:49 -0600: On Mon, 2015-11-23 at 06:29 +, Duncan wrote: Using subvolumes was the first recommendation I was going to make, too, so you're on the right track. =:^) Also, in case you are using it (you didn't say, but this has been demonstrated to solve similar issues for others so it's worth mentioning), try turning btrfs quota functionality off. While the devs are working very hard on that feature for btrfs, the fact is that it's simply still buggy and doesn't work reliably anyway, in addition to triggering scaling issues before they'd otherwise occur. So my recommendation has been, and remains, unless you're working directly with the devs to fix quota issues (in which case, thanks!), if you actually NEED quota functionality, use a filesystem where it works reliably, while if you don't, just turn it off and avoid the scaling and other issues that currently still come with it. I did indeed have quotas turned on for the home directories! Since they were mostly to calculate space used by everyone (since du -hs is so slow) and not actually needed to limit people, I disabled them. [[About quota]] Personally speaking, I'd like to have some comparison between quota enabled and disabled, to help locate if it's quota causing the problem. If you can find a good and reliable reproducer, it would be very helpful for developers to improve btrfs. BTW, it's also a good idea to us ps to locate what process is running at the time your btrfs hangs. If it's kernel thread named btrfs-transaction, then it may be related to quota. As for defrag, that's quite a topic of its own, with complications related to snapshots and the nocow file attribute. Very briefly, if you haven't been running it regularly or using the autodefrag mount option by default, chances are your available free space is rather fragmented as well, and while defrag may help, it may not reduce fragmentation to the degree you'd like. (I'd suggest using filefrag to check fragmentation, but it doesn't know how to deal with btrfs compression, and will report heavy fragmentation for compressed files even if they're fine. Since you use compression, that kind of eliminates using filefrag to actually see what your fragmentation is.) Additionally, defrag isn't snapshot aware (they tried it for a few kernels a couple years ago but it simply didn't scale), so if you're using snapshots (as I believe Ubuntu does by default on btrfs, at least taking snapshots for upgrade-in-place), so using defrag on files that exist in the snapshots as well can dramatically increase space usage, since defrag will break the reflinks to the snapshotted extents and create new extents for defragged files. Meanwhile, the absolute worst-case fragmentation on btrfs occurs with random-internal-rewrite-pattern files (as opposed to never changed, or append-only). Common examples are database files and VM images. For /relatively/ small files, to say 256 MiB, the autodefrag mount option is a reasonably effective solution, but it tends to have scaling issues with files over half a GiB so you can call this a negative recommendation for trying that option with half-gig-plus internal-random-rewrite-pattern files. There are other mitigation strategies that can be used, but here the subject gets complex so I'll not detail them. Suffice it to say that if the filesystem in question is used with large VM images or database files and you haven't taken specific fragmentation avoidance measures, that's very likely a good part of your problem right there, and you can call this a hint that further research is called for. If your half-gig-plus files are mostly write-once, for example most media files unless you're doing heavy media editing, however, then autodefrag could be a good option in general, as it deals well with such files and with random-internal-rewrite-pattern files under a quarter gig or so. Be aware, however, that if it's enabled on an already heavily fragmented filesystem (as yours likely is), it's likely to actually make performance worse until it gets things under control. Your best bet in that case, if you have spare devices available to do so, is probably to create a fresh btrfs and consistently use autodefrag as you populate it from the existing heavily fragmented btrfs. That way, it'll never have a chance for the fragmentation to build up in the first place, and autodefrag used as a routine mount option should keep it from getting bad in normal use. Thanks for explaining that! Most of these files are written once and then read from for the rest of their "lifetime" until the simulations are done and they get archived/deleted. I'll try leaving autodefrag on and defragging directories over the holiday weekend when no one is using the server. There is some database usage, but I turned off COW for its folder and it only gets used sporadically and shouldn't be a huge factor in day-to-day usage. Also, is there a
Re: btrfs check help
> On Nov 25, 2015, at 8:44 PM, Qu Wenruowrote: > > > > Vincent Olivier wrote on 2015/11/25 11:51 -0500: >> I should probably point out that there is 64GB of RAM on this machine and >> it’s a dual Xeon processor (LGA2011-3) system. Also, there is only Btrfs >> served via Samba and the kernel panic was caused Btrfs (as per what I >> remember from the log on the screen just before I rebooted) and happened in >> the middle of the night when zero (0) client was connected. >> >> You will find below the full “btrfs check” log for each device in the order >> it is listed by “btrfs fi show”. > > There is really no need to do such thing, as btrfs is able to manage multiple > device, calling btrfsck on any of them is enough as long as it's not hugely > damaged. > >> >> Ca I get a strong confirmation that I should run with the “—repair” option >> on each device? Thanks. > > YES. > > Inode nbytes fix is *VERY* safe as long as it's the only error. > > Although it's not that convincing since the inode nbytes fix code is written > by myself and authors always tend to believe their codes are good > But at least, some other users with more complicated problem(with inode > nbytes error) fixed it. > > The last decision is still on you anyway. I will do it on the first device from the “fi show” output and report. Thanks, Vincent -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to detect / notify when a raid drive fails?
I'd like to run "mail" when a btrfs raid drive fails, but I don't know how to detect that a drive has failed. It don't see it in any docs. Otherwise I assume I would never know until enough drives fail that the filesystem stops working, and I'd like to know before that. - Ian Kelling -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to detect / notify when a raid drive fails?
Ian Kelling posted on Thu, 26 Nov 2015 21:14:57 -0800 as excerpted: > I'd like to run "mail" when a btrfs raid drive fails, but I don't know > how to detect that a drive has failed. It don't see it in any docs. > Otherwise I assume I would never know until enough drives fail that the > filesystem stops working, and I'd like to know before that. Btrfs isn't yet mature enough to have a device failure notifier daemon, like for instance mdadm does. There's a patch set going around that adds global spares, so btrfs can detect the problem and grab a spare, but it's only a rather simplistic initial implementation designed to provide the framework for more fancy stuff later, and that's about it in terms of anything close, so far. What generally happens now, however, is that the btrfs will note failures attempting to write the device and start queuing up writes. If the device reappears fast enough, btrfs will flush the queue and be back to normal. Otherwise, you pretty much need to reboot and mount degraded, then add a device and rebalance. (btrfs device delete missing broke some versions ago and just got fixed by the latest btrfs-progs-4.3.1, IIRC.) As for alerts, you'd see the pile of accumulating write errors in the kernel log. Presumably you can write up a script that can alert on that and mail you the log or whatever, but I don't believe there's anything official or close to it, yet. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: implications of mixed mode
On Fri, 27 Nov 2015 10:21:31 +0800 Qu Wenruowrote: > And some extra pros and cons due to fixed(4K) small(compared to 16K > default) nodesize: > > + A little higher performance >node/leaf size is restricted to sectorsize, smaller node/leaf, >smaller range to lock. >In our SSD test, operations with high concurrency, the performance is >overall 10% better than 16K nodesize. >And in extreme metadata operation case, like high concurrency on >sequence write into small files, it can be 8 times the performance of >default 16K nodesize. This is surprising to read, as I thought 16K is generally faster and that's why the default value was changed to it from 4K. https://oss.oracle.com/~mason/blocksizes/ https://git.kernel.org/cgit/linux/kernel/git/mason/btrfs-progs.git/commit/?id=c652e4efb8e2dd76ef1627d8cd649c6af5905902 Seems like the 16K size prevents fragmentation, but since your SSDs do not care much about fragmentation, that's not adding a benefit for them. -- With respect, Roman signature.asc Description: PGP signature
[PATCH] btrfs-progs: mkfs: Fix a wrong extent buffer size causing wrong superblock csum
For make_btrfs(), it's setting wrong buf size for last super block write out. The superblock size is always BTRFS_SUPER_INFO_SIZE, not cfg->sectorsize. And this makes mkfs.btrfs -f -s 8K fails. Fix it to BTRFS_SUPER_INFO_SIZE. Signed-off-by: Qu Wenruo--- Thanks goodness, this time it's not my super block checksum patches causing bugs. --- utils.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/utils.c b/utils.c index 60235d8..00355a2 100644 --- a/utils.c +++ b/utils.c @@ -554,7 +554,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg) BUG_ON(sizeof(super) > cfg->sectorsize); memset(buf->data, 0, cfg->sectorsize); memcpy(buf->data, , sizeof(super)); - buf->len = cfg->sectorsize; + buf->len = BTRFS_SUPER_INFO_SIZE; csum_tree_block_size(buf, BTRFS_CRC32_SIZE, 0); ret = pwrite(fd, buf->data, cfg->sectorsize, cfg->blocks[0]); if (ret != cfg->sectorsize) { -- 2.6.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/25] Btrfs-convert rework to support native separate
On Thu, Nov 26, 2015 at 08:38:23AM +0800, Qu Wenruo wrote: > > As far as the conversion support stays, it's not a problem of course. I > > don't have a complete picture of all the actual merging conflicts, but > > the idea is to provide the callback abstraction v2 to allow ext2 and > > reiser plus allow all the changes of this pathcset. > > > Glad to hear that. > > BTW, which reiserfs progs headers are you using? Sorry I forgot to mention it, it's the latest git version, https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git/ Jeff hasn't released v3.6.25 yet. We have the git version in SUSE distros so it works for me here. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/25] Btrfs-convert rework to support native separate
On 11/26/2015 05:30 PM, David Sterba wrote: On Thu, Nov 26, 2015 at 08:38:23AM +0800, Qu Wenruo wrote: As far as the conversion support stays, it's not a problem of course. I don't have a complete picture of all the actual merging conflicts, but the idea is to provide the callback abstraction v2 to allow ext2 and reiser plus allow all the changes of this pathcset. Glad to hear that. BTW, which reiserfs progs headers are you using? Sorry I forgot to mention it, it's the latest git version, https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git/ Jeff hasn't released v3.6.25 yet. We have the git version in SUSE distros so it works for me here. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Thanks, now it should be OK to continue the rebase. But I'm a little concerned about the unstable headers, unlike ext2 its headers is almost stable but reiserfs seems not. What about rebasing my patch to your abstract patch (btrfs-progs: convert: add context and operations struct to allow different file systems) first and add back your reiserfs patch? Your abstract patch is quite nice, although need some modification to work with new convert. I hope to add stable things first and don't want another reiserfs change breaks the compile. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] Let user specify the kernel version for features
With the new -O comp= option, the concern on user who want to make a btrfs for newer kernel is hugely reduced. NO!. actually new option -O comp= provides no concern for users who want to create _a btrfs disk layout which is compatible with more than one kernel_. above there are two examples of it. Why you can't give a higher kernel version than current kernel? mount fails. Pls try !! But I still prefer such feature align to be done only when specified by user, instead of automatically. (yeah, already told for several times though) Warning should be enough for user, sometimes too automatic is not good, As said before. We need latest btrfs-progs on older kernels, for obvious reasons of btrfs-progs bug fixes. We don't have to back port fixes even on btrfs-progs as we already do it in btrfs kernel. A btrfs-progs should work on any kernel with the "default features as prescribed for that kernel". Let's say if we don't do this automatic then, latest btrfs-progs with default mkfs.btfs && mount fails. But a user upgrading btrfs-progs for fsck bug fixes, shouldn't find 'default mkfs.btfs && mount' failing. Nor they have to use a "new" set of mkfs option to create all default FS for a LTS kernel. Default features based on btrfs-progs version instead of kernel version- makes NO sense. Kernel version never makes sense, especially for non-vanilla. > And unfortunately, most of kernels used in stable distribution is not vanilla. And that's the *POINT1*. That's why I stand against kernel version based detection. You can use stable /sys/fs/btrfs/features/, but kernel version? Not an option even as fallback. yep. thats the reason someone invented sysfs/features, but unfortunately from 3.14. If not version, pls suggest best suitable, _without_ transferring problem to solve to the user end. And adding a warning for not using latest features which is not in their running kernel is pointless. In the context of default features. You didn't get the point of what to WARN. Not warning user they are not using latest features, but warning some features may prevent the fs being mounted for current kernel. It was there before. some patch in the past removed it. hope you remember "Turning on incompatible..." That's _not_ a backward kernel compatible tool. btrfs-progs should work "for the kernel". We should avoid adding too much intelligence into btrfs-progs. I have fixed too many issues and redesigned progs in this area. Too many bugs were mainly because of the idea of copy and maintain same code on btrfs-progs and btrfs-kernel approach for progs. (ref wiki and my email before). Thats a wrong approach. Totally agree with this point. Too many non-sense in btrfs-progs codes copied from kernel, and due to lack of update, it's very buggy now. Just check volume.c for allocating data chunk. But I didn't see the point related to the feature auto align here. The whole point is don't add more intelligence into progs than what is required. Here its about default features. And the questions are Default against of what ? To the btrfs-progs version itself? Why do you want to add another attribute of btrfs-progs-version being relevant at the user end ? For users that's like progs not being inline with kernel, a very strong problem statement. (bit vague as of now) potentially there are chances that someday we would move mkfs part into the kernel itself, makes progs as slick as possible. I don't understand- if the purpose of both of these isn't same what is the point in maintaining same code? It won't save in efforts mainly because its like developing a distributed FS where two parties has to be communicated to be in sync. Which is like using the canon to shoo a crow. But if the reason was fuse like kernel-free FS (no one said that though) then its better to do it as a separate project. especially for tests. It depends whats being tested kernel OR progs? Its kernel not progs. No, both kernel and progs. Just from Dave, even with his typo: "xfstests is not jsut for testing kernel changes - it tests all of the filesystem utilities for regressions, too. And so when inadvertant changes in default behaviour occur, it detects those regressions too." Now in this context if you are testing latest btrfs-progs (without these patches) on an old LTS kernel, and using default mkfs option all tests fails. That's something to fix. Without transferring implementation difficulties to the user end. And without changing xfstests mkfs_options. Because we claim progs is backward kernel compatible. Automatic will keep default feature constant for a given kernel version. Further, for testing using a known set of options is even better. Yeah, known set of options get unknown on different kernels, thanks to the hidden feature align. Unless you specify it by -O options. That's the *POINT2*: Default auto feature align are making mkfs.btrfs behavior *unpredictable*. > Before auto feature
Re: [PATCH 0/7] Let user specify the kernel version for features
On 11/26/2015 07:18 PM, Anand Jain wrote: With the new -O comp= option, the concern on user who want to make a btrfs for newer kernel is hugely reduced. NO!. actually new option -O comp= provides no concern for users who want to create _a btrfs disk layout which is compatible with more than one kernel_. above there are two examples of it. Why you can't give a higher kernel version than current kernel? mount fails. Pls try !! But that's what user want to do. He/she knows what he is doing. Maybe he is testing btrfs-progs self test without the need to mount it(at least some of the tests doesn't require mount) But I still prefer such feature align to be done only when specified by user, instead of automatically. (yeah, already told for several times though) Warning should be enough for user, sometimes too automatic is not good, As said before. We need latest btrfs-progs on older kernels, for obvious reasons of btrfs-progs bug fixes. We don't have to back port fixes even on btrfs-progs as we already do it in btrfs kernel. A btrfs-progs should work on any kernel with the "default features as prescribed for that kernel". Let's say if we don't do this automatic then, latest btrfs-progs with default mkfs.btfs && mount fails. But a user upgrading btrfs-progs for fsck bug fixes, shouldn't find 'default mkfs.btfs && mount' failing. Nor they have to use a "new" set of mkfs option to create all default FS for a LTS kernel. Default features based on btrfs-progs version instead of kernel version- makes NO sense. Kernel version never makes sense, especially for non-vanilla. > And unfortunately, most of kernels used in stable distribution is not vanilla. And that's the *POINT1*. That's why I stand against kernel version based detection. You can use stable /sys/fs/btrfs/features/, but kernel version? Not an option even as fallback. yep. thats the reason someone invented sysfs/features, but unfortunately from 3.14. If not version, pls suggest best suitable, _without_ transferring problem to solve to the user end. A solution which may cause wrong result is never a solution. No matter if it is better than any other one. Do it wrong(even sometimes it's OK) is never good than doing nothing. And adding a warning for not using latest features which is not in their running kernel is pointless. In the context of default features. You didn't get the point of what to WARN. Not warning user they are not using latest features, but warning some features may prevent the fs being mounted for current kernel. It was there before. some patch in the past removed it. hope you remember "Turning on incompatible..." Then, add it back, and not such informative, but warning. The old output is so easy to ignore. Or you can just stop continuing mkfs if you detect such incompact feature with current kernel, only continue if "-f" is given. That's _not_ a backward kernel compatible tool. btrfs-progs should work "for the kernel". We should avoid adding too much intelligence into btrfs-progs. I have fixed too many issues and redesigned progs in this area. Too many bugs were mainly because of the idea of copy and maintain same code on btrfs-progs and btrfs-kernel approach for progs. (ref wiki and my email before). Thats a wrong approach. Totally agree with this point. Too many non-sense in btrfs-progs codes copied from kernel, and due to lack of update, it's very buggy now. Just check volume.c for allocating data chunk. But I didn't see the point related to the feature auto align here. The whole point is don't add more intelligence into progs than what is required. Here its about default features. And the questions are Default against of what ? To the btrfs-progs version itself? Why do you want to add another attribute of btrfs-progs-version being relevant at the user end ? End user of what? A single package? No, end user of the *whole distribution*. That's the packager/backport guys responsible for not combining mismatch kernel(-LTS) and progs Now we need to auto align feature with kernel, who know one day we will need to auto align our libs to upstream package? Keeping a matrix with different packages like libuuid/acl/attr with different Makefile? At least this is not a good idea for me, and that's the work of autoconfig IIRC. And if I'm a package and face such problem, I'll choose the simplest solution, just add a line in PKGBUILD(package system of Archlinux) of btrfs. -- depends=('linux>=3.14') -- (Yeah, such simple and slick packaging solution is the reason I like Arch over other rolling distribution) Not every thing really needed to be done in code level. For users that's like progs not being inline with kernel, a very strong problem statement. (bit vague as of now) potentially there are chances that someday we would move mkfs part into the kernel itself, makes progs as slick as possible. I don't understand- if the
Re: [PATCH] btrfs-progs: mkfs: Fix a wrong extent buffer size causing wrong superblock csum
On Thu, Nov 26, 2015 at 04:15:55PM +0800, Qu Wenruo wrote: > For make_btrfs(), it's setting wrong buf size for last super block write > out. > The superblock size is always BTRFS_SUPER_INFO_SIZE, not > cfg->sectorsize. > > And this makes mkfs.btrfs -f -s 8K fails. > > Fix it to BTRFS_SUPER_INFO_SIZE. > > Signed-off-by: Qu WenruoAlready fixed in current devel (bf1ac8305ab3f191d9). > --- a/utils.c > +++ b/utils.c > @@ -554,7 +554,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg) > BUG_ON(sizeof(super) > cfg->sectorsize); > memset(buf->data, 0, cfg->sectorsize); > memcpy(buf->data, , sizeof(super)); > - buf->len = cfg->sectorsize; > + buf->len = BTRFS_SUPER_INFO_SIZE; > csum_tree_block_size(buf, BTRFS_CRC32_SIZE, 0); > ret = pwrite(fd, buf->data, cfg->sectorsize, cfg->blocks[0]); Also, this overwrites more bytes than necessary so my fix uses BTRFS_SUPER_INFO_SIZE everywhere. > if (ret != cfg->sectorsize) { -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs-progs: fsck: Fix a false alert where extent record has wrong metadata flag
On Wed, Nov 25, 2015 at 02:19:06PM +0800, Qu Wenruo wrote: > In process_extent_item(), it gives 'metadata' initial value 0, but for > non-skinny-metadata case, metadata extent can't be judged just from key > type and it forgot that case. > > This causes a lot of false alert in non-skinny-metadata filesystem. > > Fix it by set correct metadata value before calling add_extent_rec(). > > Reported-by: Christoph Anton Mitterer> Signed-off-by: Qu Wenruo Patch replaced, thanks. The test image is pushed as well. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/25] Btrfs-convert rework to support native separate
On Thu, Nov 26, 2015 at 06:12:57PM +0800, Qu Wenruo wrote: > But I'm a little concerned about the unstable headers, unlike ext2 its > headers is almost stable but reiserfs seems not. Well, reiserfs is not developped nowadays and I think Jeff implemented the bits required for btrfs-convert. The configure script will detect if the reiser library provided needed functions, compiling reiser in will be optional anyway. > What about rebasing my patch to your abstract patch (btrfs-progs: > convert: add context and operations struct to allow different file > systems) first and add back your reiserfs patch? Oh right, the patch is independent, I'll add it to devel. > Your abstract patch is quite nice, although need some modification to > work with new convert. Yes, that's expected. > I hope to add stable things first and don't want another reiserfs change > breaks the compile. Ok. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Support convert to -d dup for btrfs-convert
On Thu, Nov 19, 2015 at 05:26:22PM +0800, Zhao Lei wrote: > Since we will add support for -d dup for non-mixed filesystem, > kernel need to support converting to this raid-type. > > This patch remove limitation of above case. > > Signed-off-by: Zhao LeiReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs-progs: mkfs: Enable -d dup for single device
On Thu, Nov 19, 2015 at 04:14:16PM -0500, Austin S Hemmelgarn wrote: > On 2015-11-19 04:36, Zhao Lei wrote: > > Signed-off-by: Zhao Lei> Seeing as I forgot to reply to the previous version after testing it, > I'll just reply here now that I've run this version through the same > tests I did on the last one. > > I threw everything I could think of at it, and nothing broke, so you can > add: > Tested-by: Austin S. Hemmelgarn Thanks. Patch added to devel as it's not code-intrusive, will be probably released within 4.4. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/25] Btrfs-convert rework to support native separate
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 11/26/15 5:12 AM, Qu Wenruo wrote: > > > On 11/26/2015 05:30 PM, David Sterba wrote: >> On Thu, Nov 26, 2015 at 08:38:23AM +0800, Qu Wenruo wrote: As far as the conversion support stays, it's not a problem of course. I don't have a complete picture of all the actual merging conflicts, but the idea is to provide the callback abstraction v2 to allow ext2 and reiser plus allow all the changes of this pathcset. >>> Glad to hear that. >>> >>> BTW, which reiserfs progs headers are you using? >> >> Sorry I forgot to mention it, it's the latest git version, >> https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git/ >> >> >> Jeff hasn't released v3.6.25 yet. We have the git version in SUSE >> distros so it works for me here. -- To unsubscribe from this >> list: send the line "unsubscribe linux-btrfs" in the body of a >> message to majord...@vger.kernel.org More majordomo info at >> http://vger.kernel.org/majordomo-info.html >> > Thanks, now it should be OK to continue the rebase. > > But I'm a little concerned about the unstable headers, unlike ext2 > its headers is almost stable but reiserfs seems not. This is entirely due to the fact that splitting out library functionality for reiserfsprogs was only done to support btrfs-convert. Unless there's some pressing need to revise the API, the headers are pretty much static at this point. I should just go ahead and release the current snapshot as 3.6.25. Today's a US holiday and I won't be able to get to it, but I'll do that in the next few days. - -Jeff > What about rebasing my patch to your abstract patch (btrfs-progs: > convert: add context and operations struct to allow different file > systems) first and add back your reiserfs patch? > > Your abstract patch is quite nice, although need some modification > to work with new convert. I hope to add stable things first and > don't want another reiserfs change breaks the compile. > > Thanks, Qu > - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG/MacGPG2 v2 iQIcBAEBCAAGBQJWVyECAAoJEB57S2MheeWyHFQP/jNSZthnme+Gr6HV9DcGNuWb F3m4HOauGny+5mWzWjzATcS6YjAB0Hr0ObXi6jgoQxueBTInZKfgRYIqF6q1hdxS NymluoAi9lkTkgkiCTZWmUexUaGE2pDzWR14dzqkBorqfCkvyvBVrkXV32FQUJ9H ln85QND805HUiHol3+rqSSFhxN8A8C+3UMkxOuUCrlkBx8KkzdKVrFJNH4+2L3Ml tt1KYkWa4hCOYqFivzxDJ69HWNBkUIPXsig+38Lw/JkuLbt82DfMAVu5QbttAgus 1/ZOhPwv7grRfV/CpCpNHPV/AvLLLR0wu5DNCej4HsC81WH4KUE5Isavk2WxQAF3 gpFdnxh2Ok3n+g/obFpDKh43XjVIbtRel4bsB13WjfM9mUNFX8lm/vnK8bjmuoGo FRf0eQwY3YZAzFrTglFBQxK4eUnopDr1CTFTmnLtBoYKSojDdwe8KQTDWuimLyXb /BY7qeiW3Cahm+V2VrL76cnbxg5/xH0u6CKrfsg9p4NpP9+5bi72vLgy16IJI+a2 jf6pLvgOi9MwMGW42tIHGna+vFwPAaoL4Iqm4qbWqHZ/R91bNAfWgOdBEubqHuYi qML5RAttojrw7ZtwSLkBxnOyj3ias3c2jDTQ1yqaSVMJ6phn9XwzB5v4I87qEdsS a8QhP69u++S+PcohdSYQ =3s6k -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs-progs: fsck: Fix a false alert where extent record has wrong metadata flag
Hey. I can confirm that the new patch fixes the issue on both test filesystems. Thanks for working that out. I guess there's no longer a need to keep that old filesystems now?! Cheers, Chris. On Thu, 2015-11-26 at 15:27 +0100, David Sterba wrote: > On Wed, Nov 25, 2015 at 02:19:06PM +0800, Qu Wenruo wrote: > > In process_extent_item(), it gives 'metadata' initial value 0, but > > for > > non-skinny-metadata case, metadata extent can't be judged just from > > key > > type and it forgot that case. > > > > This causes a lot of false alert in non-skinny-metadata filesystem. > > > > Fix it by set correct metadata value before calling > > add_extent_rec(). > > > > Reported-by: Christoph Anton Mitterer> > Signed-off-by: Qu Wenruo > > Patch replaced, thanks. The test image is pushed as well. smime.p7s Description: S/MIME cryptographic signature
Re: btrfs send reproducibly fails for a specific subvolume after sending 15 GiB, scrub reports no errors
Hugo Mills posted on Tue, 24 Nov 2015 21:27:46 + as excerpted: [In the context of btrfs send...] >-p only sends the file metadata for the changes from the reference > snapshot to the sent snapshot. -c sends all the file metadata, but will > preserve the reflinks between the sent snapshot and the (one or more) > reference snapshots. You can only use one -p (because there's only one > difference you can compute at any one time), but you can use as many -c > as you like (because you can share extents with any number of subvols). > >In both cases, the reference snapshot(s) must exist on the > receiving side. > >In implementation terms, on the receiver, -p takes a (writable) > snapshot of the reference subvol, and modifies it according to the > stream data. -c makes a new empty subvol, and populates it from scratch, > using the reflink ioctl to use data which is known to exist in the > reference subvols. Thanks, Hugo. I had a vague idea that the above was the difference in general, but as CAM says, the manpage (and wiki) isn't particularly detailed on the differences, so I didn't know whether my vague idea was correct or not. Your explanation makes perfect sense and clears things up dramatically. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using Btrfs on single drives
Russell Coker posted on Wed, 25 Nov 2015 18:20:25 +1100 as excerpted: > On Sun, 15 Nov 2015 03:01:57 PM Duncan wrote: >> That looks to me like native drive limitations. >> >> Due to the fact that a modern hard drive spins at the same speed no >> matter where the read/write head is located, when it's reading/writing >> to the first part of the drive -- the outside -- much more linear drive >> distance will pass under the read/write heads in say a tenth of a >> second than will be the case as the last part of the drive is filled -- >> the inside -- and throughput will be much higher at the first of the >> drive. > > http://www.coker.com.au/bonnie++/zcav/results.html > > The above page has the results of my ZCAV benchmark (part of the > Bonnie++ suite) which shows this. You can safely tun ZCAV in read mode > on a device that's got a filesystem on it so it's not too late to test > these things. Thanks. Those graphs are pretty clear. As you, I'd have thought there'd be far fewer zones (3-4) than it turns out there are (8ish). -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.3-rc4] scrubbing aborts before finishing
Martin Steigerwald posted on Wed, 25 Nov 2015 16:35:39 +0100 as excerpted: > I´d report a bug report, in case anyone would be interested. But if the > interest it like in this mailinglist post I can spare myself the time > for reporting via bugzilla. > > So does anyone at all care about this issue? FWIW I'm interested, but haven't as a user seen the same thing here... scrubs have worked fine here unless I forget to sudo, in which case they don't work at all as they can't get necessary privs. And I'm stumped as to what else it might be. I don't even have an idea where to start looking for clues. But not being a dev and having not the foggiest what the problem may be, it's not like my interest is going to help much, which is why I didn't reply earlier. Just posting now to say you're not alone in your interest. Things like this that don't seem to have any logic bother me, tho, so I really would like to at least learn why it's happening. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
Mitchell Fossen posted on Wed, 25 Nov 2015 15:49:58 -0600 as excerpted: > Also, is there a recommendation for relatime vs noatime mount options? I > don't believe anything that runs on the server needs to use file access > times, so if it can help with performance/disk usage I'm fine with > setting it to noatime. FWIW I finally got tired enough of always setting noatime (for over a decade, since kernel 2.4 and my standardizing to then reiserfs) that I finally found the spot in the kernel where the relatime default is set, and patched it to be noatime by default. My kernel scripts apply that on top of my git kernel pulls, now. For people doing snapshotting in particular, atime updates can be a big part of the differences between snapshots, so it's particularly important to set noatime if you're snapshotting. If you're not doing snapshots, it's somewhat less important, but IIRC it was still someone more of a performance issue than with ext*, tho I don't remember the details but I'd guess it's to do with COWing the metadata triggering metadata fragmentation. Bottom line, use noatime unless you have something that needs atime. It's not going to hurt for sure, and should improve performance at least somewhat even on ext*. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: docs: mkfs, implications of DUP on devices
We offer DUP but still depend on the hardware, to do the right thing. Signed-off-by: David Sterba--- To wider audience: feel free to suggest improvements to the manual page text if you think it's not clear or too technical etc. Documentation/mkfs.btrfs.asciidoc | 32 utils.c | 5 +++-- 2 files changed, 31 insertions(+), 6 deletions(-) diff --git a/Documentation/mkfs.btrfs.asciidoc b/Documentation/mkfs.btrfs.asciidoc index c9ba314c2220..0b145c7a01c3 100644 --- a/Documentation/mkfs.btrfs.asciidoc +++ b/Documentation/mkfs.btrfs.asciidoc @@ -50,7 +50,9 @@ mkfs.btrfs uses the entire device space for the filesystem. *-d|--data *:: Specify the profile for the data block groups. Valid values are 'raid0', -'raid1', 'raid5', 'raid6', 'raid10' or 'single', (case does not matter). +'raid1', 'raid5', 'raid6', 'raid10' or 'single' or dup (case does not matter). ++ +See 'DUP PROFILES ON A SINGLE DEVICE' for more. *-m|--metadata *:: Specify the profile for the metadata block groups. @@ -60,13 +62,12 @@ Valid values are 'raid0', 'raid1', 'raid5', 'raid6', 'raid10', 'single' or A single device filesystem will default to 'DUP', unless a SSD is detected. Then it will default to 'single'. The detection is based on the value of `/sys/block/DEV/queue/rotational`, where 'DEV' is the short name of the device. -This is because SSDs can remap the blocks internally to a single copy thus -deduplicating them which negates the purpose of increased metadata redunancy -and just wastes space. + Note that the rotational status can be arbitrarily set by the underlying block device driver and may not reflect the true status (network block device, memory-backed SCSI devices etc). Use the options '--data/--metadata' to avoid confusion. ++ +See 'DUP PROFILES ON A SINGLE DEVICE' for more details. *-M|--mixed*:: Normally the data and metadata block groups are isolated. The 'mixed' mode @@ -265,6 +266,29 @@ PROFILES another one is added, but *mkfs.btrfs* will not let you create DUP on multiple devices. +DUP PROFILES ON A SINGLE DEVICE +--- + +The mkfs utility will let the user create a filesystem with profiles that write +the logical blocks to 2 physical locations. Whether there are really 2 +physical copies highly depends on the underlying device type. + +For example, a SSD drive can remap the blocks internally to a single copy thus +deduplicating them. This negates the purpose of increased redunancy and just +wastes space. + +The duplicated data/metadata may still be useful to statistically improve the +chances on a device that might perform some internal optimizations. The actual +details are not usually disclosed by vendors. As another example, the widely +used USB flash or SD cards use a translation layer. The data lifetime may +be affected by frequent plugging. The memory cells could get damaged, hopefully +not destroying both copies of particular data. + +The traditional rotational hard drives usually fail at the sector level. + +In any case, a device that starts to misbehave and repairs from the DUP copy +should be replaced! *DUP is not backup*. + KNOWN ISSUES diff --git a/utils.c b/utils.c index c20966c19768..d5f60a420135 100644 --- a/utils.c +++ b/utils.c @@ -2504,8 +2504,9 @@ int test_num_disk_vs_raid(u64 metadata_profile, u64 data_profile, return 1; } - warning_on(!mixed && (data_profile & BTRFS_BLOCK_GROUP_DUP) && ssd, - "DUP have no effect if your SSD have deduplication function"); + /* warning_on(!mixed && (data_profile & BTRFS_BLOCK_GROUP_DUP) && ssd, */ + warning_on(!mixed && (data_profile & BTRFS_BLOCK_GROUP_DUP), + "DUP may not actually lead to 2 copies on the device, see manual page"); return 0; } -- 2.6.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/5] btrfs-progs: introduce framework to check kernel supported features
On Tue, Nov 24, 2015 at 03:21:19PM -0500, Austin S Hemmelgarn wrote: > > I think you mean 2.6.37 here. > > 67377734fd24c3 "Btrfs: add support for mixed data+metadata block groups" > This brings up a rather important question: > Should compat-X.Y mean features that were considered usable in that > version, or everything that version offered? I understand wanting > consistency with the kernel versions, but we shouldn't be creating > filesystems that we know will break on the specified kernel even if it > is mountable on it. IMO compat refers to the compatibility feature bits so it's whether the filesystem is mountable on a given version. Usability can be subjective. I assume the kernel versions in wide use match some of the long term branches. If it's k.org, we can submit the fixes and distros update their long term branches. A table of "is the feature usable" would be interesting but I think it's for wiki. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
On Thu, 2015-11-26 at 16:52 +, Duncan wrote: > For people doing snapshotting in particular, atime updates can be a > big > part of the differences between snapshots, so it's particularly > important > to set noatime if you're snapshotting. What everything happens when that is left at relatime? I'd guess that obviously everytime the atime is updated there will be some CoW, but only on meta-data blocks, right? Does this then lead to fragmentation problems in the meta-data block groups? And how serious are the effects on space that is eaten up... say I have n snapshots and access all of their files... then I'd probably get n times the metadata, right? Which would sound quite dramatic... Or is just parts of the metadate copied with new atimes? Thanks, Chris. smime.p7s Description: S/MIME cryptographic signature
vfs: move btrfs clone ioctls to common code
This patch set moves the existing btrfs clone ioctls that other file system have started to implement to common code, and allows the NFS server to export this functionality to remote systems. This work is based originally on my NFS CLONE prototype, which reused code from Anna Schumaker's NFS COPY prototype, as well as various updates from Peng Tao to this code. The patches are also available as a git branch and on gitweb: git://git.infradead.org/users/hch/pnfs.git clone-for-viro http://git.infradead.org/users/hch/pnfs.git/shortlog/refs/heads/clone-for-viro -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] vfs: pull btrfs clone API to vfs layer
The btrfs ioctl clones are now adopted by other file systems: NFS since 4.3 and XFS a few kernel in the future, as well as the previous (incorrect) usage by CIFS. To avoid growth of various slightly incompatible implementation add one to the core VFS code. Note that clones are different from file copies in various ways: - they are atomic vs other writers - they support whole file clones - they support 64-bit legth clones - they do not allow partial success (aka short writes) - clones are expected to be a fast metadata operation Because of that it would be rather cumbersome to try to piggyback them on top of the recent clone_file_range infrastructure. Based on earlier work from Peng Tao. Signed-off-by: Christoph Hellwig--- fs/btrfs/ctree.h | 3 +- fs/btrfs/file.c | 1 + fs/btrfs/ioctl.c | 49 ++ fs/ioctl.c | 29 + fs/nfs/nfs42proc.c | 1 + fs/nfs/nfs4file.c| 107 --- fs/read_write.c | 71 +++ include/linux/fs.h | 7 +++- include/uapi/linux/fs.h | 9 include/uapi/linux/nfs.h | 11 - 10 files changed, 140 insertions(+), 148 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index dd7d888..adc997f 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -4021,7 +4021,6 @@ void btrfs_get_block_group_info(struct list_head *groups_list, void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock, struct btrfs_ioctl_balance_args *bargs); - /* file.c */ int btrfs_auto_defrag_init(void); void btrfs_auto_defrag_exit(void); @@ -4054,6 +4053,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end); ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, size_t len, unsigned int flags); +int btrfs_clone_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, u64 len); /* tree-defrag.c */ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 1c0ee74..3b61b0a 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2921,6 +2921,7 @@ const struct file_operations btrfs_file_operations = { .compat_ioctl = btrfs_ioctl, #endif .copy_file_range = btrfs_copy_file_range, + .clone_file_range = btrfs_clone_file_range, }; void btrfs_auto_defrag_exit(void) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0f92735..85b1cae 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3906,49 +3906,10 @@ ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in, return ret; } -static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, - u64 off, u64 olen, u64 destoff) +int btrfs_clone_file_range(struct file *src_file, loff_t off, + struct file *dst_file, loff_t destoff, u64 len) { - struct fd src_file; - int ret; - - /* the destination must be opened for writing */ - if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND)) - return -EINVAL; - - ret = mnt_want_write_file(file); - if (ret) - return ret; - - src_file = fdget(srcfd); - if (!src_file.file) { - ret = -EBADF; - goto out_drop_write; - } - - /* the src must be open for reading */ - if (!(src_file.file->f_mode & FMODE_READ)) { - ret = -EINVAL; - goto out_fput; - } - - ret = btrfs_clone_files(file, src_file.file, off, olen, destoff); - -out_fput: - fdput(src_file); -out_drop_write: - mnt_drop_write_file(file); - return ret; -} - -static long btrfs_ioctl_clone_range(struct file *file, void __user *argp) -{ - struct btrfs_ioctl_clone_range_args args; - - if (copy_from_user(, argp, sizeof(args))) - return -EFAULT; - return btrfs_ioctl_clone(file, args.src_fd, args.src_offset, -args.src_length, args.dest_offset); + return btrfs_clone_files(dst_file, src_file, off, len, destoff); } /* @@ -5498,10 +5459,6 @@ long btrfs_ioctl(struct file *file, unsigned int return btrfs_ioctl_dev_info(root, argp); case BTRFS_IOC_BALANCE: return btrfs_ioctl_balance(file, NULL); - case BTRFS_IOC_CLONE: - return btrfs_ioctl_clone(file, arg, 0, 0, 0); - case BTRFS_IOC_CLONE_RANGE: - return btrfs_ioctl_clone_range(file, argp); case BTRFS_IOC_TRANS_START: return btrfs_ioctl_trans_start(file); case BTRFS_IOC_TRANS_END: diff --git a/fs/ioctl.c b/fs/ioctl.c index 5d01d26..84c6e79 100644 ---
[PATCH 4/5] nfsd: Pass filehandle to nfs4_preprocess_stateid_op()
From: Anna SchumakerThis will be needed so COPY can look up the saved_fh in addition to the current_fh. Signed-off-by: Anna Schumaker --- fs/nfsd/nfs4proc.c | 16 +--- fs/nfsd/nfs4state.c | 5 ++--- fs/nfsd/state.h | 4 ++-- 3 files changed, 13 insertions(+), 12 deletions(-) diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c index a9f096c..3ba10a3 100644 --- a/fs/nfsd/nfs4proc.c +++ b/fs/nfsd/nfs4proc.c @@ -774,8 +774,9 @@ nfsd4_read(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, clear_bit(RQ_SPLICE_OK, >rq_flags); /* check stateid */ - status = nfs4_preprocess_stateid_op(rqstp, cstate, >rd_stateid, - RD_STATE, >rd_filp, >rd_tmp_file); + status = nfs4_preprocess_stateid_op(rqstp, cstate, >current_fh, + >rd_stateid, RD_STATE, + >rd_filp, >rd_tmp_file); if (status) { dprintk("NFSD: nfsd4_read: couldn't process stateid!\n"); goto out; @@ -921,7 +922,8 @@ nfsd4_setattr(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, if (setattr->sa_iattr.ia_valid & ATTR_SIZE) { status = nfs4_preprocess_stateid_op(rqstp, cstate, - >sa_stateid, WR_STATE, NULL, NULL); + >current_fh, >sa_stateid, + WR_STATE, NULL, NULL); if (status) { dprintk("NFSD: nfsd4_setattr: couldn't process stateid!\n"); return status; @@ -985,8 +987,8 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, if (write->wr_offset >= OFFSET_MAX) return nfserr_inval; - status = nfs4_preprocess_stateid_op(rqstp, cstate, stateid, WR_STATE, - , NULL); + status = nfs4_preprocess_stateid_op(rqstp, cstate, >current_fh, + stateid, WR_STATE, , NULL); if (status) { dprintk("NFSD: nfsd4_write: couldn't process stateid!\n"); return status; @@ -1016,7 +1018,7 @@ nfsd4_fallocate(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, __be32 status = nfserr_notsupp; struct file *file; - status = nfs4_preprocess_stateid_op(rqstp, cstate, + status = nfs4_preprocess_stateid_op(rqstp, cstate, >current_fh, >falloc_stateid, WR_STATE, , NULL); if (status != nfs_ok) { @@ -1055,7 +1057,7 @@ nfsd4_seek(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, __be32 status; struct file *file; - status = nfs4_preprocess_stateid_op(rqstp, cstate, + status = nfs4_preprocess_stateid_op(rqstp, cstate, >current_fh, >seek_stateid, RD_STATE, , NULL); if (status) { diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 6b800b5..df5dba6 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -4797,10 +4797,9 @@ nfs4_check_file(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfs4_stid *s, */ __be32 nfs4_preprocess_stateid_op(struct svc_rqst *rqstp, - struct nfsd4_compound_state *cstate, stateid_t *stateid, - int flags, struct file **filpp, bool *tmp_file) + struct nfsd4_compound_state *cstate, struct svc_fh *fhp, + stateid_t *stateid, int flags, struct file **filpp, bool *tmp_file) { - struct svc_fh *fhp = >current_fh; struct inode *ino = d_inode(fhp->fh_dentry); struct net *net = SVC_NET(rqstp); struct nfsd_net *nn = net_generic(net, nfsd_net_id); diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h index 77fdf4d..99432b7 100644 --- a/fs/nfsd/state.h +++ b/fs/nfsd/state.h @@ -578,8 +578,8 @@ struct nfsd4_compound_state; struct nfsd_net; extern __be32 nfs4_preprocess_stateid_op(struct svc_rqst *rqstp, - struct nfsd4_compound_state *cstate, stateid_t *stateid, - int flags, struct file **filp, bool *tmp_file); + struct nfsd4_compound_state *cstate, struct svc_fh *fhp, + stateid_t *stateid, int flags, struct file **filp, bool *tmp_file); __be32 nfsd4_lookup_stateid(struct nfsd4_compound_state *cstate, stateid_t *stateid, unsigned char typemask, struct nfs4_stid **s, struct nfsd_net *nn); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] cifs: implement clone_file_range operation
And drop the fake support for the btrfs CLONE ioctl - SMB2 copies are chunked and do not actually implement clone semantics! Heavily based on a previous patch from Peng Tao. Signed-off-by: Christoph Hellwig--- fs/cifs/cifsfs.c | 25 ++ fs/cifs/cifsfs.h | 4 ++- fs/cifs/ioctl.c | 103 +++ 3 files changed, 86 insertions(+), 46 deletions(-) diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index cbc0f4b..ad7117a 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -914,6 +914,23 @@ const struct inode_operations cifs_symlink_inode_ops = { #endif }; +ssize_t cifs_file_copy_range(struct file *file_in, loff_t pos_in, +struct file *file_out, loff_t pos_out, +size_t len, unsigned int flags) +{ + unsigned int xid; + int rc; + + if (flags) + return -EOPNOTSUPP; + + xid = get_xid(); + rc = cifs_file_clone_range(xid, file_in, file_out, pos_in, + len, pos_out, true); + free_xid(xid); + return rc < 0 ? rc : len; +} + const struct file_operations cifs_file_ops = { .read_iter = cifs_loose_read_iter, .write_iter = cifs_file_write_iter, @@ -926,6 +943,7 @@ const struct file_operations cifs_file_ops = { .splice_read = generic_file_splice_read, .llseek = cifs_llseek, .unlocked_ioctl = cifs_ioctl, + .copy_file_range = cifs_file_copy_range, .setlease = cifs_setlease, .fallocate = cifs_fallocate, }; @@ -942,6 +960,8 @@ const struct file_operations cifs_file_strict_ops = { .splice_read = generic_file_splice_read, .llseek = cifs_llseek, .unlocked_ioctl = cifs_ioctl, + .copy_file_range = cifs_file_copy_range, + .copy_file_range = cifs_file_copy_range, .setlease = cifs_setlease, .fallocate = cifs_fallocate, }; @@ -958,6 +978,7 @@ const struct file_operations cifs_file_direct_ops = { .mmap = cifs_file_mmap, .splice_read = generic_file_splice_read, .unlocked_ioctl = cifs_ioctl, + .copy_file_range = cifs_file_copy_range, .llseek = cifs_llseek, .setlease = cifs_setlease, .fallocate = cifs_fallocate, @@ -974,6 +995,7 @@ const struct file_operations cifs_file_nobrl_ops = { .splice_read = generic_file_splice_read, .llseek = cifs_llseek, .unlocked_ioctl = cifs_ioctl, + .copy_file_range = cifs_file_copy_range, .setlease = cifs_setlease, .fallocate = cifs_fallocate, }; @@ -989,6 +1011,7 @@ const struct file_operations cifs_file_strict_nobrl_ops = { .splice_read = generic_file_splice_read, .llseek = cifs_llseek, .unlocked_ioctl = cifs_ioctl, + .copy_file_range = cifs_file_copy_range, .setlease = cifs_setlease, .fallocate = cifs_fallocate, }; @@ -1004,6 +1027,7 @@ const struct file_operations cifs_file_direct_nobrl_ops = { .mmap = cifs_file_mmap, .splice_read = generic_file_splice_read, .unlocked_ioctl = cifs_ioctl, + .copy_file_range = cifs_file_copy_range, .llseek = cifs_llseek, .setlease = cifs_setlease, .fallocate = cifs_fallocate, @@ -1014,6 +1038,7 @@ const struct file_operations cifs_dir_ops = { .release = cifs_closedir, .read= generic_read_dir, .unlocked_ioctl = cifs_ioctl, + .copy_file_range = cifs_file_copy_range, .llseek = generic_file_llseek, }; diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h index c3cc160..797439b 100644 --- a/fs/cifs/cifsfs.h +++ b/fs/cifs/cifsfs.h @@ -131,7 +131,9 @@ extern int cifs_setxattr(struct dentry *, const char *, const void *, extern ssize_t cifs_getxattr(struct dentry *, const char *, void *, size_t); extern ssize_t cifs_listxattr(struct dentry *, char *, size_t); extern long cifs_ioctl(struct file *filep, unsigned int cmd, unsigned long arg); - +extern int cifs_file_clone_range(unsigned int xid, struct file *src_file, +struct file *dst_file, u64 off, u64 len, +u64 destoff, bool dup_extents); #ifdef CONFIG_CIFS_NFSD_EXPORT extern const struct export_operations cifs_export_ops; #endif /* CONFIG_CIFS_NFSD_EXPORT */ diff --git a/fs/cifs/ioctl.c b/fs/cifs/ioctl.c index 35cf990..4f92f5c 100644 --- a/fs/cifs/ioctl.c +++ b/fs/cifs/ioctl.c @@ -34,73 +34,43 @@ #include "cifs_ioctl.h" #include -static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file, - unsigned long srcfd, u64 off, u64 len, u64 destoff, - bool dup_extents) +int cifs_file_clone_range(unsigned int xid, struct file *src_file, + struct file *dst_file, u64 off, u64 len, + u64 destoff, bool dup_extents) { - int rc; - struct cifsFileInfo *smb_file_target =
[PATCH 2/5] locks: new locks_mandatory_area calling convention
Pass a loff_t end for the last byte instead of the 32-bit count parameter to allow full file clones even on 32-bit architectures. While we're at it also drop the pointless inode argument and simplify the read/write selection. Signed-off-by: Christoph Hellwig--- fs/locks.c | 22 +- fs/read_write.c| 5 ++--- include/linux/fs.h | 28 +--- 3 files changed, 24 insertions(+), 31 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 0d2b326..d503669 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1227,21 +1227,17 @@ int locks_mandatory_locked(struct file *file) /** * locks_mandatory_area - Check for a conflicting lock - * @read_write: %FLOCK_VERIFY_WRITE for exclusive access, %FLOCK_VERIFY_READ - * for shared - * @inode: the file to check * @filp: how the file was opened (if it was) - * @offset: start of area to check - * @count: length of area to check + * @start: first byte in the file to check + * @end: lastbyte in the file to check + * @write: %true if checking for write access * * Searches the inode's list of locks to find any POSIX locks which conflict. - * This function is called from rw_verify_area() and - * locks_verify_truncate(). */ -int locks_mandatory_area(int read_write, struct inode *inode, -struct file *filp, loff_t offset, -size_t count) +int locks_mandatory_area(struct file *filp, loff_t start, loff_t end, + bool write) { + struct inode *inode = file_inode(filp); struct file_lock fl; int error; bool sleep = false; @@ -1252,9 +1248,9 @@ int locks_mandatory_area(int read_write, struct inode *inode, fl.fl_flags = FL_POSIX | FL_ACCESS; if (filp && !(filp->f_flags & O_NONBLOCK)) sleep = true; - fl.fl_type = (read_write == FLOCK_VERIFY_WRITE) ? F_WRLCK : F_RDLCK; - fl.fl_start = offset; - fl.fl_end = offset + count - 1; + fl.fl_type = write ? F_WRLCK : F_RDLCK; + fl.fl_start = start; + fl.fl_end = end; for (;;) { if (filp) { diff --git a/fs/read_write.c b/fs/read_write.c index c81ef39..48157dd 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -396,9 +396,8 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t } if (unlikely(inode->i_flctx && mandatory_lock(inode))) { - retval = locks_mandatory_area( - read_write == READ ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE, - inode, file, pos, count); + retval = locks_mandatory_area(file, pos, pos + count - 1, + read_write == READ ? false : true); if (retval < 0) return retval; } diff --git a/include/linux/fs.h b/include/linux/fs.h index 870a76e..e640f791 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2030,12 +2030,9 @@ extern struct kobject *fs_kobj; #define MAX_RW_COUNT (INT_MAX & PAGE_CACHE_MASK) -#define FLOCK_VERIFY_READ 1 -#define FLOCK_VERIFY_WRITE 2 - #ifdef CONFIG_FILE_LOCKING extern int locks_mandatory_locked(struct file *); -extern int locks_mandatory_area(int, struct inode *, struct file *, loff_t, size_t); +extern int locks_mandatory_area(struct file *, loff_t, loff_t, bool); /* * Candidates for mandatory locking have the setgid bit set @@ -2068,14 +2065,16 @@ static inline int locks_verify_truncate(struct inode *inode, struct file *filp, loff_t size) { - if (inode->i_flctx && mandatory_lock(inode)) - return locks_mandatory_area( - FLOCK_VERIFY_WRITE, inode, filp, - size < inode->i_size ? size : inode->i_size, - (size < inode->i_size ? inode->i_size - size -: size - inode->i_size) - ); - return 0; + if (!inode->i_flctx || !mandatory_lock(inode)) + return 0; + + if (size < inode->i_size) { + return locks_mandatory_area(filp, size, inode->i_size - 1, + true); + } else { + return locks_mandatory_area(filp, inode->i_size, size - 1, + true); + } } static inline int break_lease(struct inode *inode, unsigned int mode) @@ -2144,9 +2143,8 @@ static inline int locks_mandatory_locked(struct file *file) return 0; } -static inline int locks_mandatory_area(int rw, struct inode *inode, - struct file *filp, loff_t offset, - size_t count) +static inline int locks_mandatory_area(struct file *filp, loff_t start, + loff_t end, bool write) { return 0; } -- 1.9.1 -- To unsubscribe from this
[PATCH 5/5] nfsd: implement the NFSv4.2 CLONE operation
This is basically a remote version of the btrfs CLONE operation, so the implementation is fairly trivial. Made even more trivial by stealing the XDR code and general framework Anna Schumaker's COPY prototype. Signed-off-by: Christoph Hellwig--- fs/nfsd/nfs4proc.c | 47 +++ fs/nfsd/nfs4xdr.c| 21 + fs/nfsd/vfs.c| 8 fs/nfsd/vfs.h| 2 ++ fs/nfsd/xdr4.h | 10 ++ include/linux/nfs4.h | 4 ++-- 6 files changed, 90 insertions(+), 2 deletions(-) diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c index 3ba10a3..819ad81 100644 --- a/fs/nfsd/nfs4proc.c +++ b/fs/nfsd/nfs4proc.c @@ -1012,6 +1012,47 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, } static __be32 +nfsd4_clone(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, + struct nfsd4_clone *clone) +{ + struct file *src, *dst; + __be32 status; + + status = nfs4_preprocess_stateid_op(rqstp, cstate, >save_fh, + >cl_src_stateid, RD_STATE, + , NULL); + if (status) { + dprintk("NFSD: %s: couldn't process src stateid!\n", __func__); + goto out; + } + + status = nfs4_preprocess_stateid_op(rqstp, cstate, >current_fh, + >cl_dst_stateid, WR_STATE, + , NULL); + if (status) { + dprintk("NFSD: %s: couldn't process dst stateid!\n", __func__); + goto out_put_src; + } + + /* fix up for NFS-specific error code */ + if (!S_ISREG(file_inode(src)->i_mode) || + !S_ISREG(file_inode(dst)->i_mode)) { + status = nfserr_wrong_type; + goto out_put_dst; + } + + status = nfsd4_clone_file_range(src, clone->cl_src_pos, + dst, clone->cl_dst_pos, clone->cl_count); + +out_put_dst: + fput(dst); +out_put_src: + fput(src); +out: + return status; +} + +static __be32 nfsd4_fallocate(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, struct nfsd4_fallocate *fallocate, int flags) { @@ -2281,6 +2322,12 @@ static struct nfsd4_operation nfsd4_ops[] = { .op_name = "OP_DEALLOCATE", .op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize, }, + [OP_CLONE] = { + .op_func = (nfsd4op_func)nfsd4_clone, + .op_flags = OP_MODIFIES_SOMETHING | OP_CACHEME, + .op_name = "OP_CLONE", + .op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize, + }, [OP_SEEK] = { .op_func = (nfsd4op_func)nfsd4_seek, .op_name = "OP_SEEK", diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c index 51c9e9c..924416f 100644 --- a/fs/nfsd/nfs4xdr.c +++ b/fs/nfsd/nfs4xdr.c @@ -1675,6 +1675,25 @@ nfsd4_decode_fallocate(struct nfsd4_compoundargs *argp, } static __be32 +nfsd4_decode_clone(struct nfsd4_compoundargs *argp, struct nfsd4_clone *clone) +{ + DECODE_HEAD; + + status = nfsd4_decode_stateid(argp, >cl_src_stateid); + if (status) + return status; + status = nfsd4_decode_stateid(argp, >cl_dst_stateid); + if (status) + return status; + + READ_BUF(8 + 8 + 8); + p = xdr_decode_hyper(p, >cl_src_pos); + p = xdr_decode_hyper(p, >cl_dst_pos); + p = xdr_decode_hyper(p, >cl_count); + DECODE_TAIL; +} + +static __be32 nfsd4_decode_seek(struct nfsd4_compoundargs *argp, struct nfsd4_seek *seek) { DECODE_HEAD; @@ -1785,6 +1804,7 @@ static nfsd4_dec nfsd4_dec_ops[] = { [OP_READ_PLUS] = (nfsd4_dec)nfsd4_decode_notsupp, [OP_SEEK] = (nfsd4_dec)nfsd4_decode_seek, [OP_WRITE_SAME] = (nfsd4_dec)nfsd4_decode_notsupp, + [OP_CLONE] = (nfsd4_dec)nfsd4_decode_clone, }; static inline bool @@ -4292,6 +4312,7 @@ static nfsd4_enc nfsd4_enc_ops[] = { [OP_READ_PLUS] = (nfsd4_enc)nfsd4_encode_noop, [OP_SEEK] = (nfsd4_enc)nfsd4_encode_seek, [OP_WRITE_SAME] = (nfsd4_enc)nfsd4_encode_noop, + [OP_CLONE] = (nfsd4_enc)nfsd4_encode_noop, }; /* diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 994d66f..5411bf0 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -36,6 +36,7 @@ #endif /* CONFIG_NFSD_V3 */ #ifdef CONFIG_NFSD_V4 +#include "../internal.h" #include "acl.h" #include "idmap.h" #endif /* CONFIG_NFSD_V4 */ @@ -498,6 +499,13 @@ __be32 nfsd4_set_nfs4_label(struct svc_rqst *rqstp, struct svc_fh *fhp, } #endif +__be32 nfsd4_clone_file_range(struct file *src, u64 src_pos, struct file *dst, + u64 dst_pos, u64 count) +{ + return nfserrno(vfs_clone_file_range(src, src_pos, dst,
Re: How to detect / notify when a raid drive fails?
On Thu, Nov 26, 2015, at 09:30 PM, Duncan wrote: > What generally happens now, however, is that the btrfs will note failures > attempting to write the device and start queuing up writes. If the > device reappears fast enough, btrfs will flush the queue and be back to > normal. Otherwise, you pretty much need to reboot and mount degraded, > then add a device and rebalance. (btrfs device delete missing broke some > versions ago and just got fixed by the latest btrfs-progs-4.3.1, IIRC.) > > As for alerts, you'd see the pile of accumulating write errors in the > kernel log. Presumably you can write up a script that can alert on that > and mail you the log or whatever, but I don't believe there's anything > official or close to it, yet. Great info, thanks. Just trying to write a file, sync and read it sounds like the easiest test for now, especially since I don't know what the write fail log entries will look like. And setting up SMART notifications. - Ian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html