Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Thu, Jun 22, 2017 at 5:52 PM, Dave Chinner wrote: > On Wed, Jun 21, 2017 at 09:07:57PM -0700, Andy Lutomirski wrote: >> On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner wrote: >> > >> > You seem to be calling the "fdatasync on every page fault" the >> >> It's the opposite of fdatasync(). It needs to sync whatever metadata >> is needed to find the data. The data doesn't need to be synced. > > So much wrong with that statement. > > Andy, what does fdatasync() do when you have a data-clean, > metadata-dirty file (e.g. you just punched a hole or preallocated > more space via fallocate())? Hint: it doesn't sync any data > because the mapping tree is clean, but it still syncs the dirty > metadata needed to access the data. > > Now, what does a file where we do direct IO writes look like? Yup, > the mapping tree always remains clean and so it's only ever going to > appear to the kernel as a *data-clean, metadata-dirty* file. So, > after a direct IO write is done, what operation do we need to run to > ensure that we can always access the data? > > Yup, it's fdatasync(). Fair enough. Except that fdatasync() goes through dax_writeback_one() (I think), which deals with cache flushes (via wb_cache_pmem()). This special type of sync shouldn't need to do that, so it's not really quite fdatasync(). >> > "lightweight" option. That's the brute-force-with-big-hammer >> > solution - it's most definitely not lightweight as every page fault >> > has extra overhead to call ->fsync(). Sure, the API is simple, but >> > the runtime overhead is significant. >> >> It's lightweight in terms of its impact on the filesystem. It doesn't >> need any persistent setup -- you can just use it. > > Well, no, that's wrong, because we have to co-ordinate multiple > concurrent accesses to the data in the kernel. What happens when > some other process writes to the file *at the same time* but does > not use userspace sync? We aren't tracking dirty regions on the > inode mapping because we've been told not to do that, so fsync() > from that other process *won't sync the data it wrote*. IOws, the > kernel has failed to provide the guarantee that userspace wants it > to provide. ... > What I'd like to avoid is creating another kernel bypass mechanism > where we allow coherency and/or integrity to be fucked up in a way that > we can't fix without giving up all the performance that the kernel > bypass provides userspace apps. Constrain the cases where kernel > bypass is allowed, and we avoid all the crappy corner cases where > our only answer to users with corrupt data is "the man page advises > application developers not to do that". Ah, I see, a DAX file makes regular write() flush out the cache automatically. But I think the situation may be fucked up integrity-wise anyway. If you make an immutable-extent DAX file and a DAX-unaware process mmaps() it and writes to the mapping, what flushes the CPU cache? Isn't part of the point of the magic immutable-extent mode that it wouldn't have to track dirty extents? Can it keep track of which mappings are DAX-aware (via an mmap() flag, I assume)? Would all mappings of a DAX immutable-extent file be forced to be uncached (or writethrough or WC or some type that allows fsync to be fast)? Can you send a link to your fallocate email? I'm having trouble finding it.
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Wed, Jun 21, 2017 at 09:07:57PM -0700, Andy Lutomirski wrote: > On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner wrote: > > > > You seem to be calling the "fdatasync on every page fault" the > > It's the opposite of fdatasync(). It needs to sync whatever metadata > is needed to find the data. The data doesn't need to be synced. So much wrong with that statement. Andy, what does fdatasync() do when you have a data-clean, metadata-dirty file (e.g. you just punched a hole or preallocated more space via fallocate())? Hint: it doesn't sync any data because the mapping tree is clean, but it still syncs the dirty metadata needed to access the data. Now, what does a file where we do direct IO writes look like? Yup, the mapping tree always remains clean and so it's only ever going to appear to the kernel as a *data-clean, metadata-dirty* file. So, after a direct IO write is done, what operation do we need to run to ensure that we can always access the data? Yup, it's fdatasync(). So, what does a DAX file that does userspace data flushes look like to the kernel? Yup, again the mapping tree always remains clean and so it's only ever going to be a *data-clean, metadata-dirty* file. It should be clear now why I said "fdatasync on every page fault" because that's exactly the mechanism we'd use to implement this functionality It should also be clear that DAX is not introducing any new data integrity problems to the filesystems that direct IO hasn't already introduced. Both DAX with userspace data sync and Direct IO writes are completely untracked by the kernel. IOWs, direct IO is a form of "kernel bypass", just like DAX+userspace data sync is. All that is different is the method by which data is written to the storage media from userspace, which in the case of DAX is via mmap rather than read/write. > > "lightweight" option. That's the brute-force-with-big-hammer > > solution - it's most definitely not lightweight as every page fault > > has extra overhead to call ->fsync(). Sure, the API is simple, but > > the runtime overhead is significant. > > It's lightweight in terms of its impact on the filesystem. It doesn't > need any persistent setup -- you can just use it. Well, no, that's wrong, because we have to co-ordinate multiple concurrent accesses to the data in the kernel. What happens when some other process writes to the file *at the same time* but does not use userspace sync? We aren't tracking dirty regions on the inode mapping because we've been told not to do that, so fsync() from that other process *won't sync the data it wrote*. IOws, the kernel has failed to provide the guarantee that userspace wants it to provide. The single mapping tree is central to the problem here - we can't mix modes of dirty tracking across different processes. Either everything uses userspace sync, or everything uses kernel controlled dirty tracking so fsync() works correctly in all cases. Put simply - dirty tracking is a per-inode function, not a per-file or per-vma function. As the direct IO kernel-bypass model demonstrates, as soon as you start considering multi-process data coherency and durability with mixed kernel+kernel bypass methods in play, lots of potential problems and issues crop up that can't easily be solved by the kernel or filesystems. We try to minimise the problems, but we don't guarantee mixed mode coherency (and hence integrity) as we've delegated data coherency and integrity responsibility to the app bypassing the kernel data coherency and integrity mechanisms. What I'd like to avoid is creating another kernel bypass mechanism where we allow coherency and/or integrity to be fucked up in a way that we can't fix without giving up all the performance that the kernel bypass provides userspace apps. Constrain the cases where kernel bypass is allowed, and we avoid all the crappy corner cases where our only answer to users with corrupt data is "the man page advises application developers not to do that". If in future we work out how to implement everything without needing immutable extents in the inode, we can relax the restrictions we've placed on userspace DAX data sync > > Even if you are considering the complexity of the APIs, it's hardly > > a "heavyweight" when it only requires a single call to fallocate() > > before mmap() to set up the immutable extents on the file... > > So what would the exact semantics be? In particular, how can it fail? > If I do the fallocate(), is it absolutely promised that the extent > map won't get out of sync between what mmap sees and what's on disk? That's precisely the guarantee I documented would be given by immutable extents in my very first proposal. > Do user programs need to worry about colliding with each other when > one does fallocate() to DAXify a file and the other does fallocate() > to unDAXify a file? Yes, it can. This was one of the reasons for putting it under privilege - so only the app has full control of the extent map changes a
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Thu, Jun 22, 2017 at 09:37:14AM +1000, Dave Chinner wrote: > On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote: > > [add linux-xfs to the fray] > > > > On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote: > > > + spin_lock(&dax_lock); > > > + list_add(&d->list, &daxfiles); > > > + spin_unlock(&dax_lock); > > > + > > > + /* > > > + * We set S_SWAPFILE to gain "no truncate" / static block > > > + * allocation semantics, and S_DAXFILE so we can differentiate > > > + * traditional swapfiles and assume static block mappings in the > > > + * dax mmap path. > > > + */ > > > + inode->i_flags |= S_SWAPFILE | S_DAXFILE; > > > > Yikes. You know, I hadn't even thought about considering swap files as > > a subcase of files with immutable block maps, but here we are. Both > > swap files and DAX require absolutely stable block mappings, they are > > both (probably) intolerant of inode metadata changes (size, mtime, etc.) > > Swap files are intolerant of any metadata changes because once the > mapping has been sucked into the swapfile code, the inode is never > looked at again. DAX file data access always goes through the inode, > so they is much more tolerant of metadata changes given certain > constraints. Fair enough. > > > > Honestly, I realize we've gone back, forth, and around all over the > > place on this. I still prefer something similar to a permanent flag, > > similar to what Dave suggested, though I hate the name PMEM_IMMUTABLE > > and some of the semantics. > > > > First, a new inode flag S_IOMAP_FROZEN that means the file's block map > > cannot change. > > I've been calling it "immutable extents" - freezing has implications > that it's only temporary (i.e. freezing filesystems) and will be > followed shortly by a thaw. That isn't the case here - we truly want > the extent/block map to be immutable S_IOMAP_IMMUTABLE it is, then. > > Second, some kind of function to toggle the S_IOMAP_FROZEN flag. > > Turning it on will lock the inode, check the extent map for holes, > > shared, or unwritten bits, and bail out if it finds any, or set the > > flag. > > Hmmm, I disagree on the unwritten state here. We want swap files to > be able to use unwritten extents - it means we can preallocate the > swap file and hand it straight to swapon without having to zero it > (i.e. no I/O needed to demand allocate more swap space when memory > is very low). Also, anyone who tries to read the swap file from > userspace will be reading unwritten extents, which will always > return zeros rather than whatever is in the swap file... Now I've twisted all the way around to thinking that swap files should be /totally/ unwritten, except for the file header. :) > > Not sure if we should require CAP_LINUX_IMMUTABLE -- probably > > yes, at least at first. I don't currently have any objection to writing > > non-iomap inode metadata out to disk. > > > > Third, the flag can only be cleared if the file isn't mapped. > > How do we check this from the fs without racing? AFAICT we can't > prevent a concurrent map operation from occurring while we are > changing the state of the inode - we can only block page faults > after then inode is mapped I'd thought we could coordinate that via xfs_file_mmap, but tbh my brain paged that out a while ago. > > Fourth, the VFS entry points for things like read, write, truncate, > > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a > > file, so that the block map cannot be modified. > > mmap is still allowed, > > as we've discussed. /Maybe/ we can allow fallocate to extend a file > > with zeroed extents (it will be slow) as I've heard murmurs about > > wanting to be able to extend a file, maybe not. > > read is fine, write should be fine as long as the iomap call can > error out operations that would require extent map modifications. Ok. > fallocate should be allowed to modify the extent map, too, because > it should be the mechanism used be applications to set up file > extents in the correct form for applications to use as immutable > (i.e. lock out page faults, allocate, zero, extend and fsync in > one atomic operation) > > Fifth, swapfiles now require the S_IOMAP_FROZEN flag since they want > > stable iomap but probably don't care about things like mtime. Maybe > > they can call iomap too. > > > > Sixth, XFS can record the S_IOMAP_FROZEN state in di_flags2 and set it > > whenever the in-core inode gets constructed. This enables us to > > prohibit reflinking and other such undesirable activity. > > *nod* > > > If we actually /do/ come up with a reference implementation for XFS, I'd > > be ok with tacking it on the end of my dev branch, which will give us a > > lng runway to try it out. The end of the dev branch is beyond > > online XFS fsck and repair and the "root metadata btrees in inodes" > > rework; since that's ~90 patches with my name on it that I cannot also > > review, it won't go in for a long tim
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Tue, Jun 20, 2017 at 09:42:55AM -0600, Ross Zwisler wrote: > On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote: > <> > > Fourth, the VFS entry points for things like read, write, truncate, > > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a > > file, so that the block map cannot be modified. mmap is still allowed, > > as we've discussed. /Maybe/ we can allow fallocate to extend a file > > with zeroed extents (it will be slow) as I've heard murmurs about > > wanting to be able to extend a file, maybe not. > > Read and write should still be allowed, right? I had thought the usage model was pretty slanted towards mmap, but it's not a big deal to turn read/writes into glorified memcpy, provided we reject the io request if it goes past EOF. --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner wrote: > > You seem to be calling the "fdatasync on every page fault" the It's the opposite of fdatasync(). It needs to sync whatever metadata is needed to find the data. The data doesn't need to be synced. > "lightweight" option. That's the brute-force-with-big-hammer > solution - it's most definitely not lightweight as every page fault > has extra overhead to call ->fsync(). Sure, the API is simple, but > the runtime overhead is significant. It's lightweight in terms of its impact on the filesystem. It doesn't need any persistent setup -- you can just use it. > Even if you are considering the complexity of the APIs, it's hardly > a "heavyweight" when it only requires a single call to fallocate() > before mmap() to set up the immutable extents on the file... So what would the exact semantics be? In particular, how can it fail? If I do the fallocate(), is it absolutely promised that the extent map won't get out of sync between what mmap sees and what's on disk? Do user programs need to worry about colliding with each other when one does fallocate() to DAXify a file and the other does fallocate() to unDAXify a file? Does this particular fallocate() call still keep its effect after a reboot? These issues are why I think it would be nicer to have an API that makes a particular mapping or fd be unconditionally *correct* and then to provide something else that makes it avoid latency spikes. Is there an actual concrete proposal that's reviewable?
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Tue, Jun 20, 2017 at 10:18:24PM -0700, Andy Lutomirski wrote: > On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner wrote: > >> A per-inode > >> count of the number of live DAX mappings or of the number of struct > >> file instances that have requested DAX would work here. > > > > For what purpose does this serve? The reflink invalidates all the > > existing mappings, so the next write access causes a fault and then > > page_mkwrite is called and the shared extent will get COWed > > The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it > right), except that IMO an API that doesn't involve making a change to > an inode that sticks around would be nice. The inode flag has the > unfortunate property that, if two different programs each try to set > the flag, mmap, write, and clear the flag, they'll stomp on each other > and risk data corruption. > > I admit I'm now thoroughly confused as to exactly what XFS does here > -- does FS_XFLAG_DAX persist across unmount/mount? Yes, it is. i.e. DAX on XFS does not rely on a naive fs-wide mount option. You can have applications on pmem filesystems use either DAX or normal IO based on directory/inode flags. Something doesn't work with DAX, so just remove the DAX flags from the directories/inodes, and it will safely and transparently switch to page-cache based IO. > Here's the overall point I'm trying to make: unprivileged programs > that want to write to DAX files with userspace commit mechanisms > (CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without > privilege, and with reasonably clean APIs. Ideally they could do this > to any file they have write access to. The privilege argument is irrelevant now - it was /suggested/ initially as a way of preventing people from shooting themselves in the foot based on the immutable file model. It's clear that's not desired, and it's not a show stopper. > Programs that want to write to > mmapped files, DAX or otherwise, without latency spikes due to > .page_mkwrite should be able to opt in to a heavier weight mechanism. > But these two issues are someone independent, and I think they should > be solved separately. You seem to be calling the "fdatasync on every page fault" the "lightweight" option. That's the brute-force-with-big-hammer solution - it's most definitely not lightweight as every page fault has extra overhead to call ->fsync(). Sure, the API is simple, but the runtime overhead is significant. The lightweight *runtime* option is to set up the file in such a way that there is never any extra overhead at page fault time. This is what immutable extent maps provide. Indeed, because the mappings never change, you could use hardware dirty tracking if you wanted, as there's no need to look up the filesystem to do writeback as everything needed for writeback was mapped at page fault time. This "map first and then just write when you need to" is *exactly how swap files work*. Even if you are considering the complexity of the APIs, it's hardly a "heavyweight" when it only requires a single call to fallocate() before mmap() to set up the immutable extents on the file... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote: > [add linux-xfs to the fray] > > On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote: > > + spin_lock(&dax_lock); > > + list_add(&d->list, &daxfiles); > > + spin_unlock(&dax_lock); > > + > > + /* > > +* We set S_SWAPFILE to gain "no truncate" / static block > > +* allocation semantics, and S_DAXFILE so we can differentiate > > +* traditional swapfiles and assume static block mappings in the > > +* dax mmap path. > > +*/ > > + inode->i_flags |= S_SWAPFILE | S_DAXFILE; > > Yikes. You know, I hadn't even thought about considering swap files as > a subcase of files with immutable block maps, but here we are. Both > swap files and DAX require absolutely stable block mappings, they are > both (probably) intolerant of inode metadata changes (size, mtime, etc.) Swap files are intolerant of any metadata changes because once the mapping has been sucked into the swapfile code, the inode is never looked at again. DAX file data access always goes through the inode, so they is much more tolerant of metadata changes given certain constraints. > Honestly, I realize we've gone back, forth, and around all over the > place on this. I still prefer something similar to a permanent flag, > similar to what Dave suggested, though I hate the name PMEM_IMMUTABLE > and some of the semantics. > > First, a new inode flag S_IOMAP_FROZEN that means the file's block map > cannot change. I've been calling it "immutable extents" - freezing has implications that it's only temporary (i.e. freezing filesystems) and will be followed shortly by a thaw. That isn't the case here - we truly want the extent/block map to be immutable > Second, some kind of function to toggle the S_IOMAP_FROZEN flag. > Turning it on will lock the inode, check the extent map for holes, > shared, or unwritten bits, and bail out if it finds any, or set the > flag. Hmmm, I disagree on the unwritten state here. We want swap files to be able to use unwritten extents - it means we can preallocate the swap file and hand it straight to swapon without having to zero it (i.e. no I/O needed to demand allocate more swap space when memory is very low). Also, anyone who tries to read the swap file from userspace will be reading unwritten extents, which will always return zeros rather than whatever is in the swap file... > Not sure if we should require CAP_LINUX_IMMUTABLE -- probably > yes, at least at first. I don't currently have any objection to writing > non-iomap inode metadata out to disk. > > Third, the flag can only be cleared if the file isn't mapped. How do we check this from the fs without racing? AFAICT we can't prevent a concurrent map operation from occurring while we are changing the state of the inode - we can only block page faults after then inode is mapped > Fourth, the VFS entry points for things like read, write, truncate, > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a > file, so that the block map cannot be modified. > mmap is still allowed, > as we've discussed. /Maybe/ we can allow fallocate to extend a file > with zeroed extents (it will be slow) as I've heard murmurs about > wanting to be able to extend a file, maybe not. read is fine, write should be fine as long as the iomap call can error out operations that would require extent map modifications. fallocate should be allowed to modify the extent map, too, because it should be the mechanism used be applications to set up file extents in the correct form for applications to use as immutable (i.e. lock out page faults, allocate, zero, extend and fsync in one atomic operation) > Fifth, swapfiles now require the S_IOMAP_FROZEN flag since they want > stable iomap but probably don't care about things like mtime. Maybe > they can call iomap too. > > Sixth, XFS can record the S_IOMAP_FROZEN state in di_flags2 and set it > whenever the in-core inode gets constructed. This enables us to > prohibit reflinking and other such undesirable activity. *nod* > If we actually /do/ come up with a reference implementation for XFS, I'd > be ok with tacking it on the end of my dev branch, which will give us a > lng runway to try it out. The end of the dev branch is beyond > online XFS fsck and repair and the "root metadata btrees in inodes" > rework; since that's ~90 patches with my name on it that I cannot also > review, it won't go in for a long time indeed! I don't think it's so complex to need such a long dev time - all the infrastructure we need is pretty much there already... > (Yes, that was also sort of a plea for someone to go review the XFS > scrub patches.) > > > + return 0; > > +} > > + > > +SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align) > > I was /about/ to grouse about this syscall, then realized that maybe it > /is/ useful to be able to check a specific alignment. Maybe not, since > I had something more
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner wrote: > Your mangling terminology here. We don't "break COW" - we *use* > copy-on-write to break *extent sharing*. We can break extent sharing > in page_mkwrite - that's exactly what we do for normal pagecache > based mmap writes, and it's done in page_mkwrite. Right, my bad. > > It hasn't been enabled it for DAX yet because it simply hasn't been > robustly tested yet. > >> A per-inode >> count of the number of live DAX mappings or of the number of struct >> file instances that have requested DAX would work here. > > For what purpose does this serve? The reflink invalidates all the > existing mappings, so the next write access causes a fault and then > page_mkwrite is called and the shared extent will get COWed The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it right), except that IMO an API that doesn't involve making a change to an inode that sticks around would be nice. The inode flag has the unfortunate property that, if two different programs each try to set the flag, mmap, write, and clear the flag, they'll stomp on each other and risk data corruption. I admit I'm now thoroughly confused as to exactly what XFS does here -- does FS_XFLAG_DAX persist across unmount/mount? > >> - Trying to use DAX on a file that is already reflinked. The order >> of operations doesn't matter hugely, except that breaking COW for the >> entire range in question all at once would be faster and result in >> better allocation. > > We have COW extent size hints for that. i.e. if you want to COW a > huge page at a time, set the COW extent size hint to the huge page > size... Nifty. > Apparently it is. There are people telling us that mtime > updates in page faults introduce too much unpredictable latency and > that screws over their low latency real time applications. I was one of those, and I even wrote patches. I should try to dust them off. > > Those same people are telling use that dirty tracking in page faults > for msync/fsync on DAX is too heavyweight and calling msync is too > onerous and has unpredictable latencies because it might result in > having to sync tens of thousands of unrelated dirty objects. Hence > they want to use userspace data sync primitives to avoid this > overhead and so filesystems need to make it possible to provide this > userspace idata sync capability. If I were using DAX in production, I'd have exactly this issue. Let me quote myself: On Tue, Jun 20, 2017 at 9:14 AM, Andy Lutomirski wrote: > 3. (Not strictly related to DAX.) A way to tell the kernel "I have > this file mmapped for write. Please go out of your way to avoid > future page faults." I've wanted this for ordinary files on ext4. > The kernel could, but presently does not, use hardware dirty tracking > instead of software dirty tracking to decide when to write the page > back. The kernel could also, in principle, write back dirty pages > without ever write protecting them. For DAX, this might change > behavior to prevent any operation that would relocate blocks or to > have the kernel go out of its way to only do such operations when > absolutely necessary and to immediately update and unwriteprotect the > relevant pages. I agree that this is a real issue, but it's not limited to DAX. I've wanted a mode where I tell the kernel "I'm a high-performance application mmapping this file and I'm going to write to it a lot. Do your best to avoid any page faults, even if it adversely affects the performance of the system." This mode could do lots of things. It could cause the system to leave the page writable even after writeback and, if possible, to use hardware dirty tracking. It could cause the system to make a copy of the page and write back from the copy if there is anything in play that could need stable pages during writeback. And, for DAX, it could tell the system to keep the page pinned and disallow moving it and reflinking it. (Of course, the above requires that we either deal with mtime like my patches do or that this heavyweight mechanism disable mtime updates. I prefer the former.) Here's the overall point I'm trying to make: unprivileged programs that want to write to DAX files with userspace commit mechanisms (CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without privilege, and with reasonably clean APIs. Ideally they could do this to any file they have write access to. Programs that want to write to mmapped files, DAX or otherwise, without latency spikes due to .page_mkwrite should be able to opt in to a heavier weight mechanism. But these two issues are someone independent, and I think they should be solved separately.
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Tue, Jun 20, 2017 at 06:24:03PM -0700, Darrick J. Wong wrote: > On Wed, Jun 21, 2017 at 09:53:46AM +1000, Dave Chinner wrote: > > On Tue, Jun 20, 2017 at 09:17:36AM -0700, Dan Williams wrote: > > > An immutable-extent DAX-file and a reflink-capable DAX-file are not > > > mutually exclusive, > > > > Actually, they are mutually exclusive: when the immutable extent DAX > > inode is breaking the extent sharing done during the reflink > > operation, the copy-on-write operation requires allocating and > > freeing extents on the inode that has immutable extents. Which, if > > the inode really has immutable extents, cannot be done. > > > > That said, if the extent sharing is broken on the other side of the > > reflink (i.e. the non-immutable inode created by the reflink) then > > the extent map of the inode with immutable extents will remain > > unchanged. i.e. there are two sides to this, and if you only see one > > side you might come to the wrong conclusion. > > > > However, we cannot guarantee that no writes occur to the inode with > > immutable extent maps (especially as the whole point is to allow > > userspace writes and commits without the kernel being involved), so > > extent sharing on immutable extent maps cannot be allowed... > > Just to play devil's advocate... > > /If/ you have rmap and /if/ you discover that there's only one > IOMAP_IMMUTABLE file owning this same block and /if/ you're willing to > relocate every other mapping on the whole filesystem, /then/ you could > /in theory/ support shared daxfiles. I figured that nobody apart from experienced filesystem developers would understand the complexities of rmap and refcounts and how they could be abused to do this. I also assumed that that people like you would understand this is possible but completely impractical > However, that's so many on-disk metadata lookups to shove into a > pagefault handler that I don't think anyone in XFSland would entertain > such an ugly fantasy. You'd be making a lot of metadata requests, and > you'd have to lock the rmapbt while grabbing inodes, which is insane. Exactly. But while I understand this, consider the amount of assumed filesystem and XFS knowledge in that one simple paragraph. Most non-experts would have stopped *understanding* at "/If/ you have rmap" and go away with the wrong ideas in their heads. Hence I now tend to omit mentioning "possible but impractical" things in mixed expertise discussions > Much easier to have a per-inode flag that says "the block map of this > file does not change" and put up with the restricted semantics. In a nutshell. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Tue, Jun 20, 2017 at 09:14:24AM -0700, Andy Lutomirski wrote: > On Tue, Jun 20, 2017 at 3:11 AM, Dave Chinner wrote: > > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: > >> On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner wrote: > >> > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote: > >> >> Second: syncing extents. Here's a straw man. Forget the mmap() flag. > >> >> Instead add a new msync() operation: > >> >> > >> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE); > >> > > >> > How's this any different from the fallocate command I proposed to do > >> > this (apart from the fact that msync() is not intended to be abused > >> > as a file preallocation/zeroing interface)? > >> > >> I must have missed that suggestion. > >> > >> But it's different in a major way. fallocate() takes an fd parameter, > >> which means that, if some flag gets set, it's set on the struct file. > > > > DAX is a property of the inode, not the VMA or struct file as it > > needs to be consistent across all VMAs and struct files that > > reference that inode. Also, fallocate() manipulates state and > > metadata hidden behind the struct inode, not the struct file, so it > > seems to me like the right API to use. > > I'm not sure I see why. I can think of a few different scenarios: > > - Reflink while a DAX-using program is running. It would be nice for > performance, but not critical for functionality, if trying to reflink > a file that's mapped for DAX would copy instead of COWing. But > breaking COW on the next page_mkwrite would work, too. Your mangling terminology here. We don't "break COW" - we *use* copy-on-write to break *extent sharing*. We can break extent sharing in page_mkwrite - that's exactly what we do for normal pagecache based mmap writes, and it's done in page_mkwrite. It hasn't been enabled it for DAX yet because it simply hasn't been robustly tested yet. > A per-inode > count of the number of live DAX mappings or of the number of struct > file instances that have requested DAX would work here. For what purpose does this serve? The reflink invalidates all the existing mappings, so the next write access causes a fault and then page_mkwrite is called and the shared extent will get COWed > - Trying to use DAX on a file that is already reflinked. The order > of operations doesn't matter hugely, except that breaking COW for the > entire range in question all at once would be faster and result in > better allocation. We have COW extent size hints for that. i.e. if you want to COW a huge page at a time, set the COW extent size hint to the huge page size... > - Checksumming filesystems. I think it's basically impossible to do > DAX writes to a file like this. No shit, sherlock. See my previous comments about compression and encryption. DAX cannot be used where in-place data manipulations would be required by the IO path. > A filesystem could skip checksumming > on extents that are actively mapped for DAX, but something like chattr > to tell, say, btrfs that a given file is intended for DAX is probably > better. (But, if so, it should presumably be persistent like chattr > and not something that's purely in memory like daxctl().) You can already do this on btrfs regardless of DAX. There's an inode flag: #define BTRFS_INODE_NODATASUM (1 << 0) And you set it by turning off copy-on-write when the file is empty. i.e. just after creation, run: ioctl(FS_IOC_GETFLAGS, &flags). flags |= FS_NOCOW_FL; ioctl(FS_IOC_SETFLAGS, &flags). You're talking about stuff that already exists - stop trying to tell me about basic filesystem functionality and DAX requirements that I understood years ago. Please start with the assumption that I know a lot more about this than you do, and if there's something you write that I don't understand I'll ask you to explain > - Defrag and such on an otherwise simple filesystem. Defragging a > file while it's not actively mapped for DAX should be allowed. > Defragging a file while it is actively mapped for DAX may be slow, but > it a filesystem could certainly make it work. FYI, XFS already does this - it skips mapped files altogether, so DAX state simply doesn't matter here. We also have the FS_NODEFRAG ioctl flag, so admins can choose to mark files that they want defrag to skip... > > And, as mmap() requires a fd to set up the mapping and fallocate() > > would have to be run *before* mmap() is used to access the data > > directly, I don't see why using fallocate would be a problem here... > > Fair enough. But ISTM fallocate() has no business setting up > important state in the struct file. It's an operation on inodes. Yup, the state in question is kept in inode->i_flags rather than duplicated into the struct file. Which leads to code like this in the page fault handlers. e.g. in xfs_filemap_page_mkwrite(): struct inode*inode = file_inode(vma->vm_file); if (I
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Wed, Jun 21, 2017 at 09:53:46AM +1000, Dave Chinner wrote: > On Tue, Jun 20, 2017 at 09:17:36AM -0700, Dan Williams wrote: > > On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig wrote: > > > [stripped giant fullquotes] > > > > > > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: > > >> But that's my whole point. The kernel doesn't really need to prevent > > >> all these background maintenance operations -- it just needs to block > > >> .page_mkwrite until they are synced. I think that whatever new > > >> mechanism we add for this should be sticky, but I see no reason why > > >> the filesystem should have to block reflink on a DAX file entirely. > > > > > > Agreed - IFF we want to support write through semantics this is the > > > only somewhat feasible way. It still has massive downsides of forcing > > > the full sync machinery to run from the page fauly handler, which > > > I'm rather scared off, but that's still better than creating a magic > > > special case that isn't managable at all. > > > > An immutable-extent DAX-file and a reflink-capable DAX-file are not > > mutually exclusive, > > Actually, they are mutually exclusive: when the immutable extent DAX > inode is breaking the extent sharing done during the reflink > operation, the copy-on-write operation requires allocating and > freeing extents on the inode that has immutable extents. Which, if > the inode really has immutable extents, cannot be done. > > That said, if the extent sharing is broken on the other side of the > reflink (i.e. the non-immutable inode created by the reflink) then > the extent map of the inode with immutable extents will remain > unchanged. i.e. there are two sides to this, and if you only see one > side you might come to the wrong conclusion. > > However, we cannot guarantee that no writes occur to the inode with > immutable extent maps (especially as the whole point is to allow > userspace writes and commits without the kernel being involved), so > extent sharing on immutable extent maps cannot be allowed... Just to play devil's advocate... /If/ you have rmap and /if/ you discover that there's only one IOMAP_IMMUTABLE file owning this same block and /if/ you're willing to relocate every other mapping on the whole filesystem, /then/ you could /in theory/ support shared daxfiles. However, that's so many on-disk metadata lookups to shove into a pagefault handler that I don't think anyone in XFSland would entertain such an ugly fantasy. You'd be making a lot of metadata requests, and you'd have to lock the rmapbt while grabbing inodes, which is insane. Much easier to have a per-inode flag that says "the block map of this file does not change" and put up with the restricted semantics. --D > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Tue, Jun 20, 2017 at 09:17:36AM -0700, Dan Williams wrote: > On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig wrote: > > [stripped giant fullquotes] > > > > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: > >> But that's my whole point. The kernel doesn't really need to prevent > >> all these background maintenance operations -- it just needs to block > >> .page_mkwrite until they are synced. I think that whatever new > >> mechanism we add for this should be sticky, but I see no reason why > >> the filesystem should have to block reflink on a DAX file entirely. > > > > Agreed - IFF we want to support write through semantics this is the > > only somewhat feasible way. It still has massive downsides of forcing > > the full sync machinery to run from the page fauly handler, which > > I'm rather scared off, but that's still better than creating a magic > > special case that isn't managable at all. > > An immutable-extent DAX-file and a reflink-capable DAX-file are not > mutually exclusive, Actually, they are mutually exclusive: when the immutable extent DAX inode is breaking the extent sharing done during the reflink operation, the copy-on-write operation requires allocating and freeing extents on the inode that has immutable extents. Which, if the inode really has immutable extents, cannot be done. That said, if the extent sharing is broken on the other side of the reflink (i.e. the non-immutable inode created by the reflink) then the extent map of the inode with immutable extents will remain unchanged. i.e. there are two sides to this, and if you only see one side you might come to the wrong conclusion. However, we cannot guarantee that no writes occur to the inode with immutable extent maps (especially as the whole point is to allow userspace writes and commits without the kernel being involved), so extent sharing on immutable extent maps cannot be allowed... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Tue, Jun 20, 2017 at 9:17 AM, Dan Williams wrote: > On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig wrote: >> [stripped giant fullquotes] >> >> On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: >>> But that's my whole point. The kernel doesn't really need to prevent >>> all these background maintenance operations -- it just needs to block >>> .page_mkwrite until they are synced. I think that whatever new >>> mechanism we add for this should be sticky, but I see no reason why >>> the filesystem should have to block reflink on a DAX file entirely. >> >> Agreed - IFF we want to support write through semantics this is the >> only somewhat feasible way. It still has massive downsides of forcing >> the full sync machinery to run from the page fauly handler, which >> I'm rather scared off, but that's still better than creating a magic >> special case that isn't managable at all. > > An immutable-extent DAX-file and a reflink-capable DAX-file are not > mutually exclusive, and I have yet to hear a need for reflink support > without fsync/msync. Instead I have heard the need for an immutable > file for RDMA purposes, especially for hardware that can't trigger an > mmu fault. The special management of an immutable file is acceptable > to get these capabilities. I guess this applies to any user of get_user_pages() on a DAX-mapped file. Hmm.
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig wrote: > [stripped giant fullquotes] > > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: >> But that's my whole point. The kernel doesn't really need to prevent >> all these background maintenance operations -- it just needs to block >> .page_mkwrite until they are synced. I think that whatever new >> mechanism we add for this should be sticky, but I see no reason why >> the filesystem should have to block reflink on a DAX file entirely. > > Agreed - IFF we want to support write through semantics this is the > only somewhat feasible way. It still has massive downsides of forcing > the full sync machinery to run from the page fauly handler, which > I'm rather scared off, but that's still better than creating a magic > special case that isn't managable at all. An immutable-extent DAX-file and a reflink-capable DAX-file are not mutually exclusive, and I have yet to hear a need for reflink support without fsync/msync. Instead I have heard the need for an immutable file for RDMA purposes, especially for hardware that can't trigger an mmu fault. The special management of an immutable file is acceptable to get these capabilities.
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Tue, Jun 20, 2017 at 3:11 AM, Dave Chinner wrote: > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: >> On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner wrote: >> > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote: >> >> Second: syncing extents. Here's a straw man. Forget the mmap() flag. >> >> Instead add a new msync() operation: >> >> >> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE); >> > >> > How's this any different from the fallocate command I proposed to do >> > this (apart from the fact that msync() is not intended to be abused >> > as a file preallocation/zeroing interface)? >> >> I must have missed that suggestion. >> >> But it's different in a major way. fallocate() takes an fd parameter, >> which means that, if some flag gets set, it's set on the struct file. > > DAX is a property of the inode, not the VMA or struct file as it > needs to be consistent across all VMAs and struct files that > reference that inode. Also, fallocate() manipulates state and > metadata hidden behind the struct inode, not the struct file, so it > seems to me like the right API to use. I'm not sure I see why. I can think of a few different scenarios: - Reflink while a DAX-using program is running. It would be nice for performance, but not critical for functionality, if trying to reflink a file that's mapped for DAX would copy instead of COWing. But breaking COW on the next page_mkwrite would work, too. A per-inode count of the number of live DAX mappings or of the number of struct file instances that have requested DAX would work here. - Trying to use DAX on a file that is already reflinked. The order of operations doesn't matter hugely, except that breaking COW for the entire range in question all at once would be faster and result in better allocation. - Checksumming filesystems. I think it's basically impossible to do DAX writes to a file like this. A filesystem could skip checksumming on extents that are actively mapped for DAX, but something like chattr to tell, say, btrfs that a given file is intended for DAX is probably better. (But, if so, it should presumably be persistent like chattr and not something that's purely in memory like daxctl().) - Defrag and such on an otherwise simple filesystem. Defragging a file while it's not actively mapped for DAX should be allowed. Defragging a file while it is actively mapped for DAX may be slow, but it a filesystem could certainly make it work. - RAID. Not going to work. > > And, as mmap() requires a fd to set up the mapping and fallocate() > would have to be run *before* mmap() is used to access the data > directly, I don't see why using fallocate would be a problem here... Fair enough. But ISTM fallocate() has no business setting up important state in the struct file. It's an operation on inodes. Looking at the above scenarios, it seems to may that two or three separate mechanisms may be ideal here. 1. The most important: some way to tell the kernel that an open file description or an mmap or some subrange thereof is going to be used for DAX writes and for the kernel to respond by saying "yes, okay" or "no, not okay". This should be 100% reliable, which means that all the corner cases have to work. This means that, if one task says "make this file work for DAX" and another task extends the file using truncate (without calling fallocate), then whatever the kernel promised to the first task should remain true. 2. Some way to tell filesystems like btrfs to make a file that will be DAX-able. chattr +C might already fit the bill. Without this, I'd expect the normal incantation to DAX-map a file on btrfs to return an error. 3. (Not strictly related to DAX.) A way to tell the kernel "I have this file mmapped for write. Please go out of your way to avoid future page faults." I've wanted this for ordinary files on ext4. The kernel could, but presently does not, use hardware dirty tracking instead of software dirty tracking to decide when to write the page back. The kernel could also, in principle, write back dirty pages without ever write protecting them. For DAX, this might change behavior to prevent any operation that would relocate blocks or to have the kernel go out of its way to only do such operations when absolutely necessary and to immediately update and unwriteprotect the relevant pages. (3) is optional and could be delayed to the distant future. It's not needed for correctness. I really don't want to see a scenario where DAX works if you use some fancy special-purpose library exactly as intended but occasionally eats your data if you don't use the library exactly as intended. Getting a valid DAX mapping, writing to it, and doing CLFLUSHOPT;SFENCE must make that write durable no matter what (unless the underlying hardware actually fails). > The MAP_SYNC proposal is effectively "run the metadata side of > fdatasync() on every page fault". If the inode is not metadata > dirty, then it will do n
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote: <> > Fourth, the VFS entry points for things like read, write, truncate, > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a > file, so that the block map cannot be modified. mmap is still allowed, > as we've discussed. /Maybe/ we can allow fallocate to extend a file > with zeroed extents (it will be slow) as I've heard murmurs about > wanting to be able to extend a file, maybe not. Read and write should still be allowed, right?
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: > On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner wrote: > > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote: > >> Second: syncing extents. Here's a straw man. Forget the mmap() flag. > >> Instead add a new msync() operation: > >> > >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE); > > > > How's this any different from the fallocate command I proposed to do > > this (apart from the fact that msync() is not intended to be abused > > as a file preallocation/zeroing interface)? > > I must have missed that suggestion. > > But it's different in a major way. fallocate() takes an fd parameter, > which means that, if some flag gets set, it's set on the struct file. DAX is a property of the inode, not the VMA or struct file as it needs to be consistent across all VMAs and struct files that reference that inode. Also, fallocate() manipulates state and metadata hidden behind the struct inode, not the struct file, so it seems to me like the right API to use. And, as mmap() requires a fd to set up the mapping and fallocate() would have to be run *before* mmap() is used to access the data directly, I don't see why using fallocate would be a problem here... > >> If this operation succeeds, it guarantees that all future writes > >> through this mapping on this range will hit actual storage and that > >> all the metadata operations needed to make this write persistent will > >> hit storage such that they are ordered before the user's writes. > >> As an implementation detail, this will flush out the extents if > >> needed. In addition, if the FS has any mechanism that would cause > >> problems asyncronously later on (dedupe? deallocated extents full > >> of zeros? defrag?), > > > > Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and > > other background filesystem maintenance operations, etc can all > > change the extent layout asynchronously if there's no mechanism to > > tell the fs not to modify the inode extent layout. > > But that's my whole point. The kernel doesn't really need to prevent > all these background maintenance operations -- it just needs to block > .page_mkwrite until they are synced. I think that whatever new > mechanism we add for this should be sticky, but I see no reason why > the filesystem should have to block reflink on a DAX file entirely. I understand the problem quite well, thank you very much. Yes, COW operations (and other things) can be handled by invalidating DAX mappings and blocking new page faults. I see little difference between this and running the sync path after page-mkwrite has triggered filesystem metadata changes (e.g. block allocation). i.e. If MAP_SYNC is going to be used, then all the things you are talking about comes along for the ride via invalidations. The MAP_SYNC proposal is effectively "run the metadata side of fdatasync() on every page fault". If the inode is not metadata dirty, then it will do nothing, otherwise it will do what it needs to stabilise the inode for userspace to be able to sync the data and it will block until it is done. Prediction for the MAP_SYNC future: frequent bug reports about huge, unpredictable page fault latencies on DAX files because every so often a page fault is required to sync tens of thousands of unrelated dirty objects because of filesystem journal ordering constraints Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
[stripped giant fullquotes] On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: > But that's my whole point. The kernel doesn't really need to prevent > all these background maintenance operations -- it just needs to block > .page_mkwrite until they are synced. I think that whatever new > mechanism we add for this should be sticky, but I see no reason why > the filesystem should have to block reflink on a DAX file entirely. Agreed - IFF we want to support write through semantics this is the only somewhat feasible way. It still has massive downsides of forcing the full sync machinery to run from the page fauly handler, which I'm rather scared off, but that's still better than creating a magic special case that isn't managable at all. > If, instead, we had a nice unprivileged per-vma or per-fd mechanism to > tell the filesystem that I want DAX durability, I could just use it > without any fuss. If it worked on ext4 before it worked on xfs, then > I'd use ext4. If it ended up being heavier weight on XFS than it was > on ext4 because XFS needed to lock down the extent map for the inode > whereas ext4 could manage it through .page_mkwrite(), then I'd > benchmark it and see which fs would win. (For my particular use case, > I doubt it would matter, since I aggressively offload fs metadata > operations to a thread whose performance I don't really care about.) ext4 and XFS have the same fundamental issue: both have a file system wide log of modified data that needs to be flushed to stable storage to ensure everything is safe. So if you solve the issue for one of them you've solved it for the other one as well modulo implementation details.
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner wrote: > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote: >> On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner wrote: >> > On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote: >> >> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams >> >> wrote: >> >> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski >> >> > wrote: >> >> >> My other objection is that the syscall intentionally leaks a reference >> >> >> to the file. This means it needs overflow protection and it probably >> >> >> shouldn't ever be allowed to use it without privilege. >> >> > >> >> > We only hold the one reference while S_DAXFILE is set, so I think the >> >> > protection is there, and per Dave's original proposal this requires >> >> > CAP_LINUX_IMMUTABLE. >> >> > >> >> >> Why can't the underlying issue be easily fixed, though? Could >> >> >> .page_mkwrite just make sure that metadata is synced when the FS uses >> >> >> DAX? >> >> > >> >> > Yes, it most definitely could and that idea has been floated. >> >> > >> >> >> On a DAX fs, syncing metadata should be extremely fast. >> > >> > >> > >> > This again >> > >> > Persistent memory means the *I/O* is fast. It does not mean that >> > *complex filesystem operations* are fast. >> > >> > Don't forget that there's an shitload of CPU that gets burnt to make >> > sure that the metadata is synced correctly. Do that /synchronously/ >> > on *every* write page fault (which, BTW, modify mtime, so will >> > always have dirty metadata to sync) and now you have a serious >> > performance problem with your "fast" DAX access method. >> >> I think the mtime issue can and should be solved separately. But it' >> s a fair point that there would be workloads for which this could be >> excessively expensive. In particular, simply creating a file, >> mmapping a large range, and touching the pages one by one -- delalloc >> would be completely defeated. >> >> But here's a strawman for solving both issues. First, mtime. I >> consider it to be either a bug or a misfeature that .page_mkwrite >> *ever* dirties an inode just to update mtime. I have old patches to >> fix this, and those patches could be updated and merged. With them >> applied, there's just a set_bit() in .page_mkwrite() to handle mtime. >> >> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4 > > Yup, I remember that - it delays the update to data writeback time, > IOWs the proposed MAP_SYNC page fault semantics result in the same > (poor) behaviour because the sync operation will trigger mtime > updates instead of the page fault. > > Unless, of course, you are implying that MAP_SYNC should not > actually sync all known dirty metadata on an inode. > > > >> Second: syncing extents. Here's a straw man. Forget the mmap() flag. >> Instead add a new msync() operation: >> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE); > > How's this any different from the fallocate command I proposed to do > this (apart from the fact that msync() is not intended to be abused > as a file preallocation/zeroing interface)? I must have missed that suggestion. But it's different in a major way. fallocate() takes an fd parameter, which means that, if some flag gets set, it's set on the struct file. The persistence property seems to me like it belongs on the vma, not the file. But it doesn't have to be msync() -- it could be madvise or even a new mallocate(). (Although mallocate() is possible the worst name ever.) > >> If this operation succeeds, it guarantees that all future writes >> through this mapping on this range will hit actual storage and that >> all the metadata operations needed to make this write persistent will >> hit storage such that they are ordered before the user's writes. >> As an implementation detail, this will flush out the extents if >> needed. In addition, if the FS has any mechanism that would cause >> problems asyncronously later on (dedupe? deallocated extents full >> of zeros? defrag?), > > Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and > other background filesystem maintenance operations, etc can all > change the extent layout asynchronously if there's no mechanism to > tell the fs not to modify the inode extent layout. But that's my whole point. The kernel doesn't really need to prevent all these background maintenance operations -- it just needs to block .page_mkwrite until they are synced. I think that whatever new mechanism we add for this should be sticky, but I see no reason why the filesystem should have to block reflink on a DAX file entirely. In fact, the daxctl() proposal seems actively problematic for some usecases. I think I should be able to mmap() a DAX file and then, while it's still mapped, extend the file, mmap the new part (with the appropriate flag, madvise(), msync(), fallocate(), whatever), and write directly through that mapping and through the original mapping, concurrently, w
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
[add linux-xfs to the fray] On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote: > To date, the full promise of byte-addressable access to persistent > memory has only been half realized via the filesystem-dax interface. The > current filesystem-dax mechanism allows an application to consume (read) > data from persistent storage at byte-size granularity, bypassing the > full page reads required by traditional storage devices. > > Now, for writes, applications still need to contend with > page-granularity dirtying and flushing semantics as well as filesystem > coordination for metadata updates after any mmap write. The current > situation precludes use cases that leverage byte-granularity / in-place > updates to persistent media. > > To get around this limitation there are some specialized applications > that are using the device-dax interface to bypass the overhead and > data-safety problems of the current filesystem-dax mmap-write path. > QEMU-KVM is forced to use device-dax to safely pass through persistent > memory to a guest [1]. Some specialized databases are using device-dax > for byte-granularity writes. Outside of those cases, device-dax is > difficult for general purpose persistent memory applications to consume. > There is demand for access to pmem without needing to contend with > special device configuration and other device-dax limitations. > > The 'daxfile' interface satisfies this demand and realizes one of Dave > Chinner's ideas for allowing pmem applications to safely bypass > fsync/msync requirements. The idea is to make the file immutable with > respect to the offset-to-block mappings for every extent in the file > [2]. It turns out that filesystems already need to make this guarantee > today. This property is needed for files marked as swap files. > > The new daxctl() syscall manages setting a file into 'static-dax' mode > whereby it arranges for the file to be treated as a swapfile as far as > the filesystem is concerned, but not registered with the core-mm as > swapfile space. A file in this mode is then safe to be mapped and > written without the requirement to fsync/msync the writes. The cpu > cache management for flushing data to persistence can be handled > completely in userspace. > > [1]: https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg01207.html > [2]: https://lkml.org/lkml/2016/9/11/159 > > Cc: Jan Kara > Cc: Jeff Moyer > Cc: Christoph Hellwig > Cc: Dave Chinner > Cc: Andrew Morton > Cc: Ross Zwisler > Signed-off-by: Dan Williams > --- > arch/x86/entry/syscalls/syscall_64.tbl |1 > include/linux/dax.h|9 ++ > include/linux/fs.h |3 + > include/linux/syscalls.h |1 > include/uapi/linux/dax.h |8 + > mm/Kconfig |5 + > mm/Makefile|1 > mm/daxfile.c | 186 > > mm/page_io.c | 31 + > 9 files changed, 245 insertions(+) > create mode 100644 include/uapi/linux/dax.h > create mode 100644 mm/daxfile.c > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > b/arch/x86/entry/syscalls/syscall_64.tbl > index 5aef183e2f85..795eb93d6beb 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -339,6 +339,7 @@ > 330 common pkey_alloc sys_pkey_alloc > 331 common pkey_free sys_pkey_free > 332 common statx sys_statx > +333 64 daxctl sys_daxctl > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/include/linux/dax.h b/include/linux/dax.h > index 5ec1f6c47716..5f1d0e0ed30f 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -4,8 +4,17 @@ > #include > #include > #include > +#include > #include > > +/* > + * TODO: make sys_daxctl() be the generic interface for toggling S_DAX > + * across filesystems. For now, mark DAXCTL_F_DAX as an invalid flag > + */ > +#define DAXCTL_VALID_FLAGS (DAXCTL_F_GET | DAXCTL_F_STATIC) > + > +int daxfile_activate(struct file *daxfile, unsigned align); > + > struct iomap_ops; > struct dax_device; > struct dax_operations { > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3e68cabb8457..3af649fb669f 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1824,8 +1824,10 @@ struct super_operations { > #define S_NOSEC 4096/* no suid or xattr security attributes > */ > #ifdef CONFIG_FS_DAX > #define S_DAX8192/* Direct Access, avoiding the page > cache */ > +#define S_DAXFILE16384 /* no truncate (swapfile) semantics + dax */ > #else > #define S_DAX0 /* Make all the DAX code disappear */ > +#define S_DAXFILE0 > #endif > > /* > @@ -1865,6 +1867,7 @@ struct super_operations { > #define IS_AUTOMOUNT(
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote: > On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner wrote: > > On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote: > >> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams > >> wrote: > >> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski wrote: > >> >> My other objection is that the syscall intentionally leaks a reference > >> >> to the file. This means it needs overflow protection and it probably > >> >> shouldn't ever be allowed to use it without privilege. > >> > > >> > We only hold the one reference while S_DAXFILE is set, so I think the > >> > protection is there, and per Dave's original proposal this requires > >> > CAP_LINUX_IMMUTABLE. > >> > > >> >> Why can't the underlying issue be easily fixed, though? Could > >> >> .page_mkwrite just make sure that metadata is synced when the FS uses > >> >> DAX? > >> > > >> > Yes, it most definitely could and that idea has been floated. > >> > > >> >> On a DAX fs, syncing metadata should be extremely fast. > > > > > > > > This again > > > > Persistent memory means the *I/O* is fast. It does not mean that > > *complex filesystem operations* are fast. > > > > Don't forget that there's an shitload of CPU that gets burnt to make > > sure that the metadata is synced correctly. Do that /synchronously/ > > on *every* write page fault (which, BTW, modify mtime, so will > > always have dirty metadata to sync) and now you have a serious > > performance problem with your "fast" DAX access method. > > I think the mtime issue can and should be solved separately. But it' > s a fair point that there would be workloads for which this could be > excessively expensive. In particular, simply creating a file, > mmapping a large range, and touching the pages one by one -- delalloc > would be completely defeated. > > But here's a strawman for solving both issues. First, mtime. I > consider it to be either a bug or a misfeature that .page_mkwrite > *ever* dirties an inode just to update mtime. I have old patches to > fix this, and those patches could be updated and merged. With them > applied, there's just a set_bit() in .page_mkwrite() to handle mtime. > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4 Yup, I remember that - it delays the update to data writeback time, IOWs the proposed MAP_SYNC page fault semantics result in the same (poor) behaviour because the sync operation will trigger mtime updates instead of the page fault. Unless, of course, you are implying that MAP_SYNC should not actually sync all known dirty metadata on an inode. > Second: syncing extents. Here's a straw man. Forget the mmap() flag. > Instead add a new msync() operation: > > msync(start, length, MSYNC_PMEM_PREPARE_WRITE); How's this any different from the fallocate command I proposed to do this (apart from the fact that msync() is not intended to be abused as a file preallocation/zeroing interface)? > If this operation succeeds, it guarantees that all future writes > through this mapping on this range will hit actual storage and that > all the metadata operations needed to make this write persistent will > hit storage such that they are ordered before the user's writes. > As an implementation detail, this will flush out the extents if > needed. In addition, if the FS has any mechanism that would cause > problems asyncronously later on (dedupe? deallocated extents full > of zeros? defrag?), Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and other background filesystem maintenance operations, etc can all change the extent layout asynchronously if there's no mechanism to tell the fs not to modify the inode extent layout. > it may also need to set a flag on the VMA > that changes the behavior of future .page_mkwrite operations. > > (On x86, for example, this would permit the FS to do WC/streaming > writes without SFENCE if the FS were structured in a way that this > worked.) > > Now we have an API that should work going forward without > introducing baggage. And XFS is free to implement this API by > making the entire file act like a swap file if XFS wants to do so, > but this doesn't force other filesystems (ext4? NOVA?) to do the > same thing. Sure, you are providing a simple programmatic API, but this does not provide a viable feature management strategy. i.e. the API you are now proposing requires the filesystem to ensure an inode's extent map cannot be modified ever again in the future (that "guarantees all future writes" bit). This requires, at minimum, a persistent flag to be set on the inode so the VFS and filesystem implementations can use it to prevent anything that, for example, relies on copy-on-write semantics being done on those files. That means the proposed msync operation will need to check the filesystem can support this feature and *fail* if it can't. Further, administrators need to be aware of this application requirement s
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner wrote: > On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote: >> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams >> wrote: >> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski wrote: >> >> My other objection is that the syscall intentionally leaks a reference >> >> to the file. This means it needs overflow protection and it probably >> >> shouldn't ever be allowed to use it without privilege. >> > >> > We only hold the one reference while S_DAXFILE is set, so I think the >> > protection is there, and per Dave's original proposal this requires >> > CAP_LINUX_IMMUTABLE. >> > >> >> Why can't the underlying issue be easily fixed, though? Could >> >> .page_mkwrite just make sure that metadata is synced when the FS uses >> >> DAX? >> > >> > Yes, it most definitely could and that idea has been floated. >> > >> >> On a DAX fs, syncing metadata should be extremely fast. > > > > This again > > Persistent memory means the *I/O* is fast. It does not mean that > *complex filesystem operations* are fast. > > Don't forget that there's an shitload of CPU that gets burnt to make > sure that the metadata is synced correctly. Do that /synchronously/ > on *every* write page fault (which, BTW, modify mtime, so will > always have dirty metadata to sync) and now you have a serious > performance problem with your "fast" DAX access method. I think the mtime issue can and should be solved separately. But it' s a fair point that there would be workloads for which this could be excessively expensive. In particular, simply creating a file, mmapping a large range, and touching the pages one by one -- delalloc would be completely defeated. But here's a strawman for solving both issues. First, mtime. I consider it to be either a bug or a misfeature that .page_mkwrite *ever* dirties an inode just to update mtime. I have old patches to fix this, and those patches could be updated and merged. With them applied, there's just a set_bit() in .page_mkwrite() to handle mtime. https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4 Second: syncing extents. Here's a straw man. Forget the mmap() flag. Instead add a new msync() operation: msync(start, length, MSYNC_PMEM_PREPARE_WRITE); If this operation succeeds, it guarantees that all future writes through this mapping on this range will hit actual storage and that all the metadata operations needed to make this write persistent will hit storage such that they are ordered before the user's writes. As an implementation detail, this will flush out the extents if needed. In addition, if the FS has any mechanism that would cause problems asyncronously later on (dedupe? deallocated extents full of zeros? defrag?), it may also need to set a flag on the VMA that changes the behavior of future .page_mkwrite operations. (On x86, for example, this would permit the FS to do WC/streaming writes without SFENCE if the FS were structured in a way that this worked.) Now we have an API that should work going forward without introducing baggage. And XFS is free to implement this API by making the entire file act like a swap file if XFS wants to do so, but this doesn't force other filesystems (ext4? NOVA?) to do the same thing. > > And that's before we even consider all the problems with running > sync operations in page fault context > >> >> This >> >> could be conditioned on an madvise or mmap flag if performance might >> >> be an issue. As far as I know, this change alone should be >> >> sufficient. >> > >> > The hang up is that it requires per-fs enabling as it needs to be >> > careful to manage mmap_sem vs fs journal locks for example. I know the >> > in-development NOVA [1] filesystem is planning to support this out of >> > the gate. ext4 would be open to implementing it, but I think xfs is >> > cold on the idea. Christoph originally proposed it here [2], before >> > Dave went on to propose immutable semantics. >> >> Hmm. Given a choice between a very clean API that works without >> privilege but is awkward to implement on XFS and an awkward-to-use >> API, I'd personally choose the former. > > Yup, you have the choice of a clean kernel API that will be > substantially slower than the existing "dirty page" tracking and > having the app run fsync() when necessary, or having to do a little > more work in a library routine that preallocates a file and sets a > flag on it? > > The apps will use the library API, not the kernel API, so who really > cares if there's a few steps to setting up the file state > appropriately? > >> Dave, even with the lock ordering issue, couldn't XFS implement >> MAP_PMEM_AWARE by having .page_mkwrite work roughly like this: >> >> if (metadata is dirty) { >> up_write(&mmap_sem); >> sync the metadata; >> down_write(&mmap_sem); >> return 0; /* retry the fault */ >> } else { >> return whatever success code; >> } > > How do you know that there is depende
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote: > On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams > wrote: > > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski wrote: > >> My other objection is that the syscall intentionally leaks a reference > >> to the file. This means it needs overflow protection and it probably > >> shouldn't ever be allowed to use it without privilege. > > > > We only hold the one reference while S_DAXFILE is set, so I think the > > protection is there, and per Dave's original proposal this requires > > CAP_LINUX_IMMUTABLE. > > > >> Why can't the underlying issue be easily fixed, though? Could > >> .page_mkwrite just make sure that metadata is synced when the FS uses > >> DAX? > > > > Yes, it most definitely could and that idea has been floated. > > > >> On a DAX fs, syncing metadata should be extremely fast. This again Persistent memory means the *I/O* is fast. It does not mean that *complex filesystem operations* are fast. Don't forget that there's an shitload of CPU that gets burnt to make sure that the metadata is synced correctly. Do that /synchronously/ on *every* write page fault (which, BTW, modify mtime, so will always have dirty metadata to sync) and now you have a serious performance problem with your "fast" DAX access method. And that's before we even consider all the problems with running sync operations in page fault context > >> This > >> could be conditioned on an madvise or mmap flag if performance might > >> be an issue. As far as I know, this change alone should be > >> sufficient. > > > > The hang up is that it requires per-fs enabling as it needs to be > > careful to manage mmap_sem vs fs journal locks for example. I know the > > in-development NOVA [1] filesystem is planning to support this out of > > the gate. ext4 would be open to implementing it, but I think xfs is > > cold on the idea. Christoph originally proposed it here [2], before > > Dave went on to propose immutable semantics. > > Hmm. Given a choice between a very clean API that works without > privilege but is awkward to implement on XFS and an awkward-to-use > API, I'd personally choose the former. Yup, you have the choice of a clean kernel API that will be substantially slower than the existing "dirty page" tracking and having the app run fsync() when necessary, or having to do a little more work in a library routine that preallocates a file and sets a flag on it? The apps will use the library API, not the kernel API, so who really cares if there's a few steps to setting up the file state appropriately? > Dave, even with the lock ordering issue, couldn't XFS implement > MAP_PMEM_AWARE by having .page_mkwrite work roughly like this: > > if (metadata is dirty) { > up_write(&mmap_sem); > sync the metadata; > down_write(&mmap_sem); > return 0; /* retry the fault */ > } else { > return whatever success code; > } How do you know that there is dependent filesystem metadata that needs syncing at a level that you can safely manipulate the mmap_sem? And how, exactly, do you do this without races? It'd be trivial to DOS such retryable DAX faults simply by touching the file in a tight loop in a separate process... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Sun, Jun 18, 2017 at 1:18 AM, Christoph Hellwig wrote: > On Sat, Jun 17, 2017 at 08:15:05PM -0700, Dan Williams wrote: >> The hang up is that it requires per-fs enabling as it needs to be >> careful to manage mmap_sem vs fs journal locks for example. I know the >> in-development NOVA [1] filesystem is planning to support this out of >> the gate. ext4 would be open to implementing it, but I think xfs is >> cold on the idea. Christoph originally proposed it here [2], before >> Dave went on to propose immutable semantics. >> >> [1]: https://github.com/NVSL/NOVA >> [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004609.html > > And I stand to that statement. Let's get DAX stable first, and > properly cleaned up (e.g. follow on work with separating it entirely > from the block device). Then think hard about how most of the > persistent memory technologies actually work, including the point that > for a lot of workloads page cache will be required at least on the > write side. And then come up with actual real use cases and we can > look into it. I see it differently. We're already at a good point in time to start iterating on a fix for this issue. Ross and Jan have done a lot of good work on the dax stability front, and the block-device separation of dax is well underway. > And stop trying to shoe-horn crap like this in. The kernel shoe-horning all pmem+filesystem-dax applications into abiding page-cache semantics is a problem, and this RFC has already helped move the needle on a couple fronts. 1/ Swapfiles are subtly broken which is something worth fixing, and if it gets us a synchronous-dax mode without major filesystem surgery then that's all for the better. 2/ There's an appetite for just fixing this incrementally in each filesystem's fault handler, so if ext4 was able to prove out an interface / implementation for synchronous faults we could go with that instead of a pre-allocated + immutable interface and let other filesystems set their own timelines.
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Sat, Jun 17, 2017 at 08:15:05PM -0700, Dan Williams wrote: > The hang up is that it requires per-fs enabling as it needs to be > careful to manage mmap_sem vs fs journal locks for example. I know the > in-development NOVA [1] filesystem is planning to support this out of > the gate. ext4 would be open to implementing it, but I think xfs is > cold on the idea. Christoph originally proposed it here [2], before > Dave went on to propose immutable semantics. > > [1]: https://github.com/NVSL/NOVA > [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004609.html And I stand to that statement. Let's get DAX stable first, and properly cleaned up (e.g. follow on work with separating it entirely from the block device). Then think hard about how most of the persistent memory technologies actually work, including the point that for a lot of workloads page cache will be required at least on the write side. And then come up with actual real use cases and we can look into it. And stop trying to shoe-horn crap like this in.
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams wrote: > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski wrote: >> My other objection is that the syscall intentionally leaks a reference >> to the file. This means it needs overflow protection and it probably >> shouldn't ever be allowed to use it without privilege. > > We only hold the one reference while S_DAXFILE is set, so I think the > protection is there, and per Dave's original proposal this requires > CAP_LINUX_IMMUTABLE. > >> Why can't the underlying issue be easily fixed, though? Could >> .page_mkwrite just make sure that metadata is synced when the FS uses >> DAX? > > Yes, it most definitely could and that idea has been floated. > >> On a DAX fs, syncing metadata should be extremely fast. This >> could be conditioned on an madvise or mmap flag if performance might >> be an issue. As far as I know, this change alone should be >> sufficient. > > The hang up is that it requires per-fs enabling as it needs to be > careful to manage mmap_sem vs fs journal locks for example. I know the > in-development NOVA [1] filesystem is planning to support this out of > the gate. ext4 would be open to implementing it, but I think xfs is > cold on the idea. Christoph originally proposed it here [2], before > Dave went on to propose immutable semantics. Hmm. Given a choice between a very clean API that works without privilege but is awkward to implement on XFS and an awkward-to-use API, I'd personally choose the former. Dave, even with the lock ordering issue, couldn't XFS implement MAP_PMEM_AWARE by having .page_mkwrite work roughly like this: if (metadata is dirty) { up_write(&mmap_sem); sync the metadata; down_write(&mmap_sem); return 0; /* retry the fault */ } else { return whatever success code; } This might require returning VM_FAULT_RETRY instead of 0 and it might require auditing the core mm code to make sure that it can handle mmap_sem being dropped like this. I don't see why it couldn't work in principle, though.
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski wrote: > On Sat, Jun 17, 2017 at 2:52 PM, Dan Williams > wrote: >> On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski wrote: >>> >>> Can you remind those of us who haven't played with DAX in a while what >>> the problem is with mmapping a DAX file without this patchset? If >>> there's some bookkkeeping needed to make sure that the filesystem will >>> invalidate all the mappings if it decides to move the file, maybe that >>> should be the default rather than needing a new syscall. >> >> The bookkeeping to invalidate mappings when the filesystem moves a >> block is already there. >> >> Without this patchset an application needs to call fsync/msync after >> any write to a DAX mapping otherwise there is no guarantee the >> filesystem has written the metadata to find the updated block after a >> crash or power loss event. Even if the sync operation is reduced to a >> minimal cmpxchg in userspace to check if the filesystem-metadata is >> dirty, that mechanism doesn't translate to a virtualized environment, >> as requiring guests to trigger host fsync()s is not feasible. It's a >> half-step solution when you can instead just ask the filesystem to >> never move blocks, as Dave proposed many months back. >> >> We stepped back from that proposal when it looked like a significant >> amount of per-filesystem work to introduce the capability and it was >> not clear that application developers would tolerate the side effects >> of this 'immutable' semantic. However, the implementation is dead >> simple since ext4 and xfs already need to make >> block-allocation-immutable semantics available for swapfiles. We also >> have application developers telling us they are ok with the semantics, >> especially because it catches Linux up to other operating environments >> that are already on board with allowing this type of access to pmem >> through a filesystem. This patchset gives pmem application developers >> what they want without any additional burden on filesystem >> implementations. > > I see. > > I have a couple of minor-ish issues with the current proposal, then. > One is that I think the terminology should be changed to still make > sense if filesystems or VFS improves to make this approach > unnecessary. Rather than saying "this file is now static", perhaps > users should set a flag with the explicit semantics that "mmaps of > this file are guaranteed not to lose data due to the kernel's > activities", IOW that mmaps will be at least as durable as a direct > mapping of DAX memory. Then the kernel has the flexibility to add a > future implementation in which, instead of pinning the file, the > filesystem just knows to keep metadata synced before allowing > page_mkwrite to re-enable writes to an mmapped DAX file. Yes, sounds good to me. Rename the flag to DAXCTL_F_SYNC to indicate updates via mmap to this file are synchronous as far as block allocation metadata is concerned. Future filesystems are then free to always support this synchronous mode without using the swapfile hack. > My other objection is that the syscall intentionally leaks a reference > to the file. This means it needs overflow protection and it probably > shouldn't ever be allowed to use it without privilege. We only hold the one reference while S_DAXFILE is set, so I think the protection is there, and per Dave's original proposal this requires CAP_LINUX_IMMUTABLE. > Why can't the underlying issue be easily fixed, though? Could > .page_mkwrite just make sure that metadata is synced when the FS uses > DAX? Yes, it most definitely could and that idea has been floated. > On a DAX fs, syncing metadata should be extremely fast. This > could be conditioned on an madvise or mmap flag if performance might > be an issue. As far as I know, this change alone should be > sufficient. The hang up is that it requires per-fs enabling as it needs to be careful to manage mmap_sem vs fs journal locks for example. I know the in-development NOVA [1] filesystem is planning to support this out of the gate. ext4 would be open to implementing it, but I think xfs is cold on the idea. Christoph originally proposed it here [2], before Dave went on to propose immutable semantics. [1]: https://github.com/NVSL/NOVA [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004609.html
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Sat, Jun 17, 2017 at 2:52 PM, Dan Williams wrote: > On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski wrote: >> >> Can you remind those of us who haven't played with DAX in a while what >> the problem is with mmapping a DAX file without this patchset? If >> there's some bookkkeeping needed to make sure that the filesystem will >> invalidate all the mappings if it decides to move the file, maybe that >> should be the default rather than needing a new syscall. > > The bookkeeping to invalidate mappings when the filesystem moves a > block is already there. > > Without this patchset an application needs to call fsync/msync after > any write to a DAX mapping otherwise there is no guarantee the > filesystem has written the metadata to find the updated block after a > crash or power loss event. Even if the sync operation is reduced to a > minimal cmpxchg in userspace to check if the filesystem-metadata is > dirty, that mechanism doesn't translate to a virtualized environment, > as requiring guests to trigger host fsync()s is not feasible. It's a > half-step solution when you can instead just ask the filesystem to > never move blocks, as Dave proposed many months back. > > We stepped back from that proposal when it looked like a significant > amount of per-filesystem work to introduce the capability and it was > not clear that application developers would tolerate the side effects > of this 'immutable' semantic. However, the implementation is dead > simple since ext4 and xfs already need to make > block-allocation-immutable semantics available for swapfiles. We also > have application developers telling us they are ok with the semantics, > especially because it catches Linux up to other operating environments > that are already on board with allowing this type of access to pmem > through a filesystem. This patchset gives pmem application developers > what they want without any additional burden on filesystem > implementations. I see. I have a couple of minor-ish issues with the current proposal, then. One is that I think the terminology should be changed to still make sense if filesystems or VFS improves to make this approach unnecessary. Rather than saying "this file is now static", perhaps users should set a flag with the explicit semantics that "mmaps of this file are guaranteed not to lose data due to the kernel's activities", IOW that mmaps will be at least as durable as a direct mapping of DAX memory. Then the kernel has the flexibility to add a future implementation in which, instead of pinning the file, the filesystem just knows to keep metadata synced before allowing page_mkwrite to re-enable writes to an mmapped DAX file. My other objection is that the syscall intentionally leaks a reference to the file. This means it needs overflow protection and it probably shouldn't ever be allowed to use it without privilege. Why can't the underlying issue be easily fixed, though? Could .page_mkwrite just make sure that metadata is synced when the FS uses DAX? On a DAX fs, syncing metadata should be extremely fast. This could be conditioned on an madvise or mmap flag if performance might be an issue. As far as I know, this change alone should be sufficient.
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski wrote: > On Fri, Jun 16, 2017 at 6:15 PM, Dan Williams > wrote: >> To date, the full promise of byte-addressable access to persistent >> memory has only been half realized via the filesystem-dax interface. The >> current filesystem-dax mechanism allows an application to consume (read) >> data from persistent storage at byte-size granularity, bypassing the >> full page reads required by traditional storage devices. >> >> Now, for writes, applications still need to contend with >> page-granularity dirtying and flushing semantics as well as filesystem >> coordination for metadata updates after any mmap write. The current >> situation precludes use cases that leverage byte-granularity / in-place >> updates to persistent media. >> >> To get around this limitation there are some specialized applications >> that are using the device-dax interface to bypass the overhead and >> data-safety problems of the current filesystem-dax mmap-write path. >> QEMU-KVM is forced to use device-dax to safely pass through persistent >> memory to a guest [1]. Some specialized databases are using device-dax >> for byte-granularity writes. Outside of those cases, device-dax is >> difficult for general purpose persistent memory applications to consume. >> There is demand for access to pmem without needing to contend with >> special device configuration and other device-dax limitations. >> >> The 'daxfile' interface satisfies this demand and realizes one of Dave >> Chinner's ideas for allowing pmem applications to safely bypass >> fsync/msync requirements. The idea is to make the file immutable with >> respect to the offset-to-block mappings for every extent in the file >> [2]. It turns out that filesystems already need to make this guarantee >> today. This property is needed for files marked as swap files. >> >> The new daxctl() syscall manages setting a file into 'static-dax' mode >> whereby it arranges for the file to be treated as a swapfile as far as >> the filesystem is concerned, but not registered with the core-mm as >> swapfile space. A file in this mode is then safe to be mapped and >> written without the requirement to fsync/msync the writes. The cpu >> cache management for flushing data to persistence can be handled >> completely in userspace. > > Can you remind those of us who haven't played with DAX in a while what > the problem is with mmapping a DAX file without this patchset? If > there's some bookkkeeping needed to make sure that the filesystem will > invalidate all the mappings if it decides to move the file, maybe that > should be the default rather than needing a new syscall. The bookkeeping to invalidate mappings when the filesystem moves a block is already there. Without this patchset an application needs to call fsync/msync after any write to a DAX mapping otherwise there is no guarantee the filesystem has written the metadata to find the updated block after a crash or power loss event. Even if the sync operation is reduced to a minimal cmpxchg in userspace to check if the filesystem-metadata is dirty, that mechanism doesn't translate to a virtualized environment, as requiring guests to trigger host fsync()s is not feasible. It's a half-step solution when you can instead just ask the filesystem to never move blocks, as Dave proposed many months back. We stepped back from that proposal when it looked like a significant amount of per-filesystem work to introduce the capability and it was not clear that application developers would tolerate the side effects of this 'immutable' semantic. However, the implementation is dead simple since ext4 and xfs already need to make block-allocation-immutable semantics available for swapfiles. We also have application developers telling us they are ok with the semantics, especially because it catches Linux up to other operating environments that are already on board with allowing this type of access to pmem through a filesystem. This patchset gives pmem application developers what they want without any additional burden on filesystem implementations.
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Fri, Jun 16, 2017 at 6:15 PM, Dan Williams wrote: > To date, the full promise of byte-addressable access to persistent > memory has only been half realized via the filesystem-dax interface. The > current filesystem-dax mechanism allows an application to consume (read) > data from persistent storage at byte-size granularity, bypassing the > full page reads required by traditional storage devices. > > Now, for writes, applications still need to contend with > page-granularity dirtying and flushing semantics as well as filesystem > coordination for metadata updates after any mmap write. The current > situation precludes use cases that leverage byte-granularity / in-place > updates to persistent media. > > To get around this limitation there are some specialized applications > that are using the device-dax interface to bypass the overhead and > data-safety problems of the current filesystem-dax mmap-write path. > QEMU-KVM is forced to use device-dax to safely pass through persistent > memory to a guest [1]. Some specialized databases are using device-dax > for byte-granularity writes. Outside of those cases, device-dax is > difficult for general purpose persistent memory applications to consume. > There is demand for access to pmem without needing to contend with > special device configuration and other device-dax limitations. > > The 'daxfile' interface satisfies this demand and realizes one of Dave > Chinner's ideas for allowing pmem applications to safely bypass > fsync/msync requirements. The idea is to make the file immutable with > respect to the offset-to-block mappings for every extent in the file > [2]. It turns out that filesystems already need to make this guarantee > today. This property is needed for files marked as swap files. > > The new daxctl() syscall manages setting a file into 'static-dax' mode > whereby it arranges for the file to be treated as a swapfile as far as > the filesystem is concerned, but not registered with the core-mm as > swapfile space. A file in this mode is then safe to be mapped and > written without the requirement to fsync/msync the writes. The cpu > cache management for flushing data to persistence can be handled > completely in userspace. Can you remind those of us who haven't played with DAX in a while what the problem is with mmapping a DAX file without this patchset? If there's some bookkkeeping needed to make sure that the filesystem will invalidate all the mappings if it decides to move the file, maybe that should be the default rather than needing a new syscall. --Andy
[RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
To date, the full promise of byte-addressable access to persistent memory has only been half realized via the filesystem-dax interface. The current filesystem-dax mechanism allows an application to consume (read) data from persistent storage at byte-size granularity, bypassing the full page reads required by traditional storage devices. Now, for writes, applications still need to contend with page-granularity dirtying and flushing semantics as well as filesystem coordination for metadata updates after any mmap write. The current situation precludes use cases that leverage byte-granularity / in-place updates to persistent media. To get around this limitation there are some specialized applications that are using the device-dax interface to bypass the overhead and data-safety problems of the current filesystem-dax mmap-write path. QEMU-KVM is forced to use device-dax to safely pass through persistent memory to a guest [1]. Some specialized databases are using device-dax for byte-granularity writes. Outside of those cases, device-dax is difficult for general purpose persistent memory applications to consume. There is demand for access to pmem without needing to contend with special device configuration and other device-dax limitations. The 'daxfile' interface satisfies this demand and realizes one of Dave Chinner's ideas for allowing pmem applications to safely bypass fsync/msync requirements. The idea is to make the file immutable with respect to the offset-to-block mappings for every extent in the file [2]. It turns out that filesystems already need to make this guarantee today. This property is needed for files marked as swap files. The new daxctl() syscall manages setting a file into 'static-dax' mode whereby it arranges for the file to be treated as a swapfile as far as the filesystem is concerned, but not registered with the core-mm as swapfile space. A file in this mode is then safe to be mapped and written without the requirement to fsync/msync the writes. The cpu cache management for flushing data to persistence can be handled completely in userspace. [1]: https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg01207.html [2]: https://lkml.org/lkml/2016/9/11/159 Cc: Jan Kara Cc: Jeff Moyer Cc: Christoph Hellwig Cc: Dave Chinner Cc: Andrew Morton Cc: Ross Zwisler Signed-off-by: Dan Williams --- arch/x86/entry/syscalls/syscall_64.tbl |1 include/linux/dax.h|9 ++ include/linux/fs.h |3 + include/linux/syscalls.h |1 include/uapi/linux/dax.h |8 + mm/Kconfig |5 + mm/Makefile|1 mm/daxfile.c | 186 mm/page_io.c | 31 + 9 files changed, 245 insertions(+) create mode 100644 include/uapi/linux/dax.h create mode 100644 mm/daxfile.c diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 5aef183e2f85..795eb93d6beb 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -339,6 +339,7 @@ 330common pkey_alloc sys_pkey_alloc 331common pkey_free sys_pkey_free 332common statx sys_statx +33364 daxctl sys_daxctl # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/dax.h b/include/linux/dax.h index 5ec1f6c47716..5f1d0e0ed30f 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -4,8 +4,17 @@ #include #include #include +#include #include +/* + * TODO: make sys_daxctl() be the generic interface for toggling S_DAX + * across filesystems. For now, mark DAXCTL_F_DAX as an invalid flag + */ +#define DAXCTL_VALID_FLAGS (DAXCTL_F_GET | DAXCTL_F_STATIC) + +int daxfile_activate(struct file *daxfile, unsigned align); + struct iomap_ops; struct dax_device; struct dax_operations { diff --git a/include/linux/fs.h b/include/linux/fs.h index 3e68cabb8457..3af649fb669f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1824,8 +1824,10 @@ struct super_operations { #define S_NOSEC4096/* no suid or xattr security attributes */ #ifdef CONFIG_FS_DAX #define S_DAX 8192/* Direct Access, avoiding the page cache */ +#define S_DAXFILE 16384 /* no truncate (swapfile) semantics + dax */ #else #define S_DAX 0 /* Make all the DAX code disappear */ +#define S_DAXFILE 0 #endif /* @@ -1865,6 +1867,7 @@ struct super_operations { #define IS_AUTOMOUNT(inode)((inode)->i_flags & S_AUTOMOUNT) #define IS_NOSEC(inode)((inode)->i_flags & S_NOSEC) #define IS_DAX(inode) ((inode)->i_flags & S_DAX) +#define IS_DAXFILE(inode) ((inode)->i_flags & S_DAXFILE) #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \