Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Tue, Mar 02, 2021 at 09:49:30AM -0800, Dan Williams wrote: > On Mon, Mar 1, 2021 at 11:57 PM Dave Chinner wrote: > > > > On Mon, Mar 01, 2021 at 09:41:02PM -0800, Dan Williams wrote: > > > On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong wrote: > > > > > > I really don't see you seem to be telling us that invalidation is an > > > > > > either/or choice. There's more ways to convert physical block > > > > > > address -> inode file offset and mapping index than brute force > > > > > > inode cache walks > > > > > > > > > > Yes, but I was trying to map it to an existing mechanism and the > > > > > internals of drop_pagecache_sb() are, in coarse terms, close to what > > > > > needs to happen here. > > > > > > > > Yes. XFS (with rmap enabled) can do all the iteration and walking in > > > > that function except for the invalidate_mapping_* call itself. The goal > > > > of this series is first to wire up a callback within both the block and > > > > pmem subsystems so that they can take notifications and reverse-map them > > > > through the storage stack until they reach an fs superblock. > > > > > > I'm chuckling because this "reverse map all the way up the block > > > layer" is the opposite of what Dave said at the first reaction to my > > > proposal, "can't the mm map pfns to fs inode address_spaces?". > > > > Ah, no, I never said that the filesystem can't do reverse maps. I > > was asking if the mm could directly (brute-force) invalidate PTEs > > pointing at physical pmem ranges without needing walk the inode > > mappings. That would be far more efficient if it could be done So, uh, /can/ the kernel brute-force invalidate PTEs when the pmem driver says that something died? Part of what's keeping me from putting together a coherent vision for how this would work is my relative unfamiliarity with all things mm/. > > > Today whenever the pmem driver receives new corrupted range > > > notification from the lower level nvdimm > > > infrastructure(nd_pmem_notify) it updates the 'badblocks' instance > > > associated with the pmem gendisk and then notifies userspace that > > > there are new badblocks. This seems a perfect place to signal an upper > > > level stacked block device that may also be watching disk->bb. Then > > > each gendisk in a stacked topology is responsible for watching the > > > badblock notifications of the next level and storing a remapped > > > instance of those blocks until ultimately the filesystem mounted on > > > the top-level block device is responsible for registering for those > > > top-level disk->bb events. > > > > > > The device gone notification does not map cleanly onto 'struct badblocks'. > > > > Filesystems are not allowed to interact with the gendisk > > infrastructure - that's for supporting the device side of a block > > device. It's a layering violation, and many a filesytem developer > > has been shouted at for trying to do this. At most we can peek > > through it to query functionality support from the request queue, > > but otherwise filesystems do not interact with anything under > > bdev->bd_disk. > > So lets add an api that allows the querying of badblocks by bdev and > let the block core handle the bd_disk interaction. I see other block > functionality like blk-integrity reaching through gendisk. The fs need > not interact with the gendisk directly. (I thought it was ok for block code to fiddle with other block internals, and it's filesystems messing with block internals that was prohibited?) > > As it is, badblocks are used by devices to manage internal state. > > e.g. md for recording stripes that need recovery if the system > > crashes while they are being written out. > > I know, I was there when it was invented which is why it was > top-of-mind when pmem had a need to communicate badblocks. Other block > drivers have threatened to use it for badblocks tracking, but none of > those have carried through on that initial interest. I hadn't realized that badblocks was bolted onto gendisk nowadays, I mistakenly thought it was still something internal to md. Looking over badblocks, I see a major drawback in that it can only remember a single page's worth of badblocks records. > > > If an upper level agent really cared about knowing about ->remove() > > > events before they happened it could maybe do something like: > > > > > > dev = disk_to_dev(bdev->bd_disk)->parent; > > > bus_register_notifier(dev->bus. _host_device_notifier_block) > > > > Yeah, that's exactly the sort of thing that filesystems have been > > aggressively discouraged from doing for years. > > Yup, it's a layering violation. > > > Part of the reason for this is that gendisk based mechanisms are not > > very good for stacked device error reporting. Part of the problem > > here is that every layer of the stacked device has to hook the > > notifier of the block devices underneath it, then translate the > > event to match the upper block device map, then regenerate the > > notification
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 1, 2021 at 11:57 PM Dave Chinner wrote: > > On Mon, Mar 01, 2021 at 09:41:02PM -0800, Dan Williams wrote: > > On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong wrote: > > > > > I really don't see you seem to be telling us that invalidation is an > > > > > either/or choice. There's more ways to convert physical block > > > > > address -> inode file offset and mapping index than brute force > > > > > inode cache walks > > > > > > > > Yes, but I was trying to map it to an existing mechanism and the > > > > internals of drop_pagecache_sb() are, in coarse terms, close to what > > > > needs to happen here. > > > > > > Yes. XFS (with rmap enabled) can do all the iteration and walking in > > > that function except for the invalidate_mapping_* call itself. The goal > > > of this series is first to wire up a callback within both the block and > > > pmem subsystems so that they can take notifications and reverse-map them > > > through the storage stack until they reach an fs superblock. > > > > I'm chuckling because this "reverse map all the way up the block > > layer" is the opposite of what Dave said at the first reaction to my > > proposal, "can't the mm map pfns to fs inode address_spaces?". > > Ah, no, I never said that the filesystem can't do reverse maps. I > was asking if the mm could directly (brute-force) invalidate PTEs > pointing at physical pmem ranges without needing walk the inode > mappings. That would be far more efficient if it could be done > > > Today whenever the pmem driver receives new corrupted range > > notification from the lower level nvdimm > > infrastructure(nd_pmem_notify) it updates the 'badblocks' instance > > associated with the pmem gendisk and then notifies userspace that > > there are new badblocks. This seems a perfect place to signal an upper > > level stacked block device that may also be watching disk->bb. Then > > each gendisk in a stacked topology is responsible for watching the > > badblock notifications of the next level and storing a remapped > > instance of those blocks until ultimately the filesystem mounted on > > the top-level block device is responsible for registering for those > > top-level disk->bb events. > > > > The device gone notification does not map cleanly onto 'struct badblocks'. > > Filesystems are not allowed to interact with the gendisk > infrastructure - that's for supporting the device side of a block > device. It's a layering violation, and many a filesytem developer > has been shouted at for trying to do this. At most we can peek > through it to query functionality support from the request queue, > but otherwise filesystems do not interact with anything under > bdev->bd_disk. So lets add an api that allows the querying of badblocks by bdev and let the block core handle the bd_disk interaction. I see other block functionality like blk-integrity reaching through gendisk. The fs need not interact with the gendisk directly. > > As it is, badblocks are used by devices to manage internal state. > e.g. md for recording stripes that need recovery if the system > crashes while they are being written out. I know, I was there when it was invented which is why it was top-of-mind when pmem had a need to communicate badblocks. Other block drivers have threatened to use it for badblocks tracking, but none of those have carried through on that initial interest. > > > If an upper level agent really cared about knowing about ->remove() > > events before they happened it could maybe do something like: > > > > dev = disk_to_dev(bdev->bd_disk)->parent; > > bus_register_notifier(dev->bus. _host_device_notifier_block) > > Yeah, that's exactly the sort of thing that filesystems have been > aggressively discouraged from doing for years. Yup, it's a layering violation. > Part of the reason for this is that gendisk based mechanisms are not > very good for stacked device error reporting. Part of the problem > here is that every layer of the stacked device has to hook the > notifier of the block devices underneath it, then translate the > event to match the upper block device map, then regenerate the > notification for the next layer up. This isn't an efficient way to > pass a notification through a series of stacked devices and it is > messy and cumbersome to maintain. It's been messy and cumbersome to route new infrastructure through DM every time a new dax_operation arrives. The corrupted_range() routing has the same burden. The advantage of badblocks over corrupted_range() is that it solves the "what If I miss a notification" problem. Each layer of the stack maintains its sector translation of the next level errors. . > It can be effective for getting notifications to userspace about > something that happens to a specific block device. No, it's not block device specific, it's stuck at the disk level. The user notification aspect was added for pmem at the disk layer because IIRC it was NAKd to add it to the block_device itself. > > But The
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 01, 2021 at 09:41:02PM -0800, Dan Williams wrote: > On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong wrote: > > > > I really don't see you seem to be telling us that invalidation is an > > > > either/or choice. There's more ways to convert physical block > > > > address -> inode file offset and mapping index than brute force > > > > inode cache walks > > > > > > Yes, but I was trying to map it to an existing mechanism and the > > > internals of drop_pagecache_sb() are, in coarse terms, close to what > > > needs to happen here. > > > > Yes. XFS (with rmap enabled) can do all the iteration and walking in > > that function except for the invalidate_mapping_* call itself. The goal > > of this series is first to wire up a callback within both the block and > > pmem subsystems so that they can take notifications and reverse-map them > > through the storage stack until they reach an fs superblock. > > I'm chuckling because this "reverse map all the way up the block > layer" is the opposite of what Dave said at the first reaction to my > proposal, "can't the mm map pfns to fs inode address_spaces?". Ah, no, I never said that the filesystem can't do reverse maps. I was asking if the mm could directly (brute-force) invalidate PTEs pointing at physical pmem ranges without needing walk the inode mappings. That would be far more efficient if it could be done > Today whenever the pmem driver receives new corrupted range > notification from the lower level nvdimm > infrastructure(nd_pmem_notify) it updates the 'badblocks' instance > associated with the pmem gendisk and then notifies userspace that > there are new badblocks. This seems a perfect place to signal an upper > level stacked block device that may also be watching disk->bb. Then > each gendisk in a stacked topology is responsible for watching the > badblock notifications of the next level and storing a remapped > instance of those blocks until ultimately the filesystem mounted on > the top-level block device is responsible for registering for those > top-level disk->bb events. > > The device gone notification does not map cleanly onto 'struct badblocks'. Filesystems are not allowed to interact with the gendisk infrastructure - that's for supporting the device side of a block device. It's a layering violation, and many a filesytem developer has been shouted at for trying to do this. At most we can peek through it to query functionality support from the request queue, but otherwise filesystems do not interact with anything under bdev->bd_disk. As it is, badblocks are used by devices to manage internal state. e.g. md for recording stripes that need recovery if the system crashes while they are being written out. > If an upper level agent really cared about knowing about ->remove() > events before they happened it could maybe do something like: > > dev = disk_to_dev(bdev->bd_disk)->parent; > bus_register_notifier(dev->bus. _host_device_notifier_block) Yeah, that's exactly the sort of thing that filesystems have been aggressively discouraged from doing for years. Part of the reason for this is that gendisk based mechanisms are not very good for stacked device error reporting. Part of the problem here is that every layer of the stacked device has to hook the notifier of the block devices underneath it, then translate the event to match the upper block device map, then regenerate the notification for the next layer up. This isn't an efficient way to pass a notification through a series of stacked devices and it is messy and cumbersome to maintain. It can be effective for getting notifications to userspace about something that happens to a specific block device. But The userspace still ends up having to solve the "what does this error resolve to" problem. i.e. Userspace still needs to map that notification to a filesystem, and for data loss events map it to objects within the filesystem, which can be extremely expensive to do from userspace. This is exactly the sort of userspace error reporting mess that various projects have asked us to try to fix. Plumbing errors internally through the kernel up to the filesystem where the filesytem can point directly to the user data that is affected is a simple, effective solution to the problem. Especially if we then have a generic error notification mechanism for filesystems to emit errors to registered userspace watchers... > I still don't think that solves the need for a separate mechanism for > global dax_device pte invalidation. It's just another type of media error because. > I think that global dax_device invalidation needs new kernel > infrastructure to allow internal users, like dm-writecache and future > filesystems using dax for metadata, to take a fault when pmem is > offlined. if userspace has directly mapped into the cache, and the cache storage goes away, the userspace app has to be killed because we have no idea if the device going away has caused data loss or not.
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 1, 2021 at 9:38 PM Dave Chinner wrote: > > On Mon, Mar 01, 2021 at 07:33:28PM -0800, Dan Williams wrote: > > On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner wrote: > > [..] > > > We do not need a DAX specific mechanism to tell us "DAX device > > > gone", we need a generic block device interface that tells us "range > > > of block device is gone". > > > > This is the crux of the disagreement. The block_device is going away > > *and* the dax_device is going away. > > No, that is not the disagreement I have with what you are saying. > You still haven't understand that it's even more basic and generic > than devices going away. At the simplest form, all the filesystem > wants is to be notified of is when *unrecoverable media errors* > occur in the persistent storage that underlies the filesystem. > > The filesystem does not care what that media is build from - PMEM, > flash, corroded spinning disks, MRAM, or any other persistent media > you can think off. It just doesn't matter. > > What we care about is that the contents of a *specific LBA range* no > longer contain *valid data*. IOWs, the data in that range of the > block device has been lost, cannot be retreived and/or cannot be > written to any more. > > PMEM taking a MCE because ECC tripped is a media error because data > is lost and inaccessible until recovery actions are taken. > > MD RAID failing a scrub is a media error and data is lost and > unrecoverable at that layer. > > A device disappearing is a media error because the storage media is > now permanently inaccessible to the higher layers. > > This "media error" categorisation is a fundamental property of > persistent storage and, as such, is a property of the block devices > used to access said persistent storage. > > That's the disagreement here - that you and Christoph are saying > ->corrupted_range is not a block device property because only a > pmem/DAX device currently generates it. > > You both seem to be NACKing a generic interface because it's only > implemented for the first subsystem that needs it. AFAICT, you > either don't understand or are completely ignoring the architectural > need for it to be provided across the rest of the storage stack that > *block device based filesystems depend on*. No I'm NAKing it because it's the wrong interface. See my 'struct badblocks' argument in the reply to Darrick. That 'struct badblocks' infrastructure arose from MD and is shared with PMEM. > > Sure, there might be dax device based fielsystems around the corner. > They just require a different pmem device ->corrupted_range callout > to implement the notification - one that directs to the dax device > rather than the block device. That's simple and trivial to > implement, but such functionaity for DAX devices does not replace > the need for the same generic functionality to be provided across a > *range of different block devices* as required by *block device > based filesystems*. > > And that's fundamentally the problem. XFS is block device based, not > DAX device based. We require errors to be reported through block > device mechanisms. fs-dax does not change this - it is based on pmem > being presented as a primarily as a block device to the block device > based filesystems and only secondarily as a dax device. Hence if it > can be trivially implemented as a block device interface, that's > where it should go, because then all the other block devices that > the filesytem runs on can provide the same functionality for similar > media error events Sure, use 'struct badblocks' not struct block_device and block_device_operations. > > > The dax_device removal implies one > > set of actions (direct accessed pfns invalid) the block device removal > > implies another (block layer sector access offline). > > There you go again, saying DAX requires an action, while the block > device notification is a -state change- (i.e. goes offline). There you go reacting to the least generous interpretation of what I said. s/pfns invalid/pfns offline/ > > This is exactly what I said was wrong in my last email. > > > corrupted_range > > is blurring the notification for 2 different failure domains. Look at > > the nascent idea to mount a filesystem on dax sans a block device. > > Look at the existing plumbing for DM to map dax_operations through a > > device stack. > > Ummm, it just maps the direct_access call to the underlying device > and calls it's ->direct_access method. All it's doing is LBA > mapping. That's all it needs to do for ->corrupted_range, too. > I have no clue why you think this is a problem for error > notification... > > > Look at the pushback Ruan got for adding a new > > block_device operation for corrupted_range(). > > one person said "no". That's hardly pushback. Especially as I think > Christoph's objection about this being dax specific functionality > is simply wrong, as per above. It's not wrong when we have a perfectly suitable object for sector based error notification
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong wrote: > > On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote: > > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner wrote: > > > > > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote: > > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner > > > > wrote: > > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner > > > > > > wrote: > > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > > > it points to, check if it points to the PMEM that is being removed, > > > > > grab the page it points to, map that to the relevant struct page, > > > > > run collect_procs() on that page, then kill the user processes that > > > > > map that page. > > > > > > > > > > So why can't we walk the ptescheck the physical pages that they > > > > > map to and if they map to a pmem page we go poison that > > > > > page and that kills any user process that maps it. > > > > > > > > > > i.e. I can't see how unexpected pmem device unplug is any different > > > > > to an MCE delivering a hwpoison event to a DAX mapped page. > > > > > > > > I guess the tradeoff is walking a long list of inodes vs walking a > > > > large array of pages. > > > > > > Not really. You're assuming all a filesystem has to do is invalidate > > > everything if a device goes away, and that's not true. Finding if an > > > inode has a mapping that spans a specific device in a multi-device > > > filesystem can be a lot more complex than that. Just walking inodes > > > is easy - determining whihc inodes need invalidation is the hard > > > part. > > > > That inode-to-device level of specificity is not needed for the same > > reason that drop_caches does not need to be specific. If the wrong > > page is unmapped a re-fault will bring it back, and re-fault will fail > > for the pages that are successfully removed. > > > > > That's where ->corrupt_range() comes in - the filesystem is already > > > set up to do reverse mapping from physical range to inode(s) > > > offsets... > > > > Sure, but what is the need to get to that level of specificity with > > the filesystem for something that should rarely happen in the course > > of normal operation outside of a mistake? > > I can't tell if we're conflating the "a bunch of your pmem went bad" > case with the "all your dimms fell out of the machine" case. >From the pmem driver perspective it has the media scanning to find some small handful of cachelines that have gone bad, and it has the driver ->remove() callback to tell it a bunch of pmem is now offline. The NVDIMM device "range has gone bad" mechanism has no way to communicate multiple terabytes have gone bad at once. In fact I think the distinction is important that ->remove() is not treated as ->corrupted_range() because I expect the level of freakout is much worse for a "your storage is offline" notification vs "your storage is corrupted" notification. > If, say, a single cacheline's worth of pmem goes bad on a node with 2TB > of pmem, I certainly want that level of specificity. Just notify the > users of the dead piece, don't flush the whole machine down the drain. Right, something like corrupted_range() is there to say, "keep going upper layers, but note that this handful of sectors now has indeterminant data and will return -EIO on access until repaired". The repair for device-offline is device-online. > > > > > There's likely always more pages than inodes, but perhaps it's more > > > > efficient to walk the 'struct page' array than sb->s_inodes? > > > > > > I really don't see you seem to be telling us that invalidation is an > > > either/or choice. There's more ways to convert physical block > > > address -> inode file offset and mapping index than brute force > > > inode cache walks > > > > Yes, but I was trying to map it to an existing mechanism and the > > internals of drop_pagecache_sb() are, in coarse terms, close to what > > needs to happen here. > > Yes. XFS (with rmap enabled) can do all the iteration and walking in > that function except for the invalidate_mapping_* call itself. The goal > of this series is first to wire up a callback within both the block and > pmem subsystems so that they can take notifications and reverse-map them > through the storage stack until they reach an fs superblock. I'm chuckling because this "reverse map all the way up the block layer" is the opposite of what Dave said at the first reaction to my proposal, "can't the mm map pfns to fs inode address_spaces?". I think dax unmap is distinct from corrupted_range() precisely because they are events happening in two different domains, block device sectors vs dax device pfns. Let's step back. I think a chain of ->corrupted_range() callbacks up the block stack terminating in the filesystem with dax implications tacked on is the wrong abstraction. Why not use the existing generic object for communicating bad
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 01, 2021 at 07:33:28PM -0800, Dan Williams wrote: > On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner wrote: > [..] > > We do not need a DAX specific mechanism to tell us "DAX device > > gone", we need a generic block device interface that tells us "range > > of block device is gone". > > This is the crux of the disagreement. The block_device is going away > *and* the dax_device is going away. No, that is not the disagreement I have with what you are saying. You still haven't understand that it's even more basic and generic than devices going away. At the simplest form, all the filesystem wants is to be notified of is when *unrecoverable media errors* occur in the persistent storage that underlies the filesystem. The filesystem does not care what that media is build from - PMEM, flash, corroded spinning disks, MRAM, or any other persistent media you can think off. It just doesn't matter. What we care about is that the contents of a *specific LBA range* no longer contain *valid data*. IOWs, the data in that range of the block device has been lost, cannot be retreived and/or cannot be written to any more. PMEM taking a MCE because ECC tripped is a media error because data is lost and inaccessible until recovery actions are taken. MD RAID failing a scrub is a media error and data is lost and unrecoverable at that layer. A device disappearing is a media error because the storage media is now permanently inaccessible to the higher layers. This "media error" categorisation is a fundamental property of persistent storage and, as such, is a property of the block devices used to access said persistent storage. That's the disagreement here - that you and Christoph are saying ->corrupted_range is not a block device property because only a pmem/DAX device currently generates it. You both seem to be NACKing a generic interface because it's only implemented for the first subsystem that needs it. AFAICT, you either don't understand or are completely ignoring the architectural need for it to be provided across the rest of the storage stack that *block device based filesystems depend on*. Sure, there might be dax device based fielsystems around the corner. They just require a different pmem device ->corrupted_range callout to implement the notification - one that directs to the dax device rather than the block device. That's simple and trivial to implement, but such functionaity for DAX devices does not replace the need for the same generic functionality to be provided across a *range of different block devices* as required by *block device based filesystems*. And that's fundamentally the problem. XFS is block device based, not DAX device based. We require errors to be reported through block device mechanisms. fs-dax does not change this - it is based on pmem being presented as a primarily as a block device to the block device based filesystems and only secondarily as a dax device. Hence if it can be trivially implemented as a block device interface, that's where it should go, because then all the other block devices that the filesytem runs on can provide the same functionality for similar media error events > The dax_device removal implies one > set of actions (direct accessed pfns invalid) the block device removal > implies another (block layer sector access offline). There you go again, saying DAX requires an action, while the block device notification is a -state change- (i.e. goes offline). This is exactly what I said was wrong in my last email. > corrupted_range > is blurring the notification for 2 different failure domains. Look at > the nascent idea to mount a filesystem on dax sans a block device. > Look at the existing plumbing for DM to map dax_operations through a > device stack. Ummm, it just maps the direct_access call to the underlying device and calls it's ->direct_access method. All it's doing is LBA mapping. That's all it needs to do for ->corrupted_range, too. I have no clue why you think this is a problem for error notification... > Look at the pushback Ruan got for adding a new > block_device operation for corrupted_range(). one person said "no". That's hardly pushback. Especially as I think Christoph's objection about this being dax specific functionality is simply wrong, as per above. > > This is why we need to communicate what error occurred, not what > > action a device driver thinks needs to be taken. > > The driver is only an event producer in this model, whatever the > consumer does at the other end is not its concern. There may be a > generic consumer and a filesystem specific consumer. That's why these are all ops functions that can provide multiple implementations to different device types. So that when we get a new use case, the ops function structure can be replaced with one that directs the notification to the new user instead of to the existing one. It's a design pattern we use all over the kernel code. Cheers, Dave. -- Dave Chinner
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner wrote: [..] > We do not need a DAX specific mechanism to tell us "DAX device > gone", we need a generic block device interface that tells us "range > of block device is gone". This is the crux of the disagreement. The block_device is going away *and* the dax_device is going away. The dax_device removal implies one set of actions (direct accessed pfns invalid) the block device removal implies another (block layer sector access offline). corrupted_range is blurring the notification for 2 different failure domains. Look at the nascent idea to mount a filesystem on dax sans a block device. Look at the existing plumbing for DM to map dax_operations through a device stack. Look at the pushback Ruan got for adding a new block_device operation for corrupted_range(). > The reason that the block device is gone is irrelevant to the > filesystem. The type of block device is also irrelevant. If the > filesystem isn't using DAX (e.g. dax=never mount option) even when > it is on a DAX capable device, then it just doesn't care about > invalidating DAX mappings because it has none. But it still may care > about shutting down the filesystem because the block device is gone. Sure, let's have a discussion about a block_device gone notification, and a dax_device gone notification. > This is why we need to communicate what error occurred, not what > action a device driver thinks needs to be taken. The driver is only an event producer in this model, whatever the consumer does at the other end is not its concern. There may be a generic consumer and a filesystem specific consumer. > The error is > important to the filesystem, the action might be completely > irrelevant. And, as we know now, shutdown on DAX enable filesystems > needs to imply DAX mapping invalidation in all cases, not just when > the device disappears from under the filesystem. Sure. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote: > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner wrote: > > > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote: > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner wrote: > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner > > > > > wrote: > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > > it points to, check if it points to the PMEM that is being removed, > > > > grab the page it points to, map that to the relevant struct page, > > > > run collect_procs() on that page, then kill the user processes that > > > > map that page. > > > > > > > > So why can't we walk the ptescheck the physical pages that they > > > > map to and if they map to a pmem page we go poison that > > > > page and that kills any user process that maps it. > > > > > > > > i.e. I can't see how unexpected pmem device unplug is any different > > > > to an MCE delivering a hwpoison event to a DAX mapped page. > > > > > > I guess the tradeoff is walking a long list of inodes vs walking a > > > large array of pages. > > > > Not really. You're assuming all a filesystem has to do is invalidate > > everything if a device goes away, and that's not true. Finding if an > > inode has a mapping that spans a specific device in a multi-device > > filesystem can be a lot more complex than that. Just walking inodes > > is easy - determining whihc inodes need invalidation is the hard > > part. > > That inode-to-device level of specificity is not needed for the same > reason that drop_caches does not need to be specific. If the wrong > page is unmapped a re-fault will bring it back, and re-fault will fail > for the pages that are successfully removed. > > > That's where ->corrupt_range() comes in - the filesystem is already > > set up to do reverse mapping from physical range to inode(s) > > offsets... > > Sure, but what is the need to get to that level of specificity with > the filesystem for something that should rarely happen in the course > of normal operation outside of a mistake? I can't tell if we're conflating the "a bunch of your pmem went bad" case with the "all your dimms fell out of the machine" case. If, say, a single cacheline's worth of pmem goes bad on a node with 2TB of pmem, I certainly want that level of specificity. Just notify the users of the dead piece, don't flush the whole machine down the drain. > > > There's likely always more pages than inodes, but perhaps it's more > > > efficient to walk the 'struct page' array than sb->s_inodes? > > > > I really don't see you seem to be telling us that invalidation is an > > either/or choice. There's more ways to convert physical block > > address -> inode file offset and mapping index than brute force > > inode cache walks > > Yes, but I was trying to map it to an existing mechanism and the > internals of drop_pagecache_sb() are, in coarse terms, close to what > needs to happen here. Yes. XFS (with rmap enabled) can do all the iteration and walking in that function except for the invalidate_mapping_* call itself. The goal of this series is first to wire up a callback within both the block and pmem subsystems so that they can take notifications and reverse-map them through the storage stack until they reach an fs superblock. Once the information has reached XFS, it can use its own reverse mappings to figure out which pages of which inodes are now targetted. The future of DAX hw error handling can be that you throw the spitwad at us, and it's our problem to distill that into mm invalidation calls. XFS' reverse mapping data is indexed by storage location and isn't sharded by address_space, so (except for the DIMMs falling out), we don't need to walk the entire inode list or scan the entire mapping. Between XFS and DAX and mm, the mm already has the invalidation calls, xfs already has the distiller, and so all we need is that first bit. The current mm code doesn't fully solve the problem, nor does it need to, since it handles DRAM errors acceptably* already. * Actually, the hwpoison code should _also_ be calling ->corrupted_range when DRAM goes bad so that we can detect metadata failures and either reload the buffer or (if it was dirty) shut down. > > > > . > > > > > > IOWs, what needs to happen at this point is very filesystem > > > > specific. Assuming that "device unplug == filesystem dead" is not > > > > correct, nor is specifying a generic action that assumes the > > > > filesystem is dead because a device it is using went away. > > > > > > Ok, I think I set this discussion in the wrong direction implying any > > > mapping of this action to a "filesystem dead" event. It's just a "zap > > > all ptes" event and upper layers recover from there. > > > > Yes, that's exactly what ->corrupt_range() is intended for. It > > allows the filesystem to lock out access to the bad range > > and then
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 01, 2021 at 04:32:36PM -0800, Dan Williams wrote: > On Mon, Mar 1, 2021 at 2:47 PM Dave Chinner wrote: > > Now we have the filesytem people providing a mechanism for the pmem > > devices to tell the filesystems about physical device failures so > > they can handle such failures correctly themselves. Having the > > device go away unexpectedly from underneath a mounted and active > > filesystem is a *device failure*, not an "unplug event". > > It's the same difference to the physical page, all mappings to that > page need to be torn down. I'm happy to call an fs callback and let > each filesystem do what it wants with a "every pfn in this dax device > needs to be unmapped". You keep talking like this is something specific to a DAX device. It isn't - the filesystem needs to take specific actions if any type of block device reports that it has a corrupted range, not just DAX. A DAX device simply adds "and invalidate direct mappings" to the list of stuff that needs to be done. And as far as a filesystem is concerned, there is no difference between "this 4kB range is bad" and "the range of this entire device is bad". We have to do the same things in both situations. > I'm looking at the ->corrupted_range() patches trying to map it to > this use case and I don't see how, for example a realtime-xfs over DM > over multiple PMEM gets the notification to the right place. > bd_corrupted_range() uses get_super() which get the wrong answer for > both realtime-xfs and DM. I'm not sure I follow your logic. What is generating the wrong answer? We already have infrastructure for the block device to look up the superblock mounted on top of it, an DM already uses that for things like "dmsetup suspend" to freeze the filesystem before it does something. This "superblock lookup" only occurs for the top level DM device, not for the component pmem devices that make up the DM device. IOWs, if there's a DM device that maps multiple pmem devices, then it should be stacking the bd_corrupted_range() callbacks to map the physical device range to the range in the higher level DM device that belongs to. This mapping of ranges is what DM exists to do - the filesystem has no clue about what devices make up a DM device, so the DM device *must* translate ranges for component devices into the ranges that it maps that device into the LBA range it exposes to the filesystem. > I'd flip that arrangement around and have the FS tell the block device > "if something happens to you, here is the super_block to notify". We already have a mechanism for this that the block device calls: get_active_super(bdev). There can be only one superblock per block device - the superblock has exclusive ownership of the block device while the filesystem is mounted. get_active_super() returns the superblock that sits on top of the bdev with an active reference, allowing the caller to safely access and operate on the sueprblock without having to worry about the superblock going away in the middle of whatever operation the block device needs to perform. If this isn't working, then existing storage stack functionality doesn't work as it should and this needs fixing independently of the PMEM/DAX stuff we are talking about here. > So > to me this looks like a fs_dax_register_super() helper that plumbs the > superblock through an arbitrary stack of block devices to the leaf > block-device that might want to send a notification up when a global > unmap operation needs to be performed. No, this is just wrong. The filesystem has no clue what block device is at the leaf level of a block device stack, nor what LBA block range represents that device within the address space the stacked block devices present to the filesystem. > > Please listen when we say "that is > > not sufficient" because we don't want to be backed into a corner > > that we have to fix ourselves again before we can enable some basic > > filesystem functionality that we should have been able to support on > > DAX from the start... > > That's some revisionist interpretation of how the discovery of the > reflink+dax+memory-error-handling collision went down. > > The whole point of this discussion is to determine if > ->corrupted_range() is suitable for this notification, and looking at > the code as is, it isn't. Perhaps you have a different implementation > of ->corrupted_range() in mind that allows this to be plumbed > correctly? So rather than try to make the generic ->corrupted_range infrastructure be able to report "DAX range is invalid" (which is the very definition of a corrupted block device range!), you want to introduce a DAX specific notification to tell us that a range in the block device contains invalid/corrupt data? We're talking about a patchset that is in development. The proposed notification path is supposed to be generic and *not PMEM specific*, and is intended to handle exactly the use case that you raised. The implementation may not be perfect yet, so rather
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 1, 2021 at 2:47 PM Dave Chinner wrote: > > On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote: > > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner wrote: > > > > > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote: > > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner > > > > wrote: > > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner > > > > > > wrote: > > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > > > it points to, check if it points to the PMEM that is being removed, > > > > > grab the page it points to, map that to the relevant struct page, > > > > > run collect_procs() on that page, then kill the user processes that > > > > > map that page. > > > > > > > > > > So why can't we walk the ptescheck the physical pages that they > > > > > map to and if they map to a pmem page we go poison that > > > > > page and that kills any user process that maps it. > > > > > > > > > > i.e. I can't see how unexpected pmem device unplug is any different > > > > > to an MCE delivering a hwpoison event to a DAX mapped page. > > > > > > > > I guess the tradeoff is walking a long list of inodes vs walking a > > > > large array of pages. > > > > > > Not really. You're assuming all a filesystem has to do is invalidate > > > everything if a device goes away, and that's not true. Finding if an > > > inode has a mapping that spans a specific device in a multi-device > > > filesystem can be a lot more complex than that. Just walking inodes > > > is easy - determining whihc inodes need invalidation is the hard > > > part. > > > > That inode-to-device level of specificity is not needed for the same > > reason that drop_caches does not need to be specific. If the wrong > > page is unmapped a re-fault will bring it back, and re-fault will fail > > for the pages that are successfully removed. > > > > > That's where ->corrupt_range() comes in - the filesystem is already > > > set up to do reverse mapping from physical range to inode(s) > > > offsets... > > > > Sure, but what is the need to get to that level of specificity with > > the filesystem for something that should rarely happen in the course > > of normal operation outside of a mistake? > > Dan, you made this mistake with the hwpoisoning code that we're > trying to fix that here. Hard coding a 1:1 physical address to > inode/offset into the DAX mapping was a bad mistake. It's also one > that should never have occurred because it's *obviously wrong* to > filesystem developers and has been for a long time. I admit that mistake. The traditional memory error handling model assumptions around page->mapping were broken by DAX, I'm not trying to repeat that mistake. I feel we're talking past each other on the discussion of the proposals. > Now we have the filesytem people providing a mechanism for the pmem > devices to tell the filesystems about physical device failures so > they can handle such failures correctly themselves. Having the > device go away unexpectedly from underneath a mounted and active > filesystem is a *device failure*, not an "unplug event". It's the same difference to the physical page, all mappings to that page need to be torn down. I'm happy to call an fs callback and let each filesystem do what it wants with a "every pfn in this dax device needs to be unmapped". I'm looking at the ->corrupted_range() patches trying to map it to this use case and I don't see how, for example a realtime-xfs over DM over multiple PMEM gets the notification to the right place. bd_corrupted_range() uses get_super() which get the wrong answer for both realtime-xfs and DM. I'd flip that arrangement around and have the FS tell the block device "if something happens to you, here is the super_block to notify". So to me this looks like a fs_dax_register_super() helper that plumbs the superblock through an arbitrary stack of block devices to the leaf block-device that might want to send a notification up when a global unmap operation needs to be performed. I naively think that "for_each_inode() unmap_mapping_range(>i_mapping)" is sufficient as a generic implementation, that does not preclude XFS to override that generic implementation and handle it directly if it so chooses. > The mistake you made was not understanding how filesystems work, > nor actually asking filesystem developers what they actually needed. You're going too far here, but that's off topic. > You're doing the same thing here - you're telling us what you think > the solution filesystems need is. No, I'm not, I'm trying to understand tradeoffs. I apologize if this is coming across as not listening. > Please listen when we say "that is > not sufficient" because we don't want to be backed into a corner > that we have to fix ourselves again before we can enable some basic > filesystem functionality that we should have been able to support on > DAX from the start... That's
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote: > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner wrote: > > > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote: > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner wrote: > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner > > > > > wrote: > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > > it points to, check if it points to the PMEM that is being removed, > > > > grab the page it points to, map that to the relevant struct page, > > > > run collect_procs() on that page, then kill the user processes that > > > > map that page. > > > > > > > > So why can't we walk the ptescheck the physical pages that they > > > > map to and if they map to a pmem page we go poison that > > > > page and that kills any user process that maps it. > > > > > > > > i.e. I can't see how unexpected pmem device unplug is any different > > > > to an MCE delivering a hwpoison event to a DAX mapped page. > > > > > > I guess the tradeoff is walking a long list of inodes vs walking a > > > large array of pages. > > > > Not really. You're assuming all a filesystem has to do is invalidate > > everything if a device goes away, and that's not true. Finding if an > > inode has a mapping that spans a specific device in a multi-device > > filesystem can be a lot more complex than that. Just walking inodes > > is easy - determining whihc inodes need invalidation is the hard > > part. > > That inode-to-device level of specificity is not needed for the same > reason that drop_caches does not need to be specific. If the wrong > page is unmapped a re-fault will bring it back, and re-fault will fail > for the pages that are successfully removed. > > > That's where ->corrupt_range() comes in - the filesystem is already > > set up to do reverse mapping from physical range to inode(s) > > offsets... > > Sure, but what is the need to get to that level of specificity with > the filesystem for something that should rarely happen in the course > of normal operation outside of a mistake? Dan, you made this mistake with the hwpoisoning code that we're trying to fix that here. Hard coding a 1:1 physical address to inode/offset into the DAX mapping was a bad mistake. It's also one that should never have occurred because it's *obviously wrong* to filesystem developers and has been for a long time. Now we have the filesytem people providing a mechanism for the pmem devices to tell the filesystems about physical device failures so they can handle such failures correctly themselves. Having the device go away unexpectedly from underneath a mounted and active filesystem is a *device failure*, not an "unplug event". The mistake you made was not understanding how filesystems work, nor actually asking filesystem developers what they actually needed. You're doing the same thing here - you're telling us what you think the solution filesystems need is. Please listen when we say "that is not sufficient" because we don't want to be backed into a corner that we have to fix ourselves again before we can enable some basic filesystem functionality that we should have been able to support on DAX from the start... > > > There's likely always more pages than inodes, but perhaps it's more > > > efficient to walk the 'struct page' array than sb->s_inodes? > > > > I really don't see you seem to be telling us that invalidation is an > > either/or choice. There's more ways to convert physical block > > address -> inode file offset and mapping index than brute force > > inode cache walks > > Yes, but I was trying to map it to an existing mechanism and the > internals of drop_pagecache_sb() are, in coarse terms, close to what > needs to happen here. No. drop_pagecache_sb() is not a relevant model for telling a filesystem that the block device underneath has gone away, nor for a device to ensure that access protections that *are managed by the filesystem* are enforced/revoked sanely. drop_pagecache_sb() is a brute-force model for invalidating user data mappings that the filesystem performs in response to such a notification. It only needs this brute-force approach if it has no other way to find active DAX mappings across the range of the device that has gone away. But this model doesn't work for direct mapped metadata, journals or any other internal direct filesystem mappings that aren't referenced by inodes that the filesytem might be using. The filesystem still needs to invalidate all those mappings and prevent further access to them, even from within the kernel itself. Filesystems are way more complex than pure DAX devices, and hence handle errors and failure events differently. Unlike DAX devices, we have both internal and external references to the DAX device, and we can have both external and internal direct maps. Invalidating user data mappings is all dax devices need to do on unplug,
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Sun, Feb 28, 2021 at 11:27 PM Yasunori Goto wrote: > > Hello, Dan-san, > > On 2021/02/27 4:24, Dan Williams wrote: > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong wrote: > >> > >> On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote: > >>> Hi, guys > >>> > >>> Beside this patchset, I'd like to confirm something about the > >>> "EXPERIMENTAL" tag for dax in XFS. > >>> > >>> In XFS, the "EXPERIMENTAL" tag, which is reported in waring message > >>> when we mount a pmem device with dax option, has been existed for a > >>> while. It's a bit annoying when using fsdax feature. So, my initial > >>> intention was to remove this tag. And I started to find out and solve > >>> the problems which prevent it from being removed. > >>> > >>> As is talked before, there are 3 main problems. The first one is "dax > >>> semantics", which has been resolved. The rest two are "RMAP for > >>> fsdax" and "support dax reflink for filesystem", which I have been > >>> working on. > >> > >> > >> > >>> So, what I want to confirm is: does it means that we can remove the > >>> "EXPERIMENTAL" tag when the rest two problem are solved? > >> > >> Yes. I'd keep the experimental tag for a cycle or two to make sure that > >> nothing new pops up, but otherwise the two patchsets you've sent close > >> those two big remaining gaps. Thank you for working on this! > >> > >>> Or maybe there are other important problems need to be fixed before > >>> removing it? If there are, could you please show me that? > >> > >> That remains to be seen through QA/validation, but I think that's it. > >> > >> Granted, I still have to read through the two patchsets... > > > > I've been meaning to circle back here as well. > > > > My immediate concern is the issue Jason recently highlighted [1] with > > respect to invalidating all dax mappings when / if the device is > > ripped out from underneath the fs. I don't think that will collide > > with Ruan's implementation, but it does need new communication from > > driver to fs about removal events. > > > > [1]: > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com > > > > I'm not sure why there is a race condition between unbinding operation > and accessing mmaped file on filesystem dax yet. > > May be silly question, but could you tell me why the "unbinding" > operation of the namespace which is mounted by filesystem dax must be > allowed? The unbind operation is used to switch the mode of a namespace between fsdax and devdax. There is no way to fail unbind. At most it can be delayed for a short while to perform cleanup, but it can't be blocked indefinitely. There is the option to specify 'suppress_bind_attrs' to the driver to preclude software triggered device removal, but that would disable mode changes for the device. > If "unbinding" is rejected when the filesystem is mounted with dax > enabled, what is inconvenience? It would be interesting (read difficult) to introduce the concept of dynamic 'suppress_bind_attrs'. Today the decision is static at driver registration time, not in response to how the device is being used. I think global invalidation of all inodes that might be affected by a dax-capable device being ripped away from the filesystem is sufficient and avoids per-fs enabling, but I'm willing to be convinced that ->corrupted_range() is the proper vehicle for this. > > I can imagine if a device like usb memory stick is removed surprisingly, > kernel/filesystem need to reject writeback at the time, and discard page > cache. Then, I can understand that unbinding operation is essential for > such case. For usb the system is protected by the fact that all future block-i/o submissions to the old block-device will fail, and a new usb-device being plugged in will get a new block-device. I.e. the old security model is invalidated / all holes are closed by blk_cleanup_queue(). > But I don't know why PMEM device/namespace allows unbinding operation > like surprising removal event. DAX hands direct mappings to physical pages, if the security model fronting those physical pages changes the mappings attained via the old security model need to be invalidated. blk_cleanup_queue() does not invalidate DAX mappings. The practical value of fixing that hole is small given that physical unplug is not implemented for NVDIMMs today, but the get_user_pages() path can be optimized if this invalidation is implemented, and future hotplug-capable NVDIMM buses like CXL will need this. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner wrote: > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote: > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner wrote: > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner > > > > wrote: > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > it points to, check if it points to the PMEM that is being removed, > > > grab the page it points to, map that to the relevant struct page, > > > run collect_procs() on that page, then kill the user processes that > > > map that page. > > > > > > So why can't we walk the ptescheck the physical pages that they > > > map to and if they map to a pmem page we go poison that > > > page and that kills any user process that maps it. > > > > > > i.e. I can't see how unexpected pmem device unplug is any different > > > to an MCE delivering a hwpoison event to a DAX mapped page. > > > > I guess the tradeoff is walking a long list of inodes vs walking a > > large array of pages. > > Not really. You're assuming all a filesystem has to do is invalidate > everything if a device goes away, and that's not true. Finding if an > inode has a mapping that spans a specific device in a multi-device > filesystem can be a lot more complex than that. Just walking inodes > is easy - determining whihc inodes need invalidation is the hard > part. That inode-to-device level of specificity is not needed for the same reason that drop_caches does not need to be specific. If the wrong page is unmapped a re-fault will bring it back, and re-fault will fail for the pages that are successfully removed. > That's where ->corrupt_range() comes in - the filesystem is already > set up to do reverse mapping from physical range to inode(s) > offsets... Sure, but what is the need to get to that level of specificity with the filesystem for something that should rarely happen in the course of normal operation outside of a mistake? > > > There's likely always more pages than inodes, but perhaps it's more > > efficient to walk the 'struct page' array than sb->s_inodes? > > I really don't see you seem to be telling us that invalidation is an > either/or choice. There's more ways to convert physical block > address -> inode file offset and mapping index than brute force > inode cache walks Yes, but I was trying to map it to an existing mechanism and the internals of drop_pagecache_sb() are, in coarse terms, close to what needs to happen here. > > . > > > > IOWs, what needs to happen at this point is very filesystem > > > specific. Assuming that "device unplug == filesystem dead" is not > > > correct, nor is specifying a generic action that assumes the > > > filesystem is dead because a device it is using went away. > > > > Ok, I think I set this discussion in the wrong direction implying any > > mapping of this action to a "filesystem dead" event. It's just a "zap > > all ptes" event and upper layers recover from there. > > Yes, that's exactly what ->corrupt_range() is intended for. It > allows the filesystem to lock out access to the bad range > and then recover the data. Or metadata, if that's where the bad > range lands. If that recovery fails, it can then report a data > loss/filesystem shutdown event to userspace and kill user procs that > span the bad range... > > FWIW, is this notification going to occur before or after the device > has been physically unplugged? Before. This will be operations that happen in the pmem driver ->remove() callback. > i.e. what do we do about the > time-of-unplug-to-time-of-invalidation window where userspace can > still attempt to access the missing pmem though the > not-yet-invalidated ptes? It may not be likely that people just yank > pmem nvdimms out of machines, but with NVMe persistent memory > spaces, there's every chance that someone pulls the wrong device... The physical removal aspect is only theoretical today. While the pmem driver has a ->remove() path that's purely a software unbind operation. That said the vulnerability window today is if a process acquires a dax mapping, the pmem device hosting that filesystem goes through an unbind / bind cycle, and then a new filesystem is created / mounted. That old pte may be able to access data that is outside its intended protection domain. Going forward, for buses like CXL, there will be a managed physical remove operation via PCIE native hotplug. The flow there is that the PCIE hotplug driver will notify the OS of a pending removal, trigger ->remove() on the pmem driver, and then notify the technician (slot status LED) that the card is safe to pull. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
Hello, Dan-san, On 2021/02/27 4:24, Dan Williams wrote: On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong wrote: On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote: Hi, guys Beside this patchset, I'd like to confirm something about the "EXPERIMENTAL" tag for dax in XFS. In XFS, the "EXPERIMENTAL" tag, which is reported in waring message when we mount a pmem device with dax option, has been existed for a while. It's a bit annoying when using fsdax feature. So, my initial intention was to remove this tag. And I started to find out and solve the problems which prevent it from being removed. As is talked before, there are 3 main problems. The first one is "dax semantics", which has been resolved. The rest two are "RMAP for fsdax" and "support dax reflink for filesystem", which I have been working on. So, what I want to confirm is: does it means that we can remove the "EXPERIMENTAL" tag when the rest two problem are solved? Yes. I'd keep the experimental tag for a cycle or two to make sure that nothing new pops up, but otherwise the two patchsets you've sent close those two big remaining gaps. Thank you for working on this! Or maybe there are other important problems need to be fixed before removing it? If there are, could you please show me that? That remains to be seen through QA/validation, but I think that's it. Granted, I still have to read through the two patchsets... I've been meaning to circle back here as well. My immediate concern is the issue Jason recently highlighted [1] with respect to invalidating all dax mappings when / if the device is ripped out from underneath the fs. I don't think that will collide with Ruan's implementation, but it does need new communication from driver to fs about removal events. [1]: http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com I'm not sure why there is a race condition between unbinding operation and accessing mmaped file on filesystem dax yet. May be silly question, but could you tell me why the "unbinding" operation of the namespace which is mounted by filesystem dax must be allowed? If "unbinding" is rejected when the filesystem is mounted with dax enabled, what is inconvenience? I can imagine if a device like usb memory stick is removed surprisingly, kernel/filesystem need to reject writeback at the time, and discard page cache. Then, I can understand that unbinding operation is essential for such case. But I don't know why PMEM device/namespace allows unbinding operation like surprising removal event. Thanks, -- Yasunori Goto ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote: > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner wrote: > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner wrote: > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > it points to, check if it points to the PMEM that is being removed, > > grab the page it points to, map that to the relevant struct page, > > run collect_procs() on that page, then kill the user processes that > > map that page. > > > > So why can't we walk the ptescheck the physical pages that they > > map to and if they map to a pmem page we go poison that > > page and that kills any user process that maps it. > > > > i.e. I can't see how unexpected pmem device unplug is any different > > to an MCE delivering a hwpoison event to a DAX mapped page. > > I guess the tradeoff is walking a long list of inodes vs walking a > large array of pages. Not really. You're assuming all a filesystem has to do is invalidate everything if a device goes away, and that's not true. Finding if an inode has a mapping that spans a specific device in a multi-device filesystem can be a lot more complex than that. Just walking inodes is easy - determining whihc inodes need invalidation is the hard part. That's where ->corrupt_range() comes in - the filesystem is already set up to do reverse mapping from physical range to inode(s) offsets... > There's likely always more pages than inodes, but perhaps it's more > efficient to walk the 'struct page' array than sb->s_inodes? I really don't see you seem to be telling us that invalidation is an either/or choice. There's more ways to convert physical block address -> inode file offset and mapping index than brute force inode cache walks . > > IOWs, what needs to happen at this point is very filesystem > > specific. Assuming that "device unplug == filesystem dead" is not > > correct, nor is specifying a generic action that assumes the > > filesystem is dead because a device it is using went away. > > Ok, I think I set this discussion in the wrong direction implying any > mapping of this action to a "filesystem dead" event. It's just a "zap > all ptes" event and upper layers recover from there. Yes, that's exactly what ->corrupt_range() is intended for. It allows the filesystem to lock out access to the bad range and then recover the data. Or metadata, if that's where the bad range lands. If that recovery fails, it can then report a data loss/filesystem shutdown event to userspace and kill user procs that span the bad range... FWIW, is this notification going to occur before or after the device has been physically unplugged? i.e. what do we do about the time-of-unplug-to-time-of-invalidation window where userspace can still attempt to access the missing pmem though the not-yet-invalidated ptes? It may not be likely that people just yank pmem nvdimms out of machines, but with NVMe persistent memory spaces, there's every chance that someone pulls the wrong device... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner wrote: > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner wrote: > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner > > > > wrote: > > > > > > My immediate concern is the issue Jason recently highlighted [1] > > > > > > with > > > > > > respect to invalidating all dax mappings when / if the device is > > > > > > ripped out from underneath the fs. I don't think that will collide > > > > > > with Ruan's implementation, but it does need new communication from > > > > > > driver to fs about removal events. > > > > > > > > > > > > [1]: > > > > > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com > > > > > > > > > > Oh, yay. > > > > > > > > > > The XFS shutdown code is centred around preventing new IO from being > > > > > issued - we don't actually do anything about DAX mappings because, > > > > > well, I don't think anyone on the filesystem side thought they had > > > > > to do anything special if pmem went away from under it. > > > > > > > > > > My understanding -was- that the pmem removal invalidates > > > > > all the ptes currently mapped into CPU page tables that point at > > > > > the dax device across the system. THe vmas that manage these > > > > > mappings are not really something the filesystem really manages, > > > > > but a function of the mm subsystem. What the filesystem cares about > > > > > is that it gets page faults triggered when a change of state occurs > > > > > so that it can remap the page to it's backing store correctly. > > > > > > > > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the > > > > > CPU ptes, because then when then when userspace tries to access the > > > > > mapped DAX pages we get a new page fault. In processing the fault, the > > > > > filesystem will try to get direct access to the pmem from the block > > > > > device. This will get an ENODEV error from the block device because > > > > > because the backing store (pmem) has been unplugged and is no longer > > > > > there... > > > > > > > > > > AFAICT, as long as pmem removal invalidates all the active ptes that > > > > > point at the pmem being removed, the filesystem doesn't need to > > > > > care about device removal at all, DAX or no DAX... > > > > > > > > How would the pmem removal do that without walking all the active > > > > inodes in the fs at the time of shutdown and call > > > > unmap_mapping_range(inode->i_mapping, 0, 0, 1)? > > > > > > Which then immediately ends up back at the vmas that manage the ptes > > > to unmap them. > > > > > > Isn't finding the vma(s) that map a specific memory range exactly > > > what the rmap code in the mm subsystem is supposed to address? > > > > rmap can lookup only vmas from a virt address relative to a given > > mm_struct. The driver has neither the list of mm_struct objects nor > > virt addresses to do a lookup. All it knows is that someone might have > > mapped pages through the fsdax interface. > > So there's no physical addr to vma translation in the mm subsystem > at all? > > That doesn't make sense. We do exactly this for hwpoison for DAX > mappings. While we don't look at ptes, we get a pfn, True hwpoison does get a known failing pfn and then uses page->mapping to get the 'struct address_space' to do the unmap. I discounted that approach from the outset because it would mean walking every pfn in a multi-terabyte device just in case one is mapped at device shutdown. > it points to, check if it points to the PMEM that is being removed, > grab the page it points to, map that to the relevant struct page, > run collect_procs() on that page, then kill the user processes that > map that page. > > So why can't we walk the ptescheck the physical pages that they > map to and if they map to a pmem page we go poison that > page and that kills any user process that maps it. > > i.e. I can't see how unexpected pmem device unplug is any different > to an MCE delivering a hwpoison event to a DAX mapped page. I guess the tradeoff is walking a long list of inodes vs walking a large array of pages. There's likely always more pages than inodes, but perhaps it's more efficient to walk the 'struct page' array than sb->s_inodes? > Both > indicate a physical address range now contains invalid data and the > filesystem has to take the same action... > > IOWs, we could just call ->corrupted_range(0, EOD) here to tell the > filesystem the entire device went away. Then the filesystem deal > with this however it needs to. However, it would be more efficient > from an invalidation POV to just call it on the pages that have > currently active ptes because once the block device is dead > new page faults on DAX mappings will get a SIGBUS naturally. There is no efficient way to lookup "currently active ptes" relative to a physical pfn range. SIGBUS
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner wrote: > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner wrote: > > > > > My immediate concern is the issue Jason recently highlighted [1] with > > > > > respect to invalidating all dax mappings when / if the device is > > > > > ripped out from underneath the fs. I don't think that will collide > > > > > with Ruan's implementation, but it does need new communication from > > > > > driver to fs about removal events. > > > > > > > > > > [1]: > > > > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com > > > > > > > > Oh, yay. > > > > > > > > The XFS shutdown code is centred around preventing new IO from being > > > > issued - we don't actually do anything about DAX mappings because, > > > > well, I don't think anyone on the filesystem side thought they had > > > > to do anything special if pmem went away from under it. > > > > > > > > My understanding -was- that the pmem removal invalidates > > > > all the ptes currently mapped into CPU page tables that point at > > > > the dax device across the system. THe vmas that manage these > > > > mappings are not really something the filesystem really manages, > > > > but a function of the mm subsystem. What the filesystem cares about > > > > is that it gets page faults triggered when a change of state occurs > > > > so that it can remap the page to it's backing store correctly. > > > > > > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the > > > > CPU ptes, because then when then when userspace tries to access the > > > > mapped DAX pages we get a new page fault. In processing the fault, the > > > > filesystem will try to get direct access to the pmem from the block > > > > device. This will get an ENODEV error from the block device because > > > > because the backing store (pmem) has been unplugged and is no longer > > > > there... > > > > > > > > AFAICT, as long as pmem removal invalidates all the active ptes that > > > > point at the pmem being removed, the filesystem doesn't need to > > > > care about device removal at all, DAX or no DAX... > > > > > > How would the pmem removal do that without walking all the active > > > inodes in the fs at the time of shutdown and call > > > unmap_mapping_range(inode->i_mapping, 0, 0, 1)? > > > > Which then immediately ends up back at the vmas that manage the ptes > > to unmap them. > > > > Isn't finding the vma(s) that map a specific memory range exactly > > what the rmap code in the mm subsystem is supposed to address? > > rmap can lookup only vmas from a virt address relative to a given > mm_struct. The driver has neither the list of mm_struct objects nor > virt addresses to do a lookup. All it knows is that someone might have > mapped pages through the fsdax interface. So there's no physical addr to vma translation in the mm subsystem at all? That doesn't make sense. We do exactly this for hwpoison for DAX mappings. While we don't look at ptes, we get a pfn, grab the page it points to, check if it points to the PMEM that is being removed, grab the page it points to, map that to the relevant struct page, run collect_procs() on that page, then kill the user processes that map that page. So why can't we walk the ptes, check the physical pages that they map to and if they map to a pmem page we go poison that page and that kills any user process that maps it. i.e. I can't see how unexpected pmem device unplug is any different to an MCE delivering a hwpoison event to a DAX mapped page. Both indicate a physical address range now contains invalid data and the filesystem has to take the same action... IOWs, we could just call ->corrupted_range(0, EOD) here to tell the filesystem the entire device went away. Then the filesystem deal with this however it needs to. However, it would be more efficient from an invalidation POV to just call it on the pages that have currently active ptes because once the block device is dead new page faults on DAX mappings will get a SIGBUS naturally. > To me this looks like a notifier that fires from memunmap_pages() > after dev_pagemap_kill() to notify any block_device associated with > that dev_pagemap() to say that any dax mappings arranged through this > block_device are now invalid. The reason to do this after > dev_pagemap_kill() is so that any new mapping attempts that are racing > the removal will be blocked. I don't see why this needs a unique notifier. At the filesystem level, we want a single interface that tells us "something bad happened to the block device", not a proliferation of similar but subtly different "bad thing X happened to block device" interfaces that are unique to specific physical device drivers... > The receiver of that notification needs to go from a block_device to a > superblock that has mapped inodes and walk ->sb_inodes
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner wrote: > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner wrote: > > > > > > On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote: > > > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong > > > > wrote: > > > > > > > > > > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com > > > > > wrote: > > > > > > Hi, guys > > > > > > > > > > > > Beside this patchset, I'd like to confirm something about the > > > > > > "EXPERIMENTAL" tag for dax in XFS. > > > > > > > > > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message > > > > > > when we mount a pmem device with dax option, has been existed for a > > > > > > while. It's a bit annoying when using fsdax feature. So, my > > > > > > initial > > > > > > intention was to remove this tag. And I started to find out and > > > > > > solve > > > > > > the problems which prevent it from being removed. > > > > > > > > > > > > As is talked before, there are 3 main problems. The first one is > > > > > > "dax > > > > > > semantics", which has been resolved. The rest two are "RMAP for > > > > > > fsdax" and "support dax reflink for filesystem", which I have been > > > > > > working on. > > > > > > > > > > > > > > > > > > > > > So, what I want to confirm is: does it means that we can remove the > > > > > > "EXPERIMENTAL" tag when the rest two problem are solved? > > > > > > > > > > Yes. I'd keep the experimental tag for a cycle or two to make sure > > > > > that > > > > > nothing new pops up, but otherwise the two patchsets you've sent close > > > > > those two big remaining gaps. Thank you for working on this! > > > > > > > > > > > Or maybe there are other important problems need to be fixed before > > > > > > removing it? If there are, could you please show me that? > > > > > > > > > > That remains to be seen through QA/validation, but I think that's it. > > > > > > > > > > Granted, I still have to read through the two patchsets... > > > > > > > > I've been meaning to circle back here as well. > > > > > > > > My immediate concern is the issue Jason recently highlighted [1] with > > > > respect to invalidating all dax mappings when / if the device is > > > > ripped out from underneath the fs. I don't think that will collide > > > > with Ruan's implementation, but it does need new communication from > > > > driver to fs about removal events. > > > > > > > > [1]: > > > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com > > > > > > Oh, yay. > > > > > > The XFS shutdown code is centred around preventing new IO from being > > > issued - we don't actually do anything about DAX mappings because, > > > well, I don't think anyone on the filesystem side thought they had > > > to do anything special if pmem went away from under it. > > > > > > My understanding -was- that the pmem removal invalidates > > > all the ptes currently mapped into CPU page tables that point at > > > the dax device across the system. THe vmas that manage these > > > mappings are not really something the filesystem really manages, > > > but a function of the mm subsystem. What the filesystem cares about > > > is that it gets page faults triggered when a change of state occurs > > > so that it can remap the page to it's backing store correctly. > > > > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the > > > CPU ptes, because then when then when userspace tries to access the > > > mapped DAX pages we get a new page fault. In processing the fault, the > > > filesystem will try to get direct access to the pmem from the block > > > device. This will get an ENODEV error from the block device because > > > because the backing store (pmem) has been unplugged and is no longer > > > there... > > > > > > AFAICT, as long as pmem removal invalidates all the active ptes that > > > point at the pmem being removed, the filesystem doesn't need to > > > care about device removal at all, DAX or no DAX... > > > > How would the pmem removal do that without walking all the active > > inodes in the fs at the time of shutdown and call > > unmap_mapping_range(inode->i_mapping, 0, 0, 1)? > > Which then immediately ends up back at the vmas that manage the ptes > to unmap them. > > Isn't finding the vma(s) that map a specific memory range exactly > what the rmap code in the mm subsystem is supposed to address? rmap can lookup only vmas from a virt address relative to a given mm_struct. The driver has neither the list of mm_struct objects nor virt addresses to do a lookup. All it knows is that someone might have mapped pages through the fsdax interface. To me this looks like a notifier that fires from memunmap_pages() after dev_pagemap_kill() to notify any block_device associated with that dev_pagemap() to say that any dax mappings arranged through this block_device are now invalid. The reason to do this
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner wrote: > > > > On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote: > > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong > > > wrote: > > > > > > > > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote: > > > > > Hi, guys > > > > > > > > > > Beside this patchset, I'd like to confirm something about the > > > > > "EXPERIMENTAL" tag for dax in XFS. > > > > > > > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message > > > > > when we mount a pmem device with dax option, has been existed for a > > > > > while. It's a bit annoying when using fsdax feature. So, my initial > > > > > intention was to remove this tag. And I started to find out and solve > > > > > the problems which prevent it from being removed. > > > > > > > > > > As is talked before, there are 3 main problems. The first one is "dax > > > > > semantics", which has been resolved. The rest two are "RMAP for > > > > > fsdax" and "support dax reflink for filesystem", which I have been > > > > > working on. > > > > > > > > > > > > > > > > > So, what I want to confirm is: does it means that we can remove the > > > > > "EXPERIMENTAL" tag when the rest two problem are solved? > > > > > > > > Yes. I'd keep the experimental tag for a cycle or two to make sure that > > > > nothing new pops up, but otherwise the two patchsets you've sent close > > > > those two big remaining gaps. Thank you for working on this! > > > > > > > > > Or maybe there are other important problems need to be fixed before > > > > > removing it? If there are, could you please show me that? > > > > > > > > That remains to be seen through QA/validation, but I think that's it. > > > > > > > > Granted, I still have to read through the two patchsets... > > > > > > I've been meaning to circle back here as well. > > > > > > My immediate concern is the issue Jason recently highlighted [1] with > > > respect to invalidating all dax mappings when / if the device is > > > ripped out from underneath the fs. I don't think that will collide > > > with Ruan's implementation, but it does need new communication from > > > driver to fs about removal events. > > > > > > [1]: > > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com > > > > Oh, yay. > > > > The XFS shutdown code is centred around preventing new IO from being > > issued - we don't actually do anything about DAX mappings because, > > well, I don't think anyone on the filesystem side thought they had > > to do anything special if pmem went away from under it. > > > > My understanding -was- that the pmem removal invalidates > > all the ptes currently mapped into CPU page tables that point at > > the dax device across the system. THe vmas that manage these > > mappings are not really something the filesystem really manages, > > but a function of the mm subsystem. What the filesystem cares about > > is that it gets page faults triggered when a change of state occurs > > so that it can remap the page to it's backing store correctly. > > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the > > CPU ptes, because then when then when userspace tries to access the > > mapped DAX pages we get a new page fault. In processing the fault, the > > filesystem will try to get direct access to the pmem from the block > > device. This will get an ENODEV error from the block device because > > because the backing store (pmem) has been unplugged and is no longer > > there... > > > > AFAICT, as long as pmem removal invalidates all the active ptes that > > point at the pmem being removed, the filesystem doesn't need to > > care about device removal at all, DAX or no DAX... > > How would the pmem removal do that without walking all the active > inodes in the fs at the time of shutdown and call > unmap_mapping_range(inode->i_mapping, 0, 0, 1)? Which then immediately ends up back at the vmas that manage the ptes to unmap them. Isn't finding the vma(s) that map a specific memory range exactly what the rmap code in the mm subsystem is supposed to address? Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner wrote: > > On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote: > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong wrote: > > > > > > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote: > > > > Hi, guys > > > > > > > > Beside this patchset, I'd like to confirm something about the > > > > "EXPERIMENTAL" tag for dax in XFS. > > > > > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message > > > > when we mount a pmem device with dax option, has been existed for a > > > > while. It's a bit annoying when using fsdax feature. So, my initial > > > > intention was to remove this tag. And I started to find out and solve > > > > the problems which prevent it from being removed. > > > > > > > > As is talked before, there are 3 main problems. The first one is "dax > > > > semantics", which has been resolved. The rest two are "RMAP for > > > > fsdax" and "support dax reflink for filesystem", which I have been > > > > working on. > > > > > > > > > > > > > So, what I want to confirm is: does it means that we can remove the > > > > "EXPERIMENTAL" tag when the rest two problem are solved? > > > > > > Yes. I'd keep the experimental tag for a cycle or two to make sure that > > > nothing new pops up, but otherwise the two patchsets you've sent close > > > those two big remaining gaps. Thank you for working on this! > > > > > > > Or maybe there are other important problems need to be fixed before > > > > removing it? If there are, could you please show me that? > > > > > > That remains to be seen through QA/validation, but I think that's it. > > > > > > Granted, I still have to read through the two patchsets... > > > > I've been meaning to circle back here as well. > > > > My immediate concern is the issue Jason recently highlighted [1] with > > respect to invalidating all dax mappings when / if the device is > > ripped out from underneath the fs. I don't think that will collide > > with Ruan's implementation, but it does need new communication from > > driver to fs about removal events. > > > > [1]: > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com > > Oh, yay. > > The XFS shutdown code is centred around preventing new IO from being > issued - we don't actually do anything about DAX mappings because, > well, I don't think anyone on the filesystem side thought they had > to do anything special if pmem went away from under it. > > My understanding -was- that the pmem removal invalidates > all the ptes currently mapped into CPU page tables that point at > the dax device across the system. THe vmas that manage these > mappings are not really something the filesystem really manages, > but a function of the mm subsystem. What the filesystem cares about > is that it gets page faults triggered when a change of state occurs > so that it can remap the page to it's backing store correctly. > > IOWs, all the mm subsystem needs to when pmem goes away is clear the > CPU ptes, because then when then when userspace tries to access the > mapped DAX pages we get a new page fault. In processing the fault, the > filesystem will try to get direct access to the pmem from the block > device. This will get an ENODEV error from the block device because > because the backing store (pmem) has been unplugged and is no longer > there... > > AFAICT, as long as pmem removal invalidates all the active ptes that > point at the pmem being removed, the filesystem doesn't need to > care about device removal at all, DAX or no DAX... How would the pmem removal do that without walking all the active inodes in the fs at the time of shutdown and call unmap_mapping_range(inode->i_mapping, 0, 0, 1)? The core-mm does tear down the ptes in the direct map, but user mappings to pmem are not afaics in xfs_do_force_shutdown(). ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote: > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong wrote: > > > > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote: > > > Hi, guys > > > > > > Beside this patchset, I'd like to confirm something about the > > > "EXPERIMENTAL" tag for dax in XFS. > > > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message > > > when we mount a pmem device with dax option, has been existed for a > > > while. It's a bit annoying when using fsdax feature. So, my initial > > > intention was to remove this tag. And I started to find out and solve > > > the problems which prevent it from being removed. > > > > > > As is talked before, there are 3 main problems. The first one is "dax > > > semantics", which has been resolved. The rest two are "RMAP for > > > fsdax" and "support dax reflink for filesystem", which I have been > > > working on. > > > > > > > > > So, what I want to confirm is: does it means that we can remove the > > > "EXPERIMENTAL" tag when the rest two problem are solved? > > > > Yes. I'd keep the experimental tag for a cycle or two to make sure that > > nothing new pops up, but otherwise the two patchsets you've sent close > > those two big remaining gaps. Thank you for working on this! > > > > > Or maybe there are other important problems need to be fixed before > > > removing it? If there are, could you please show me that? > > > > That remains to be seen through QA/validation, but I think that's it. > > > > Granted, I still have to read through the two patchsets... > > I've been meaning to circle back here as well. > > My immediate concern is the issue Jason recently highlighted [1] with > respect to invalidating all dax mappings when / if the device is > ripped out from underneath the fs. I don't think that will collide > with Ruan's implementation, but it does need new communication from > driver to fs about removal events. > > [1]: > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com Oh, yay. The XFS shutdown code is centred around preventing new IO from being issued - we don't actually do anything about DAX mappings because, well, I don't think anyone on the filesystem side thought they had to do anything special if pmem went away from under it. My understanding -was- that the pmem removal invalidates all the ptes currently mapped into CPU page tables that point at the dax device across the system. THe vmas that manage these mappings are not really something the filesystem really manages, but a function of the mm subsystem. What the filesystem cares about is that it gets page faults triggered when a change of state occurs so that it can remap the page to it's backing store correctly. IOWs, all the mm subsystem needs to when pmem goes away is clear the CPU ptes, because then when then when userspace tries to access the mapped DAX pages we get a new page fault. In processing the fault, the filesystem will try to get direct access to the pmem from the block device. This will get an ENODEV error from the block device because because the backing store (pmem) has been unplugged and is no longer there... AFAICT, as long as pmem removal invalidates all the active ptes that point at the pmem being removed, the filesystem doesn't need to care about device removal at all, DAX or no DAX... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong wrote: > > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote: > > Hi, guys > > > > Beside this patchset, I'd like to confirm something about the > > "EXPERIMENTAL" tag for dax in XFS. > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message > > when we mount a pmem device with dax option, has been existed for a > > while. It's a bit annoying when using fsdax feature. So, my initial > > intention was to remove this tag. And I started to find out and solve > > the problems which prevent it from being removed. > > > > As is talked before, there are 3 main problems. The first one is "dax > > semantics", which has been resolved. The rest two are "RMAP for > > fsdax" and "support dax reflink for filesystem", which I have been > > working on. > > > > > So, what I want to confirm is: does it means that we can remove the > > "EXPERIMENTAL" tag when the rest two problem are solved? > > Yes. I'd keep the experimental tag for a cycle or two to make sure that > nothing new pops up, but otherwise the two patchsets you've sent close > those two big remaining gaps. Thank you for working on this! > > > Or maybe there are other important problems need to be fixed before > > removing it? If there are, could you please show me that? > > That remains to be seen through QA/validation, but I think that's it. > > Granted, I still have to read through the two patchsets... I've been meaning to circle back here as well. My immediate concern is the issue Jason recently highlighted [1] with respect to invalidating all dax mappings when / if the device is ripped out from underneath the fs. I don't think that will collide with Ruan's implementation, but it does need new communication from driver to fs about removal events. [1]: http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote: > Hi, guys > > Beside this patchset, I'd like to confirm something about the > "EXPERIMENTAL" tag for dax in XFS. > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message > when we mount a pmem device with dax option, has been existed for a > while. It's a bit annoying when using fsdax feature. So, my initial > intention was to remove this tag. And I started to find out and solve > the problems which prevent it from being removed. > > As is talked before, there are 3 main problems. The first one is "dax > semantics", which has been resolved. The rest two are "RMAP for > fsdax" and "support dax reflink for filesystem", which I have been > working on. > So, what I want to confirm is: does it means that we can remove the > "EXPERIMENTAL" tag when the rest two problem are solved? Yes. I'd keep the experimental tag for a cycle or two to make sure that nothing new pops up, but otherwise the two patchsets you've sent close those two big remaining gaps. Thank you for working on this! > Or maybe there are other important problems need to be fixed before > removing it? If there are, could you please show me that? That remains to be seen through QA/validation, but I think that's it. Granted, I still have to read through the two patchsets... --D > > Thank you. > > > -- > Ruan Shiyang. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org