Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On the 28th of August 2014 at 09:17, Dave Chinner wrote: On Wed, Aug 27, 2014 at 02:30:55PM -0700, Andrew Morton wrote: On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter wrote: Some explanation of why one would use ext4 instead of, say, suitably-modified ramfs/tmpfs/rd/etc? The NVDIMM contents survive reboot and therefore ramfs and friends wont work with it. See "suitably modified". Presumably this type of memory would need to come from a particular page allocator zone. ramfs would be unweildy due to its use to dentry/inode caches, but rd/etc should be feasible. Hello Dave and the others Thank you very much for your patience and your following summarization. That's where we started about two years ago with that horrible pramfs trainwreck. To start with: brd is a block device, not a filesystem. We still need the filesystem on top of a persistent ram disk to make it useful to applications. We can do this with ext4/XFS right now, and that is the fundamental basis on which DAX is built. For sake of the discussion, however, let's walk through what is required to make an "existing" ramfs persistent. Persistence means we can't just wipe it and start again if it gets corrupted, and rebooting is not a fix for problems. Hence we need to be able to identify it, check it, repair it, ensure metadata operations are persistent across machine crashes, etc, so there is all sorts of management tools required by a persistent ramfs. But most important of all: the persistent storage format needs to be forwards and backwards compatible across kernel versions. Hence we can't encode any structure the kernel uses internally into the persistent storage because they aren't stable structures. That means we need to marshall objects between the persistence domain and the volatile domain in an orderly fashion. Two little questions: 1. If we would omit the compatiblitiy across kernel versions only for theoretical reasons, then would it make sense at all to encode a structure that the kernel uses internally and what advantages could be reached in this way? 2. Have the said structures used by the kernel changed so many times? We can avoid using the dentry/inode *caches* by freeing those volatile objects the moment reference counts dop to zero rather than putting them on LRUs. However, we can't store them in persistent storage and we can't avoid using them to interface with the VFS, so it makes little sense to burn CPU continually marshalling such structures in and out of volatile memory if we have free RAM to do so. So even with a "persistent ramfs" caching the working set of volatile VFS objects makes sense from a peformance point of view. I am sorry to say so, but I am confused again and do not understand this argument, because we are already talking about NVDIMMs here. So, if we have those volatile VFS objects already in NVDIMMs so to say, then we have them also in persistent storage and in DRAM at the same time. Then you've got crash recovery management: NVDIMMs are not synchronous: they can still lose data while it is being written on power loss. And we can't update persistent memory piecemeal as the VFS code modifies metadata - there needs to be synchronisation points, otherwise we will always have inconsistent metadata state in persistent memory. Persistent memory also can't do atomic writes across multiple, disjoint CPU cachelines or NVDIMMs, and this is what is needed for synchroniation points for multi-object metadata modification operations to be consistent after a crash. There is some work in the nvme working groups to define this, but so far there hasn't been any useful outcome, and then we willhave to wait for CPUs to implement those interfaces. Hence the metadata that indexes the persistent RAM needs to use COW techniques, use a log structure or use WAL (journalling). Hence that "persistent ramfs" is now looking much more like a database or traditional filesystem. Further, it's going to need to scale to very large amounts of storage. We're talking about machines with *tens of TB* of NVDIMM capacity in the immediate future and so free space manangement and concurrency of allocation and freeing of used space is going to be fundamental to the performance of the persistent NVRAM filesystem. So, you end up with block/allocation groups to subdivide the space. Looking a lot like ext4 or XFS at this point. And now you have to scale to indexing tens of millions of everything. At least tens of millions - hundreds of millions to billions is more likely, because storing tens of terabytes of small files is going to require indexing billions of files. And because there is no performance penalty for doing this, people will use the filesystem as a great big database. So now you have to have a scalable posix compatible directory structures, scalable freespace indexation, dynamic, scalable inode allocation, freeing, etc. Oh, and it also needs to be highly concurrent to handle machines
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On the 28th of August 2014 at 09:17, Dave Chinner wrote: On Wed, Aug 27, 2014 at 02:30:55PM -0700, Andrew Morton wrote: On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameterc...@linux.com wrote: Some explanation of why one would use ext4 instead of, say, suitably-modified ramfs/tmpfs/rd/etc? The NVDIMM contents survive reboot and therefore ramfs and friends wont work with it. See suitably modified. Presumably this type of memory would need to come from a particular page allocator zone. ramfs would be unweildy due to its use to dentry/inode caches, but rd/etc should be feasible. sigh Hello Dave and the others Thank you very much for your patience and your following summarization. That's where we started about two years ago with that horrible pramfs trainwreck. To start with: brd is a block device, not a filesystem. We still need the filesystem on top of a persistent ram disk to make it useful to applications. We can do this with ext4/XFS right now, and that is the fundamental basis on which DAX is built. For sake of the discussion, however, let's walk through what is required to make an existing ramfs persistent. Persistence means we can't just wipe it and start again if it gets corrupted, and rebooting is not a fix for problems. Hence we need to be able to identify it, check it, repair it, ensure metadata operations are persistent across machine crashes, etc, so there is all sorts of management tools required by a persistent ramfs. But most important of all: the persistent storage format needs to be forwards and backwards compatible across kernel versions. Hence we can't encode any structure the kernel uses internally into the persistent storage because they aren't stable structures. That means we need to marshall objects between the persistence domain and the volatile domain in an orderly fashion. Two little questions: 1. If we would omit the compatiblitiy across kernel versions only for theoretical reasons, then would it make sense at all to encode a structure that the kernel uses internally and what advantages could be reached in this way? 2. Have the said structures used by the kernel changed so many times? We can avoid using the dentry/inode *caches* by freeing those volatile objects the moment reference counts dop to zero rather than putting them on LRUs. However, we can't store them in persistent storage and we can't avoid using them to interface with the VFS, so it makes little sense to burn CPU continually marshalling such structures in and out of volatile memory if we have free RAM to do so. So even with a persistent ramfs caching the working set of volatile VFS objects makes sense from a peformance point of view. I am sorry to say so, but I am confused again and do not understand this argument, because we are already talking about NVDIMMs here. So, if we have those volatile VFS objects already in NVDIMMs so to say, then we have them also in persistent storage and in DRAM at the same time. Then you've got crash recovery management: NVDIMMs are not synchronous: they can still lose data while it is being written on power loss. And we can't update persistent memory piecemeal as the VFS code modifies metadata - there needs to be synchronisation points, otherwise we will always have inconsistent metadata state in persistent memory. Persistent memory also can't do atomic writes across multiple, disjoint CPU cachelines or NVDIMMs, and this is what is needed for synchroniation points for multi-object metadata modification operations to be consistent after a crash. There is some work in the nvme working groups to define this, but so far there hasn't been any useful outcome, and then we willhave to wait for CPUs to implement those interfaces. Hence the metadata that indexes the persistent RAM needs to use COW techniques, use a log structure or use WAL (journalling). Hence that persistent ramfs is now looking much more like a database or traditional filesystem. Further, it's going to need to scale to very large amounts of storage. We're talking about machines with *tens of TB* of NVDIMM capacity in the immediate future and so free space manangement and concurrency of allocation and freeing of used space is going to be fundamental to the performance of the persistent NVRAM filesystem. So, you end up with block/allocation groups to subdivide the space. Looking a lot like ext4 or XFS at this point. And now you have to scale to indexing tens of millions of everything. At least tens of millions - hundreds of millions to billions is more likely, because storing tens of terabytes of small files is going to require indexing billions of files. And because there is no performance penalty for doing this, people will use the filesystem as a great big database. So now you have to have a scalable posix compatible directory structures, scalable freespace indexation, dynamic, scalable inode allocation, freeing, etc. Oh, and it also needs to be highly concurrent to handle
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Thu, 2014-08-28 at 11:08 +0300, Boaz Harrosh wrote: > On 08/27/2014 06:45 AM, Matthew Wilcox wrote: > > One of the primary uses for NV-DIMMs is to expose them as a block device > > and use a filesystem to store files on the NV-DIMM. While that works, > > it currently wastes memory and CPU time buffering the files in the page > > cache. We have support in ext2 for bypassing the page cache, but it > > has some races which are unfixable in the current design. This series > > of patches rewrite the underlying support, and add support for direct > > access to ext4. > > > > Note that patch 6/21 has been included in > > https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next-candidate > > > > Matthew hi > > Could you please push this to the regular or a new public tree? > > (Old versions are at: https://github.com/01org/prd) > > Thanks > Boaz Hi Boaz, I've pushed the updated tree to https://github.com/01org/prd in the master branch. All the older versions of the code that we've had while rebasing are still available in their own branches. Thanks, - Ross
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, Aug 27, 2014 at 06:30:27PM -0700, Andy Lutomirski wrote: > 4) No page faults ever once a page is writable (I hope -- I'm not sure > whether this series actually achieves that goal). I can't think of a circumstance in which you'd end up taking a page fault after a writable mapping is established. The next part to this series (that I'm working on now) is PMD support. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, Aug 27, 2014 at 02:46:22PM -0700, Andrew Morton wrote: > > > Sat down to read all this but I'm finding it rather unwieldy - it's > > > just a great blob of code. Is there some overall > > > what-it-does-and-how-it-does-it roadmap? > > > > The overall goal is to map persistent memory / NV-DIMMs directly to > > userspace. We have that functionality in the XIP code, but the way > > it's structured is unsuitable for filesystems like ext4 & XFS, and > > it has some pretty ugly races. > > When thinking about looking at the patchset I wonder things like how > does mmap work, in what situations does a page get COWed, how do we > handle partial pages at EOF, etc. I guess that's all part of the > filemap_xip legacy, the details of which I've totally forgotten. mmap works by installing a PTE that points to the storage. This implies that the NV-DIMM has to be the kind that always has everything mapped (there are other types that require commands to be sent to move windows around that point into the storage ... DAX is not for these types of DIMMs). We use a VM_MIXEDMAP vma. The PTEs pointing to PFNs will just get copied across on fork. Read-faults on holes are covered by a read-only page cache page. On a write to a hole, any page cache page covering it will be unmapped and evicted from the page cache. The mapping for the faulting task will be replaced with a mapping to the newly established block, but other mappings will take a fresh fault on their next reference. Partial pages are mmapable, just as they are with page-cache based files. You can even store beyond EOF, just as with page-cache files. Those stores are, of course, going to end up on persistence, but they might well end up being zeroed if the file is extended ... again, this is no different to page-cache based files. > > > Performance testing results? > > > > I haven't been running any performance tests. What sort of performance > > tests would be interesting for you to see? > > fs benchmarks? `dd' would be a good start ;) > > I assume (because I wasn't told!) that there are two objectives here: > > 1) reduce memory consumption by not maintaining pagecache and > 2) reduce CPU cost by avoiding the double-copies. > > These things are pretty easily quantified. And really they must be > quantified as part of the developer testing, because if you find > they've worsened then holy cow, what went wrong. It's really a functionality argument; the users we anticipate for NV-DIMMs really want to directly map them into memory and do a lot of work through loads and stores with the kernel not being involved at all, so we don't actually have any performance targets for things like read/write. That said, when running xfstests and comparing results between ext4 with and without DAX, I do see many of the tests completing quicker with DAX than without (others "run for thirty seconds" so there's no time difference between with/without). > None of the patch titles identify the subsystem(s) which they're > hitting. eg, "Introduce IS_DAX(inode)" is an ext2 patch, but nobody > would know that from browsing the titles. I actually see that one as being a VFS patch ... ext2 changing is just a side-effect. I can re-split that patch if desired. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On 08/27/2014 06:45 AM, Matthew Wilcox wrote: > One of the primary uses for NV-DIMMs is to expose them as a block device > and use a filesystem to store files on the NV-DIMM. While that works, > it currently wastes memory and CPU time buffering the files in the page > cache. We have support in ext2 for bypassing the page cache, but it > has some races which are unfixable in the current design. This series > of patches rewrite the underlying support, and add support for direct > access to ext4. > > Note that patch 6/21 has been included in > https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next-candidate > Matthew hi Could you please push this to the regular or a new public tree? (Old versions are at: https://github.com/01org/prd) Thanks Boaz > This iteration of the patchset rebases to 3.17-rc2, changes the page fault > locking, fixes a couple of bugs and makes a few other minor changes. > > - Move the calculation of the maximum size available at the requested >location from the ->direct_access implementations to bdev_direct_access() > - Fix a comment typo (Ross Zwisler) > - Check that the requested length is positive in bdev_direct_access(). If >it is not, assume that it's an errno, and just return it. > - Fix some whitespace issues flagged by checkpatch > - Added the Acked-by responses from Kirill that I forget in the last round > - Added myself to MAINTAINERS for DAX > - Fixed compilation with !CONFIG_DAX (Vishal Verma) > - Revert the locking in the page fault handler back to an earlier version. >If we hit the race that we were trying to protect against, we will leave >blocks allocated past the end of the file. They will be removed on file >removal, the next truncate, or fsck. > > > Matthew Wilcox (20): > axonram: Fix bug in direct_access > Change direct_access calling convention > Fix XIP fault vs truncate race > Allow page fault handlers to perform the COW > Introduce IS_DAX(inode) > Add copy_to_iter(), copy_from_iter() and iov_iter_zero() > Replace XIP read and write with DAX I/O > Replace ext2_clear_xip_target with dax_clear_blocks > Replace the XIP page fault handler with the DAX page fault handler > Replace xip_truncate_page with dax_truncate_page > Replace XIP documentation with DAX documentation > Remove get_xip_mem > ext2: Remove ext2_xip_verify_sb() > ext2: Remove ext2_use_xip > ext2: Remove xip.c and xip.h > Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX > ext2: Remove ext2_aops_xip > Get rid of most mentions of XIP in ext2 > xip: Add xip_zero_page_range > brd: Rename XIP to DAX > > Ross Zwisler (1): > ext4: Add DAX functionality > > Documentation/filesystems/Locking | 3 - > Documentation/filesystems/dax.txt | 91 +++ > Documentation/filesystems/ext4.txt | 2 + > Documentation/filesystems/xip.txt | 68 - > MAINTAINERS| 6 + > arch/powerpc/sysdev/axonram.c | 19 +- > drivers/block/Kconfig | 13 +- > drivers/block/brd.c| 26 +- > drivers/s390/block/dcssblk.c | 21 +- > fs/Kconfig | 21 +- > fs/Makefile| 1 + > fs/block_dev.c | 40 +++ > fs/dax.c | 497 > + > fs/exofs/inode.c | 1 - > fs/ext2/Kconfig| 11 - > fs/ext2/Makefile | 1 - > fs/ext2/ext2.h | 10 +- > fs/ext2/file.c | 45 +++- > fs/ext2/inode.c| 38 +-- > fs/ext2/namei.c| 13 +- > fs/ext2/super.c| 53 ++-- > fs/ext2/xip.c | 91 --- > fs/ext2/xip.h | 26 -- > fs/ext4/ext4.h | 6 + > fs/ext4/file.c | 49 +++- > fs/ext4/indirect.c | 18 +- > fs/ext4/inode.c| 51 ++-- > fs/ext4/namei.c| 10 +- > fs/ext4/super.c| 39 ++- > fs/open.c | 5 +- > include/linux/blkdev.h | 6 +- > include/linux/fs.h | 49 +++- > include/linux/mm.h | 1 + > include/linux/uio.h| 3 + > mm/Makefile| 1 - > mm/fadvise.c | 6 +- > mm/filemap.c | 6 +- > mm/filemap_xip.c | 483 --- > mm/iov_iter.c | 237 -- > mm/madvise.c | 2 +- > mm/memory.c| 33 ++- > 41 files changed, 1229 insertions(+), 873 deletions(-) > create mode 100644 Documentation/filesystems/dax.txt > delete mode 100644 Documentation/filesystems/xip.txt > create mode 100644 fs/dax.c > delete mode 100644 fs/ext2/xip.c > delete
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, Aug 27, 2014 at 02:30:55PM -0700, Andrew Morton wrote: > On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter > wrote: > > > > Some explanation of why one would use ext4 instead of, say, > > > suitably-modified ramfs/tmpfs/rd/etc? > > > > The NVDIMM contents survive reboot and therefore ramfs and friends wont > > work with it. > > See "suitably modified". Presumably this type of memory would need to > come from a particular page allocator zone. ramfs would be unweildy > due to its use to dentry/inode caches, but rd/etc should be feasible. That's where we started about two years ago with that horrible pramfs trainwreck. To start with: brd is a block device, not a filesystem. We still need the filesystem on top of a persistent ram disk to make it useful to applications. We can do this with ext4/XFS right now, and that is the fundamental basis on which DAX is built. For sake of the discussion, however, let's walk through what is required to make an "existing" ramfs persistent. Persistence means we can't just wipe it and start again if it gets corrupted, and rebooting is not a fix for problems. Hence we need to be able to identify it, check it, repair it, ensure metadata operations are persistent across machine crashes, etc, so there is all sorts of management tools required by a persistent ramfs. But most important of all: the persistent storage format needs to be forwards and backwards compatible across kernel versions. Hence we can't encode any structure the kernel uses internally into the persistent storage because they aren't stable structures. That means we need to marshall objects between the persistence domain and the volatile domain in an orderly fashion. We can avoid using the dentry/inode *caches* by freeing those volatile objects the moment reference counts dop to zero rather than putting them on LRUs. However, we can't store them in persistent storage and we can't avoid using them to interface with the VFS, so it makes little sense to burn CPU continually marshalling such structures in and out of volatile memory if we have free RAM to do so. So even with a "persistent ramfs" caching the working set of volatile VFS objects makes sense from a peformance point of view. Then you've got crash recovery management: NVDIMMs are not synchronous: they can still lose data while it is being written on power loss. And we can't update persistent memory piecemeal as the VFS code modifies metadata - there needs to be synchronisation points, otherwise we will always have inconsistent metadata state in persistent memory. Persistent memory also can't do atomic writes across multiple, disjoint CPU cachelines or NVDIMMs, and this is what is needed for synchroniation points for multi-object metadata modification operations to be consistent after a crash. There is some work in the nvme working groups to define this, but so far there hasn't been any useful outcome, and then we willhave to wait for CPUs to implement those interfaces. Hence the metadata that indexes the persistent RAM needs to use COW techniques, use a log structure or use WAL (journalling). Hence that "persistent ramfs" is now looking much more like a database or traditional filesystem. Further, it's going to need to scale to very large amounts of storage. We're talking about machines with *tens of TB* of NVDIMM capacity in the immediate future and so free space manangement and concurrency of allocation and freeing of used space is going to be fundamental to the performance of the persistent NVRAM filesystem. So, you end up with block/allocation groups to subdivide the space. Looking a lot like ext4 or XFS at this point. And now you have to scale to indexing tens of millions of everything. At least tens of millions - hundreds of millions to billions is more likely, because storing tens of terabytes of small files is going to require indexing billions of files. And because there is no performance penalty for doing this, people will use the filesystem as a great big database. So now you have to have a scalable posix compatible directory structures, scalable freespace indexation, dynamic, scalable inode allocation, freeing, etc. Oh, and it also needs to be highly concurrent to handle machines with hundreds of CPU cores. Funnily enough, we already have a couple of persistent storage implementations that solve these problems to varying degrees. ext4 is one of them, if you ignore the scalability and concurrency requirements. XFS is the other. And both will run unmodified on a persistant ram block device, which we *already have*. And so back to DAX. What users actually want from their high speed persistant RAM storage is direct, cpu addressable access to that persistent storage. They don't want to have to care about how to find an object in the persistent storage - that's what filesystems are for - they just want to be able to read and write to it directly. That's what DAX does - it provides existing filesystems a method
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, Aug 27, 2014 at 02:30:55PM -0700, Andrew Morton wrote: On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter c...@linux.com wrote: Some explanation of why one would use ext4 instead of, say, suitably-modified ramfs/tmpfs/rd/etc? The NVDIMM contents survive reboot and therefore ramfs and friends wont work with it. See suitably modified. Presumably this type of memory would need to come from a particular page allocator zone. ramfs would be unweildy due to its use to dentry/inode caches, but rd/etc should be feasible. sigh That's where we started about two years ago with that horrible pramfs trainwreck. To start with: brd is a block device, not a filesystem. We still need the filesystem on top of a persistent ram disk to make it useful to applications. We can do this with ext4/XFS right now, and that is the fundamental basis on which DAX is built. For sake of the discussion, however, let's walk through what is required to make an existing ramfs persistent. Persistence means we can't just wipe it and start again if it gets corrupted, and rebooting is not a fix for problems. Hence we need to be able to identify it, check it, repair it, ensure metadata operations are persistent across machine crashes, etc, so there is all sorts of management tools required by a persistent ramfs. But most important of all: the persistent storage format needs to be forwards and backwards compatible across kernel versions. Hence we can't encode any structure the kernel uses internally into the persistent storage because they aren't stable structures. That means we need to marshall objects between the persistence domain and the volatile domain in an orderly fashion. We can avoid using the dentry/inode *caches* by freeing those volatile objects the moment reference counts dop to zero rather than putting them on LRUs. However, we can't store them in persistent storage and we can't avoid using them to interface with the VFS, so it makes little sense to burn CPU continually marshalling such structures in and out of volatile memory if we have free RAM to do so. So even with a persistent ramfs caching the working set of volatile VFS objects makes sense from a peformance point of view. Then you've got crash recovery management: NVDIMMs are not synchronous: they can still lose data while it is being written on power loss. And we can't update persistent memory piecemeal as the VFS code modifies metadata - there needs to be synchronisation points, otherwise we will always have inconsistent metadata state in persistent memory. Persistent memory also can't do atomic writes across multiple, disjoint CPU cachelines or NVDIMMs, and this is what is needed for synchroniation points for multi-object metadata modification operations to be consistent after a crash. There is some work in the nvme working groups to define this, but so far there hasn't been any useful outcome, and then we willhave to wait for CPUs to implement those interfaces. Hence the metadata that indexes the persistent RAM needs to use COW techniques, use a log structure or use WAL (journalling). Hence that persistent ramfs is now looking much more like a database or traditional filesystem. Further, it's going to need to scale to very large amounts of storage. We're talking about machines with *tens of TB* of NVDIMM capacity in the immediate future and so free space manangement and concurrency of allocation and freeing of used space is going to be fundamental to the performance of the persistent NVRAM filesystem. So, you end up with block/allocation groups to subdivide the space. Looking a lot like ext4 or XFS at this point. And now you have to scale to indexing tens of millions of everything. At least tens of millions - hundreds of millions to billions is more likely, because storing tens of terabytes of small files is going to require indexing billions of files. And because there is no performance penalty for doing this, people will use the filesystem as a great big database. So now you have to have a scalable posix compatible directory structures, scalable freespace indexation, dynamic, scalable inode allocation, freeing, etc. Oh, and it also needs to be highly concurrent to handle machines with hundreds of CPU cores. Funnily enough, we already have a couple of persistent storage implementations that solve these problems to varying degrees. ext4 is one of them, if you ignore the scalability and concurrency requirements. XFS is the other. And both will run unmodified on a persistant ram block device, which we *already have*. And so back to DAX. What users actually want from their high speed persistant RAM storage is direct, cpu addressable access to that persistent storage. They don't want to have to care about how to find an object in the persistent storage - that's what filesystems are for - they just want to be able to read and write to it directly. That's what DAX does - it provides existing filesystems a method for
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On 08/27/2014 06:45 AM, Matthew Wilcox wrote: One of the primary uses for NV-DIMMs is to expose them as a block device and use a filesystem to store files on the NV-DIMM. While that works, it currently wastes memory and CPU time buffering the files in the page cache. We have support in ext2 for bypassing the page cache, but it has some races which are unfixable in the current design. This series of patches rewrite the underlying support, and add support for direct access to ext4. Note that patch 6/21 has been included in https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next-candidate Matthew hi Could you please push this to the regular or a new public tree? (Old versions are at: https://github.com/01org/prd) Thanks Boaz This iteration of the patchset rebases to 3.17-rc2, changes the page fault locking, fixes a couple of bugs and makes a few other minor changes. - Move the calculation of the maximum size available at the requested location from the -direct_access implementations to bdev_direct_access() - Fix a comment typo (Ross Zwisler) - Check that the requested length is positive in bdev_direct_access(). If it is not, assume that it's an errno, and just return it. - Fix some whitespace issues flagged by checkpatch - Added the Acked-by responses from Kirill that I forget in the last round - Added myself to MAINTAINERS for DAX - Fixed compilation with !CONFIG_DAX (Vishal Verma) - Revert the locking in the page fault handler back to an earlier version. If we hit the race that we were trying to protect against, we will leave blocks allocated past the end of the file. They will be removed on file removal, the next truncate, or fsck. Matthew Wilcox (20): axonram: Fix bug in direct_access Change direct_access calling convention Fix XIP fault vs truncate race Allow page fault handlers to perform the COW Introduce IS_DAX(inode) Add copy_to_iter(), copy_from_iter() and iov_iter_zero() Replace XIP read and write with DAX I/O Replace ext2_clear_xip_target with dax_clear_blocks Replace the XIP page fault handler with the DAX page fault handler Replace xip_truncate_page with dax_truncate_page Replace XIP documentation with DAX documentation Remove get_xip_mem ext2: Remove ext2_xip_verify_sb() ext2: Remove ext2_use_xip ext2: Remove xip.c and xip.h Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX ext2: Remove ext2_aops_xip Get rid of most mentions of XIP in ext2 xip: Add xip_zero_page_range brd: Rename XIP to DAX Ross Zwisler (1): ext4: Add DAX functionality Documentation/filesystems/Locking | 3 - Documentation/filesystems/dax.txt | 91 +++ Documentation/filesystems/ext4.txt | 2 + Documentation/filesystems/xip.txt | 68 - MAINTAINERS| 6 + arch/powerpc/sysdev/axonram.c | 19 +- drivers/block/Kconfig | 13 +- drivers/block/brd.c| 26 +- drivers/s390/block/dcssblk.c | 21 +- fs/Kconfig | 21 +- fs/Makefile| 1 + fs/block_dev.c | 40 +++ fs/dax.c | 497 + fs/exofs/inode.c | 1 - fs/ext2/Kconfig| 11 - fs/ext2/Makefile | 1 - fs/ext2/ext2.h | 10 +- fs/ext2/file.c | 45 +++- fs/ext2/inode.c| 38 +-- fs/ext2/namei.c| 13 +- fs/ext2/super.c| 53 ++-- fs/ext2/xip.c | 91 --- fs/ext2/xip.h | 26 -- fs/ext4/ext4.h | 6 + fs/ext4/file.c | 49 +++- fs/ext4/indirect.c | 18 +- fs/ext4/inode.c| 51 ++-- fs/ext4/namei.c| 10 +- fs/ext4/super.c| 39 ++- fs/open.c | 5 +- include/linux/blkdev.h | 6 +- include/linux/fs.h | 49 +++- include/linux/mm.h | 1 + include/linux/uio.h| 3 + mm/Makefile| 1 - mm/fadvise.c | 6 +- mm/filemap.c | 6 +- mm/filemap_xip.c | 483 --- mm/iov_iter.c | 237 -- mm/madvise.c | 2 +- mm/memory.c| 33 ++- 41 files changed, 1229 insertions(+), 873 deletions(-) create mode 100644 Documentation/filesystems/dax.txt delete mode 100644 Documentation/filesystems/xip.txt create mode 100644 fs/dax.c delete mode 100644 fs/ext2/xip.c delete mode 100644 fs/ext2/xip.h delete mode 100644 mm/filemap_xip.c -- To unsubscribe from this list: send
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, Aug 27, 2014 at 02:46:22PM -0700, Andrew Morton wrote: Sat down to read all this but I'm finding it rather unwieldy - it's just a great blob of code. Is there some overall what-it-does-and-how-it-does-it roadmap? The overall goal is to map persistent memory / NV-DIMMs directly to userspace. We have that functionality in the XIP code, but the way it's structured is unsuitable for filesystems like ext4 XFS, and it has some pretty ugly races. When thinking about looking at the patchset I wonder things like how does mmap work, in what situations does a page get COWed, how do we handle partial pages at EOF, etc. I guess that's all part of the filemap_xip legacy, the details of which I've totally forgotten. mmap works by installing a PTE that points to the storage. This implies that the NV-DIMM has to be the kind that always has everything mapped (there are other types that require commands to be sent to move windows around that point into the storage ... DAX is not for these types of DIMMs). We use a VM_MIXEDMAP vma. The PTEs pointing to PFNs will just get copied across on fork. Read-faults on holes are covered by a read-only page cache page. On a write to a hole, any page cache page covering it will be unmapped and evicted from the page cache. The mapping for the faulting task will be replaced with a mapping to the newly established block, but other mappings will take a fresh fault on their next reference. Partial pages are mmapable, just as they are with page-cache based files. You can even store beyond EOF, just as with page-cache files. Those stores are, of course, going to end up on persistence, but they might well end up being zeroed if the file is extended ... again, this is no different to page-cache based files. Performance testing results? I haven't been running any performance tests. What sort of performance tests would be interesting for you to see? fs benchmarks? `dd' would be a good start ;) I assume (because I wasn't told!) that there are two objectives here: 1) reduce memory consumption by not maintaining pagecache and 2) reduce CPU cost by avoiding the double-copies. These things are pretty easily quantified. And really they must be quantified as part of the developer testing, because if you find they've worsened then holy cow, what went wrong. It's really a functionality argument; the users we anticipate for NV-DIMMs really want to directly map them into memory and do a lot of work through loads and stores with the kernel not being involved at all, so we don't actually have any performance targets for things like read/write. That said, when running xfstests and comparing results between ext4 with and without DAX, I do see many of the tests completing quicker with DAX than without (others run for thirty seconds so there's no time difference between with/without). None of the patch titles identify the subsystem(s) which they're hitting. eg, Introduce IS_DAX(inode) is an ext2 patch, but nobody would know that from browsing the titles. I actually see that one as being a VFS patch ... ext2 changing is just a side-effect. I can re-split that patch if desired. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, Aug 27, 2014 at 06:30:27PM -0700, Andy Lutomirski wrote: 4) No page faults ever once a page is writable (I hope -- I'm not sure whether this series actually achieves that goal). I can't think of a circumstance in which you'd end up taking a page fault after a writable mapping is established. The next part to this series (that I'm working on now) is PMD support. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Thu, 2014-08-28 at 11:08 +0300, Boaz Harrosh wrote: On 08/27/2014 06:45 AM, Matthew Wilcox wrote: One of the primary uses for NV-DIMMs is to expose them as a block device and use a filesystem to store files on the NV-DIMM. While that works, it currently wastes memory and CPU time buffering the files in the page cache. We have support in ext2 for bypassing the page cache, but it has some races which are unfixable in the current design. This series of patches rewrite the underlying support, and add support for direct access to ext4. Note that patch 6/21 has been included in https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next-candidate Matthew hi Could you please push this to the regular or a new public tree? (Old versions are at: https://github.com/01org/prd) Thanks Boaz Hi Boaz, I've pushed the updated tree to https://github.com/01org/prd in the master branch. All the older versions of the code that we've had while rebasing are still available in their own branches. Thanks, - Ross
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On 08/27/2014 02:46 PM, Andrew Morton wrote: > I assume (because I wasn't told!) that there are two objectives here: > > 1) reduce memory consumption by not maintaining pagecache and > 2) reduce CPU cost by avoiding the double-copies. > > These things are pretty easily quantified. And really they must be > quantified as part of the developer testing, because if you find > they've worsened then holy cow, what went wrong. > There are two more huge ones: 3) Writes via mmap are immediately durable (or at least they're durable after a *very* lightweight flush). 4) No page faults ever once a page is writable (I hope -- I'm not sure whether this series actually achieves that goal). A note on #3: there is ongoing work to enable write-through memory for things like this. Once that's done, then writes via mmap might actually be synchronously durable, depending on chipset details. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, 27 Aug 2014 14:30:55 -0700 Andrew Morton wrote: > On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter > wrote: > > > > Some explanation of why one would use ext4 instead of, say, > > > suitably-modified ramfs/tmpfs/rd/etc? > > > > The NVDIMM contents survive reboot and therefore ramfs and friends wont > > work with it. > > See "suitably modified". Presumably this type of memory would need to > come from a particular page allocator zone. ramfs would be unweildy > due to its use to dentry/inode caches, but rd/etc should be feasible. If you took one of the existing ramfs types you would then need to - make it persistent in its storage, and put all the objects in the store - add journalling for failures mid transaction. Your dimm may retain its bits but if your CPU reset mid fs operation its got to be recovered - write an fsck tool for it - validate it at which point it's probably turned into ext4 8) It's persistent but that doesn't solve the 'my box crashed' problem. Alan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, 27 Aug 2014 17:12:50 -0400 Matthew Wilcox wrote: > On Wed, Aug 27, 2014 at 01:06:13PM -0700, Andrew Morton wrote: > > On Tue, 26 Aug 2014 23:45:20 -0400 Matthew Wilcox > > wrote: > > > > > One of the primary uses for NV-DIMMs is to expose them as a block device > > > and use a filesystem to store files on the NV-DIMM. While that works, > > > it currently wastes memory and CPU time buffering the files in the page > > > cache. We have support in ext2 for bypassing the page cache, but it > > > has some races which are unfixable in the current design. This series > > > of patches rewrite the underlying support, and add support for direct > > > access to ext4. > > > > Sat down to read all this but I'm finding it rather unwieldy - it's > > just a great blob of code. Is there some overall > > what-it-does-and-how-it-does-it roadmap? > > The overall goal is to map persistent memory / NV-DIMMs directly to > userspace. We have that functionality in the XIP code, but the way > it's structured is unsuitable for filesystems like ext4 & XFS, and > it has some pretty ugly races. When thinking about looking at the patchset I wonder things like how does mmap work, in what situations does a page get COWed, how do we handle partial pages at EOF, etc. I guess that's all part of the filemap_xip legacy, the details of which I've totally forgotten. > Patches 1 & 3 are simply bug-fixes. They should go in regardless of > the merits of anything else in this series. > > Patch 2 changes the API for the direct_access block_device_operation so > it can report more than a single page at a time. As the series evolved, > this work also included moving support for partitioning into the VFS > where it belongs, handling various error cases in the VFS and so on. > > Patch 4 is an optimisation. It's poor form to make userspace take two > faults for the same dereference. > > Patch 5 gives us a VFS flag for the DAX property, which lets us get rid of > the get_xip_mem() method later on. > > Patch 6 is also prep work; Al Viro liked it enough that it's now in > his tree. > > The new DAX code is then dribbled in over patches 7-11, split up by > functional area. At each stage, the ext2-xip code is converted over to > the new DAX code. > > Patches 12-18 delete the remnants of the old XIP code, and fix the things > in ext2 that Jan didn't like when he reviewed them for ext4 :-) > > Patches 19 & 20 are the work to make ext4 use DAX. > > Patch 21 is some final cleanup of references to the old XIP code, renaming > it all to DAX. hrm. > > Some explanation of why one would use ext4 instead of, say, > > suitably-modified ramfs/tmpfs/rd/etc? > > ramfs and tmpfs really rely on the page cache. They're not exactly > built for permanence either. brd also relies on the page cache, and > there's a clear desire to use a filesystem instead of a block device > for all the usual reasons of access permissions, grow/shrink, etc. > > Some people might want to use XFS instead of ext4. We're starting with > ext4, but we've been keeping an eye on what other filesystems might want > to use. btrfs isn't going to use the DAX code, but some of the other > pieces will probably come in handy. > > There are also at least three people working on their own filesystems > specially designed for persistent memory. I wish them all the best > ... but I'd like to get this infrastructure into place. This is the sort of thing which first-timers (this one at least) like to see in [0/n]. > > Performance testing results? > > I haven't been running any performance tests. What sort of performance > tests would be interesting for you to see? fs benchmarks? `dd' would be a good start ;) I assume (because I wasn't told!) that there are two objectives here: 1) reduce memory consumption by not maintaining pagecache and 2) reduce CPU cost by avoiding the double-copies. These things are pretty easily quantified. And really they must be quantified as part of the developer testing, because if you find they've worsened then holy cow, what went wrong. > > Carsten Otte wrote filemap_xip.c and may be a useful reviewer of this > > work. > > I cc'd him on some earlier versions and didn't hear anything back. It felt > rude to keep plying him with 20+ patches every month. OK. > > All the patch subjects violate Documentation/SubmittingPatches > > section 15 ;) > > errr ... which bit? I used git format-patch to create them. None of the patch titles identify the subsystem(s) which they're hitting. eg, "Introduce IS_DAX(inode)" is an ext2 patch, but nobody would know that from browsing the titles. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter wrote: > > Some explanation of why one would use ext4 instead of, say, > > suitably-modified ramfs/tmpfs/rd/etc? > > The NVDIMM contents survive reboot and therefore ramfs and friends wont > work with it. See "suitably modified". Presumably this type of memory would need to come from a particular page allocator zone. ramfs would be unweildy due to its use to dentry/inode caches, but rd/etc should be feasible. I dunno, I'm not proposing implementations - I'm asking obvious questions. Stuff which should have been addressed in the changelogs before one even starts to read the code... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, 27 Aug 2014, Andrew Morton wrote: > Sat down to read all this but I'm finding it rather unwieldy - it's > just a great blob of code. Is there some overall > what-it-does-and-how-it-does-it roadmap? Matthew gave a talk about DAX at the kernel summit. Its a great feature because this is another piece of the bare metal hardware technology that is being improved by him. > Some explanation of why one would use ext4 instead of, say, > suitably-modified ramfs/tmpfs/rd/etc? The NVDIMM contents survive reboot and therefore ramfs and friends wont work with it. > Performance testing results? This is obviously avoiding kernel buffering and therefore decreasing kernel overhead for non volatile memory. Avoids useless duplication of data from the non volatile memory into regular ram and allows direct access to non volatile memory from user space in a controlled fashion. I think this should be a priority item. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, Aug 27, 2014 at 01:06:13PM -0700, Andrew Morton wrote: > On Tue, 26 Aug 2014 23:45:20 -0400 Matthew Wilcox > wrote: > > > One of the primary uses for NV-DIMMs is to expose them as a block device > > and use a filesystem to store files on the NV-DIMM. While that works, > > it currently wastes memory and CPU time buffering the files in the page > > cache. We have support in ext2 for bypassing the page cache, but it > > has some races which are unfixable in the current design. This series > > of patches rewrite the underlying support, and add support for direct > > access to ext4. > > Sat down to read all this but I'm finding it rather unwieldy - it's > just a great blob of code. Is there some overall > what-it-does-and-how-it-does-it roadmap? The overall goal is to map persistent memory / NV-DIMMs directly to userspace. We have that functionality in the XIP code, but the way it's structured is unsuitable for filesystems like ext4 & XFS, and it has some pretty ugly races. Patches 1 & 3 are simply bug-fixes. They should go in regardless of the merits of anything else in this series. Patch 2 changes the API for the direct_access block_device_operation so it can report more than a single page at a time. As the series evolved, this work also included moving support for partitioning into the VFS where it belongs, handling various error cases in the VFS and so on. Patch 4 is an optimisation. It's poor form to make userspace take two faults for the same dereference. Patch 5 gives us a VFS flag for the DAX property, which lets us get rid of the get_xip_mem() method later on. Patch 6 is also prep work; Al Viro liked it enough that it's now in his tree. The new DAX code is then dribbled in over patches 7-11, split up by functional area. At each stage, the ext2-xip code is converted over to the new DAX code. Patches 12-18 delete the remnants of the old XIP code, and fix the things in ext2 that Jan didn't like when he reviewed them for ext4 :-) Patches 19 & 20 are the work to make ext4 use DAX. Patch 21 is some final cleanup of references to the old XIP code, renaming it all to DAX. > Some explanation of why one would use ext4 instead of, say, > suitably-modified ramfs/tmpfs/rd/etc? ramfs and tmpfs really rely on the page cache. They're not exactly built for permanence either. brd also relies on the page cache, and there's a clear desire to use a filesystem instead of a block device for all the usual reasons of access permissions, grow/shrink, etc. Some people might want to use XFS instead of ext4. We're starting with ext4, but we've been keeping an eye on what other filesystems might want to use. btrfs isn't going to use the DAX code, but some of the other pieces will probably come in handy. There are also at least three people working on their own filesystems specially designed for persistent memory. I wish them all the best ... but I'd like to get this infrastructure into place. > Performance testing results? I haven't been running any performance tests. What sort of performance tests would be interesting for you to see? > Carsten Otte wrote filemap_xip.c and may be a useful reviewer of this > work. I cc'd him on some earlier versions and didn't hear anything back. It felt rude to keep plying him with 20+ patches every month. > All the patch subjects violate Documentation/SubmittingPatches > section 15 ;) errr ... which bit? I used git format-patch to create them. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Tue, 26 Aug 2014 23:45:20 -0400 Matthew Wilcox wrote: > One of the primary uses for NV-DIMMs is to expose them as a block device > and use a filesystem to store files on the NV-DIMM. While that works, > it currently wastes memory and CPU time buffering the files in the page > cache. We have support in ext2 for bypassing the page cache, but it > has some races which are unfixable in the current design. This series > of patches rewrite the underlying support, and add support for direct > access to ext4. Sat down to read all this but I'm finding it rather unwieldy - it's just a great blob of code. Is there some overall what-it-does-and-how-it-does-it roadmap? Some explanation of why one would use ext4 instead of, say, suitably-modified ramfs/tmpfs/rd/etc? Performance testing results? Carsten Otte wrote filemap_xip.c and may be a useful reviewer of this work. All the patch subjects violate Documentation/SubmittingPatches section 15 ;) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Tue, 26 Aug 2014 23:45:20 -0400 Matthew Wilcox matthew.r.wil...@intel.com wrote: One of the primary uses for NV-DIMMs is to expose them as a block device and use a filesystem to store files on the NV-DIMM. While that works, it currently wastes memory and CPU time buffering the files in the page cache. We have support in ext2 for bypassing the page cache, but it has some races which are unfixable in the current design. This series of patches rewrite the underlying support, and add support for direct access to ext4. Sat down to read all this but I'm finding it rather unwieldy - it's just a great blob of code. Is there some overall what-it-does-and-how-it-does-it roadmap? Some explanation of why one would use ext4 instead of, say, suitably-modified ramfs/tmpfs/rd/etc? Performance testing results? Carsten Otte wrote filemap_xip.c and may be a useful reviewer of this work. All the patch subjects violate Documentation/SubmittingPatches section 15 ;) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, Aug 27, 2014 at 01:06:13PM -0700, Andrew Morton wrote: On Tue, 26 Aug 2014 23:45:20 -0400 Matthew Wilcox matthew.r.wil...@intel.com wrote: One of the primary uses for NV-DIMMs is to expose them as a block device and use a filesystem to store files on the NV-DIMM. While that works, it currently wastes memory and CPU time buffering the files in the page cache. We have support in ext2 for bypassing the page cache, but it has some races which are unfixable in the current design. This series of patches rewrite the underlying support, and add support for direct access to ext4. Sat down to read all this but I'm finding it rather unwieldy - it's just a great blob of code. Is there some overall what-it-does-and-how-it-does-it roadmap? The overall goal is to map persistent memory / NV-DIMMs directly to userspace. We have that functionality in the XIP code, but the way it's structured is unsuitable for filesystems like ext4 XFS, and it has some pretty ugly races. Patches 1 3 are simply bug-fixes. They should go in regardless of the merits of anything else in this series. Patch 2 changes the API for the direct_access block_device_operation so it can report more than a single page at a time. As the series evolved, this work also included moving support for partitioning into the VFS where it belongs, handling various error cases in the VFS and so on. Patch 4 is an optimisation. It's poor form to make userspace take two faults for the same dereference. Patch 5 gives us a VFS flag for the DAX property, which lets us get rid of the get_xip_mem() method later on. Patch 6 is also prep work; Al Viro liked it enough that it's now in his tree. The new DAX code is then dribbled in over patches 7-11, split up by functional area. At each stage, the ext2-xip code is converted over to the new DAX code. Patches 12-18 delete the remnants of the old XIP code, and fix the things in ext2 that Jan didn't like when he reviewed them for ext4 :-) Patches 19 20 are the work to make ext4 use DAX. Patch 21 is some final cleanup of references to the old XIP code, renaming it all to DAX. Some explanation of why one would use ext4 instead of, say, suitably-modified ramfs/tmpfs/rd/etc? ramfs and tmpfs really rely on the page cache. They're not exactly built for permanence either. brd also relies on the page cache, and there's a clear desire to use a filesystem instead of a block device for all the usual reasons of access permissions, grow/shrink, etc. Some people might want to use XFS instead of ext4. We're starting with ext4, but we've been keeping an eye on what other filesystems might want to use. btrfs isn't going to use the DAX code, but some of the other pieces will probably come in handy. There are also at least three people working on their own filesystems specially designed for persistent memory. I wish them all the best ... but I'd like to get this infrastructure into place. Performance testing results? I haven't been running any performance tests. What sort of performance tests would be interesting for you to see? Carsten Otte wrote filemap_xip.c and may be a useful reviewer of this work. I cc'd him on some earlier versions and didn't hear anything back. It felt rude to keep plying him with 20+ patches every month. All the patch subjects violate Documentation/SubmittingPatches section 15 ;) errr ... which bit? I used git format-patch to create them. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, 27 Aug 2014, Andrew Morton wrote: Sat down to read all this but I'm finding it rather unwieldy - it's just a great blob of code. Is there some overall what-it-does-and-how-it-does-it roadmap? Matthew gave a talk about DAX at the kernel summit. Its a great feature because this is another piece of the bare metal hardware technology that is being improved by him. Some explanation of why one would use ext4 instead of, say, suitably-modified ramfs/tmpfs/rd/etc? The NVDIMM contents survive reboot and therefore ramfs and friends wont work with it. Performance testing results? This is obviously avoiding kernel buffering and therefore decreasing kernel overhead for non volatile memory. Avoids useless duplication of data from the non volatile memory into regular ram and allows direct access to non volatile memory from user space in a controlled fashion. I think this should be a priority item. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter c...@linux.com wrote: Some explanation of why one would use ext4 instead of, say, suitably-modified ramfs/tmpfs/rd/etc? The NVDIMM contents survive reboot and therefore ramfs and friends wont work with it. See suitably modified. Presumably this type of memory would need to come from a particular page allocator zone. ramfs would be unweildy due to its use to dentry/inode caches, but rd/etc should be feasible. I dunno, I'm not proposing implementations - I'm asking obvious questions. Stuff which should have been addressed in the changelogs before one even starts to read the code... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, 27 Aug 2014 17:12:50 -0400 Matthew Wilcox wi...@linux.intel.com wrote: On Wed, Aug 27, 2014 at 01:06:13PM -0700, Andrew Morton wrote: On Tue, 26 Aug 2014 23:45:20 -0400 Matthew Wilcox matthew.r.wil...@intel.com wrote: One of the primary uses for NV-DIMMs is to expose them as a block device and use a filesystem to store files on the NV-DIMM. While that works, it currently wastes memory and CPU time buffering the files in the page cache. We have support in ext2 for bypassing the page cache, but it has some races which are unfixable in the current design. This series of patches rewrite the underlying support, and add support for direct access to ext4. Sat down to read all this but I'm finding it rather unwieldy - it's just a great blob of code. Is there some overall what-it-does-and-how-it-does-it roadmap? The overall goal is to map persistent memory / NV-DIMMs directly to userspace. We have that functionality in the XIP code, but the way it's structured is unsuitable for filesystems like ext4 XFS, and it has some pretty ugly races. When thinking about looking at the patchset I wonder things like how does mmap work, in what situations does a page get COWed, how do we handle partial pages at EOF, etc. I guess that's all part of the filemap_xip legacy, the details of which I've totally forgotten. Patches 1 3 are simply bug-fixes. They should go in regardless of the merits of anything else in this series. Patch 2 changes the API for the direct_access block_device_operation so it can report more than a single page at a time. As the series evolved, this work also included moving support for partitioning into the VFS where it belongs, handling various error cases in the VFS and so on. Patch 4 is an optimisation. It's poor form to make userspace take two faults for the same dereference. Patch 5 gives us a VFS flag for the DAX property, which lets us get rid of the get_xip_mem() method later on. Patch 6 is also prep work; Al Viro liked it enough that it's now in his tree. The new DAX code is then dribbled in over patches 7-11, split up by functional area. At each stage, the ext2-xip code is converted over to the new DAX code. Patches 12-18 delete the remnants of the old XIP code, and fix the things in ext2 that Jan didn't like when he reviewed them for ext4 :-) Patches 19 20 are the work to make ext4 use DAX. Patch 21 is some final cleanup of references to the old XIP code, renaming it all to DAX. hrm. Some explanation of why one would use ext4 instead of, say, suitably-modified ramfs/tmpfs/rd/etc? ramfs and tmpfs really rely on the page cache. They're not exactly built for permanence either. brd also relies on the page cache, and there's a clear desire to use a filesystem instead of a block device for all the usual reasons of access permissions, grow/shrink, etc. Some people might want to use XFS instead of ext4. We're starting with ext4, but we've been keeping an eye on what other filesystems might want to use. btrfs isn't going to use the DAX code, but some of the other pieces will probably come in handy. There are also at least three people working on their own filesystems specially designed for persistent memory. I wish them all the best ... but I'd like to get this infrastructure into place. This is the sort of thing which first-timers (this one at least) like to see in [0/n]. Performance testing results? I haven't been running any performance tests. What sort of performance tests would be interesting for you to see? fs benchmarks? `dd' would be a good start ;) I assume (because I wasn't told!) that there are two objectives here: 1) reduce memory consumption by not maintaining pagecache and 2) reduce CPU cost by avoiding the double-copies. These things are pretty easily quantified. And really they must be quantified as part of the developer testing, because if you find they've worsened then holy cow, what went wrong. Carsten Otte wrote filemap_xip.c and may be a useful reviewer of this work. I cc'd him on some earlier versions and didn't hear anything back. It felt rude to keep plying him with 20+ patches every month. OK. All the patch subjects violate Documentation/SubmittingPatches section 15 ;) errr ... which bit? I used git format-patch to create them. None of the patch titles identify the subsystem(s) which they're hitting. eg, Introduce IS_DAX(inode) is an ext2 patch, but nobody would know that from browsing the titles. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On Wed, 27 Aug 2014 14:30:55 -0700 Andrew Morton a...@linux-foundation.org wrote: On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter c...@linux.com wrote: Some explanation of why one would use ext4 instead of, say, suitably-modified ramfs/tmpfs/rd/etc? The NVDIMM contents survive reboot and therefore ramfs and friends wont work with it. See suitably modified. Presumably this type of memory would need to come from a particular page allocator zone. ramfs would be unweildy due to its use to dentry/inode caches, but rd/etc should be feasible. If you took one of the existing ramfs types you would then need to - make it persistent in its storage, and put all the objects in the store - add journalling for failures mid transaction. Your dimm may retain its bits but if your CPU reset mid fs operation its got to be recovered - write an fsck tool for it - validate it at which point it's probably turned into ext4 8) It's persistent but that doesn't solve the 'my box crashed' problem. Alan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
On 08/27/2014 02:46 PM, Andrew Morton wrote: I assume (because I wasn't told!) that there are two objectives here: 1) reduce memory consumption by not maintaining pagecache and 2) reduce CPU cost by avoiding the double-copies. These things are pretty easily quantified. And really they must be quantified as part of the developer testing, because if you find they've worsened then holy cow, what went wrong. There are two more huge ones: 3) Writes via mmap are immediately durable (or at least they're durable after a *very* lightweight flush). 4) No page faults ever once a page is writable (I hope -- I'm not sure whether this series actually achieves that goal). A note on #3: there is ongoing work to enable write-through memory for things like this. Once that's done, then writes via mmap might actually be synchronously durable, depending on chipset details. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 00/21] Support ext4 on NV-DIMMs
One of the primary uses for NV-DIMMs is to expose them as a block device and use a filesystem to store files on the NV-DIMM. While that works, it currently wastes memory and CPU time buffering the files in the page cache. We have support in ext2 for bypassing the page cache, but it has some races which are unfixable in the current design. This series of patches rewrite the underlying support, and add support for direct access to ext4. Note that patch 6/21 has been included in https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next-candidate This iteration of the patchset rebases to 3.17-rc2, changes the page fault locking, fixes a couple of bugs and makes a few other minor changes. - Move the calculation of the maximum size available at the requested location from the ->direct_access implementations to bdev_direct_access() - Fix a comment typo (Ross Zwisler) - Check that the requested length is positive in bdev_direct_access(). If it is not, assume that it's an errno, and just return it. - Fix some whitespace issues flagged by checkpatch - Added the Acked-by responses from Kirill that I forget in the last round - Added myself to MAINTAINERS for DAX - Fixed compilation with !CONFIG_DAX (Vishal Verma) - Revert the locking in the page fault handler back to an earlier version. If we hit the race that we were trying to protect against, we will leave blocks allocated past the end of the file. They will be removed on file removal, the next truncate, or fsck. Matthew Wilcox (20): axonram: Fix bug in direct_access Change direct_access calling convention Fix XIP fault vs truncate race Allow page fault handlers to perform the COW Introduce IS_DAX(inode) Add copy_to_iter(), copy_from_iter() and iov_iter_zero() Replace XIP read and write with DAX I/O Replace ext2_clear_xip_target with dax_clear_blocks Replace the XIP page fault handler with the DAX page fault handler Replace xip_truncate_page with dax_truncate_page Replace XIP documentation with DAX documentation Remove get_xip_mem ext2: Remove ext2_xip_verify_sb() ext2: Remove ext2_use_xip ext2: Remove xip.c and xip.h Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX ext2: Remove ext2_aops_xip Get rid of most mentions of XIP in ext2 xip: Add xip_zero_page_range brd: Rename XIP to DAX Ross Zwisler (1): ext4: Add DAX functionality Documentation/filesystems/Locking | 3 - Documentation/filesystems/dax.txt | 91 +++ Documentation/filesystems/ext4.txt | 2 + Documentation/filesystems/xip.txt | 68 - MAINTAINERS| 6 + arch/powerpc/sysdev/axonram.c | 19 +- drivers/block/Kconfig | 13 +- drivers/block/brd.c| 26 +- drivers/s390/block/dcssblk.c | 21 +- fs/Kconfig | 21 +- fs/Makefile| 1 + fs/block_dev.c | 40 +++ fs/dax.c | 497 + fs/exofs/inode.c | 1 - fs/ext2/Kconfig| 11 - fs/ext2/Makefile | 1 - fs/ext2/ext2.h | 10 +- fs/ext2/file.c | 45 +++- fs/ext2/inode.c| 38 +-- fs/ext2/namei.c| 13 +- fs/ext2/super.c| 53 ++-- fs/ext2/xip.c | 91 --- fs/ext2/xip.h | 26 -- fs/ext4/ext4.h | 6 + fs/ext4/file.c | 49 +++- fs/ext4/indirect.c | 18 +- fs/ext4/inode.c| 51 ++-- fs/ext4/namei.c| 10 +- fs/ext4/super.c| 39 ++- fs/open.c | 5 +- include/linux/blkdev.h | 6 +- include/linux/fs.h | 49 +++- include/linux/mm.h | 1 + include/linux/uio.h| 3 + mm/Makefile| 1 - mm/fadvise.c | 6 +- mm/filemap.c | 6 +- mm/filemap_xip.c | 483 --- mm/iov_iter.c | 237 -- mm/madvise.c | 2 +- mm/memory.c| 33 ++- 41 files changed, 1229 insertions(+), 873 deletions(-) create mode 100644 Documentation/filesystems/dax.txt delete mode 100644 Documentation/filesystems/xip.txt create mode 100644 fs/dax.c delete mode 100644 fs/ext2/xip.c delete mode 100644 fs/ext2/xip.h delete mode 100644 mm/filemap_xip.c -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 00/21] Support ext4 on NV-DIMMs
One of the primary uses for NV-DIMMs is to expose them as a block device and use a filesystem to store files on the NV-DIMM. While that works, it currently wastes memory and CPU time buffering the files in the page cache. We have support in ext2 for bypassing the page cache, but it has some races which are unfixable in the current design. This series of patches rewrite the underlying support, and add support for direct access to ext4. Note that patch 6/21 has been included in https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next-candidate This iteration of the patchset rebases to 3.17-rc2, changes the page fault locking, fixes a couple of bugs and makes a few other minor changes. - Move the calculation of the maximum size available at the requested location from the -direct_access implementations to bdev_direct_access() - Fix a comment typo (Ross Zwisler) - Check that the requested length is positive in bdev_direct_access(). If it is not, assume that it's an errno, and just return it. - Fix some whitespace issues flagged by checkpatch - Added the Acked-by responses from Kirill that I forget in the last round - Added myself to MAINTAINERS for DAX - Fixed compilation with !CONFIG_DAX (Vishal Verma) - Revert the locking in the page fault handler back to an earlier version. If we hit the race that we were trying to protect against, we will leave blocks allocated past the end of the file. They will be removed on file removal, the next truncate, or fsck. Matthew Wilcox (20): axonram: Fix bug in direct_access Change direct_access calling convention Fix XIP fault vs truncate race Allow page fault handlers to perform the COW Introduce IS_DAX(inode) Add copy_to_iter(), copy_from_iter() and iov_iter_zero() Replace XIP read and write with DAX I/O Replace ext2_clear_xip_target with dax_clear_blocks Replace the XIP page fault handler with the DAX page fault handler Replace xip_truncate_page with dax_truncate_page Replace XIP documentation with DAX documentation Remove get_xip_mem ext2: Remove ext2_xip_verify_sb() ext2: Remove ext2_use_xip ext2: Remove xip.c and xip.h Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX ext2: Remove ext2_aops_xip Get rid of most mentions of XIP in ext2 xip: Add xip_zero_page_range brd: Rename XIP to DAX Ross Zwisler (1): ext4: Add DAX functionality Documentation/filesystems/Locking | 3 - Documentation/filesystems/dax.txt | 91 +++ Documentation/filesystems/ext4.txt | 2 + Documentation/filesystems/xip.txt | 68 - MAINTAINERS| 6 + arch/powerpc/sysdev/axonram.c | 19 +- drivers/block/Kconfig | 13 +- drivers/block/brd.c| 26 +- drivers/s390/block/dcssblk.c | 21 +- fs/Kconfig | 21 +- fs/Makefile| 1 + fs/block_dev.c | 40 +++ fs/dax.c | 497 + fs/exofs/inode.c | 1 - fs/ext2/Kconfig| 11 - fs/ext2/Makefile | 1 - fs/ext2/ext2.h | 10 +- fs/ext2/file.c | 45 +++- fs/ext2/inode.c| 38 +-- fs/ext2/namei.c| 13 +- fs/ext2/super.c| 53 ++-- fs/ext2/xip.c | 91 --- fs/ext2/xip.h | 26 -- fs/ext4/ext4.h | 6 + fs/ext4/file.c | 49 +++- fs/ext4/indirect.c | 18 +- fs/ext4/inode.c| 51 ++-- fs/ext4/namei.c| 10 +- fs/ext4/super.c| 39 ++- fs/open.c | 5 +- include/linux/blkdev.h | 6 +- include/linux/fs.h | 49 +++- include/linux/mm.h | 1 + include/linux/uio.h| 3 + mm/Makefile| 1 - mm/fadvise.c | 6 +- mm/filemap.c | 6 +- mm/filemap_xip.c | 483 --- mm/iov_iter.c | 237 -- mm/madvise.c | 2 +- mm/memory.c| 33 ++- 41 files changed, 1229 insertions(+), 873 deletions(-) create mode 100644 Documentation/filesystems/dax.txt delete mode 100644 Documentation/filesystems/xip.txt create mode 100644 fs/dax.c delete mode 100644 fs/ext2/xip.c delete mode 100644 fs/ext2/xip.h delete mode 100644 mm/filemap_xip.c -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/