Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-22 Thread Ira Weiny
On Mon, Sep 21, 2020 at 12:19:07PM -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 15 Sep 2020, Dan Williams wrote:
> 
> > > TODO:
> > >
> > > - programs run approximately 4% slower when running from Optane-based
> > > persistent memory. Therefore, programs and libraries should use page cache
> > > and not DAX mapping.
> > 
> > This needs to be based on platform firmware data f(ACPI HMAT) for the
> > relative performance of a PMEM range vs DRAM. For example, this
> > tradeoff should not exist with battery backed DRAM, or virtio-pmem.
> 
> Hi
> 
> I have implemented this functionality - if we mmap a file with 
> (vma->vm_flags & VM_DENYWRITE), then it is assumed that this is executable 
> file mapping - the flag S_DAX on the inode is cleared on and the inode 
> will use normal page cache.
> 
> Is there some way how to test if we are using Optane-based module (where 
> this optimization should be applied) or battery backed DRAM (where it 
> should not)?
> 
> I've added mount options dax=never, dax=auto, dax=always, so that the user 
  
  dax=inode?

'inode' is the option used by ext4/xfs.

Ira

> can override the automatic behavior.
> 
> Mikulas
> 


Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-22 Thread Ritesh Harjani




On 9/15/20 6:30 PM, Matthew Wilcox wrote:

On Tue, Sep 15, 2020 at 08:34:41AM -0400, Mikulas Patocka wrote:

- when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses
buffer cache for the mapping. The buffer cache slows does fsck by a factor
of 5 to 10. Could it be possible to change the kernel so that it maps DAX
based block devices directly?


Oh, because fs/block_dev.c has:
 .mmap   = generic_file_mmap,

I don't see why we shouldn't have a blkdev_mmap modelled after
ext2_file_mmap() with the corresponding blkdev_dax_vm_ops.



pls help with below 2 queries:-

1. Can't we use ->direct_IO here to avoid the mentioned performance problem?
2. Any other existing use case where having this blkdev_dax_vm_ops be 
useful?


-ritesh


Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-21 Thread Dan Williams
On Mon, Sep 21, 2020 at 9:19 AM Mikulas Patocka  wrote:
>
>
>
> On Tue, 15 Sep 2020, Dan Williams wrote:
>
> > > TODO:
> > >
> > > - programs run approximately 4% slower when running from Optane-based
> > > persistent memory. Therefore, programs and libraries should use page cache
> > > and not DAX mapping.
> >
> > This needs to be based on platform firmware data f(ACPI HMAT) for the
> > relative performance of a PMEM range vs DRAM. For example, this
> > tradeoff should not exist with battery backed DRAM, or virtio-pmem.
>
> Hi
>
> I have implemented this functionality - if we mmap a file with
> (vma->vm_flags & VM_DENYWRITE), then it is assumed that this is executable
> file mapping - the flag S_DAX on the inode is cleared on and the inode
> will use normal page cache.
>
> Is there some way how to test if we are using Optane-based module (where
> this optimization should be applied) or battery backed DRAM (where it
> should not)?

No, there's no direct reliable type information. Instead the firmware
on ACPI platforms provides the HMAT table which provides performance
details of system-memory ranges.


Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-21 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Dan Williams wrote:

> > TODO:
> >
> > - programs run approximately 4% slower when running from Optane-based
> > persistent memory. Therefore, programs and libraries should use page cache
> > and not DAX mapping.
> 
> This needs to be based on platform firmware data f(ACPI HMAT) for the
> relative performance of a PMEM range vs DRAM. For example, this
> tradeoff should not exist with battery backed DRAM, or virtio-pmem.

Hi

I have implemented this functionality - if we mmap a file with 
(vma->vm_flags & VM_DENYWRITE), then it is assumed that this is executable 
file mapping - the flag S_DAX on the inode is cleared on and the inode 
will use normal page cache.

Is there some way how to test if we are using Optane-based module (where 
this optimization should be applied) or battery backed DRAM (where it 
should not)?

I've added mount options dax=never, dax=auto, dax=always, so that the user 
can override the automatic behavior.

Mikulas



Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Matthew Wilcox wrote:

> On Tue, Sep 15, 2020 at 08:34:41AM -0400, Mikulas Patocka wrote:
> > - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses 
> > buffer cache for the mapping. The buffer cache slows does fsck by a factor 
> > of 5 to 10. Could it be possible to change the kernel so that it maps DAX 
> > based block devices directly?
> 
> Oh, because fs/block_dev.c has:
> .mmap   = generic_file_mmap,
> 
> I don't see why we shouldn't have a blkdev_mmap modelled after
> ext2_file_mmap() with the corresponding blkdev_dax_vm_ops.

Yes, that's possible - and we whould also have to rewrite methods 
read_iter and write_iter on DAX block devices, so that they are coherent 
with mmap.

Mikulas



Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Dan Williams wrote:

> > - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses
> > buffer cache for the mapping. The buffer cache slows does fsck by a factor
> > of 5 to 10. Could it be possible to change the kernel so that it maps DAX
> > based block devices directly?
> 
> We've been down this path before.
> 
> 5a023cdba50c block: enable dax for raw block devices
> 9f4736fe7ca8 block: revert runtime dax control of the raw block device
> acc93d30d7d4 Revert "block: enable dax for raw block devices"

It says "The functionality is superseded by the new 'Device DAX' 
facility". But the fsck tool can't change a fsdax device into a devdax 
device just for checking. Or can it?

> EXT2/4 metadata buffer management depends on the page cache and we
> eliminated a class of bugs by removing that support. The problems are
> likely tractable, but there was not a straightforward fix visible at
> the time.

Thinking about it - it isn't as easy as it looks...

Suppose that the user mounts an ext2 filesystem and then uses the tune2fs 
tool on the mounted block device. The tune2fs tool reads and writes the 
mounted superblock directly.

So, read/write must be coherent with the buffer cache (otherwise the 
kernel would not see the changes written by tune2fs). And mmap must be 
coherent with read/write.

So, if we want to map the pmem device directly, we could add a new flag 
MAP_DAX. Or we could test if the fd has O_DIRECT flag and map it directly 
in this case. But the default must be to map it coherently in order to not 
break existing programs.

> > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and
> > trailing bytes.
> 
> You want copy_user_flushcache(). See how fs/dax.c arranges for
> dax_copy_from_iter() to route to pmem_copy_from_iter().

Is it something new for the kernel 5.10? I see only __copy_user_flushcache 
that is implemented just for x86 and arm64.

There is __copy_from_user_flushcache implemented for x86, arm64 and power. 
It is used in lib/iov_iter.c under
#ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this?

Mikulas



Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Mikulas Patocka wrote:

> > > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and
> > > trailing bytes.
> > 
> > You want copy_user_flushcache(). See how fs/dax.c arranges for
> > dax_copy_from_iter() to route to pmem_copy_from_iter().
> 
> Is it something new for the kernel 5.10? I see only __copy_user_flushcache 
> that is implemented just for x86 and arm64.
> 
> There is __copy_from_user_flushcache implemented for x86, arm64 and power. 
> It is used in lib/iov_iter.c under
> #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this?
> 
> Mikulas

... and __copy_user_flushcache is not exported for modules. So, I am stuck 
with __copy_from_user_inatomic_nocache.

Mikulas



Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Dan Williams
On Tue, Sep 15, 2020 at 5:35 AM Mikulas Patocka  wrote:
>
> Hi
>
> I am developing a new filesystem suitable for persistent memory - nvfs.

Nice!

> The goal is to have a small and fast filesystem that can be used on
> DAX-based devices. Nvfs maps the whole device into linear address space
> and it completely bypasses the overhead of the block layer and buffer
> cache.

So does device-dax, but device-dax lacks read(2)/write(2).

> In the past, there was nova filesystem for pmem, but it was abandoned a
> year ago (the last version is for the kernel 5.1 -
> https://github.com/NVSL/linux-nova ). Nvfs is smaller and performs better.
>
> The design of nvfs is similar to ext2/ext4, so that it fits into the VFS
> layer naturally, without too much glue code.
>
> I'd like to ask you to review it.
>
>
> tarballs:
> http://people.redhat.com/~mpatocka/nvfs/
> git:
> git://leontynka.twibright.com/nvfs.git
> the description of filesystem internals:
> http://people.redhat.com/~mpatocka/nvfs/INTERNALS
> benchmarks:
> http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
>
>
> TODO:
>
> - programs run approximately 4% slower when running from Optane-based
> persistent memory. Therefore, programs and libraries should use page cache
> and not DAX mapping.

This needs to be based on platform firmware data f(ACPI HMAT) for the
relative performance of a PMEM range vs DRAM. For example, this
tradeoff should not exist with battery backed DRAM, or virtio-pmem.

>
> - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses
> buffer cache for the mapping. The buffer cache slows does fsck by a factor
> of 5 to 10. Could it be possible to change the kernel so that it maps DAX
> based block devices directly?

We've been down this path before.

5a023cdba50c block: enable dax for raw block devices
9f4736fe7ca8 block: revert runtime dax control of the raw block device
acc93d30d7d4 Revert "block: enable dax for raw block devices"

EXT2/4 metadata buffer management depends on the page cache and we
eliminated a class of bugs by removing that support. The problems are
likely tractable, but there was not a straightforward fix visible at
the time.

> - __copy_from_user_inatomic_nocache doesn't flush cache for leading and
> trailing bytes.

You want copy_user_flushcache(). See how fs/dax.c arranges for
dax_copy_from_iter() to route to pmem_copy_from_iter().


Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Matthew Wilcox
On Tue, Sep 15, 2020 at 08:34:41AM -0400, Mikulas Patocka wrote:
> - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses 
> buffer cache for the mapping. The buffer cache slows does fsck by a factor 
> of 5 to 10. Could it be possible to change the kernel so that it maps DAX 
> based block devices directly?

Oh, because fs/block_dev.c has:
.mmap   = generic_file_mmap,

I don't see why we shouldn't have a blkdev_mmap modelled after
ext2_file_mmap() with the corresponding blkdev_dax_vm_ops.


[RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Mikulas Patocka
Hi

I am developing a new filesystem suitable for persistent memory - nvfs. 
The goal is to have a small and fast filesystem that can be used on 
DAX-based devices. Nvfs maps the whole device into linear address space 
and it completely bypasses the overhead of the block layer and buffer 
cache.

In the past, there was nova filesystem for pmem, but it was abandoned a 
year ago (the last version is for the kernel 5.1 - 
https://github.com/NVSL/linux-nova ). Nvfs is smaller and performs better.

The design of nvfs is similar to ext2/ext4, so that it fits into the VFS 
layer naturally, without too much glue code.

I'd like to ask you to review it.


tarballs:
http://people.redhat.com/~mpatocka/nvfs/
git:
git://leontynka.twibright.com/nvfs.git
the description of filesystem internals:
http://people.redhat.com/~mpatocka/nvfs/INTERNALS
benchmarks:
http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS


TODO:

- programs run approximately 4% slower when running from Optane-based 
persistent memory. Therefore, programs and libraries should use page cache 
and not DAX mapping.

- when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses 
buffer cache for the mapping. The buffer cache slows does fsck by a factor 
of 5 to 10. Could it be possible to change the kernel so that it maps DAX 
based block devices directly?

- __copy_from_user_inatomic_nocache doesn't flush cache for leading and 
trailing bytes.

Mikulas