subject:"Re\: \[PATCH v2 0\/5\] fs, xfs\: block map immutable files for dax, dma\-to\-storage, and swap"

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-21 Thread Peter Zijlstra

On Mon, Aug 14, 2017 at 02:40:59PM +0200, Jan Kara wrote:
> Hum, this proposal (and the problems you are trying to deal with) seem very
> similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> the DAX area (and so additionally complicated by the fact that filesystems
> now have to care). The patch set was not merged due to lack of interest I
> think but it looked sensible and the proposed API would make sense for more
> stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> 
> [1] https://lwn.net/Articles/600502/

Thanks for thinking of that. The main sticking point was that I never
got it working for RDMA, I got hopelessly lost in that code.

Also I feel (and still do) that mpin() would be very useful for CMA,
mpin() would be a good moment to migrate/compact the pages and get out
of the way.

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-16 Thread Jan Kara

On Tue 15-08-17 16:50:55, Dan Williams wrote:
> On Tue, Aug 15, 2017 at 1:37 AM, Jan Kara  wrote:
> > On Mon 14-08-17 09:14:42, Dan Williams wrote:
> >> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara  wrote:
> >> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
> >> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig  wrote:
> >> >> > Thay being said I think we absolutely should support RDMA memory
> >> >> > registrations for DAX mappings.  I'm just not sure how 
> >> >> > S_IOMAP_IMMUTABLE
> >> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> >> >> > all the blocks are polulated and all ptes are set up.  Second we need
> >> >> > to make sure get_user_page works, which for now means we'll need a
> >> >> > struct page mapping for the region (which will be really annoying
> >> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> >> >> > and we need to gurantee that the extent mapping won't change while
> >> >> > the get_user_pages holds the pages inside it.  I think that is true
> >> >> > due to side effects even with the current DAX code, but we'll need to
> >> >> > make it explicit.  And maybe that's where we need to converge -
> >> >> > "sealing" the extent map makes sense as such a temporary measure
> >> >> > that is not persisted on disk, which automatically gets released
> >> >> > when the holding process exits, because we sort of already do this
> >> >> > implicitly.  It might also make sense to have explicitl breakable
> >> >> > seals similar to what I do for the pNFS blocks kernel server, as
> >> >> > any userspace RDMA file server would also need those semantics.
> >> >>
> >> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range 
> >> >> to:
> >> >>
> >> >> 1/ only succeed if the fault can be satisfied without page cache
> >> >>
> >> >> 2/ only install a pte for the fault if it can do so without
> >> >> triggering block map updates
> >> >>
> >> >> So, I think it would still end up setting an inode flag to make
> >> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> >> >> active. However, it would not record that state in the on-disk
> >> >> metadata and it would automatically clear at munmap time. That should
> >> >> be enough to support the host-persistent-memory, and
> >> >> NVMe-persistent-memory use cases (provided we have struct page for
> >> >> NVMe). Although, we need more safety infrastructure in the NVMe case
> >> >> where we would need to software manage I/O coherence.
> >> >
> >> > Hum, this proposal (and the problems you are trying to deal with) seem 
> >> > very
> >> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> >> > the DAX area (and so additionally complicated by the fact that 
> >> > filesystems
> >> > now have to care). The patch set was not merged due to lack of interest I
> >> > think but it looked sensible and the proposed API would make sense for 
> >> > more
> >> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> >>
> >> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
> >> "no-fault" guarantee and fixes the accounting of locked System RAM.
> >> MAP_DIRECT still allows faults, and DAX mappings don't consume System
> >> RAM so the accounting problem is not there for DAX. mm_pin() also does
> >> not appear to have a relationship to a file backed memory like mmap
> >> allows.
> >
> > So the accounting part is probably non-interesting for DAX purposes and I
> > agree there are other differences as well. But mm_mpin() prevented page
> > migrations which is parallel to your requirement of "offset->block mapping
> > is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
> > has saner interface to pin pages than get_user_pages() and you mention RDMA
> > and similar technologies as a usecase for your work for similar reasons.
> > So my thought was that possibly we should have the same API for pinning
> > "storage" for RDMA transfers regardless of whether the backing is page
> > cache or pmem and the API should be usable for in-kernel users as well?
> > mmap flag seems a bit clumsy in this regard so maybe a form of a separate
> > syscall - be it mpin(start, len) or some other name - might be more
> > suitable?
> 
> Can you say about more about why an mmap flag for this feels awkward
> to you? I think there's symmetry between O_SYNC / O_DIRECT setting up
> synchronous / page-cache-bypass file descriptors and MAP_SYNC /
> MAP_DIRECT setting up synchronous and page-cache bypass mappings.

So my thinking was, that for in-kernel users it might be a bit more
difficult to use mmap flag directly as they generally won't need to setup
the mapping. But that can be certainly dealt with by proper helpers for
in-kernel users.

> "Pinning" also feels like the wrong mechanism when you consider
> hardware is moving toward eliminating the pinning requirement over
> time. SVM "Shared Virtual Memory" hardware will just ope

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-15 Thread Dan Williams

On Tue, Aug 15, 2017 at 1:37 AM, Jan Kara  wrote:
> On Mon 14-08-17 09:14:42, Dan Williams wrote:
>> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara  wrote:
>> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
>> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig  wrote:
>> >> > Thay being said I think we absolutely should support RDMA memory
>> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
>> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
>> >> > all the blocks are polulated and all ptes are set up.  Second we need
>> >> > to make sure get_user_page works, which for now means we'll need a
>> >> > struct page mapping for the region (which will be really annoying
>> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
>> >> > and we need to gurantee that the extent mapping won't change while
>> >> > the get_user_pages holds the pages inside it.  I think that is true
>> >> > due to side effects even with the current DAX code, but we'll need to
>> >> > make it explicit.  And maybe that's where we need to converge -
>> >> > "sealing" the extent map makes sense as such a temporary measure
>> >> > that is not persisted on disk, which automatically gets released
>> >> > when the holding process exits, because we sort of already do this
>> >> > implicitly.  It might also make sense to have explicitl breakable
>> >> > seals similar to what I do for the pNFS blocks kernel server, as
>> >> > any userspace RDMA file server would also need those semantics.
>> >>
>> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
>> >>
>> >> 1/ only succeed if the fault can be satisfied without page cache
>> >>
>> >> 2/ only install a pte for the fault if it can do so without
>> >> triggering block map updates
>> >>
>> >> So, I think it would still end up setting an inode flag to make
>> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
>> >> active. However, it would not record that state in the on-disk
>> >> metadata and it would automatically clear at munmap time. That should
>> >> be enough to support the host-persistent-memory, and
>> >> NVMe-persistent-memory use cases (provided we have struct page for
>> >> NVMe). Although, we need more safety infrastructure in the NVMe case
>> >> where we would need to software manage I/O coherence.
>> >
>> > Hum, this proposal (and the problems you are trying to deal with) seem very
>> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
>> > the DAX area (and so additionally complicated by the fact that filesystems
>> > now have to care). The patch set was not merged due to lack of interest I
>> > think but it looked sensible and the proposed API would make sense for more
>> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
>>
>> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
>> "no-fault" guarantee and fixes the accounting of locked System RAM.
>> MAP_DIRECT still allows faults, and DAX mappings don't consume System
>> RAM so the accounting problem is not there for DAX. mm_pin() also does
>> not appear to have a relationship to a file backed memory like mmap
>> allows.
>
> So the accounting part is probably non-interesting for DAX purposes and I
> agree there are other differences as well. But mm_mpin() prevented page
> migrations which is parallel to your requirement of "offset->block mapping
> is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
> has saner interface to pin pages than get_user_pages() and you mention RDMA
> and similar technologies as a usecase for your work for similar reasons.
> So my thought was that possibly we should have the same API for pinning
> "storage" for RDMA transfers regardless of whether the backing is page
> cache or pmem and the API should be usable for in-kernel users as well?
> mmap flag seems a bit clumsy in this regard so maybe a form of a separate
> syscall - be it mpin(start, len) or some other name - might be more
> suitable?

Can you say about more about why an mmap flag for this feels awkward
to you? I think there's symmetry between O_SYNC / O_DIRECT setting up
synchronous / page-cache-bypass file descriptors and MAP_SYNC /
MAP_DIRECT setting up synchronous and page-cache bypass mappings.
"Pinning" also feels like the wrong mechanism when you consider
hardware is moving toward eliminating the pinning requirement over
time. SVM "Shared Virtual Memory" hardware will just operate on cpu
virtual addresses directly and generate typical faults. On such
hardware MAP_DIRECT would be a nop relative to MAP_SYNC, so you
wouldn't want your application to be stuck with the legacy concept
that pages need to be explicitly "pinned".

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-15 Thread Jan Kara

On Mon 14-08-17 09:14:42, Dan Williams wrote:
> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara  wrote:
> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig  wrote:
> >> > Thay being said I think we absolutely should support RDMA memory
> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> >> > all the blocks are polulated and all ptes are set up.  Second we need
> >> > to make sure get_user_page works, which for now means we'll need a
> >> > struct page mapping for the region (which will be really annoying
> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> >> > and we need to gurantee that the extent mapping won't change while
> >> > the get_user_pages holds the pages inside it.  I think that is true
> >> > due to side effects even with the current DAX code, but we'll need to
> >> > make it explicit.  And maybe that's where we need to converge -
> >> > "sealing" the extent map makes sense as such a temporary measure
> >> > that is not persisted on disk, which automatically gets released
> >> > when the holding process exits, because we sort of already do this
> >> > implicitly.  It might also make sense to have explicitl breakable
> >> > seals similar to what I do for the pNFS blocks kernel server, as
> >> > any userspace RDMA file server would also need those semantics.
> >>
> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> >>
> >> 1/ only succeed if the fault can be satisfied without page cache
> >>
> >> 2/ only install a pte for the fault if it can do so without
> >> triggering block map updates
> >>
> >> So, I think it would still end up setting an inode flag to make
> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> >> active. However, it would not record that state in the on-disk
> >> metadata and it would automatically clear at munmap time. That should
> >> be enough to support the host-persistent-memory, and
> >> NVMe-persistent-memory use cases (provided we have struct page for
> >> NVMe). Although, we need more safety infrastructure in the NVMe case
> >> where we would need to software manage I/O coherence.
> >
> > Hum, this proposal (and the problems you are trying to deal with) seem very
> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> > the DAX area (and so additionally complicated by the fact that filesystems
> > now have to care). The patch set was not merged due to lack of interest I
> > think but it looked sensible and the proposed API would make sense for more
> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> 
> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
> "no-fault" guarantee and fixes the accounting of locked System RAM.
> MAP_DIRECT still allows faults, and DAX mappings don't consume System
> RAM so the accounting problem is not there for DAX. mm_pin() also does
> not appear to have a relationship to a file backed memory like mmap
> allows.

So the accounting part is probably non-interesting for DAX purposes and I
agree there are other differences as well. But mm_mpin() prevented page
migrations which is parallel to your requirement of "offset->block mapping
is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
has saner interface to pin pages than get_user_pages() and you mention RDMA
and similar technologies as a usecase for your work for similar reasons.
So my thought was that possibly we should have the same API for pinning
"storage" for RDMA transfers regardless of whether the backing is page
cache or pmem and the API should be usable for in-kernel users as well?
mmap flag seems a bit clumsy in this regard so maybe a form of a separate
syscall - be it mpin(start, len) or some other name - might be more
suitable?

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-14 Thread Darrick J. Wong

On Sun, Aug 13, 2017 at 01:31:45PM -0700, Dan Williams wrote:
> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig  wrote:
> > On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
> >> The application does not need to know the storage address, it needs to
> >> know that the storage address to file offset is fixed. With this
> >> information it can make assumptions about the permanence of results it
> >> gets from the kernel.
> >
> > Only if we clearly document that fact - and documenting the permanence
> > is different from saying the block map won't change.
> 
> I can get on board with that.
> 
> >
> >> For example get_user_pages() today makes no guarantees outside of
> >> "page will not be freed",
> >
> > It also makes the extremely important gurantee that the page won't
> > _move_ - e.g. that we won't do a memory migration for compaction or
> > other reasons.  That's why for example RDMA can use to register
> > memory and then we can later set up memory windows that point to this
> > registration from userspace and implement userspace RDMA.
> >
> >> but with immutable files and dax you now
> >> have a mechanism for userspace to coordinate direct access to storage
> >> addresses. Those raw storage addresses need not be exposed to the
> >> application, as you say it doesn't need to know that detail. MAP_SYNC
> >> does not fully satisfy this case because it requires agents that can
> >> generate MMU faults to coordinate with the filesystem.
> >
> > The file system is always in the fault path, can you explain what other
> > agents you are talking about?
> 
> Exactly the one's you mention below. SVM hardware can just use a
> MAP_SYNC mapping and be sure that its metadata dirtying writes are
> synchronized with the filesystem through the fault path. Hardware that
> does not have SVM, or hypervisors like Xen that want to attach their
> own static metadata about the file offset to physical block mapping,
> need a mechanism to make sure the block map is sealed while they have
> it mapped.
> 
> >> All I know is that SMB Direct for persistent memory seems like a
> >> potential consumer. I know they're not going to use a userspace
> >> filesystem or put an SMB server in the kernel.
> >
> > Last I talked to the Samba folks they didn't expect a userspace
> > SMB direct implementation to work anyway due to the fact that
> > libibverbs memory registrations interact badly with their fork()ing
> > daemon model.  That being said during the recent submission of the
> > RDMA client code some comments were made about userspace versions of
> > it, so I'm not sure if that opinion has changed in one way or another.
> 
> Ok.
> 
> >
> > Thay being said I think we absolutely should support RDMA memory
> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> > all the blocks are polulated and all ptes are set up.  Second we need
> > to make sure get_user_page works, which for now means we'll need a
> > struct page mapping for the region (which will be really annoying
> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> > and we need to gurantee that the extent mapping won't change while
> > the get_user_pages holds the pages inside it.  I think that is true
> > due to side effects even with the current DAX code, but we'll need to
> > make it explicit.  And maybe that's where we need to converge -
> > "sealing" the extent map makes sense as such a temporary measure
> > that is not persisted on disk, which automatically gets released
> > when the holding process exits, because we sort of already do this
> > implicitly.  It might also make sense to have explicitl breakable
> > seals similar to what I do for the pNFS blocks kernel server, as
> > any userspace RDMA file server would also need those semantics.
> 
> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> 
> 1/ only succeed if the fault can be satisfied without page cache
> 
> 2/ only install a pte for the fault if it can do so without
> triggering block map updates
> 
> So, I think it would still end up setting an inode flag to make
> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> active. However, it would not record that state in the on-disk
> metadata and it would automatically clear at munmap time. That should

TBH even after the last round of 'do we need this on-disk flag?' I still
wasn't 100% convinced that we really needed a permanent flag vs.
requiring apps to ask for a sealed iomap mmap like what you just
described, so I'm glad this converation has continue. :)

--D

> be enough to support the host-persistent-memory, and
> NVMe-persistent-memory use cases (provided we have struct page for
> NVMe). Although, we need more safety infrastructure in the NVMe case
> where we would need to software manage I/O coherence.
> 
> > Last but not least we have any interesting additional case for modern
> >

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-14 Thread Dan Williams

On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara  wrote:
> On Sun 13-08-17 13:31:45, Dan Williams wrote:
>> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig  wrote:
>> > Thay being said I think we absolutely should support RDMA memory
>> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
>> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
>> > all the blocks are polulated and all ptes are set up.  Second we need
>> > to make sure get_user_page works, which for now means we'll need a
>> > struct page mapping for the region (which will be really annoying
>> > for PCIe mappings, like the upcoming NVMe persistent memory region),
>> > and we need to gurantee that the extent mapping won't change while
>> > the get_user_pages holds the pages inside it.  I think that is true
>> > due to side effects even with the current DAX code, but we'll need to
>> > make it explicit.  And maybe that's where we need to converge -
>> > "sealing" the extent map makes sense as such a temporary measure
>> > that is not persisted on disk, which automatically gets released
>> > when the holding process exits, because we sort of already do this
>> > implicitly.  It might also make sense to have explicitl breakable
>> > seals similar to what I do for the pNFS blocks kernel server, as
>> > any userspace RDMA file server would also need those semantics.
>>
>> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
>>
>> 1/ only succeed if the fault can be satisfied without page cache
>>
>> 2/ only install a pte for the fault if it can do so without
>> triggering block map updates
>>
>> So, I think it would still end up setting an inode flag to make
>> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
>> active. However, it would not record that state in the on-disk
>> metadata and it would automatically clear at munmap time. That should
>> be enough to support the host-persistent-memory, and
>> NVMe-persistent-memory use cases (provided we have struct page for
>> NVMe). Although, we need more safety infrastructure in the NVMe case
>> where we would need to software manage I/O coherence.
>
> Hum, this proposal (and the problems you are trying to deal with) seem very
> similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> the DAX area (and so additionally complicated by the fact that filesystems
> now have to care). The patch set was not merged due to lack of interest I
> think but it looked sensible and the proposed API would make sense for more
> stuff than just DAX so maybe it would be better than MAP_DIRECT flag?

Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
"no-fault" guarantee and fixes the accounting of locked System RAM.
MAP_DIRECT still allows faults, and DAX mappings don't consume System
RAM so the accounting problem is not there for DAX. mm_pin() also does
not appear to have a relationship to a file backed memory like mmap
allows.

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-14 Thread Jan Kara

On Sun 13-08-17 13:31:45, Dan Williams wrote:
> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig  wrote:
> > Thay being said I think we absolutely should support RDMA memory
> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> > all the blocks are polulated and all ptes are set up.  Second we need
> > to make sure get_user_page works, which for now means we'll need a
> > struct page mapping for the region (which will be really annoying
> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> > and we need to gurantee that the extent mapping won't change while
> > the get_user_pages holds the pages inside it.  I think that is true
> > due to side effects even with the current DAX code, but we'll need to
> > make it explicit.  And maybe that's where we need to converge -
> > "sealing" the extent map makes sense as such a temporary measure
> > that is not persisted on disk, which automatically gets released
> > when the holding process exits, because we sort of already do this
> > implicitly.  It might also make sense to have explicitl breakable
> > seals similar to what I do for the pNFS blocks kernel server, as
> > any userspace RDMA file server would also need those semantics.
> 
> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> 
> 1/ only succeed if the fault can be satisfied without page cache
> 
> 2/ only install a pte for the fault if it can do so without
> triggering block map updates
> 
> So, I think it would still end up setting an inode flag to make
> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> active. However, it would not record that state in the on-disk
> metadata and it would automatically clear at munmap time. That should
> be enough to support the host-persistent-memory, and
> NVMe-persistent-memory use cases (provided we have struct page for
> NVMe). Although, we need more safety infrastructure in the NVMe case
> where we would need to software manage I/O coherence.

Hum, this proposal (and the problems you are trying to deal with) seem very
similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
the DAX area (and so additionally complicated by the fact that filesystems
now have to care). The patch set was not merged due to lack of interest I
think but it looked sensible and the proposed API would make sense for more
stuff than just DAX so maybe it would be better than MAP_DIRECT flag?

[1] https://lwn.net/Articles/600502/

Honza

-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-13 Thread Dave Chinner

On Sun, Aug 13, 2017 at 11:24:36AM +0200, Christoph Hellwig wrote:
> And maybe that's where we need to converge - 
> "sealing" the extent map makes sense as such a temporary measure
> that is not persisted on disk, which automatically gets released
> when the holding process exits, because we sort of already do this
> implicitly.

That seems reasonable to me. Personally I don't need persistent
state, and I'd only intended persistence to be so that we didn't get
arbitrary processes whacking holes in the file when the DAX app
wasn't running that would then cause for userspace data sync. Seeing
as the interface is morphing away from a "fill holes and persist"
interface to just a "seal the existing map" interface, it'll be up
to the app/library to prep check file layout for sanity every time
it is sealed.

> It might also make sense to have explicitl breakable
> seals similar to what I do for the pNFS blocks kernel server, as
> any userspace RDMA file server would also need those semantics.

How would that work? IIUC, we'd need userspace to take out a file
lease so that it gets notified when the seal is going to be broken
by the filesystem via the break_layouts() interface, and the break
then blocks until the app releases the lease? So the seal lifetime
is bounded by the lease?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-13 Thread Dan Williams

On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig  wrote:
> On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
>> The application does not need to know the storage address, it needs to
>> know that the storage address to file offset is fixed. With this
>> information it can make assumptions about the permanence of results it
>> gets from the kernel.
>
> Only if we clearly document that fact - and documenting the permanence
> is different from saying the block map won't change.

I can get on board with that.

>
>> For example get_user_pages() today makes no guarantees outside of
>> "page will not be freed",
>
> It also makes the extremely important gurantee that the page won't
> _move_ - e.g. that we won't do a memory migration for compaction or
> other reasons.  That's why for example RDMA can use to register
> memory and then we can later set up memory windows that point to this
> registration from userspace and implement userspace RDMA.
>
>> but with immutable files and dax you now
>> have a mechanism for userspace to coordinate direct access to storage
>> addresses. Those raw storage addresses need not be exposed to the
>> application, as you say it doesn't need to know that detail. MAP_SYNC
>> does not fully satisfy this case because it requires agents that can
>> generate MMU faults to coordinate with the filesystem.
>
> The file system is always in the fault path, can you explain what other
> agents you are talking about?

Exactly the one's you mention below. SVM hardware can just use a
MAP_SYNC mapping and be sure that its metadata dirtying writes are
synchronized with the filesystem through the fault path. Hardware that
does not have SVM, or hypervisors like Xen that want to attach their
own static metadata about the file offset to physical block mapping,
need a mechanism to make sure the block map is sealed while they have
it mapped.

>> All I know is that SMB Direct for persistent memory seems like a
>> potential consumer. I know they're not going to use a userspace
>> filesystem or put an SMB server in the kernel.
>
> Last I talked to the Samba folks they didn't expect a userspace
> SMB direct implementation to work anyway due to the fact that
> libibverbs memory registrations interact badly with their fork()ing
> daemon model.  That being said during the recent submission of the
> RDMA client code some comments were made about userspace versions of
> it, so I'm not sure if that opinion has changed in one way or another.

Ok.

>
> Thay being said I think we absolutely should support RDMA memory
> registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> all the blocks are polulated and all ptes are set up.  Second we need
> to make sure get_user_page works, which for now means we'll need a
> struct page mapping for the region (which will be really annoying
> for PCIe mappings, like the upcoming NVMe persistent memory region),
> and we need to gurantee that the extent mapping won't change while
> the get_user_pages holds the pages inside it.  I think that is true
> due to side effects even with the current DAX code, but we'll need to
> make it explicit.  And maybe that's where we need to converge -
> "sealing" the extent map makes sense as such a temporary measure
> that is not persisted on disk, which automatically gets released
> when the holding process exits, because we sort of already do this
> implicitly.  It might also make sense to have explicitl breakable
> seals similar to what I do for the pNFS blocks kernel server, as
> any userspace RDMA file server would also need those semantics.

Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:

1/ only succeed if the fault can be satisfied without page cache

2/ only install a pte for the fault if it can do so without
triggering block map updates

So, I think it would still end up setting an inode flag to make
xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
active. However, it would not record that state in the on-disk
metadata and it would automatically clear at munmap time. That should
be enough to support the host-persistent-memory, and
NVMe-persistent-memory use cases (provided we have struct page for
NVMe). Although, we need more safety infrastructure in the NVMe case
where we would need to software manage I/O coherence.

> Last but not least we have any interesting additional case for modern
> Mellanox hardware - On Demand Paging where we don't actually do a
> get_user_pages but the hardware implements SVM and thus gets fed
> virtual addresses directly.  My head spins when talking about the
> implications for DAX mappings on that, so I'm just throwing that in
> for now instead of trying to come up with a solution.

Yeah, DAX + SVM needs more thought.

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-13 Thread Christoph Hellwig

On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
> The application does not need to know the storage address, it needs to
> know that the storage address to file offset is fixed. With this
> information it can make assumptions about the permanence of results it
> gets from the kernel.

Only if we clearly document that fact - and documenting the permanence
is different from saying the block map won't change.

> For example get_user_pages() today makes no guarantees outside of
> "page will not be freed",

It also makes the extremely important gurantee that the page won't
_move_ - e.g. that we won't do a memory migration for compaction or
other reasons.  That's why for example RDMA can use to register
memory and then we can later set up memory windows that point to this
registration from userspace and implement userspace RDMA.

> but with immutable files and dax you now
> have a mechanism for userspace to coordinate direct access to storage
> addresses. Those raw storage addresses need not be exposed to the
> application, as you say it doesn't need to know that detail. MAP_SYNC
> does not fully satisfy this case because it requires agents that can
> generate MMU faults to coordinate with the filesystem.

The file system is always in the fault path, can you explain what other
agents you are talking about?

> All I know is that SMB Direct for persistent memory seems like a
> potential consumer. I know they're not going to use a userspace
> filesystem or put an SMB server in the kernel.

Last I talked to the Samba folks they didn't expect a userspace
SMB direct implementation to work anyway due to the fact that
libibverbs memory registrations interact badly with their fork()ing
daemon model.  That being said during the recent submission of the
RDMA client code some comments were made about userspace versions of
it, so I'm not sure if that opinion has changed in one way or another.

Thay being said I think we absolutely should support RDMA memory
registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
all the blocks are polulated and all ptes are set up.  Second we need
to make sure get_user_page works, which for now means we'll need a
struct page mapping for the region (which will be really annoying
for PCIe mappings, like the upcoming NVMe persistent memory region),
and we need to gurantee that the extent mapping won't change while
the get_user_pages holds the pages inside it.  I think that is true
due to side effects even with the current DAX code, but we'll need to
make it explicit.  And maybe that's where we need to converge - 
"sealing" the extent map makes sense as such a temporary measure
that is not persisted on disk, which automatically gets released
when the holding process exits, because we sort of already do this
implicitly.  It might also make sense to have explicitl breakable
seals similar to what I do for the pNFS blocks kernel server, as
any userspace RDMA file server would also need those semantics.

Last but not least we have any interesting additional case for modern
Mellanox hardware - On Demand Paging where we don't actually do a
get_user_pages but the hardware implements SVM and thus gets fed
virtual addresses directly.  My head spins when talking about the
implications for DAX mappings on that, so I'm just throwing that in
for now instead of trying to come up with a solution.

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-12 Thread Dan Williams

On Sat, Aug 12, 2017 at 12:33 AM, Christoph Hellwig  wrote:
> On Fri, Aug 11, 2017 at 03:26:05PM -0700, Dan Williams wrote:
>> Right, but they let userspace make inferences about the state of
>> metadata relative to I/O to a given storage address. In this regard
>> S_IOMAP_IMMUTABLE is no different than MAP_SYNC, but 'immutable' goes
>> a step further to let an application infer that the storage address is
>> stable. This enables applications that MAP_SYNC does not, see below.
>
> But the application must not know (and cannot know) the storage address,
> so it doesn't matter.
>
>> > What is the observable behavior of an extent map change?  How can you
>> > describe your immutable extent map behavior so that when I violate
>> > them by e.g. moving one extent to a different place on disk you can
>> > observe that in userspace?
>>
>> The violation is blocked, it's immutable. Using this feature means the
>> application is taking away some of the kernel's freedom. That is a
>> valid / safe tradeoff for the set of applications that would otherwise
>> resort to raw device access.
>
> What can the application do with it safely that it can't otherwise do?
> Short answer: nothing.

The application does not need to know the storage address, it needs to
know that the storage address to file offset is fixed. With this
information it can make assumptions about the permanence of results it
gets from the kernel.

For example get_user_pages() today makes no guarantees outside of
"page will not be freed", but with immutable files and dax you now
have a mechanism for userspace to coordinate direct access to storage
addresses. Those raw storage addresses need not be exposed to the
application, as you say it doesn't need to know that detail. MAP_SYNC
does not fully satisfy this case because it requires agents that can
generate MMU faults to coordinate with the filesystem.

>> >
>> > Please explain how this interface allows for any sort of safe userspace
>> > DMA.
>>
>> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
>> support applications that MAP_SYNC does not. Dave mentioned userspace
>> pNFS4 servers, but there's also Samba and other protocols that want to
>> negotiate a direct path to pmem outside the kernel.
>
> Userspace pNFS servers must use a userspace file system.  Everything
> else is just brainded stupid due to the amount of communication they
> need to do.  Also note that the only pNFS layouts that would even cause
> direct block access are pNFS block/scsi and for those the
> S_IOMAP_IMMUTABLE semantics are not very useful (background: I wrote
> the Linux implementation for those, and authored the scsi layout spec)
>

Understood.

All I know is that SMB Direct for persistent memory seems like a
potential consumer. I know they're not going to use a userspace
filesystem or put an SMB server in the kernel.

>
>> Applications that just want flush from userspace can use MAP_SYNC,
>> those that need to temporarily pin the block for RDMA can use the
>> in-kernel pNFS server, and those that need to coordinate both from
>> userspace can use S_IOMAP_IMMUTABLE. It's a continuum, not a
>> competition.
>
> Again - how does your application even know that I moved your block
> around with your S_IOMAP_IMMUTABLE?  We should never add interfaces
> that mandate implementations - we should based interfaces based on
> user observable behavior - and debug tools like fiemap don't count.

I'm still not grokking this "I moved your block" example. What agent
is moving blocks while the file is immutable?

> Before going any further please write a man page that describeѕ your
> intended semantics in a way that an application programmer understands.

Sure, I'll try to write this up in terms of the use cases I know about
that can immediately consume it and switch away from device-dax.

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-12 Thread Christoph Hellwig

On Fri, Aug 11, 2017 at 08:57:18PM -0700, Andy Lutomirski wrote:
> One thing that makes me quite nervous about S_IOMAP_IMMUTABLE is the
> degree to which things go badly if one program relies on it while
> another program clears the flag: you risk corrupting unrelated
> filesystem metadata.  I think a userspace interface to pin the extent
> mapping of a file really wants a way to reliably keep it pinned (or to
> reliably zap the userspace application if it gets unpinned).

The nice thing is that no application can rely on it anyway..

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-12 Thread Christoph Hellwig

On Fri, Aug 11, 2017 at 03:26:05PM -0700, Dan Williams wrote:
> Right, but they let userspace make inferences about the state of
> metadata relative to I/O to a given storage address. In this regard
> S_IOMAP_IMMUTABLE is no different than MAP_SYNC, but 'immutable' goes
> a step further to let an application infer that the storage address is
> stable. This enables applications that MAP_SYNC does not, see below.

But the application must not know (and cannot know) the storage address,
so it doesn't matter.

> > What is the observable behavior of an extent map change?  How can you
> > describe your immutable extent map behavior so that when I violate
> > them by e.g. moving one extent to a different place on disk you can
> > observe that in userspace?
> 
> The violation is blocked, it's immutable. Using this feature means the
> application is taking away some of the kernel's freedom. That is a
> valid / safe tradeoff for the set of applications that would otherwise
> resort to raw device access.

What can the application do with it safely that it can't otherwise do?
Short answer: nothing.

> >
> > Please explain how this interface allows for any sort of safe userspace
> > DMA.
> 
> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
> support applications that MAP_SYNC does not. Dave mentioned userspace
> pNFS4 servers, but there's also Samba and other protocols that want to
> negotiate a direct path to pmem outside the kernel.

Userspace pNFS servers must use a userspace file system.  Everything
else is just brainded stupid due to the amount of communication they
need to do.  Also note that the only pNFS layouts that would even cause
direct block access are pNFS block/scsi and for those the
S_IOMAP_IMMUTABLE semantics are not very useful (background: I wrote
the Linux implementation for those, and authored the scsi layout spec)

> Applications that just want flush from userspace can use MAP_SYNC,
> those that need to temporarily pin the block for RDMA can use the
> in-kernel pNFS server, and those that need to coordinate both from
> userspace can use S_IOMAP_IMMUTABLE. It's a continuum, not a
> competition.

Again - how does your application even know that I moved your block
around with your S_IOMAP_IMMUTABLE?  We should never add interfaces
that mandate implementations - we should based interfaces based on
user observable behavior - and debug tools like fiemap don't count.

Before going any further please write a man page that describeѕ your
intended semantics in a way that an application programmer understands.

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-11 Thread Dan Williams

On Fri, Aug 11, 2017 at 8:57 PM, Andy Lutomirski  wrote:
> On Fri, Aug 11, 2017 at 3:26 PM, Dan Williams  
> wrote:
>> On Fri, Aug 11, 2017 at 3:44 AM, Christoph Hellwig  wrote:
>>> Please explain how this interface allows for any sort of safe userspace
>>> DMA.
>>
>> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
>> support applications that MAP_SYNC does not. Dave mentioned userspace
>> pNFS4 servers, but there's also Samba and other protocols that want to
>> negotiate a direct path to pmem outside the kernel. Xen support has
>> thus far not been able to follow in the footsteps of KVM enabling due
>> to a dependence on static M2P tables that assume a static
>> guest-physical to host-physical relationship [1]. Immutable files
>> would allow Xen to follow the same "mmap a file" semantic as KVM.
>
> One thing that makes me quite nervous about S_IOMAP_IMMUTABLE is the
> degree to which things go badly if one program relies on it while
> another program clears the flag: you risk corrupting unrelated
> filesystem metadata.  I think a userspace interface to pin the extent
> mapping of a file really wants a way to reliably keep it pinned (or to
> reliably zap the userspace application if it gets unpinned).

In the current patches, mapping_mapped() pins the immutable state.

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-11 Thread Andy Lutomirski

On Fri, Aug 11, 2017 at 3:26 PM, Dan Williams  wrote:
> On Fri, Aug 11, 2017 at 3:44 AM, Christoph Hellwig  wrote:
>> Please explain how this interface allows for any sort of safe userspace
>> DMA.
>
> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
> support applications that MAP_SYNC does not. Dave mentioned userspace
> pNFS4 servers, but there's also Samba and other protocols that want to
> negotiate a direct path to pmem outside the kernel. Xen support has
> thus far not been able to follow in the footsteps of KVM enabling due
> to a dependence on static M2P tables that assume a static
> guest-physical to host-physical relationship [1]. Immutable files
> would allow Xen to follow the same "mmap a file" semantic as KVM.

One thing that makes me quite nervous about S_IOMAP_IMMUTABLE is the
degree to which things go badly if one program relies on it while
another program clears the flag: you risk corrupting unrelated
filesystem metadata.  I think a userspace interface to pin the extent
mapping of a file really wants a way to reliably keep it pinned (or to
reliably zap the userspace application if it gets unpinned).

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-11 Thread Dan Williams

On Fri, Aug 11, 2017 at 3:44 AM, Christoph Hellwig  wrote:
> On Sun, Aug 06, 2017 at 11:51:50AM -0700, Dan Williams wrote:
>> Of course it's a useful API. An application already needs to worry
>> about the block map, that's why we have fallocate, msync, fiemap
>> and...
>
> Fallocate and msync do not expose the block map in any way.  Proof:
> they work just fine over say nfs.

Right, but they let userspace make inferences about the state of
metadata relative to I/O to a given storage address. In this regard
S_IOMAP_IMMUTABLE is no different than MAP_SYNC, but 'immutable' goes
a step further to let an application infer that the storage address is
stable. This enables applications that MAP_SYNC does not, see below.

> fiemap does indeed expose the block map, which is the whole point.
> But it's a debug tool that we don't event have a man page for.  And
> it's not usable for anything else, if only for the fact that it doesn't
> tell you what device your returned extents are relative to.

True, one couldn't just use immutable + fiemap and expect to have the
right storage device.

>
>> > We've been through this a few times but let me repeat it:  The only
>> > sensible API gurantee is one that is observable and usable.
>>
>> I'm missing how block-map immutable files violate this observable and
>> usable constraint?
>
> What is the observable behavior of an extent map change?  How can you
> describe your immutable extent map behavior so that when I violate
> them by e.g. moving one extent to a different place on disk you can
> observe that in userspace?

The violation is blocked, it's immutable. Using this feature means the
application is taking away some of the kernel's freedom. That is a
valid / safe tradeoff for the set of applications that would otherwise
resort to raw device access.

>
>> This immutable approach should also go in, it solves the same problem
>> without the the latency drawback,
>
> How is your latency going to be any different from MAP_SYNC on
> a fully allocated and pre-zeroed file?

So, I went back and read Jan's patches, and in the pre-allocated case
I don't think we can get stuck behind a backlog of dirty metada
flushing since the implementation only seems to take the synchronous
fault path if the fault dirtied the block map.

>> Beyond flush from userspace it also
>> can be used to solve the swapfile problems you highlighted
>
> Which swapfile problem?

The TOCTOU problem of enabling swap vs reflink that you mentioned in
your criticism of the daxctl syscall, but now that I look your
comments were based on the *general* case use of bmap(), However, xfs
in particular as of commits:

   eb5e248d502b xfs: don't allow bmap on rt files
   db1327b16c2b xfs: report shared extent mappings to userspace correctly

...doesn't appear to have this problem. That said Dave's idea to use
immutable + unwritten extents for swap makes sense to me. That's a
feature, not a bug fix, but I went ahead and appended a
proof-of-concept implementation to the v3 posting.

>> and it
>> allows safe ongoing dma to a filesystem-dax mapping beyond what we can
>> already do with direct-I/O.
>
> Please explain how this interface allows for any sort of safe userspace
> DMA.

So this is where I continue to see S_IOMAP_IMMUTABLE being able to
support applications that MAP_SYNC does not. Dave mentioned userspace
pNFS4 servers, but there's also Samba and other protocols that want to
negotiate a direct path to pmem outside the kernel. Xen support has
thus far not been able to follow in the footsteps of KVM enabling due
to a dependence on static M2P tables that assume a static
guest-physical to host-physical relationship [1]. Immutable files
would allow Xen to follow the same "mmap a file" semantic as KVM.

Applications that just want flush from userspace can use MAP_SYNC,
those that need to temporarily pin the block for RDMA can use the
in-kernel pNFS server, and those that need to coordinate both from
userspace can use S_IOMAP_IMMUTABLE. It's a continuum, not a
competition.

[1]: https://lists.xen.org/archives/html/xen-devel/2017-04/msg00427.html

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-11 Thread Christoph Hellwig

On Sun, Aug 06, 2017 at 11:51:50AM -0700, Dan Williams wrote:
> Of course it's a useful API. An application already needs to worry
> about the block map, that's why we have fallocate, msync, fiemap
> and...

Fallocate and msync do not expose the block map in any way.  Proof:
they work just fine over say nfs.

fiemap does indeed expose the block map, which is the whole point.
But it's a debug tool that we don't event have a man page for.  And
it's not usable for anything else, if only for the fact that it doesn't
tell you what device your returned extents are relative to.

> > We've been through this a few times but let me repeat it:  The only
> > sensible API gurantee is one that is observable and usable.
> 
> I'm missing how block-map immutable files violate this observable and
> usable constraint?

What is the observable behavior of an extent map change?  How can you
describe your immutable extent map behavior so that when I violate
them by e.g. moving one extent to a different place on disk you can
observe that in userspace?

> This immutable approach should also go in, it solves the same problem
> without the the latency drawback,

How is your latency going to be any different from MAP_SYNC on
a fully allocated and pre-zeroed file?

> Beyond flush from userspace it also
> can be used to solve the swapfile problems you highlighted

Which swapfile problem?

> and it
> allows safe ongoing dma to a filesystem-dax mapping beyond what we can
> already do with direct-I/O.

Please explain how this interface allows for any sort of safe userspace
DMA.

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-06 Thread Dan Williams

On Sat, Aug 5, 2017 at 2:50 AM, Christoph Hellwig  wrote:
> On Thu, Aug 03, 2017 at 07:38:11PM -0700, Dan Williams wrote:
>> [ adding linux-api to the cover letter for notification, will send the
>> full set to linux-api for v3 ]
>
> Just don't send this crap ever again.  All the so called use cases in the
> earlier thread were incorrect and highly dangerous.

I usually end up coming around to your position on these types of
debates because you almost always put forward unassailable technical
arguments. So far, you have not in this case.

> Promising that the block map is stable is not a useful userspace API,
> as it the block map is a complete internal implementation detail.

Of course it's a useful API. An application already needs to worry
about the block map, that's why we have fallocate, msync, fiemap
and...

> We've been through this a few times but let me repeat it:  The only
> sensible API gurantee is one that is observable and usable.

I'm missing how block-map immutable files violate this observable and
usable constraint?

> so Jan's synchronous page fault flag in one form or another makes
> perfect sense as it is a clear receipe for the user:  you don't
> have to call msync to persist your mmap writes.  This API is not,
> it guarantees that the block map does not change, but the application
> has absolutely no point of even knowing about the block map.

Jan's approach is great, it should go in, it solves a long standing
problem with dax with the only drawback being potentially
unpredictable latency spikes.

This immutable approach should also go in, it solves the same problem
without the the latency drawback, but yes, with the administrative
overhead of CAP_LINUX_IMMUTABLE. Beyond flush from userspace it also
can be used to solve the swapfile problems you highlighted and it
allows safe ongoing dma to a filesystem-dax mapping beyond what we can
already do with direct-I/O. There is demand for these capabilities
that cannot be satisfied by just hand waving them away as invalid.

The magnitude of opposition to this approach is out of step with the
actual risk.

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-05 Thread Christoph Hellwig

On Thu, Aug 03, 2017 at 07:38:11PM -0700, Dan Williams wrote:
> [ adding linux-api to the cover letter for notification, will send the
> full set to linux-api for v3 ]

Just don't send this crap ever again.  All the so called use cases in the
earlier thread were incorrect and highly dangerous.

Promising that the block map is stable is not a useful userspace API,
as it the block map is a complete internal implementation detail.

We've been through this a few times but let me repeat it:  The only
sensible API gurantee is one that is observable and usable.

so Jan's synchronous page fault flag in one form or another makes
perfect sense as it is a clear receipe for the user:  you don't
have to call msync to persist your mmap writes.  This API is not,
it guarantees that the block map does not change, but the application
has absolutely no point of even knowing about the block map.

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

2017-08-03 Thread Dan Williams

[ adding linux-api to the cover letter for notification, will send the
full set to linux-api for v3 ]

On Thu, Aug 3, 2017 at 7:28 PM, Dan Williams  wrote:
> Changes since v1 [1]:
> * Add IS_IOMAP_IMMUTABLE() checks to xfs ioctl paths that perform block
>   map changes (xfs_alloc_file_space and xfs_free_file_space) (Darrick)
>
> * Rather than complete a partial write, fail all writes that would
>   attempt to extend the file size (Darrick)
>
> * Introduce FALLOC_FL_UNSEAL_BLOCK_MAP as an explicit operation type for
>   clearing S_IOMAP_IMMUTABLE (Dave)
>
> * Rework xfs_seal_file_space() to first complete hole-fill and unshare
>   operations and then check the file for suitability under
>   XFS_ILOCK_EXCL. (Darrick)
>
> * Add an FS_XFLAG_IOMAP_IMMUTABLE flag so the immutable state can be
>   seen by xfs_io. (Dave)
>
> * Move the setting of S_IOMAP_IMMUTABLE to be atomic with respect to the
>   successful transaction that records XFS_DIFLAG2_IOMAP_IMMUTABLE.
>   (Darrick, Dave)
>
> * Switch to a 'goto out_unlock' style in xfs_seal_file_space() to
>   cleanup 'if / else' tree, and use the mapping_mapped() helper. (Dave)
>
> * Rely on XFS_MMAPLOCK_EXCL for reading a stable state of
>   mapping->i_mmap. (Dave)
>
> [1]: http://marc.info/?l=linux-fsdevel&m=150135785712967&w=2
>
> ---
>
> The daxfile proposal a few weeks back [2] sought to piggy back on the
> swapfile implementation to approximate a block map immutable file. This
> is an idea Dave originated last year to solve the dax "flush from
> userspace" problem [3].
>
> The discussion yielded several results. First, Christoph pointed out
> that swapfiles are subtly broken [4].  Second, Darrick [5] and Dave [6]
> proposed how to properly implement a block map immutable file.  Finally,
> Dave identified some improvements to swapfiles that can be built on the
> block-map-immutable mechanism. These patches seek to implement the first
> part of the proposal and save the swapfile work to build on top once the
> base mechanism is complete.
>
> While the initial motivation for this feature is support for
> byte-addressable updates of persistent memory and managing cache
> maintenance from userspace, the applications of the feature are broader.
> In addition to being the start of a better swapfile mechanism it can
> also support a DMA-to-storage use case.  This use case enables
> data-acquisition hardware to DMA directly to a storage device address
> while being safe in the knowledge that storage mappings will not change.
>
> [2]: https://lkml.org/lkml/2017/6/16/790
> [3]: https://lkml.org/lkml/2016/9/11/159
> [4]: https://lkml.org/lkml/2017/6/18/31
> [5]: https://lkml.org/lkml/2017/6/20/49
> [6]: https://www.spinics.net/lists/linux-xfs/msg07871.html
>
> ---
>
> Dan Williams (5):
>   fs, xfs: introduce S_IOMAP_IMMUTABLE
>   fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
>   fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP
>   xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
>   xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
>
>
>  fs/attr.c   |   10 ++
>  fs/open.c   |   22 +
>  fs/read_write.c |3 +
>  fs/xfs/libxfs/xfs_format.h  |5 +
>  fs/xfs/xfs_bmap_util.c  |  181 
> +++
>  fs/xfs/xfs_bmap_util.h  |5 +
>  fs/xfs/xfs_file.c   |   16 +++-
>  fs/xfs/xfs_inode.c  |2
>  fs/xfs/xfs_ioctl.c  |7 ++
>  fs/xfs/xfs_iops.c   |8 +-
>  include/linux/falloc.h  |4 +
>  include/linux/fs.h  |2
>  include/uapi/linux/falloc.h |   20 +
>  include/uapi/linux/fs.h |1
>  mm/filemap.c|5 +
>  15 files changed, 282 insertions(+), 9 deletions(-)

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

20 matches

Site Navigation

Mail list logo

Footer information