Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-12-21 Thread Dan Williams
On Wed, Dec 21, 2016 at 1:24 PM, Dave Chinner  wrote:
> On Wed, Dec 21, 2016 at 08:53:46AM -0800, Dan Williams wrote:
>> On Tue, Dec 20, 2016 at 4:40 PM, Darrick J. Wong
>>  wrote:
>> > On Mon, Dec 19, 2016 at 05:18:40PM -0800, Dan Williams wrote:
>> >> On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong
>> >>  wrote:
>> >> > On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
>> >> >> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
>> >> >> <>
>> >> >> > Definitely the first step would be your simple preallocated per
>> >> >> > inode approach until it is shown to be insufficient.
>> >> >>
>> >> >> Reviving this thread a few months later...
>> >> >>
>> >> >> Dave, we're interested in taking a serious look at what it would take 
>> >> >> to get
>> >> >> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is 
>> >> >> (or could
>> >> >> become, with some amount of work) a workable solution?
>> >> >>
>> >> >> We're happy to do the grunt work for this feature, but we will 
>> >> >> probably need
>> >> >> guidance from someone with more XFS experience.  With you out on 
>> >> >> extended leave
>> >> >> the first half of 2017, who would be the best person to ask for this 
>> >> >> guidance?
>> >> >> Darrick?
>> >> >
>> >> > Yes, probably. :)
>> >> >
>> >> > I think where we left off with this (on the XFS side) is some sort of
>> >> > fallocate mode that would allocate blocks, zero them, and then set the
>> >> > DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
>> >> > file and thereby gain the ability to control write persistents behavior
>> >> > without having to worry about fs metadata updates.  As an added plus, I
>> >> > think zeroing the pmem also clears media errors, or something like that.
>> >> >
>> >> >  Is that a reasonable starting point?  My memory is a little 
>> >> > foggy.
>> >> >
>> >> > Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
>> >> > read that.
>> >>
>> >> That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE
>> >> via a character device interface. It's useful for cases where you want
>> >> an entire nvdimm namespace/volume in "no fs-metadata to worry about"
>> >> mode.  But, for sub-allocations of a namespace and support for
>> >> existing tooling, PMEM_IMMUTABLE is much more usable.
>> >
>> > Well sure... but otoh I was thinking that it'd be pretty neat if we
>> > could use the same code regardless of whether the target file was a
>> > dax-device or an xfs file:
>> >
>> > fd = open("", O_RDWR);
>> > fstat(fd, ):
>> > fallocate(fd, FALLOC_FL_PMEM_IMMUTABLE, 0, statbuf.st_size);
>> > p = mmap(NULL, statbuf.st_size, PROT_READ | PROT_WRITE, fd, 0);
>> >
>> > *(p + 42) = 0xDEADBEEF;
>> > asm { clflush; } /* or whatever */
>> >
>> > ...so perhaps it would be a good idea to design the fallocate primitive
>> > around "prepare this fd for mmap-only pmem semantics" and let it the
>> > backend do zeroing and inode flag changes as necessary to make it
>> > happen.  We'd need to do some bikeshedding about what the other falloc
>> > flags mean when we're dealing with pmem files and devices, but I think
>> > we should try to keep the userland presentation the same unless there's
>> > a really good reason not to.
>>
>> It would be interesting to use fallocate to size device-dax files...
>
> No. device-dax needs to die, not poison a bunch of existing file and
> block device APIs and behaviours with special snowflakes.  Get
> DAX-enabled filesystems to do what you need, and get rid of this
> ugly, nasty hack.
>

Right, Christoph already killed fallocate for device-dax.

What we're looking for now is the next level of detail on how to get
started on PMEM_IMMUTABLE, as Ross asked a few messages back in this
thread, so we have a reasonable replacement for device-dax.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-12-21 Thread Dave Chinner
On Wed, Dec 21, 2016 at 08:53:46AM -0800, Dan Williams wrote:
> On Tue, Dec 20, 2016 at 4:40 PM, Darrick J. Wong
>  wrote:
> > On Mon, Dec 19, 2016 at 05:18:40PM -0800, Dan Williams wrote:
> >> On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong
> >>  wrote:
> >> > On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
> >> >> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
> >> >> <>
> >> >> > Definitely the first step would be your simple preallocated per
> >> >> > inode approach until it is shown to be insufficient.
> >> >>
> >> >> Reviving this thread a few months later...
> >> >>
> >> >> Dave, we're interested in taking a serious look at what it would take 
> >> >> to get
> >> >> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or 
> >> >> could
> >> >> become, with some amount of work) a workable solution?
> >> >>
> >> >> We're happy to do the grunt work for this feature, but we will probably 
> >> >> need
> >> >> guidance from someone with more XFS experience.  With you out on 
> >> >> extended leave
> >> >> the first half of 2017, who would be the best person to ask for this 
> >> >> guidance?
> >> >> Darrick?
> >> >
> >> > Yes, probably. :)
> >> >
> >> > I think where we left off with this (on the XFS side) is some sort of
> >> > fallocate mode that would allocate blocks, zero them, and then set the
> >> > DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
> >> > file and thereby gain the ability to control write persistents behavior
> >> > without having to worry about fs metadata updates.  As an added plus, I
> >> > think zeroing the pmem also clears media errors, or something like that.
> >> >
> >> >  Is that a reasonable starting point?  My memory is a little 
> >> > foggy.
> >> >
> >> > Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
> >> > read that.
> >>
> >> That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE
> >> via a character device interface. It's useful for cases where you want
> >> an entire nvdimm namespace/volume in "no fs-metadata to worry about"
> >> mode.  But, for sub-allocations of a namespace and support for
> >> existing tooling, PMEM_IMMUTABLE is much more usable.
> >
> > Well sure... but otoh I was thinking that it'd be pretty neat if we
> > could use the same code regardless of whether the target file was a
> > dax-device or an xfs file:
> >
> > fd = open("", O_RDWR);
> > fstat(fd, ):
> > fallocate(fd, FALLOC_FL_PMEM_IMMUTABLE, 0, statbuf.st_size);
> > p = mmap(NULL, statbuf.st_size, PROT_READ | PROT_WRITE, fd, 0);
> >
> > *(p + 42) = 0xDEADBEEF;
> > asm { clflush; } /* or whatever */
> >
> > ...so perhaps it would be a good idea to design the fallocate primitive
> > around "prepare this fd for mmap-only pmem semantics" and let it the
> > backend do zeroing and inode flag changes as necessary to make it
> > happen.  We'd need to do some bikeshedding about what the other falloc
> > flags mean when we're dealing with pmem files and devices, but I think
> > we should try to keep the userland presentation the same unless there's
> > a really good reason not to.
> 
> It would be interesting to use fallocate to size device-dax files...

No. device-dax needs to die, not poison a bunch of existing file and
block device APIs and behaviours with special snowflakes.  Get
DAX-enabled filesystems to do what you need, and get rid of this
ugly, nasty hack.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-12-21 Thread Dan Williams
On Tue, Dec 20, 2016 at 4:40 PM, Darrick J. Wong
 wrote:
> On Mon, Dec 19, 2016 at 05:18:40PM -0800, Dan Williams wrote:
>> On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong
>>  wrote:
>> > On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
>> >> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
>> >> <>
>> >> > Definitely the first step would be your simple preallocated per
>> >> > inode approach until it is shown to be insufficient.
>> >>
>> >> Reviving this thread a few months later...
>> >>
>> >> Dave, we're interested in taking a serious look at what it would take to 
>> >> get
>> >> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or 
>> >> could
>> >> become, with some amount of work) a workable solution?
>> >>
>> >> We're happy to do the grunt work for this feature, but we will probably 
>> >> need
>> >> guidance from someone with more XFS experience.  With you out on extended 
>> >> leave
>> >> the first half of 2017, who would be the best person to ask for this 
>> >> guidance?
>> >> Darrick?
>> >
>> > Yes, probably. :)
>> >
>> > I think where we left off with this (on the XFS side) is some sort of
>> > fallocate mode that would allocate blocks, zero them, and then set the
>> > DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
>> > file and thereby gain the ability to control write persistents behavior
>> > without having to worry about fs metadata updates.  As an added plus, I
>> > think zeroing the pmem also clears media errors, or something like that.
>> >
>> >  Is that a reasonable starting point?  My memory is a little foggy.
>> >
>> > Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
>> > read that.
>>
>> That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE
>> via a character device interface. It's useful for cases where you want
>> an entire nvdimm namespace/volume in "no fs-metadata to worry about"
>> mode.  But, for sub-allocations of a namespace and support for
>> existing tooling, PMEM_IMMUTABLE is much more usable.
>
> Well sure... but otoh I was thinking that it'd be pretty neat if we
> could use the same code regardless of whether the target file was a
> dax-device or an xfs file:
>
> fd = open("", O_RDWR);
> fstat(fd, ):
> fallocate(fd, FALLOC_FL_PMEM_IMMUTABLE, 0, statbuf.st_size);
> p = mmap(NULL, statbuf.st_size, PROT_READ | PROT_WRITE, fd, 0);
>
> *(p + 42) = 0xDEADBEEF;
> asm { clflush; } /* or whatever */
>
> ...so perhaps it would be a good idea to design the fallocate primitive
> around "prepare this fd for mmap-only pmem semantics" and let it the
> backend do zeroing and inode flag changes as necessary to make it
> happen.  We'd need to do some bikeshedding about what the other falloc
> flags mean when we're dealing with pmem files and devices, but I think
> we should try to keep the userland presentation the same unless there's
> a really good reason not to.

It would be interesting to use fallocate to size device-dax files...

However, the fallocate userland presentation I was looking to unify
was the semantic for how to clear errors. The current method for
creating a device-dax instance is 'echo "" > ""'. Christoph put the kibosh on carrying the
fallocate-error-clearing semantic over, so we'll need to fallback to
using the existing ND_IOCTL_CLEAR_ERROR. All that's missing for error
clearing is a secondary mechanism to report the media error list for a
physical address range.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-12-20 Thread Darrick J. Wong
On Mon, Dec 19, 2016 at 05:18:40PM -0800, Dan Williams wrote:
> On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong
>  wrote:
> > On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
> >> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
> >> <>
> >> > Definitely the first step would be your simple preallocated per
> >> > inode approach until it is shown to be insufficient.
> >>
> >> Reviving this thread a few months later...
> >>
> >> Dave, we're interested in taking a serious look at what it would take to 
> >> get
> >> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or 
> >> could
> >> become, with some amount of work) a workable solution?
> >>
> >> We're happy to do the grunt work for this feature, but we will probably 
> >> need
> >> guidance from someone with more XFS experience.  With you out on extended 
> >> leave
> >> the first half of 2017, who would be the best person to ask for this 
> >> guidance?
> >> Darrick?
> >
> > Yes, probably. :)
> >
> > I think where we left off with this (on the XFS side) is some sort of
> > fallocate mode that would allocate blocks, zero them, and then set the
> > DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
> > file and thereby gain the ability to control write persistents behavior
> > without having to worry about fs metadata updates.  As an added plus, I
> > think zeroing the pmem also clears media errors, or something like that.
> >
> >  Is that a reasonable starting point?  My memory is a little foggy.
> >
> > Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
> > read that.
> 
> That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE
> via a character device interface. It's useful for cases where you want
> an entire nvdimm namespace/volume in "no fs-metadata to worry about"
> mode.  But, for sub-allocations of a namespace and support for
> existing tooling, PMEM_IMMUTABLE is much more usable.

Well sure... but otoh I was thinking that it'd be pretty neat if we
could use the same code regardless of whether the target file was a
dax-device or an xfs file:

fd = open("", O_RDWR);
fstat(fd, ):
fallocate(fd, FALLOC_FL_PMEM_IMMUTABLE, 0, statbuf.st_size);
p = mmap(NULL, statbuf.st_size, PROT_READ | PROT_WRITE, fd, 0);

*(p + 42) = 0xDEADBEEF;
asm { clflush; } /* or whatever */

...so perhaps it would be a good idea to design the fallocate primitive
around "prepare this fd for mmap-only pmem semantics" and let it the
backend do zeroing and inode flag changes as necessary to make it
happen.  We'd need to do some bikeshedding about what the other falloc
flags mean when we're dealing with pmem files and devices, but I think
we should try to keep the userland presentation the same unless there's
a really good reason not to.

--D
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-12-19 Thread Dan Williams
On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong
 wrote:
> On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
>> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
>> <>
>> > Definitely the first step would be your simple preallocated per
>> > inode approach until it is shown to be insufficient.
>>
>> Reviving this thread a few months later...
>>
>> Dave, we're interested in taking a serious look at what it would take to get
>> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
>> become, with some amount of work) a workable solution?
>>
>> We're happy to do the grunt work for this feature, but we will probably need
>> guidance from someone with more XFS experience.  With you out on extended 
>> leave
>> the first half of 2017, who would be the best person to ask for this 
>> guidance?
>> Darrick?
>
> Yes, probably. :)
>
> I think where we left off with this (on the XFS side) is some sort of
> fallocate mode that would allocate blocks, zero them, and then set the
> DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
> file and thereby gain the ability to control write persistents behavior
> without having to worry about fs metadata updates.  As an added plus, I
> think zeroing the pmem also clears media errors, or something like that.
>
>  Is that a reasonable starting point?  My memory is a little foggy.
>
> Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
> read that.

That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE
via a character device interface. It's useful for cases where you want
an entire nvdimm namespace/volume in "no fs-metadata to worry about"
mode.  But, for sub-allocations of a namespace and support for
existing tooling, PMEM_IMMUTABLE is much more usable.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-12-19 Thread Darrick J. Wong
On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
> <>
> > Definitely the first step would be your simple preallocated per
> > inode approach until it is shown to be insufficient.
> 
> Reviving this thread a few months later...
> 
> Dave, we're interested in taking a serious look at what it would take to get
> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
> become, with some amount of work) a workable solution?
> 
> We're happy to do the grunt work for this feature, but we will probably need
> guidance from someone with more XFS experience.  With you out on extended 
> leave
> the first half of 2017, who would be the best person to ask for this guidance?
> Darrick?

Yes, probably. :)

I think where we left off with this (on the XFS side) is some sort of
fallocate mode that would allocate blocks, zero them, and then set the
DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
file and thereby gain the ability to control write persistents behavior
without having to worry about fs metadata updates.  As an added plus, I
think zeroing the pmem also clears media errors, or something like that.

 Is that a reasonable starting point?  My memory is a little foggy.

Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
read that.

--D

> 
> Thanks,
> - Ross
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-12-19 Thread Ross Zwisler
On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
<>
> Definitely the first step would be your simple preallocated per
> inode approach until it is shown to be insufficient.

Reviving this thread a few months later...

Dave, we're interested in taking a serious look at what it would take to get
PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
become, with some amount of work) a workable solution?

We're happy to do the grunt work for this feature, but we will probably need
guidance from someone with more XFS experience.  With you out on extended leave
the first half of 2017, who would be the best person to ask for this guidance?
Darrick?

Thanks,
- Ross
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-15 Thread Nicholas Piggin
On Fri, 16 Sep 2016 08:33:50 +1000
Dave Chinner  wrote:

> On Thu, Sep 15, 2016 at 09:42:22PM +1000, Nicholas Piggin wrote:
> > On Thu, 15 Sep 2016 20:32:10 +1000
> > Dave Chinner  wrote:  
> > > 
> > > You still haven't described anything about what a per-block flag
> > > design is supposed to look like :/  
> > 
> > For the API, or implementation? I'm not quite sure what you mean
> > here. For implementation it's possible to carefully ensure metadata
> > is persistent when allocating blocks in page fault but before
> > mapping pages. Truncate or hole punch or such things can be made to
> > work by invalidating all such mappings and holding them off until
> > you can cope with them again. Not necessarily for a filesystem with
> > *all* capabilities of XFS -- I don't know -- but for a complete basic
> > one.  
> 
> SO, essentially, it comes down to synchrnous metadta updates again.

Yes. I guess fundamentally you can't get away from either that or
preloading at some level.

(Also I don't know that there's a sane way to handle [cm]time properly,
so some things like that -- this is just about block allocation /
avoiding fdatasync).

> but synchronous updates would be conditional on whether an extent
> metadata with the "nofsync" flag asserted was updated? Where's the
> nofsync flag kept? in memory at a generic layer, or in the
> filesystem, potentially in an on-disk structure? How would the
> application set it for a given range?

I guess that comes back to the API. Whether you want it to be persistent,
request based, etc. It could be derived type of storage blocks that are
mapped there, stored per-inode, in-memory, or on extents on disk. I'm not
advocating for a particular API and of course less complexity better.

> 
> > > > > > Filesystem will
> > > > > > invalidate all such mappings before it does buffered IOs or hole 
> > > > > > punch,
> > > > > > and will sync metadata after allocating a new block but before 
> > > > > > returning
> > > > > > from a fault.  
> > > > > 
> > > > > ... requires synchronous metadata updates from page fault context,
> > > > > which we already know is not a good solution.  I'll quote one of
> > > > > Christoph's previous replies to save me the trouble:
> > > > > 
> > > > >   "You could write all metadata synchronously from the page
> > > > >   fault handler, but that's basically asking for all kinds of
> > > > >   deadlocks."
> > > > > So, let's redirect back to the "no sync" flag you were talking about
> > > > > - can you answer the questions I asked above? It would be especially
> > > > > important to highlight how the proposed feature would avoid requiring
> > > > > synchronous metadata updates in page fault contexts
> > > > 
> > > > Right. So what deadlocks are you concerned about?
> > > 
> > > It basically puts the entire journal checkpoint path under a page
> > > fault context. i.e. a whole new global locking context problem is  
> > 
> > Yes there are potentially some new lock orderings created if you
> > do that, depending on what locks the filesystem does.  
> 
> Well, that's the whole issue.

For filesystem implementations, but perhaps not mm/vfs implemenatation
AFAIKS.

> 
> > > created as this path can now be run both inside and outside the
> > > mmap_sem. Nothing ever good comes from running filesystem locking
> > > code both inside and outside the mmap_sem.  
> > 
> > You mean that some cases journal checkpoint runs with mmap_sem
> > held, and others without mmap_sem held? Not that mmap_sem is taken
> > inside journal checkpoint.  
> 
> Maybe not, but now we open up the potential for locks held inside
> or outside mmap sem to interact with the journal locks that are now
> held inside and outside mmap_sem. See below
> 
> > Then I don't really see why that's a
> > problem. I mean performance could suffer a bit, but with fault
> > retry you can almost always do the syncing outside mmap_sem in
> > practice.
> > 
> > Yes, I'll preemptively agree with you -- We don't want to add any
> > such burden if it is not needed and well justified.
> >   
> > > FWIW, We've never executed synchronous transactions inside page
> > > faults in XFS, and I think ext4 is in the same boat - it may be even
> > > worse because of the way it does ordered data dispatch through the
> > > journal. I don't really even want to think about the level of hurt
> > > this might put btrfs or other COW/log structured filesystems under.
> > > I'm sure Christoph can reel off a bunch more issues off the top of
> > > his head  
> > 
> > I asked him, we'll see what he thinks. I don't beleive there is
> > anything fundamental about mm or fs core layers that cause deadlocks
> > though.  
> 
> Spent 5 minutes looking at ext4 for an example: filesystems are
> allowed to take page locks during transaction commit. e.g ext4
> journal commit when using the default ordered data mode:
> 
> jbd2_journal_commit_transaction
>   

Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-15 Thread Dave Chinner
On Thu, Sep 15, 2016 at 09:42:22PM +1000, Nicholas Piggin wrote:
> On Thu, 15 Sep 2016 20:32:10 +1000
> Dave Chinner  wrote:
> > 
> > You still haven't described anything about what a per-block flag
> > design is supposed to look like :/
> 
> For the API, or implementation? I'm not quite sure what you mean
> here. For implementation it's possible to carefully ensure metadata
> is persistent when allocating blocks in page fault but before
> mapping pages. Truncate or hole punch or such things can be made to
> work by invalidating all such mappings and holding them off until
> you can cope with them again. Not necessarily for a filesystem with
> *all* capabilities of XFS -- I don't know -- but for a complete basic
> one.

SO, essentially, it comes down to synchrnous metadta updates again.
but synchronous updates would be conditional on whether an extent
metadata with the "nofsync" flag asserted was updated? Where's the
nofsync flag kept? in memory at a generic layer, or in the
filesystem, potentially in an on-disk structure? How would the
application set it for a given range?

> > > > > Filesystem will
> > > > > invalidate all such mappings before it does buffered IOs or hole 
> > > > > punch,
> > > > > and will sync metadata after allocating a new block but before 
> > > > > returning
> > > > > from a fault.
> > > > 
> > > > ... requires synchronous metadata updates from page fault context,
> > > > which we already know is not a good solution.  I'll quote one of
> > > > Christoph's previous replies to save me the trouble:
> > > > 
> > > > "You could write all metadata synchronously from the page
> > > > fault handler, but that's basically asking for all kinds of
> > > > deadlocks."
> > > > So, let's redirect back to the "no sync" flag you were talking about
> > > > - can you answer the questions I asked above? It would be especially
> > > > important to highlight how the proposed feature would avoid requiring
> > > > synchronous metadata updates in page fault contexts  
> > > 
> > > Right. So what deadlocks are you concerned about?  
> > 
> > It basically puts the entire journal checkpoint path under a page
> > fault context. i.e. a whole new global locking context problem is
> 
> Yes there are potentially some new lock orderings created if you
> do that, depending on what locks the filesystem does.

Well, that's the whole issue.

> > created as this path can now be run both inside and outside the
> > mmap_sem. Nothing ever good comes from running filesystem locking
> > code both inside and outside the mmap_sem.
> 
> You mean that some cases journal checkpoint runs with mmap_sem
> held, and others without mmap_sem held? Not that mmap_sem is taken
> inside journal checkpoint.

Maybe not, but now we open up the potential for locks held inside
or outside mmap sem to interact with the journal locks that are now
held inside and outside mmap_sem. See below

> Then I don't really see why that's a
> problem. I mean performance could suffer a bit, but with fault
> retry you can almost always do the syncing outside mmap_sem in
> practice.
> 
> Yes, I'll preemptively agree with you -- We don't want to add any
> such burden if it is not needed and well justified.
> 
> > FWIW, We've never executed synchronous transactions inside page
> > faults in XFS, and I think ext4 is in the same boat - it may be even
> > worse because of the way it does ordered data dispatch through the
> > journal. I don't really even want to think about the level of hurt
> > this might put btrfs or other COW/log structured filesystems under.
> > I'm sure Christoph can reel off a bunch more issues off the top of
> > his head
> 
> I asked him, we'll see what he thinks. I don't beleive there is
> anything fundamental about mm or fs core layers that cause deadlocks
> though.

Spent 5 minutes looking at ext4 for an example: filesystems are
allowed to take page locks during transaction commit. e.g ext4
journal commit when using the default ordered data mode:

jbd2_journal_commit_transaction
  journal_submit_data_buffers()
journal_submit_inode_data_buffers
  generic_writepages()
ext4_writepages()
  mpage_prepare_extent_to_map()
lock_page()

i.e. if we fault on the user buffer during a write() operation and
that user buffer is a mmapped DAX file that needs to be allocated
and we have synchronous metadata updates during page faults, we
deadlock on the page lock held above the page fault context...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-15 Thread Nicholas Piggin
On Thu, 15 Sep 2016 20:32:10 +1000
Dave Chinner  wrote:

> On Thu, Sep 15, 2016 at 01:49:45PM +1000, Nicholas Piggin wrote:
> > On Thu, 15 Sep 2016 12:31:33 +1000
> > Dave Chinner  wrote:
> >   
> > > On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:  
> > > > On Wed, 14 Sep 2016 17:39:02 +1000  
> > > Sure, but one first has to describe the feature desired before all  
> > 
> > The DAX people have been.  
> 
> H. the only "DAX people" I know of are kernel developers who
> have been working on implementing DAX in the kernel - Willy, Ross,
> Dan, Jan, Christoph, Kirill, myelf and a few others around the
> fringes.
> 
> > They want to be able to get mappings
> > that can be synced without doing fsync.  
> 
> Oh, ok, the Intel Userspace PMEM library requirement. I though you
> had something more that this - whatever extra problem the per-block
> no fsync flag would solve?

Only the PMEM really. I don't want to add more complexity than
required.

> > The *exact* extent of
> > those capabilities and what the API exactly looks like is up for
> > discussion.  
> 
> Yup.
> 
> > Well you said it was impossible already and Christoph told them
> > they were smoking crack :)  
> 
> I have not said that. I have said bad things about bad
> proposals and called the PMEM library model broken, but I most
> definitely have not said that solving the problem is impossible.
>
> > > That's not an answer to the questions I asked about about the "no
> > > sync" flag you were proposing. You've redirected to the a different
> > > solution, one that   
> > 
> > No sync flag would do the same thing exactly in terms of consistency.
> > It would just do the no-sync sequence by default rather than being
> > asked for it. More of an API detail than implementation.  
> 
> You still haven't described anything about what a per-block flag
> design is supposed to look like :/

For the API, or implementation? I'm not quite sure what you mean
here. For implementation it's possible to carefully ensure metadata
is persistent when allocating blocks in page fault but before
mapping pages. Truncate or hole punch or such things can be made to
work by invalidating all such mappings and holding them off until
you can cope with them again. Not necessarily for a filesystem with
*all* capabilities of XFS -- I don't know -- but for a complete basic
one.

> > > > Filesystem will
> > > > invalidate all such mappings before it does buffered IOs or hole punch,
> > > > and will sync metadata after allocating a new block but before returning
> > > > from a fault.
> > > 
> > > ... requires synchronous metadata updates from page fault context,
> > > which we already know is not a good solution.  I'll quote one of
> > > Christoph's previous replies to save me the trouble:
> > > 
> > >   "You could write all metadata synchronously from the page
> > >   fault handler, but that's basically asking for all kinds of
> > >   deadlocks."
> > > So, let's redirect back to the "no sync" flag you were talking about
> > > - can you answer the questions I asked above? It would be especially
> > > important to highlight how the proposed feature would avoid requiring
> > > synchronous metadata updates in page fault contexts  
> > 
> > Right. So what deadlocks are you concerned about?  
> 
> It basically puts the entire journal checkpoint path under a page
> fault context. i.e. a whole new global locking context problem is

Yes there are potentially some new lock orderings created if you
do that, depending on what locks the filesystem does.

> created as this path can now be run both inside and outside the
> mmap_sem. Nothing ever good comes from running filesystem locking
> code both inside and outside the mmap_sem.

You mean that some cases journal checkpoint runs with mmap_sem
held, and others without mmap_sem held? Not that mmap_sem is taken
inside journal checkpoint. Then I don't really see why that's a
problem. I mean performance could suffer a bit, but with fault
retry you can almost always do the syncing outside mmap_sem in
practice.

Yes, I'll preemptively agree with you -- We don't want to add any
such burden if it is not needed and well justified.

> FWIW, We've never executed synchronous transactions inside page
> faults in XFS, and I think ext4 is in the same boat - it may be even
> worse because of the way it does ordered data dispatch through the
> journal. I don't really even want to think about the level of hurt
> this might put btrfs or other COW/log structured filesystems under.
> I'm sure Christoph can reel off a bunch more issues off the top of
> his head

I asked him, we'll see what he thinks. I don't beleive there is
anything fundamental about mm or fs core layers that cause deadlocks
though.

> 
> > There could be a scale of capabilities here, for different filesystems
> > that do things differently.   
> 
> Why do we need such complexity to be defined?
> 
> I'm tending towards 

Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-15 Thread Dave Chinner
On Thu, Sep 15, 2016 at 01:49:45PM +1000, Nicholas Piggin wrote:
> On Thu, 15 Sep 2016 12:31:33 +1000
> Dave Chinner  wrote:
> 
> > On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:
> > > On Wed, 14 Sep 2016 17:39:02 +1000
> > Sure, but one first has to describe the feature desired before all
> 
> The DAX people have been.

H. the only "DAX people" I know of are kernel developers who
have been working on implementing DAX in the kernel - Willy, Ross,
Dan, Jan, Christoph, Kirill, myelf and a few others around the
fringes.

> They want to be able to get mappings
> that can be synced without doing fsync.

Oh, ok, the Intel Userspace PMEM library requirement. I though you
had something more that this - whatever extra problem the per-block
no fsync flag would solve?

> The *exact* extent of
> those capabilities and what the API exactly looks like is up for
> discussion.

Yup.

> Well you said it was impossible already and Christoph told them
> they were smoking crack :)

I have not said that. I have said bad things about bad
proposals and called the PMEM library model broken, but I most
definitely have not said that solving the problem is impossible.

> > That's not an answer to the questions I asked about about the "no
> > sync" flag you were proposing. You've redirected to the a different
> > solution, one that 
> 
> No sync flag would do the same thing exactly in terms of consistency.
> It would just do the no-sync sequence by default rather than being
> asked for it. More of an API detail than implementation.

You still haven't described anything about what a per-block flag
design is supposed to look like :/

> > > Filesystem will
> > > invalidate all such mappings before it does buffered IOs or hole punch,
> > > and will sync metadata after allocating a new block but before returning
> > > from a fault.  
> > 
> > ... requires synchronous metadata updates from page fault context,
> > which we already know is not a good solution.  I'll quote one of
> > Christoph's previous replies to save me the trouble:
> > 
> > "You could write all metadata synchronously from the page
> > fault handler, but that's basically asking for all kinds of
> > deadlocks."
> > So, let's redirect back to the "no sync" flag you were talking about
> > - can you answer the questions I asked above? It would be especially
> > important to highlight how the proposed feature would avoid requiring
> > synchronous metadata updates in page fault contexts
> 
> Right. So what deadlocks are you concerned about?

It basically puts the entire journal checkpoint path under a page
fault context. i.e. a whole new global locking context problem is
created as this path can now be run both inside and outside the
mmap_sem. Nothing ever good comes from running filesystem locking
code both inside and outside the mmap_sem.

FWIW, We've never executed synchronous transactions inside page
faults in XFS, and I think ext4 is in the same boat - it may be even
worse because of the way it does ordered data dispatch through the
journal. I don't really even want to think about the level of hurt
this might put btrfs or other COW/log structured filesystems under.
I'm sure Christoph can reel off a bunch more issues off the top of
his head

> There could be a scale of capabilities here, for different filesystems
> that do things differently. 

Why do we need such complexity to be defined?

I'm tending towards just adding new fallocate() operation that sets
up a fully allocated and zeroed file of fixed length that has
immutable metadata once created. Single syscall, with well dfined
semantics, and it doesn't dictate the implementation any filesystem
must use. All it dictates is that the data region can be written
safely on dax-enabled storage without needing fsync() to be issued.

i.e. the implementation can be filesystem specific, and it is simple
to implement the basic functionality and constraints in both XFS and
ext4 right away, and as othe filesystems come along they can
implement it in the way that best suits them. e.g. btrfs would need
to add the "no COW" flag to the file as well.

If someone wants to implement a per-block no-fsync flag, and  do
sycnhronous metadata updates in the page fault path, then they are
welcome to do so. But we don't /need/ such complexity to implement
the functionality that pmem programming model requires.

> Some filesystems could require fsync for metadata, but allow fdatasync
> to be skipped. Users would need to have some knowledge of block size
> or do preallocation and sync.

Not sure what you mean here -  avoiding the need for using fsync()
by using fsync() seems a little circular to me.  :/

> That might put more burden on libraries/applications if there are
> concurrent operations, but that might be something they can deal with
> -- fdatasync already requires some knowledge of concurrent operations
> (or lack thereof).

Additional userspace complexity is something 

Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-15 Thread Dave Chinner
On Wed, Sep 14, 2016 at 10:55:03PM -0700, Darrick J. Wong wrote:
> On Mon, Sep 12, 2016 at 11:40:35AM +1000, Dave Chinner wrote:
> > On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
> > > On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:
> > > > My understanding is that it is looking for the VM_MIXEDMAP flag which
> > > > is already ambiguous for determining if DAX is enabled even if this
> > > > dynamic listing issue is fixed.  XFS has arranged for DAX to be a
> > > > per-inode capability and has an XFS-specific inode flag.  We can make
> > > > that a common inode flag, but it seems we should have a way to
> > > > interrogate the mapping itself in the case where the inode is unknown
> > > > or unavailable.  I'm thinking extensions to mincore to have flags for
> > > > DAX and possibly whether the page is part of a pte, pmd, or pud
> > > > mapping.  Just floating that idea before starting to look into the
> > > > implementation, comments or other ideas welcome...
> > > 
> > > I think this goes back to our previous discussion about support for the 
> > > PMEM
> > > programming model.  Really I think what NVML needs isn't a way to tell if 
> > > it
> > > is getting a DAX mapping, but whether it is getting a DAX mapping on a
> > > filesystem that fully supports the PMEM programming model.  This of 
> > > course is
> > > defined to be a filesystem where it can do all of its flushes from 
> > > userspace
> > > safely and never call fsync/msync, and that allocations that happen in 
> > > page
> > > faults will be synchronized to media before the page fault completes.
> > > 
> > > IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
> > > everything or can I rely fully on flushes from userspace?" 
> > 
> > "need fsync/msync" is a dynamic state of an inode, not a static
> > property. i.e. users can do things that change an inode behind the
> > back of a mapping, even if they are not aware that this might
> > happen. As such, a filesystem can invalidate an existing mapping
> > at any time and userspace won't notice because it will simply fault
> > in a new mapping on the next access...
> > 
> > > For all existing implementations, I think the answer is "you need to use
> > > fsync/msync" because we don't yet have proper support for the PMEM 
> > > programming
> > > model.
> > 
> > Yes, that is correct.
> > 
> > FWIW, I don't think it will ever be possible to support this 
> > wonderful "PMEM programming model" from any current or future kernel
> > filesystem without a very specific set of restrictions on what can
> > be done to a file.  e.g.
> > 
> > 1. the file has to be fully allocated and zeroed before
> >use. Preallocation/zeroing via unwritten extents is not
> >allowed. Sparse files are not allowed. Shared extents are
> >not allowed.
> > 2. set the "PMEM_IMMUTABLE" inode flag - filesystem must
> >check the file is fully allocated before allowing it to
> >be set, and caller must have CAP_LINUX_IMMUTABLE.
> > 3. Inode metadata is now immutable, and file data can only
> >be accessed and/or modified via mmap().
> > 4. All non-mmap methods of inode data modification
> >will now fail with EPERM.
> > 5. all methods of inode metadata modification will now fail
> >with EPERM, timestamp udpdates will be ignored.
> > 6. PMEM_IMMUTABLE flag can only be removed if the file is
> >not currently mapped and caller has CAP_LINUX_IMMUTABLE.
> > 
> > A flag like this /should/ make it possible to avoid fsync/msync() on
> > a file for existing filesystems, but it also means that such files
> > have significant management issues (hence the need for
> > CAP_LINUX_IMMUTABLE to cover it's use).
> 
> Hmmm... I started to ponder such a flag, but ran into some questions.
> If it's PMEM_IMMUTABLE, does this mean that none of 1-6 apply if the
> filesystem discovers it isn't on pmem?

Would only be meaningful if the FS_XFLAG_DAX/S_DAX flag is also set
on the inode and the backing store is dax capable. Hence the 'PMEM'
part of the name.

> I thought about just having a 'immutable metadata' flag where any
> timestamp, xattr, or block mapping update just returns EPERM.

And all the rest - no hard links, no perm/owner changes, no security
context changes(!), and so on. ANd it's even more complex with
filesystems that have COW metadata and pack multiple unrelated
metadata objects into single blocks - they can do all sorts of
interesting things on unrealted metadata updates... :P

You'd also have to turn off background internal filesystem mod
vectors, too, like EOF scanning, or defrag, balance, dedupe,
auto-repair, etc.  And, now that I think about it, snapshots are out
of the question too.

This gets more hairy the more I think about what our filesystems can
do these days

> There
> wouldn't be any checks as in (1); if you left a hole in the file prior
> to setting the flag then you won't be filling 

Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Nicholas Piggin
On Thu, 15 Sep 2016 12:31:33 +1000
Dave Chinner  wrote:

> On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:
> > On Wed, 14 Sep 2016 17:39:02 +1000
> > Dave Chinner  wrote:  
> > > Ok, looking back over your example, you seem to be suggesting a new
> > > page fault behaviour is required from filesystems that has not been
> > > described or explained, and that behaviour is triggered
> > > (persistently) somehow from userspace. You've also suggested
> > > filesystems store a persistent per-block "no fsync" flag
> > > in their extent map as part of the implementation. Right?  
> > 
> > This is what we're talking about. Of course a filesystem can't just
> > start supporting the feature without any changes.  
> 
> Sure, but one first has to describe the feature desired before all

The DAX people have been. They want to be able to get mappings
that can be synced without doing fsync. The *exact* extent of
those capabilities and what the API exactly looks like is up for
discussion.


> parties can discuss it. We need more than vague references and
> allusions from you to define the solution you are proposing.
> 
> Once everyone understands what is being describing, we might be able
> to work out how it can be implemented in a simple, generic manner
> rather than require every filesystem to change their on-disk
> formats. IOWs, we need you to describe /details/ of semantics,
> behaviour and data integrity constraints that are required, not
> describe an implementation of something we have no knwoledge about.

Well you said it was impossible already and Christoph told them
they were smoking crack :)

Anyway, there's a few questions. Implementation and API. Some
filesystems may never cope with it. Of course it should be as
generic as possible though.


> > > Reading between the lines, I'm guessing that the "no fsync" flag has
> > > very specific update semantics, constraints and requirements.  Can
> > > you outline how you expect this flag to be set and updated, how it's
> > > used consistently between different applications (e.g. cp of a file
> > > vs the app using the file), behavioural constraints it implies for
> > > page faults vs non-mmap access to the data in the block, how
> > > you'd expect filesystems to deal with things like a hole punch
> > > landing in the middle of an extent marked with "no fsync", etc?  
> > 
> > Well that's what's being discussed.  An approach close to what I did is
> > to allow the app request a "no sync" type of mmap.  
> 
> That's not an answer to the questions I asked about about the "no
> sync" flag you were proposing. You've redirected to the a different
> solution, one that 

No sync flag would do the same thing exactly in terms of consistency.
It would just do the no-sync sequence by default rather than being
asked for it. More of an API detail than implementation.

> 
> > Filesystem will
> > invalidate all such mappings before it does buffered IOs or hole punch,
> > and will sync metadata after allocating a new block but before returning
> > from a fault.  
> 
> ... requires synchronous metadata updates from page fault context,
> which we already know is not a good solution.  I'll quote one of
> Christoph's previous replies to save me the trouble:
> 
>   "You could write all metadata synchronously from the page
>   fault handler, but that's basically asking for all kinds of
>   deadlocks."
> So, let's redirect back to the "no sync" flag you were talking about
> - can you answer the questions I asked above? It would be especially
> important to highlight how the proposed feature would avoid requiring
> synchronous metadata updates in page fault contexts

Right. So what deadlocks are you concerned about?

There could be a scale of capabilities here, for different filesystems
that do things differently. 

Some filesystems could require fsync for metadata, but allow fdatasync
to be skipped. Users would need to have some knowledge of block size
or do preallocation and sync.

That might put more burden on libraries/applications if there are
concurrent operations, but that might be something they can deal with
-- fdatasync already requires some knowledge of concurrent operations
(or lack thereof).


> > > [snip]
> > >   
> > > > If there is any huge complexity or unsolved problem, it is in XFS.
> > > > Conceptual problem is simple.
> > > 
> > > Play nice and be constructive, please?  
> > 
> > So you agree that the persistent memory people who have come with some
> > requirements and ideas for an API should not be immediately shut down
> > with bogus handwaving.  
> 
> Pull your head in, Nick.
> 
> You've been absent from the community for the last 5 years. You
> suddenly barge in with a massive chip on your shoulder and try to

I'm trying to give some constructive input to the nvdimm guys.

You and Christoph know a huge amount about vfs and filesystems.
But sometimes you shut people down prematurely. 

Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Dave Chinner
On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:
> On Wed, 14 Sep 2016 17:39:02 +1000
> Dave Chinner  wrote:
> > Ok, looking back over your example, you seem to be suggesting a new
> > page fault behaviour is required from filesystems that has not been
> > described or explained, and that behaviour is triggered
> > (persistently) somehow from userspace. You've also suggested
> > filesystems store a persistent per-block "no fsync" flag
> > in their extent map as part of the implementation. Right?
> 
> This is what we're talking about. Of course a filesystem can't just
> start supporting the feature without any changes.

Sure, but one first has to describe the feature desired before all
parties can discuss it. We need more than vague references and
allusions from you to define the solution you are proposing.

Once everyone understands what is being describing, we might be able
to work out how it can be implemented in a simple, generic manner
rather than require every filesystem to change their on-disk
formats. IOWs, we need you to describe /details/ of semantics,
behaviour and data integrity constraints that are required, not
describe an implementation of something we have no knwoledge about.

> > Reading between the lines, I'm guessing that the "no fsync" flag has
> > very specific update semantics, constraints and requirements.  Can
> > you outline how you expect this flag to be set and updated, how it's
> > used consistently between different applications (e.g. cp of a file
> > vs the app using the file), behavioural constraints it implies for
> > page faults vs non-mmap access to the data in the block, how
> > you'd expect filesystems to deal with things like a hole punch
> > landing in the middle of an extent marked with "no fsync", etc?
> 
> Well that's what's being discussed.  An approach close to what I did is
> to allow the app request a "no sync" type of mmap.

That's not an answer to the questions I asked about about the "no
sync" flag you were proposing. You've redirected to the a different
solution, one that 

> Filesystem will
> invalidate all such mappings before it does buffered IOs or hole punch,
> and will sync metadata after allocating a new block but before returning
> from a fault.

... requires synchronous metadata updates from page fault context,
which we already know is not a good solution.  I'll quote one of
Christoph's previous replies to save me the trouble:

"You could write all metadata synchronously from the page
fault handler, but that's basically asking for all kinds of
deadlocks."

So, let's redirect back to the "no sync" flag you were talking about
- can you answer the questions I asked above? It would be especially
important to highlight how the proposed feature would avoid requiring
synchronous metadata updates in page fault contexts

> > [snip]
> > 
> > > If there is any huge complexity or unsolved problem, it is in XFS.
> > > Conceptual problem is simple.  
> > 
> > Play nice and be constructive, please?
> 
> So you agree that the persistent memory people who have come with some
> requirements and ideas for an API should not be immediately shut down
> with bogus handwaving.

Pull your head in, Nick.

You've been absent from the community for the last 5 years. You
suddenly barge in with a massive chip on your shoulder and try to
throw your weight around. You're being arrogant, obnoxious, evasive
and petty. You're belittling anyone who dares to question your
proclamations. You're not listening to the replies you are getting.
You're baiting people to try to get an adverse reaction from them
and when someone gives you the adverse reaction you were fishing
for, you play the victim card.

That's textbook bullying behaviour.

Nick, this behaviour does not help progress the discussion in any
way. It only serves to annoy the other people who are sincerely
trying to understand and determine if/how we can solve the problem
in some way.

So, again, play nice and be constructive, please?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Nicholas Piggin
On Wed, 14 Sep 2016 17:39:02 +1000
Dave Chinner  wrote:

> On Tue, Sep 13, 2016 at 11:53:11AM +1000, Nicholas Piggin wrote:
> > On Tue, 13 Sep 2016 07:34:36 +1000
> > Dave Chinner  wrote:
> > But let me understand your example in the absence of that.
> > 
> > - Application mmaps a file, faults in block 0
> > - FS allocates block, creates mappings, syncs metadata, sets "no fsync"
> >   flag for that block, and completes the fault.
> > - Application writes some data to block 0, completes userspace flushes
> > 
> > * At this point, a crash must return with above data (or newer).
> > 
> > - Application starts writing more stuff into block 0
> > - Concurrently, fault in block 1
> > - FS starts to allocate, splits trees including mappings to block 0
> > 
> > * Crash
> > 
> > Is that right?  
> 
> No.
> 
> - app write faults block 0, fs allocates
> < time passes while app does stuff to block 0 mapping >
> - fs syncs journal, block 0 metadata now persistent
> < time passes while app does stuff to block 0 mapping >
> - app structure grows, faults block 1, fs allocates
> - app adds pointers to data in block 1 from block 0, does
>   userspace pmem data sync.
> 
> *crash*
> 
> > How does your filesystem lose data before the sync
> > point?  
> 
> After recovery, file has a data in block 0, but no block 1 because
> the allocation transaction for block 1 was not flushed to the
> journal. Data in block 0 points to data in block 1, but block 1 does
> not exist. IOWs, the application has corrupt data because it never
> issued a data synchronisation request to the filesystem
> 
> 
> 
> Ok, looking back over your example, you seem to be suggesting a new
> page fault behaviour is required from filesystems that has not been
> described or explained, and that behaviour is triggered
> (persistently) somehow from userspace. You've also suggested
> filesystems store a persistent per-block "no fsync" flag
> in their extent map as part of the implementation. Right?

This is what we're talking about. Of course a filesystem can't just
start supporting the feature without any changes.


> Reading between the lines, I'm guessing that the "no fsync" flag has
> very specific update semantics, constraints and requirements.  Can
> you outline how you expect this flag to be set and updated, how it's
> used consistently between different applications (e.g. cp of a file
> vs the app using the file), behavioural constraints it implies for
> page faults vs non-mmap access to the data in the block, how
> you'd expect filesystems to deal with things like a hole punch
> landing in the middle of an extent marked with "no fsync", etc?

Well that's what's being discussed. An approach close to what I did is
to allow the app request a "no sync" type of mmap. Filesystem will
invalidate all such mappings before it does buffered IOs or hole punch,
and will sync metadata after allocating a new block but before returning
from a fault.

The app could query rather than request, but I found request seemed to
work better. The filesystem might be working with apps that don't use
the feature for example, and doesn't want to flush just in case any one
ever queried in the past.


> [snip]
> 
> > If there is any huge complexity or unsolved problem, it is in XFS.
> > Conceptual problem is simple.  
> 
> Play nice and be constructive, please?

So you agree that the persistent memory people who have come with some
requirements and ideas for an API should not be immediately shut down
with bogus handwaving.

Thanks,
Nick
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Dave Chinner
On Tue, Sep 13, 2016 at 11:53:11AM +1000, Nicholas Piggin wrote:
> On Tue, 13 Sep 2016 07:34:36 +1000
> Dave Chinner  wrote:
> But let me understand your example in the absence of that.
> 
> - Application mmaps a file, faults in block 0
> - FS allocates block, creates mappings, syncs metadata, sets "no fsync"
>   flag for that block, and completes the fault.
> - Application writes some data to block 0, completes userspace flushes
> 
> * At this point, a crash must return with above data (or newer).
> 
> - Application starts writing more stuff into block 0
> - Concurrently, fault in block 1
> - FS starts to allocate, splits trees including mappings to block 0
> 
> * Crash
> 
> Is that right?

No.

- app write faults block 0, fs allocates
< time passes while app does stuff to block 0 mapping >
- fs syncs journal, block 0 metadata now persistent
< time passes while app does stuff to block 0 mapping >
- app structure grows, faults block 1, fs allocates
- app adds pointers to data in block 1 from block 0, does
  userspace pmem data sync.

*crash*

> How does your filesystem lose data before the sync
> point?

After recovery, file has a data in block 0, but no block 1 because
the allocation transaction for block 1 was not flushed to the
journal. Data in block 0 points to data in block 1, but block 1 does
not exist. IOWs, the application has corrupt data because it never
issued a data synchronisation request to the filesystem



Ok, looking back over your example, you seem to be suggesting a new
page fault behaviour is required from filesystems that has not been
described or explained, and that behaviour is triggered
(persistently) somehow from userspace. You've also suggested
filesystems store a persistent per-block "no fsync" flag
in their extent map as part of the implementation. Right?

Reading between the lines, I'm guessing that the "no fsync" flag has
very specific update semantics, constraints and requirements.  Can
you outline how you expect this flag to be set and updated, how it's
used consistently between different applications (e.g. cp of a file
vs the app using the file), behavioural constraints it implies for
page faults vs non-mmap access to the data in the block, how
you'd expect filesystems to deal with things like a hole punch
landing in the middle of an extent marked with "no fsync", etc?

[snip]

> If there is any huge complexity or unsolved problem, it is in XFS.
> Conceptual problem is simple.

Play nice and be constructive, please?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-12 Thread Christoph Hellwig
On Mon, Sep 12, 2016 at 06:05:07PM +1000, Nicholas Piggin wrote:
> It's not fundamentally broken, it just doesn't fit well existing
> filesystems.

Or the existing file system architecture for that matter.  Which makes
it a fundamentally broken model.

> Dave's post of requirements is also wrong. A filesystem does not have
> to guarantee all that, it only has to guarantee that is the case for
> a given block after it has a mapping and page fault returns, other
> operations can be supported by invalidating mappings, etc.

Which doesn't really matter if your use case is manipulating
fully mapped files.

But back to the point: if you want to use a full blown Linux or Unix
filesystem you will always have to fsync (or variants of it like msync),
period.

If you want a volume manager on stereoids that hands out large chunks
of storage memory that can't ever be moved, truncated, shared, allocated
on demand, etc - implement it in your library on top of a device file.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-12 Thread Nicholas Piggin
On Mon, 12 Sep 2016 00:51:28 -0700
Christoph Hellwig  wrote:

> On Mon, Sep 12, 2016 at 05:25:15PM +1000, Oliver O'Halloran wrote:
> > What are the problems here? Is this a matter of existing filesystems
> > being unable/unwilling to support this or is it just fundamentally
> > broken?  
> 
> It's a fundamentally broken model.  See Dave's post that actually was
> sent slightly earlier then mine for the list of required items, which
> is fairly unrealistic.  You could probably try to architect a file
> system for it, but I doubt it would gain much traction.

It's not fundamentally broken, it just doesn't fit well existing
filesystems.

Dave's post of requirements is also wrong. A filesystem does not have
to guarantee all that, it only has to guarantee that is the case for
a given block after it has a mapping and page fault returns, other
operations can be supported by invalidating mappings, etc.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-12 Thread Christoph Hellwig
On Mon, Sep 12, 2016 at 05:25:15PM +1000, Oliver O'Halloran wrote:
> What are the problems here? Is this a matter of existing filesystems
> being unable/unwilling to support this or is it just fundamentally
> broken?

It's a fundamentally broken model.  See Dave's post that actually was
sent slightly earlier then mine for the list of required items, which
is fairly unrealistic.  You could probably try to architect a file
system for it, but I doubt it would gain much traction.

> The end goal is to let applications manage the persistence of
> their own data without having to involve the kernel in every IOP, but
> if we can't do that then what would a 90% solution look like? I think
> most people would be OK with having to do an fsync() occasionally, but
> not after ever write to pmem.

You need an fsync for each write that you want to persist.  This sounds
painful for now.  But I have an implementation that will allow the
atomic commit of more or less arbitrary amounts of previous writes for
XFS that I plan to land once the reflink work is in.

That way you create almost arbitrarily complex data structures in your
programs and commit them atomicly.  It's not going to fit the nvml
model, but that whole think has been complete bullshit since the
beginning anyway.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-12 Thread Oliver O'Halloran
On Mon, Sep 12, 2016 at 3:27 PM, Christoph Hellwig  wrote:
> On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
>> I think this goes back to our previous discussion about support for the PMEM
>> programming model.  Really I think what NVML needs isn't a way to tell if it
>> is getting a DAX mapping, but whether it is getting a DAX mapping on a
>> filesystem that fully supports the PMEM programming model.  This of course is
>> defined to be a filesystem where it can do all of its flushes from userspace
>> safely and never call fsync/msync, and that allocations that happen in page
>> faults will be synchronized to media before the page fault completes.
>
> That's a an easy way to flag:  you will never get that from a Linux
> filesystem, period.
>
> NVML folks really need to stop taking crack and dreaming this could
> happen.

Well, that's a bummer.

What are the problems here? Is this a matter of existing filesystems
being unable/unwilling to support this or is it just fundamentally
broken? The end goal is to let applications manage the persistence of
their own data without having to involve the kernel in every IOP, but
if we can't do that then what would a 90% solution look like? I think
most people would be OK with having to do an fsync() occasionally, but
not after ever write to pmem.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-12 Thread Xiao Guangrong



On 09/12/2016 11:44 AM, Rudoff, Andy wrote:

Whether msync/fsync can make data persistent depends on ADR feature on
memory controller, if it exists everything works well, otherwise, we need
to have another interface that is why 'Flush hint table' in ACPI comes
in. 'Flush hint table' is particularly useful for nvdimm virtualization if
we use normal memory to emulate nvdimm with data persistent characteristic
(the data will be flushed to a persistent storage, e.g, disk).

Does current PMEM programming model fully supports 'Flush hint table'? Is
userspace allowed to use these addresses?


The Flush hint table is NOT a replacement for ADR.  To support pmem on
the x86 architecture, the platform is required to ensure that a pmem
store flushed from the CPU caches is in the persistent domain so that the
application need not take any additional steps to make it persistent.
The most common way to do this is the ADR feature.

If the above is not true, then your x86 platform does not support pmem.


Understood.

However, virtualization is a special case as we can use normal memory
to emulate NVDIMM for the vm so that vm can bypass local file-cache,
reduce memory usage and io path, etc. Currently, this usage is useful
for lightweight virtualization, such as clean container.

Under this case, ADR is available on physical platform but it can
not help us to make data persistence for the vm. So that virtualizeing
'flush hint table' is a good way to handle it based on the acpi spec:
| software can write to any one of these Flush Hint Addresses to
| cause any preceding writes to the NVDIMM region to be flushed
| out of the intervening platform buffers 1 to the targeted NVDIMM
| (to achieve durability)



Flush hints are for use by the BIOS and drivers and are not intended to
be used in user space.  Flush hints provide two things:

First, if a driver needs to write to command registers or movable windows
on a DIMM, the Flush hint (if provided in the NFIT) is required to flush
the command to the DIMM or ensure stores done through the movable window
are complete before moving it somewhere else.

Second, for the rare case where the kernel wants to flush stores to the
smallest possible failure domain (i.e. to the DIMM even though ADR will
handle flushing it from a larger domain), the flush hints provide a way
to do this.  This might be useful for things like file system journals to
help ensure the file system is consistent even in the face of ADR failure.


We are assuming ADR can fail, however, do we have a way to know whether
ADR works correctly? Maybe MCE can work on it?

This is necessary to support making data persistent without 'fsync/msync'
in userspace. Or do we need to unconditionally use 'flush hint address'
if it is available as current nvdimm driver does?

Thanks!
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-11 Thread Christoph Hellwig
On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
> I think this goes back to our previous discussion about support for the PMEM
> programming model.  Really I think what NVML needs isn't a way to tell if it
> is getting a DAX mapping, but whether it is getting a DAX mapping on a
> filesystem that fully supports the PMEM programming model.  This of course is
> defined to be a filesystem where it can do all of its flushes from userspace
> safely and never call fsync/msync, and that allocations that happen in page
> faults will be synchronized to media before the page fault completes.

That's a an easy way to flag:  you will never get that from a Linux
filesystem, period.

NVML folks really need to stop taking crack and dreaming this could
happen.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-11 Thread Rudoff, Andy
>Whether msync/fsync can make data persistent depends on ADR feature on
>memory controller, if it exists everything works well, otherwise, we need
>to have another interface that is why 'Flush hint table' in ACPI comes
>in. 'Flush hint table' is particularly useful for nvdimm virtualization if
>we use normal memory to emulate nvdimm with data persistent characteristic
>(the data will be flushed to a persistent storage, e.g, disk).
>
>Does current PMEM programming model fully supports 'Flush hint table'? Is
>userspace allowed to use these addresses?

The Flush hint table is NOT a replacement for ADR.  To support pmem on
the x86 architecture, the platform is required to ensure that a pmem
store flushed from the CPU caches is in the persistent domain so that the
application need not take any additional steps to make it persistent.
The most common way to do this is the ADR feature.

If the above is not true, then your x86 platform does not support pmem.

Flush hints are for use by the BIOS and drivers and are not intended to
be used in user space.  Flush hints provide two things:

First, if a driver needs to write to command registers or movable windows
on a DIMM, the Flush hint (if provided in the NFIT) is required to flush
the command to the DIMM or ensure stores done through the movable window
are complete before moving it somewhere else.

Second, for the rare case where the kernel wants to flush stores to the
smallest possible failure domain (i.e. to the DIMM even though ADR will
handle flushing it from a larger domain), the flush hints provide a way
to do this.  This might be useful for things like file system journals to
help ensure the file system is consistent even in the face of ADR failure.

-andy


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-11 Thread Dave Chinner
On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
> On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:
> > My understanding is that it is looking for the VM_MIXEDMAP flag which
> > is already ambiguous for determining if DAX is enabled even if this
> > dynamic listing issue is fixed.  XFS has arranged for DAX to be a
> > per-inode capability and has an XFS-specific inode flag.  We can make
> > that a common inode flag, but it seems we should have a way to
> > interrogate the mapping itself in the case where the inode is unknown
> > or unavailable.  I'm thinking extensions to mincore to have flags for
> > DAX and possibly whether the page is part of a pte, pmd, or pud
> > mapping.  Just floating that idea before starting to look into the
> > implementation, comments or other ideas welcome...
> 
> I think this goes back to our previous discussion about support for the PMEM
> programming model.  Really I think what NVML needs isn't a way to tell if it
> is getting a DAX mapping, but whether it is getting a DAX mapping on a
> filesystem that fully supports the PMEM programming model.  This of course is
> defined to be a filesystem where it can do all of its flushes from userspace
> safely and never call fsync/msync, and that allocations that happen in page
> faults will be synchronized to media before the page fault completes.
> 
> IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
> everything or can I rely fully on flushes from userspace?" 

"need fsync/msync" is a dynamic state of an inode, not a static
property. i.e. users can do things that change an inode behind the
back of a mapping, even if they are not aware that this might
happen. As such, a filesystem can invalidate an existing mapping
at any time and userspace won't notice because it will simply fault
in a new mapping on the next access...

> For all existing implementations, I think the answer is "you need to use
> fsync/msync" because we don't yet have proper support for the PMEM programming
> model.

Yes, that is correct.

FWIW, I don't think it will ever be possible to support this 
wonderful "PMEM programming model" from any current or future kernel
filesystem without a very specific set of restrictions on what can
be done to a file.  e.g.

1. the file has to be fully allocated and zeroed before
   use. Preallocation/zeroing via unwritten extents is not
   allowed. Sparse files are not allowed. Shared extents are
   not allowed.
2. set the "PMEM_IMMUTABLE" inode flag - filesystem must
   check the file is fully allocated before allowing it to
   be set, and caller must have CAP_LINUX_IMMUTABLE.
3. Inode metadata is now immutable, and file data can only
   be accessed and/or modified via mmap().
4. All non-mmap methods of inode data modification
   will now fail with EPERM.
5. all methods of inode metadata modification will now fail
   with EPERM, timestamp udpdates will be ignored.
6. PMEM_IMMUTABLE flag can only be removed if the file is
   not currently mapped and caller has CAP_LINUX_IMMUTABLE.

A flag like this /should/ make it possible to avoid fsync/msync() on
a file for existing filesystems, but it also means that such files
have significant management issues (hence the need for
CAP_LINUX_IMMUTABLE to cover it's use).

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-09 Thread Dan Williams
On Fri, Sep 9, 2016 at 1:55 AM, Xiao Guangrong
 wrote:
[..]
>>
>> Whether a persistent memory mapping requires an msync/fsync is a
>> filesystem specific question.  This mincore proposal is separate from
>> that.  Consider device-DAX for volatile memory or mincore() called on
>> an anonymous memory range.  In those cases persistence and filesystem
>> metadata are not in the picture, but it would still be useful for
>> userspace to know "is there page cache backing this mapping?" or "what
>> is the TLB geometry of this mapping?".
>
>
> I got a question about msync/fsync which is beyond the topic of this thread
> :)
>
> Whether msync/fsync can make data persistent depends on ADR feature on
> memory
> controller, if it exists everything works well, otherwise, we need to have
> another
> interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint
> table' is
> particularly useful for nvdimm virtualization if we use normal memory to
> emulate
> nvdimm with data persistent characteristic (the data will be flushed to a
> persistent storage, e.g, disk).
>
> Does current PMEM programming model fully supports 'Flush hint table'? Is
> userspace allowed to use these addresses?

If you publish flush hint addresses in the virtual NFIT the guest VM
will write to them whenever a REQ_FLUSH or REQ_FUA request is sent to
the virtual /dev/pmemX device.  Yes, seems straightforward to take a
VM exit on those events and flush simulated pmem to persistent
storage.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm