date:20190606

On Thu, Jun 06, 2019 at 03:03:30PM -0700, 'Ira Weiny' wrote:
> On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > On Wed 05-06-19 18:45:33, ira.we...@intel.com wrote:
> > > From: Ira Weiny 
> > > 
> > > ... V1,000,000   ;-)
> > > 
> > > Pre-requisites:
> > >   John Hubbard's put_user_pages() patch series.[1]
> > >   Jan Kara's ext4_break_layouts() fixes[2]
> > > 
> > > Based on the feedback from LSFmm and the LWN article which resulted.  I've
> > > decided to take a slightly different tack on this problem.
> > > 
> > > The real issue is that there is no use case for a user to have RDMA 
> > > pinn'ed
> > > memory which is then truncated.  So really any solution we present which:
> > > 
> > > A) Prevents file system corruption or data leaks
> > > ...and...
> > > B) Informs the user that they did something wrong
> > > 
> > > Should be an acceptable solution.
> > > 
> > > Because this is slightly new behavior.  And because this is gonig to be
> > > specific to DAX (because of the lack of a page cache) we have made the 
> > > user
> > > "opt in" to this behavior.
> > > 
> > > The following patches implement the following solution.
> > > 
> > > 1) The user has to opt in to allowing GUP pins on a file with a layout 
> > > lease
> > >(now made visible).
> > > 2) GUP will fail (EPERM) if a layout lease is not taken
> > > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > > 4) The user has the option of holding the layout lease to receive a SIGIO 
> > > for
> > >notification to the original thread that another thread has tried to 
> > > delete
> > >their data.  Furthermore this indicates that if the user needs to GUP 
> > > the
> > >file again they will need to retake the Layout lease before doing so.
> > > 
> > > 
> > > NOTE: If the user releases the layout lease or if it has been broken by
> > > another operation further GUP operations on the file will fail without
> > > re-taking the lease.  This means that if a user would like to register
> > > pieces of a file and continue to register other pieces later they would
> > > be advised to keep the layout lease, get a SIGIO notification, and retake
> > > the lease.
> > > 
> > > NOTE2: Truncation of pages which are not actively pinned will succeed.
> > > Similar to accessing an mmap to this area GUP pins of that memory may
> > > fail.
> > 
> > So after some through I'm willing accept the fact that pinned DAX pages
> > will just make truncate / hole punch fail and shove it into a same bucket
> > of situations like "user can open a file and unlink won't delete it" or
> > "ETXTBUSY when user is executing a file being truncated".  The problem I
> > have with this proposal is a lack of visibility from sysadmin POV. For
> > ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the
> > problematic process and kill it. There's nothing like that with your
> > proposal since currently once you hold page reference, you can unmap the
> > file, drop layout lease, close the file, and there's no trace that you're
> > responsible for the pinned page anymore.
> 
> Agreed.  For some "GUP interfaces" one may be able to figure this out but I'm
> not familiar with any.  For RDMA there has been some additions for tracking
> resources but I don't think any of that is useful here.  Regardless from a FS
> POV this is awkward to have to understand all the independent interfaces, so I
> agree.
> 
> > 
> > So I'd like to actually mandate that you *must* hold the file lease until
> > you unpin all pages in the given range (not just that you have an option to
> > hold a lease). And I believe the kernel should actually enforce this. That
> > way we maintain a sane state that if someone uses a physical location of
> > logical file offset on disk, he has a layout lease. Also once this is done,
> > sysadmin has a reasonably easy way to discover run-away RDMA application
> > and kill it if he wishes so.
> 
> Fair enough.
> 
> I was kind of heading that direction but had not thought this far forward.  I
> was exploring how to have a lease remain on the file even after a "lease
> break".  But that is incompatible with the current semantics of a "layout"
> lease (as currently defined in the kernel).  [In the end I wanted to get an 
> RFC
> out to see what people think of this idea so I did not look at keeping the
> lease.]
> 
> Also hitch is that currently a lease is forcefully broken after
> /lease-break-time.  To do what you suggest I think we would need a new
> lease type with the semantics you describe.
> 
> Previously I had thought this would be a good idea (for other reasons).  But
> what does everyone think about using a "longterm lease" similar to [1] which
> has the semantics you proppose?  In [1] I was not sure "longterm" was a good
> name but with your proposal I think it makes more sense.
> 
> > 
> > The question is on how to exactly enforce that lease is taken until all
> > pages are unpinned. I belive it could be done by

Persistent Programming in Real Life (PIRL) Call for Presentations

2019-06-06 Thread Steven Swanson

You are invited to propose a presentation at the first annual Persistent
Programming in Real Life (PIRL) (https://pirl.nvsl.io/). PIRL brings
together software development leaders interested in learning about
programming methodologies for persistent memories (e.g. NVDIMMs, Optane DC)
and sharing their experiences with others.

Tell us about what you have done (and want to do) with persistent memory,
what worked, what didn’t, what was hard, what was easy, what was
surprising, and what did you learn? How can others learn from your
experience?

PIRL is small. We are limiting attendance this year to under 100 people,
including speakers. There will be lots of time for informal discussion and
networking.

We have an exciting slate of keynote speakers lined up:

Pratap Subrahmanyam (VMware)
Jia Shi (Oracle)
Dan Williams (Intel Corporation)
Scott Miller (Dreamworks)
Stephen Bates (Eideticom)

If you would like to present at PIRL, fill out this form (
https://docs.google.com/forms/d/e/1FAIpQLSfKwKznaOUkKhwvbSlhTSWMCyO-RXRy2QVheUwATzA7nvzQzQ/viewform).
For instance, a proposal might include:
1. What you did or are trying to do.
2. Your experience, and the lessons others can learn from it.
3. We expect to see code!

We welcome talks that describe new tools, but we are not interested in
sales pitches. Attendees should come away from every session with
actionable information that can be utilized to use persistent memory more
effectively.
Presentation Format

We are open to many different kinds of talks. Possibilities include:
1. Talks about experiences on a particular project.
2. Potentially develop or write code live or provide samples during the
presentation.
3. Do you have a fun challenge for persistent coding? We’re accepting code
challenges, and we’ll be offering two or three of them so you can show
others your skills.

PIRL will be hosted by the Non-Volatile Systems Laboratory at the
University of California, San Diego. It will be held at Scripps Forum (
https://scripps.ucsd.edu/about/venues/seaside-forum) on July 22nd to 23rd,
2019, with a social event the evening of July 21st.

Pre-registration is open (
https://www.eventbrite.com/e/pirl-registration-62219498194) and will be
$400. Space is limited.

If you have any questions, please contact Steven Swanson (
swan...@cs.ucsd.edu).
Organizing Committee

Steven Swanson (UCSD) Jim Fister (SNIA) Andy Rudoff (Intel)

Jishen Zhao (UCSD) Joe Izraelevitz (UCSD)
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 06, 2019 at 04:51:15PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> 
> > So I'd like to actually mandate that you *must* hold the file lease until
> > you unpin all pages in the given range (not just that you have an option to
> > hold a lease). And I believe the kernel should actually enforce this. That
> > way we maintain a sane state that if someone uses a physical location of
> > logical file offset on disk, he has a layout lease. Also once this is done,
> > sysadmin has a reasonably easy way to discover run-away RDMA application
> > and kill it if he wishes so.
> > 
> > The question is on how to exactly enforce that lease is taken until all
> > pages are unpinned. I belive it could be done by tracking number of
> > long-term pinned pages within a lease. Gup_longterm could easily increment
> > the count when verifying the lease exists, gup_longterm users will somehow
> > need to propagate corresponding 'filp' (struct file pointer) to
> > put_user_pages_longterm() callsites so that they can look up appropriate
> > lease to drop reference - probably I'd just transition all gup_longterm()
> > users to a saner API similar to the one we have in mm/frame_vector.c where
> > we don't hand out page pointers but an encapsulating structure that does
> > all the necessary tracking. Removing a lease would need to block until all
> > pins are released - this is probably the most hairy part since we need to
> > handle a case if application just closes the file descriptor which
> > would
> 
> I think if you are going to do this then the 'struct filp' that
> represents the lease should be held in the kernel (ie inside the RDMA
> umem) until the kernel is done with it.

Yea there seems merit to this.  I'm still not resolving how this helps track
who has the pin across a fork.

> 
> Actually does someone have a pointer to this userspace lease API, I'm
> not at all familiar with it, thanks

man fcntl
search for SETLEASE

But I had to add the F_LAYOUT lease type.  (Personally I'm for calling it
F_LONGTERM at this point.  I don't think LAYOUT is compatible with what we are
proposing here.)

Anyway, yea would be a libc change at lease for man page etc...  But again I
want to get some buy in before going through all that.

> 
> And yes, a better output format from GUP would be great..
> 
> > Maybe we could block only on explicit lease unlock and just drop the layout
> > lease on file close and if there are still pinned pages, send SIGKILL to an
> > application as a reminder it did something stupid...
> 
> Which process would you SIGKILL? At least for the rdma case a FD is
> holding the GUP, so to do the put_user_pages() the kernel needs to
> close the FD. I guess it would have to kill every process that has the
> FD open? Seems complicated...

Tending to agree...  But I'm still not opposed to killing bad actors...  ;-)

NOTE: Jason I think you need to be more clear about the FD you are speaking of.
I believe you mean the FD which refers to the RMDA context.  That is what I
called it in my other email.

Ira

> 
> Regards,
> Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> On Wed 05-06-19 18:45:33, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > ... V1,000,000   ;-)
> > 
> > Pre-requisites:
> > John Hubbard's put_user_pages() patch series.[1]
> > Jan Kara's ext4_break_layouts() fixes[2]
> > 
> > Based on the feedback from LSFmm and the LWN article which resulted.  I've
> > decided to take a slightly different tack on this problem.
> > 
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated.  So really any solution we present which:
> > 
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> > 
> > Should be an acceptable solution.
> > 
> > Because this is slightly new behavior.  And because this is gonig to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> > 
> > The following patches implement the following solution.
> > 
> > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> >(now made visible).
> > 2) GUP will fail (EPERM) if a layout lease is not taken
> > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > 4) The user has the option of holding the layout lease to receive a SIGIO 
> > for
> >notification to the original thread that another thread has tried to 
> > delete
> >their data.  Furthermore this indicates that if the user needs to GUP the
> >file again they will need to retake the Layout lease before doing so.
> > 
> > 
> > NOTE: If the user releases the layout lease or if it has been broken by
> > another operation further GUP operations on the file will fail without
> > re-taking the lease.  This means that if a user would like to register
> > pieces of a file and continue to register other pieces later they would
> > be advised to keep the layout lease, get a SIGIO notification, and retake
> > the lease.
> > 
> > NOTE2: Truncation of pages which are not actively pinned will succeed.
> > Similar to accessing an mmap to this area GUP pins of that memory may
> > fail.
> 
> So after some through I'm willing accept the fact that pinned DAX pages
> will just make truncate / hole punch fail and shove it into a same bucket
> of situations like "user can open a file and unlink won't delete it" or
> "ETXTBUSY when user is executing a file being truncated".  The problem I
> have with this proposal is a lack of visibility from sysadmin POV. For
> ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the
> problematic process and kill it. There's nothing like that with your
> proposal since currently once you hold page reference, you can unmap the
> file, drop layout lease, close the file, and there's no trace that you're
> responsible for the pinned page anymore.

Agreed.  For some "GUP interfaces" one may be able to figure this out but I'm
not familiar with any.  For RDMA there has been some additions for tracking
resources but I don't think any of that is useful here.  Regardless from a FS
POV this is awkward to have to understand all the independent interfaces, so I
agree.

> 
> So I'd like to actually mandate that you *must* hold the file lease until
> you unpin all pages in the given range (not just that you have an option to
> hold a lease). And I believe the kernel should actually enforce this. That
> way we maintain a sane state that if someone uses a physical location of
> logical file offset on disk, he has a layout lease. Also once this is done,
> sysadmin has a reasonably easy way to discover run-away RDMA application
> and kill it if he wishes so.

Fair enough.

I was kind of heading that direction but had not thought this far forward.  I
was exploring how to have a lease remain on the file even after a "lease
break".  But that is incompatible with the current semantics of a "layout"
lease (as currently defined in the kernel).  [In the end I wanted to get an RFC
out to see what people think of this idea so I did not look at keeping the
lease.]

Also hitch is that currently a lease is forcefully broken after
/lease-break-time.  To do what you suggest I think we would need a new
lease type with the semantics you describe.

Previously I had thought this would be a good idea (for other reasons).  But
what does everyone think about using a "longterm lease" similar to [1] which
has the semantics you proppose?  In [1] I was not sure "longterm" was a good
name but with your proposal I think it makes more sense.

> 
> The question is on how to exactly enforce that lease is taken until all
> pages are unpinned. I belive it could be done by tracking number of
> long-term pinned pages within a lease. Gup_longterm could easily increment
> the count when verifying the lease exists, gup_longterm users will somehow
> need to propagate corresponding 'filp' (struct file pointer) to
> put_user_pages_longterm() callsites so that

Re: [PATCH v9 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields

2019-06-06 Thread Andrew Morton

On Wed, 05 Jun 2019 14:58:58 -0700 Dan Williams  
wrote:

> At namespace creation time there is the potential for the "expected to
> be zero" fields of a 'pfn' info-block to be filled with indeterminate
> data. While the kernel buffer is zeroed on allocation it is immediately
> overwritten by nd_pfn_validate() filling it with the current contents of
> the on-media info-block location. For fields like, 'flags' and the
> 'padding' it potentially means that future implementations can not rely
> on those fields being zero.
> 
> In preparation to stop using the 'start_pad' and 'end_trunc' fields for
> section alignment, arrange for fields that are not explicitly
> initialized to be guaranteed zero. Bump the minor version to indicate it
> is safe to assume the 'padding' and 'flags' are zero. Otherwise, this
> corruption is expected to benign since all other critical fields are
> explicitly initialized.
> 
> Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
> Cc: 
> Signed-off-by: Dan Williams 

The cc:stable in [11/12] seems odd.  Is this independent of the other
patches?  If so, shouldn't it be a standalone thing which can be
prioritized?

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] dax: Fix xarray entry association for mixed mappings

On Thu 06-06-19 10:00:01, Dan Williams wrote:
> On Thu, Jun 6, 2019 at 2:10 AM Jan Kara  wrote:
> >
> > When inserting entry into xarray, we store mapping and index in
> > corresponding struct pages for memory error handling. When it happened
> > that one process was mapping file at PMD granularity while another
> > process at PTE granularity, we could wrongly deassociate PMD range and
> > then reassociate PTE range leaving the rest of struct pages in PMD range
> > without mapping information which could later cause missed notifications
> > about memory errors. Fix the problem by calling the association /
> > deassociation code if and only if we are really going to update the
> > xarray (deassociating and associating zero or empty entries is just
> > no-op so there's no reason to complicate the code with trying to avoid
> > the calls for these cases).
> 
> Looks good to me, I assume this also needs:
> 
> Cc: 
> Fixes: d2c997c0f145 ("fs, dax: use page->mapping to warn if truncate
> collides with a busy page")

Yes, thanks for that.

Honza

> 
> >
> > Signed-off-by: Jan Kara 
> > ---
> >  fs/dax.c | 9 -
> >  1 file changed, 4 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index f74386293632..9fd908f3df32 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -728,12 +728,11 @@ static void *dax_insert_entry(struct xa_state *xas,
> >
> > xas_reset(xas);
> > xas_lock_irq(xas);
> > -   if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
> > +   if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> > +   void *old;
> > +
> > dax_disassociate_entry(entry, mapping, false);
> > dax_associate_entry(new_entry, mapping, vmf->vma, 
> > vmf->address);
> > -   }
> > -
> > -   if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> > /*
> >  * Only swap our new entry into the page cache if the 
> > current
> >  * entry is a zero page or an empty entry.  If a normal PTE 
> > or
> > @@ -742,7 +741,7 @@ static void *dax_insert_entry(struct xa_state *xas,
> >  * existing entry is a PMD, we will just leave the PMD in 
> > the
> >  * tree and dirty it if necessary.
> >  */
> > -   void *old = dax_lock_entry(xas, new_entry);
> > +   old = dax_lock_entry(xas, new_entry);
> > WARN_ON_ONCE(old != xa_mk_value(xa_to_value(entry) |
> > DAX_LOCKED));
> > entry = new_entry;
> > --
> > 2.16.4
> >
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v9 07/12] mm/sparsemem: Prepare for sub-section ranges

2019-06-06 Thread Dan Williams

On Thu, Jun 6, 2019 at 10:21 AM Oscar Salvador  wrote:
>
> On Wed, Jun 05, 2019 at 02:58:37PM -0700, Dan Williams wrote:
> > Prepare the memory hot-{add,remove} paths for handling sub-section
> > ranges by plumbing the starting page frame and number of pages being
> > handled through arch_{add,remove}_memory() to
> > sparse_{add,remove}_one_section().
> >
> > This is simply plumbing, small cleanups, and some identifier renames. No
> > intended functional changes.
> >
> > Cc: Michal Hocko 
> > Cc: Vlastimil Babka 
> > Cc: Logan Gunthorpe 
> > Cc: Oscar Salvador 
> > Reviewed-by: Pavel Tatashin 
> > Signed-off-by: Dan Williams 
> > ---
> >  include/linux/memory_hotplug.h |5 +-
> >  mm/memory_hotplug.c|  114 
> > +---
> >  mm/sparse.c|   15 ++---
> >  3 files changed, 81 insertions(+), 53 deletions(-)
> >
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 79e0add6a597..3ab0282b4fe5 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -348,9 +348,10 @@ extern int add_memory_resource(int nid, struct 
> > resource *resource);
> >  extern void move_pfn_range_to_zone(struct zone *zone, unsigned long 
> > start_pfn,
> >   unsigned long nr_pages, struct vmem_altmap *altmap);
> >  extern bool is_memblock_offlined(struct memory_block *mem);
> > -extern int sparse_add_one_section(int nid, unsigned long start_pfn,
> > -   struct vmem_altmap *altmap);
> > +extern int sparse_add_section(int nid, unsigned long pfn,
> > + unsigned long nr_pages, struct vmem_altmap *altmap);
> >  extern void sparse_remove_one_section(struct mem_section *ms,
> > + unsigned long pfn, unsigned long nr_pages,
> >   unsigned long map_offset, struct vmem_altmap *altmap);
> >  extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
> > unsigned long pnum);
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 4b882c57781a..399bf78bccc5 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -252,51 +252,84 @@ void __init register_page_bootmem_info_node(struct 
> > pglist_data *pgdat)
> >  }
> >  #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
> >
> > -static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
> > -struct vmem_altmap *altmap)
> > +static int __meminit __add_section(int nid, unsigned long pfn,
> > + unsigned long nr_pages, struct vmem_altmap *altmap)
> >  {
> >   int ret;
> >
> > - if (pfn_valid(phys_start_pfn))
> > + if (pfn_valid(pfn))
> >   return -EEXIST;
> >
> > - ret = sparse_add_one_section(nid, phys_start_pfn, altmap);
> > + ret = sparse_add_section(nid, pfn, nr_pages, altmap);
> >   return ret < 0 ? ret : 0;
> >  }
> >
> > +static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
> > + const char *reason)
> > +{
> > + /*
> > +  * Disallow all operations smaller than a sub-section and only
> > +  * allow operations smaller than a section for
> > +  * SPARSEMEM_VMEMMAP. Note that check_hotplug_memory_range()
> > +  * enforces a larger memory_block_size_bytes() granularity for
> > +  * memory that will be marked online, so this check should only
> > +  * fire for direct arch_{add,remove}_memory() users outside of
> > +  * add_memory_resource().
> > +  */
> > + unsigned long min_align;
> > +
> > + if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP))
> > + min_align = PAGES_PER_SUBSECTION;
> > + else
> > + min_align = PAGES_PER_SECTION;
> > + if (!IS_ALIGNED(pfn, min_align)
> > + || !IS_ALIGNED(nr_pages, min_align)) {
> > + WARN(1, "Misaligned __%s_pages start: %#lx end: #%lx\n",
> > + reason, pfn, pfn + nr_pages - 1);
> > + return -EINVAL;
> > + }
> > + return 0;
> > +}
>
>
> This caught my eye.
> Back in patch#4 "Convert kmalloc_section_memmap() to 
> populate_section_memmap()",
> you placed a mis-usage check for !CONFIG_SPARSEMEM_VMEMMAP in
> populate_section_memmap().
>
> populate_section_memmap() gets called from sparse_add_one_section(), which 
> means
> that we should have passed this check, otherwise we cannot go further and call
> __add_section().
>
> So, unless I am missing something it seems to me that the check from patch#4 
> could go?
> And I think the same applies to depopulate_section_memmap()?

Yes, good catch, I can kill those extra checks in favor of this one.

> Besides that, it looks good to me:

Thanks Oscar!

>
> Reviewed-by: Oscar Salvador 
>
> --
> Oscar Salvador
> SUSE L3
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v9 01/12] mm/sparsemem: Introduce struct mem_section_usage

On Wed, Jun 05, 2019 at 02:57:54PM -0700, Dan Williams wrote:
> Towards enabling memory hotplug to track partial population of a
> section, introduce 'struct mem_section_usage'.
> 
> A pointer to a 'struct mem_section_usage' instance replaces the existing
> pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
> 'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to
> house a new 'subsection_map' bitmap.  The new bitmap enables the memory
> hot{plug,remove} implementation to act on incremental sub-divisions of a
> section.
> 
> The default SUBSECTION_SHIFT is chosen to keep the 'subsection_map' no
> larger than a single 'unsigned long' on the major architectures.
> Alternatively an architecture can define ARCH_SUBSECTION_SHIFT to
> override the default PMD_SHIFT. Note that PowerPC needs to use
> ARCH_SUBSECTION_SHIFT to workaround PMD_SHIFT being a non-constant
> expression on PowerPC.
> 
> The primary motivation for this functionality is to support platforms
> that mix "System RAM" and "Persistent Memory" within a single section,
> or multiple PMEM ranges with different mapping lifetimes within a single
> section. The section restriction for hotplug has caused an ongoing saga
> of hacks and bugs for devm_memremap_pages() users.
> 
> Beyond the fixups to teach existing paths how to retrieve the 'usemap'
> from a section, and updates to usemap allocation path, there are no
> expected behavior changes.
> 
> Cc: Michal Hocko 
> Cc: Vlastimil Babka 
> Cc: Logan Gunthorpe 
> Cc: Oscar Salvador 
> Cc: Pavel Tatashin 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Signed-off-by: Dan Williams 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v9 07/12] mm/sparsemem: Prepare for sub-section ranges

On Wed, Jun 05, 2019 at 02:58:37PM -0700, Dan Williams wrote:
> Prepare the memory hot-{add,remove} paths for handling sub-section
> ranges by plumbing the starting page frame and number of pages being
> handled through arch_{add,remove}_memory() to
> sparse_{add,remove}_one_section().
> 
> This is simply plumbing, small cleanups, and some identifier renames. No
> intended functional changes.
> 
> Cc: Michal Hocko 
> Cc: Vlastimil Babka 
> Cc: Logan Gunthorpe 
> Cc: Oscar Salvador 
> Reviewed-by: Pavel Tatashin 
> Signed-off-by: Dan Williams 
> ---
>  include/linux/memory_hotplug.h |5 +-
>  mm/memory_hotplug.c|  114 
> +---
>  mm/sparse.c|   15 ++---
>  3 files changed, 81 insertions(+), 53 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 79e0add6a597..3ab0282b4fe5 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -348,9 +348,10 @@ extern int add_memory_resource(int nid, struct resource 
> *resource);
>  extern void move_pfn_range_to_zone(struct zone *zone, unsigned long 
> start_pfn,
>   unsigned long nr_pages, struct vmem_altmap *altmap);
>  extern bool is_memblock_offlined(struct memory_block *mem);
> -extern int sparse_add_one_section(int nid, unsigned long start_pfn,
> -   struct vmem_altmap *altmap);
> +extern int sparse_add_section(int nid, unsigned long pfn,
> + unsigned long nr_pages, struct vmem_altmap *altmap);
>  extern void sparse_remove_one_section(struct mem_section *ms,
> + unsigned long pfn, unsigned long nr_pages,
>   unsigned long map_offset, struct vmem_altmap *altmap);
>  extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
> unsigned long pnum);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 4b882c57781a..399bf78bccc5 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -252,51 +252,84 @@ void __init register_page_bootmem_info_node(struct 
> pglist_data *pgdat)
>  }
>  #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
>  
> -static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
> -struct vmem_altmap *altmap)
> +static int __meminit __add_section(int nid, unsigned long pfn,
> + unsigned long nr_pages, struct vmem_altmap *altmap)
>  {
>   int ret;
>  
> - if (pfn_valid(phys_start_pfn))
> + if (pfn_valid(pfn))
>   return -EEXIST;
>  
> - ret = sparse_add_one_section(nid, phys_start_pfn, altmap);
> + ret = sparse_add_section(nid, pfn, nr_pages, altmap);
>   return ret < 0 ? ret : 0;
>  }
>  
> +static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
> + const char *reason)
> +{
> + /*
> +  * Disallow all operations smaller than a sub-section and only
> +  * allow operations smaller than a section for
> +  * SPARSEMEM_VMEMMAP. Note that check_hotplug_memory_range()
> +  * enforces a larger memory_block_size_bytes() granularity for
> +  * memory that will be marked online, so this check should only
> +  * fire for direct arch_{add,remove}_memory() users outside of
> +  * add_memory_resource().
> +  */
> + unsigned long min_align;
> +
> + if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP))
> + min_align = PAGES_PER_SUBSECTION;
> + else
> + min_align = PAGES_PER_SECTION;
> + if (!IS_ALIGNED(pfn, min_align)
> + || !IS_ALIGNED(nr_pages, min_align)) {
> + WARN(1, "Misaligned __%s_pages start: %#lx end: #%lx\n",
> + reason, pfn, pfn + nr_pages - 1);
> + return -EINVAL;
> + }
> + return 0;
> +}


This caught my eye.
Back in patch#4 "Convert kmalloc_section_memmap() to populate_section_memmap()",
you placed a mis-usage check for !CONFIG_SPARSEMEM_VMEMMAP in
populate_section_memmap().

populate_section_memmap() gets called from sparse_add_one_section(), which means
that we should have passed this check, otherwise we cannot go further and call
__add_section().

So, unless I am missing something it seems to me that the check from patch#4 
could go?
And I think the same applies to depopulate_section_memmap()?

Besides that, it looks good to me:

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 05, 2019 at 10:52:12PM -0700, John Hubbard wrote:
> On 6/5/19 6:45 PM, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > ... V1,000,000   ;-)
> > 
> > Pre-requisites:
> > John Hubbard's put_user_pages() patch series.[1]
> > Jan Kara's ext4_break_layouts() fixes[2]
> > 
> > Based on the feedback from LSFmm and the LWN article which resulted.  I've
> > decided to take a slightly different tack on this problem.
> > 
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated.  So really any solution we present which:
> > 
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> > 
> > Should be an acceptable solution.
> > 
> > Because this is slightly new behavior.  And because this is gonig to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> > 
> > The following patches implement the following solution.
> > 
> > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> >(now made visible).
> > 2) GUP will fail (EPERM) if a layout lease is not taken
> > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > 4) The user has the option of holding the layout lease to receive a SIGIO 
> > for
> >notification to the original thread that another thread has tried to 
> > delete
> >their data.  Furthermore this indicates that if the user needs to GUP the
> >file again they will need to retake the Layout lease before doing so.
> > 
> > 
> > NOTE: If the user releases the layout lease or if it has been broken by 
> > another
> > operation further GUP operations on the file will fail without re-taking the
> > lease.  This means that if a user would like to register pieces of a file 
> > and
> > continue to register other pieces later they would be advised to keep the
> > layout lease, get a SIGIO notification, and retake the lease.
> > 
> > NOTE2: Truncation of pages which are not actively pinned will succeed.  
> > Similar
> > to accessing an mmap to this area GUP pins of that memory may fail.
> > 
> 
> Hi Ira,
> 
> Wow, great to see this. This looks like basically the right behavior, IMHO.
> 
> 1. We'll need man page additions, to explain it. In fact, even after a quick 
> first
> pass through, I'm vague on two points:

Of course.  But I was not going to go through and attempt to write man pages
and other docs without some agreement on the final mechanisms.  This works
which was the basic requirement I had to send an RFC.  :-D  But yes man pages
and updates to headers etc all have to be done.

> 
> a) I'm not sure how this actually provides "opt-in to new behavior", because 
> I 
> don't see any CONFIG_* or boot time choices, and it looks like the new 
> behavior 
> just is there. That is, if user space doesn't set F_LAYOUT on a range, 
> GUP FOLL_LONGTERM will now fail, which is new behavior. (Did I get that 
> right?)

The opt in is at run time.  Currently GUP FOLL_LONGTERM is _not_ _allowed_ on
the FS DAX pages at all.  So the default behavior is the same, GUP fails.  (Or
specifically ibv_reg_mr() fails.  This fails as before, not change there.

The Opt in is that if a user knows what is involved they can take the lease and
the GUP will not fail.  This comes with the price of knowing that other
processes can't truncate those pages in use.

> 
> b) Truncate and hole punch behavior, with and without user space having a 
> SIGIO
> handler. (I'm sure this is obvious after another look through, but it might go
> nicely in a man page.)

Sorry this was not clear.  There are 2 points for this patch set which requires
the use of catching SIGIO.

1) If an application _actually_ does (somehow, somewhere, in some unforseen use
   case) want to allow a truncate to happen.  They can catch the SIGIO, finish
   their use of the pages, and release them.  As long as they can do this within
   the /lease-time-break time they are ok and the truncate can proceed.

2) This is a bit more subtle and something I almost delayed sending these out
   for.  Currently the implementation of a lease break actually removes the
   lease from the file.  I did not want this to happen and I was thinking of
   delaying this patch set to implement something which keeps the lease around
   but I figured I should get something out for comments.  Jan has proposed
   something along these lines and I agree with him so I'm going to ask you to
   read my response to him about the details.

   Anyway so the key here is that currently an app needs the SIGIO to retake
   the lease if they want to map the file again or in parts based on usage.
   For example, they may only want to map some of the file for when they are
   using it and then map another part later.  Without the SIGIO they would lose
   their lease or would have to just take the lease for each GUP pin (which
   adds overhead).

Re: [PATCH v9 04/12] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()

On Wed, Jun 05, 2019 at 02:58:21PM -0700, Dan Williams wrote:
> Allow sub-section sized ranges to be added to the memmap.
> populate_section_memmap() takes an explict pfn range rather than
> assuming a full section, and those parameters are plumbed all the way
> through to vmmemap_populate(). There should be no sub-section usage in
> current deployments. New warnings are added to clarify which memmap
> allocation paths are sub-section capable.
> 
> Cc: Michal Hocko 
> Cc: David Hildenbrand 
> Cc: Logan Gunthorpe 
> Cc: Oscar Salvador 
> Reviewed-by: Pavel Tatashin 
> Signed-off-by: Dan Williams 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] dax: Fix xarray entry association for mixed mappings

2019-06-06 Thread Dan Williams

On Thu, Jun 6, 2019 at 2:10 AM Jan Kara  wrote:
>
> When inserting entry into xarray, we store mapping and index in
> corresponding struct pages for memory error handling. When it happened
> that one process was mapping file at PMD granularity while another
> process at PTE granularity, we could wrongly deassociate PMD range and
> then reassociate PTE range leaving the rest of struct pages in PMD range
> without mapping information which could later cause missed notifications
> about memory errors. Fix the problem by calling the association /
> deassociation code if and only if we are really going to update the
> xarray (deassociating and associating zero or empty entries is just
> no-op so there's no reason to complicate the code with trying to avoid
> the calls for these cases).

Looks good to me, I assume this also needs:

Cc: 
Fixes: d2c997c0f145 ("fs, dax: use page->mapping to warn if truncate
collides with a busy page")

>
> Signed-off-by: Jan Kara 
> ---
>  fs/dax.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index f74386293632..9fd908f3df32 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -728,12 +728,11 @@ static void *dax_insert_entry(struct xa_state *xas,
>
> xas_reset(xas);
> xas_lock_irq(xas);
> -   if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
> +   if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> +   void *old;
> +
> dax_disassociate_entry(entry, mapping, false);
> dax_associate_entry(new_entry, mapping, vmf->vma, 
> vmf->address);
> -   }
> -
> -   if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> /*
>  * Only swap our new entry into the page cache if the current
>  * entry is a zero page or an empty entry.  If a normal PTE or
> @@ -742,7 +741,7 @@ static void *dax_insert_entry(struct xa_state *xas,
>  * existing entry is a PMD, we will just leave the PMD in the
>  * tree and dirty it if necessary.
>  */
> -   void *old = dax_lock_entry(xas, new_entry);
> +   old = dax_lock_entry(xas, new_entry);
> WARN_ON_ONCE(old != xa_mk_value(xa_to_value(entry) |
> DAX_LOCKED));
> entry = new_entry;
> --
> 2.16.4
>
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v9 02/12] mm/sparsemem: Add helpers track active portions of a section at boot

On Wed, Jun 05, 2019 at 02:57:59PM -0700, Dan Williams wrote:
> Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
> sub-section active bitmask, each bit representing a PMD_SIZE span of the
> architecture's memory hotplug section size.
> 
> The implications of a partially populated section is that pfn_valid()
> needs to go beyond a valid_section() check and read the sub-section
> active ranges from the bitmask. The expectation is that the bitmask
> (subsection_map) fits in the same cacheline as the valid_section() data,
> so the incremental performance overhead to pfn_valid() should be
> negligible.
> 
> Cc: Michal Hocko 
> Cc: Vlastimil Babka 
> Cc: Logan Gunthorpe 
> Cc: Oscar Salvador 
> Cc: Pavel Tatashin 
> Tested-by: Jane Chu 
> Signed-off-by: Dan Williams 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 07/10] fs/ext4: Fail truncate if pages are GUP pinned

On Thu, Jun 06, 2019 at 12:58:55PM +0200, Jan Kara wrote:
> On Wed 05-06-19 18:45:40, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > If pages are actively gup pinned fail the truncate operation.
> > 
> > Signed-off-by: Ira Weiny 
> > ---
> >  fs/ext4/inode.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 75f543f384e4..1ded83ec08c0 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4250,6 +4250,9 @@ int ext4_break_layouts(struct inode *inode, loff_t 
> > offset, loff_t len)
> > if (!page)
> > return 0;
> >  
> > +   if (page_gup_pinned(page))
> > +   return -ETXTBSY;
> > +
> > error = ___wait_var_event(>_refcount,
> > atomic_read(>_refcount) == 1,
> > TASK_INTERRUPTIBLE, 0, 0,
> 
> This caught my eye. Does this mean that now truncate for a file which has
> temporary gup users (such buffers for DIO) can fail with ETXTBUSY?

I thought about that before and I _thought_ I had accounted for it.  But I
think you are right...

>
> That
> doesn't look desirable.

No not desirable at all...  Ah it just dawned on my why I thought it was ok...
I was wrong.  :-/

> If we would mandate layout lease while pages are
> pinned as I suggested, this could be dealt with by checking for leases with
> pins (breaking such lease would return error and not break it) and if
> breaking leases succeeds (i.e., there are no long-term pinned pages), we'd
> just wait for the remaining references as we do now.

Agreed.

But I'm going to respond with some of the challenges of this (and ideas I had)
when replying to your other email.

Ira

> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 03/10] mm/gup: Pass flags down to __gup_device_huge* calls

On Wed, Jun 05, 2019 at 11:18:19PM -0700, Christoph Hellwig wrote:
> On Wed, Jun 05, 2019 at 06:45:36PM -0700, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > In order to support checking for a layout lease on a FS DAX inode these
> > calls need to know if FOLL_LONGTERM was specified.
> > 
> > Prepare for this with this patch.
> 
> The GUP fast argument passing is a mess.  That is why I've come up
> with this as part of the (not ready) get_user_pages_fast_bvec
> implementation:
> 
> http://git.infradead.org/users/hch/misc.git/commitdiff/c3d019802dbde5a4cc4160e7ec8ccba479b19f97

Agreed that looks better.

And I'm sure I will have to re-roll this to deal with conflicts with this set.
But for now I needed this for the follow ons and having a nice separate little
patch like this means I can just drop it after I get your clean up!  :-D

Ira

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-06 Thread Dan Williams

On Thu, Jun 6, 2019 at 3:42 AM Jan Kara  wrote:
>
> On Wed 05-06-19 18:45:33, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> >
> > ... V1,000,000   ;-)
> >
> > Pre-requisites:
> >   John Hubbard's put_user_pages() patch series.[1]
> >   Jan Kara's ext4_break_layouts() fixes[2]
> >
> > Based on the feedback from LSFmm and the LWN article which resulted.  I've
> > decided to take a slightly different tack on this problem.
> >
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated.  So really any solution we present which:
> >
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> >
> > Should be an acceptable solution.
> >
> > Because this is slightly new behavior.  And because this is gonig to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> >
> > The following patches implement the following solution.
> >
> > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> >(now made visible).
> > 2) GUP will fail (EPERM) if a layout lease is not taken
> > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > 4) The user has the option of holding the layout lease to receive a SIGIO 
> > for
> >notification to the original thread that another thread has tried to 
> > delete
> >their data.  Furthermore this indicates that if the user needs to GUP the
> >file again they will need to retake the Layout lease before doing so.
> >
> >
> > NOTE: If the user releases the layout lease or if it has been broken by
> > another operation further GUP operations on the file will fail without
> > re-taking the lease.  This means that if a user would like to register
> > pieces of a file and continue to register other pieces later they would
> > be advised to keep the layout lease, get a SIGIO notification, and retake
> > the lease.
> >
> > NOTE2: Truncation of pages which are not actively pinned will succeed.
> > Similar to accessing an mmap to this area GUP pins of that memory may
> > fail.
>
> So after some through I'm willing accept the fact that pinned DAX pages
> will just make truncate / hole punch fail and shove it into a same bucket
> of situations like "user can open a file and unlink won't delete it" or
> "ETXTBUSY when user is executing a file being truncated".  The problem I
> have with this proposal is a lack of visibility from sysadmin POV. For
> ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the
> problematic process and kill it. There's nothing like that with your
> proposal since currently once you hold page reference, you can unmap the
> file, drop layout lease, close the file, and there's no trace that you're
> responsible for the pinned page anymore.
>
> So I'd like to actually mandate that you *must* hold the file lease until
> you unpin all pages in the given range (not just that you have an option to
> hold a lease). And I believe the kernel should actually enforce this. That
> way we maintain a sane state that if someone uses a physical location of
> logical file offset on disk, he has a layout lease. Also once this is done,
> sysadmin has a reasonably easy way to discover run-away RDMA application
> and kill it if he wishes so.

Yes, this satisfies the primary concern that made me oppose failing
truncate. If the administrator determines that reclaiming capacity is
more important than maintaining active RDMA mappings "lsof + kill" is
a reasonable way to recover. I'd go so far as to say that anything
less is an abdication of the kernel's responsibility as an arbiter of
platform resources.

> The question is on how to exactly enforce that lease is taken until all
> pages are unpinned. I belive it could be done by tracking number of
> long-term pinned pages within a lease. Gup_longterm could easily increment
> the count when verifying the lease exists, gup_longterm users will somehow
> need to propagate corresponding 'filp' (struct file pointer) to
> put_user_pages_longterm() callsites so that they can look up appropriate
> lease to drop reference - probably I'd just transition all gup_longterm()
> users to a saner API similar to the one we have in mm/frame_vector.c where
> we don't hand out page pointers but an encapsulating structure that does
> all the necessary tracking. Removing a lease would need to block until all
> pins are released - this is probably the most hairy part since we need to
> handle a case if application just closes the file descriptor which would
> release the lease but OTOH we need to make sure task exit does not deadlock.
> Maybe we could block only on explicit lease unlock and just drop the layout
> lease on file close and if there are still pinned pages, send SIGKILL to an
> application as a reminder it did something stupid...
>
> What do people think about this?

SIGKILL on close() without explicit

Re: [PATCH RFC 07/10] fs/ext4: Fail truncate if pages are GUP pinned

On Wed 05-06-19 18:45:40, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> If pages are actively gup pinned fail the truncate operation.
> 
> Signed-off-by: Ira Weiny 
> ---
>  fs/ext4/inode.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 75f543f384e4..1ded83ec08c0 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4250,6 +4250,9 @@ int ext4_break_layouts(struct inode *inode, loff_t 
> offset, loff_t len)
>   if (!page)
>   return 0;
>  
> + if (page_gup_pinned(page))
> + return -ETXTBSY;
> +
>   error = ___wait_var_event(>_refcount,
>   atomic_read(>_refcount) == 1,
>   TASK_INTERRUPTIBLE, 0, 0,

This caught my eye. Does this mean that now truncate for a file which has
temporary gup users (such buffers for DIO) can fail with ETXTBUSY? That
doesn't look desirable. If we would mandate layout lease while pages are
pinned as I suggested, this could be dealt with by checking for leases with
pins (breaking such lease would return error and not break it) and if
breaking leases succeeds (i.e., there are no long-term pinned pages), we'd
just wait for the remaining references as we do now.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed 05-06-19 18:45:33, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> ... V1,000,000   ;-)
> 
> Pre-requisites:
>   John Hubbard's put_user_pages() patch series.[1]
>   Jan Kara's ext4_break_layouts() fixes[2]
> 
> Based on the feedback from LSFmm and the LWN article which resulted.  I've
> decided to take a slightly different tack on this problem.
> 
> The real issue is that there is no use case for a user to have RDMA pinn'ed
> memory which is then truncated.  So really any solution we present which:
> 
> A) Prevents file system corruption or data leaks
> ...and...
> B) Informs the user that they did something wrong
> 
> Should be an acceptable solution.
> 
> Because this is slightly new behavior.  And because this is gonig to be
> specific to DAX (because of the lack of a page cache) we have made the user
> "opt in" to this behavior.
> 
> The following patches implement the following solution.
> 
> 1) The user has to opt in to allowing GUP pins on a file with a layout lease
>(now made visible).
> 2) GUP will fail (EPERM) if a layout lease is not taken
> 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> 4) The user has the option of holding the layout lease to receive a SIGIO for
>notification to the original thread that another thread has tried to delete
>their data.  Furthermore this indicates that if the user needs to GUP the
>file again they will need to retake the Layout lease before doing so.
> 
> 
> NOTE: If the user releases the layout lease or if it has been broken by
> another operation further GUP operations on the file will fail without
> re-taking the lease.  This means that if a user would like to register
> pieces of a file and continue to register other pieces later they would
> be advised to keep the layout lease, get a SIGIO notification, and retake
> the lease.
> 
> NOTE2: Truncation of pages which are not actively pinned will succeed.
> Similar to accessing an mmap to this area GUP pins of that memory may
> fail.

So after some through I'm willing accept the fact that pinned DAX pages
will just make truncate / hole punch fail and shove it into a same bucket
of situations like "user can open a file and unlink won't delete it" or
"ETXTBUSY when user is executing a file being truncated".  The problem I
have with this proposal is a lack of visibility from sysadmin POV. For
ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the
problematic process and kill it. There's nothing like that with your
proposal since currently once you hold page reference, you can unmap the
file, drop layout lease, close the file, and there's no trace that you're
responsible for the pinned page anymore.

So I'd like to actually mandate that you *must* hold the file lease until
you unpin all pages in the given range (not just that you have an option to
hold a lease). And I believe the kernel should actually enforce this. That
way we maintain a sane state that if someone uses a physical location of
logical file offset on disk, he has a layout lease. Also once this is done,
sysadmin has a reasonably easy way to discover run-away RDMA application
and kill it if he wishes so.

The question is on how to exactly enforce that lease is taken until all
pages are unpinned. I belive it could be done by tracking number of
long-term pinned pages within a lease. Gup_longterm could easily increment
the count when verifying the lease exists, gup_longterm users will somehow
need to propagate corresponding 'filp' (struct file pointer) to
put_user_pages_longterm() callsites so that they can look up appropriate
lease to drop reference - probably I'd just transition all gup_longterm()
users to a saner API similar to the one we have in mm/frame_vector.c where
we don't hand out page pointers but an encapsulating structure that does
all the necessary tracking. Removing a lease would need to block until all
pins are released - this is probably the most hairy part since we need to
handle a case if application just closes the file descriptor which would
release the lease but OTOH we need to make sure task exit does not deadlock.
Maybe we could block only on explicit lease unlock and just drop the layout
lease on file close and if there are still pinned pages, send SIGKILL to an
application as a reminder it did something stupid...

What do people think about this?

Honza
> 
> 
> A general overview follows for background.
> 
> It should be noted that one solution for this problem is to use RDMA's On
> Demand Paging (ODP).  There are 2 big reasons this may not work.
> 
>   1) The hardware being used for RDMA may not support ODP
>   2) ODP may be detrimental to the over all network (cluster or cloud)
>  performance
> 
> Therefore, in order to support RDMA to File system pages without On Demand
> Paging (ODP) a number of things need to be done.
> 
> 1) GUP "longterm" users need to inform the

[PATCH] dax: Fix xarray entry association for mixed mappings