Re: error: redefinition of ‘dax_supported’

2020-09-21 Thread Dan Williams
On Mon, Sep 21, 2020 at 11:35 AM Nick Desaulniers
 wrote:
>
> Hello DAX maintainers,
> I noticed our PPC64LE builds failing last night:
> https://travis-ci.com/github/ClangBuiltLinux/continuous-integration/jobs/388047043
> https://travis-ci.com/github/ClangBuiltLinux/continuous-integration/jobs/388047056
> https://travis-ci.com/github/ClangBuiltLinux/continuous-integration/jobs/388047099
> and looking on lore, I see a fresh report from KernelCI against arm:
> https://lore.kernel.org/linux-next/?q=dax_supported
>
> Can you all please take a look?  More concerning is that I see this
> failure on mainline.  It may be interesting to consider how this was
> not spotted on -next.

The failure is fixed with commit 88b67edd7247 ("dax: Fix compilation
for CONFIG_DAX && !CONFIG_FS_DAX"). I rushed the fixes that led to
this regression with insufficient exposure because it was crashing all
users. I thought the 2 kbuild-robot reports I squashed covered all the
config combinations, but there was a straggling report after I sent my
-rc6 pull request.

The baseline process escape for all of this was allowing a unit test
triggerable insta-crash upstream in the first instance necessitating
an urgent fix.


Re: [PATCH 12/20] Documentation: maintainer-entry-profile: eliminate duplicated word

2020-07-07 Thread Dan Williams
On Tue, Jul 7, 2020 at 11:07 AM Randy Dunlap  wrote:
>
> Drop the doubled word "have".
>
> Signed-off-by: Randy Dunlap 
> Cc: Jonathan Corbet 
> Cc: linux-...@vger.kernel.org
> Cc: Dan Williams 
> ---
>  Documentation/maintainer/maintainer-entry-profile.rst |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- 
> linux-next-20200701.orig/Documentation/maintainer/maintainer-entry-profile.rst
> +++ linux-next-20200701/Documentation/maintainer/maintainer-entry-profile.rst
> @@ -31,7 +31,7 @@ Example questions to consider:
>  - What branch should contributors submit against?
>  - Links to any other Maintainer Entry Profiles? For example a
>device-driver may point to an entry for its parent subsystem. This makes
> -  the contributor aware of obligations a maintainer may have have for
> +  the contributor aware of obligations a maintainer may have for
>other maintainers in the submission chain.

Acked-by Dan Williams 


Re: [PATCH 16/20] block: move ->make_request_fn to struct block_device_operations

2020-07-01 Thread Dan Williams
On Wed, Jul 1, 2020 at 2:01 AM Christoph Hellwig  wrote:
>
> The make_request_fn is a little weird in that it sits directly in
> struct request_queue instead of an operation vector.  Replace it with
> a block_device_operations method called submit_bio (which describes much
> better what it does).  Also remove the request_queue argument to it, as
> the queue can be derived pretty trivially from the bio.
>
> Signed-off-by: Christoph Hellwig 
> ---
[..]
>  drivers/nvdimm/blk.c  |  5 +-
>  drivers/nvdimm/btt.c  |  5 +-
>  drivers/nvdimm/pmem.c |  5 +-

For drivers/nvdimm

Acked-by: Dan Williams 


Re: [PATCH v6 6/8] powerpc/pmem: Avoid the barrier in flush routines

2020-06-30 Thread Dan Williams
On Tue, Jun 30, 2020 at 8:09 PM Aneesh Kumar K.V
 wrote:
>
> On 7/1/20 1:15 AM, Dan Williams wrote:
> > On Tue, Jun 30, 2020 at 2:21 AM Aneesh Kumar K.V
> >  wrote:
> > [..]
> >>>> The bio argument isn't for range based flushing, it is for flush
> >>>> operations that need to complete asynchronously.
> >>> How does the block layer determine that the pmem device needs
> >>> asynchronous fushing?
> >>>
> >>
> >>  set_bit(ND_REGION_ASYNC, _desc.flags);
> >>
> >> and dax_synchronous(dev)
> >
> > Yes, but I think it is overkill to have an indirect function call just
> > for a single instruction.
> >
> > How about something like this instead, to share a common pmem_wmb()
> > across x86 and powerpc.
> >
> > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > index 20ff30c2ab93..b14009060c83 100644
> > --- a/drivers/nvdimm/region_devs.c
> > +++ b/drivers/nvdimm/region_devs.c
> > @@ -1180,6 +1180,13 @@ int nvdimm_flush(struct nd_region *nd_region,
> > struct bio *bio)
> >   {
> >  int rc = 0;
> >
> > +   /*
> > +* pmem_wmb() is needed to 'sfence' all previous writes such
> > +* that they are architecturally visible for the platform buffer
> > +* flush.
> > +*/
> > +   pmem_wmb();
> > +
> >  if (!nd_region->flush)
> >  rc = generic_nvdimm_flush(nd_region);
> >  else {
> > @@ -1206,17 +1213,14 @@ int generic_nvdimm_flush(struct nd_region 
> > *nd_region)
> >  idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 
> > 8));
> >
> >  /*
> > -* The first wmb() is needed to 'sfence' all previous writes
> > -* such that they are architecturally visible for the platform
> > -* buffer flush.  Note that we've already arranged for pmem
> > -* writes to avoid the cache via memcpy_flushcache().  The final
> > -* wmb() ensures ordering for the NVDIMM flush write.
> > +* Note that we've already arranged for pmem writes to avoid the
> > +* cache via memcpy_flushcache().  The final wmb() ensures
> > +* ordering for the NVDIMM flush write.
> >   */
> > -   wmb();
>
>
> The series already convert this to pmem_wmb().
>
> >  for (i = 0; i < nd_region->ndr_mappings; i++)
> >  if (ndrd_get_flush_wpq(ndrd, i, 0))
> >  writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> > -   wmb();
> > +   pmem_wmb();
>
>
> Should this be pmem_wmb()? This is ordering the above writeq() right?

Correct, this can just be wmb().

>
> >
> >  return 0;
> >   }
> >
>
> This still results in two pmem_wmb() on platforms that doesn't have
> flush_wpq. I was trying to avoid that by adding a nd_region->flush call
> back.

How about skip or exit early out of generic_nvdimm_flush if
ndrd->flush_wpq is NULL? That still saves an indirect branch at the
cost of another conditional, but that should still be worth it.


Re: [PATCH v6 6/8] powerpc/pmem: Avoid the barrier in flush routines

2020-06-30 Thread Dan Williams
On Tue, Jun 30, 2020 at 2:21 AM Aneesh Kumar K.V
 wrote:
[..]
> >> The bio argument isn't for range based flushing, it is for flush
> >> operations that need to complete asynchronously.
> > How does the block layer determine that the pmem device needs
> > asynchronous fushing?
> >
>
> set_bit(ND_REGION_ASYNC, _desc.flags);
>
> and dax_synchronous(dev)

Yes, but I think it is overkill to have an indirect function call just
for a single instruction.

How about something like this instead, to share a common pmem_wmb()
across x86 and powerpc.

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 20ff30c2ab93..b14009060c83 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1180,6 +1180,13 @@ int nvdimm_flush(struct nd_region *nd_region,
struct bio *bio)
 {
int rc = 0;

+   /*
+* pmem_wmb() is needed to 'sfence' all previous writes such
+* that they are architecturally visible for the platform buffer
+* flush.
+*/
+   pmem_wmb();
+
if (!nd_region->flush)
rc = generic_nvdimm_flush(nd_region);
else {
@@ -1206,17 +1213,14 @@ int generic_nvdimm_flush(struct nd_region *nd_region)
idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8));

/*
-* The first wmb() is needed to 'sfence' all previous writes
-* such that they are architecturally visible for the platform
-* buffer flush.  Note that we've already arranged for pmem
-* writes to avoid the cache via memcpy_flushcache().  The final
-* wmb() ensures ordering for the NVDIMM flush write.
+* Note that we've already arranged for pmem writes to avoid the
+* cache via memcpy_flushcache().  The final wmb() ensures
+* ordering for the NVDIMM flush write.
 */
-   wmb();
for (i = 0; i < nd_region->ndr_mappings; i++)
if (ndrd_get_flush_wpq(ndrd, i, 0))
writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
-   wmb();
+   pmem_wmb();

return 0;
 }


Re: [PATCH updated] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-06-30 Thread Dan Williams
On Tue, Jun 30, 2020 at 5:48 AM Aneesh Kumar K.V
 wrote:
>
>
> Update patch.
>
> From 1e6aa6c4182e14ec5d6bf878ae44c3f69ebff745 Mon Sep 17 00:00:00 2001
> From: "Aneesh Kumar K.V" 
> Date: Tue, 12 May 2020 20:58:33 +0530
> Subject: [PATCH] libnvdimm/nvdimm/flush: Allow architecture to override the
>  flush barrier
>
> Architectures like ppc64 provide persistent memory specific barriers
> that will ensure that all stores for which the modifications are
> written to persistent storage by preceding dcbfps and dcbstps
> instructions have updated persistent storage before any data
> access or data transfer caused by subsequent instructions is initiated.
> This is in addition to the ordering done by wmb()
>
> Update nvdimm core such that architecture can use barriers other than
> wmb to ensure all previous writes are architecturally visible for
> the platform buffer flush.

Looks good, after a few minor fixups below you can add:

Reviewed-by: Dan Williams 

I'm expecting that these will be merged through the powerpc tree since
they mostly impact powerpc with only minor touches to libnvdimm.

> Signed-off-by: Aneesh Kumar K.V 
> ---
>  Documentation/memory-barriers.txt | 14 ++
>  drivers/md/dm-writecache.c|  2 +-
>  drivers/nvdimm/region_devs.c  |  8 
>  include/asm-generic/barrier.h | 10 ++
>  4 files changed, 29 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index eaabc3134294..340273a6b18e 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1935,6 +1935,20 @@ There are some more advanced barrier functions:
>   relaxed I/O accessors and the Documentation/DMA-API.txt file for more
>   information on consistent memory.
>
> + (*) pmem_wmb();
> +
> + This is for use with persistent memory to ensure that stores for which
> + modifications are written to persistent storage have updated the 
> persistent
> + storage.

I think this should be:

s/updated the persistent storage/reached a platform durability domain/

> +
> + For example, after a non-temporal write to pmem region, we use 
> pmem_wmb()
> + to ensures that stores have updated the persistent storage. This ensures

s/ensures/ensure/

...and the same comment about "persistent storage" because pmem_wmb()
as implemented on x86 does not guarantee that the writes have reached
storage it ensures that writes have reached buffers / queues that are
within the ADR (platform persistence / durability) domain.

> + that stores have updated persistent storage before any data access or
> + data transfer caused by subsequent instructions is initiated. This is
> + in addition to the ordering done by wmb().
> +
> + For load from persistent memory, existing read memory barriers are 
> sufficient
> + to ensure read ordering.
>
>  ===
>  IMPLICIT KERNEL MEMORY BARRIERS
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 74f3c506f084..00534fa4a384 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -536,7 +536,7 @@ static void ssd_commit_superblock(struct dm_writecache 
> *wc)
>  static void writecache_commit_flushed(struct dm_writecache *wc, bool 
> wait_for_ios)
>  {
> if (WC_MODE_PMEM(wc))
> -   wmb();
> +   pmem_wmb();
> else
> ssd_commit_flushed(wc, wait_for_ios);
>  }
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 4502f9c4708d..2333b290bdcf 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1206,13 +1206,13 @@ int generic_nvdimm_flush(struct nd_region *nd_region)
> idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8));
>
> /*
> -* The first wmb() is needed to 'sfence' all previous writes
> -* such that they are architecturally visible for the platform
> -* buffer flush.  Note that we've already arranged for pmem
> +* The first arch_pmem_flush_barrier() is needed to 'sfence' all

One missed arch_pmem_flush_barrier() rename.

> +* previous writes such that they are architecturally visible for
> +* the platform buffer flush. Note that we've already arranged for 
> pmem
>  * writes to avoid the cache via memcpy_flushcache().  The final
>  * wmb() ensures ordering for the NVDIMM flush write.
>  */
> -   wmb();
> +   pmem_wmb();
> for (i = 0; i < nd_region->ndr_mappings; i++)
> if (ndrd_get_flu

Re: [PATCH v6 5/8] powerpc/pmem/of_pmem: Update of_pmem to use the new barrier instruction.

2020-06-30 Thread Dan Williams
On Mon, Jun 29, 2020 at 10:05 PM Aneesh Kumar K.V
 wrote:
>
> Dan Williams  writes:
>
> > On Mon, Jun 29, 2020 at 6:58 AM Aneesh Kumar K.V
> >  wrote:
> >>
> >> of_pmem on POWER10 can now use phwsync instead of hwsync to ensure
> >> all previous writes are architecturally visible for the platform
> >> buffer flush.
> >>
> >> Signed-off-by: Aneesh Kumar K.V 
> >> ---
> >>  arch/powerpc/include/asm/cacheflush.h | 7 +++
> >>  1 file changed, 7 insertions(+)
> >>
> >> diff --git a/arch/powerpc/include/asm/cacheflush.h 
> >> b/arch/powerpc/include/asm/cacheflush.h
> >> index 54764c6e922d..95782f77d768 100644
> >> --- a/arch/powerpc/include/asm/cacheflush.h
> >> +++ b/arch/powerpc/include/asm/cacheflush.h
> >> @@ -98,6 +98,13 @@ static inline void invalidate_dcache_range(unsigned 
> >> long start,
> >> mb();   /* sync */
> >>  }
> >>
> >> +#define arch_pmem_flush_barrier arch_pmem_flush_barrier
> >> +static inline void  arch_pmem_flush_barrier(void)
> >> +{
> >> +   if (cpu_has_feature(CPU_FTR_ARCH_207S))
> >> +   asm volatile(PPC_PHWSYNC ::: "memory");
> >
> > Shouldn't this fallback to a compatible store-fence in an else statement?
>
> The idea was to avoid calling this on anything else. We ensure that by
> making sure that pmem devices are not initialized on systems without that
> cpu feature. Patch 1 does that. Also, the last patch adds a WARN_ON() to
> catch the usage of this outside pmem devices and on systems without that
> cpu feature.

If patch1 handles this why re-check the cpu-feature in this helper? If
the intent is for these routines to be generic why not have them fall
back to the P8 barrier instructions for example like x86 clwb(). Any
kernel code can call it, and it falls back to a compatible clflush()
call on older cpus. I otherwise don't get the point of patch7.


Re: [PATCH updated] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-06-30 Thread Dan Williams
On Mon, Jun 29, 2020 at 10:02 PM Aneesh Kumar K.V
 wrote:
>
> Dan Williams  writes:
>
> > On Mon, Jun 29, 2020 at 1:29 PM Aneesh Kumar K.V
> >  wrote:
> >>
> >> Architectures like ppc64 provide persistent memory specific barriers
> >> that will ensure that all stores for which the modifications are
> >> written to persistent storage by preceding dcbfps and dcbstps
> >> instructions have updated persistent storage before any data
> >> access or data transfer caused by subsequent instructions is initiated.
> >> This is in addition to the ordering done by wmb()
> >>
> >> Update nvdimm core such that architecture can use barriers other than
> >> wmb to ensure all previous writes are architecturally visible for
> >> the platform buffer flush.
> >>
> >> Signed-off-by: Aneesh Kumar K.V 
> >> ---
> >>  drivers/md/dm-writecache.c   | 2 +-
> >>  drivers/nvdimm/region_devs.c | 8 
> >>  include/linux/libnvdimm.h| 4 
> >>  3 files changed, 9 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> >> index 74f3c506f084..8c6b6dce64e2 100644
> >> --- a/drivers/md/dm-writecache.c
> >> +++ b/drivers/md/dm-writecache.c
> >> @@ -536,7 +536,7 @@ static void ssd_commit_superblock(struct dm_writecache 
> >> *wc)
> >>  static void writecache_commit_flushed(struct dm_writecache *wc, bool 
> >> wait_for_ios)
> >>  {
> >> if (WC_MODE_PMEM(wc))
> >> -   wmb();
> >> +   arch_pmem_flush_barrier();
> >> else
> >> ssd_commit_flushed(wc, wait_for_ios);
> >>  }
> >> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> >> index 4502f9c4708d..b308ad09b63d 100644
> >> --- a/drivers/nvdimm/region_devs.c
> >> +++ b/drivers/nvdimm/region_devs.c
> >> @@ -1206,13 +1206,13 @@ int generic_nvdimm_flush(struct nd_region 
> >> *nd_region)
> >> idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 
> >> 8));
> >>
> >> /*
> >> -* The first wmb() is needed to 'sfence' all previous writes
> >> -* such that they are architecturally visible for the platform
> >> -* buffer flush.  Note that we've already arranged for pmem
> >> +* The first arch_pmem_flush_barrier() is needed to 'sfence' all
> >> +* previous writes such that they are architecturally visible for
> >> +* the platform buffer flush. Note that we've already arranged for 
> >> pmem
> >>  * writes to avoid the cache via memcpy_flushcache().  The final
> >>  * wmb() ensures ordering for the NVDIMM flush write.
> >>  */
> >> -   wmb();
> >> +   arch_pmem_flush_barrier();
> >> for (i = 0; i < nd_region->ndr_mappings; i++)
> >> if (ndrd_get_flush_wpq(ndrd, i, 0))
> >> writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> >> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> >> index 18da4059be09..66f6c65bd789 100644
> >> --- a/include/linux/libnvdimm.h
> >> +++ b/include/linux/libnvdimm.h
> >> @@ -286,4 +286,8 @@ static inline void arch_invalidate_pmem(void *addr, 
> >> size_t size)
> >>  }
> >>  #endif
> >>
> >> +#ifndef arch_pmem_flush_barrier
> >> +#define arch_pmem_flush_barrier() wmb()
> >> +#endif
> >
> > I think it is out of place to define this in libnvdimm.h and it is odd
> > to give it such a long name. The other pmem api helpers like
> > arch_wb_cache_pmem() and arch_invalidate_pmem() are function calls for
> > libnvdimm driver operations, this barrier is just an instruction and
> > is closer to wmb() than the pmem api routine.
> >
> > Since it is a store fence for pmem, so let's just call it pmem_wmb()
> > and define the generic version in include/linux/compiler.h. It should
> > probably also be documented alongside dma_wmb() in
> > Documentation/memory-barriers.txt about why code would use it over
> > wmb(), and why a symmetric pmem_rmb() is not needed.
>
> How about the below? I used pmem_barrier() instead of pmem_wmb().

Why? A barrier() is a bi-directional ordering mechanic for reads and
writes, and the proposed semantics mechanism only orders writes +
persistence. Otherwise the default fallback to wmb() on archs that
don't override it does not m

Re: [PATCH v6 7/8] powerpc/pmem: Add WARN_ONCE to catch the wrong usage of pmem flush functions.

2020-06-29 Thread Dan Williams
On Mon, Jun 29, 2020 at 6:58 AM Aneesh Kumar K.V
 wrote:
>
> We only support persistent memory on P8 and above. This is enforced by the
> firmware and further checked on virtualzied platform during platform init.
> Add WARN_ONCE in pmem flush routines to catch the wrong usage of these.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/cacheflush.h | 2 ++
>  arch/powerpc/lib/pmem.c   | 2 ++
>  2 files changed, 4 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/cacheflush.h 
> b/arch/powerpc/include/asm/cacheflush.h
> index 95782f77d768..1ab0fa660497 100644
> --- a/arch/powerpc/include/asm/cacheflush.h
> +++ b/arch/powerpc/include/asm/cacheflush.h
> @@ -103,6 +103,8 @@ static inline void  arch_pmem_flush_barrier(void)
>  {
> if (cpu_has_feature(CPU_FTR_ARCH_207S))
> asm volatile(PPC_PHWSYNC ::: "memory");
> +   else
> +   WARN_ONCE(1, "Using pmem flush on older hardware.");

This seems too late to be making this determination. I'd expect the
driver to fail to successfully bind default if this constraint is not
met.


Re: [PATCH v6 6/8] powerpc/pmem: Avoid the barrier in flush routines

2020-06-29 Thread Dan Williams
On Mon, Jun 29, 2020 at 1:41 PM Aneesh Kumar K.V
 wrote:
>
> Michal Suchánek  writes:
>
> > Hello,
> >
> > On Mon, Jun 29, 2020 at 07:27:20PM +0530, Aneesh Kumar K.V wrote:
> >> nvdimm expect the flush routines to just mark the cache clean. The barrier
> >> that mark the store globally visible is done in nvdimm_flush().
> >>
> >> Update the papr_scm driver to a simplified nvdim_flush callback that do
> >> only the required barrier.
> >>
> >> Signed-off-by: Aneesh Kumar K.V 
> >> ---
> >>  arch/powerpc/lib/pmem.c   |  6 --
> >>  arch/powerpc/platforms/pseries/papr_scm.c | 13 +
> >>  2 files changed, 13 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c
> >> index 5a61aaeb6930..21210fa676e5 100644
> >> --- a/arch/powerpc/lib/pmem.c
> >> +++ b/arch/powerpc/lib/pmem.c
> >> @@ -19,9 +19,6 @@ static inline void __clean_pmem_range(unsigned long 
> >> start, unsigned long stop)
> >>
> >>  for (i = 0; i < size >> shift; i++, addr += bytes)
> >>  asm volatile(PPC_DCBSTPS(%0, %1): :"i"(0), "r"(addr): 
> >> "memory");
> >> -
> >> -
> >> -asm volatile(PPC_PHWSYNC ::: "memory");
> >>  }
> >>
> >>  static inline void __flush_pmem_range(unsigned long start, unsigned long 
> >> stop)
> >> @@ -34,9 +31,6 @@ static inline void __flush_pmem_range(unsigned long 
> >> start, unsigned long stop)
> >>
> >>  for (i = 0; i < size >> shift; i++, addr += bytes)
> >>  asm volatile(PPC_DCBFPS(%0, %1): :"i"(0), "r"(addr): 
> >> "memory");
> >> -
> >> -
> >> -asm volatile(PPC_PHWSYNC ::: "memory");
> >>  }
> >>
> >>  static inline void clean_pmem_range(unsigned long start, unsigned long 
> >> stop)
> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> >> b/arch/powerpc/platforms/pseries/papr_scm.c
> >> index 9c569078a09f..9a9a0766f8b6 100644
> >> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> >> @@ -630,6 +630,18 @@ static int papr_scm_ndctl(struct 
> >> nvdimm_bus_descriptor *nd_desc,
> >>
> >>  return 0;
> >>  }
> >> +/*
> >> + * We have made sure the pmem writes are done such that before calling 
> >> this
> >> + * all the caches are flushed/clean. We use dcbf/dcbfps to ensure this. 
> >> Here
> >> + * we just need to add the necessary barrier to make sure the above 
> >> flushes
> >> + * are have updated persistent storage before any data access or data 
> >> transfer
> >> + * caused by subsequent instructions is initiated.
> >> + */
> >> +static int papr_scm_flush_sync(struct nd_region *nd_region, struct bio 
> >> *bio)
> >> +{
> >> +arch_pmem_flush_barrier();
> >> +return 0;
> >> +}
> >>
> >>  static ssize_t flags_show(struct device *dev,
> >>struct device_attribute *attr, char *buf)
> >> @@ -743,6 +755,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv 
> >> *p)
> >>  ndr_desc.mapping = 
> >>  ndr_desc.num_mappings = 1;
> >>  ndr_desc.nd_set = >nd_set;
> >> +ndr_desc.flush = papr_scm_flush_sync;
> >
> > AFAICT currently the only device that implements flush is virtio_pmem.
> > How does the nfit driver get away without implementing flush?
>
> generic_nvdimm_flush does the required barrier for nfit. The reason for
> adding ndr_desc.flush call back for papr_scm was to avoid the usage
> of iomem based deep flushing (ndr_region_data.flush_wpq) which is not
> supported by papr_scm.
>
> BTW we do return NULL for ndrd_get_flush_wpq() on power. So the upstream
> code also does the same thing, but in a different way.
>
>
> > Also the flush takes arguments that are completely unused but a user of
> > the pmem region must assume they are used, and call flush() on the
> > region rather than arch_pmem_flush_barrier() directly.
>
> The bio argument can help a pmem driver to do range based flushing in
> case of pmem_make_request. If bio is null then we must assume a full
> device flush.

The bio argument isn't for range based flushing, it is for flush
operations that need to complete asynchronously.

There's no mechanism for the block layer to communicate range based
cache flushing, block-device flushing is assumed to be the device's
entire cache. For pmem that would be the entirety of the cpu cache.
Instead of modeling the cpu cache as a storage device cache it is
modeled as page-cache. Once the fs-layer writes back page-cache /
cpu-cache the storage device is only responsible for flushing those
cache-writes into the persistence domain.

Additionally there is a concept of deep-flush that relegates some
power-fail scenarios to a smaller failure domain. For example consider
the difference between a write arriving at the head of a device-queue
and successfully traversing a device-queue to media. The expectation
of pmem applications is that data is persisted once they reach the
equivalent of the x86 ADR domain, deep-flush is past ADR.


Re: [PATCH v6 5/8] powerpc/pmem/of_pmem: Update of_pmem to use the new barrier instruction.

2020-06-29 Thread Dan Williams
On Mon, Jun 29, 2020 at 6:58 AM Aneesh Kumar K.V
 wrote:
>
> of_pmem on POWER10 can now use phwsync instead of hwsync to ensure
> all previous writes are architecturally visible for the platform
> buffer flush.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/cacheflush.h | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/cacheflush.h 
> b/arch/powerpc/include/asm/cacheflush.h
> index 54764c6e922d..95782f77d768 100644
> --- a/arch/powerpc/include/asm/cacheflush.h
> +++ b/arch/powerpc/include/asm/cacheflush.h
> @@ -98,6 +98,13 @@ static inline void invalidate_dcache_range(unsigned long 
> start,
> mb();   /* sync */
>  }
>
> +#define arch_pmem_flush_barrier arch_pmem_flush_barrier
> +static inline void  arch_pmem_flush_barrier(void)
> +{
> +   if (cpu_has_feature(CPU_FTR_ARCH_207S))
> +   asm volatile(PPC_PHWSYNC ::: "memory");

Shouldn't this fallback to a compatible store-fence in an else statement?


Re: [PATCH updated] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-06-29 Thread Dan Williams
On Mon, Jun 29, 2020 at 1:29 PM Aneesh Kumar K.V
 wrote:
>
> Architectures like ppc64 provide persistent memory specific barriers
> that will ensure that all stores for which the modifications are
> written to persistent storage by preceding dcbfps and dcbstps
> instructions have updated persistent storage before any data
> access or data transfer caused by subsequent instructions is initiated.
> This is in addition to the ordering done by wmb()
>
> Update nvdimm core such that architecture can use barriers other than
> wmb to ensure all previous writes are architecturally visible for
> the platform buffer flush.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  drivers/md/dm-writecache.c   | 2 +-
>  drivers/nvdimm/region_devs.c | 8 
>  include/linux/libnvdimm.h| 4 
>  3 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 74f3c506f084..8c6b6dce64e2 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -536,7 +536,7 @@ static void ssd_commit_superblock(struct dm_writecache 
> *wc)
>  static void writecache_commit_flushed(struct dm_writecache *wc, bool 
> wait_for_ios)
>  {
> if (WC_MODE_PMEM(wc))
> -   wmb();
> +   arch_pmem_flush_barrier();
> else
> ssd_commit_flushed(wc, wait_for_ios);
>  }
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 4502f9c4708d..b308ad09b63d 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1206,13 +1206,13 @@ int generic_nvdimm_flush(struct nd_region *nd_region)
> idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8));
>
> /*
> -* The first wmb() is needed to 'sfence' all previous writes
> -* such that they are architecturally visible for the platform
> -* buffer flush.  Note that we've already arranged for pmem
> +* The first arch_pmem_flush_barrier() is needed to 'sfence' all
> +* previous writes such that they are architecturally visible for
> +* the platform buffer flush. Note that we've already arranged for 
> pmem
>  * writes to avoid the cache via memcpy_flushcache().  The final
>  * wmb() ensures ordering for the NVDIMM flush write.
>  */
> -   wmb();
> +   arch_pmem_flush_barrier();
> for (i = 0; i < nd_region->ndr_mappings; i++)
> if (ndrd_get_flush_wpq(ndrd, i, 0))
> writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> index 18da4059be09..66f6c65bd789 100644
> --- a/include/linux/libnvdimm.h
> +++ b/include/linux/libnvdimm.h
> @@ -286,4 +286,8 @@ static inline void arch_invalidate_pmem(void *addr, 
> size_t size)
>  }
>  #endif
>
> +#ifndef arch_pmem_flush_barrier
> +#define arch_pmem_flush_barrier() wmb()
> +#endif

I think it is out of place to define this in libnvdimm.h and it is odd
to give it such a long name. The other pmem api helpers like
arch_wb_cache_pmem() and arch_invalidate_pmem() are function calls for
libnvdimm driver operations, this barrier is just an instruction and
is closer to wmb() than the pmem api routine.

Since it is a store fence for pmem, so let's just call it pmem_wmb()
and define the generic version in include/linux/compiler.h. It should
probably also be documented alongside dma_wmb() in
Documentation/memory-barriers.txt about why code would use it over
wmb(), and why a symmetric pmem_rmb() is not needed.


Re: [PATCH v13 2/6] seq_buf: Export seq_buf_printf

2020-06-15 Thread Dan Williams
On Mon, Jun 15, 2020 at 5:56 AM Borislav Petkov  wrote:
>
> On Mon, Jun 15, 2020 at 06:14:03PM +0530, Vaibhav Jain wrote:
> > 'seq_buf' provides a very useful abstraction for writing to a string
> > buffer without needing to worry about it over-flowing. However even
> > though the API has been stable for couple of years now its still not
> > exported to kernel loadable modules limiting its usage.
> >
> > Hence this patch proposes update to 'seq_buf.c' to mark
> > seq_buf_printf() which is part of the seq_buf API to be exported to
> > kernel loadable GPL modules. This symbol will be used in later parts
> > of this patch-set to simplify content creation for a sysfs attribute.
> >
> > Cc: Piotr Maziarz 
> > Cc: Cezary Rojewski 
> > Cc: Christoph Hellwig 
> > Cc: Steven Rostedt 
> > Cc: Borislav Petkov 
> > Acked-by: Steven Rostedt (VMware) 
> > Signed-off-by: Vaibhav Jain 
> > ---
> > Changelog:
> >
> > v12..v13:
> > * None
> >
> > v11..v12:
> > * None
>
> Can you please resend your patchset once a week like everyone else and
> not flood inboxes with it?

Hi Boris,

I gave Vaibhav some long shot hope that his series could be included
in my libnvdimm pull request for -rc1. Save for a last minute clang
report that I misread as a gcc warning, I likely would have included.
This spin is looking to address the last of the comments I had and
something I would consider for -rc2. So, in this case the resends were
requested by me and I'll take the grumbles on Vaibhav's behalf.


Re: [PATCH v11 5/6] ndctl/papr_scm, uapi: Add support for PAPR nvdimm specific methods

2020-06-10 Thread Dan Williams
On Wed, Jun 10, 2020 at 5:10 AM Vaibhav Jain  wrote:
>
> Dan Williams  writes:
>
> > On Tue, Jun 9, 2020 at 10:54 AM Vaibhav Jain  wrote:
> >>
> >> Thanks Dan for the consideration and taking time to look into this.
> >>
> >> My responses below:
> >>
> >> Dan Williams  writes:
> >>
> >> > On Mon, Jun 8, 2020 at 5:16 PM kernel test robot  wrote:
> >> >>
> >> >> Hi Vaibhav,
> >> >>
> >> >> Thank you for the patch! Perhaps something to improve:
> >> >>
> >> >> [auto build test WARNING on powerpc/next]
> >> >> [also build test WARNING on linus/master v5.7 next-20200605]
> >> >> [cannot apply to linux-nvdimm/libnvdimm-for-next scottwood/next]
> >> >> [if your patch is applied to the wrong git tree, please drop us a note 
> >> >> to help
> >> >> improve the system. BTW, we also suggest to use '--base' option to 
> >> >> specify the
> >> >> base tree in git format-patch, please see 
> >> >> https://stackoverflow.com/a/37406982]
> >> >>
> >> >> url:
> >> >> https://github.com/0day-ci/linux/commits/Vaibhav-Jain/powerpc-papr_scm-Add-support-for-reporting-nvdimm-health/20200607-211653
> >> >> base:   
> >> >> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> >> >> config: powerpc-randconfig-r016-20200607 (attached as .config)
> >> >> compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 
> >> >> e429cffd4f228f70c1d9df0e5d77c08590dd9766)
> >> >> reproduce (this is a W=1 build):
> >> >> wget 
> >> >> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross
> >> >>  -O ~/bin/make.cross
> >> >> chmod +x ~/bin/make.cross
> >> >> # install powerpc cross compiling tool for clang build
> >> >> # apt-get install binutils-powerpc-linux-gnu
> >> >> # save the attached .config to linux build tree
> >> >> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross 
> >> >> ARCH=powerpc
> >> >>
> >> >> If you fix the issue, kindly add following tag as appropriate
> >> >> Reported-by: kernel test robot 
> >> >>
> >> >> All warnings (new ones prefixed by >>, old ones prefixed by <<):
> >> >>
> >> >> In file included from :1:
> >> >> >> ./usr/include/asm/papr_pdsm.h:69:20: warning: field 'hdr' with 
> >> >> >> variable sized type 'struct nd_cmd_pkg' not at the end of a struct 
> >> >> >> or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
> >> >> struct nd_cmd_pkg hdr;  /* Package header containing sub-cmd */
> >> >
> >> > Hi Vaibhav,
> >> >
> >> [.]
> >> > This looks like it's going to need another round to get this fixed. I
> >> > don't think 'struct nd_pdsm_cmd_pkg' should embed a definition of
> >> > 'struct nd_cmd_pkg'. An instance of 'struct nd_cmd_pkg' carries a
> >> > payload that is the 'pdsm' specifics. As the code has it now it's
> >> > defined as a superset of 'struct nd_cmd_pkg' and the compiler warning
> >> > is pointing out a real 'struct' organization problem.
> >> >
> >> > Given the soak time needed in -next after the code is finalized this
> >> > there's no time to do another round of updates and still make the v5.8
> >> > merge window.
> >>
> >> Agreed that this looks bad, a solution will probably need some more
> >> review cycles resulting in this series missing the merge window.
> >>
> >> I am investigating into the possible solutions for this reported issue
> >> and made few observations:
> >>
> >> I see command pkg for Intel, Hpe, Msft and Hyperv families using a
> >> similar layout of embedding nd_cmd_pkg at the head of the
> >> command-pkg. struct nd_pdsm_cmd_pkg is following the same pattern.
> >>
> >> struct nd_pdsm_cmd_pkg {
> >> struct nd_cmd_pkg hdr;
> >> /* other members */
> >> };
> >>
> >> struct ndn_pkg_msft {
> >> struct nd_cmd_pkg gen;
> >> /* other members */
> >> };
> >> struct nd_pkg_intel {
> >> struct nd_cmd_pkg gen;
> >> /

Re: [PATCH v11 5/6] ndctl/papr_scm, uapi: Add support for PAPR nvdimm specific methods

2020-06-09 Thread Dan Williams
On Tue, Jun 9, 2020 at 10:54 AM Vaibhav Jain  wrote:
>
> Thanks Dan for the consideration and taking time to look into this.
>
> My responses below:
>
> Dan Williams  writes:
>
> > On Mon, Jun 8, 2020 at 5:16 PM kernel test robot  wrote:
> >>
> >> Hi Vaibhav,
> >>
> >> Thank you for the patch! Perhaps something to improve:
> >>
> >> [auto build test WARNING on powerpc/next]
> >> [also build test WARNING on linus/master v5.7 next-20200605]
> >> [cannot apply to linux-nvdimm/libnvdimm-for-next scottwood/next]
> >> [if your patch is applied to the wrong git tree, please drop us a note to 
> >> help
> >> improve the system. BTW, we also suggest to use '--base' option to specify 
> >> the
> >> base tree in git format-patch, please see 
> >> https://stackoverflow.com/a/37406982]
> >>
> >> url:
> >> https://github.com/0day-ci/linux/commits/Vaibhav-Jain/powerpc-papr_scm-Add-support-for-reporting-nvdimm-health/20200607-211653
> >> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> >> next
> >> config: powerpc-randconfig-r016-20200607 (attached as .config)
> >> compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 
> >> e429cffd4f228f70c1d9df0e5d77c08590dd9766)
> >> reproduce (this is a W=1 build):
> >> wget 
> >> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross 
> >> -O ~/bin/make.cross
> >> chmod +x ~/bin/make.cross
> >> # install powerpc cross compiling tool for clang build
> >> # apt-get install binutils-powerpc-linux-gnu
> >> # save the attached .config to linux build tree
> >> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross 
> >> ARCH=powerpc
> >>
> >> If you fix the issue, kindly add following tag as appropriate
> >> Reported-by: kernel test robot 
> >>
> >> All warnings (new ones prefixed by >>, old ones prefixed by <<):
> >>
> >> In file included from :1:
> >> >> ./usr/include/asm/papr_pdsm.h:69:20: warning: field 'hdr' with variable 
> >> >> sized type 'struct nd_cmd_pkg' not at the end of a struct or class is a 
> >> >> GNU extension [-Wgnu-variable-sized-type-not-at-end]
> >> struct nd_cmd_pkg hdr;  /* Package header containing sub-cmd */
> >
> > Hi Vaibhav,
> >
> [.]
> > This looks like it's going to need another round to get this fixed. I
> > don't think 'struct nd_pdsm_cmd_pkg' should embed a definition of
> > 'struct nd_cmd_pkg'. An instance of 'struct nd_cmd_pkg' carries a
> > payload that is the 'pdsm' specifics. As the code has it now it's
> > defined as a superset of 'struct nd_cmd_pkg' and the compiler warning
> > is pointing out a real 'struct' organization problem.
> >
> > Given the soak time needed in -next after the code is finalized this
> > there's no time to do another round of updates and still make the v5.8
> > merge window.
>
> Agreed that this looks bad, a solution will probably need some more
> review cycles resulting in this series missing the merge window.
>
> I am investigating into the possible solutions for this reported issue
> and made few observations:
>
> I see command pkg for Intel, Hpe, Msft and Hyperv families using a
> similar layout of embedding nd_cmd_pkg at the head of the
> command-pkg. struct nd_pdsm_cmd_pkg is following the same pattern.
>
> struct nd_pdsm_cmd_pkg {
> struct nd_cmd_pkg hdr;
> /* other members */
> };
>
> struct ndn_pkg_msft {
> struct nd_cmd_pkg gen;
> /* other members */
> };
> struct nd_pkg_intel {
> struct nd_cmd_pkg gen;
> /* other members */
> };
> struct ndn_pkg_hpe1 {
> struct nd_cmd_pkg gen;
> /* other members */

In those cases the other members are a union and there is no second
variable length array. Perhaps that is why those definitions are not
getting flagged? I'm not seeing anything in ndctl build options that
would explicitly disable this warning, but I'm not sure if the ndctl
build environment is missing this build warning by accident.

Those variable size payloads are also not being used in any code paths
that would look at the size of the command payload, like the kernel
ioctl() path. The payload validation code needs static sizes and the
payload parsing code wants to cast the payload to a known type. I
don't think you can use the same struct definition for both those
cases which is why the ndctl parsing code uses the union layout, but
the kernel command marsha

Re: [PATCH v11 5/6] ndctl/papr_scm, uapi: Add support for PAPR nvdimm specific methods

2020-06-08 Thread Dan Williams
On Mon, Jun 8, 2020 at 5:16 PM kernel test robot  wrote:
>
> Hi Vaibhav,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on powerpc/next]
> [also build test WARNING on linus/master v5.7 next-20200605]
> [cannot apply to linux-nvdimm/libnvdimm-for-next scottwood/next]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see 
> https://stackoverflow.com/a/37406982]
>
> url:
> https://github.com/0day-ci/linux/commits/Vaibhav-Jain/powerpc-papr_scm-Add-support-for-reporting-nvdimm-health/20200607-211653
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: powerpc-randconfig-r016-20200607 (attached as .config)
> compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 
> e429cffd4f228f70c1d9df0e5d77c08590dd9766)
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # install powerpc cross compiling tool for clang build
> # apt-get install binutils-powerpc-linux-gnu
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross 
> ARCH=powerpc
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
>
> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>
> In file included from :1:
> >> ./usr/include/asm/papr_pdsm.h:69:20: warning: field 'hdr' with variable 
> >> sized type 'struct nd_cmd_pkg' not at the end of a struct or class is a 
> >> GNU extension [-Wgnu-variable-sized-type-not-at-end]
> struct nd_cmd_pkg hdr;  /* Package header containing sub-cmd */

Hi Vaibhav,

This looks like it's going to need another round to get this fixed. I
don't think 'struct nd_pdsm_cmd_pkg' should embed a definition of
'struct nd_cmd_pkg'. An instance of 'struct nd_cmd_pkg' carries a
payload that is the 'pdsm' specifics. As the code has it now it's
defined as a superset of 'struct nd_cmd_pkg' and the compiler warning
is pointing out a real 'struct' organization problem.

Given the soak time needed in -next after the code is finalized this
there's no time to do another round of updates and still make the v5.8
merge window.


Re: [PATCH v10 5/6] ndctl/papr_scm, uapi: Add support for PAPR nvdimm specific methods

2020-06-05 Thread Dan Williams
On Fri, Jun 5, 2020 at 12:50 PM Ira Weiny  wrote:
>
> On Fri, Jun 05, 2020 at 05:11:35AM +0530, Vaibhav Jain wrote:
> > Introduce support for PAPR NVDIMM Specific Methods (PDSM) in papr_scm
> > module and add the command family NVDIMM_FAMILY_PAPR to the white list
> > of NVDIMM command sets. Also advertise support for ND_CMD_CALL for the
> > nvdimm command mask and implement necessary scaffolding in the module
> > to handle ND_CMD_CALL ioctl and PDSM requests that we receive.
> >
> > The layout of the PDSM request as we expect from libnvdimm/libndctl is
> > described in newly introduced uapi header 'papr_pdsm.h' which
> > defines a new 'struct nd_pdsm_cmd_pkg' header. This header is used
> > to communicate the PDSM request via member
> > 'nd_cmd_pkg.nd_command' and size of payload that need to be
> > sent/received for servicing the PDSM.
> >
> > A new function is_cmd_valid() is implemented that reads the args to
> > papr_scm_ndctl() and performs sanity tests on them. A new function
> > papr_scm_service_pdsm() is introduced and is called from
> > papr_scm_ndctl() in case of a PDSM request is received via ND_CMD_CALL
> > command from libnvdimm.
> >
> > Cc: "Aneesh Kumar K . V" 
> > Cc: Dan Williams 
> > Cc: Michael Ellerman 
> > Cc: Ira Weiny 
> > Signed-off-by: Vaibhav Jain 
> > ---
> > Changelog:
> >
> > v9..v10:
> > * Simplified 'struct nd_pdsm_cmd_pkg' by removing the
> >   'payload_version' field.
> > * Removed the corrosponding documentation on versioning and backward
> >   compatibility from 'papr_pdsm.h'
> > * Reduced the size of reserved fields to 4-bytes making 'struct
> >   nd_pdsm_cmd_pkg' 64 + 8 bytes long.
> > * Updated is_cmd_valid() to enforce validation checks on pdsm
> >   commands. [ Dan Williams ]
> > * Added check for reserved fields being set to '0' in is_cmd_valid()
> >   [ Ira ]
> > * Moved changes for checking cmd_rc == NULL and logging improvements
> >   to a separate prelim patch [ Ira ].
> > * Moved  pdsm package validation checks from papr_scm_service_pdsm()
> >   to is_cmd_valid().
> > * Marked papr_scm_service_pdsm() return type as 'void' since errors
> >   are reported in nd_pdsm_cmd_pkg.cmd_status field.
> >
> > Resend:
> > * Added ack from Aneesh.
> >
> > v8..v9:
> > * Reduced the usage of term SCM replacing it with appropriate
> >   replacement [ Dan Williams, Aneesh ]
> > * Renamed 'papr_scm_pdsm.h' to 'papr_pdsm.h'
> > * s/PAPR_SCM_PDSM_*/PAPR_PDSM_*/g
> > * s/NVDIMM_FAMILY_PAPR_SCM/NVDIMM_FAMILY_PAPR/g
> > * Minor updates to 'papr_psdm.h' to replace usage of term 'SCM'.
> > * Minor update to patch description.
> >
> > v7..v8:
> > * Removed the 'payload_offset' field from 'struct
> >   nd_pdsm_cmd_pkg'. Instead command payload is always assumed to start
> >   at 'nd_pdsm_cmd_pkg.payload'. [ Aneesh ]
> > * To enable introducing new fields to 'struct nd_pdsm_cmd_pkg',
> >   'reserved' field of 10-bytes is introduced. [ Aneesh ]
> > * Fixed a typo in "Backward Compatibility" section of papr_scm_pdsm.h
> >   [ Ira ]
> >
> > Resend:
> > * None
> >
> > v6..v7 :
> > * Removed the re-definitions of __packed macro from papr_scm_pdsm.h
> >   [Mpe].
> > * Removed the usage of __KERNEL__ macros in papr_scm_pdsm.h [Mpe].
> > * Removed macros that were unused in papr_scm.c from papr_scm_pdsm.h
> >   [Mpe].
> > * Made functions defined in papr_scm_pdsm.h as static inline. [Mpe]
> >
> > v5..v6 :
> > * Changed the usage of the term DSM to PDSM to distinguish it from the
> >   ACPI term [ Dan Williams ]
> > * Renamed papr_scm_dsm.h to papr_scm_pdsm.h and updated various struct
> >   to reflect the new terminology.
> > * Updated the patch description and title to reflect the new terminology.
> > * Squashed patch to introduce new command family in 'ndctl.h' with
> >   this patch [ Dan Williams ]
> > * Updated the papr_scm_pdsm method starting index from 0x1 to 0x0
> >   [ Dan Williams ]
> > * Removed redundant license text from the papr_scm_psdm.h file.
> >   [ Dan Williams ]
> > * s/envelop/envelope/ at various places [ Dan Williams ]
> > * Added '__packed' attribute to command package header to gaurd
> >   against different compiler adding paddings between the fields.
> >   [ Dan Williams]
> > * Converted various pr_debug to dev_debug [ Dan Williams ]
> >
> > v4..v5 :
> > * None
> >
> > v3..v4 :
> > * None
> >
> > v2..v3 :
> > * Updated the patch prefi

Re: [PATCH v10 4/6] powerpc/papr_scm: Improve error logging and handling papr_scm_ndctl()

2020-06-05 Thread Dan Williams
On Fri, Jun 5, 2020 at 10:13 AM Ira Weiny  wrote:
>
> On Fri, Jun 05, 2020 at 05:11:34AM +0530, Vaibhav Jain wrote:
> > Since papr_scm_ndctl() can be called from outside papr_scm, its
> > exposed to the possibility of receiving NULL as value of 'cmd_rc'
> > argument. This patch updates papr_scm_ndctl() to protect against such
> > possibility by assigning it pointer to a local variable in case cmd_rc
> > == NULL.
> >
> > Finally the patch also updates the 'default' clause of the switch-case
> > block removing a 'return' statement thereby ensuring that value of
> > 'cmd_rc' is always logged when papr_scm_ndctl() returns.
> >
> > Cc: "Aneesh Kumar K . V" 
> > Cc: Dan Williams 
> > Cc: Michael Ellerman 
> > Cc: Ira Weiny 
> > Signed-off-by: Vaibhav Jain 
> > ---
> > Changelog:
> >
> > v9..v10
> > * New patch in the series
>
> Thanks for making this a separate patch it is easier to see what is going on
> here.
>
> > ---
> >  arch/powerpc/platforms/pseries/papr_scm.c | 10 --
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> > b/arch/powerpc/platforms/pseries/papr_scm.c
> > index 0c091622b15e..6512fe6a2874 100644
> > --- a/arch/powerpc/platforms/pseries/papr_scm.c
> > +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> > @@ -355,11 +355,16 @@ static int papr_scm_ndctl(struct 
> > nvdimm_bus_descriptor *nd_desc,
> >  {
> >   struct nd_cmd_get_config_size *get_size_hdr;
> >   struct papr_scm_priv *p;
> > + int rc;
> >
> >   /* Only dimm-specific calls are supported atm */
> >   if (!nvdimm)
> >   return -EINVAL;
> >
> > + /* Use a local variable in case cmd_rc pointer is NULL */
> > + if (!cmd_rc)
> > + cmd_rc = 
> > +
>
> This protects you from the NULL.  However...
>
> >   p = nvdimm_provider_data(nvdimm);
> >
> >   switch (cmd) {
> > @@ -381,12 +386,13 @@ static int papr_scm_ndctl(struct 
> > nvdimm_bus_descriptor *nd_desc,
> >   break;
> >
> >   default:
> > - return -EINVAL;
> > + dev_dbg(>pdev->dev, "Unknown command = %d\n", cmd);
> > + *cmd_rc = -EINVAL;
>
> ... I think you are conflating rc and cmd_rc...
>
> >   }
> >
> >   dev_dbg(>pdev->dev, "returned with cmd_rc = %d\n", *cmd_rc);
> >
> > - return 0;
> > + return *cmd_rc;
>
> ... this changes the behavior of the current commands.  Now if the underlying
> papr_scm_meta_[get|set]() fails you return that failure as rc rather than 0.
>
> Is that ok?

The expectation is that rc is "did the command get sent to the device,
or did it fail for 'transport' reasons". The role of cmd_rc is to
translate the specific status response of the command into a common
error code. The expectations are:

rc < 0: Error code, Linux terminated the ioctl before talking to hardware

rc == 0: Linux successfully submitted the command to hardware, cmd_rc
is valid for command specific response

rc > 0: Linux successfully submitted the command, but detected that
only a subset of the data was accepted for "write"-style commands, or
that only subset of data was returned for "read"-style commands. I.e.
short-write / short-read semantics. cmd_rc is valid in this case and
its up to userspace to determine if a short transfer is an error or
not.

> Also 'logging cmd_rc' in the invalid cmd case does not seem quite right unless
> you really want rc to be cmd_rc.
>
> The architecture is designed to separate errors which occur in the kernel vs
> errors in the firmware/dimm.  Are they always the same?  The current code
> differentiates them.

Yeah, they're distinct, transport vs end-point / command-specific
status returns.


Re: [RESEND PATCH v9 4/5] ndctl/papr_scm,uapi: Add support for PAPR nvdimm specific methods

2020-06-05 Thread Dan Williams
On Fri, Jun 5, 2020 at 8:22 AM Vaibhav Jain  wrote:
[..]
> > Oh, why not define a maximal health payload with all the attributes
> > you know about today, leave some room for future expansion, and then
> > report a validity flag for each attribute? This is how the "intel"
> > smart-health payload works. If they ever needed to extend the payload
> > they would increase the size and add more validity flags. Old
> > userspace never groks the new fields, new userspace knows to ask for
> > and parse the larger payload.
> >
> > See the flags field in 'struct nd_intel_smart' (in ndctl) and the
> > translation of those flags to ndctl generic attribute flags
> > intel_cmd_smart_get_flags().
> >
> > In general I'd like ndctl to understand the superset of all health
> > attributes across all vendors. For the truly vendor specific ones it
> > would mean that the health flags with a specific "papr_scm" back-end
> > just would never be set on an "intel" device. I.e. look at the "hpe"
> > and "msft" health backends. They only set a subset of the valid flags
> > that could be reported.
>
> Thanks, this sounds good. Infact papr_scm implementation in ndctl does
> advertises support for only a subset of ND_SMART_* flags right now.
>
> Using 'flags' instead of 'version' was indeed discussed during
> v7..v9. However re-looking at the 'msft' and 'hpe' implementations the
> approach of maximal health payload tagged with a flags field looks more
> intuitive and I would prefer implementing this scheme in this patch-set.
>
> The current set health data exchanged with between libndctl and
> papr_scm via 'struct nd_papr_pdsm_health' (e.g various health status
> bits , nvdimm arming status etc) are guaranteed to be always available
> hence associating their availability with a flag wont be much useful as
> the flag will be always set.
>
> However as you suggested, extending the 'struct nd_papr_pdsm_health' in
> future to accommodate new attributes like 'life-remaining' can be done
> via adding them to the end of the struct and setting a flag field to
> indicate its presence.
>
> So I have the following proposal:
> * Add a new '__u32 extension_flags' field at beginning of 'struct
>   nd_papr_pdsm_health'
> * Set the size of the struct to 184-bytes which is the maximum possible
>   size for a pdsm payload.
> * 'papr_scm' kernel driver will currently set 'extension_flag' to 0
>   indicating no extension fields.
>
> * Future patch that adds support for 'life-remaining' add the new-field
>   at the end of known fields in 'struct nd_papr_pdsm_health'.
> * When provided to  papr_scm kernel module, if 'life-remaining' data is
>   available its populated and corresponding flag set in
>   'extension_flags' field indicating its presence.
> * When received by libndctl papr_scm implementation its tests if the
>   extension_flags have associated 'life-remaining' flag set and if yes
>   then return ND_SMART_USED_VALID flag back from
>   ndctl_cmd_smart_get_flags().
>
> Implementing first 3 items above in the current patchset should be
> fairly trivial.
>
> Does that sounds reasonable ?

This sounds good to me.


Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-05-30 Thread Dan Williams
On Sat, May 30, 2020 at 12:18 AM Aneesh Kumar K.V
 wrote:
>
> On 5/30/20 12:52 AM, Dan Williams wrote:
> > On Fri, May 29, 2020 at 3:55 AM Aneesh Kumar K.V
> >  wrote:
> >>
> >> On 5/29/20 3:22 PM, Jan Kara wrote:
> >>> Hi!
> >>>
> >>> On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> >>>> Thanks Michal. I also missed Jeff in this email thread.
> >>>
> >>> And I think you'll also need some of the sched maintainers for the prctl
> >>> bits...
> >>>
> >>>> On 5/29/20 3:03 PM, Michal Suchánek wrote:
> >>>>> Adding Jan
> >>>>>
> >>>>> On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> >>>>>> With POWER10, architecture is adding new pmem flush and sync 
> >>>>>> instructions.
> >>>>>> The kernel should prevent the usage of MAP_SYNC if applications are 
> >>>>>> not using
> >>>>>> the new instructions on newer hardware.
> >>>>>>
> >>>>>> This patch adds a prctl option MAP_SYNC_ENABLE that can be used to 
> >>>>>> enable
> >>>>>> the usage of MAP_SYNC. The kernel config option is added to allow the 
> >>>>>> user
> >>>>>> to control whether MAP_SYNC should be enabled by default or not.
> >>>>>>
> >>>>>> Signed-off-by: Aneesh Kumar K.V 
> >>> ...
> >>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
> >>>>>> index 8c700f881d92..d5a9a363e81e 100644
> >>>>>> --- a/kernel/fork.c
> >>>>>> +++ b/kernel/fork.c
> >>>>>> @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
> >>>>>> DEFINE_SPINLOCK(mmlist_lock);
> >>>>>> static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
> >>>>>> +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> >>>>>> +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> >>>>>> +#else
> >>>>>> +unsigned long default_map_sync_mask = 0;
> >>>>>> +#endif
> >>>>>> +
> >>>
> >>> I'm not sure CONFIG is really the right approach here. For a distro that 
> >>> would
> >>> basically mean to disable MAP_SYNC for all PPC kernels unless application
> >>> explicitly uses the right prctl. Shouldn't we rather initialize
> >>> default_map_sync_mask on boot based on whether the CPU we run on requires
> >>> new flush instructions or not? Otherwise the patch looks sensible.
> >>>
> >>
> >> yes that is correct. We ideally want to deny MAP_SYNC only w.r.t
> >> POWER10. But on a virtualized platform there is no easy way to detect
> >> that. We could ideally hook this into the nvdimm driver where we look at
> >> the new compat string ibm,persistent-memory-v2 and then disable MAP_SYNC
> >> if we find a device with the specific value.
> >>
> >> BTW with the recent changes I posted for the nvdimm driver, older kernel
> >> won't initialize persistent memory device on newer hardware. Newer
> >> hardware will present the device to OS with a different device tree
> >> compat string.
> >>
> >> My expectation  w.r.t this patch was, Distro would want to  mark
> >> CONFIG_ARCH_MAP_SYNC_DISABLE=n based on the different application
> >> certification.  Otherwise application will have to end up calling the
> >> prctl(MMF_DISABLE_MAP_SYNC, 0) any way. If that is the case, should this
> >> be dependent on P10?
> >>
> >> With that I am wondering should we even have this patch? Can we expect
> >> userspace get updated to use new instruction?.
> >>
> >> With ppc64 we never had a real persistent memory device available for
> >> end user to try. The available persistent memory stack was using vPMEM
> >> which was presented as a volatile memory region for which there is no
> >> need to use any of the flush instructions. We could safely assume that
> >> as we get applications certified/verified for working with pmem device
> >> on ppc64, they would all be using the new instructions?
> >
> > I think prctl is the wrong interface for this. I was thinking a sysfs
> > interface along the same lines as /sys/block/pmemX/dax/write_cache.
> > That attribute is toggling DAXDEV_WRITE_CACHE for the determination of
> > whether the platform or the kernel needs to handle cache flushing
> > relative to power loss. A similar attribute can be established for
> > DAXDEV_SYNC, it would simply default to off based on a configuration
> > time policy, but be dynamically changeable at runtime via sysfs.
> >
> > These flags are device properties that affect the kernel and
> > userspace's handling of persistence.
> >
>
> That will not handle the scenario with multiple applications using the
> same fsdax mount point where one is updated to use the new instruction
> and the other is not.

Right, it needs to be a global setting / flag day to switch from one
regime to another. Per-process control is a recipe for disaster.


Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-05-29 Thread Dan Williams
On Fri, May 29, 2020 at 3:55 AM Aneesh Kumar K.V
 wrote:
>
> On 5/29/20 3:22 PM, Jan Kara wrote:
> > Hi!
> >
> > On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> >> Thanks Michal. I also missed Jeff in this email thread.
> >
> > And I think you'll also need some of the sched maintainers for the prctl
> > bits...
> >
> >> On 5/29/20 3:03 PM, Michal Suchánek wrote:
> >>> Adding Jan
> >>>
> >>> On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
>  With POWER10, architecture is adding new pmem flush and sync 
>  instructions.
>  The kernel should prevent the usage of MAP_SYNC if applications are not 
>  using
>  the new instructions on newer hardware.
> 
>  This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
>  the usage of MAP_SYNC. The kernel config option is added to allow the 
>  user
>  to control whether MAP_SYNC should be enabled by default or not.
> 
>  Signed-off-by: Aneesh Kumar K.V 
> > ...
>  diff --git a/kernel/fork.c b/kernel/fork.c
>  index 8c700f881d92..d5a9a363e81e 100644
>  --- a/kernel/fork.c
>  +++ b/kernel/fork.c
>  @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
>  DEFINE_SPINLOCK(mmlist_lock);
> static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
>  +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
>  +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
>  +#else
>  +unsigned long default_map_sync_mask = 0;
>  +#endif
>  +
> >
> > I'm not sure CONFIG is really the right approach here. For a distro that 
> > would
> > basically mean to disable MAP_SYNC for all PPC kernels unless application
> > explicitly uses the right prctl. Shouldn't we rather initialize
> > default_map_sync_mask on boot based on whether the CPU we run on requires
> > new flush instructions or not? Otherwise the patch looks sensible.
> >
>
> yes that is correct. We ideally want to deny MAP_SYNC only w.r.t
> POWER10. But on a virtualized platform there is no easy way to detect
> that. We could ideally hook this into the nvdimm driver where we look at
> the new compat string ibm,persistent-memory-v2 and then disable MAP_SYNC
> if we find a device with the specific value.
>
> BTW with the recent changes I posted for the nvdimm driver, older kernel
> won't initialize persistent memory device on newer hardware. Newer
> hardware will present the device to OS with a different device tree
> compat string.
>
> My expectation  w.r.t this patch was, Distro would want to  mark
> CONFIG_ARCH_MAP_SYNC_DISABLE=n based on the different application
> certification.  Otherwise application will have to end up calling the
> prctl(MMF_DISABLE_MAP_SYNC, 0) any way. If that is the case, should this
> be dependent on P10?
>
> With that I am wondering should we even have this patch? Can we expect
> userspace get updated to use new instruction?.
>
> With ppc64 we never had a real persistent memory device available for
> end user to try. The available persistent memory stack was using vPMEM
> which was presented as a volatile memory region for which there is no
> need to use any of the flush instructions. We could safely assume that
> as we get applications certified/verified for working with pmem device
> on ppc64, they would all be using the new instructions?

I think prctl is the wrong interface for this. I was thinking a sysfs
interface along the same lines as /sys/block/pmemX/dax/write_cache.
That attribute is toggling DAXDEV_WRITE_CACHE for the determination of
whether the platform or the kernel needs to handle cache flushing
relative to power loss. A similar attribute can be established for
DAXDEV_SYNC, it would simply default to off based on a configuration
time policy, but be dynamically changeable at runtime via sysfs.

These flags are device properties that affect the kernel and
userspace's handling of persistence.


Re: [PATCH v8 1/5] powerpc: Document details on H_SCM_HEALTH hcall

2020-05-27 Thread Dan Williams
On Tue, May 26, 2020 at 9:13 PM Vaibhav Jain  wrote:
>
> Add documentation to 'papr_hcalls.rst' describing the bitmap flags
> that are returned from H_SCM_HEALTH hcall as per the PAPR-SCM
> specification.
>

Please do a global s/SCM/PMEM/ or s/SCM/NVDIMM/. It's unfortunate that
we already have 2 ways to describe persistent memory devices, let's
not perpetuate a third so that "grep" has a chance to find
interrelated code across architectures. Other than that this looks
good to me.

> Cc: "Aneesh Kumar K . V" 
> Cc: Dan Williams 
> Cc: Michael Ellerman 
> Cc: Ira Weiny 
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
> v7..v8:
> * Added a clarification on bit-ordering of Health Bitmap
>
> Resend:
> * None
>
> v6..v7:
> * None
>
> v5..v6:
> * New patch in the series
> ---
>  Documentation/powerpc/papr_hcalls.rst | 45 ---
>  1 file changed, 41 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/powerpc/papr_hcalls.rst 
> b/Documentation/powerpc/papr_hcalls.rst
> index 3493631a60f8..45063f305813 100644
> --- a/Documentation/powerpc/papr_hcalls.rst
> +++ b/Documentation/powerpc/papr_hcalls.rst
> @@ -220,13 +220,50 @@ from the LPAR memory.
>  **H_SCM_HEALTH**
>
>  | Input: drcIndex
> -| Out: *health-bitmap, health-bit-valid-bitmap*
> +| Out: *health-bitmap (r4), health-bit-valid-bitmap (r5)*
>  | Return Value: *H_Success, H_Parameter, H_Hardware*
>
>  Given a DRC Index return the info on predictive failure and overall health of
> -the NVDIMM. The asserted bits in the health-bitmap indicate a single 
> predictive
> -failure and health-bit-valid-bitmap indicate which bits in health-bitmap are
> -valid.
> +the NVDIMM. The asserted bits in the health-bitmap indicate one or more 
> states
> +(described in table below) of the NVDIMM and health-bit-valid-bitmap indicate
> +which bits in health-bitmap are valid. The bits are reported in
> +reverse bit ordering for example a value of 0xC400
> +indicates bits 0, 1, and 5 are valid.
> +
> +Health Bitmap Flags:
> +
> ++--+---+
> +|  Bit |   Definition
>   |
> ++==+===+
> +|  00  | SCM device is unable to persist memory contents.
>   |
> +|  | If the system is powered down, nothing will be saved.   
>   |
> ++--+---+
> +|  01  | SCM device failed to persist memory contents. Either contents were 
> not|
> +|  | saved successfully on power down or were not restored properly on   
>   |
> +|  | power up.   
>   |
> ++--+---+
> +|  02  | SCM device contents are persisted from previous IPL. The data from  
>   |
> +|  | the last boot were successfully restored.   
>   |
> ++--+---+
> +|  03  | SCM device contents are not persisted from previous IPL. There was 
> no |
> +|  | data to restore from the last boot. 
>   |
> ++--+---+
> +|  04  | SCM device memory life remaining is critically low  
>   |
> ++--+---+
> +|  05  | SCM device will be garded off next IPL due to failure   
>   |
> ++--+---+
> +|  06  | SCM contents cannot persist due to current platform health status. 
> A  |
> +|  | hardware failure may prevent data from being saved or restored. 
>   |
> ++--+---+
> +|  07  | SCM device is unable to persist memory contents in certain 
> conditions |
> ++--+---+
> +|  08  | SCM device is encrypted 
>   |
> ++--+---+
> +|  09  | SCM device has successfully completed a requested erase or secure   
>   |
> +|  | erase procedure.
>   |
> ++--+---+
> +|10:63 | Reserved / Unused   
>   |
> ++--+---+
>
>  **H_SCM_PERFORMANCE_STATS**
>
> --
> 2.26.2
>


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-21 Thread Dan Williams
On Thu, May 21, 2020 at 7:39 AM Jeff Moyer  wrote:
>
> Dan Williams  writes:
>
> >> But I agree with your concern that if we have older kernel/applications
> >> that continue to use `dcbf` on future hardware we will end up
> >> having issues w.r.t powerfail consistency. The plan is what you outlined
> >> above as tighter ecosystem control. Considering we don't have a pmem
> >> device generally available, we get both kernel and userspace upgraded
> >> to use these new instructions before such a device is made available.
>
> I thought power already supported NVDIMM-N, no?  So are you saying that
> those devices will continue to work with the existing flushing and
> fencing mechanisms?
>
> > Ok, I think a compile time kernel option with a runtime override
> > satisfies my concern. Does that work for you?
>
> The compile time option only helps when running newer kernels.  I'm not
> sure how you would even begin to audit userspace applications (keep in
> mind, not every application is open source, and not every application
> uses pmdk).  I also question the merits of forcing the administrator to
> make the determination of whether all applications on the system will
> work properly.  Really, you have to rely on the vendor to tell you the
> platform is supported, and at that point, why put further hurdles in the
> way?

I'm thoroughly confused by this. I thought this was exactly the role
of a Linux distribution vendor. ISVs qualify their application on a
hardware-platform + distribution combination and the distribution owns
picking ABI defaults like CONFIG_SYSFS_DEPRECATED regardless of
whether they can guarantee that all apps are updated to the new
semantics.

The administrator is not forced, the administrator if afforded an
override in the extreme case that they find an exception to what was
qualified and need to override the distribution's compile-time choice.

>
> The decision to require different instructions on ppc is unfortunate,
> but one I'm sure we have no control over.  I don't see any merit in the
> kernel disallowing MAP_SYNC access on these platforms.  Ideally, we'd
> have some way of ensuring older kernels don't work with these new
> platforms, but I don't think that's possible.

I see disabling MAP_SYNC as the more targeted form of "ensursing older
kernels don't work.

So I guess we agree that something should break when baseline
assumptions change, we just don't yet agree on where that break should
happen?


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-21 Thread Dan Williams
On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V
 wrote:
>
> On 5/21/20 8:08 PM, Jeff Moyer wrote:
> > Dan Williams  writes:
> >
> >>> But I agree with your concern that if we have older kernel/applications
> >>> that continue to use `dcbf` on future hardware we will end up
> >>> having issues w.r.t powerfail consistency. The plan is what you outlined
> >>> above as tighter ecosystem control. Considering we don't have a pmem
> >>> device generally available, we get both kernel and userspace upgraded
> >>> to use these new instructions before such a device is made available.
> >
> > I thought power already supported NVDIMM-N, no?  So are you saying that
> > those devices will continue to work with the existing flushing and
> > fencing mechanisms?
> >
>
> yes. these devices can continue to use 'dcbf + hwsync' as long as we are
> running them on P9.
>
>
> >> Ok, I think a compile time kernel option with a runtime override
> >> satisfies my concern. Does that work for you?
> >
> > The compile time option only helps when running newer kernels.  I'm not
> > sure how you would even begin to audit userspace applications (keep in
> > mind, not every application is open source, and not every application
> > uses pmdk).  I also question the merits of forcing the administrator to
> > make the determination of whether all applications on the system will
> > work properly.  Really, you have to rely on the vendor to tell you the
> > platform is supported, and at that point, why put further hurdles in the
> > way?
> >
> > The decision to require different instructions on ppc is unfortunate,
> > but one I'm sure we have no control over.  I don't see any merit in the
> > kernel disallowing MAP_SYNC access on these platforms.  Ideally, we'd
> > have some way of ensuring older kernels don't work with these new
> > platforms, but I don't think that's possible.
> >
>
>
> I am currently looking at the possibility of firmware present these
> devices with different device-tree compat values. So that older
> /existing kernel won't initialize the device on newer systems. Is that a
> good compromise? We still can end up with older userspace and newer
> kernel. One of the option suggested by Jan Kara is to use a prctl flag
> to control that? (intead of kernel parameter option I posted before)
>
>
> > Moving on to the patch itself--Aneesh, have you audited other persistent
> > memory users in the kernel?  For example, drivers/md/dm-writecache.c does
> > this:
> >
> > static void writecache_commit_flushed(struct dm_writecache *wc, bool 
> > wait_for_ios)
> > {
> >   if (WC_MODE_PMEM(wc))
> >   wmb(); <==
> >  else
> >  ssd_commit_flushed(wc, wait_for_ios);
> > }
> >
> > I believe you'll need to make modifications there.
> >
>
> Correct. Thanks for catching that.
>
>
> I don't understand dm much, wondering how this will work with
> non-synchronous DAX device?

That's a good point. DM-writecache needs to be cognizant of things
like virtio-pmem that violate the rule that persisent memory writes
can be flushed by CPU functions rather than calling back into the
driver. It seems we need to always make the flush case a dax_operation
callback to account for this.


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-19 Thread Dan Williams
On Tue, May 19, 2020 at 6:53 AM Aneesh Kumar K.V
 wrote:
>
> Dan Williams  writes:
>
> > On Mon, May 18, 2020 at 10:30 PM Aneesh Kumar K.V
> >  wrote:
>
> ...
>
> >> Applications using new instructions will behave as expected when running
> >> on P8 and P9. Only future hardware will differentiate between 'dcbf' and
> >> 'dcbfps'
> >
> > Right, this is the problem. Applications using new instructions behave
> > as expected, the kernel has been shipping of_pmem and papr_scm for
> > several cycles now, you're saying that the DAX applications written
> > against those platforms are going to be broken on P8 and P9?
>
> The expecation is that both kernel and userspace would get upgraded to
> use the new instruction before actual persistent memory devices are
> made available.
>
> >
> >> > I'm thinking the kernel
> >> > should go as far as to disable DAX operation by default on new
> >> > hardware until userspace asserts that it is prepared to switch to the
> >> > new implementation. Is there any other way to ensure the forward
> >> > compatibility of deployed ppc64 DAX applications?
> >>
> >> AFAIU there is no released persistent memory hardware on ppc64 platform
> >> and we need to make sure before applications get enabled to use these
> >> persistent memory devices, they should switch to use the new
> >> instruction?
> >
> > Right, I want the kernel to offer some level of safety here because
> > everything you are describing sounds like a flag day conversion. Am I
> > misreading? Is there some other gate that prevents existing users of
> > of_pmem and papr_scm from having their expectations violated when
> > running on P8 / P9 hardware? Maybe there's tighter ecosystem control
> > that I'm just not familiar with, I'm only going off the fact that the
> > kernel has shipped a non-zero number of NVDIMM drivers that build with
> > ARCH=ppc64 for several cycles.
>
> If we are looking at adding changes to kernel that will prevent a kernel
> from running on newer hardware in a specific case, we could as well take
> the changes to get the kernel use the newer instructions right?

Oh, no, I'm not talking about stopping the kernel from running. I'm
simply recommending that support for MAP_SYNC mappings (userspace
managed flushing) be disabled by default on PPC with either a
compile-time or run-time default to assert that userspace has been
audited for legacy applications or that the platform owner is
otherwise willing to take the risk.

> But I agree with your concern that if we have older kernel/applications
> that continue to use `dcbf` on future hardware we will end up
> having issues w.r.t powerfail consistency. The plan is what you outlined
> above as tighter ecosystem control. Considering we don't have a pmem
> device generally available, we get both kernel and userspace upgraded
> to use these new instructions before such a device is made available.

Ok, I think a compile time kernel option with a runtime override
satisfies my concern. Does that work for you?


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-19 Thread Dan Williams
On Mon, May 18, 2020 at 10:30 PM Aneesh Kumar K.V
 wrote:
>
>
> Hi Dan,
>
> Apologies for the delay in response. I was waiting for feedback from
> hardware team before responding to this email.
>
>
> Dan Williams  writes:
>
> > On Tue, May 12, 2020 at 8:47 PM Aneesh Kumar K.V
> >  wrote:
> >>
> >> Architectures like ppc64 provide persistent memory specific barriers
> >> that will ensure that all stores for which the modifications are
> >> written to persistent storage by preceding dcbfps and dcbstps
> >> instructions have updated persistent storage before any data
> >> access or data transfer caused by subsequent instructions is initiated.
> >> This is in addition to the ordering done by wmb()
> >>
> >> Update nvdimm core such that architecture can use barriers other than
> >> wmb to ensure all previous writes are architecturally visible for
> >> the platform buffer flush.
> >
> > This seems like an exceedingly bad idea, maybe I'm missing something.
> > This implies that the deployed base of DAX applications using the old
> > instruction sequence are going to regress on new hardware that
> > requires the new instructions to be deployed.
>
>
> pmdk support for ppc64 is still work in progress and there is pull
> request to switch pmdk to use new instruction.

Ok.

>
> https://github.com/tuliom/pmdk/commit/fix-flush
>
> All userspace applications will be switched to use the new
> instructions. The new instructions are designed such that when running on P8
> and P9 they behave as 'dcbf' and 'hwsync'.

Sure, makes sense.

> Applications using new instructions will behave as expected when running
> on P8 and P9. Only future hardware will differentiate between 'dcbf' and
> 'dcbfps'

Right, this is the problem. Applications using new instructions behave
as expected, the kernel has been shipping of_pmem and papr_scm for
several cycles now, you're saying that the DAX applications written
against those platforms are going to be broken on P8 and P9?

> > I'm thinking the kernel
> > should go as far as to disable DAX operation by default on new
> > hardware until userspace asserts that it is prepared to switch to the
> > new implementation. Is there any other way to ensure the forward
> > compatibility of deployed ppc64 DAX applications?
>
> AFAIU there is no released persistent memory hardware on ppc64 platform
> and we need to make sure before applications get enabled to use these
> persistent memory devices, they should switch to use the new
> instruction?

Right, I want the kernel to offer some level of safety here because
everything you are describing sounds like a flag day conversion. Am I
misreading? Is there some other gate that prevents existing users of
of_pmem and papr_scm from having their expectations violated when
running on P8 / P9 hardware? Maybe there's tighter ecosystem control
that I'm just not familiar with, I'm only going off the fact that the
kernel has shipped a non-zero number of NVDIMM drivers that build with
ARCH=ppc64 for several cycles.


Re: remove a few uses of ->queuedata

2020-05-13 Thread Dan Williams
On Tue, May 12, 2020 at 1:08 AM Christoph Hellwig  wrote:
>
> On Sat, May 09, 2020 at 08:07:14AM -0700, Dan Williams wrote:
> > > which are all used in the I/O submission path (generic_make_request /
> > > generic_make_request_checks).  This is mostly a prep cleanup patch to
> > > also remove the pointless queue argument from ->make_request - then
> > > ->queue is an extra dereference and extra churn.
> >
> > Ah ok. If the changelogs had been filled in with something like "In
> > preparation for removing @q from make_request_fn, stop using
> > ->queuedata", I probably wouldn't have looked twice.
> >
> > For the nvdimm/ driver updates you can add:
> >
> > Reviewed-by: Dan Williams 
> >
> > ...or just let me know if you want me to pick those up through the nvdimm 
> > tree.
>
> I'd love you to pick them up through the nvdimm tree.  Do you want
> to fix up the commit message yourself?

Will do, thanks.


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-13 Thread Dan Williams
On Tue, May 12, 2020 at 8:47 PM Aneesh Kumar K.V
 wrote:
>
> Architectures like ppc64 provide persistent memory specific barriers
> that will ensure that all stores for which the modifications are
> written to persistent storage by preceding dcbfps and dcbstps
> instructions have updated persistent storage before any data
> access or data transfer caused by subsequent instructions is initiated.
> This is in addition to the ordering done by wmb()
>
> Update nvdimm core such that architecture can use barriers other than
> wmb to ensure all previous writes are architecturally visible for
> the platform buffer flush.

This seems like an exceedingly bad idea, maybe I'm missing something.
This implies that the deployed base of DAX applications using the old
instruction sequence are going to regress on new hardware that
requires the new instructions to be deployed. I'm thinking the kernel
should go as far as to disable DAX operation by default on new
hardware until userspace asserts that it is prepared to switch to the
new implementation. Is there any other way to ensure the forward
compatibility of deployed ppc64 DAX applications?

>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  drivers/nvdimm/region_devs.c | 8 
>  include/linux/libnvdimm.h| 4 
>  2 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index ccbb5b43b8b2..88ea34a9c7fd 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1216,13 +1216,13 @@ int generic_nvdimm_flush(struct nd_region *nd_region)
> idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8));
>
> /*
> -* The first wmb() is needed to 'sfence' all previous writes
> -* such that they are architecturally visible for the platform
> -* buffer flush.  Note that we've already arranged for pmem
> +* The first arch_pmem_flush_barrier() is needed to 'sfence' all
> +* previous writes such that they are architecturally visible for
> +* the platform buffer flush. Note that we've already arranged for 
> pmem
>  * writes to avoid the cache via memcpy_flushcache().  The final
>  * wmb() ensures ordering for the NVDIMM flush write.
>  */
> -   wmb();
> +   arch_pmem_flush_barrier();
> for (i = 0; i < nd_region->ndr_mappings; i++)
> if (ndrd_get_flush_wpq(ndrd, i, 0))
> writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> index 18da4059be09..66f6c65bd789 100644
> --- a/include/linux/libnvdimm.h
> +++ b/include/linux/libnvdimm.h
> @@ -286,4 +286,8 @@ static inline void arch_invalidate_pmem(void *addr, 
> size_t size)
>  }
>  #endif
>
> +#ifndef arch_pmem_flush_barrier
> +#define arch_pmem_flush_barrier() wmb()
> +#endif
> +
>  #endif /* __LIBNVDIMM_H__ */
> --
> 2.26.2
>


Re: remove a few uses of ->queuedata

2020-05-09 Thread Dan Williams
On Sat, May 9, 2020 at 1:24 AM Christoph Hellwig  wrote:
>
> On Fri, May 08, 2020 at 11:04:45AM -0700, Dan Williams wrote:
> > On Fri, May 8, 2020 at 9:16 AM Christoph Hellwig  wrote:
> > >
> > > Hi all,
> > >
> > > various bio based drivers use queue->queuedata despite already having
> > > set up disk->private_data, which can be used just as easily.  This
> > > series cleans them up to only use a single private data pointer.
> >
> > ...but isn't the queue pretty much guaranteed to be cache hot and the
> > gendisk cache cold? I'm not immediately seeing what else needs the
> > gendisk in the I/O path. Is there another motivation I'm missing?
>
> ->private_data is right next to the ->queue pointer, pat0 and part_tbl
> which are all used in the I/O submission path (generic_make_request /
> generic_make_request_checks).  This is mostly a prep cleanup patch to
> also remove the pointless queue argument from ->make_request - then
> ->queue is an extra dereference and extra churn.

Ah ok. If the changelogs had been filled in with something like "In
preparation for removing @q from make_request_fn, stop using
->queuedata", I probably wouldn't have looked twice.

For the nvdimm/ driver updates you can add:

Reviewed-by: Dan Williams 

...or just let me know if you want me to pick those up through the nvdimm tree.


Re: remove a few uses of ->queuedata

2020-05-08 Thread Dan Williams
On Fri, May 8, 2020 at 9:16 AM Christoph Hellwig  wrote:
>
> Hi all,
>
> various bio based drivers use queue->queuedata despite already having
> set up disk->private_data, which can be used just as easily.  This
> series cleans them up to only use a single private data pointer.

...but isn't the queue pretty much guaranteed to be cache hot and the
gendisk cache cold? I'm not immediately seeing what else needs the
gendisk in the I/O path. Is there another motivation I'm missing?


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-02 Thread Dan Williams
On Sat, May 2, 2020 at 2:27 AM David Hildenbrand  wrote:
>
> >> Now, let's clarify what I want regarding virtio-mem:
> >>
> >> 1. kexec should not add virtio-mem memory to the initial firmware
> >>memmap. The driver has to be in charge as discussed.
> >> 2. kexec should not place kexec images onto virtio-mem memory. That
> >>would end badly.
> >> 3. kexec should still dump virtio-mem memory via kdump.
> >
> > Ok, but then seems to say to me that dax/kmem is a different type of
> > (driver managed) than virtio-mem and it's confusing to try to apply
> > the same meaning. Why not just call your type for the distinct type it
> > is "System RAM (virtio-mem)" and let any other driver managed memory
> > follow the same "System RAM ($driver)" format if it wants?
>
> I had the same idea but discarded it because it seemed to uglify the
> add_memory() interface (passing yet another parameter only relevant for
> driver managed memory). Maybe we really want a new one, because I like
> that idea:
>
> /*
>  * Add special, driver-managed memory to the system as system ram.
>  * The resource_name is expected to have the name format "System RAM
>  * ($DRIVER)", so user space (esp. kexec-tools)" can special-case it.
>  *
>  * For this memory, no entries in /sys/firmware/memmap are created,
>  * as this memory won't be part of the raw firmware-provided memory map
>  * e.g., after a reboot. Also, the created memory resource is flagged
>  * with IORESOURCE_MEM_DRIVER_MANAGED, so in-kernel users can special-
>  * case this memory (e.g., not place kexec images onto it).
>  */
> int add_memory_driver_managed(int nid, u64 start, u64 size,
>   const char *resource_name);
>
>
> If we'd ever have to special case it even more in the kernel, we could
> allow to specify further resource flags. While passing the driver name
> instead of the resource_name would be an option, this way we don't have
> to hand craft new resource strings for added memory resources.
>
> Thoughts?

Looks useful to me and simplifies walking /proc/iomem. I personally
like the safety of the string just being the $driver component of the
name, but I won't lose sleep if the interface stays freeform like you
propose.


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 2:11 PM David Hildenbrand  wrote:
>
> On 01.05.20 22:12, Dan Williams wrote:
[..]
> >>> Consider the case of EFI Special Purpose (SP) Memory that is
> >>> marked EFI Conventional Memory with the SP attribute. In that case the
> >>> firmware memory map marked it as conventional RAM, but the kernel
> >>> optionally marks it as System RAM vs Soft Reserved. The 2008 patch
> >>> simply does not consider that case. I'm not sure strict textualism
> >>> works for coding decisions.
> >>
> >> I am no expert on that matter (esp EFI). But looking at the users of
> >> firmware_map_add_early(), the single user is in arch/x86/kernel/e820.c
> >> . So the single source of /sys/firmware/memmap is (besides hotplug) e820.
> >>
> >> "'e820_table_firmware': the original firmware version passed to us by
> >> the bootloader - not modified by the kernel. ... inform the user about
> >> the firmware's notion of memory layout via /sys/firmware/memmap"
> >> (arch/x86/kernel/e820.c)
> >>
> >> How is the EFI Special Purpose (SP) Memory represented in e820?
> >> /sys/firmware/memmap is really simple: just dump in e820. No policies IIUC.
> >
> > e820 now has a Soft Reserved translation for this which means "try to
> > reserve, but treat as System RAM is ok too". It seems generically
> > useful to me that the toggle for determining whether Soft Reserved or
> > System RAM shows up /sys/firmware/memmap is a determination that
> > policy can make. The kernel need not preemptively block it.
>
> So, I think I have to clarify something here. We do have two ways to kexec
>
> 1. kexec_load(): User space (kexec-tools) crafts the memmap (e.g., using
> /sys/firmware/memmap on x86-64) and selects memory where to place the
> kexec images (e.g., using /proc/iomem)
>
> 2. kexec_file_load(): The kernel reuses the (basically) raw firmware
> memmap and selects memory where to place kexec images.
>
> We are talking about changing 1, to behave like 2 in regards to
> dax/kmem. 2. does currently not add any hotplugged memory to the
> fixed-up e820, and it should be fixed regarding hotplugged DIMMs that
> would appear in e820 after a reboot.
>
> Now, all these policy discussions are nice and fun, but I don't really
> see a good reason to (ab)use /sys/firmware/memmap for that (e.g., parent
> properties). If you want to be able to make this configurable, then
> e.g., add a way to configure this in the kernel (for example along with
> kmem) to make 1. and 2. behave the same way. Otherwise, you really only
> can change 1.

That's clearer.

>
>
> Now, let's clarify what I want regarding virtio-mem:
>
> 1. kexec should not add virtio-mem memory to the initial firmware
>memmap. The driver has to be in charge as discussed.
> 2. kexec should not place kexec images onto virtio-mem memory. That
>would end badly.
> 3. kexec should still dump virtio-mem memory via kdump.

Ok, but then seems to say to me that dax/kmem is a different type of
(driver managed) than virtio-mem and it's confusing to try to apply
the same meaning. Why not just call your type for the distinct type it
is "System RAM (virtio-mem)" and let any other driver managed memory
follow the same "System RAM ($driver)" format if it wants?


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 12:18 PM David Hildenbrand  wrote:
>
> On 01.05.20 20:43, Dan Williams wrote:
> > On Fri, May 1, 2020 at 11:14 AM David Hildenbrand  wrote:
> >>
> >> On 01.05.20 20:03, Dan Williams wrote:
> >>> On Fri, May 1, 2020 at 10:51 AM David Hildenbrand  
> >>> wrote:
> >>>>
> >>>> On 01.05.20 19:45, David Hildenbrand wrote:
> >>>>> On 01.05.20 19:39, Dan Williams wrote:
> >>>>>> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand  
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On 01.05.20 18:56, Dan Williams wrote:
> >>>>>>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> On 01.05.20 00:24, Andrew Morton wrote:
> >>>>>>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand 
> >>>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Why does the firmware map support hotplug entries?
> >>>>>>>>>>>
> >>>>>>>>>>> I assume:
> >>>>>>>>>>>
> >>>>>>>>>>> The firmware memmap was added primarily for x86-64 kexec (and 
> >>>>>>>>>>> still, is
> >>>>>>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. 
> >>>>>>>>>>> When DIMMs
> >>>>>>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>>>>>>>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>>>>>>>>> ballooning and you reboot ... the the e820 is changed as well). I 
> >>>>>>>>>>> assume
> >>>>>>>>>>> we wanted to be able to reflect that, to make kexec look like a 
> >>>>>>>>>>> real reboot.
> >>>>>>>>>>>
> >>>>>>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> But I assume only Andrew can enlighten us.
> >>>>>>>>>>>
> >>>>>>>>>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>>>>>>>>> firmware memmap, even if this contradicts with the existing
> >>>>>>>>>>> documentation? (especially, if the actual firmware memmap will 
> >>>>>>>>>>> *not*
> >>>>>>>>>>> contain that memory after a reboot)
> >>>>>>>>>>
> >>>>>>>>>> For some reason that patch is misattributed - it was authored by
> >>>>>>>>>> Shaohui Zheng , who hasn't been heard 
> >>>>>>>>>> from in
> >>>>>>>>>> a decade.  I looked through the email discussion from that time 
> >>>>>>>>>> and I'm
> >>>>>>>>>> not seeing anything useful.  But I wasn't able to locate Dave 
> >>>>>>>>>> Hansen's
> >>>>>>>>>> review comments.
> >>>>>>>>>
> >>>>>>>>> Okay, thanks for checking. I think the documentation from 2008 is 
> >>>>>>>>> pretty
> >>>>>>>>> clear what has to be done here. I will add some of these details to 
> >>>>>>>>> the
> >>>>>>>>> patch description.
> >>>>>>>>>
> >>>>>>>>> Also, now that I know that esp. kexec-tools already don't consider
> >>>>>>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
> >>>>>>>>> won't really suffer from a name change in /proc/iomem, I will go 
> >>>>>>>>> back to
> >>>>>>>>> the MHP_DRIVER_MANAGED approach and
> >>>>>>>>> 1. Don't create firmware memma

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 11:14 AM David Hildenbrand  wrote:
>
> On 01.05.20 20:03, Dan Williams wrote:
> > On Fri, May 1, 2020 at 10:51 AM David Hildenbrand  wrote:
> >>
> >> On 01.05.20 19:45, David Hildenbrand wrote:
> >>> On 01.05.20 19:39, Dan Williams wrote:
> >>>> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand  
> >>>> wrote:
> >>>>>
> >>>>> On 01.05.20 18:56, Dan Williams wrote:
> >>>>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On 01.05.20 00:24, Andrew Morton wrote:
> >>>>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand 
> >>>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Why does the firmware map support hotplug entries?
> >>>>>>>>>
> >>>>>>>>> I assume:
> >>>>>>>>>
> >>>>>>>>> The firmware memmap was added primarily for x86-64 kexec (and 
> >>>>>>>>> still, is
> >>>>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When 
> >>>>>>>>> DIMMs
> >>>>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>>>>>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>>>>>>> ballooning and you reboot ... the the e820 is changed as well). I 
> >>>>>>>>> assume
> >>>>>>>>> we wanted to be able to reflect that, to make kexec look like a 
> >>>>>>>>> real reboot.
> >>>>>>>>>
> >>>>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> But I assume only Andrew can enlighten us.
> >>>>>>>>>
> >>>>>>>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>>>>>>> firmware memmap, even if this contradicts with the existing
> >>>>>>>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>>>>>>> contain that memory after a reboot)
> >>>>>>>>
> >>>>>>>> For some reason that patch is misattributed - it was authored by
> >>>>>>>> Shaohui Zheng , who hasn't been heard from 
> >>>>>>>> in
> >>>>>>>> a decade.  I looked through the email discussion from that time and 
> >>>>>>>> I'm
> >>>>>>>> not seeing anything useful.  But I wasn't able to locate Dave 
> >>>>>>>> Hansen's
> >>>>>>>> review comments.
> >>>>>>>
> >>>>>>> Okay, thanks for checking. I think the documentation from 2008 is 
> >>>>>>> pretty
> >>>>>>> clear what has to be done here. I will add some of these details to 
> >>>>>>> the
> >>>>>>> patch description.
> >>>>>>>
> >>>>>>> Also, now that I know that esp. kexec-tools already don't consider
> >>>>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
> >>>>>>> won't really suffer from a name change in /proc/iomem, I will go back 
> >>>>>>> to
> >>>>>>> the MHP_DRIVER_MANAGED approach and
> >>>>>>> 1. Don't create firmware memmap entries
> >>>>>>> 2. Name the resource "System RAM (driver managed)"
> >>>>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>>>>>>
> >>>>>>> This way, kernel users and user space can figure out that this memory
> >>>>>>> has different semantics and handle it accordingly - I think that was
> >>>>>>> what Eric was asking for.
> >>>>>>>
> >>>>>>> Of course, open for suggestions.
> >>>>>>
> >>>>>> I'm still more of a fan of this bei

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 10:51 AM David Hildenbrand  wrote:
>
> On 01.05.20 19:45, David Hildenbrand wrote:
> > On 01.05.20 19:39, Dan Williams wrote:
> >> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand  wrote:
> >>>
> >>> On 01.05.20 18:56, Dan Williams wrote:
> >>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  
> >>>> wrote:
> >>>>>
> >>>>> On 01.05.20 00:24, Andrew Morton wrote:
> >>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand 
> >>>>>>  wrote:
> >>>>>>
> >>>>>>>>
> >>>>>>>> Why does the firmware map support hotplug entries?
> >>>>>>>
> >>>>>>> I assume:
> >>>>>>>
> >>>>>>> The firmware memmap was added primarily for x86-64 kexec (and still, 
> >>>>>>> is
> >>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When 
> >>>>>>> DIMMs
> >>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>>>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>>>>> ballooning and you reboot ... the the e820 is changed as well). I 
> >>>>>>> assume
> >>>>>>> we wanted to be able to reflect that, to make kexec look like a real 
> >>>>>>> reboot.
> >>>>>>>
> >>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>>>>
> >>>>>>>
> >>>>>>> But I assume only Andrew can enlighten us.
> >>>>>>>
> >>>>>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>>>>> firmware memmap, even if this contradicts with the existing
> >>>>>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>>>>> contain that memory after a reboot)
> >>>>>>
> >>>>>> For some reason that patch is misattributed - it was authored by
> >>>>>> Shaohui Zheng , who hasn't been heard from in
> >>>>>> a decade.  I looked through the email discussion from that time and I'm
> >>>>>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> >>>>>> review comments.
> >>>>>
> >>>>> Okay, thanks for checking. I think the documentation from 2008 is pretty
> >>>>> clear what has to be done here. I will add some of these details to the
> >>>>> patch description.
> >>>>>
> >>>>> Also, now that I know that esp. kexec-tools already don't consider
> >>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
> >>>>> won't really suffer from a name change in /proc/iomem, I will go back to
> >>>>> the MHP_DRIVER_MANAGED approach and
> >>>>> 1. Don't create firmware memmap entries
> >>>>> 2. Name the resource "System RAM (driver managed)"
> >>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>>>>
> >>>>> This way, kernel users and user space can figure out that this memory
> >>>>> has different semantics and handle it accordingly - I think that was
> >>>>> what Eric was asking for.
> >>>>>
> >>>>> Of course, open for suggestions.
> >>>>
> >>>> I'm still more of a fan of this being communicated by "System RAM"
> >>>
> >>> I was mentioning somewhere in this thread that "System RAM" inside a
> >>> hierarchy (like dax/kmem) will already be basically ignored by
> >>> kexec-tools. So, placing it inside a hierarchy already makes it look
> >>> special already.
> >>>
> >>> But after all, as we have to change kexec-tools either way, we can
> >>> directly go ahead and flag it properly as special (in case there will
> >>> ever be other cases where we could no longer distinguish it).
> >>>
> >>>> being parented especially because that tells you something about how
> >>>> the memory is driver-managed and which mechanism might be in play.
> >>>
> >>> The 

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 10:21 AM David Hildenbrand  wrote:
>
> On 01.05.20 18:56, Dan Williams wrote:
> > On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  wrote:
> >>
> >> On 01.05.20 00:24, Andrew Morton wrote:
> >>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand  
> >>> wrote:
> >>>
> >>>>>
> >>>>> Why does the firmware map support hotplug entries?
> >>>>
> >>>> I assume:
> >>>>
> >>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
> >>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> >>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>> ballooning and you reboot ... the the e820 is changed as well). I assume
> >>>> we wanted to be able to reflect that, to make kexec look like a real 
> >>>> reboot.
> >>>>
> >>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>
> >>>>
> >>>> But I assume only Andrew can enlighten us.
> >>>>
> >>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>> firmware memmap, even if this contradicts with the existing
> >>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>> contain that memory after a reboot)
> >>>
> >>> For some reason that patch is misattributed - it was authored by
> >>> Shaohui Zheng , who hasn't been heard from in
> >>> a decade.  I looked through the email discussion from that time and I'm
> >>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> >>> review comments.
> >>
> >> Okay, thanks for checking. I think the documentation from 2008 is pretty
> >> clear what has to be done here. I will add some of these details to the
> >> patch description.
> >>
> >> Also, now that I know that esp. kexec-tools already don't consider
> >> dax/kmem memory properly (memory will not get dumped via kdump) and
> >> won't really suffer from a name change in /proc/iomem, I will go back to
> >> the MHP_DRIVER_MANAGED approach and
> >> 1. Don't create firmware memmap entries
> >> 2. Name the resource "System RAM (driver managed)"
> >> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>
> >> This way, kernel users and user space can figure out that this memory
> >> has different semantics and handle it accordingly - I think that was
> >> what Eric was asking for.
> >>
> >> Of course, open for suggestions.
> >
> > I'm still more of a fan of this being communicated by "System RAM"
>
> I was mentioning somewhere in this thread that "System RAM" inside a
> hierarchy (like dax/kmem) will already be basically ignored by
> kexec-tools. So, placing it inside a hierarchy already makes it look
> special already.
>
> But after all, as we have to change kexec-tools either way, we can
> directly go ahead and flag it properly as special (in case there will
> ever be other cases where we could no longer distinguish it).
>
> > being parented especially because that tells you something about how
> > the memory is driver-managed and which mechanism might be in play.
>
> The could be communicated to some degree via the resource hierarchy.
>
> E.g.,
>
> [root@localhost ~]# cat /proc/iomem
> ...
> 14000-33fff : Persistent Memory
>   14000-1481f : namespace0.0
>   15000-33fff : dax0.0
> 15000-33fff : System RAM (driver managed)
>
> vs.
>
>:/# cat /proc/iomem
> [...]
> 14000-333ff : virtio-mem (virtio0)
>   14000-147ff : System RAM (driver managed)
>   14800-14fff : System RAM (driver managed)
>   15000-157ff : System RAM (driver managed)
>
> Good enough for my taste.
>
> > What about adding an optional /sys/firmware/memmap/X/parent attribute.
>
> I really don't want any firmware memmap entries for something that is
> not part of the firmware provided memmap. In addition,
> /sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
> and two arm configs enable it at all.
>
> So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.

I think that's a policy decision and policy decisions do not belong in
the kernel. Give the tooling the opportunity to decide whether System
RAM stays that way over a kexec. The parenthetical reference otherwise
looks out of place to me in the /proc/iomem output. What makes it
"driver managed" is how the kernel handles it, not how the kernel
names it.


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  wrote:
>
> On 01.05.20 00:24, Andrew Morton wrote:
> > On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand  
> > wrote:
> >
> >>>
> >>> Why does the firmware map support hotplug entries?
> >>
> >> I assume:
> >>
> >> The firmware memmap was added primarily for x86-64 kexec (and still, is
> >> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> >> get hotplugged on real HW, they get added to e820. Same applies to
> >> memory added via HyperV balloon (unless memory is unplugged via
> >> ballooning and you reboot ... the the e820 is changed as well). I assume
> >> we wanted to be able to reflect that, to make kexec look like a real 
> >> reboot.
> >>
> >> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>
> >>
> >> But I assume only Andrew can enlighten us.
> >>
> >> @Andrew, any guidance here? Should we really add all memory to the
> >> firmware memmap, even if this contradicts with the existing
> >> documentation? (especially, if the actual firmware memmap will *not*
> >> contain that memory after a reboot)
> >
> > For some reason that patch is misattributed - it was authored by
> > Shaohui Zheng , who hasn't been heard from in
> > a decade.  I looked through the email discussion from that time and I'm
> > not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> > review comments.
>
> Okay, thanks for checking. I think the documentation from 2008 is pretty
> clear what has to be done here. I will add some of these details to the
> patch description.
>
> Also, now that I know that esp. kexec-tools already don't consider
> dax/kmem memory properly (memory will not get dumped via kdump) and
> won't really suffer from a name change in /proc/iomem, I will go back to
> the MHP_DRIVER_MANAGED approach and
> 1. Don't create firmware memmap entries
> 2. Name the resource "System RAM (driver managed)"
> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
>
> This way, kernel users and user space can figure out that this memory
> has different semantics and handle it accordingly - I think that was
> what Eric was asking for.
>
> Of course, open for suggestions.

I'm still more of a fan of this being communicated by "System RAM"
being parented especially because that tells you something about how
the memory is driver-managed and which mechanism might be in play.
What about adding an optional /sys/firmware/memmap/X/parent attribute.
This lets tooling check if it cares via that interface and lets it
lookup the related infrastructure to interact with if it would do
something different for virtio-mem vs dax/kmem?


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-04-30 Thread Dan Williams
On Thu, Apr 30, 2020 at 11:44 AM David Hildenbrand  wrote:
>
>  >>> If the class of memory is different then please by all means let's mark
> >>> it differently in struct resource so everyone knows it is different.
> >>> But that difference needs to be more than hotplug.
> >>>
> >>> That difference needs to be the hypervisor loaned us memory and might
> >>> take it back at any time, or this memory is persistent and so it has
> >>> these different characteristics so don't use it as ordinary ram.
> >>
> >> Yes, and I think kmem took an excellent approach of explicitly putting
> >> that "System RAM" into a resource hierarchy. That "System RAM" won't
> >> show up as a root node under /proc/iomem (see patch #3), which already
> >> results in kexec-tools to treat it in a special way. I am thinking about
> >> doing the same for virtio-mem.
> >
> > Reading this and your patch cover letters again my concern is that
> > the justification seems to be letting the tail wag the dog.
> >
> > You want kexec-tools to behave in a certain way so you are changing the
> > kernel.
> >
> > Rather it should be change the kernel to clearly reflect reality and if
> > you can get away without a change to kexec-tools that is a bonus.
> >
>
> Right, because user space has to have a way to figure out what to do.
>
> But talking about the firmware memmap, indicating something via a "raw
> firmware-provided memory map", that is not actually in the "raw
> firmware-provided memory map" feels wrong to me. (below)
>
>
> >>> That information is also useful to other people looking at the system
> >>> and seeing what is going on.
> >>>
> >>> Just please don't muddle the concepts, or assume that whatever subset of
> >>> hotplug memory you are dealing with is the only subset.
> >>
> >> I can certainly rephrase the subject/description/comment, stating that
> >> this is not to be used for ordinary hotplugged DIMMs - only when the
> >> device driver is under control to decide what to do with that memory -
> >> especially when kexec'ing.
> >>
> >> (previously, I called this flag MHP_DRIVER_MANAGED, but I think
> >> MHP_NO_FIRMWARE_MEMMAP is clearer, we just need a better description)
> >>
> >> Would that make it clearer?
> >
> > I am not certain, but Andrew Morton deliberately added that
> > firmware_map_add_hotplug call.  Which means that there is a reason
> > for putting hotplugged memory in the firmware map.
> >
> > So the justification needs to take that reason into account.  The
> > justification can not be it is hotplugged therefore it should not belong
> > in the firmware memory map.  Unless you can show that
> > firmware_map_add_hotplug that was actually a bug and should be removed.
> > But as it has been that way since 2010 that seems like a long shot.
> >
> > So my question is what is right for the firmware map?
>
> We have documentation for that since 2008. Andrews patch is from 2010.
>
> Documentation/ABI/testing/sysfs-firmware-memmap
>
> It clearly talks about "raw firmware-provided memory map" and why the
> interface was introduced at all ("on most architectures that
> firmware-provided memory map is modified afterwards by the kernel itself").
>
> >
> > Why does the firmware map support hotplug entries?
>
> I assume:
>
> The firmware memmap was added primarily for x86-64 kexec (and still, is
> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> get hotplugged on real HW, they get added to e820. Same applies to
> memory added via HyperV balloon (unless memory is unplugged via
> ballooning and you reboot ... the the e820 is changed as well). I assume
> we wanted to be able to reflect that, to make kexec look like a real reboot.

I can at least say that this breakdown makes sense to me. Traditional
memory hotplug results in permanent change to the raw firmware memory
map reported by the host at next reboot. These device-driver-owned
memory regions really want a hotplug policy per-kernel boot instance
and should fall back to the default reserved state at reboot (kexec or
otherwise). When I say hotplug-policy I mean whether the current
kernel wants to treat the device range as System RAM or leave it as
device-managed. The intent is that the follow-on kernel needs to
re-decide the device policy.

>
> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
>


Re: [PATCH v1 2/3] mm/memory_hotplug: Introduce MHP_DRIVER_MANAGED

2020-04-30 Thread Dan Williams
On Thu, Apr 30, 2020 at 1:21 AM David Hildenbrand  wrote:
> >> Just because we decided to use some DAX memory in the current kernel as
> >> system ram, doesn't mean we should make that decision for the kexec
> >> kernel (e.g., using it as initial memory, placing kexec binaries onto
> >> it, etc.). This is also not what we would observe during a real reboot.
> >
> > Agree.
> >
> >> I can see that the "System RAM" resource will show up as child resource
> >> under the device e.g., in /proc/iomem.
> >>
> >> However, entries in /sys/firmware/memmap/ are created as "System RAM".
> >
> > True. Do you think this rename should just be limited to what type
> > /sys/firmware/memmap/ emits? I have the concern, but no proof
>
> We could split this patch into
>
> MHP_NO_FIRMWARE_MEMMAP (create firmware memmap entries)
>
> and
>
> MHP_DRIVER_MANAGED (name of the resource)
>
> See below, the latter might not be needed.
>
> > currently, that there are /proc/iomem walkers that explicitly look for
> > "System RAM", but might be thrown off by "System RAM (driver
> > managed)". I was not aware of /sys/firmware/memmap until about 5
> > minutes ago.
>
> The only two users of /proc/iomem I am aware of are kexec-tools and some
> s390x tools.
>
> kexec-tools on x86-64 uses /sys/firmware/memmap to craft the initial
> memmap, but uses /proc/iomem to
> a) Find places for kexec images
> b) Detect memory regions to dump via kdump
>
> I am not yet sure if we really need the "System RAM (driver managed)"
> part. If we can teach kexec-tools to
> a) Don't place kexec images on "System RAM" that has a parent resource
> (most likely requires kexec-tools changes)
> b) Consider for kdump "System RAM" that has a parent resource
> we might be able to avoid renaming that. (I assume that's already done)
>
> E.g., regarding virtio-mem (patch #3) I am currently also looking into
> creating a parent resource instead, like dax/kmem to avoid the rename:
>
> :/# cat /proc/iomem
> -0fff : Reserved
> [...]
> 1-13fff : System RAM
> 14000-33fff : virtio0
>   14000-147ff : System RAM
>   14800-14fff : System RAM
>   15000-157ff : System RAM
> 34000-303fff : virtio1
>   34000-347ff : System RAM
> 328000-32 : PCI Bus :00

Looks good to me if it flies with kexec-tools.


Re: [PATCH v1 2/3] mm/memory_hotplug: Introduce MHP_DRIVER_MANAGED

2020-04-30 Thread Dan Williams
On Thu, Apr 30, 2020 at 12:20 AM David Hildenbrand  wrote:
>
> On 29.04.20 18:08, David Hildenbrand wrote:
> > Some paravirtualized devices that add memory via add_memory() and
> > friends (esp. virtio-mem) don't want to create entries in
> > /sys/firmware/memmap/ - primarily to hinder kexec from adding this
> > memory to the boot memmap of the kexec kernel.
> >
> > In fact, such memory is never exposed via the firmware (e.g., e820), but
> > only via the device, so exposing this memory via /sys/firmware/memmap/ is
> > wrong:
> >  "kexec needs the raw firmware-provided memory map to setup the
> >   parameter segment of the kernel that should be booted with
> >   kexec. Also, the raw memory map is useful for debugging. For
> >   that reason, /sys/firmware/memmap is an interface that provides
> >   the raw memory map to userspace." [1]
> >
> > We want to let user space know that memory which is always detected,
> > added, and managed via a (device) driver - like memory managed by
> > virtio-mem - is special. It cannot be used for placing kexec segments
> > and the (device) driver is responsible for re-adding memory that
> > (eventually shrunk/grown/defragmented) memory after a reboot/kexec. It
> > should e.g., not be added to a fixed up firmware memmap. However, it should
> > be dumped by kdump.
> >
> > Also, such memory could behave differently than an ordinary DIMM - e.g.,
> > memory managed by virtio-mem can have holes inside added memory resource,
> > which should not be touched, especially for writing.
> >
> > Let's expose that memory as "System RAM (driver managed)" e.g., via
> > /pro/iomem.
> >
> > We don't have to worry about firmware_map_remove() on the removal path.
> > If there is no entry, it will simply return with -EINVAL.
> >
> > [1] 
> > https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap
> >
> > Cc: Andrew Morton 
> > Cc: Michal Hocko 
> > Cc: Pankaj Gupta 
> > Cc: Wei Yang 
> > Cc: Baoquan He 
> > Cc: Eric Biederman 
> > Signed-off-by: David Hildenbrand 
> > ---
> >  include/linux/memory_hotplug.h |  8 
> >  mm/memory_hotplug.c| 20 
> >  2 files changed, 24 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index bf0e3edb8688..cc538584b39e 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -68,6 +68,14 @@ struct mhp_params {
> >   pgprot_t pgprot;
> >  };
> >
> > +/* Flags used for add_memory() and friends. */
> > +
> > +/*
> > + * Don't create entries in /sys/firmware/memmap/ and expose memory as
> > + * "System RAM (driver managed)" in e.g., /proc/iomem
> > + */
> > +#define MHP_DRIVER_MANAGED   1
> > +
> >  /*
> >   * Zone resizing functions
> >   *
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index ebdf6541d074..cfa0721280aa 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -98,11 +98,11 @@ void mem_hotplug_done(void)
> >  u64 max_mem_size = U64_MAX;
> >
> >  /* add this memory to iomem resource */
> > -static struct resource *register_memory_resource(u64 start, u64 size)
> > +static struct resource *register_memory_resource(u64 start, u64 size,
> > +  const char *resource_name)
> >  {
> >   struct resource *res;
> >   unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> > - char *resource_name = "System RAM";
> >
> >   /*
> >* Make sure value parsed from 'mem=' only restricts memory adding
> > @@ -1058,7 +1058,8 @@ int __ref add_memory_resource(int nid, struct 
> > resource *res,
> >   BUG_ON(ret);
> >
> >   /* create new memmap entry */
> > - firmware_map_add_hotplug(start, start + size, "System RAM");
> > + if (!(flags & MHP_DRIVER_MANAGED))
> > + firmware_map_add_hotplug(start, start + size, "System RAM");
> >
> >   /* device_online() will take the lock when calling online_pages() */
> >   mem_hotplug_done();
> > @@ -1081,10 +1082,21 @@ int __ref add_memory_resource(int nid, struct 
> > resource *res,
> >  /* requires device_hotplug_lock, see add_memory_resource() */
> >  int __ref __add_memory(int nid, u64 start, u64 size, unsigned long flags)
> >  {
> > + const char *resource_name = "System RAM";
> >   struct resource *res;
> >   int ret;
> >
> > - res = register_memory_resource(start, size);
> > + /*
> > +  * Indicate that memory managed by a driver is special. It's always
> > +  * detected and added via a driver, should not be given to the kexec
> > +  * kernel for booting when manually crafting the firmware memmap, and
> > +  * no kexec segments should be placed on it. However, kdump should
> > +  * dump this memory.
> > +  */
> > + if (flags & MHP_DRIVER_MANAGED)
> > + resource_name = "System RAM (driver managed)";
> > +
> > + res = register_memory_resource(start, size, 

Re: [PATCH v5 4/4] powerpc/papr_scm: Implement support for DSM_PAPR_SCM_HEALTH

2020-04-03 Thread Dan Williams
On Tue, Mar 31, 2020 at 7:33 AM Vaibhav Jain  wrote:
>
> This patch implements support for papr_scm command
> 'DSM_PAPR_SCM_HEALTH' that returns a newly introduced 'struct
> nd_papr_scm_dimm_health_stat' instance containing dimm health
> information back to user space in response to ND_CMD_CALL. This
> functionality is implemented in newly introduced papr_scm_get_health()
> that queries the scm-dimm health information and then copies these bitmaps
> to the package payload whose layout is defined by 'struct
> papr_scm_ndctl_health'.
>
> The patch also introduces a new member a new member 'struct
> papr_scm_priv.health' thats an instance of 'struct
> nd_papr_scm_dimm_health_stat' to cache the health information of a
> scm-dimm. As a result functions drc_pmem_query_health() and
> papr_flags_show() are updated to populate and use this new struct
> instead of two be64 integers that we earlier used.

Link to HCALL specification?

>
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> v4..v5: None
>
> v3..v4: Call the DSM_PAPR_SCM_HEALTH service function from
> papr_scm_service_dsm() instead of papr_scm_ndctl(). [Aneesh]
>
> v2..v3: Updated struct nd_papr_scm_dimm_health_stat_v1 to use '__xx'
> types as its exported to the userspace [Aneesh]
> Changed the constants DSM_PAPR_SCM_DIMM_XX indicating dimm
> health from enum to #defines [Aneesh]
>
> v1..v2: New patch in the series
> ---
>  arch/powerpc/include/uapi/asm/papr_scm_dsm.h |  40 +++
>  arch/powerpc/platforms/pseries/papr_scm.c| 109 ---
>  2 files changed, 132 insertions(+), 17 deletions(-)
>
> diff --git a/arch/powerpc/include/uapi/asm/papr_scm_dsm.h 
> b/arch/powerpc/include/uapi/asm/papr_scm_dsm.h
> index c039a49b41b4..8265125304ca 100644
> --- a/arch/powerpc/include/uapi/asm/papr_scm_dsm.h
> +++ b/arch/powerpc/include/uapi/asm/papr_scm_dsm.h
> @@ -132,6 +132,7 @@ struct nd_papr_scm_cmd_pkg {
>   */
>  enum dsm_papr_scm {
> DSM_PAPR_SCM_MIN =  0x1,
> +   DSM_PAPR_SCM_HEALTH,
> DSM_PAPR_SCM_MAX,
>  };
>
> @@ -158,4 +159,43 @@ static void *papr_scm_pcmd_to_payload(struct 
> nd_papr_scm_cmd_pkg *pcmd)
> else
> return (void *)((__u8 *) pcmd + pcmd->payload_offset);
>  }
> +
> +/* Various scm-dimm health indicators */
> +#define DSM_PAPR_SCM_DIMM_HEALTHY   0
> +#define DSM_PAPR_SCM_DIMM_UNHEALTHY 1
> +#define DSM_PAPR_SCM_DIMM_CRITICAL  2
> +#define DSM_PAPR_SCM_DIMM_FATAL 3
> +
> +/*
> + * Struct exchanged between kernel & ndctl in for PAPR_DSM_PAPR_SMART_HEALTH
> + * Various bitflags indicate the health status of the dimm.
> + *
> + * dimm_unarmed: Dimm not armed. So contents wont persist.
> + * dimm_bad_shutdown   : Previous shutdown did not persist contents.
> + * dimm_bad_restore: Contents from previous shutdown werent restored.
> + * dimm_scrubbed   : Contents of the dimm have been scrubbed.
> + * dimm_locked : Contents of the dimm cant be modified until CEC 
> reboot
> + * dimm_encrypted  : Contents of dimm are encrypted.
> + * dimm_health : Dimm health indicator.
> + */
> +struct nd_papr_scm_dimm_health_stat_v1 {
> +   __u8 dimm_unarmed;
> +   __u8 dimm_bad_shutdown;
> +   __u8 dimm_bad_restore;
> +   __u8 dimm_scrubbed;
> +   __u8 dimm_locked;
> +   __u8 dimm_encrypted;
> +   __u16 dimm_health;
> +};

Does the structure pack the same across different compilers and configurations?

> +
> +/*
> + * Typedef the current struct for dimm_health so that any application
> + * or kernel recompiled after introducing a new version automatically
> + * supports the new version.
> + */
> +#define nd_papr_scm_dimm_health_stat nd_papr_scm_dimm_health_stat_v1
> +
> +/* Current version number for the dimm health struct */
> +#define ND_PAPR_SCM_DIMM_HEALTH_VERSION 1
> +
>  #endif /* _UAPI_ASM_POWERPC_PAPR_SCM_DSM_H_ */
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index e8ce96d2249e..ce94762954e0 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -47,8 +47,7 @@ struct papr_scm_priv {
> struct mutex dimm_mutex;
>
> /* Health information for the dimm */
> -   __be64 health_bitmap;
> -   __be64 health_bitmap_valid;
> +   struct nd_papr_scm_dimm_health_stat health;
>  };
>
>  static int drc_pmem_bind(struct papr_scm_priv *p)
> @@ -158,6 +157,7 @@ static int drc_pmem_query_health(struct papr_scm_priv *p)
>  {
> unsigned long ret[PLPAR_HCALL_BUFSIZE];
> int64_t rc;
> +   __be64 health;
>
> rc = plpar_hcall(H_SCM_HEALTH, ret, p->drc_index);
> if (rc != H_SUCCESS) {
> @@ -172,13 +172,41 @@ static int drc_pmem_query_health(struct papr_scm_priv 
> *p)
> return rc;
>
> /* Store the retrieved health information in dimm platform data */
> -   p->health_bitmap = ret[0];
> -   

Re: [PATCH v5 3/4] powerpc/papr_scm,uapi: Add support for handling PAPR DSM commands

2020-04-03 Thread Dan Williams
On Tue, Mar 31, 2020 at 7:33 AM Vaibhav Jain  wrote:
>
> Implement support for handling PAPR DSM commands in papr_scm
> module. We advertise support for ND_CMD_CALL for the dimm command mask
> and implement necessary scaffolding in the module to handle ND_CMD_CALL
> ioctl and DSM commands that we receive.

They aren't ACPI Device Specific Methods in the papr_scm case, right?
I'd call them what the papr_scm specification calls them and replace
"DSM" throughout.

> The layout of the DSM commands as we expect from libnvdimm/libndctl is
> described in newly introduced uapi header 'papr_scm_dsm.h' which
> defines a new 'struct nd_papr_scm_cmd_pkg' header. This header is used
> to communicate the DSM command via 'nd_pkg_papr_scm->nd_command' and
> size of payload that need to be sent/received for servicing the DSM.
>
> The PAPR DSM commands are assigned indexes started from 0x1 to
> prevent them from overlapping ND_CMD_* values and also makes handling
> dimm commands in papr_scm_ndctl().

You don't necessarily need to have command number separation like
that. The function number spaces are unique per family.

> A new function cmd_to_func() is
> implemented that reads the args to papr_scm_ndctl() and performs
> sanity tests on them. In case of a DSM command being sent via
> ND_CMD_CALL a newly introduced function papr_scm_service_dsm() is
> called to handle the request.
>
> Signed-off-by: Vaibhav Jain 
>
> ---
> Changelog:
>
> v4..v5: Fixed a bug in new implementation of papr_scm_ndctl().
>
> v3..v4: Updated papr_scm_ndctl() to delegate DSM command handling to a
> different function papr_scm_service_dsm(). [Aneesh]
>
> v2..v3: Updated the nd_papr_scm_cmd_pkg to use __xx types as its
> exported to the userspace [Aneesh]
>
> v1..v2: New patch in the series.
> ---
>  arch/powerpc/include/uapi/asm/papr_scm_dsm.h | 161 +++
>  arch/powerpc/platforms/pseries/papr_scm.c|  97 ++-
>  2 files changed, 252 insertions(+), 6 deletions(-)
>  create mode 100644 arch/powerpc/include/uapi/asm/papr_scm_dsm.h
>
> diff --git a/arch/powerpc/include/uapi/asm/papr_scm_dsm.h 
> b/arch/powerpc/include/uapi/asm/papr_scm_dsm.h
> new file mode 100644
> index ..c039a49b41b4
> --- /dev/null
> +++ b/arch/powerpc/include/uapi/asm/papr_scm_dsm.h
> @@ -0,0 +1,161 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +/*
> + * PAPR SCM Device specific methods and struct for libndctl and ndctl
> + *
> + * (C) Copyright IBM 2020
> + *
> + * Author: Vaibhav Jain 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2, or (at your option)
> + * any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.

These 2 paragraphs of redundant license text can be dropped. The SPDX
line is sufficient.

> + */
> +
> +#ifndef _UAPI_ASM_POWERPC_PAPR_SCM_DSM_H_
> +#define _UAPI_ASM_POWERPC_PAPR_SCM_DSM_H_
> +
> +#include 
> +
> +#ifdef __KERNEL__
> +#include 
> +#else
> +#include 
> +#endif
> +
> +/*
> + * DSM Envelope:
> + *
> + * The ioctl ND_CMD_CALL transfers data between user-space and kernel via
> + * 'envelopes' which consists of a header and user-defined payload sections.
> + * The header is described by 'struct nd_papr_scm_cmd_pkg' which expects a
> + * payload following it and offset of which relative to the struct is 
> provided
> + * by 'nd_papr_scm_cmd_pkg.payload_offset'. *
> + *
> + *  +-+-+---+
> + *  |   64-Bytes  |   8-Bytes   |   Max 184-Bytes   |
> + *  +-+-+---+
> + *  |   nd_papr_scm_cmd_pkg |   |
> + *  |-+ |   |
> + *  |  nd_cmd_pkg | |   |
> + *  +-+-+---+
> + *  | nd_family   ||   |
> + *  | nd_size_out | cmd_status  |  |
> + *  | nd_size_in  | payload_version |  PAYLOAD |
> + *  | nd_command  | payload_offset ->  |
> + *  | nd_fw_size  | |  |
> + *  +-+-+---+
> + *
> + * DSM Header:
> + *
> + * The header is defined as 'struct nd_papr_scm_cmd_pkg' which embeds a
> + * 'struct nd_cmd_pkg' instance. The DSM command is assigned to member
> + * 'nd_cmd_pkg.nd_command'. Apart from size information of the envelop which 
> is

 s/envelop/envelope/

There's a 

Re: [PATCH v5 2/4] ndctl/uapi: Introduce NVDIMM_FAMILY_PAPR_SCM as a new NVDIMM DSM family

2020-04-03 Thread Dan Williams
On Tue, Mar 31, 2020 at 7:33 AM Vaibhav Jain  wrote:
>
> Add PAPR-scm family of DSM command-set to the white list of NVDIMM
> command sets.
>
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> v4..v5 : None
>
> v3..v4 : None
>
> v2..v3 : Updated the patch prefix to 'ndctl/uapi' [Aneesh]
>
> v1..v2 : None
> ---
>  include/uapi/linux/ndctl.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
> index de5d90212409..99fb60600ef8 100644
> --- a/include/uapi/linux/ndctl.h
> +++ b/include/uapi/linux/ndctl.h
> @@ -244,6 +244,7 @@ struct nd_cmd_pkg {
>  #define NVDIMM_FAMILY_HPE2 2
>  #define NVDIMM_FAMILY_MSFT 3
>  #define NVDIMM_FAMILY_HYPERV 4
> +#define NVDIMM_FAMILY_PAPR_SCM 5

Looks good, but please squash it with patch 3.


Re: [PATCH v5 1/4] powerpc/papr_scm: Fetch nvdimm health information from PHYP

2020-04-02 Thread Dan Williams
On Wed, Apr 1, 2020 at 8:08 PM Dan Williams  wrote:
[..]
> >  * "locked" : Indicating that nvdimm contents cant be modified
> >until next power cycle.
>
> There is the generic NDD_LOCKED flag, can you use that? ...and in
> general I wonder if we should try to unify all the common papr_scm and
> nfit health flags in a generic location. It will already be the case
> the ndctl needs to look somewhere papr specific for this data maybe it
> all should have been generic from the beginning.

The more I think about this more I think this would be a good time to
introduce a common "health/" attribute group under the generic nmemX
sysfs, and then have one flag per-file / attribute. Not only does that
match the recommended sysfs ABI better, but it allows ndctl to
enumerate which flags are supported in addition to their state.

> In any event, can you also add this content to a new
> Documentation/ABI/testing/sysfs-bus-papr? See sysfs-bus-nfit for
> comparison.


Re: [PATCH v4 16/25] nvdimm/ocxl: Implement the Read Error Log command

2020-04-02 Thread Dan Williams
On Tue, Mar 31, 2020 at 1:59 AM Alastair D'Silva  wrote:
>
> The read error log command extracts information from the controller's
> internal error log.
>
> This patch exposes this information in 2 ways:
> - During probe, if an error occurs & a log is available, print it to the
>   console
> - After probe, make the error log available to userspace via an IOCTL.
>   Userspace is notified of pending error logs in a later patch
>   ("powerpc/powernv/pmem: Forward events to userspace")

So, have a look at the recent papr_scm patches to add health flags and
smart data retrieval. I'd prefer to extend existing nvdimm device
retrieval mechanisms than invent new ones.


>
> Signed-off-by: Alastair D'Silva 
> ---
>  .../userspace-api/ioctl/ioctl-number.rst  |   1 +
>  drivers/nvdimm/ocxl/main.c| 240 ++
>  include/uapi/nvdimm/ocxlpmem.h|  46 
>  3 files changed, 287 insertions(+)
>  create mode 100644 include/uapi/nvdimm/ocxlpmem.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 9425377615ce..ba0ce7dca643 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -340,6 +340,7 @@ Code  Seq#Include File
>Comments
>  0xC0  00-0F  linux/usb/iowarrior.h
>  0xCA  00-0F  uapi/misc/cxl.h
>  0xCA  10-2F  uapi/misc/ocxl.h
> +0xCA  30-3F  uapi/nvdimm/ocxlpmem.h  
> OpenCAPI Persistent Memory
>  0xCA  80-BF  uapi/scsi/cxlflash_ioctl.h
>  0xCB  00-1F  CBM 
> serial IEC bus in development:
>   
> 
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> index 9b85fcd3f1c9..e6be0029f658 100644
> --- a/drivers/nvdimm/ocxl/main.c
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ocxlpmem.h"
>
>  static const struct pci_device_id pci_tbl[] = {
> @@ -401,10 +402,190 @@ static int file_release(struct inode *inode, struct 
> file *file)
> return 0;
>  }
>
> +/**
> + * error_log_header_parse() - Parse the first 64 bits of the error log 
> command response
> + * @ocxlpmem: the device metadata
> + * @length: out, returns the number of bytes in the response (excluding the 
> 64 bit header)
> + */
> +static int error_log_header_parse(struct ocxlpmem *ocxlpmem, u16 *length)
> +{
> +   int rc;
> +   u64 val;
> +   u16 data_identifier;
> +   u32 data_length;
> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
> +ocxlpmem->admin_command.data_offset,
> +OCXL_LITTLE_ENDIAN, );
> +   if (rc)
> +   return rc;
> +
> +   data_identifier = val >> 48;
> +   data_length = val & 0x;
> +
> +   if (data_identifier != 0x454C) { // 'EL'
> +   dev_err(>dev,
> +   "Bad data identifier for error log data, expected 
> 'EL', got '%2s' (%#x), data_length=%u\n",
> +   (char *)_identifier,
> +   (unsigned int)data_identifier, data_length);
> +   return -EINVAL;
> +   }
> +
> +   *length = data_length;
> +   return 0;
> +}
> +
> +static int read_error_log(struct ocxlpmem *ocxlpmem,
> + struct ioctl_ocxlpmem_error_log *log,
> + bool buf_is_user)
> +{
> +   u64 val;
> +   u16 user_buf_length;
> +   u16 buf_length;
> +   u64 *buf = (u64 *)log->buf_ptr;
> +   u16 i;
> +   int rc;
> +
> +   if (log->buf_size % 8)
> +   return -EINVAL;
> +
> +   rc = ocxlpmem_chi(ocxlpmem, );
> +   if (rc)
> +   return rc;
> +
> +   if (!(val & GLOBAL_MMIO_CHI_ELA))
> +   return -EAGAIN;
> +
> +   user_buf_length = log->buf_size;
> +
> +   mutex_lock(>admin_command.lock);
> +
> +   rc = admin_command_execute(ocxlpmem, ADMIN_COMMAND_ERRLOG);
> +   if (rc != STATUS_SUCCESS) {
> +   warn_status(ocxlpmem,
> +   "Unexpected status from retrieve error log", rc);
> +   goto out;
> +   }
> +
> +   rc = error_log_header_parse(ocxlpmem, >buf_size);
> +   if (rc)
> +   goto out;
> +   // log->buf_size now contains the returned buffer size, not the user 
> size
> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
> +ocxlpmem->admin_command.data_offset + 
> 0x08,
> +OCXL_LITTLE_ENDIAN, );
> +   if (rc)
> +   goto out;
> +
> +   log->log_identifier = val >> 32;
> +   log->program_reference_code = val & 0x;
> +
> 

Re: [PATCH v4 14/25] nvdimm/ocxl: Add support for Admin commands

2020-04-02 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> Admin commands for these devices are the primary means of interacting
> with the device controller to provide functionality beyond the load/store
> capabilities offered via the NPU.
>
> For example, SMART data, firmware update, and device error logs are
> implemented via admin commands.
>
> This patch requests the metadata required to issue admin commands, as well
> as some helper functions to construct and check the completion of the
> commands.
>
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/ocxl/main.c  |  65 ++
>  drivers/nvdimm/ocxl/ocxlpmem.h  |  50 -
>  drivers/nvdimm/ocxl/ocxlpmem_internal.c | 261 
>  3 files changed, 375 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> index be76acd33d74..8db573036423 100644
> --- a/drivers/nvdimm/ocxl/main.c
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -217,6 +217,58 @@ static int register_lpc_mem(struct ocxlpmem *ocxlpmem)
> return 0;
>  }
>
> +/**
> + * extract_command_metadata() - Extract command data from MMIO & save it for 
> further use
> + * @ocxlpmem: the device metadata
> + * @offset: The base address of the command data structures (address of 
> CREQO)
> + * @command_metadata: A pointer to the command metadata to populate
> + * Return: 0 on success, negative on failure
> + */
> +static int extract_command_metadata(struct ocxlpmem *ocxlpmem, u32 offset,
> +   struct command_metadata *command_metadata)

How about "struct ocxlpmem *ocp" throughout all these patches? The
full duplication of the type name as the local variable name makes
this look like non-idiomatic Linux code to me. It had not quite hit me
until I saw "struct command_metadata *command_metadata" that just
strikes me as too literal and the person that gets to maintain this
code later will appreciate a smaller amount of typing.

Also, is it really the case that the layout of the admin command
metadata needs to be programmatically determined at runtime? I would
expect it to be a static command definition in the spec.


> +{
> +   int rc;
> +   u64 tmp;
> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, offset,
> +OCXL_LITTLE_ENDIAN, );
> +   if (rc)
> +   return rc;
> +
> +   command_metadata->request_offset = tmp >> 32;
> +   command_metadata->response_offset = tmp & 0x;
> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, offset + 8,
> +OCXL_LITTLE_ENDIAN, );
> +   if (rc)
> +   return rc;
> +
> +   command_metadata->data_offset = tmp >> 32;
> +   command_metadata->data_size = tmp & 0x;
> +
> +   command_metadata->id = 0;
> +
> +   return 0;
> +}
> +
> +/**
> + * setup_command_metadata() - Set up the command metadata
> + * @ocxlpmem: the device metadata
> + */
> +static int setup_command_metadata(struct ocxlpmem *ocxlpmem)
> +{
> +   int rc;
> +
> +   mutex_init(>admin_command.lock);
> +
> +   rc = extract_command_metadata(ocxlpmem, GLOBAL_MMIO_ACMA_CREQO,
> + >admin_command);
> +   if (rc)
> +   return rc;
> +
> +   return 0;
> +}
> +
>  /**
>   * allocate_minor() - Allocate a minor number to use for an OpenCAPI pmem 
> device
>   * @ocxlpmem: the device metadata
> @@ -421,6 +473,14 @@ static int probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
>
> ocxlpmem->pdev = pci_dev_get(pdev);
>
> +   ocxlpmem->timeouts[ADMIN_COMMAND_ERRLOG] = 2000; // ms
> +   ocxlpmem->timeouts[ADMIN_COMMAND_HEARTBEAT] = 100; // ms
> +   ocxlpmem->timeouts[ADMIN_COMMAND_SMART] = 100; // ms
> +   ocxlpmem->timeouts[ADMIN_COMMAND_CONTROLLER_DUMP] = 1000; // ms
> +   ocxlpmem->timeouts[ADMIN_COMMAND_CONTROLLER_STATS] = 100; // ms
> +   ocxlpmem->timeouts[ADMIN_COMMAND_SHUTDOWN] = 1000; // ms
> +   ocxlpmem->timeouts[ADMIN_COMMAND_FW_UPDATE] = 16000; // ms
> +
> pci_set_drvdata(pdev, ocxlpmem);
>
> ocxlpmem->ocxl_fn = ocxl_function_open(pdev);
> @@ -467,6 +527,11 @@ static int probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
> goto err;
> }
>
> +   if (setup_command_metadata(ocxlpmem)) {
> +   dev_err(>dev, "Could not read command metadata\n");
> +   goto err;
> +   }
> +
> elapsed = 0;
> timeout = ocxlpmem->readiness_timeout +
>   ocxlpmem->memory_available_timeout;
> diff --git a/drivers/nvdimm/ocxl/ocxlpmem.h b/drivers/nvdimm/ocxl/ocxlpmem.h
> index 3eadbe19f6d0..b72b3f909fc3 100644
> --- a/drivers/nvdimm/ocxl/ocxlpmem.h
> +++ b/drivers/nvdimm/ocxl/ocxlpmem.h
> @@ -7,6 +7,7 @@
>  #include 
>
>  #define LABEL_AREA_SIZEBIT_ULL(PA_SECTION_SHIFT)
> +#define DEFAULT_TIMEOUT 100
>
>  #define 

Re: [PATCH v5 1/4] powerpc/papr_scm: Fetch nvdimm health information from PHYP

2020-04-01 Thread Dan Williams
On Tue, Mar 31, 2020 at 7:33 AM Vaibhav Jain  wrote:
>
> Implement support for fetching nvdimm health information via
> H_SCM_HEALTH hcall as documented in Ref[1]. The hcall returns a pair
> of 64-bit big-endian integers which are then stored in 'struct
> papr_scm_priv' and subsequently partially exposed to user-space via
> newly introduced dimm specific attribute 'papr_flags'. Also a new asm
> header named 'papr-scm.h' is added that describes the interface
> between PHYP and guest kernel.
>
> Following flags are reported via 'papr_flags' sysfs attribute contents
> of which are space separated string flags indicating various nvdimm
> states:
>
>  * "not_armed"  : Indicating that nvdimm contents wont survive a power
>cycle.

s/wont/will not/

>  * "save_fail"  : Indicating that nvdimm contents couldn't be flushed
>during last shutdown event.

In the nfit definition this description is "flush_fail". The
"save_fail" flag was specific to hybrid devices that don't have
persistent media and instead scuttle away data from DRAM to flash on
power-failure.

>  * "restore_fail": Indicating that nvdimm contents couldn't be restored
>during dimm initialization.
>  * "encrypted"  : Dimm contents are encrypted.

This does not seem like a health flag to me, have you considered the
libnvdimm security interface for this indicator?

>  * "smart_notify": There is health event for the nvdimm.

Are you also going to signal the sysfs attribute when this event happens?

>  * "scrubbed"   : Indicating that contents of the nvdimm have been
>scrubbed.

This one seems odd to me what does it mean if it is not set? What does
it mean if a new scrub has been launched. Basically, is there value in
exposing this state?

>  * "locked" : Indicating that nvdimm contents cant be modified
>until next power cycle.

There is the generic NDD_LOCKED flag, can you use that? ...and in
general I wonder if we should try to unify all the common papr_scm and
nfit health flags in a generic location. It will already be the case
the ndctl needs to look somewhere papr specific for this data maybe it
all should have been generic from the beginning.


In any event, can you also add this content to a new
Documentation/ABI/testing/sysfs-bus-papr? See sysfs-bus-nfit for
comparison.

>
> [1]: commit 58b278f568f0 ("powerpc: Provide initial documentation for
> PAPR hcalls")
>
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> v4..v5 : None
>
> v3..v4 : None
>
> v2..v3 : Removed PAPR_SCM_DIMM_HEALTH_NON_CRITICAL as a condition for
>  NVDIMM unarmed [Aneesh]
>
> v1..v2 : New patch in the series.
> ---
>  arch/powerpc/include/asm/papr_scm.h   |  48 ++
>  arch/powerpc/platforms/pseries/papr_scm.c | 105 +-
>  2 files changed, 151 insertions(+), 2 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/papr_scm.h
>
> diff --git a/arch/powerpc/include/asm/papr_scm.h 
> b/arch/powerpc/include/asm/papr_scm.h
> new file mode 100644
> index ..868d3360f56a
> --- /dev/null
> +++ b/arch/powerpc/include/asm/papr_scm.h
> @@ -0,0 +1,48 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Structures and defines needed to manage nvdimms for spapr guests.
> + */
> +#ifndef _ASM_POWERPC_PAPR_SCM_H_
> +#define _ASM_POWERPC_PAPR_SCM_H_
> +
> +#include 
> +#include 
> +
> +/* DIMM health bitmap bitmap indicators */
> +/* SCM device is unable to persist memory contents */
> +#define PAPR_SCM_DIMM_UNARMED  PPC_BIT(0)
> +/* SCM device failed to persist memory contents */
> +#define PAPR_SCM_DIMM_SHUTDOWN_DIRTY   PPC_BIT(1)
> +/* SCM device contents are persisted from previous IPL */
> +#define PAPR_SCM_DIMM_SHUTDOWN_CLEAN   PPC_BIT(2)
> +/* SCM device contents are not persisted from previous IPL */
> +#define PAPR_SCM_DIMM_EMPTYPPC_BIT(3)
> +/* SCM device memory life remaining is critically low */
> +#define PAPR_SCM_DIMM_HEALTH_CRITICAL  PPC_BIT(4)
> +/* SCM device will be garded off next IPL due to failure */
> +#define PAPR_SCM_DIMM_HEALTH_FATAL PPC_BIT(5)
> +/* SCM contents cannot persist due to current platform health status */
> +#define PAPR_SCM_DIMM_HEALTH_UNHEALTHY PPC_BIT(6)
> +/* SCM device is unable to persist memory contents in certain conditions */
> +#define PAPR_SCM_DIMM_HEALTH_NON_CRITICAL  PPC_BIT(7)
> +/* SCM device is encrypted */
> +#define PAPR_SCM_DIMM_ENCRYPTEDPPC_BIT(8)
> +/* SCM device has been scrubbed and locked */
> +#define PAPR_SCM_DIMM_SCRUBBED_AND_LOCKED  PPC_BIT(9)
> +
> +/* Bits status indicators for health bitmap indicating unarmed dimm */
> +#define PAPR_SCM_DIMM_UNARMED_MASK (PAPR_SCM_DIMM_UNARMED |\
> +   PAPR_SCM_DIMM_HEALTH_UNHEALTHY)
> +
> +/* Bits status indicators for health bitmap indicating unflushed dimm */
> +#define 

Re: [PATCH v4 19/25] nvdimm/ocxl: Forward events to userspace

2020-04-01 Thread Dan Williams
On Tue, Mar 31, 2020 at 1:59 AM Alastair D'Silva  wrote:
>
> Some of the interrupts that the card generates are better handled
> by the userspace daemon, in particular:
> Controller Hardware/Firmware Fatal
> Controller Dump Available
> Error Log available
>
> This patch allows a userspace application to register an eventfd with
> the driver via SCM_IOCTL_EVENTFD to receive notifications of these
> interrupts.
>
> Userspace can then identify what events have occurred by calling
> SCM_IOCTL_EVENT_CHECK and checking against the SCM_IOCTL_EVENT_FOO
> masks.

The amount new ioctl's in this driver is too high, it seems much of
this data can be exported via sysfs attributes which are more
maintainable that ioctls. Then sysfs also has the ability to signal
events on sysfs attributes, see sys_notify_dirent.

Can you step back and review the ABI exposure of the driver and what
can be moved to sysfs? If you need to have bus specific attributes
ordered underneath the libnvdimm generic attributes you can create a
sysfs attribute subdirectory.

In general a roadmap document of all the proposed ABI is needed to
make sure it is both sufficient and necessary. See the libnvdimm
document that introduced the initial libnvdimm ABI:

https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt

>
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/ocxl/main.c | 220 +
>  drivers/nvdimm/ocxl/ocxlpmem.h |   4 +
>  include/uapi/nvdimm/ocxlpmem.h |  12 ++
>  3 files changed, 236 insertions(+)
>
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> index 0040fc09cceb..cb6cdc9eb899 100644
> --- a/drivers/nvdimm/ocxl/main.c
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -301,8 +302,19 @@ static void free_ocxlpmem(struct ocxlpmem *ocxlpmem)
>  {
> int rc;
>
> +   // Disable doorbells
> +   (void)ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CHIEC,
> +OCXL_LITTLE_ENDIAN,
> +GLOBAL_MMIO_CHI_ALL);
> +
> free_minor(ocxlpmem);
>
> +   if (ocxlpmem->irq_addr[1])
> +   iounmap(ocxlpmem->irq_addr[1]);
> +
> +   if (ocxlpmem->irq_addr[0])
> +   iounmap(ocxlpmem->irq_addr[0]);
> +
> if (ocxlpmem->ocxl_context) {
> rc = ocxl_context_detach(ocxlpmem->ocxl_context);
> if (rc == -EBUSY)
> @@ -398,6 +410,11 @@ static int file_release(struct inode *inode, struct file 
> *file)
>  {
> struct ocxlpmem *ocxlpmem = file->private_data;
>
> +   if (ocxlpmem->ev_ctx) {
> +   eventfd_ctx_put(ocxlpmem->ev_ctx);
> +   ocxlpmem->ev_ctx = NULL;
> +   }
> +
> ocxlpmem_put(ocxlpmem);
> return 0;
>  }
> @@ -928,6 +945,52 @@ static int ioctl_controller_stats(struct ocxlpmem 
> *ocxlpmem,
> return rc;
>  }
>
> +static int ioctl_eventfd(struct ocxlpmem *ocxlpmem,
> +struct ioctl_ocxlpmem_eventfd __user *uarg)
> +{
> +   struct ioctl_ocxlpmem_eventfd args;
> +
> +   if (copy_from_user(, uarg, sizeof(args)))
> +   return -EFAULT;
> +
> +   if (ocxlpmem->ev_ctx)
> +   return -EBUSY;
> +
> +   ocxlpmem->ev_ctx = eventfd_ctx_fdget(args.eventfd);
> +   if (IS_ERR(ocxlpmem->ev_ctx))
> +   return PTR_ERR(ocxlpmem->ev_ctx);
> +
> +   return 0;
> +}
> +
> +static int ioctl_event_check(struct ocxlpmem *ocxlpmem, u64 __user *uarg)
> +{
> +   u64 val = 0;
> +   int rc;
> +   u64 chi = 0;
> +
> +   rc = ocxlpmem_chi(ocxlpmem, );
> +   if (rc < 0)
> +   return rc;
> +
> +   if (chi & GLOBAL_MMIO_CHI_ELA)
> +   val |= IOCTL_OCXLPMEM_EVENT_ERROR_LOG_AVAILABLE;
> +
> +   if (chi & GLOBAL_MMIO_CHI_CDA)
> +   val |= IOCTL_OCXLPMEM_EVENT_CONTROLLER_DUMP_AVAILABLE;
> +
> +   if (chi & GLOBAL_MMIO_CHI_CFFS)
> +   val |= IOCTL_OCXLPMEM_EVENT_FIRMWARE_FATAL;
> +
> +   if (chi & GLOBAL_MMIO_CHI_CHFS)
> +   val |= IOCTL_OCXLPMEM_EVENT_HARDWARE_FATAL;
> +
> +   if (copy_to_user((u64 __user *)uarg, , sizeof(val)))
> +   return -EFAULT;
> +
> +   return rc;
> +}
> +
>  static long file_ioctl(struct file *file, unsigned int cmd, unsigned long 
> args)
>  {
> struct ocxlpmem *ocxlpmem = file->private_data;
> @@ -956,6 +1019,15 @@ static long file_ioctl(struct file *file, unsigned int 
> cmd, unsigned long args)
> rc = ioctl_controller_stats(ocxlpmem,
> (struct 
> ioctl_ocxlpmem_controller_stats __user *)args);
> break;
> +
> +   case IOCTL_OCXLPMEM_EVENTFD:
> +   rc = ioctl_eventfd(ocxlpmem,
> +  (struct ioctl_ocxlpmem_eventfd __user 
> *)args);
> +   break;
> +
> + 

Re: [PATCH v4 15/25] nvdimm/ocxl: Register a character device for userspace to interact with

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:53 PM Alastair D'Silva  wrote:
>
> This patch introduces a character device (/dev/ocxlpmemX) which further
> patches will use to interact with userspace, such as error logs,
> controller stats and card debug functionality.

This was asked earlier, but I'll reiterate, I do not see what
justifies an ocxlpmemX private device ABI vs routing through the
existing generic character ndbusX and nmemX character devices.

>
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/ocxl/main.c | 117 -
>  drivers/nvdimm/ocxl/ocxlpmem.h |   2 +
>  2 files changed, 117 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> index 8db573036423..9b85fcd3f1c9 100644
> --- a/drivers/nvdimm/ocxl/main.c
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include "ocxlpmem.h"
> @@ -356,6 +357,67 @@ static int ocxlpmem_register(struct ocxlpmem *ocxlpmem)
> return device_register(>dev);
>  }
>
> +static void ocxlpmem_put(struct ocxlpmem *ocxlpmem)
> +{
> +   put_device(>dev);
> +}
> +
> +static struct ocxlpmem *ocxlpmem_get(struct ocxlpmem *ocxlpmem)
> +{
> +   return (!get_device(>dev)) ? NULL : ocxlpmem;
> +}
> +
> +static struct ocxlpmem *find_and_get_ocxlpmem(dev_t devno)
> +{
> +   struct ocxlpmem *ocxlpmem;
> +   int minor = MINOR(devno);
> +
> +   mutex_lock(_idr_lock);
> +   ocxlpmem = idr_find(_idr, minor);
> +   if (ocxlpmem)
> +   ocxlpmem_get(ocxlpmem);
> +   mutex_unlock(_idr_lock);
> +
> +   return ocxlpmem;
> +}
> +
> +static int file_open(struct inode *inode, struct file *file)
> +{
> +   struct ocxlpmem *ocxlpmem;
> +
> +   ocxlpmem = find_and_get_ocxlpmem(inode->i_rdev);
> +   if (!ocxlpmem)
> +   return -ENODEV;
> +
> +   file->private_data = ocxlpmem;
> +   return 0;
> +}
> +
> +static int file_release(struct inode *inode, struct file *file)
> +{
> +   struct ocxlpmem *ocxlpmem = file->private_data;
> +
> +   ocxlpmem_put(ocxlpmem);
> +   return 0;
> +}
> +
> +static const struct file_operations fops = {
> +   .owner  = THIS_MODULE,
> +   .open   = file_open,
> +   .release= file_release,
> +};
> +
> +/**
> + * create_cdev() - Create the chardev in /dev for the device
> + * @ocxlpmem: the SCM metadata
> + * Return: 0 on success, negative on failure
> + */
> +static int create_cdev(struct ocxlpmem *ocxlpmem)
> +{
> +   cdev_init(>cdev, );
> +   return cdev_add(>cdev, ocxlpmem->dev.devt, 1);
> +}
> +
>  /**
>   * ocxlpmem_remove() - Free an OpenCAPI persistent memory device
>   * @pdev: the PCI device information struct
> @@ -376,6 +438,13 @@ static void remove(struct pci_dev *pdev)
> if (ocxlpmem->nvdimm_bus)
> nvdimm_bus_unregister(ocxlpmem->nvdimm_bus);
>
> +   /*
> +* Remove the cdev early to prevent a race against userspace
> +* via the char dev
> +*/
> +   if (ocxlpmem->cdev.owner)
> +   cdev_del(>cdev);
> +
> device_unregister(>dev);
> }
>  }
> @@ -527,11 +596,18 @@ static int probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
> goto err;
> }
>
> -   if (setup_command_metadata(ocxlpmem)) {
> +   rc = setup_command_metadata(ocxlpmem);
> +   if (rc) {
> dev_err(>dev, "Could not read command metadata\n");
> goto err;
> }
>
> +   rc = create_cdev(ocxlpmem);
> +   if (rc) {
> +   dev_err(>dev, "Could not create character device\n");
> +   goto err;
> +   }
> +
> elapsed = 0;
> timeout = ocxlpmem->readiness_timeout +
>   ocxlpmem->memory_available_timeout;
> @@ -599,6 +675,36 @@ static struct pci_driver pci_driver = {
> .shutdown = remove,
>  };
>
> +static int file_init(void)
> +{
> +   int rc;
> +
> +   rc = alloc_chrdev_region(_dev, 0, NUM_MINORS, "ocxlpmem");
> +   if (rc) {
> +   idr_destroy(_idr);
> +   pr_err("Unable to allocate OpenCAPI persistent memory major 
> number: %d\n",
> +  rc);
> +   return rc;
> +   }
> +
> +   ocxlpmem_class = class_create(THIS_MODULE, "ocxlpmem");
> +   if (IS_ERR(ocxlpmem_class)) {
> +   idr_destroy(_idr);
> +   pr_err("Unable to create ocxlpmem class\n");
> +   unregister_chrdev_region(ocxlpmem_dev, NUM_MINORS);
> +   return PTR_ERR(ocxlpmem_class);
> +   }
> +
> +   return 0;
> +}
> +
> +static void file_exit(void)
> +{
> +   class_destroy(ocxlpmem_class);
> +   unregister_chrdev_region(ocxlpmem_dev, NUM_MINORS);
> +   idr_destroy(_idr);
> +}
> +
>  static int __init 

Re: [PATCH v4 13/25] nvdimm/ocxl: Read the capability registers & wait for device ready

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This patch reads timeouts & firmware version from the controller, and
> uses those timeouts to wait for the controller to report that it is ready
> before handing the memory over to libnvdimm.
>
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/ocxl/Makefile|  2 +-
>  drivers/nvdimm/ocxl/main.c  | 85 +
>  drivers/nvdimm/ocxl/ocxlpmem.h  | 29 +
>  drivers/nvdimm/ocxl/ocxlpmem_internal.c | 19 ++
>  4 files changed, 134 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/nvdimm/ocxl/ocxlpmem_internal.c
>
> diff --git a/drivers/nvdimm/ocxl/Makefile b/drivers/nvdimm/ocxl/Makefile
> index e0e8ade1987a..bab97082e062 100644
> --- a/drivers/nvdimm/ocxl/Makefile
> +++ b/drivers/nvdimm/ocxl/Makefile
> @@ -4,4 +4,4 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror
>
>  obj-$(CONFIG_OCXL_PMEM) += ocxlpmem.o
>
> -ocxlpmem-y := main.o
> \ No newline at end of file
> +ocxlpmem-y := main.o ocxlpmem_internal.o
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> index c0066fedf9cc..be76acd33d74 100644
> --- a/drivers/nvdimm/ocxl/main.c
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -8,6 +8,7 @@
>
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -327,6 +328,50 @@ static void remove(struct pci_dev *pdev)
> }
>  }
>
> +/**
> + * read_device_metadata() - Retrieve config information from the AFU and 
> save it for future use
> + * @ocxlpmem: the device metadata
> + * Return: 0 on success, negative on failure
> + */
> +static int read_device_metadata(struct ocxlpmem *ocxlpmem)
> +{
> +   u64 val;
> +   int rc;
> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CCAP0,
> +OCXL_LITTLE_ENDIAN, );

This calling convention would seem to defeat the ability of sparse to
validate endian correctness. That's independent of this series, but I
wonder how does someone review why this argument is sometimes
OCXL_LITTLE_ENDIAN and sometimes OCXL_HOST_ENDIAN?

> +   if (rc)
> +   return rc;
> +
> +   ocxlpmem->scm_revision = val & 0x;
> +   ocxlpmem->read_latency = (val >> 32) & 0x;
> +   ocxlpmem->readiness_timeout = (val >> 48) & 0x0F;
> +   ocxlpmem->memory_available_timeout = val >> 52;

Maybe some macros to parse out these register fields?

> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CCAP1,
> +OCXL_LITTLE_ENDIAN, );
> +   if (rc)
> +   return rc;
> +
> +   ocxlpmem->max_controller_dump_size = val & 0x;
> +
> +   // Extract firmware version text
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_FWVER,
> +OCXL_HOST_ENDIAN,
> +(u64 *)ocxlpmem->fw_version);
> +   if (rc)
> +   return rc;
> +
> +   ocxlpmem->fw_version[8] = '\0';
> +
> +   dev_info(>dev,
> +"Firmware version '%s' SCM revision %d:%d\n",
> +ocxlpmem->fw_version, ocxlpmem->scm_revision >> 4,
> +ocxlpmem->scm_revision & 0x0F);

Does the driver need to be chatty here. If this data is relevant
should it appear in sysfs by default?

> +
> +   return 0;
> +}
> +
>  /**
>   * probe_function0() - Set up function 0 for an OpenCAPI persistent memory 
> device
>   * This is important as it enables templates higher than 0 across all other
> @@ -359,6 +404,9 @@ static int probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
>  {
> struct ocxlpmem *ocxlpmem;
> int rc;
> +   u64 chi;
> +   u16 elapsed, timeout;
> +   bool ready = false;
>
> if (PCI_FUNC(pdev->devfn) == 0)
> return probe_function0(pdev);
> @@ -413,6 +461,43 @@ static int probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
> goto err;
> }
>
> +   rc = read_device_metadata(ocxlpmem);
> +   if (rc) {
> +   dev_err(>dev, "Could not read metadata\n");
> +   goto err;
> +   }
> +
> +   elapsed = 0;
> +   timeout = ocxlpmem->readiness_timeout +
> + ocxlpmem->memory_available_timeout;
> +
> +   while (true) {
> +   rc = ocxlpmem_chi(ocxlpmem, );
> +   ready = (chi & (GLOBAL_MMIO_CHI_CRDY | GLOBAL_MMIO_CHI_MA)) ==
> +   (GLOBAL_MMIO_CHI_CRDY | GLOBAL_MMIO_CHI_MA);
> +
> +   if (ready)
> +   break;
> +
> +   if (elapsed++ > timeout) {
> +   dev_err(>dev,
> +   "OpenCAPI Persistent Memory ready 
> timeout.\n");
> +
> +   if (!(chi & GLOBAL_MMIO_CHI_CRDY))
> +   dev_err(>dev,
> +   "controller is not 

Re: [PATCH v4 12/25] nvdimm/ocxl: Add register addresses & status values to the header

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:53 PM Alastair D'Silva  wrote:
>
> These values have been taken from the device specifications.

Link to specification?


Re: [PATCH v4 11/25] powerpc: Enable the OpenCAPI Persistent Memory driver for powernv_defconfig

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This patch enables the OpenCAPI Persistent Memory driver, as well
> as DAX support, for the 'powernv' defconfig.
>
> DAX is not a strict requirement for the functioning of the driver, but it
> is likely that a user will want to create a DAX device on top of their
> persistent memory device.
>
> Signed-off-by: Alastair D'Silva 
> Reviewed-by: Andrew Donnellan 
> ---
>  arch/powerpc/configs/powernv_defconfig | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/arch/powerpc/configs/powernv_defconfig 
> b/arch/powerpc/configs/powernv_defconfig
> index 71749377d164..921d77bbd3d2 100644
> --- a/arch/powerpc/configs/powernv_defconfig
> +++ b/arch/powerpc/configs/powernv_defconfig
> @@ -348,3 +348,8 @@ CONFIG_KVM_BOOK3S_64=m
>  CONFIG_KVM_BOOK3S_64_HV=m
>  CONFIG_VHOST_NET=m
>  CONFIG_PRINTK_TIME=y
> +CONFIG_ZONE_DEVICE=y
> +CONFIG_OCXL_PMEM=m
> +CONFIG_DEV_DAX=m
> +CONFIG_DEV_DAX_PMEM=m
> +CONFIG_FS_DAX=y

These options have dependencies. I think it would better to implement
a top-level configuration question called something like
PERSISTENT_MEMORY_ALL that goes and selects all the bus providers and
infrastructure and lets other defaults follow along. For example,
CONFIG_DEV_DAX could grow a "default LIBNVDIMM" and then
CONFIG_DEV_DAX_PMEM would default on as well. If
CONFIG_PERSISTENT_MEMORY_ALL selected all the bus providers and
ZONE_DEVICE then the Kconfig system could prompt you to where the
dependencies are not satisfied.


Re: [PATCH v4 10/25] nvdimm: Add driver for OpenCAPI Persistent Memory

2020-04-01 Thread Dan Williams
On Wed, Apr 1, 2020 at 1:49 AM Dan Williams  wrote:
>
> On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  
> wrote:
> >
> > This driver exposes LPC memory on OpenCAPI pmem cards
> > as an NVDIMM, allowing the existing nvram infrastructure
> > to be used.
> >
> > Namespace metadata is stored on the media itself, so
> > scm_reserve_metadata() maps 1 section's worth of PMEM storage
> > at the start to hold this. The rest of the PMEM range is registered
> > with libnvdimm as an nvdimm. ndctl_config_read/write/size() provide
> > callbacks to libnvdimm to access the metadata.
> >
> > Signed-off-by: Alastair D'Silva 
> > ---
> >  drivers/nvdimm/Kconfig |   2 +
> >  drivers/nvdimm/Makefile|   1 +
> >  drivers/nvdimm/ocxl/Kconfig|  15 ++
> >  drivers/nvdimm/ocxl/Makefile   |   7 +
> >  drivers/nvdimm/ocxl/main.c | 476 +
> >  drivers/nvdimm/ocxl/ocxlpmem.h |  23 ++
> >  6 files changed, 524 insertions(+)
> >  create mode 100644 drivers/nvdimm/ocxl/Kconfig
> >  create mode 100644 drivers/nvdimm/ocxl/Makefile
> >  create mode 100644 drivers/nvdimm/ocxl/main.c
> >  create mode 100644 drivers/nvdimm/ocxl/ocxlpmem.h
> >
> > diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
> > index b7d1eb38b27d..368328637182 100644
> > --- a/drivers/nvdimm/Kconfig
> > +++ b/drivers/nvdimm/Kconfig
> > @@ -131,4 +131,6 @@ config NVDIMM_TEST_BUILD
> >   core devm_memremap_pages() implementation and other
> >   infrastructure.
> >
> > +source "drivers/nvdimm/ocxl/Kconfig"
> > +
> >  endif
> > diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
> > index 29203f3d3069..bc02be11c794 100644
> > --- a/drivers/nvdimm/Makefile
> > +++ b/drivers/nvdimm/Makefile
> > @@ -33,3 +33,4 @@ libnvdimm-$(CONFIG_NVDIMM_KEYS) += security.o
> >  TOOLS := ../../tools
> >  TEST_SRC := $(TOOLS)/testing/nvdimm/test
> >  obj-$(CONFIG_NVDIMM_TEST_BUILD) += $(TEST_SRC)/iomap.o
> > +obj-$(CONFIG_LIBNVDIMM) += ocxl/
> > diff --git a/drivers/nvdimm/ocxl/Kconfig b/drivers/nvdimm/ocxl/Kconfig
> > new file mode 100644
> > index ..c5d927520920
> > --- /dev/null
> > +++ b/drivers/nvdimm/ocxl/Kconfig
> > @@ -0,0 +1,15 @@
> > +# SPDX-License-Identifier: GPL-2.0-only
> > +if LIBNVDIMM
> > +
> > +config OCXL_PMEM
> > +   tristate "OpenCAPI Persistent Memory"
> > +   depends on LIBNVDIMM && PPC_POWERNV && PCI && EEH && ZONE_DEVICE && 
> > OCXL
>
> Does OXCL_PMEM itself have any CONFIG_ZONE_DEVICE dependencies? That's
> more a function of CONFIG_DEV_DAX and CONFIG_FS_DAX. Doesn't OCXL
> already depend on CONFIG_PCI?
>
>
> > +   help
> > + Exposes devices that implement the OpenCAPI Storage Class Memory
> > + specification as persistent memory regions. You may also want
> > + DEV_DAX, DEV_DAX_PMEM & FS_DAX if you plan on using DAX devices
> > + stacked on top of this driver.
> > +
> > + Select N if unsure.
> > +
> > +endif
> > diff --git a/drivers/nvdimm/ocxl/Makefile b/drivers/nvdimm/ocxl/Makefile
> > new file mode 100644
> > index ..e0e8ade1987a
> > --- /dev/null
> > +++ b/drivers/nvdimm/ocxl/Makefile
> > @@ -0,0 +1,7 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +ccflags-$(CONFIG_PPC_WERROR)   += -Werror
> > +
> > +obj-$(CONFIG_OCXL_PMEM) += ocxlpmem.o
> > +
> > +ocxlpmem-y := main.o
> > \ No newline at end of file
> > diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> > new file mode 100644
> > index ..c0066fedf9cc
> > --- /dev/null
> > +++ b/drivers/nvdimm/ocxl/main.c
> > @@ -0,0 +1,476 @@
> > +// SPDX-License-Identifier: GPL-2.0+
> > +// Copyright 2020 IBM Corp.
> > +
> > +/*
> > + * A driver for OpenCAPI devices that implement the Storage Class
> > + * Memory specification.
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include "ocxlpmem.h"
> > +
> > +static const struct pci_device_id pci_tbl[] = {
> > +   { PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x0625), },
> > +   { }
> > +};
> > +
> > +MODULE_DEVICE_TABLE(pci, pci_tbl);
> > +
> > +#define NUM_MINORS 256 // Total to reserve
> > +
> > +static dev_t ocxlpmem_dev;
> > +static st

Re: [PATCH v4 10/25] nvdimm: Add driver for OpenCAPI Persistent Memory

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This driver exposes LPC memory on OpenCAPI pmem cards
> as an NVDIMM, allowing the existing nvram infrastructure
> to be used.
>
> Namespace metadata is stored on the media itself, so
> scm_reserve_metadata() maps 1 section's worth of PMEM storage
> at the start to hold this. The rest of the PMEM range is registered
> with libnvdimm as an nvdimm. ndctl_config_read/write/size() provide
> callbacks to libnvdimm to access the metadata.
>
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/Kconfig |   2 +
>  drivers/nvdimm/Makefile|   1 +
>  drivers/nvdimm/ocxl/Kconfig|  15 ++
>  drivers/nvdimm/ocxl/Makefile   |   7 +
>  drivers/nvdimm/ocxl/main.c | 476 +
>  drivers/nvdimm/ocxl/ocxlpmem.h |  23 ++
>  6 files changed, 524 insertions(+)
>  create mode 100644 drivers/nvdimm/ocxl/Kconfig
>  create mode 100644 drivers/nvdimm/ocxl/Makefile
>  create mode 100644 drivers/nvdimm/ocxl/main.c
>  create mode 100644 drivers/nvdimm/ocxl/ocxlpmem.h
>
> diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
> index b7d1eb38b27d..368328637182 100644
> --- a/drivers/nvdimm/Kconfig
> +++ b/drivers/nvdimm/Kconfig
> @@ -131,4 +131,6 @@ config NVDIMM_TEST_BUILD
>   core devm_memremap_pages() implementation and other
>   infrastructure.
>
> +source "drivers/nvdimm/ocxl/Kconfig"
> +
>  endif
> diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
> index 29203f3d3069..bc02be11c794 100644
> --- a/drivers/nvdimm/Makefile
> +++ b/drivers/nvdimm/Makefile
> @@ -33,3 +33,4 @@ libnvdimm-$(CONFIG_NVDIMM_KEYS) += security.o
>  TOOLS := ../../tools
>  TEST_SRC := $(TOOLS)/testing/nvdimm/test
>  obj-$(CONFIG_NVDIMM_TEST_BUILD) += $(TEST_SRC)/iomap.o
> +obj-$(CONFIG_LIBNVDIMM) += ocxl/
> diff --git a/drivers/nvdimm/ocxl/Kconfig b/drivers/nvdimm/ocxl/Kconfig
> new file mode 100644
> index ..c5d927520920
> --- /dev/null
> +++ b/drivers/nvdimm/ocxl/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +if LIBNVDIMM
> +
> +config OCXL_PMEM
> +   tristate "OpenCAPI Persistent Memory"
> +   depends on LIBNVDIMM && PPC_POWERNV && PCI && EEH && ZONE_DEVICE && 
> OCXL

Does OXCL_PMEM itself have any CONFIG_ZONE_DEVICE dependencies? That's
more a function of CONFIG_DEV_DAX and CONFIG_FS_DAX. Doesn't OCXL
already depend on CONFIG_PCI?


> +   help
> + Exposes devices that implement the OpenCAPI Storage Class Memory
> + specification as persistent memory regions. You may also want
> + DEV_DAX, DEV_DAX_PMEM & FS_DAX if you plan on using DAX devices
> + stacked on top of this driver.
> +
> + Select N if unsure.
> +
> +endif
> diff --git a/drivers/nvdimm/ocxl/Makefile b/drivers/nvdimm/ocxl/Makefile
> new file mode 100644
> index ..e0e8ade1987a
> --- /dev/null
> +++ b/drivers/nvdimm/ocxl/Makefile
> @@ -0,0 +1,7 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +ccflags-$(CONFIG_PPC_WERROR)   += -Werror
> +
> +obj-$(CONFIG_OCXL_PMEM) += ocxlpmem.o
> +
> +ocxlpmem-y := main.o
> \ No newline at end of file
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> new file mode 100644
> index ..c0066fedf9cc
> --- /dev/null
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -0,0 +1,476 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +// Copyright 2020 IBM Corp.
> +
> +/*
> + * A driver for OpenCAPI devices that implement the Storage Class
> + * Memory specification.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "ocxlpmem.h"
> +
> +static const struct pci_device_id pci_tbl[] = {
> +   { PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x0625), },
> +   { }
> +};
> +
> +MODULE_DEVICE_TABLE(pci, pci_tbl);
> +
> +#define NUM_MINORS 256 // Total to reserve
> +
> +static dev_t ocxlpmem_dev;
> +static struct class *ocxlpmem_class;
> +static struct mutex minors_idr_lock;
> +static struct idr minors_idr;
> +
> +/**
> + * ndctl_config_write() - Handle a ND_CMD_SET_CONFIG_DATA command from ndctl
> + * @ocxlpmem: the device metadata
> + * @command: the incoming data to write
> + * Return: 0 on success, negative on failure
> + */
> +static int ndctl_config_write(struct ocxlpmem *ocxlpmem,
> + struct nd_cmd_set_config_hdr *command)
> +{
> +   if (command->in_offset + command->in_length > LABEL_AREA_SIZE)
> +   return -EINVAL;
> +
> +   memcpy_flushcache(ocxlpmem->metadata_addr + command->in_offset,
> + command->in_buf, command->in_length);
> +
> +   return 0;
> +}
> +
> +/**
> + * ndctl_config_read() - Handle a ND_CMD_GET_CONFIG_DATA command from ndctl
> + * @ocxlpmem: the device metadata
> + * @command: the read request
> + * Return: 0 on success, negative on failure
> + */
> +static int ndctl_config_read(struct ocxlpmem *ocxlpmem,
> +struct nd_cmd_get_config_data_hdr *command)
> +{
> + 

Re: [PATCH v4 08/25] ocxl: Emit a log message showing how much LPC memory was detected

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This patch emits a message showing how much LPC memory & special purpose
> memory was detected on an OCXL device.
>
> Signed-off-by: Alastair D'Silva 
> Acked-by: Frederic Barrat 
> Acked-by: Andrew Donnellan 
> ---
>  drivers/misc/ocxl/config.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
> index a62e3d7db2bf..69cca341d446 100644
> --- a/drivers/misc/ocxl/config.c
> +++ b/drivers/misc/ocxl/config.c
> @@ -568,6 +568,10 @@ static int read_afu_lpc_memory_info(struct pci_dev *dev,
> afu->special_purpose_mem_size =
> total_mem_size - lpc_mem_size;
> }
> +
> +   dev_info(>dev, "Probed LPC memory of %#llx bytes and special 
> purpose memory of %#llx bytes\n",
> +afu->lpc_mem_size, afu->special_purpose_mem_size);

A patch for a single log message is too fine grained for my taste,
let's squash this into another patch in the series.

> +
> return 0;
>  }
>
> --
> 2.24.1
>


Re: [PATCH v4 07/25] ocxl: Add functions to map/unmap LPC memory

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> Add functions to map/unmap LPC memory
>

"map memory" is an overloaded term. I'm guessing this patch has
nothing to do with mapping memory in the MMU. Is it updating hardware
resource decoders to start claiming address space that was allocated
previously?

> Signed-off-by: Alastair D'Silva 
> Acked-by: Frederic Barrat 
> ---
>  drivers/misc/ocxl/core.c  | 51 +++
>  drivers/misc/ocxl/ocxl_internal.h |  3 ++
>  include/misc/ocxl.h   | 21 +
>  3 files changed, 75 insertions(+)
>
> diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
> index 2531c6cf19a0..75ff14e3882a 100644
> --- a/drivers/misc/ocxl/core.c
> +++ b/drivers/misc/ocxl/core.c
> @@ -210,6 +210,56 @@ static void unmap_mmio_areas(struct ocxl_afu *afu)
> release_fn_bar(afu->fn, afu->config.global_mmio_bar);
>  }
>
> +int ocxl_afu_map_lpc_mem(struct ocxl_afu *afu)
> +{
> +   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
> +
> +   if ((afu->config.lpc_mem_size + afu->config.special_purpose_mem_size) 
> == 0)
> +   return 0;
> +
> +   afu->lpc_base_addr = ocxl_link_lpc_map(afu->fn->link, dev);
> +   if (afu->lpc_base_addr == 0)
> +   return -EINVAL;
> +
> +   if (afu->config.lpc_mem_size > 0) {
> +   afu->lpc_res.start = afu->lpc_base_addr + 
> afu->config.lpc_mem_offset;
> +   afu->lpc_res.end = afu->lpc_res.start + 
> afu->config.lpc_mem_size - 1;
> +   }
> +
> +   if (afu->config.special_purpose_mem_size > 0) {
> +   afu->special_purpose_res.start = afu->lpc_base_addr +
> +
> afu->config.special_purpose_mem_offset;
> +   afu->special_purpose_res.end = afu->special_purpose_res.start 
> +
> +  
> afu->config.special_purpose_mem_size - 1;
> +   }
> +
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(ocxl_afu_map_lpc_mem);
> +
> +struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu)
> +{
> +   return >lpc_res;
> +}
> +EXPORT_SYMBOL_GPL(ocxl_afu_lpc_mem);
> +
> +static void unmap_lpc_mem(struct ocxl_afu *afu)
> +{
> +   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
> +
> +   if (afu->lpc_res.start || afu->special_purpose_res.start) {
> +   void *link = afu->fn->link;
> +
> +   // only release the link when the the last consumer calls 
> release
> +   ocxl_link_lpc_release(link, dev);
> +
> +   afu->lpc_res.start = 0;
> +   afu->lpc_res.end = 0;
> +   afu->special_purpose_res.start = 0;
> +   afu->special_purpose_res.end = 0;
> +   }
> +}
> +
>  static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, struct pci_dev 
> *dev)
>  {
> int rc;
> @@ -251,6 +301,7 @@ static int configure_afu(struct ocxl_afu *afu, u8 
> afu_idx, struct pci_dev *dev)
>
>  static void deconfigure_afu(struct ocxl_afu *afu)
>  {
> +   unmap_lpc_mem(afu);
> unmap_mmio_areas(afu);
> reclaim_afu_pasid(afu);
> reclaim_afu_actag(afu);
> diff --git a/drivers/misc/ocxl/ocxl_internal.h 
> b/drivers/misc/ocxl/ocxl_internal.h
> index 2d7575225bd7..7b975a89db7b 100644
> --- a/drivers/misc/ocxl/ocxl_internal.h
> +++ b/drivers/misc/ocxl/ocxl_internal.h
> @@ -52,6 +52,9 @@ struct ocxl_afu {
> void __iomem *global_mmio_ptr;
> u64 pp_mmio_start;
> void *private;
> +   u64 lpc_base_addr; /* Covers both LPC & special purpose memory */
> +   struct resource lpc_res;
> +   struct resource special_purpose_res;
>  };
>
>  enum ocxl_context_status {
> diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
> index 357ef1aadbc0..d8b0b4d46bfb 100644
> --- a/include/misc/ocxl.h
> +++ b/include/misc/ocxl.h
> @@ -203,6 +203,27 @@ int ocxl_irq_set_handler(struct ocxl_context *ctx, int 
> irq_id,
>
>  // AFU Metadata
>
> +/**
> + * ocxl_afu_map_lpc_mem() - Map the LPC system & special purpose memory for 
> an AFU
> + * Do not call this during device discovery, as there may me multiple

s/me/be/


> + * devices on a link, and the memory is mapped for the whole link, not
> + * just one device. It should only be called after all devices have
> + * registered their memory on the link.
> + *
> + * @afu: The AFU that has the LPC memory to map
> + *
> + * Returns 0 on success, negative on failure
> + */
> +int ocxl_afu_map_lpc_mem(struct ocxl_afu *afu);
> +
> +/**
> + * ocxl_afu_lpc_mem() - Get the physical address range of LPC memory for an 
> AFU
> + * @afu: The AFU associated with the LPC memory
> + *
> + * Returns a pointer to the resource struct for the physical address range
> + */
> +struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu);
> +
>  /**
>   * ocxl_afu_config() - Get a pointer to the config for an AFU
>   * @afu: a pointer to the AFU to get the config for
> --
> 2.24.1
>


Re: [PATCH v4 05/25] ocxl: Address kernel doc errors & warnings

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This patch addresses warnings and errors from the kernel doc scripts for
> the OpenCAPI driver.
>
> It also makes minor tweaks to make the docs more consistent.
>
> Signed-off-by: Alastair D'Silva 
> Acked-by: Andrew Donnellan 
> ---
>  drivers/misc/ocxl/config.c| 24 
>  drivers/misc/ocxl/ocxl_internal.h |  9 +--
>  include/misc/ocxl.h   | 96 ---
>  3 files changed, 55 insertions(+), 74 deletions(-)

Looks good.


>
> diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
> index c8e19bfb5ef9..a62e3d7db2bf 100644
> --- a/drivers/misc/ocxl/config.c
> +++ b/drivers/misc/ocxl/config.c
> @@ -273,16 +273,16 @@ static int read_afu_info(struct pci_dev *dev, struct 
> ocxl_fn_config *fn,
>  }
>
>  /**
> - * Read the template version from the AFU
> - * dev: the device for the AFU
> - * fn: the AFU offsets
> - * len: outputs the template length
> - * version: outputs the major<<8,minor version
> + * read_template_version() - Read the template version from the AFU
> + * @dev: the device for the AFU
> + * @fn: the AFU offsets
> + * @len: outputs the template length
> + * @version: outputs the major<<8,minor version
>   *
>   * Returns 0 on success, negative on failure
>   */
>  static int read_template_version(struct pci_dev *dev, struct ocxl_fn_config 
> *fn,
> -   u16 *len, u16 *version)
> +u16 *len, u16 *version)
>  {
> u32 val32;
> u8 major, minor;
> @@ -476,16 +476,16 @@ static int validate_afu(struct pci_dev *dev, struct 
> ocxl_afu_config *afu)
>  }
>
>  /**
> - * Populate AFU metadata regarding LPC memory
> - * dev: the device for the AFU
> - * fn: the AFU offsets
> - * afu: the AFU struct to populate the LPC metadata into
> + * read_afu_lpc_memory_info() - Populate AFU metadata regarding LPC memory
> + * @dev: the device for the AFU
> + * @fn: the AFU offsets
> + * @afu: the AFU struct to populate the LPC metadata into
>   *
>   * Returns 0 on success, negative on failure
>   */
>  static int read_afu_lpc_memory_info(struct pci_dev *dev,
> -   struct ocxl_fn_config *fn,
> -   struct ocxl_afu_config *afu)
> +   struct ocxl_fn_config *fn,
> +   struct ocxl_afu_config *afu)
>  {
> int rc;
> u32 val32;
> diff --git a/drivers/misc/ocxl/ocxl_internal.h 
> b/drivers/misc/ocxl/ocxl_internal.h
> index 345bf843a38e..198e4e4bc51d 100644
> --- a/drivers/misc/ocxl/ocxl_internal.h
> +++ b/drivers/misc/ocxl/ocxl_internal.h
> @@ -122,11 +122,12 @@ int ocxl_config_check_afu_index(struct pci_dev *dev,
> struct ocxl_fn_config *fn, int afu_idx);
>
>  /**
> - * Update values within a Process Element
> + * ocxl_link_update_pe() - Update values within a Process Element
> + * @link_handle: the link handle associated with the process element
> + * @pasid: the PASID for the AFU context
> + * @tid: the new thread id for the process element
>   *
> - * link_handle: the link handle associated with the process element
> - * pasid: the PASID for the AFU context
> - * tid: the new thread id for the process element
> + * Returns 0 on success
>   */
>  int ocxl_link_update_pe(void *link_handle, int pasid, __u16 tid);
>
> diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
> index 0a762e387418..357ef1aadbc0 100644
> --- a/include/misc/ocxl.h
> +++ b/include/misc/ocxl.h
> @@ -62,8 +62,7 @@ struct ocxl_context;
>  // Device detection & initialisation
>
>  /**
> - * Open an OpenCAPI function on an OpenCAPI device
> - *
> + * ocxl_function_open() - Open an OpenCAPI function on an OpenCAPI device
>   * @dev: The PCI device that contains the function
>   *
>   * Returns an opaque pointer to the function, or an error pointer (check 
> with IS_ERR)
> @@ -71,8 +70,7 @@ struct ocxl_context;
>  struct ocxl_fn *ocxl_function_open(struct pci_dev *dev);
>
>  /**
> - * Get the list of AFUs associated with a PCI function device
> - *
> + * ocxl_function_afu_list() - Get the list of AFUs associated with a PCI 
> function device
>   * Returns a list of struct ocxl_afu *
>   *
>   * @fn: The OpenCAPI function containing the AFUs
> @@ -80,8 +78,7 @@ struct ocxl_fn *ocxl_function_open(struct pci_dev *dev);
>  struct list_head *ocxl_function_afu_list(struct ocxl_fn *fn);
>
>  /**
> - * Fetch an AFU instance from an OpenCAPI function
> - *
> + * ocxl_function_fetch_afu() - Fetch an AFU instance from an OpenCAPI 
> function
>   * @fn: The OpenCAPI function to get the AFU from
>   * @afu_idx: The index of the AFU to get
>   *
> @@ -92,23 +89,20 @@ struct list_head *ocxl_function_afu_list(struct ocxl_fn 
> *fn);
>  struct ocxl_afu *ocxl_function_fetch_afu(struct ocxl_fn *fn, u8 afu_idx);
>
>  /**
> - * Take a reference to an AFU
> - *
> + * ocxl_afu_get() - Take a reference to an AFU
>   * @afu: The AFU to 

Re: [PATCH v4 06/25] ocxl: Tally up the LPC memory on a link & allow it to be mapped

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:53 PM Alastair D'Silva  wrote:
>
> OpenCAPI LPC memory is allocated per link, but each link supports
> multiple AFUs, and each AFU can have LPC memory assigned to it.

Is there an OpenCAPI primer to decode these objects and their
associations that I can reference?


>
> This patch tallys the memory for all AFUs on a link, allowing it
> to be mapped in a single operation after the AFUs have been
> enumerated.
>
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/misc/ocxl/core.c  | 10 ++
>  drivers/misc/ocxl/link.c  | 60 +++
>  drivers/misc/ocxl/ocxl_internal.h | 33 +
>  3 files changed, 103 insertions(+)
>
> diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
> index b7a09b21ab36..2531c6cf19a0 100644
> --- a/drivers/misc/ocxl/core.c
> +++ b/drivers/misc/ocxl/core.c
> @@ -230,8 +230,18 @@ static int configure_afu(struct ocxl_afu *afu, u8 
> afu_idx, struct pci_dev *dev)
> if (rc)
> goto err_free_pasid;
>
> +   if (afu->config.lpc_mem_size || afu->config.special_purpose_mem_size) 
> {
> +   rc = ocxl_link_add_lpc_mem(afu->fn->link, 
> afu->config.lpc_mem_offset,
> +  afu->config.lpc_mem_size +
> +  
> afu->config.special_purpose_mem_size);
> +   if (rc)
> +   goto err_free_mmio;
> +   }
> +
> return 0;
>
> +err_free_mmio:
> +   unmap_mmio_areas(afu);
>  err_free_pasid:
> reclaim_afu_pasid(afu);
>  err_free_actag:
> diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
> index 58d111afd9f6..af119d3ef79a 100644
> --- a/drivers/misc/ocxl/link.c
> +++ b/drivers/misc/ocxl/link.c
> @@ -84,6 +84,11 @@ struct ocxl_link {
> int dev;
> atomic_t irq_available;
> struct spa *spa;
> +   struct mutex lpc_mem_lock; /* protects lpc_mem & lpc_mem_sz */
> +   u64 lpc_mem_sz; /* Total amount of LPC memory presented on the link */
> +   u64 lpc_mem;
> +   int lpc_consumers;
> +
> void *platform_data;
>  };
>  static struct list_head links_list = LIST_HEAD_INIT(links_list);
> @@ -396,6 +401,8 @@ static int alloc_link(struct pci_dev *dev, int PE_mask, 
> struct ocxl_link **out_l
> if (rc)
> goto err_spa;
>
> +   mutex_init(>lpc_mem_lock);
> +
> /* platform specific hook */
> rc = pnv_ocxl_spa_setup(dev, link->spa->spa_mem, PE_mask,
> >platform_data);
> @@ -711,3 +718,56 @@ void ocxl_link_free_irq(void *link_handle, int hw_irq)
> atomic_inc(>irq_available);
>  }
>  EXPORT_SYMBOL_GPL(ocxl_link_free_irq);
> +
> +int ocxl_link_add_lpc_mem(void *link_handle, u64 offset, u64 size)
> +{
> +   struct ocxl_link *link = (struct ocxl_link *)link_handle;
> +
> +   // Check for overflow
> +   if (offset > (offset + size))
> +   return -EINVAL;
> +
> +   mutex_lock(>lpc_mem_lock);
> +   link->lpc_mem_sz = max(link->lpc_mem_sz, offset + size);
> +
> +   mutex_unlock(>lpc_mem_lock);
> +
> +   return 0;
> +}
> +
> +u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev)
> +{
> +   struct ocxl_link *link = (struct ocxl_link *)link_handle;
> +
> +   mutex_lock(>lpc_mem_lock);
> +
> +   if (!link->lpc_mem)
> +   link->lpc_mem = pnv_ocxl_platform_lpc_setup(pdev, 
> link->lpc_mem_sz);
> +
> +   if (link->lpc_mem)
> +   link->lpc_consumers++;
> +   mutex_unlock(>lpc_mem_lock);
> +
> +   return link->lpc_mem;
> +}
> +
> +void ocxl_link_lpc_release(void *link_handle, struct pci_dev *pdev)
> +{
> +   struct ocxl_link *link = (struct ocxl_link *)link_handle;
> +
> +   mutex_lock(>lpc_mem_lock);
> +
> +   if (!link->lpc_mem) {
> +   mutex_unlock(>lpc_mem_lock);
> +   return;
> +   }
> +
> +   WARN_ON(--link->lpc_consumers < 0);
> +
> +   if (link->lpc_consumers == 0) {
> +   pnv_ocxl_platform_lpc_release(pdev);
> +   link->lpc_mem = 0;
> +   }
> +
> +   mutex_unlock(>lpc_mem_lock);
> +}
> diff --git a/drivers/misc/ocxl/ocxl_internal.h 
> b/drivers/misc/ocxl/ocxl_internal.h
> index 198e4e4bc51d..2d7575225bd7 100644
> --- a/drivers/misc/ocxl/ocxl_internal.h
> +++ b/drivers/misc/ocxl/ocxl_internal.h
> @@ -142,4 +142,37 @@ int ocxl_irq_offset_to_id(struct ocxl_context *ctx, u64 
> offset);
>  u64 ocxl_irq_id_to_offset(struct ocxl_context *ctx, int irq_id);
>  void ocxl_afu_irq_free_all(struct ocxl_context *ctx);
>
> +/**
> + * ocxl_link_add_lpc_mem() - Increment the amount of memory required by an 
> OpenCAPI link
> + *
> + * @link_handle: The OpenCAPI link handle
> + * @offset: The offset of the memory to add
> + * @size: The number of bytes to increment memory on the link by
> + *
> + * Returns 0 on success, -EINVAL on overflow
> + */
> +int ocxl_link_add_lpc_mem(void 

Re: [PATCH v4 04/25] ocxl: Remove unnecessary externs

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> Function declarations don't need externs, remove the existing ones
> so they are consistent with newer code
>
> Signed-off-by: Alastair D'Silva 
> Acked-by: Andrew Donnellan 
> Acked-by: Frederic Barrat 

Looks good.


> ---
>  arch/powerpc/include/asm/pnv-ocxl.h | 40 ++---
>  include/misc/ocxl.h |  6 ++---
>  2 files changed, 22 insertions(+), 24 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
> b/arch/powerpc/include/asm/pnv-ocxl.h
> index 560a19bb71b7..205efc41a33c 100644
> --- a/arch/powerpc/include/asm/pnv-ocxl.h
> +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> @@ -9,29 +9,27 @@
>  #define PNV_OCXL_TL_BITS_PER_RATE   4
>  #define PNV_OCXL_TL_RATE_BUF_SIZE   ((PNV_OCXL_TL_MAX_TEMPLATE+1) * 
> PNV_OCXL_TL_BITS_PER_RATE / 8)
>
> -extern int pnv_ocxl_get_actag(struct pci_dev *dev, u16 *base, u16 *enabled,
> -   u16 *supported);
> -extern int pnv_ocxl_get_pasid_count(struct pci_dev *dev, int *count);
> +int pnv_ocxl_get_actag(struct pci_dev *dev, u16 *base, u16 *enabled, u16 
> *supported);
> +int pnv_ocxl_get_pasid_count(struct pci_dev *dev, int *count);
>
> -extern int pnv_ocxl_get_tl_cap(struct pci_dev *dev, long *cap,
> +int pnv_ocxl_get_tl_cap(struct pci_dev *dev, long *cap,
> char *rate_buf, int rate_buf_size);
> -extern int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long cap,
> -   uint64_t rate_buf_phys, int rate_buf_size);
> -
> -extern int pnv_ocxl_get_xsl_irq(struct pci_dev *dev, int *hwirq);
> -extern void pnv_ocxl_unmap_xsl_regs(void __iomem *dsisr, void __iomem *dar,
> -   void __iomem *tfc, void __iomem *pe_handle);
> -extern int pnv_ocxl_map_xsl_regs(struct pci_dev *dev, void __iomem **dsisr,
> -   void __iomem **dar, void __iomem **tfc,
> -   void __iomem **pe_handle);
> -
> -extern int pnv_ocxl_spa_setup(struct pci_dev *dev, void *spa_mem, int 
> PE_mask,
> -   void **platform_data);
> -extern void pnv_ocxl_spa_release(void *platform_data);
> -extern int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int 
> pe_handle);
> -
> -extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
> -extern void pnv_ocxl_free_xive_irq(u32 irq);
> +int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long cap,
> +uint64_t rate_buf_phys, int rate_buf_size);
> +
> +int pnv_ocxl_get_xsl_irq(struct pci_dev *dev, int *hwirq);
> +void pnv_ocxl_unmap_xsl_regs(void __iomem *dsisr, void __iomem *dar,
> +void __iomem *tfc, void __iomem *pe_handle);
> +int pnv_ocxl_map_xsl_regs(struct pci_dev *dev, void __iomem **dsisr,
> + void __iomem **dar, void __iomem **tfc,
> + void __iomem **pe_handle);
> +
> +int pnv_ocxl_spa_setup(struct pci_dev *dev, void *spa_mem, int PE_mask, void 
> **platform_data);
> +void pnv_ocxl_spa_release(void *platform_data);
> +int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle);
> +
> +int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
> +void pnv_ocxl_free_xive_irq(u32 irq);
>  u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size);
>  void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev);
>
> diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
> index 06dd5839e438..0a762e387418 100644
> --- a/include/misc/ocxl.h
> +++ b/include/misc/ocxl.h
> @@ -173,7 +173,7 @@ int ocxl_context_detach(struct ocxl_context *ctx);
>   *
>   * Returns 0 on success, negative on failure
>   */
> -extern int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int *irq_id);
> +int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int *irq_id);
>
>  /**
>   * Frees an IRQ associated with an AFU context
> @@ -182,7 +182,7 @@ extern int ocxl_afu_irq_alloc(struct ocxl_context *ctx, 
> int *irq_id);
>   *
>   * Returns 0 on success, negative on failure
>   */
> -extern int ocxl_afu_irq_free(struct ocxl_context *ctx, int irq_id);
> +int ocxl_afu_irq_free(struct ocxl_context *ctx, int irq_id);
>
>  /**
>   * Gets the address of the trigger page for an IRQ
> @@ -193,7 +193,7 @@ extern int ocxl_afu_irq_free(struct ocxl_context *ctx, 
> int irq_id);
>   *
>   * returns the trigger page address, or 0 if the IRQ is not valid
>   */
> -extern u64 ocxl_afu_irq_get_addr(struct ocxl_context *ctx, int irq_id);
> +u64 ocxl_afu_irq_get_addr(struct ocxl_context *ctx, int irq_id);
>
>  /**
>   * Provide a callback to be called when an IRQ is triggered
> --
> 2.24.1
>


Re: [PATCH v4 03/25] powerpc/powernv: Map & release OpenCAPI LPC memory

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This patch adds OPAL calls to powernv so that the OpenCAPI
> driver can map & release LPC (Lowest Point of Coherency)  memory.
>
> Signed-off-by: Alastair D'Silva 
> Reviewed-by: Andrew Donnellan 
> ---
>  arch/powerpc/include/asm/pnv-ocxl.h   |  2 ++
>  arch/powerpc/platforms/powernv/ocxl.c | 43 +++
>  2 files changed, 45 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
> b/arch/powerpc/include/asm/pnv-ocxl.h
> index 7de82647e761..560a19bb71b7 100644
> --- a/arch/powerpc/include/asm/pnv-ocxl.h
> +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> @@ -32,5 +32,7 @@ extern int pnv_ocxl_spa_remove_pe_from_cache(void 
> *platform_data, int pe_handle)
>
>  extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
>  extern void pnv_ocxl_free_xive_irq(u32 irq);
> +u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size);
> +void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev);
>
>  #endif /* _ASM_PNV_OCXL_H */
> diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
> b/arch/powerpc/platforms/powernv/ocxl.c
> index 8c65aacda9c8..f13119a7c026 100644
> --- a/arch/powerpc/platforms/powernv/ocxl.c
> +++ b/arch/powerpc/platforms/powernv/ocxl.c
> @@ -475,6 +475,49 @@ void pnv_ocxl_spa_release(void *platform_data)
>  }
>  EXPORT_SYMBOL_GPL(pnv_ocxl_spa_release);
>
> +u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size)
> +{
> +   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> +   struct pnv_phb *phb = hose->private_data;

Is calling the local variable 'hose' instead of 'host' on purpose?

> +   u32 bdfn = pci_dev_id(pdev);
> +   __be64 base_addr_be64;
> +   u64 base_addr;
> +   int rc;
> +
> +   rc = opal_npu_mem_alloc(phb->opal_id, bdfn, size, _addr_be64);
> +   if (rc) {
> +   dev_warn(>dev,
> +"OPAL could not allocate LPC memory, rc=%d\n", rc);
> +   return 0;
> +   }
> +
> +   base_addr = be64_to_cpu(base_addr_be64);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE

With the proposed cleanup in patch2 the ifdef can be elided here.

> +   rc = check_hotplug_memory_addressable(base_addr >> PAGE_SHIFT,
> + size >> PAGE_SHIFT);
> +   if (rc)
> +   return 0;

Is this an error worth logging if someone is wondering why their
device is not showing up?


> +#endif
> +
> +   return base_addr;
> +}
> +EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_setup);
> +
> +void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev)
> +{
> +   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> +   struct pnv_phb *phb = hose->private_data;
> +   u32 bdfn = pci_dev_id(pdev);
> +   int rc;
> +
> +   rc = opal_npu_mem_release(phb->opal_id, bdfn);
> +   if (rc)
> +   dev_warn(>dev,
> +"OPAL reported rc=%d when releasing LPC memory\n", 
> rc);
> +}
> +EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_release);
> +
>  int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle)
>  {
> struct spa_data *data = (struct spa_data *) platform_data;
> --
> 2.24.1
>


Re: [PATCH v4 02/25] mm/memory_hotplug: Allow check_hotplug_memory_addressable to be called from drivers

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> When setting up OpenCAPI connected persistent memory, the range check may
> not be performed until quite late (or perhaps not at all, if the user does
> not establish a DAX device).
>
> This patch makes the range check callable so we can perform the check while
> probing the OpenCAPI Persistent Memory device.
>
> Signed-off-by: Alastair D'Silva 
> Reviewed-by: Andrew Donnellan 
> ---
>  include/linux/memory_hotplug.h | 5 +
>  mm/memory_hotplug.c| 4 ++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index f4d59155f3d4..9a19ae0d7e31 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -337,6 +337,11 @@ static inline void __remove_memory(int nid, u64 start, 
> u64 size) {}
>  extern void set_zone_contiguous(struct zone *zone);
>  extern void clear_zone_contiguous(struct zone *zone);
>
> +#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
> +int check_hotplug_memory_addressable(unsigned long pfn,
> +unsigned long nr_pages);
> +#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */

Let's move this to include/linux/memory.h with the other
CONFIG_MEMORY_HOTPLUG_SPARSE declarations, and add a dummy
implementation for the CONFIG_MEMORY_HOTPLUG_SPARSE=n case.

Also, this patch can be squashed with the next one, no need for it to
be stand alone.


> +
>  extern void __ref free_area_init_core_hotplug(int nid);
>  extern int __add_memory(int nid, u64 start, u64 size);
>  extern int add_memory(int nid, u64 start, u64 size);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0a54ffac8c68..14945f033594 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -276,8 +276,8 @@ static int check_pfn_span(unsigned long pfn, unsigned 
> long nr_pages,
> return 0;
>  }
>
> -static int check_hotplug_memory_addressable(unsigned long pfn,
> -   unsigned long nr_pages)
> +int check_hotplug_memory_addressable(unsigned long pfn,
> +unsigned long nr_pages)
>  {
> const u64 max_addr = PFN_PHYS(pfn + nr_pages) - 1;
>
> --
> 2.24.1
>


Re: [PATCH v4 01/25] powerpc/powernv: Add OPAL calls for LPC memory alloc/release

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> Add OPAL calls for LPC memory alloc/release
>

This seems to be referencing an existing api definition, can you
include a pointer to the spec in case someone wanted to understand
what these routines do? I suspect this is not allocating memory in the
traditional sense as much as it's allocating physical address space
for a device to be mapped?


> Signed-off-by: Alastair D'Silva 
> Acked-by: Andrew Donnellan 
> Acked-by: Frederic Barrat 
> ---
>  arch/powerpc/include/asm/opal-api.h| 2 ++
>  arch/powerpc/include/asm/opal.h| 2 ++
>  arch/powerpc/platforms/powernv/opal-call.c | 2 ++
>  3 files changed, 6 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index c1f25a760eb1..9298e603001b 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -208,6 +208,8 @@
>  #define OPAL_HANDLE_HMI2   166
>  #defineOPAL_NX_COPROC_INIT 167
>  #define OPAL_XIVE_GET_VP_STATE 170
> +#define OPAL_NPU_MEM_ALLOC 171
> +#define OPAL_NPU_MEM_RELEASE   172
>  #define OPAL_MPIPL_UPDATE  173
>  #define OPAL_MPIPL_REGISTER_TAG174
>  #define OPAL_MPIPL_QUERY_TAG   175
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 9986ac34b8e2..301fea46c7ca 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -39,6 +39,8 @@ int64_t opal_npu_spa_clear_cache(uint64_t phb_id, uint32_t 
> bdfn,
> uint64_t PE_handle);
>  int64_t opal_npu_tl_set(uint64_t phb_id, uint32_t bdfn, long cap,
> uint64_t rate_phys, uint32_t size);
> +int64_t opal_npu_mem_alloc(u64 phb_id, u32 bdfn, u64 size, __be64 *bar);
> +int64_t opal_npu_mem_release(u64 phb_id, u32 bdfn);
>
>  int64_t opal_console_write(int64_t term_number, __be64 *length,
>const uint8_t *buffer);
> diff --git a/arch/powerpc/platforms/powernv/opal-call.c 
> b/arch/powerpc/platforms/powernv/opal-call.c
> index 5cd0f52d258f..f26e58b72c04 100644
> --- a/arch/powerpc/platforms/powernv/opal-call.c
> +++ b/arch/powerpc/platforms/powernv/opal-call.c
> @@ -287,6 +287,8 @@ OPAL_CALL(opal_pci_set_pbcq_tunnel_bar, 
> OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
>  OPAL_CALL(opal_sensor_read_u64,OPAL_SENSOR_READ_U64);
>  OPAL_CALL(opal_sensor_group_enable,OPAL_SENSOR_GROUP_ENABLE);
>  OPAL_CALL(opal_nx_coproc_init, OPAL_NX_COPROC_INIT);
> +OPAL_CALL(opal_npu_mem_alloc,  OPAL_NPU_MEM_ALLOC);
> +OPAL_CALL(opal_npu_mem_release,OPAL_NPU_MEM_RELEASE);
>  OPAL_CALL(opal_mpipl_update,   OPAL_MPIPL_UPDATE);
>  OPAL_CALL(opal_mpipl_register_tag, OPAL_MPIPL_REGISTER_TAG);
>  OPAL_CALL(opal_mpipl_query_tag,OPAL_MPIPL_QUERY_TAG);
> --
> 2.24.1
>


Re: [PATCH v4 00/25] Add support for OpenCAPI Persistent Memory devices

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This series adds support for OpenCAPI Persistent Memory devices on bare metal 
> (arch/powernv), exposing them as nvdimms so that we can make use of the 
> existing infrastructure. There already exists a driver for the same devices 
> abstracted through PowerVM (arch/pseries): 
> arch/powerpc/platforms/pseries/papr_scm.c
>
> These devices are connected via OpenCAPI, and present as LPC (lowest 
> coherence point) memory to the system, practically, that means that memory on 
> these cards could be treated as conventional, cache-coherent memory.
>
> Since the devices are connected via OpenCAPI, they are not enumerated via 
> ACPI. Instead, OpenCAPI links present as pseudo-PCI bridges, with devices 
> below them.
>
> This series introduces a driver that exposes the memory on these cards as 
> nvdimms, with each card getting it's own bus. This is somewhat complicated by 
> the fact that the cards do not have out of band persistent storage for 
> metadata, so 1 SECTION_SIZE's (see SPARSEMEM) worth of storage is carved out 
> of the top of the card storage to implement the ndctl_config_* calls.

Is it really tied to section-size? Can't that change based on the
configured page-size? It's not clear to me why that would be the
choice, but I'll dig into the implementation.

> The driver is not responsible for configuring the NPU (NVLink Processing 
> Unit) BARs to map the LPC memory from the card into the system's physical 
> address space, instead, it requests this to be done via OPAL calls (typically 
> implemented by Skiboot).

Are OPAL calls similar to ACPI DSMs? I.e. methods for the OS to invoke
platform firmware services? What's Skiboot?

>
> The series is structured as follows:
>  - Required infrastructure changes & cleanup
>  - A minimal driver implementation
>  - Implementing additional features within the driver

Thanks for the intro and the changelog!

>
> Changelog:
> V4:
>   - Rebase on next-20200320

Do you have dependencies on other material that's in -next? Otherwise
-next is only a viable development baseline if you are going to merge
through Andrew's tree.

>   - Bump copyright to 2020
>   - Ensure all uapi headers use C89 compatible comments (missed ocxlpmem.h)
>   - Move the driver back to drivers/nvdimm/ocxl, after confirmation
> that this location is desirable
>   - Rename ocxl.c to ocxlpmem.c (+ support files)
>   - Rename all ocxl_pmem to ocxlpmem
>   - Address checkpatch --strict issues
>   - "powerpc/powernv: Add OPAL calls for LPC memory alloc/release"
> - Pass base address as __be64
>   - "ocxl: Tally up the LPC memory on a link & allow it to be mapped"
> - Address checkpatch spacing warnings
> - Reword blurb
> - Reword size description for ocxl_link_add_lpc_mem()
> - Add an early exit in ocxl_link_lpc_release() to avoid triggering
>   bogus warnings if called after ocxl_link_lpc_map() fails
>   - "powerpc/powernv: Add OPAL calls for LPC memory alloc/release"
> - Reword blurb
>   - "powerpc/powernv: Map & release OpenCAPI LPC memory"
> - Reword blurb
>   - Move minor_idr init from file_init() to ocxlpmem_init() (fixes runtime 
> error
> in "nvdimm: Add driver for OpenCAPI Persistent Memory")
>   - Wrap long lines
>   - "nvdimm: Add driver for OpenCAPI Storage Class Memory"
> - Remove '+ 1' workround from serial number->cookie assignment
> - Drop out of memory message for ocxlpmem in probe()
> - Fix leaks of ocxlpmem & ocxlpmem->ocxl_fn in probe()
> - remove struct ocxlpmem_function0, it didn't value add
> - factor out err_unregistered label in probe
> - Address more checkpatch warnings
> - get/put the pci dev on probe/free
> - Drop ocxlpmem_ prefix from static functions
> - Propogate errors up from called functions in probe()
> - Set MODULE_LICENSE to GPLv2
> - Add myself as module author
> - Call nvdimm_bus_unregister() in remove() to release references
> - Don't call devm_memunmap on metadata_address, the release handler on
>  the device already deals with this
>   - "nvdimm/ocxl: Read the capability registers & wait for device ready"
> - Fix mask for read_latency
> - Fold in is_usable logic into timeout to remove error message race
> - propogate bad rc from read_device_metadata
>   - "nvdimm/ocxl: Add register addresses & status values to the header"
> - Add comments for register abbreviations where names have been
>   expanded
> - Add missing status for blocked on background task
> - Alias defines for firmware update status to show that the 
> duplication
>   of values is intentional
>   - "nvdimm/ocxl: Register a character device for userspace to interact with"
> - Add lock around minors IDR, delete the cdev before device_unregister
> - Propogate errors up from 

Re: [PATCH v2] libnvdimm: Update persistence domain value for of_pmem and papr_scm device

2020-03-23 Thread Dan Williams
On Fri, Mar 20, 2020 at 2:25 AM Aneesh Kumar K.V
 wrote:
>
>
> Hi Dan,
>
>
> Dan Williams  writes:
>
> ...
>
>
> >
> >>
> >> Or are you suggesting that application should not infer any of those
> >> details looking at persistence_domain value? If so what is the purpose
> >> of exporting that attribute?
> >
> > The way the patch was worded I thought it was referring to an explicit
> > mechanism outside cpu cache flushes, i.e. a mechanism that required a
> > driver call.
> >
>
> This patch is blocked because I am not expressing the details correctly.
> I updates this as below. Can you suggest if this is ok? If not what
> alternate wording do you suggest to document "memory controller"
>
>
> commit 329b46e88f8cd30eee4776b0de7913ab4d496bd8
> Author: Aneesh Kumar K.V 
> Date:   Wed Dec 18 13:53:16 2019 +0530
>
> libnvdimm: Update persistence domain value for of_pmem and papr_scm device
>
> Currently, kernel shows the below values
> "persistence_domain":"cpu_cache"
> "persistence_domain":"memory_controller"
> "persistence_domain":"unknown"
>
> "cpu_cache" indicates no extra instructions is needed to ensure the 
> persistence
> of data in the pmem media on power failure.
>
> "memory_controller" indicates cpu cache flush instructions is required to 
> flush
> the data. Platform provides mechanisms to automatically flush outstanding
> write data from memory controler to pmem on system power loss.
>
> Based on the above use memory_controller for non volatile regions on 
> ppc64.

Looks good to me, want to resend via git-format-patch?


[PATCH v4 5/5] libnvdimm/region: Introduce an 'align' attribute

2020-03-03 Thread Dan Williams
The align attribute applies an alignment constraint for namespace
creation in a region. Whereas the 'align' attribute of a namespace
applied alignment padding via an info block, the 'align' attribute
applies alignment constraints to the free space allocation.

The default for 'align' is the maximum known memremap_compat_align()
across all archs (16MiB from PowerPC at time of writing) multiplied by
the number of interleave ways if there is blk-aliasing. The minimum is
PAGE_SIZE and allows for the creation of cross-arch incompatible
namespaces, just as previous kernels allowed, but the expectation is
cross-arch and mode-independent compatibility by default.

The regression risk with this change is limited to cases that were
dependent on the ability to create unaligned namespaces, *and* for some
reason are unable to opt-out of aligned namespaces by writing to
'regionX/align'. If such a scenario arises the default can be flipped
from opt-out to opt-in of compat-aligned namespace creation, but that is
a last resort. The kernel will otherwise continue to support existing
defined misaligned namespaces.

Unfortunately this change needs to touch several parts of the
implementation at once:

- region/available_size: expand busy extents to current align
- region/max_available_extent: expand busy extents to current align
- namespace/size: trim free space to current align

...to keep the free space accounting conforming to the dynamic align
setting.

Reported-by: Aneesh Kumar K.V 
Reported-by: Jeff Moyer 
Reviewed-by: Aneesh Kumar K.V 
Reviewed-by: Jeff Moyer 
Link: 
https://lore.kernel.org/r/158041478371.3889308.14542630147672668068.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/dimm_devs.c  |   86 +++
 drivers/nvdimm/namespace_devs.c |9 ++-
 drivers/nvdimm/nd.h |1 
 drivers/nvdimm/region_devs.c|  122 ---
 4 files changed, 192 insertions(+), 26 deletions(-)

diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 39a61a514746..b7b77e8d9027 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -563,6 +563,21 @@ int nvdimm_security_freeze(struct nvdimm *nvdimm)
return rc;
 }
 
+static unsigned long dpa_align(struct nd_region *nd_region)
+{
+   struct device *dev = _region->dev;
+
+   if (dev_WARN_ONCE(dev, !is_nvdimm_bus_locked(dev),
+   "bus lock required for capacity provision\n"))
+   return 0;
+   if (dev_WARN_ONCE(dev, !nd_region->ndr_mappings || nd_region->align
+   % nd_region->ndr_mappings,
+   "invalid region align %#lx mappings: %d\n",
+   nd_region->align, nd_region->ndr_mappings))
+   return 0;
+   return nd_region->align / nd_region->ndr_mappings;
+}
+
 int alias_dpa_busy(struct device *dev, void *data)
 {
resource_size_t map_end, blk_start, new;
@@ -571,6 +586,7 @@ int alias_dpa_busy(struct device *dev, void *data)
struct nd_region *nd_region;
struct nvdimm_drvdata *ndd;
struct resource *res;
+   unsigned long align;
int i;
 
if (!is_memory(dev))
@@ -608,13 +624,21 @@ int alias_dpa_busy(struct device *dev, void *data)
 * Find the free dpa from the end of the last pmem allocation to
 * the end of the interleave-set mapping.
 */
+   align = dpa_align(nd_region);
+   if (!align)
+   return 0;
+
for_each_dpa_resource(ndd, res) {
+   resource_size_t start, end;
+
if (strncmp(res->name, "pmem", 4) != 0)
continue;
-   if ((res->start >= blk_start && res->start < map_end)
-   || (res->end >= blk_start
-   && res->end <= map_end)) {
-   new = max(blk_start, min(map_end + 1, res->end + 1));
+
+   start = ALIGN_DOWN(res->start, align);
+   end = ALIGN(res->end + 1, align) - 1;
+   if ((start >= blk_start && start < map_end)
+   || (end >= blk_start && end <= map_end)) {
+   new = max(blk_start, min(map_end, end) + 1);
if (new != blk_start) {
blk_start = new;
goto retry;
@@ -654,6 +678,7 @@ resource_size_t nd_blk_available_dpa(struct nd_region 
*nd_region)
.res = NULL,
};
struct resource *res;
+   unsigned long align;
 
if (!ndd)
return 0;
@@ -661,10 +686,20 @@ resource_size_t nd_blk_available_dpa(struct nd_region 
*nd_region)
device_for_each_child(_bus->dev, , alias_

[PATCH v4 4/5] libnvdimm/region: Introduce NDD_LABELING

2020-03-03 Thread Dan Williams
The NDD_ALIASING flag is used to indicate where pmem capacity might
alias with blk capacity and require labeling. It is also used to
indicate whether the DIMM supports labeling. Separate this latter
capability into its own flag so that the NDD_ALIASING flag is scoped to
true aliased configurations.

To my knowledge aliased configurations only exist in the ACPI spec,
there are no known platforms that ship this support in production.

This clarity allows namespace-capacity alignment constraints around
interleave-ways to be relaxed.

Cc: Vishal Verma 
Cc: Oliver O'Halloran 
Reviewed-by: Jeff Moyer 
Reviewed-by: Aneesh Kumar K.V 
Link: 
https://lore.kernel.org/r/158041477856.3889308.4212605617834097674.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
---
 arch/powerpc/platforms/pseries/papr_scm.c |2 +-
 drivers/acpi/nfit/core.c  |4 +++-
 drivers/nvdimm/dimm.c |2 +-
 drivers/nvdimm/dimm_devs.c|9 +
 drivers/nvdimm/namespace_devs.c   |2 +-
 drivers/nvdimm/nd.h   |2 +-
 drivers/nvdimm/region_devs.c  |   10 +-
 include/linux/libnvdimm.h |2 ++
 8 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 0b4467e378e5..589858cb3203 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -328,7 +328,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
}
 
dimm_flags = 0;
-   set_bit(NDD_ALIASING, _flags);
+   set_bit(NDD_LABELING, _flags);
 
p->nvdimm = nvdimm_create(p->bus, p, NULL, dimm_flags,
  PAPR_SCM_DIMM_CMD_MASK, 0, NULL);
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index a3320f93616d..71d7f2aa1b12 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2026,8 +2026,10 @@ static int acpi_nfit_register_dimms(struct 
acpi_nfit_desc *acpi_desc)
continue;
}
 
-   if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+   if (nfit_mem->bdw && nfit_mem->memdev_pmem) {
set_bit(NDD_ALIASING, );
+   set_bit(NDD_LABELING, );
+   }
 
/* collate flags across all memdevs for this dimm */
list_for_each_entry(nfit_memdev, _desc->memdevs, list) {
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 64776ed15bb3..7d4ddc4d9322 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -99,7 +99,7 @@ static int nvdimm_probe(struct device *dev)
if (ndd->ns_current >= 0) {
rc = nd_label_reserve_dpa(ndd);
if (rc == 0)
-   nvdimm_set_aliasing(dev);
+   nvdimm_set_labeling(dev);
}
nvdimm_bus_unlock(dev);
 
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 94ea6dba6b4f..39a61a514746 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -32,7 +32,7 @@ int nvdimm_check_config_data(struct device *dev)
 
if (!nvdimm->cmd_mask ||
!test_bit(ND_CMD_GET_CONFIG_DATA, >cmd_mask)) {
-   if (test_bit(NDD_ALIASING, >flags))
+   if (test_bit(NDD_LABELING, >flags))
return -ENXIO;
else
return -ENOTTY;
@@ -173,11 +173,11 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, 
size_t offset,
return rc;
 }
 
-void nvdimm_set_aliasing(struct device *dev)
+void nvdimm_set_labeling(struct device *dev)
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
-   set_bit(NDD_ALIASING, >flags);
+   set_bit(NDD_LABELING, >flags);
 }
 
 void nvdimm_set_locked(struct device *dev)
@@ -312,8 +312,9 @@ static ssize_t flags_show(struct device *dev,
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
-   return sprintf(buf, "%s%s\n",
+   return sprintf(buf, "%s%s%s\n",
test_bit(NDD_ALIASING, >flags) ? "alias " : "",
+   test_bit(NDD_LABELING, >flags) ? "label " : "",
test_bit(NDD_LOCKED, >flags) ? "lock " : "");
 }
 static DEVICE_ATTR_RO(flags);
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 77e211c7d94d..01f6c22f0d1a 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -2538,7 +2538,7 @@ static int init_active_labels(struct nd_region *nd_region)
if (!ndd) {
if (test_bit(NDD_LOCKED, >flags))
/* fail, label data may be unreadable */;
-   

[PATCH v4 3/5] libnvdimm/namespace: Enforce memremap_compat_align()

2020-03-03 Thread Dan Williams
The pmem driver on PowerPC crashes with the following signature when
instantiating misaligned namespaces that map their capacity via
memremap_pages().

BUG: Unable to handle kernel data access at 0xc00100040600
Faulting instruction address: 0xc0090790
NIP [c0090790] arch_add_memory+0xc0/0x130
LR [c0090744] arch_add_memory+0x74/0x130
Call Trace:
 arch_add_memory+0x74/0x130 (unreliable)
 memremap_pages+0x74c/0xa30
 devm_memremap_pages+0x3c/0xa0
 pmem_attach_disk+0x188/0x770
 nvdimm_bus_probe+0xd8/0x470

With the assumption that only memremap_pages() has alignment
constraints, enforce memremap_compat_align() for
pmem_should_map_pages(), nd_pfn, and nd_dax cases. This includes
preventing the creation of namespaces where the base address is
misaligned and cases there infoblock padding parameters are invalid.

Reported-by: Aneesh Kumar K.V 
Cc: Jeff Moyer 
Fixes: a3619190d62e ("libnvdimm/pfn: stop padding pmem namespaces to section 
alignment")
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/namespace_devs.c |   17 +
 drivers/nvdimm/pfn.h|   12 
 drivers/nvdimm/pfn_devs.c   |   32 +---
 3 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 032dc61725ff..77e211c7d94d 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -10,6 +10,7 @@
 #include 
 #include "nd-core.h"
 #include "pmem.h"
+#include "pfn.h"
 #include "nd.h"
 
 static void namespace_io_release(struct device *dev)
@@ -1739,6 +1740,22 @@ struct nd_namespace_common 
*nvdimm_namespace_common_probe(struct device *dev)
return ERR_PTR(-ENODEV);
}
 
+   /*
+* Note, alignment validation for fsdax and devdax mode
+* namespaces happens in nd_pfn_validate() where infoblock
+* padding parameters can be applied.
+*/
+   if (pmem_should_map_pages(dev)) {
+   struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
+   struct resource *res = >res;
+
+   if (!IS_ALIGNED(res->start | (res->end + 1),
+   memremap_compat_align())) {
+   dev_err(>dev, "%pr misaligned, unable to map\n", 
res);
+   return ERR_PTR(-EOPNOTSUPP);
+   }
+   }
+
if (is_namespace_pmem(>dev)) {
struct nd_namespace_pmem *nspm;
 
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index acb19517f678..37cb1b8a2a39 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -24,6 +24,18 @@ struct nd_pfn_sb {
__le64 npfns;
__le32 mode;
/* minor-version-1 additions for section alignment */
+   /**
+* @start_pad: Deprecated attribute to pad start-misaligned namespaces
+*
+* start_pad is deprecated because the original definition did
+* not comprehend that dataoff is relative to the base address
+* of the namespace not the start_pad adjusted base. The result
+* is that the dax path is broken, but the block-I/O path is
+* not. The kernel will no longer create namespaces using start
+* padding, but it still supports block-I/O for legacy
+* configurations mainly to allow a backup, reconfigure the
+* namespace, and restore flow to repair dax operation.
+*/
__le32 start_pad;
__le32 end_trunc;
/* minor-version-2 record the base alignment of the mapping */
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 79fe02d6f657..34db557dbad1 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -446,6 +446,7 @@ static bool nd_supported_alignment(unsigned long align)
 int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
 {
u64 checksum, offset;
+   struct resource *res;
enum nd_pfn_mode mode;
struct nd_namespace_io *nsio;
unsigned long align, start_pad;
@@ -578,13 +579,14 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char 
*sig)
 * established.
 */
nsio = to_nd_namespace_io(>dev);
-   if (offset >= resource_size(>res)) {
+   res = >res;
+   if (offset >= resource_size(res)) {
dev_err(_pfn->dev, "pfn array size exceeds capacity of %s\n",
dev_name(>dev));
return -EOPNOTSUPP;
}
 
-   if ((align && !IS_ALIGNED(nsio->res.start + offset + start_pad, align))
+   if ((align && !IS_ALIGNED(res->start + offset + start_pad, align))
|| !IS_ALIGNED(offset, PAGE_SIZE)) {
dev_err(_pfn->dev,
 

[PATCH v4 2/5] libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid

2020-03-03 Thread Dan Williams
The EOPNOTSUPP return code from the pmem driver indicates that the
namespace has a configuration that may be valid, but the current kernel
does not support it. Expand this to all of the nd_pfn_validate() error
conditions after the infoblock has been verified as self consistent.

This prevents exposing the namespace to I/O when the infoblock needs to
be corrected, or the system needs to be put into a different
configuration (like changing the page size on PowerPC).

Cc: Aneesh Kumar K.V 
Cc: Jeff Moyer 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pfn_devs.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index a5c25cb87116..79fe02d6f657 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -561,14 +561,14 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char 
*sig)
dev_dbg(_pfn->dev, "align: %lx:%lx mode: %d:%d\n",
nd_pfn->align, align, nd_pfn->mode,
mode);
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
}
 
if (align > nvdimm_namespace_capacity(ndns)) {
dev_err(_pfn->dev, "alignment: %lx exceeds capacity %llx\n",
align, nvdimm_namespace_capacity(ndns));
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
 
/*
@@ -581,7 +581,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
if (offset >= resource_size(>res)) {
dev_err(_pfn->dev, "pfn array size exceeds capacity of %s\n",
dev_name(>dev));
-   return -EBUSY;
+   return -EOPNOTSUPP;
}
 
if ((align && !IS_ALIGNED(nsio->res.start + offset + start_pad, align))
@@ -589,7 +589,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
dev_err(_pfn->dev,
"bad offset: %#llx dax disabled align: %#lx\n",
offset, align);
-   return -ENXIO;
+   return -EOPNOTSUPP;
}
 
return 0;



[PATCH v4 1/5] mm/memremap_pages: Introduce memremap_compat_align()

2020-03-03 Thread Dan Williams
The "sub-section memory hotplug" facility allows memremap_pages() users
like libnvdimm to compensate for hardware platforms like x86 that have a
section size larger than their hardware memory mapping granularity.  The
compensation that sub-section support affords is being tolerant of
physical memory resources shifting by units smaller (64MiB on x86) than
the memory-hotplug section size (128 MiB). Where the platform
physical-memory mapping granularity is limited by the number and
capability of address-decode-registers in the memory controller.

While the sub-section support allows memremap_pages() to operate on
sub-section (2MiB) granularity, the Power architecture may still
require 16MiB alignment on "!radix_enabled()" platforms.

In order for libnvdimm to be able to detect and manage this per-arch
limitation, introduce memremap_compat_align() as a common minimum
alignment across all driver-facing memory-mapping interfaces, and let
Power override it to 16MiB in the "!radix_enabled()" case.

The assumption / requirement for 16MiB to be a viable
memremap_compat_align() value is that Power does not have platforms
where its equivalent of address-decode-registers never hardware remaps a
persistent memory resource on smaller than 16MiB boundaries. Note that I
tried my best to not add a new Kconfig symbol, but header include
entanglements defeated the #ifndef memremap_compat_align design pattern
and the need to export it defeats the __weak design pattern for arch
overrides.

Based on an initial patch by Aneesh.

Link: 
http://lore.kernel.org/r/capcyv4gbgnp95apyabcsocea50tqj9b5h__83vgngjq3oug...@mail.gmail.com
Reported-by: Aneesh Kumar K.V 
Reported-by: Jeff Moyer 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Dan Williams 
---
 arch/powerpc/Kconfig  |1 +
 arch/powerpc/mm/ioremap.c |   21 +
 drivers/nvdimm/pfn_devs.c |2 +-
 include/linux/memremap.h  |8 
 include/linux/mmzone.h|1 +
 lib/Kconfig   |3 +++
 mm/memremap.c |   23 +++
 7 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 497b7d0b2d7e..e6ffe905e2b9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -122,6 +122,7 @@ config PPC
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV
select ARCH_HAS_HUGEPD  if HUGETLB_PAGE
+   select ARCH_HAS_MEMREMAP_COMPAT_ALIGN
select ARCH_HAS_MMIOWB  if PPC64
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_API
diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
index fc669643ce6a..b1a0aebe8c48 100644
--- a/arch/powerpc/mm/ioremap.c
+++ b/arch/powerpc/mm/ioremap.c
@@ -2,6 +2,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -97,3 +98,23 @@ void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, 
unsigned long size,
 
return NULL;
 }
+
+#ifdef CONFIG_ZONE_DEVICE
+/*
+ * Override the generic version in mm/memremap.c.
+ *
+ * With hash translation, the direct-map range is mapped with just one
+ * page size selected by htab_init_page_sizes(). Consult
+ * mmu_psize_defs[] to determine the minimum page size alignment.
+*/
+unsigned long memremap_compat_align(void)
+{
+   unsigned int shift = mmu_psize_defs[mmu_linear_psize].shift;
+
+   if (radix_enabled())
+   return SUBSECTION_SIZE;
+   return max(SUBSECTION_SIZE, 1UL << shift);
+
+}
+EXPORT_SYMBOL_GPL(memremap_compat_align);
+#endif
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index b94f7a7e94b8..a5c25cb87116 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -750,7 +750,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
start = nsio->res.start;
size = resource_size(>res);
npfns = PHYS_PFN(size - SZ_8K);
-   align = max(nd_pfn->align, (1UL << SUBSECTION_SHIFT));
+   align = max(nd_pfn->align, SUBSECTION_SIZE);
end_trunc = start + size - ALIGN_DOWN(start + size, align);
if (nd_pfn->mode == PFN_MODE_PMEM) {
/*
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 6fefb09af7c3..8af1cbd8f293 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -132,6 +132,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
+unsigned long memremap_compat_align(void);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
@@ -165,6 +166,12 @@ static inline void vmem_altmap_free(struct vmem_altmap 
*altmap,
unsigned long nr_pfns)
 {
 }
+
+/* when memremap_pages() is disabled all archs can remap a single page *

[PATCH v4 0/5] libnvdimm: Cross-arch compatible namespace alignment

2020-03-03 Thread Dan Williams
Changes since v3 [1]:
- Collect Aneesh's Reviewed-by for Patch1-Patch3

- Add commentary to "libnvdimm/namespace: Enforce
  memremap_compat_align()" to clarify alignment validation and why
  start_pad is deprecated (Aneesh)

[1]: 
http://lore.kernel.org/r/158291746615.1609624.7591692546429050845.st...@dwillia2-desk3.amr.corp.intel.com

---

Patch "mm/memremap_pages: Introduce memremap_compat_align()" still
needs a PowerPC maintainer ack for the touches to
arch/powerpc/mm/ioremap.c.

---

Aneesh reports that PowerPC requires 16MiB alignment for the address
range passed to devm_memremap_pages(), and Jeff reports that it is
possible to create a misaligned namespace which blocks future namespace
creation in that region. Both of these issues require namespace
alignment to be managed at the region level rather than padding at the
namespace level which has been a broken approach to date.

Introduce memremap_compat_align() to indicate the hard requirements of
an arch's memremap_pages() implementation. Use the maximum known
memremap_compat_align() to set the default namespace alignment for
libnvdimm. Consult that alignment when allocating free space. Finally,
allow the default region alignment to be overridden to maintain the same
namespace creation capability as previous kernels (modulo dax operation
not being supported with a non-zero start_pad).

The ndctl unit tests, which have some misaligned namespace assumptions,
are updated to use the alignment override where necessary.

Thanks to Aneesh for early feedback and testing on this change to
alignment handling.

---

Dan Williams (5):
  mm/memremap_pages: Introduce memremap_compat_align()
  libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
  libnvdimm/namespace: Enforce memremap_compat_align()
  libnvdimm/region: Introduce NDD_LABELING
  libnvdimm/region: Introduce an 'align' attribute


 arch/powerpc/Kconfig  |1 
 arch/powerpc/mm/ioremap.c |   21 +
 arch/powerpc/platforms/pseries/papr_scm.c |2 
 drivers/acpi/nfit/core.c  |4 +
 drivers/nvdimm/dimm.c |2 
 drivers/nvdimm/dimm_devs.c|   95 +
 drivers/nvdimm/namespace_devs.c   |   28 +-
 drivers/nvdimm/nd.h   |3 -
 drivers/nvdimm/pfn.h  |   12 +++
 drivers/nvdimm/pfn_devs.c |   40 +++--
 drivers/nvdimm/region_devs.c  |  132 ++---
 include/linux/libnvdimm.h |2 
 include/linux/memremap.h  |8 ++
 include/linux/mmzone.h|1 
 lib/Kconfig   |3 +
 mm/memremap.c |   23 +
 16 files changed, 330 insertions(+), 47 deletions(-)

base-commit: 1d0827b75ee7df497f611a2ac412a88135fb0ef5


Re: [PATCH v3 6/7] mm/memory_hotplug: Add pgprot_t to mhp_params

2020-03-02 Thread Dan Williams
On Mon, Mar 2, 2020 at 10:55 AM Logan Gunthorpe  wrote:
>
>
>
> On 2020-02-29 3:44 p.m., Dan Williams wrote:
> > On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe  
> > wrote:
> >>
> >> devm_memremap_pages() is currently used by the PCI P2PDMA code to create
> >> struct page mappings for IO memory. At present, these mappings are created
> >> with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on
> >> x86, an mtrr register will typically override this and force the cache
> >> type to be UC-. In the case firmware doesn't set this register it is
> >> effectively WB and will typically result in a machine check exception
> >> when it's accessed.
> >>
> >> Other arches are not currently likely to function correctly seeing they
> >> don't have any MTRR registers to fall back on.
> >>
> >> To solve this, provide a way to specify the pgprot value explicitly to
> >> arch_add_memory().
> >>
> >> Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple
> >> change to pass the pgprot_t down to their respective functions which set
> >> up the page tables. For x86_32, set the page tables explicitly using
> >> _set_memory_prot() (seeing they are already mapped). For ia64, s390 and
> >> sh, reject anything but PAGE_KERNEL settings -- this should be fine,
> >> for now, seeing these architectures don't support ZONE_DEVICE.
> >>
> >> A check in __add_pages() is also added to ensure the pgprot parameter was
> >> set for all arches.
> >>
> >> Cc: Dan Williams 
> >> Signed-off-by: Logan Gunthorpe 
> >> Acked-by: David Hildenbrand 
> >> Acked-by: Michal Hocko 
> >> ---
> >>  arch/arm64/mm/mmu.c| 3 ++-
> >>  arch/ia64/mm/init.c| 3 +++
> >>  arch/powerpc/mm/mem.c  | 3 ++-
> >>  arch/s390/mm/init.c| 3 +++
> >>  arch/sh/mm/init.c  | 3 +++
> >>  arch/x86/mm/init_32.c  | 5 +
> >>  arch/x86/mm/init_64.c  | 2 +-
> >>  include/linux/memory_hotplug.h | 2 ++
> >>  mm/memory_hotplug.c| 5 -
> >>  mm/memremap.c  | 6 +++---
> >>  10 files changed, 28 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >> index ee37bca8aba8..ea3fa844a8a2 100644
> >> --- a/arch/arm64/mm/mmu.c
> >> +++ b/arch/arm64/mm/mmu.c
> >> @@ -1058,7 +1058,8 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >> flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> >>
> >> __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
> >> -size, PAGE_KERNEL, __pgd_pgtable_alloc, 
> >> flags);
> >> +size, params->pgprot, __pgd_pgtable_alloc,
> >> +flags);
> >>
> >> memblock_clear_nomap(start, size);
> >>
> >> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> >> index 97bbc23ea1e3..d637b4ea3147 100644
> >> --- a/arch/ia64/mm/init.c
> >> +++ b/arch/ia64/mm/init.c
> >> @@ -676,6 +676,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >> unsigned long nr_pages = size >> PAGE_SHIFT;
> >> int ret;
> >>
> >> +   if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
> >> +   return -EINVAL;
> >> +
> >> ret = __add_pages(nid, start_pfn, nr_pages, params);
> >> if (ret)
> >> printk("%s: Problem encountered in __add_pages() as 
> >> ret=%d\n",
> >> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> >> index 19b1da5d7eca..832412bc7fad 100644
> >> --- a/arch/powerpc/mm/mem.c
> >> +++ b/arch/powerpc/mm/mem.c
> >> @@ -138,7 +138,8 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
> >> resize_hpt_for_hotplug(memblock_phys_mem_size());
> >>
> >> start = (unsigned long)__va(start);
> >> -   rc = create_section_mapping(start, start + size, nid, PAGE_KERNEL);
> >> +   rc = create_section_mapping(start, start + size, nid,
> >> +   params->pgprot);
> >> if (rc) {
> >> pr_warn("Unable to create mapping for hot added memory 
> >> 0x%llx..0x%llx: %d\n",
> >> start,

Re: [PATCH v3 3/5] libnvdimm/namespace: Enforce memremap_compat_align()

2020-03-02 Thread Dan Williams
On Mon, Mar 2, 2020 at 4:09 AM Aneesh Kumar K.V
 wrote:
>
> Dan Williams  writes:
>
> > The pmem driver on PowerPC crashes with the following signature when
> > instantiating misaligned namespaces that map their capacity via
> > memremap_pages().
> >
> > BUG: Unable to handle kernel data access at 0xc00100040600
> > Faulting instruction address: 0xc0090790
> > NIP [c0090790] arch_add_memory+0xc0/0x130
> > LR [c0090744] arch_add_memory+0x74/0x130
> > Call Trace:
> >  arch_add_memory+0x74/0x130 (unreliable)
> >  memremap_pages+0x74c/0xa30
> >  devm_memremap_pages+0x3c/0xa0
> >  pmem_attach_disk+0x188/0x770
> >  nvdimm_bus_probe+0xd8/0x470
> >
> > With the assumption that only memremap_pages() has alignment
> > constraints, enforce memremap_compat_align() for
> > pmem_should_map_pages(), nd_pfn, and nd_dax cases. This includes
> > preventing the creation of namespaces where the base address is
> > misaligned and cases there infoblock padding parameters are invalid.
> >
>
> Reviewed-by: Aneesh Kumar K.V 
>
> > Reported-by: Aneesh Kumar K.V 
> > Cc: Jeff Moyer 
> > Fixes: a3619190d62e ("libnvdimm/pfn: stop padding pmem namespaces to 
> > section alignment")
> > Signed-off-by: Dan Williams 
> > ---
> >  drivers/nvdimm/namespace_devs.c |   12 
> >  drivers/nvdimm/pfn_devs.c   |   26 +++---
> >  2 files changed, 35 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/nvdimm/namespace_devs.c 
> > b/drivers/nvdimm/namespace_devs.c
> > index 032dc61725ff..68e89855f779 100644
> > --- a/drivers/nvdimm/namespace_devs.c
> > +++ b/drivers/nvdimm/namespace_devs.c
> > @@ -10,6 +10,7 @@
> >  #include 
> >  #include "nd-core.h"
> >  #include "pmem.h"
> > +#include "pfn.h"
> >  #include "nd.h"
> >
> >  static void namespace_io_release(struct device *dev)
> > @@ -1739,6 +1740,17 @@ struct nd_namespace_common 
> > *nvdimm_namespace_common_probe(struct device *dev)
> >   return ERR_PTR(-ENODEV);
> >   }
>
> May be add a comment here that both dax/fsdax namespace details are
> checked in nd_pfn_validate() so that we look at start_pad and end_trunc
> while validating the namespace?
>
> >
> > + if (pmem_should_map_pages(dev)) {
> > + struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
> > + struct resource *res = >res;
> > +
> > + if (!IS_ALIGNED(res->start | (res->end + 1),
> > + memremap_compat_align())) {
> > + dev_err(>dev, "%pr misaligned, unable to 
> > map\n", res);
> > + return ERR_PTR(-EOPNOTSUPP);
> > + }
> > + }
> > +
> >   if (is_namespace_pmem(>dev)) {
> >   struct nd_namespace_pmem *nspm;
> >
> > diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> > index 79fe02d6f657..3bdd4b883d05 100644
> > --- a/drivers/nvdimm/pfn_devs.c
> > +++ b/drivers/nvdimm/pfn_devs.c
> > @@ -446,6 +446,7 @@ static bool nd_supported_alignment(unsigned long align)
> >  int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
> >  {
> >   u64 checksum, offset;
> > + struct resource *res;
> >   enum nd_pfn_mode mode;
> >   struct nd_namespace_io *nsio;
> >   unsigned long align, start_pad;
> > @@ -578,13 +579,14 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char 
> > *sig)
> >* established.
> >*/
> >   nsio = to_nd_namespace_io(>dev);
> > - if (offset >= resource_size(>res)) {
> > + res = >res;
> > + if (offset >= resource_size(res)) {
> >   dev_err(_pfn->dev, "pfn array size exceeds capacity of 
> > %s\n",
> >   dev_name(>dev));
> >   return -EOPNOTSUPP;
> >   }
> >
> > - if ((align && !IS_ALIGNED(nsio->res.start + offset + start_pad, 
> > align))
> > + if ((align && !IS_ALIGNED(res->start + offset + start_pad, align))
> >   || !IS_ALIGNED(offset, PAGE_SIZE)) {
> >   dev_err(_pfn->dev,
> >   "bad offset: %#llx dax disabled align: 
> > %#lx\n",
> > @@ -592,6 +594,18 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char 

Re: [PATCH v3 15/27] powerpc/powernv/pmem: Add support for near storage commands

2020-03-02 Thread Dan Williams
On Mon, Mar 2, 2020 at 9:59 AM Frederic Barrat  wrote:
>
>
>
> Le 21/02/2020 à 04:27, Alastair D'Silva a écrit :
> > From: Alastair D'Silva 
> >
> > Similar to the previous patch, this adds support for near storage commands.
> >
> > Signed-off-by: Alastair D'Silva 
> > ---
>
>
> Is any of these new functions ever called?

This is my concern as well. The libnvdimm command support is limited
to the commands that Linux will use. Other passthrough commands are
supported through a passthrough interface. However, that passthrough
interface is explicitly limited to publicly documented command sets so
that the kernel has an opportunity to constrain and consolidate
command implementations across vendors.


Re: [PATCH v3 7/7] mm/memremap: Set caching mode for PCI P2PDMA memory to WC

2020-02-29 Thread Dan Williams
On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe  wrote:
>
> PCI BAR IO memory should never be mapped as WB, however prior to this
> the PAT bits were set WB and it was typically overridden by MTRR
> registers set by the firmware.
>
> Set PCI P2PDMA memory to be WC (writecombining) as the only current
> user (the NVMe CMB) was originally mapped WC before the P2PDMA code
> replaced the mapping with devm_memremap_pages().

Will the change to UC regress this existing use case?

>
> Future use-cases may need to generalize this by adding flags to
> select the caching type, as some P2PDMA cases will not want WC.
> However, those use-cases are not upstream yet and this can be changed
> when they arrive.
>
> Cc: Dan Williams 
> Cc: Christoph Hellwig 
> Cc: Jason Gunthorpe 
> Signed-off-by: Logan Gunthorpe 
> ---
>  mm/memremap.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 06742372a203..8d141c3e3364 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -190,7 +190,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
> }
> break;
> case MEMORY_DEVICE_DEVDAX:
> +   need_devmap_managed = false;
> +   break;
> case MEMORY_DEVICE_PCI_P2PDMA:
> +   params.pgprot = pgprot_writecombine(params.pgprot);

Approach looks good to me, modulo Jason's comment that this should be
UC. Upcoming DAX changes will want to pass this via pgmap, but as you
say this can wait for this changes to arrive.

After change to UC:

Reviewed-by: Dan Williams 


Re: [PATCH v3 6/7] mm/memory_hotplug: Add pgprot_t to mhp_params

2020-02-29 Thread Dan Williams
On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe  wrote:
>
> devm_memremap_pages() is currently used by the PCI P2PDMA code to create
> struct page mappings for IO memory. At present, these mappings are created
> with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on
> x86, an mtrr register will typically override this and force the cache
> type to be UC-. In the case firmware doesn't set this register it is
> effectively WB and will typically result in a machine check exception
> when it's accessed.
>
> Other arches are not currently likely to function correctly seeing they
> don't have any MTRR registers to fall back on.
>
> To solve this, provide a way to specify the pgprot value explicitly to
> arch_add_memory().
>
> Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple
> change to pass the pgprot_t down to their respective functions which set
> up the page tables. For x86_32, set the page tables explicitly using
> _set_memory_prot() (seeing they are already mapped). For ia64, s390 and
> sh, reject anything but PAGE_KERNEL settings -- this should be fine,
> for now, seeing these architectures don't support ZONE_DEVICE.
>
> A check in __add_pages() is also added to ensure the pgprot parameter was
> set for all arches.
>
> Cc: Dan Williams 
> Signed-off-by: Logan Gunthorpe 
> Acked-by: David Hildenbrand 
> Acked-by: Michal Hocko 
> ---
>  arch/arm64/mm/mmu.c| 3 ++-
>  arch/ia64/mm/init.c| 3 +++
>  arch/powerpc/mm/mem.c  | 3 ++-
>  arch/s390/mm/init.c| 3 +++
>  arch/sh/mm/init.c  | 3 +++
>  arch/x86/mm/init_32.c  | 5 +
>  arch/x86/mm/init_64.c  | 2 +-
>  include/linux/memory_hotplug.h | 2 ++
>  mm/memory_hotplug.c| 5 -
>  mm/memremap.c  | 6 +++---
>  10 files changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index ee37bca8aba8..ea3fa844a8a2 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1058,7 +1058,8 @@ int arch_add_memory(int nid, u64 start, u64 size,
> flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>
> __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
> -size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
> +size, params->pgprot, __pgd_pgtable_alloc,
> +flags);
>
> memblock_clear_nomap(start, size);
>
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index 97bbc23ea1e3..d637b4ea3147 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -676,6 +676,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long nr_pages = size >> PAGE_SHIFT;
> int ret;
>
> +   if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
> +   return -EINVAL;
> +
> ret = __add_pages(nid, start_pfn, nr_pages, params);
> if (ret)
> printk("%s: Problem encountered in __add_pages() as ret=%d\n",
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 19b1da5d7eca..832412bc7fad 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -138,7 +138,8 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
> resize_hpt_for_hotplug(memblock_phys_mem_size());
>
> start = (unsigned long)__va(start);
> -   rc = create_section_mapping(start, start + size, nid, PAGE_KERNEL);
> +   rc = create_section_mapping(start, start + size, nid,
> +   params->pgprot);
> if (rc) {
> pr_warn("Unable to create mapping for hot added memory 
> 0x%llx..0x%llx: %d\n",
> start, start + size, rc);
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index e9e4a7abd0cc..87b2d024e75a 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -277,6 +277,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
> if (WARN_ON_ONCE(params->altmap))
> return -EINVAL;
>
> +   if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
> +   return -EINVAL;
> +
> rc = vmem_add_mapping(start, size);
> if (rc)
> return rc;
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index e5114c053364..b9de2d4fa57e 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -412,6 +412,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long nr_pages = size >> PAGE_SHIFT;
> int ret;
>
> +   if (WARN_ON_ONCE(para

Re: [PATCH v3 3/7] x86/mm: Thread pgprot_t through init_memory_mapping()

2020-02-29 Thread Dan Williams
On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe  wrote:
>
> In prepartion to support a pgprot_t argument for arch_add_memory().
>
> It's required to move the prototype of init_memory_mapping() seeing
> the original location came before the definition of pgprot_t.
>
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Signed-off-by: Logan Gunthorpe 

Looks good, checked for argument confusion, passes the nvdimm unit tests.

Reviewed-by: Dan Williams 


Re: [PATCH v3 4/7] x86/mm: Introduce _set_memory_prot()

2020-02-29 Thread Dan Williams
On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe  wrote:
>
> For use in the 32bit arch_add_memory() to set the pgprot type of the
> memory to add.
>
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Signed-off-by: Logan Gunthorpe 
> ---
>  arch/x86/include/asm/set_memory.h | 1 +
>  arch/x86/mm/pat/set_memory.c  | 7 +++
>  2 files changed, 8 insertions(+)
>
> diff --git a/arch/x86/include/asm/set_memory.h 
> b/arch/x86/include/asm/set_memory.h
> index 64c3dce374e5..0aca959cf9a4 100644
> --- a/arch/x86/include/asm/set_memory.h
> +++ b/arch/x86/include/asm/set_memory.h
> @@ -34,6 +34,7 @@
>   * The caller is required to take care of these.
>   */
>
> +int _set_memory_prot(unsigned long addr, int numpages, pgprot_t prot);

I wonder if this should be separated from the naming convention of the
other routines because this is only an internal helper for code paths
where the prot was established by an upper layer. For example, I
expect that the kernel does not want new usages to make the mistake of
calling:

   _set_memory_prot(..., pgprot_writecombine(pgprot))

...instead of

_set_memory_wc()

I'm thinking just a double underscore rename (__set_memory_prot) and a
kerneldoc comment for that  pointing people to use the direct
_set_memory_ helpers.

With that you can add:

Reviewed-by: Dan Williams 


Re: [PATCH v3 2/7] mm/memory_hotplug: Rename mhp_restrictions to mhp_params

2020-02-29 Thread Dan Williams
On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe  wrote:
>
> The mhp_restrictions struct really doesn't specify anything resembling
> a restriction anymore so rename it to be mhp_params as it is a list
> of extended parameters.
>
> Signed-off-by: Logan Gunthorpe 

Tests ok, and looks good to me:

Reviewed-by: Dan Williams 


Re: [PATCH] libnvdimm/bus: return the outvar 'cmd_rc' error code in __nd_ioctl()

2020-02-28 Thread Dan Williams
On Tue, Feb 18, 2020 at 1:03 PM Dan Williams  wrote:
>
> On Tue, Feb 18, 2020 at 1:00 PM Jeff Moyer  wrote:
> >
> > Vaibhav Jain  writes:
> >
> > > Presently the error code returned via out variable 'cmd_rc' from the
> > > nvdimm-bus controller function is ignored when called from
> > > __nd_ioctl() and never communicated back to user-space code that called
> > > an ioctl on dimm/bus.
> > >
> > > This minor patch updates __nd_ioctl() to propagate the value of out
> > > variable 'cmd_rc' back to user-space in case it reports an error.
> > >
> > > Signed-off-by: Vaibhav Jain 
> > > ---
> > >  drivers/nvdimm/bus.c | 5 +
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> > > index a8b515968569..5b687a27fdf2 100644
> > > --- a/drivers/nvdimm/bus.c
> > > +++ b/drivers/nvdimm/bus.c
> > > @@ -1153,6 +1153,11 @@ static int __nd_ioctl(struct nvdimm_bus 
> > > *nvdimm_bus, struct nvdimm *nvdimm,
> > >   if (rc < 0)
> > >   goto out_unlock;
> > >
> > > + if (cmd_rc < 0) {
> > > + rc = cmd_rc;
> > > + goto out_unlock;
> > > + }
> > > +
> > >   if (!nvdimm && cmd == ND_CMD_CLEAR_ERROR && cmd_rc >= 0) {
> > >   struct nd_cmd_clear_error *clear_err = buf;
> >
> > Looks good to me.
> >
> > Reviewed-by: Jeff Moyer 
>
> Applied.

Unapplied. This breaks the NVDIMM unit test, and now that I look
closer you are likely overlooking the fact that cmd_rc is a
translation of the firmware status, while the ioctl rc is whether the
command was successfully submitted. If you want the equivalent of
cmd_rc in userspace you need to translate the firmware status. See
ndctl_cmd_submit_xlat() in libndctl as an example of how the
equivalent of cmd_rc is generated from the firmware status.


Re: [PATCH v3 1/7] mm/memory_hotplug: Drop the flags field from struct mhp_restrictions

2020-02-28 Thread Dan Williams
On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe  wrote:
>
> This variable is not used anywhere and should therefore be removed
> from the structure.
>
> Signed-off-by: Logan Gunthorpe 
> Reviewed-by: David Hildenbrand 

Reviewed-by: Dan Williams 


[PATCH v3 5/5] libnvdimm/region: Introduce an 'align' attribute

2020-02-28 Thread Dan Williams
The align attribute applies an alignment constraint for namespace
creation in a region. Whereas the 'align' attribute of a namespace
applied alignment padding via an info block, the 'align' attribute
applies alignment constraints to the free space allocation.

The default for 'align' is the maximum known memremap_compat_align()
across all archs (16MiB from PowerPC at time of writing) multiplied by
the number of interleave ways if there is blk-aliasing. The minimum is
PAGE_SIZE and allows for the creation of cross-arch incompatible
namespaces, just as previous kernels allowed, but the expectation is
cross-arch and mode-independent compatibility by default.

The regression risk with this change is limited to cases that were
dependent on the ability to create unaligned namespaces, *and* for some
reason are unable to opt-out of aligned namespaces by writing to
'regionX/align'. If such a scenario arises the default can be flipped
from opt-out to opt-in of compat-aligned namespace creation, but that is
a last resort. The kernel will otherwise continue to support existing
defined misaligned namespaces.

Unfortunately this change needs to touch several parts of the
implementation at once:

- region/available_size: expand busy extents to current align
- region/max_available_extent: expand busy extents to current align
- namespace/size: trim free space to current align

...to keep the free space accounting conforming to the dynamic align
setting.

Reported-by: Aneesh Kumar K.V 
Reported-by: Jeff Moyer 
Reviewed-by: Aneesh Kumar K.V 
Reviewed-by: Jeff Moyer 
Link: 
https://lore.kernel.org/r/158041478371.3889308.14542630147672668068.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/dimm_devs.c  |   86 +++
 drivers/nvdimm/namespace_devs.c |9 ++-
 drivers/nvdimm/nd.h |1 
 drivers/nvdimm/region_devs.c|  122 ---
 4 files changed, 192 insertions(+), 26 deletions(-)

diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 39a61a514746..b7b77e8d9027 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -563,6 +563,21 @@ int nvdimm_security_freeze(struct nvdimm *nvdimm)
return rc;
 }
 
+static unsigned long dpa_align(struct nd_region *nd_region)
+{
+   struct device *dev = _region->dev;
+
+   if (dev_WARN_ONCE(dev, !is_nvdimm_bus_locked(dev),
+   "bus lock required for capacity provision\n"))
+   return 0;
+   if (dev_WARN_ONCE(dev, !nd_region->ndr_mappings || nd_region->align
+   % nd_region->ndr_mappings,
+   "invalid region align %#lx mappings: %d\n",
+   nd_region->align, nd_region->ndr_mappings))
+   return 0;
+   return nd_region->align / nd_region->ndr_mappings;
+}
+
 int alias_dpa_busy(struct device *dev, void *data)
 {
resource_size_t map_end, blk_start, new;
@@ -571,6 +586,7 @@ int alias_dpa_busy(struct device *dev, void *data)
struct nd_region *nd_region;
struct nvdimm_drvdata *ndd;
struct resource *res;
+   unsigned long align;
int i;
 
if (!is_memory(dev))
@@ -608,13 +624,21 @@ int alias_dpa_busy(struct device *dev, void *data)
 * Find the free dpa from the end of the last pmem allocation to
 * the end of the interleave-set mapping.
 */
+   align = dpa_align(nd_region);
+   if (!align)
+   return 0;
+
for_each_dpa_resource(ndd, res) {
+   resource_size_t start, end;
+
if (strncmp(res->name, "pmem", 4) != 0)
continue;
-   if ((res->start >= blk_start && res->start < map_end)
-   || (res->end >= blk_start
-   && res->end <= map_end)) {
-   new = max(blk_start, min(map_end + 1, res->end + 1));
+
+   start = ALIGN_DOWN(res->start, align);
+   end = ALIGN(res->end + 1, align) - 1;
+   if ((start >= blk_start && start < map_end)
+   || (end >= blk_start && end <= map_end)) {
+   new = max(blk_start, min(map_end, end) + 1);
if (new != blk_start) {
blk_start = new;
goto retry;
@@ -654,6 +678,7 @@ resource_size_t nd_blk_available_dpa(struct nd_region 
*nd_region)
.res = NULL,
};
struct resource *res;
+   unsigned long align;
 
if (!ndd)
return 0;
@@ -661,10 +686,20 @@ resource_size_t nd_blk_available_dpa(struct nd_region 
*nd_region)
device_for_each_child(_bus->dev, , alias_

[PATCH v3 4/5] libnvdimm/region: Introduce NDD_LABELING

2020-02-28 Thread Dan Williams
The NDD_ALIASING flag is used to indicate where pmem capacity might
alias with blk capacity and require labeling. It is also used to
indicate whether the DIMM supports labeling. Separate this latter
capability into its own flag so that the NDD_ALIASING flag is scoped to
true aliased configurations.

To my knowledge aliased configurations only exist in the ACPI spec,
there are no known platforms that ship this support in production.

This clarity allows namespace-capacity alignment constraints around
interleave-ways to be relaxed.

Cc: Vishal Verma 
Cc: Oliver O'Halloran 
Reviewed-by: Jeff Moyer 
Reviewed-by: Aneesh Kumar K.V 
Link: 
https://lore.kernel.org/r/158041477856.3889308.4212605617834097674.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
---
 arch/powerpc/platforms/pseries/papr_scm.c |2 +-
 drivers/acpi/nfit/core.c  |4 +++-
 drivers/nvdimm/dimm.c |2 +-
 drivers/nvdimm/dimm_devs.c|9 +
 drivers/nvdimm/namespace_devs.c   |2 +-
 drivers/nvdimm/nd.h   |2 +-
 drivers/nvdimm/region_devs.c  |   10 +-
 include/linux/libnvdimm.h |2 ++
 8 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 0b4467e378e5..589858cb3203 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -328,7 +328,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
}
 
dimm_flags = 0;
-   set_bit(NDD_ALIASING, _flags);
+   set_bit(NDD_LABELING, _flags);
 
p->nvdimm = nvdimm_create(p->bus, p, NULL, dimm_flags,
  PAPR_SCM_DIMM_CMD_MASK, 0, NULL);
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index a3320f93616d..71d7f2aa1b12 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2026,8 +2026,10 @@ static int acpi_nfit_register_dimms(struct 
acpi_nfit_desc *acpi_desc)
continue;
}
 
-   if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+   if (nfit_mem->bdw && nfit_mem->memdev_pmem) {
set_bit(NDD_ALIASING, );
+   set_bit(NDD_LABELING, );
+   }
 
/* collate flags across all memdevs for this dimm */
list_for_each_entry(nfit_memdev, _desc->memdevs, list) {
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 64776ed15bb3..7d4ddc4d9322 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -99,7 +99,7 @@ static int nvdimm_probe(struct device *dev)
if (ndd->ns_current >= 0) {
rc = nd_label_reserve_dpa(ndd);
if (rc == 0)
-   nvdimm_set_aliasing(dev);
+   nvdimm_set_labeling(dev);
}
nvdimm_bus_unlock(dev);
 
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 94ea6dba6b4f..39a61a514746 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -32,7 +32,7 @@ int nvdimm_check_config_data(struct device *dev)
 
if (!nvdimm->cmd_mask ||
!test_bit(ND_CMD_GET_CONFIG_DATA, >cmd_mask)) {
-   if (test_bit(NDD_ALIASING, >flags))
+   if (test_bit(NDD_LABELING, >flags))
return -ENXIO;
else
return -ENOTTY;
@@ -173,11 +173,11 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, 
size_t offset,
return rc;
 }
 
-void nvdimm_set_aliasing(struct device *dev)
+void nvdimm_set_labeling(struct device *dev)
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
-   set_bit(NDD_ALIASING, >flags);
+   set_bit(NDD_LABELING, >flags);
 }
 
 void nvdimm_set_locked(struct device *dev)
@@ -312,8 +312,9 @@ static ssize_t flags_show(struct device *dev,
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
-   return sprintf(buf, "%s%s\n",
+   return sprintf(buf, "%s%s%s\n",
test_bit(NDD_ALIASING, >flags) ? "alias " : "",
+   test_bit(NDD_LABELING, >flags) ? "label " : "",
test_bit(NDD_LOCKED, >flags) ? "lock " : "");
 }
 static DEVICE_ATTR_RO(flags);
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 68e89855f779..2388598ce1a2 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -2533,7 +2533,7 @@ static int init_active_labels(struct nd_region *nd_region)
if (!ndd) {
if (test_bit(NDD_LOCKED, >flags))
/* fail, label data may be unreadable */;
-   

[PATCH v3 3/5] libnvdimm/namespace: Enforce memremap_compat_align()

2020-02-28 Thread Dan Williams
The pmem driver on PowerPC crashes with the following signature when
instantiating misaligned namespaces that map their capacity via
memremap_pages().

BUG: Unable to handle kernel data access at 0xc00100040600
Faulting instruction address: 0xc0090790
NIP [c0090790] arch_add_memory+0xc0/0x130
LR [c0090744] arch_add_memory+0x74/0x130
Call Trace:
 arch_add_memory+0x74/0x130 (unreliable)
 memremap_pages+0x74c/0xa30
 devm_memremap_pages+0x3c/0xa0
 pmem_attach_disk+0x188/0x770
 nvdimm_bus_probe+0xd8/0x470

With the assumption that only memremap_pages() has alignment
constraints, enforce memremap_compat_align() for
pmem_should_map_pages(), nd_pfn, and nd_dax cases. This includes
preventing the creation of namespaces where the base address is
misaligned and cases there infoblock padding parameters are invalid.

Reported-by: Aneesh Kumar K.V 
Cc: Jeff Moyer 
Fixes: a3619190d62e ("libnvdimm/pfn: stop padding pmem namespaces to section 
alignment")
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/namespace_devs.c |   12 
 drivers/nvdimm/pfn_devs.c   |   26 +++---
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 032dc61725ff..68e89855f779 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -10,6 +10,7 @@
 #include 
 #include "nd-core.h"
 #include "pmem.h"
+#include "pfn.h"
 #include "nd.h"
 
 static void namespace_io_release(struct device *dev)
@@ -1739,6 +1740,17 @@ struct nd_namespace_common 
*nvdimm_namespace_common_probe(struct device *dev)
return ERR_PTR(-ENODEV);
}
 
+   if (pmem_should_map_pages(dev)) {
+   struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
+   struct resource *res = >res;
+
+   if (!IS_ALIGNED(res->start | (res->end + 1),
+   memremap_compat_align())) {
+   dev_err(>dev, "%pr misaligned, unable to map\n", 
res);
+   return ERR_PTR(-EOPNOTSUPP);
+   }
+   }
+
if (is_namespace_pmem(>dev)) {
struct nd_namespace_pmem *nspm;
 
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 79fe02d6f657..3bdd4b883d05 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -446,6 +446,7 @@ static bool nd_supported_alignment(unsigned long align)
 int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
 {
u64 checksum, offset;
+   struct resource *res;
enum nd_pfn_mode mode;
struct nd_namespace_io *nsio;
unsigned long align, start_pad;
@@ -578,13 +579,14 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char 
*sig)
 * established.
 */
nsio = to_nd_namespace_io(>dev);
-   if (offset >= resource_size(>res)) {
+   res = >res;
+   if (offset >= resource_size(res)) {
dev_err(_pfn->dev, "pfn array size exceeds capacity of %s\n",
dev_name(>dev));
return -EOPNOTSUPP;
}
 
-   if ((align && !IS_ALIGNED(nsio->res.start + offset + start_pad, align))
+   if ((align && !IS_ALIGNED(res->start + offset + start_pad, align))
|| !IS_ALIGNED(offset, PAGE_SIZE)) {
dev_err(_pfn->dev,
"bad offset: %#llx dax disabled align: %#lx\n",
@@ -592,6 +594,18 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
return -EOPNOTSUPP;
}
 
+   if (!IS_ALIGNED(res->start + le32_to_cpu(pfn_sb->start_pad),
+   memremap_compat_align())) {
+   dev_err(_pfn->dev, "resource start misaligned\n");
+   return -EOPNOTSUPP;
+   }
+
+   if (!IS_ALIGNED(res->end + 1 - le32_to_cpu(pfn_sb->end_trunc),
+   memremap_compat_align())) {
+   dev_err(_pfn->dev, "resource end misaligned\n");
+   return -EOPNOTSUPP;
+   }
+
return 0;
 }
 EXPORT_SYMBOL(nd_pfn_validate);
@@ -750,7 +764,13 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
start = nsio->res.start;
size = resource_size(>res);
npfns = PHYS_PFN(size - SZ_8K);
-   align = max(nd_pfn->align, SUBSECTION_SIZE);
+   align = max(nd_pfn->align, memremap_compat_align());
+   if (!IS_ALIGNED(start, memremap_compat_align())) {
+   dev_err(_pfn->dev, "%s: start %pa misaligned to %#lx\n",
+   dev_name(>dev), ,
+   memremap_compat_align());
+   return -EINVAL;
+   }
end_trunc = start + size - ALIGN_DOWN(start + size, align);
if (nd_pfn->mode == PFN_MODE_PMEM) {
/*



[PATCH v3 2/5] libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid

2020-02-28 Thread Dan Williams
The EOPNOTSUPP return code from the pmem driver indicates that the
namespace has a configuration that may be valid, but the current kernel
does not support it. Expand this to all of the nd_pfn_validate() error
conditions after the infoblock has been verified as self consistent.

This prevents exposing the namespace to I/O when the infoblock needs to
be corrected, or the system needs to be put into a different
configuration (like changing the page size on PowerPC).

Cc: Aneesh Kumar K.V 
Cc: Jeff Moyer 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pfn_devs.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index a5c25cb87116..79fe02d6f657 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -561,14 +561,14 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char 
*sig)
dev_dbg(_pfn->dev, "align: %lx:%lx mode: %d:%d\n",
nd_pfn->align, align, nd_pfn->mode,
mode);
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
}
 
if (align > nvdimm_namespace_capacity(ndns)) {
dev_err(_pfn->dev, "alignment: %lx exceeds capacity %llx\n",
align, nvdimm_namespace_capacity(ndns));
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
 
/*
@@ -581,7 +581,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
if (offset >= resource_size(>res)) {
dev_err(_pfn->dev, "pfn array size exceeds capacity of %s\n",
dev_name(>dev));
-   return -EBUSY;
+   return -EOPNOTSUPP;
}
 
if ((align && !IS_ALIGNED(nsio->res.start + offset + start_pad, align))
@@ -589,7 +589,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
dev_err(_pfn->dev,
"bad offset: %#llx dax disabled align: %#lx\n",
offset, align);
-   return -ENXIO;
+   return -EOPNOTSUPP;
}
 
return 0;



[PATCH v3 1/5] mm/memremap_pages: Introduce memremap_compat_align()

2020-02-28 Thread Dan Williams
The "sub-section memory hotplug" facility allows memremap_pages() users
like libnvdimm to compensate for hardware platforms like x86 that have a
section size larger than their hardware memory mapping granularity.  The
compensation that sub-section support affords is being tolerant of
physical memory resources shifting by units smaller (64MiB on x86) than
the memory-hotplug section size (128 MiB). Where the platform
physical-memory mapping granularity is limited by the number and
capability of address-decode-registers in the memory controller.

While the sub-section support allows memremap_pages() to operate on
sub-section (2MiB) granularity, the Power architecture may still
require 16MiB alignment on "!radix_enabled()" platforms.

In order for libnvdimm to be able to detect and manage this per-arch
limitation, introduce memremap_compat_align() as a common minimum
alignment across all driver-facing memory-mapping interfaces, and let
Power override it to 16MiB in the "!radix_enabled()" case.

The assumption / requirement for 16MiB to be a viable
memremap_compat_align() value is that Power does not have platforms
where its equivalent of address-decode-registers never hardware remaps a
persistent memory resource on smaller than 16MiB boundaries. Note that I
tried my best to not add a new Kconfig symbol, but header include
entanglements defeated the #ifndef memremap_compat_align design pattern
and the need to export it defeats the __weak design pattern for arch
overrides.

Based on an initial patch by Aneesh.

Link: 
http://lore.kernel.org/r/capcyv4gbgnp95apyabcsocea50tqj9b5h__83vgngjq3oug...@mail.gmail.com
Reported-by: Aneesh Kumar K.V 
Reported-by: Jeff Moyer 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Signed-off-by: Dan Williams 
---
 arch/powerpc/Kconfig  |1 +
 arch/powerpc/mm/ioremap.c |   21 +
 drivers/nvdimm/pfn_devs.c |2 +-
 include/linux/memremap.h  |8 
 include/linux/mmzone.h|1 +
 lib/Kconfig   |3 +++
 mm/memremap.c |   23 +++
 7 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 497b7d0b2d7e..e6ffe905e2b9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -122,6 +122,7 @@ config PPC
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV
select ARCH_HAS_HUGEPD  if HUGETLB_PAGE
+   select ARCH_HAS_MEMREMAP_COMPAT_ALIGN
select ARCH_HAS_MMIOWB  if PPC64
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_API
diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
index fc669643ce6a..b1a0aebe8c48 100644
--- a/arch/powerpc/mm/ioremap.c
+++ b/arch/powerpc/mm/ioremap.c
@@ -2,6 +2,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -97,3 +98,23 @@ void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, 
unsigned long size,
 
return NULL;
 }
+
+#ifdef CONFIG_ZONE_DEVICE
+/*
+ * Override the generic version in mm/memremap.c.
+ *
+ * With hash translation, the direct-map range is mapped with just one
+ * page size selected by htab_init_page_sizes(). Consult
+ * mmu_psize_defs[] to determine the minimum page size alignment.
+*/
+unsigned long memremap_compat_align(void)
+{
+   unsigned int shift = mmu_psize_defs[mmu_linear_psize].shift;
+
+   if (radix_enabled())
+   return SUBSECTION_SIZE;
+   return max(SUBSECTION_SIZE, 1UL << shift);
+
+}
+EXPORT_SYMBOL_GPL(memremap_compat_align);
+#endif
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index b94f7a7e94b8..a5c25cb87116 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -750,7 +750,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
start = nsio->res.start;
size = resource_size(>res);
npfns = PHYS_PFN(size - SZ_8K);
-   align = max(nd_pfn->align, (1UL << SUBSECTION_SHIFT));
+   align = max(nd_pfn->align, SUBSECTION_SIZE);
end_trunc = start + size - ALIGN_DOWN(start + size, align);
if (nd_pfn->mode == PFN_MODE_PMEM) {
/*
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 6fefb09af7c3..8af1cbd8f293 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -132,6 +132,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
+unsigned long memremap_compat_align(void);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
@@ -165,6 +166,12 @@ static inline void vmem_altmap_free(struct vmem_altmap 
*altmap,
unsigned long nr_pfns)
 {
 }
+
+/* when memremap_pages() is disabled all archs can remap a single page */
+static inline unsigned lon

[PATCH v3 0/5] libnvdimm: Cross-arch compatible namespace alignment

2020-02-28 Thread Dan Williams
Changes since v2 [1]:
- Fix up a missing space in flags_show() (Jeff)

- Prompted by Jeff saying that v2 only worked for him if
  memremap_compat_align() returned PAGE_SIZE (which defeats the purpose)
  I developed a new ndctl unit test that runs through the possible
  legacy configurations that the kernel needs to support. Several changes
  fell out as a result:

  - Update nd_pfn_validate() to add more -EOPNOTSUPP cases. That error
code indicates "Stop, the pfn looks coherent, but invalid. Do not
proceed with exposing a raw namespace, require the user to
investigate whether the infoblock needs to be rewritten, or the
kernel configuration (like PAGE_SIZE) needs to change."

  - Move the validation of fsdax and devdax infoblocks to
nd_pfn_validate() so that the presence of non-zero 'start_pad' and
'end_trunc' can be considered in the alignment validation.

  - Fail namespace creation when the base address is misaligned. A
non-zero-start_pad prevents dax operation due to original bug of
->data_offset being base address relative when it should have been
->start_pad relative. So, reject all base address misaligned
namespaces in nd_pfn_init().

[1]: 
http://lore.kernel.org/r/158155489850.3343782.2687127373754434980.st...@dwillia2-desk3.amr.corp.intel.com

---

Review / merge logistics notes:

Patch "libnvdimm/namespace: Enforce memremap_compat_align()" has
changed enough that it needs to be reviewed again.

Patch "mm/memremap_pages: Introduce memremap_compat_align()" still
needs a PowerPC maintainer ack for the touches to
arch/powerpc/mm/ioremap.c.

---

Aneesh reports that PowerPC requires 16MiB alignment for the address
range passed to devm_memremap_pages(), and Jeff reports that it is
possible to create a misaligned namespace which blocks future namespace
creation in that region. Both of these issues require namespace
alignment to be managed at the region level rather than padding at the
namespace level which has been a broken approach to date.

Introduce memremap_compat_align() to indicate the hard requirements of
an arch's memremap_pages() implementation. Use the maximum known
memremap_compat_align() to set the default namespace alignment for
libnvdimm. Consult that alignment when allocating free space. Finally,
allow the default region alignment to be overridden to maintain the same
namespace creation capability as previous kernels.

The ndctl unit tests, which have some misaligned namespace assumptions,
are updated to use the alignment override where necessary.

Thanks to Aneesh for early feedback and testing on this improved
alignment handling.

---

Dan Williams (5):
  mm/memremap_pages: Introduce memremap_compat_align()
  libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
  libnvdimm/namespace: Enforce memremap_compat_align()
  libnvdimm/region: Introduce NDD_LABELING
  libnvdimm/region: Introduce an 'align' attribute


 arch/powerpc/Kconfig  |1 
 arch/powerpc/mm/ioremap.c |   21 +
 arch/powerpc/platforms/pseries/papr_scm.c |2 
 drivers/acpi/nfit/core.c  |4 +
 drivers/nvdimm/dimm.c |2 
 drivers/nvdimm/dimm_devs.c|   95 +
 drivers/nvdimm/namespace_devs.c   |   23 -
 drivers/nvdimm/nd.h   |3 -
 drivers/nvdimm/pfn_devs.c |   34 ++-
 drivers/nvdimm/region_devs.c  |  132 ++---
 include/linux/libnvdimm.h |2 
 include/linux/memremap.h  |8 ++
 include/linux/mmzone.h|1 
 lib/Kconfig   |3 +
 mm/memremap.c |   23 +
 15 files changed, 307 insertions(+), 47 deletions(-)

base-commit: 11a48a5a18c63fd7621bb050228cebf13566e4d8


Re: [PATCH v3 0/7] Allow setting caching mode in arch_add_memory() for P2PDMA

2020-02-27 Thread Dan Williams
On Thu, Feb 27, 2020 at 10:03 AM Jason Gunthorpe  wrote:
>
> On Thu, Feb 27, 2020 at 09:55:04AM -0800, Dan Williams wrote:
> > On Thu, Feb 27, 2020 at 9:43 AM Jason Gunthorpe  wrote:
> > >
> > > On Thu, Feb 27, 2020 at 10:21:50AM -0700, Logan Gunthorpe wrote:
> > > >
> > > >
> > > > On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote:
> > > > >> Instead of this, this series proposes a change to arch_add_memory()
> > > > >> to take the pgprot required by the mapping which allows us to
> > > > >> explicitly set pagetable entries for P2PDMA memory to WC.
> > > > >
> > > > > Is there a particular reason why WC was selected here? I thought for
> > > > > the p2pdma cases there was no kernel user that touched the memory?
> > > >
> > > > Yes, that's correct. I choose WC here because the existing users are
> > > > registering memory blocks without side effects which fit the WC
> > > > semantics well.
> > >
> > > Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in
> > > Linux, so while it is true the memory has no side effects, there would
> > > be surprising concurrency risks if anything in the kernel tried to
> > > write to it.
> > >
> > > Not compatible means the locks don't contain stores to WC memory the
> > > way you would expect. AFAIK on many CPUs extra barriers are required
> > > to keep WC stores ordered, the same way ARM already has extra barriers
> > > to keep UC stores ordered with locking..
> > >
> > > The spinlocks are defined to contain UC stores though.
> >
> > How are spinlocks and mutexes getting into p2pdma ranges in the first
> > instance? Even with UC, the system has bigger problems if it's trying
> > to send bus locks targeting PCI, see the flurry of activity of trying
> > to trigger faults on split locks [1].
>
> This is not what I was trying to explain.
>
> Consider
>
>  static spinlock lock; // CPU DRAM
>  static idx = 0;
>  u64 *wc_memory = [..];
>
>  spin_lock();
>  wc_memory[0] = idx++;
>  spin_unlock();
>
> You'd expect that the PCI device will observe stores where idx is
> strictly increasing, but this is not guarenteed. idx may decrease, idx
> may skip. It just won't duplicate.
>
> Or perhaps
>
>  wc_memory[0] = foo;
>  writel(doorbell)
>
> foo is not guarenteed observable by the device before doorbell reaches
> the device.
>
> All of these are things that do not happen with UC or NC memory, and
> are surprising violations of our programming model.
>
> Generic kernel code should never touch WC memory unless the code is
> specifically designed to handle it.

Ah, yes, agree.


Re: [PATCH v3 0/7] Allow setting caching mode in arch_add_memory() for P2PDMA

2020-02-27 Thread Dan Williams
On Thu, Feb 27, 2020 at 9:43 AM Jason Gunthorpe  wrote:
>
> On Thu, Feb 27, 2020 at 10:21:50AM -0700, Logan Gunthorpe wrote:
> >
> >
> > On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote:
> > >> Instead of this, this series proposes a change to arch_add_memory()
> > >> to take the pgprot required by the mapping which allows us to
> > >> explicitly set pagetable entries for P2PDMA memory to WC.
> > >
> > > Is there a particular reason why WC was selected here? I thought for
> > > the p2pdma cases there was no kernel user that touched the memory?
> >
> > Yes, that's correct. I choose WC here because the existing users are
> > registering memory blocks without side effects which fit the WC
> > semantics well.
>
> Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in
> Linux, so while it is true the memory has no side effects, there would
> be surprising concurrency risks if anything in the kernel tried to
> write to it.
>
> Not compatible means the locks don't contain stores to WC memory the
> way you would expect. AFAIK on many CPUs extra barriers are required
> to keep WC stores ordered, the same way ARM already has extra barriers
> to keep UC stores ordered with locking..
>
> The spinlocks are defined to contain UC stores though.

How are spinlocks and mutexes getting into p2pdma ranges in the first
instance? Even with UC, the system has bigger problems if it's trying
to send bus locks targeting PCI, see the flurry of activity of trying
to trigger faults on split locks [1].

This does raise a question about separating the cacheability of the
'struct page' memmap from the BAR range. You get this for free if the
memmap is dynamically allocated from "System RAM", but perhaps
memremap_pages() should explicitly prevent altmap configurations that
try to place the map in PCI space?

> If there is no actual need today for WC I would suggest using UC as
> the default.

That's reasonable, but it still seems to be making a broken
configuration marginally less broken. I'd be more interested in
safeguards that prevent p2pdma mappings from being used for any cpu
atomic cycles.

[1]: https://lwn.net/Articles/784864/


Re: [PATCH v3 15/27] powerpc/powernv/pmem: Add support for near storage commands

2020-02-27 Thread Dan Williams
On Thu, Feb 20, 2020 at 7:28 PM Alastair D'Silva  wrote:
>
> From: Alastair D'Silva 
>
> Similar to the previous patch, this adds support for near storage commands.

Similar comment as the last patch. This changelog does not give the
reviewer any frame of reference to review the patch.


Re: [PATCH v3 14/27] powerpc/powernv/pmem: Add support for Admin commands

2020-02-27 Thread Dan Williams
On Thu, Feb 20, 2020 at 7:28 PM Alastair D'Silva  wrote:
>
> From: Alastair D'Silva 
>
> This patch requests the metadata required to issue admin commands, as well
> as some helper functions to construct and check the completion of the
> commands.

What are the admin commands? Any pointer to a spec? Why does Linux
need to support these commands?


Re: [PATCH v3 00/27] Add support for OpenCAPI Persistent Memory devices

2020-02-25 Thread Dan Williams
On Tue, Feb 25, 2020 at 4:14 PM Alastair D'Silva  wrote:
>
> On Mon, 2020-02-24 at 17:51 +1100, Oliver O'Halloran wrote:
> > On Mon, Feb 24, 2020 at 3:43 PM Alastair D'Silva <
> > alast...@au1.ibm.com> wrote:
> > > On Sun, 2020-02-23 at 20:37 -0800, Matthew Wilcox wrote:
> > > > On Mon, Feb 24, 2020 at 03:34:07PM +1100, Alastair D'Silva wrote:
> > > > > V3:
> > > > >   - Rebase against next/next-20200220
> > > > >   - Move driver to arch/powerpc/platforms/powernv, we now
> > > > > expect
> > > > > this
> > > > > driver to go upstream via the powerpc tree
> > > >
> > > > That's rather the opposite direction of normal; mostly drivers
> > > > live
> > > > under
> > > > drivers/ and not in arch/.  It's easier for drivers to get
> > > > overlooked
> > > > when doing tree-wide changes if they're hiding.
> > >
> > > This is true, however, given that it was not all that desirable to
> > > have
> > > it under drivers/nvdimm, it's sister driver (for the same hardware)
> > > is
> > > also under arch, and that we don't expect this driver to be used on
> > > any
> > > platform other than powernv, we think this was the most reasonable
> > > place to put it.
> >
> > Historically powernv specific platform drivers go in their respective
> > subsystem trees rather than in arch/ and I'd prefer we kept it that
> > way. When I added the papr_scm driver I put it in the pseries
> > platform
> > directory because most of the pseries paravirt code lives there for
> > some reason; I don't know why. Luckily for me that followed the same
> > model that Dan used when he put the NFIT driver in drivers/acpi/ and
> > the libnvdimm core in drivers/nvdimm/ so we didn't have anything to
> > argue about. However, as Matthew pointed out, it is at odds with how
> > most subsystems operate. Is there any particular reason we're doing
> > things this way or should we think about moving libnvdimm users to
> > drivers/nvdimm/?
> >
> > Oliver
>
>
> I'm not too fussed where it ends up, as long as it ends up somewhere :)
>
> From what I can tell, the issue is that we have both "infrastructure"
> drivers, and end-device drivers. To me, it feels like drivers/nvdimm
> should contain both, and I think this feels like the right approach.
>
> I could move it back to drivers/nvdimm/ocxl, but I felt that it was
> only tolerated there, not desired. This could be cleared up with a
> response from Dan Williams, and if it is indeed dersired, this is my
> preferred location.

Apologies if I gave the impression it was only tolerated. I'm ok with
drivers/nvdimm/ocxl/, and to the larger point I'd also be ok with a
drivers/{acpi => nvdimm}/nfit and {arch/powerpc/platforms/pseries =>
drivers/nvdimm}/papr_scm.c move as well to keep all the consumers of
the nvdimm related code together with the core.


Re: [PATCH v3 00/27] Add support for OpenCAPI Persistent Memory devices

2020-02-21 Thread Dan Williams
On Fri, Feb 21, 2020 at 8:21 AM Dan Williams  wrote:
>
> On Thu, Feb 20, 2020 at 7:28 PM Alastair D'Silva  wrote:
> >
> > From: Alastair D'Silva 
> >
> > This series adds support for OpenCAPI Persistent Memory devices, exposing
> > them as nvdimms so that we can make use of the existing infrastructure.
>
> A single sentence to introduce:
>
> 24 files changed, 3029 insertions(+), 97 deletions(-)
>
> ...is inadequate. What are OpenCAPI Persistent Memory devices? How do
> they compare, in terms relevant to libnvdimm, to other persistent
> memory devices? What challenges do they pose to the existing enabling?
> What is the overall approach taken with this 27 patch break down? What
> are the changes since v2, v1? If you incorporated someone's review
> feedback note it in the cover letter changelog, if you didn't

Assumptions and tradeoffs the implementation considered are also
critical for reviewing the approach.


Re: [PATCH v3 00/27] Add support for OpenCAPI Persistent Memory devices

2020-02-21 Thread Dan Williams
On Thu, Feb 20, 2020 at 7:28 PM Alastair D'Silva  wrote:
>
> From: Alastair D'Silva 
>
> This series adds support for OpenCAPI Persistent Memory devices, exposing
> them as nvdimms so that we can make use of the existing infrastructure.

A single sentence to introduce:

24 files changed, 3029 insertions(+), 97 deletions(-)

...is inadequate. What are OpenCAPI Persistent Memory devices? How do
they compare, in terms relevant to libnvdimm, to other persistent
memory devices? What challenges do they pose to the existing enabling?
What is the overall approach taken with this 27 patch break down? What
are the changes since v2, v1? If you incorporated someone's review
feedback note it in the cover letter changelog, if you didn't
incorporate someone's feedback note that too with an explanation.

In short, provide a bridge document for someone familiar with the
upstream infrastructure, but not necessarily steeped in powernv /
OpenCAPI platform details, to get started with this code.

For now, no need to resend the whole series, just reply to this
message with a fleshed out cover letter and then incorporate it going
forward for v4+.


Re: [PATCH] libnvdimm/bus: return the outvar 'cmd_rc' error code in __nd_ioctl()

2020-02-18 Thread Dan Williams
On Tue, Feb 18, 2020 at 1:00 PM Jeff Moyer  wrote:
>
> Vaibhav Jain  writes:
>
> > Presently the error code returned via out variable 'cmd_rc' from the
> > nvdimm-bus controller function is ignored when called from
> > __nd_ioctl() and never communicated back to user-space code that called
> > an ioctl on dimm/bus.
> >
> > This minor patch updates __nd_ioctl() to propagate the value of out
> > variable 'cmd_rc' back to user-space in case it reports an error.
> >
> > Signed-off-by: Vaibhav Jain 
> > ---
> >  drivers/nvdimm/bus.c | 5 +
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> > index a8b515968569..5b687a27fdf2 100644
> > --- a/drivers/nvdimm/bus.c
> > +++ b/drivers/nvdimm/bus.c
> > @@ -1153,6 +1153,11 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, 
> > struct nvdimm *nvdimm,
> >   if (rc < 0)
> >   goto out_unlock;
> >
> > + if (cmd_rc < 0) {
> > + rc = cmd_rc;
> > + goto out_unlock;
> > + }
> > +
> >   if (!nvdimm && cmd == ND_CMD_CLEAR_ERROR && cmd_rc >= 0) {
> >   struct nd_cmd_clear_error *clear_err = buf;
>
> Looks good to me.
>
> Reviewed-by: Jeff Moyer 

Applied.


Re: [PATCH v2 1/4] mm/memremap_pages: Introduce memremap_compat_align()

2020-02-14 Thread Dan Williams
On Fri, Feb 14, 2020 at 12:59 PM Jeff Moyer  wrote:
>
> Dan Williams  writes:
>
> > On Thu, Feb 13, 2020 at 8:58 AM Jeff Moyer  wrote:
>
> >> I have just a couple of questions.
> >>
> >> First, can you please add a comment above the generic implementation of
> >> memremap_compat_align describing its purpose, and why a platform might
> >> want to override it?
> >
> > Sure, how about:
> >
> > /*
> >  * The memremap() and memremap_pages() interfaces are alternately used
> >  * to map persistent memory namespaces. These interfaces place different
> >  * constraints on the alignment and size of the mapping (namespace).
> >  * memremap() can map individual PAGE_SIZE pages. memremap_pages() can
> >  * only map subsections (2MB), and at least one architecture (PowerPC)
> >  * the minimum mapping granularity of memremap_pages() is 16MB.
> >  *
> >  * The role of memremap_compat_align() is to communicate the minimum
> >  * arch supported alignment of a namespace such that it can freely
> >  * switch modes without violating the arch constraint. Namely, do not
> >  * allow a namespace to be PAGE_SIZE aligned since that namespace may be
> >  * reconfigured into a mode that requires SUBSECTION_SIZE alignment.
> >  */
>
> Well, if we modify the x86 variant to be PAGE_SIZE, I think that text
> won't work.  How about:

...but I'm not looking to change it to PAGE_SIZE, I'm going to fix the
alignment check to skip if the namespace has "inner" alignment
padding, i.e. "start_pad" and/or "end_trunc" are non-zero.


Re: [PATCH v2 2/4] libnvdimm/namespace: Enforce memremap_compat_align()

2020-02-13 Thread Dan Williams
On Thu, Feb 13, 2020 at 1:55 PM Jeff Moyer  wrote:
>
> Dan Williams  writes:
>
> > The pmem driver on PowerPC crashes with the following signature when
> > instantiating misaligned namespaces that map their capacity via
> > memremap_pages().
> >
> > BUG: Unable to handle kernel data access at 0xc00100040600
> > Faulting instruction address: 0xc0090790
> > NIP [c0090790] arch_add_memory+0xc0/0x130
> > LR [c0090744] arch_add_memory+0x74/0x130
> > Call Trace:
> >  arch_add_memory+0x74/0x130 (unreliable)
> >  memremap_pages+0x74c/0xa30
> >  devm_memremap_pages+0x3c/0xa0
> >  pmem_attach_disk+0x188/0x770
> >  nvdimm_bus_probe+0xd8/0x470
> >
> > With the assumption that only memremap_pages() has alignment
> > constraints, enforce memremap_compat_align() for
> > pmem_should_map_pages(), nd_pfn, or nd_dax cases.
> >
> > Reported-by: Aneesh Kumar K.V 
> > Cc: Jeff Moyer 
> > Reviewed-by: Aneesh Kumar K.V 
> > Link: 
> > https://lore.kernel.org/r/158041477336.3889308.4581652885008605170.st...@dwillia2-desk3.amr.corp.intel.com
> > Signed-off-by: Dan Williams 
> > ---
> >  drivers/nvdimm/namespace_devs.c |   10 ++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/drivers/nvdimm/namespace_devs.c 
> > b/drivers/nvdimm/namespace_devs.c
> > index 032dc61725ff..aff1f32fdb4f 100644
> > --- a/drivers/nvdimm/namespace_devs.c
> > +++ b/drivers/nvdimm/namespace_devs.c
> > @@ -1739,6 +1739,16 @@ struct nd_namespace_common 
> > *nvdimm_namespace_common_probe(struct device *dev)
> >   return ERR_PTR(-ENODEV);
> >   }
> >
> > + if (pmem_should_map_pages(dev) || nd_pfn || nd_dax) {
> > + struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
> > + resource_size_t start = nsio->res.start;
> > +
> > + if (!IS_ALIGNED(start | size, memremap_compat_align())) {
> > + dev_dbg(>dev, "misaligned, unable to map\n");
> > + return ERR_PTR(-EOPNOTSUPP);
> > + }
> > + }
> > +
> >   if (is_namespace_pmem(>dev)) {
> >   struct nd_namespace_pmem *nspm;
> >
>
> Actually, I take back my ack.  :) This prevents a previously working
> namespace from being successfully probed/setup.

Do you have a test case handy? I can see a potential gap with a
namespace that used internal padding to fix up the alignment. The goal
of this check is to catch cases that are just going to fail
devm_memremap_pages(), and the expectation is that it could not have
worked before unless it was ported from another platform, or someone
flipped the page-size switch on PowerPC.

> I thought we were only
> going to enforce the alignment for a newly created namespace?  This should
> only check whether the alignment works for the current platform.

The model is a new default 16MB alignment is enforced at creation
time, but if you need to support previously created namespaces then
you can manually trim that alignment requirement to no less than
memremap_compat_align() because that's the point at which
devm_memremap_pages() will start failing or crashing.


Re: [PATCH v2 1/4] mm/memremap_pages: Introduce memremap_compat_align()

2020-02-13 Thread Dan Williams
On Thu, Feb 13, 2020 at 8:58 AM Jeff Moyer  wrote:
>
> Dan Williams  writes:
>
> > The "sub-section memory hotplug" facility allows memremap_pages() users
> > like libnvdimm to compensate for hardware platforms like x86 that have a
> > section size larger than their hardware memory mapping granularity.  The
> > compensation that sub-section support affords is being tolerant of
> > physical memory resources shifting by units smaller (64MiB on x86) than
> > the memory-hotplug section size (128 MiB). Where the platform
> > physical-memory mapping granularity is limited by the number and
> > capability of address-decode-registers in the memory controller.
> >
> > While the sub-section support allows memremap_pages() to operate on
> > sub-section (2MiB) granularity, the Power architecture may still
> > require 16MiB alignment on "!radix_enabled()" platforms.
> >
> > In order for libnvdimm to be able to detect and manage this per-arch
> > limitation, introduce memremap_compat_align() as a common minimum
> > alignment across all driver-facing memory-mapping interfaces, and let
> > Power override it to 16MiB in the "!radix_enabled()" case.
> >
> > The assumption / requirement for 16MiB to be a viable
> > memremap_compat_align() value is that Power does not have platforms
> > where its equivalent of address-decode-registers never hardware remaps a
> > persistent memory resource on smaller than 16MiB boundaries. Note that I
> > tried my best to not add a new Kconfig symbol, but header include
> > entanglements defeated the #ifndef memremap_compat_align design pattern
> > and the need to export it defeats the __weak design pattern for arch
> > overrides.
> >
> > Based on an initial patch by Aneesh.
>
> I have just a couple of questions.
>
> First, can you please add a comment above the generic implementation of
> memremap_compat_align describing its purpose, and why a platform might
> want to override it?

Sure, how about:

/*
 * The memremap() and memremap_pages() interfaces are alternately used
 * to map persistent memory namespaces. These interfaces place different
 * constraints on the alignment and size of the mapping (namespace).
 * memremap() can map individual PAGE_SIZE pages. memremap_pages() can
 * only map subsections (2MB), and at least one architecture (PowerPC)
 * the minimum mapping granularity of memremap_pages() is 16MB.
 *
 * The role of memremap_compat_align() is to communicate the minimum
 * arch supported alignment of a namespace such that it can freely
 * switch modes without violating the arch constraint. Namely, do not
 * allow a namespace to be PAGE_SIZE aligned since that namespace may be
 * reconfigured into a mode that requires SUBSECTION_SIZE alignment.
 */

> Second, I will take it at face value that the power architecture
> requires a 16MB alignment, but it's not clear to me why mmu_linear_psize
> was chosen to represent that.  What's the relationship, there, and can
> we please have a comment explaining it?

Aneesh, can you help here?


[PATCH v2 4/4] libnvdimm/region: Introduce an 'align' attribute

2020-02-12 Thread Dan Williams
The align attribute applies an alignment constraint for namespace
creation in a region. Whereas the 'align' attribute of a namespace
applied alignment padding via an info block, the 'align' attribute
applies alignment constraints to the free space allocation.

The default for 'align' is the maximum known memremap_compat_align()
across all archs (16MiB from PowerPC at time of writing) multiplied by
the number of interleave ways if there is blk-aliasing. The minimum is
PAGE_SIZE and allows for the creation of cross-arch incompatible
namespaces, just as previous kernels allowed, but the expectation is
cross-arch and mode-independent compatibility by default.

The regression risk with this change is limited to cases that were
dependent on the ability to create unaligned namespaces, *and* for some
reason are unable to opt-out of aligned namespaces by writing to
'regionX/align'. If such a scenario arises the default can be flipped
from opt-out to opt-in of compat-aligned namespace creation, but that is
a last resort. The kernel will otherwise continue to support existing
defined misaligned namespaces.

Unfortunately this change needs to touch several parts of the
implementation at once:

- region/available_size: expand busy extents to current align
- region/max_available_extent: expand busy extents to current align
- namespace/size: trim free space to current align

...to keep the free space accounting conforming to the dynamic align
setting.

Reported-by: Aneesh Kumar K.V 
Reported-by: Jeff Moyer 
Signed-off-by: Dan Williams 
Reviewed-by: Aneesh Kumar K.V 
Link: 
https://lore.kernel.org/r/158041478371.3889308.14542630147672668068.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/dimm_devs.c  |   86 +++
 drivers/nvdimm/namespace_devs.c |9 ++-
 drivers/nvdimm/nd.h |1 
 drivers/nvdimm/region_devs.c|  122 ---
 4 files changed, 192 insertions(+), 26 deletions(-)

diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 64159d4d4b8f..b4994abb655f 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -563,6 +563,21 @@ int nvdimm_security_freeze(struct nvdimm *nvdimm)
return rc;
 }
 
+static unsigned long dpa_align(struct nd_region *nd_region)
+{
+   struct device *dev = _region->dev;
+
+   if (dev_WARN_ONCE(dev, !is_nvdimm_bus_locked(dev),
+   "bus lock required for capacity provision\n"))
+   return 0;
+   if (dev_WARN_ONCE(dev, !nd_region->ndr_mappings || nd_region->align
+   % nd_region->ndr_mappings,
+   "invalid region align %#lx mappings: %d\n",
+   nd_region->align, nd_region->ndr_mappings))
+   return 0;
+   return nd_region->align / nd_region->ndr_mappings;
+}
+
 int alias_dpa_busy(struct device *dev, void *data)
 {
resource_size_t map_end, blk_start, new;
@@ -571,6 +586,7 @@ int alias_dpa_busy(struct device *dev, void *data)
struct nd_region *nd_region;
struct nvdimm_drvdata *ndd;
struct resource *res;
+   unsigned long align;
int i;
 
if (!is_memory(dev))
@@ -608,13 +624,21 @@ int alias_dpa_busy(struct device *dev, void *data)
 * Find the free dpa from the end of the last pmem allocation to
 * the end of the interleave-set mapping.
 */
+   align = dpa_align(nd_region);
+   if (!align)
+   return 0;
+
for_each_dpa_resource(ndd, res) {
+   resource_size_t start, end;
+
if (strncmp(res->name, "pmem", 4) != 0)
continue;
-   if ((res->start >= blk_start && res->start < map_end)
-   || (res->end >= blk_start
-   && res->end <= map_end)) {
-   new = max(blk_start, min(map_end + 1, res->end + 1));
+
+   start = ALIGN_DOWN(res->start, align);
+   end = ALIGN(res->end + 1, align) - 1;
+   if ((start >= blk_start && start < map_end)
+   || (end >= blk_start && end <= map_end)) {
+   new = max(blk_start, min(map_end, end) + 1);
if (new != blk_start) {
blk_start = new;
goto retry;
@@ -654,6 +678,7 @@ resource_size_t nd_blk_available_dpa(struct nd_region 
*nd_region)
.res = NULL,
};
struct resource *res;
+   unsigned long align;
 
if (!ndd)
return 0;
@@ -661,10 +686,20 @@ resource_size_t nd_blk_available_dpa(struct nd_region 
*nd_region)
device_for_each_child(_bus->dev, , alias_

[PATCH v2 3/4] libnvdimm/region: Introduce NDD_LABELING

2020-02-12 Thread Dan Williams
The NDD_ALIASING flag is used to indicate where pmem capacity might
alias with blk capacity and require labeling. It is also used to
indicate whether the DIMM supports labeling. Separate this latter
capability into its own flag so that the NDD_ALIASING flag is scoped to
true aliased configurations.

To my knowledge aliased configurations only exist in the ACPI spec,
there are no known platforms that ship this support in production.

This clarity allows namespace-capacity alignment constraints around
interleave-ways to be relaxed.

Cc: Vishal Verma 
Cc: Aneesh Kumar K.V 
Cc: Oliver O'Halloran 
Reviewed-by: Aneesh Kumar K.V 
Link: 
https://lore.kernel.org/r/158041477856.3889308.4212605617834097674.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
---
 arch/powerpc/platforms/pseries/papr_scm.c |2 +-
 drivers/acpi/nfit/core.c  |4 +++-
 drivers/nvdimm/dimm.c |2 +-
 drivers/nvdimm/dimm_devs.c|9 +
 drivers/nvdimm/namespace_devs.c   |2 +-
 drivers/nvdimm/nd.h   |2 +-
 drivers/nvdimm/region_devs.c  |   10 +-
 include/linux/libnvdimm.h |2 ++
 8 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 0b4467e378e5..589858cb3203 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -328,7 +328,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
}
 
dimm_flags = 0;
-   set_bit(NDD_ALIASING, _flags);
+   set_bit(NDD_LABELING, _flags);
 
p->nvdimm = nvdimm_create(p->bus, p, NULL, dimm_flags,
  PAPR_SCM_DIMM_CMD_MASK, 0, NULL);
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index a3320f93616d..71d7f2aa1b12 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2026,8 +2026,10 @@ static int acpi_nfit_register_dimms(struct 
acpi_nfit_desc *acpi_desc)
continue;
}
 
-   if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+   if (nfit_mem->bdw && nfit_mem->memdev_pmem) {
set_bit(NDD_ALIASING, );
+   set_bit(NDD_LABELING, );
+   }
 
/* collate flags across all memdevs for this dimm */
list_for_each_entry(nfit_memdev, _desc->memdevs, list) {
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 64776ed15bb3..7d4ddc4d9322 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -99,7 +99,7 @@ static int nvdimm_probe(struct device *dev)
if (ndd->ns_current >= 0) {
rc = nd_label_reserve_dpa(ndd);
if (rc == 0)
-   nvdimm_set_aliasing(dev);
+   nvdimm_set_labeling(dev);
}
nvdimm_bus_unlock(dev);
 
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 94ea6dba6b4f..64159d4d4b8f 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -32,7 +32,7 @@ int nvdimm_check_config_data(struct device *dev)
 
if (!nvdimm->cmd_mask ||
!test_bit(ND_CMD_GET_CONFIG_DATA, >cmd_mask)) {
-   if (test_bit(NDD_ALIASING, >flags))
+   if (test_bit(NDD_LABELING, >flags))
return -ENXIO;
else
return -ENOTTY;
@@ -173,11 +173,11 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, 
size_t offset,
return rc;
 }
 
-void nvdimm_set_aliasing(struct device *dev)
+void nvdimm_set_labeling(struct device *dev)
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
-   set_bit(NDD_ALIASING, >flags);
+   set_bit(NDD_LABELING, >flags);
 }
 
 void nvdimm_set_locked(struct device *dev)
@@ -312,8 +312,9 @@ static ssize_t flags_show(struct device *dev,
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
-   return sprintf(buf, "%s%s\n",
+   return sprintf(buf, "%s%s%s\n",
test_bit(NDD_ALIASING, >flags) ? "alias " : "",
+   test_bit(NDD_LABELING, >flags) ? "label" : "",
test_bit(NDD_LOCKED, >flags) ? "lock " : "");
 }
 static DEVICE_ATTR_RO(flags);
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index aff1f32fdb4f..30cda9f235de 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -2531,7 +2531,7 @@ static int init_active_labels(struct nd_region *nd_region)
if (!ndd) {
if (test_bit(NDD_LOCKED, >flags))
/* fail, label data may be unreadable */;
-   else if

[PATCH v2 2/4] libnvdimm/namespace: Enforce memremap_compat_align()

2020-02-12 Thread Dan Williams
The pmem driver on PowerPC crashes with the following signature when
instantiating misaligned namespaces that map their capacity via
memremap_pages().

BUG: Unable to handle kernel data access at 0xc00100040600
Faulting instruction address: 0xc0090790
NIP [c0090790] arch_add_memory+0xc0/0x130
LR [c0090744] arch_add_memory+0x74/0x130
Call Trace:
 arch_add_memory+0x74/0x130 (unreliable)
 memremap_pages+0x74c/0xa30
 devm_memremap_pages+0x3c/0xa0
 pmem_attach_disk+0x188/0x770
 nvdimm_bus_probe+0xd8/0x470

With the assumption that only memremap_pages() has alignment
constraints, enforce memremap_compat_align() for
pmem_should_map_pages(), nd_pfn, or nd_dax cases.

Reported-by: Aneesh Kumar K.V 
Cc: Jeff Moyer 
Reviewed-by: Aneesh Kumar K.V 
Link: 
https://lore.kernel.org/r/158041477336.3889308.4581652885008605170.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/namespace_devs.c |   10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 032dc61725ff..aff1f32fdb4f 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -1739,6 +1739,16 @@ struct nd_namespace_common 
*nvdimm_namespace_common_probe(struct device *dev)
return ERR_PTR(-ENODEV);
}
 
+   if (pmem_should_map_pages(dev) || nd_pfn || nd_dax) {
+   struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
+   resource_size_t start = nsio->res.start;
+
+   if (!IS_ALIGNED(start | size, memremap_compat_align())) {
+   dev_dbg(>dev, "misaligned, unable to map\n");
+   return ERR_PTR(-EOPNOTSUPP);
+   }
+   }
+
if (is_namespace_pmem(>dev)) {
struct nd_namespace_pmem *nspm;
 



  1   2   3   4   >