Re: [PATCH 06/13] nvdimm/blk: avoid calling del_gendisk() on early failures

2021-10-15 Thread Dan Williams
On Fri, Oct 15, 2021 at 4:53 PM Luis Chamberlain  wrote:
>
> If nd_integrity_init() fails we'd get del_gendisk() called,
> but that's not correct as we should only call that if we're
> done with device_add_disk(). Fix this by providing unwinding
> prior to the devm call being registered and moving the devm
> registration to the very end.
>
> This should fix calling del_gendisk() if nd_integrity_init()
> fails. I only spotted this issue through code inspection. It
> does not fix any real world bug.
>

Just fyi, I'm preparing patches to delete this driver completely as it
is unused by any shipping platform. I hope to get that removal into
v5.16.


[PATCH v3 10/10] ocxl: Use pci core's DVSEC functionality

2021-10-09 Thread Dan Williams
From: Ben Widawsky 

Reduce maintenance burden of DVSEC query implementation by using the
centralized PCI core implementation.

There are two obvious places to simply drop in the new core
implementation. There remains find_dvsec_from_pos() which would benefit
from using a core implementation. As that change is less trivial it is
reserved for later.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andrew Donnellan 
Acked-by: Frederic Barrat  (v1)
Signed-off-by: Ben Widawsky 
Reviewed-by: Andrew Donnellan 
Signed-off-by: Dan Williams 
---
 arch/powerpc/platforms/powernv/ocxl.c |3 ++-
 drivers/misc/ocxl/config.c|   13 +
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c
index 9105efcf242a..28b009b46464 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -107,7 +107,8 @@ static int get_max_afu_index(struct pci_dev *dev, int 
*afu_idx)
int pos;
u32 val;
 
-   pos = find_dvsec_from_pos(dev, OCXL_DVSEC_FUNC_ID, 0);
+   pos = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_IBM,
+   OCXL_DVSEC_FUNC_ID);
if (!pos)
return -ESRCH;
 
diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index a68738f38252..e401a51596b9 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -33,18 +33,7 @@
 
 static int find_dvsec(struct pci_dev *dev, int dvsec_id)
 {
-   int vsec = 0;
-   u16 vendor, id;
-
-   while ((vsec = pci_find_next_ext_capability(dev, vsec,
-   OCXL_EXT_CAP_ID_DVSEC))) {
-   pci_read_config_word(dev, vsec + OCXL_DVSEC_VENDOR_OFFSET,
-   );
-   pci_read_config_word(dev, vsec + OCXL_DVSEC_ID_OFFSET, );
-   if (vendor == PCI_VENDOR_ID_IBM && id == dvsec_id)
-   return vsec;
-   }
-   return 0;
+   return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_IBM, dvsec_id);
 }
 
 static int find_dvsec_afu_ctrl(struct pci_dev *dev, u8 afu_idx)



[PATCH v3 08/10] PCI: Add pci_find_dvsec_capability to find designated VSEC

2021-10-09 Thread Dan Williams
From: Ben Widawsky 

Add pci_find_dvsec_capability to locate a Designated Vendor-Specific
Extended Capability with the specified Vendor ID and Capability ID.

The Designated Vendor-Specific Extended Capability (DVSEC) allows one or
more "vendor" specific capabilities that are not tied to the Vendor ID
of the PCI component. Where the DVSEC Vendor may be a standards body
like CXL.

Cc: David E. Box 
Cc: Jonathan Cameron 
Cc: Bjorn Helgaas 
Cc: Dan Williams 
Cc: linux-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andrew Donnellan 
Cc: Lu Baolu 
Reviewed-by: Frederic Barrat 
Signed-off-by: Ben Widawsky 
Reviewed-by: Andrew Donnellan 
Acked-by: Bjorn Helgaas 
Tested-by: Kan Liang 
Signed-off-by: Kan Liang 
Reviewed-by: Jonathan Cameron 
Signed-off-by: Dan Williams 
---
 drivers/pci/pci.c   |   32 
 include/linux/pci.h |1 +
 2 files changed, 33 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ce2ab62b64cf..94ac86ff28b0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -732,6 +732,38 @@ u16 pci_find_vsec_capability(struct pci_dev *dev, u16 
vendor, int cap)
 }
 EXPORT_SYMBOL_GPL(pci_find_vsec_capability);
 
+/**
+ * pci_find_dvsec_capability - Find DVSEC for vendor
+ * @dev: PCI device to query
+ * @vendor: Vendor ID to match for the DVSEC
+ * @dvsec: Designated Vendor-specific capability ID
+ *
+ * If DVSEC has Vendor ID @vendor and DVSEC ID @dvsec return the capability
+ * offset in config space; otherwise return 0.
+ */
+u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec)
+{
+   int pos;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);
+   if (!pos)
+   return 0;
+
+   while (pos) {
+   u16 v, id;
+
+   pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, );
+   pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, );
+   if (vendor == v && dvsec == id)
+   return pos;
+
+   pos = pci_find_next_ext_capability(dev, pos, 
PCI_EXT_CAP_ID_DVSEC);
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(pci_find_dvsec_capability);
+
 /**
  * pci_find_parent_resource - return resource region of parent bus of given
  *   region
diff --git a/include/linux/pci.h b/include/linux/pci.h
index cd8aa6fce204..c93ccfa4571b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1130,6 +1130,7 @@ u16 pci_find_ext_capability(struct pci_dev *dev, int cap);
 u16 pci_find_next_ext_capability(struct pci_dev *dev, u16 pos, int cap);
 struct pci_bus *pci_find_next_bus(const struct pci_bus *from);
 u16 pci_find_vsec_capability(struct pci_dev *dev, u16 vendor, int cap);
+u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec);
 
 u64 pci_get_dsn(struct pci_dev *dev);
 



[PATCH v3 00/10] cxl_pci refactor for reusability

2021-10-09 Thread Dan Williams
Changes since v2 [1]:
- Rework some of the changelogs per feedback (Bjorn, and I)
- Move the cxl_register_map refactor earlier in the series to make the
  cxl_setup_pci_regs() refactor easier to read.
- Fix a bug added in v5.14 for handling the error return case
  cxl_pci_map_regblock()
- Split the addition of @base to cxl_register_map to its own patch
- Drop the cxl_pci_dvsec() wrapper (Christoph)
- Drop the SIOV conversion patch given Baolu's feedback about it being
  dead code

[1]: https://lore.kernel.org/r/20210923172647.72738-1-ben.widaw...@intel.com

---
I am helping out with the review feedback on this set while Ben is
focusing on region provisioning. It appears this rework will be suitable
to just carry in cxl/next, no need to make a cross-tree dependency for
"PCI: Add pci_find_dvsec_capability to find designated VSEC" at this
time.

Ben's original cover:

Provide the ability to obtain CXL register blocks as discrete functionality.
This functionality will become useful for other CXL drivers that need access to
CXL register blocks. It is also in line with other additions to core which moves
register mapping functionality.

At the introduction of the CXL driver the only user of CXL MMIO was cxl_pci
(then known as cxl_mem). As the driver has evolved it is clear that cxl_pci will
not be the only entity that needs access to CXL MMIO. This series stops short of
moving the generalized functionality into cxl_core for the sake of getting eyes
on the important foundational bits sooner rather than later. The ultimate plan
is to move much of the code into cxl_core.

Via review of two previous patches [1] & [2] it has been suggested that the bits
which are being used for DVSEC enumeration move into PCI core. As CXL core is
soon going to require these, let's try to get the ball rolling now on making
that happen.

---

[1]: 
https://lore.kernel.org/linux-pci/20210913190131.xiiszmno46qie...@intel.com/
[2]: 
https://lore.kernel.org/linux-cxl/20210920225638.1729482-1-ben.widaw...@intel.com/
[3]: 
https://lore.kernel.org/linux-cxl/20210921220459.2437386-1-ben.widaw...@intel.com/

---

Ben Widawsky (8):
  cxl/pci: Convert register block identifiers to an enum
  cxl/pci: Remove dev_dbg for unknown register blocks
  cxl/pci: Remove pci request/release regions
  cxl/pci: Make more use of cxl_register_map
  cxl/pci: Split cxl_pci_setup_regs()
  PCI: Add pci_find_dvsec_capability to find designated VSEC
  cxl/pci: Use pci core's DVSEC functionality
  ocxl: Use pci core's DVSEC functionality

Dan Williams (2):
  cxl/pci: Fix NULL vs ERR_PTR confusion
  cxl/pci: Add @base to cxl_register_map


 arch/powerpc/platforms/powernv/ocxl.c |3 -
 drivers/cxl/cxl.h |1 
 drivers/cxl/pci.c |  157 +
 drivers/cxl/pci.h |   14 ++-
 drivers/misc/ocxl/config.c|   13 ---
 drivers/pci/pci.c |   32 +++
 include/linux/pci.h   |1 
 7 files changed, 105 insertions(+), 116 deletions(-)

base-commit: ed97afb53365cd03dde266c9644334a558fe5a16


Re: [PATCH v2 9/9] iommu/vt-d: Use pci core's DVSEC functionality

2021-09-28 Thread Dan Williams
On Thu, Sep 23, 2021 at 10:27 AM Ben Widawsky  wrote:
>
> Reduce maintenance burden of DVSEC query implementation by using the
> centralized PCI core implementation.
>
> Cc: io...@lists.linux-foundation.org
> Cc: David Woodhouse 
> Cc: Lu Baolu 
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/iommu/intel/iommu.c | 15 +--
>  1 file changed, 1 insertion(+), 14 deletions(-)
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index d75f59ae28e6..30c97181f0ae 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -5398,20 +5398,7 @@ static int intel_iommu_disable_sva(struct device *dev)
>   */
>  static int siov_find_pci_dvsec(struct pci_dev *pdev)
>  {
> -   int pos;
> -   u16 vendor, id;
> -
> -   pos = pci_find_next_ext_capability(pdev, 0, 0x23);
> -   while (pos) {
> -   pci_read_config_word(pdev, pos + 4, );
> -   pci_read_config_word(pdev, pos + 8, );
> -   if (vendor == PCI_VENDOR_ID_INTEL && id == 5)
> -   return pos;
> -
> -   pos = pci_find_next_ext_capability(pdev, pos, 0x23);
> -   }
> -
> -   return 0;
> +   return pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_INTEL, 5);
>  }

Same comments as the CXL patch, siov_find_pci_dvsec() doesn't seem to
have a reason to exist anymore. What is 5?


Re: [PATCH v2 7/9] cxl/pci: Use pci core's DVSEC functionality

2021-09-28 Thread Dan Williams
On Thu, Sep 23, 2021 at 10:27 AM Ben Widawsky  wrote:
>
> Reduce maintenance burden of DVSEC query implementation by using the
> centralized PCI core implementation.
>
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/pci.c | 20 +---
>  1 file changed, 1 insertion(+), 19 deletions(-)
>
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 5eaf2736f779..79d4d9b16d83 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -340,25 +340,7 @@ static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, 
> struct cxl_register_map
>
>  static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec)

cxl_pci_dvsec() has no reason to exist anymore. Let's just have the
caller use pci_find_dvsec_capability() directly.


Re: [PATCH v2 5/9] cxl/pci: Make more use of cxl_register_map

2021-09-28 Thread Dan Williams
On Thu, Sep 23, 2021 at 10:27 AM Ben Widawsky  wrote:
>
> The structure exists to pass around information about register mapping.
> Using it more extensively cleans up many existing functions.

I would have liked to have seen "add @base to cxl_register_map" and
"use @map for @bar and @offset arguments" somewhere in this changelog
to set expectations for what changes are included. That would have
also highlighted that adding a @base to cxl_register_map deserves its
own patch vs the conversion of @bar and @offset to instead use
@map->bar and @map->offset. Can you resend with that split and those
mentions?


Re: [PATCH v2 4/9] cxl/pci: Refactor cxl_pci_setup_regs

2021-09-28 Thread Dan Williams
On Thu, Sep 23, 2021 at 10:27 AM Ben Widawsky  wrote:
>
> In preparation for moving parts of register mapping to cxl_core, the
> cxl_pci driver is refactored to utilize a new helper to find register
> blocks by type.
>
> cxl_pci scanned through all register blocks and mapping the ones that
> the driver will use. This logic is inverted so that the driver
> specifically requests the register blocks from a new helper. Under the
> hood, the same implementation of scanning through all register locator
> DVSEC entries exists.
>
> There are 2 behavioral changes (#2 is arguable):
> 1. A dev_err is introduced if cxl_map_regs fails.
> 2. The previous logic would try to map component registers and device
>registers multiple times if there were present and keep the mapping
>of the last one found (furthest offset in the register locator).
>While this is disallowed in the spec, CXL 2.0 8.1.9: "Each register
>block identifier shall only occur once in the Register Locator DVSEC
>structure" it was how the driver would respond to the spec violation.
>The new logic will take the first found register block by type and
>move on.
>
> Signed-off-by: Ben Widawsky 
>
> ---
> Changes since v1:

No changes? Luckily git am strips this section...

Overall I think this refactor can be broken down further for
readability and cleanup the long standing problem that the driver maps
component registers for no reason. The main contributing factor to
readability is that cxl_setup_pci_regs() still exists after the
refactor, which also contributes to the component register problem. If
the register mapping is up leveled to the caller of
cxl_setup_pci_regs() (and drops mapping component registers) then a
follow-on patch to rename cxl_setup_pci_regs to find_register_block
becomes easier to read. Moving the cxl_register_map array out of
cxl_setup_pci_regs() also makes a later patch to pass component
register enumeration details to the endpoint-port that much cleaner.


Re: [PATCH v2 3/9] cxl/pci: Remove pci request/release regions

2021-09-28 Thread Dan Williams
On Thu, Sep 23, 2021 at 10:26 AM Ben Widawsky  wrote:
>
> Quoting Dan, "... the request + release regions should probably just be
> dropped. It's not like any of the register enumeration would collide
> with someone else who already has the registers mapped. The collision
> only comes when the registers are mapped for their final usage, and that
> will have more precision in the request."

Looks good to me:

Reviewed-by: Dan Williams 

>
> Recommended-by: Dan Williams 

This isn't one of the canonical tags:
Documentation/process/submitting-patches.rst

I'll change this to Suggested-by:

> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/pci.c | 5 -
>  1 file changed, 5 deletions(-)
>
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index ccc7c2573ddc..7256c236fdb3 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -453,9 +453,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
> return -ENXIO;
> }
>
> -   if (pci_request_mem_regions(pdev, pci_name(pdev)))
> -   return -ENODEV;
> -
> /* Get the size of the Register Locator DVSEC */
> pci_read_config_dword(pdev, regloc + PCI_DVSEC_HEADER1, _size);
> regloc_size = FIELD_GET(PCI_DVSEC_HEADER1_LENGTH_MASK, regloc_size);
> @@ -499,8 +496,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
> n_maps++;
> }
>
> -   pci_release_mem_regions(pdev);
> -
> for (i = 0; i < n_maps; i++) {
> ret = cxl_map_regs(cxlm, [i]);
> if (ret)
> --
> 2.33.0
>


Re: [PATCH v2 2/9] cxl/pci: Remove dev_dbg for unknown register blocks

2021-09-28 Thread Dan Williams
On Thu, Sep 23, 2021 at 10:27 AM Ben Widawsky  wrote:
>
> While interesting to driver developers, the dev_dbg message doesn't do
> much except clutter up logs. This information should be attainable
> through sysfs, and someday lspci like utilities. This change
> additionally helps reduce the LOC in a subsequent patch to refactor some
> of cxl_pci register mapping.

Looks good to me:

Reviewed-by: Dan Williams 


Re: [PATCH v2 1/9] cxl: Convert "RBI" to enum

2021-09-27 Thread Dan Williams
Please spell out "register block indicator" in the subject so that the
shortlog remains somewhat readable.

On Thu, Sep 23, 2021 at 10:27 AM Ben Widawsky  wrote:
>
> In preparation for passing around the Register Block Indicator (RBI) as
> a parameter, it is desirable to convert the type to an enum so that the
> interface can use a well defined type checked parameter.

C wouldn't type check this unless it failed an integer conversion,
right? It would need to be a struct to get useful type checking.

I don't mind this for the self documenting properties it has for the
functions that will take this as a parameter, but maybe clarify what
you mean by type checked parameter?

>
> As a result of this change, should future versions of the spec add
> sparsely defined identifiers, it could become a problem if checking for
> invalid identifiers since the code currently checks for the max
> identifier. This is not an issue with current spec, and the algorithm to
> obtain the register blocks will change before those possible additions
> are made.

In general let's not spend changelog space trying to guess what future
specs may or may not do. I.e. I think this text can be dropped,
especially because enums can support sparse number spaces.


Re: [PATCH 7/7] ocxl: Use pci core's DVSEC functionality

2021-09-21 Thread Dan Williams
On Tue, Sep 21, 2021 at 3:05 PM Ben Widawsky  wrote:
>
> Reduce maintenance burden of DVSEC query implementation by using the
> centralized PCI core implementation.
>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Frederic Barrat 
> Cc: Andrew Donnellan 
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/misc/ocxl/config.c | 13 +
>  1 file changed, 1 insertion(+), 12 deletions(-)
>
> diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
> index a68738f38252..e401a51596b9 100644
> --- a/drivers/misc/ocxl/config.c
> +++ b/drivers/misc/ocxl/config.c
> @@ -33,18 +33,7 @@
>
>  static int find_dvsec(struct pci_dev *dev, int dvsec_id)
>  {
> -   int vsec = 0;
> -   u16 vendor, id;
> -
> -   while ((vsec = pci_find_next_ext_capability(dev, vsec,
> -   OCXL_EXT_CAP_ID_DVSEC))) {
> -   pci_read_config_word(dev, vsec + OCXL_DVSEC_VENDOR_OFFSET,
> -   );
> -   pci_read_config_word(dev, vsec + OCXL_DVSEC_ID_OFFSET, );
> -   if (vendor == PCI_VENDOR_ID_IBM && id == dvsec_id)
> -   return vsec;
> -   }
> -   return 0;
> +   return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_IBM, dvsec_id);
>  }

What about:

arch/powerpc/platforms/powernv/ocxl.c::find_dvsec_from_pos()

...?  With that converted the redundant definitions below:

OCXL_EXT_CAP_ID_DVSEC
OCXL_DVSEC_VENDOR_OFFSET
OCXL_DVSEC_ID_OFFSET

...can be cleaned up in favor of the core definitions.


Re: [RESEND PATCH v4 1/4] drivers/nvdimm: Add nvdimm pmu structure

2021-09-14 Thread Dan Williams
On Tue, Sep 14, 2021 at 9:08 PM Dan Williams  wrote:
>
> On Thu, Sep 9, 2021 at 12:56 AM kajoljain  wrote:
> >
> >
> >
> > On 9/8/21 3:29 AM, Dan Williams wrote:
> > > Hi Kajol,
> > >
> > > Apologies for the delay in responding to this series, some comments below:
> >
> > Hi Dan,
> > No issues, thanks for reviewing the patches.
> >
> > >
> > > On Thu, Sep 2, 2021 at 10:10 PM Kajol Jain  wrote:
> > >>
> > >> A structure is added, called nvdimm_pmu, for performance
> > >> stats reporting support of nvdimm devices. It can be used to add
> > >> nvdimm pmu data such as supported events and pmu event functions
> > >> like event_init/add/read/del with cpu hotplug support.
> > >>
> > >> Acked-by: Peter Zijlstra (Intel) 
> > >> Reviewed-by: Madhavan Srinivasan 
> > >> Tested-by: Nageswara R Sastry 
> > >> Signed-off-by: Kajol Jain 
> > >> ---
> > >>  include/linux/nd.h | 43 +++
> > >>  1 file changed, 43 insertions(+)
> > >>
> > >> diff --git a/include/linux/nd.h b/include/linux/nd.h
> > >> index ee9ad76afbba..712499cf7335 100644
> > >> --- a/include/linux/nd.h
> > >> +++ b/include/linux/nd.h
> > >> @@ -8,6 +8,8 @@
> > >>  #include 
> > >>  #include 
> > >>  #include 
> > >> +#include 
> > >> +#include 
> > >>
> > >>  enum nvdimm_event {
> > >> NVDIMM_REVALIDATE_POISON,
> > >> @@ -23,6 +25,47 @@ enum nvdimm_claim_class {
> > >> NVDIMM_CCLASS_UNKNOWN,
> > >>  };
> > >>
> > >> +/* Event attribute array index */
> > >> +#define NVDIMM_PMU_FORMAT_ATTR 0
> > >> +#define NVDIMM_PMU_EVENT_ATTR  1
> > >> +#define NVDIMM_PMU_CPUMASK_ATTR2
> > >> +#define NVDIMM_PMU_NULL_ATTR   3
> > >> +
> > >> +/**
> > >> + * struct nvdimm_pmu - data structure for nvdimm perf driver
> > >> + *
> > >> + * @name: name of the nvdimm pmu device.
> > >> + * @pmu: pmu data structure for nvdimm performance stats.
> > >> + * @dev: nvdimm device pointer.
> > >> + * @functions(event_init/add/del/read): platform specific pmu functions.
> > >
> > > This is not valid kernel-doc:
> > >
> > > include/linux/nd.h:67: warning: Function parameter or member
> > > 'event_init' not described in 'nvdimm_pmu'
> > > include/linux/nd.h:67: warning: Function parameter or member 'add' not
> > > described in 'nvdimm_pmu'
> > > include/linux/nd.h:67: warning: Function parameter or member 'del' not
> > > described in 'nvdimm_pmu'
> > > include/linux/nd.h:67: warning: Function parameter or member 'read'
> > > not described in 'nvdimm_pmu'
> > >
> > > ...but I think rather than fixing those up 'struct nvdimm_pmu' should be 
> > > pruned.
> > >
> > > It's not clear to me that it is worth the effort to describe these
> > > details to the nvdimm core which is just going to turn around and call
> > > the pmu core. I'd just as soon have the driver call the pmu core
> > > directly, optionally passing in attributes and callbacks that come
> > > from the nvdimm core and/or the nvdimm provider.
> >
> > The intend for adding these callbacks(event_init/add/del/read) is to give
> > flexibility to the nvdimm core to add some common checks/routines if 
> > required
> > in the future. Those checks can be common for all architecture with still 
> > having the
> > ability to call arch/platform specific driver code to use its own routines.
> >
> > But as you said, currently we don't have any common checks and it directly
> > calling platform specific code, so we can get rid of it.
> > Should we remove this part for now?
>
> Yes, lets go direct to the perf api for now and await the need for a
> common core wrapper to present itself.
>
> >
> >
> > >
> > > Otherwise it's also not clear which of these structure members are
> > > used at runtime vs purely used as temporary storage to pass parameters
> > > to the pmu core.
> > >
> > >> + * @attr_groups: data structure for events, formats and cpumask
> > >> + * @cpu: designated cpu for counter access.
> > >> + * @node: node for cpu hotplug notifier link.
> > >> + * @cp

Re: [RESEND PATCH v4 1/4] drivers/nvdimm: Add nvdimm pmu structure

2021-09-14 Thread Dan Williams
On Thu, Sep 9, 2021 at 12:56 AM kajoljain  wrote:
>
>
>
> On 9/8/21 3:29 AM, Dan Williams wrote:
> > Hi Kajol,
> >
> > Apologies for the delay in responding to this series, some comments below:
>
> Hi Dan,
> No issues, thanks for reviewing the patches.
>
> >
> > On Thu, Sep 2, 2021 at 10:10 PM Kajol Jain  wrote:
> >>
> >> A structure is added, called nvdimm_pmu, for performance
> >> stats reporting support of nvdimm devices. It can be used to add
> >> nvdimm pmu data such as supported events and pmu event functions
> >> like event_init/add/read/del with cpu hotplug support.
> >>
> >> Acked-by: Peter Zijlstra (Intel) 
> >> Reviewed-by: Madhavan Srinivasan 
> >> Tested-by: Nageswara R Sastry 
> >> Signed-off-by: Kajol Jain 
> >> ---
> >>  include/linux/nd.h | 43 +++
> >>  1 file changed, 43 insertions(+)
> >>
> >> diff --git a/include/linux/nd.h b/include/linux/nd.h
> >> index ee9ad76afbba..712499cf7335 100644
> >> --- a/include/linux/nd.h
> >> +++ b/include/linux/nd.h
> >> @@ -8,6 +8,8 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >> +#include 
> >>
> >>  enum nvdimm_event {
> >> NVDIMM_REVALIDATE_POISON,
> >> @@ -23,6 +25,47 @@ enum nvdimm_claim_class {
> >> NVDIMM_CCLASS_UNKNOWN,
> >>  };
> >>
> >> +/* Event attribute array index */
> >> +#define NVDIMM_PMU_FORMAT_ATTR 0
> >> +#define NVDIMM_PMU_EVENT_ATTR  1
> >> +#define NVDIMM_PMU_CPUMASK_ATTR2
> >> +#define NVDIMM_PMU_NULL_ATTR   3
> >> +
> >> +/**
> >> + * struct nvdimm_pmu - data structure for nvdimm perf driver
> >> + *
> >> + * @name: name of the nvdimm pmu device.
> >> + * @pmu: pmu data structure for nvdimm performance stats.
> >> + * @dev: nvdimm device pointer.
> >> + * @functions(event_init/add/del/read): platform specific pmu functions.
> >
> > This is not valid kernel-doc:
> >
> > include/linux/nd.h:67: warning: Function parameter or member
> > 'event_init' not described in 'nvdimm_pmu'
> > include/linux/nd.h:67: warning: Function parameter or member 'add' not
> > described in 'nvdimm_pmu'
> > include/linux/nd.h:67: warning: Function parameter or member 'del' not
> > described in 'nvdimm_pmu'
> > include/linux/nd.h:67: warning: Function parameter or member 'read'
> > not described in 'nvdimm_pmu'
> >
> > ...but I think rather than fixing those up 'struct nvdimm_pmu' should be 
> > pruned.
> >
> > It's not clear to me that it is worth the effort to describe these
> > details to the nvdimm core which is just going to turn around and call
> > the pmu core. I'd just as soon have the driver call the pmu core
> > directly, optionally passing in attributes and callbacks that come
> > from the nvdimm core and/or the nvdimm provider.
>
> The intend for adding these callbacks(event_init/add/del/read) is to give
> flexibility to the nvdimm core to add some common checks/routines if required
> in the future. Those checks can be common for all architecture with still 
> having the
> ability to call arch/platform specific driver code to use its own routines.
>
> But as you said, currently we don't have any common checks and it directly
> calling platform specific code, so we can get rid of it.
> Should we remove this part for now?

Yes, lets go direct to the perf api for now and await the need for a
common core wrapper to present itself.

>
>
> >
> > Otherwise it's also not clear which of these structure members are
> > used at runtime vs purely used as temporary storage to pass parameters
> > to the pmu core.
> >
> >> + * @attr_groups: data structure for events, formats and cpumask
> >> + * @cpu: designated cpu for counter access.
> >> + * @node: node for cpu hotplug notifier link.
> >> + * @cpuhp_state: state for cpu hotplug notification.
> >> + * @arch_cpumask: cpumask to get designated cpu for counter access.
> >> + */
> >> +struct nvdimm_pmu {
> >> +   const char *name;
> >> +   struct pmu pmu;
> >> +   struct device *dev;
> >> +   int (*event_init)(struct perf_event *event);
> >> +   int  (*add)(struct perf_event *event, int flags);
> >> +   void (*del)(struct perf_event *event, int flags);
> >> +   void (*read)(struct perf_event *event);
> &g

Re: [RESEND PATCH v4 4/4] powerpc/papr_scm: Document papr_scm sysfs event format entries

2021-09-07 Thread Dan Williams
On Thu, Sep 2, 2021 at 10:11 PM Kajol Jain  wrote:
>
> Details is added for the event, cpumask and format attributes
> in the ABI documentation.
>
> Acked-by: Peter Zijlstra (Intel) 
> Reviewed-by: Madhavan Srinivasan 
> Tested-by: Nageswara R Sastry 
> Signed-off-by: Kajol Jain 
> ---
>  Documentation/ABI/testing/sysfs-bus-papr-pmem | 31 +++
>  1 file changed, 31 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
> b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> index 95254cec92bf..4d86252448f8 100644
> --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
> +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> @@ -61,3 +61,34 @@ Description:
> * "CchRHCnt" : Cache Read Hit Count
> * "CchWHCnt" : Cache Write Hit Count
> * "FastWCnt" : Fast Write Count
> +
> +What:  /sys/devices/nmemX/format
> +Date:  June 2021
> +Contact:   linuxppc-dev , 
> nvd...@lists.linux.dev,
> +Description:   (RO) Attribute group to describe the magic bits
> +that go into perf_event_attr.config for a particular pmu.
> +(See ABI/testing/sysfs-bus-event_source-devices-format).
> +
> +Each attribute under this group defines a bit range of the
> +perf_event_attr.config. Supported attribute is listed
> +below::
> +
> +   event  = "config:0-4"  - event ID
> +
> +   For example::
> +   noopstat = "event=0x1"
> +
> +What:  /sys/devices/nmemX/events

That's not a valid sysfs path. Did you mean /sys/bus/nd/devices/nmemX?

> +Date:  June 2021
> +Contact:   linuxppc-dev , 
> nvd...@lists.linux.dev,
> +Description:(RO) Attribute group to describe performance monitoring
> +events specific to papr-scm. Each attribute in this group 
> describes
> +a single performance monitoring event supported by this 
> nvdimm pmu.
> +The name of the file is the name of the event.
> +(See ABI/testing/sysfs-bus-event_source-devices-events).

Given these events are in the generic namespace the ABI documentation
should be generic as well. So I think move these entries to
Documentation/ABI/testing/sysfs-bus-nvdimm directly.

You can still mention papr-scm, but I would expect something like:

What:   /sys/bus/nd/devices/nmemX/events
Date:   September 2021
KernelVersion:  5.16
Contact:Kajol Jain 
Description:
(RO) Attribute group to describe performance monitoring events
for the nvdimm memory device. Each attribute in this group
describes a single performance monitoring event supported by
this nvdimm pmu.  The name of the file is the name of the event.
(See ABI/testing/sysfs-bus-event_source-devices-events). A
listing of the events supported by a given nvdimm provider type
can be found in Documentation/driver-api/nvdimm/$provider, for
example: Documentation/driver-api/nvdimm/papr-scm.



> +
> +What:  /sys/devices/nmemX/cpumask
> +Date:  June 2021
> +Contact:   linuxppc-dev , 
> nvd...@lists.linux.dev,
> +Description:   (RO) This sysfs file exposes the cpumask which is designated 
> to make
> +HCALLs to retrieve nvdimm pmu event counter data.

Seems this one would be provider generic, so no need to refer to PPC
specific concepts like HCALLs.


Re: [RESEND PATCH v4 1/4] drivers/nvdimm: Add nvdimm pmu structure

2021-09-07 Thread Dan Williams
Hi Kajol,

Apologies for the delay in responding to this series, some comments below:

On Thu, Sep 2, 2021 at 10:10 PM Kajol Jain  wrote:
>
> A structure is added, called nvdimm_pmu, for performance
> stats reporting support of nvdimm devices. It can be used to add
> nvdimm pmu data such as supported events and pmu event functions
> like event_init/add/read/del with cpu hotplug support.
>
> Acked-by: Peter Zijlstra (Intel) 
> Reviewed-by: Madhavan Srinivasan 
> Tested-by: Nageswara R Sastry 
> Signed-off-by: Kajol Jain 
> ---
>  include/linux/nd.h | 43 +++
>  1 file changed, 43 insertions(+)
>
> diff --git a/include/linux/nd.h b/include/linux/nd.h
> index ee9ad76afbba..712499cf7335 100644
> --- a/include/linux/nd.h
> +++ b/include/linux/nd.h
> @@ -8,6 +8,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  enum nvdimm_event {
> NVDIMM_REVALIDATE_POISON,
> @@ -23,6 +25,47 @@ enum nvdimm_claim_class {
> NVDIMM_CCLASS_UNKNOWN,
>  };
>
> +/* Event attribute array index */
> +#define NVDIMM_PMU_FORMAT_ATTR 0
> +#define NVDIMM_PMU_EVENT_ATTR  1
> +#define NVDIMM_PMU_CPUMASK_ATTR2
> +#define NVDIMM_PMU_NULL_ATTR   3
> +
> +/**
> + * struct nvdimm_pmu - data structure for nvdimm perf driver
> + *
> + * @name: name of the nvdimm pmu device.
> + * @pmu: pmu data structure for nvdimm performance stats.
> + * @dev: nvdimm device pointer.
> + * @functions(event_init/add/del/read): platform specific pmu functions.

This is not valid kernel-doc:

include/linux/nd.h:67: warning: Function parameter or member
'event_init' not described in 'nvdimm_pmu'
include/linux/nd.h:67: warning: Function parameter or member 'add' not
described in 'nvdimm_pmu'
include/linux/nd.h:67: warning: Function parameter or member 'del' not
described in 'nvdimm_pmu'
include/linux/nd.h:67: warning: Function parameter or member 'read'
not described in 'nvdimm_pmu'

...but I think rather than fixing those up 'struct nvdimm_pmu' should be pruned.

It's not clear to me that it is worth the effort to describe these
details to the nvdimm core which is just going to turn around and call
the pmu core. I'd just as soon have the driver call the pmu core
directly, optionally passing in attributes and callbacks that come
from the nvdimm core and/or the nvdimm provider.

Otherwise it's also not clear which of these structure members are
used at runtime vs purely used as temporary storage to pass parameters
to the pmu core.

> + * @attr_groups: data structure for events, formats and cpumask
> + * @cpu: designated cpu for counter access.
> + * @node: node for cpu hotplug notifier link.
> + * @cpuhp_state: state for cpu hotplug notification.
> + * @arch_cpumask: cpumask to get designated cpu for counter access.
> + */
> +struct nvdimm_pmu {
> +   const char *name;
> +   struct pmu pmu;
> +   struct device *dev;
> +   int (*event_init)(struct perf_event *event);
> +   int  (*add)(struct perf_event *event, int flags);
> +   void (*del)(struct perf_event *event, int flags);
> +   void (*read)(struct perf_event *event);
> +   /*
> +* Attribute groups for the nvdimm pmu. Index 0 used for
> +* format attribute, index 1 used for event attribute,
> +* index 2 used for cpusmask attribute and index 3 kept as NULL.
> +*/
> +   const struct attribute_group *attr_groups[4];

Following from above, I'd rather this was organized as static
attributes with an is_visible() helper for the groups for any dynamic
aspects. That mirrors the behavior of nvdimm_create() and allows for
device drivers to compose the attribute groups from a core set and /
or a provider specific set.

> +   int cpu;
> +   struct hlist_node node;
> +   enum cpuhp_state cpuhp_state;
> +
> +   /* cpumask provided by arch/platform specific code */
> +   struct cpumask arch_cpumask;
> +};
> +
>  struct nd_device_driver {
> struct device_driver drv;
> unsigned long type;
> --
> 2.26.2
>


Re: [PATCH v3] lockdown,selinux: fix wrong subject in some SELinux lockdown checks

2021-08-31 Thread Dan Williams
On Tue, Aug 31, 2021 at 6:53 AM Paul Moore  wrote:
>
> On Tue, Aug 31, 2021 at 5:09 AM Ondrej Mosnacek  wrote:
> > On Sat, Jun 19, 2021 at 12:18 AM Dan Williams  
> > wrote:
> > > On Wed, Jun 16, 2021 at 1:51 AM Ondrej Mosnacek  
> > > wrote:
>
> ...
>
> > > > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > > > index 2acc6173da36..c1747b6555c7 100644
> > > > --- a/drivers/cxl/mem.c
> > > > +++ b/drivers/cxl/mem.c
> > > > @@ -568,7 +568,7 @@ static bool cxl_mem_raw_command_allowed(u16 opcode)
> > > > if (!IS_ENABLED(CONFIG_CXL_MEM_RAW_COMMANDS))
> > > > return false;
> > > >
> > > > -   if (security_locked_down(LOCKDOWN_NONE))
> > > > +   if (security_locked_down(current_cred(), LOCKDOWN_NONE))
> > >
> > > Acked-by: Dan Williams 
> > >
> > > ...however that usage looks wrong. The expectation is that if kernel
> > > integrity protections are enabled then raw command access should be
> > > disabled. So I think that should be equivalent to LOCKDOWN_PCI_ACCESS
> > > in terms of the command capabilities to filter.
> >
> > Yes, the LOCKDOWN_NONE seems wrong here... but it's a pre-existing bug
> > and I didn't want to go down yet another rabbit hole trying to fix it.
> > I'll look at this again once this patch is settled - it may indeed be
> > as simple as replacing LOCKDOWN_NONE with LOCKDOWN_PCI_ACCESS.
>
> At this point you should be well aware of my distaste for merging
> patches that have known bugs in them.  Yes, this is a pre-existing
> condition, but it seems well within the scope of this work to address
> it as well.
>
> This isn't something that is going to get merged while the merge
> window is open, so at the very least you've got almost two weeks to
> sort this out - please do that.

Yes, apologies, I should have sent the fix shortly after noticing the
problem. I'll get the CXL bug fix out of the way so Ondrej can move
this along.


Re: [PATCH v2 4/4] bus: Make remove callback return void

2021-07-06 Thread Dan Williams
On Tue, Jul 6, 2021 at 8:51 AM Uwe Kleine-König
 wrote:
>
> The driver core ignores the return value of this callback because there
> is only little it can do when a device disappears.
>
> This is the final bit of a long lasting cleanup quest where several
> buses were converted to also return void from their remove callback.
> Additionally some resource leaks were fixed that were caused by drivers
> returning an error code in the expectation that the driver won't go
> away.
>
> With struct bus_type::remove returning void it's prevented that newly
> implemented buses return an ignored error code and so don't anticipate
> wrong expectations for driver authors.
>

>  drivers/cxl/core.c| 3 +--
>  drivers/dax/bus.c | 4 +---
>  drivers/nvdimm/bus.c      | 3 +--

For CXL, DAX, and NVDIMM

Acked-by: Dan Williams 


Re: [RESEND-PATCH v2] powerpc/papr_scm: Add support for reporting dirty-shutdown-count

2021-06-26 Thread Dan Williams
On Thu, Jun 24, 2021 at 1:07 AM Vaibhav Jain  wrote:
>
> Persistent memory devices like NVDIMMs can loose cached writes in case
> something prevents flush on power-fail. Such situations are termed as
> dirty shutdown and are exposed to applications as
> last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as
> described at [1]. The latter being useful in conditions where multiple
> applications want to detect a dirty shutdown event without racing with
> one another.
>
> PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a
> dirty-shutdown-state. This patch further adds support for DSC via the
> "ibm,persistence-failed-count" device tree property of an NVDIMM. This
> property is a monotonic increasing 64-bit counter thats an indication
> of number of times an NVDIMM has encountered a dirty-shutdown event
> causing persistence loss.
>
> Since this value is not expected to change after system-boot hence
> papr_scm reads & caches its value during NVDIMM probe and exposes it
> as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of
> similarly named NFIT sysfs attribute. Also this value is available to
> libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health'
> has been extended to add a new member called 'dimm_dsc' presence of
> which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag.
>
> References:
> [1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf
>
> Signed-off-by: Vaibhav Jain 
> Reviewed-by: Aneesh Kumar K.V 

Belated:

Acked-by: Dan Williams 

It's looking like CXL will add one of these as well. Might be time to
add a unified location when that happens and deprecate these
bus-specific locations.


Re: [PATCH v3] lockdown,selinux: fix wrong subject in some SELinux lockdown checks

2021-06-18 Thread Dan Williams
On Wed, Jun 16, 2021 at 1:51 AM Ondrej Mosnacek  wrote:
>
> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> lockdown") added an implementation of the locked_down LSM hook to
> SELinux, with the aim to restrict which domains are allowed to perform
> operations that would breach lockdown.
>
> However, in several places the security_locked_down() hook is called in
> situations where the current task isn't doing any action that would
> directly breach lockdown, leading to SELinux checks that are basically
> bogus.
>
> To fix this, add an explicit struct cred pointer argument to
> security_lockdown() and define NULL as a special value to pass instead
> of current_cred() in such situations. LSMs that take the subject
> credentials into account can then fall back to some default or ignore
> such calls altogether. In the SELinux lockdown hook implementation, use
> SECINITSID_KERNEL in case the cred argument is NULL.
>
> Most of the callers are updated to pass current_cred() as the cred
> pointer, thus maintaining the same behavior. The following callers are
> modified to pass NULL as the cred pointer instead:
> 1. arch/powerpc/xmon/xmon.c
>  Seems to be some interactive debugging facility. It appears that
>  the lockdown hook is called from interrupt context here, so it
>  should be more appropriate to request a global lockdown decision.
> 2. fs/tracefs/inode.c:tracefs_create_file()
>  Here the call is used to prevent creating new tracefs entries when
>  the kernel is locked down. Assumes that locking down is one-way -
>  i.e. if the hook returns non-zero once, it will never return zero
>  again, thus no point in creating these files. Also, the hook is
>  often called by a module's init function when it is loaded by
>  userspace, where it doesn't make much sense to do a check against
>  the current task's creds, since the task itself doesn't actually
>  use the tracing functionality (i.e. doesn't breach lockdown), just
>  indirectly makes some new tracepoints available to whoever is
>  authorized to use them.
> 3. net/xfrm/xfrm_user.c:copy_to_user_*()
>  Here a cryptographic secret is redacted based on the value returned
>  from the hook. There are two possible actions that may lead here:
>  a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the
> task context is relevant, since the dumped data is sent back to
> the current task.
>  b) When adding/deleting/updating an SA via XFRM_MSG_xxxSA, the
> dumped SA is broadcasted to tasks subscribed to XFRM events -
> here the current task context is not relevant as it doesn't
> represent the tasks that could potentially see the secret.
>  It doesn't seem worth it to try to keep using the current task's
>  context in the a) case, since the eventual data leak can be
>  circumvented anyway via b), plus there is no way for the task to
>  indicate that it doesn't care about the actual key value, so the
>  check could generate a lot of "false alert" denials with SELinux.
>  Thus, let's pass NULL instead of current_cred() here faute de
>  mieux.
>
> Improvements-suggested-by: Casey Schaufler 
> Improvements-suggested-by: Paul Moore 
> Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
> Signed-off-by: Ondrej Mosnacek 
[..]
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 2acc6173da36..c1747b6555c7 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -568,7 +568,7 @@ static bool cxl_mem_raw_command_allowed(u16 opcode)
> if (!IS_ENABLED(CONFIG_CXL_MEM_RAW_COMMANDS))
> return false;
>
> -   if (security_locked_down(LOCKDOWN_NONE))
> +   if (security_locked_down(current_cred(), LOCKDOWN_NONE))

Acked-by: Dan Williams 

...however that usage looks wrong. The expectation is that if kernel
integrity protections are enabled then raw command access should be
disabled. So I think that should be equivalent to LOCKDOWN_PCI_ACCESS
in terms of the command capabilities to filter.


[PATCH v2] libnvdimm/pmem: Fix blk_cleanup_disk() usage

2021-06-07 Thread Dan Williams
The queue_to_disk() helper can not be used after del_gendisk()
communicate @disk via the pgmap->owner.

Otherwise, queue_to_disk() returns NULL resulting in the splat below.

 Kernel attempted to read user page (330) - exploit attempt? (uid: 0)
 BUG: Kernel NULL pointer dereference on read at 0x0330
 Faulting instruction address: 0xc0906344
 Oops: Kernel access of bad area, sig: 11 [#1]
 [..]
 NIP [c0906344] pmem_pagemap_cleanup+0x24/0x40
 LR [c04701d4] memunmap_pages+0x1b4/0x4b0
 Call Trace:
 [c00022cbb9c0] [c09063c8] pmem_pagemap_kill+0x28/0x40 (unreliable)
 [c00022cbb9e0] [c04701d4] memunmap_pages+0x1b4/0x4b0
 [c00022cbba90] [c08b28a0] devm_action_release+0x30/0x50
 [c00022cbbab0] [c08b39c8] release_nodes+0x2f8/0x3e0
 [c00022cbbb60] [c08ac440] 
device_release_driver_internal+0x190/0x2b0
 [c00022cbbba0] [c08a8450] unbind_store+0x130/0x170

Reported-by: Sachin Sant 
Fixes: 87eb73b2ca7c ("nvdimm-pmem: convert to blk_alloc_disk/blk_cleanup_disk")
Link: 
http://lore.kernel.org/r/dfb75ba8-603f-4a35-880b-c5b23ef8f...@linux.vnet.ibm.com
Cc: Christoph Hellwig 
Cc: Ulf Hansson 
Cc: Jens Axboe 
Signed-off-by: Dan Williams 
---
Changes in v2 Improve the changelog.

 drivers/nvdimm/pmem.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 31f3c4bd6f72..fc6b78dd2d24 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -337,8 +337,9 @@ static void pmem_pagemap_cleanup(struct dev_pagemap *pgmap)
 {
struct request_queue *q =
container_of(pgmap->ref, struct request_queue, q_usage_counter);
+   struct pmem_device *pmem = pgmap->owner;
 
-   blk_cleanup_disk(queue_to_disk(q));
+   blk_cleanup_disk(pmem->disk);
 }
 
 static void pmem_release_queue(void *pgmap)
@@ -427,6 +428,7 @@ static int pmem_attach_disk(struct device *dev,
q = disk->queue;
 
pmem->disk = disk;
+   pmem->pgmap.owner = pmem;
pmem->pfn_flags = PFN_DEV;
pmem->pgmap.ref = >q_usage_counter;
if (is_nd_pfn(dev)) {



[PATCH] libnvdimm/pmem: Fix blk_cleanup_disk() usage

2021-06-07 Thread Dan Williams
The queue_to_disk() helper can not be used after del_gendisk()
communicate @disk via the pgmap->owner.

Reported-by: Sachin Sant 
Fixes: 87eb73b2ca7c ("nvdimm-pmem: convert to blk_alloc_disk/blk_cleanup_disk")
Cc: Christoph Hellwig 
Cc: Ulf Hansson 
Cc: Jens Axboe 
Signed-off-by: Dan Williams 
---
Jens,

Please take or fold this into your tree after Sachin has a chance
to test it out. It's passing my tests.

 drivers/nvdimm/pmem.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 31f3c4bd6f72..fc6b78dd2d24 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -337,8 +337,9 @@ static void pmem_pagemap_cleanup(struct dev_pagemap *pgmap)
 {
struct request_queue *q =
container_of(pgmap->ref, struct request_queue, q_usage_counter);
+   struct pmem_device *pmem = pgmap->owner;
 
-   blk_cleanup_disk(queue_to_disk(q));
+   blk_cleanup_disk(pmem->disk);
 }
 
 static void pmem_release_queue(void *pgmap)
@@ -427,6 +428,7 @@ static int pmem_attach_disk(struct device *dev,
q = disk->queue;
 
pmem->disk = disk;
+   pmem->pgmap.owner = pmem;
pmem->pfn_flags = PFN_DEV;
pmem->pgmap.ref = >q_usage_counter;
if (is_nd_pfn(dev)) {



Re: [PATCH 17/26] nvdimm-pmem: convert to blk_alloc_disk/blk_cleanup_disk

2021-06-06 Thread Dan Williams
[ add Sachin who reported this commit in -next ]

On Thu, May 20, 2021 at 10:52 PM Christoph Hellwig  wrote:
>
> Convert the nvdimm-pmem driver to use the blk_alloc_disk and
> blk_cleanup_disk helpers to simplify gendisk and request_queue
> allocation.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/nvdimm/pmem.c | 15 +--
>  1 file changed, 5 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 968b8483c763..9fcd05084564 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -338,7 +338,7 @@ static void pmem_pagemap_cleanup(struct dev_pagemap 
> *pgmap)
> struct request_queue *q =
> container_of(pgmap->ref, struct request_queue, 
> q_usage_counter);
>
> -   blk_cleanup_queue(q);
> +   blk_cleanup_disk(queue_to_disk(q));

This is broken. This comes after del_gendisk() which means the queue
device is no longer associated with its disk parent. Perhaps @pmem
could be stashed in pgmap->owner and then this can use pmem->disk? Not
see any other readily available ways to get back to the disk from here
after del_gendisk().


Re: [PATCH v2] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable

2021-05-12 Thread Dan Williams
On Fri, May 7, 2021 at 4:40 AM Vaibhav Jain  wrote:
>
> In case performance stats for an nvdimm are not available, reading the
> 'perf_stats' sysfs file returns an -ENOENT error. A better approach is
> to make the 'perf_stats' file entirely invisible to indicate that
> performance stats for an nvdimm are unavailable.
>
> So this patch updates 'papr_nd_attribute_group' to add a 'is_visible'
> callback implemented as newly introduced 'papr_nd_attribute_visible()'
> that returns an appropriate mode in case performance stats aren't
> supported in a given nvdimm.
>
> Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved
> from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is
> available when 'papr_nd_attribute_visible()' is called during nvdimm
> initialization.
>
> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
> PHYP')

Since this has been the exposed ABI since v5.9 perhaps a note /
analysis is needed here that the disappearance of this file will not
break any existing scripts/tools.

> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> v2:
> * Removed a redundant dev_info() from pap_scm_nvdimm_init() [ Aneesh ]
> * Marked papr_nd_attribute_visible() as static which also fixed the
>   build warning reported by kernel build robot
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 35 ---
>  1 file changed, 24 insertions(+), 11 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index e2b69cc3beaf..11e7b90a3360 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev,
>  }
>  DEVICE_ATTR_RO(flags);
>
> +static umode_t papr_nd_attribute_visible(struct kobject *kobj,
> +struct attribute *attr, int n)
> +{
> +   struct device *dev = container_of(kobj, typeof(*dev), kobj);

This can use the kobj_to_dev() helper.

> +   struct nvdimm *nvdimm = to_nvdimm(dev);
> +   struct papr_scm_priv *p = nvdimm_provider_data(nvdimm);
> +
> +   /* For if perf-stats not available remove perf_stats sysfs */
> +   if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0)
> +   return 0;
> +
> +   return attr->mode;
> +}
> +
>  /* papr_scm specific dimm attributes */
>  static struct attribute *papr_nd_attributes[] = {
> _attr_flags.attr,
> @@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = {
>
>  static struct attribute_group papr_nd_attribute_group = {
> .name = "papr",
> +   .is_visible = papr_nd_attribute_visible,
> .attrs = papr_nd_attributes,
>  };
>
> @@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
> struct nd_region_desc ndr_desc;
> unsigned long dimm_flags;
> int target_nid, online_nid;
> -   ssize_t stat_size;
>
> p->bus_desc.ndctl = papr_scm_ndctl;
> p->bus_desc.module = THIS_MODULE;
> @@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv 
> *p)
> list_add_tail(>region_list, _nd_regions);
> mutex_unlock(_ndr_lock);
>
> -   /* Try retriving the stat buffer and see if its supported */
> -   stat_size = drc_pmem_query_stats(p, NULL, 0);
> -   if (stat_size > 0) {
> -   p->stat_buffer_len = stat_size;
> -   dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n",
> -   p->stat_buffer_len);
> -   } else {
> -   dev_info(>pdev->dev, "Dimm performance stats 
> unavailable\n");
> -   }
> -
> return 0;
>
>  err:   nvdimm_bus_unregister(p->bus);
> @@ -1102,6 +1106,7 @@ static int papr_scm_probe(struct platform_device *pdev)
> u64 blocks, block_size;
> struct papr_scm_priv *p;
> const char *uuid_str;
> +   ssize_t stat_size;
> u64 uuid[2];
> int rc;
>
> @@ -1179,6 +1184,14 @@ static int papr_scm_probe(struct platform_device *pdev)
> p->res.name  = pdev->name;
> p->res.flags = IORESOURCE_MEM;
>
> +   /* Try retriving the stat buffer and see if its supported */

s/retriving/retrieving/

> +   stat_size = drc_pmem_query_stats(p, NULL, 0);
> +   if (stat_size > 0) {
> +   p->stat_buffer_len = stat_size;
> +   dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n",
> +   p->stat_buffer_len);
> +   }
> +
> rc = papr_scm_nvdimm_init(p);
> if (rc)
> goto err2;
> --
> 2.31.1
>

After the minor fixups above you can add:

Reviewed-by: Dan Williams 

...I assume this will go through the PPC tree.


Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-15 Thread Dan Williams
On Thu, Apr 15, 2021 at 4:44 AM Vaibhav Jain  wrote:
>
> Thanks for looking into this Dan,
>
> Dan Williams  writes:
>
> > On Wed, Apr 14, 2021 at 5:40 AM Vaibhav Jain  wrote:
> >>
> >> Currently drc_pmem_qeury_stats() generates a dev_err in case
> >> "Enable Performance Information Collection" feature is disabled from
> >> HMC. The error is of the form below:
> >>
> >> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
> >>  performance stats, Err:-10
> >>
> >> This error message confuses users as it implies a possible problem
> >> with the nvdimm even though its due to a disabled feature.
> >>
> >> So we fix this by explicitly handling the H_AUTHORITY error from the
> >> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
> >> error, saying that "Performance stats in-accessible".
> >>
> >> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
> >> PHYP')
> >> Signed-off-by: Vaibhav Jain 
> >> ---
> >>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> >> b/arch/powerpc/platforms/pseries/papr_scm.c
> >> index 835163f54244..9216424f8be3 100644
> >> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> >> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct 
> >> papr_scm_priv *p,
> >> dev_err(>pdev->dev,
> >> "Unknown performance stats, Err:0x%016lX\n", 
> >> ret[0]);
> >> return -ENOENT;
> >> +   } else if (rc == H_AUTHORITY) {
> >> +   dev_warn(>pdev->dev, "Performance stats in-accessible");
> >> +   return -EPERM;
> >
> > So userspace can spam the kernel log? Why is kernel log message needed
> > at all? EPERM told the caller what happened.
> Currently this error message is only reported during probe of the
> nvdimm. So userspace cannot directly spam kernel log.

Oh, ok, I saw things like papr_pdsm_fuel_gauge() in the call stack and
thought this was reachable through an ioctl. Sorry for the noise.


Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-14 Thread Dan Williams
On Wed, Apr 14, 2021 at 5:40 AM Vaibhav Jain  wrote:
>
> Currently drc_pmem_qeury_stats() generates a dev_err in case
> "Enable Performance Information Collection" feature is disabled from
> HMC. The error is of the form below:
>
> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
>  performance stats, Err:-10
>
> This error message confuses users as it implies a possible problem
> with the nvdimm even though its due to a disabled feature.
>
> So we fix this by explicitly handling the H_AUTHORITY error from the
> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
> error, saying that "Performance stats in-accessible".
>
> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
> PHYP')
> Signed-off-by: Vaibhav Jain 
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 835163f54244..9216424f8be3 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
> *p,
> dev_err(>pdev->dev,
> "Unknown performance stats, Err:0x%016lX\n", ret[0]);
> return -ENOENT;
> +   } else if (rc == H_AUTHORITY) {
> +   dev_warn(>pdev->dev, "Performance stats in-accessible");
> +   return -EPERM;

So userspace can spam the kernel log? Why is kernel log message needed
at all? EPERM told the caller what happened.


Re: [PATCH] mm/pmem: Avoid inserting hugepage PTE entry with fsdax if hugepage support is disabled

2021-02-05 Thread Dan Williams
[ add Andrew ]

On Thu, Feb 4, 2021 at 6:40 PM Aneesh Kumar K.V
 wrote:
>
> Differentiate between hardware not supporting hugepages and user disabling THP
> via 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'
>
> For the devdax namespace, the kernel handles the above via the
> supported_alignment attribute and failing to initialize the namespace
> if the namespace align value is not supported on the platform.
>
> For the fsdax namespace, the kernel will continue to initialize
> the namespace. This can result in the kernel creating a huge pte
> entry even though the hardware don't support the same.
>
> We do want hugepage support with pmem even if the end-user disabled THP
> via sysfs file (/sys/kernel/mm/transparent_hugepage/enabled). Hence
> differentiate between hardware/firmware lacking support vs user-controlled
> disable of THP and prevent a huge fault if the hardware lacks hugepage
> support.

Looks good to me.

Reviewed-by: Dan Williams 

I assume this will go through Andrew.


Re: [RFC][PATCH 1/2] libnvdimm: Introduce ND_CMD_GET_STAT to retrieve nvdimm statistics

2020-12-07 Thread Dan Williams
[ add perf maintainers ]

On Sun, Nov 8, 2020 at 1:16 PM Vaibhav Jain  wrote:
>
> Implement support for exposing generic nvdimm statistics via newly
> introduced dimm-command ND_CMD_GET_STAT that can be handled by nvdimm
> command handler function and provide values for these statistics back
> to libnvdimm. Following generic nvdimm statistics are defined as an
> enumeration in 'uapi/ndctl.h':
>
> * "media_reads" : Number of media reads that have occurred since reboot.
> * "media_writes" : Number of media writes that have occurred since reboot.
> * "read_requests" : Number of read requests that have occurred since reboot.
> * "write_requests" : Number of write requests that have occurred since reboot.

Perhaps document these as "since device reset"? As I can imagine some
devices might have a mechanism to reset the count outside of "reboot"
which is a bit ambiguous.

> * "total_media_reads" : Total number of media reads that have occurred.
> * "total_media_writes" : Total number of media writes that have occurred.
> * "total_read_requests" : Total number of read requests that have occurred.
> * "total_write_requests" : Total number of write requests that have occurred.
>
> Apart from ND_CMD_GET_STAT ioctl these nvdimm statistics are also
> exposed via sysfs '/stats' directory for easy user-space
> access like below:
>
> /sys/class/nd/ndctl0/device/nmem0/stats # tail -n +1 *
> ==> media_reads <==
> 252197707602
> ==> media_writes <==
> 20684685172
> ==> read_requests <==
> 658810924962
> ==> write_requests <==
> 404464081574

Hmm, I haven't looked but how hard would it be to plumb these to be
perf counter-events. So someone could combine these with other perf
counters?

> In case a specific nvdimm-statistic is not supported than nvdimm
> command handler function can simply return an error (e.g -ENOENT) for
> request to read that nvdimm-statistic.

Makes sense, but I expect the perf route also has a way to enumerate
which statistics / counters are supported. I'm not opposed to also
having them in sysfs, but I think perf support should be a first class
citizen.

>
> The value for a specific nvdimm-stat is exchanged via newly introduced
> 'struct nd_cmd_get_dimm_stat' that hold a single statistics and a
> union of possible values types. Presently only '__s64' type of generic
> attributes are supported. These attributes are defined in
> 'ndvimm/dimm_devs.c' via a helper macro 'NVDIMM_STAT_ATTR'.
>
> Signed-off-by: Vaibhav Jain 
> ---
>  drivers/nvdimm/bus.c   |   6 ++
>  drivers/nvdimm/dimm_devs.c | 109 +
>  drivers/nvdimm/nd.h|   5 ++
>  include/uapi/linux/ndctl.h |  27 +
>  4 files changed, 147 insertions(+)
>
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index 2304c6183822..d53564477437 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -794,6 +794,12 @@ static const struct nd_cmd_desc __nd_cmd_dimm_descs[] = {
> .out_num = 1,
> .out_sizes = { UINT_MAX, },
> },
> +   [ND_CMD_GET_STAT] = {
> +   .in_num = 1,
> +   .in_sizes = { sizeof(struct nd_cmd_get_dimm_stat), },
> +   .out_num = 1,
> +   .out_sizes = { sizeof(struct nd_cmd_get_dimm_stat), },
> +   },
>  };
>
>  const struct nd_cmd_desc *nd_cmd_dimm_desc(int cmd)
> diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> index b59032e0859b..68aaa294def7 100644
> --- a/drivers/nvdimm/dimm_devs.c
> +++ b/drivers/nvdimm/dimm_devs.c
> @@ -555,6 +555,114 @@ static umode_t nvdimm_firmware_visible(struct kobject 
> *kobj, struct attribute *a
> return a->mode;
>  }
>
> +/* Request a dimm stat from the bus driver */
> +static int __request_dimm_stat(struct nvdimm_bus *nvdimm_bus,
> +  struct nvdimm *dimm, u64 stat_id,
> +  s64 *stat_val)
> +{
> +   struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc;
> +   struct nd_cmd_get_dimm_stat stat = { .stat_id = stat_id };
> +   int rc, cmd_rc;
> +
> +   if (!test_bit(ND_CMD_GET_STAT, >cmd_mask)) {
> +   pr_debug("CMD_GET_STAT not set for bus driver 0x%lx\n",
> +nd_desc->cmd_mask);
> +   return -ENOENT;
> +   }
> +
> +   /* Is stat requested is known & bus driver supports fetching stats */
> +   if (stat_id <= ND_DIMM_STAT_INVALID || stat_id > ND_DIMM_STAT_MAX) {
> +   WARN(1, "Unknown stat-id(%llu) requested", stat_id);
> +   return -ENOENT;
> +   }
> +
> +   /* Ask bus driver for its stat value */
> +   rc = nd_desc->ndctl(nd_desc, dimm, ND_CMD_GET_STAT,
> +   , sizeof(stat), _rc);
> +   if (rc || cmd_rc) {
> +   pr_debug("Unable to request stat %lld. Error (%d,%d)\n",
> +stat_id, rc, cmd_rc);
> +   return rc ? rc : cmd_rc;
> +   }
> +
> +   /* Indicate error in 

[PATCH] powerpc: fix create_section_mapping compile warning

2020-11-16 Thread Dan Williams
0day robot reports that a recent rework of how
memory_add_physaddr_to_nid() and phys_to_target_node() are declared
resulted in the following new compilation warning:

arch/powerpc/mm/mem.c:91:12: warning: no previous prototype for 
'create_section_mapping' [-Wmissing-prototypes]
   91 | int __weak create_section_mapping(unsigned long start, unsigned long 
end,
  |^~

...fix this by moving the declaration of create_section_mapping()
outside of the CONFIG_NEED_MULTIPLE_NODES ifdef guard, and include an
explicit include of asm/mmzone.h in mem.c. An include of linux/mmzone.h
is not sufficient.

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Andrew Morton 
Reported-by: kernel test robot 
Signed-off-by: Dan Williams 
---
 arch/powerpc/include/asm/mmzone.h |7 +--
 arch/powerpc/mm/mem.c |1 +
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/mmzone.h 
b/arch/powerpc/include/asm/mmzone.h
index 177fd18caf83..6cda76b57c5d 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -33,8 +33,6 @@ extern struct pglist_data *node_data[];
 extern int numa_cpu_lookup_table[];
 extern cpumask_var_t node_to_cpumask_map[];
 #ifdef CONFIG_MEMORY_HOTPLUG
-extern int create_section_mapping(unsigned long start, unsigned long end,
- int nid, pgprot_t prot);
 extern unsigned long max_pfn;
 u64 memory_hotplug_max(void);
 #else
@@ -48,5 +46,10 @@ u64 memory_hotplug_max(void);
 #define __HAVE_ARCH_RESERVED_KERNEL_PAGES
 #endif
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+extern int create_section_mapping(unsigned long start, unsigned long end,
+ int nid, pgprot_t prot);
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_MMZONE_H_ */
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 01ec2a252f09..3fc325bebe4d 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 



Re: error: redefinition of ‘dax_supported’

2020-09-21 Thread Dan Williams
On Mon, Sep 21, 2020 at 11:35 AM Nick Desaulniers
 wrote:
>
> Hello DAX maintainers,
> I noticed our PPC64LE builds failing last night:
> https://travis-ci.com/github/ClangBuiltLinux/continuous-integration/jobs/388047043
> https://travis-ci.com/github/ClangBuiltLinux/continuous-integration/jobs/388047056
> https://travis-ci.com/github/ClangBuiltLinux/continuous-integration/jobs/388047099
> and looking on lore, I see a fresh report from KernelCI against arm:
> https://lore.kernel.org/linux-next/?q=dax_supported
>
> Can you all please take a look?  More concerning is that I see this
> failure on mainline.  It may be interesting to consider how this was
> not spotted on -next.

The failure is fixed with commit 88b67edd7247 ("dax: Fix compilation
for CONFIG_DAX && !CONFIG_FS_DAX"). I rushed the fixes that led to
this regression with insufficient exposure because it was crashing all
users. I thought the 2 kbuild-robot reports I squashed covered all the
config combinations, but there was a straggling report after I sent my
-rc6 pull request.

The baseline process escape for all of this was allowing a unit test
triggerable insta-crash upstream in the first instance necessitating
an urgent fix.


Re: [PATCH 12/20] Documentation: maintainer-entry-profile: eliminate duplicated word

2020-07-07 Thread Dan Williams
On Tue, Jul 7, 2020 at 11:07 AM Randy Dunlap  wrote:
>
> Drop the doubled word "have".
>
> Signed-off-by: Randy Dunlap 
> Cc: Jonathan Corbet 
> Cc: linux-...@vger.kernel.org
> Cc: Dan Williams 
> ---
>  Documentation/maintainer/maintainer-entry-profile.rst |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- 
> linux-next-20200701.orig/Documentation/maintainer/maintainer-entry-profile.rst
> +++ linux-next-20200701/Documentation/maintainer/maintainer-entry-profile.rst
> @@ -31,7 +31,7 @@ Example questions to consider:
>  - What branch should contributors submit against?
>  - Links to any other Maintainer Entry Profiles? For example a
>device-driver may point to an entry for its parent subsystem. This makes
> -  the contributor aware of obligations a maintainer may have have for
> +  the contributor aware of obligations a maintainer may have for
>other maintainers in the submission chain.

Acked-by Dan Williams 


Re: [PATCH 16/20] block: move ->make_request_fn to struct block_device_operations

2020-07-01 Thread Dan Williams
On Wed, Jul 1, 2020 at 2:01 AM Christoph Hellwig  wrote:
>
> The make_request_fn is a little weird in that it sits directly in
> struct request_queue instead of an operation vector.  Replace it with
> a block_device_operations method called submit_bio (which describes much
> better what it does).  Also remove the request_queue argument to it, as
> the queue can be derived pretty trivially from the bio.
>
> Signed-off-by: Christoph Hellwig 
> ---
[..]
>  drivers/nvdimm/blk.c  |  5 +-
>  drivers/nvdimm/btt.c  |  5 +-
>  drivers/nvdimm/pmem.c |  5 +-

For drivers/nvdimm

Acked-by: Dan Williams 


Re: [PATCH v6 6/8] powerpc/pmem: Avoid the barrier in flush routines

2020-06-30 Thread Dan Williams
On Tue, Jun 30, 2020 at 8:09 PM Aneesh Kumar K.V
 wrote:
>
> On 7/1/20 1:15 AM, Dan Williams wrote:
> > On Tue, Jun 30, 2020 at 2:21 AM Aneesh Kumar K.V
> >  wrote:
> > [..]
> >>>> The bio argument isn't for range based flushing, it is for flush
> >>>> operations that need to complete asynchronously.
> >>> How does the block layer determine that the pmem device needs
> >>> asynchronous fushing?
> >>>
> >>
> >>  set_bit(ND_REGION_ASYNC, _desc.flags);
> >>
> >> and dax_synchronous(dev)
> >
> > Yes, but I think it is overkill to have an indirect function call just
> > for a single instruction.
> >
> > How about something like this instead, to share a common pmem_wmb()
> > across x86 and powerpc.
> >
> > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > index 20ff30c2ab93..b14009060c83 100644
> > --- a/drivers/nvdimm/region_devs.c
> > +++ b/drivers/nvdimm/region_devs.c
> > @@ -1180,6 +1180,13 @@ int nvdimm_flush(struct nd_region *nd_region,
> > struct bio *bio)
> >   {
> >  int rc = 0;
> >
> > +   /*
> > +* pmem_wmb() is needed to 'sfence' all previous writes such
> > +* that they are architecturally visible for the platform buffer
> > +* flush.
> > +*/
> > +   pmem_wmb();
> > +
> >  if (!nd_region->flush)
> >  rc = generic_nvdimm_flush(nd_region);
> >  else {
> > @@ -1206,17 +1213,14 @@ int generic_nvdimm_flush(struct nd_region 
> > *nd_region)
> >  idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 
> > 8));
> >
> >  /*
> > -* The first wmb() is needed to 'sfence' all previous writes
> > -* such that they are architecturally visible for the platform
> > -* buffer flush.  Note that we've already arranged for pmem
> > -* writes to avoid the cache via memcpy_flushcache().  The final
> > -* wmb() ensures ordering for the NVDIMM flush write.
> > +* Note that we've already arranged for pmem writes to avoid the
> > +* cache via memcpy_flushcache().  The final wmb() ensures
> > +* ordering for the NVDIMM flush write.
> >   */
> > -   wmb();
>
>
> The series already convert this to pmem_wmb().
>
> >  for (i = 0; i < nd_region->ndr_mappings; i++)
> >  if (ndrd_get_flush_wpq(ndrd, i, 0))
> >  writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> > -   wmb();
> > +   pmem_wmb();
>
>
> Should this be pmem_wmb()? This is ordering the above writeq() right?

Correct, this can just be wmb().

>
> >
> >  return 0;
> >   }
> >
>
> This still results in two pmem_wmb() on platforms that doesn't have
> flush_wpq. I was trying to avoid that by adding a nd_region->flush call
> back.

How about skip or exit early out of generic_nvdimm_flush if
ndrd->flush_wpq is NULL? That still saves an indirect branch at the
cost of another conditional, but that should still be worth it.


Re: [PATCH v6 6/8] powerpc/pmem: Avoid the barrier in flush routines

2020-06-30 Thread Dan Williams
On Tue, Jun 30, 2020 at 2:21 AM Aneesh Kumar K.V
 wrote:
[..]
> >> The bio argument isn't for range based flushing, it is for flush
> >> operations that need to complete asynchronously.
> > How does the block layer determine that the pmem device needs
> > asynchronous fushing?
> >
>
> set_bit(ND_REGION_ASYNC, _desc.flags);
>
> and dax_synchronous(dev)

Yes, but I think it is overkill to have an indirect function call just
for a single instruction.

How about something like this instead, to share a common pmem_wmb()
across x86 and powerpc.

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 20ff30c2ab93..b14009060c83 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1180,6 +1180,13 @@ int nvdimm_flush(struct nd_region *nd_region,
struct bio *bio)
 {
int rc = 0;

+   /*
+* pmem_wmb() is needed to 'sfence' all previous writes such
+* that they are architecturally visible for the platform buffer
+* flush.
+*/
+   pmem_wmb();
+
if (!nd_region->flush)
rc = generic_nvdimm_flush(nd_region);
else {
@@ -1206,17 +1213,14 @@ int generic_nvdimm_flush(struct nd_region *nd_region)
idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8));

/*
-* The first wmb() is needed to 'sfence' all previous writes
-* such that they are architecturally visible for the platform
-* buffer flush.  Note that we've already arranged for pmem
-* writes to avoid the cache via memcpy_flushcache().  The final
-* wmb() ensures ordering for the NVDIMM flush write.
+* Note that we've already arranged for pmem writes to avoid the
+* cache via memcpy_flushcache().  The final wmb() ensures
+* ordering for the NVDIMM flush write.
 */
-   wmb();
for (i = 0; i < nd_region->ndr_mappings; i++)
if (ndrd_get_flush_wpq(ndrd, i, 0))
writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
-   wmb();
+   pmem_wmb();

return 0;
 }


Re: [PATCH updated] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-06-30 Thread Dan Williams
On Tue, Jun 30, 2020 at 5:48 AM Aneesh Kumar K.V
 wrote:
>
>
> Update patch.
>
> From 1e6aa6c4182e14ec5d6bf878ae44c3f69ebff745 Mon Sep 17 00:00:00 2001
> From: "Aneesh Kumar K.V" 
> Date: Tue, 12 May 2020 20:58:33 +0530
> Subject: [PATCH] libnvdimm/nvdimm/flush: Allow architecture to override the
>  flush barrier
>
> Architectures like ppc64 provide persistent memory specific barriers
> that will ensure that all stores for which the modifications are
> written to persistent storage by preceding dcbfps and dcbstps
> instructions have updated persistent storage before any data
> access or data transfer caused by subsequent instructions is initiated.
> This is in addition to the ordering done by wmb()
>
> Update nvdimm core such that architecture can use barriers other than
> wmb to ensure all previous writes are architecturally visible for
> the platform buffer flush.

Looks good, after a few minor fixups below you can add:

Reviewed-by: Dan Williams 

I'm expecting that these will be merged through the powerpc tree since
they mostly impact powerpc with only minor touches to libnvdimm.

> Signed-off-by: Aneesh Kumar K.V 
> ---
>  Documentation/memory-barriers.txt | 14 ++
>  drivers/md/dm-writecache.c|  2 +-
>  drivers/nvdimm/region_devs.c  |  8 
>  include/asm-generic/barrier.h | 10 ++
>  4 files changed, 29 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index eaabc3134294..340273a6b18e 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1935,6 +1935,20 @@ There are some more advanced barrier functions:
>   relaxed I/O accessors and the Documentation/DMA-API.txt file for more
>   information on consistent memory.
>
> + (*) pmem_wmb();
> +
> + This is for use with persistent memory to ensure that stores for which
> + modifications are written to persistent storage have updated the 
> persistent
> + storage.

I think this should be:

s/updated the persistent storage/reached a platform durability domain/

> +
> + For example, after a non-temporal write to pmem region, we use 
> pmem_wmb()
> + to ensures that stores have updated the persistent storage. This ensures

s/ensures/ensure/

...and the same comment about "persistent storage" because pmem_wmb()
as implemented on x86 does not guarantee that the writes have reached
storage it ensures that writes have reached buffers / queues that are
within the ADR (platform persistence / durability) domain.

> + that stores have updated persistent storage before any data access or
> + data transfer caused by subsequent instructions is initiated. This is
> + in addition to the ordering done by wmb().
> +
> + For load from persistent memory, existing read memory barriers are 
> sufficient
> + to ensure read ordering.
>
>  ===
>  IMPLICIT KERNEL MEMORY BARRIERS
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 74f3c506f084..00534fa4a384 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -536,7 +536,7 @@ static void ssd_commit_superblock(struct dm_writecache 
> *wc)
>  static void writecache_commit_flushed(struct dm_writecache *wc, bool 
> wait_for_ios)
>  {
> if (WC_MODE_PMEM(wc))
> -   wmb();
> +   pmem_wmb();
> else
> ssd_commit_flushed(wc, wait_for_ios);
>  }
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 4502f9c4708d..2333b290bdcf 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1206,13 +1206,13 @@ int generic_nvdimm_flush(struct nd_region *nd_region)
> idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8));
>
> /*
> -* The first wmb() is needed to 'sfence' all previous writes
> -* such that they are architecturally visible for the platform
> -* buffer flush.  Note that we've already arranged for pmem
> +* The first arch_pmem_flush_barrier() is needed to 'sfence' all

One missed arch_pmem_flush_barrier() rename.

> +* previous writes such that they are architecturally visible for
> +* the platform buffer flush. Note that we've already arranged for 
> pmem
>  * writes to avoid the cache via memcpy_flushcache().  The final
>  * wmb() ensures ordering for the NVDIMM flush write.
>  */
> -   wmb();
> +   pmem_wmb();
> for (i = 0; i < nd_region->ndr_mappings; i++)
> if (ndrd_get_flu

Re: [PATCH v6 5/8] powerpc/pmem/of_pmem: Update of_pmem to use the new barrier instruction.

2020-06-30 Thread Dan Williams
On Mon, Jun 29, 2020 at 10:05 PM Aneesh Kumar K.V
 wrote:
>
> Dan Williams  writes:
>
> > On Mon, Jun 29, 2020 at 6:58 AM Aneesh Kumar K.V
> >  wrote:
> >>
> >> of_pmem on POWER10 can now use phwsync instead of hwsync to ensure
> >> all previous writes are architecturally visible for the platform
> >> buffer flush.
> >>
> >> Signed-off-by: Aneesh Kumar K.V 
> >> ---
> >>  arch/powerpc/include/asm/cacheflush.h | 7 +++
> >>  1 file changed, 7 insertions(+)
> >>
> >> diff --git a/arch/powerpc/include/asm/cacheflush.h 
> >> b/arch/powerpc/include/asm/cacheflush.h
> >> index 54764c6e922d..95782f77d768 100644
> >> --- a/arch/powerpc/include/asm/cacheflush.h
> >> +++ b/arch/powerpc/include/asm/cacheflush.h
> >> @@ -98,6 +98,13 @@ static inline void invalidate_dcache_range(unsigned 
> >> long start,
> >> mb();   /* sync */
> >>  }
> >>
> >> +#define arch_pmem_flush_barrier arch_pmem_flush_barrier
> >> +static inline void  arch_pmem_flush_barrier(void)
> >> +{
> >> +   if (cpu_has_feature(CPU_FTR_ARCH_207S))
> >> +   asm volatile(PPC_PHWSYNC ::: "memory");
> >
> > Shouldn't this fallback to a compatible store-fence in an else statement?
>
> The idea was to avoid calling this on anything else. We ensure that by
> making sure that pmem devices are not initialized on systems without that
> cpu feature. Patch 1 does that. Also, the last patch adds a WARN_ON() to
> catch the usage of this outside pmem devices and on systems without that
> cpu feature.

If patch1 handles this why re-check the cpu-feature in this helper? If
the intent is for these routines to be generic why not have them fall
back to the P8 barrier instructions for example like x86 clwb(). Any
kernel code can call it, and it falls back to a compatible clflush()
call on older cpus. I otherwise don't get the point of patch7.


Re: [PATCH updated] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-06-30 Thread Dan Williams
On Mon, Jun 29, 2020 at 10:02 PM Aneesh Kumar K.V
 wrote:
>
> Dan Williams  writes:
>
> > On Mon, Jun 29, 2020 at 1:29 PM Aneesh Kumar K.V
> >  wrote:
> >>
> >> Architectures like ppc64 provide persistent memory specific barriers
> >> that will ensure that all stores for which the modifications are
> >> written to persistent storage by preceding dcbfps and dcbstps
> >> instructions have updated persistent storage before any data
> >> access or data transfer caused by subsequent instructions is initiated.
> >> This is in addition to the ordering done by wmb()
> >>
> >> Update nvdimm core such that architecture can use barriers other than
> >> wmb to ensure all previous writes are architecturally visible for
> >> the platform buffer flush.
> >>
> >> Signed-off-by: Aneesh Kumar K.V 
> >> ---
> >>  drivers/md/dm-writecache.c   | 2 +-
> >>  drivers/nvdimm/region_devs.c | 8 
> >>  include/linux/libnvdimm.h| 4 
> >>  3 files changed, 9 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> >> index 74f3c506f084..8c6b6dce64e2 100644
> >> --- a/drivers/md/dm-writecache.c
> >> +++ b/drivers/md/dm-writecache.c
> >> @@ -536,7 +536,7 @@ static void ssd_commit_superblock(struct dm_writecache 
> >> *wc)
> >>  static void writecache_commit_flushed(struct dm_writecache *wc, bool 
> >> wait_for_ios)
> >>  {
> >> if (WC_MODE_PMEM(wc))
> >> -   wmb();
> >> +   arch_pmem_flush_barrier();
> >> else
> >> ssd_commit_flushed(wc, wait_for_ios);
> >>  }
> >> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> >> index 4502f9c4708d..b308ad09b63d 100644
> >> --- a/drivers/nvdimm/region_devs.c
> >> +++ b/drivers/nvdimm/region_devs.c
> >> @@ -1206,13 +1206,13 @@ int generic_nvdimm_flush(struct nd_region 
> >> *nd_region)
> >> idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 
> >> 8));
> >>
> >> /*
> >> -* The first wmb() is needed to 'sfence' all previous writes
> >> -* such that they are architecturally visible for the platform
> >> -* buffer flush.  Note that we've already arranged for pmem
> >> +* The first arch_pmem_flush_barrier() is needed to 'sfence' all
> >> +* previous writes such that they are architecturally visible for
> >> +* the platform buffer flush. Note that we've already arranged for 
> >> pmem
> >>  * writes to avoid the cache via memcpy_flushcache().  The final
> >>  * wmb() ensures ordering for the NVDIMM flush write.
> >>  */
> >> -   wmb();
> >> +   arch_pmem_flush_barrier();
> >> for (i = 0; i < nd_region->ndr_mappings; i++)
> >> if (ndrd_get_flush_wpq(ndrd, i, 0))
> >> writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> >> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> >> index 18da4059be09..66f6c65bd789 100644
> >> --- a/include/linux/libnvdimm.h
> >> +++ b/include/linux/libnvdimm.h
> >> @@ -286,4 +286,8 @@ static inline void arch_invalidate_pmem(void *addr, 
> >> size_t size)
> >>  }
> >>  #endif
> >>
> >> +#ifndef arch_pmem_flush_barrier
> >> +#define arch_pmem_flush_barrier() wmb()
> >> +#endif
> >
> > I think it is out of place to define this in libnvdimm.h and it is odd
> > to give it such a long name. The other pmem api helpers like
> > arch_wb_cache_pmem() and arch_invalidate_pmem() are function calls for
> > libnvdimm driver operations, this barrier is just an instruction and
> > is closer to wmb() than the pmem api routine.
> >
> > Since it is a store fence for pmem, so let's just call it pmem_wmb()
> > and define the generic version in include/linux/compiler.h. It should
> > probably also be documented alongside dma_wmb() in
> > Documentation/memory-barriers.txt about why code would use it over
> > wmb(), and why a symmetric pmem_rmb() is not needed.
>
> How about the below? I used pmem_barrier() instead of pmem_wmb().

Why? A barrier() is a bi-directional ordering mechanic for reads and
writes, and the proposed semantics mechanism only orders writes +
persistence. Otherwise the default fallback to wmb() on archs that
don't override it does not m

Re: [PATCH v6 7/8] powerpc/pmem: Add WARN_ONCE to catch the wrong usage of pmem flush functions.

2020-06-29 Thread Dan Williams
On Mon, Jun 29, 2020 at 6:58 AM Aneesh Kumar K.V
 wrote:
>
> We only support persistent memory on P8 and above. This is enforced by the
> firmware and further checked on virtualzied platform during platform init.
> Add WARN_ONCE in pmem flush routines to catch the wrong usage of these.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/cacheflush.h | 2 ++
>  arch/powerpc/lib/pmem.c   | 2 ++
>  2 files changed, 4 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/cacheflush.h 
> b/arch/powerpc/include/asm/cacheflush.h
> index 95782f77d768..1ab0fa660497 100644
> --- a/arch/powerpc/include/asm/cacheflush.h
> +++ b/arch/powerpc/include/asm/cacheflush.h
> @@ -103,6 +103,8 @@ static inline void  arch_pmem_flush_barrier(void)
>  {
> if (cpu_has_feature(CPU_FTR_ARCH_207S))
> asm volatile(PPC_PHWSYNC ::: "memory");
> +   else
> +   WARN_ONCE(1, "Using pmem flush on older hardware.");

This seems too late to be making this determination. I'd expect the
driver to fail to successfully bind default if this constraint is not
met.


Re: [PATCH v6 6/8] powerpc/pmem: Avoid the barrier in flush routines

2020-06-29 Thread Dan Williams
On Mon, Jun 29, 2020 at 1:41 PM Aneesh Kumar K.V
 wrote:
>
> Michal Suchánek  writes:
>
> > Hello,
> >
> > On Mon, Jun 29, 2020 at 07:27:20PM +0530, Aneesh Kumar K.V wrote:
> >> nvdimm expect the flush routines to just mark the cache clean. The barrier
> >> that mark the store globally visible is done in nvdimm_flush().
> >>
> >> Update the papr_scm driver to a simplified nvdim_flush callback that do
> >> only the required barrier.
> >>
> >> Signed-off-by: Aneesh Kumar K.V 
> >> ---
> >>  arch/powerpc/lib/pmem.c   |  6 --
> >>  arch/powerpc/platforms/pseries/papr_scm.c | 13 +
> >>  2 files changed, 13 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c
> >> index 5a61aaeb6930..21210fa676e5 100644
> >> --- a/arch/powerpc/lib/pmem.c
> >> +++ b/arch/powerpc/lib/pmem.c
> >> @@ -19,9 +19,6 @@ static inline void __clean_pmem_range(unsigned long 
> >> start, unsigned long stop)
> >>
> >>  for (i = 0; i < size >> shift; i++, addr += bytes)
> >>  asm volatile(PPC_DCBSTPS(%0, %1): :"i"(0), "r"(addr): 
> >> "memory");
> >> -
> >> -
> >> -asm volatile(PPC_PHWSYNC ::: "memory");
> >>  }
> >>
> >>  static inline void __flush_pmem_range(unsigned long start, unsigned long 
> >> stop)
> >> @@ -34,9 +31,6 @@ static inline void __flush_pmem_range(unsigned long 
> >> start, unsigned long stop)
> >>
> >>  for (i = 0; i < size >> shift; i++, addr += bytes)
> >>  asm volatile(PPC_DCBFPS(%0, %1): :"i"(0), "r"(addr): 
> >> "memory");
> >> -
> >> -
> >> -asm volatile(PPC_PHWSYNC ::: "memory");
> >>  }
> >>
> >>  static inline void clean_pmem_range(unsigned long start, unsigned long 
> >> stop)
> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> >> b/arch/powerpc/platforms/pseries/papr_scm.c
> >> index 9c569078a09f..9a9a0766f8b6 100644
> >> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> >> @@ -630,6 +630,18 @@ static int papr_scm_ndctl(struct 
> >> nvdimm_bus_descriptor *nd_desc,
> >>
> >>  return 0;
> >>  }
> >> +/*
> >> + * We have made sure the pmem writes are done such that before calling 
> >> this
> >> + * all the caches are flushed/clean. We use dcbf/dcbfps to ensure this. 
> >> Here
> >> + * we just need to add the necessary barrier to make sure the above 
> >> flushes
> >> + * are have updated persistent storage before any data access or data 
> >> transfer
> >> + * caused by subsequent instructions is initiated.
> >> + */
> >> +static int papr_scm_flush_sync(struct nd_region *nd_region, struct bio 
> >> *bio)
> >> +{
> >> +arch_pmem_flush_barrier();
> >> +return 0;
> >> +}
> >>
> >>  static ssize_t flags_show(struct device *dev,
> >>struct device_attribute *attr, char *buf)
> >> @@ -743,6 +755,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv 
> >> *p)
> >>  ndr_desc.mapping = 
> >>  ndr_desc.num_mappings = 1;
> >>  ndr_desc.nd_set = >nd_set;
> >> +ndr_desc.flush = papr_scm_flush_sync;
> >
> > AFAICT currently the only device that implements flush is virtio_pmem.
> > How does the nfit driver get away without implementing flush?
>
> generic_nvdimm_flush does the required barrier for nfit. The reason for
> adding ndr_desc.flush call back for papr_scm was to avoid the usage
> of iomem based deep flushing (ndr_region_data.flush_wpq) which is not
> supported by papr_scm.
>
> BTW we do return NULL for ndrd_get_flush_wpq() on power. So the upstream
> code also does the same thing, but in a different way.
>
>
> > Also the flush takes arguments that are completely unused but a user of
> > the pmem region must assume they are used, and call flush() on the
> > region rather than arch_pmem_flush_barrier() directly.
>
> The bio argument can help a pmem driver to do range based flushing in
> case of pmem_make_request. If bio is null then we must assume a full
> device flush.

The bio argument isn't for range based flushing, it is for flush
operations that need to complete asynchronously.

There's no mechanism for the block layer to communicate range based
cache flushing, block-device flushing is assumed to be the device's
entire cache. For pmem that would be the entirety of the cpu cache.
Instead of modeling the cpu cache as a storage device cache it is
modeled as page-cache. Once the fs-layer writes back page-cache /
cpu-cache the storage device is only responsible for flushing those
cache-writes into the persistence domain.

Additionally there is a concept of deep-flush that relegates some
power-fail scenarios to a smaller failure domain. For example consider
the difference between a write arriving at the head of a device-queue
and successfully traversing a device-queue to media. The expectation
of pmem applications is that data is persisted once they reach the
equivalent of the x86 ADR domain, deep-flush is past ADR.


Re: [PATCH v6 5/8] powerpc/pmem/of_pmem: Update of_pmem to use the new barrier instruction.

2020-06-29 Thread Dan Williams
On Mon, Jun 29, 2020 at 6:58 AM Aneesh Kumar K.V
 wrote:
>
> of_pmem on POWER10 can now use phwsync instead of hwsync to ensure
> all previous writes are architecturally visible for the platform
> buffer flush.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/cacheflush.h | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/cacheflush.h 
> b/arch/powerpc/include/asm/cacheflush.h
> index 54764c6e922d..95782f77d768 100644
> --- a/arch/powerpc/include/asm/cacheflush.h
> +++ b/arch/powerpc/include/asm/cacheflush.h
> @@ -98,6 +98,13 @@ static inline void invalidate_dcache_range(unsigned long 
> start,
> mb();   /* sync */
>  }
>
> +#define arch_pmem_flush_barrier arch_pmem_flush_barrier
> +static inline void  arch_pmem_flush_barrier(void)
> +{
> +   if (cpu_has_feature(CPU_FTR_ARCH_207S))
> +   asm volatile(PPC_PHWSYNC ::: "memory");

Shouldn't this fallback to a compatible store-fence in an else statement?


Re: [PATCH updated] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-06-29 Thread Dan Williams
On Mon, Jun 29, 2020 at 1:29 PM Aneesh Kumar K.V
 wrote:
>
> Architectures like ppc64 provide persistent memory specific barriers
> that will ensure that all stores for which the modifications are
> written to persistent storage by preceding dcbfps and dcbstps
> instructions have updated persistent storage before any data
> access or data transfer caused by subsequent instructions is initiated.
> This is in addition to the ordering done by wmb()
>
> Update nvdimm core such that architecture can use barriers other than
> wmb to ensure all previous writes are architecturally visible for
> the platform buffer flush.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  drivers/md/dm-writecache.c   | 2 +-
>  drivers/nvdimm/region_devs.c | 8 
>  include/linux/libnvdimm.h| 4 
>  3 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 74f3c506f084..8c6b6dce64e2 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -536,7 +536,7 @@ static void ssd_commit_superblock(struct dm_writecache 
> *wc)
>  static void writecache_commit_flushed(struct dm_writecache *wc, bool 
> wait_for_ios)
>  {
> if (WC_MODE_PMEM(wc))
> -   wmb();
> +   arch_pmem_flush_barrier();
> else
> ssd_commit_flushed(wc, wait_for_ios);
>  }
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 4502f9c4708d..b308ad09b63d 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1206,13 +1206,13 @@ int generic_nvdimm_flush(struct nd_region *nd_region)
> idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8));
>
> /*
> -* The first wmb() is needed to 'sfence' all previous writes
> -* such that they are architecturally visible for the platform
> -* buffer flush.  Note that we've already arranged for pmem
> +* The first arch_pmem_flush_barrier() is needed to 'sfence' all
> +* previous writes such that they are architecturally visible for
> +* the platform buffer flush. Note that we've already arranged for 
> pmem
>  * writes to avoid the cache via memcpy_flushcache().  The final
>  * wmb() ensures ordering for the NVDIMM flush write.
>  */
> -   wmb();
> +   arch_pmem_flush_barrier();
> for (i = 0; i < nd_region->ndr_mappings; i++)
> if (ndrd_get_flush_wpq(ndrd, i, 0))
> writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> index 18da4059be09..66f6c65bd789 100644
> --- a/include/linux/libnvdimm.h
> +++ b/include/linux/libnvdimm.h
> @@ -286,4 +286,8 @@ static inline void arch_invalidate_pmem(void *addr, 
> size_t size)
>  }
>  #endif
>
> +#ifndef arch_pmem_flush_barrier
> +#define arch_pmem_flush_barrier() wmb()
> +#endif

I think it is out of place to define this in libnvdimm.h and it is odd
to give it such a long name. The other pmem api helpers like
arch_wb_cache_pmem() and arch_invalidate_pmem() are function calls for
libnvdimm driver operations, this barrier is just an instruction and
is closer to wmb() than the pmem api routine.

Since it is a store fence for pmem, so let's just call it pmem_wmb()
and define the generic version in include/linux/compiler.h. It should
probably also be documented alongside dma_wmb() in
Documentation/memory-barriers.txt about why code would use it over
wmb(), and why a symmetric pmem_rmb() is not needed.


Re: [PATCH v13 2/6] seq_buf: Export seq_buf_printf

2020-06-15 Thread Dan Williams
On Mon, Jun 15, 2020 at 5:56 AM Borislav Petkov  wrote:
>
> On Mon, Jun 15, 2020 at 06:14:03PM +0530, Vaibhav Jain wrote:
> > 'seq_buf' provides a very useful abstraction for writing to a string
> > buffer without needing to worry about it over-flowing. However even
> > though the API has been stable for couple of years now its still not
> > exported to kernel loadable modules limiting its usage.
> >
> > Hence this patch proposes update to 'seq_buf.c' to mark
> > seq_buf_printf() which is part of the seq_buf API to be exported to
> > kernel loadable GPL modules. This symbol will be used in later parts
> > of this patch-set to simplify content creation for a sysfs attribute.
> >
> > Cc: Piotr Maziarz 
> > Cc: Cezary Rojewski 
> > Cc: Christoph Hellwig 
> > Cc: Steven Rostedt 
> > Cc: Borislav Petkov 
> > Acked-by: Steven Rostedt (VMware) 
> > Signed-off-by: Vaibhav Jain 
> > ---
> > Changelog:
> >
> > v12..v13:
> > * None
> >
> > v11..v12:
> > * None
>
> Can you please resend your patchset once a week like everyone else and
> not flood inboxes with it?

Hi Boris,

I gave Vaibhav some long shot hope that his series could be included
in my libnvdimm pull request for -rc1. Save for a last minute clang
report that I misread as a gcc warning, I likely would have included.
This spin is looking to address the last of the comments I had and
something I would consider for -rc2. So, in this case the resends were
requested by me and I'll take the grumbles on Vaibhav's behalf.


Re: [PATCH v11 5/6] ndctl/papr_scm, uapi: Add support for PAPR nvdimm specific methods

2020-06-10 Thread Dan Williams
On Wed, Jun 10, 2020 at 5:10 AM Vaibhav Jain  wrote:
>
> Dan Williams  writes:
>
> > On Tue, Jun 9, 2020 at 10:54 AM Vaibhav Jain  wrote:
> >>
> >> Thanks Dan for the consideration and taking time to look into this.
> >>
> >> My responses below:
> >>
> >> Dan Williams  writes:
> >>
> >> > On Mon, Jun 8, 2020 at 5:16 PM kernel test robot  wrote:
> >> >>
> >> >> Hi Vaibhav,
> >> >>
> >> >> Thank you for the patch! Perhaps something to improve:
> >> >>
> >> >> [auto build test WARNING on powerpc/next]
> >> >> [also build test WARNING on linus/master v5.7 next-20200605]
> >> >> [cannot apply to linux-nvdimm/libnvdimm-for-next scottwood/next]
> >> >> [if your patch is applied to the wrong git tree, please drop us a note 
> >> >> to help
> >> >> improve the system. BTW, we also suggest to use '--base' option to 
> >> >> specify the
> >> >> base tree in git format-patch, please see 
> >> >> https://stackoverflow.com/a/37406982]
> >> >>
> >> >> url:
> >> >> https://github.com/0day-ci/linux/commits/Vaibhav-Jain/powerpc-papr_scm-Add-support-for-reporting-nvdimm-health/20200607-211653
> >> >> base:   
> >> >> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> >> >> config: powerpc-randconfig-r016-20200607 (attached as .config)
> >> >> compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 
> >> >> e429cffd4f228f70c1d9df0e5d77c08590dd9766)
> >> >> reproduce (this is a W=1 build):
> >> >> wget 
> >> >> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross
> >> >>  -O ~/bin/make.cross
> >> >> chmod +x ~/bin/make.cross
> >> >> # install powerpc cross compiling tool for clang build
> >> >> # apt-get install binutils-powerpc-linux-gnu
> >> >> # save the attached .config to linux build tree
> >> >> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross 
> >> >> ARCH=powerpc
> >> >>
> >> >> If you fix the issue, kindly add following tag as appropriate
> >> >> Reported-by: kernel test robot 
> >> >>
> >> >> All warnings (new ones prefixed by >>, old ones prefixed by <<):
> >> >>
> >> >> In file included from :1:
> >> >> >> ./usr/include/asm/papr_pdsm.h:69:20: warning: field 'hdr' with 
> >> >> >> variable sized type 'struct nd_cmd_pkg' not at the end of a struct 
> >> >> >> or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
> >> >> struct nd_cmd_pkg hdr;  /* Package header containing sub-cmd */
> >> >
> >> > Hi Vaibhav,
> >> >
> >> [.]
> >> > This looks like it's going to need another round to get this fixed. I
> >> > don't think 'struct nd_pdsm_cmd_pkg' should embed a definition of
> >> > 'struct nd_cmd_pkg'. An instance of 'struct nd_cmd_pkg' carries a
> >> > payload that is the 'pdsm' specifics. As the code has it now it's
> >> > defined as a superset of 'struct nd_cmd_pkg' and the compiler warning
> >> > is pointing out a real 'struct' organization problem.
> >> >
> >> > Given the soak time needed in -next after the code is finalized this
> >> > there's no time to do another round of updates and still make the v5.8
> >> > merge window.
> >>
> >> Agreed that this looks bad, a solution will probably need some more
> >> review cycles resulting in this series missing the merge window.
> >>
> >> I am investigating into the possible solutions for this reported issue
> >> and made few observations:
> >>
> >> I see command pkg for Intel, Hpe, Msft and Hyperv families using a
> >> similar layout of embedding nd_cmd_pkg at the head of the
> >> command-pkg. struct nd_pdsm_cmd_pkg is following the same pattern.
> >>
> >> struct nd_pdsm_cmd_pkg {
> >> struct nd_cmd_pkg hdr;
> >> /* other members */
> >> };
> >>
> >> struct ndn_pkg_msft {
> >> struct nd_cmd_pkg gen;
> >> /* other members */
> >> };
> >> struct nd_pkg_intel {
> >> struct nd_cmd_pkg gen;
> >> /

Re: [PATCH v11 5/6] ndctl/papr_scm, uapi: Add support for PAPR nvdimm specific methods

2020-06-09 Thread Dan Williams
On Tue, Jun 9, 2020 at 10:54 AM Vaibhav Jain  wrote:
>
> Thanks Dan for the consideration and taking time to look into this.
>
> My responses below:
>
> Dan Williams  writes:
>
> > On Mon, Jun 8, 2020 at 5:16 PM kernel test robot  wrote:
> >>
> >> Hi Vaibhav,
> >>
> >> Thank you for the patch! Perhaps something to improve:
> >>
> >> [auto build test WARNING on powerpc/next]
> >> [also build test WARNING on linus/master v5.7 next-20200605]
> >> [cannot apply to linux-nvdimm/libnvdimm-for-next scottwood/next]
> >> [if your patch is applied to the wrong git tree, please drop us a note to 
> >> help
> >> improve the system. BTW, we also suggest to use '--base' option to specify 
> >> the
> >> base tree in git format-patch, please see 
> >> https://stackoverflow.com/a/37406982]
> >>
> >> url:
> >> https://github.com/0day-ci/linux/commits/Vaibhav-Jain/powerpc-papr_scm-Add-support-for-reporting-nvdimm-health/20200607-211653
> >> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> >> next
> >> config: powerpc-randconfig-r016-20200607 (attached as .config)
> >> compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 
> >> e429cffd4f228f70c1d9df0e5d77c08590dd9766)
> >> reproduce (this is a W=1 build):
> >> wget 
> >> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross 
> >> -O ~/bin/make.cross
> >> chmod +x ~/bin/make.cross
> >> # install powerpc cross compiling tool for clang build
> >> # apt-get install binutils-powerpc-linux-gnu
> >> # save the attached .config to linux build tree
> >> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross 
> >> ARCH=powerpc
> >>
> >> If you fix the issue, kindly add following tag as appropriate
> >> Reported-by: kernel test robot 
> >>
> >> All warnings (new ones prefixed by >>, old ones prefixed by <<):
> >>
> >> In file included from :1:
> >> >> ./usr/include/asm/papr_pdsm.h:69:20: warning: field 'hdr' with variable 
> >> >> sized type 'struct nd_cmd_pkg' not at the end of a struct or class is a 
> >> >> GNU extension [-Wgnu-variable-sized-type-not-at-end]
> >> struct nd_cmd_pkg hdr;  /* Package header containing sub-cmd */
> >
> > Hi Vaibhav,
> >
> [.]
> > This looks like it's going to need another round to get this fixed. I
> > don't think 'struct nd_pdsm_cmd_pkg' should embed a definition of
> > 'struct nd_cmd_pkg'. An instance of 'struct nd_cmd_pkg' carries a
> > payload that is the 'pdsm' specifics. As the code has it now it's
> > defined as a superset of 'struct nd_cmd_pkg' and the compiler warning
> > is pointing out a real 'struct' organization problem.
> >
> > Given the soak time needed in -next after the code is finalized this
> > there's no time to do another round of updates and still make the v5.8
> > merge window.
>
> Agreed that this looks bad, a solution will probably need some more
> review cycles resulting in this series missing the merge window.
>
> I am investigating into the possible solutions for this reported issue
> and made few observations:
>
> I see command pkg for Intel, Hpe, Msft and Hyperv families using a
> similar layout of embedding nd_cmd_pkg at the head of the
> command-pkg. struct nd_pdsm_cmd_pkg is following the same pattern.
>
> struct nd_pdsm_cmd_pkg {
> struct nd_cmd_pkg hdr;
> /* other members */
> };
>
> struct ndn_pkg_msft {
> struct nd_cmd_pkg gen;
> /* other members */
> };
> struct nd_pkg_intel {
> struct nd_cmd_pkg gen;
> /* other members */
> };
> struct ndn_pkg_hpe1 {
> struct nd_cmd_pkg gen;
> /* other members */

In those cases the other members are a union and there is no second
variable length array. Perhaps that is why those definitions are not
getting flagged? I'm not seeing anything in ndctl build options that
would explicitly disable this warning, but I'm not sure if the ndctl
build environment is missing this build warning by accident.

Those variable size payloads are also not being used in any code paths
that would look at the size of the command payload, like the kernel
ioctl() path. The payload validation code needs static sizes and the
payload parsing code wants to cast the payload to a known type. I
don't think you can use the same struct definition for both those
cases which is why the ndctl parsing code uses the union layout, but
the kernel command marsha

Re: [PATCH v11 5/6] ndctl/papr_scm, uapi: Add support for PAPR nvdimm specific methods

2020-06-08 Thread Dan Williams
On Mon, Jun 8, 2020 at 5:16 PM kernel test robot  wrote:
>
> Hi Vaibhav,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on powerpc/next]
> [also build test WARNING on linus/master v5.7 next-20200605]
> [cannot apply to linux-nvdimm/libnvdimm-for-next scottwood/next]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see 
> https://stackoverflow.com/a/37406982]
>
> url:
> https://github.com/0day-ci/linux/commits/Vaibhav-Jain/powerpc-papr_scm-Add-support-for-reporting-nvdimm-health/20200607-211653
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: powerpc-randconfig-r016-20200607 (attached as .config)
> compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 
> e429cffd4f228f70c1d9df0e5d77c08590dd9766)
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # install powerpc cross compiling tool for clang build
> # apt-get install binutils-powerpc-linux-gnu
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross 
> ARCH=powerpc
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
>
> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>
> In file included from :1:
> >> ./usr/include/asm/papr_pdsm.h:69:20: warning: field 'hdr' with variable 
> >> sized type 'struct nd_cmd_pkg' not at the end of a struct or class is a 
> >> GNU extension [-Wgnu-variable-sized-type-not-at-end]
> struct nd_cmd_pkg hdr;  /* Package header containing sub-cmd */

Hi Vaibhav,

This looks like it's going to need another round to get this fixed. I
don't think 'struct nd_pdsm_cmd_pkg' should embed a definition of
'struct nd_cmd_pkg'. An instance of 'struct nd_cmd_pkg' carries a
payload that is the 'pdsm' specifics. As the code has it now it's
defined as a superset of 'struct nd_cmd_pkg' and the compiler warning
is pointing out a real 'struct' organization problem.

Given the soak time needed in -next after the code is finalized this
there's no time to do another round of updates and still make the v5.8
merge window.


Re: [PATCH v10 5/6] ndctl/papr_scm, uapi: Add support for PAPR nvdimm specific methods

2020-06-05 Thread Dan Williams
On Fri, Jun 5, 2020 at 12:50 PM Ira Weiny  wrote:
>
> On Fri, Jun 05, 2020 at 05:11:35AM +0530, Vaibhav Jain wrote:
> > Introduce support for PAPR NVDIMM Specific Methods (PDSM) in papr_scm
> > module and add the command family NVDIMM_FAMILY_PAPR to the white list
> > of NVDIMM command sets. Also advertise support for ND_CMD_CALL for the
> > nvdimm command mask and implement necessary scaffolding in the module
> > to handle ND_CMD_CALL ioctl and PDSM requests that we receive.
> >
> > The layout of the PDSM request as we expect from libnvdimm/libndctl is
> > described in newly introduced uapi header 'papr_pdsm.h' which
> > defines a new 'struct nd_pdsm_cmd_pkg' header. This header is used
> > to communicate the PDSM request via member
> > 'nd_cmd_pkg.nd_command' and size of payload that need to be
> > sent/received for servicing the PDSM.
> >
> > A new function is_cmd_valid() is implemented that reads the args to
> > papr_scm_ndctl() and performs sanity tests on them. A new function
> > papr_scm_service_pdsm() is introduced and is called from
> > papr_scm_ndctl() in case of a PDSM request is received via ND_CMD_CALL
> > command from libnvdimm.
> >
> > Cc: "Aneesh Kumar K . V" 
> > Cc: Dan Williams 
> > Cc: Michael Ellerman 
> > Cc: Ira Weiny 
> > Signed-off-by: Vaibhav Jain 
> > ---
> > Changelog:
> >
> > v9..v10:
> > * Simplified 'struct nd_pdsm_cmd_pkg' by removing the
> >   'payload_version' field.
> > * Removed the corrosponding documentation on versioning and backward
> >   compatibility from 'papr_pdsm.h'
> > * Reduced the size of reserved fields to 4-bytes making 'struct
> >   nd_pdsm_cmd_pkg' 64 + 8 bytes long.
> > * Updated is_cmd_valid() to enforce validation checks on pdsm
> >   commands. [ Dan Williams ]
> > * Added check for reserved fields being set to '0' in is_cmd_valid()
> >   [ Ira ]
> > * Moved changes for checking cmd_rc == NULL and logging improvements
> >   to a separate prelim patch [ Ira ].
> > * Moved  pdsm package validation checks from papr_scm_service_pdsm()
> >   to is_cmd_valid().
> > * Marked papr_scm_service_pdsm() return type as 'void' since errors
> >   are reported in nd_pdsm_cmd_pkg.cmd_status field.
> >
> > Resend:
> > * Added ack from Aneesh.
> >
> > v8..v9:
> > * Reduced the usage of term SCM replacing it with appropriate
> >   replacement [ Dan Williams, Aneesh ]
> > * Renamed 'papr_scm_pdsm.h' to 'papr_pdsm.h'
> > * s/PAPR_SCM_PDSM_*/PAPR_PDSM_*/g
> > * s/NVDIMM_FAMILY_PAPR_SCM/NVDIMM_FAMILY_PAPR/g
> > * Minor updates to 'papr_psdm.h' to replace usage of term 'SCM'.
> > * Minor update to patch description.
> >
> > v7..v8:
> > * Removed the 'payload_offset' field from 'struct
> >   nd_pdsm_cmd_pkg'. Instead command payload is always assumed to start
> >   at 'nd_pdsm_cmd_pkg.payload'. [ Aneesh ]
> > * To enable introducing new fields to 'struct nd_pdsm_cmd_pkg',
> >   'reserved' field of 10-bytes is introduced. [ Aneesh ]
> > * Fixed a typo in "Backward Compatibility" section of papr_scm_pdsm.h
> >   [ Ira ]
> >
> > Resend:
> > * None
> >
> > v6..v7 :
> > * Removed the re-definitions of __packed macro from papr_scm_pdsm.h
> >   [Mpe].
> > * Removed the usage of __KERNEL__ macros in papr_scm_pdsm.h [Mpe].
> > * Removed macros that were unused in papr_scm.c from papr_scm_pdsm.h
> >   [Mpe].
> > * Made functions defined in papr_scm_pdsm.h as static inline. [Mpe]
> >
> > v5..v6 :
> > * Changed the usage of the term DSM to PDSM to distinguish it from the
> >   ACPI term [ Dan Williams ]
> > * Renamed papr_scm_dsm.h to papr_scm_pdsm.h and updated various struct
> >   to reflect the new terminology.
> > * Updated the patch description and title to reflect the new terminology.
> > * Squashed patch to introduce new command family in 'ndctl.h' with
> >   this patch [ Dan Williams ]
> > * Updated the papr_scm_pdsm method starting index from 0x1 to 0x0
> >   [ Dan Williams ]
> > * Removed redundant license text from the papr_scm_psdm.h file.
> >   [ Dan Williams ]
> > * s/envelop/envelope/ at various places [ Dan Williams ]
> > * Added '__packed' attribute to command package header to gaurd
> >   against different compiler adding paddings between the fields.
> >   [ Dan Williams]
> > * Converted various pr_debug to dev_debug [ Dan Williams ]
> >
> > v4..v5 :
> > * None
> >
> > v3..v4 :
> > * None
> >
> > v2..v3 :
> > * Updated the patch prefi

Re: [PATCH v10 4/6] powerpc/papr_scm: Improve error logging and handling papr_scm_ndctl()

2020-06-05 Thread Dan Williams
On Fri, Jun 5, 2020 at 10:13 AM Ira Weiny  wrote:
>
> On Fri, Jun 05, 2020 at 05:11:34AM +0530, Vaibhav Jain wrote:
> > Since papr_scm_ndctl() can be called from outside papr_scm, its
> > exposed to the possibility of receiving NULL as value of 'cmd_rc'
> > argument. This patch updates papr_scm_ndctl() to protect against such
> > possibility by assigning it pointer to a local variable in case cmd_rc
> > == NULL.
> >
> > Finally the patch also updates the 'default' clause of the switch-case
> > block removing a 'return' statement thereby ensuring that value of
> > 'cmd_rc' is always logged when papr_scm_ndctl() returns.
> >
> > Cc: "Aneesh Kumar K . V" 
> > Cc: Dan Williams 
> > Cc: Michael Ellerman 
> > Cc: Ira Weiny 
> > Signed-off-by: Vaibhav Jain 
> > ---
> > Changelog:
> >
> > v9..v10
> > * New patch in the series
>
> Thanks for making this a separate patch it is easier to see what is going on
> here.
>
> > ---
> >  arch/powerpc/platforms/pseries/papr_scm.c | 10 --
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> > b/arch/powerpc/platforms/pseries/papr_scm.c
> > index 0c091622b15e..6512fe6a2874 100644
> > --- a/arch/powerpc/platforms/pseries/papr_scm.c
> > +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> > @@ -355,11 +355,16 @@ static int papr_scm_ndctl(struct 
> > nvdimm_bus_descriptor *nd_desc,
> >  {
> >   struct nd_cmd_get_config_size *get_size_hdr;
> >   struct papr_scm_priv *p;
> > + int rc;
> >
> >   /* Only dimm-specific calls are supported atm */
> >   if (!nvdimm)
> >   return -EINVAL;
> >
> > + /* Use a local variable in case cmd_rc pointer is NULL */
> > + if (!cmd_rc)
> > + cmd_rc = 
> > +
>
> This protects you from the NULL.  However...
>
> >   p = nvdimm_provider_data(nvdimm);
> >
> >   switch (cmd) {
> > @@ -381,12 +386,13 @@ static int papr_scm_ndctl(struct 
> > nvdimm_bus_descriptor *nd_desc,
> >   break;
> >
> >   default:
> > - return -EINVAL;
> > + dev_dbg(>pdev->dev, "Unknown command = %d\n", cmd);
> > + *cmd_rc = -EINVAL;
>
> ... I think you are conflating rc and cmd_rc...
>
> >   }
> >
> >   dev_dbg(>pdev->dev, "returned with cmd_rc = %d\n", *cmd_rc);
> >
> > - return 0;
> > + return *cmd_rc;
>
> ... this changes the behavior of the current commands.  Now if the underlying
> papr_scm_meta_[get|set]() fails you return that failure as rc rather than 0.
>
> Is that ok?

The expectation is that rc is "did the command get sent to the device,
or did it fail for 'transport' reasons". The role of cmd_rc is to
translate the specific status response of the command into a common
error code. The expectations are:

rc < 0: Error code, Linux terminated the ioctl before talking to hardware

rc == 0: Linux successfully submitted the command to hardware, cmd_rc
is valid for command specific response

rc > 0: Linux successfully submitted the command, but detected that
only a subset of the data was accepted for "write"-style commands, or
that only subset of data was returned for "read"-style commands. I.e.
short-write / short-read semantics. cmd_rc is valid in this case and
its up to userspace to determine if a short transfer is an error or
not.

> Also 'logging cmd_rc' in the invalid cmd case does not seem quite right unless
> you really want rc to be cmd_rc.
>
> The architecture is designed to separate errors which occur in the kernel vs
> errors in the firmware/dimm.  Are they always the same?  The current code
> differentiates them.

Yeah, they're distinct, transport vs end-point / command-specific
status returns.


Re: [RESEND PATCH v9 4/5] ndctl/papr_scm,uapi: Add support for PAPR nvdimm specific methods

2020-06-05 Thread Dan Williams
On Fri, Jun 5, 2020 at 8:22 AM Vaibhav Jain  wrote:
[..]
> > Oh, why not define a maximal health payload with all the attributes
> > you know about today, leave some room for future expansion, and then
> > report a validity flag for each attribute? This is how the "intel"
> > smart-health payload works. If they ever needed to extend the payload
> > they would increase the size and add more validity flags. Old
> > userspace never groks the new fields, new userspace knows to ask for
> > and parse the larger payload.
> >
> > See the flags field in 'struct nd_intel_smart' (in ndctl) and the
> > translation of those flags to ndctl generic attribute flags
> > intel_cmd_smart_get_flags().
> >
> > In general I'd like ndctl to understand the superset of all health
> > attributes across all vendors. For the truly vendor specific ones it
> > would mean that the health flags with a specific "papr_scm" back-end
> > just would never be set on an "intel" device. I.e. look at the "hpe"
> > and "msft" health backends. They only set a subset of the valid flags
> > that could be reported.
>
> Thanks, this sounds good. Infact papr_scm implementation in ndctl does
> advertises support for only a subset of ND_SMART_* flags right now.
>
> Using 'flags' instead of 'version' was indeed discussed during
> v7..v9. However re-looking at the 'msft' and 'hpe' implementations the
> approach of maximal health payload tagged with a flags field looks more
> intuitive and I would prefer implementing this scheme in this patch-set.
>
> The current set health data exchanged with between libndctl and
> papr_scm via 'struct nd_papr_pdsm_health' (e.g various health status
> bits , nvdimm arming status etc) are guaranteed to be always available
> hence associating their availability with a flag wont be much useful as
> the flag will be always set.
>
> However as you suggested, extending the 'struct nd_papr_pdsm_health' in
> future to accommodate new attributes like 'life-remaining' can be done
> via adding them to the end of the struct and setting a flag field to
> indicate its presence.
>
> So I have the following proposal:
> * Add a new '__u32 extension_flags' field at beginning of 'struct
>   nd_papr_pdsm_health'
> * Set the size of the struct to 184-bytes which is the maximum possible
>   size for a pdsm payload.
> * 'papr_scm' kernel driver will currently set 'extension_flag' to 0
>   indicating no extension fields.
>
> * Future patch that adds support for 'life-remaining' add the new-field
>   at the end of known fields in 'struct nd_papr_pdsm_health'.
> * When provided to  papr_scm kernel module, if 'life-remaining' data is
>   available its populated and corresponding flag set in
>   'extension_flags' field indicating its presence.
> * When received by libndctl papr_scm implementation its tests if the
>   extension_flags have associated 'life-remaining' flag set and if yes
>   then return ND_SMART_USED_VALID flag back from
>   ndctl_cmd_smart_get_flags().
>
> Implementing first 3 items above in the current patchset should be
> fairly trivial.
>
> Does that sounds reasonable ?

This sounds good to me.


Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-05-30 Thread Dan Williams
On Sat, May 30, 2020 at 12:18 AM Aneesh Kumar K.V
 wrote:
>
> On 5/30/20 12:52 AM, Dan Williams wrote:
> > On Fri, May 29, 2020 at 3:55 AM Aneesh Kumar K.V
> >  wrote:
> >>
> >> On 5/29/20 3:22 PM, Jan Kara wrote:
> >>> Hi!
> >>>
> >>> On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> >>>> Thanks Michal. I also missed Jeff in this email thread.
> >>>
> >>> And I think you'll also need some of the sched maintainers for the prctl
> >>> bits...
> >>>
> >>>> On 5/29/20 3:03 PM, Michal Suchánek wrote:
> >>>>> Adding Jan
> >>>>>
> >>>>> On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> >>>>>> With POWER10, architecture is adding new pmem flush and sync 
> >>>>>> instructions.
> >>>>>> The kernel should prevent the usage of MAP_SYNC if applications are 
> >>>>>> not using
> >>>>>> the new instructions on newer hardware.
> >>>>>>
> >>>>>> This patch adds a prctl option MAP_SYNC_ENABLE that can be used to 
> >>>>>> enable
> >>>>>> the usage of MAP_SYNC. The kernel config option is added to allow the 
> >>>>>> user
> >>>>>> to control whether MAP_SYNC should be enabled by default or not.
> >>>>>>
> >>>>>> Signed-off-by: Aneesh Kumar K.V 
> >>> ...
> >>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
> >>>>>> index 8c700f881d92..d5a9a363e81e 100644
> >>>>>> --- a/kernel/fork.c
> >>>>>> +++ b/kernel/fork.c
> >>>>>> @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
> >>>>>> DEFINE_SPINLOCK(mmlist_lock);
> >>>>>> static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
> >>>>>> +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> >>>>>> +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> >>>>>> +#else
> >>>>>> +unsigned long default_map_sync_mask = 0;
> >>>>>> +#endif
> >>>>>> +
> >>>
> >>> I'm not sure CONFIG is really the right approach here. For a distro that 
> >>> would
> >>> basically mean to disable MAP_SYNC for all PPC kernels unless application
> >>> explicitly uses the right prctl. Shouldn't we rather initialize
> >>> default_map_sync_mask on boot based on whether the CPU we run on requires
> >>> new flush instructions or not? Otherwise the patch looks sensible.
> >>>
> >>
> >> yes that is correct. We ideally want to deny MAP_SYNC only w.r.t
> >> POWER10. But on a virtualized platform there is no easy way to detect
> >> that. We could ideally hook this into the nvdimm driver where we look at
> >> the new compat string ibm,persistent-memory-v2 and then disable MAP_SYNC
> >> if we find a device with the specific value.
> >>
> >> BTW with the recent changes I posted for the nvdimm driver, older kernel
> >> won't initialize persistent memory device on newer hardware. Newer
> >> hardware will present the device to OS with a different device tree
> >> compat string.
> >>
> >> My expectation  w.r.t this patch was, Distro would want to  mark
> >> CONFIG_ARCH_MAP_SYNC_DISABLE=n based on the different application
> >> certification.  Otherwise application will have to end up calling the
> >> prctl(MMF_DISABLE_MAP_SYNC, 0) any way. If that is the case, should this
> >> be dependent on P10?
> >>
> >> With that I am wondering should we even have this patch? Can we expect
> >> userspace get updated to use new instruction?.
> >>
> >> With ppc64 we never had a real persistent memory device available for
> >> end user to try. The available persistent memory stack was using vPMEM
> >> which was presented as a volatile memory region for which there is no
> >> need to use any of the flush instructions. We could safely assume that
> >> as we get applications certified/verified for working with pmem device
> >> on ppc64, they would all be using the new instructions?
> >
> > I think prctl is the wrong interface for this. I was thinking a sysfs
> > interface along the same lines as /sys/block/pmemX/dax/write_cache.
> > That attribute is toggling DAXDEV_WRITE_CACHE for the determination of
> > whether the platform or the kernel needs to handle cache flushing
> > relative to power loss. A similar attribute can be established for
> > DAXDEV_SYNC, it would simply default to off based on a configuration
> > time policy, but be dynamically changeable at runtime via sysfs.
> >
> > These flags are device properties that affect the kernel and
> > userspace's handling of persistence.
> >
>
> That will not handle the scenario with multiple applications using the
> same fsdax mount point where one is updated to use the new instruction
> and the other is not.

Right, it needs to be a global setting / flag day to switch from one
regime to another. Per-process control is a recipe for disaster.


Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-05-29 Thread Dan Williams
On Fri, May 29, 2020 at 3:55 AM Aneesh Kumar K.V
 wrote:
>
> On 5/29/20 3:22 PM, Jan Kara wrote:
> > Hi!
> >
> > On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> >> Thanks Michal. I also missed Jeff in this email thread.
> >
> > And I think you'll also need some of the sched maintainers for the prctl
> > bits...
> >
> >> On 5/29/20 3:03 PM, Michal Suchánek wrote:
> >>> Adding Jan
> >>>
> >>> On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
>  With POWER10, architecture is adding new pmem flush and sync 
>  instructions.
>  The kernel should prevent the usage of MAP_SYNC if applications are not 
>  using
>  the new instructions on newer hardware.
> 
>  This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
>  the usage of MAP_SYNC. The kernel config option is added to allow the 
>  user
>  to control whether MAP_SYNC should be enabled by default or not.
> 
>  Signed-off-by: Aneesh Kumar K.V 
> > ...
>  diff --git a/kernel/fork.c b/kernel/fork.c
>  index 8c700f881d92..d5a9a363e81e 100644
>  --- a/kernel/fork.c
>  +++ b/kernel/fork.c
>  @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
>  DEFINE_SPINLOCK(mmlist_lock);
> static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
>  +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
>  +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
>  +#else
>  +unsigned long default_map_sync_mask = 0;
>  +#endif
>  +
> >
> > I'm not sure CONFIG is really the right approach here. For a distro that 
> > would
> > basically mean to disable MAP_SYNC for all PPC kernels unless application
> > explicitly uses the right prctl. Shouldn't we rather initialize
> > default_map_sync_mask on boot based on whether the CPU we run on requires
> > new flush instructions or not? Otherwise the patch looks sensible.
> >
>
> yes that is correct. We ideally want to deny MAP_SYNC only w.r.t
> POWER10. But on a virtualized platform there is no easy way to detect
> that. We could ideally hook this into the nvdimm driver where we look at
> the new compat string ibm,persistent-memory-v2 and then disable MAP_SYNC
> if we find a device with the specific value.
>
> BTW with the recent changes I posted for the nvdimm driver, older kernel
> won't initialize persistent memory device on newer hardware. Newer
> hardware will present the device to OS with a different device tree
> compat string.
>
> My expectation  w.r.t this patch was, Distro would want to  mark
> CONFIG_ARCH_MAP_SYNC_DISABLE=n based on the different application
> certification.  Otherwise application will have to end up calling the
> prctl(MMF_DISABLE_MAP_SYNC, 0) any way. If that is the case, should this
> be dependent on P10?
>
> With that I am wondering should we even have this patch? Can we expect
> userspace get updated to use new instruction?.
>
> With ppc64 we never had a real persistent memory device available for
> end user to try. The available persistent memory stack was using vPMEM
> which was presented as a volatile memory region for which there is no
> need to use any of the flush instructions. We could safely assume that
> as we get applications certified/verified for working with pmem device
> on ppc64, they would all be using the new instructions?

I think prctl is the wrong interface for this. I was thinking a sysfs
interface along the same lines as /sys/block/pmemX/dax/write_cache.
That attribute is toggling DAXDEV_WRITE_CACHE for the determination of
whether the platform or the kernel needs to handle cache flushing
relative to power loss. A similar attribute can be established for
DAXDEV_SYNC, it would simply default to off based on a configuration
time policy, but be dynamically changeable at runtime via sysfs.

These flags are device properties that affect the kernel and
userspace's handling of persistence.


Re: [PATCH v8 1/5] powerpc: Document details on H_SCM_HEALTH hcall

2020-05-27 Thread Dan Williams
On Tue, May 26, 2020 at 9:13 PM Vaibhav Jain  wrote:
>
> Add documentation to 'papr_hcalls.rst' describing the bitmap flags
> that are returned from H_SCM_HEALTH hcall as per the PAPR-SCM
> specification.
>

Please do a global s/SCM/PMEM/ or s/SCM/NVDIMM/. It's unfortunate that
we already have 2 ways to describe persistent memory devices, let's
not perpetuate a third so that "grep" has a chance to find
interrelated code across architectures. Other than that this looks
good to me.

> Cc: "Aneesh Kumar K . V" 
> Cc: Dan Williams 
> Cc: Michael Ellerman 
> Cc: Ira Weiny 
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
> v7..v8:
> * Added a clarification on bit-ordering of Health Bitmap
>
> Resend:
> * None
>
> v6..v7:
> * None
>
> v5..v6:
> * New patch in the series
> ---
>  Documentation/powerpc/papr_hcalls.rst | 45 ---
>  1 file changed, 41 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/powerpc/papr_hcalls.rst 
> b/Documentation/powerpc/papr_hcalls.rst
> index 3493631a60f8..45063f305813 100644
> --- a/Documentation/powerpc/papr_hcalls.rst
> +++ b/Documentation/powerpc/papr_hcalls.rst
> @@ -220,13 +220,50 @@ from the LPAR memory.
>  **H_SCM_HEALTH**
>
>  | Input: drcIndex
> -| Out: *health-bitmap, health-bit-valid-bitmap*
> +| Out: *health-bitmap (r4), health-bit-valid-bitmap (r5)*
>  | Return Value: *H_Success, H_Parameter, H_Hardware*
>
>  Given a DRC Index return the info on predictive failure and overall health of
> -the NVDIMM. The asserted bits in the health-bitmap indicate a single 
> predictive
> -failure and health-bit-valid-bitmap indicate which bits in health-bitmap are
> -valid.
> +the NVDIMM. The asserted bits in the health-bitmap indicate one or more 
> states
> +(described in table below) of the NVDIMM and health-bit-valid-bitmap indicate
> +which bits in health-bitmap are valid. The bits are reported in
> +reverse bit ordering for example a value of 0xC400
> +indicates bits 0, 1, and 5 are valid.
> +
> +Health Bitmap Flags:
> +
> ++--+---+
> +|  Bit |   Definition
>   |
> ++==+===+
> +|  00  | SCM device is unable to persist memory contents.
>   |
> +|  | If the system is powered down, nothing will be saved.   
>   |
> ++--+---+
> +|  01  | SCM device failed to persist memory contents. Either contents were 
> not|
> +|  | saved successfully on power down or were not restored properly on   
>   |
> +|  | power up.   
>   |
> ++--+---+
> +|  02  | SCM device contents are persisted from previous IPL. The data from  
>   |
> +|  | the last boot were successfully restored.   
>   |
> ++--+---+
> +|  03  | SCM device contents are not persisted from previous IPL. There was 
> no |
> +|  | data to restore from the last boot. 
>   |
> ++--+---+
> +|  04  | SCM device memory life remaining is critically low  
>   |
> ++--+---+
> +|  05  | SCM device will be garded off next IPL due to failure   
>   |
> ++--+---+
> +|  06  | SCM contents cannot persist due to current platform health status. 
> A  |
> +|  | hardware failure may prevent data from being saved or restored. 
>   |
> ++--+---+
> +|  07  | SCM device is unable to persist memory contents in certain 
> conditions |
> ++--+---+
> +|  08  | SCM device is encrypted 
>   |
> ++--+---+
> +|  09  | SCM device has successfully completed a requested erase or secure   
>   |
> +|  | erase procedure.
>   |
> ++--+---+
> +|10:63 | Reserved / Unused   
>   |
> ++--+---+
>
>  **H_SCM_PERFORMANCE_STATS**
>
> --
> 2.26.2
>


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-21 Thread Dan Williams
On Thu, May 21, 2020 at 7:39 AM Jeff Moyer  wrote:
>
> Dan Williams  writes:
>
> >> But I agree with your concern that if we have older kernel/applications
> >> that continue to use `dcbf` on future hardware we will end up
> >> having issues w.r.t powerfail consistency. The plan is what you outlined
> >> above as tighter ecosystem control. Considering we don't have a pmem
> >> device generally available, we get both kernel and userspace upgraded
> >> to use these new instructions before such a device is made available.
>
> I thought power already supported NVDIMM-N, no?  So are you saying that
> those devices will continue to work with the existing flushing and
> fencing mechanisms?
>
> > Ok, I think a compile time kernel option with a runtime override
> > satisfies my concern. Does that work for you?
>
> The compile time option only helps when running newer kernels.  I'm not
> sure how you would even begin to audit userspace applications (keep in
> mind, not every application is open source, and not every application
> uses pmdk).  I also question the merits of forcing the administrator to
> make the determination of whether all applications on the system will
> work properly.  Really, you have to rely on the vendor to tell you the
> platform is supported, and at that point, why put further hurdles in the
> way?

I'm thoroughly confused by this. I thought this was exactly the role
of a Linux distribution vendor. ISVs qualify their application on a
hardware-platform + distribution combination and the distribution owns
picking ABI defaults like CONFIG_SYSFS_DEPRECATED regardless of
whether they can guarantee that all apps are updated to the new
semantics.

The administrator is not forced, the administrator if afforded an
override in the extreme case that they find an exception to what was
qualified and need to override the distribution's compile-time choice.

>
> The decision to require different instructions on ppc is unfortunate,
> but one I'm sure we have no control over.  I don't see any merit in the
> kernel disallowing MAP_SYNC access on these platforms.  Ideally, we'd
> have some way of ensuring older kernels don't work with these new
> platforms, but I don't think that's possible.

I see disabling MAP_SYNC as the more targeted form of "ensursing older
kernels don't work.

So I guess we agree that something should break when baseline
assumptions change, we just don't yet agree on where that break should
happen?


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-21 Thread Dan Williams
On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V
 wrote:
>
> On 5/21/20 8:08 PM, Jeff Moyer wrote:
> > Dan Williams  writes:
> >
> >>> But I agree with your concern that if we have older kernel/applications
> >>> that continue to use `dcbf` on future hardware we will end up
> >>> having issues w.r.t powerfail consistency. The plan is what you outlined
> >>> above as tighter ecosystem control. Considering we don't have a pmem
> >>> device generally available, we get both kernel and userspace upgraded
> >>> to use these new instructions before such a device is made available.
> >
> > I thought power already supported NVDIMM-N, no?  So are you saying that
> > those devices will continue to work with the existing flushing and
> > fencing mechanisms?
> >
>
> yes. these devices can continue to use 'dcbf + hwsync' as long as we are
> running them on P9.
>
>
> >> Ok, I think a compile time kernel option with a runtime override
> >> satisfies my concern. Does that work for you?
> >
> > The compile time option only helps when running newer kernels.  I'm not
> > sure how you would even begin to audit userspace applications (keep in
> > mind, not every application is open source, and not every application
> > uses pmdk).  I also question the merits of forcing the administrator to
> > make the determination of whether all applications on the system will
> > work properly.  Really, you have to rely on the vendor to tell you the
> > platform is supported, and at that point, why put further hurdles in the
> > way?
> >
> > The decision to require different instructions on ppc is unfortunate,
> > but one I'm sure we have no control over.  I don't see any merit in the
> > kernel disallowing MAP_SYNC access on these platforms.  Ideally, we'd
> > have some way of ensuring older kernels don't work with these new
> > platforms, but I don't think that's possible.
> >
>
>
> I am currently looking at the possibility of firmware present these
> devices with different device-tree compat values. So that older
> /existing kernel won't initialize the device on newer systems. Is that a
> good compromise? We still can end up with older userspace and newer
> kernel. One of the option suggested by Jan Kara is to use a prctl flag
> to control that? (intead of kernel parameter option I posted before)
>
>
> > Moving on to the patch itself--Aneesh, have you audited other persistent
> > memory users in the kernel?  For example, drivers/md/dm-writecache.c does
> > this:
> >
> > static void writecache_commit_flushed(struct dm_writecache *wc, bool 
> > wait_for_ios)
> > {
> >   if (WC_MODE_PMEM(wc))
> >   wmb(); <==
> >  else
> >  ssd_commit_flushed(wc, wait_for_ios);
> > }
> >
> > I believe you'll need to make modifications there.
> >
>
> Correct. Thanks for catching that.
>
>
> I don't understand dm much, wondering how this will work with
> non-synchronous DAX device?

That's a good point. DM-writecache needs to be cognizant of things
like virtio-pmem that violate the rule that persisent memory writes
can be flushed by CPU functions rather than calling back into the
driver. It seems we need to always make the flush case a dax_operation
callback to account for this.


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-19 Thread Dan Williams
On Tue, May 19, 2020 at 6:53 AM Aneesh Kumar K.V
 wrote:
>
> Dan Williams  writes:
>
> > On Mon, May 18, 2020 at 10:30 PM Aneesh Kumar K.V
> >  wrote:
>
> ...
>
> >> Applications using new instructions will behave as expected when running
> >> on P8 and P9. Only future hardware will differentiate between 'dcbf' and
> >> 'dcbfps'
> >
> > Right, this is the problem. Applications using new instructions behave
> > as expected, the kernel has been shipping of_pmem and papr_scm for
> > several cycles now, you're saying that the DAX applications written
> > against those platforms are going to be broken on P8 and P9?
>
> The expecation is that both kernel and userspace would get upgraded to
> use the new instruction before actual persistent memory devices are
> made available.
>
> >
> >> > I'm thinking the kernel
> >> > should go as far as to disable DAX operation by default on new
> >> > hardware until userspace asserts that it is prepared to switch to the
> >> > new implementation. Is there any other way to ensure the forward
> >> > compatibility of deployed ppc64 DAX applications?
> >>
> >> AFAIU there is no released persistent memory hardware on ppc64 platform
> >> and we need to make sure before applications get enabled to use these
> >> persistent memory devices, they should switch to use the new
> >> instruction?
> >
> > Right, I want the kernel to offer some level of safety here because
> > everything you are describing sounds like a flag day conversion. Am I
> > misreading? Is there some other gate that prevents existing users of
> > of_pmem and papr_scm from having their expectations violated when
> > running on P8 / P9 hardware? Maybe there's tighter ecosystem control
> > that I'm just not familiar with, I'm only going off the fact that the
> > kernel has shipped a non-zero number of NVDIMM drivers that build with
> > ARCH=ppc64 for several cycles.
>
> If we are looking at adding changes to kernel that will prevent a kernel
> from running on newer hardware in a specific case, we could as well take
> the changes to get the kernel use the newer instructions right?

Oh, no, I'm not talking about stopping the kernel from running. I'm
simply recommending that support for MAP_SYNC mappings (userspace
managed flushing) be disabled by default on PPC with either a
compile-time or run-time default to assert that userspace has been
audited for legacy applications or that the platform owner is
otherwise willing to take the risk.

> But I agree with your concern that if we have older kernel/applications
> that continue to use `dcbf` on future hardware we will end up
> having issues w.r.t powerfail consistency. The plan is what you outlined
> above as tighter ecosystem control. Considering we don't have a pmem
> device generally available, we get both kernel and userspace upgraded
> to use these new instructions before such a device is made available.

Ok, I think a compile time kernel option with a runtime override
satisfies my concern. Does that work for you?


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-19 Thread Dan Williams
On Mon, May 18, 2020 at 10:30 PM Aneesh Kumar K.V
 wrote:
>
>
> Hi Dan,
>
> Apologies for the delay in response. I was waiting for feedback from
> hardware team before responding to this email.
>
>
> Dan Williams  writes:
>
> > On Tue, May 12, 2020 at 8:47 PM Aneesh Kumar K.V
> >  wrote:
> >>
> >> Architectures like ppc64 provide persistent memory specific barriers
> >> that will ensure that all stores for which the modifications are
> >> written to persistent storage by preceding dcbfps and dcbstps
> >> instructions have updated persistent storage before any data
> >> access or data transfer caused by subsequent instructions is initiated.
> >> This is in addition to the ordering done by wmb()
> >>
> >> Update nvdimm core such that architecture can use barriers other than
> >> wmb to ensure all previous writes are architecturally visible for
> >> the platform buffer flush.
> >
> > This seems like an exceedingly bad idea, maybe I'm missing something.
> > This implies that the deployed base of DAX applications using the old
> > instruction sequence are going to regress on new hardware that
> > requires the new instructions to be deployed.
>
>
> pmdk support for ppc64 is still work in progress and there is pull
> request to switch pmdk to use new instruction.

Ok.

>
> https://github.com/tuliom/pmdk/commit/fix-flush
>
> All userspace applications will be switched to use the new
> instructions. The new instructions are designed such that when running on P8
> and P9 they behave as 'dcbf' and 'hwsync'.

Sure, makes sense.

> Applications using new instructions will behave as expected when running
> on P8 and P9. Only future hardware will differentiate between 'dcbf' and
> 'dcbfps'

Right, this is the problem. Applications using new instructions behave
as expected, the kernel has been shipping of_pmem and papr_scm for
several cycles now, you're saying that the DAX applications written
against those platforms are going to be broken on P8 and P9?

> > I'm thinking the kernel
> > should go as far as to disable DAX operation by default on new
> > hardware until userspace asserts that it is prepared to switch to the
> > new implementation. Is there any other way to ensure the forward
> > compatibility of deployed ppc64 DAX applications?
>
> AFAIU there is no released persistent memory hardware on ppc64 platform
> and we need to make sure before applications get enabled to use these
> persistent memory devices, they should switch to use the new
> instruction?

Right, I want the kernel to offer some level of safety here because
everything you are describing sounds like a flag day conversion. Am I
misreading? Is there some other gate that prevents existing users of
of_pmem and papr_scm from having their expectations violated when
running on P8 / P9 hardware? Maybe there's tighter ecosystem control
that I'm just not familiar with, I'm only going off the fact that the
kernel has shipped a non-zero number of NVDIMM drivers that build with
ARCH=ppc64 for several cycles.


Re: remove a few uses of ->queuedata

2020-05-13 Thread Dan Williams
On Tue, May 12, 2020 at 1:08 AM Christoph Hellwig  wrote:
>
> On Sat, May 09, 2020 at 08:07:14AM -0700, Dan Williams wrote:
> > > which are all used in the I/O submission path (generic_make_request /
> > > generic_make_request_checks).  This is mostly a prep cleanup patch to
> > > also remove the pointless queue argument from ->make_request - then
> > > ->queue is an extra dereference and extra churn.
> >
> > Ah ok. If the changelogs had been filled in with something like "In
> > preparation for removing @q from make_request_fn, stop using
> > ->queuedata", I probably wouldn't have looked twice.
> >
> > For the nvdimm/ driver updates you can add:
> >
> > Reviewed-by: Dan Williams 
> >
> > ...or just let me know if you want me to pick those up through the nvdimm 
> > tree.
>
> I'd love you to pick them up through the nvdimm tree.  Do you want
> to fix up the commit message yourself?

Will do, thanks.


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-13 Thread Dan Williams
On Tue, May 12, 2020 at 8:47 PM Aneesh Kumar K.V
 wrote:
>
> Architectures like ppc64 provide persistent memory specific barriers
> that will ensure that all stores for which the modifications are
> written to persistent storage by preceding dcbfps and dcbstps
> instructions have updated persistent storage before any data
> access or data transfer caused by subsequent instructions is initiated.
> This is in addition to the ordering done by wmb()
>
> Update nvdimm core such that architecture can use barriers other than
> wmb to ensure all previous writes are architecturally visible for
> the platform buffer flush.

This seems like an exceedingly bad idea, maybe I'm missing something.
This implies that the deployed base of DAX applications using the old
instruction sequence are going to regress on new hardware that
requires the new instructions to be deployed. I'm thinking the kernel
should go as far as to disable DAX operation by default on new
hardware until userspace asserts that it is prepared to switch to the
new implementation. Is there any other way to ensure the forward
compatibility of deployed ppc64 DAX applications?

>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  drivers/nvdimm/region_devs.c | 8 
>  include/linux/libnvdimm.h| 4 
>  2 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index ccbb5b43b8b2..88ea34a9c7fd 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1216,13 +1216,13 @@ int generic_nvdimm_flush(struct nd_region *nd_region)
> idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8));
>
> /*
> -* The first wmb() is needed to 'sfence' all previous writes
> -* such that they are architecturally visible for the platform
> -* buffer flush.  Note that we've already arranged for pmem
> +* The first arch_pmem_flush_barrier() is needed to 'sfence' all
> +* previous writes such that they are architecturally visible for
> +* the platform buffer flush. Note that we've already arranged for 
> pmem
>  * writes to avoid the cache via memcpy_flushcache().  The final
>  * wmb() ensures ordering for the NVDIMM flush write.
>  */
> -   wmb();
> +   arch_pmem_flush_barrier();
> for (i = 0; i < nd_region->ndr_mappings; i++)
> if (ndrd_get_flush_wpq(ndrd, i, 0))
> writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> index 18da4059be09..66f6c65bd789 100644
> --- a/include/linux/libnvdimm.h
> +++ b/include/linux/libnvdimm.h
> @@ -286,4 +286,8 @@ static inline void arch_invalidate_pmem(void *addr, 
> size_t size)
>  }
>  #endif
>
> +#ifndef arch_pmem_flush_barrier
> +#define arch_pmem_flush_barrier() wmb()
> +#endif
> +
>  #endif /* __LIBNVDIMM_H__ */
> --
> 2.26.2
>


Re: remove a few uses of ->queuedata

2020-05-09 Thread Dan Williams
On Sat, May 9, 2020 at 1:24 AM Christoph Hellwig  wrote:
>
> On Fri, May 08, 2020 at 11:04:45AM -0700, Dan Williams wrote:
> > On Fri, May 8, 2020 at 9:16 AM Christoph Hellwig  wrote:
> > >
> > > Hi all,
> > >
> > > various bio based drivers use queue->queuedata despite already having
> > > set up disk->private_data, which can be used just as easily.  This
> > > series cleans them up to only use a single private data pointer.
> >
> > ...but isn't the queue pretty much guaranteed to be cache hot and the
> > gendisk cache cold? I'm not immediately seeing what else needs the
> > gendisk in the I/O path. Is there another motivation I'm missing?
>
> ->private_data is right next to the ->queue pointer, pat0 and part_tbl
> which are all used in the I/O submission path (generic_make_request /
> generic_make_request_checks).  This is mostly a prep cleanup patch to
> also remove the pointless queue argument from ->make_request - then
> ->queue is an extra dereference and extra churn.

Ah ok. If the changelogs had been filled in with something like "In
preparation for removing @q from make_request_fn, stop using
->queuedata", I probably wouldn't have looked twice.

For the nvdimm/ driver updates you can add:

Reviewed-by: Dan Williams 

...or just let me know if you want me to pick those up through the nvdimm tree.


Re: remove a few uses of ->queuedata

2020-05-08 Thread Dan Williams
On Fri, May 8, 2020 at 9:16 AM Christoph Hellwig  wrote:
>
> Hi all,
>
> various bio based drivers use queue->queuedata despite already having
> set up disk->private_data, which can be used just as easily.  This
> series cleans them up to only use a single private data pointer.

...but isn't the queue pretty much guaranteed to be cache hot and the
gendisk cache cold? I'm not immediately seeing what else needs the
gendisk in the I/O path. Is there another motivation I'm missing?


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-02 Thread Dan Williams
On Sat, May 2, 2020 at 2:27 AM David Hildenbrand  wrote:
>
> >> Now, let's clarify what I want regarding virtio-mem:
> >>
> >> 1. kexec should not add virtio-mem memory to the initial firmware
> >>memmap. The driver has to be in charge as discussed.
> >> 2. kexec should not place kexec images onto virtio-mem memory. That
> >>would end badly.
> >> 3. kexec should still dump virtio-mem memory via kdump.
> >
> > Ok, but then seems to say to me that dax/kmem is a different type of
> > (driver managed) than virtio-mem and it's confusing to try to apply
> > the same meaning. Why not just call your type for the distinct type it
> > is "System RAM (virtio-mem)" and let any other driver managed memory
> > follow the same "System RAM ($driver)" format if it wants?
>
> I had the same idea but discarded it because it seemed to uglify the
> add_memory() interface (passing yet another parameter only relevant for
> driver managed memory). Maybe we really want a new one, because I like
> that idea:
>
> /*
>  * Add special, driver-managed memory to the system as system ram.
>  * The resource_name is expected to have the name format "System RAM
>  * ($DRIVER)", so user space (esp. kexec-tools)" can special-case it.
>  *
>  * For this memory, no entries in /sys/firmware/memmap are created,
>  * as this memory won't be part of the raw firmware-provided memory map
>  * e.g., after a reboot. Also, the created memory resource is flagged
>  * with IORESOURCE_MEM_DRIVER_MANAGED, so in-kernel users can special-
>  * case this memory (e.g., not place kexec images onto it).
>  */
> int add_memory_driver_managed(int nid, u64 start, u64 size,
>   const char *resource_name);
>
>
> If we'd ever have to special case it even more in the kernel, we could
> allow to specify further resource flags. While passing the driver name
> instead of the resource_name would be an option, this way we don't have
> to hand craft new resource strings for added memory resources.
>
> Thoughts?

Looks useful to me and simplifies walking /proc/iomem. I personally
like the safety of the string just being the $driver component of the
name, but I won't lose sleep if the interface stays freeform like you
propose.


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 2:11 PM David Hildenbrand  wrote:
>
> On 01.05.20 22:12, Dan Williams wrote:
[..]
> >>> Consider the case of EFI Special Purpose (SP) Memory that is
> >>> marked EFI Conventional Memory with the SP attribute. In that case the
> >>> firmware memory map marked it as conventional RAM, but the kernel
> >>> optionally marks it as System RAM vs Soft Reserved. The 2008 patch
> >>> simply does not consider that case. I'm not sure strict textualism
> >>> works for coding decisions.
> >>
> >> I am no expert on that matter (esp EFI). But looking at the users of
> >> firmware_map_add_early(), the single user is in arch/x86/kernel/e820.c
> >> . So the single source of /sys/firmware/memmap is (besides hotplug) e820.
> >>
> >> "'e820_table_firmware': the original firmware version passed to us by
> >> the bootloader - not modified by the kernel. ... inform the user about
> >> the firmware's notion of memory layout via /sys/firmware/memmap"
> >> (arch/x86/kernel/e820.c)
> >>
> >> How is the EFI Special Purpose (SP) Memory represented in e820?
> >> /sys/firmware/memmap is really simple: just dump in e820. No policies IIUC.
> >
> > e820 now has a Soft Reserved translation for this which means "try to
> > reserve, but treat as System RAM is ok too". It seems generically
> > useful to me that the toggle for determining whether Soft Reserved or
> > System RAM shows up /sys/firmware/memmap is a determination that
> > policy can make. The kernel need not preemptively block it.
>
> So, I think I have to clarify something here. We do have two ways to kexec
>
> 1. kexec_load(): User space (kexec-tools) crafts the memmap (e.g., using
> /sys/firmware/memmap on x86-64) and selects memory where to place the
> kexec images (e.g., using /proc/iomem)
>
> 2. kexec_file_load(): The kernel reuses the (basically) raw firmware
> memmap and selects memory where to place kexec images.
>
> We are talking about changing 1, to behave like 2 in regards to
> dax/kmem. 2. does currently not add any hotplugged memory to the
> fixed-up e820, and it should be fixed regarding hotplugged DIMMs that
> would appear in e820 after a reboot.
>
> Now, all these policy discussions are nice and fun, but I don't really
> see a good reason to (ab)use /sys/firmware/memmap for that (e.g., parent
> properties). If you want to be able to make this configurable, then
> e.g., add a way to configure this in the kernel (for example along with
> kmem) to make 1. and 2. behave the same way. Otherwise, you really only
> can change 1.

That's clearer.

>
>
> Now, let's clarify what I want regarding virtio-mem:
>
> 1. kexec should not add virtio-mem memory to the initial firmware
>memmap. The driver has to be in charge as discussed.
> 2. kexec should not place kexec images onto virtio-mem memory. That
>would end badly.
> 3. kexec should still dump virtio-mem memory via kdump.

Ok, but then seems to say to me that dax/kmem is a different type of
(driver managed) than virtio-mem and it's confusing to try to apply
the same meaning. Why not just call your type for the distinct type it
is "System RAM (virtio-mem)" and let any other driver managed memory
follow the same "System RAM ($driver)" format if it wants?


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 12:18 PM David Hildenbrand  wrote:
>
> On 01.05.20 20:43, Dan Williams wrote:
> > On Fri, May 1, 2020 at 11:14 AM David Hildenbrand  wrote:
> >>
> >> On 01.05.20 20:03, Dan Williams wrote:
> >>> On Fri, May 1, 2020 at 10:51 AM David Hildenbrand  
> >>> wrote:
> >>>>
> >>>> On 01.05.20 19:45, David Hildenbrand wrote:
> >>>>> On 01.05.20 19:39, Dan Williams wrote:
> >>>>>> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand  
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On 01.05.20 18:56, Dan Williams wrote:
> >>>>>>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> On 01.05.20 00:24, Andrew Morton wrote:
> >>>>>>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand 
> >>>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Why does the firmware map support hotplug entries?
> >>>>>>>>>>>
> >>>>>>>>>>> I assume:
> >>>>>>>>>>>
> >>>>>>>>>>> The firmware memmap was added primarily for x86-64 kexec (and 
> >>>>>>>>>>> still, is
> >>>>>>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. 
> >>>>>>>>>>> When DIMMs
> >>>>>>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>>>>>>>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>>>>>>>>> ballooning and you reboot ... the the e820 is changed as well). I 
> >>>>>>>>>>> assume
> >>>>>>>>>>> we wanted to be able to reflect that, to make kexec look like a 
> >>>>>>>>>>> real reboot.
> >>>>>>>>>>>
> >>>>>>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> But I assume only Andrew can enlighten us.
> >>>>>>>>>>>
> >>>>>>>>>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>>>>>>>>> firmware memmap, even if this contradicts with the existing
> >>>>>>>>>>> documentation? (especially, if the actual firmware memmap will 
> >>>>>>>>>>> *not*
> >>>>>>>>>>> contain that memory after a reboot)
> >>>>>>>>>>
> >>>>>>>>>> For some reason that patch is misattributed - it was authored by
> >>>>>>>>>> Shaohui Zheng , who hasn't been heard 
> >>>>>>>>>> from in
> >>>>>>>>>> a decade.  I looked through the email discussion from that time 
> >>>>>>>>>> and I'm
> >>>>>>>>>> not seeing anything useful.  But I wasn't able to locate Dave 
> >>>>>>>>>> Hansen's
> >>>>>>>>>> review comments.
> >>>>>>>>>
> >>>>>>>>> Okay, thanks for checking. I think the documentation from 2008 is 
> >>>>>>>>> pretty
> >>>>>>>>> clear what has to be done here. I will add some of these details to 
> >>>>>>>>> the
> >>>>>>>>> patch description.
> >>>>>>>>>
> >>>>>>>>> Also, now that I know that esp. kexec-tools already don't consider
> >>>>>>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
> >>>>>>>>> won't really suffer from a name change in /proc/iomem, I will go 
> >>>>>>>>> back to
> >>>>>>>>> the MHP_DRIVER_MANAGED approach and
> >>>>>>>>> 1. Don't create firmware memma

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 11:14 AM David Hildenbrand  wrote:
>
> On 01.05.20 20:03, Dan Williams wrote:
> > On Fri, May 1, 2020 at 10:51 AM David Hildenbrand  wrote:
> >>
> >> On 01.05.20 19:45, David Hildenbrand wrote:
> >>> On 01.05.20 19:39, Dan Williams wrote:
> >>>> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand  
> >>>> wrote:
> >>>>>
> >>>>> On 01.05.20 18:56, Dan Williams wrote:
> >>>>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On 01.05.20 00:24, Andrew Morton wrote:
> >>>>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand 
> >>>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Why does the firmware map support hotplug entries?
> >>>>>>>>>
> >>>>>>>>> I assume:
> >>>>>>>>>
> >>>>>>>>> The firmware memmap was added primarily for x86-64 kexec (and 
> >>>>>>>>> still, is
> >>>>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When 
> >>>>>>>>> DIMMs
> >>>>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>>>>>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>>>>>>> ballooning and you reboot ... the the e820 is changed as well). I 
> >>>>>>>>> assume
> >>>>>>>>> we wanted to be able to reflect that, to make kexec look like a 
> >>>>>>>>> real reboot.
> >>>>>>>>>
> >>>>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> But I assume only Andrew can enlighten us.
> >>>>>>>>>
> >>>>>>>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>>>>>>> firmware memmap, even if this contradicts with the existing
> >>>>>>>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>>>>>>> contain that memory after a reboot)
> >>>>>>>>
> >>>>>>>> For some reason that patch is misattributed - it was authored by
> >>>>>>>> Shaohui Zheng , who hasn't been heard from 
> >>>>>>>> in
> >>>>>>>> a decade.  I looked through the email discussion from that time and 
> >>>>>>>> I'm
> >>>>>>>> not seeing anything useful.  But I wasn't able to locate Dave 
> >>>>>>>> Hansen's
> >>>>>>>> review comments.
> >>>>>>>
> >>>>>>> Okay, thanks for checking. I think the documentation from 2008 is 
> >>>>>>> pretty
> >>>>>>> clear what has to be done here. I will add some of these details to 
> >>>>>>> the
> >>>>>>> patch description.
> >>>>>>>
> >>>>>>> Also, now that I know that esp. kexec-tools already don't consider
> >>>>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
> >>>>>>> won't really suffer from a name change in /proc/iomem, I will go back 
> >>>>>>> to
> >>>>>>> the MHP_DRIVER_MANAGED approach and
> >>>>>>> 1. Don't create firmware memmap entries
> >>>>>>> 2. Name the resource "System RAM (driver managed)"
> >>>>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>>>>>>
> >>>>>>> This way, kernel users and user space can figure out that this memory
> >>>>>>> has different semantics and handle it accordingly - I think that was
> >>>>>>> what Eric was asking for.
> >>>>>>>
> >>>>>>> Of course, open for suggestions.
> >>>>>>
> >>>>>> I'm still more of a fan of this bei

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 10:51 AM David Hildenbrand  wrote:
>
> On 01.05.20 19:45, David Hildenbrand wrote:
> > On 01.05.20 19:39, Dan Williams wrote:
> >> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand  wrote:
> >>>
> >>> On 01.05.20 18:56, Dan Williams wrote:
> >>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  
> >>>> wrote:
> >>>>>
> >>>>> On 01.05.20 00:24, Andrew Morton wrote:
> >>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand 
> >>>>>>  wrote:
> >>>>>>
> >>>>>>>>
> >>>>>>>> Why does the firmware map support hotplug entries?
> >>>>>>>
> >>>>>>> I assume:
> >>>>>>>
> >>>>>>> The firmware memmap was added primarily for x86-64 kexec (and still, 
> >>>>>>> is
> >>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When 
> >>>>>>> DIMMs
> >>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>>>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>>>>> ballooning and you reboot ... the the e820 is changed as well). I 
> >>>>>>> assume
> >>>>>>> we wanted to be able to reflect that, to make kexec look like a real 
> >>>>>>> reboot.
> >>>>>>>
> >>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>>>>
> >>>>>>>
> >>>>>>> But I assume only Andrew can enlighten us.
> >>>>>>>
> >>>>>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>>>>> firmware memmap, even if this contradicts with the existing
> >>>>>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>>>>> contain that memory after a reboot)
> >>>>>>
> >>>>>> For some reason that patch is misattributed - it was authored by
> >>>>>> Shaohui Zheng , who hasn't been heard from in
> >>>>>> a decade.  I looked through the email discussion from that time and I'm
> >>>>>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> >>>>>> review comments.
> >>>>>
> >>>>> Okay, thanks for checking. I think the documentation from 2008 is pretty
> >>>>> clear what has to be done here. I will add some of these details to the
> >>>>> patch description.
> >>>>>
> >>>>> Also, now that I know that esp. kexec-tools already don't consider
> >>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
> >>>>> won't really suffer from a name change in /proc/iomem, I will go back to
> >>>>> the MHP_DRIVER_MANAGED approach and
> >>>>> 1. Don't create firmware memmap entries
> >>>>> 2. Name the resource "System RAM (driver managed)"
> >>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>>>>
> >>>>> This way, kernel users and user space can figure out that this memory
> >>>>> has different semantics and handle it accordingly - I think that was
> >>>>> what Eric was asking for.
> >>>>>
> >>>>> Of course, open for suggestions.
> >>>>
> >>>> I'm still more of a fan of this being communicated by "System RAM"
> >>>
> >>> I was mentioning somewhere in this thread that "System RAM" inside a
> >>> hierarchy (like dax/kmem) will already be basically ignored by
> >>> kexec-tools. So, placing it inside a hierarchy already makes it look
> >>> special already.
> >>>
> >>> But after all, as we have to change kexec-tools either way, we can
> >>> directly go ahead and flag it properly as special (in case there will
> >>> ever be other cases where we could no longer distinguish it).
> >>>
> >>>> being parented especially because that tells you something about how
> >>>> the memory is driver-managed and which mechanism might be in play.
> >>>
> >>> The 

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 10:21 AM David Hildenbrand  wrote:
>
> On 01.05.20 18:56, Dan Williams wrote:
> > On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  wrote:
> >>
> >> On 01.05.20 00:24, Andrew Morton wrote:
> >>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand  
> >>> wrote:
> >>>
> >>>>>
> >>>>> Why does the firmware map support hotplug entries?
> >>>>
> >>>> I assume:
> >>>>
> >>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
> >>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> >>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>> ballooning and you reboot ... the the e820 is changed as well). I assume
> >>>> we wanted to be able to reflect that, to make kexec look like a real 
> >>>> reboot.
> >>>>
> >>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>
> >>>>
> >>>> But I assume only Andrew can enlighten us.
> >>>>
> >>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>> firmware memmap, even if this contradicts with the existing
> >>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>> contain that memory after a reboot)
> >>>
> >>> For some reason that patch is misattributed - it was authored by
> >>> Shaohui Zheng , who hasn't been heard from in
> >>> a decade.  I looked through the email discussion from that time and I'm
> >>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> >>> review comments.
> >>
> >> Okay, thanks for checking. I think the documentation from 2008 is pretty
> >> clear what has to be done here. I will add some of these details to the
> >> patch description.
> >>
> >> Also, now that I know that esp. kexec-tools already don't consider
> >> dax/kmem memory properly (memory will not get dumped via kdump) and
> >> won't really suffer from a name change in /proc/iomem, I will go back to
> >> the MHP_DRIVER_MANAGED approach and
> >> 1. Don't create firmware memmap entries
> >> 2. Name the resource "System RAM (driver managed)"
> >> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>
> >> This way, kernel users and user space can figure out that this memory
> >> has different semantics and handle it accordingly - I think that was
> >> what Eric was asking for.
> >>
> >> Of course, open for suggestions.
> >
> > I'm still more of a fan of this being communicated by "System RAM"
>
> I was mentioning somewhere in this thread that "System RAM" inside a
> hierarchy (like dax/kmem) will already be basically ignored by
> kexec-tools. So, placing it inside a hierarchy already makes it look
> special already.
>
> But after all, as we have to change kexec-tools either way, we can
> directly go ahead and flag it properly as special (in case there will
> ever be other cases where we could no longer distinguish it).
>
> > being parented especially because that tells you something about how
> > the memory is driver-managed and which mechanism might be in play.
>
> The could be communicated to some degree via the resource hierarchy.
>
> E.g.,
>
> [root@localhost ~]# cat /proc/iomem
> ...
> 14000-33fff : Persistent Memory
>   14000-1481f : namespace0.0
>   15000-33fff : dax0.0
> 15000-33fff : System RAM (driver managed)
>
> vs.
>
>:/# cat /proc/iomem
> [...]
> 14000-333ff : virtio-mem (virtio0)
>   14000-147ff : System RAM (driver managed)
>   14800-14fff : System RAM (driver managed)
>   15000-157ff : System RAM (driver managed)
>
> Good enough for my taste.
>
> > What about adding an optional /sys/firmware/memmap/X/parent attribute.
>
> I really don't want any firmware memmap entries for something that is
> not part of the firmware provided memmap. In addition,
> /sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
> and two arm configs enable it at all.
>
> So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.

I think that's a policy decision and policy decisions do not belong in
the kernel. Give the tooling the opportunity to decide whether System
RAM stays that way over a kexec. The parenthetical reference otherwise
looks out of place to me in the /proc/iomem output. What makes it
"driver managed" is how the kernel handles it, not how the kernel
names it.


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  wrote:
>
> On 01.05.20 00:24, Andrew Morton wrote:
> > On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand  
> > wrote:
> >
> >>>
> >>> Why does the firmware map support hotplug entries?
> >>
> >> I assume:
> >>
> >> The firmware memmap was added primarily for x86-64 kexec (and still, is
> >> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> >> get hotplugged on real HW, they get added to e820. Same applies to
> >> memory added via HyperV balloon (unless memory is unplugged via
> >> ballooning and you reboot ... the the e820 is changed as well). I assume
> >> we wanted to be able to reflect that, to make kexec look like a real 
> >> reboot.
> >>
> >> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>
> >>
> >> But I assume only Andrew can enlighten us.
> >>
> >> @Andrew, any guidance here? Should we really add all memory to the
> >> firmware memmap, even if this contradicts with the existing
> >> documentation? (especially, if the actual firmware memmap will *not*
> >> contain that memory after a reboot)
> >
> > For some reason that patch is misattributed - it was authored by
> > Shaohui Zheng , who hasn't been heard from in
> > a decade.  I looked through the email discussion from that time and I'm
> > not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> > review comments.
>
> Okay, thanks for checking. I think the documentation from 2008 is pretty
> clear what has to be done here. I will add some of these details to the
> patch description.
>
> Also, now that I know that esp. kexec-tools already don't consider
> dax/kmem memory properly (memory will not get dumped via kdump) and
> won't really suffer from a name change in /proc/iomem, I will go back to
> the MHP_DRIVER_MANAGED approach and
> 1. Don't create firmware memmap entries
> 2. Name the resource "System RAM (driver managed)"
> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
>
> This way, kernel users and user space can figure out that this memory
> has different semantics and handle it accordingly - I think that was
> what Eric was asking for.
>
> Of course, open for suggestions.

I'm still more of a fan of this being communicated by "System RAM"
being parented especially because that tells you something about how
the memory is driver-managed and which mechanism might be in play.
What about adding an optional /sys/firmware/memmap/X/parent attribute.
This lets tooling check if it cares via that interface and lets it
lookup the related infrastructure to interact with if it would do
something different for virtio-mem vs dax/kmem?


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-04-30 Thread Dan Williams
On Thu, Apr 30, 2020 at 11:44 AM David Hildenbrand  wrote:
>
>  >>> If the class of memory is different then please by all means let's mark
> >>> it differently in struct resource so everyone knows it is different.
> >>> But that difference needs to be more than hotplug.
> >>>
> >>> That difference needs to be the hypervisor loaned us memory and might
> >>> take it back at any time, or this memory is persistent and so it has
> >>> these different characteristics so don't use it as ordinary ram.
> >>
> >> Yes, and I think kmem took an excellent approach of explicitly putting
> >> that "System RAM" into a resource hierarchy. That "System RAM" won't
> >> show up as a root node under /proc/iomem (see patch #3), which already
> >> results in kexec-tools to treat it in a special way. I am thinking about
> >> doing the same for virtio-mem.
> >
> > Reading this and your patch cover letters again my concern is that
> > the justification seems to be letting the tail wag the dog.
> >
> > You want kexec-tools to behave in a certain way so you are changing the
> > kernel.
> >
> > Rather it should be change the kernel to clearly reflect reality and if
> > you can get away without a change to kexec-tools that is a bonus.
> >
>
> Right, because user space has to have a way to figure out what to do.
>
> But talking about the firmware memmap, indicating something via a "raw
> firmware-provided memory map", that is not actually in the "raw
> firmware-provided memory map" feels wrong to me. (below)
>
>
> >>> That information is also useful to other people looking at the system
> >>> and seeing what is going on.
> >>>
> >>> Just please don't muddle the concepts, or assume that whatever subset of
> >>> hotplug memory you are dealing with is the only subset.
> >>
> >> I can certainly rephrase the subject/description/comment, stating that
> >> this is not to be used for ordinary hotplugged DIMMs - only when the
> >> device driver is under control to decide what to do with that memory -
> >> especially when kexec'ing.
> >>
> >> (previously, I called this flag MHP_DRIVER_MANAGED, but I think
> >> MHP_NO_FIRMWARE_MEMMAP is clearer, we just need a better description)
> >>
> >> Would that make it clearer?
> >
> > I am not certain, but Andrew Morton deliberately added that
> > firmware_map_add_hotplug call.  Which means that there is a reason
> > for putting hotplugged memory in the firmware map.
> >
> > So the justification needs to take that reason into account.  The
> > justification can not be it is hotplugged therefore it should not belong
> > in the firmware memory map.  Unless you can show that
> > firmware_map_add_hotplug that was actually a bug and should be removed.
> > But as it has been that way since 2010 that seems like a long shot.
> >
> > So my question is what is right for the firmware map?
>
> We have documentation for that since 2008. Andrews patch is from 2010.
>
> Documentation/ABI/testing/sysfs-firmware-memmap
>
> It clearly talks about "raw firmware-provided memory map" and why the
> interface was introduced at all ("on most architectures that
> firmware-provided memory map is modified afterwards by the kernel itself").
>
> >
> > Why does the firmware map support hotplug entries?
>
> I assume:
>
> The firmware memmap was added primarily for x86-64 kexec (and still, is
> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> get hotplugged on real HW, they get added to e820. Same applies to
> memory added via HyperV balloon (unless memory is unplugged via
> ballooning and you reboot ... the the e820 is changed as well). I assume
> we wanted to be able to reflect that, to make kexec look like a real reboot.

I can at least say that this breakdown makes sense to me. Traditional
memory hotplug results in permanent change to the raw firmware memory
map reported by the host at next reboot. These device-driver-owned
memory regions really want a hotplug policy per-kernel boot instance
and should fall back to the default reserved state at reboot (kexec or
otherwise). When I say hotplug-policy I mean whether the current
kernel wants to treat the device range as System RAM or leave it as
device-managed. The intent is that the follow-on kernel needs to
re-decide the device policy.

>
> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
>


Re: [PATCH v1 2/3] mm/memory_hotplug: Introduce MHP_DRIVER_MANAGED

2020-04-30 Thread Dan Williams
On Thu, Apr 30, 2020 at 1:21 AM David Hildenbrand  wrote:
> >> Just because we decided to use some DAX memory in the current kernel as
> >> system ram, doesn't mean we should make that decision for the kexec
> >> kernel (e.g., using it as initial memory, placing kexec binaries onto
> >> it, etc.). This is also not what we would observe during a real reboot.
> >
> > Agree.
> >
> >> I can see that the "System RAM" resource will show up as child resource
> >> under the device e.g., in /proc/iomem.
> >>
> >> However, entries in /sys/firmware/memmap/ are created as "System RAM".
> >
> > True. Do you think this rename should just be limited to what type
> > /sys/firmware/memmap/ emits? I have the concern, but no proof
>
> We could split this patch into
>
> MHP_NO_FIRMWARE_MEMMAP (create firmware memmap entries)
>
> and
>
> MHP_DRIVER_MANAGED (name of the resource)
>
> See below, the latter might not be needed.
>
> > currently, that there are /proc/iomem walkers that explicitly look for
> > "System RAM", but might be thrown off by "System RAM (driver
> > managed)". I was not aware of /sys/firmware/memmap until about 5
> > minutes ago.
>
> The only two users of /proc/iomem I am aware of are kexec-tools and some
> s390x tools.
>
> kexec-tools on x86-64 uses /sys/firmware/memmap to craft the initial
> memmap, but uses /proc/iomem to
> a) Find places for kexec images
> b) Detect memory regions to dump via kdump
>
> I am not yet sure if we really need the "System RAM (driver managed)"
> part. If we can teach kexec-tools to
> a) Don't place kexec images on "System RAM" that has a parent resource
> (most likely requires kexec-tools changes)
> b) Consider for kdump "System RAM" that has a parent resource
> we might be able to avoid renaming that. (I assume that's already done)
>
> E.g., regarding virtio-mem (patch #3) I am currently also looking into
> creating a parent resource instead, like dax/kmem to avoid the rename:
>
> :/# cat /proc/iomem
> -0fff : Reserved
> [...]
> 1-13fff : System RAM
> 14000-33fff : virtio0
>   14000-147ff : System RAM
>   14800-14fff : System RAM
>   15000-157ff : System RAM
> 34000-303fff : virtio1
>   34000-347ff : System RAM
> 328000-32 : PCI Bus :00

Looks good to me if it flies with kexec-tools.


Re: [PATCH v1 2/3] mm/memory_hotplug: Introduce MHP_DRIVER_MANAGED

2020-04-30 Thread Dan Williams
On Thu, Apr 30, 2020 at 12:20 AM David Hildenbrand  wrote:
>
> On 29.04.20 18:08, David Hildenbrand wrote:
> > Some paravirtualized devices that add memory via add_memory() and
> > friends (esp. virtio-mem) don't want to create entries in
> > /sys/firmware/memmap/ - primarily to hinder kexec from adding this
> > memory to the boot memmap of the kexec kernel.
> >
> > In fact, such memory is never exposed via the firmware (e.g., e820), but
> > only via the device, so exposing this memory via /sys/firmware/memmap/ is
> > wrong:
> >  "kexec needs the raw firmware-provided memory map to setup the
> >   parameter segment of the kernel that should be booted with
> >   kexec. Also, the raw memory map is useful for debugging. For
> >   that reason, /sys/firmware/memmap is an interface that provides
> >   the raw memory map to userspace." [1]
> >
> > We want to let user space know that memory which is always detected,
> > added, and managed via a (device) driver - like memory managed by
> > virtio-mem - is special. It cannot be used for placing kexec segments
> > and the (device) driver is responsible for re-adding memory that
> > (eventually shrunk/grown/defragmented) memory after a reboot/kexec. It
> > should e.g., not be added to a fixed up firmware memmap. However, it should
> > be dumped by kdump.
> >
> > Also, such memory could behave differently than an ordinary DIMM - e.g.,
> > memory managed by virtio-mem can have holes inside added memory resource,
> > which should not be touched, especially for writing.
> >
> > Let's expose that memory as "System RAM (driver managed)" e.g., via
> > /pro/iomem.
> >
> > We don't have to worry about firmware_map_remove() on the removal path.
> > If there is no entry, it will simply return with -EINVAL.
> >
> > [1] 
> > https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap
> >
> > Cc: Andrew Morton 
> > Cc: Michal Hocko 
> > Cc: Pankaj Gupta 
> > Cc: Wei Yang 
> > Cc: Baoquan He 
> > Cc: Eric Biederman 
> > Signed-off-by: David Hildenbrand 
> > ---
> >  include/linux/memory_hotplug.h |  8 
> >  mm/memory_hotplug.c| 20 
> >  2 files changed, 24 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index bf0e3edb8688..cc538584b39e 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -68,6 +68,14 @@ struct mhp_params {
> >   pgprot_t pgprot;
> >  };
> >
> > +/* Flags used for add_memory() and friends. */
> > +
> > +/*
> > + * Don't create entries in /sys/firmware/memmap/ and expose memory as
> > + * "System RAM (driver managed)" in e.g., /proc/iomem
> > + */
> > +#define MHP_DRIVER_MANAGED   1
> > +
> >  /*
> >   * Zone resizing functions
> >   *
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index ebdf6541d074..cfa0721280aa 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -98,11 +98,11 @@ void mem_hotplug_done(void)
> >  u64 max_mem_size = U64_MAX;
> >
> >  /* add this memory to iomem resource */
> > -static struct resource *register_memory_resource(u64 start, u64 size)
> > +static struct resource *register_memory_resource(u64 start, u64 size,
> > +  const char *resource_name)
> >  {
> >   struct resource *res;
> >   unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> > - char *resource_name = "System RAM";
> >
> >   /*
> >* Make sure value parsed from 'mem=' only restricts memory adding
> > @@ -1058,7 +1058,8 @@ int __ref add_memory_resource(int nid, struct 
> > resource *res,
> >   BUG_ON(ret);
> >
> >   /* create new memmap entry */
> > - firmware_map_add_hotplug(start, start + size, "System RAM");
> > + if (!(flags & MHP_DRIVER_MANAGED))
> > + firmware_map_add_hotplug(start, start + size, "System RAM");
> >
> >   /* device_online() will take the lock when calling online_pages() */
> >   mem_hotplug_done();
> > @@ -1081,10 +1082,21 @@ int __ref add_memory_resource(int nid, struct 
> > resource *res,
> >  /* requires device_hotplug_lock, see add_memory_resource() */
> >  int __ref __add_memory(int nid, u64 start, u64 size, unsigned long flags)
> >  {
> > + const char *resource_name = "System RAM";
> >   struct resource *res;
> >   int ret;
> >
> > - res = register_memory_resource(start, size);
> > + /*
> > +  * Indicate that memory managed by a driver is special. It's always
> > +  * detected and added via a driver, should not be given to the kexec
> > +  * kernel for booting when manually crafting the firmware memmap, and
> > +  * no kexec segments should be placed on it. However, kdump should
> > +  * dump this memory.
> > +  */
> > + if (flags & MHP_DRIVER_MANAGED)
> > + resource_name = "System RAM (driver managed)";
> > +
> > + res = register_memory_resource(start, size, 

Re: [PATCH v5 4/4] powerpc/papr_scm: Implement support for DSM_PAPR_SCM_HEALTH

2020-04-03 Thread Dan Williams
On Tue, Mar 31, 2020 at 7:33 AM Vaibhav Jain  wrote:
>
> This patch implements support for papr_scm command
> 'DSM_PAPR_SCM_HEALTH' that returns a newly introduced 'struct
> nd_papr_scm_dimm_health_stat' instance containing dimm health
> information back to user space in response to ND_CMD_CALL. This
> functionality is implemented in newly introduced papr_scm_get_health()
> that queries the scm-dimm health information and then copies these bitmaps
> to the package payload whose layout is defined by 'struct
> papr_scm_ndctl_health'.
>
> The patch also introduces a new member a new member 'struct
> papr_scm_priv.health' thats an instance of 'struct
> nd_papr_scm_dimm_health_stat' to cache the health information of a
> scm-dimm. As a result functions drc_pmem_query_health() and
> papr_flags_show() are updated to populate and use this new struct
> instead of two be64 integers that we earlier used.

Link to HCALL specification?

>
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> v4..v5: None
>
> v3..v4: Call the DSM_PAPR_SCM_HEALTH service function from
> papr_scm_service_dsm() instead of papr_scm_ndctl(). [Aneesh]
>
> v2..v3: Updated struct nd_papr_scm_dimm_health_stat_v1 to use '__xx'
> types as its exported to the userspace [Aneesh]
> Changed the constants DSM_PAPR_SCM_DIMM_XX indicating dimm
> health from enum to #defines [Aneesh]
>
> v1..v2: New patch in the series
> ---
>  arch/powerpc/include/uapi/asm/papr_scm_dsm.h |  40 +++
>  arch/powerpc/platforms/pseries/papr_scm.c| 109 ---
>  2 files changed, 132 insertions(+), 17 deletions(-)
>
> diff --git a/arch/powerpc/include/uapi/asm/papr_scm_dsm.h 
> b/arch/powerpc/include/uapi/asm/papr_scm_dsm.h
> index c039a49b41b4..8265125304ca 100644
> --- a/arch/powerpc/include/uapi/asm/papr_scm_dsm.h
> +++ b/arch/powerpc/include/uapi/asm/papr_scm_dsm.h
> @@ -132,6 +132,7 @@ struct nd_papr_scm_cmd_pkg {
>   */
>  enum dsm_papr_scm {
> DSM_PAPR_SCM_MIN =  0x1,
> +   DSM_PAPR_SCM_HEALTH,
> DSM_PAPR_SCM_MAX,
>  };
>
> @@ -158,4 +159,43 @@ static void *papr_scm_pcmd_to_payload(struct 
> nd_papr_scm_cmd_pkg *pcmd)
> else
> return (void *)((__u8 *) pcmd + pcmd->payload_offset);
>  }
> +
> +/* Various scm-dimm health indicators */
> +#define DSM_PAPR_SCM_DIMM_HEALTHY   0
> +#define DSM_PAPR_SCM_DIMM_UNHEALTHY 1
> +#define DSM_PAPR_SCM_DIMM_CRITICAL  2
> +#define DSM_PAPR_SCM_DIMM_FATAL 3
> +
> +/*
> + * Struct exchanged between kernel & ndctl in for PAPR_DSM_PAPR_SMART_HEALTH
> + * Various bitflags indicate the health status of the dimm.
> + *
> + * dimm_unarmed: Dimm not armed. So contents wont persist.
> + * dimm_bad_shutdown   : Previous shutdown did not persist contents.
> + * dimm_bad_restore: Contents from previous shutdown werent restored.
> + * dimm_scrubbed   : Contents of the dimm have been scrubbed.
> + * dimm_locked : Contents of the dimm cant be modified until CEC 
> reboot
> + * dimm_encrypted  : Contents of dimm are encrypted.
> + * dimm_health : Dimm health indicator.
> + */
> +struct nd_papr_scm_dimm_health_stat_v1 {
> +   __u8 dimm_unarmed;
> +   __u8 dimm_bad_shutdown;
> +   __u8 dimm_bad_restore;
> +   __u8 dimm_scrubbed;
> +   __u8 dimm_locked;
> +   __u8 dimm_encrypted;
> +   __u16 dimm_health;
> +};

Does the structure pack the same across different compilers and configurations?

> +
> +/*
> + * Typedef the current struct for dimm_health so that any application
> + * or kernel recompiled after introducing a new version automatically
> + * supports the new version.
> + */
> +#define nd_papr_scm_dimm_health_stat nd_papr_scm_dimm_health_stat_v1
> +
> +/* Current version number for the dimm health struct */
> +#define ND_PAPR_SCM_DIMM_HEALTH_VERSION 1
> +
>  #endif /* _UAPI_ASM_POWERPC_PAPR_SCM_DSM_H_ */
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index e8ce96d2249e..ce94762954e0 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -47,8 +47,7 @@ struct papr_scm_priv {
> struct mutex dimm_mutex;
>
> /* Health information for the dimm */
> -   __be64 health_bitmap;
> -   __be64 health_bitmap_valid;
> +   struct nd_papr_scm_dimm_health_stat health;
>  };
>
>  static int drc_pmem_bind(struct papr_scm_priv *p)
> @@ -158,6 +157,7 @@ static int drc_pmem_query_health(struct papr_scm_priv *p)
>  {
> unsigned long ret[PLPAR_HCALL_BUFSIZE];
> int64_t rc;
> +   __be64 health;
>
> rc = plpar_hcall(H_SCM_HEALTH, ret, p->drc_index);
> if (rc != H_SUCCESS) {
> @@ -172,13 +172,41 @@ static int drc_pmem_query_health(struct papr_scm_priv 
> *p)
> return rc;
>
> /* Store the retrieved health information in dimm platform data */
> -   p->health_bitmap = ret[0];
> -   

Re: [PATCH v5 3/4] powerpc/papr_scm,uapi: Add support for handling PAPR DSM commands

2020-04-03 Thread Dan Williams
On Tue, Mar 31, 2020 at 7:33 AM Vaibhav Jain  wrote:
>
> Implement support for handling PAPR DSM commands in papr_scm
> module. We advertise support for ND_CMD_CALL for the dimm command mask
> and implement necessary scaffolding in the module to handle ND_CMD_CALL
> ioctl and DSM commands that we receive.

They aren't ACPI Device Specific Methods in the papr_scm case, right?
I'd call them what the papr_scm specification calls them and replace
"DSM" throughout.

> The layout of the DSM commands as we expect from libnvdimm/libndctl is
> described in newly introduced uapi header 'papr_scm_dsm.h' which
> defines a new 'struct nd_papr_scm_cmd_pkg' header. This header is used
> to communicate the DSM command via 'nd_pkg_papr_scm->nd_command' and
> size of payload that need to be sent/received for servicing the DSM.
>
> The PAPR DSM commands are assigned indexes started from 0x1 to
> prevent them from overlapping ND_CMD_* values and also makes handling
> dimm commands in papr_scm_ndctl().

You don't necessarily need to have command number separation like
that. The function number spaces are unique per family.

> A new function cmd_to_func() is
> implemented that reads the args to papr_scm_ndctl() and performs
> sanity tests on them. In case of a DSM command being sent via
> ND_CMD_CALL a newly introduced function papr_scm_service_dsm() is
> called to handle the request.
>
> Signed-off-by: Vaibhav Jain 
>
> ---
> Changelog:
>
> v4..v5: Fixed a bug in new implementation of papr_scm_ndctl().
>
> v3..v4: Updated papr_scm_ndctl() to delegate DSM command handling to a
> different function papr_scm_service_dsm(). [Aneesh]
>
> v2..v3: Updated the nd_papr_scm_cmd_pkg to use __xx types as its
> exported to the userspace [Aneesh]
>
> v1..v2: New patch in the series.
> ---
>  arch/powerpc/include/uapi/asm/papr_scm_dsm.h | 161 +++
>  arch/powerpc/platforms/pseries/papr_scm.c|  97 ++-
>  2 files changed, 252 insertions(+), 6 deletions(-)
>  create mode 100644 arch/powerpc/include/uapi/asm/papr_scm_dsm.h
>
> diff --git a/arch/powerpc/include/uapi/asm/papr_scm_dsm.h 
> b/arch/powerpc/include/uapi/asm/papr_scm_dsm.h
> new file mode 100644
> index ..c039a49b41b4
> --- /dev/null
> +++ b/arch/powerpc/include/uapi/asm/papr_scm_dsm.h
> @@ -0,0 +1,161 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +/*
> + * PAPR SCM Device specific methods and struct for libndctl and ndctl
> + *
> + * (C) Copyright IBM 2020
> + *
> + * Author: Vaibhav Jain 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2, or (at your option)
> + * any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.

These 2 paragraphs of redundant license text can be dropped. The SPDX
line is sufficient.

> + */
> +
> +#ifndef _UAPI_ASM_POWERPC_PAPR_SCM_DSM_H_
> +#define _UAPI_ASM_POWERPC_PAPR_SCM_DSM_H_
> +
> +#include 
> +
> +#ifdef __KERNEL__
> +#include 
> +#else
> +#include 
> +#endif
> +
> +/*
> + * DSM Envelope:
> + *
> + * The ioctl ND_CMD_CALL transfers data between user-space and kernel via
> + * 'envelopes' which consists of a header and user-defined payload sections.
> + * The header is described by 'struct nd_papr_scm_cmd_pkg' which expects a
> + * payload following it and offset of which relative to the struct is 
> provided
> + * by 'nd_papr_scm_cmd_pkg.payload_offset'. *
> + *
> + *  +-+-+---+
> + *  |   64-Bytes  |   8-Bytes   |   Max 184-Bytes   |
> + *  +-+-+---+
> + *  |   nd_papr_scm_cmd_pkg |   |
> + *  |-+ |   |
> + *  |  nd_cmd_pkg | |   |
> + *  +-+-+---+
> + *  | nd_family   ||   |
> + *  | nd_size_out | cmd_status  |  |
> + *  | nd_size_in  | payload_version |  PAYLOAD |
> + *  | nd_command  | payload_offset ->  |
> + *  | nd_fw_size  | |  |
> + *  +-+-+---+
> + *
> + * DSM Header:
> + *
> + * The header is defined as 'struct nd_papr_scm_cmd_pkg' which embeds a
> + * 'struct nd_cmd_pkg' instance. The DSM command is assigned to member
> + * 'nd_cmd_pkg.nd_command'. Apart from size information of the envelop which 
> is

 s/envelop/envelope/

There's a 

Re: [PATCH v5 2/4] ndctl/uapi: Introduce NVDIMM_FAMILY_PAPR_SCM as a new NVDIMM DSM family

2020-04-03 Thread Dan Williams
On Tue, Mar 31, 2020 at 7:33 AM Vaibhav Jain  wrote:
>
> Add PAPR-scm family of DSM command-set to the white list of NVDIMM
> command sets.
>
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> v4..v5 : None
>
> v3..v4 : None
>
> v2..v3 : Updated the patch prefix to 'ndctl/uapi' [Aneesh]
>
> v1..v2 : None
> ---
>  include/uapi/linux/ndctl.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
> index de5d90212409..99fb60600ef8 100644
> --- a/include/uapi/linux/ndctl.h
> +++ b/include/uapi/linux/ndctl.h
> @@ -244,6 +244,7 @@ struct nd_cmd_pkg {
>  #define NVDIMM_FAMILY_HPE2 2
>  #define NVDIMM_FAMILY_MSFT 3
>  #define NVDIMM_FAMILY_HYPERV 4
> +#define NVDIMM_FAMILY_PAPR_SCM 5

Looks good, but please squash it with patch 3.


Re: [PATCH v5 1/4] powerpc/papr_scm: Fetch nvdimm health information from PHYP

2020-04-02 Thread Dan Williams
On Wed, Apr 1, 2020 at 8:08 PM Dan Williams  wrote:
[..]
> >  * "locked" : Indicating that nvdimm contents cant be modified
> >until next power cycle.
>
> There is the generic NDD_LOCKED flag, can you use that? ...and in
> general I wonder if we should try to unify all the common papr_scm and
> nfit health flags in a generic location. It will already be the case
> the ndctl needs to look somewhere papr specific for this data maybe it
> all should have been generic from the beginning.

The more I think about this more I think this would be a good time to
introduce a common "health/" attribute group under the generic nmemX
sysfs, and then have one flag per-file / attribute. Not only does that
match the recommended sysfs ABI better, but it allows ndctl to
enumerate which flags are supported in addition to their state.

> In any event, can you also add this content to a new
> Documentation/ABI/testing/sysfs-bus-papr? See sysfs-bus-nfit for
> comparison.


Re: [PATCH v4 16/25] nvdimm/ocxl: Implement the Read Error Log command

2020-04-02 Thread Dan Williams
On Tue, Mar 31, 2020 at 1:59 AM Alastair D'Silva  wrote:
>
> The read error log command extracts information from the controller's
> internal error log.
>
> This patch exposes this information in 2 ways:
> - During probe, if an error occurs & a log is available, print it to the
>   console
> - After probe, make the error log available to userspace via an IOCTL.
>   Userspace is notified of pending error logs in a later patch
>   ("powerpc/powernv/pmem: Forward events to userspace")

So, have a look at the recent papr_scm patches to add health flags and
smart data retrieval. I'd prefer to extend existing nvdimm device
retrieval mechanisms than invent new ones.


>
> Signed-off-by: Alastair D'Silva 
> ---
>  .../userspace-api/ioctl/ioctl-number.rst  |   1 +
>  drivers/nvdimm/ocxl/main.c| 240 ++
>  include/uapi/nvdimm/ocxlpmem.h|  46 
>  3 files changed, 287 insertions(+)
>  create mode 100644 include/uapi/nvdimm/ocxlpmem.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 9425377615ce..ba0ce7dca643 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -340,6 +340,7 @@ Code  Seq#Include File
>Comments
>  0xC0  00-0F  linux/usb/iowarrior.h
>  0xCA  00-0F  uapi/misc/cxl.h
>  0xCA  10-2F  uapi/misc/ocxl.h
> +0xCA  30-3F  uapi/nvdimm/ocxlpmem.h  
> OpenCAPI Persistent Memory
>  0xCA  80-BF  uapi/scsi/cxlflash_ioctl.h
>  0xCB  00-1F  CBM 
> serial IEC bus in development:
>   
> 
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> index 9b85fcd3f1c9..e6be0029f658 100644
> --- a/drivers/nvdimm/ocxl/main.c
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ocxlpmem.h"
>
>  static const struct pci_device_id pci_tbl[] = {
> @@ -401,10 +402,190 @@ static int file_release(struct inode *inode, struct 
> file *file)
> return 0;
>  }
>
> +/**
> + * error_log_header_parse() - Parse the first 64 bits of the error log 
> command response
> + * @ocxlpmem: the device metadata
> + * @length: out, returns the number of bytes in the response (excluding the 
> 64 bit header)
> + */
> +static int error_log_header_parse(struct ocxlpmem *ocxlpmem, u16 *length)
> +{
> +   int rc;
> +   u64 val;
> +   u16 data_identifier;
> +   u32 data_length;
> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
> +ocxlpmem->admin_command.data_offset,
> +OCXL_LITTLE_ENDIAN, );
> +   if (rc)
> +   return rc;
> +
> +   data_identifier = val >> 48;
> +   data_length = val & 0x;
> +
> +   if (data_identifier != 0x454C) { // 'EL'
> +   dev_err(>dev,
> +   "Bad data identifier for error log data, expected 
> 'EL', got '%2s' (%#x), data_length=%u\n",
> +   (char *)_identifier,
> +   (unsigned int)data_identifier, data_length);
> +   return -EINVAL;
> +   }
> +
> +   *length = data_length;
> +   return 0;
> +}
> +
> +static int read_error_log(struct ocxlpmem *ocxlpmem,
> + struct ioctl_ocxlpmem_error_log *log,
> + bool buf_is_user)
> +{
> +   u64 val;
> +   u16 user_buf_length;
> +   u16 buf_length;
> +   u64 *buf = (u64 *)log->buf_ptr;
> +   u16 i;
> +   int rc;
> +
> +   if (log->buf_size % 8)
> +   return -EINVAL;
> +
> +   rc = ocxlpmem_chi(ocxlpmem, );
> +   if (rc)
> +   return rc;
> +
> +   if (!(val & GLOBAL_MMIO_CHI_ELA))
> +   return -EAGAIN;
> +
> +   user_buf_length = log->buf_size;
> +
> +   mutex_lock(>admin_command.lock);
> +
> +   rc = admin_command_execute(ocxlpmem, ADMIN_COMMAND_ERRLOG);
> +   if (rc != STATUS_SUCCESS) {
> +   warn_status(ocxlpmem,
> +   "Unexpected status from retrieve error log", rc);
> +   goto out;
> +   }
> +
> +   rc = error_log_header_parse(ocxlpmem, >buf_size);
> +   if (rc)
> +   goto out;
> +   // log->buf_size now contains the returned buffer size, not the user 
> size
> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
> +ocxlpmem->admin_command.data_offset + 
> 0x08,
> +OCXL_LITTLE_ENDIAN, );
> +   if (rc)
> +   goto out;
> +
> +   log->log_identifier = val >> 32;
> +   log->program_reference_code = val & 0x;
> +
> 

Re: [PATCH v4 14/25] nvdimm/ocxl: Add support for Admin commands

2020-04-02 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> Admin commands for these devices are the primary means of interacting
> with the device controller to provide functionality beyond the load/store
> capabilities offered via the NPU.
>
> For example, SMART data, firmware update, and device error logs are
> implemented via admin commands.
>
> This patch requests the metadata required to issue admin commands, as well
> as some helper functions to construct and check the completion of the
> commands.
>
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/ocxl/main.c  |  65 ++
>  drivers/nvdimm/ocxl/ocxlpmem.h  |  50 -
>  drivers/nvdimm/ocxl/ocxlpmem_internal.c | 261 
>  3 files changed, 375 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> index be76acd33d74..8db573036423 100644
> --- a/drivers/nvdimm/ocxl/main.c
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -217,6 +217,58 @@ static int register_lpc_mem(struct ocxlpmem *ocxlpmem)
> return 0;
>  }
>
> +/**
> + * extract_command_metadata() - Extract command data from MMIO & save it for 
> further use
> + * @ocxlpmem: the device metadata
> + * @offset: The base address of the command data structures (address of 
> CREQO)
> + * @command_metadata: A pointer to the command metadata to populate
> + * Return: 0 on success, negative on failure
> + */
> +static int extract_command_metadata(struct ocxlpmem *ocxlpmem, u32 offset,
> +   struct command_metadata *command_metadata)

How about "struct ocxlpmem *ocp" throughout all these patches? The
full duplication of the type name as the local variable name makes
this look like non-idiomatic Linux code to me. It had not quite hit me
until I saw "struct command_metadata *command_metadata" that just
strikes me as too literal and the person that gets to maintain this
code later will appreciate a smaller amount of typing.

Also, is it really the case that the layout of the admin command
metadata needs to be programmatically determined at runtime? I would
expect it to be a static command definition in the spec.


> +{
> +   int rc;
> +   u64 tmp;
> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, offset,
> +OCXL_LITTLE_ENDIAN, );
> +   if (rc)
> +   return rc;
> +
> +   command_metadata->request_offset = tmp >> 32;
> +   command_metadata->response_offset = tmp & 0x;
> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, offset + 8,
> +OCXL_LITTLE_ENDIAN, );
> +   if (rc)
> +   return rc;
> +
> +   command_metadata->data_offset = tmp >> 32;
> +   command_metadata->data_size = tmp & 0x;
> +
> +   command_metadata->id = 0;
> +
> +   return 0;
> +}
> +
> +/**
> + * setup_command_metadata() - Set up the command metadata
> + * @ocxlpmem: the device metadata
> + */
> +static int setup_command_metadata(struct ocxlpmem *ocxlpmem)
> +{
> +   int rc;
> +
> +   mutex_init(>admin_command.lock);
> +
> +   rc = extract_command_metadata(ocxlpmem, GLOBAL_MMIO_ACMA_CREQO,
> + >admin_command);
> +   if (rc)
> +   return rc;
> +
> +   return 0;
> +}
> +
>  /**
>   * allocate_minor() - Allocate a minor number to use for an OpenCAPI pmem 
> device
>   * @ocxlpmem: the device metadata
> @@ -421,6 +473,14 @@ static int probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
>
> ocxlpmem->pdev = pci_dev_get(pdev);
>
> +   ocxlpmem->timeouts[ADMIN_COMMAND_ERRLOG] = 2000; // ms
> +   ocxlpmem->timeouts[ADMIN_COMMAND_HEARTBEAT] = 100; // ms
> +   ocxlpmem->timeouts[ADMIN_COMMAND_SMART] = 100; // ms
> +   ocxlpmem->timeouts[ADMIN_COMMAND_CONTROLLER_DUMP] = 1000; // ms
> +   ocxlpmem->timeouts[ADMIN_COMMAND_CONTROLLER_STATS] = 100; // ms
> +   ocxlpmem->timeouts[ADMIN_COMMAND_SHUTDOWN] = 1000; // ms
> +   ocxlpmem->timeouts[ADMIN_COMMAND_FW_UPDATE] = 16000; // ms
> +
> pci_set_drvdata(pdev, ocxlpmem);
>
> ocxlpmem->ocxl_fn = ocxl_function_open(pdev);
> @@ -467,6 +527,11 @@ static int probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
> goto err;
> }
>
> +   if (setup_command_metadata(ocxlpmem)) {
> +   dev_err(>dev, "Could not read command metadata\n");
> +   goto err;
> +   }
> +
> elapsed = 0;
> timeout = ocxlpmem->readiness_timeout +
>   ocxlpmem->memory_available_timeout;
> diff --git a/drivers/nvdimm/ocxl/ocxlpmem.h b/drivers/nvdimm/ocxl/ocxlpmem.h
> index 3eadbe19f6d0..b72b3f909fc3 100644
> --- a/drivers/nvdimm/ocxl/ocxlpmem.h
> +++ b/drivers/nvdimm/ocxl/ocxlpmem.h
> @@ -7,6 +7,7 @@
>  #include 
>
>  #define LABEL_AREA_SIZEBIT_ULL(PA_SECTION_SHIFT)
> +#define DEFAULT_TIMEOUT 100
>
>  #define 

Re: [PATCH v5 1/4] powerpc/papr_scm: Fetch nvdimm health information from PHYP

2020-04-01 Thread Dan Williams
On Tue, Mar 31, 2020 at 7:33 AM Vaibhav Jain  wrote:
>
> Implement support for fetching nvdimm health information via
> H_SCM_HEALTH hcall as documented in Ref[1]. The hcall returns a pair
> of 64-bit big-endian integers which are then stored in 'struct
> papr_scm_priv' and subsequently partially exposed to user-space via
> newly introduced dimm specific attribute 'papr_flags'. Also a new asm
> header named 'papr-scm.h' is added that describes the interface
> between PHYP and guest kernel.
>
> Following flags are reported via 'papr_flags' sysfs attribute contents
> of which are space separated string flags indicating various nvdimm
> states:
>
>  * "not_armed"  : Indicating that nvdimm contents wont survive a power
>cycle.

s/wont/will not/

>  * "save_fail"  : Indicating that nvdimm contents couldn't be flushed
>during last shutdown event.

In the nfit definition this description is "flush_fail". The
"save_fail" flag was specific to hybrid devices that don't have
persistent media and instead scuttle away data from DRAM to flash on
power-failure.

>  * "restore_fail": Indicating that nvdimm contents couldn't be restored
>during dimm initialization.
>  * "encrypted"  : Dimm contents are encrypted.

This does not seem like a health flag to me, have you considered the
libnvdimm security interface for this indicator?

>  * "smart_notify": There is health event for the nvdimm.

Are you also going to signal the sysfs attribute when this event happens?

>  * "scrubbed"   : Indicating that contents of the nvdimm have been
>scrubbed.

This one seems odd to me what does it mean if it is not set? What does
it mean if a new scrub has been launched. Basically, is there value in
exposing this state?

>  * "locked" : Indicating that nvdimm contents cant be modified
>until next power cycle.

There is the generic NDD_LOCKED flag, can you use that? ...and in
general I wonder if we should try to unify all the common papr_scm and
nfit health flags in a generic location. It will already be the case
the ndctl needs to look somewhere papr specific for this data maybe it
all should have been generic from the beginning.


In any event, can you also add this content to a new
Documentation/ABI/testing/sysfs-bus-papr? See sysfs-bus-nfit for
comparison.

>
> [1]: commit 58b278f568f0 ("powerpc: Provide initial documentation for
> PAPR hcalls")
>
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> v4..v5 : None
>
> v3..v4 : None
>
> v2..v3 : Removed PAPR_SCM_DIMM_HEALTH_NON_CRITICAL as a condition for
>  NVDIMM unarmed [Aneesh]
>
> v1..v2 : New patch in the series.
> ---
>  arch/powerpc/include/asm/papr_scm.h   |  48 ++
>  arch/powerpc/platforms/pseries/papr_scm.c | 105 +-
>  2 files changed, 151 insertions(+), 2 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/papr_scm.h
>
> diff --git a/arch/powerpc/include/asm/papr_scm.h 
> b/arch/powerpc/include/asm/papr_scm.h
> new file mode 100644
> index ..868d3360f56a
> --- /dev/null
> +++ b/arch/powerpc/include/asm/papr_scm.h
> @@ -0,0 +1,48 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Structures and defines needed to manage nvdimms for spapr guests.
> + */
> +#ifndef _ASM_POWERPC_PAPR_SCM_H_
> +#define _ASM_POWERPC_PAPR_SCM_H_
> +
> +#include 
> +#include 
> +
> +/* DIMM health bitmap bitmap indicators */
> +/* SCM device is unable to persist memory contents */
> +#define PAPR_SCM_DIMM_UNARMED  PPC_BIT(0)
> +/* SCM device failed to persist memory contents */
> +#define PAPR_SCM_DIMM_SHUTDOWN_DIRTY   PPC_BIT(1)
> +/* SCM device contents are persisted from previous IPL */
> +#define PAPR_SCM_DIMM_SHUTDOWN_CLEAN   PPC_BIT(2)
> +/* SCM device contents are not persisted from previous IPL */
> +#define PAPR_SCM_DIMM_EMPTYPPC_BIT(3)
> +/* SCM device memory life remaining is critically low */
> +#define PAPR_SCM_DIMM_HEALTH_CRITICAL  PPC_BIT(4)
> +/* SCM device will be garded off next IPL due to failure */
> +#define PAPR_SCM_DIMM_HEALTH_FATAL PPC_BIT(5)
> +/* SCM contents cannot persist due to current platform health status */
> +#define PAPR_SCM_DIMM_HEALTH_UNHEALTHY PPC_BIT(6)
> +/* SCM device is unable to persist memory contents in certain conditions */
> +#define PAPR_SCM_DIMM_HEALTH_NON_CRITICAL  PPC_BIT(7)
> +/* SCM device is encrypted */
> +#define PAPR_SCM_DIMM_ENCRYPTEDPPC_BIT(8)
> +/* SCM device has been scrubbed and locked */
> +#define PAPR_SCM_DIMM_SCRUBBED_AND_LOCKED  PPC_BIT(9)
> +
> +/* Bits status indicators for health bitmap indicating unarmed dimm */
> +#define PAPR_SCM_DIMM_UNARMED_MASK (PAPR_SCM_DIMM_UNARMED |\
> +   PAPR_SCM_DIMM_HEALTH_UNHEALTHY)
> +
> +/* Bits status indicators for health bitmap indicating unflushed dimm */
> +#define 

Re: [PATCH v4 19/25] nvdimm/ocxl: Forward events to userspace

2020-04-01 Thread Dan Williams
On Tue, Mar 31, 2020 at 1:59 AM Alastair D'Silva  wrote:
>
> Some of the interrupts that the card generates are better handled
> by the userspace daemon, in particular:
> Controller Hardware/Firmware Fatal
> Controller Dump Available
> Error Log available
>
> This patch allows a userspace application to register an eventfd with
> the driver via SCM_IOCTL_EVENTFD to receive notifications of these
> interrupts.
>
> Userspace can then identify what events have occurred by calling
> SCM_IOCTL_EVENT_CHECK and checking against the SCM_IOCTL_EVENT_FOO
> masks.

The amount new ioctl's in this driver is too high, it seems much of
this data can be exported via sysfs attributes which are more
maintainable that ioctls. Then sysfs also has the ability to signal
events on sysfs attributes, see sys_notify_dirent.

Can you step back and review the ABI exposure of the driver and what
can be moved to sysfs? If you need to have bus specific attributes
ordered underneath the libnvdimm generic attributes you can create a
sysfs attribute subdirectory.

In general a roadmap document of all the proposed ABI is needed to
make sure it is both sufficient and necessary. See the libnvdimm
document that introduced the initial libnvdimm ABI:

https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt

>
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/ocxl/main.c | 220 +
>  drivers/nvdimm/ocxl/ocxlpmem.h |   4 +
>  include/uapi/nvdimm/ocxlpmem.h |  12 ++
>  3 files changed, 236 insertions(+)
>
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> index 0040fc09cceb..cb6cdc9eb899 100644
> --- a/drivers/nvdimm/ocxl/main.c
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -301,8 +302,19 @@ static void free_ocxlpmem(struct ocxlpmem *ocxlpmem)
>  {
> int rc;
>
> +   // Disable doorbells
> +   (void)ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CHIEC,
> +OCXL_LITTLE_ENDIAN,
> +GLOBAL_MMIO_CHI_ALL);
> +
> free_minor(ocxlpmem);
>
> +   if (ocxlpmem->irq_addr[1])
> +   iounmap(ocxlpmem->irq_addr[1]);
> +
> +   if (ocxlpmem->irq_addr[0])
> +   iounmap(ocxlpmem->irq_addr[0]);
> +
> if (ocxlpmem->ocxl_context) {
> rc = ocxl_context_detach(ocxlpmem->ocxl_context);
> if (rc == -EBUSY)
> @@ -398,6 +410,11 @@ static int file_release(struct inode *inode, struct file 
> *file)
>  {
> struct ocxlpmem *ocxlpmem = file->private_data;
>
> +   if (ocxlpmem->ev_ctx) {
> +   eventfd_ctx_put(ocxlpmem->ev_ctx);
> +   ocxlpmem->ev_ctx = NULL;
> +   }
> +
> ocxlpmem_put(ocxlpmem);
> return 0;
>  }
> @@ -928,6 +945,52 @@ static int ioctl_controller_stats(struct ocxlpmem 
> *ocxlpmem,
> return rc;
>  }
>
> +static int ioctl_eventfd(struct ocxlpmem *ocxlpmem,
> +struct ioctl_ocxlpmem_eventfd __user *uarg)
> +{
> +   struct ioctl_ocxlpmem_eventfd args;
> +
> +   if (copy_from_user(, uarg, sizeof(args)))
> +   return -EFAULT;
> +
> +   if (ocxlpmem->ev_ctx)
> +   return -EBUSY;
> +
> +   ocxlpmem->ev_ctx = eventfd_ctx_fdget(args.eventfd);
> +   if (IS_ERR(ocxlpmem->ev_ctx))
> +   return PTR_ERR(ocxlpmem->ev_ctx);
> +
> +   return 0;
> +}
> +
> +static int ioctl_event_check(struct ocxlpmem *ocxlpmem, u64 __user *uarg)
> +{
> +   u64 val = 0;
> +   int rc;
> +   u64 chi = 0;
> +
> +   rc = ocxlpmem_chi(ocxlpmem, );
> +   if (rc < 0)
> +   return rc;
> +
> +   if (chi & GLOBAL_MMIO_CHI_ELA)
> +   val |= IOCTL_OCXLPMEM_EVENT_ERROR_LOG_AVAILABLE;
> +
> +   if (chi & GLOBAL_MMIO_CHI_CDA)
> +   val |= IOCTL_OCXLPMEM_EVENT_CONTROLLER_DUMP_AVAILABLE;
> +
> +   if (chi & GLOBAL_MMIO_CHI_CFFS)
> +   val |= IOCTL_OCXLPMEM_EVENT_FIRMWARE_FATAL;
> +
> +   if (chi & GLOBAL_MMIO_CHI_CHFS)
> +   val |= IOCTL_OCXLPMEM_EVENT_HARDWARE_FATAL;
> +
> +   if (copy_to_user((u64 __user *)uarg, , sizeof(val)))
> +   return -EFAULT;
> +
> +   return rc;
> +}
> +
>  static long file_ioctl(struct file *file, unsigned int cmd, unsigned long 
> args)
>  {
> struct ocxlpmem *ocxlpmem = file->private_data;
> @@ -956,6 +1019,15 @@ static long file_ioctl(struct file *file, unsigned int 
> cmd, unsigned long args)
> rc = ioctl_controller_stats(ocxlpmem,
> (struct 
> ioctl_ocxlpmem_controller_stats __user *)args);
> break;
> +
> +   case IOCTL_OCXLPMEM_EVENTFD:
> +   rc = ioctl_eventfd(ocxlpmem,
> +  (struct ioctl_ocxlpmem_eventfd __user 
> *)args);
> +   break;
> +
> + 

Re: [PATCH v4 15/25] nvdimm/ocxl: Register a character device for userspace to interact with

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:53 PM Alastair D'Silva  wrote:
>
> This patch introduces a character device (/dev/ocxlpmemX) which further
> patches will use to interact with userspace, such as error logs,
> controller stats and card debug functionality.

This was asked earlier, but I'll reiterate, I do not see what
justifies an ocxlpmemX private device ABI vs routing through the
existing generic character ndbusX and nmemX character devices.

>
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/ocxl/main.c | 117 -
>  drivers/nvdimm/ocxl/ocxlpmem.h |   2 +
>  2 files changed, 117 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> index 8db573036423..9b85fcd3f1c9 100644
> --- a/drivers/nvdimm/ocxl/main.c
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include "ocxlpmem.h"
> @@ -356,6 +357,67 @@ static int ocxlpmem_register(struct ocxlpmem *ocxlpmem)
> return device_register(>dev);
>  }
>
> +static void ocxlpmem_put(struct ocxlpmem *ocxlpmem)
> +{
> +   put_device(>dev);
> +}
> +
> +static struct ocxlpmem *ocxlpmem_get(struct ocxlpmem *ocxlpmem)
> +{
> +   return (!get_device(>dev)) ? NULL : ocxlpmem;
> +}
> +
> +static struct ocxlpmem *find_and_get_ocxlpmem(dev_t devno)
> +{
> +   struct ocxlpmem *ocxlpmem;
> +   int minor = MINOR(devno);
> +
> +   mutex_lock(_idr_lock);
> +   ocxlpmem = idr_find(_idr, minor);
> +   if (ocxlpmem)
> +   ocxlpmem_get(ocxlpmem);
> +   mutex_unlock(_idr_lock);
> +
> +   return ocxlpmem;
> +}
> +
> +static int file_open(struct inode *inode, struct file *file)
> +{
> +   struct ocxlpmem *ocxlpmem;
> +
> +   ocxlpmem = find_and_get_ocxlpmem(inode->i_rdev);
> +   if (!ocxlpmem)
> +   return -ENODEV;
> +
> +   file->private_data = ocxlpmem;
> +   return 0;
> +}
> +
> +static int file_release(struct inode *inode, struct file *file)
> +{
> +   struct ocxlpmem *ocxlpmem = file->private_data;
> +
> +   ocxlpmem_put(ocxlpmem);
> +   return 0;
> +}
> +
> +static const struct file_operations fops = {
> +   .owner  = THIS_MODULE,
> +   .open   = file_open,
> +   .release= file_release,
> +};
> +
> +/**
> + * create_cdev() - Create the chardev in /dev for the device
> + * @ocxlpmem: the SCM metadata
> + * Return: 0 on success, negative on failure
> + */
> +static int create_cdev(struct ocxlpmem *ocxlpmem)
> +{
> +   cdev_init(>cdev, );
> +   return cdev_add(>cdev, ocxlpmem->dev.devt, 1);
> +}
> +
>  /**
>   * ocxlpmem_remove() - Free an OpenCAPI persistent memory device
>   * @pdev: the PCI device information struct
> @@ -376,6 +438,13 @@ static void remove(struct pci_dev *pdev)
> if (ocxlpmem->nvdimm_bus)
> nvdimm_bus_unregister(ocxlpmem->nvdimm_bus);
>
> +   /*
> +* Remove the cdev early to prevent a race against userspace
> +* via the char dev
> +*/
> +   if (ocxlpmem->cdev.owner)
> +   cdev_del(>cdev);
> +
> device_unregister(>dev);
> }
>  }
> @@ -527,11 +596,18 @@ static int probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
> goto err;
> }
>
> -   if (setup_command_metadata(ocxlpmem)) {
> +   rc = setup_command_metadata(ocxlpmem);
> +   if (rc) {
> dev_err(>dev, "Could not read command metadata\n");
> goto err;
> }
>
> +   rc = create_cdev(ocxlpmem);
> +   if (rc) {
> +   dev_err(>dev, "Could not create character device\n");
> +   goto err;
> +   }
> +
> elapsed = 0;
> timeout = ocxlpmem->readiness_timeout +
>   ocxlpmem->memory_available_timeout;
> @@ -599,6 +675,36 @@ static struct pci_driver pci_driver = {
> .shutdown = remove,
>  };
>
> +static int file_init(void)
> +{
> +   int rc;
> +
> +   rc = alloc_chrdev_region(_dev, 0, NUM_MINORS, "ocxlpmem");
> +   if (rc) {
> +   idr_destroy(_idr);
> +   pr_err("Unable to allocate OpenCAPI persistent memory major 
> number: %d\n",
> +  rc);
> +   return rc;
> +   }
> +
> +   ocxlpmem_class = class_create(THIS_MODULE, "ocxlpmem");
> +   if (IS_ERR(ocxlpmem_class)) {
> +   idr_destroy(_idr);
> +   pr_err("Unable to create ocxlpmem class\n");
> +   unregister_chrdev_region(ocxlpmem_dev, NUM_MINORS);
> +   return PTR_ERR(ocxlpmem_class);
> +   }
> +
> +   return 0;
> +}
> +
> +static void file_exit(void)
> +{
> +   class_destroy(ocxlpmem_class);
> +   unregister_chrdev_region(ocxlpmem_dev, NUM_MINORS);
> +   idr_destroy(_idr);
> +}
> +
>  static int __init 

Re: [PATCH v4 13/25] nvdimm/ocxl: Read the capability registers & wait for device ready

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This patch reads timeouts & firmware version from the controller, and
> uses those timeouts to wait for the controller to report that it is ready
> before handing the memory over to libnvdimm.
>
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/ocxl/Makefile|  2 +-
>  drivers/nvdimm/ocxl/main.c  | 85 +
>  drivers/nvdimm/ocxl/ocxlpmem.h  | 29 +
>  drivers/nvdimm/ocxl/ocxlpmem_internal.c | 19 ++
>  4 files changed, 134 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/nvdimm/ocxl/ocxlpmem_internal.c
>
> diff --git a/drivers/nvdimm/ocxl/Makefile b/drivers/nvdimm/ocxl/Makefile
> index e0e8ade1987a..bab97082e062 100644
> --- a/drivers/nvdimm/ocxl/Makefile
> +++ b/drivers/nvdimm/ocxl/Makefile
> @@ -4,4 +4,4 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror
>
>  obj-$(CONFIG_OCXL_PMEM) += ocxlpmem.o
>
> -ocxlpmem-y := main.o
> \ No newline at end of file
> +ocxlpmem-y := main.o ocxlpmem_internal.o
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> index c0066fedf9cc..be76acd33d74 100644
> --- a/drivers/nvdimm/ocxl/main.c
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -8,6 +8,7 @@
>
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -327,6 +328,50 @@ static void remove(struct pci_dev *pdev)
> }
>  }
>
> +/**
> + * read_device_metadata() - Retrieve config information from the AFU and 
> save it for future use
> + * @ocxlpmem: the device metadata
> + * Return: 0 on success, negative on failure
> + */
> +static int read_device_metadata(struct ocxlpmem *ocxlpmem)
> +{
> +   u64 val;
> +   int rc;
> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CCAP0,
> +OCXL_LITTLE_ENDIAN, );

This calling convention would seem to defeat the ability of sparse to
validate endian correctness. That's independent of this series, but I
wonder how does someone review why this argument is sometimes
OCXL_LITTLE_ENDIAN and sometimes OCXL_HOST_ENDIAN?

> +   if (rc)
> +   return rc;
> +
> +   ocxlpmem->scm_revision = val & 0x;
> +   ocxlpmem->read_latency = (val >> 32) & 0x;
> +   ocxlpmem->readiness_timeout = (val >> 48) & 0x0F;
> +   ocxlpmem->memory_available_timeout = val >> 52;

Maybe some macros to parse out these register fields?

> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CCAP1,
> +OCXL_LITTLE_ENDIAN, );
> +   if (rc)
> +   return rc;
> +
> +   ocxlpmem->max_controller_dump_size = val & 0x;
> +
> +   // Extract firmware version text
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_FWVER,
> +OCXL_HOST_ENDIAN,
> +(u64 *)ocxlpmem->fw_version);
> +   if (rc)
> +   return rc;
> +
> +   ocxlpmem->fw_version[8] = '\0';
> +
> +   dev_info(>dev,
> +"Firmware version '%s' SCM revision %d:%d\n",
> +ocxlpmem->fw_version, ocxlpmem->scm_revision >> 4,
> +ocxlpmem->scm_revision & 0x0F);

Does the driver need to be chatty here. If this data is relevant
should it appear in sysfs by default?

> +
> +   return 0;
> +}
> +
>  /**
>   * probe_function0() - Set up function 0 for an OpenCAPI persistent memory 
> device
>   * This is important as it enables templates higher than 0 across all other
> @@ -359,6 +404,9 @@ static int probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
>  {
> struct ocxlpmem *ocxlpmem;
> int rc;
> +   u64 chi;
> +   u16 elapsed, timeout;
> +   bool ready = false;
>
> if (PCI_FUNC(pdev->devfn) == 0)
> return probe_function0(pdev);
> @@ -413,6 +461,43 @@ static int probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
> goto err;
> }
>
> +   rc = read_device_metadata(ocxlpmem);
> +   if (rc) {
> +   dev_err(>dev, "Could not read metadata\n");
> +   goto err;
> +   }
> +
> +   elapsed = 0;
> +   timeout = ocxlpmem->readiness_timeout +
> + ocxlpmem->memory_available_timeout;
> +
> +   while (true) {
> +   rc = ocxlpmem_chi(ocxlpmem, );
> +   ready = (chi & (GLOBAL_MMIO_CHI_CRDY | GLOBAL_MMIO_CHI_MA)) ==
> +   (GLOBAL_MMIO_CHI_CRDY | GLOBAL_MMIO_CHI_MA);
> +
> +   if (ready)
> +   break;
> +
> +   if (elapsed++ > timeout) {
> +   dev_err(>dev,
> +   "OpenCAPI Persistent Memory ready 
> timeout.\n");
> +
> +   if (!(chi & GLOBAL_MMIO_CHI_CRDY))
> +   dev_err(>dev,
> +   "controller is not 

Re: [PATCH v4 12/25] nvdimm/ocxl: Add register addresses & status values to the header

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:53 PM Alastair D'Silva  wrote:
>
> These values have been taken from the device specifications.

Link to specification?


Re: [PATCH v4 11/25] powerpc: Enable the OpenCAPI Persistent Memory driver for powernv_defconfig

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This patch enables the OpenCAPI Persistent Memory driver, as well
> as DAX support, for the 'powernv' defconfig.
>
> DAX is not a strict requirement for the functioning of the driver, but it
> is likely that a user will want to create a DAX device on top of their
> persistent memory device.
>
> Signed-off-by: Alastair D'Silva 
> Reviewed-by: Andrew Donnellan 
> ---
>  arch/powerpc/configs/powernv_defconfig | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/arch/powerpc/configs/powernv_defconfig 
> b/arch/powerpc/configs/powernv_defconfig
> index 71749377d164..921d77bbd3d2 100644
> --- a/arch/powerpc/configs/powernv_defconfig
> +++ b/arch/powerpc/configs/powernv_defconfig
> @@ -348,3 +348,8 @@ CONFIG_KVM_BOOK3S_64=m
>  CONFIG_KVM_BOOK3S_64_HV=m
>  CONFIG_VHOST_NET=m
>  CONFIG_PRINTK_TIME=y
> +CONFIG_ZONE_DEVICE=y
> +CONFIG_OCXL_PMEM=m
> +CONFIG_DEV_DAX=m
> +CONFIG_DEV_DAX_PMEM=m
> +CONFIG_FS_DAX=y

These options have dependencies. I think it would better to implement
a top-level configuration question called something like
PERSISTENT_MEMORY_ALL that goes and selects all the bus providers and
infrastructure and lets other defaults follow along. For example,
CONFIG_DEV_DAX could grow a "default LIBNVDIMM" and then
CONFIG_DEV_DAX_PMEM would default on as well. If
CONFIG_PERSISTENT_MEMORY_ALL selected all the bus providers and
ZONE_DEVICE then the Kconfig system could prompt you to where the
dependencies are not satisfied.


Re: [PATCH v4 10/25] nvdimm: Add driver for OpenCAPI Persistent Memory

2020-04-01 Thread Dan Williams
On Wed, Apr 1, 2020 at 1:49 AM Dan Williams  wrote:
>
> On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  
> wrote:
> >
> > This driver exposes LPC memory on OpenCAPI pmem cards
> > as an NVDIMM, allowing the existing nvram infrastructure
> > to be used.
> >
> > Namespace metadata is stored on the media itself, so
> > scm_reserve_metadata() maps 1 section's worth of PMEM storage
> > at the start to hold this. The rest of the PMEM range is registered
> > with libnvdimm as an nvdimm. ndctl_config_read/write/size() provide
> > callbacks to libnvdimm to access the metadata.
> >
> > Signed-off-by: Alastair D'Silva 
> > ---
> >  drivers/nvdimm/Kconfig |   2 +
> >  drivers/nvdimm/Makefile|   1 +
> >  drivers/nvdimm/ocxl/Kconfig|  15 ++
> >  drivers/nvdimm/ocxl/Makefile   |   7 +
> >  drivers/nvdimm/ocxl/main.c | 476 +
> >  drivers/nvdimm/ocxl/ocxlpmem.h |  23 ++
> >  6 files changed, 524 insertions(+)
> >  create mode 100644 drivers/nvdimm/ocxl/Kconfig
> >  create mode 100644 drivers/nvdimm/ocxl/Makefile
> >  create mode 100644 drivers/nvdimm/ocxl/main.c
> >  create mode 100644 drivers/nvdimm/ocxl/ocxlpmem.h
> >
> > diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
> > index b7d1eb38b27d..368328637182 100644
> > --- a/drivers/nvdimm/Kconfig
> > +++ b/drivers/nvdimm/Kconfig
> > @@ -131,4 +131,6 @@ config NVDIMM_TEST_BUILD
> >   core devm_memremap_pages() implementation and other
> >   infrastructure.
> >
> > +source "drivers/nvdimm/ocxl/Kconfig"
> > +
> >  endif
> > diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
> > index 29203f3d3069..bc02be11c794 100644
> > --- a/drivers/nvdimm/Makefile
> > +++ b/drivers/nvdimm/Makefile
> > @@ -33,3 +33,4 @@ libnvdimm-$(CONFIG_NVDIMM_KEYS) += security.o
> >  TOOLS := ../../tools
> >  TEST_SRC := $(TOOLS)/testing/nvdimm/test
> >  obj-$(CONFIG_NVDIMM_TEST_BUILD) += $(TEST_SRC)/iomap.o
> > +obj-$(CONFIG_LIBNVDIMM) += ocxl/
> > diff --git a/drivers/nvdimm/ocxl/Kconfig b/drivers/nvdimm/ocxl/Kconfig
> > new file mode 100644
> > index ..c5d927520920
> > --- /dev/null
> > +++ b/drivers/nvdimm/ocxl/Kconfig
> > @@ -0,0 +1,15 @@
> > +# SPDX-License-Identifier: GPL-2.0-only
> > +if LIBNVDIMM
> > +
> > +config OCXL_PMEM
> > +   tristate "OpenCAPI Persistent Memory"
> > +   depends on LIBNVDIMM && PPC_POWERNV && PCI && EEH && ZONE_DEVICE && 
> > OCXL
>
> Does OXCL_PMEM itself have any CONFIG_ZONE_DEVICE dependencies? That's
> more a function of CONFIG_DEV_DAX and CONFIG_FS_DAX. Doesn't OCXL
> already depend on CONFIG_PCI?
>
>
> > +   help
> > + Exposes devices that implement the OpenCAPI Storage Class Memory
> > + specification as persistent memory regions. You may also want
> > + DEV_DAX, DEV_DAX_PMEM & FS_DAX if you plan on using DAX devices
> > + stacked on top of this driver.
> > +
> > + Select N if unsure.
> > +
> > +endif
> > diff --git a/drivers/nvdimm/ocxl/Makefile b/drivers/nvdimm/ocxl/Makefile
> > new file mode 100644
> > index ..e0e8ade1987a
> > --- /dev/null
> > +++ b/drivers/nvdimm/ocxl/Makefile
> > @@ -0,0 +1,7 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +ccflags-$(CONFIG_PPC_WERROR)   += -Werror
> > +
> > +obj-$(CONFIG_OCXL_PMEM) += ocxlpmem.o
> > +
> > +ocxlpmem-y := main.o
> > \ No newline at end of file
> > diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> > new file mode 100644
> > index ..c0066fedf9cc
> > --- /dev/null
> > +++ b/drivers/nvdimm/ocxl/main.c
> > @@ -0,0 +1,476 @@
> > +// SPDX-License-Identifier: GPL-2.0+
> > +// Copyright 2020 IBM Corp.
> > +
> > +/*
> > + * A driver for OpenCAPI devices that implement the Storage Class
> > + * Memory specification.
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include "ocxlpmem.h"
> > +
> > +static const struct pci_device_id pci_tbl[] = {
> > +   { PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x0625), },
> > +   { }
> > +};
> > +
> > +MODULE_DEVICE_TABLE(pci, pci_tbl);
> > +
> > +#define NUM_MINORS 256 // Total to reserve
> > +
> > +static dev_t ocxlpmem_dev;
> > +static st

Re: [PATCH v4 10/25] nvdimm: Add driver for OpenCAPI Persistent Memory

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This driver exposes LPC memory on OpenCAPI pmem cards
> as an NVDIMM, allowing the existing nvram infrastructure
> to be used.
>
> Namespace metadata is stored on the media itself, so
> scm_reserve_metadata() maps 1 section's worth of PMEM storage
> at the start to hold this. The rest of the PMEM range is registered
> with libnvdimm as an nvdimm. ndctl_config_read/write/size() provide
> callbacks to libnvdimm to access the metadata.
>
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/Kconfig |   2 +
>  drivers/nvdimm/Makefile|   1 +
>  drivers/nvdimm/ocxl/Kconfig|  15 ++
>  drivers/nvdimm/ocxl/Makefile   |   7 +
>  drivers/nvdimm/ocxl/main.c | 476 +
>  drivers/nvdimm/ocxl/ocxlpmem.h |  23 ++
>  6 files changed, 524 insertions(+)
>  create mode 100644 drivers/nvdimm/ocxl/Kconfig
>  create mode 100644 drivers/nvdimm/ocxl/Makefile
>  create mode 100644 drivers/nvdimm/ocxl/main.c
>  create mode 100644 drivers/nvdimm/ocxl/ocxlpmem.h
>
> diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
> index b7d1eb38b27d..368328637182 100644
> --- a/drivers/nvdimm/Kconfig
> +++ b/drivers/nvdimm/Kconfig
> @@ -131,4 +131,6 @@ config NVDIMM_TEST_BUILD
>   core devm_memremap_pages() implementation and other
>   infrastructure.
>
> +source "drivers/nvdimm/ocxl/Kconfig"
> +
>  endif
> diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
> index 29203f3d3069..bc02be11c794 100644
> --- a/drivers/nvdimm/Makefile
> +++ b/drivers/nvdimm/Makefile
> @@ -33,3 +33,4 @@ libnvdimm-$(CONFIG_NVDIMM_KEYS) += security.o
>  TOOLS := ../../tools
>  TEST_SRC := $(TOOLS)/testing/nvdimm/test
>  obj-$(CONFIG_NVDIMM_TEST_BUILD) += $(TEST_SRC)/iomap.o
> +obj-$(CONFIG_LIBNVDIMM) += ocxl/
> diff --git a/drivers/nvdimm/ocxl/Kconfig b/drivers/nvdimm/ocxl/Kconfig
> new file mode 100644
> index ..c5d927520920
> --- /dev/null
> +++ b/drivers/nvdimm/ocxl/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +if LIBNVDIMM
> +
> +config OCXL_PMEM
> +   tristate "OpenCAPI Persistent Memory"
> +   depends on LIBNVDIMM && PPC_POWERNV && PCI && EEH && ZONE_DEVICE && 
> OCXL

Does OXCL_PMEM itself have any CONFIG_ZONE_DEVICE dependencies? That's
more a function of CONFIG_DEV_DAX and CONFIG_FS_DAX. Doesn't OCXL
already depend on CONFIG_PCI?


> +   help
> + Exposes devices that implement the OpenCAPI Storage Class Memory
> + specification as persistent memory regions. You may also want
> + DEV_DAX, DEV_DAX_PMEM & FS_DAX if you plan on using DAX devices
> + stacked on top of this driver.
> +
> + Select N if unsure.
> +
> +endif
> diff --git a/drivers/nvdimm/ocxl/Makefile b/drivers/nvdimm/ocxl/Makefile
> new file mode 100644
> index ..e0e8ade1987a
> --- /dev/null
> +++ b/drivers/nvdimm/ocxl/Makefile
> @@ -0,0 +1,7 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +ccflags-$(CONFIG_PPC_WERROR)   += -Werror
> +
> +obj-$(CONFIG_OCXL_PMEM) += ocxlpmem.o
> +
> +ocxlpmem-y := main.o
> \ No newline at end of file
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> new file mode 100644
> index ..c0066fedf9cc
> --- /dev/null
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -0,0 +1,476 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +// Copyright 2020 IBM Corp.
> +
> +/*
> + * A driver for OpenCAPI devices that implement the Storage Class
> + * Memory specification.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "ocxlpmem.h"
> +
> +static const struct pci_device_id pci_tbl[] = {
> +   { PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x0625), },
> +   { }
> +};
> +
> +MODULE_DEVICE_TABLE(pci, pci_tbl);
> +
> +#define NUM_MINORS 256 // Total to reserve
> +
> +static dev_t ocxlpmem_dev;
> +static struct class *ocxlpmem_class;
> +static struct mutex minors_idr_lock;
> +static struct idr minors_idr;
> +
> +/**
> + * ndctl_config_write() - Handle a ND_CMD_SET_CONFIG_DATA command from ndctl
> + * @ocxlpmem: the device metadata
> + * @command: the incoming data to write
> + * Return: 0 on success, negative on failure
> + */
> +static int ndctl_config_write(struct ocxlpmem *ocxlpmem,
> + struct nd_cmd_set_config_hdr *command)
> +{
> +   if (command->in_offset + command->in_length > LABEL_AREA_SIZE)
> +   return -EINVAL;
> +
> +   memcpy_flushcache(ocxlpmem->metadata_addr + command->in_offset,
> + command->in_buf, command->in_length);
> +
> +   return 0;
> +}
> +
> +/**
> + * ndctl_config_read() - Handle a ND_CMD_GET_CONFIG_DATA command from ndctl
> + * @ocxlpmem: the device metadata
> + * @command: the read request
> + * Return: 0 on success, negative on failure
> + */
> +static int ndctl_config_read(struct ocxlpmem *ocxlpmem,
> +struct nd_cmd_get_config_data_hdr *command)
> +{
> + 

Re: [PATCH v4 08/25] ocxl: Emit a log message showing how much LPC memory was detected

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This patch emits a message showing how much LPC memory & special purpose
> memory was detected on an OCXL device.
>
> Signed-off-by: Alastair D'Silva 
> Acked-by: Frederic Barrat 
> Acked-by: Andrew Donnellan 
> ---
>  drivers/misc/ocxl/config.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
> index a62e3d7db2bf..69cca341d446 100644
> --- a/drivers/misc/ocxl/config.c
> +++ b/drivers/misc/ocxl/config.c
> @@ -568,6 +568,10 @@ static int read_afu_lpc_memory_info(struct pci_dev *dev,
> afu->special_purpose_mem_size =
> total_mem_size - lpc_mem_size;
> }
> +
> +   dev_info(>dev, "Probed LPC memory of %#llx bytes and special 
> purpose memory of %#llx bytes\n",
> +afu->lpc_mem_size, afu->special_purpose_mem_size);

A patch for a single log message is too fine grained for my taste,
let's squash this into another patch in the series.

> +
> return 0;
>  }
>
> --
> 2.24.1
>


Re: [PATCH v4 07/25] ocxl: Add functions to map/unmap LPC memory

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> Add functions to map/unmap LPC memory
>

"map memory" is an overloaded term. I'm guessing this patch has
nothing to do with mapping memory in the MMU. Is it updating hardware
resource decoders to start claiming address space that was allocated
previously?

> Signed-off-by: Alastair D'Silva 
> Acked-by: Frederic Barrat 
> ---
>  drivers/misc/ocxl/core.c  | 51 +++
>  drivers/misc/ocxl/ocxl_internal.h |  3 ++
>  include/misc/ocxl.h   | 21 +
>  3 files changed, 75 insertions(+)
>
> diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
> index 2531c6cf19a0..75ff14e3882a 100644
> --- a/drivers/misc/ocxl/core.c
> +++ b/drivers/misc/ocxl/core.c
> @@ -210,6 +210,56 @@ static void unmap_mmio_areas(struct ocxl_afu *afu)
> release_fn_bar(afu->fn, afu->config.global_mmio_bar);
>  }
>
> +int ocxl_afu_map_lpc_mem(struct ocxl_afu *afu)
> +{
> +   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
> +
> +   if ((afu->config.lpc_mem_size + afu->config.special_purpose_mem_size) 
> == 0)
> +   return 0;
> +
> +   afu->lpc_base_addr = ocxl_link_lpc_map(afu->fn->link, dev);
> +   if (afu->lpc_base_addr == 0)
> +   return -EINVAL;
> +
> +   if (afu->config.lpc_mem_size > 0) {
> +   afu->lpc_res.start = afu->lpc_base_addr + 
> afu->config.lpc_mem_offset;
> +   afu->lpc_res.end = afu->lpc_res.start + 
> afu->config.lpc_mem_size - 1;
> +   }
> +
> +   if (afu->config.special_purpose_mem_size > 0) {
> +   afu->special_purpose_res.start = afu->lpc_base_addr +
> +
> afu->config.special_purpose_mem_offset;
> +   afu->special_purpose_res.end = afu->special_purpose_res.start 
> +
> +  
> afu->config.special_purpose_mem_size - 1;
> +   }
> +
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(ocxl_afu_map_lpc_mem);
> +
> +struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu)
> +{
> +   return >lpc_res;
> +}
> +EXPORT_SYMBOL_GPL(ocxl_afu_lpc_mem);
> +
> +static void unmap_lpc_mem(struct ocxl_afu *afu)
> +{
> +   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
> +
> +   if (afu->lpc_res.start || afu->special_purpose_res.start) {
> +   void *link = afu->fn->link;
> +
> +   // only release the link when the the last consumer calls 
> release
> +   ocxl_link_lpc_release(link, dev);
> +
> +   afu->lpc_res.start = 0;
> +   afu->lpc_res.end = 0;
> +   afu->special_purpose_res.start = 0;
> +   afu->special_purpose_res.end = 0;
> +   }
> +}
> +
>  static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, struct pci_dev 
> *dev)
>  {
> int rc;
> @@ -251,6 +301,7 @@ static int configure_afu(struct ocxl_afu *afu, u8 
> afu_idx, struct pci_dev *dev)
>
>  static void deconfigure_afu(struct ocxl_afu *afu)
>  {
> +   unmap_lpc_mem(afu);
> unmap_mmio_areas(afu);
> reclaim_afu_pasid(afu);
> reclaim_afu_actag(afu);
> diff --git a/drivers/misc/ocxl/ocxl_internal.h 
> b/drivers/misc/ocxl/ocxl_internal.h
> index 2d7575225bd7..7b975a89db7b 100644
> --- a/drivers/misc/ocxl/ocxl_internal.h
> +++ b/drivers/misc/ocxl/ocxl_internal.h
> @@ -52,6 +52,9 @@ struct ocxl_afu {
> void __iomem *global_mmio_ptr;
> u64 pp_mmio_start;
> void *private;
> +   u64 lpc_base_addr; /* Covers both LPC & special purpose memory */
> +   struct resource lpc_res;
> +   struct resource special_purpose_res;
>  };
>
>  enum ocxl_context_status {
> diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
> index 357ef1aadbc0..d8b0b4d46bfb 100644
> --- a/include/misc/ocxl.h
> +++ b/include/misc/ocxl.h
> @@ -203,6 +203,27 @@ int ocxl_irq_set_handler(struct ocxl_context *ctx, int 
> irq_id,
>
>  // AFU Metadata
>
> +/**
> + * ocxl_afu_map_lpc_mem() - Map the LPC system & special purpose memory for 
> an AFU
> + * Do not call this during device discovery, as there may me multiple

s/me/be/


> + * devices on a link, and the memory is mapped for the whole link, not
> + * just one device. It should only be called after all devices have
> + * registered their memory on the link.
> + *
> + * @afu: The AFU that has the LPC memory to map
> + *
> + * Returns 0 on success, negative on failure
> + */
> +int ocxl_afu_map_lpc_mem(struct ocxl_afu *afu);
> +
> +/**
> + * ocxl_afu_lpc_mem() - Get the physical address range of LPC memory for an 
> AFU
> + * @afu: The AFU associated with the LPC memory
> + *
> + * Returns a pointer to the resource struct for the physical address range
> + */
> +struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu);
> +
>  /**
>   * ocxl_afu_config() - Get a pointer to the config for an AFU
>   * @afu: a pointer to the AFU to get the config for
> --
> 2.24.1
>


Re: [PATCH v4 05/25] ocxl: Address kernel doc errors & warnings

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This patch addresses warnings and errors from the kernel doc scripts for
> the OpenCAPI driver.
>
> It also makes minor tweaks to make the docs more consistent.
>
> Signed-off-by: Alastair D'Silva 
> Acked-by: Andrew Donnellan 
> ---
>  drivers/misc/ocxl/config.c| 24 
>  drivers/misc/ocxl/ocxl_internal.h |  9 +--
>  include/misc/ocxl.h   | 96 ---
>  3 files changed, 55 insertions(+), 74 deletions(-)

Looks good.


>
> diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
> index c8e19bfb5ef9..a62e3d7db2bf 100644
> --- a/drivers/misc/ocxl/config.c
> +++ b/drivers/misc/ocxl/config.c
> @@ -273,16 +273,16 @@ static int read_afu_info(struct pci_dev *dev, struct 
> ocxl_fn_config *fn,
>  }
>
>  /**
> - * Read the template version from the AFU
> - * dev: the device for the AFU
> - * fn: the AFU offsets
> - * len: outputs the template length
> - * version: outputs the major<<8,minor version
> + * read_template_version() - Read the template version from the AFU
> + * @dev: the device for the AFU
> + * @fn: the AFU offsets
> + * @len: outputs the template length
> + * @version: outputs the major<<8,minor version
>   *
>   * Returns 0 on success, negative on failure
>   */
>  static int read_template_version(struct pci_dev *dev, struct ocxl_fn_config 
> *fn,
> -   u16 *len, u16 *version)
> +u16 *len, u16 *version)
>  {
> u32 val32;
> u8 major, minor;
> @@ -476,16 +476,16 @@ static int validate_afu(struct pci_dev *dev, struct 
> ocxl_afu_config *afu)
>  }
>
>  /**
> - * Populate AFU metadata regarding LPC memory
> - * dev: the device for the AFU
> - * fn: the AFU offsets
> - * afu: the AFU struct to populate the LPC metadata into
> + * read_afu_lpc_memory_info() - Populate AFU metadata regarding LPC memory
> + * @dev: the device for the AFU
> + * @fn: the AFU offsets
> + * @afu: the AFU struct to populate the LPC metadata into
>   *
>   * Returns 0 on success, negative on failure
>   */
>  static int read_afu_lpc_memory_info(struct pci_dev *dev,
> -   struct ocxl_fn_config *fn,
> -   struct ocxl_afu_config *afu)
> +   struct ocxl_fn_config *fn,
> +   struct ocxl_afu_config *afu)
>  {
> int rc;
> u32 val32;
> diff --git a/drivers/misc/ocxl/ocxl_internal.h 
> b/drivers/misc/ocxl/ocxl_internal.h
> index 345bf843a38e..198e4e4bc51d 100644
> --- a/drivers/misc/ocxl/ocxl_internal.h
> +++ b/drivers/misc/ocxl/ocxl_internal.h
> @@ -122,11 +122,12 @@ int ocxl_config_check_afu_index(struct pci_dev *dev,
> struct ocxl_fn_config *fn, int afu_idx);
>
>  /**
> - * Update values within a Process Element
> + * ocxl_link_update_pe() - Update values within a Process Element
> + * @link_handle: the link handle associated with the process element
> + * @pasid: the PASID for the AFU context
> + * @tid: the new thread id for the process element
>   *
> - * link_handle: the link handle associated with the process element
> - * pasid: the PASID for the AFU context
> - * tid: the new thread id for the process element
> + * Returns 0 on success
>   */
>  int ocxl_link_update_pe(void *link_handle, int pasid, __u16 tid);
>
> diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
> index 0a762e387418..357ef1aadbc0 100644
> --- a/include/misc/ocxl.h
> +++ b/include/misc/ocxl.h
> @@ -62,8 +62,7 @@ struct ocxl_context;
>  // Device detection & initialisation
>
>  /**
> - * Open an OpenCAPI function on an OpenCAPI device
> - *
> + * ocxl_function_open() - Open an OpenCAPI function on an OpenCAPI device
>   * @dev: The PCI device that contains the function
>   *
>   * Returns an opaque pointer to the function, or an error pointer (check 
> with IS_ERR)
> @@ -71,8 +70,7 @@ struct ocxl_context;
>  struct ocxl_fn *ocxl_function_open(struct pci_dev *dev);
>
>  /**
> - * Get the list of AFUs associated with a PCI function device
> - *
> + * ocxl_function_afu_list() - Get the list of AFUs associated with a PCI 
> function device
>   * Returns a list of struct ocxl_afu *
>   *
>   * @fn: The OpenCAPI function containing the AFUs
> @@ -80,8 +78,7 @@ struct ocxl_fn *ocxl_function_open(struct pci_dev *dev);
>  struct list_head *ocxl_function_afu_list(struct ocxl_fn *fn);
>
>  /**
> - * Fetch an AFU instance from an OpenCAPI function
> - *
> + * ocxl_function_fetch_afu() - Fetch an AFU instance from an OpenCAPI 
> function
>   * @fn: The OpenCAPI function to get the AFU from
>   * @afu_idx: The index of the AFU to get
>   *
> @@ -92,23 +89,20 @@ struct list_head *ocxl_function_afu_list(struct ocxl_fn 
> *fn);
>  struct ocxl_afu *ocxl_function_fetch_afu(struct ocxl_fn *fn, u8 afu_idx);
>
>  /**
> - * Take a reference to an AFU
> - *
> + * ocxl_afu_get() - Take a reference to an AFU
>   * @afu: The AFU to 

Re: [PATCH v4 06/25] ocxl: Tally up the LPC memory on a link & allow it to be mapped

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:53 PM Alastair D'Silva  wrote:
>
> OpenCAPI LPC memory is allocated per link, but each link supports
> multiple AFUs, and each AFU can have LPC memory assigned to it.

Is there an OpenCAPI primer to decode these objects and their
associations that I can reference?


>
> This patch tallys the memory for all AFUs on a link, allowing it
> to be mapped in a single operation after the AFUs have been
> enumerated.
>
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/misc/ocxl/core.c  | 10 ++
>  drivers/misc/ocxl/link.c  | 60 +++
>  drivers/misc/ocxl/ocxl_internal.h | 33 +
>  3 files changed, 103 insertions(+)
>
> diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
> index b7a09b21ab36..2531c6cf19a0 100644
> --- a/drivers/misc/ocxl/core.c
> +++ b/drivers/misc/ocxl/core.c
> @@ -230,8 +230,18 @@ static int configure_afu(struct ocxl_afu *afu, u8 
> afu_idx, struct pci_dev *dev)
> if (rc)
> goto err_free_pasid;
>
> +   if (afu->config.lpc_mem_size || afu->config.special_purpose_mem_size) 
> {
> +   rc = ocxl_link_add_lpc_mem(afu->fn->link, 
> afu->config.lpc_mem_offset,
> +  afu->config.lpc_mem_size +
> +  
> afu->config.special_purpose_mem_size);
> +   if (rc)
> +   goto err_free_mmio;
> +   }
> +
> return 0;
>
> +err_free_mmio:
> +   unmap_mmio_areas(afu);
>  err_free_pasid:
> reclaim_afu_pasid(afu);
>  err_free_actag:
> diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
> index 58d111afd9f6..af119d3ef79a 100644
> --- a/drivers/misc/ocxl/link.c
> +++ b/drivers/misc/ocxl/link.c
> @@ -84,6 +84,11 @@ struct ocxl_link {
> int dev;
> atomic_t irq_available;
> struct spa *spa;
> +   struct mutex lpc_mem_lock; /* protects lpc_mem & lpc_mem_sz */
> +   u64 lpc_mem_sz; /* Total amount of LPC memory presented on the link */
> +   u64 lpc_mem;
> +   int lpc_consumers;
> +
> void *platform_data;
>  };
>  static struct list_head links_list = LIST_HEAD_INIT(links_list);
> @@ -396,6 +401,8 @@ static int alloc_link(struct pci_dev *dev, int PE_mask, 
> struct ocxl_link **out_l
> if (rc)
> goto err_spa;
>
> +   mutex_init(>lpc_mem_lock);
> +
> /* platform specific hook */
> rc = pnv_ocxl_spa_setup(dev, link->spa->spa_mem, PE_mask,
> >platform_data);
> @@ -711,3 +718,56 @@ void ocxl_link_free_irq(void *link_handle, int hw_irq)
> atomic_inc(>irq_available);
>  }
>  EXPORT_SYMBOL_GPL(ocxl_link_free_irq);
> +
> +int ocxl_link_add_lpc_mem(void *link_handle, u64 offset, u64 size)
> +{
> +   struct ocxl_link *link = (struct ocxl_link *)link_handle;
> +
> +   // Check for overflow
> +   if (offset > (offset + size))
> +   return -EINVAL;
> +
> +   mutex_lock(>lpc_mem_lock);
> +   link->lpc_mem_sz = max(link->lpc_mem_sz, offset + size);
> +
> +   mutex_unlock(>lpc_mem_lock);
> +
> +   return 0;
> +}
> +
> +u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev)
> +{
> +   struct ocxl_link *link = (struct ocxl_link *)link_handle;
> +
> +   mutex_lock(>lpc_mem_lock);
> +
> +   if (!link->lpc_mem)
> +   link->lpc_mem = pnv_ocxl_platform_lpc_setup(pdev, 
> link->lpc_mem_sz);
> +
> +   if (link->lpc_mem)
> +   link->lpc_consumers++;
> +   mutex_unlock(>lpc_mem_lock);
> +
> +   return link->lpc_mem;
> +}
> +
> +void ocxl_link_lpc_release(void *link_handle, struct pci_dev *pdev)
> +{
> +   struct ocxl_link *link = (struct ocxl_link *)link_handle;
> +
> +   mutex_lock(>lpc_mem_lock);
> +
> +   if (!link->lpc_mem) {
> +   mutex_unlock(>lpc_mem_lock);
> +   return;
> +   }
> +
> +   WARN_ON(--link->lpc_consumers < 0);
> +
> +   if (link->lpc_consumers == 0) {
> +   pnv_ocxl_platform_lpc_release(pdev);
> +   link->lpc_mem = 0;
> +   }
> +
> +   mutex_unlock(>lpc_mem_lock);
> +}
> diff --git a/drivers/misc/ocxl/ocxl_internal.h 
> b/drivers/misc/ocxl/ocxl_internal.h
> index 198e4e4bc51d..2d7575225bd7 100644
> --- a/drivers/misc/ocxl/ocxl_internal.h
> +++ b/drivers/misc/ocxl/ocxl_internal.h
> @@ -142,4 +142,37 @@ int ocxl_irq_offset_to_id(struct ocxl_context *ctx, u64 
> offset);
>  u64 ocxl_irq_id_to_offset(struct ocxl_context *ctx, int irq_id);
>  void ocxl_afu_irq_free_all(struct ocxl_context *ctx);
>
> +/**
> + * ocxl_link_add_lpc_mem() - Increment the amount of memory required by an 
> OpenCAPI link
> + *
> + * @link_handle: The OpenCAPI link handle
> + * @offset: The offset of the memory to add
> + * @size: The number of bytes to increment memory on the link by
> + *
> + * Returns 0 on success, -EINVAL on overflow
> + */
> +int ocxl_link_add_lpc_mem(void 

Re: [PATCH v4 04/25] ocxl: Remove unnecessary externs

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> Function declarations don't need externs, remove the existing ones
> so they are consistent with newer code
>
> Signed-off-by: Alastair D'Silva 
> Acked-by: Andrew Donnellan 
> Acked-by: Frederic Barrat 

Looks good.


> ---
>  arch/powerpc/include/asm/pnv-ocxl.h | 40 ++---
>  include/misc/ocxl.h |  6 ++---
>  2 files changed, 22 insertions(+), 24 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
> b/arch/powerpc/include/asm/pnv-ocxl.h
> index 560a19bb71b7..205efc41a33c 100644
> --- a/arch/powerpc/include/asm/pnv-ocxl.h
> +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> @@ -9,29 +9,27 @@
>  #define PNV_OCXL_TL_BITS_PER_RATE   4
>  #define PNV_OCXL_TL_RATE_BUF_SIZE   ((PNV_OCXL_TL_MAX_TEMPLATE+1) * 
> PNV_OCXL_TL_BITS_PER_RATE / 8)
>
> -extern int pnv_ocxl_get_actag(struct pci_dev *dev, u16 *base, u16 *enabled,
> -   u16 *supported);
> -extern int pnv_ocxl_get_pasid_count(struct pci_dev *dev, int *count);
> +int pnv_ocxl_get_actag(struct pci_dev *dev, u16 *base, u16 *enabled, u16 
> *supported);
> +int pnv_ocxl_get_pasid_count(struct pci_dev *dev, int *count);
>
> -extern int pnv_ocxl_get_tl_cap(struct pci_dev *dev, long *cap,
> +int pnv_ocxl_get_tl_cap(struct pci_dev *dev, long *cap,
> char *rate_buf, int rate_buf_size);
> -extern int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long cap,
> -   uint64_t rate_buf_phys, int rate_buf_size);
> -
> -extern int pnv_ocxl_get_xsl_irq(struct pci_dev *dev, int *hwirq);
> -extern void pnv_ocxl_unmap_xsl_regs(void __iomem *dsisr, void __iomem *dar,
> -   void __iomem *tfc, void __iomem *pe_handle);
> -extern int pnv_ocxl_map_xsl_regs(struct pci_dev *dev, void __iomem **dsisr,
> -   void __iomem **dar, void __iomem **tfc,
> -   void __iomem **pe_handle);
> -
> -extern int pnv_ocxl_spa_setup(struct pci_dev *dev, void *spa_mem, int 
> PE_mask,
> -   void **platform_data);
> -extern void pnv_ocxl_spa_release(void *platform_data);
> -extern int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int 
> pe_handle);
> -
> -extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
> -extern void pnv_ocxl_free_xive_irq(u32 irq);
> +int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long cap,
> +uint64_t rate_buf_phys, int rate_buf_size);
> +
> +int pnv_ocxl_get_xsl_irq(struct pci_dev *dev, int *hwirq);
> +void pnv_ocxl_unmap_xsl_regs(void __iomem *dsisr, void __iomem *dar,
> +void __iomem *tfc, void __iomem *pe_handle);
> +int pnv_ocxl_map_xsl_regs(struct pci_dev *dev, void __iomem **dsisr,
> + void __iomem **dar, void __iomem **tfc,
> + void __iomem **pe_handle);
> +
> +int pnv_ocxl_spa_setup(struct pci_dev *dev, void *spa_mem, int PE_mask, void 
> **platform_data);
> +void pnv_ocxl_spa_release(void *platform_data);
> +int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle);
> +
> +int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
> +void pnv_ocxl_free_xive_irq(u32 irq);
>  u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size);
>  void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev);
>
> diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
> index 06dd5839e438..0a762e387418 100644
> --- a/include/misc/ocxl.h
> +++ b/include/misc/ocxl.h
> @@ -173,7 +173,7 @@ int ocxl_context_detach(struct ocxl_context *ctx);
>   *
>   * Returns 0 on success, negative on failure
>   */
> -extern int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int *irq_id);
> +int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int *irq_id);
>
>  /**
>   * Frees an IRQ associated with an AFU context
> @@ -182,7 +182,7 @@ extern int ocxl_afu_irq_alloc(struct ocxl_context *ctx, 
> int *irq_id);
>   *
>   * Returns 0 on success, negative on failure
>   */
> -extern int ocxl_afu_irq_free(struct ocxl_context *ctx, int irq_id);
> +int ocxl_afu_irq_free(struct ocxl_context *ctx, int irq_id);
>
>  /**
>   * Gets the address of the trigger page for an IRQ
> @@ -193,7 +193,7 @@ extern int ocxl_afu_irq_free(struct ocxl_context *ctx, 
> int irq_id);
>   *
>   * returns the trigger page address, or 0 if the IRQ is not valid
>   */
> -extern u64 ocxl_afu_irq_get_addr(struct ocxl_context *ctx, int irq_id);
> +u64 ocxl_afu_irq_get_addr(struct ocxl_context *ctx, int irq_id);
>
>  /**
>   * Provide a callback to be called when an IRQ is triggered
> --
> 2.24.1
>


Re: [PATCH v4 03/25] powerpc/powernv: Map & release OpenCAPI LPC memory

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This patch adds OPAL calls to powernv so that the OpenCAPI
> driver can map & release LPC (Lowest Point of Coherency)  memory.
>
> Signed-off-by: Alastair D'Silva 
> Reviewed-by: Andrew Donnellan 
> ---
>  arch/powerpc/include/asm/pnv-ocxl.h   |  2 ++
>  arch/powerpc/platforms/powernv/ocxl.c | 43 +++
>  2 files changed, 45 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
> b/arch/powerpc/include/asm/pnv-ocxl.h
> index 7de82647e761..560a19bb71b7 100644
> --- a/arch/powerpc/include/asm/pnv-ocxl.h
> +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> @@ -32,5 +32,7 @@ extern int pnv_ocxl_spa_remove_pe_from_cache(void 
> *platform_data, int pe_handle)
>
>  extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
>  extern void pnv_ocxl_free_xive_irq(u32 irq);
> +u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size);
> +void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev);
>
>  #endif /* _ASM_PNV_OCXL_H */
> diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
> b/arch/powerpc/platforms/powernv/ocxl.c
> index 8c65aacda9c8..f13119a7c026 100644
> --- a/arch/powerpc/platforms/powernv/ocxl.c
> +++ b/arch/powerpc/platforms/powernv/ocxl.c
> @@ -475,6 +475,49 @@ void pnv_ocxl_spa_release(void *platform_data)
>  }
>  EXPORT_SYMBOL_GPL(pnv_ocxl_spa_release);
>
> +u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size)
> +{
> +   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> +   struct pnv_phb *phb = hose->private_data;

Is calling the local variable 'hose' instead of 'host' on purpose?

> +   u32 bdfn = pci_dev_id(pdev);
> +   __be64 base_addr_be64;
> +   u64 base_addr;
> +   int rc;
> +
> +   rc = opal_npu_mem_alloc(phb->opal_id, bdfn, size, _addr_be64);
> +   if (rc) {
> +   dev_warn(>dev,
> +"OPAL could not allocate LPC memory, rc=%d\n", rc);
> +   return 0;
> +   }
> +
> +   base_addr = be64_to_cpu(base_addr_be64);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE

With the proposed cleanup in patch2 the ifdef can be elided here.

> +   rc = check_hotplug_memory_addressable(base_addr >> PAGE_SHIFT,
> + size >> PAGE_SHIFT);
> +   if (rc)
> +   return 0;

Is this an error worth logging if someone is wondering why their
device is not showing up?


> +#endif
> +
> +   return base_addr;
> +}
> +EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_setup);
> +
> +void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev)
> +{
> +   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> +   struct pnv_phb *phb = hose->private_data;
> +   u32 bdfn = pci_dev_id(pdev);
> +   int rc;
> +
> +   rc = opal_npu_mem_release(phb->opal_id, bdfn);
> +   if (rc)
> +   dev_warn(>dev,
> +"OPAL reported rc=%d when releasing LPC memory\n", 
> rc);
> +}
> +EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_release);
> +
>  int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle)
>  {
> struct spa_data *data = (struct spa_data *) platform_data;
> --
> 2.24.1
>


Re: [PATCH v4 02/25] mm/memory_hotplug: Allow check_hotplug_memory_addressable to be called from drivers

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> When setting up OpenCAPI connected persistent memory, the range check may
> not be performed until quite late (or perhaps not at all, if the user does
> not establish a DAX device).
>
> This patch makes the range check callable so we can perform the check while
> probing the OpenCAPI Persistent Memory device.
>
> Signed-off-by: Alastair D'Silva 
> Reviewed-by: Andrew Donnellan 
> ---
>  include/linux/memory_hotplug.h | 5 +
>  mm/memory_hotplug.c| 4 ++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index f4d59155f3d4..9a19ae0d7e31 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -337,6 +337,11 @@ static inline void __remove_memory(int nid, u64 start, 
> u64 size) {}
>  extern void set_zone_contiguous(struct zone *zone);
>  extern void clear_zone_contiguous(struct zone *zone);
>
> +#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
> +int check_hotplug_memory_addressable(unsigned long pfn,
> +unsigned long nr_pages);
> +#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */

Let's move this to include/linux/memory.h with the other
CONFIG_MEMORY_HOTPLUG_SPARSE declarations, and add a dummy
implementation for the CONFIG_MEMORY_HOTPLUG_SPARSE=n case.

Also, this patch can be squashed with the next one, no need for it to
be stand alone.


> +
>  extern void __ref free_area_init_core_hotplug(int nid);
>  extern int __add_memory(int nid, u64 start, u64 size);
>  extern int add_memory(int nid, u64 start, u64 size);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0a54ffac8c68..14945f033594 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -276,8 +276,8 @@ static int check_pfn_span(unsigned long pfn, unsigned 
> long nr_pages,
> return 0;
>  }
>
> -static int check_hotplug_memory_addressable(unsigned long pfn,
> -   unsigned long nr_pages)
> +int check_hotplug_memory_addressable(unsigned long pfn,
> +unsigned long nr_pages)
>  {
> const u64 max_addr = PFN_PHYS(pfn + nr_pages) - 1;
>
> --
> 2.24.1
>


Re: [PATCH v4 01/25] powerpc/powernv: Add OPAL calls for LPC memory alloc/release

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> Add OPAL calls for LPC memory alloc/release
>

This seems to be referencing an existing api definition, can you
include a pointer to the spec in case someone wanted to understand
what these routines do? I suspect this is not allocating memory in the
traditional sense as much as it's allocating physical address space
for a device to be mapped?


> Signed-off-by: Alastair D'Silva 
> Acked-by: Andrew Donnellan 
> Acked-by: Frederic Barrat 
> ---
>  arch/powerpc/include/asm/opal-api.h| 2 ++
>  arch/powerpc/include/asm/opal.h| 2 ++
>  arch/powerpc/platforms/powernv/opal-call.c | 2 ++
>  3 files changed, 6 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index c1f25a760eb1..9298e603001b 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -208,6 +208,8 @@
>  #define OPAL_HANDLE_HMI2   166
>  #defineOPAL_NX_COPROC_INIT 167
>  #define OPAL_XIVE_GET_VP_STATE 170
> +#define OPAL_NPU_MEM_ALLOC 171
> +#define OPAL_NPU_MEM_RELEASE   172
>  #define OPAL_MPIPL_UPDATE  173
>  #define OPAL_MPIPL_REGISTER_TAG174
>  #define OPAL_MPIPL_QUERY_TAG   175
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 9986ac34b8e2..301fea46c7ca 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -39,6 +39,8 @@ int64_t opal_npu_spa_clear_cache(uint64_t phb_id, uint32_t 
> bdfn,
> uint64_t PE_handle);
>  int64_t opal_npu_tl_set(uint64_t phb_id, uint32_t bdfn, long cap,
> uint64_t rate_phys, uint32_t size);
> +int64_t opal_npu_mem_alloc(u64 phb_id, u32 bdfn, u64 size, __be64 *bar);
> +int64_t opal_npu_mem_release(u64 phb_id, u32 bdfn);
>
>  int64_t opal_console_write(int64_t term_number, __be64 *length,
>const uint8_t *buffer);
> diff --git a/arch/powerpc/platforms/powernv/opal-call.c 
> b/arch/powerpc/platforms/powernv/opal-call.c
> index 5cd0f52d258f..f26e58b72c04 100644
> --- a/arch/powerpc/platforms/powernv/opal-call.c
> +++ b/arch/powerpc/platforms/powernv/opal-call.c
> @@ -287,6 +287,8 @@ OPAL_CALL(opal_pci_set_pbcq_tunnel_bar, 
> OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
>  OPAL_CALL(opal_sensor_read_u64,OPAL_SENSOR_READ_U64);
>  OPAL_CALL(opal_sensor_group_enable,OPAL_SENSOR_GROUP_ENABLE);
>  OPAL_CALL(opal_nx_coproc_init, OPAL_NX_COPROC_INIT);
> +OPAL_CALL(opal_npu_mem_alloc,  OPAL_NPU_MEM_ALLOC);
> +OPAL_CALL(opal_npu_mem_release,OPAL_NPU_MEM_RELEASE);
>  OPAL_CALL(opal_mpipl_update,   OPAL_MPIPL_UPDATE);
>  OPAL_CALL(opal_mpipl_register_tag, OPAL_MPIPL_REGISTER_TAG);
>  OPAL_CALL(opal_mpipl_query_tag,OPAL_MPIPL_QUERY_TAG);
> --
> 2.24.1
>


Re: [PATCH v4 00/25] Add support for OpenCAPI Persistent Memory devices

2020-04-01 Thread Dan Williams
On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva  wrote:
>
> This series adds support for OpenCAPI Persistent Memory devices on bare metal 
> (arch/powernv), exposing them as nvdimms so that we can make use of the 
> existing infrastructure. There already exists a driver for the same devices 
> abstracted through PowerVM (arch/pseries): 
> arch/powerpc/platforms/pseries/papr_scm.c
>
> These devices are connected via OpenCAPI, and present as LPC (lowest 
> coherence point) memory to the system, practically, that means that memory on 
> these cards could be treated as conventional, cache-coherent memory.
>
> Since the devices are connected via OpenCAPI, they are not enumerated via 
> ACPI. Instead, OpenCAPI links present as pseudo-PCI bridges, with devices 
> below them.
>
> This series introduces a driver that exposes the memory on these cards as 
> nvdimms, with each card getting it's own bus. This is somewhat complicated by 
> the fact that the cards do not have out of band persistent storage for 
> metadata, so 1 SECTION_SIZE's (see SPARSEMEM) worth of storage is carved out 
> of the top of the card storage to implement the ndctl_config_* calls.

Is it really tied to section-size? Can't that change based on the
configured page-size? It's not clear to me why that would be the
choice, but I'll dig into the implementation.

> The driver is not responsible for configuring the NPU (NVLink Processing 
> Unit) BARs to map the LPC memory from the card into the system's physical 
> address space, instead, it requests this to be done via OPAL calls (typically 
> implemented by Skiboot).

Are OPAL calls similar to ACPI DSMs? I.e. methods for the OS to invoke
platform firmware services? What's Skiboot?

>
> The series is structured as follows:
>  - Required infrastructure changes & cleanup
>  - A minimal driver implementation
>  - Implementing additional features within the driver

Thanks for the intro and the changelog!

>
> Changelog:
> V4:
>   - Rebase on next-20200320

Do you have dependencies on other material that's in -next? Otherwise
-next is only a viable development baseline if you are going to merge
through Andrew's tree.

>   - Bump copyright to 2020
>   - Ensure all uapi headers use C89 compatible comments (missed ocxlpmem.h)
>   - Move the driver back to drivers/nvdimm/ocxl, after confirmation
> that this location is desirable
>   - Rename ocxl.c to ocxlpmem.c (+ support files)
>   - Rename all ocxl_pmem to ocxlpmem
>   - Address checkpatch --strict issues
>   - "powerpc/powernv: Add OPAL calls for LPC memory alloc/release"
> - Pass base address as __be64
>   - "ocxl: Tally up the LPC memory on a link & allow it to be mapped"
> - Address checkpatch spacing warnings
> - Reword blurb
> - Reword size description for ocxl_link_add_lpc_mem()
> - Add an early exit in ocxl_link_lpc_release() to avoid triggering
>   bogus warnings if called after ocxl_link_lpc_map() fails
>   - "powerpc/powernv: Add OPAL calls for LPC memory alloc/release"
> - Reword blurb
>   - "powerpc/powernv: Map & release OpenCAPI LPC memory"
> - Reword blurb
>   - Move minor_idr init from file_init() to ocxlpmem_init() (fixes runtime 
> error
> in "nvdimm: Add driver for OpenCAPI Persistent Memory")
>   - Wrap long lines
>   - "nvdimm: Add driver for OpenCAPI Storage Class Memory"
> - Remove '+ 1' workround from serial number->cookie assignment
> - Drop out of memory message for ocxlpmem in probe()
> - Fix leaks of ocxlpmem & ocxlpmem->ocxl_fn in probe()
> - remove struct ocxlpmem_function0, it didn't value add
> - factor out err_unregistered label in probe
> - Address more checkpatch warnings
> - get/put the pci dev on probe/free
> - Drop ocxlpmem_ prefix from static functions
> - Propogate errors up from called functions in probe()
> - Set MODULE_LICENSE to GPLv2
> - Add myself as module author
> - Call nvdimm_bus_unregister() in remove() to release references
> - Don't call devm_memunmap on metadata_address, the release handler on
>  the device already deals with this
>   - "nvdimm/ocxl: Read the capability registers & wait for device ready"
> - Fix mask for read_latency
> - Fold in is_usable logic into timeout to remove error message race
> - propogate bad rc from read_device_metadata
>   - "nvdimm/ocxl: Add register addresses & status values to the header"
> - Add comments for register abbreviations where names have been
>   expanded
> - Add missing status for blocked on background task
> - Alias defines for firmware update status to show that the 
> duplication
>   of values is intentional
>   - "nvdimm/ocxl: Register a character device for userspace to interact with"
> - Add lock around minors IDR, delete the cdev before device_unregister
> - Propogate errors up from 

Re: [PATCH v2] libnvdimm: Update persistence domain value for of_pmem and papr_scm device

2020-03-23 Thread Dan Williams
On Fri, Mar 20, 2020 at 2:25 AM Aneesh Kumar K.V
 wrote:
>
>
> Hi Dan,
>
>
> Dan Williams  writes:
>
> ...
>
>
> >
> >>
> >> Or are you suggesting that application should not infer any of those
> >> details looking at persistence_domain value? If so what is the purpose
> >> of exporting that attribute?
> >
> > The way the patch was worded I thought it was referring to an explicit
> > mechanism outside cpu cache flushes, i.e. a mechanism that required a
> > driver call.
> >
>
> This patch is blocked because I am not expressing the details correctly.
> I updates this as below. Can you suggest if this is ok? If not what
> alternate wording do you suggest to document "memory controller"
>
>
> commit 329b46e88f8cd30eee4776b0de7913ab4d496bd8
> Author: Aneesh Kumar K.V 
> Date:   Wed Dec 18 13:53:16 2019 +0530
>
> libnvdimm: Update persistence domain value for of_pmem and papr_scm device
>
> Currently, kernel shows the below values
> "persistence_domain":"cpu_cache"
> "persistence_domain":"memory_controller"
> "persistence_domain":"unknown"
>
> "cpu_cache" indicates no extra instructions is needed to ensure the 
> persistence
> of data in the pmem media on power failure.
>
> "memory_controller" indicates cpu cache flush instructions is required to 
> flush
> the data. Platform provides mechanisms to automatically flush outstanding
> write data from memory controler to pmem on system power loss.
>
> Based on the above use memory_controller for non volatile regions on 
> ppc64.

Looks good to me, want to resend via git-format-patch?


[PATCH v4 5/5] libnvdimm/region: Introduce an 'align' attribute

2020-03-03 Thread Dan Williams
The align attribute applies an alignment constraint for namespace
creation in a region. Whereas the 'align' attribute of a namespace
applied alignment padding via an info block, the 'align' attribute
applies alignment constraints to the free space allocation.

The default for 'align' is the maximum known memremap_compat_align()
across all archs (16MiB from PowerPC at time of writing) multiplied by
the number of interleave ways if there is blk-aliasing. The minimum is
PAGE_SIZE and allows for the creation of cross-arch incompatible
namespaces, just as previous kernels allowed, but the expectation is
cross-arch and mode-independent compatibility by default.

The regression risk with this change is limited to cases that were
dependent on the ability to create unaligned namespaces, *and* for some
reason are unable to opt-out of aligned namespaces by writing to
'regionX/align'. If such a scenario arises the default can be flipped
from opt-out to opt-in of compat-aligned namespace creation, but that is
a last resort. The kernel will otherwise continue to support existing
defined misaligned namespaces.

Unfortunately this change needs to touch several parts of the
implementation at once:

- region/available_size: expand busy extents to current align
- region/max_available_extent: expand busy extents to current align
- namespace/size: trim free space to current align

...to keep the free space accounting conforming to the dynamic align
setting.

Reported-by: Aneesh Kumar K.V 
Reported-by: Jeff Moyer 
Reviewed-by: Aneesh Kumar K.V 
Reviewed-by: Jeff Moyer 
Link: 
https://lore.kernel.org/r/158041478371.3889308.14542630147672668068.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/dimm_devs.c  |   86 +++
 drivers/nvdimm/namespace_devs.c |9 ++-
 drivers/nvdimm/nd.h |1 
 drivers/nvdimm/region_devs.c|  122 ---
 4 files changed, 192 insertions(+), 26 deletions(-)

diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 39a61a514746..b7b77e8d9027 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -563,6 +563,21 @@ int nvdimm_security_freeze(struct nvdimm *nvdimm)
return rc;
 }
 
+static unsigned long dpa_align(struct nd_region *nd_region)
+{
+   struct device *dev = _region->dev;
+
+   if (dev_WARN_ONCE(dev, !is_nvdimm_bus_locked(dev),
+   "bus lock required for capacity provision\n"))
+   return 0;
+   if (dev_WARN_ONCE(dev, !nd_region->ndr_mappings || nd_region->align
+   % nd_region->ndr_mappings,
+   "invalid region align %#lx mappings: %d\n",
+   nd_region->align, nd_region->ndr_mappings))
+   return 0;
+   return nd_region->align / nd_region->ndr_mappings;
+}
+
 int alias_dpa_busy(struct device *dev, void *data)
 {
resource_size_t map_end, blk_start, new;
@@ -571,6 +586,7 @@ int alias_dpa_busy(struct device *dev, void *data)
struct nd_region *nd_region;
struct nvdimm_drvdata *ndd;
struct resource *res;
+   unsigned long align;
int i;
 
if (!is_memory(dev))
@@ -608,13 +624,21 @@ int alias_dpa_busy(struct device *dev, void *data)
 * Find the free dpa from the end of the last pmem allocation to
 * the end of the interleave-set mapping.
 */
+   align = dpa_align(nd_region);
+   if (!align)
+   return 0;
+
for_each_dpa_resource(ndd, res) {
+   resource_size_t start, end;
+
if (strncmp(res->name, "pmem", 4) != 0)
continue;
-   if ((res->start >= blk_start && res->start < map_end)
-   || (res->end >= blk_start
-   && res->end <= map_end)) {
-   new = max(blk_start, min(map_end + 1, res->end + 1));
+
+   start = ALIGN_DOWN(res->start, align);
+   end = ALIGN(res->end + 1, align) - 1;
+   if ((start >= blk_start && start < map_end)
+   || (end >= blk_start && end <= map_end)) {
+   new = max(blk_start, min(map_end, end) + 1);
if (new != blk_start) {
blk_start = new;
goto retry;
@@ -654,6 +678,7 @@ resource_size_t nd_blk_available_dpa(struct nd_region 
*nd_region)
.res = NULL,
};
struct resource *res;
+   unsigned long align;
 
if (!ndd)
return 0;
@@ -661,10 +686,20 @@ resource_size_t nd_blk_available_dpa(struct nd_region 
*nd_region)
device_for_each_child(_bus->dev, , alias_

[PATCH v4 4/5] libnvdimm/region: Introduce NDD_LABELING

2020-03-03 Thread Dan Williams
The NDD_ALIASING flag is used to indicate where pmem capacity might
alias with blk capacity and require labeling. It is also used to
indicate whether the DIMM supports labeling. Separate this latter
capability into its own flag so that the NDD_ALIASING flag is scoped to
true aliased configurations.

To my knowledge aliased configurations only exist in the ACPI spec,
there are no known platforms that ship this support in production.

This clarity allows namespace-capacity alignment constraints around
interleave-ways to be relaxed.

Cc: Vishal Verma 
Cc: Oliver O'Halloran 
Reviewed-by: Jeff Moyer 
Reviewed-by: Aneesh Kumar K.V 
Link: 
https://lore.kernel.org/r/158041477856.3889308.4212605617834097674.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
---
 arch/powerpc/platforms/pseries/papr_scm.c |2 +-
 drivers/acpi/nfit/core.c  |4 +++-
 drivers/nvdimm/dimm.c |2 +-
 drivers/nvdimm/dimm_devs.c|9 +
 drivers/nvdimm/namespace_devs.c   |2 +-
 drivers/nvdimm/nd.h   |2 +-
 drivers/nvdimm/region_devs.c  |   10 +-
 include/linux/libnvdimm.h |2 ++
 8 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 0b4467e378e5..589858cb3203 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -328,7 +328,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
}
 
dimm_flags = 0;
-   set_bit(NDD_ALIASING, _flags);
+   set_bit(NDD_LABELING, _flags);
 
p->nvdimm = nvdimm_create(p->bus, p, NULL, dimm_flags,
  PAPR_SCM_DIMM_CMD_MASK, 0, NULL);
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index a3320f93616d..71d7f2aa1b12 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2026,8 +2026,10 @@ static int acpi_nfit_register_dimms(struct 
acpi_nfit_desc *acpi_desc)
continue;
}
 
-   if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+   if (nfit_mem->bdw && nfit_mem->memdev_pmem) {
set_bit(NDD_ALIASING, );
+   set_bit(NDD_LABELING, );
+   }
 
/* collate flags across all memdevs for this dimm */
list_for_each_entry(nfit_memdev, _desc->memdevs, list) {
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 64776ed15bb3..7d4ddc4d9322 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -99,7 +99,7 @@ static int nvdimm_probe(struct device *dev)
if (ndd->ns_current >= 0) {
rc = nd_label_reserve_dpa(ndd);
if (rc == 0)
-   nvdimm_set_aliasing(dev);
+   nvdimm_set_labeling(dev);
}
nvdimm_bus_unlock(dev);
 
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 94ea6dba6b4f..39a61a514746 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -32,7 +32,7 @@ int nvdimm_check_config_data(struct device *dev)
 
if (!nvdimm->cmd_mask ||
!test_bit(ND_CMD_GET_CONFIG_DATA, >cmd_mask)) {
-   if (test_bit(NDD_ALIASING, >flags))
+   if (test_bit(NDD_LABELING, >flags))
return -ENXIO;
else
return -ENOTTY;
@@ -173,11 +173,11 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, 
size_t offset,
return rc;
 }
 
-void nvdimm_set_aliasing(struct device *dev)
+void nvdimm_set_labeling(struct device *dev)
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
-   set_bit(NDD_ALIASING, >flags);
+   set_bit(NDD_LABELING, >flags);
 }
 
 void nvdimm_set_locked(struct device *dev)
@@ -312,8 +312,9 @@ static ssize_t flags_show(struct device *dev,
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
-   return sprintf(buf, "%s%s\n",
+   return sprintf(buf, "%s%s%s\n",
test_bit(NDD_ALIASING, >flags) ? "alias " : "",
+   test_bit(NDD_LABELING, >flags) ? "label " : "",
test_bit(NDD_LOCKED, >flags) ? "lock " : "");
 }
 static DEVICE_ATTR_RO(flags);
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 77e211c7d94d..01f6c22f0d1a 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -2538,7 +2538,7 @@ static int init_active_labels(struct nd_region *nd_region)
if (!ndd) {
if (test_bit(NDD_LOCKED, >flags))
/* fail, label data may be unreadable */;
-   

[PATCH v4 3/5] libnvdimm/namespace: Enforce memremap_compat_align()

2020-03-03 Thread Dan Williams
The pmem driver on PowerPC crashes with the following signature when
instantiating misaligned namespaces that map their capacity via
memremap_pages().

BUG: Unable to handle kernel data access at 0xc00100040600
Faulting instruction address: 0xc0090790
NIP [c0090790] arch_add_memory+0xc0/0x130
LR [c0090744] arch_add_memory+0x74/0x130
Call Trace:
 arch_add_memory+0x74/0x130 (unreliable)
 memremap_pages+0x74c/0xa30
 devm_memremap_pages+0x3c/0xa0
 pmem_attach_disk+0x188/0x770
 nvdimm_bus_probe+0xd8/0x470

With the assumption that only memremap_pages() has alignment
constraints, enforce memremap_compat_align() for
pmem_should_map_pages(), nd_pfn, and nd_dax cases. This includes
preventing the creation of namespaces where the base address is
misaligned and cases there infoblock padding parameters are invalid.

Reported-by: Aneesh Kumar K.V 
Cc: Jeff Moyer 
Fixes: a3619190d62e ("libnvdimm/pfn: stop padding pmem namespaces to section 
alignment")
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/namespace_devs.c |   17 +
 drivers/nvdimm/pfn.h|   12 
 drivers/nvdimm/pfn_devs.c   |   32 +---
 3 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 032dc61725ff..77e211c7d94d 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -10,6 +10,7 @@
 #include 
 #include "nd-core.h"
 #include "pmem.h"
+#include "pfn.h"
 #include "nd.h"
 
 static void namespace_io_release(struct device *dev)
@@ -1739,6 +1740,22 @@ struct nd_namespace_common 
*nvdimm_namespace_common_probe(struct device *dev)
return ERR_PTR(-ENODEV);
}
 
+   /*
+* Note, alignment validation for fsdax and devdax mode
+* namespaces happens in nd_pfn_validate() where infoblock
+* padding parameters can be applied.
+*/
+   if (pmem_should_map_pages(dev)) {
+   struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
+   struct resource *res = >res;
+
+   if (!IS_ALIGNED(res->start | (res->end + 1),
+   memremap_compat_align())) {
+   dev_err(>dev, "%pr misaligned, unable to map\n", 
res);
+   return ERR_PTR(-EOPNOTSUPP);
+   }
+   }
+
if (is_namespace_pmem(>dev)) {
struct nd_namespace_pmem *nspm;
 
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index acb19517f678..37cb1b8a2a39 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -24,6 +24,18 @@ struct nd_pfn_sb {
__le64 npfns;
__le32 mode;
/* minor-version-1 additions for section alignment */
+   /**
+* @start_pad: Deprecated attribute to pad start-misaligned namespaces
+*
+* start_pad is deprecated because the original definition did
+* not comprehend that dataoff is relative to the base address
+* of the namespace not the start_pad adjusted base. The result
+* is that the dax path is broken, but the block-I/O path is
+* not. The kernel will no longer create namespaces using start
+* padding, but it still supports block-I/O for legacy
+* configurations mainly to allow a backup, reconfigure the
+* namespace, and restore flow to repair dax operation.
+*/
__le32 start_pad;
__le32 end_trunc;
/* minor-version-2 record the base alignment of the mapping */
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 79fe02d6f657..34db557dbad1 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -446,6 +446,7 @@ static bool nd_supported_alignment(unsigned long align)
 int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
 {
u64 checksum, offset;
+   struct resource *res;
enum nd_pfn_mode mode;
struct nd_namespace_io *nsio;
unsigned long align, start_pad;
@@ -578,13 +579,14 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char 
*sig)
 * established.
 */
nsio = to_nd_namespace_io(>dev);
-   if (offset >= resource_size(>res)) {
+   res = >res;
+   if (offset >= resource_size(res)) {
dev_err(_pfn->dev, "pfn array size exceeds capacity of %s\n",
dev_name(>dev));
return -EOPNOTSUPP;
}
 
-   if ((align && !IS_ALIGNED(nsio->res.start + offset + start_pad, align))
+   if ((align && !IS_ALIGNED(res->start + offset + start_pad, align))
|| !IS_ALIGNED(offset, PAGE_SIZE)) {
dev_err(_pfn->dev,
 

[PATCH v4 2/5] libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid

2020-03-03 Thread Dan Williams
The EOPNOTSUPP return code from the pmem driver indicates that the
namespace has a configuration that may be valid, but the current kernel
does not support it. Expand this to all of the nd_pfn_validate() error
conditions after the infoblock has been verified as self consistent.

This prevents exposing the namespace to I/O when the infoblock needs to
be corrected, or the system needs to be put into a different
configuration (like changing the page size on PowerPC).

Cc: Aneesh Kumar K.V 
Cc: Jeff Moyer 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pfn_devs.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index a5c25cb87116..79fe02d6f657 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -561,14 +561,14 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char 
*sig)
dev_dbg(_pfn->dev, "align: %lx:%lx mode: %d:%d\n",
nd_pfn->align, align, nd_pfn->mode,
mode);
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
}
 
if (align > nvdimm_namespace_capacity(ndns)) {
dev_err(_pfn->dev, "alignment: %lx exceeds capacity %llx\n",
align, nvdimm_namespace_capacity(ndns));
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
 
/*
@@ -581,7 +581,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
if (offset >= resource_size(>res)) {
dev_err(_pfn->dev, "pfn array size exceeds capacity of %s\n",
dev_name(>dev));
-   return -EBUSY;
+   return -EOPNOTSUPP;
}
 
if ((align && !IS_ALIGNED(nsio->res.start + offset + start_pad, align))
@@ -589,7 +589,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
dev_err(_pfn->dev,
"bad offset: %#llx dax disabled align: %#lx\n",
offset, align);
-   return -ENXIO;
+   return -EOPNOTSUPP;
}
 
return 0;



[PATCH v4 1/5] mm/memremap_pages: Introduce memremap_compat_align()

2020-03-03 Thread Dan Williams
The "sub-section memory hotplug" facility allows memremap_pages() users
like libnvdimm to compensate for hardware platforms like x86 that have a
section size larger than their hardware memory mapping granularity.  The
compensation that sub-section support affords is being tolerant of
physical memory resources shifting by units smaller (64MiB on x86) than
the memory-hotplug section size (128 MiB). Where the platform
physical-memory mapping granularity is limited by the number and
capability of address-decode-registers in the memory controller.

While the sub-section support allows memremap_pages() to operate on
sub-section (2MiB) granularity, the Power architecture may still
require 16MiB alignment on "!radix_enabled()" platforms.

In order for libnvdimm to be able to detect and manage this per-arch
limitation, introduce memremap_compat_align() as a common minimum
alignment across all driver-facing memory-mapping interfaces, and let
Power override it to 16MiB in the "!radix_enabled()" case.

The assumption / requirement for 16MiB to be a viable
memremap_compat_align() value is that Power does not have platforms
where its equivalent of address-decode-registers never hardware remaps a
persistent memory resource on smaller than 16MiB boundaries. Note that I
tried my best to not add a new Kconfig symbol, but header include
entanglements defeated the #ifndef memremap_compat_align design pattern
and the need to export it defeats the __weak design pattern for arch
overrides.

Based on an initial patch by Aneesh.

Link: 
http://lore.kernel.org/r/capcyv4gbgnp95apyabcsocea50tqj9b5h__83vgngjq3oug...@mail.gmail.com
Reported-by: Aneesh Kumar K.V 
Reported-by: Jeff Moyer 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Dan Williams 
---
 arch/powerpc/Kconfig  |1 +
 arch/powerpc/mm/ioremap.c |   21 +
 drivers/nvdimm/pfn_devs.c |2 +-
 include/linux/memremap.h  |8 
 include/linux/mmzone.h|1 +
 lib/Kconfig   |3 +++
 mm/memremap.c |   23 +++
 7 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 497b7d0b2d7e..e6ffe905e2b9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -122,6 +122,7 @@ config PPC
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV
select ARCH_HAS_HUGEPD  if HUGETLB_PAGE
+   select ARCH_HAS_MEMREMAP_COMPAT_ALIGN
select ARCH_HAS_MMIOWB  if PPC64
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_API
diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
index fc669643ce6a..b1a0aebe8c48 100644
--- a/arch/powerpc/mm/ioremap.c
+++ b/arch/powerpc/mm/ioremap.c
@@ -2,6 +2,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -97,3 +98,23 @@ void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, 
unsigned long size,
 
return NULL;
 }
+
+#ifdef CONFIG_ZONE_DEVICE
+/*
+ * Override the generic version in mm/memremap.c.
+ *
+ * With hash translation, the direct-map range is mapped with just one
+ * page size selected by htab_init_page_sizes(). Consult
+ * mmu_psize_defs[] to determine the minimum page size alignment.
+*/
+unsigned long memremap_compat_align(void)
+{
+   unsigned int shift = mmu_psize_defs[mmu_linear_psize].shift;
+
+   if (radix_enabled())
+   return SUBSECTION_SIZE;
+   return max(SUBSECTION_SIZE, 1UL << shift);
+
+}
+EXPORT_SYMBOL_GPL(memremap_compat_align);
+#endif
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index b94f7a7e94b8..a5c25cb87116 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -750,7 +750,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
start = nsio->res.start;
size = resource_size(>res);
npfns = PHYS_PFN(size - SZ_8K);
-   align = max(nd_pfn->align, (1UL << SUBSECTION_SHIFT));
+   align = max(nd_pfn->align, SUBSECTION_SIZE);
end_trunc = start + size - ALIGN_DOWN(start + size, align);
if (nd_pfn->mode == PFN_MODE_PMEM) {
/*
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 6fefb09af7c3..8af1cbd8f293 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -132,6 +132,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
+unsigned long memremap_compat_align(void);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
@@ -165,6 +166,12 @@ static inline void vmem_altmap_free(struct vmem_altmap 
*altmap,
unsigned long nr_pfns)
 {
 }
+
+/* when memremap_pages() is disabled all archs can remap a single page *

[PATCH v4 0/5] libnvdimm: Cross-arch compatible namespace alignment

2020-03-03 Thread Dan Williams
Changes since v3 [1]:
- Collect Aneesh's Reviewed-by for Patch1-Patch3

- Add commentary to "libnvdimm/namespace: Enforce
  memremap_compat_align()" to clarify alignment validation and why
  start_pad is deprecated (Aneesh)

[1]: 
http://lore.kernel.org/r/158291746615.1609624.7591692546429050845.st...@dwillia2-desk3.amr.corp.intel.com

---

Patch "mm/memremap_pages: Introduce memremap_compat_align()" still
needs a PowerPC maintainer ack for the touches to
arch/powerpc/mm/ioremap.c.

---

Aneesh reports that PowerPC requires 16MiB alignment for the address
range passed to devm_memremap_pages(), and Jeff reports that it is
possible to create a misaligned namespace which blocks future namespace
creation in that region. Both of these issues require namespace
alignment to be managed at the region level rather than padding at the
namespace level which has been a broken approach to date.

Introduce memremap_compat_align() to indicate the hard requirements of
an arch's memremap_pages() implementation. Use the maximum known
memremap_compat_align() to set the default namespace alignment for
libnvdimm. Consult that alignment when allocating free space. Finally,
allow the default region alignment to be overridden to maintain the same
namespace creation capability as previous kernels (modulo dax operation
not being supported with a non-zero start_pad).

The ndctl unit tests, which have some misaligned namespace assumptions,
are updated to use the alignment override where necessary.

Thanks to Aneesh for early feedback and testing on this change to
alignment handling.

---

Dan Williams (5):
  mm/memremap_pages: Introduce memremap_compat_align()
  libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
  libnvdimm/namespace: Enforce memremap_compat_align()
  libnvdimm/region: Introduce NDD_LABELING
  libnvdimm/region: Introduce an 'align' attribute


 arch/powerpc/Kconfig  |1 
 arch/powerpc/mm/ioremap.c |   21 +
 arch/powerpc/platforms/pseries/papr_scm.c |2 
 drivers/acpi/nfit/core.c  |4 +
 drivers/nvdimm/dimm.c |2 
 drivers/nvdimm/dimm_devs.c|   95 +
 drivers/nvdimm/namespace_devs.c   |   28 +-
 drivers/nvdimm/nd.h   |3 -
 drivers/nvdimm/pfn.h  |   12 +++
 drivers/nvdimm/pfn_devs.c |   40 +++--
 drivers/nvdimm/region_devs.c  |  132 ++---
 include/linux/libnvdimm.h |2 
 include/linux/memremap.h  |8 ++
 include/linux/mmzone.h|1 
 lib/Kconfig   |3 +
 mm/memremap.c |   23 +
 16 files changed, 330 insertions(+), 47 deletions(-)

base-commit: 1d0827b75ee7df497f611a2ac412a88135fb0ef5


Re: [PATCH v3 6/7] mm/memory_hotplug: Add pgprot_t to mhp_params

2020-03-02 Thread Dan Williams
On Mon, Mar 2, 2020 at 10:55 AM Logan Gunthorpe  wrote:
>
>
>
> On 2020-02-29 3:44 p.m., Dan Williams wrote:
> > On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe  
> > wrote:
> >>
> >> devm_memremap_pages() is currently used by the PCI P2PDMA code to create
> >> struct page mappings for IO memory. At present, these mappings are created
> >> with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on
> >> x86, an mtrr register will typically override this and force the cache
> >> type to be UC-. In the case firmware doesn't set this register it is
> >> effectively WB and will typically result in a machine check exception
> >> when it's accessed.
> >>
> >> Other arches are not currently likely to function correctly seeing they
> >> don't have any MTRR registers to fall back on.
> >>
> >> To solve this, provide a way to specify the pgprot value explicitly to
> >> arch_add_memory().
> >>
> >> Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple
> >> change to pass the pgprot_t down to their respective functions which set
> >> up the page tables. For x86_32, set the page tables explicitly using
> >> _set_memory_prot() (seeing they are already mapped). For ia64, s390 and
> >> sh, reject anything but PAGE_KERNEL settings -- this should be fine,
> >> for now, seeing these architectures don't support ZONE_DEVICE.
> >>
> >> A check in __add_pages() is also added to ensure the pgprot parameter was
> >> set for all arches.
> >>
> >> Cc: Dan Williams 
> >> Signed-off-by: Logan Gunthorpe 
> >> Acked-by: David Hildenbrand 
> >> Acked-by: Michal Hocko 
> >> ---
> >>  arch/arm64/mm/mmu.c| 3 ++-
> >>  arch/ia64/mm/init.c| 3 +++
> >>  arch/powerpc/mm/mem.c  | 3 ++-
> >>  arch/s390/mm/init.c| 3 +++
> >>  arch/sh/mm/init.c  | 3 +++
> >>  arch/x86/mm/init_32.c  | 5 +
> >>  arch/x86/mm/init_64.c  | 2 +-
> >>  include/linux/memory_hotplug.h | 2 ++
> >>  mm/memory_hotplug.c| 5 -
> >>  mm/memremap.c  | 6 +++---
> >>  10 files changed, 28 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >> index ee37bca8aba8..ea3fa844a8a2 100644
> >> --- a/arch/arm64/mm/mmu.c
> >> +++ b/arch/arm64/mm/mmu.c
> >> @@ -1058,7 +1058,8 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >> flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> >>
> >> __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
> >> -size, PAGE_KERNEL, __pgd_pgtable_alloc, 
> >> flags);
> >> +size, params->pgprot, __pgd_pgtable_alloc,
> >> +flags);
> >>
> >> memblock_clear_nomap(start, size);
> >>
> >> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> >> index 97bbc23ea1e3..d637b4ea3147 100644
> >> --- a/arch/ia64/mm/init.c
> >> +++ b/arch/ia64/mm/init.c
> >> @@ -676,6 +676,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >> unsigned long nr_pages = size >> PAGE_SHIFT;
> >> int ret;
> >>
> >> +   if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
> >> +   return -EINVAL;
> >> +
> >> ret = __add_pages(nid, start_pfn, nr_pages, params);
> >> if (ret)
> >> printk("%s: Problem encountered in __add_pages() as 
> >> ret=%d\n",
> >> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> >> index 19b1da5d7eca..832412bc7fad 100644
> >> --- a/arch/powerpc/mm/mem.c
> >> +++ b/arch/powerpc/mm/mem.c
> >> @@ -138,7 +138,8 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
> >> resize_hpt_for_hotplug(memblock_phys_mem_size());
> >>
> >> start = (unsigned long)__va(start);
> >> -   rc = create_section_mapping(start, start + size, nid, PAGE_KERNEL);
> >> +   rc = create_section_mapping(start, start + size, nid,
> >> +   params->pgprot);
> >> if (rc) {
> >> pr_warn("Unable to create mapping for hot added memory 
> >> 0x%llx..0x%llx: %d\n",
> >> start,

  1   2   3   4   5   >