from:"Dan Williams"

Re: [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users

2021-05-18 Thread Dan Williams

On Tue, May 18, 2021 at 6:33 PM James Bottomley  wrote:
>
> On Tue, 2021-05-18 at 11:24 +0100, Mark Rutland wrote:
> > On Thu, May 13, 2021 at 09:47:32PM +0300, Mike Rapoport wrote:
> > > From: Mike Rapoport 
> > >
> > > It is unsafe to allow saving of secretmem areas to the hibernation
> > > snapshot as they would be visible after the resume and this
> > > essentially will defeat the purpose of secret memory mappings.
> > >
> > > Prevent hibernation whenever there are active secret memory users.
> >
> > Have we thought about how this is going to work in practice, e.g. on
> > mobile systems? It seems to me that there are a variety of common
> > applications which might want to use this which people don't expect
> > to inhibit hibernate (e.g. authentication agents, web browsers).
>
> If mobile systems require hibernate, then the choice is to disable this
> functionality or implement a secure hibernation store.   I also thought
> most mobile hibernation was basically equivalent to S3, in which case
> there's no actual writing of ram into storage, in which case there's no
> security barrier and likely the inhibition needs to be made a bit more
> specific to the suspend to disk case?
>
> > Are we happy to say that any userspace application can incidentally
> > inhibit hibernate?
>
> Well, yes, for the laptop use case because we don't want suspend to
> disk to be able to compromise the secret area.  You can disable this
> for mobile if you like, or work out how to implement hibernate securely
> if you're really suspending to disk.

Forgive me if this was already asked and answered. Why not document
that secretmem is ephemeral in the case of hibernate and push the
problem to userspace to disable hibernation? In other words
hibernation causes applications to need to reload their secretmem, it
will be destroyed on the way down and SIGBUS afterwards. That at least
gives a system the flexibility to either sacrifice hibernate for
secretmem (with a userspace controlled policy), or sacrifice secretmem
using processes for hibernate.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [ndctl PATCH] ndctl: Update nvdimm mailing list address

2021-05-18 Thread Dan Williams

On Tue, May 18, 2021 at 3:26 PM Vishal Verma  wrote:
>
> The 'nvdimm' mailing list has moved from lists.01.org to
> lists.linux.dev. Update CONTRIBUTING.md and configure.ac to reflect
> this.

LGTM

Reviewed-by: Dan Williams

Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-05-17 Thread Dan Williams

On Wed, Apr 21, 2021 at 11:25 PM Christoph Hellwig  wrote:
>
> On Wed, Apr 21, 2021 at 12:09:54PM -0700, Dan Williams wrote:
> > Can you get in the habit of not replying inline with new patches like
> > this? Collect the review feedback, take a pause, and resend the full
> > series so tooling like b4 and patchwork can track when a new posting
> > supersedes a previous one. As is, this inline style inflicts manual
> > effort on the maintainer.
>
> Honestly I don't mind it at all.  If you shiny new tooling can't handle
> it maybe you should fix your shiny new tooling instead of changing
> everyones workflow?

Fyi, shiny new tooling has been fixed:

http://lore.kernel.org/r/20210517161317.teawoh5qovxpmqdc@nitro.local

...it still requires properly formatted patches with commentary below
the "---" break line, but this should cut down on re-rolls.

Hat tip to Konstantin.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[GIT PULL] libnvdimm fixes for 5.13-rc2

2021-05-14 Thread Dan Williams

Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-fixes-5.13-rc2

...to receive a regression fix for a bootup crash condition introduced
in -rc1 and some other minor fixups. This has all appeared in -next
with no reported issues.

---


The following changes since commit 6efb943b8616ec53a5e444193dccf1af9ad627b5:

  Linux 5.13-rc1 (2021-05-09 14:17:44 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-fixes-5.13-rc2

for you to fetch changes up to e9cfd259c6d386f6235395a13bd4f357a979b2d0:

  ACPI: NFIT: Fix support for variable 'SPA' structure size
(2021-05-12 12:38:25 -0700)


libnvdimm fixes for 5.13-rc2

- Fix regression in ACPI NFIT table handling leading to crashes and
  driver load failures.

- Move the nvdimm mailing list

- Miscellaneous minor fixups


Dan Williams (2):
  MAINTAINERS: Move nvdimm mailing list
  ACPI: NFIT: Fix support for variable 'SPA' structure size

Wan Jiabing (1):
  libnvdimm: Remove duplicate struct declaration

Zou Wei (1):
  tools/testing/nvdimm: Make symbol '__nfit_test_ioremap' static

 Documentation/ABI/obsolete/sysfs-class-dax|  2 +-
 Documentation/ABI/removed/sysfs-bus-nfit  |  2 +-
 Documentation/ABI/testing/sysfs-bus-nfit  | 40 -
 Documentation/ABI/testing/sysfs-bus-papr-pmem |  4 +--
 Documentation/driver-api/nvdimm/nvdimm.rst|  2 +-
 MAINTAINERS   | 14 -
 drivers/acpi/nfit/core.c  | 15 +++---
 include/linux/libnvdimm.h |  1 -
 tools/testing/nvdimm/test/iomap.c |  2 +-
 tools/testing/nvdimm/test/nfit.c  | 42 ---
 10 files changed, 69 insertions(+), 55 deletions(-)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[GIT PULL] dax fixes for v5.13-rc2

2021-05-14 Thread Dan Williams

Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/dax-fixes-5.13-rc2

...to receive a fix for a hang condition in the filesystem-dax core
when exercised by virtiofs. This bug has been there from the
beginning, but the condition has not triggered on other filesystems
since they hold a lock over invalidation events.

The changes have appeared in -next with no reported issues. The
patches were originally against v5.12 so you will see a minor conflict
with Willy's nr_exceptional changes. My proposed conflict resolution
appended below.

---

The following changes since commit 9f4ad9e425a1d3b6a34617b8ea226d56a119a717:

  Linux 5.12 (2021-04-25 13:49:08 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/dax-fixes-5.13-rc2

for you to fetch changes up to 237388320deffde7c2d65ed8fc9eef670dc979b3:

  dax: Wake up all waiters after invalidating dax entry (2021-05-07
15:55:44 -0700)


dax fixes for 5.13-rc2

- Fix a hang condition (missed wakeups with virtiofs when invalidating
  entries)


Vivek Goyal (3):
  dax: Add an enum for specifying dax wakup mode
  dax: Add a wakeup mode parameter to put_unlocked_entry()
  dax: Wake up all waiters after invalidating dax entry

 fs/dax.c | 35 +++
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --cc fs/dax.c
index 69216241392f,df5485b4bddf..62352cbcf0f4
--- a/fs/dax.c
+++ b/fs/dax.c
@@@ -524,8 -535,8 +535,8 @@@ retry

dax_disassociate_entry(entry, mapping, false);
xas_store(xas, NULL);   /* undo the PMD join */
-   dax_wake_entry(xas, entry, true);
+   dax_wake_entry(xas, entry, WAKE_ALL);
 -  mapping->nrexceptional--;
 +  mapping->nrpages -= PG_PMD_NR;
entry = NULL;
xas_set(xas, index);
}
@@@ -661,10 -672,10 +672,10 @@@ static int __dax_invalidate_entry(struc
goto out;
dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
 -  mapping->nrexceptional--;
 +  mapping->nrpages -= 1UL << dax_entry_order(entry);
ret = 1;
  out:
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_ALL);
xas_unlock_irq();
return ret;
  }
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] MAINTAINERS: Move nvdimm mailing list

2021-05-13 Thread Dan Williams

On Wed, Apr 21, 2021 at 12:05 AM Dan Williams  wrote:
>
> After seeing some users have subscription management trouble, more spam
> than other Linux development lists, and considering some of the benefits
> of kernel.org hosted lists, nvdimm and persistent memory development is
> moving to nvd...@lists.linux.dev.
>
> The old list will remain up until v5.14-rc1 and shutdown thereafter.
>

Events have transpired that require this schedule be expedited. So
once this patch ships in v5.13-rc2 the linux-nvdimm@lists.01.org list
will be shutdown and all future nvdimm correspondence must be sent to
nvd...@lists.linux.dev.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable

2021-05-12 Thread Dan Williams

On Fri, May 7, 2021 at 4:40 AM Vaibhav Jain  wrote:
>
> In case performance stats for an nvdimm are not available, reading the
> 'perf_stats' sysfs file returns an -ENOENT error. A better approach is
> to make the 'perf_stats' file entirely invisible to indicate that
> performance stats for an nvdimm are unavailable.
>
> So this patch updates 'papr_nd_attribute_group' to add a 'is_visible'
> callback implemented as newly introduced 'papr_nd_attribute_visible()'
> that returns an appropriate mode in case performance stats aren't
> supported in a given nvdimm.
>
> Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved
> from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is
> available when 'papr_nd_attribute_visible()' is called during nvdimm
> initialization.
>
> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
> PHYP')

Since this has been the exposed ABI since v5.9 perhaps a note /
analysis is needed here that the disappearance of this file will not
break any existing scripts/tools.

> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> v2:
> * Removed a redundant dev_info() from pap_scm_nvdimm_init() [ Aneesh ]
> * Marked papr_nd_attribute_visible() as static which also fixed the
>   build warning reported by kernel build robot
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 35 ---
>  1 file changed, 24 insertions(+), 11 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index e2b69cc3beaf..11e7b90a3360 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev,
>  }
>  DEVICE_ATTR_RO(flags);
>
> +static umode_t papr_nd_attribute_visible(struct kobject *kobj,
> +struct attribute *attr, int n)
> +{
> +   struct device *dev = container_of(kobj, typeof(*dev), kobj);

This can use the kobj_to_dev() helper.

> +   struct nvdimm *nvdimm = to_nvdimm(dev);
> +   struct papr_scm_priv *p = nvdimm_provider_data(nvdimm);
> +
> +   /* For if perf-stats not available remove perf_stats sysfs */
> +   if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0)
> +   return 0;
> +
> +   return attr->mode;
> +}
> +
>  /* papr_scm specific dimm attributes */
>  static struct attribute *papr_nd_attributes[] = {
> _attr_flags.attr,
> @@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = {
>
>  static struct attribute_group papr_nd_attribute_group = {
> .name = "papr",
> +   .is_visible = papr_nd_attribute_visible,
> .attrs = papr_nd_attributes,
>  };
>
> @@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
> struct nd_region_desc ndr_desc;
> unsigned long dimm_flags;
> int target_nid, online_nid;
> -   ssize_t stat_size;
>
> p->bus_desc.ndctl = papr_scm_ndctl;
> p->bus_desc.module = THIS_MODULE;
> @@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv 
> *p)
> list_add_tail(>region_list, _nd_regions);
> mutex_unlock(_ndr_lock);
>
> -   /* Try retriving the stat buffer and see if its supported */
> -   stat_size = drc_pmem_query_stats(p, NULL, 0);
> -   if (stat_size > 0) {
> -   p->stat_buffer_len = stat_size;
> -   dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n",
> -   p->stat_buffer_len);
> -   } else {
> -   dev_info(>pdev->dev, "Dimm performance stats 
> unavailable\n");
> -   }
> -
> return 0;
>
>  err:   nvdimm_bus_unregister(p->bus);
> @@ -1102,6 +1106,7 @@ static int papr_scm_probe(struct platform_device *pdev)
> u64 blocks, block_size;
> struct papr_scm_priv *p;
> const char *uuid_str;
> +   ssize_t stat_size;
> u64 uuid[2];
> int rc;
>
> @@ -1179,6 +1184,14 @@ static int papr_scm_probe(struct platform_device *pdev)
> p->res.name  = pdev->name;
> p->res.flags = IORESOURCE_MEM;
>
> +   /* Try retriving the stat buffer and see if its supported */

s/retriving/retrieving/

> +   stat_size = drc_pmem_query_stats(p, NULL, 0);
> +   if (stat_size > 0) {
> +   p->stat_buffer_len = stat_size;
> +   dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n",
> +   p->stat_buffer_len);
> +   }
> +
> rc = papr_scm_nvdimm_init(p);
> if (rc)
> goto err2;
> --
> 2.31.1
>

After the minor fixups above you can add:

Reviewed-by: Dan Williams 

...I assume this will go through the PPC tree.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps

2021-05-10 Thread Dan Williams

On Thu, May 6, 2021 at 4:02 AM Joao Martins  wrote:
[..]
> >> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> >
> > I think this can be replaced with a call to follow_pte(_mm...).
> >
>
> Ah, of course! That would shorten things up too.

Now that I look closely, I notice that function disclaims being
suitable as a general purpose pte lookup utility. If it works for you,
great, but if not I think it's past time to create such a helper. I
know Ira is looking for one, and I contributed to the proliferation
when I added dev_pagemap_mapping_shift().
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] ACPI: NFIT: Fix support for variable 'SPA' structure size

2021-05-07 Thread Dan Williams

On Fri, May 7, 2021 at 7:49 AM Rafael J. Wysocki  wrote:
>
> On Fri, May 7, 2021 at 4:12 PM Dan Williams  wrote:
> >
> > On Fri, May 7, 2021 at 2:47 AM Rafael J. Wysocki  wrote:
> > >
> > > Hi Dan,
> > >
> > > On Fri, May 7, 2021 at 9:33 AM Dan Williams  
> > > wrote:
> > > >
> > > > ACPI 6.4 introduced the "SpaLocationCookie" to the NFIT "System Physical
> > > > Address (SPA) Range Structure". The presence of that new field is
> > > > indicated by the ACPI_NFIT_LOCATION_COOKIE_VALID flag. Pre-ACPI-6.4
> > > > firmware implementations omit the flag and maintain the original size of
> > > > the structure.
> > > >
> > > > Update the implementation to check that flag to determine the size
> > > > rather than the ACPI 6.4 compliant definition of 'struct
> > > > acpi_nfit_system_address' from the Linux ACPICA definitions.
> > > >
> > > > Update the test infrastructure for the new expectations as well, i.e.
> > > > continue to emulate the ACPI 6.3 definition of that structure.
> > > >
> > > > Without this fix the kernel fails to validate 'SPA' structures and this
> > > > leads to a crash in nfit_get_smbios_id() since that routine assumes that
> > > > SPAs are valid if it finds valid SMBIOS tables.
> > > >
> > > > BUG: unable to handle page fault for address: ffa8
> > > > [..]
> > > > Call Trace:
> > > >  skx_get_nvdimm_info+0x56/0x130 [skx_edac]
> > > >  skx_get_dimm_config+0x1f5/0x213 [skx_edac]
> > > >  skx_register_mci+0x132/0x1c0 [skx_edac]
> > > >
> > > > Cc: Bob Moore 
> > > > Cc: Erik Kaneda 
> > > > Fixes: cf16b05c607b ("ACPICA: ACPI 6.4: NFIT: add Location Cookie 
> > > > field")
> > >
> > > Do you want me to apply this (as the commit being fixed went in
> > > through the ACPI tree)?
> >
> > Yes, I would need to wait for a signed tag so if you're sending urgent
> > fixes in the next few days please take this one, otherwise I'll circle
> > back next week after -rc1.
>
> I'll be doing my next push after -rc1 either, so I guess it doesn't
> matter time-wise.

Ok, I got it, thanks for the offer.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] ACPI: NFIT: Fix support for variable 'SPA' structure size

2021-05-07 Thread Dan Williams

On Fri, May 7, 2021 at 2:47 AM Rafael J. Wysocki  wrote:
>
> Hi Dan,
>
> On Fri, May 7, 2021 at 9:33 AM Dan Williams  wrote:
> >
> > ACPI 6.4 introduced the "SpaLocationCookie" to the NFIT "System Physical
> > Address (SPA) Range Structure". The presence of that new field is
> > indicated by the ACPI_NFIT_LOCATION_COOKIE_VALID flag. Pre-ACPI-6.4
> > firmware implementations omit the flag and maintain the original size of
> > the structure.
> >
> > Update the implementation to check that flag to determine the size
> > rather than the ACPI 6.4 compliant definition of 'struct
> > acpi_nfit_system_address' from the Linux ACPICA definitions.
> >
> > Update the test infrastructure for the new expectations as well, i.e.
> > continue to emulate the ACPI 6.3 definition of that structure.
> >
> > Without this fix the kernel fails to validate 'SPA' structures and this
> > leads to a crash in nfit_get_smbios_id() since that routine assumes that
> > SPAs are valid if it finds valid SMBIOS tables.
> >
> > BUG: unable to handle page fault for address: ffa8
> > [..]
> > Call Trace:
> >  skx_get_nvdimm_info+0x56/0x130 [skx_edac]
> >  skx_get_dimm_config+0x1f5/0x213 [skx_edac]
> >  skx_register_mci+0x132/0x1c0 [skx_edac]
> >
> > Cc: Bob Moore 
> > Cc: Erik Kaneda 
> > Fixes: cf16b05c607b ("ACPICA: ACPI 6.4: NFIT: add Location Cookie field")
>
> Do you want me to apply this (as the commit being fixed went in
> through the ACPI tree)?

Yes, I would need to wait for a signed tag so if you're sending urgent
fixes in the next few days please take this one, otherwise I'll circle
back next week after -rc1.

>
> If you'd rather take care of it yourself:
>
> Reviewed-by: Rafael J. Wysocki 

Thanks!
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH] ACPI: NFIT: Fix support for variable 'SPA' structure size

2021-05-07 Thread Dan Williams

ACPI 6.4 introduced the "SpaLocationCookie" to the NFIT "System Physical
Address (SPA) Range Structure". The presence of that new field is
indicated by the ACPI_NFIT_LOCATION_COOKIE_VALID flag. Pre-ACPI-6.4
firmware implementations omit the flag and maintain the original size of
the structure.

Update the implementation to check that flag to determine the size
rather than the ACPI 6.4 compliant definition of 'struct
acpi_nfit_system_address' from the Linux ACPICA definitions.

Update the test infrastructure for the new expectations as well, i.e.
continue to emulate the ACPI 6.3 definition of that structure.

Without this fix the kernel fails to validate 'SPA' structures and this
leads to a crash in nfit_get_smbios_id() since that routine assumes that
SPAs are valid if it finds valid SMBIOS tables.

BUG: unable to handle page fault for address: ffa8
[..]
Call Trace:
 skx_get_nvdimm_info+0x56/0x130 [skx_edac]
 skx_get_dimm_config+0x1f5/0x213 [skx_edac]
 skx_register_mci+0x132/0x1c0 [skx_edac]

Cc: Bob Moore 
Cc: Erik Kaneda 
Fixes: cf16b05c607b ("ACPICA: ACPI 6.4: NFIT: add Location Cookie field")
Reported-by: Yi Zhang 
Tested-by: Yi Zhang 
Signed-off-by: Dan Williams 
---

Rafael, I can take this through nvdimm.git after -rc1, but if you are
sending any fixes to Linus based on your merge baseline between now and
Monday, please pick up this one.

 drivers/acpi/nfit/core.c |   15 ++
 tools/testing/nvdimm/test/nfit.c |   42 +++---
 2 files changed, 36 insertions(+), 21 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 958aaac869e8..23d9a09d7060 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -686,6 +686,13 @@ int nfit_spa_type(struct acpi_nfit_system_address *spa)
return -1;
 }
 
+static size_t sizeof_spa(struct acpi_nfit_system_address *spa)
+{
+   if (spa->flags & ACPI_NFIT_LOCATION_COOKIE_VALID)
+   return sizeof(*spa);
+   return sizeof(*spa) - 8;
+}
+
 static bool add_spa(struct acpi_nfit_desc *acpi_desc,
struct nfit_table_prev *prev,
struct acpi_nfit_system_address *spa)
@@ -693,22 +700,22 @@ static bool add_spa(struct acpi_nfit_desc *acpi_desc,
struct device *dev = acpi_desc->dev;
struct nfit_spa *nfit_spa;
 
-   if (spa->header.length != sizeof(*spa))
+   if (spa->header.length != sizeof_spa(spa))
return false;
 
list_for_each_entry(nfit_spa, >spas, list) {
-   if (memcmp(nfit_spa->spa, spa, sizeof(*spa)) == 0) {
+   if (memcmp(nfit_spa->spa, spa, sizeof_spa(spa)) == 0) {
list_move_tail(_spa->list, _desc->spas);
return true;
}
}
 
-   nfit_spa = devm_kzalloc(dev, sizeof(*nfit_spa) + sizeof(*spa),
+   nfit_spa = devm_kzalloc(dev, sizeof(*nfit_spa) + sizeof_spa(spa),
GFP_KERNEL);
if (!nfit_spa)
return false;
INIT_LIST_HEAD(_spa->list);
-   memcpy(nfit_spa->spa, spa, sizeof(*spa));
+   memcpy(nfit_spa->spa, spa, sizeof_spa(spa));
list_add_tail(_spa->list, _desc->spas);
dev_dbg(dev, "spa index: %d type: %s\n",
spa->range_index,
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index 9b185bf82da8..54f367cbadae 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -1871,9 +1871,16 @@ static void smart_init(struct nfit_test *t)
}
 }
 
+static size_t sizeof_spa(struct acpi_nfit_system_address *spa)
+{
+   /* until spa location cookie support is added... */
+   return sizeof(*spa) - 8;
+}
+
 static int nfit_test0_alloc(struct nfit_test *t)
 {
-   size_t nfit_size = sizeof(struct acpi_nfit_system_address) * NUM_SPA
+   struct acpi_nfit_system_address *spa = NULL;
+   size_t nfit_size = sizeof_spa(spa) * NUM_SPA
+ sizeof(struct acpi_nfit_memory_map) * NUM_MEM
+ sizeof(struct acpi_nfit_control_region) * NUM_DCR
+ offsetof(struct acpi_nfit_control_region,
@@ -1937,7 +1944,8 @@ static int nfit_test0_alloc(struct nfit_test *t)
 
 static int nfit_test1_alloc(struct nfit_test *t)
 {
-   size_t nfit_size = sizeof(struct acpi_nfit_system_address) * 2
+   struct acpi_nfit_system_address *spa = NULL;
+   size_t nfit_size = sizeof_spa(spa) * 2
+ sizeof(struct acpi_nfit_memory_map) * 2
+ offsetof(struct acpi_nfit_control_region, window_size) * 2;
int i;
@@ -2000,7 +2008,7 @@ static void nfit_test0_setup(struct nfit_test *t)
 */
spa = nfit_buf;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
-   spa->header.length = sizeof(*s

Re: [bug report] system panic at nfit_get_smbios_id+0x6e/0xf0 [nfit] during boot

2021-05-06 Thread Dan Williams

On Thu, May 6, 2021 at 10:28 AM Kaneda, Erik  wrote:
>
>
>
> > -Original Message-
> > From: Yi Zhang 
> > Sent: Wednesday, May 5, 2021 8:05 PM
> > To: Williams, Dan J ; Moore, Robert
> > 
> > Cc: linux-nvdimm ; Kaneda, Erik
> > ; Wysocki, Rafael J 
> > Subject: Re: [bug report] system panic at nfit_get_smbios_id+0x6e/0xf0
> > [nfit] during boot
> >
> > On Sat, May 1, 2021 at 2:05 PM Dan Williams 
> > wrote:
> > >
> > > On Fri, Apr 30, 2021 at 7:28 PM Yi Zhang  wrote:
> > > >
> > > > Hi
> > > >
> > > > With the latest Linux tree, my DCPMM server boot failed with the
> > > > bellow panic log, pls help check it, let me know if you need any test
> > > > for it.
> > >
> > > So v5.12 is ok but v5.12+ is not?
> > >
> > > Might you be able to bisect?
> >
> > Hi Dan
> > This issue was introduced with this patch, let me know if you need more 
> > info.
> >
> > commit cf16b05c607bd716a0a5726dc8d577a89fdc1777
> > Author: Bob Moore 
> > Date:   Tue Apr 6 14:30:15 2021 -0700
> >
> > ACPICA: ACPI 6.4: NFIT: add Location Cookie field
> >
> > Also, update struct size to reflect these changes in nfit core driver.
> >
> > ACPICA commit af60199a9a1de9e6844929fd4cc22334522ed195
> >
> > Link: https://github.com/acpica/acpica/commit/af60199a
> > Cc: Dan Williams 
> > Signed-off-by: Bob Moore 
> > Signed-off-by: Erik Kaneda 
> > Signed-off-by: Rafael J. Wysocki 
> >
>
> It's likely that this change forced the nfit driver's code to parse the ACPI 
> table so that it assumes that the location cookie field is always enabled and 
> the NFIT was parsed incorrectly. Does the NFIT table on this platform contain 
> a valid cookie field?
>

This was my fault. When I saw the size change fly by, I should have
remembered to go update all the places that do "sizeof(struct
acpi_nfit_system_address)".

Yi Zhang, can you give the attached patch a try:
ACPI: NFIT: Fix support for variable 'SPA' structure size

From: Dan Williams 

ACPI 6.4 introduced the "SpaLocationCookie" to the NFIT "System Physical
Address (SPA) Range Structure". The presence of that new field is
indicated by the ACPI_NFIT_LOCATION_COOKIE_VALID flag. Pre-ACPI-6.4
firmware implementations omit the flag and maintain the original size of
the structure.

Update the implementation to check that flag to determine the size
rather than the ACPI 6.4 compliant definition of 'struct
acpi_nfit_system_address' from the Linux ACPICA definitions.

Update the test infrastructure for the new expectations as well, i.e.
continue to emulate the ACPI 6.3 definition of that structure.

Without this fix the kernel fails to validate 'SPA' structures and this
leads to a crash in nfit_get_smbios_id() since that routine assumes that
SPAs are valid if it finds valid SMBIOS tables.

BUG: unable to handle page fault for address: ffa8
[..]
Call Trace:
 skx_get_nvdimm_info+0x56/0x130 [skx_edac]
 skx_get_dimm_config+0x1f5/0x213 [skx_edac]
 skx_register_mci+0x132/0x1c0 [skx_edac]

Reported-by: Yi Zhang 
Signed-off-by: Dan Williams 
---
 drivers/acpi/nfit/core.c |   17 +++
 tools/testing/nvdimm/test/nfit.c |   42 +++---
 2 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 958aaac869e8..bfecb79e8c82 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -686,29 +686,36 @@ int nfit_spa_type(struct acpi_nfit_system_address *spa)
 	return -1;
 }
 
-static bool add_spa(struct acpi_nfit_desc *acpi_desc,
+static size_t sizeof_spa(struct acpi_nfit_system_address *spa)
+{
+	if (spa->flags & ACPI_NFIT_LOCATION_COOKIE_VALID)
+		return sizeof(*spa);
+	return sizeof(*spa) - 8;
+}
+
+static bool noinline add_spa(struct acpi_nfit_desc *acpi_desc,
 		struct nfit_table_prev *prev,
 		struct acpi_nfit_system_address *spa)
 {
 	struct device *dev = acpi_desc->dev;
 	struct nfit_spa *nfit_spa;
 
-	if (spa->header.length != sizeof(*spa))
+	if (spa->header.length != sizeof_spa(spa))
 		return false;
 
 	list_for_each_entry(nfit_spa, >spas, list) {
-		if (memcmp(nfit_spa->spa, spa, sizeof(*spa)) == 0) {
+		if (memcmp(nfit_spa->spa, spa, sizeof_spa(spa)) == 0) {
 			list_move_tail(_spa->list, _desc->spas);
 			return true;
 		}
 	}
 
-	nfit_spa = devm_kzalloc(dev, sizeof(*nfit_spa) + sizeof(*spa),
+	nfit_spa = devm_kzalloc(dev, sizeof(*nfit_spa) + sizeof_spa(spa),
 			GFP_KERNEL);
 	if (!nfit_spa)
 		return false;
 	INIT_LIST_HEAD(_spa->list);
-	memcpy(nfit_spa->spa, spa, sizeof(*spa));
+	memcpy(nfit_spa-

Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps

2021-05-05 Thread Dan Williams

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  wrote:
>
> A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
> means that pages are mapped at a given huge page alignment and utilize
> uses compound pages as opposed to order-0 pages.
>
> To minimize struct page overhead we take advantage of the fact that

I'm not sure if Andrew is a member of the "we" police, but I am on
team "imperative tense".

"Take advantage of the fact that most tail pages look the same (except
the first two) to minimize struct page overhead."

...I think all the other "we"s below can be dropped without losing any meaning.

> most tail pages look the same (except the first two). We allocate a
> separate page for the vmemmap area which contains the head page and
> separate for the next 64 pages. The rest of the subsections then reuse
> this tail vmemmap page to initialize the rest of the tail pages.
>
> Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
> when initializing compound pagemap with big enough @align (e.g. 1G
> PUD) it  will cross various sections. To be able to reuse tail pages
> across sections belonging to the same gigantic page we fetch the
> @range being mapped (nr_ranges + 1).  If the section being mapped is
> not offset 0 of the @align, then lookup the PFN of the struct page
> address that preceeds it and use that to populate the entire

s/preceeds/precedes/

> section.
>
> On compound pagemaps with 2M align, this lets mechanism saves 6 pages

s/lets mechanism saves 6 pages/this mechanism lets 6 pages be saved/

> out of the 8 necessary PFNs necessary to set the subsection's 512
> struct pages being mapped. On a 1G compound pagemap it saves
> 4094 pages.
>
> Altmap isn't supported yet, given various restrictions in altmap pfn
> allocator, thus fallback to the already in use vmemmap_populate().

Perhaps also note that altmap for devmap mappings was there to relieve
the pressure of inordinate amounts of memmap space to map terabytes of
PMEM. With compound pages the motivation for altmaps for PMEM is
reduced.

>
> Signed-off-by: Joao Martins 
> ---
>  include/linux/mm.h  |   2 +-
>  mm/memremap.c   |   1 +
>  mm/sparse-vmemmap.c | 139 
>  3 files changed, 131 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 61474602c2b1..49d717ae40ae 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long 
> addr, int node);
>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>  pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> -   struct vmem_altmap *altmap);
> +   struct vmem_altmap *altmap, void *block);
>  void *vmemmap_alloc_block(unsigned long size, int node);
>  struct vmem_altmap;
>  void *vmemmap_alloc_block_buf(unsigned long size, int node,
> diff --git a/mm/memremap.c b/mm/memremap.c
> index d160853670c4..2e6bc0b1ff00 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>  {
> struct mhp_params params = {
> .altmap = pgmap_altmap(pgmap),
> +   .pgmap = pgmap,
> .pgprot = PAGE_KERNEL,
> };
> const int nr_range = pgmap->nr_range;
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 8814876edcfa..f57c5eada099 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
>  }
>
>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int 
> node,
> -  struct vmem_altmap *altmap)
> +  struct vmem_altmap *altmap, void 
> *block)

For type-safety I would make this argument a 'struct page *' and drop
the virt_to_page().

>  {
> pte_t *pte = pte_offset_kernel(pmd, addr);
> if (pte_none(*pte)) {
> pte_t entry;
> -   void *p;
> -
> -   p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> -   if (!p)
> -   return NULL;
> +   void *p = block;
> +
> +   if (!block) {
> +   p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> +   if (!p)
> +   return NULL;
> +   } else if (!altmap) {
> +   get_page(virt_to_page(block));

This is either super subtle, or there is a missing put_page(). I'm
assuming that in the shutdown path the same page will be freed
multiple times because the same page is mapped multiple times.

Have you validated that the accounting is correct and all allocated
pages get freed? I feel this needs more than a comment, it

Re: [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation

2021-05-05 Thread Dan Williams

On Wed, May 5, 2021 at 3:38 PM Joao Martins  wrote:
>
>
>
> On 5/5/21 11:34 PM, Dan Williams wrote:
> > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  
> > wrote:
> >>
> >> @altmap is stored in a dev_pagemap, but it will be repurposed for
> >> hotplug memory for storing the memmap in the hotplugged memory[*] and
> >> reusing the altmap infrastructure to that end. This is to say that
> >> altmap can't be replaced with a @pgmap as it is going to cover more than
> >> dev_pagemap backend altmaps.
> >
> > I was going to say, just pass the pgmap and lookup the altmap from
> > pgmap, but Oscar added a use case for altmap independent of pgmap. So
> > you might refresh this commit message to clarify why passing pgmap by
> > itself is not sufficient.
> >
> Isn't that what I am doing above with that exact paragraph? I even
> reference his series at the end of commit description :) in [*]

Oh, sorry, it didn't hit me explicitly that you were talking about
Oscar's work I thought you were referring to your own changes in this
set. I see it now... at a minimum the tense needs updating since
Oscar's changes are in the past not the future anymore. If it helps,
the following reads more direct to me: "In support of using compound
pages for devmap mappings, plumb the pgmap down to the
vmemmap_populate* implementation. Note that while altmap is
retrievable from pgmap the memory hotplug codes passes altmap without
pgmap, so both need to be independently plumbed."
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-05 Thread Dan Williams

On Wed, May 5, 2021 at 3:36 PM Joao Martins  wrote:
[..]
> > Ah yup, my eyes glazed over that. I think this is another place that
> > benefits from a more specific name than "align". "pfns_per_compound"
> > "compound_pfns"?
> >
> We are still describing a page, just not a base page. So perhaps 
> @pfns_per_hpage ?
>
> I am fine with @pfns_per_compound or @compound_pfns as well.

My only concern about hpage is that hpage implies PMD, where compound
is generic across PMD and PUD.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages()

2021-05-05 Thread Dan Williams

I suspect it's a good sign I'm only finding cosmetic and changelog
changes in the review... I have some more:

A year for now if I'm tracking down a problem and looking through mm
commits I would appreciate a subject line like the following:
"refactor core of vmemmap_populate_basepages() to helper" that gives
an idea of the impact and side effects of the change.

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  wrote:
>

I would add a lead in phrase like: "In preparation for describing a
memmap with compound pages, move the actual..."

> Move the actual pte population logic into a separate function
> vmemmap_populate_address() and have vmemmap_populate_basepages()
> walk through all base pages it needs to populate.

Aside from the above, looks good.

>
> Signed-off-by: Joao Martins 
> ---
>  mm/sparse-vmemmap.c | 44 ++--
>  1 file changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 370728c206ee..8814876edcfa 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -216,33 +216,41 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long 
> addr, int node)
> return pgd;
>  }
>
> -int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long 
> end,
> -int node, struct vmem_altmap *altmap)
> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> + struct vmem_altmap *altmap)
>  {
> -   unsigned long addr = start;
> pgd_t *pgd;
> p4d_t *p4d;
> pud_t *pud;
> pmd_t *pmd;
> pte_t *pte;
>
> +   pgd = vmemmap_pgd_populate(addr, node);
> +   if (!pgd)
> +   return -ENOMEM;
> +   p4d = vmemmap_p4d_populate(pgd, addr, node);
> +   if (!p4d)
> +   return -ENOMEM;
> +   pud = vmemmap_pud_populate(p4d, addr, node);
> +   if (!pud)
> +   return -ENOMEM;
> +   pmd = vmemmap_pmd_populate(pud, addr, node);
> +   if (!pmd)
> +   return -ENOMEM;
> +   pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> +   if (!pte)
> +   return -ENOMEM;
> +   vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
> +}
> +
> +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long 
> end,
> +int node, struct vmem_altmap *altmap)
> +{
> +   unsigned long addr = start;
> +
> for (; addr < end; addr += PAGE_SIZE) {
> -   pgd = vmemmap_pgd_populate(addr, node);
> -   if (!pgd)
> -   return -ENOMEM;
> -   p4d = vmemmap_p4d_populate(pgd, addr, node);
> -   if (!p4d)
> -   return -ENOMEM;
> -   pud = vmemmap_pud_populate(p4d, addr, node);
> -   if (!pud)
> -   return -ENOMEM;
> -   pmd = vmemmap_pmd_populate(pud, addr, node);
> -   if (!pmd)
> -   return -ENOMEM;
> -   pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> -   if (!pte)
> +   if (vmemmap_populate_address(addr, node, altmap))
> return -ENOMEM;
> -   vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
> }
>
> return 0;
> --
> 2.17.1
>
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation

2021-05-05 Thread Dan Williams

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  wrote:
>
> @altmap is stored in a dev_pagemap, but it will be repurposed for
> hotplug memory for storing the memmap in the hotplugged memory[*] and
> reusing the altmap infrastructure to that end. This is to say that
> altmap can't be replaced with a @pgmap as it is going to cover more than
> dev_pagemap backend altmaps.

I was going to say, just pass the pgmap and lookup the altmap from
pgmap, but Oscar added a use case for altmap independent of pgmap. So
you might refresh this commit message to clarify why passing pgmap by
itself is not sufficient.

Other than that, looks good.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-05 Thread Dan Williams

On Wed, May 5, 2021 at 12:50 PM Joao Martins  wrote:
>
>
>
> On 5/5/21 7:44 PM, Dan Williams wrote:
> > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  
> > wrote:
> >>
> >> Add a new align property for struct dev_pagemap which specifies that a
> >> pagemap is composed of a set of compound pages of size @align, instead
> >> of base pages. When these pages are initialised, most are initialised as
> >> tail pages instead of order-0 pages.
> >>
> >> For certain ZONE_DEVICE users like device-dax which have a fixed page
> >> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> >> treating it the same way as THP or hugetlb pages.
> >>
> >> Signed-off-by: Joao Martins 
> >> ---
> >>  include/linux/memremap.h | 13 +
> >>  mm/memremap.c|  8 ++--
> >>  mm/page_alloc.c  | 24 +++-
> >>  3 files changed, 42 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> index b46f63dcaed3..bb28d82dda5e 100644
> >> --- a/include/linux/memremap.h
> >> +++ b/include/linux/memremap.h
> >> @@ -114,6 +114,7 @@ struct dev_pagemap {
> >> struct completion done;
> >> enum memory_type type;
> >> unsigned int flags;
> >> +   unsigned long align;
> >
> > I think this wants some kernel-doc above to indicate that non-zero
> > means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
> > means "use non-compound base pages".
>
> Got it. Are you thinking a kernel_doc on top of the variable above, or 
> preferably on top
> of pgmap_align()?

I was thinking in dev_pagemap because this value is more than just a
plain alignment its restructuring the layout and construction of the
memmap(), but when I say it that way it seems much less like a vanilla
align value compared to a construction / geometry mode setting.

>
> > The non-zero value must be
> > PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
> > Hmm, maybe it should be an
> > enum:
> >
> > enum devmap_geometry {
> > DEVMAP_PTE,
> > DEVMAP_PMD,
> > DEVMAP_PUD,
> > }
> >
> I suppose a converter between devmap_geometry and page_size would be needed 
> too? And maybe
> the whole dax/nvdimm align values change meanwhile (as a followup 
> improvement)?

I think it is ok for dax/nvdimm to continue to maintain their align
value because it should be ok to have 4MB align if the device really
wanted. However, when it goes to map that alignment with
memremap_pages() it can pick a mode. For example, it's already the
case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
they're already separate concepts that can stay separate.

>
> Although to be fair we only ever care about compound page size in this series 
> (and
> similarly dax/nvdimm @align properties).
>
> > ...because it's more than just an alignment it's a structural
> > definition of how the memmap is laid out.
> >
> >> const struct dev_pagemap_ops *ops;
> >> void *owner;
> >> int nr_range;
> >> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct 
> >> dev_pagemap *pgmap)
> >> return NULL;
> >>  }
> >>
> >> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
> >> +{
> >> +   if (!pgmap || !pgmap->align)
> >> +   return PAGE_SIZE;
> >> +   return pgmap->align;
> >> +}
> >> +
> >> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
> >> +{
> >> +   return PHYS_PFN(pgmap_align(pgmap));
> >> +}
> >> +
> >>  #ifdef CONFIG_ZONE_DEVICE
> >>  bool pfn_zone_device_reserved(unsigned long pfn);
> >>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> >> diff --git a/mm/memremap.c b/mm/memremap.c
> >> index 805d761740c4..d160853670c4 100644
> >> --- a/mm/memremap.c
> >> +++ b/mm/memremap.c
> >> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
> >> struct mhp_params *params,
> >> memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
> >> PHYS_PFN(range->start),
> >> PHYS_PFN(range_len(range)), pgmap);
> >> -   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >> -   - pfn_f

Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-05 Thread Dan Williams

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  wrote:
>
> Add a new align property for struct dev_pagemap which specifies that a
> pagemap is composed of a set of compound pages of size @align, instead
> of base pages. When these pages are initialised, most are initialised as
> tail pages instead of order-0 pages.
>
> For certain ZONE_DEVICE users like device-dax which have a fixed page
> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> treating it the same way as THP or hugetlb pages.
>
> Signed-off-by: Joao Martins 
> ---
>  include/linux/memremap.h | 13 +
>  mm/memremap.c|  8 ++--
>  mm/page_alloc.c  | 24 +++-
>  3 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index b46f63dcaed3..bb28d82dda5e 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -114,6 +114,7 @@ struct dev_pagemap {
> struct completion done;
> enum memory_type type;
> unsigned int flags;
> +   unsigned long align;

I think this wants some kernel-doc above to indicate that non-zero
means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
means "use non-compound base pages". The non-zero value must be
PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. Hmm, maybe it should be an
enum:

enum devmap_geometry {
DEVMAP_PTE,
DEVMAP_PMD,
DEVMAP_PUD,
}

...because it's more than just an alignment it's a structural
definition of how the memmap is laid out.

> const struct dev_pagemap_ops *ops;
> void *owner;
> int nr_range;
> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct 
> dev_pagemap *pgmap)
> return NULL;
>  }
>
> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
> +{
> +   if (!pgmap || !pgmap->align)
> +   return PAGE_SIZE;
> +   return pgmap->align;
> +}
> +
> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
> +{
> +   return PHYS_PFN(pgmap_align(pgmap));
> +}
> +
>  #ifdef CONFIG_ZONE_DEVICE
>  bool pfn_zone_device_reserved(unsigned long pfn);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 805d761740c4..d160853670c4 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
> struct mhp_params *params,
> memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
> PHYS_PFN(range->start),
> PHYS_PFN(range_len(range)), pgmap);
> -   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> -   - pfn_first(pgmap, range_id));
> +   if (pgmap_align(pgmap) > PAGE_SIZE)
> +   percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> +   - pfn_first(pgmap, range_id)) / 
> pgmap_pfn_align(pgmap));
> +   else
> +   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> +   - pfn_first(pgmap, range_id));
> return 0;
>
>  err_add_memory:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 58974067bbd4..3a77f9e43f3a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
> unsigned long pfn, end_pfn = start_pfn + nr_pages;
> struct pglist_data *pgdat = zone->zone_pgdat;
> struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> +   unsigned int pfn_align = pgmap_pfn_align(pgmap);
> +   unsigned int order_align = order_base_2(pfn_align);
> unsigned long zone_idx = zone_idx(zone);
> unsigned long start = jiffies;
> int nid = pgdat->node_id;
> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
> nr_pages = end_pfn - start_pfn;
> }
>
> -   for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> +   for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {

pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
PAGE_SHIFT" I missed somewhere?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm

2021-05-04 Thread Dan Williams

On Tue, May 4, 2021 at 2:03 AM Aneesh Kumar K.V
 wrote:
>
> On 5/4/21 11:13 AM, Pankaj Gupta wrote:
> 
>
> >>
> >> What this patch series did was to express that property via a device
> >> tree node and guest driver enables a hypercall based flush mechanism to
> >> ensure persistence.
> >
> > Would VIRTIO (entirely asynchronous, no trap at host side) based
> > mechanism is better
> > than hyper-call based? Registering memory can be done any way. We
> > implemented virtio-pmem
> > flush mechanisms with below considerations:
> >
> > - Proper semantic for guest flush requests.
> > - Efficient mechanism for performance pov.
> >
>
> sure, virio-pmem can be used as an alternative.
>
> > I am just asking myself if we have platform agnostic mechanism already
> > there, maybe
> > we can extend it to suit our needs? Maybe I am missing some points here.
> >
>
> What is being attempted in this series is to indicate to the guest OS
> that the backing device/file used for emulated nvdimm device cannot
> guarantee the persistence via cpu cache flush instructions.

Right, the detail I am arguing is that it should be a device
description, not a backend file property. In other words this proposal
is advocating this:

-object memory-backend-file,id=mem1,share,sync-dax=$sync-dax,mem-path=$path
-device nvdimm,memdev=mem1

...and I am advocating for reusing or duplicating the virtio-pmem
model like this:

-object memory-backend-file,id=mem1,share,mem-path=$path
-device spapr-hcall,memdev=mem1

...because the interface to the guest is the device, not the
memory-backend-file.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm

2021-05-03 Thread Dan Williams

On Mon, May 3, 2021 at 7:06 AM Shivaprasad G Bhat  wrote:
>
>
> On 5/1/21 12:44 AM, Dan Williams wrote:
> > Some corrections to terminology confusion below...
> >
> >
> > On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat  
> > wrote:
> >> The nvdimm devices are expected to ensure write persistence during power
> >> failure kind of scenarios.
> > No, QEMU is not expected to make that guarantee. QEMU is free to lie
> > to the guest about the persistence guarantees of the guest PMEM
> > ranges. It's more accurate to say that QEMU nvdimm devices can emulate
> > persistent memory and optionally pass through host power-fail
> > persistence guarantees to the guest. The power-fail persistence domain
> > can be one of "cpu_cache", or "memory_controller" if the persistent
> > memory region is "synchronous". If the persistent range is not
> > synchronous, it really isn't "persistent memory"; it's memory mapped
> > storage that needs I/O commands to flush.
>
> Since this is virtual nvdimm(v-nvdimm) backed by a file, and the data is
> completely
>
> in the host pagecache, and we need a way to ensure that host pagecaches
>
> are flushed to the backend. This analogous to the WPQ flush being offloaded
>
> to the hypervisor.

No, it isn't analogous. WPQ flush is an optional mechanism to force
data to a higher durability domain. The flush in this interface is
mandatory. It's a different class of device.

The proposal that "sync-dax=unsafe" for non-PPC architectures, is a
fundamental misrepresentation of how this is supposed to work. Rather
than make "sync-dax" a first class citizen of the device-description
interface I'm proposing that you make this a separate device-type.
This also solves the problem that "sync-dax" with an implicit
architecture backend assumption is not precise, but a new "non-nvdimm"
device type would make it explicit what the host is advertising to the
guest.

>
>
> Ref: https://github.com/dgibson/qemu/blob/main/docs/nvdimm.txt
>
>
>
> >
> >> The libpmem has architecture specific instructions like dcbf on POWER
> > Which "libpmem" is this? PMDK is a reference library not a PMEM
> > interface... maybe I'm missing what libpmem has to do with QEMU?
>
>
> I was referrering to semantics of flushing pmem cache lines as in
>
> PMDK/libpmem.
>
>
> >
> >> to flush the cache data to backend nvdimm device during normal writes
> >> followed by explicit flushes if the backend devices are not synchronous
> >> DAX capable.
> >>
> >> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
> >> and the subsequent flush doesn't traslate to actual flush to the backend
> > s/traslate/translate/
> >
> >> file on the host in case of file backed v-nvdimms. This is addressed by
> >> virtio-pmem in case of x86_64 by making explicit flushes translating to
> >> fsync at qemu.
> > Note that virtio-pmem was a proposal for a specific optimization of
> > allowing guests to share page cache. The virtio-pmem approach is not
> > to be confused with actual persistent memory.
> >
> >> On SPAPR, the issue is addressed by adding a new hcall to
> >> request for an explicit flush from the guest ndctl driver when the backend
> > What is an "ndctl" driver? ndctl is userspace tooling, do you mean the
> > guest pmem driver?
>
>
> oops, wrong terminologies. I was referring to guest libnvdimm and
>
> papr_scm kernel modules.
>
>
> >
> >> nvdimm cannot ensure write persistence with dcbf alone. So, the approach
> >> here is to convey when the hcall flush is required in a device tree
> >> property. The guest makes the hcall when the property is found, instead
> >> of relying on dcbf.
> >>
> >> A new device property sync-dax is added to the nvdimm device. When the
> >> sync-dax is 'writeback'(default for PPC), device property
> >> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH
> >> requesting for an explicit flush.
> > I'm not sure "sync-dax" is a suitable name for the property of the
> > guest persistent memory.
>
>
> sync-dax property translates ND_REGION_ASYNC flag being set/unset

Yes, I am aware, but that property alone is not sufficient to identify
the flush mechanism.

>
> for the pmem region also if the nvdimm_flush callback is provided in the
>
> papr_scm or not. As everything boils down to synchronous nature
>
> of the device, I chose sync-dax for the name.
>
>
> >   There is

Re: [bug report] system panic at nfit_get_smbios_id+0x6e/0xf0 [nfit] during boot

2021-05-01 Thread Dan Williams

On Fri, Apr 30, 2021 at 7:28 PM Yi Zhang  wrote:
>
> Hi
>
> With the latest Linux tree, my DCPMM server boot failed with the
> bellow panic log, pls help check it, let me know if you need any test
> for it.

So v5.12 is ok but v5.12+ is not?

Might you be able to bisect?

If not can you send the nfit.gz from this command:

acpidump -n NFIT | gzip -c > nfit.gz

Also can you send the full dmesg? I don't suppose you see a message of
this format before this failure:

dev_err(acpi_desc->dev, "SPA %d missing DCR %d\n",
spa->range_index, dcr);
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm

2021-04-30 Thread Dan Williams

Some corrections to terminology confusion below...

On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat  wrote:
>
> The nvdimm devices are expected to ensure write persistence during power
> failure kind of scenarios.

No, QEMU is not expected to make that guarantee. QEMU is free to lie
to the guest about the persistence guarantees of the guest PMEM
ranges. It's more accurate to say that QEMU nvdimm devices can emulate
persistent memory and optionally pass through host power-fail
persistence guarantees to the guest. The power-fail persistence domain
can be one of "cpu_cache", or "memory_controller" if the persistent
memory region is "synchronous". If the persistent range is not
synchronous, it really isn't "persistent memory"; it's memory mapped
storage that needs I/O commands to flush.

> The libpmem has architecture specific instructions like dcbf on POWER

Which "libpmem" is this? PMDK is a reference library not a PMEM
interface... maybe I'm missing what libpmem has to do with QEMU?

> to flush the cache data to backend nvdimm device during normal writes
> followed by explicit flushes if the backend devices are not synchronous
> DAX capable.
>
> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
> and the subsequent flush doesn't traslate to actual flush to the backend

s/traslate/translate/

> file on the host in case of file backed v-nvdimms. This is addressed by
> virtio-pmem in case of x86_64 by making explicit flushes translating to
> fsync at qemu.

Note that virtio-pmem was a proposal for a specific optimization of
allowing guests to share page cache. The virtio-pmem approach is not
to be confused with actual persistent memory.

> On SPAPR, the issue is addressed by adding a new hcall to
> request for an explicit flush from the guest ndctl driver when the backend

What is an "ndctl" driver? ndctl is userspace tooling, do you mean the
guest pmem driver?

> nvdimm cannot ensure write persistence with dcbf alone. So, the approach
> here is to convey when the hcall flush is required in a device tree
> property. The guest makes the hcall when the property is found, instead
> of relying on dcbf.
>
> A new device property sync-dax is added to the nvdimm device. When the
> sync-dax is 'writeback'(default for PPC), device property
> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH
> requesting for an explicit flush.

I'm not sure "sync-dax" is a suitable name for the property of the
guest persistent memory. There is no requirement that the
memory-backend file for a guest be a dax-capable file. It's also
implementation specific what hypercall needs to be invoked for a given
occurrence of "sync-dax". What does that map to on non-PPC platforms
for example? It seems to me that an "nvdimm" device presents the
synchronous usage model and a whole other device type implements an
async-hypercall setup that the guest happens to service with its
nvdimm stack, but it's not an "nvdimm" anymore at that point.

> sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines
> prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented
> now as the flush semantics are unimplemented.

"sync-dax" has no meaning on its own, I think this needs an explicit
mechanism to convey both the "not-sync" property *and* the callback
method, it shouldn't be inferred by arch type.

> When the backend file is actually synchronous DAX capable and no explicit
> flushes are required, the sync-dax mode 'direct' is to be used.
>
> The below demonstration shows the map_sync behavior with sync-dax writeback &
> direct.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
>
> The pmem0 is from nvdimm with With sync-dax=direct, and pmem1 is from
> nvdimm with syn-dax=writeback, mounted as
> /dev/pmem0 on /mnt1 type xfs 
> (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs 
> (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
>
> [root@atest-guest ~]# ./mapsync /mnt1/newfile > When 
> sync-dax=unsafe/direct
> [root@atest-guest ~]# ./mapsync /mnt2/newfile > when sync-dax=writeback
> Failed to mmap  with Operation not supported
>
> The first patch does the header file cleanup necessary for the
> subsequent ones. Second patch implements the hcall, adds the necessary
> vmstate properties to spapr machine structure for carrying the hcall
> status during save-restore. The nature of the hcall being asynchronus,
> the patch uses aio utilities to offload the flush. The third patch adds
> the 'sync-dax' device property and enables the device tree property
> for the guest to utilise the hcall.
>
> The kernel changes to exploit this hcall is at
> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch
>
> ---
> v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
> Changes from v3:
>   - Fixed the forward declaration coding

Re: [PATCH] MAINTAINERS: Move nvdimm mailing list

2021-04-29 Thread Dan Williams

On Tue, Apr 27, 2021 at 8:58 AM Jonathan Corbet  wrote:
>
> Dan Williams  writes:
>
> > After seeing some users have subscription management trouble, more spam
> > than other Linux development lists, and considering some of the benefits
> > of kernel.org hosted lists, nvdimm and persistent memory development is
> > moving to nvd...@lists.linux.dev.
> >
> > The old list will remain up until v5.14-rc1 and shutdown thereafter.
> >
> > Cc: Ira Weiny 
> > Cc: Oliver O'Halloran 
> > Cc: Matthew Wilcox 
> > Cc: Jan Kara 
> > Cc: Jonathan Corbet 
> > Cc: Dave Jiang 
> > Cc: Vishal Verma 
> > Signed-off-by: Dan Williams 
> > ---
> >  Documentation/ABI/obsolete/sysfs-class-dax|2 +
> >  Documentation/ABI/removed/sysfs-bus-nfit  |2 +
> >  Documentation/ABI/testing/sysfs-bus-nfit  |   40 
> > +
> >  Documentation/ABI/testing/sysfs-bus-papr-pmem |4 +--
> >  Documentation/driver-api/nvdimm/nvdimm.rst|2 +
> >  MAINTAINERS   |   14 -
> >  6 files changed, 32 insertions(+), 32 deletions(-)
>
> Would you like me to take this through docs-next, or did you have
> another path in mind?

Thanks for the offer, I have a few other nvdimm related bits for this
cycle, so it can go through nvdimm.git.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v6 0/3] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-28 Thread Dan Williams

On Wed, Apr 28, 2021 at 12:03 PM Vivek Goyal  wrote:
>
> Hi,
>
> This is V6. Only change since V5 is that I changed order of WAKE_NEXT
> and WAKE_ALL in comments too.
>
> Vivek
>
> Vivek Goyal (3):
>   dax: Add an enum for specifying dax wakup mode
>   dax: Add a wakeup mode parameter to put_unlocked_entry()
>   dax: Wake up all waiters after invalidating dax entry

Thanks Vivek, looks good.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init

2021-04-23 Thread Dan Williams

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  wrote:
>
> Move struct page init to an helper function __init_zone_device_page().

Same sentence addition suggestion from the last patch to make this
patch have some rationale for existing.

>
> Signed-off-by: Joao Martins 
> ---
>  mm/page_alloc.c | 74 +++--
>  1 file changed, 41 insertions(+), 33 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 43dd98446b0b..58974067bbd4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6237,6 +6237,46 @@ void __meminit memmap_init_range(unsigned long size, 
> int nid, unsigned long zone
>  }
>
>  #ifdef CONFIG_ZONE_DEVICE
> +static void __ref __init_zone_device_page(struct page *page, unsigned long 
> pfn,
> + unsigned long zone_idx, int nid,
> + struct dev_pagemap *pgmap)
> +{
> +
> +   __init_single_page(page, pfn, zone_idx, nid);
> +
> +   /*
> +* Mark page reserved as it will need to wait for onlining
> +* phase for it to be fully associated with a zone.
> +*
> +* We can use the non-atomic __set_bit operation for setting
> +* the flag as we are still initializing the pages.
> +*/
> +   __SetPageReserved(page);
> +
> +   /*
> +* ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
> +* and zone_device_data.  It is a bug if a ZONE_DEVICE page is
> +* ever freed or placed on a driver-private list.
> +*/
> +   page->pgmap = pgmap;
> +   page->zone_device_data = NULL;
> +
> +   /*
> +* Mark the block movable so that blocks are reserved for
> +* movable at startup. This will force kernel allocations
> +* to reserve their blocks rather than leaking throughout
> +* the address space during boot when many long-lived
> +* kernel allocations are made.
> +*
> +* Please note that MEMINIT_HOTPLUG path doesn't clear memmap
> +* because this is done early in section_activate()
> +*/
> +   if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
> +   set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> +   cond_resched();
> +   }
> +}
> +
>  void __ref memmap_init_zone_device(struct zone *zone,
>unsigned long start_pfn,
>unsigned long nr_pages,
> @@ -6265,39 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> struct page *page = pfn_to_page(pfn);
>
> -   __init_single_page(page, pfn, zone_idx, nid);
> -
> -   /*
> -* Mark page reserved as it will need to wait for onlining
> -* phase for it to be fully associated with a zone.
> -*
> -* We can use the non-atomic __set_bit operation for setting
> -* the flag as we are still initializing the pages.
> -*/
> -   __SetPageReserved(page);
> -
> -   /*
> -* ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
> -* and zone_device_data.  It is a bug if a ZONE_DEVICE page is
> -* ever freed or placed on a driver-private list.
> -*/
> -   page->pgmap = pgmap;
> -   page->zone_device_data = NULL;
> -
> -   /*
> -* Mark the block movable so that blocks are reserved for
> -* movable at startup. This will force kernel allocations
> -* to reserve their blocks rather than leaking throughout
> -* the address space during boot when many long-lived
> -* kernel allocations are made.
> -*
> -* Please note that MEMINIT_HOTPLUG path doesn't clear memmap
> -* because this is done early in section_activate()
> -*/
> -   if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
> -   set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> -   cond_resched();
> -   }
> +   __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
> }
>
> pr_info("%s initialised %lu pages in %ums\n", __func__,
> --
> 2.17.1
>
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts

2021-04-23 Thread Dan Williams

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  wrote:
>
> Split the utility function prep_compound_page() into head and tail
> counterparts, and use them accordingly.

To make this patch stand alone better lets add another sentence:

"This is in preparation for sharing the storage for / deduplicating
compound page metadata."

Other than that, looks good to me.

>
> Signed-off-by: Joao Martins 
> ---
>  mm/page_alloc.c | 32 +---
>  1 file changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c53fe4fa10bf..43dd98446b0b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -692,24 +692,34 @@ void free_compound_page(struct page *page)
> __free_pages_ok(page, compound_order(page), FPI_NONE);
>  }
>
> +static void prep_compound_head(struct page *page, unsigned int order)
> +{
> +   set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
> +   set_compound_order(page, order);
> +   atomic_set(compound_mapcount_ptr(page), -1);
> +   if (hpage_pincount_available(page))
> +   atomic_set(compound_pincount_ptr(page), 0);
> +}
> +
> +static void prep_compound_tail(struct page *head, int tail_idx)
> +{
> +   struct page *p = head + tail_idx;
> +
> +   set_page_count(p, 0);
> +   p->mapping = TAIL_MAPPING;
> +   set_compound_head(p, head);
> +}
> +
>  void prep_compound_page(struct page *page, unsigned int order)
>  {
> int i;
> int nr_pages = 1 << order;
>
> __SetPageHead(page);
> -   for (i = 1; i < nr_pages; i++) {
> -   struct page *p = page + i;
> -   set_page_count(p, 0);
> -   p->mapping = TAIL_MAPPING;
> -   set_compound_head(p, page);
> -   }
> +   for (i = 1; i < nr_pages; i++)
> +   prep_compound_tail(page, i);
>
> -   set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
> -   set_compound_order(page, order);
> -   atomic_set(compound_mapcount_ptr(page), -1);
> -   if (hpage_pincount_available(page))
> -   atomic_set(compound_pincount_ptr(page), 0);
> +   prep_compound_head(page, order);
>  }
>
>  #ifdef CONFIG_DEBUG_PAGEALLOC
> --
> 2.17.1
>
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid()

2021-04-23 Thread Dan Williams

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  wrote:
>
> memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
> dax_lock_page()).  For pagemap with compound pages fetch the
> compound_head in case we are handling a tail page memory failure.
>
> Currently this is a nop, but in the advent of compound pages in
> dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
>
> Reported-by: Jane Chu 
> Signed-off-by: Joao Martins 
> ---
>  mm/memory-failure.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..94240d772623 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1318,6 +1318,8 @@ static int memory_failure_dev_pagemap(unsigned long 
> pfn, int flags,
> goto out;
> }
>
> +   page = compound_head(page);

Unless / until we do compound pages for the filesystem-dax case, I
would add a comment like:

/* pages instantiated by device-dax (not filesystem-dax) may be
compound pages */
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v4 0/3] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-23 Thread Dan Williams

On Fri, Apr 23, 2021 at 6:07 AM Vivek Goyal  wrote:
>
> Hi,
>
> This is V4 of the patches. Posted V3 here.
>
> https://lore.kernel.org/linux-fsdevel/20210419213636.1514816-1-vgo...@redhat.com/
>
> Changes since V3 are.
>
> - Renamed "enum dax_entry_wake_mode" to "enum dax_wake_mode" (Matthew Wilcox)
> - Changed description of WAKE_NEXT and WAKE_ALL (Jan Kara)
> - Got rid of a comment (Greg Kurz)

Looks good Vivek, thanks for the resend.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-22 Thread Dan Williams

On Wed, Apr 21, 2021 at 11:25 PM Christoph Hellwig  wrote:
>
> On Wed, Apr 21, 2021 at 12:09:54PM -0700, Dan Williams wrote:
> > Can you get in the habit of not replying inline with new patches like
> > this? Collect the review feedback, take a pause, and resend the full
> > series so tooling like b4 and patchwork can track when a new posting
> > supersedes a previous one. As is, this inline style inflicts manual
> > effort on the maintainer.
>
> Honestly I don't mind it at all.  If you shiny new tooling can't handle
> it maybe you should fix your shiny new tooling instead of changing
> everyones workflow?

I think asking a submitter to resend a series is par for the course,
especially for poor saps like me burdened by corporate email systems.
Vivek, if this is too onerous a request just give me a heads up and
I'll manually pull out the patch content from your replies.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-21 Thread Dan Williams

On Tue, Apr 20, 2021 at 7:01 AM Vivek Goyal  wrote:
>
> On Tue, Apr 20, 2021 at 09:34:20AM +0200, Greg Kurz wrote:
> > On Mon, 19 Apr 2021 17:36:35 -0400
> > Vivek Goyal  wrote:
> >
> > > As of now put_unlocked_entry() always wakes up next waiter. In next
> > > patches we want to wake up all waiters at one callsite. Hence, add a
> > > parameter to the function.
> > >
> > > This patch does not introduce any change of behavior.
> > >
> > > Suggested-by: Dan Williams 
> > > Signed-off-by: Vivek Goyal 
> > > ---
> > >  fs/dax.c | 13 +++--
> > >  1 file changed, 7 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 00978d0838b1..f19d76a6a493 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -275,11 +275,12 @@ static void wait_entry_unlocked(struct xa_state 
> > > *xas, void *entry)
> > > finish_wait(wq, );
> > >  }
> > >
> > > -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> > > +static void put_unlocked_entry(struct xa_state *xas, void *entry,
> > > +  enum dax_entry_wake_mode mode)
> > >  {
> > > /* If we were the only waiter woken, wake the next one */
> >
> > With this change, the comment is no longer accurate since the
> > function can now wake all waiters if passed mode == WAKE_ALL.
> > Also, it paraphrases the code which is simple enough, so I'd
> > simply drop it.
> >
> > This is minor though and it shouldn't prevent this fix to go
> > forward.
> >
> > Reviewed-by: Greg Kurz 
>
> Ok, here is the updated patch which drops that comment line.
>
> Vivek

Hi Vivek,

Can you get in the habit of not replying inline with new patches like
this? Collect the review feedback, take a pause, and resend the full
series so tooling like b4 and patchwork can track when a new posting
supersedes a previous one. As is, this inline style inflicts manual
effort on the maintainer.

>
> Subject: dax: Add a wakeup mode parameter to put_unlocked_entry()
>
> As of now put_unlocked_entry() always wakes up next waiter. In next
> patches we want to wake up all waiters at one callsite. Hence, add a
> parameter to the function.
>
> This patch does not introduce any change of behavior.
>
> Suggested-by: Dan Williams 
> Signed-off-by: Vivek Goyal 
> ---
>  fs/dax.c |   14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> Index: redhat-linux/fs/dax.c
> ===
> --- redhat-linux.orig/fs/dax.c  2021-04-20 09:55:45.105069893 -0400
> +++ redhat-linux/fs/dax.c   2021-04-20 09:56:27.685822730 -0400
> @@ -275,11 +275,11 @@ static void wait_entry_unlocked(struct x
> finish_wait(wq, );
>  }
>
> -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> +static void put_unlocked_entry(struct xa_state *xas, void *entry,
> +  enum dax_entry_wake_mode mode)
>  {
> -   /* If we were the only waiter woken, wake the next one */
> if (entry && !dax_is_conflict(entry))
> -   dax_wake_entry(xas, entry, WAKE_NEXT);
> +   dax_wake_entry(xas, entry, mode);
>  }
>
>  /*
> @@ -633,7 +633,7 @@ struct page *dax_layout_busy_page_range(
> entry = get_unlocked_entry(, 0);
> if (entry)
> page = dax_busy_page(entry);
> -   put_unlocked_entry(, entry);
> +   put_unlocked_entry(, entry, WAKE_NEXT);
> if (page)
> break;
> if (++scanned % XA_CHECK_SCHED)
> @@ -675,7 +675,7 @@ static int __dax_invalidate_entry(struct
> mapping->nrexceptional--;
> ret = 1;
>  out:
> -   put_unlocked_entry(, entry);
> +   put_unlocked_entry(, entry, WAKE_NEXT);
> xas_unlock_irq();
> return ret;
>  }
> @@ -954,7 +954,7 @@ static int dax_writeback_one(struct xa_s
> return ret;
>
>   put_unlocked:
> -   put_unlocked_entry(xas, entry);
> +   put_unlocked_entry(xas, entry, WAKE_NEXT);
> return ret;
>  }
>
> @@ -1695,7 +1695,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *
> /* Did we race with someone splitting entry or so? */
> if (!entry || dax_is_conflict(entry) ||
> (order == 0 && !dax_is_pte_entry(entry))) {
> -   put_unlocked_entry(, entry);
> +   put_unlocked_entry(, entry, WAKE_NEXT);
> xas_unlock_irq();
> trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
>   VM_FAULT_NOPAGE);
>
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH] MAINTAINERS: Move nvdimm mailing list

2021-04-21 Thread Dan Williams

After seeing some users have subscription management trouble, more spam
than other Linux development lists, and considering some of the benefits
of kernel.org hosted lists, nvdimm and persistent memory development is
moving to nvd...@lists.linux.dev.

The old list will remain up until v5.14-rc1 and shutdown thereafter.

Cc: Ira Weiny 
Cc: Oliver O'Halloran 
Cc: Matthew Wilcox 
Cc: Jan Kara 
Cc: Jonathan Corbet 
Cc: Dave Jiang 
Cc: Vishal Verma 
Signed-off-by: Dan Williams 
---
 Documentation/ABI/obsolete/sysfs-class-dax|2 +
 Documentation/ABI/removed/sysfs-bus-nfit  |2 +
 Documentation/ABI/testing/sysfs-bus-nfit  |   40 +
 Documentation/ABI/testing/sysfs-bus-papr-pmem |4 +--
 Documentation/driver-api/nvdimm/nvdimm.rst|2 +
 MAINTAINERS   |   14 -
 6 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/Documentation/ABI/obsolete/sysfs-class-dax 
b/Documentation/ABI/obsolete/sysfs-class-dax
index 0faf1354cd05..5bcce27458e3 100644
--- a/Documentation/ABI/obsolete/sysfs-class-dax
+++ b/Documentation/ABI/obsolete/sysfs-class-dax
@@ -1,7 +1,7 @@
 What:   /sys/class/dax/
 Date:   May, 2016
 KernelVersion:  v4.7
-Contact:linux-nvdimm@lists.01.org
+Contact:nvd...@lists.linux.dev
 Description:   Device DAX is the device-centric analogue of Filesystem
DAX (CONFIG_FS_DAX).  It allows memory ranges to be
allocated and mapped without need of an intervening file
diff --git a/Documentation/ABI/removed/sysfs-bus-nfit 
b/Documentation/ABI/removed/sysfs-bus-nfit
index ae8c1ca53828..277437005def 100644
--- a/Documentation/ABI/removed/sysfs-bus-nfit
+++ b/Documentation/ABI/removed/sysfs-bus-nfit
@@ -1,7 +1,7 @@
 What:  /sys/bus/nd/devices/regionX/nfit/ecc_unit_size
 Date:  Aug, 2017
 KernelVersion: v4.14 (Removed v4.18)
-Contact:   linux-nvdimm@lists.01.org
+Contact:   nvd...@lists.linux.dev
 Description:
(RO) Size of a write request to a DIMM that will not incur a
read-modify-write cycle at the memory controller.
diff --git a/Documentation/ABI/testing/sysfs-bus-nfit 
b/Documentation/ABI/testing/sysfs-bus-nfit
index 63ef0b9ecce7..e7282d184a74 100644
--- a/Documentation/ABI/testing/sysfs-bus-nfit
+++ b/Documentation/ABI/testing/sysfs-bus-nfit
@@ -5,7 +5,7 @@ Interface Table (NFIT)' section in the ACPI specification
 What:  /sys/bus/nd/devices/nmemX/nfit/serial
 Date:  Jun, 2015
 KernelVersion: v4.2
-Contact:   linux-nvdimm@lists.01.org
+Contact:   nvd...@lists.linux.dev
 Description:
(RO) Serial number of the NVDIMM (non-volatile dual in-line
memory module), assigned by the module vendor.
@@ -14,7 +14,7 @@ Description:
 What:  /sys/bus/nd/devices/nmemX/nfit/handle
 Date:  Apr, 2015
 KernelVersion: v4.2
-Contact:   linux-nvdimm@lists.01.org
+Contact:   nvd...@lists.linux.dev
 Description:
(RO) The address (given by the _ADR object) of the device on its
parent bus of the NVDIMM device containing the NVDIMM region.
@@ -23,7 +23,7 @@ Description:
 What:  /sys/bus/nd/devices/nmemX/nfit/device
 Date:  Apr, 2015
 KernelVersion: v4.1
-Contact:   linux-nvdimm@lists.01.org
+Contact:   nvd...@lists.linux.dev
 Description:
(RO) Device id for the NVDIMM, assigned by the module vendor.
 
@@ -31,7 +31,7 @@ Description:
 What:  /sys/bus/nd/devices/nmemX/nfit/rev_id
 Date:  Jun, 2015
 KernelVersion: v4.2
-Contact:   linux-nvdimm@lists.01.org
+Contact:   nvd...@lists.linux.dev
 Description:
(RO) Revision of the NVDIMM, assigned by the module vendor.
 
@@ -39,7 +39,7 @@ Description:
 What:  /sys/bus/nd/devices/nmemX/nfit/phys_id
 Date:  Apr, 2015
 KernelVersion: v4.2
-Contact:   linux-nvdimm@lists.01.org
+Contact:   nvd...@lists.linux.dev
 Description:
(RO) Handle (i.e., instance number) for the SMBIOS (system
management BIOS) Memory Device structure describing the NVDIMM
@@ -49,7 +49,7 @@ Description:
 What:  /sys/bus/nd/devices/nmemX/nfit/flags
 Date:  Jun, 2015
 KernelVersion: v4.2
-Contact:   linux-nvdimm@lists.01.org
+Contact:   nvd...@lists.linux.dev
 Description:
(RO) The flags in the NFIT memory device sub-structure indicate
the state of the data on the nvdimm relative to its energy
@@ -68,7 +68,7 @@ What: /sys/bus/nd/devices/nmemX/nfit/format1
 What:  /sys/bus/nd/devices/nmemX/nfit/formats
 Date:  Apr, 2016
 KernelVersion: v4.7
-Contact:   linux-nvdimm@lists.01.org
+Contact:   nvd...@lists.linux.dev
 Description:
(RO) The interface codes indicate support for persistent memory
mapped directly into system physical address space and / or a
@@ -84,7 +84,7

Re: [PATCH -next v2] tools/testing/nvdimm: Make symbol '__nfit_test_ioremap' static

2021-04-20 Thread Dan Williams

On Tue, Apr 20, 2021 at 12:31 AM Zou Wei  wrote:
>
> The sparse tool complains as follows:
>
> tools/testing/nvdimm/test/iomap.c:65:14: warning:
>  symbol '__nfit_test_ioremap' was not declared. Should it be static?
>
> This symbol is not used outside of iomap.c, so this
> commit marks it static.
>
> Reported-by: Hulk Robot 
> Signed-off-by: Zou Wei 

Looks good to me, thanks.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[ANNOUNCE] NVDIMM development move to nvd...@lists.linux.dev

2021-04-20 Thread Dan Williams

There have been multiple occasions recently where people reported
trouble managing their subscription to linux-nvdimm@lists.01.org. The
spam filtering also appears less robust compared to other Linux
development lists.

There are also side benefits to being on kernel.org infrastructure as
it has a direct path to inject messages into the public-inbox
repository on lore, and feed the patchwork-bot.

You can find the subscription link here:

https://subspace.kernel.org/lists.linux.dev.html

For the v5.13 merge window I'll submit a patch to switch the list from
linux-nvdimm@lists.01.org to nvd...@lists.linux.dev, and plan to
shutdown linux-nvdimm@ a couple months later around the v5.14 merge
window.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] libnvdimm.h: Remove duplicate struct declaration

2021-04-20 Thread Dan Williams

On Tue, Apr 20, 2021 at 6:39 AM Santosh Sivaraj  wrote:
>
> Hi Ira,
>
> Ira Weiny  writes:
>
> > On Mon, Apr 19, 2021 at 07:27:25PM +0800, Wan Jiabing wrote:
> >> struct device is declared at 133rd line.
> >> The declaration here is unnecessary. Remove it.
> >>
> >> Signed-off-by: Wan Jiabing 
> >> ---
> >>  include/linux/libnvdimm.h | 1 -
> >>  1 file changed, 1 deletion(-)
> >>
> >> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> >> index 01f251b6e36c..89b69e645ac7 100644
> >> --- a/include/linux/libnvdimm.h
> >> +++ b/include/linux/libnvdimm.h
> >> @@ -141,7 +141,6 @@ static inline void __iomem *devm_nvdimm_ioremap(struct 
> >> device *dev,
> >>
> >>  struct nvdimm_bus;
> >>  struct module;
> >> -struct device;
> >>  struct nd_blk_region;
> >
> > What is the coding style preference for pre-declarations like this?  Should
> > they be placed at the top of the file?
> >
> > The patch is reasonable but if the intent is to declare right before use for
> > clarity, both devm_nvdimm_memremap() and nd_blk_region_desc() use struct
> > device.  So perhaps this duplicate is on purpose?
>
> There are other struct device usage much later in the file, which doesn't have
> any pre-declarations for struct device. So I assume this might not be on
> purpose :-)

Yeah, I believe it was just code movement and the duplicate was
inadvertently introduced. Patch looks ok to me.

>
> On a side note, types.h can also be removed, since it's already included in
> kernel.h.

That I don't necessarily agree with, it just makes future header
reworks more fraught for not much benefit.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH][v2] dax: Fix missed wakeup during dax entry invalidation

2021-04-19 Thread Dan Williams

On Mon, Apr 19, 2021 at 11:45 AM Vivek Goyal  wrote:
>
> This is V2 of the patch. Posted V1 here.
>
> https://lore.kernel.org/linux-fsdevel/20210416173524.ga1379...@redhat.com/
>
> Based on feedback from Dan and Jan, modified the patch to wake up
> all waiters when dax entry is invalidated. This solves the issues
> of missed wakeups.

Care to send a formal patch with this commentary moved below the --- line?

One style fixup below...

>
> I am seeing missed wakeups which ultimately lead to a deadlock when I am
> using virtiofs with DAX enabled and running "make -j". I had to mount
> virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
> the problem consistently.
>
> So here is the problem. put_unlocked_entry() wakes up waiters only
> if entry is not null as well as !dax_is_conflict(entry). But if I
> call multiple instances of invalidate_inode_pages2() in parallel,
> then I can run into a situation where there are waiters on
> this index but nobody will wait these.
>
> invalidate_inode_pages2()
>   invalidate_inode_pages2_range()
> invalidate_exceptional_entry2()
>   dax_invalidate_mapping_entry_sync()
> __dax_invalidate_entry() {
> xas_lock_irq();
> entry = get_unlocked_entry(, 0);
> ...
> ...
> dax_disassociate_entry(entry, mapping, trunc);
> xas_store(, NULL);
> ...
> ...
> put_unlocked_entry(, entry);
> xas_unlock_irq();
> }
>
> Say a fault in in progress and it has locked entry at offset say "0x1c".
> Now say three instances of invalidate_inode_pages2() are in progress
> (A, B, C) and they all try to invalidate entry at offset "0x1c". Given
> dax entry is locked, all tree instances A, B, C will wait in wait queue.
>
> When dax fault finishes, say A is woken up. It will store NULL entry
> at index "0x1c" and wake up B. When B comes along it will find "entry=0"
> at page offset 0x1c and it will call put_unlocked_entry(, 0). And
> this means put_unlocked_entry() will not wake up next waiter, given
> the current code. And that means C continues to wait and is not woken
> up.
>
> This patch fixes the issue by waking up all waiters when a dax entry
> has been invalidated. This seems to fix the deadlock I am facing
> and I can make forward progress.
>
> Reported-by: Sergio Lopez 
> Signed-off-by: Vivek Goyal 
> ---
>  fs/dax.c |   12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> Index: redhat-linux/fs/dax.c
> ===
> --- redhat-linux.orig/fs/dax.c  2021-04-16 14:16:44.332140543 -0400
> +++ redhat-linux/fs/dax.c   2021-04-19 11:24:11.465213474 -0400
> @@ -264,11 +264,11 @@ static void wait_entry_unlocked(struct x
> finish_wait(wq, );
>  }
>
> -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> +static void put_unlocked_entry(struct xa_state *xas, void *entry, bool 
> wake_all)
>  {
> /* If we were the only waiter woken, wake the next one */
> if (entry && !dax_is_conflict(entry))
> -   dax_wake_entry(xas, entry, false);
> +   dax_wake_entry(xas, entry, wake_all);
>  }
>
>  /*
> @@ -622,7 +622,7 @@ struct page *dax_layout_busy_page_range(
> entry = get_unlocked_entry(, 0);
> if (entry)
> page = dax_busy_page(entry);
> -   put_unlocked_entry(, entry);
> +   put_unlocked_entry(, entry, false);

I'm not a fan of raw true/false arguments because if you read this
line in isolation you need to go read put_unlocked_entry() to recall
what that argument means. So lets add something like:

/**
 * enum dax_entry_wake_mode: waitqueue wakeup toggle
 * @WAKE_NEXT: entry was not mutated
 * @WAKE_ALL: entry was invalidated, or resized
 */
enum dax_entry_wake_mode {
WAKE_NEXT,
WAKE_ALL,
}

...and use that as the arg for dax_wake_entry(). So I'd expect this to
be a 3 patch series, introduce dax_entry_wake_mode for
dax_wake_entry(), introduce the argument for put_unlocked_entry()
without changing the logic, and finally this bug fix. Feel free to add
'Fixes: ac401cc78242 ("dax: New fault locking")' in case you feel this
needs to be backported.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[GIT PULL] libnvdimm fixes for v5.12-rc8 / final

2021-04-16 Thread Dan Williams

Hi Linus, please pull from:

...to receive a handful of libnvdimm fixups.

The largest change is for a regression that landed during -rc1 for
block-device read-only handling. Vaibhav found a new use for the
ability (originally introduced by virtio_pmem) to call back to the
platform to flush data, but also found an original bug in that
implementation. Lastly, Arnd cleans up some compile warnings in dax.

This has all appeared in -next with no reported issues.

---

The following changes since commit e49d033bddf5b565044e2abe4241353959bc9120:

  Linux 5.12-rc6 (2021-04-04 14:15:36 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-fixes-for-5.12-rc8

for you to fetch changes up to 11d2498f1568a0f923dc8ef7621de15a9e89267f:

  Merge branch 'for-5.12/dax' into libnvdimm-fixes (2021-04-09 22:00:09 -0700)


libnvdimm fixes for v5.12-rc8

- Fix a regression of read-only handling in the pmem driver.

- Fix a compile warning.

- Fix support for platform cache flush commands on powerpc/papr


Arnd Bergmann (1):
  dax: avoid -Wempty-body warnings

Dan Williams (2):
  libnvdimm: Notify disk drivers to revalidate region read-only
  Merge branch 'for-5.12/dax' into libnvdimm-fixes

Vaibhav Jain (1):
  libnvdimm/region: Fix nvdimm_has_flush() to handle ND_REGION_ASYNC

 drivers/dax/bus.c|  6 ++
 drivers/nvdimm/bus.c | 14 ++
 drivers/nvdimm/pmem.c| 37 +
 drivers/nvdimm/region_devs.c | 16 ++--
 include/linux/nd.h   |  1 +
 5 files changed, 56 insertions(+), 18 deletions(-)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 1/1] libnvdimm: Don't use GUID APIs against raw buffer

2021-04-16 Thread Dan Williams

On Fri, Apr 16, 2021 at 1:42 PM Andy Shevchenko
 wrote:
>
> On Fri, Apr 16, 2021 at 01:08:06PM -0700, Dan Williams wrote:
> > [ add Erik ]
> >
> > On Fri, Apr 16, 2021 at 10:36 AM Andy Shevchenko
> >  wrote:
> > >
> > > On Thu, Apr 15, 2021 at 05:37:54PM +0300, Andy Shevchenko wrote:
> > > > Strictly speaking the comparison between guid_t and raw buffer
> > > > is not correct. Return to plain memcmp() since the data structures
> > > > haven't changed to use uuid_t / guid_t the current state of affairs
> > > > is inconsistent. Either it should be changed altogether or left
> > > > as is.
> > >
> > > Dan, please review this one as well. I think here you may agree with me.
> >
> > You know, this is all a problem because ACPICA is using a raw buffer.
>
> And this is fine. It might be any other representation as well.
>
> > Erik, would it be possible to use the guid_t type in ACPICA? That
> > would allow NFIT to drop some ugly casts.
>
> guid_t is internal kernel type. If we ever decide to deviate from the current
> representation it wouldn't be possible in case a 3rd party is using it 1:1
> (via typedef or so).

I'm thinking something like ACPICA defining that space as a union with
the correct typing just for Linux.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 1/1] libnvdimm: Don't use GUID APIs against raw buffer

2021-04-16 Thread Dan Williams

[ add Erik ]

On Fri, Apr 16, 2021 at 10:36 AM Andy Shevchenko
 wrote:
>
> On Thu, Apr 15, 2021 at 05:37:54PM +0300, Andy Shevchenko wrote:
> > Strictly speaking the comparison between guid_t and raw buffer
> > is not correct. Return to plain memcmp() since the data structures
> > haven't changed to use uuid_t / guid_t the current state of affairs
> > is inconsistent. Either it should be changed altogether or left
> > as is.
>
> Dan, please review this one as well. I think here you may agree with me.

You know, this is all a problem because ACPICA is using a raw buffer.

Erik, would it be possible to use the guid_t type in ACPICA? That
would allow NFIT to drop some ugly casts.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-16 Thread Dan Williams

On Fri, Apr 16, 2021 at 10:35 AM Vivek Goyal  wrote:
>
> I am seeing missed wakeups which ultimately lead to a deadlock when I am
> using virtiofs with DAX enabled and running "make -j". I had to mount
> virtiofs as rootfs and also reduce to dax window size to 32M to reproduce
> the problem consistently.
>
> This is not a complete patch. I am just proposing this partial fix to
> highlight the issue and trying to figure out how it should be fixed.
> Should it be fixed in generic dax code or should filesystem (fuse/virtiofs)
> take care of this.
>
> So here is the problem. put_unlocked_entry() wakes up waiters only
> if entry is not null as well as !dax_is_conflict(entry). But if I
> call multiple instances of invalidate_inode_pages2() in parallel,
> then I can run into a situation where there are waiters on
> this index but nobody will wait these.
>
> invalidate_inode_pages2()
>   invalidate_inode_pages2_range()
> invalidate_exceptional_entry2()
>   dax_invalidate_mapping_entry_sync()
> __dax_invalidate_entry() {
> xas_lock_irq();
> entry = get_unlocked_entry(, 0);
> ...
> ...
> dax_disassociate_entry(entry, mapping, trunc);
> xas_store(, NULL);
> ...
> ...
> put_unlocked_entry(, entry);
> xas_unlock_irq();
> }
>
> Say a fault in in progress and it has locked entry at offset say "0x1c".
> Now say three instances of invalidate_inode_pages2() are in progress
> (A, B, C) and they all try to invalidate entry at offset "0x1c". Given
> dax entry is locked, all tree instances A, B, C will wait in wait queue.
>
> When dax fault finishes, say A is woken up. It will store NULL entry
> at index "0x1c" and wake up B. When B comes along it will find "entry=0"
> at page offset 0x1c and it will call put_unlocked_entry(, 0). And
> this means put_unlocked_entry() will not wake up next waiter, given
> the current code. And that means C continues to wait and is not woken
> up.
>
> In my case I am seeing that dax page fault path itself is waiting
> on grab_mapping_entry() and also invalidate_inode_page2() is
> waiting in get_unlocked_entry() but entry has already been cleaned
> up and nobody woke up these processes. Atleast I think that's what
> is happening.
>
> This patch wakes up a process even if entry=0. And deadlock does not
> happen. I am running into some OOM issues, that will debug.
>
> So my question is that is it a dax issue and should it be fixed in
> dax layer. Or should it be handled in fuse to make sure that
> multiple instances of invalidate_inode_pages2() on same inode
> don't make progress in parallel and introduce enough locking
> around it.
>
> Right now fuse_finish_open() calls invalidate_inode_pages2() without
> any locking. That allows it to make progress in parallel to dax
> fault path as well as allows multiple instances of invalidate_inode_pages2()
> to run in parallel.
>
> Not-yet-signed-off-by: Vivek Goyal 
> ---
>  fs/dax.c |7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> Index: redhat-linux/fs/dax.c
> ===
> --- redhat-linux.orig/fs/dax.c  2021-04-16 12:50:40.141363317 -0400
> +++ redhat-linux/fs/dax.c   2021-04-16 12:51:42.385926390 -0400
> @@ -266,9 +266,10 @@ static void wait_entry_unlocked(struct x
>
>  static void put_unlocked_entry(struct xa_state *xas, void *entry)
>  {
> -   /* If we were the only waiter woken, wake the next one */
> -   if (entry && !dax_is_conflict(entry))
> -   dax_wake_entry(xas, entry, false);
> +   if (dax_is_conflict(entry))
> +   return;
> +
> +   dax_wake_entry(xas, entry, false);

How does this work if entry is NULL? dax_entry_waitqueue() will not
know if it needs to adjust the index. I think the fix might be to
specify that put_unlocked_entry() in the invalidate path needs to do a
wake_up_all().
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 1/1] ACPI: NFIT: Import GUID before use

2021-04-16 Thread Dan Williams

On Fri, Apr 16, 2021 at 10:34 AM Andy Shevchenko
 wrote:
>
> On Fri, Apr 16, 2021 at 09:15:34AM -0700, Dan Williams wrote:
> > On Fri, Apr 16, 2021 at 1:58 AM Andy Shevchenko
> >  wrote:
> > > On Fri, Apr 16, 2021 at 8:28 AM Dan Williams  
> > > wrote:
> > > > On Thu, Apr 15, 2021 at 6:59 AM Andy Shevchenko
> > > >  wrote:
> > > > >
> > > > > Strictly speaking the comparison between guid_t and raw buffer
> > > > > is not correct. Import GUID to variable of guid_t type and then
> > > > > compare.
> > > >
> > > > Hmm, what about something like the following instead, because it adds
> > > > safety. Any concerns about evaluating x twice in a macro should be
> > > > alleviated by the fact that ARRAY_SIZE() will fail the build if (x) is
> > > > not an array.
> > >
> > > ARRAY_SIZE doesn't check type.
> >
> > See __must_be_array.
> >
> > > I don't like hiding ugly casts like this.
> >
> > See PTR_ERR, ERR_PTR, ERR_CAST.
>
> It's special, i.e. error pointer case. We don't handle such here.
>
> > There's nothing broken about the way the code currently stands, so I'd
> > rather try to find something to move the implementation forward than
> > sideways.
>
> Submit a patch then. I rest my case b/c I consider that ugly castings worse
> than additional API call, although it's not ideal.

It sounds like you'll NAK that patch, and I'm not too enthusiastic
about these proposed changes either because I disagree that the code
is incorrect. Is there another compromise?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 1/1] ACPI: NFIT: Import GUID before use

2021-04-16 Thread Dan Williams

On Fri, Apr 16, 2021 at 1:58 AM Andy Shevchenko
 wrote:
>
> On Fri, Apr 16, 2021 at 8:28 AM Dan Williams  wrote:
> >
> > On Thu, Apr 15, 2021 at 6:59 AM Andy Shevchenko
> >  wrote:
> > >
> > > Strictly speaking the comparison between guid_t and raw buffer
> > > is not correct. Import GUID to variable of guid_t type and then
> > > compare.
> >
> > Hmm, what about something like the following instead, because it adds
> > safety. Any concerns about evaluating x twice in a macro should be
> > alleviated by the fact that ARRAY_SIZE() will fail the build if (x) is
> > not an array.
>
> ARRAY_SIZE doesn't check type.

See __must_be_array.

> I don't like hiding ugly casts like this.

See PTR_ERR, ERR_PTR, ERR_CAST.

There's nothing broken about the way the code currently stands, so I'd
rather try to find something to move the implementation forward than
sideways.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 1/1] ACPI: NFIT: Import GUID before use

2021-04-15 Thread Dan Williams

On Thu, Apr 15, 2021 at 6:59 AM Andy Shevchenko
 wrote:
>
> Strictly speaking the comparison between guid_t and raw buffer
> is not correct. Import GUID to variable of guid_t type and then
> compare.

Hmm, what about something like the following instead, because it adds
safety. Any concerns about evaluating x twice in a macro should be
alleviated by the fact that ARRAY_SIZE() will fail the build if (x) is
not an array.

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 8c5dde628405..bac01eec07a6 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -681,7 +681,7 @@ int nfit_spa_type(struct acpi_nfit_system_address *spa)
int i;

for (i = 0; i < NFIT_UUID_MAX; i++)
-   if (guid_equal(to_nfit_uuid(i), (guid_t *)>range_guid))
+   if (guid_equal(to_nfit_uuid(i), cast_guid(spa->range_guid)))
return i;
return -1;
 }
diff --git a/include/linux/uuid.h b/include/linux/uuid.h
index 8cdc0d3567cd..cec1dc2ab994 100644
--- a/include/linux/uuid.h
+++ b/include/linux/uuid.h
@@ -33,6 +33,9 @@ typedef struct {
 extern const guid_t guid_null;
 extern const uuid_t uuid_null;

+#define cast_guid(x) ({ BUILD_BUG_ON(ARRAY_SIZE(x) != 16); (guid_t *)&(x); })
+#define cast_uuid(x) ({ BUILD_BUG_ON(ARRAY_SIZE(x) != 16); (uuid_t *)&(x); })
+
 static inline bool guid_equal(const guid_t *u1, const guid_t *u2)
 {
return memcmp(u1, u2, sizeof(guid_t)) == 0;
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-15 Thread Dan Williams

On Thu, Apr 15, 2021 at 4:44 AM Vaibhav Jain  wrote:
>
> Thanks for looking into this Dan,
>
> Dan Williams  writes:
>
> > On Wed, Apr 14, 2021 at 5:40 AM Vaibhav Jain  wrote:
> >>
> >> Currently drc_pmem_qeury_stats() generates a dev_err in case
> >> "Enable Performance Information Collection" feature is disabled from
> >> HMC. The error is of the form below:
> >>
> >> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
> >>  performance stats, Err:-10
> >>
> >> This error message confuses users as it implies a possible problem
> >> with the nvdimm even though its due to a disabled feature.
> >>
> >> So we fix this by explicitly handling the H_AUTHORITY error from the
> >> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
> >> error, saying that "Performance stats in-accessible".
> >>
> >> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
> >> PHYP')
> >> Signed-off-by: Vaibhav Jain 
> >> ---
> >>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> >> b/arch/powerpc/platforms/pseries/papr_scm.c
> >> index 835163f54244..9216424f8be3 100644
> >> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> >> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct 
> >> papr_scm_priv *p,
> >> dev_err(>pdev->dev,
> >> "Unknown performance stats, Err:0x%016lX\n", 
> >> ret[0]);
> >> return -ENOENT;
> >> +   } else if (rc == H_AUTHORITY) {
> >> +   dev_warn(>pdev->dev, "Performance stats in-accessible");
> >> +   return -EPERM;
> >
> > So userspace can spam the kernel log? Why is kernel log message needed
> > at all? EPERM told the caller what happened.
> Currently this error message is only reported during probe of the
> nvdimm. So userspace cannot directly spam kernel log.

Oh, ok, I saw things like papr_pdsm_fuel_gauge() in the call stack and
thought this was reachable through an ioctl. Sorry for the noise.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-14 Thread Dan Williams

On Wed, Apr 14, 2021 at 5:40 AM Vaibhav Jain  wrote:
>
> Currently drc_pmem_qeury_stats() generates a dev_err in case
> "Enable Performance Information Collection" feature is disabled from
> HMC. The error is of the form below:
>
> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
>  performance stats, Err:-10
>
> This error message confuses users as it implies a possible problem
> with the nvdimm even though its due to a disabled feature.
>
> So we fix this by explicitly handling the H_AUTHORITY error from the
> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
> error, saying that "Performance stats in-accessible".
>
> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
> PHYP')
> Signed-off-by: Vaibhav Jain 
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 835163f54244..9216424f8be3 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
> *p,
> dev_err(>pdev->dev,
> "Unknown performance stats, Err:0x%016lX\n", ret[0]);
> return -ENOENT;
> +   } else if (rc == H_AUTHORITY) {
> +   dev_warn(>pdev->dev, "Performance stats in-accessible");
> +   return -EPERM;

So userspace can spam the kernel log? Why is kernel log message needed
at all? EPERM told the caller what happened.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1] libnvdimm, dax: Fix a missing check in nd_dax_probe()

2021-04-09 Thread Dan Williams

On Fri, Apr 9, 2021 at 5:33 PM  wrote:
>
> From: Yingjie Wang 
>
> In nd_dax_probe(), nd_dax_alloc() may fail and return NULL.
> Check for NULL before attempting to
> use nd_dax to avoid a NULL pointer dereference.
>
> Fixes: c5ed9268643c ("libnvdimm, dax: autodetect support")
> Signed-off-by: Yingjie Wang 
> ---
>  drivers/nvdimm/dax_devs.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
> index 99965077bac4..b1426ac03f01 100644
> --- a/drivers/nvdimm/dax_devs.c
> +++ b/drivers/nvdimm/dax_devs.c
> @@ -106,6 +106,8 @@ int nd_dax_probe(struct device *dev, struct 
> nd_namespace_common *ndns)
>
> nvdimm_bus_lock(>dev);

hmmm...

> nd_dax = nd_dax_alloc(nd_region);
> +   if (!nd_dax)
> +   return -ENOMEM;

Can you spot the bug this introduces? See the hint above.

> nd_pfn = _dax->nd_pfn;
> dax_dev = nd_pfn_devinit(nd_pfn, ndns);
> nvdimm_bus_unlock(>dev);
> --
> 2.7.4
>
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] libnvdimm/region: Update nvdimm_has_flush() to handle ND_REGION_ASYNC

2021-04-09 Thread Dan Williams

On Mon, Apr 5, 2021 at 4:31 AM Aneesh Kumar K.V
 wrote:
>
> Vaibhav Jain  writes:
>
> > In case a platform doesn't provide explicit flush-hints but provides an
> > explicit flush callback via ND_REGION_ASYNC region flag, then
> > nvdimm_has_flush() still returns '0' indicating that writes do not
> > require flushing. This happens on PPC64 with patch at [1] applied,
> > where 'deep_flush' of a region was denied even though an explicit
> > flush function was provided.
> >
> > Fix this by adding a condition to nvdimm_has_flush() to test for the
> > ND_REGION_ASYNC flag on the region and see if a 'region->flush'
> > callback is assigned.
> >
>
> May be this should have
> Fixes: c5d4355d10d4 ("libnvdimm: nd_region flush callback support")

Yes, thanks for that.

>
> Without this we will mark the pmem disk not having FUA support?

Yes.

>
>
> > References:
> > [1] "powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall"
> > https://lore.kernel.org/linux-nvdimm/161703936121.36.7260632399   
> > 582101498.stgit@e1fbed493c87
> >
> > Reported-by: Shivaprasad G Bhat 
> > Signed-off-by: Vaibhav Jain 
> > ---
> >  drivers/nvdimm/region_devs.c | 9 +++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > index ef23119db574..e05cc9f8a9fd 100644
> > --- a/drivers/nvdimm/region_devs.c
> > +++ b/drivers/nvdimm/region_devs.c
> > @@ -1239,6 +1239,11 @@ int nvdimm_has_flush(struct nd_region *nd_region)
> >   || !IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API))
> >   return -ENXIO;
> >
> > + /* Test if an explicit flush function is defined */
> > + if (test_bit(ND_REGION_ASYNC, _region->flags) && nd_region->flush)
> > + return 1;
>
> > +
> > + /* Test if any flush hints for the region are available */
> >   for (i = 0; i < nd_region->ndr_mappings; i++) {
> >   struct nd_mapping *nd_mapping = _region->mapping[i];
> >   struct nvdimm *nvdimm = nd_mapping->nvdimm;
> > @@ -1249,8 +1254,8 @@ int nvdimm_has_flush(struct nd_region *nd_region)
> >   }
> >
> >   /*
> > -  * The platform defines dimm devices without hints, assume
> > -  * platform persistence mechanism like ADR
> > +  * The platform defines dimm devices without hints nor explicit flush,
> > +  * assume platform persistence mechanism like ADR
> >*/
> >   return 0;
> >  }
> > --
> > 2.30.2
> > ___
> > Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> > To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] libnvdimm/region: Update nvdimm_has_flush() to handle ND_REGION_ASYNC

2021-04-09 Thread Dan Williams

On Fri, Apr 2, 2021 at 2:26 AM Vaibhav Jain  wrote:
>
> In case a platform doesn't provide explicit flush-hints but provides an
> explicit flush callback via ND_REGION_ASYNC region flag, then
> nvdimm_has_flush() still returns '0' indicating that writes do not
> require flushing. This happens on PPC64 with patch at [1] applied,
> where 'deep_flush' of a region was denied even though an explicit
> flush function was provided.
>
> Fix this by adding a condition to nvdimm_has_flush() to test for the
> ND_REGION_ASYNC flag on the region and see if a 'region->flush'
> callback is assigned.

Looks good.

>
> References:
> [1] "powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall"
> https://lore.kernel.org/linux-nvdimm/161703936121.36.7260632399 
> 582101498.stgit@e1fbed493c87

Looks like a typo happened in that link, I can fix that up. I'll also
change this to the canonical "Link:" tag for references.

>
> Reported-by: Shivaprasad G Bhat 
> Signed-off-by: Vaibhav Jain 
> ---
>  drivers/nvdimm/region_devs.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index ef23119db574..e05cc9f8a9fd 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1239,6 +1239,11 @@ int nvdimm_has_flush(struct nd_region *nd_region)
> || !IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API))
> return -ENXIO;
>
> +   /* Test if an explicit flush function is defined */
> +   if (test_bit(ND_REGION_ASYNC, _region->flags) && nd_region->flush)
> +   return 1;
> +
> +   /* Test if any flush hints for the region are available */
> for (i = 0; i < nd_region->ndr_mappings; i++) {
> struct nd_mapping *nd_mapping = _region->mapping[i];
> struct nvdimm *nvdimm = nd_mapping->nvdimm;
> @@ -1249,8 +1254,8 @@ int nvdimm_has_flush(struct nd_region *nd_region)
> }
>
> /*
> -* The platform defines dimm devices without hints, assume
> -* platform persistence mechanism like ADR
> +* The platform defines dimm devices without hints nor explicit flush,
> +* assume platform persistence mechanism like ADR
>  */
> return 0;
>  }
> --
> 2.30.2
>
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

2021-03-29 Thread Dan Williams

On Thu, Mar 25, 2021 at 7:34 AM Jason Gunthorpe  wrote:
>
> On Thu, Mar 18, 2021 at 10:03:06AM -0700, Dan Williams wrote:
> > Yes. I still need to answer the question of whether mapping
> > invalidation triggers longterm pin holders to relinquish their hold,
> > but that's a problem regardless of whether gup-fast is supported or
> > not.
>
> It does not, GUP users do not interact with addres_space or mmu
> notifiers
>

Ok, but the SIGKILL from the memory_failure() will drop the pin?

Can memory_failure() find the right processes to kill if the memory
registration has been passed by SCM_RIGHTS?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

2021-03-24 Thread Dan Williams

On Wed, Mar 24, 2021 at 10:39 AM Christoph Hellwig  wrote:
>
> On Wed, Mar 24, 2021 at 09:37:01AM -0700, Dan Williams wrote:
> > > Eww.  As I said I think the right way is that the file system (or
> > > other consumer) can register a set of callbacks for opening the device.
> >
> > How does that solve the problem of the driver being notified of all
> > pfn failure events?
>
> Ok, I probably just showed I need to spend more time looking at
> your proposal vs the actual code..
>
> Don't we have a proper way how one of the nvdimm layers own a
> spefific memory range and call directly into that instead of through
> a notifier?

So that could be a new dev_pagemap operation as Ruan has here. I was
thinking that other agents would be interested in non-dev_pagemap
managed ranges, but we could leave that for later and just make the
current pgmap->memory_failure() callback proposal range based.

>
> > Today pmem only finds out about the ones that are
> > notified via native x86 machine check error handling via a notifier
> > (yes "firmware-first" error handling fails to do the right thing for
> > the pmem driver),
>
> Did any kind of firmware-first error handling ever get anything
> right?  I wish people would have learned that by now.

Part of me wants to say if you use firmware-first you get to keep the
pieces, but it's not always the end user choice as far as I
understand.

> > or the ones that are eventually reported via address
> > range scrub, but only for the nvdimms that implement range scrubbing.
> > memory_failure() seems a reasonable catch all point to route pfn
> > failure events, in an arch independent way, to interested drivers.
>
> Yeah.
>
> > I'm fine swapping out dax_device blocking_notiier chains for your
> > proposal, but that does not address all the proposed reworks in my
> > list which are:
> >
> > - delete "drivers/acpi/nfit/mce.c"
> >
> > - teach memory_failure() to be able to communicate range failure
> >
> > - enable memory_failure() to defer to a filesystem that can say
> > "critical metadata is impacted, no point in trying to do file-by-file
> > isolation, bring the whole fs down".
>
> This all sounds sensible.

Ok, Ruan, I think this means rework your dev_pagemap_ops callback to
be range based. Add a holder concept for dax_devices and then layer
that on Christoph's eventual dax_device callback mechanism that a
dax_device holder can register.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

2021-03-24 Thread Dan Williams

On Thu, Mar 18, 2021 at 3:02 AM Joao Martins  wrote:
>
> On 3/18/21 4:08 AM, Dan Williams wrote:
> > Now that device-dax and filesystem-dax are guaranteed to unmap all user
> > mappings of devmap / DAX pages before tearing down the 'struct page'
> > array, get_user_pages_fast() can rely on its traditional synchronization
> > method "validate_pte(); get_page(); revalidate_pte()" to catch races with
> > device shutdown. Specifically the unmap guarantee ensures that gup-fast
> > either succeeds in taking a page reference (lock-less), or it detects a
> > need to fall back to the slow path where the device presence can be
> > revalidated with locks held.
>
> [...]
>
> > @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
> > addr, unsigned long end,
> >  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
> >
> >  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
> > defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > +
> >  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >unsigned long end, unsigned int flags,
> >struct page **pages, int *nr)
> >  {
> >   int nr_start = *nr;
> > - struct dev_pagemap *pgmap = NULL;
> >
> >   do {
> > - struct page *page = pfn_to_page(pfn);
> > + struct page *page;
> > +
> > + /*
> > +  * Typically pfn_to_page() on a devmap pfn is not safe
> > +  * without holding a live reference on the hosting
> > +  * pgmap. In the gup-fast path it is safe because any
> > +  * races will be resolved by either gup-fast taking a
> > +  * reference or the shutdown path unmapping the pte to
> > +  * trigger gup-fast to fall back to the slow path.
> > +  */
> > + page = pfn_to_page(pfn);
> >
> > - pgmap = get_dev_pagemap(pfn, pgmap);
> > - if (unlikely(!pgmap)) {
> > - undo_dev_pagemap(nr, nr_start, flags, pages);
> > - return 0;
> > - }
> >   SetPageReferenced(page);
> >   pages[*nr] = page;
> >   if (unlikely(!try_grab_page(page, flags))) {
>
> So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after
> try_grab_page() for checking pgmap type to see if we are in a device-dax
> longterm pin?

So, there is an effort to add a new pte bit p{m,u}d_special to disable
gup-fast for huge pages [1]. I'd like to investigate whether we could
use devmap + special as an encoding for "no longterm" and never
consult the pgmap in the gup-fast path.

[1]: 
https://lore.kernel.org/linux-mm/a1fa7fa2-914b-366d-9902-e5b784e84...@shipmail.org/
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

2021-03-24 Thread Dan Williams

On Wed, Mar 24, 2021 at 12:48 AM Christoph Hellwig  wrote:
>
> On Tue, Mar 23, 2021 at 07:19:28PM -0700, Dan Williams wrote:
> > So I think the path forward is:
> >
> > - teach memory_failure() to allow for ranged failures
> >
> > - let interested drivers register for memory failure events via a
> > blocking_notifier_head
>
> Eww.  As I said I think the right way is that the file system (or
> other consumer) can register a set of callbacks for opening the device.

How does that solve the problem of the driver being notified of all
pfn failure events? Today pmem only finds out about the ones that are
notified via native x86 machine check error handling via a notifier
(yes "firmware-first" error handling fails to do the right thing for
the pmem driver), or the ones that are eventually reported via address
range scrub, but only for the nvdimms that implement range scrubbing.
memory_failure() seems a reasonable catch all point to route pfn
failure events, in an arch independent way, to interested drivers.

I'm fine swapping out dax_device blocking_notiier chains for your
proposal, but that does not address all the proposed reworks in my
list which are:

- delete "drivers/acpi/nfit/mce.c"

- teach memory_failure() to be able to communicate range failure

- enable memory_failure() to defer to a filesystem that can say
"critical metadata is impacted, no point in trying to do file-by-file
isolation, bring the whole fs down".

> I have a series I need to finish and send out to do that for block
> devices.  We probably also need the concept of a holder for the dax
> device to make it work nicely, as otherwise we're going to have a bit
> of a mess.

Ok, I'll take a look at adding a holder.

>
> > This obviously does not solve Dave's desire to get this type of error
> > reporting on block_devices, but I think there's nothing stopping a
> > parallel notifier chain from being created for block-devices, but
> > that's orthogonal to requirements and capabilities provided by
> > dax-devices.
>
> FYI, my series could easily accomodate that if we ever get a block
> driver that actually could report such errors.

Sure, whatever we land for a dax_device could easily be adopted for a
block device.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

2021-03-23 Thread Dan Williams

On Thu, Mar 18, 2021 at 7:18 PM ruansy.f...@fujitsu.com
 wrote:
>
>
>
> > -Original Message-
> > From: ruansy.f...@fujitsu.com 
> > Subject: RE: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
> > > > > > >
> > > > > > > After the conversation with Dave I don't see the point of this.
> > > > > > > If there is a memory_failure() on a page, why not just call
> > > > > > > memory_failure()? That already knows how to find the inode and
> > > > > > > the filesystem can be notified from there.
> > > > > >
> > > > > > We want memory_failure() supports reflinked files.  In this
> > > > > > case, we are not able to track multiple files from a page(this
> > > > > > broken
> > > > > > page) because
> > > > > > page->mapping,page->index can only track one file.  Thus, I
> > > > > > page->introduce this
> > > > > > ->memory_failure() implemented in pmem driver, to call
> > > > > > ->->corrupted_range()
> > > > > > upper level to upper level, and finally find out files who are
> > > > > > using(mmapping) this page.
> > > > > >
> > > > >
> > > > > I know the motivation, but this implementation seems backwards.
> > > > > It's already the case that memory_failure() looks up the
> > > > > address_space associated with a mapping. From there I would expect
> > > > > a new 'struct address_space_operations' op to let the fs handle
> > > > > the case when there are multiple address_spaces associated with a 
> > > > > given
> > file.
> > > > >
> > > >
> > > > Let me think about it.  In this way, we
> > > > 1. associate file mapping with dax page in dax page fault;
> > >
> > > I think this needs to be a new type of association that proxies the
> > > representation of the reflink across all involved address_spaces.
> > >
> > > > 2. iterate files reflinked to notify `kill processes signal` by the
> > > >   new address_space_operation;
> > > > 3. re-associate to another reflinked file mapping when unmmaping
> > > > (rmap qeury in filesystem to get the another file).
> > >
> > > Perhaps the proxy object is reference counted per-ref-link. It seems
> > > error prone to keep changing the association of the pfn while the reflink 
> > > is
> > in-tact.
> > Hi, Dan
> >
> > I think my early rfc patchset was implemented in this way:
> >  - Create a per-page 'dax-rmap tree' to store each reflinked file's 
> > (mapping,
> > offset) when causing dax page fault.
> >  - Mount this tree on page->zone_device_data which is not used in fsdax, so
> > that we can iterate reflinked file mappings in memory_failure() easily.
> > In my understanding, the dax-rmap tree is the proxy object you mentioned.  
> > If
> > so, I have to say, this method was rejected. Because this will cause huge
> > overhead in some case that every dax page have one dax-rmap tree.
> >
>
> Hi, Dan
>
> How do you think about this?  I am still confused.  Could you give me some 
> advice?

So I think the primary driver of this functionality is dax-devices and
the architectural model for memory failure where several architectures
and error handlers know how to route pfn failure to the
memory_failure() frontend.

Compare that to block-devices where sector failure has no similar
framework, and despite some initial interest about reusing 'struct
badblocks' for this type of scenario there has been no real uptake to
expand 'struct badblocks' outside of the pmem driver.

I think the work you have done for ->corrupted_range() just needs to
be repurposed away from a block-device operation to dax-device
infrastructure. Christoph's pushback on extending
block_device_operations makes sense to me because there is likely no
other user of this facility than the pmem driver, and the pmem driver
only needs it for the vestigial reason that filesystems mount on
block-devices and not dax-devices.

Recently Dave drove home the point that a filesystem can't do anything
with pfns, it needs LBAs. A dax-device does not have LBA's, but it
does operate on the concept of device-relative offsets. The filesystem
is allowed to assume that dax-device:PFN[device_byte_offset >>
PAGE_SHIFT] aliases the same data as the associated
block-device:LBA[device_byte_offset >> SECTOR_SHIFT]. He also
reiterated that this interface should be range based, which you
already had, but I did not include in my attempt to communicate the
mass failure of an entire surprise-removed device.

So I think the path forward is:

- teach memory_failure() to allow for ranged failures

- let interested drivers register for memory failure events via a
blocking_notifier_head

- teach memory_failure() to optionally let the notifier chain claim
the event vs its current default of walking page->mapping

- teach the pmem driver to register for memory_failure() events and
filter the ones that apply to pfns that the driver owns

- drop the nfit driver's usage of the mce notifier chain since
memory_failure() is a superset of what the mce notifier communicates

- augment the pmem driver's view of badblocks that it

Re: [PATCH] dax: avoid -Wempty-body warnings

2021-03-22 Thread Dan Williams

On Mon, Mar 22, 2021 at 4:45 AM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> gcc warns about an empty body in an else statement:
>
> drivers/dax/bus.c: In function 'do_id_store':
> drivers/dax/bus.c:94:48: error: suggest braces around empty body in an 'else' 
> statement [-Werror=empty-body]
>94 | /* nothing to remove */;
>   |^
> drivers/dax/bus.c:99:43: error: suggest braces around empty body in an 'else' 
> statement [-Werror=empty-body]
>99 | /* dax_id already added */;
>   |   ^
>
> In both of these cases, the 'else' exists only to have a place to
> add a comment, but that comment doesn't really explain that much
> either, so the easiest way to shut up that warning is to just
> remove the else.

Ok, applied, thanks.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-19 Thread Dan Williams

On Fri, Mar 19, 2021 at 6:47 PM Dave Chinner  wrote:
[..]
> > Now I'm trying to reconcile the fact that platform
> > poison handling will hit memory_failure() first and may not
> > immediately reach the driver, if ever (see the perennially awkward
> > firmware-first-mode error handling: ghes_handle_memory_failure()) . So
> > even if the ->memory_failure(dev...) up call exists there is no
> > guarantee it can get called for all poison before the memory_failure()
> > down call happens. Which means regardless of whether
> > ->memory_failure(dev...) exists memory_failure() needs to be able to
> > do the right thing.
>
> I don't see how a poor implementation of memory_failure in a driver
> or hardware is even remotely relevant to the interface used to
> notify the filesystem of a media or device failure. It sounds like
> you are trying to use memory_failure() for something it was never
> intended to support and that there's a bunch of driver and
> infrastructure work needed to make it work how you want it to work.
> And even then it may not work the way we want it to work
>
> > Combine that with the fact that new buses like CXL might be configured
> > in "poison on decode error" mode which means that a memory_failure()
> > storm can happen regardless of whether the driver initiates it
> > programatically.
>
> Straw man argument.
>
> "We can't make this interface a ranged notification because the
> hardware might only be able to do page-by-page notification."

No, it's "we can't make this interface notify the filesytem that
sectors have failed before the memory_failure() (ranged or not) has
communicated that pfns have failed."

memory_failure() today is the first and sometimes only interface that
learns of pfn failures.

>
> You can do page-by-page notification with a range based interface.
> We are talking about how to efficiently and reliably inform the
> filesystem that a range of a device is no longer accessible and so
> it needs to revoke all mappings over that range of it's address
> space. That does not need to be a single page at a time interface.
>
> If your hardware is configured to do stupid things, then that is not
> the fault of the software interface used to communicate faults up
> the stack, nor is it something that the notfication interface should
> try to fix or mitigate.
>
> > How about a mechanism to optionally let a filesystem take over memory
> > failure handling for a range of pfns that the memory_failure() can
> > consult to fail ranges at a time rather than one by one? So a new
> > 'struct dax_operations' op (void) (*memory_failure_register(struct
> > dax_device *, void *data). Where any agent that claims a dax_dev can
> > register to take over memory_failure() handling for any event that
> > happens in that range. This would be routed through device-mapper like
> > any other 'struct dax_operations' op. I think that meets your
> > requirement to get notifications of all the events you want to handle,
> > but still allows memory_failure() to be the last resort for everything
> > that has not opted into this error handling.
>
> Which is basically the same as the proposed ->corrupted_range stack,
> except it doesn't map the pfns back to LBA addresses the filesystem
> needs to make sense of the failure.
>
> fs-dax filesystems have no clue what pfns are, or how to translate
> them to LBAs in their block device address space that the map
> everything to. The fs-dax infrastructure asks the filesystem for
> bdev/sector based mappings, and internally converts them to pfns by
> a combination of bdev and daxdev callouts. Hence fs-dax filesystems
> never see nor interpret pfns at all.  Nor do they have the
> capability to convert a PFN to a LBA address. And neither the
> underlying block device nor the associated DAX device provide a
> method for doing this reverse mapping translation.

True.

>
> So if we have an interface that hands a {daxdev,PFN,len} tuple to
> the filesystem, exactly what is the filesystem supposed to do with
> it? How do we turn that back into a {bdev,sector,len} tuple so we
> can do reverse mapping lookups to find the filesystem objects that
> allocated within the notified range?
>
> I'll point out again that these address space translations were
> something that the ->corrupted_range callbacks handled directly - no
> layer in the stack was handed a range that it didn't know how to map
> to it's own internal structures. By the time it got to the
> filesystem, it was a {bdev,sector,len} tuple, and the filesystem
> could feed that directly to it's reverse mapping lookups
>
> Maybe I'm missing something magical about ->memory_failure that does
> all this translation for us, but I don't see it in this patchset. I
> just don't see how this proposed interface is a usable at the
> filesystem level as it stands.

So then it's not the filesystem that needs to register for
memory_failure() it's the driver in order to translate the failed LBAs
up the stack. However,

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-18 Thread Dan Williams

On Wed, Mar 17, 2021 at 9:45 PM Darrick J. Wong  wrote:
>
> On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > Jason wondered why the get_user_pages_fast() path takes references on a
> > @pgmap object. The rationale was to protect against accessing a 'struct
> > page' that might be in the process of being removed by the driver, but
> > he rightly points out that should be solved the same way all gup-fast
> > synchronization is solved which is invalidate the mapping and let the
> > gup slow path do @pgmap synchronization [1].
> >
> > To achieve that it means that new user mappings need to stop being
> > created and all existing user mappings need to be invalidated.
> >
> > For device-dax this is already the case as kill_dax() prevents future
> > faults from installing a pte, and the single device-dax inode
> > address_space can be trivially unmapped.
> >
> > The situation is different for filesystem-dax where device pages could
> > be mapped by any number of inode address_space instances. An initial
> > thought was to treat the device removal event like a drop_pagecache_sb()
> > event that walks superblocks and unmaps all inodes. However, Dave points
> > out that it is not just the filesystem user-mappings that need to react
> > to global DAX page-unmap events, it is also filesystem metadata
> > (proposed DAX metadata access), and other drivers (upstream
> > DM-writecache) that need to react to this event [2].
> >
> > The only kernel facility that is meant to globally broadcast the loss of
> > a page (via corruption or surprise remove) is memory_failure(). The
> > downside of memory_failure() is that it is a pfn-at-a-time interface.
> > However, the events that would trigger the need to call memory_failure()
> > over a full PMEM device should be rare. Remove should always be
> > coordinated by the administrator with the filesystem. If someone force
> > removes a device from underneath a mounted filesystem the driver assumes
> > they have a good reason, or otherwise get to keep the pieces. Since
> > ->remove() callbacks can not fail the only option is to trigger the mass
> > memory_failure().
> >
> > The mechanism to determine whether memory_failure() triggers at
> > pmem->remove() time is whether the associated dax_device has an elevated
> > reference at @pgmap ->kill() time.
> >
> > With this in place the get_user_pages_fast() path can drop its
> > half-measure synchronization with an @pgmap reference.
> >
> > Link: http://lore.kernel.org/r/20210224010017.gq2643...@ziepe.ca [1]
> > Link: http://lore.kernel.org/r/20210302075736.gj4...@dread.disaster.area [2]
> > Reported-by: Jason Gunthorpe 
> > Cc: Dave Chinner 
> > Cc: Christoph Hellwig 
> > Cc: Shiyang Ruan 
> > Cc: Vishal Verma 
> > Cc: Dave Jiang 
> > Cc: Ira Weiny 
> > Cc: Matthew Wilcox 
> > Cc: Jan Kara 
> > Cc: Andrew Morton 
> > Cc: Naoya Horiguchi 
> > Cc: "Darrick J. Wong" 
> > Signed-off-by: Dan Williams 
> > ---
> >  drivers/dax/super.c  |   15 +++
> >  drivers/nvdimm/pmem.c|   10 +-
> >  drivers/nvdimm/pmem.h|1 +
> >  include/linux/dax.h  |5 +
> >  include/linux/memremap.h |5 +
> >  include/linux/mm.h   |3 +++
> >  mm/memory-failure.c  |   11 +--
> >  mm/memremap.c|   11 +++
> >  8 files changed, 58 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 5fa6ae9dbc8b..5ebcedf4a68c 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev)
> >  }
> >  EXPORT_SYMBOL_GPL(put_dax);
> >
> > +bool dax_is_idle(struct dax_device *dax_dev)
> > +{
> > + struct inode *inode;
> > +
> > + if (!dax_dev)
> > + return true;
> > +
> > + WARN_ONCE(test_bit(DAXDEV_ALIVE, _dev->flags),
> > +   "dax idle check on live device.\n");
> > +
> > + inode = _dev->inode;
> > + return atomic_read(>i_count) < 2;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_is_idle);
> > +
> >  /**
> >   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
> >   * @host: alternate name for the device registered by a dax driver
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index b8a85bfb2e95..e8822c9262ee 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pme

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-18 Thread Dan Williams

On Wed, Mar 17, 2021 at 9:58 PM Dave Chinner  wrote:
>
> On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > Jason wondered why the get_user_pages_fast() path takes references on a
> > @pgmap object. The rationale was to protect against accessing a 'struct
> > page' that might be in the process of being removed by the driver, but
> > he rightly points out that should be solved the same way all gup-fast
> > synchronization is solved which is invalidate the mapping and let the
> > gup slow path do @pgmap synchronization [1].
> >
> > To achieve that it means that new user mappings need to stop being
> > created and all existing user mappings need to be invalidated.
> >
> > For device-dax this is already the case as kill_dax() prevents future
> > faults from installing a pte, and the single device-dax inode
> > address_space can be trivially unmapped.
> >
> > The situation is different for filesystem-dax where device pages could
> > be mapped by any number of inode address_space instances. An initial
> > thought was to treat the device removal event like a drop_pagecache_sb()
> > event that walks superblocks and unmaps all inodes. However, Dave points
> > out that it is not just the filesystem user-mappings that need to react
> > to global DAX page-unmap events, it is also filesystem metadata
> > (proposed DAX metadata access), and other drivers (upstream
> > DM-writecache) that need to react to this event [2].
> >
> > The only kernel facility that is meant to globally broadcast the loss of
> > a page (via corruption or surprise remove) is memory_failure(). The
> > downside of memory_failure() is that it is a pfn-at-a-time interface.
> > However, the events that would trigger the need to call memory_failure()
> > over a full PMEM device should be rare.
>
> This is a highly suboptimal design. Filesystems only need a single
> callout to trigger a shutdown that unmaps every active mapping in
> the filesystem - we do not need a page-by-page error notification
> which results in 250 million hwposion callouts per TB of pmem to do
> this.
>
> Indeed, the moment we get the first hwpoison from this patch, we'll
> map it to the primary XFS superblock and we'd almost certainly
> consider losing the storage behind that block to be a shut down
> trigger. During the shutdown, the filesystem should unmap all the
> active mappings (we already need to add this to shutdown on DAX
> regardless of this device remove issue) and so we really don't need
> a page-by-page notification of badness.

XFS doesn't, but what about device-mapper and other agents? Even if
the driver had a callback up the stack memory_failure() still needs to
be able to trigger failures down the stack for CPU consumed poison.

>
> AFAICT, it's going to take minutes, maybe hours for do the page-by-page
> iteration to hwposion every page. It's going to take a few seconds
> for the filesystem shutdown to run a device wide invalidation.
>
> SO, yeah, I think this should simply be a single ranged call to the
> filesystem like:
>
> ->memory_failure(dev, 0, -1ULL)
>
> to tell the filesystem that the entire backing device has gone away,
> and leave the filesystem to handle failure entirely at the
> filesystem level.

So I went with memory_failure() after our discussion of all the other
agents in the system that might care about these pfns going offline
and relying on memory_failure() to route down to each of those. I.e.
the "reuse the drop_pagecache_sb() model" idea was indeed
insufficient. Now I'm trying to reconcile the fact that platform
poison handling will hit memory_failure() first and may not
immediately reach the driver, if ever (see the perennially awkward
firmware-first-mode error handling: ghes_handle_memory_failure()) . So
even if the ->memory_failure(dev...) up call exists there is no
guarantee it can get called for all poison before the memory_failure()
down call happens. Which means regardless of whether
->memory_failure(dev...) exists memory_failure() needs to be able to
do the right thing.

Combine that with the fact that new buses like CXL might be configured
in "poison on decode error" mode which means that a memory_failure()
storm can happen regardless of whether the driver initiates it
programatically.

How about a mechanism to optionally let a filesystem take over memory
failure handling for a range of pfns that the memory_failure() can
consult to fail ranges at a time rather than one by one? So a new
'struct dax_operations' op (void) (*memory_failure_register(struct
dax_device *, void *data). Where any agent that claims a dax_dev can
register to take over memory_failure() handling for any event that
happens in that range. This would

Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

2021-03-18 Thread Dan Williams

On Thu, Mar 18, 2021 at 3:02 AM Joao Martins  wrote:
>
> On 3/18/21 4:08 AM, Dan Williams wrote:
> > Now that device-dax and filesystem-dax are guaranteed to unmap all user
> > mappings of devmap / DAX pages before tearing down the 'struct page'
> > array, get_user_pages_fast() can rely on its traditional synchronization
> > method "validate_pte(); get_page(); revalidate_pte()" to catch races with
> > device shutdown. Specifically the unmap guarantee ensures that gup-fast
> > either succeeds in taking a page reference (lock-less), or it detects a
> > need to fall back to the slow path where the device presence can be
> > revalidated with locks held.
>
> [...]
>
> > @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
> > addr, unsigned long end,
> >  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
> >
> >  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
> > defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > +
> >  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >unsigned long end, unsigned int flags,
> >struct page **pages, int *nr)
> >  {
> >   int nr_start = *nr;
> > - struct dev_pagemap *pgmap = NULL;
> >
> >   do {
> > - struct page *page = pfn_to_page(pfn);
> > + struct page *page;
> > +
> > + /*
> > +  * Typically pfn_to_page() on a devmap pfn is not safe
> > +  * without holding a live reference on the hosting
> > +  * pgmap. In the gup-fast path it is safe because any
> > +  * races will be resolved by either gup-fast taking a
> > +  * reference or the shutdown path unmapping the pte to
> > +  * trigger gup-fast to fall back to the slow path.
> > +  */
> > + page = pfn_to_page(pfn);
> >
> > - pgmap = get_dev_pagemap(pfn, pgmap);
> > - if (unlikely(!pgmap)) {
> > - undo_dev_pagemap(nr, nr_start, flags, pages);
> > - return 0;
> > - }
> >   SetPageReferenced(page);
> >   pages[*nr] = page;
> >   if (unlikely(!try_grab_page(page, flags))) {
>
> So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after
> try_grab_page() for checking pgmap type to see if we are in a device-dax
> longterm pin?
>

Yes. I still need to answer the question of whether mapping
invalidation triggers longterm pin holders to relinquish their hold,
but that's a problem regardless of whether gup-fast is supported or
not.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

2021-03-17 Thread Dan Williams

Now that device-dax and filesystem-dax are guaranteed to unmap all user
mappings of devmap / DAX pages before tearing down the 'struct page'
array, get_user_pages_fast() can rely on its traditional synchronization
method "validate_pte(); get_page(); revalidate_pte()" to catch races with
device shutdown. Specifically the unmap guarantee ensures that gup-fast
either succeeds in taking a page reference (lock-less), or it detects a
need to fall back to the slow path where the device presence can be
revalidated with locks held.

Reported-by: Jason Gunthorpe 
Cc: Christoph Hellwig 
Cc: Shiyang Ruan 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Ira Weiny 
Cc: Matthew Wilcox 
Cc: Jan Kara 
Cc: Andrew Morton 
Signed-off-by: Dan Williams 
---
 mm/gup.c |   38 --
 1 file changed, 16 insertions(+), 22 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..dfeb47e4e8d4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1996,9 +1996,8 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int 
nr_start,
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
 {
-   struct dev_pagemap *pgmap = NULL;
-   int nr_start = *nr, ret = 0;
pte_t *ptep, *ptem;
+   int ret = 0;
 
ptem = ptep = pte_offset_map(, addr);
do {
@@ -2015,16 +2014,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
if (!pte_access_permitted(pte, flags & FOLL_WRITE))
goto pte_unmap;
 
-   if (pte_devmap(pte)) {
-   if (unlikely(flags & FOLL_LONGTERM))
-   goto pte_unmap;
+   if (pte_devmap(pte) && (flags & FOLL_LONGTERM))
+   goto pte_unmap;
 
-   pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
-   if (unlikely(!pgmap)) {
-   undo_dev_pagemap(nr, nr_start, flags, pages);
-   goto pte_unmap;
-   }
-   } else if (pte_special(pte))
+   if (pte_special(pte))
goto pte_unmap;
 
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
@@ -2063,8 +2056,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
ret = 1;
 
 pte_unmap:
-   if (pgmap)
-   put_dev_pagemap(pgmap);
pte_unmap(ptem);
return ret;
 }
@@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 unsigned long end, unsigned int flags,
 struct page **pages, int *nr)
 {
int nr_start = *nr;
-   struct dev_pagemap *pgmap = NULL;
 
do {
-   struct page *page = pfn_to_page(pfn);
+   struct page *page;
+
+   /*
+* Typically pfn_to_page() on a devmap pfn is not safe
+* without holding a live reference on the hosting
+* pgmap. In the gup-fast path it is safe because any
+* races will be resolved by either gup-fast taking a
+* reference or the shutdown path unmapping the pte to
+* trigger gup-fast to fall back to the slow path.
+*/
+   page = pfn_to_page(pfn);
 
-   pgmap = get_dev_pagemap(pfn, pgmap);
-   if (unlikely(!pgmap)) {
-   undo_dev_pagemap(nr, nr_start, flags, pages);
-   return 0;
-   }
SetPageReferenced(page);
pages[*nr] = page;
if (unlikely(!try_grab_page(page, flags))) {
@@ -2112,8 +2108,6 @@ static int __gup_device_huge(unsigned long pfn, unsigned 
long addr,
pfn++;
} while (addr += PAGE_SIZE, addr != end);
 
-   if (pgmap)
-   put_dev_pagemap(pgmap);
return 1;
 }
 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-17 Thread Dan Williams

Jason wondered why the get_user_pages_fast() path takes references on a
@pgmap object. The rationale was to protect against accessing a 'struct
page' that might be in the process of being removed by the driver, but
he rightly points out that should be solved the same way all gup-fast
synchronization is solved which is invalidate the mapping and let the
gup slow path do @pgmap synchronization [1].

To achieve that it means that new user mappings need to stop being
created and all existing user mappings need to be invalidated.

For device-dax this is already the case as kill_dax() prevents future
faults from installing a pte, and the single device-dax inode
address_space can be trivially unmapped.

The situation is different for filesystem-dax where device pages could
be mapped by any number of inode address_space instances. An initial
thought was to treat the device removal event like a drop_pagecache_sb()
event that walks superblocks and unmaps all inodes. However, Dave points
out that it is not just the filesystem user-mappings that need to react
to global DAX page-unmap events, it is also filesystem metadata
(proposed DAX metadata access), and other drivers (upstream
DM-writecache) that need to react to this event [2].

The only kernel facility that is meant to globally broadcast the loss of
a page (via corruption or surprise remove) is memory_failure(). The
downside of memory_failure() is that it is a pfn-at-a-time interface.
However, the events that would trigger the need to call memory_failure()
over a full PMEM device should be rare. Remove should always be
coordinated by the administrator with the filesystem. If someone force
removes a device from underneath a mounted filesystem the driver assumes
they have a good reason, or otherwise get to keep the pieces. Since
->remove() callbacks can not fail the only option is to trigger the mass
memory_failure().

The mechanism to determine whether memory_failure() triggers at
pmem->remove() time is whether the associated dax_device has an elevated
reference at @pgmap ->kill() time.

With this in place the get_user_pages_fast() path can drop its
half-measure synchronization with an @pgmap reference.

Link: http://lore.kernel.org/r/20210224010017.gq2643...@ziepe.ca [1]
Link: http://lore.kernel.org/r/20210302075736.gj4...@dread.disaster.area [2]
Reported-by: Jason Gunthorpe 
Cc: Dave Chinner 
Cc: Christoph Hellwig 
Cc: Shiyang Ruan 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Ira Weiny 
Cc: Matthew Wilcox 
Cc: Jan Kara 
Cc: Andrew Morton 
Cc: Naoya Horiguchi 
Cc: "Darrick J. Wong" 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c  |   15 +++
 drivers/nvdimm/pmem.c|   10 +-
 drivers/nvdimm/pmem.h|1 +
 include/linux/dax.h  |5 +
 include/linux/memremap.h |5 +
 include/linux/mm.h   |3 +++
 mm/memory-failure.c  |   11 +--
 mm/memremap.c|   11 +++
 8 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 5fa6ae9dbc8b..5ebcedf4a68c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(put_dax);
 
+bool dax_is_idle(struct dax_device *dax_dev)
+{
+   struct inode *inode;
+
+   if (!dax_dev)
+   return true;
+
+   WARN_ONCE(test_bit(DAXDEV_ALIVE, _dev->flags),
+ "dax idle check on live device.\n");
+
+   inode = _dev->inode;
+   return atomic_read(>i_count) < 2;
+}
+EXPORT_SYMBOL_GPL(dax_is_idle);
+
 /**
  * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
  * @host: alternate name for the device registered by a dax driver
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b8a85bfb2e95..e8822c9262ee 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -348,15 +348,21 @@ static void pmem_pagemap_kill(struct dev_pagemap *pgmap)
 {
struct request_queue *q =
container_of(pgmap->ref, struct request_queue, q_usage_counter);
+   struct pmem_device *pmem = q->queuedata;
 
blk_freeze_queue_start(q);
+   kill_dax(pmem->dax_dev);
+   if (!dax_is_idle(pmem->dax_dev)) {
+   dev_warn(pmem->dev,
+"DAX active at remove, trigger mass memory failure\n");
+   dev_pagemap_failure(pgmap);
+   }
 }
 
 static void pmem_release_disk(void *__pmem)
 {
struct pmem_device *pmem = __pmem;
 
-   kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);
del_gendisk(pmem->disk);
put_disk(pmem->disk);
@@ -406,6 +412,7 @@ static int pmem_attach_disk(struct device *dev,
devm_namespace_disable(dev, ndns);
 
dev_set_drvdata(dev, pmem);
+   pmem->dev = dev;
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
fua = nvdim

[PATCH 1/3] mm/memory-failure: Prepare for mass memory_failure()

2021-03-17 Thread Dan Williams

Currently memory_failure() assumes an infrequent report on a handful of
pages. A new use case for surprise removal of a persistent memory device
needs to trigger memory_failure() on a large range. Rate limit
memory_failure() error logging, and allow the
memory_failure_dev_pagemap() helper to be called directly.

Cc: Naoya Horiguchi 
Cc: Andrew Morton 
Signed-off-by: Dan Williams 
---
 mm/memory-failure.c |   25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..43ba4307c526 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -395,8 +395,9 @@ static void kill_procs(struct list_head *to_kill, int 
forcekill, bool fail,
 * signal and then access the memory. Just kill it.
 */
if (fail || tk->addr == -EFAULT) {
-   pr_err("Memory failure: %#lx: forcibly killing 
%s:%d because of failure to unmap corrupted page\n",
-  pfn, tk->tsk->comm, tk->tsk->pid);
+   pr_err_ratelimited(
+   "Memory failure: %#lx: forcibly killing 
%s:%d because of failure to unmap corrupted page\n",
+   pfn, tk->tsk->comm, tk->tsk->pid);
do_send_sig_info(SIGKILL, SEND_SIG_PRIV,
 tk->tsk, PIDTYPE_PID);
}
@@ -408,8 +409,9 @@ static void kill_procs(struct list_head *to_kill, int 
forcekill, bool fail,
 * process anyways.
 */
else if (kill_proc(tk, pfn, flags) < 0)
-   pr_err("Memory failure: %#lx: Cannot send 
advisory machine check signal to %s:%d\n",
-  pfn, tk->tsk->comm, tk->tsk->pid);
+   pr_err_ratelimited(
+   "Memory failure: %#lx: Cannot send 
advisory machine check signal to %s:%d\n",
+   pfn, tk->tsk->comm, tk->tsk->pid);
}
put_task_struct(tk->tsk);
kfree(tk);
@@ -919,8 +921,8 @@ static void action_result(unsigned long pfn, enum 
mf_action_page_type type,
 {
trace_memory_failure_event(pfn, type, result);
 
-   pr_err("Memory failure: %#lx: recovery action for %s: %s\n",
-   pfn, action_page_types[type], action_name[result]);
+   pr_err_ratelimited("Memory failure: %#lx: recovery action for %s: %s\n",
+  pfn, action_page_types[type], action_name[result]);
 }
 
 static int page_action(struct page_state *ps, struct page *p,
@@ -1375,8 +1377,6 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
 unlock:
dax_unlock_page(page, cookie);
 out:
-   /* drop pgmap ref acquired in caller */
-   put_dev_pagemap(pgmap);
action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
return rc;
 }
@@ -1415,9 +1415,12 @@ int memory_failure(unsigned long pfn, int flags)
if (!p) {
if (pfn_valid(pfn)) {
pgmap = get_dev_pagemap(pfn, NULL);
-   if (pgmap)
-   return memory_failure_dev_pagemap(pfn, flags,
- pgmap);
+   if (pgmap) {
+   res = memory_failure_dev_pagemap(pfn, flags,
+pgmap);
+   put_dev_pagemap(pgmap);
+   return res;
+   }
}
pr_err("Memory failure: %#lx: memory outside kernel control\n",
pfn);
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove

2021-03-17 Thread Dan Williams

Summary:

A dax_dev can be unbound from its driver at any time. Unbind can not
fail. The driver-core will always trigger ->remove() and the result from
->remove() is ignored. After ->remove() the driver-core proceeds to tear
down context. The filesystem-dax implementation can leave pfns mapped
after ->remove() if it is triggered while the filesystem is mounted.
Security and data-integrity is forfeit if the dax_dev is repurposed for
another security domain (new filesystem or change device modes), or if
the dax_dev is physically replaced. CXL is a hotplug bus that makes
dax_dev physical replace a real world prospect. 

All dax_dev pfns must be unmapped at remove. Detect the "remove while
mounted" case and trigger memory_failure() over the entire dax_dev
range.

Details:

The get_user_pages_fast() path expects all synchronization to be handled
by the pattern of checking for pte presence, taking a page reference,
and then validating that the pte was stable over that event. The
gup-fast path for devmap / DAX pages additionally attempts to take/hold
a live reference against the hosting pgmap over the page pin. The
rational for the pgmap reference is to synchronize against a dax-device
unbind / ->remove() event, but that is unnecessary if pte invalidation
is guaranteed in the ->remove() path.

Global dax-device pte invalidation *does* happen when the device is in
raw "device-dax" mode where there is a single shared inode to unmap at
remove, but the filesystem-dax path has a large number of actively
mapped inodes unknown to the driver at ->remove() time. So, that unmap
does not happen today for filesystem-dax. However, as Jason points out,
that unmap / invalidation *needs* to happen not only to cleanup
get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev
->remove() is correlated with actual physical removal / replacement the
implications of allowing a physical pfn to be exchanged without tearing
down old mappings are severe (security and data-integrity).

What is not in this patch set is coordination with the dax_kmem driver
to trigger memory_failure() when the dax_dev is onlined as "System
RAM". The remove_memory() API was built with the assumption that
platform firmware negotiates all removal requests and the OS has a
chance to say "no". This is why dax_kmem today simply leaks
request_region() to burn that physical address space for any other
usage until the next reboot on a manual unbind event if the memory can't
be offlined. However a future to make sure that remove_memory() succeeds
after memory_failure() of the same range seems a better semantic than
permanently burning physical address space.

The topic of remove_memory() failures gets to the question of what
happens to active page references when the inopportune ->remove() event
happens. For transient pins the ->remove() event will wait for for all
pins to be dropped before allowing ->remove() to complete. Since
fileystem-dax forbids longterm pins all those pins are transient.
Device-dax, on the other hand, does allow longterm pins which means that
->remove() will hang unless / until the longterm pin is dropped.
Hopefully an unmap_mapping_range() event is sufficient to get the pin
dropped, but I suspect device-dax might need to trigger memory_failure()
as well to get the longterm pin holder to wake up and get out of the
way (TBD).

Lest we repeat the "longterm-pin-revoke" debate, which highlighted that
RDMA devices do not respond well to having context torn down, keep in
mind that this proposal is to do a best effort recovery of an event that
should not happen (surprise removal) under nominal operation.

---

Dan Williams (3):
  mm/memory-failure: Prepare for mass memory_failure()
  mm, dax, pmem: Introduce dev_pagemap_failure()
  mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path


 drivers/dax/super.c  |   15 +++
 drivers/nvdimm/pmem.c|   10 +-
 drivers/nvdimm/pmem.h|1 +
 include/linux/dax.h  |5 +
 include/linux/memremap.h |5 +
 include/linux/mm.h   |3 +++
 mm/gup.c |   38 --
 mm/memory-failure.c  |   36 +++-
 mm/memremap.c|   11 +++
 9 files changed, 88 insertions(+), 36 deletions(-)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC 0/2] virtio-pmem: Asynchronous flush

2021-03-11 Thread Dan Williams

On Thu, Mar 11, 2021 at 8:21 PM Pankaj Gupta
 wrote:
>
> Hi David,
>
> > >   Jeff reported preflush order issue with the existing implementation
> > >   of virtio pmem preflush. Dan suggested[1] to implement asynchronous 
> > > flush
> > >   for virtio pmem using work queue as done in md/RAID. This patch series
> > >   intends to solve the preflush ordering issue and also makes the flush
> > >   asynchronous from the submitting thread POV.
> > >
> > >   Submitting this patch series for feeback and is in WIP. I have
> > >   done basic testing and currently doing more testing.
> > >
> > > Pankaj Gupta (2):
> > >pmem: make nvdimm_flush asynchronous
> > >virtio_pmem: Async virtio-pmem flush
> > >
> > >   drivers/nvdimm/nd_virtio.c   | 66 ++--
> > >   drivers/nvdimm/pmem.c| 15 
> > >   drivers/nvdimm/region_devs.c |  3 +-
> > >   drivers/nvdimm/virtio_pmem.c |  9 +
> > >   drivers/nvdimm/virtio_pmem.h | 12 +++
> > >   5 files changed, 78 insertions(+), 27 deletions(-)
> > >
> > > [1] https://marc.info/?l=linux-kernel=157446316409937=2
> > >
> >
> > Just wondering, was there any follow up of this or are we still waiting
> > for feedback? :)
>
> Thank you for bringing this up.
>
> My apologies I could not followup on this. I have another version in my local
> tree but could not post it as I was not sure if I solved the problem
> correctly. I will
> clean it up and post for feedback as soon as I can.
>
> P.S: Due to serious personal/family health issues I am not able to
> devote much time
> on this with other professional commitments. I feel bad that I have
> this unfinished task.
> Just in last one year things have not been stable for me & my family
> and still not getting :(

No worries Pankaj. Take care of yourself and your family. The
community can handle this for you. I'm open to coaching somebody
through what's involved to get this fix landed.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

2021-03-11 Thread Dan Williams

On Wed, Mar 10, 2021 at 10:13 AM Joao Martins  wrote:
>
> On 2/22/21 8:37 PM, Dan Williams wrote:
> > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins  
> > wrote:
> >> On 2/20/21 1:43 AM, Dan Williams wrote:
> >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard  wrote:
> >>>> On 12/8/20 9:28 AM, Joao Martins wrote:
> >> One thing to point out about altmap is that the degradation (in pinning and
> >> unpining) we observed with struct page's in device memory, is no longer 
> >> observed
> >> once 1) we batch ref count updates as we move to compound pages 2) reusing
> >> tail pages seems to lead to these struct pages staying more likely in cache
> >> which perhaps contributes to dirtying a lot less cachelines.
> >
> > True, it makes it more palatable to survive 'struct page' in PMEM,
>
> I want to retract for now what I said above wrt to the no degradation with
> struct page in device comment. I was fooled by a bug on a patch later down
> this series. Particular because I accidentally cleared PGMAP_ALTMAP_VALID when
> unilaterally setting PGMAP_COMPOUND, which consequently lead to always
> allocating struct pages from memory. No wonder the numbers were just as fast.
> I am still confident that it's going to be faster and observe less degradation
> in pinning/init. Init for now is worst-case 2x faster. But to be *as fast* 
> struct
> pages in memory might still be early to say.
>
> The broken masking of the PGMAP_ALTMAP_VALID bit did hide one flaw, where
> we don't support altmap for basepages on x86/mm and it apparently depends
> on architectures to implement it (and a couple other issues). The vmemmap
> allocation isn't the problem, so the previous comment in this thread that
> altmap doesn't change much in the vmemmap_populate_compound_pages() is
> still accurate.
>
> The problem though resides on the freeing of vmemmap pagetables with
> basepages *with altmap* (e.g. at dax-device teardown) which require arch
> support. Doing it properly would mean making the altmap reserve smaller
> (given fewer pages are allocated), and the ability for the altmap pfn
> allocator to track references per pfn. But I think it deserves its own
> separate patch series (probably almost just as big).
>
> Perhaps for this set I can stick without altmap as you suggested, and
> use hugepage vmemmap population (which wouldn't
> lead to device memory savings) instead of reusing base pages . I would
> still leave the compound page support logic as metadata representation
> for > 4K @align, as I think that's the right thing to do. And then
> a separate series onto improving altmap to leverage the metadata reduction
> support as done with non-device struct pages.
>
> Thoughts?

The space savings is the whole point. So I agree with moving altmap
support to a follow-on enhancement, but land the non-altmap basepage
support in the first round.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 00/10] fsdax,xfs: Add reflink support for fsdax

2021-03-10 Thread Dan Williams

On Wed, Mar 10, 2021 at 6:27 AM Matthew Wilcox  wrote:
>
> On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> > On 13:02 10/03, Matthew Wilcox wrote:
> > > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > > Btrfs at the same time? It seems weird to me that adding a feature
> > >
> > > btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX 
> > > support.
> > >
> > > If you think about it, btrfs and DAX are diametrically opposite things.
> > > DAX is about giving raw access to the hardware.  btrfs is about offering
> > > extra value (RAID, checksums, ...), none of which can be done if the
> > > filesystem isn't in the read/write path.
> > >
> > > That's why there's no DAX support in btrfs.  If you want DAX, you have
> > > to give up all the features you like in btrfs.  So you may as well use
> > > a different filesystem.
> >
> > DAX on btrfs has been attempted[1]. Of course, we could not
>
> But why?  A completeness fetish?  I don't understand why you decided
> to do this work.

Isn't DAX useful for pagecache minimization on read even if it is
awkward for a copy-on-write fs?

Seems it would be a useful case to have COW'd VM images on BTRFS that
don't need superfluous page cache allocations.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2] libnvdimm: Notify disk drivers to revalidate region read-only

2021-03-10 Thread Dan Williams

On Tue, Mar 9, 2021 at 10:54 PM Christoph Hellwig  wrote:
>
> Looks good to me:
>
> Reviewed-by: Christoph Hellwig 
>
> Question on the pre-existing code: given that nvdimm_check_and_set_ro is
> the only caller of set_disk_ro for nvdimm devices, we'll also get
> the message when initially setting up any read-only disk.  Is that
> intentional?

Yeah, that's intentional. There's no other notification that userspace
would be looking for by default besides the kernel log, and the block
device name is more meaningful than the region name, or the nvdimm
device status for that matter.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[ndctl PATCH] test/libndctl: Use ndctl_region_set_ro() to change disk read-only state

2021-03-09 Thread Dan Williams

Kernel commit 52f019d43c22 ("block: add a hard-readonly flag to struct
gendisk") broke the read-only management test, by fixing the broken
behavior that BLKROSET could make a block device read-write even when the
disk is read-only. The fix (see Link:) propagates changes of the region
read-only state to the underlying disk. Add ndctl_region_set_ro() ahead of
BLKROSET so that BLKROSET does not conflict the block_device state with the
disk state.

Link: 
http://lore.kernel.org/r/161534060720.528671.2341213328968989192.st...@dwillia2-desk3.amr.corp.intel.com
Reported-by: kernel test robot 
Reported-by: Vishal Verma 
Signed-off-by: Dan Williams 
---
 test/libndctl.c |   18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/test/libndctl.c b/test/libndctl.c
index faa308ade8fc..391b94086dae 100644
--- a/test/libndctl.c
+++ b/test/libndctl.c
@@ -1541,6 +1541,7 @@ static int validate_bdev(const char *devname, struct 
ndctl_btt *btt,
struct ndctl_pfn *pfn, struct ndctl_namespace *ndns,
struct namespace *namespace, void *buf)
 {
+   struct ndctl_region *region = ndctl_namespace_get_region(ndns);
char bdevpath[50];
int fd, rc, ro;
 
@@ -1578,6 +1579,13 @@ static int validate_bdev(const char *devname, struct 
ndctl_btt *btt,
}
 
ro = 0;
+   rc = ndctl_region_set_ro(region, ro);
+   if (rc < 0) {
+   fprintf(stderr, "%s: ndctl_region_set_ro failed\n", devname);
+   rc = -errno;
+   goto out;
+   }
+
rc = ioctl(fd, BLKROSET, );
if (rc < 0) {
fprintf(stderr, "%s: BLKROSET failed\n",
@@ -1605,8 +1613,16 @@ static int validate_bdev(const char *devname, struct 
ndctl_btt *btt,
rc = -ENXIO;
goto out;
}
+
+   rc = ndctl_region_set_ro(region, namespace->ro);
+   if (rc < 0) {
+   fprintf(stderr, "%s: ndctl_region_set_ro reset failed\n", 
devname);
+   rc = -errno;
+   goto out;
+   }
+
rc = 0;
- out:
+out:
close(fd);
return rc;
 }
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2] libnvdimm: Notify disk drivers to revalidate region read-only

2021-03-09 Thread Dan Williams

Previous kernels allowed the BLKROSET to override the disk's read-only
status. With that situation fixed the pmem driver needs to rely on
notification events to reevaluate the disk read-only status after the
host region has been marked read-write.

Recall that when libnvdimm determines that the persistent memory has
lost persistence (for example lack of energy to flush from DRAM to FLASH
on an NVDIMM-N device) it marks the region read-only, but that state can
be overridden by the user via:

   echo 0 > /sys/bus/nd/devices/regionX/read_only

...to date there is no notification that the region has restored
persistence, so the user override is the only recovery.

Fixes: 52f019d43c22 ("block: add a hard-readonly flag to struct gendisk")
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Martin K. Petersen 
Cc: Hannes Reinecke 
Cc: Jens Axboe 
Reported-by: kernel test robot 
Reported-by: Vishal Verma 
Signed-off-by: Dan Williams 
---
Changes since v1 [1]:
- Move from the sinking ship of revalidate_disk() to the local hotness
  of nd_pmem_notify() (hch).

[1]: 
http://lore.kernel.org/r/161527286194.446794.5215036039655765042.st...@dwillia2-desk3.amr.corp.intel.com

 drivers/nvdimm/bus.c |   14 ++
 drivers/nvdimm/pmem.c|   37 +
 drivers/nvdimm/region_devs.c |7 +++
 include/linux/nd.h   |1 +
 4 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 48f0985ca8a0..3a777d0073b7 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -631,16 +631,14 @@ void nvdimm_check_and_set_ro(struct gendisk *disk)
struct nd_region *nd_region = to_nd_region(dev->parent);
int disk_ro = get_disk_ro(disk);
 
-   /*
-* Upgrade to read-only if the region is read-only preserve as
-* read-only if the disk is already read-only.
-*/
-   if (disk_ro || nd_region->ro == disk_ro)
+   /* catch the disk up with the region ro state */
+   if (disk_ro == nd_region->ro)
return;
 
-   dev_info(dev, "%s read-only, marking %s read-only\n",
-   dev_name(_region->dev), disk->disk_name);
-   set_disk_ro(disk, 1);
+   dev_info(dev, "%s read-%s, marking %s read-%s\n",
+dev_name(_region->dev), nd_region->ro ? "only" : "write",
+disk->disk_name, nd_region->ro ? "only" : "write");
+   set_disk_ro(disk, nd_region->ro);
 }
 EXPORT_SYMBOL(nvdimm_check_and_set_ro);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b8a85bfb2e95..7daac795db39 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include "pmem.h"
+#include "btt.h"
 #include "pfn.h"
 #include "nd.h"
 
@@ -585,7 +586,7 @@ static void nd_pmem_shutdown(struct device *dev)
nvdimm_flush(to_nd_region(dev->parent), NULL);
 }
 
-static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
+static void pmem_revalidate_poison(struct device *dev)
 {
struct nd_region *nd_region;
resource_size_t offset = 0, end_trunc = 0;
@@ -595,9 +596,6 @@ static void nd_pmem_notify(struct device *dev, enum 
nvdimm_event event)
struct range range;
struct kernfs_node *bb_state;
 
-   if (event != NVDIMM_REVALIDATE_POISON)
-   return;
-
if (is_nd_btt(dev)) {
struct nd_btt *nd_btt = to_nd_btt(dev);
 
@@ -635,6 +633,37 @@ static void nd_pmem_notify(struct device *dev, enum 
nvdimm_event event)
sysfs_notify_dirent(bb_state);
 }
 
+static void pmem_revalidate_region(struct device *dev)
+{
+   struct pmem_device *pmem;
+
+   if (is_nd_btt(dev)) {
+   struct nd_btt *nd_btt = to_nd_btt(dev);
+   struct btt *btt = nd_btt->btt;
+
+   nvdimm_check_and_set_ro(btt->btt_disk);
+   return;
+   }
+
+   pmem = dev_get_drvdata(dev);
+   nvdimm_check_and_set_ro(pmem->disk);
+}
+
+static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
+{
+   switch (event) {
+   case NVDIMM_REVALIDATE_POISON:
+   pmem_revalidate_poison(dev);
+   break;
+   case NVDIMM_REVALIDATE_REGION:
+   pmem_revalidate_region(dev);
+   break;
+   default:
+   dev_WARN_ONCE(dev, 1, "notify: unknown event: %d\n", event);
+   break;
+   }
+}
+
 MODULE_ALIAS("pmem");
 MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_IO);
 MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_PMEM);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index ef23119db574..51870eb51da6 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -518,6 +518,1

Re: [PATCH v2] include: Remove pagemap.h from blkdev.h

2021-03-09 Thread Dan Williams

On Tue, Mar 9, 2021 at 11:59 AM Matthew Wilcox (Oracle)
 wrote:
>
> My UEK-derived config has 1030 files depending on pagemap.h before
> this change.  Afterwards, just 326 files need to be rebuilt when I
> touch pagemap.h.  I think blkdev.h is probably included too widely,
> but untangling that dependency is harder and this solves my problem.
> x86 allmodconfig builds, but there may be implicit include problems
> on other architectures.
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
> v2: Fix CONFIG_SWAP=n implicit use of pagemap.h by swap.h.  Increases
> the number of files from 240, but that's still a big win -- 68%
> reduction instead of 77%.
>
[..]
>  drivers/nvdimm/btt.c  | 1 +
>  drivers/nvdimm/pmem.c | 1 +

For the nvdimm bits:

Acked-by: Dan Williams 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] libnvdimm: Let revalidate_disk() revalidate region read-only

2021-03-09 Thread Dan Williams

On Mon, Mar 8, 2021 at 11:31 PM Christoph Hellwig  wrote:
>
> On Mon, Mar 08, 2021 at 10:54:22PM -0800, Dan Williams wrote:
> > Previous kernels allowed the BLKROSET to override the disk's read-only
> > status. With that situation fixed the pmem driver needs to rely on
> > revalidate_disk() to clear the disk read-only status after the host
> > region has been marked read-write.
> >
> > Recall that when libnvdimm determines that the persistent memory has
> > lost persistence (for example lack of energy to flush from DRAM to FLASH
> > on an NVDIMM-N device) it marks the region read-only, but that state can
> > be overridden by the user via:
> >
> >echo 0 > /sys/bus/nd/devices/regionX/read_only
> >
> > ...to date there is no notification that the region has restored
> > persistence, so the user override is the only recovery.
>
> I've just resent my series to kill of ->revalidate_disk for good, so this
> obvious makes me a little unhappy.  Given that ->revalidate_disk
> only ends up beeing called from the same path that ->open is called,
> why can't you just hook this up from the open method?
>
> Also any reason the sysfs attribute can't just directly propagate the
> information to the disk?

I should have assumed that revalidate_disk() was on the chopping
block. Let me take a look at just propagating from the region update
down to all affected disks. There's already a notification path for
regions to communicate badblocks, should be straightforward to reuse
for read-only updates.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH] libnvdimm: Let revalidate_disk() revalidate region read-only

2021-03-08 Thread Dan Williams

Previous kernels allowed the BLKROSET to override the disk's read-only
status. With that situation fixed the pmem driver needs to rely on
revalidate_disk() to clear the disk read-only status after the host
region has been marked read-write.

Recall that when libnvdimm determines that the persistent memory has
lost persistence (for example lack of energy to flush from DRAM to FLASH
on an NVDIMM-N device) it marks the region read-only, but that state can
be overridden by the user via:

   echo 0 > /sys/bus/nd/devices/regionX/read_only

...to date there is no notification that the region has restored
persistence, so the user override is the only recovery.

Fixes: 52f019d43c22 ("block: add a hard-readonly flag to struct gendisk")
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Martin K. Petersen 
Cc: Hannes Reinecke 
Cc: Jens Axboe 
Reported-by: kernel test robot 
Reported-by: Vishal Verma 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/btt.c  |7 +++
 drivers/nvdimm/bus.c  |   14 ++
 drivers/nvdimm/pmem.c |7 +++
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 41aa1f01fc07..73d3bf5aa208 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1508,11 +1508,18 @@ static int btt_getgeo(struct block_device *bd, struct 
hd_geometry *geo)
return 0;
 }
 
+static int btt_revalidate(struct gendisk *disk)
+{
+   nvdimm_check_and_set_ro(disk);
+   return 0;
+}
+
 static const struct block_device_operations btt_fops = {
.owner =THIS_MODULE,
.submit_bio =   btt_submit_bio,
.rw_page =  btt_rw_page,
.getgeo =   btt_getgeo,
+   .revalidate_disk =  btt_revalidate,
 };
 
 static int btt_blk_init(struct btt *btt)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 48f0985ca8a0..3a777d0073b7 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -631,16 +631,14 @@ void nvdimm_check_and_set_ro(struct gendisk *disk)
struct nd_region *nd_region = to_nd_region(dev->parent);
int disk_ro = get_disk_ro(disk);
 
-   /*
-* Upgrade to read-only if the region is read-only preserve as
-* read-only if the disk is already read-only.
-*/
-   if (disk_ro || nd_region->ro == disk_ro)
+   /* catch the disk up with the region ro state */
+   if (disk_ro == nd_region->ro)
return;
 
-   dev_info(dev, "%s read-only, marking %s read-only\n",
-   dev_name(_region->dev), disk->disk_name);
-   set_disk_ro(disk, 1);
+   dev_info(dev, "%s read-%s, marking %s read-%s\n",
+dev_name(_region->dev), nd_region->ro ? "only" : "write",
+disk->disk_name, nd_region->ro ? "only" : "write");
+   set_disk_ro(disk, nd_region->ro);
 }
 EXPORT_SYMBOL(nvdimm_check_and_set_ro);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b8a85bfb2e95..af204fce1b1c 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -276,10 +276,17 @@ __weak long __pmem_direct_access(struct pmem_device 
*pmem, pgoff_t pgoff,
return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
 }
 
+static int pmem_revalidate(struct gendisk *disk)
+{
+   nvdimm_check_and_set_ro(disk);
+   return 0;
+}
+
 static const struct block_device_operations pmem_fops = {
.owner =THIS_MODULE,
.submit_bio =   pmem_submit_bio,
.rw_page =  pmem_rw_page,
+   .revalidate_disk =  pmem_revalidate,
 };
 
 static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

2021-03-08 Thread Dan Williams

On Mon, Mar 8, 2021 at 3:34 AM ruansy.f...@fujitsu.com
 wrote:
> > > > >  1 file changed, 8 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > index 79c49e7f5c30..0bcf2b1e20bd 100644
> > > > > --- a/include/linux/memremap.h
> > > > > +++ b/include/linux/memremap.h
> > > > > @@ -87,6 +87,14 @@ struct dev_pagemap_ops {
> > > > >  * the page back to a CPU accessible page.
> > > > >  */
> > > > > vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > > > > +
> > > > > +   /*
> > > > > +* Handle the memory failure happens on one page.  Notify the 
> > > > > processes
> > > > > +* who are using this page, and try to recover the data on 
> > > > > this page
> > > > > +* if necessary.
> > > > > +*/
> > > > > +   int (*memory_failure)(struct dev_pagemap *pgmap, unsigned 
> > > > > long pfn,
> > > > > + int flags);
> > > > >  };
> > > >
> > > > After the conversation with Dave I don't see the point of this. If
> > > > there is a memory_failure() on a page, why not just call
> > > > memory_failure()? That already knows how to find the inode and the
> > > > filesystem can be notified from there.
> > >
> > > We want memory_failure() supports reflinked files.  In this case, we are 
> > > not
> > > able to track multiple files from a page(this broken page) because
> > > page->mapping,page->index can only track one file.  Thus, I introduce this
> > > ->memory_failure() implemented in pmem driver, to call ->corrupted_range()
> > > upper level to upper level, and finally find out files who are
> > > using(mmapping) this page.
> > >
> >
> > I know the motivation, but this implementation seems backwards. It's
> > already the case that memory_failure() looks up the address_space
> > associated with a mapping. From there I would expect a new 'struct
> > address_space_operations' op to let the fs handle the case when there
> > are multiple address_spaces associated with a given file.
> >
>
> Let me think about it.  In this way, we
> 1. associate file mapping with dax page in dax page fault;

I think this needs to be a new type of association that proxies the
representation of the reflink across all involved address_spaces.

> 2. iterate files reflinked to notify `kill processes signal` by the
>   new address_space_operation;
> 3. re-associate to another reflinked file mapping when unmmaping
> (rmap qeury in filesystem to get the another file).

Perhaps the proxy object is reference counted per-ref-link. It seems
error prone to keep changing the association of the pfn while the
reflink is in-tact.

> It did not handle those dax pages are not in use, because their ->mapping are
> not associated to any file.  I didn't think it through until reading your
> conversation.  Here is my understanding: this case should be handled by
> badblock mechanism in pmem driver.  This badblock mechanism will call
> ->corrupted_range() to tell filesystem to repaire the data if possible.

There are 2 types of notifications. There are badblocks discovered by
the driver (see notify_pmem()) and there are memory_failures()
signalled by the CPU machine-check handler, or the platform BIOS. In
the case of badblocks that needs to be information considered by the
fs block allocator to avoid / try-to-repair badblocks on allocate, and
to allow listing damaged files that need repair. The memory_failure()
notification needs immediate handling to tear down mappings to that
pfn and signal processes that have consumed it with
SIGBUS-action-required. Processes that have the poison mapped, but
have not consumed it receive SIGBUS-action-optional.

> So, we split it into two parts.  And dax device and block device won't be 
> mixed
> up again.   Is my understanding right?

Right, it's only the filesystem that knows that the block_device and
the dax_device alias data at the same logical offset. The requirements
for sector error handling and page error handling are separate like
block_device_operations and dax_operations.

> But the solution above is to solve the hwpoison on one or couple pages, which
> happens rarely(I think).  Do the 'pmem remove' operation cause hwpoison too?
> Call memory_failure() so many times?  I havn't understood this yet.

I'm working on a patch here to call memory_failure() on a wide range
for the surprise remove of a dax_device while a filesystem might be
mounted. It won't be efficient, but there is no other way to notify
the kernel that it needs to immediately stop referencing a page.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

2021-03-07 Thread Dan Williams

On Sun, Mar 7, 2021 at 7:38 PM ruansy.f...@fujitsu.com
 wrote:
>
> > On Mon, Feb 8, 2021 at 2:55 AM Shiyang Ruan  
> > wrote:
> > >
> > > When memory-failure occurs, we call this function which is implemented
> > > by each kind of devices.  For the fsdax case, pmem device driver
> > > implements it.  Pmem device driver will find out the block device where
> > > the error page locates in, and try to get the filesystem on this block
> > > device.  And finally call filesystem handler to deal with the error.
> > > The filesystem will try to recover the corrupted data if possiable.
> > >
> > > Signed-off-by: Shiyang Ruan 
> > > ---
> > >  include/linux/memremap.h | 8 
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > index 79c49e7f5c30..0bcf2b1e20bd 100644
> > > --- a/include/linux/memremap.h
> > > +++ b/include/linux/memremap.h
> > > @@ -87,6 +87,14 @@ struct dev_pagemap_ops {
> > >  * the page back to a CPU accessible page.
> > >  */
> > > vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > > +
> > > +   /*
> > > +* Handle the memory failure happens on one page.  Notify the 
> > > processes
> > > +* who are using this page, and try to recover the data on this 
> > > page
> > > +* if necessary.
> > > +*/
> > > +   int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long 
> > > pfn,
> > > + int flags);
> > >  };
> >
> > After the conversation with Dave I don't see the point of this. If
> > there is a memory_failure() on a page, why not just call
> > memory_failure()? That already knows how to find the inode and the
> > filesystem can be notified from there.
>
> We want memory_failure() supports reflinked files.  In this case, we are not
> able to track multiple files from a page(this broken page) because
> page->mapping,page->index can only track one file.  Thus, I introduce this
> ->memory_failure() implemented in pmem driver, to call ->corrupted_range()
> upper level to upper level, and finally find out files who are
> using(mmapping) this page.
>

I know the motivation, but this implementation seems backwards. It's
already the case that memory_failure() looks up the address_space
associated with a mapping. From there I would expect a new 'struct
address_space_operations' op to let the fs handle the case when there
are multiple address_spaces associated with a given file.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

2021-03-06 Thread Dan Williams

On Mon, Feb 8, 2021 at 2:55 AM Shiyang Ruan  wrote:
>
> When memory-failure occurs, we call this function which is implemented
> by each kind of devices.  For the fsdax case, pmem device driver
> implements it.  Pmem device driver will find out the block device where
> the error page locates in, and try to get the filesystem on this block
> device.  And finally call filesystem handler to deal with the error.
> The filesystem will try to recover the corrupted data if possiable.
>
> Signed-off-by: Shiyang Ruan 
> ---
>  include/linux/memremap.h | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 79c49e7f5c30..0bcf2b1e20bd 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -87,6 +87,14 @@ struct dev_pagemap_ops {
>  * the page back to a CPU accessible page.
>  */
> vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> +
> +   /*
> +* Handle the memory failure happens on one page.  Notify the 
> processes
> +* who are using this page, and try to recover the data on this page
> +* if necessary.
> +*/
> +   int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
> + int flags);
>  };

After the conversation with Dave I don't see the point of this. If
there is a memory_failure() on a page, why not just call
memory_failure()? That already knows how to find the inode and the
filesystem can be notified from there.

Although memory_failure() is inefficient for large range failures, I'm
not seeing a better option, so I'm going to test calling
memory_failure() over a large range whenever an in-use dax-device is
hot-removed.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-02 Thread Dan Williams

On Mon, Mar 1, 2021 at 11:57 PM Dave Chinner  wrote:
>
> On Mon, Mar 01, 2021 at 09:41:02PM -0800, Dan Williams wrote:
> > On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong  wrote:
> > > > > I really don't see you seem to be telling us that invalidation is an
> > > > > either/or choice. There's more ways to convert physical block
> > > > > address -> inode file offset and mapping index than brute force
> > > > > inode cache walks
> > > >
> > > > Yes, but I was trying to map it to an existing mechanism and the
> > > > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > > > needs to happen here.
> > >
> > > Yes.  XFS (with rmap enabled) can do all the iteration and walking in
> > > that function except for the invalidate_mapping_* call itself.  The goal
> > > of this series is first to wire up a callback within both the block and
> > > pmem subsystems so that they can take notifications and reverse-map them
> > > through the storage stack until they reach an fs superblock.
> >
> > I'm chuckling because this "reverse map all the way up the block
> > layer" is the opposite of what Dave said at the first reaction to my
> > proposal, "can't the mm map pfns to fs inode  address_spaces?".
>
> Ah, no, I never said that the filesystem can't do reverse maps. I
> was asking if the mm could directly (brute-force) invalidate PTEs
> pointing at physical pmem ranges without needing walk the inode
> mappings. That would be far more efficient if it could be done
>
> > Today whenever the pmem driver receives new corrupted range
> > notification from the lower level nvdimm
> > infrastructure(nd_pmem_notify) it updates the 'badblocks' instance
> > associated with the pmem gendisk and then notifies userspace that
> > there are new badblocks. This seems a perfect place to signal an upper
> > level stacked block device that may also be watching disk->bb. Then
> > each gendisk in a stacked topology is responsible for watching the
> > badblock notifications of the next level and storing a remapped
> > instance of those blocks until ultimately the filesystem mounted on
> > the top-level block device is responsible for registering for those
> > top-level disk->bb events.
> >
> > The device gone notification does not map cleanly onto 'struct badblocks'.
>
> Filesystems are not allowed to interact with the gendisk
> infrastructure - that's for supporting the device side of a block
> device. It's a layering violation, and many a filesytem developer
> has been shouted at for trying to do this. At most we can peek
> through it to query functionality support from the request queue,
> but otherwise filesystems do not interact with anything under
> bdev->bd_disk.

So lets add an api that allows the querying of badblocks by bdev and
let the block core handle the bd_disk interaction. I see other block
functionality like blk-integrity reaching through gendisk. The fs need
not interact with the gendisk directly.

>
> As it is, badblocks are used by devices to manage internal state.
> e.g. md for recording stripes that need recovery if the system
> crashes while they are being written out.

I know, I was there when it was invented which is why it was
top-of-mind when pmem had a need to communicate badblocks. Other block
drivers have threatened to use it for badblocks tracking, but none of
those have carried through on that initial interest.

>
> > If an upper level agent really cared about knowing about ->remove()
> > events before they happened it could maybe do something like:
> >
> > dev = disk_to_dev(bdev->bd_disk)->parent;
> > bus_register_notifier(dev->bus. _host_device_notifier_block)
>
> Yeah, that's exactly the sort of thing that filesystems have been
> aggressively discouraged from doing for years.

Yup, it's a layering violation.

> Part of the reason for this is that gendisk based mechanisms are not
> very good for stacked device error reporting. Part of the problem
> here is that every layer of the stacked device has to hook the
> notifier of the block devices underneath it, then translate the
> event to match the upper block device map, then regenerate the
> notification for the next layer up. This isn't an efficient way to
> pass a notification through a series of stacked devices and it is
> messy and cumbersome to maintain.

It's been messy and cumbersome to route new infrastructure through DM
every time a new dax_operation arrives. The corrupted_range() routing
has the same burden. The advantage of badblocks over corrupted_range()
is that it solves the

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dan Williams

On Mon, Mar 1, 2021 at 9:38 PM Dave Chinner  wrote:
>
> On Mon, Mar 01, 2021 at 07:33:28PM -0800, Dan Williams wrote:
> > On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner  wrote:
> > [..]
> > > We do not need a DAX specific mechanism to tell us "DAX device
> > > gone", we need a generic block device interface that tells us "range
> > > of block device is gone".
> >
> > This is the crux of the disagreement. The block_device is going away
> > *and* the dax_device is going away.
>
> No, that is not the disagreement I have with what you are saying.
> You still haven't understand that it's even more basic and generic
> than devices going away. At the simplest form, all the filesystem
> wants is to be notified of is when *unrecoverable media errors*
> occur in the persistent storage that underlies the filesystem.
>
> The filesystem does not care what that media is build from - PMEM,
> flash, corroded spinning disks, MRAM, or any other persistent media
> you can think off. It just doesn't matter.
>
> What we care about is that the contents of a *specific LBA range* no
> longer contain *valid data*. IOWs, the data in that range of the
> block device has been lost, cannot be retreived and/or cannot be
> written to any more.
>
> PMEM taking a MCE because ECC tripped is a media error because data
> is lost and inaccessible until recovery actions are taken.
>
> MD RAID failing a scrub is a media error and data is lost and
> unrecoverable at that layer.
>
> A device disappearing is a media error because the storage media is
> now permanently inaccessible to the higher layers.
>
> This "media error" categorisation is a fundamental property of
> persistent storage and, as such, is a property of the block devices
> used to access said persistent storage.
>
> That's the disagreement here - that you and Christoph are saying
> ->corrupted_range is not a block device property because only a
> pmem/DAX device currently generates it.
>
> You both seem to be NACKing a generic interface because it's only
> implemented for the first subsystem that needs it. AFAICT, you
> either don't understand or are completely ignoring the architectural
> need for it to be provided across the rest of the storage stack that
> *block device based filesystems depend on*.

No I'm NAKing it because it's the wrong interface. See my 'struct
badblocks' argument in the reply to Darrick. That 'struct badblocks'
infrastructure arose from MD and is shared with PMEM.

>
> Sure, there might be dax device based fielsystems around the corner.
> They just require a different pmem device ->corrupted_range callout
> to implement the notification - one that directs to the dax device
> rather than the block device. That's simple and trivial to
> implement, but such functionaity for DAX devices  does not replace
> the need for the same generic functionality to be provided across a
> *range of different block devices* as required by *block device
> based filesystems*.
>
> And that's fundamentally the problem. XFS is block device based, not
> DAX device based. We require errors to be reported through block
> device mechanisms. fs-dax does not change this - it is based on pmem
> being presented as a primarily as a block device to the block device
> based filesystems and only secondarily as a dax device. Hence if it
> can be trivially implemented as a block device interface, that's
> where it should go, because then all the other block devices that
> the filesytem runs on can provide the same functionality for similar
> media error events

Sure, use 'struct badblocks' not struct block_device and
block_device_operations.
>
> > The dax_device removal implies one
> > set of actions (direct accessed pfns invalid) the block device removal
> > implies another (block layer sector access offline).
>
> There you go again, saying DAX requires an action, while the block
> device notification is a -state change- (i.e. goes offline).

There you go reacting to the least generous interpretation of what I said.

s/pfns invalid/pfns offline/

>
> This is exactly what I said was wrong in my last email.
>
> > corrupted_range
> > is blurring the notification for 2 different failure domains. Look at
> > the nascent idea to mount a filesystem on dax sans a block device.
> > Look at the existing plumbing for DM to map dax_operations through a
> > device stack.
>
> Ummm, it just maps the direct_access call to the underlying device
> and calls it's ->direct_access method. All it's doing is LBA
> mapping. That's all it needs to do for ->corrupted_range, too.
> I have no clue why you think this is a problem for error
> notification...
>

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dan Williams

On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong  wrote:
>
> On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner  wrote:
> > >
> > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  
> > > > wrote:
> > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  
> > > > > > wrote:
> > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > > it points to, check if it points to the PMEM that is being removed,
> > > > > grab the page it points to, map that to the relevant struct page,
> > > > > run collect_procs() on that page, then kill the user processes that
> > > > > map that page.
> > > > >
> > > > > So why can't we walk the ptescheck the physical pages that they
> > > > > map to and if they map to a pmem page we go poison that
> > > > > page and that kills any user process that maps it.
> > > > >
> > > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > > >
> > > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > > large array of pages.
> > >
> > > Not really. You're assuming all a filesystem has to do is invalidate
> > > everything if a device goes away, and that's not true. Finding if an
> > > inode has a mapping that spans a specific device in a multi-device
> > > filesystem can be a lot more complex than that. Just walking inodes
> > > is easy - determining whihc inodes need invalidation is the hard
> > > part.
> >
> > That inode-to-device level of specificity is not needed for the same
> > reason that drop_caches does not need to be specific. If the wrong
> > page is unmapped a re-fault will bring it back, and re-fault will fail
> > for the pages that are successfully removed.
> >
> > > That's where ->corrupt_range() comes in - the filesystem is already
> > > set up to do reverse mapping from physical range to inode(s)
> > > offsets...
> >
> > Sure, but what is the need to get to that level of specificity with
> > the filesystem for something that should rarely happen in the course
> > of normal operation outside of a mistake?
>
> I can't tell if we're conflating the "a bunch of your pmem went bad"
> case with the "all your dimms fell out of the machine" case.

>From the pmem driver perspective it has the media scanning to find
some small handful of cachelines that have gone bad, and it has the
driver ->remove() callback to tell it a bunch of pmem is now offline.
The NVDIMM device "range has gone bad" mechanism has no way to
communicate multiple terabytes have gone bad at once.

In fact I think the distinction is important that ->remove() is not
treated as ->corrupted_range() because I expect the level of freakout
is much worse for a "your storage is offline" notification vs "your
storage is corrupted" notification.

> If, say, a single cacheline's worth of pmem goes bad on a node with 2TB
> of pmem, I certainly want that level of specificity.  Just notify the
> users of the dead piece, don't flush the whole machine down the drain.

Right, something like corrupted_range() is there to say, "keep going
upper layers, but note that this handful of sectors now has
indeterminant data and will return -EIO on access until repaired". The
repair for device-offline is device-online.

>
> > > > There's likely always more pages than inodes, but perhaps it's more
> > > > efficient to walk the 'struct page' array than sb->s_inodes?
> > >
> > > I really don't see you seem to be telling us that invalidation is an
> > > either/or choice. There's more ways to convert physical block
> > > address -> inode file offset and mapping index than brute force
> > > inode cache walks
> >
> > Yes, but I was trying to map it to an existing mechanism and the
> > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > needs to happen here.
>
> Yes.  XFS (with rmap enabled) can do all the iteration and walking in
> that function except for the invalidate_mapping_* call itself.  The goal
> of this series is first to wire up a callback within both the block and
> pmem subsystems so that they can take notificati

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dan Williams

On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner  wrote:
[..]
> We do not need a DAX specific mechanism to tell us "DAX device
> gone", we need a generic block device interface that tells us "range
> of block device is gone".

This is the crux of the disagreement. The block_device is going away
*and* the dax_device is going away. The dax_device removal implies one
set of actions (direct accessed pfns invalid) the block device removal
implies another (block layer sector access offline). corrupted_range
is blurring the notification for 2 different failure domains. Look at
the nascent idea to mount a filesystem on dax sans a block device.
Look at the existing plumbing for DM to map dax_operations through a
device stack. Look at the pushback Ruan got for adding a new
block_device operation for corrupted_range().

> The reason that the block device is gone is irrelevant to the
> filesystem. The type of block device is also irrelevant. If the
> filesystem isn't using DAX (e.g. dax=never mount option) even when
> it is on a DAX capable device, then it just doesn't care about
> invalidating DAX mappings because it has none. But it still may care
> about shutting down the filesystem because the block device is gone.

Sure, let's have a discussion about a block_device gone notification,
and a dax_device gone notification.

> This is why we need to communicate what error occurred, not what
> action a device driver thinks needs to be taken.

The driver is only an event producer in this model, whatever the
consumer does at the other end is not its concern. There may be a
generic consumer and a filesystem specific consumer.

> The error is
> important to the filesystem, the action might be completely
> irrelevant. And, as we know now, shutdown on DAX enable filesystems
> needs to imply DAX mapping invalidation in all cases, not just when
> the device disappears from under the filesystem.

Sure.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dan Williams

On Mon, Mar 1, 2021 at 2:47 PM Dave Chinner  wrote:
>
> On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner  wrote:
> > >
> > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  
> > > > wrote:
> > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  
> > > > > > wrote:
> > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > > it points to, check if it points to the PMEM that is being removed,
> > > > > grab the page it points to, map that to the relevant struct page,
> > > > > run collect_procs() on that page, then kill the user processes that
> > > > > map that page.
> > > > >
> > > > > So why can't we walk the ptescheck the physical pages that they
> > > > > map to and if they map to a pmem page we go poison that
> > > > > page and that kills any user process that maps it.
> > > > >
> > > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > > >
> > > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > > large array of pages.
> > >
> > > Not really. You're assuming all a filesystem has to do is invalidate
> > > everything if a device goes away, and that's not true. Finding if an
> > > inode has a mapping that spans a specific device in a multi-device
> > > filesystem can be a lot more complex than that. Just walking inodes
> > > is easy - determining whihc inodes need invalidation is the hard
> > > part.
> >
> > That inode-to-device level of specificity is not needed for the same
> > reason that drop_caches does not need to be specific. If the wrong
> > page is unmapped a re-fault will bring it back, and re-fault will fail
> > for the pages that are successfully removed.
> >
> > > That's where ->corrupt_range() comes in - the filesystem is already
> > > set up to do reverse mapping from physical range to inode(s)
> > > offsets...
> >
> > Sure, but what is the need to get to that level of specificity with
> > the filesystem for something that should rarely happen in the course
> > of normal operation outside of a mistake?
>
> Dan, you made this mistake with the hwpoisoning code that we're
> trying to fix that here. Hard coding a 1:1 physical address to
> inode/offset into the DAX mapping was a bad mistake. It's also one
> that should never have occurred because it's *obviously wrong* to
> filesystem developers and has been for a long time.

I admit that mistake. The traditional memory error handling model
assumptions around page->mapping were broken by DAX, I'm not trying to
repeat that mistake. I feel we're talking past each other on the
discussion of the proposals.

> Now we have the filesytem people providing a mechanism for the pmem
> devices to tell the filesystems about physical device failures so
> they can handle such failures correctly themselves. Having the
> device go away unexpectedly from underneath a mounted and active
> filesystem is a *device failure*, not an "unplug event".

It's the same difference to the physical page, all mappings to that
page need to be torn down. I'm happy to call an fs callback and let
each filesystem do what it wants with a "every pfn in this dax device
needs to be unmapped".

I'm looking at the ->corrupted_range() patches trying to map it to
this use case and I don't see how, for example a realtime-xfs over DM
over multiple PMEM gets the notification to the right place.
bd_corrupted_range() uses get_super() which get the wrong answer for
both realtime-xfs and DM.

I'd flip that arrangement around and have the FS tell the block device
"if something happens to you, here is the super_block to notify". So
to me this looks like a fs_dax_register_super() helper that plumbs the
superblock through an arbitrary stack of block devices to the leaf
block-device that might want to send a notification up when a global
unmap operation needs to be performed.

I naively think that "for_each_inode()
unmap_mapping_range(>i_mapping)" is sufficient as a generic
implementation, that does not preclude XFS to override that generic
implementation and handle it directly if it so chooses.

> The mistake you made was not understanding how filesystems work,
> nor actually asking filesystem developer

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dan Williams

On Sun, Feb 28, 2021 at 11:27 PM Yasunori Goto  wrote:
>
> Hello, Dan-san,
>
> On 2021/02/27 4:24, Dan Williams wrote:
> > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong  wrote:
> >>
> >> On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote:
> >>> Hi, guys
> >>>
> >>> Beside this patchset, I'd like to confirm something about the
> >>> "EXPERIMENTAL" tag for dax in XFS.
> >>>
> >>> In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> >>> when we mount a pmem device with dax option, has been existed for a
> >>> while.  It's a bit annoying when using fsdax feature.  So, my initial
> >>> intention was to remove this tag.  And I started to find out and solve
> >>> the problems which prevent it from being removed.
> >>>
> >>> As is talked before, there are 3 main problems.  The first one is "dax
> >>> semantics", which has been resolved.  The rest two are "RMAP for
> >>> fsdax" and "support dax reflink for filesystem", which I have been
> >>> working on.
> >>
> >> 
> >>
> >>> So, what I want to confirm is: does it means that we can remove the
> >>> "EXPERIMENTAL" tag when the rest two problem are solved?
> >>
> >> Yes.  I'd keep the experimental tag for a cycle or two to make sure that
> >> nothing new pops up, but otherwise the two patchsets you've sent close
> >> those two big remaining gaps.  Thank you for working on this!
> >>
> >>> Or maybe there are other important problems need to be fixed before
> >>> removing it?  If there are, could you please show me that?
> >>
> >> That remains to be seen through QA/validation, but I think that's it.
> >>
> >> Granted, I still have to read through the two patchsets...
> >
> > I've been meaning to circle back here as well.
> >
> > My immediate concern is the issue Jason recently highlighted [1] with
> > respect to invalidating all dax mappings when / if the device is
> > ripped out from underneath the fs. I don't think that will collide
> > with Ruan's implementation, but it does need new communication from
> > driver to fs about removal events.
> >
> > [1]: 
> > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com
> >
>
> I'm not sure why there is a race condition between unbinding operation
> and accessing mmaped file on filesystem dax yet.
>
> May be silly question, but could you tell me why the "unbinding"
> operation of the namespace which is mounted by filesystem dax must be
> allowed?

The unbind operation is used to switch the mode of a namespace between
fsdax and devdax. There is no way to fail unbind. At most it can be
delayed for a short while to perform cleanup, but it can't be blocked
indefinitely. There is the option to specify 'suppress_bind_attrs' to
the driver to preclude software triggered device removal, but that
would disable mode changes for the device.

> If "unbinding" is rejected when the filesystem is mounted with dax
> enabled, what is inconvenience?

It would be interesting (read difficult) to introduce the concept of
dynamic 'suppress_bind_attrs'. Today the decision is static at driver
registration time, not in response to how the device is being used.

I think global invalidation of all inodes that might be affected by a
dax-capable device being ripped away from the filesystem is sufficient
and avoids per-fs enabling, but I'm willing to be convinced that
->corrupted_range() is the proper vehicle for this.

>
> I can imagine if a device like usb memory stick is removed surprisingly,
> kernel/filesystem need to reject writeback at the time, and discard page
> cache. Then, I can understand that unbinding operation is essential for
> such case.

For usb the system is protected by the fact that all future block-i/o
submissions to the old block-device will fail, and a new usb-device
being plugged in will get a new block-device. I.e. the old security
model is invalidated / all holes are closed by blk_cleanup_queue().

> But I don't know why PMEM device/namespace allows unbinding operation
> like surprising removal event.

DAX hands direct mappings to physical pages, if the security model
fronting those physical pages changes the mappings attained via the
old security model need to be invalidated. blk_cleanup_queue() does
not invalidate DAX mappings.

The practical value of fixing that hole is small given that physical
unplug is not implemented for NVDIMMs today, but the get_user_pages()
path can be optimized if this invalidation is implemented, and future
hotplug-capable NVDIMM buses like CXL will need this.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dan Williams

On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner  wrote:
>
> On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  wrote:
> > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  
> > > > wrote:
> > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > it points to, check if it points to the PMEM that is being removed,
> > > grab the page it points to, map that to the relevant struct page,
> > > run collect_procs() on that page, then kill the user processes that
> > > map that page.
> > >
> > > So why can't we walk the ptescheck the physical pages that they
> > > map to and if they map to a pmem page we go poison that
> > > page and that kills any user process that maps it.
> > >
> > > i.e. I can't see how unexpected pmem device unplug is any different
> > > to an MCE delivering a hwpoison event to a DAX mapped page.
> >
> > I guess the tradeoff is walking a long list of inodes vs walking a
> > large array of pages.
>
> Not really. You're assuming all a filesystem has to do is invalidate
> everything if a device goes away, and that's not true. Finding if an
> inode has a mapping that spans a specific device in a multi-device
> filesystem can be a lot more complex than that. Just walking inodes
> is easy - determining whihc inodes need invalidation is the hard
> part.

That inode-to-device level of specificity is not needed for the same
reason that drop_caches does not need to be specific. If the wrong
page is unmapped a re-fault will bring it back, and re-fault will fail
for the pages that are successfully removed.

> That's where ->corrupt_range() comes in - the filesystem is already
> set up to do reverse mapping from physical range to inode(s)
> offsets...

Sure, but what is the need to get to that level of specificity with
the filesystem for something that should rarely happen in the course
of normal operation outside of a mistake?

>
> > There's likely always more pages than inodes, but perhaps it's more
> > efficient to walk the 'struct page' array than sb->s_inodes?
>
> I really don't see you seem to be telling us that invalidation is an
> either/or choice. There's more ways to convert physical block
> address -> inode file offset and mapping index than brute force
> inode cache walks

Yes, but I was trying to map it to an existing mechanism and the
internals of drop_pagecache_sb() are, in coarse terms, close to what
needs to happen here.

>
> .
>
> > > IOWs, what needs to happen at this point is very filesystem
> > > specific. Assuming that "device unplug == filesystem dead" is not
> > > correct, nor is specifying a generic action that assumes the
> > > filesystem is dead because a device it is using went away.
> >
> > Ok, I think I set this discussion in the wrong direction implying any
> > mapping of this action to a "filesystem dead" event. It's just a "zap
> > all ptes" event and upper layers recover from there.
>
> Yes, that's exactly what ->corrupt_range() is intended for. It
> allows the filesystem to lock out access to the bad range
> and then recover the data. Or metadata, if that's where the bad
> range lands. If that recovery fails, it can then report a data
> loss/filesystem shutdown event to userspace and kill user procs that
> span the bad range...
>
> FWIW, is this notification going to occur before or after the device
> has been physically unplugged?

Before. This will be operations that happen in the pmem driver
->remove() callback.

> i.e. what do we do about the
> time-of-unplug-to-time-of-invalidation window where userspace can
> still attempt to access the missing pmem though the
> not-yet-invalidated ptes? It may not be likely that people just yank
> pmem nvdimms out of machines, but with NVMe persistent memory
> spaces, there's every chance that someone pulls the wrong device...

The physical removal aspect is only theoretical today. While the pmem
driver has a ->remove() path that's purely a software unbind
operation. That said the vulnerability window today is if a process
acquires a dax mapping, the pmem device hosting that filesystem goes
through an unbind / bind cycle, and then a new filesystem is created /
mounted. That old pte may be able to access data that is outside its
intended protection domain.

Going forward, for buses like CXL, there will be a managed physical
remove operation via PCIE native hotplug. The flow there is that the
PCIE hotplug driver will notify the OS of a pending removal, trigger
->remove() on the pmem driver, and then notify the technician (slot
status LED) that the card is safe to pull.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-27 Thread Dan Williams

On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  wrote:
>
> On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  wrote:
> > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner  
> > > > wrote:
> > > > > > My immediate concern is the issue Jason recently highlighted [1] 
> > > > > > with
> > > > > > respect to invalidating all dax mappings when / if the device is
> > > > > > ripped out from underneath the fs. I don't think that will collide
> > > > > > with Ruan's implementation, but it does need new communication from
> > > > > > driver to fs about removal events.
> > > > > >
> > > > > > [1]: 
> > > > > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com
> > > > >
> > > > > Oh, yay.
> > > > >
> > > > > The XFS shutdown code is centred around preventing new IO from being
> > > > > issued - we don't actually do anything about DAX mappings because,
> > > > > well, I don't think anyone on the filesystem side thought they had
> > > > > to do anything special if pmem went away from under it.
> > > > >
> > > > > My understanding -was- that the pmem removal invalidates
> > > > > all the ptes currently mapped into CPU page tables that point at
> > > > > the dax device across the system. THe vmas that manage these
> > > > > mappings are not really something the filesystem really manages,
> > > > > but a function of the mm subsystem. What the filesystem cares about
> > > > > is that it gets page faults triggered when a change of state occurs
> > > > > so that it can remap the page to it's backing store correctly.
> > > > >
> > > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the
> > > > > CPU ptes, because then when then when userspace tries to access the
> > > > > mapped DAX pages we get a new page fault. In processing the fault, the
> > > > > filesystem will try to get direct access to the pmem from the block
> > > > > device. This will get an ENODEV error from the block device because
> > > > > because the backing store (pmem) has been unplugged and is no longer
> > > > > there...
> > > > >
> > > > > AFAICT, as long as pmem removal invalidates all the active ptes that
> > > > > point at the pmem being removed, the filesystem doesn't need to
> > > > > care about device removal at all, DAX or no DAX...
> > > >
> > > > How would the pmem removal do that without walking all the active
> > > > inodes in the fs at the time of shutdown and call
> > > > unmap_mapping_range(inode->i_mapping, 0, 0, 1)?
> > >
> > > Which then immediately ends up back at the vmas that manage the ptes
> > > to unmap them.
> > >
> > > Isn't finding the vma(s) that map a specific memory range exactly
> > > what the rmap code in the mm subsystem is supposed to address?
> >
> > rmap can lookup only vmas from a virt address relative to a given
> > mm_struct. The driver has neither the list of mm_struct objects nor
> > virt addresses to do a lookup. All it knows is that someone might have
> > mapped pages through the fsdax interface.
>
> So there's no physical addr to vma translation in the mm subsystem
> at all?
>
> That doesn't make sense. We do exactly this for hwpoison for DAX
> mappings. While we don't look at ptes, we get a pfn,

True hwpoison does get a known failing pfn and then uses page->mapping
to get the 'struct address_space' to do the unmap. I discounted that
approach from the outset because it would mean walking every pfn in a
multi-terabyte device just in case one is mapped at device shutdown.

> it points to, check if it points to the PMEM that is being removed,
> grab the page it points to, map that to the relevant struct page,
> run collect_procs() on that page, then kill the user processes that
> map that page.
>
> So why can't we walk the ptescheck the physical pages that they
> map to and if they map to a pmem page we go poison that
> page and that kills any user process that maps it.
>
> i.e. I can't see how unexpected pmem device unplug is any different
> to an MCE delivering a hwpoison event to a DAX mapped page.

I guess the tradeoff is walking a long lis

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-26 Thread Dan Williams

On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  wrote:
>
> On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner  wrote:
> > >
> > > On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote:
> > > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong  
> > > > wrote:
> > > > >
> > > > > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com 
> > > > > wrote:
> > > > > > Hi, guys
> > > > > >
> > > > > > Beside this patchset, I'd like to confirm something about the
> > > > > > "EXPERIMENTAL" tag for dax in XFS.
> > > > > >
> > > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> > > > > > when we mount a pmem device with dax option, has been existed for a
> > > > > > while.  It's a bit annoying when using fsdax feature.  So, my 
> > > > > > initial
> > > > > > intention was to remove this tag.  And I started to find out and 
> > > > > > solve
> > > > > > the problems which prevent it from being removed.
> > > > > >
> > > > > > As is talked before, there are 3 main problems.  The first one is 
> > > > > > "dax
> > > > > > semantics", which has been resolved.  The rest two are "RMAP for
> > > > > > fsdax" and "support dax reflink for filesystem", which I have been
> > > > > > working on.
> > > > >
> > > > > 
> > > > >
> > > > > > So, what I want to confirm is: does it means that we can remove the
> > > > > > "EXPERIMENTAL" tag when the rest two problem are solved?
> > > > >
> > > > > Yes.  I'd keep the experimental tag for a cycle or two to make sure 
> > > > > that
> > > > > nothing new pops up, but otherwise the two patchsets you've sent close
> > > > > those two big remaining gaps.  Thank you for working on this!
> > > > >
> > > > > > Or maybe there are other important problems need to be fixed before
> > > > > > removing it?  If there are, could you please show me that?
> > > > >
> > > > > That remains to be seen through QA/validation, but I think that's it.
> > > > >
> > > > > Granted, I still have to read through the two patchsets...
> > > >
> > > > I've been meaning to circle back here as well.
> > > >
> > > > My immediate concern is the issue Jason recently highlighted [1] with
> > > > respect to invalidating all dax mappings when / if the device is
> > > > ripped out from underneath the fs. I don't think that will collide
> > > > with Ruan's implementation, but it does need new communication from
> > > > driver to fs about removal events.
> > > >
> > > > [1]: 
> > > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com
> > >
> > > Oh, yay.
> > >
> > > The XFS shutdown code is centred around preventing new IO from being
> > > issued - we don't actually do anything about DAX mappings because,
> > > well, I don't think anyone on the filesystem side thought they had
> > > to do anything special if pmem went away from under it.
> > >
> > > My understanding -was- that the pmem removal invalidates
> > > all the ptes currently mapped into CPU page tables that point at
> > > the dax device across the system. THe vmas that manage these
> > > mappings are not really something the filesystem really manages,
> > > but a function of the mm subsystem. What the filesystem cares about
> > > is that it gets page faults triggered when a change of state occurs
> > > so that it can remap the page to it's backing store correctly.
> > >
> > > IOWs, all the mm subsystem needs to when pmem goes away is clear the
> > > CPU ptes, because then when then when userspace tries to access the
> > > mapped DAX pages we get a new page fault. In processing the fault, the
> > > filesystem will try to get direct access to the pmem from the block
> > > device. This will get an ENODEV error from the block device because
> > > because the backing store (pmem) has been unplugged and is no longer
> > > there...
> > >
> >

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-26 Thread Dan Williams

On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner  wrote:
>
> On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote:
> > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong  wrote:
> > >
> > > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote:
> > > > Hi, guys
> > > >
> > > > Beside this patchset, I'd like to confirm something about the
> > > > "EXPERIMENTAL" tag for dax in XFS.
> > > >
> > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> > > > when we mount a pmem device with dax option, has been existed for a
> > > > while.  It's a bit annoying when using fsdax feature.  So, my initial
> > > > intention was to remove this tag.  And I started to find out and solve
> > > > the problems which prevent it from being removed.
> > > >
> > > > As is talked before, there are 3 main problems.  The first one is "dax
> > > > semantics", which has been resolved.  The rest two are "RMAP for
> > > > fsdax" and "support dax reflink for filesystem", which I have been
> > > > working on.
> > >
> > > 
> > >
> > > > So, what I want to confirm is: does it means that we can remove the
> > > > "EXPERIMENTAL" tag when the rest two problem are solved?
> > >
> > > Yes.  I'd keep the experimental tag for a cycle or two to make sure that
> > > nothing new pops up, but otherwise the two patchsets you've sent close
> > > those two big remaining gaps.  Thank you for working on this!
> > >
> > > > Or maybe there are other important problems need to be fixed before
> > > > removing it?  If there are, could you please show me that?
> > >
> > > That remains to be seen through QA/validation, but I think that's it.
> > >
> > > Granted, I still have to read through the two patchsets...
> >
> > I've been meaning to circle back here as well.
> >
> > My immediate concern is the issue Jason recently highlighted [1] with
> > respect to invalidating all dax mappings when / if the device is
> > ripped out from underneath the fs. I don't think that will collide
> > with Ruan's implementation, but it does need new communication from
> > driver to fs about removal events.
> >
> > [1]: 
> > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com
>
> Oh, yay.
>
> The XFS shutdown code is centred around preventing new IO from being
> issued - we don't actually do anything about DAX mappings because,
> well, I don't think anyone on the filesystem side thought they had
> to do anything special if pmem went away from under it.
>
> My understanding -was- that the pmem removal invalidates
> all the ptes currently mapped into CPU page tables that point at
> the dax device across the system. THe vmas that manage these
> mappings are not really something the filesystem really manages,
> but a function of the mm subsystem. What the filesystem cares about
> is that it gets page faults triggered when a change of state occurs
> so that it can remap the page to it's backing store correctly.
>
> IOWs, all the mm subsystem needs to when pmem goes away is clear the
> CPU ptes, because then when then when userspace tries to access the
> mapped DAX pages we get a new page fault. In processing the fault, the
> filesystem will try to get direct access to the pmem from the block
> device. This will get an ENODEV error from the block device because
> because the backing store (pmem) has been unplugged and is no longer
> there...
>
> AFAICT, as long as pmem removal invalidates all the active ptes that
> point at the pmem being removed, the filesystem doesn't need to
> care about device removal at all, DAX or no DAX...

How would the pmem removal do that without walking all the active
inodes in the fs at the time of shutdown and call
unmap_mapping_range(inode->i_mapping, 0, 0, 1)?

The core-mm does tear down the ptes in the direct map, but user
mappings to pmem are not afaics in xfs_do_force_shutdown().
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-26 Thread Dan Williams

On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong  wrote:
>
> On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote:
> > Hi, guys
> >
> > Beside this patchset, I'd like to confirm something about the
> > "EXPERIMENTAL" tag for dax in XFS.
> >
> > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> > when we mount a pmem device with dax option, has been existed for a
> > while.  It's a bit annoying when using fsdax feature.  So, my initial
> > intention was to remove this tag.  And I started to find out and solve
> > the problems which prevent it from being removed.
> >
> > As is talked before, there are 3 main problems.  The first one is "dax
> > semantics", which has been resolved.  The rest two are "RMAP for
> > fsdax" and "support dax reflink for filesystem", which I have been
> > working on.
>
> 
>
> > So, what I want to confirm is: does it means that we can remove the
> > "EXPERIMENTAL" tag when the rest two problem are solved?
>
> Yes.  I'd keep the experimental tag for a cycle or two to make sure that
> nothing new pops up, but otherwise the two patchsets you've sent close
> those two big remaining gaps.  Thank you for working on this!
>
> > Or maybe there are other important problems need to be fixed before
> > removing it?  If there are, could you please show me that?
>
> That remains to be seen through QA/validation, but I think that's it.
>
> Granted, I still have to read through the two patchsets...

I've been meaning to circle back here as well.

My immediate concern is the issue Jason recently highlighted [1] with
respect to invalidating all dax mappings when / if the device is
ripped out from underneath the fs. I don't think that will collide
with Ruan's implementation, but it does need new communication from
driver to fs about removal events.

[1]: 
http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [ndctl PATCH] Expose ndctl_bus_nfit_translate_spa as a public function.

2021-02-26 Thread Dan Williams

On Wed, Feb 24, 2021 at 3:48 PM Tsaur, Erwin  wrote:
>
> The motivation is to allow access to ACPI defined NVDIMM Root Device _DSM 
> Function Index 5(Translate SPA).  The rest of the _DSM functions, which are 
> mostly ARS related, are already public.

For future reference you'll want to fix your editor to make sure it
wraps at 72 columns for commit messages. If you're a vim user, I have
the following in my .vimrc

set tw=72

...and then you can use the 'gq' command to fixup line wrapping after the fact.

Otherwise, looks good to me. I was worried that you needed to double
check that the bus argument actually is an nfit bus, but
bus_has_translate_spa() already takes care of that, so I think we're
good to go. Longer term if other buses grow the ability to return the
DPA we can add a proper generic wrapper for that, to date I have not
seen much support for RAS brewing in other buses.

Reviewed-by: Dan Williams 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [GIT PULL] Compute Express Linux (CXL) for v5.12-rc1

2021-02-23 Thread Dan Williams

As much as I'd love to be working on "Compute Express Linux" the
subject should have read "Compute Express Link".

On Tue, Feb 23, 2021 at 8:05 PM Dan Williams  wrote:
>
> Hi Linus, please pull from:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
> tags/cxl-for-5.12
>
> ...to receive an initial driver for CXL 2.0 Memory Devices. Technical
> details are in the tag message and Documentation/. I am taking this
> through nvdimm.git this first cycle until the cxl.git repository and
> maintainer team can be set up on git.kernel.org.
>
> In terms of why merge this initial driver now, it establishes just
> enough functionality to enumerate these devices and issue all
> administrative commands. It sets a v5.12 baseline to develop the more
> complicated higher order functionality like memory device
> interleaving, persistent memory support, and hotplug which entangle
> with ACPI, LIBNVDIMM, and PCI.
>
> The focus of this release is establishing the ioctl UAPI for the
> management commands. Similar to NVME there are a set of standard
> commands as well as the possibility for vendor specific commands.
> Unlike the NVME driver the CXL driver does not enable vendor specific
> command functionality by default. This conservatism is out of concern
> for the fact that CXL interleaves memory across devices and implements
> host memory. The system integrity implications of some commands are
> more severe than NVME and vendor specific functionality is mostly
> unauditable. This will be an ongoing topic of discussion with the
> wider CXL community for next few months.
>
> The driver has been developed in the open since November against a
> work-in-progress QEMU emulation of the CXL device model. That QEMU
> effort has recently attracted contributions from multiple hardware
> vendors.
>
> The driver has appeared in -next. It collected some initial static
> analysis fixes and build-robot reports, but all quiet in -next for the
> past week.
>
> A list of review tags that arrived after the branch for -next was cut
> is appended to the tag message below.
>
> ---
>
> The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac:
>
>   Linux 5.11-rc6 (2021-01-31 13:50:09 -0800)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
> tags/cxl-for-5.12
>
> for you to fetch changes up to 88ff5d466c0250259818f3153dbdc4af1f8615dd:
>
>   cxl/mem: Fix potential memory leak (2021-02-22 14:44:39 -0800)
>
> 
> cxl for 5.12
>
> Introduce an initial driver for CXL 2.0 Type-3 Memory Devices. CXL is
> Compute Express Link which released the 2.0 specification in November.
> The Linux relevant changes in CXL 2.0 are support for an OS to
> dynamically assign address space to memory devices, support for
> switches, persistent memory, and hotplug. A Type-3 Memory Device is a
> PCI enumerated device presenting the CXL Memory Device Class Code and
> implementing the CXL.mem protocol. CXL.mem allows device to advertise
> CPU and I/O coherent memory to the system, i.e. typical "System RAM" and
> "Persistent Memory" in Linux /proc/iomem terms.
>
> In addition to the CXL.mem fast path there is an administrative command
> hardware mailbox interface for maintenance and provisioning. It is this
> command interface that is the focus of the initial driver. With this
> driver a CXL device that is mapped by the BIOS can be administered by
> Linux. Linux support for CXL PMEM and dynamic CXL address space
> management are to be implemented post v5.12.
>
> 4cdadfd5e0a7 cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints
> Reviewed-by: Konrad Rzeszutek Wilk 
>
> 8adaf747c9f0 cxl/mem: Find device capabilities
> Reviewed-by: Jonathan Cameron 
>
> b39cb1052a5c cxl/mem: Register CXL memX devices
> Reviewed-by: Jonathan Cameron 
>
> 13237183c735 cxl/mem: Add a "RAW" send command
> Reviewed-by: Konrad Rzeszutek Wilk 
>
> 472b1ce6e9d6 cxl/mem: Enable commands via CEL
> Reviewed-by: Konrad Rzeszutek Wilk 
>
> 57ee605b976c cxl/mem: Add set of informational commands
> Reviewed-by: Konrad Rzeszutek Wilk 
>
> 
> Ben Widawsky (7):
>   cxl/mem: Find device capabilities
>   cxl/mem: Add basic IOCTL interface
>   cxl/mem: Add a "RAW" send command
>   cxl/mem: Enable commands via CEL
>   cxl/mem: Add set of informational commands
>   MAINTAINERS: Add maintainers of the CXL driver
>   cxl/mem: Fix potential memory leak
>
> Dan Carpenter (1):
>   cxl/mem: Return -E

[GIT PULL] Compute Express Linux (CXL) for v5.12-rc1

2021-02-23 Thread Dan Williams

Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/cxl-for-5.12

...to receive an initial driver for CXL 2.0 Memory Devices. Technical
details are in the tag message and Documentation/. I am taking this
through nvdimm.git this first cycle until the cxl.git repository and
maintainer team can be set up on git.kernel.org.

In terms of why merge this initial driver now, it establishes just
enough functionality to enumerate these devices and issue all
administrative commands. It sets a v5.12 baseline to develop the more
complicated higher order functionality like memory device
interleaving, persistent memory support, and hotplug which entangle
with ACPI, LIBNVDIMM, and PCI.

The focus of this release is establishing the ioctl UAPI for the
management commands. Similar to NVME there are a set of standard
commands as well as the possibility for vendor specific commands.
Unlike the NVME driver the CXL driver does not enable vendor specific
command functionality by default. This conservatism is out of concern
for the fact that CXL interleaves memory across devices and implements
host memory. The system integrity implications of some commands are
more severe than NVME and vendor specific functionality is mostly
unauditable. This will be an ongoing topic of discussion with the
wider CXL community for next few months.

The driver has been developed in the open since November against a
work-in-progress QEMU emulation of the CXL device model. That QEMU
effort has recently attracted contributions from multiple hardware
vendors.

The driver has appeared in -next. It collected some initial static
analysis fixes and build-robot reports, but all quiet in -next for the
past week.

A list of review tags that arrived after the branch for -next was cut
is appended to the tag message below.

---

The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac:

  Linux 5.11-rc6 (2021-01-31 13:50:09 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/cxl-for-5.12

for you to fetch changes up to 88ff5d466c0250259818f3153dbdc4af1f8615dd:

  cxl/mem: Fix potential memory leak (2021-02-22 14:44:39 -0800)


cxl for 5.12

Introduce an initial driver for CXL 2.0 Type-3 Memory Devices. CXL is
Compute Express Link which released the 2.0 specification in November.
The Linux relevant changes in CXL 2.0 are support for an OS to
dynamically assign address space to memory devices, support for
switches, persistent memory, and hotplug. A Type-3 Memory Device is a
PCI enumerated device presenting the CXL Memory Device Class Code and
implementing the CXL.mem protocol. CXL.mem allows device to advertise
CPU and I/O coherent memory to the system, i.e. typical "System RAM" and
"Persistent Memory" in Linux /proc/iomem terms.

In addition to the CXL.mem fast path there is an administrative command
hardware mailbox interface for maintenance and provisioning. It is this
command interface that is the focus of the initial driver. With this
driver a CXL device that is mapped by the BIOS can be administered by
Linux. Linux support for CXL PMEM and dynamic CXL address space
management are to be implemented post v5.12.

4cdadfd5e0a7 cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints
Reviewed-by: Konrad Rzeszutek Wilk 

8adaf747c9f0 cxl/mem: Find device capabilities
Reviewed-by: Jonathan Cameron 

b39cb1052a5c cxl/mem: Register CXL memX devices
Reviewed-by: Jonathan Cameron 

13237183c735 cxl/mem: Add a "RAW" send command
Reviewed-by: Konrad Rzeszutek Wilk 

472b1ce6e9d6 cxl/mem: Enable commands via CEL
Reviewed-by: Konrad Rzeszutek Wilk 

57ee605b976c cxl/mem: Add set of informational commands
Reviewed-by: Konrad Rzeszutek Wilk 


Ben Widawsky (7):
  cxl/mem: Find device capabilities
  cxl/mem: Add basic IOCTL interface
  cxl/mem: Add a "RAW" send command
  cxl/mem: Enable commands via CEL
  cxl/mem: Add set of informational commands
  MAINTAINERS: Add maintainers of the CXL driver
  cxl/mem: Fix potential memory leak

Dan Carpenter (1):
  cxl/mem: Return -EFAULT if copy_to_user() fails

Dan Williams (2):
  cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints
  cxl/mem: Register CXL memX devices

 .clang-format  |1 +
 Documentation/ABI/testing/sysfs-bus-cxl|   26 +
 Documentation/driver-api/cxl/index.rst |   12 +
 Documentation/driver-api/cxl/memory-devices.rst|   46 +
 Documentation/driver-api/index.rst |1 +
 Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
 MAINTAINERS|   11 +
 drivers/Kconfig|1 +
 drivers/Makefile

[GIT PULL] libnvdimm + device-dax for v5.12-rc1

2021-02-23 Thread Dan Williams

Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-for-5.12

...to receive some miscellaneous cleanups and a fix for v5.12. This
mainly continues the kernel wide effort to remove a return code from
the remove() callback in the driver model. The fix addresses a return
code polarity typo in the new sysfs attribute to manually specify a
device-dax instance mapping range. This has all appeared in -next with
no reported issues.

---

The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac:

  Linux 5.11-rc6 (2021-01-31 13:50:09 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-for-5.12

for you to fetch changes up to 64ffe84320745ea836555ad207ebfb0e896b6167:

  Merge branch 'for-5.12/dax' into for-5.12/libnvdimm (2021-02-23
18:13:45 -0800)


libnvdimm + device-dax for 5.12

- Fix the error code polarity for the device-dax/mapping attribute

- For the device-dax and libnvdimm bus implementations stop implementing
  a useless return code for the remove() callback.

- Miscellaneous cleanups


Dan Williams (1):
  Merge branch 'for-5.12/dax' into for-5.12/libnvdimm

Shiyang Ruan (1):
  device-dax: Fix default return code of range_parse()

Uwe Kleine-König (7):
  libnvdimm/dimm: Simplify nvdimm_remove()
  libnvdimm: Make remove callback return void
  device-dax: Prevent registering drivers without probe callback
  device-dax: Properly handle drivers without remove callback
  device-dax: Fix error path in dax_driver_register
  device-dax: Drop an empty .remove callback
  dax-device: Make remove callback return void

 drivers/dax/bus.c | 24 +---
 drivers/dax/bus.h |  2 +-
 drivers/dax/device.c  |  8 +---
 drivers/dax/kmem.c|  7 ++-
 drivers/dax/pmem/compat.c |  3 +--
 drivers/nvdimm/blk.c  |  3 +--
 drivers/nvdimm/bus.c  | 13 +
 drivers/nvdimm/dimm.c |  7 +--
 drivers/nvdimm/pmem.c |  4 +---
 drivers/nvdimm/region.c   |  4 +---
 include/linux/nd.h|  2 +-
 11 files changed, 36 insertions(+), 41 deletions(-)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2021-02-23 Thread Dan Williams

On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe  wrote:
>
> On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote:
> > [ add Ralph ]
> >
> > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe  wrote:
> > >
> > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote:
> > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe  wrote:
> > > > >
> > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
> > > > >
> > > > > > > The downside would be one extra lookup in dev_pagemap tree
> > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > > > > > per gup-fast() call.
> > > > > >
> > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > > > > > path. It should be measurable that this change is at least as fast 
> > > > > > or
> > > > > > faster than falling back to the slow path, but it would be good to
> > > > > > measure.
> > > > >
> > > > > What is the dev_pagemap thing doing in gup fast anyhow?
> > > > >
> > > > > I've been wondering for a while..
> > > >
> > > > It's there to synchronize against dax-device removal. The device will
> > > > suspend removal awaiting all page references to be dropped, but
> > > > gup-fast could be racing device removal. So gup-fast checks for
> > > > pte_devmap() to grab a live reference to the device before assuming it
> > > > can pin a page.
> > >
> > > From the perspective of CPU A it can't tell if CPU B is doing a HW
> > > page table walk or a GUP fast when it invalidates a page table. The
> > > design of gup-fast is supposed to be the same as the design of a HW
> > > page table walk, and the tlb invalidate CPU A does when removing a
> > > page from a page table is supposed to serialize against both a HW page
> > > table walk and gup-fast.
> > >
> > > Given that the HW page table walker does not do dev_pagemap stuff, why
> > > does gup-fast?
> >
> > gup-fast historically assumed that the 'struct page' and memory
> > backing the page-table walk could not physically be removed from the
> > system during its walk because those pages were allocated from the
> > page allocator before being mapped into userspace.
>
> No, I'd say gup-fast assumes that any non-special PTE it finds in a
> page table must have a struct page.
>
> If something wants to remove that struct page it must first remove all
> the PTEs pointing at it from the entire system and flush the TLBs,
> which directly prevents a future gup-fast from running and trying to
> access the struct page. No extra locking needed
>
> > implied elevated reference on any page that gup-fast would be asked to
> > walk, or pte_special() is there to "say wait, nevermind this isn't a
> > page allocator page fallback to gup-slow()".
>
> pte_special says there is no struct page, and some of those cases can
> be fixed up in gup-slow.
>
> > > Can you sketch the exact race this is protecting against?
> >
> > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and
> > issues direct I/O with that mapping as the target buffer, Thread2 does
> > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without
> > the dev_pagemap check reference gup-fast could execute
> > get_page(pte_page(pte)) on a page that doesn't even exist anymore
> > because the driver unbind has already performed remove_pages().
>
> Surely the unbind either waits for all the VMAs to be destroyed or
> zaps them before allowing things to progress to remove_pages()?

If we're talking about device-dax this is precisely what it does, zaps
and prevents new faults from resolving, but filesystem-dax...

> Having a situation where the CPU page tables still point at physical
> pages that have been removed sounds so crazy/insecure, that can't be
> what is happening, can it??

Hmm, that may be true and an original dax bug! The unbind of a
block-device from underneath the filesystem does trigger the
filesystem to emergency shutdown / go read-only, but unless that
process also includes a global zap of all dax mappings not only is
that violating expectations of "page-tables to disappearing memory",
but the filesystem may also want to guarantee that no further dax
writes can happen after shutdown. Right now I believe it only assumes
that mmap I/O will come from page writeback so there's no need to
bother applications with mappings to page cache, but dax mappings need
to be ripped away.

/me goes to look at what filesytems guarantee when the block-device is
surprise removed out from under them.

In any event, this accelerates the effort to go implement
fs-global-dax-zap at the request of the device driver.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2021-02-23 Thread Dan Williams

[ add Ralph ]

On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe  wrote:
>
> On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe  wrote:
> > >
> > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
> > >
> > > > > The downside would be one extra lookup in dev_pagemap tree
> > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > > > per gup-fast() call.
> > > >
> > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > > > path. It should be measurable that this change is at least as fast or
> > > > faster than falling back to the slow path, but it would be good to
> > > > measure.
> > >
> > > What is the dev_pagemap thing doing in gup fast anyhow?
> > >
> > > I've been wondering for a while..
> >
> > It's there to synchronize against dax-device removal. The device will
> > suspend removal awaiting all page references to be dropped, but
> > gup-fast could be racing device removal. So gup-fast checks for
> > pte_devmap() to grab a live reference to the device before assuming it
> > can pin a page.
>
> From the perspective of CPU A it can't tell if CPU B is doing a HW
> page table walk or a GUP fast when it invalidates a page table. The
> design of gup-fast is supposed to be the same as the design of a HW
> page table walk, and the tlb invalidate CPU A does when removing a
> page from a page table is supposed to serialize against both a HW page
> table walk and gup-fast.
>
> Given that the HW page table walker does not do dev_pagemap stuff, why
> does gup-fast?

gup-fast historically assumed that the 'struct page' and memory
backing the page-table walk could not physically be removed from the
system during its walk because those pages were allocated from the
page allocator before being mapped into userspace. So there is an
implied elevated reference on any page that gup-fast would be asked to
walk, or pte_special() is there to "say wait, nevermind this isn't a
page allocator page fallback to gup-slow()". pte_devmap() is there to
say "wait, there is no implied elevated reference for this page, check
and hold dev_pagemap alive until a page reference can be taken". So it
splits the difference between pte_special() and typical page allocator
pages.

> Can you sketch the exact race this is protecting against?

Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and
issues direct I/O with that mapping as the target buffer, Thread2 does
"echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without
the dev_pagemap check reference gup-fast could execute
get_page(pte_page(pte)) on a page that doesn't even exist anymore
because the driver unbind has already performed remove_pages().

Effectively the same percpu_ref that protects the pmem0 block device
from new command submissions while the device is dying also prevents
new dax page references being taken while the device is dying.

This could be solved with the traditional gup-fast rules if the device
driver could tell the filesystem to unmap all dax files and force them
to re-fault through the gup-slow path to see that the device is now
dying. I'll likely be working on that sooner rather than later given
some of the expectations of the CXL persistent memory "dirty shutdown"
detection.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2021-02-23 Thread Dan Williams

On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe  wrote:
>
> On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
>
> > > The downside would be one extra lookup in dev_pagemap tree
> > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > per gup-fast() call.
> >
> > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > path. It should be measurable that this change is at least as fast or
> > faster than falling back to the slow path, but it would be good to
> > measure.
>
> What is the dev_pagemap thing doing in gup fast anyhow?
>
> I've been wondering for a while..

It's there to synchronize against dax-device removal. The device will
suspend removal awaiting all page references to be dropped, but
gup-fast could be racing device removal. So gup-fast checks for
pte_devmap() to grab a live reference to the device before assuming it
can pin a page.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

2021-02-23 Thread Dan Williams

On Tue, Feb 23, 2021 at 9:19 AM Joao Martins  wrote:
>
> On 2/23/21 4:50 PM, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 7:46 AM Joao Martins  
> > wrote:
> >> On 2/22/21 8:37 PM, Dan Williams wrote:
> >>> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins  
> >>> wrote:
> >>>> On 2/20/21 1:43 AM, Dan Williams wrote:
> >>>>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard  wrote:
> >>>>>> On 12/8/20 9:28 AM, Joao Martins wrote:
>
> [...]
>
> >>> Don't get me wrong the
> >>> capability is still needed for filesystem-dax, but the distinction is
> >>> that vmemmap_populate_compound_pages() need never worry about an
> >>> altmap.
> >>>
> >> IMO there's not much added complexity strictly speaking about altmap. We 
> >> still use the
> >> same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So 
> >> whatever it is
> >> being maintained for fsdax or other altmap consumers (e.g. we seem to be 
> >> working towards
> >> hotplug making use of it) we are using it in the exact same way.
> >>
> >> The complexity of the future vmemmap_populate_compound_pages() has more to 
> >> do with reusing
> >> vmemmap blocks allocated in previous vmemmap pages, and preserving that 
> >> across section
> >> onlining (for 1G pages).
> >
> > True, I'm less worried about the complexity as much as
> > opportunistically converting configurations to RAM backed pages. It's
> > already the case that poison handling is page mapping size aligned for
> > device-dax, and filesystem-dax needs to stick with non-compound-pages
> > for the foreseeable future.
> >
> Hmm, I was sort off wondering that fsdax could move to compound pages too as
> opposed to base pages, albeit not necessarily using the vmemmap page reuse
> as it splits pages IIUC.

I'm not sure compound pages for fsdax would work long term because
there's no infrastructure to reassemble compound pages after a split.
So if you fracture a block and then coalesce it back to a 2MB or 1GB
aligned block there's nothing to go fixup the compound page... unless
the filesystem wants to get into mm metadata fixups.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2021-02-23 Thread Dan Williams

On Tue, Feb 23, 2021 at 9:16 AM Joao Martins  wrote:
>
> On 2/23/21 4:44 PM, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 8:30 AM Joao Martins  
> > wrote:
> >>
> >> On 2/20/21 1:18 AM, Dan Williams wrote:
> >>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins  
> >>> wrote:
> >>>> Patch 6 - 8: Optimize grabbing/release a page refcount changes given 
> >>>> that we
> >>>> are working with compound pages i.e. we do 1 increment/decrement to the 
> >>>> head
> >>>> page for a given set of N subpages compared as opposed to N individual 
> >>>> writes.
> >>>> {get,pin}_user_pages_fast() for zone_device with compound pagemap 
> >>>> consequently
> >>>> improves considerably, and unpin_user_pages() improves as well when 
> >>>> passed a
> >>>> set of consecutive pages:
> >>>>
> >>>>before  after
> >>>> (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
> >>>> (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
> >>>
> >>> Compelling!
> >>>
> >>
> >> BTW is there any reason why we don't support pin_user_pages_fast() with 
> >> FOLL_LONGTERM for
> >> device-dax?
> >>
> >
> > Good catch.
> >
> > Must have been an oversight of the conversion. FOLL_LONGTERM collides
> > with filesystem operations, but not device-dax.
>
> h, fwiw, it was unilaterally disabled for any devmap pmd/pud in commit 
> 7af75561e171
> ("mm/gup: add FOLL_LONGTERM capability to GUP fast") and I must only assume 
> that
> by "DAX pages" the submitter was only referring to fs-dax pages.

Agree, that was an fsdax only assumption. Maybe this went unnoticed
because the primary gup-fast case for direct-I/O was not impacted.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

2021-02-23 Thread Dan Williams

On Tue, Feb 23, 2021 at 7:46 AM Joao Martins  wrote:
>
> On 2/22/21 8:37 PM, Dan Williams wrote:
> > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins  
> > wrote:
> >> On 2/20/21 1:43 AM, Dan Williams wrote:
> >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard  wrote:
> >>>> On 12/8/20 9:28 AM, Joao Martins wrote:
> >>>>> diff --git a/mm/memremap.c b/mm/memremap.c
> >>>>> index 16b2fb482da1..287a24b7a65a 100644
> >>>>> --- a/mm/memremap.c
> >>>>> +++ b/mm/memremap.c
> >>>>> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap 
> >>>>> *pgmap, struct mhp_params *params,
> >>>>>   memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
> >>>>>   PHYS_PFN(range->start),
> >>>>>   PHYS_PFN(range_len(range)), pgmap);
> >>>>> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >>>>> - - pfn_first(pgmap, range_id));
> >>>>> + if (pgmap->flags & PGMAP_COMPOUND)
> >>>>> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> >>>>> + - pfn_first(pgmap, range_id)) / 
> >>>>> PHYS_PFN(pgmap->align));
> >>>>
> >>>> Is there some reason that we cannot use range_len(), instead of 
> >>>> pfn_end() minus
> >>>> pfn_first()? (Yes, this more about the pre-existing code than about your 
> >>>> change.)
> >>>>
> >>>> And if not, then why are the nearby range_len() uses OK? I realize that 
> >>>> range_len()
> >>>> is simpler and skips a case, but it's not clear that it's required here. 
> >>>> But I'm
> >>>> new to this area so be warned. :)
> >>>
> >>> There's a subtle distinction between the range that was passed in and
> >>> the pfns that are activated inside of it. See the offset trickery in
> >>> pfn_first().
> >>>
> >>>> Also, dividing by PHYS_PFN() feels quite misleading: that function does 
> >>>> what you
> >>>> happen to want, but is not named accordingly. Can you use or create 
> >>>> something
> >>>> more accurately named? Like "number of pages in this large page"?
> >>>
> >>> It's not the number of pages in a large page it's converting bytes to
> >>> pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my
> >>> though process was if I'm going to add () might as well use a macro
> >>> that already does this.
> >>>
> >>> That said I think this calculation is broken precisely because
> >>> pfn_first() makes the result unaligned.
> >>>
> >>> Rather than fix the unaligned pfn_first() problem I would use this
> >>> support as an opportunity to revisit the option of storing pages in
> >>> the vmem_altmap reserve soace. The altmap's whole reason for existence
> >>> was that 1.5% of large PMEM might completely swamp DRAM. However if
> >>> that overhead is reduced by an order (or orders) of magnitude the
> >>> primary need for vmem_altmap vanishes.
> >>>
> >>> Now, we'll still need to keep it around for the ->align == PAGE_SIZE
> >>> case, but for most part existing deployments that are specifying page
> >>> map on PMEM and an align > PAGE_SIZE can instead just transparently be
> >>> upgraded to page map on a smaller amount of DRAM.
> >>>
> >> I feel the altmap is still relevant. Even with the struct page reuse for
> >> tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per
> >> 1Tb (strictly speaking about what's stored in the altmap). Muchun and
> >> Matthew were thinking (in another thread) on compound_head() adjustments
> >> that probably can make this overhead go to 2G (if we learn to differentiate
> >> the reused head page from the real head page).
> >
> > I think that realization is more justification to make a new first
> > class vmemmap_populate_compound_pages() rather than try to reuse
> > vmemmap_populate_basepages() with new parameters.
> >
> I was already going to move this to vmemmap_populate_compound_pages() based
> on your earlier suggestion :)
>
> >> But even there it's still
> >> 2G per 1Tb. 1G pages, though, have

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2021-02-23 Thread Dan Williams

On Tue, Feb 23, 2021 at 8:30 AM Joao Martins  wrote:
>
> On 2/20/21 1:18 AM, Dan Williams wrote:
> > On Tue, Dec 8, 2020 at 9:32 AM Joao Martins  
> > wrote:
> >> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that 
> >> we
> >> are working with compound pages i.e. we do 1 increment/decrement to the 
> >> head
> >> page for a given set of N subpages compared as opposed to N individual 
> >> writes.
> >> {get,pin}_user_pages_fast() for zone_device with compound pagemap 
> >> consequently
> >> improves considerably, and unpin_user_pages() improves as well when passed 
> >> a
> >> set of consecutive pages:
> >>
> >>before  after
> >> (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
> >> (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
> >
> > Compelling!
> >
>
> BTW is there any reason why we don't support pin_user_pages_fast() with 
> FOLL_LONGTERM for
> device-dax?
>

Good catch.

Must have been an oversight of the conversion. FOLL_LONGTERM collides
with filesystem operations, but not device-dax. In fact that's the
motivation for device-dax in the first instance, no need to coordinate
runtime physical address layout changes because the device is
statically allocated.

> Looking at the history, I understand that fsdax can't support it atm, but I 
> am not sure
> that the same holds for device-dax. I have this small chunk (see below the 
> scissors mark)
> which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not sure 
> if there is
> a fundamental issue for the other types that makes this an unwelcoming change.
>
> Joao
>
> ->8-
>
> Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for
>  MEMORY_DEVICE_GENERIC
>
> The downside would be one extra lookup in dev_pagemap tree
> for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> per gup-fast() call.

I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
path. It should be measurable that this change is at least as fast or
faster than falling back to the slow path, but it would be good to
measure.

Code changes look good to me.

>
> Signed-off-by: Joao Martins 
> ---
>  include/linux/mm.h |  5 +
>  mm/gup.c   | 24 +---
>  2 files changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 32f0c3986d4f..c89a049bbd7a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct 
> page *page)
> page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
>  }
>
> +static inline bool devmap_longterm_available(const struct dev_pagemap *pgmap)
> +{

I'd call this devmap_can_longterm().

> +   return pgmap->type == MEMORY_DEVICE_GENERIC;
> +}
> +
>  /* 127: arbitrary random number, small enough to assemble well */
>  #define page_ref_zero_or_close_to_overflow(page) \
> ((unsigned int) page_ref_count(page) + 127u <= 127u)
> diff --git a/mm/gup.c b/mm/gup.c
> index 222d1fdc5cfa..03e370d360e6 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2092,14 +2092,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
> addr, unsigned
> long end,
> goto pte_unmap;
>
> if (pte_devmap(pte)) {
> -   if (unlikely(flags & FOLL_LONGTERM))
> -   goto pte_unmap;
> -
> pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
> if (unlikely(!pgmap)) {
> undo_dev_pagemap(nr, nr_start, flags, pages);
> goto pte_unmap;
> }
> +
> +   if (unlikely(flags & FOLL_LONGTERM) &&
> +   !devmap_longterm_available(pgmap)) {
> +   undo_dev_pagemap(nr, nr_start, flags, pages);
> +   goto pte_unmap;
> +   }
> +
> } else if (pte_special(pte))
> goto pte_unmap;
>
> @@ -2195,6 +2199,10 @@ static int __gup_device_huge(unsigned long pfn, 
> unsigned long addr,
> return 0;
> }
>
> +   if (unlikely(flags & FOLL_LONGTERM) &&
> +   !devmap_longterm_available(pgmap))
> +   return 0;
> +
> @@ -2356,12 +2364,9 @@ static i

Re: [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size

2021-02-22 Thread Dan Williams

On Mon, Feb 22, 2021 at 3:42 AM Joao Martins  wrote:
>
>
>
> On 2/20/21 3:34 AM, Dan Williams wrote:
> > On Tue, Dec 8, 2020 at 9:32 AM Joao Martins  
> > wrote:
> >>
> >> Introduce a new flag, MEMHP_REUSE_VMEMMAP, which signals that
> >> struct pages are onlined with a given alignment, and should reuse the
> >> tail pages vmemmap areas. On that circunstamce we reuse the PFN backing
> >
> > s/On that circunstamce we reuse/Reuse/
> >
> > Kills a "we" and switches to imperative tense. I noticed a couple
> > other "we"s in the previous patches, but this crossed my threshold to
> > make a comment.
> >
> /me nods. Will fix.
>
> >> only the tail pages subsections, while letting the head page PFN remain
> >> different. This presumes that the backing page structs are compound
> >> pages, such as the case for compound pagemaps (i.e. ZONE_DEVICE with
> >> PGMAP_COMPOUND set)
> >>
> >> On 2M compound pagemaps, it lets us save 6 pages out of the 8 necessary
> >
> > s/lets us save/saves/
> >
> Will fix.
>
> >> PFNs necessary
> >
> > s/8 necessary PFNs necessary/8 PFNs necessary/
>
> Will fix.
>
> >
> >> to describe the subsection's 32K struct pages we are
> >> onlining.
> >
> > s/we are onlining/being mapped/
> >
> > ...because ZONE_DEVICE pages are never "onlined".
> >
> >> On a 1G compound pagemap it let us save 4096 pages.
> >
> > s/lets us save/saves/
> >
>
> Will fix both.
>
> >>
> >> Sections are 128M (or bigger/smaller),
> >
> > Huh?
> >
>
> Section size is arch-dependent if we are being hollistic.
> On x86 it's 64M, 128M or 512M right?
>
>  #ifdef CONFIG_X86_32
>  # ifdef CONFIG_X86_PAE
>  #  define SECTION_SIZE_BITS 29
>  #  define MAX_PHYSMEM_BITS  36
>  # else
>  #  define SECTION_SIZE_BITS 26
>  #  define MAX_PHYSMEM_BITS  32
>  # endif
>  #else /* CONFIG_X86_32 */
>  # define SECTION_SIZE_BITS  27 /* matt - 128 is convenient right now */
>  # define MAX_PHYSMEM_BITS   (pgtable_l5_enabled() ? 52 : 46)
>  #endif
>
> Also, me pointing about section sizes, is because a 1GB+ page vmemmap 
> population will
> cross sections in how sparsemem populates the vmemmap. And on that case we 
> gotta reuse the
> the PTE/PMD pages across multiple invocations of 
> vmemmap_populate_basepages(). Either
> that, or looking at the previous page PTE, but that might be ineficient.

Ok, makes sense I think saying this description of needing to handle
section crossing is clearer than mentioning one of the section sizes.

>
> >> @@ -229,38 +235,95 @@ int __meminit vmemmap_populate_basepages(unsigned 
> >> long start, unsigned long end,
> >> for (; addr < end; addr += PAGE_SIZE) {
> >> pgd = vmemmap_pgd_populate(addr, node);
> >> if (!pgd)
> >> -   return -ENOMEM;
> >> +   return NULL;
> >> p4d = vmemmap_p4d_populate(pgd, addr, node);
> >> if (!p4d)
> >> -   return -ENOMEM;
> >> +   return NULL;
> >> pud = vmemmap_pud_populate(p4d, addr, node);
> >> if (!pud)
> >> -   return -ENOMEM;
> >> +   return NULL;
> >> pmd = vmemmap_pmd_populate(pud, addr, node);
> >> if (!pmd)
> >> -   return -ENOMEM;
> >> -   pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> >> +   return NULL;
> >> +   pte = vmemmap_pte_populate(pmd, addr, node, altmap, block);
> >> if (!pte)
> >> -   return -ENOMEM;
> >> +   return NULL;
> >> vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
> >> }
> >>
> >> +   return __va(__pfn_to_phys(pte_pfn(*pte)));
> >> +}
> >> +
> >> +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned 
> >> long end,
> >> +int node, struct vmem_altmap 
> >> *altmap)
> >> +{
> >> +   if (!__vmemmap_populate_basepages(start, end, node, altmap, NULL))
> >> +   return -ENOMEM;
> >> return 0;
> >>  }
> >>
> >> +stati

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

2021-02-22 Thread Dan Williams

On Mon, Feb 22, 2021 at 3:24 AM Joao Martins  wrote:
>
> On 2/20/21 1:43 AM, Dan Williams wrote:
> > On Tue, Dec 8, 2020 at 9:59 PM John Hubbard  wrote:
> >> On 12/8/20 9:28 AM, Joao Martins wrote:
> >>> diff --git a/mm/memremap.c b/mm/memremap.c
> >>> index 16b2fb482da1..287a24b7a65a 100644
> >>> --- a/mm/memremap.c
> >>> +++ b/mm/memremap.c
> >>> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
> >>> struct mhp_params *params,
> >>>   memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
> >>>   PHYS_PFN(range->start),
> >>>   PHYS_PFN(range_len(range)), pgmap);
> >>> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >>> - - pfn_first(pgmap, range_id));
> >>> + if (pgmap->flags & PGMAP_COMPOUND)
> >>> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> >>> + - pfn_first(pgmap, range_id)) / 
> >>> PHYS_PFN(pgmap->align));
> >>
> >> Is there some reason that we cannot use range_len(), instead of pfn_end() 
> >> minus
> >> pfn_first()? (Yes, this more about the pre-existing code than about your 
> >> change.)
> >>
> >> And if not, then why are the nearby range_len() uses OK? I realize that 
> >> range_len()
> >> is simpler and skips a case, but it's not clear that it's required here. 
> >> But I'm
> >> new to this area so be warned. :)
> >
> > There's a subtle distinction between the range that was passed in and
> > the pfns that are activated inside of it. See the offset trickery in
> > pfn_first().
> >
> >> Also, dividing by PHYS_PFN() feels quite misleading: that function does 
> >> what you
> >> happen to want, but is not named accordingly. Can you use or create 
> >> something
> >> more accurately named? Like "number of pages in this large page"?
> >
> > It's not the number of pages in a large page it's converting bytes to
> > pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my
> > though process was if I'm going to add () might as well use a macro
> > that already does this.
> >
> > That said I think this calculation is broken precisely because
> > pfn_first() makes the result unaligned.
> >
> > Rather than fix the unaligned pfn_first() problem I would use this
> > support as an opportunity to revisit the option of storing pages in
> > the vmem_altmap reserve soace. The altmap's whole reason for existence
> > was that 1.5% of large PMEM might completely swamp DRAM. However if
> > that overhead is reduced by an order (or orders) of magnitude the
> > primary need for vmem_altmap vanishes.
> >
> > Now, we'll still need to keep it around for the ->align == PAGE_SIZE
> > case, but for most part existing deployments that are specifying page
> > map on PMEM and an align > PAGE_SIZE can instead just transparently be
> > upgraded to page map on a smaller amount of DRAM.
> >
> I feel the altmap is still relevant. Even with the struct page reuse for
> tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per
> 1Tb (strictly speaking about what's stored in the altmap). Muchun and
> Matthew were thinking (in another thread) on compound_head() adjustments
> that probably can make this overhead go to 2G (if we learn to differentiate
> the reused head page from the real head page).

I think that realization is more justification to make a new first
class vmemmap_populate_compound_pages() rather than try to reuse
vmemmap_populate_basepages() with new parameters.

> But even there it's still
> 2G per 1Tb. 1G pages, though, have a better story to remove altmap need.

The concern that led to altmap is that someone would build a system
with a 96:1 (PMEM:RAM) ratio where that correlates to maximum PMEM and
minimum RAM, and mapping all PMEM consumes all RAM. As far as I
understand real world populations are rarely going past 8:1, that
seems to make 'struct page' in RAM feasible even for the 2M compound
page case.

Let me ask you for a data point, since you're one of the people
actively deploying such systems, would you still use the 'struct page'
in PMEM capability after this set was merged?

> One thing to point out about altmap is that the degradation (in pinning and
> unpining) we observed with struct page's in device memory, is no longer 
> observed
> once 1) we batch ref count updates as we move to compound pages 2) reusing
> tai

Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

2021-02-22 Thread Dan Williams

On Mon, Feb 22, 2021 at 2:24 AM Mike Rapoport  wrote:
>
> On Mon, Feb 22, 2021 at 07:34:52AM +, Matthew Garrett wrote:
> > On Mon, Feb 08, 2021 at 10:49:18AM +0200, Mike Rapoport wrote:
> >
> > > It is unsafe to allow saving of secretmem areas to the hibernation
> > > snapshot as they would be visible after the resume and this essentially
> > > will defeat the purpose of secret memory mappings.
> >
> > Sorry for being a bit late to this - from the point of view of running
> > processes (and even the kernel once resume is complete), hibernation is
> > effectively equivalent to suspend to RAM. Why do they need to be handled
> > differently here?
>
> Hibernation leaves a copy of the data on the disk which we want to prevent.

Why not document that users should use data at rest protection
mechanisms for their hibernation device? Just because secretmem can't
assert its disclosure guarantee does not mean the hibernation device
is untrustworthy.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] cxl/mem: Fixes to IOCTL interface

2021-02-22 Thread Dan Williams

On Sat, Feb 20, 2021 at 7:47 PM Ben Widawsky  wrote:
>
> On 21-02-20 18:38:36, Dan Williams wrote:
> > On Sat, Feb 20, 2021 at 1:57 PM Ben Widawsky  wrote:
> > >
> > > When submitting a command for userspace, input and output payload bounce
> > > buffers are allocated. For a given command, both input and output
> > > buffers may exist and so when allocation of the input buffer fails, the
> > > output buffer must be freed. As far as I can tell, userspace can't
> > > easily exploit the leak to OOM a machine unless the machine was already
> > > near OOM state.
> > >
> > > This bug was introduced in v5 of the patch and did not exist in prior
> > > revisions.
> > >
> >
> > Thanks for the quick turnaround, but I think that speed introduced
> > some issues...
> >
> > > While here, adjust the variable 'j' found in patch review by Konrad.
> >
> > Please split this pure cleanup to its own patch. The subject says
> > "Fixes", but it's only the one fix.
> >
>
> This was intentional. I pinged you internally to just drop it if you don't 
> like
> to combine these kind of things.

Apologies, I've been offline this weekend and missed that. I will
point out though that in general I will not change patch contents on
the way upstream. The adjusted patch must appear on the mailing list.
The new korg support for attaching attestation to patches requires
they all be on the list, so even small adjustments need new postings.

> It didn't feel worthwhile to introduce a new
> patch to change the 'j'.

Janitorial patches happen all the time, so I think it's ok.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4466 matches

Mail list logo