Re: [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users
On Tue, May 18, 2021 at 6:33 PM James Bottomley wrote: > > On Tue, 2021-05-18 at 11:24 +0100, Mark Rutland wrote: > > On Thu, May 13, 2021 at 09:47:32PM +0300, Mike Rapoport wrote: > > > From: Mike Rapoport > > > > > > It is unsafe to allow saving of secretmem areas to the hibernation > > > snapshot as they would be visible after the resume and this > > > essentially will defeat the purpose of secret memory mappings. > > > > > > Prevent hibernation whenever there are active secret memory users. > > > > Have we thought about how this is going to work in practice, e.g. on > > mobile systems? It seems to me that there are a variety of common > > applications which might want to use this which people don't expect > > to inhibit hibernate (e.g. authentication agents, web browsers). > > If mobile systems require hibernate, then the choice is to disable this > functionality or implement a secure hibernation store. I also thought > most mobile hibernation was basically equivalent to S3, in which case > there's no actual writing of ram into storage, in which case there's no > security barrier and likely the inhibition needs to be made a bit more > specific to the suspend to disk case? > > > Are we happy to say that any userspace application can incidentally > > inhibit hibernate? > > Well, yes, for the laptop use case because we don't want suspend to > disk to be able to compromise the secret area. You can disable this > for mobile if you like, or work out how to implement hibernate securely > if you're really suspending to disk. Forgive me if this was already asked and answered. Why not document that secretmem is ephemeral in the case of hibernate and push the problem to userspace to disable hibernation? In other words hibernation causes applications to need to reload their secretmem, it will be destroyed on the way down and SIGBUS afterwards. That at least gives a system the flexibility to either sacrifice hibernate for secretmem (with a userspace controlled policy), or sacrifice secretmem using processes for hibernate. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [ndctl PATCH] ndctl: Update nvdimm mailing list address
On Tue, May 18, 2021 at 3:26 PM Vishal Verma wrote: > > The 'nvdimm' mailing list has moved from lists.01.org to > lists.linux.dev. Update CONTRIBUTING.md and configure.ac to reflect > this. LGTM Reviewed-by: Dan Williams
Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()
On Wed, Apr 21, 2021 at 11:25 PM Christoph Hellwig wrote: > > On Wed, Apr 21, 2021 at 12:09:54PM -0700, Dan Williams wrote: > > Can you get in the habit of not replying inline with new patches like > > this? Collect the review feedback, take a pause, and resend the full > > series so tooling like b4 and patchwork can track when a new posting > > supersedes a previous one. As is, this inline style inflicts manual > > effort on the maintainer. > > Honestly I don't mind it at all. If you shiny new tooling can't handle > it maybe you should fix your shiny new tooling instead of changing > everyones workflow? Fyi, shiny new tooling has been fixed: http://lore.kernel.org/r/20210517161317.teawoh5qovxpmqdc@nitro.local ...it still requires properly formatted patches with commentary below the "---" break line, but this should cut down on re-rolls. Hat tip to Konstantin. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[GIT PULL] libnvdimm fixes for 5.13-rc2
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-fixes-5.13-rc2 ...to receive a regression fix for a bootup crash condition introduced in -rc1 and some other minor fixups. This has all appeared in -next with no reported issues. --- The following changes since commit 6efb943b8616ec53a5e444193dccf1af9ad627b5: Linux 5.13-rc1 (2021-05-09 14:17:44 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-fixes-5.13-rc2 for you to fetch changes up to e9cfd259c6d386f6235395a13bd4f357a979b2d0: ACPI: NFIT: Fix support for variable 'SPA' structure size (2021-05-12 12:38:25 -0700) libnvdimm fixes for 5.13-rc2 - Fix regression in ACPI NFIT table handling leading to crashes and driver load failures. - Move the nvdimm mailing list - Miscellaneous minor fixups Dan Williams (2): MAINTAINERS: Move nvdimm mailing list ACPI: NFIT: Fix support for variable 'SPA' structure size Wan Jiabing (1): libnvdimm: Remove duplicate struct declaration Zou Wei (1): tools/testing/nvdimm: Make symbol '__nfit_test_ioremap' static Documentation/ABI/obsolete/sysfs-class-dax| 2 +- Documentation/ABI/removed/sysfs-bus-nfit | 2 +- Documentation/ABI/testing/sysfs-bus-nfit | 40 - Documentation/ABI/testing/sysfs-bus-papr-pmem | 4 +-- Documentation/driver-api/nvdimm/nvdimm.rst| 2 +- MAINTAINERS | 14 - drivers/acpi/nfit/core.c | 15 +++--- include/linux/libnvdimm.h | 1 - tools/testing/nvdimm/test/iomap.c | 2 +- tools/testing/nvdimm/test/nfit.c | 42 --- 10 files changed, 69 insertions(+), 55 deletions(-) ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[GIT PULL] dax fixes for v5.13-rc2
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/dax-fixes-5.13-rc2 ...to receive a fix for a hang condition in the filesystem-dax core when exercised by virtiofs. This bug has been there from the beginning, but the condition has not triggered on other filesystems since they hold a lock over invalidation events. The changes have appeared in -next with no reported issues. The patches were originally against v5.12 so you will see a minor conflict with Willy's nr_exceptional changes. My proposed conflict resolution appended below. --- The following changes since commit 9f4ad9e425a1d3b6a34617b8ea226d56a119a717: Linux 5.12 (2021-04-25 13:49:08 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/dax-fixes-5.13-rc2 for you to fetch changes up to 237388320deffde7c2d65ed8fc9eef670dc979b3: dax: Wake up all waiters after invalidating dax entry (2021-05-07 15:55:44 -0700) dax fixes for 5.13-rc2 - Fix a hang condition (missed wakeups with virtiofs when invalidating entries) Vivek Goyal (3): dax: Add an enum for specifying dax wakup mode dax: Add a wakeup mode parameter to put_unlocked_entry() dax: Wake up all waiters after invalidating dax entry fs/dax.c | 35 +++ 1 file changed, 23 insertions(+), 12 deletions(-) diff --cc fs/dax.c index 69216241392f,df5485b4bddf..62352cbcf0f4 --- a/fs/dax.c +++ b/fs/dax.c @@@ -524,8 -535,8 +535,8 @@@ retry dax_disassociate_entry(entry, mapping, false); xas_store(xas, NULL); /* undo the PMD join */ - dax_wake_entry(xas, entry, true); + dax_wake_entry(xas, entry, WAKE_ALL); - mapping->nrexceptional--; + mapping->nrpages -= PG_PMD_NR; entry = NULL; xas_set(xas, index); } @@@ -661,10 -672,10 +672,10 @@@ static int __dax_invalidate_entry(struc goto out; dax_disassociate_entry(entry, mapping, trunc); xas_store(, NULL); - mapping->nrexceptional--; + mapping->nrpages -= 1UL << dax_entry_order(entry); ret = 1; out: - put_unlocked_entry(, entry); + put_unlocked_entry(, entry, WAKE_ALL); xas_unlock_irq(); return ret; } ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] MAINTAINERS: Move nvdimm mailing list
On Wed, Apr 21, 2021 at 12:05 AM Dan Williams wrote: > > After seeing some users have subscription management trouble, more spam > than other Linux development lists, and considering some of the benefits > of kernel.org hosted lists, nvdimm and persistent memory development is > moving to nvd...@lists.linux.dev. > > The old list will remain up until v5.14-rc1 and shutdown thereafter. > Events have transpired that require this schedule be expedited. So once this patch ships in v5.13-rc2 the linux-nvdimm@lists.01.org list will be shutdown and all future nvdimm correspondence must be sent to nvd...@lists.linux.dev. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v2] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable
On Fri, May 7, 2021 at 4:40 AM Vaibhav Jain wrote: > > In case performance stats for an nvdimm are not available, reading the > 'perf_stats' sysfs file returns an -ENOENT error. A better approach is > to make the 'perf_stats' file entirely invisible to indicate that > performance stats for an nvdimm are unavailable. > > So this patch updates 'papr_nd_attribute_group' to add a 'is_visible' > callback implemented as newly introduced 'papr_nd_attribute_visible()' > that returns an appropriate mode in case performance stats aren't > supported in a given nvdimm. > > Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved > from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is > available when 'papr_nd_attribute_visible()' is called during nvdimm > initialization. > > Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from > PHYP') Since this has been the exposed ABI since v5.9 perhaps a note / analysis is needed here that the disappearance of this file will not break any existing scripts/tools. > Signed-off-by: Vaibhav Jain > --- > Changelog: > > v2: > * Removed a redundant dev_info() from pap_scm_nvdimm_init() [ Aneesh ] > * Marked papr_nd_attribute_visible() as static which also fixed the > build warning reported by kernel build robot > --- > arch/powerpc/platforms/pseries/papr_scm.c | 35 --- > 1 file changed, 24 insertions(+), 11 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > b/arch/powerpc/platforms/pseries/papr_scm.c > index e2b69cc3beaf..11e7b90a3360 100644 > --- a/arch/powerpc/platforms/pseries/papr_scm.c > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > @@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev, > } > DEVICE_ATTR_RO(flags); > > +static umode_t papr_nd_attribute_visible(struct kobject *kobj, > +struct attribute *attr, int n) > +{ > + struct device *dev = container_of(kobj, typeof(*dev), kobj); This can use the kobj_to_dev() helper. > + struct nvdimm *nvdimm = to_nvdimm(dev); > + struct papr_scm_priv *p = nvdimm_provider_data(nvdimm); > + > + /* For if perf-stats not available remove perf_stats sysfs */ > + if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0) > + return 0; > + > + return attr->mode; > +} > + > /* papr_scm specific dimm attributes */ > static struct attribute *papr_nd_attributes[] = { > _attr_flags.attr, > @@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = { > > static struct attribute_group papr_nd_attribute_group = { > .name = "papr", > + .is_visible = papr_nd_attribute_visible, > .attrs = papr_nd_attributes, > }; > > @@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) > struct nd_region_desc ndr_desc; > unsigned long dimm_flags; > int target_nid, online_nid; > - ssize_t stat_size; > > p->bus_desc.ndctl = papr_scm_ndctl; > p->bus_desc.module = THIS_MODULE; > @@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv > *p) > list_add_tail(>region_list, _nd_regions); > mutex_unlock(_ndr_lock); > > - /* Try retriving the stat buffer and see if its supported */ > - stat_size = drc_pmem_query_stats(p, NULL, 0); > - if (stat_size > 0) { > - p->stat_buffer_len = stat_size; > - dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n", > - p->stat_buffer_len); > - } else { > - dev_info(>pdev->dev, "Dimm performance stats > unavailable\n"); > - } > - > return 0; > > err: nvdimm_bus_unregister(p->bus); > @@ -1102,6 +1106,7 @@ static int papr_scm_probe(struct platform_device *pdev) > u64 blocks, block_size; > struct papr_scm_priv *p; > const char *uuid_str; > + ssize_t stat_size; > u64 uuid[2]; > int rc; > > @@ -1179,6 +1184,14 @@ static int papr_scm_probe(struct platform_device *pdev) > p->res.name = pdev->name; > p->res.flags = IORESOURCE_MEM; > > + /* Try retriving the stat buffer and see if its supported */ s/retriving/retrieving/ > + stat_size = drc_pmem_query_stats(p, NULL, 0); > + if (stat_size > 0) { > + p->stat_buffer_len = stat_size; > + dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n", > + p->stat_buffer_len); > + } > + > rc = papr_scm_nvdimm_init(p); > if (rc) > goto err2; > -- > 2.31.1 > After the minor fixups above you can add: Reviewed-by: Dan Williams ...I assume this will go through the PPC tree. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
On Thu, May 6, 2021 at 4:02 AM Joao Martins wrote: [..] > >> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr) > > > > I think this can be replaced with a call to follow_pte(_mm...). > > > > Ah, of course! That would shorten things up too. Now that I look closely, I notice that function disclaims being suitable as a general purpose pte lookup utility. If it works for you, great, but if not I think it's past time to create such a helper. I know Ira is looking for one, and I contributed to the proliferation when I added dev_pagemap_mapping_shift(). ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] ACPI: NFIT: Fix support for variable 'SPA' structure size
On Fri, May 7, 2021 at 7:49 AM Rafael J. Wysocki wrote: > > On Fri, May 7, 2021 at 4:12 PM Dan Williams wrote: > > > > On Fri, May 7, 2021 at 2:47 AM Rafael J. Wysocki wrote: > > > > > > Hi Dan, > > > > > > On Fri, May 7, 2021 at 9:33 AM Dan Williams > > > wrote: > > > > > > > > ACPI 6.4 introduced the "SpaLocationCookie" to the NFIT "System Physical > > > > Address (SPA) Range Structure". The presence of that new field is > > > > indicated by the ACPI_NFIT_LOCATION_COOKIE_VALID flag. Pre-ACPI-6.4 > > > > firmware implementations omit the flag and maintain the original size of > > > > the structure. > > > > > > > > Update the implementation to check that flag to determine the size > > > > rather than the ACPI 6.4 compliant definition of 'struct > > > > acpi_nfit_system_address' from the Linux ACPICA definitions. > > > > > > > > Update the test infrastructure for the new expectations as well, i.e. > > > > continue to emulate the ACPI 6.3 definition of that structure. > > > > > > > > Without this fix the kernel fails to validate 'SPA' structures and this > > > > leads to a crash in nfit_get_smbios_id() since that routine assumes that > > > > SPAs are valid if it finds valid SMBIOS tables. > > > > > > > > BUG: unable to handle page fault for address: ffa8 > > > > [..] > > > > Call Trace: > > > > skx_get_nvdimm_info+0x56/0x130 [skx_edac] > > > > skx_get_dimm_config+0x1f5/0x213 [skx_edac] > > > > skx_register_mci+0x132/0x1c0 [skx_edac] > > > > > > > > Cc: Bob Moore > > > > Cc: Erik Kaneda > > > > Fixes: cf16b05c607b ("ACPICA: ACPI 6.4: NFIT: add Location Cookie > > > > field") > > > > > > Do you want me to apply this (as the commit being fixed went in > > > through the ACPI tree)? > > > > Yes, I would need to wait for a signed tag so if you're sending urgent > > fixes in the next few days please take this one, otherwise I'll circle > > back next week after -rc1. > > I'll be doing my next push after -rc1 either, so I guess it doesn't > matter time-wise. Ok, I got it, thanks for the offer. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] ACPI: NFIT: Fix support for variable 'SPA' structure size
On Fri, May 7, 2021 at 2:47 AM Rafael J. Wysocki wrote: > > Hi Dan, > > On Fri, May 7, 2021 at 9:33 AM Dan Williams wrote: > > > > ACPI 6.4 introduced the "SpaLocationCookie" to the NFIT "System Physical > > Address (SPA) Range Structure". The presence of that new field is > > indicated by the ACPI_NFIT_LOCATION_COOKIE_VALID flag. Pre-ACPI-6.4 > > firmware implementations omit the flag and maintain the original size of > > the structure. > > > > Update the implementation to check that flag to determine the size > > rather than the ACPI 6.4 compliant definition of 'struct > > acpi_nfit_system_address' from the Linux ACPICA definitions. > > > > Update the test infrastructure for the new expectations as well, i.e. > > continue to emulate the ACPI 6.3 definition of that structure. > > > > Without this fix the kernel fails to validate 'SPA' structures and this > > leads to a crash in nfit_get_smbios_id() since that routine assumes that > > SPAs are valid if it finds valid SMBIOS tables. > > > > BUG: unable to handle page fault for address: ffa8 > > [..] > > Call Trace: > > skx_get_nvdimm_info+0x56/0x130 [skx_edac] > > skx_get_dimm_config+0x1f5/0x213 [skx_edac] > > skx_register_mci+0x132/0x1c0 [skx_edac] > > > > Cc: Bob Moore > > Cc: Erik Kaneda > > Fixes: cf16b05c607b ("ACPICA: ACPI 6.4: NFIT: add Location Cookie field") > > Do you want me to apply this (as the commit being fixed went in > through the ACPI tree)? Yes, I would need to wait for a signed tag so if you're sending urgent fixes in the next few days please take this one, otherwise I'll circle back next week after -rc1. > > If you'd rather take care of it yourself: > > Reviewed-by: Rafael J. Wysocki Thanks! ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH] ACPI: NFIT: Fix support for variable 'SPA' structure size
ACPI 6.4 introduced the "SpaLocationCookie" to the NFIT "System Physical Address (SPA) Range Structure". The presence of that new field is indicated by the ACPI_NFIT_LOCATION_COOKIE_VALID flag. Pre-ACPI-6.4 firmware implementations omit the flag and maintain the original size of the structure. Update the implementation to check that flag to determine the size rather than the ACPI 6.4 compliant definition of 'struct acpi_nfit_system_address' from the Linux ACPICA definitions. Update the test infrastructure for the new expectations as well, i.e. continue to emulate the ACPI 6.3 definition of that structure. Without this fix the kernel fails to validate 'SPA' structures and this leads to a crash in nfit_get_smbios_id() since that routine assumes that SPAs are valid if it finds valid SMBIOS tables. BUG: unable to handle page fault for address: ffa8 [..] Call Trace: skx_get_nvdimm_info+0x56/0x130 [skx_edac] skx_get_dimm_config+0x1f5/0x213 [skx_edac] skx_register_mci+0x132/0x1c0 [skx_edac] Cc: Bob Moore Cc: Erik Kaneda Fixes: cf16b05c607b ("ACPICA: ACPI 6.4: NFIT: add Location Cookie field") Reported-by: Yi Zhang Tested-by: Yi Zhang Signed-off-by: Dan Williams --- Rafael, I can take this through nvdimm.git after -rc1, but if you are sending any fixes to Linus based on your merge baseline between now and Monday, please pick up this one. drivers/acpi/nfit/core.c | 15 ++ tools/testing/nvdimm/test/nfit.c | 42 +++--- 2 files changed, 36 insertions(+), 21 deletions(-) diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 958aaac869e8..23d9a09d7060 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -686,6 +686,13 @@ int nfit_spa_type(struct acpi_nfit_system_address *spa) return -1; } +static size_t sizeof_spa(struct acpi_nfit_system_address *spa) +{ + if (spa->flags & ACPI_NFIT_LOCATION_COOKIE_VALID) + return sizeof(*spa); + return sizeof(*spa) - 8; +} + static bool add_spa(struct acpi_nfit_desc *acpi_desc, struct nfit_table_prev *prev, struct acpi_nfit_system_address *spa) @@ -693,22 +700,22 @@ static bool add_spa(struct acpi_nfit_desc *acpi_desc, struct device *dev = acpi_desc->dev; struct nfit_spa *nfit_spa; - if (spa->header.length != sizeof(*spa)) + if (spa->header.length != sizeof_spa(spa)) return false; list_for_each_entry(nfit_spa, >spas, list) { - if (memcmp(nfit_spa->spa, spa, sizeof(*spa)) == 0) { + if (memcmp(nfit_spa->spa, spa, sizeof_spa(spa)) == 0) { list_move_tail(_spa->list, _desc->spas); return true; } } - nfit_spa = devm_kzalloc(dev, sizeof(*nfit_spa) + sizeof(*spa), + nfit_spa = devm_kzalloc(dev, sizeof(*nfit_spa) + sizeof_spa(spa), GFP_KERNEL); if (!nfit_spa) return false; INIT_LIST_HEAD(_spa->list); - memcpy(nfit_spa->spa, spa, sizeof(*spa)); + memcpy(nfit_spa->spa, spa, sizeof_spa(spa)); list_add_tail(_spa->list, _desc->spas); dev_dbg(dev, "spa index: %d type: %s\n", spa->range_index, diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c index 9b185bf82da8..54f367cbadae 100644 --- a/tools/testing/nvdimm/test/nfit.c +++ b/tools/testing/nvdimm/test/nfit.c @@ -1871,9 +1871,16 @@ static void smart_init(struct nfit_test *t) } } +static size_t sizeof_spa(struct acpi_nfit_system_address *spa) +{ + /* until spa location cookie support is added... */ + return sizeof(*spa) - 8; +} + static int nfit_test0_alloc(struct nfit_test *t) { - size_t nfit_size = sizeof(struct acpi_nfit_system_address) * NUM_SPA + struct acpi_nfit_system_address *spa = NULL; + size_t nfit_size = sizeof_spa(spa) * NUM_SPA + sizeof(struct acpi_nfit_memory_map) * NUM_MEM + sizeof(struct acpi_nfit_control_region) * NUM_DCR + offsetof(struct acpi_nfit_control_region, @@ -1937,7 +1944,8 @@ static int nfit_test0_alloc(struct nfit_test *t) static int nfit_test1_alloc(struct nfit_test *t) { - size_t nfit_size = sizeof(struct acpi_nfit_system_address) * 2 + struct acpi_nfit_system_address *spa = NULL; + size_t nfit_size = sizeof_spa(spa) * 2 + sizeof(struct acpi_nfit_memory_map) * 2 + offsetof(struct acpi_nfit_control_region, window_size) * 2; int i; @@ -2000,7 +2008,7 @@ static void nfit_test0_setup(struct nfit_test *t) */ spa = nfit_buf; spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS; - spa->header.length = sizeof(*s
Re: [bug report] system panic at nfit_get_smbios_id+0x6e/0xf0 [nfit] during boot
On Thu, May 6, 2021 at 10:28 AM Kaneda, Erik wrote: > > > > > -Original Message- > > From: Yi Zhang > > Sent: Wednesday, May 5, 2021 8:05 PM > > To: Williams, Dan J ; Moore, Robert > > > > Cc: linux-nvdimm ; Kaneda, Erik > > ; Wysocki, Rafael J > > Subject: Re: [bug report] system panic at nfit_get_smbios_id+0x6e/0xf0 > > [nfit] during boot > > > > On Sat, May 1, 2021 at 2:05 PM Dan Williams > > wrote: > > > > > > On Fri, Apr 30, 2021 at 7:28 PM Yi Zhang wrote: > > > > > > > > Hi > > > > > > > > With the latest Linux tree, my DCPMM server boot failed with the > > > > bellow panic log, pls help check it, let me know if you need any test > > > > for it. > > > > > > So v5.12 is ok but v5.12+ is not? > > > > > > Might you be able to bisect? > > > > Hi Dan > > This issue was introduced with this patch, let me know if you need more > > info. > > > > commit cf16b05c607bd716a0a5726dc8d577a89fdc1777 > > Author: Bob Moore > > Date: Tue Apr 6 14:30:15 2021 -0700 > > > > ACPICA: ACPI 6.4: NFIT: add Location Cookie field > > > > Also, update struct size to reflect these changes in nfit core driver. > > > > ACPICA commit af60199a9a1de9e6844929fd4cc22334522ed195 > > > > Link: https://github.com/acpica/acpica/commit/af60199a > > Cc: Dan Williams > > Signed-off-by: Bob Moore > > Signed-off-by: Erik Kaneda > > Signed-off-by: Rafael J. Wysocki > > > > It's likely that this change forced the nfit driver's code to parse the ACPI > table so that it assumes that the location cookie field is always enabled and > the NFIT was parsed incorrectly. Does the NFIT table on this platform contain > a valid cookie field? > This was my fault. When I saw the size change fly by, I should have remembered to go update all the places that do "sizeof(struct acpi_nfit_system_address)". Yi Zhang, can you give the attached patch a try: ACPI: NFIT: Fix support for variable 'SPA' structure size From: Dan Williams ACPI 6.4 introduced the "SpaLocationCookie" to the NFIT "System Physical Address (SPA) Range Structure". The presence of that new field is indicated by the ACPI_NFIT_LOCATION_COOKIE_VALID flag. Pre-ACPI-6.4 firmware implementations omit the flag and maintain the original size of the structure. Update the implementation to check that flag to determine the size rather than the ACPI 6.4 compliant definition of 'struct acpi_nfit_system_address' from the Linux ACPICA definitions. Update the test infrastructure for the new expectations as well, i.e. continue to emulate the ACPI 6.3 definition of that structure. Without this fix the kernel fails to validate 'SPA' structures and this leads to a crash in nfit_get_smbios_id() since that routine assumes that SPAs are valid if it finds valid SMBIOS tables. BUG: unable to handle page fault for address: ffa8 [..] Call Trace: skx_get_nvdimm_info+0x56/0x130 [skx_edac] skx_get_dimm_config+0x1f5/0x213 [skx_edac] skx_register_mci+0x132/0x1c0 [skx_edac] Reported-by: Yi Zhang Signed-off-by: Dan Williams --- drivers/acpi/nfit/core.c | 17 +++ tools/testing/nvdimm/test/nfit.c | 42 +++--- 2 files changed, 37 insertions(+), 22 deletions(-) diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 958aaac869e8..bfecb79e8c82 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -686,29 +686,36 @@ int nfit_spa_type(struct acpi_nfit_system_address *spa) return -1; } -static bool add_spa(struct acpi_nfit_desc *acpi_desc, +static size_t sizeof_spa(struct acpi_nfit_system_address *spa) +{ + if (spa->flags & ACPI_NFIT_LOCATION_COOKIE_VALID) + return sizeof(*spa); + return sizeof(*spa) - 8; +} + +static bool noinline add_spa(struct acpi_nfit_desc *acpi_desc, struct nfit_table_prev *prev, struct acpi_nfit_system_address *spa) { struct device *dev = acpi_desc->dev; struct nfit_spa *nfit_spa; - if (spa->header.length != sizeof(*spa)) + if (spa->header.length != sizeof_spa(spa)) return false; list_for_each_entry(nfit_spa, >spas, list) { - if (memcmp(nfit_spa->spa, spa, sizeof(*spa)) == 0) { + if (memcmp(nfit_spa->spa, spa, sizeof_spa(spa)) == 0) { list_move_tail(_spa->list, _desc->spas); return true; } } - nfit_spa = devm_kzalloc(dev, sizeof(*nfit_spa) + sizeof(*spa), + nfit_spa = devm_kzalloc(dev, sizeof(*nfit_spa) + sizeof_spa(spa), GFP_KERNEL); if (!nfit_spa) return false; INIT_LIST_HEAD(_spa->list); - memcpy(nfit_spa->spa, spa, sizeof(*spa)); + memcpy(nfit_spa-
Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
On Thu, Mar 25, 2021 at 4:10 PM Joao Martins wrote: > > A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it > means that pages are mapped at a given huge page alignment and utilize > uses compound pages as opposed to order-0 pages. > > To minimize struct page overhead we take advantage of the fact that I'm not sure if Andrew is a member of the "we" police, but I am on team "imperative tense". "Take advantage of the fact that most tail pages look the same (except the first two) to minimize struct page overhead." ...I think all the other "we"s below can be dropped without losing any meaning. > most tail pages look the same (except the first two). We allocate a > separate page for the vmemmap area which contains the head page and > separate for the next 64 pages. The rest of the subsections then reuse > this tail vmemmap page to initialize the rest of the tail pages. > > Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and > when initializing compound pagemap with big enough @align (e.g. 1G > PUD) it will cross various sections. To be able to reuse tail pages > across sections belonging to the same gigantic page we fetch the > @range being mapped (nr_ranges + 1). If the section being mapped is > not offset 0 of the @align, then lookup the PFN of the struct page > address that preceeds it and use that to populate the entire s/preceeds/precedes/ > section. > > On compound pagemaps with 2M align, this lets mechanism saves 6 pages s/lets mechanism saves 6 pages/this mechanism lets 6 pages be saved/ > out of the 8 necessary PFNs necessary to set the subsection's 512 > struct pages being mapped. On a 1G compound pagemap it saves > 4094 pages. > > Altmap isn't supported yet, given various restrictions in altmap pfn > allocator, thus fallback to the already in use vmemmap_populate(). Perhaps also note that altmap for devmap mappings was there to relieve the pressure of inordinate amounts of memmap space to map terabytes of PMEM. With compound pages the motivation for altmaps for PMEM is reduced. > > Signed-off-by: Joao Martins > --- > include/linux/mm.h | 2 +- > mm/memremap.c | 1 + > mm/sparse-vmemmap.c | 139 > 3 files changed, 131 insertions(+), 11 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 61474602c2b1..49d717ae40ae 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long > addr, int node); > pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node); > pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node); > pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, > - struct vmem_altmap *altmap); > + struct vmem_altmap *altmap, void *block); > void *vmemmap_alloc_block(unsigned long size, int node); > struct vmem_altmap; > void *vmemmap_alloc_block_buf(unsigned long size, int node, > diff --git a/mm/memremap.c b/mm/memremap.c > index d160853670c4..2e6bc0b1ff00 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) > { > struct mhp_params params = { > .altmap = pgmap_altmap(pgmap), > + .pgmap = pgmap, > .pgprot = PAGE_KERNEL, > }; > const int nr_range = pgmap->nr_range; > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > index 8814876edcfa..f57c5eada099 100644 > --- a/mm/sparse-vmemmap.c > +++ b/mm/sparse-vmemmap.c > @@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node, > } > > pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int > node, > - struct vmem_altmap *altmap) > + struct vmem_altmap *altmap, void > *block) For type-safety I would make this argument a 'struct page *' and drop the virt_to_page(). > { > pte_t *pte = pte_offset_kernel(pmd, addr); > if (pte_none(*pte)) { > pte_t entry; > - void *p; > - > - p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap); > - if (!p) > - return NULL; > + void *p = block; > + > + if (!block) { > + p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap); > + if (!p) > + return NULL; > + } else if (!altmap) { > + get_page(virt_to_page(block)); This is either super subtle, or there is a missing put_page(). I'm assuming that in the shutdown path the same page will be freed multiple times because the same page is mapped multiple times. Have you validated that the accounting is correct and all allocated pages get freed? I feel this needs more than a comment, it
Re: [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
On Wed, May 5, 2021 at 3:38 PM Joao Martins wrote: > > > > On 5/5/21 11:34 PM, Dan Williams wrote: > > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins > > wrote: > >> > >> @altmap is stored in a dev_pagemap, but it will be repurposed for > >> hotplug memory for storing the memmap in the hotplugged memory[*] and > >> reusing the altmap infrastructure to that end. This is to say that > >> altmap can't be replaced with a @pgmap as it is going to cover more than > >> dev_pagemap backend altmaps. > > > > I was going to say, just pass the pgmap and lookup the altmap from > > pgmap, but Oscar added a use case for altmap independent of pgmap. So > > you might refresh this commit message to clarify why passing pgmap by > > itself is not sufficient. > > > Isn't that what I am doing above with that exact paragraph? I even > reference his series at the end of commit description :) in [*] Oh, sorry, it didn't hit me explicitly that you were talking about Oscar's work I thought you were referring to your own changes in this set. I see it now... at a minimum the tense needs updating since Oscar's changes are in the past not the future anymore. If it helps, the following reads more direct to me: "In support of using compound pages for devmap mappings, plumb the pgmap down to the vmemmap_populate* implementation. Note that while altmap is retrievable from pgmap the memory hotplug codes passes altmap without pgmap, so both need to be independently plumbed." ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
On Wed, May 5, 2021 at 3:36 PM Joao Martins wrote: [..] > > Ah yup, my eyes glazed over that. I think this is another place that > > benefits from a more specific name than "align". "pfns_per_compound" > > "compound_pfns"? > > > We are still describing a page, just not a base page. So perhaps > @pfns_per_hpage ? > > I am fine with @pfns_per_compound or @compound_pfns as well. My only concern about hpage is that hpage implies PMD, where compound is generic across PMD and PUD. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
I suspect it's a good sign I'm only finding cosmetic and changelog changes in the review... I have some more: A year for now if I'm tracking down a problem and looking through mm commits I would appreciate a subject line like the following: "refactor core of vmemmap_populate_basepages() to helper" that gives an idea of the impact and side effects of the change. On Thu, Mar 25, 2021 at 4:10 PM Joao Martins wrote: > I would add a lead in phrase like: "In preparation for describing a memmap with compound pages, move the actual..." > Move the actual pte population logic into a separate function > vmemmap_populate_address() and have vmemmap_populate_basepages() > walk through all base pages it needs to populate. Aside from the above, looks good. > > Signed-off-by: Joao Martins > --- > mm/sparse-vmemmap.c | 44 ++-- > 1 file changed, 26 insertions(+), 18 deletions(-) > > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > index 370728c206ee..8814876edcfa 100644 > --- a/mm/sparse-vmemmap.c > +++ b/mm/sparse-vmemmap.c > @@ -216,33 +216,41 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long > addr, int node) > return pgd; > } > > -int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long > end, > -int node, struct vmem_altmap *altmap) > +static int __meminit vmemmap_populate_address(unsigned long addr, int node, > + struct vmem_altmap *altmap) > { > - unsigned long addr = start; > pgd_t *pgd; > p4d_t *p4d; > pud_t *pud; > pmd_t *pmd; > pte_t *pte; > > + pgd = vmemmap_pgd_populate(addr, node); > + if (!pgd) > + return -ENOMEM; > + p4d = vmemmap_p4d_populate(pgd, addr, node); > + if (!p4d) > + return -ENOMEM; > + pud = vmemmap_pud_populate(p4d, addr, node); > + if (!pud) > + return -ENOMEM; > + pmd = vmemmap_pmd_populate(pud, addr, node); > + if (!pmd) > + return -ENOMEM; > + pte = vmemmap_pte_populate(pmd, addr, node, altmap); > + if (!pte) > + return -ENOMEM; > + vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); > +} > + > +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long > end, > +int node, struct vmem_altmap *altmap) > +{ > + unsigned long addr = start; > + > for (; addr < end; addr += PAGE_SIZE) { > - pgd = vmemmap_pgd_populate(addr, node); > - if (!pgd) > - return -ENOMEM; > - p4d = vmemmap_p4d_populate(pgd, addr, node); > - if (!p4d) > - return -ENOMEM; > - pud = vmemmap_pud_populate(p4d, addr, node); > - if (!pud) > - return -ENOMEM; > - pmd = vmemmap_pmd_populate(pud, addr, node); > - if (!pmd) > - return -ENOMEM; > - pte = vmemmap_pte_populate(pmd, addr, node, altmap); > - if (!pte) > + if (vmemmap_populate_address(addr, node, altmap)) > return -ENOMEM; > - vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); > } > > return 0; > -- > 2.17.1 > ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
On Thu, Mar 25, 2021 at 4:10 PM Joao Martins wrote: > > @altmap is stored in a dev_pagemap, but it will be repurposed for > hotplug memory for storing the memmap in the hotplugged memory[*] and > reusing the altmap infrastructure to that end. This is to say that > altmap can't be replaced with a @pgmap as it is going to cover more than > dev_pagemap backend altmaps. I was going to say, just pass the pgmap and lookup the altmap from pgmap, but Oscar added a use case for altmap independent of pgmap. So you might refresh this commit message to clarify why passing pgmap by itself is not sufficient. Other than that, looks good. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
On Wed, May 5, 2021 at 12:50 PM Joao Martins wrote: > > > > On 5/5/21 7:44 PM, Dan Williams wrote: > > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins > > wrote: > >> > >> Add a new align property for struct dev_pagemap which specifies that a > >> pagemap is composed of a set of compound pages of size @align, instead > >> of base pages. When these pages are initialised, most are initialised as > >> tail pages instead of order-0 pages. > >> > >> For certain ZONE_DEVICE users like device-dax which have a fixed page > >> size, this creates an opportunity to optimize GUP and GUP-fast walkers, > >> treating it the same way as THP or hugetlb pages. > >> > >> Signed-off-by: Joao Martins > >> --- > >> include/linux/memremap.h | 13 + > >> mm/memremap.c| 8 ++-- > >> mm/page_alloc.c | 24 +++- > >> 3 files changed, 42 insertions(+), 3 deletions(-) > >> > >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h > >> index b46f63dcaed3..bb28d82dda5e 100644 > >> --- a/include/linux/memremap.h > >> +++ b/include/linux/memremap.h > >> @@ -114,6 +114,7 @@ struct dev_pagemap { > >> struct completion done; > >> enum memory_type type; > >> unsigned int flags; > >> + unsigned long align; > > > > I think this wants some kernel-doc above to indicate that non-zero > > means "use compound pages with tail-page dedup" and zero / PAGE_SIZE > > means "use non-compound base pages". > > Got it. Are you thinking a kernel_doc on top of the variable above, or > preferably on top > of pgmap_align()? I was thinking in dev_pagemap because this value is more than just a plain alignment its restructuring the layout and construction of the memmap(), but when I say it that way it seems much less like a vanilla align value compared to a construction / geometry mode setting. > > > The non-zero value must be > > PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. > > Hmm, maybe it should be an > > enum: > > > > enum devmap_geometry { > > DEVMAP_PTE, > > DEVMAP_PMD, > > DEVMAP_PUD, > > } > > > I suppose a converter between devmap_geometry and page_size would be needed > too? And maybe > the whole dax/nvdimm align values change meanwhile (as a followup > improvement)? I think it is ok for dax/nvdimm to continue to maintain their align value because it should be ok to have 4MB align if the device really wanted. However, when it goes to map that alignment with memremap_pages() it can pick a mode. For example, it's already the case that dax->align == 1GB is mapped with DEVMAP_PTE today, so they're already separate concepts that can stay separate. > > Although to be fair we only ever care about compound page size in this series > (and > similarly dax/nvdimm @align properties). > > > ...because it's more than just an alignment it's a structural > > definition of how the memmap is laid out. > > > >> const struct dev_pagemap_ops *ops; > >> void *owner; > >> int nr_range; > >> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct > >> dev_pagemap *pgmap) > >> return NULL; > >> } > >> > >> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap) > >> +{ > >> + if (!pgmap || !pgmap->align) > >> + return PAGE_SIZE; > >> + return pgmap->align; > >> +} > >> + > >> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap) > >> +{ > >> + return PHYS_PFN(pgmap_align(pgmap)); > >> +} > >> + > >> #ifdef CONFIG_ZONE_DEVICE > >> bool pfn_zone_device_reserved(unsigned long pfn); > >> void *memremap_pages(struct dev_pagemap *pgmap, int nid); > >> diff --git a/mm/memremap.c b/mm/memremap.c > >> index 805d761740c4..d160853670c4 100644 > >> --- a/mm/memremap.c > >> +++ b/mm/memremap.c > >> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, > >> struct mhp_params *params, > >> memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE], > >> PHYS_PFN(range->start), > >> PHYS_PFN(range_len(range)), pgmap); > >> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > >> - - pfn_f
Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
On Thu, Mar 25, 2021 at 4:10 PM Joao Martins wrote: > > Add a new align property for struct dev_pagemap which specifies that a > pagemap is composed of a set of compound pages of size @align, instead > of base pages. When these pages are initialised, most are initialised as > tail pages instead of order-0 pages. > > For certain ZONE_DEVICE users like device-dax which have a fixed page > size, this creates an opportunity to optimize GUP and GUP-fast walkers, > treating it the same way as THP or hugetlb pages. > > Signed-off-by: Joao Martins > --- > include/linux/memremap.h | 13 + > mm/memremap.c| 8 ++-- > mm/page_alloc.c | 24 +++- > 3 files changed, 42 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index b46f63dcaed3..bb28d82dda5e 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -114,6 +114,7 @@ struct dev_pagemap { > struct completion done; > enum memory_type type; > unsigned int flags; > + unsigned long align; I think this wants some kernel-doc above to indicate that non-zero means "use compound pages with tail-page dedup" and zero / PAGE_SIZE means "use non-compound base pages". The non-zero value must be PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. Hmm, maybe it should be an enum: enum devmap_geometry { DEVMAP_PTE, DEVMAP_PMD, DEVMAP_PUD, } ...because it's more than just an alignment it's a structural definition of how the memmap is laid out. > const struct dev_pagemap_ops *ops; > void *owner; > int nr_range; > @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct > dev_pagemap *pgmap) > return NULL; > } > > +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap) > +{ > + if (!pgmap || !pgmap->align) > + return PAGE_SIZE; > + return pgmap->align; > +} > + > +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap) > +{ > + return PHYS_PFN(pgmap_align(pgmap)); > +} > + > #ifdef CONFIG_ZONE_DEVICE > bool pfn_zone_device_reserved(unsigned long pfn); > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > diff --git a/mm/memremap.c b/mm/memremap.c > index 805d761740c4..d160853670c4 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, > struct mhp_params *params, > memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE], > PHYS_PFN(range->start), > PHYS_PFN(range_len(range)), pgmap); > - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > - - pfn_first(pgmap, range_id)); > + if (pgmap_align(pgmap) > PAGE_SIZE) > + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > + - pfn_first(pgmap, range_id)) / > pgmap_pfn_align(pgmap)); > + else > + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > + - pfn_first(pgmap, range_id)); > return 0; > > err_add_memory: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 58974067bbd4..3a77f9e43f3a 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone, > unsigned long pfn, end_pfn = start_pfn + nr_pages; > struct pglist_data *pgdat = zone->zone_pgdat; > struct vmem_altmap *altmap = pgmap_altmap(pgmap); > + unsigned int pfn_align = pgmap_pfn_align(pgmap); > + unsigned int order_align = order_base_2(pfn_align); > unsigned long zone_idx = zone_idx(zone); > unsigned long start = jiffies; > int nid = pgdat->node_id; > @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone, > nr_pages = end_pfn - start_pfn; > } > > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > + for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) { pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>= PAGE_SHIFT" I missed somewhere? ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
On Tue, May 4, 2021 at 2:03 AM Aneesh Kumar K.V wrote: > > On 5/4/21 11:13 AM, Pankaj Gupta wrote: > > > >> > >> What this patch series did was to express that property via a device > >> tree node and guest driver enables a hypercall based flush mechanism to > >> ensure persistence. > > > > Would VIRTIO (entirely asynchronous, no trap at host side) based > > mechanism is better > > than hyper-call based? Registering memory can be done any way. We > > implemented virtio-pmem > > flush mechanisms with below considerations: > > > > - Proper semantic for guest flush requests. > > - Efficient mechanism for performance pov. > > > > sure, virio-pmem can be used as an alternative. > > > I am just asking myself if we have platform agnostic mechanism already > > there, maybe > > we can extend it to suit our needs? Maybe I am missing some points here. > > > > What is being attempted in this series is to indicate to the guest OS > that the backing device/file used for emulated nvdimm device cannot > guarantee the persistence via cpu cache flush instructions. Right, the detail I am arguing is that it should be a device description, not a backend file property. In other words this proposal is advocating this: -object memory-backend-file,id=mem1,share,sync-dax=$sync-dax,mem-path=$path -device nvdimm,memdev=mem1 ...and I am advocating for reusing or duplicating the virtio-pmem model like this: -object memory-backend-file,id=mem1,share,mem-path=$path -device spapr-hcall,memdev=mem1 ...because the interface to the guest is the device, not the memory-backend-file. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
On Mon, May 3, 2021 at 7:06 AM Shivaprasad G Bhat wrote: > > > On 5/1/21 12:44 AM, Dan Williams wrote: > > Some corrections to terminology confusion below... > > > > > > On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat > > wrote: > >> The nvdimm devices are expected to ensure write persistence during power > >> failure kind of scenarios. > > No, QEMU is not expected to make that guarantee. QEMU is free to lie > > to the guest about the persistence guarantees of the guest PMEM > > ranges. It's more accurate to say that QEMU nvdimm devices can emulate > > persistent memory and optionally pass through host power-fail > > persistence guarantees to the guest. The power-fail persistence domain > > can be one of "cpu_cache", or "memory_controller" if the persistent > > memory region is "synchronous". If the persistent range is not > > synchronous, it really isn't "persistent memory"; it's memory mapped > > storage that needs I/O commands to flush. > > Since this is virtual nvdimm(v-nvdimm) backed by a file, and the data is > completely > > in the host pagecache, and we need a way to ensure that host pagecaches > > are flushed to the backend. This analogous to the WPQ flush being offloaded > > to the hypervisor. No, it isn't analogous. WPQ flush is an optional mechanism to force data to a higher durability domain. The flush in this interface is mandatory. It's a different class of device. The proposal that "sync-dax=unsafe" for non-PPC architectures, is a fundamental misrepresentation of how this is supposed to work. Rather than make "sync-dax" a first class citizen of the device-description interface I'm proposing that you make this a separate device-type. This also solves the problem that "sync-dax" with an implicit architecture backend assumption is not precise, but a new "non-nvdimm" device type would make it explicit what the host is advertising to the guest. > > > Ref: https://github.com/dgibson/qemu/blob/main/docs/nvdimm.txt > > > > > > >> The libpmem has architecture specific instructions like dcbf on POWER > > Which "libpmem" is this? PMDK is a reference library not a PMEM > > interface... maybe I'm missing what libpmem has to do with QEMU? > > > I was referrering to semantics of flushing pmem cache lines as in > > PMDK/libpmem. > > > > > >> to flush the cache data to backend nvdimm device during normal writes > >> followed by explicit flushes if the backend devices are not synchronous > >> DAX capable. > >> > >> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest > >> and the subsequent flush doesn't traslate to actual flush to the backend > > s/traslate/translate/ > > > >> file on the host in case of file backed v-nvdimms. This is addressed by > >> virtio-pmem in case of x86_64 by making explicit flushes translating to > >> fsync at qemu. > > Note that virtio-pmem was a proposal for a specific optimization of > > allowing guests to share page cache. The virtio-pmem approach is not > > to be confused with actual persistent memory. > > > >> On SPAPR, the issue is addressed by adding a new hcall to > >> request for an explicit flush from the guest ndctl driver when the backend > > What is an "ndctl" driver? ndctl is userspace tooling, do you mean the > > guest pmem driver? > > > oops, wrong terminologies. I was referring to guest libnvdimm and > > papr_scm kernel modules. > > > > > >> nvdimm cannot ensure write persistence with dcbf alone. So, the approach > >> here is to convey when the hcall flush is required in a device tree > >> property. The guest makes the hcall when the property is found, instead > >> of relying on dcbf. > >> > >> A new device property sync-dax is added to the nvdimm device. When the > >> sync-dax is 'writeback'(default for PPC), device property > >> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH > >> requesting for an explicit flush. > > I'm not sure "sync-dax" is a suitable name for the property of the > > guest persistent memory. > > > sync-dax property translates ND_REGION_ASYNC flag being set/unset Yes, I am aware, but that property alone is not sufficient to identify the flush mechanism. > > for the pmem region also if the nvdimm_flush callback is provided in the > > papr_scm or not. As everything boils down to synchronous nature > > of the device, I chose sync-dax for the name. > > > > There is
Re: [bug report] system panic at nfit_get_smbios_id+0x6e/0xf0 [nfit] during boot
On Fri, Apr 30, 2021 at 7:28 PM Yi Zhang wrote: > > Hi > > With the latest Linux tree, my DCPMM server boot failed with the > bellow panic log, pls help check it, let me know if you need any test > for it. So v5.12 is ok but v5.12+ is not? Might you be able to bisect? If not can you send the nfit.gz from this command: acpidump -n NFIT | gzip -c > nfit.gz Also can you send the full dmesg? I don't suppose you see a message of this format before this failure: dev_err(acpi_desc->dev, "SPA %d missing DCR %d\n", spa->range_index, dcr); ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
Some corrections to terminology confusion below... On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat wrote: > > The nvdimm devices are expected to ensure write persistence during power > failure kind of scenarios. No, QEMU is not expected to make that guarantee. QEMU is free to lie to the guest about the persistence guarantees of the guest PMEM ranges. It's more accurate to say that QEMU nvdimm devices can emulate persistent memory and optionally pass through host power-fail persistence guarantees to the guest. The power-fail persistence domain can be one of "cpu_cache", or "memory_controller" if the persistent memory region is "synchronous". If the persistent range is not synchronous, it really isn't "persistent memory"; it's memory mapped storage that needs I/O commands to flush. > The libpmem has architecture specific instructions like dcbf on POWER Which "libpmem" is this? PMDK is a reference library not a PMEM interface... maybe I'm missing what libpmem has to do with QEMU? > to flush the cache data to backend nvdimm device during normal writes > followed by explicit flushes if the backend devices are not synchronous > DAX capable. > > Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest > and the subsequent flush doesn't traslate to actual flush to the backend s/traslate/translate/ > file on the host in case of file backed v-nvdimms. This is addressed by > virtio-pmem in case of x86_64 by making explicit flushes translating to > fsync at qemu. Note that virtio-pmem was a proposal for a specific optimization of allowing guests to share page cache. The virtio-pmem approach is not to be confused with actual persistent memory. > On SPAPR, the issue is addressed by adding a new hcall to > request for an explicit flush from the guest ndctl driver when the backend What is an "ndctl" driver? ndctl is userspace tooling, do you mean the guest pmem driver? > nvdimm cannot ensure write persistence with dcbf alone. So, the approach > here is to convey when the hcall flush is required in a device tree > property. The guest makes the hcall when the property is found, instead > of relying on dcbf. > > A new device property sync-dax is added to the nvdimm device. When the > sync-dax is 'writeback'(default for PPC), device property > "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH > requesting for an explicit flush. I'm not sure "sync-dax" is a suitable name for the property of the guest persistent memory. There is no requirement that the memory-backend file for a guest be a dax-capable file. It's also implementation specific what hypercall needs to be invoked for a given occurrence of "sync-dax". What does that map to on non-PPC platforms for example? It seems to me that an "nvdimm" device presents the synchronous usage model and a whole other device type implements an async-hypercall setup that the guest happens to service with its nvdimm stack, but it's not an "nvdimm" anymore at that point. > sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines > prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented > now as the flush semantics are unimplemented. "sync-dax" has no meaning on its own, I think this needs an explicit mechanism to convey both the "not-sync" property *and* the callback method, it shouldn't be inferred by arch type. > When the backend file is actually synchronous DAX capable and no explicit > flushes are required, the sync-dax mode 'direct' is to be used. > > The below demonstration shows the map_sync behavior with sync-dax writeback & > direct. > (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c) > > The pmem0 is from nvdimm with With sync-dax=direct, and pmem1 is from > nvdimm with syn-dax=writeback, mounted as > /dev/pmem0 on /mnt1 type xfs > (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota) > /dev/pmem1 on /mnt2 type xfs > (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota) > > [root@atest-guest ~]# ./mapsync /mnt1/newfile > When > sync-dax=unsafe/direct > [root@atest-guest ~]# ./mapsync /mnt2/newfile > when sync-dax=writeback > Failed to mmap with Operation not supported > > The first patch does the header file cleanup necessary for the > subsequent ones. Second patch implements the hcall, adds the necessary > vmstate properties to spapr machine structure for carrying the hcall > status during save-restore. The nature of the hcall being asynchronus, > the patch uses aio utilities to offload the flush. The third patch adds > the 'sync-dax' device property and enables the device tree property > for the guest to utilise the hcall. > > The kernel changes to exploit this hcall is at > https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch > > --- > v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html > Changes from v3: > - Fixed the forward declaration coding
Re: [PATCH] MAINTAINERS: Move nvdimm mailing list
On Tue, Apr 27, 2021 at 8:58 AM Jonathan Corbet wrote: > > Dan Williams writes: > > > After seeing some users have subscription management trouble, more spam > > than other Linux development lists, and considering some of the benefits > > of kernel.org hosted lists, nvdimm and persistent memory development is > > moving to nvd...@lists.linux.dev. > > > > The old list will remain up until v5.14-rc1 and shutdown thereafter. > > > > Cc: Ira Weiny > > Cc: Oliver O'Halloran > > Cc: Matthew Wilcox > > Cc: Jan Kara > > Cc: Jonathan Corbet > > Cc: Dave Jiang > > Cc: Vishal Verma > > Signed-off-by: Dan Williams > > --- > > Documentation/ABI/obsolete/sysfs-class-dax|2 + > > Documentation/ABI/removed/sysfs-bus-nfit |2 + > > Documentation/ABI/testing/sysfs-bus-nfit | 40 > > + > > Documentation/ABI/testing/sysfs-bus-papr-pmem |4 +-- > > Documentation/driver-api/nvdimm/nvdimm.rst|2 + > > MAINTAINERS | 14 - > > 6 files changed, 32 insertions(+), 32 deletions(-) > > Would you like me to take this through docs-next, or did you have > another path in mind? Thanks for the offer, I have a few other nvdimm related bits for this cycle, so it can go through nvdimm.git. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v6 0/3] dax: Fix missed wakeup in put_unlocked_entry()
On Wed, Apr 28, 2021 at 12:03 PM Vivek Goyal wrote: > > Hi, > > This is V6. Only change since V5 is that I changed order of WAKE_NEXT > and WAKE_ALL in comments too. > > Vivek > > Vivek Goyal (3): > dax: Add an enum for specifying dax wakup mode > dax: Add a wakeup mode parameter to put_unlocked_entry() > dax: Wake up all waiters after invalidating dax entry Thanks Vivek, looks good. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init
On Thu, Mar 25, 2021 at 4:10 PM Joao Martins wrote: > > Move struct page init to an helper function __init_zone_device_page(). Same sentence addition suggestion from the last patch to make this patch have some rationale for existing. > > Signed-off-by: Joao Martins > --- > mm/page_alloc.c | 74 +++-- > 1 file changed, 41 insertions(+), 33 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 43dd98446b0b..58974067bbd4 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6237,6 +6237,46 @@ void __meminit memmap_init_range(unsigned long size, > int nid, unsigned long zone > } > > #ifdef CONFIG_ZONE_DEVICE > +static void __ref __init_zone_device_page(struct page *page, unsigned long > pfn, > + unsigned long zone_idx, int nid, > + struct dev_pagemap *pgmap) > +{ > + > + __init_single_page(page, pfn, zone_idx, nid); > + > + /* > +* Mark page reserved as it will need to wait for onlining > +* phase for it to be fully associated with a zone. > +* > +* We can use the non-atomic __set_bit operation for setting > +* the flag as we are still initializing the pages. > +*/ > + __SetPageReserved(page); > + > + /* > +* ZONE_DEVICE pages union ->lru with a ->pgmap back pointer > +* and zone_device_data. It is a bug if a ZONE_DEVICE page is > +* ever freed or placed on a driver-private list. > +*/ > + page->pgmap = pgmap; > + page->zone_device_data = NULL; > + > + /* > +* Mark the block movable so that blocks are reserved for > +* movable at startup. This will force kernel allocations > +* to reserve their blocks rather than leaking throughout > +* the address space during boot when many long-lived > +* kernel allocations are made. > +* > +* Please note that MEMINIT_HOTPLUG path doesn't clear memmap > +* because this is done early in section_activate() > +*/ > + if (IS_ALIGNED(pfn, pageblock_nr_pages)) { > + set_pageblock_migratetype(page, MIGRATE_MOVABLE); > + cond_resched(); > + } > +} > + > void __ref memmap_init_zone_device(struct zone *zone, >unsigned long start_pfn, >unsigned long nr_pages, > @@ -6265,39 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone, > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > struct page *page = pfn_to_page(pfn); > > - __init_single_page(page, pfn, zone_idx, nid); > - > - /* > -* Mark page reserved as it will need to wait for onlining > -* phase for it to be fully associated with a zone. > -* > -* We can use the non-atomic __set_bit operation for setting > -* the flag as we are still initializing the pages. > -*/ > - __SetPageReserved(page); > - > - /* > -* ZONE_DEVICE pages union ->lru with a ->pgmap back pointer > -* and zone_device_data. It is a bug if a ZONE_DEVICE page is > -* ever freed or placed on a driver-private list. > -*/ > - page->pgmap = pgmap; > - page->zone_device_data = NULL; > - > - /* > -* Mark the block movable so that blocks are reserved for > -* movable at startup. This will force kernel allocations > -* to reserve their blocks rather than leaking throughout > -* the address space during boot when many long-lived > -* kernel allocations are made. > -* > -* Please note that MEMINIT_HOTPLUG path doesn't clear memmap > -* because this is done early in section_activate() > -*/ > - if (IS_ALIGNED(pfn, pageblock_nr_pages)) { > - set_pageblock_migratetype(page, MIGRATE_MOVABLE); > - cond_resched(); > - } > + __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); > } > > pr_info("%s initialised %lu pages in %ums\n", __func__, > -- > 2.17.1 > ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts
On Thu, Mar 25, 2021 at 4:10 PM Joao Martins wrote: > > Split the utility function prep_compound_page() into head and tail > counterparts, and use them accordingly. To make this patch stand alone better lets add another sentence: "This is in preparation for sharing the storage for / deduplicating compound page metadata." Other than that, looks good to me. > > Signed-off-by: Joao Martins > --- > mm/page_alloc.c | 32 +--- > 1 file changed, 21 insertions(+), 11 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c53fe4fa10bf..43dd98446b0b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -692,24 +692,34 @@ void free_compound_page(struct page *page) > __free_pages_ok(page, compound_order(page), FPI_NONE); > } > > +static void prep_compound_head(struct page *page, unsigned int order) > +{ > + set_compound_page_dtor(page, COMPOUND_PAGE_DTOR); > + set_compound_order(page, order); > + atomic_set(compound_mapcount_ptr(page), -1); > + if (hpage_pincount_available(page)) > + atomic_set(compound_pincount_ptr(page), 0); > +} > + > +static void prep_compound_tail(struct page *head, int tail_idx) > +{ > + struct page *p = head + tail_idx; > + > + set_page_count(p, 0); > + p->mapping = TAIL_MAPPING; > + set_compound_head(p, head); > +} > + > void prep_compound_page(struct page *page, unsigned int order) > { > int i; > int nr_pages = 1 << order; > > __SetPageHead(page); > - for (i = 1; i < nr_pages; i++) { > - struct page *p = page + i; > - set_page_count(p, 0); > - p->mapping = TAIL_MAPPING; > - set_compound_head(p, page); > - } > + for (i = 1; i < nr_pages; i++) > + prep_compound_tail(page, i); > > - set_compound_page_dtor(page, COMPOUND_PAGE_DTOR); > - set_compound_order(page, order); > - atomic_set(compound_mapcount_ptr(page), -1); > - if (hpage_pincount_available(page)) > - atomic_set(compound_pincount_ptr(page), 0); > + prep_compound_head(page, order); > } > > #ifdef CONFIG_DEBUG_PAGEALLOC > -- > 2.17.1 > ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid()
On Thu, Mar 25, 2021 at 4:10 PM Joao Martins wrote: > > memory_failure_dev_pagemap() at the moment assumes base pages (e.g. > dax_lock_page()). For pagemap with compound pages fetch the > compound_head in case we are handling a tail page memory failure. > > Currently this is a nop, but in the advent of compound pages in > dev_pagemap it allows memory_failure_dev_pagemap() to keep working. > > Reported-by: Jane Chu > Signed-off-by: Joao Martins > --- > mm/memory-failure.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 24210c9bd843..94240d772623 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1318,6 +1318,8 @@ static int memory_failure_dev_pagemap(unsigned long > pfn, int flags, > goto out; > } > > + page = compound_head(page); Unless / until we do compound pages for the filesystem-dax case, I would add a comment like: /* pages instantiated by device-dax (not filesystem-dax) may be compound pages */ ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v4 0/3] dax: Fix missed wakeup in put_unlocked_entry()
On Fri, Apr 23, 2021 at 6:07 AM Vivek Goyal wrote: > > Hi, > > This is V4 of the patches. Posted V3 here. > > https://lore.kernel.org/linux-fsdevel/20210419213636.1514816-1-vgo...@redhat.com/ > > Changes since V3 are. > > - Renamed "enum dax_entry_wake_mode" to "enum dax_wake_mode" (Matthew Wilcox) > - Changed description of WAKE_NEXT and WAKE_ALL (Jan Kara) > - Got rid of a comment (Greg Kurz) Looks good Vivek, thanks for the resend. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()
On Wed, Apr 21, 2021 at 11:25 PM Christoph Hellwig wrote: > > On Wed, Apr 21, 2021 at 12:09:54PM -0700, Dan Williams wrote: > > Can you get in the habit of not replying inline with new patches like > > this? Collect the review feedback, take a pause, and resend the full > > series so tooling like b4 and patchwork can track when a new posting > > supersedes a previous one. As is, this inline style inflicts manual > > effort on the maintainer. > > Honestly I don't mind it at all. If you shiny new tooling can't handle > it maybe you should fix your shiny new tooling instead of changing > everyones workflow? I think asking a submitter to resend a series is par for the course, especially for poor saps like me burdened by corporate email systems. Vivek, if this is too onerous a request just give me a heads up and I'll manually pull out the patch content from your replies. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()
On Tue, Apr 20, 2021 at 7:01 AM Vivek Goyal wrote: > > On Tue, Apr 20, 2021 at 09:34:20AM +0200, Greg Kurz wrote: > > On Mon, 19 Apr 2021 17:36:35 -0400 > > Vivek Goyal wrote: > > > > > As of now put_unlocked_entry() always wakes up next waiter. In next > > > patches we want to wake up all waiters at one callsite. Hence, add a > > > parameter to the function. > > > > > > This patch does not introduce any change of behavior. > > > > > > Suggested-by: Dan Williams > > > Signed-off-by: Vivek Goyal > > > --- > > > fs/dax.c | 13 +++-- > > > 1 file changed, 7 insertions(+), 6 deletions(-) > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > index 00978d0838b1..f19d76a6a493 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -275,11 +275,12 @@ static void wait_entry_unlocked(struct xa_state > > > *xas, void *entry) > > > finish_wait(wq, ); > > > } > > > > > > -static void put_unlocked_entry(struct xa_state *xas, void *entry) > > > +static void put_unlocked_entry(struct xa_state *xas, void *entry, > > > + enum dax_entry_wake_mode mode) > > > { > > > /* If we were the only waiter woken, wake the next one */ > > > > With this change, the comment is no longer accurate since the > > function can now wake all waiters if passed mode == WAKE_ALL. > > Also, it paraphrases the code which is simple enough, so I'd > > simply drop it. > > > > This is minor though and it shouldn't prevent this fix to go > > forward. > > > > Reviewed-by: Greg Kurz > > Ok, here is the updated patch which drops that comment line. > > Vivek Hi Vivek, Can you get in the habit of not replying inline with new patches like this? Collect the review feedback, take a pause, and resend the full series so tooling like b4 and patchwork can track when a new posting supersedes a previous one. As is, this inline style inflicts manual effort on the maintainer. > > Subject: dax: Add a wakeup mode parameter to put_unlocked_entry() > > As of now put_unlocked_entry() always wakes up next waiter. In next > patches we want to wake up all waiters at one callsite. Hence, add a > parameter to the function. > > This patch does not introduce any change of behavior. > > Suggested-by: Dan Williams > Signed-off-by: Vivek Goyal > --- > fs/dax.c | 14 +++--- > 1 file changed, 7 insertions(+), 7 deletions(-) > > Index: redhat-linux/fs/dax.c > === > --- redhat-linux.orig/fs/dax.c 2021-04-20 09:55:45.105069893 -0400 > +++ redhat-linux/fs/dax.c 2021-04-20 09:56:27.685822730 -0400 > @@ -275,11 +275,11 @@ static void wait_entry_unlocked(struct x > finish_wait(wq, ); > } > > -static void put_unlocked_entry(struct xa_state *xas, void *entry) > +static void put_unlocked_entry(struct xa_state *xas, void *entry, > + enum dax_entry_wake_mode mode) > { > - /* If we were the only waiter woken, wake the next one */ > if (entry && !dax_is_conflict(entry)) > - dax_wake_entry(xas, entry, WAKE_NEXT); > + dax_wake_entry(xas, entry, mode); > } > > /* > @@ -633,7 +633,7 @@ struct page *dax_layout_busy_page_range( > entry = get_unlocked_entry(, 0); > if (entry) > page = dax_busy_page(entry); > - put_unlocked_entry(, entry); > + put_unlocked_entry(, entry, WAKE_NEXT); > if (page) > break; > if (++scanned % XA_CHECK_SCHED) > @@ -675,7 +675,7 @@ static int __dax_invalidate_entry(struct > mapping->nrexceptional--; > ret = 1; > out: > - put_unlocked_entry(, entry); > + put_unlocked_entry(, entry, WAKE_NEXT); > xas_unlock_irq(); > return ret; > } > @@ -954,7 +954,7 @@ static int dax_writeback_one(struct xa_s > return ret; > > put_unlocked: > - put_unlocked_entry(xas, entry); > + put_unlocked_entry(xas, entry, WAKE_NEXT); > return ret; > } > > @@ -1695,7 +1695,7 @@ dax_insert_pfn_mkwrite(struct vm_fault * > /* Did we race with someone splitting entry or so? */ > if (!entry || dax_is_conflict(entry) || > (order == 0 && !dax_is_pte_entry(entry))) { > - put_unlocked_entry(, entry); > + put_unlocked_entry(, entry, WAKE_NEXT); > xas_unlock_irq(); > trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf, > VM_FAULT_NOPAGE); > ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH] MAINTAINERS: Move nvdimm mailing list
After seeing some users have subscription management trouble, more spam than other Linux development lists, and considering some of the benefits of kernel.org hosted lists, nvdimm and persistent memory development is moving to nvd...@lists.linux.dev. The old list will remain up until v5.14-rc1 and shutdown thereafter. Cc: Ira Weiny Cc: Oliver O'Halloran Cc: Matthew Wilcox Cc: Jan Kara Cc: Jonathan Corbet Cc: Dave Jiang Cc: Vishal Verma Signed-off-by: Dan Williams --- Documentation/ABI/obsolete/sysfs-class-dax|2 + Documentation/ABI/removed/sysfs-bus-nfit |2 + Documentation/ABI/testing/sysfs-bus-nfit | 40 + Documentation/ABI/testing/sysfs-bus-papr-pmem |4 +-- Documentation/driver-api/nvdimm/nvdimm.rst|2 + MAINTAINERS | 14 - 6 files changed, 32 insertions(+), 32 deletions(-) diff --git a/Documentation/ABI/obsolete/sysfs-class-dax b/Documentation/ABI/obsolete/sysfs-class-dax index 0faf1354cd05..5bcce27458e3 100644 --- a/Documentation/ABI/obsolete/sysfs-class-dax +++ b/Documentation/ABI/obsolete/sysfs-class-dax @@ -1,7 +1,7 @@ What: /sys/class/dax/ Date: May, 2016 KernelVersion: v4.7 -Contact:linux-nvdimm@lists.01.org +Contact:nvd...@lists.linux.dev Description: Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped without need of an intervening file diff --git a/Documentation/ABI/removed/sysfs-bus-nfit b/Documentation/ABI/removed/sysfs-bus-nfit index ae8c1ca53828..277437005def 100644 --- a/Documentation/ABI/removed/sysfs-bus-nfit +++ b/Documentation/ABI/removed/sysfs-bus-nfit @@ -1,7 +1,7 @@ What: /sys/bus/nd/devices/regionX/nfit/ecc_unit_size Date: Aug, 2017 KernelVersion: v4.14 (Removed v4.18) -Contact: linux-nvdimm@lists.01.org +Contact: nvd...@lists.linux.dev Description: (RO) Size of a write request to a DIMM that will not incur a read-modify-write cycle at the memory controller. diff --git a/Documentation/ABI/testing/sysfs-bus-nfit b/Documentation/ABI/testing/sysfs-bus-nfit index 63ef0b9ecce7..e7282d184a74 100644 --- a/Documentation/ABI/testing/sysfs-bus-nfit +++ b/Documentation/ABI/testing/sysfs-bus-nfit @@ -5,7 +5,7 @@ Interface Table (NFIT)' section in the ACPI specification What: /sys/bus/nd/devices/nmemX/nfit/serial Date: Jun, 2015 KernelVersion: v4.2 -Contact: linux-nvdimm@lists.01.org +Contact: nvd...@lists.linux.dev Description: (RO) Serial number of the NVDIMM (non-volatile dual in-line memory module), assigned by the module vendor. @@ -14,7 +14,7 @@ Description: What: /sys/bus/nd/devices/nmemX/nfit/handle Date: Apr, 2015 KernelVersion: v4.2 -Contact: linux-nvdimm@lists.01.org +Contact: nvd...@lists.linux.dev Description: (RO) The address (given by the _ADR object) of the device on its parent bus of the NVDIMM device containing the NVDIMM region. @@ -23,7 +23,7 @@ Description: What: /sys/bus/nd/devices/nmemX/nfit/device Date: Apr, 2015 KernelVersion: v4.1 -Contact: linux-nvdimm@lists.01.org +Contact: nvd...@lists.linux.dev Description: (RO) Device id for the NVDIMM, assigned by the module vendor. @@ -31,7 +31,7 @@ Description: What: /sys/bus/nd/devices/nmemX/nfit/rev_id Date: Jun, 2015 KernelVersion: v4.2 -Contact: linux-nvdimm@lists.01.org +Contact: nvd...@lists.linux.dev Description: (RO) Revision of the NVDIMM, assigned by the module vendor. @@ -39,7 +39,7 @@ Description: What: /sys/bus/nd/devices/nmemX/nfit/phys_id Date: Apr, 2015 KernelVersion: v4.2 -Contact: linux-nvdimm@lists.01.org +Contact: nvd...@lists.linux.dev Description: (RO) Handle (i.e., instance number) for the SMBIOS (system management BIOS) Memory Device structure describing the NVDIMM @@ -49,7 +49,7 @@ Description: What: /sys/bus/nd/devices/nmemX/nfit/flags Date: Jun, 2015 KernelVersion: v4.2 -Contact: linux-nvdimm@lists.01.org +Contact: nvd...@lists.linux.dev Description: (RO) The flags in the NFIT memory device sub-structure indicate the state of the data on the nvdimm relative to its energy @@ -68,7 +68,7 @@ What: /sys/bus/nd/devices/nmemX/nfit/format1 What: /sys/bus/nd/devices/nmemX/nfit/formats Date: Apr, 2016 KernelVersion: v4.7 -Contact: linux-nvdimm@lists.01.org +Contact: nvd...@lists.linux.dev Description: (RO) The interface codes indicate support for persistent memory mapped directly into system physical address space and / or a @@ -84,7 +84,7
Re: [PATCH -next v2] tools/testing/nvdimm: Make symbol '__nfit_test_ioremap' static
On Tue, Apr 20, 2021 at 12:31 AM Zou Wei wrote: > > The sparse tool complains as follows: > > tools/testing/nvdimm/test/iomap.c:65:14: warning: > symbol '__nfit_test_ioremap' was not declared. Should it be static? > > This symbol is not used outside of iomap.c, so this > commit marks it static. > > Reported-by: Hulk Robot > Signed-off-by: Zou Wei Looks good to me, thanks. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[ANNOUNCE] NVDIMM development move to nvd...@lists.linux.dev
There have been multiple occasions recently where people reported trouble managing their subscription to linux-nvdimm@lists.01.org. The spam filtering also appears less robust compared to other Linux development lists. There are also side benefits to being on kernel.org infrastructure as it has a direct path to inject messages into the public-inbox repository on lore, and feed the patchwork-bot. You can find the subscription link here: https://subspace.kernel.org/lists.linux.dev.html For the v5.13 merge window I'll submit a patch to switch the list from linux-nvdimm@lists.01.org to nvd...@lists.linux.dev, and plan to shutdown linux-nvdimm@ a couple months later around the v5.14 merge window. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] libnvdimm.h: Remove duplicate struct declaration
On Tue, Apr 20, 2021 at 6:39 AM Santosh Sivaraj wrote: > > Hi Ira, > > Ira Weiny writes: > > > On Mon, Apr 19, 2021 at 07:27:25PM +0800, Wan Jiabing wrote: > >> struct device is declared at 133rd line. > >> The declaration here is unnecessary. Remove it. > >> > >> Signed-off-by: Wan Jiabing > >> --- > >> include/linux/libnvdimm.h | 1 - > >> 1 file changed, 1 deletion(-) > >> > >> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h > >> index 01f251b6e36c..89b69e645ac7 100644 > >> --- a/include/linux/libnvdimm.h > >> +++ b/include/linux/libnvdimm.h > >> @@ -141,7 +141,6 @@ static inline void __iomem *devm_nvdimm_ioremap(struct > >> device *dev, > >> > >> struct nvdimm_bus; > >> struct module; > >> -struct device; > >> struct nd_blk_region; > > > > What is the coding style preference for pre-declarations like this? Should > > they be placed at the top of the file? > > > > The patch is reasonable but if the intent is to declare right before use for > > clarity, both devm_nvdimm_memremap() and nd_blk_region_desc() use struct > > device. So perhaps this duplicate is on purpose? > > There are other struct device usage much later in the file, which doesn't have > any pre-declarations for struct device. So I assume this might not be on > purpose :-) Yeah, I believe it was just code movement and the duplicate was inadvertently introduced. Patch looks ok to me. > > On a side note, types.h can also be removed, since it's already included in > kernel.h. That I don't necessarily agree with, it just makes future header reworks more fraught for not much benefit. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH][v2] dax: Fix missed wakeup during dax entry invalidation
On Mon, Apr 19, 2021 at 11:45 AM Vivek Goyal wrote: > > This is V2 of the patch. Posted V1 here. > > https://lore.kernel.org/linux-fsdevel/20210416173524.ga1379...@redhat.com/ > > Based on feedback from Dan and Jan, modified the patch to wake up > all waiters when dax entry is invalidated. This solves the issues > of missed wakeups. Care to send a formal patch with this commentary moved below the --- line? One style fixup below... > > I am seeing missed wakeups which ultimately lead to a deadlock when I am > using virtiofs with DAX enabled and running "make -j". I had to mount > virtiofs as rootfs and also reduce to dax window size to 256M to reproduce > the problem consistently. > > So here is the problem. put_unlocked_entry() wakes up waiters only > if entry is not null as well as !dax_is_conflict(entry). But if I > call multiple instances of invalidate_inode_pages2() in parallel, > then I can run into a situation where there are waiters on > this index but nobody will wait these. > > invalidate_inode_pages2() > invalidate_inode_pages2_range() > invalidate_exceptional_entry2() > dax_invalidate_mapping_entry_sync() > __dax_invalidate_entry() { > xas_lock_irq(); > entry = get_unlocked_entry(, 0); > ... > ... > dax_disassociate_entry(entry, mapping, trunc); > xas_store(, NULL); > ... > ... > put_unlocked_entry(, entry); > xas_unlock_irq(); > } > > Say a fault in in progress and it has locked entry at offset say "0x1c". > Now say three instances of invalidate_inode_pages2() are in progress > (A, B, C) and they all try to invalidate entry at offset "0x1c". Given > dax entry is locked, all tree instances A, B, C will wait in wait queue. > > When dax fault finishes, say A is woken up. It will store NULL entry > at index "0x1c" and wake up B. When B comes along it will find "entry=0" > at page offset 0x1c and it will call put_unlocked_entry(, 0). And > this means put_unlocked_entry() will not wake up next waiter, given > the current code. And that means C continues to wait and is not woken > up. > > This patch fixes the issue by waking up all waiters when a dax entry > has been invalidated. This seems to fix the deadlock I am facing > and I can make forward progress. > > Reported-by: Sergio Lopez > Signed-off-by: Vivek Goyal > --- > fs/dax.c | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > Index: redhat-linux/fs/dax.c > === > --- redhat-linux.orig/fs/dax.c 2021-04-16 14:16:44.332140543 -0400 > +++ redhat-linux/fs/dax.c 2021-04-19 11:24:11.465213474 -0400 > @@ -264,11 +264,11 @@ static void wait_entry_unlocked(struct x > finish_wait(wq, ); > } > > -static void put_unlocked_entry(struct xa_state *xas, void *entry) > +static void put_unlocked_entry(struct xa_state *xas, void *entry, bool > wake_all) > { > /* If we were the only waiter woken, wake the next one */ > if (entry && !dax_is_conflict(entry)) > - dax_wake_entry(xas, entry, false); > + dax_wake_entry(xas, entry, wake_all); > } > > /* > @@ -622,7 +622,7 @@ struct page *dax_layout_busy_page_range( > entry = get_unlocked_entry(, 0); > if (entry) > page = dax_busy_page(entry); > - put_unlocked_entry(, entry); > + put_unlocked_entry(, entry, false); I'm not a fan of raw true/false arguments because if you read this line in isolation you need to go read put_unlocked_entry() to recall what that argument means. So lets add something like: /** * enum dax_entry_wake_mode: waitqueue wakeup toggle * @WAKE_NEXT: entry was not mutated * @WAKE_ALL: entry was invalidated, or resized */ enum dax_entry_wake_mode { WAKE_NEXT, WAKE_ALL, } ...and use that as the arg for dax_wake_entry(). So I'd expect this to be a 3 patch series, introduce dax_entry_wake_mode for dax_wake_entry(), introduce the argument for put_unlocked_entry() without changing the logic, and finally this bug fix. Feel free to add 'Fixes: ac401cc78242 ("dax: New fault locking")' in case you feel this needs to be backported. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[GIT PULL] libnvdimm fixes for v5.12-rc8 / final
Hi Linus, please pull from: ...to receive a handful of libnvdimm fixups. The largest change is for a regression that landed during -rc1 for block-device read-only handling. Vaibhav found a new use for the ability (originally introduced by virtio_pmem) to call back to the platform to flush data, but also found an original bug in that implementation. Lastly, Arnd cleans up some compile warnings in dax. This has all appeared in -next with no reported issues. --- The following changes since commit e49d033bddf5b565044e2abe4241353959bc9120: Linux 5.12-rc6 (2021-04-04 14:15:36 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-fixes-for-5.12-rc8 for you to fetch changes up to 11d2498f1568a0f923dc8ef7621de15a9e89267f: Merge branch 'for-5.12/dax' into libnvdimm-fixes (2021-04-09 22:00:09 -0700) libnvdimm fixes for v5.12-rc8 - Fix a regression of read-only handling in the pmem driver. - Fix a compile warning. - Fix support for platform cache flush commands on powerpc/papr Arnd Bergmann (1): dax: avoid -Wempty-body warnings Dan Williams (2): libnvdimm: Notify disk drivers to revalidate region read-only Merge branch 'for-5.12/dax' into libnvdimm-fixes Vaibhav Jain (1): libnvdimm/region: Fix nvdimm_has_flush() to handle ND_REGION_ASYNC drivers/dax/bus.c| 6 ++ drivers/nvdimm/bus.c | 14 ++ drivers/nvdimm/pmem.c| 37 + drivers/nvdimm/region_devs.c | 16 ++-- include/linux/nd.h | 1 + 5 files changed, 56 insertions(+), 18 deletions(-) ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 1/1] libnvdimm: Don't use GUID APIs against raw buffer
On Fri, Apr 16, 2021 at 1:42 PM Andy Shevchenko wrote: > > On Fri, Apr 16, 2021 at 01:08:06PM -0700, Dan Williams wrote: > > [ add Erik ] > > > > On Fri, Apr 16, 2021 at 10:36 AM Andy Shevchenko > > wrote: > > > > > > On Thu, Apr 15, 2021 at 05:37:54PM +0300, Andy Shevchenko wrote: > > > > Strictly speaking the comparison between guid_t and raw buffer > > > > is not correct. Return to plain memcmp() since the data structures > > > > haven't changed to use uuid_t / guid_t the current state of affairs > > > > is inconsistent. Either it should be changed altogether or left > > > > as is. > > > > > > Dan, please review this one as well. I think here you may agree with me. > > > > You know, this is all a problem because ACPICA is using a raw buffer. > > And this is fine. It might be any other representation as well. > > > Erik, would it be possible to use the guid_t type in ACPICA? That > > would allow NFIT to drop some ugly casts. > > guid_t is internal kernel type. If we ever decide to deviate from the current > representation it wouldn't be possible in case a 3rd party is using it 1:1 > (via typedef or so). I'm thinking something like ACPICA defining that space as a union with the correct typing just for Linux. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 1/1] libnvdimm: Don't use GUID APIs against raw buffer
[ add Erik ] On Fri, Apr 16, 2021 at 10:36 AM Andy Shevchenko wrote: > > On Thu, Apr 15, 2021 at 05:37:54PM +0300, Andy Shevchenko wrote: > > Strictly speaking the comparison between guid_t and raw buffer > > is not correct. Return to plain memcmp() since the data structures > > haven't changed to use uuid_t / guid_t the current state of affairs > > is inconsistent. Either it should be changed altogether or left > > as is. > > Dan, please review this one as well. I think here you may agree with me. You know, this is all a problem because ACPICA is using a raw buffer. Erik, would it be possible to use the guid_t type in ACPICA? That would allow NFIT to drop some ugly casts. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] dax: Fix missed wakeup in put_unlocked_entry()
On Fri, Apr 16, 2021 at 10:35 AM Vivek Goyal wrote: > > I am seeing missed wakeups which ultimately lead to a deadlock when I am > using virtiofs with DAX enabled and running "make -j". I had to mount > virtiofs as rootfs and also reduce to dax window size to 32M to reproduce > the problem consistently. > > This is not a complete patch. I am just proposing this partial fix to > highlight the issue and trying to figure out how it should be fixed. > Should it be fixed in generic dax code or should filesystem (fuse/virtiofs) > take care of this. > > So here is the problem. put_unlocked_entry() wakes up waiters only > if entry is not null as well as !dax_is_conflict(entry). But if I > call multiple instances of invalidate_inode_pages2() in parallel, > then I can run into a situation where there are waiters on > this index but nobody will wait these. > > invalidate_inode_pages2() > invalidate_inode_pages2_range() > invalidate_exceptional_entry2() > dax_invalidate_mapping_entry_sync() > __dax_invalidate_entry() { > xas_lock_irq(); > entry = get_unlocked_entry(, 0); > ... > ... > dax_disassociate_entry(entry, mapping, trunc); > xas_store(, NULL); > ... > ... > put_unlocked_entry(, entry); > xas_unlock_irq(); > } > > Say a fault in in progress and it has locked entry at offset say "0x1c". > Now say three instances of invalidate_inode_pages2() are in progress > (A, B, C) and they all try to invalidate entry at offset "0x1c". Given > dax entry is locked, all tree instances A, B, C will wait in wait queue. > > When dax fault finishes, say A is woken up. It will store NULL entry > at index "0x1c" and wake up B. When B comes along it will find "entry=0" > at page offset 0x1c and it will call put_unlocked_entry(, 0). And > this means put_unlocked_entry() will not wake up next waiter, given > the current code. And that means C continues to wait and is not woken > up. > > In my case I am seeing that dax page fault path itself is waiting > on grab_mapping_entry() and also invalidate_inode_page2() is > waiting in get_unlocked_entry() but entry has already been cleaned > up and nobody woke up these processes. Atleast I think that's what > is happening. > > This patch wakes up a process even if entry=0. And deadlock does not > happen. I am running into some OOM issues, that will debug. > > So my question is that is it a dax issue and should it be fixed in > dax layer. Or should it be handled in fuse to make sure that > multiple instances of invalidate_inode_pages2() on same inode > don't make progress in parallel and introduce enough locking > around it. > > Right now fuse_finish_open() calls invalidate_inode_pages2() without > any locking. That allows it to make progress in parallel to dax > fault path as well as allows multiple instances of invalidate_inode_pages2() > to run in parallel. > > Not-yet-signed-off-by: Vivek Goyal > --- > fs/dax.c |7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) > > Index: redhat-linux/fs/dax.c > === > --- redhat-linux.orig/fs/dax.c 2021-04-16 12:50:40.141363317 -0400 > +++ redhat-linux/fs/dax.c 2021-04-16 12:51:42.385926390 -0400 > @@ -266,9 +266,10 @@ static void wait_entry_unlocked(struct x > > static void put_unlocked_entry(struct xa_state *xas, void *entry) > { > - /* If we were the only waiter woken, wake the next one */ > - if (entry && !dax_is_conflict(entry)) > - dax_wake_entry(xas, entry, false); > + if (dax_is_conflict(entry)) > + return; > + > + dax_wake_entry(xas, entry, false); How does this work if entry is NULL? dax_entry_waitqueue() will not know if it needs to adjust the index. I think the fix might be to specify that put_unlocked_entry() in the invalidate path needs to do a wake_up_all(). ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 1/1] ACPI: NFIT: Import GUID before use
On Fri, Apr 16, 2021 at 10:34 AM Andy Shevchenko wrote: > > On Fri, Apr 16, 2021 at 09:15:34AM -0700, Dan Williams wrote: > > On Fri, Apr 16, 2021 at 1:58 AM Andy Shevchenko > > wrote: > > > On Fri, Apr 16, 2021 at 8:28 AM Dan Williams > > > wrote: > > > > On Thu, Apr 15, 2021 at 6:59 AM Andy Shevchenko > > > > wrote: > > > > > > > > > > Strictly speaking the comparison between guid_t and raw buffer > > > > > is not correct. Import GUID to variable of guid_t type and then > > > > > compare. > > > > > > > > Hmm, what about something like the following instead, because it adds > > > > safety. Any concerns about evaluating x twice in a macro should be > > > > alleviated by the fact that ARRAY_SIZE() will fail the build if (x) is > > > > not an array. > > > > > > ARRAY_SIZE doesn't check type. > > > > See __must_be_array. > > > > > I don't like hiding ugly casts like this. > > > > See PTR_ERR, ERR_PTR, ERR_CAST. > > It's special, i.e. error pointer case. We don't handle such here. > > > There's nothing broken about the way the code currently stands, so I'd > > rather try to find something to move the implementation forward than > > sideways. > > Submit a patch then. I rest my case b/c I consider that ugly castings worse > than additional API call, although it's not ideal. It sounds like you'll NAK that patch, and I'm not too enthusiastic about these proposed changes either because I disagree that the code is incorrect. Is there another compromise? ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 1/1] ACPI: NFIT: Import GUID before use
On Fri, Apr 16, 2021 at 1:58 AM Andy Shevchenko wrote: > > On Fri, Apr 16, 2021 at 8:28 AM Dan Williams wrote: > > > > On Thu, Apr 15, 2021 at 6:59 AM Andy Shevchenko > > wrote: > > > > > > Strictly speaking the comparison between guid_t and raw buffer > > > is not correct. Import GUID to variable of guid_t type and then > > > compare. > > > > Hmm, what about something like the following instead, because it adds > > safety. Any concerns about evaluating x twice in a macro should be > > alleviated by the fact that ARRAY_SIZE() will fail the build if (x) is > > not an array. > > ARRAY_SIZE doesn't check type. See __must_be_array. > I don't like hiding ugly casts like this. See PTR_ERR, ERR_PTR, ERR_CAST. There's nothing broken about the way the code currently stands, so I'd rather try to find something to move the implementation forward than sideways. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1 1/1] ACPI: NFIT: Import GUID before use
On Thu, Apr 15, 2021 at 6:59 AM Andy Shevchenko wrote: > > Strictly speaking the comparison between guid_t and raw buffer > is not correct. Import GUID to variable of guid_t type and then > compare. Hmm, what about something like the following instead, because it adds safety. Any concerns about evaluating x twice in a macro should be alleviated by the fact that ARRAY_SIZE() will fail the build if (x) is not an array. diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 8c5dde628405..bac01eec07a6 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -681,7 +681,7 @@ int nfit_spa_type(struct acpi_nfit_system_address *spa) int i; for (i = 0; i < NFIT_UUID_MAX; i++) - if (guid_equal(to_nfit_uuid(i), (guid_t *)>range_guid)) + if (guid_equal(to_nfit_uuid(i), cast_guid(spa->range_guid))) return i; return -1; } diff --git a/include/linux/uuid.h b/include/linux/uuid.h index 8cdc0d3567cd..cec1dc2ab994 100644 --- a/include/linux/uuid.h +++ b/include/linux/uuid.h @@ -33,6 +33,9 @@ typedef struct { extern const guid_t guid_null; extern const uuid_t uuid_null; +#define cast_guid(x) ({ BUILD_BUG_ON(ARRAY_SIZE(x) != 16); (guid_t *)&(x); }) +#define cast_uuid(x) ({ BUILD_BUG_ON(ARRAY_SIZE(x) != 16); (uuid_t *)&(x); }) + static inline bool guid_equal(const guid_t *u1, const guid_t *u2) { return memcmp(u1, u2, sizeof(guid_t)) == 0; ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible
On Thu, Apr 15, 2021 at 4:44 AM Vaibhav Jain wrote: > > Thanks for looking into this Dan, > > Dan Williams writes: > > > On Wed, Apr 14, 2021 at 5:40 AM Vaibhav Jain wrote: > >> > >> Currently drc_pmem_qeury_stats() generates a dev_err in case > >> "Enable Performance Information Collection" feature is disabled from > >> HMC. The error is of the form below: > >> > >> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query > >> performance stats, Err:-10 > >> > >> This error message confuses users as it implies a possible problem > >> with the nvdimm even though its due to a disabled feature. > >> > >> So we fix this by explicitly handling the H_AUTHORITY error from the > >> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an > >> error, saying that "Performance stats in-accessible". > >> > >> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from > >> PHYP') > >> Signed-off-by: Vaibhav Jain > >> --- > >> arch/powerpc/platforms/pseries/papr_scm.c | 3 +++ > >> 1 file changed, 3 insertions(+) > >> > >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > >> b/arch/powerpc/platforms/pseries/papr_scm.c > >> index 835163f54244..9216424f8be3 100644 > >> --- a/arch/powerpc/platforms/pseries/papr_scm.c > >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c > >> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct > >> papr_scm_priv *p, > >> dev_err(>pdev->dev, > >> "Unknown performance stats, Err:0x%016lX\n", > >> ret[0]); > >> return -ENOENT; > >> + } else if (rc == H_AUTHORITY) { > >> + dev_warn(>pdev->dev, "Performance stats in-accessible"); > >> + return -EPERM; > > > > So userspace can spam the kernel log? Why is kernel log message needed > > at all? EPERM told the caller what happened. > Currently this error message is only reported during probe of the > nvdimm. So userspace cannot directly spam kernel log. Oh, ok, I saw things like papr_pdsm_fuel_gauge() in the call stack and thought this was reachable through an ioctl. Sorry for the noise. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible
On Wed, Apr 14, 2021 at 5:40 AM Vaibhav Jain wrote: > > Currently drc_pmem_qeury_stats() generates a dev_err in case > "Enable Performance Information Collection" feature is disabled from > HMC. The error is of the form below: > > papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query > performance stats, Err:-10 > > This error message confuses users as it implies a possible problem > with the nvdimm even though its due to a disabled feature. > > So we fix this by explicitly handling the H_AUTHORITY error from the > H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an > error, saying that "Performance stats in-accessible". > > Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from > PHYP') > Signed-off-by: Vaibhav Jain > --- > arch/powerpc/platforms/pseries/papr_scm.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > b/arch/powerpc/platforms/pseries/papr_scm.c > index 835163f54244..9216424f8be3 100644 > --- a/arch/powerpc/platforms/pseries/papr_scm.c > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv > *p, > dev_err(>pdev->dev, > "Unknown performance stats, Err:0x%016lX\n", ret[0]); > return -ENOENT; > + } else if (rc == H_AUTHORITY) { > + dev_warn(>pdev->dev, "Performance stats in-accessible"); > + return -EPERM; So userspace can spam the kernel log? Why is kernel log message needed at all? EPERM told the caller what happened. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v1] libnvdimm, dax: Fix a missing check in nd_dax_probe()
On Fri, Apr 9, 2021 at 5:33 PM wrote: > > From: Yingjie Wang > > In nd_dax_probe(), nd_dax_alloc() may fail and return NULL. > Check for NULL before attempting to > use nd_dax to avoid a NULL pointer dereference. > > Fixes: c5ed9268643c ("libnvdimm, dax: autodetect support") > Signed-off-by: Yingjie Wang > --- > drivers/nvdimm/dax_devs.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c > index 99965077bac4..b1426ac03f01 100644 > --- a/drivers/nvdimm/dax_devs.c > +++ b/drivers/nvdimm/dax_devs.c > @@ -106,6 +106,8 @@ int nd_dax_probe(struct device *dev, struct > nd_namespace_common *ndns) > > nvdimm_bus_lock(>dev); hmmm... > nd_dax = nd_dax_alloc(nd_region); > + if (!nd_dax) > + return -ENOMEM; Can you spot the bug this introduces? See the hint above. > nd_pfn = _dax->nd_pfn; > dax_dev = nd_pfn_devinit(nd_pfn, ndns); > nvdimm_bus_unlock(>dev); > -- > 2.7.4 > ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] libnvdimm/region: Update nvdimm_has_flush() to handle ND_REGION_ASYNC
On Mon, Apr 5, 2021 at 4:31 AM Aneesh Kumar K.V wrote: > > Vaibhav Jain writes: > > > In case a platform doesn't provide explicit flush-hints but provides an > > explicit flush callback via ND_REGION_ASYNC region flag, then > > nvdimm_has_flush() still returns '0' indicating that writes do not > > require flushing. This happens on PPC64 with patch at [1] applied, > > where 'deep_flush' of a region was denied even though an explicit > > flush function was provided. > > > > Fix this by adding a condition to nvdimm_has_flush() to test for the > > ND_REGION_ASYNC flag on the region and see if a 'region->flush' > > callback is assigned. > > > > May be this should have > Fixes: c5d4355d10d4 ("libnvdimm: nd_region flush callback support") Yes, thanks for that. > > Without this we will mark the pmem disk not having FUA support? Yes. > > > > References: > > [1] "powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall" > > https://lore.kernel.org/linux-nvdimm/161703936121.36.7260632399 > > 582101498.stgit@e1fbed493c87 > > > > Reported-by: Shivaprasad G Bhat > > Signed-off-by: Vaibhav Jain > > --- > > drivers/nvdimm/region_devs.c | 9 +++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c > > index ef23119db574..e05cc9f8a9fd 100644 > > --- a/drivers/nvdimm/region_devs.c > > +++ b/drivers/nvdimm/region_devs.c > > @@ -1239,6 +1239,11 @@ int nvdimm_has_flush(struct nd_region *nd_region) > > || !IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API)) > > return -ENXIO; > > > > + /* Test if an explicit flush function is defined */ > > + if (test_bit(ND_REGION_ASYNC, _region->flags) && nd_region->flush) > > + return 1; > > > + > > + /* Test if any flush hints for the region are available */ > > for (i = 0; i < nd_region->ndr_mappings; i++) { > > struct nd_mapping *nd_mapping = _region->mapping[i]; > > struct nvdimm *nvdimm = nd_mapping->nvdimm; > > @@ -1249,8 +1254,8 @@ int nvdimm_has_flush(struct nd_region *nd_region) > > } > > > > /* > > - * The platform defines dimm devices without hints, assume > > - * platform persistence mechanism like ADR > > + * The platform defines dimm devices without hints nor explicit flush, > > + * assume platform persistence mechanism like ADR > >*/ > > return 0; > > } > > -- > > 2.30.2 > > ___ > > Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org > > To unsubscribe send an email to linux-nvdimm-le...@lists.01.org ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] libnvdimm/region: Update nvdimm_has_flush() to handle ND_REGION_ASYNC
On Fri, Apr 2, 2021 at 2:26 AM Vaibhav Jain wrote: > > In case a platform doesn't provide explicit flush-hints but provides an > explicit flush callback via ND_REGION_ASYNC region flag, then > nvdimm_has_flush() still returns '0' indicating that writes do not > require flushing. This happens on PPC64 with patch at [1] applied, > where 'deep_flush' of a region was denied even though an explicit > flush function was provided. > > Fix this by adding a condition to nvdimm_has_flush() to test for the > ND_REGION_ASYNC flag on the region and see if a 'region->flush' > callback is assigned. Looks good. > > References: > [1] "powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall" > https://lore.kernel.org/linux-nvdimm/161703936121.36.7260632399 > 582101498.stgit@e1fbed493c87 Looks like a typo happened in that link, I can fix that up. I'll also change this to the canonical "Link:" tag for references. > > Reported-by: Shivaprasad G Bhat > Signed-off-by: Vaibhav Jain > --- > drivers/nvdimm/region_devs.c | 9 +++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c > index ef23119db574..e05cc9f8a9fd 100644 > --- a/drivers/nvdimm/region_devs.c > +++ b/drivers/nvdimm/region_devs.c > @@ -1239,6 +1239,11 @@ int nvdimm_has_flush(struct nd_region *nd_region) > || !IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API)) > return -ENXIO; > > + /* Test if an explicit flush function is defined */ > + if (test_bit(ND_REGION_ASYNC, _region->flags) && nd_region->flush) > + return 1; > + > + /* Test if any flush hints for the region are available */ > for (i = 0; i < nd_region->ndr_mappings; i++) { > struct nd_mapping *nd_mapping = _region->mapping[i]; > struct nvdimm *nvdimm = nd_mapping->nvdimm; > @@ -1249,8 +1254,8 @@ int nvdimm_has_flush(struct nd_region *nd_region) > } > > /* > -* The platform defines dimm devices without hints, assume > -* platform persistence mechanism like ADR > +* The platform defines dimm devices without hints nor explicit flush, > +* assume platform persistence mechanism like ADR > */ > return 0; > } > -- > 2.30.2 > ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path
On Thu, Mar 25, 2021 at 7:34 AM Jason Gunthorpe wrote: > > On Thu, Mar 18, 2021 at 10:03:06AM -0700, Dan Williams wrote: > > Yes. I still need to answer the question of whether mapping > > invalidation triggers longterm pin holders to relinquish their hold, > > but that's a problem regardless of whether gup-fast is supported or > > not. > > It does not, GUP users do not interact with addres_space or mmu > notifiers > Ok, but the SIGKILL from the memory_failure() will drop the pin? Can memory_failure() find the right processes to kill if the memory registration has been passed by SCM_RIGHTS? ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
On Wed, Mar 24, 2021 at 10:39 AM Christoph Hellwig wrote: > > On Wed, Mar 24, 2021 at 09:37:01AM -0700, Dan Williams wrote: > > > Eww. As I said I think the right way is that the file system (or > > > other consumer) can register a set of callbacks for opening the device. > > > > How does that solve the problem of the driver being notified of all > > pfn failure events? > > Ok, I probably just showed I need to spend more time looking at > your proposal vs the actual code.. > > Don't we have a proper way how one of the nvdimm layers own a > spefific memory range and call directly into that instead of through > a notifier? So that could be a new dev_pagemap operation as Ruan has here. I was thinking that other agents would be interested in non-dev_pagemap managed ranges, but we could leave that for later and just make the current pgmap->memory_failure() callback proposal range based. > > > Today pmem only finds out about the ones that are > > notified via native x86 machine check error handling via a notifier > > (yes "firmware-first" error handling fails to do the right thing for > > the pmem driver), > > Did any kind of firmware-first error handling ever get anything > right? I wish people would have learned that by now. Part of me wants to say if you use firmware-first you get to keep the pieces, but it's not always the end user choice as far as I understand. > > or the ones that are eventually reported via address > > range scrub, but only for the nvdimms that implement range scrubbing. > > memory_failure() seems a reasonable catch all point to route pfn > > failure events, in an arch independent way, to interested drivers. > > Yeah. > > > I'm fine swapping out dax_device blocking_notiier chains for your > > proposal, but that does not address all the proposed reworks in my > > list which are: > > > > - delete "drivers/acpi/nfit/mce.c" > > > > - teach memory_failure() to be able to communicate range failure > > > > - enable memory_failure() to defer to a filesystem that can say > > "critical metadata is impacted, no point in trying to do file-by-file > > isolation, bring the whole fs down". > > This all sounds sensible. Ok, Ruan, I think this means rework your dev_pagemap_ops callback to be range based. Add a holder concept for dax_devices and then layer that on Christoph's eventual dax_device callback mechanism that a dax_device holder can register. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path
On Thu, Mar 18, 2021 at 3:02 AM Joao Martins wrote: > > On 3/18/21 4:08 AM, Dan Williams wrote: > > Now that device-dax and filesystem-dax are guaranteed to unmap all user > > mappings of devmap / DAX pages before tearing down the 'struct page' > > array, get_user_pages_fast() can rely on its traditional synchronization > > method "validate_pte(); get_page(); revalidate_pte()" to catch races with > > device shutdown. Specifically the unmap guarantee ensures that gup-fast > > either succeeds in taking a page reference (lock-less), or it detects a > > need to fall back to the slow path where the device presence can be > > revalidated with locks held. > > [...] > > > @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long > > addr, unsigned long end, > > #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ > > > > #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && > > defined(CONFIG_TRANSPARENT_HUGEPAGE) > > + > > static int __gup_device_huge(unsigned long pfn, unsigned long addr, > >unsigned long end, unsigned int flags, > >struct page **pages, int *nr) > > { > > int nr_start = *nr; > > - struct dev_pagemap *pgmap = NULL; > > > > do { > > - struct page *page = pfn_to_page(pfn); > > + struct page *page; > > + > > + /* > > + * Typically pfn_to_page() on a devmap pfn is not safe > > + * without holding a live reference on the hosting > > + * pgmap. In the gup-fast path it is safe because any > > + * races will be resolved by either gup-fast taking a > > + * reference or the shutdown path unmapping the pte to > > + * trigger gup-fast to fall back to the slow path. > > + */ > > + page = pfn_to_page(pfn); > > > > - pgmap = get_dev_pagemap(pfn, pgmap); > > - if (unlikely(!pgmap)) { > > - undo_dev_pagemap(nr, nr_start, flags, pages); > > - return 0; > > - } > > SetPageReferenced(page); > > pages[*nr] = page; > > if (unlikely(!try_grab_page(page, flags))) { > > So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after > try_grab_page() for checking pgmap type to see if we are in a device-dax > longterm pin? So, there is an effort to add a new pte bit p{m,u}d_special to disable gup-fast for huge pages [1]. I'd like to investigate whether we could use devmap + special as an encoding for "no longterm" and never consult the pgmap in the gup-fast path. [1]: https://lore.kernel.org/linux-mm/a1fa7fa2-914b-366d-9902-e5b784e84...@shipmail.org/ ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
On Wed, Mar 24, 2021 at 12:48 AM Christoph Hellwig wrote: > > On Tue, Mar 23, 2021 at 07:19:28PM -0700, Dan Williams wrote: > > So I think the path forward is: > > > > - teach memory_failure() to allow for ranged failures > > > > - let interested drivers register for memory failure events via a > > blocking_notifier_head > > Eww. As I said I think the right way is that the file system (or > other consumer) can register a set of callbacks for opening the device. How does that solve the problem of the driver being notified of all pfn failure events? Today pmem only finds out about the ones that are notified via native x86 machine check error handling via a notifier (yes "firmware-first" error handling fails to do the right thing for the pmem driver), or the ones that are eventually reported via address range scrub, but only for the nvdimms that implement range scrubbing. memory_failure() seems a reasonable catch all point to route pfn failure events, in an arch independent way, to interested drivers. I'm fine swapping out dax_device blocking_notiier chains for your proposal, but that does not address all the proposed reworks in my list which are: - delete "drivers/acpi/nfit/mce.c" - teach memory_failure() to be able to communicate range failure - enable memory_failure() to defer to a filesystem that can say "critical metadata is impacted, no point in trying to do file-by-file isolation, bring the whole fs down". > I have a series I need to finish and send out to do that for block > devices. We probably also need the concept of a holder for the dax > device to make it work nicely, as otherwise we're going to have a bit > of a mess. Ok, I'll take a look at adding a holder. > > > This obviously does not solve Dave's desire to get this type of error > > reporting on block_devices, but I think there's nothing stopping a > > parallel notifier chain from being created for block-devices, but > > that's orthogonal to requirements and capabilities provided by > > dax-devices. > > FYI, my series could easily accomodate that if we ever get a block > driver that actually could report such errors. Sure, whatever we land for a dax_device could easily be adopted for a block device. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
On Thu, Mar 18, 2021 at 7:18 PM ruansy.f...@fujitsu.com wrote: > > > > > -Original Message- > > From: ruansy.f...@fujitsu.com > > Subject: RE: [PATCH v3 01/11] pagemap: Introduce ->memory_failure() > > > > > > > > > > > > > > After the conversation with Dave I don't see the point of this. > > > > > > > If there is a memory_failure() on a page, why not just call > > > > > > > memory_failure()? That already knows how to find the inode and > > > > > > > the filesystem can be notified from there. > > > > > > > > > > > > We want memory_failure() supports reflinked files. In this > > > > > > case, we are not able to track multiple files from a page(this > > > > > > broken > > > > > > page) because > > > > > > page->mapping,page->index can only track one file. Thus, I > > > > > > page->introduce this > > > > > > ->memory_failure() implemented in pmem driver, to call > > > > > > ->->corrupted_range() > > > > > > upper level to upper level, and finally find out files who are > > > > > > using(mmapping) this page. > > > > > > > > > > > > > > > > I know the motivation, but this implementation seems backwards. > > > > > It's already the case that memory_failure() looks up the > > > > > address_space associated with a mapping. From there I would expect > > > > > a new 'struct address_space_operations' op to let the fs handle > > > > > the case when there are multiple address_spaces associated with a > > > > > given > > file. > > > > > > > > > > > > > Let me think about it. In this way, we > > > > 1. associate file mapping with dax page in dax page fault; > > > > > > I think this needs to be a new type of association that proxies the > > > representation of the reflink across all involved address_spaces. > > > > > > > 2. iterate files reflinked to notify `kill processes signal` by the > > > > new address_space_operation; > > > > 3. re-associate to another reflinked file mapping when unmmaping > > > > (rmap qeury in filesystem to get the another file). > > > > > > Perhaps the proxy object is reference counted per-ref-link. It seems > > > error prone to keep changing the association of the pfn while the reflink > > > is > > in-tact. > > Hi, Dan > > > > I think my early rfc patchset was implemented in this way: > > - Create a per-page 'dax-rmap tree' to store each reflinked file's > > (mapping, > > offset) when causing dax page fault. > > - Mount this tree on page->zone_device_data which is not used in fsdax, so > > that we can iterate reflinked file mappings in memory_failure() easily. > > In my understanding, the dax-rmap tree is the proxy object you mentioned. > > If > > so, I have to say, this method was rejected. Because this will cause huge > > overhead in some case that every dax page have one dax-rmap tree. > > > > Hi, Dan > > How do you think about this? I am still confused. Could you give me some > advice? So I think the primary driver of this functionality is dax-devices and the architectural model for memory failure where several architectures and error handlers know how to route pfn failure to the memory_failure() frontend. Compare that to block-devices where sector failure has no similar framework, and despite some initial interest about reusing 'struct badblocks' for this type of scenario there has been no real uptake to expand 'struct badblocks' outside of the pmem driver. I think the work you have done for ->corrupted_range() just needs to be repurposed away from a block-device operation to dax-device infrastructure. Christoph's pushback on extending block_device_operations makes sense to me because there is likely no other user of this facility than the pmem driver, and the pmem driver only needs it for the vestigial reason that filesystems mount on block-devices and not dax-devices. Recently Dave drove home the point that a filesystem can't do anything with pfns, it needs LBAs. A dax-device does not have LBA's, but it does operate on the concept of device-relative offsets. The filesystem is allowed to assume that dax-device:PFN[device_byte_offset >> PAGE_SHIFT] aliases the same data as the associated block-device:LBA[device_byte_offset >> SECTOR_SHIFT]. He also reiterated that this interface should be range based, which you already had, but I did not include in my attempt to communicate the mass failure of an entire surprise-removed device. So I think the path forward is: - teach memory_failure() to allow for ranged failures - let interested drivers register for memory failure events via a blocking_notifier_head - teach memory_failure() to optionally let the notifier chain claim the event vs its current default of walking page->mapping - teach the pmem driver to register for memory_failure() events and filter the ones that apply to pfns that the driver owns - drop the nfit driver's usage of the mce notifier chain since memory_failure() is a superset of what the mce notifier communicates - augment the pmem driver's view of badblocks that it
Re: [PATCH] dax: avoid -Wempty-body warnings
On Mon, Mar 22, 2021 at 4:45 AM Arnd Bergmann wrote: > > From: Arnd Bergmann > > gcc warns about an empty body in an else statement: > > drivers/dax/bus.c: In function 'do_id_store': > drivers/dax/bus.c:94:48: error: suggest braces around empty body in an 'else' > statement [-Werror=empty-body] >94 | /* nothing to remove */; > |^ > drivers/dax/bus.c:99:43: error: suggest braces around empty body in an 'else' > statement [-Werror=empty-body] >99 | /* dax_id already added */; > | ^ > > In both of these cases, the 'else' exists only to have a place to > add a comment, but that comment doesn't really explain that much > either, so the easiest way to shut up that warning is to just > remove the else. Ok, applied, thanks. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()
On Fri, Mar 19, 2021 at 6:47 PM Dave Chinner wrote: [..] > > Now I'm trying to reconcile the fact that platform > > poison handling will hit memory_failure() first and may not > > immediately reach the driver, if ever (see the perennially awkward > > firmware-first-mode error handling: ghes_handle_memory_failure()) . So > > even if the ->memory_failure(dev...) up call exists there is no > > guarantee it can get called for all poison before the memory_failure() > > down call happens. Which means regardless of whether > > ->memory_failure(dev...) exists memory_failure() needs to be able to > > do the right thing. > > I don't see how a poor implementation of memory_failure in a driver > or hardware is even remotely relevant to the interface used to > notify the filesystem of a media or device failure. It sounds like > you are trying to use memory_failure() for something it was never > intended to support and that there's a bunch of driver and > infrastructure work needed to make it work how you want it to work. > And even then it may not work the way we want it to work > > > Combine that with the fact that new buses like CXL might be configured > > in "poison on decode error" mode which means that a memory_failure() > > storm can happen regardless of whether the driver initiates it > > programatically. > > Straw man argument. > > "We can't make this interface a ranged notification because the > hardware might only be able to do page-by-page notification." No, it's "we can't make this interface notify the filesytem that sectors have failed before the memory_failure() (ranged or not) has communicated that pfns have failed." memory_failure() today is the first and sometimes only interface that learns of pfn failures. > > You can do page-by-page notification with a range based interface. > We are talking about how to efficiently and reliably inform the > filesystem that a range of a device is no longer accessible and so > it needs to revoke all mappings over that range of it's address > space. That does not need to be a single page at a time interface. > > If your hardware is configured to do stupid things, then that is not > the fault of the software interface used to communicate faults up > the stack, nor is it something that the notfication interface should > try to fix or mitigate. > > > How about a mechanism to optionally let a filesystem take over memory > > failure handling for a range of pfns that the memory_failure() can > > consult to fail ranges at a time rather than one by one? So a new > > 'struct dax_operations' op (void) (*memory_failure_register(struct > > dax_device *, void *data). Where any agent that claims a dax_dev can > > register to take over memory_failure() handling for any event that > > happens in that range. This would be routed through device-mapper like > > any other 'struct dax_operations' op. I think that meets your > > requirement to get notifications of all the events you want to handle, > > but still allows memory_failure() to be the last resort for everything > > that has not opted into this error handling. > > Which is basically the same as the proposed ->corrupted_range stack, > except it doesn't map the pfns back to LBA addresses the filesystem > needs to make sense of the failure. > > fs-dax filesystems have no clue what pfns are, or how to translate > them to LBAs in their block device address space that the map > everything to. The fs-dax infrastructure asks the filesystem for > bdev/sector based mappings, and internally converts them to pfns by > a combination of bdev and daxdev callouts. Hence fs-dax filesystems > never see nor interpret pfns at all. Nor do they have the > capability to convert a PFN to a LBA address. And neither the > underlying block device nor the associated DAX device provide a > method for doing this reverse mapping translation. True. > > So if we have an interface that hands a {daxdev,PFN,len} tuple to > the filesystem, exactly what is the filesystem supposed to do with > it? How do we turn that back into a {bdev,sector,len} tuple so we > can do reverse mapping lookups to find the filesystem objects that > allocated within the notified range? > > I'll point out again that these address space translations were > something that the ->corrupted_range callbacks handled directly - no > layer in the stack was handed a range that it didn't know how to map > to it's own internal structures. By the time it got to the > filesystem, it was a {bdev,sector,len} tuple, and the filesystem > could feed that directly to it's reverse mapping lookups > > Maybe I'm missing something magical about ->memory_failure that does > all this translation for us, but I don't see it in this patchset. I > just don't see how this proposed interface is a usable at the > filesystem level as it stands. So then it's not the filesystem that needs to register for memory_failure() it's the driver in order to translate the failed LBAs up the stack. However,
Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()
On Wed, Mar 17, 2021 at 9:45 PM Darrick J. Wong wrote: > > On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote: > > Jason wondered why the get_user_pages_fast() path takes references on a > > @pgmap object. The rationale was to protect against accessing a 'struct > > page' that might be in the process of being removed by the driver, but > > he rightly points out that should be solved the same way all gup-fast > > synchronization is solved which is invalidate the mapping and let the > > gup slow path do @pgmap synchronization [1]. > > > > To achieve that it means that new user mappings need to stop being > > created and all existing user mappings need to be invalidated. > > > > For device-dax this is already the case as kill_dax() prevents future > > faults from installing a pte, and the single device-dax inode > > address_space can be trivially unmapped. > > > > The situation is different for filesystem-dax where device pages could > > be mapped by any number of inode address_space instances. An initial > > thought was to treat the device removal event like a drop_pagecache_sb() > > event that walks superblocks and unmaps all inodes. However, Dave points > > out that it is not just the filesystem user-mappings that need to react > > to global DAX page-unmap events, it is also filesystem metadata > > (proposed DAX metadata access), and other drivers (upstream > > DM-writecache) that need to react to this event [2]. > > > > The only kernel facility that is meant to globally broadcast the loss of > > a page (via corruption or surprise remove) is memory_failure(). The > > downside of memory_failure() is that it is a pfn-at-a-time interface. > > However, the events that would trigger the need to call memory_failure() > > over a full PMEM device should be rare. Remove should always be > > coordinated by the administrator with the filesystem. If someone force > > removes a device from underneath a mounted filesystem the driver assumes > > they have a good reason, or otherwise get to keep the pieces. Since > > ->remove() callbacks can not fail the only option is to trigger the mass > > memory_failure(). > > > > The mechanism to determine whether memory_failure() triggers at > > pmem->remove() time is whether the associated dax_device has an elevated > > reference at @pgmap ->kill() time. > > > > With this in place the get_user_pages_fast() path can drop its > > half-measure synchronization with an @pgmap reference. > > > > Link: http://lore.kernel.org/r/20210224010017.gq2643...@ziepe.ca [1] > > Link: http://lore.kernel.org/r/20210302075736.gj4...@dread.disaster.area [2] > > Reported-by: Jason Gunthorpe > > Cc: Dave Chinner > > Cc: Christoph Hellwig > > Cc: Shiyang Ruan > > Cc: Vishal Verma > > Cc: Dave Jiang > > Cc: Ira Weiny > > Cc: Matthew Wilcox > > Cc: Jan Kara > > Cc: Andrew Morton > > Cc: Naoya Horiguchi > > Cc: "Darrick J. Wong" > > Signed-off-by: Dan Williams > > --- > > drivers/dax/super.c | 15 +++ > > drivers/nvdimm/pmem.c| 10 +- > > drivers/nvdimm/pmem.h|1 + > > include/linux/dax.h |5 + > > include/linux/memremap.h |5 + > > include/linux/mm.h |3 +++ > > mm/memory-failure.c | 11 +-- > > mm/memremap.c| 11 +++ > > 8 files changed, 58 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > > index 5fa6ae9dbc8b..5ebcedf4a68c 100644 > > --- a/drivers/dax/super.c > > +++ b/drivers/dax/super.c > > @@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev) > > } > > EXPORT_SYMBOL_GPL(put_dax); > > > > +bool dax_is_idle(struct dax_device *dax_dev) > > +{ > > + struct inode *inode; > > + > > + if (!dax_dev) > > + return true; > > + > > + WARN_ONCE(test_bit(DAXDEV_ALIVE, _dev->flags), > > + "dax idle check on live device.\n"); > > + > > + inode = _dev->inode; > > + return atomic_read(>i_count) < 2; > > +} > > +EXPORT_SYMBOL_GPL(dax_is_idle); > > + > > /** > > * dax_get_by_host() - temporary lookup mechanism for filesystem-dax > > * @host: alternate name for the device registered by a dax driver > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > > index b8a85bfb2e95..e8822c9262ee 100644 > > --- a/drivers/nvdimm/pmem.c > > +++ b/drivers/nvdimm/pme
Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()
On Wed, Mar 17, 2021 at 9:58 PM Dave Chinner wrote: > > On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote: > > Jason wondered why the get_user_pages_fast() path takes references on a > > @pgmap object. The rationale was to protect against accessing a 'struct > > page' that might be in the process of being removed by the driver, but > > he rightly points out that should be solved the same way all gup-fast > > synchronization is solved which is invalidate the mapping and let the > > gup slow path do @pgmap synchronization [1]. > > > > To achieve that it means that new user mappings need to stop being > > created and all existing user mappings need to be invalidated. > > > > For device-dax this is already the case as kill_dax() prevents future > > faults from installing a pte, and the single device-dax inode > > address_space can be trivially unmapped. > > > > The situation is different for filesystem-dax where device pages could > > be mapped by any number of inode address_space instances. An initial > > thought was to treat the device removal event like a drop_pagecache_sb() > > event that walks superblocks and unmaps all inodes. However, Dave points > > out that it is not just the filesystem user-mappings that need to react > > to global DAX page-unmap events, it is also filesystem metadata > > (proposed DAX metadata access), and other drivers (upstream > > DM-writecache) that need to react to this event [2]. > > > > The only kernel facility that is meant to globally broadcast the loss of > > a page (via corruption or surprise remove) is memory_failure(). The > > downside of memory_failure() is that it is a pfn-at-a-time interface. > > However, the events that would trigger the need to call memory_failure() > > over a full PMEM device should be rare. > > This is a highly suboptimal design. Filesystems only need a single > callout to trigger a shutdown that unmaps every active mapping in > the filesystem - we do not need a page-by-page error notification > which results in 250 million hwposion callouts per TB of pmem to do > this. > > Indeed, the moment we get the first hwpoison from this patch, we'll > map it to the primary XFS superblock and we'd almost certainly > consider losing the storage behind that block to be a shut down > trigger. During the shutdown, the filesystem should unmap all the > active mappings (we already need to add this to shutdown on DAX > regardless of this device remove issue) and so we really don't need > a page-by-page notification of badness. XFS doesn't, but what about device-mapper and other agents? Even if the driver had a callback up the stack memory_failure() still needs to be able to trigger failures down the stack for CPU consumed poison. > > AFAICT, it's going to take minutes, maybe hours for do the page-by-page > iteration to hwposion every page. It's going to take a few seconds > for the filesystem shutdown to run a device wide invalidation. > > SO, yeah, I think this should simply be a single ranged call to the > filesystem like: > > ->memory_failure(dev, 0, -1ULL) > > to tell the filesystem that the entire backing device has gone away, > and leave the filesystem to handle failure entirely at the > filesystem level. So I went with memory_failure() after our discussion of all the other agents in the system that might care about these pfns going offline and relying on memory_failure() to route down to each of those. I.e. the "reuse the drop_pagecache_sb() model" idea was indeed insufficient. Now I'm trying to reconcile the fact that platform poison handling will hit memory_failure() first and may not immediately reach the driver, if ever (see the perennially awkward firmware-first-mode error handling: ghes_handle_memory_failure()) . So even if the ->memory_failure(dev...) up call exists there is no guarantee it can get called for all poison before the memory_failure() down call happens. Which means regardless of whether ->memory_failure(dev...) exists memory_failure() needs to be able to do the right thing. Combine that with the fact that new buses like CXL might be configured in "poison on decode error" mode which means that a memory_failure() storm can happen regardless of whether the driver initiates it programatically. How about a mechanism to optionally let a filesystem take over memory failure handling for a range of pfns that the memory_failure() can consult to fail ranges at a time rather than one by one? So a new 'struct dax_operations' op (void) (*memory_failure_register(struct dax_device *, void *data). Where any agent that claims a dax_dev can register to take over memory_failure() handling for any event that happens in that range. This would
Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path
On Thu, Mar 18, 2021 at 3:02 AM Joao Martins wrote: > > On 3/18/21 4:08 AM, Dan Williams wrote: > > Now that device-dax and filesystem-dax are guaranteed to unmap all user > > mappings of devmap / DAX pages before tearing down the 'struct page' > > array, get_user_pages_fast() can rely on its traditional synchronization > > method "validate_pte(); get_page(); revalidate_pte()" to catch races with > > device shutdown. Specifically the unmap guarantee ensures that gup-fast > > either succeeds in taking a page reference (lock-less), or it detects a > > need to fall back to the slow path where the device presence can be > > revalidated with locks held. > > [...] > > > @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long > > addr, unsigned long end, > > #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ > > > > #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && > > defined(CONFIG_TRANSPARENT_HUGEPAGE) > > + > > static int __gup_device_huge(unsigned long pfn, unsigned long addr, > >unsigned long end, unsigned int flags, > >struct page **pages, int *nr) > > { > > int nr_start = *nr; > > - struct dev_pagemap *pgmap = NULL; > > > > do { > > - struct page *page = pfn_to_page(pfn); > > + struct page *page; > > + > > + /* > > + * Typically pfn_to_page() on a devmap pfn is not safe > > + * without holding a live reference on the hosting > > + * pgmap. In the gup-fast path it is safe because any > > + * races will be resolved by either gup-fast taking a > > + * reference or the shutdown path unmapping the pte to > > + * trigger gup-fast to fall back to the slow path. > > + */ > > + page = pfn_to_page(pfn); > > > > - pgmap = get_dev_pagemap(pfn, pgmap); > > - if (unlikely(!pgmap)) { > > - undo_dev_pagemap(nr, nr_start, flags, pages); > > - return 0; > > - } > > SetPageReferenced(page); > > pages[*nr] = page; > > if (unlikely(!try_grab_page(page, flags))) { > > So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after > try_grab_page() for checking pgmap type to see if we are in a device-dax > longterm pin? > Yes. I still need to answer the question of whether mapping invalidation triggers longterm pin holders to relinquish their hold, but that's a problem regardless of whether gup-fast is supported or not. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path
Now that device-dax and filesystem-dax are guaranteed to unmap all user mappings of devmap / DAX pages before tearing down the 'struct page' array, get_user_pages_fast() can rely on its traditional synchronization method "validate_pte(); get_page(); revalidate_pte()" to catch races with device shutdown. Specifically the unmap guarantee ensures that gup-fast either succeeds in taking a page reference (lock-less), or it detects a need to fall back to the slow path where the device presence can be revalidated with locks held. Reported-by: Jason Gunthorpe Cc: Christoph Hellwig Cc: Shiyang Ruan Cc: Vishal Verma Cc: Dave Jiang Cc: Ira Weiny Cc: Matthew Wilcox Cc: Jan Kara Cc: Andrew Morton Signed-off-by: Dan Williams --- mm/gup.c | 38 -- 1 file changed, 16 insertions(+), 22 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index e40579624f10..dfeb47e4e8d4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1996,9 +1996,8 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { - struct dev_pagemap *pgmap = NULL; - int nr_start = *nr, ret = 0; pte_t *ptep, *ptem; + int ret = 0; ptem = ptep = pte_offset_map(, addr); do { @@ -2015,16 +2014,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, if (!pte_access_permitted(pte, flags & FOLL_WRITE)) goto pte_unmap; - if (pte_devmap(pte)) { - if (unlikely(flags & FOLL_LONGTERM)) - goto pte_unmap; + if (pte_devmap(pte) && (flags & FOLL_LONGTERM)) + goto pte_unmap; - pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); - if (unlikely(!pgmap)) { - undo_dev_pagemap(nr, nr_start, flags, pages); - goto pte_unmap; - } - } else if (pte_special(pte)) + if (pte_special(pte)) goto pte_unmap; VM_BUG_ON(!pfn_valid(pte_pfn(pte))); @@ -2063,8 +2056,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, ret = 1; pte_unmap: - if (pgmap) - put_dev_pagemap(pgmap); pte_unmap(ptem); return ret; } @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) + static int __gup_device_huge(unsigned long pfn, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { int nr_start = *nr; - struct dev_pagemap *pgmap = NULL; do { - struct page *page = pfn_to_page(pfn); + struct page *page; + + /* +* Typically pfn_to_page() on a devmap pfn is not safe +* without holding a live reference on the hosting +* pgmap. In the gup-fast path it is safe because any +* races will be resolved by either gup-fast taking a +* reference or the shutdown path unmapping the pte to +* trigger gup-fast to fall back to the slow path. +*/ + page = pfn_to_page(pfn); - pgmap = get_dev_pagemap(pfn, pgmap); - if (unlikely(!pgmap)) { - undo_dev_pagemap(nr, nr_start, flags, pages); - return 0; - } SetPageReferenced(page); pages[*nr] = page; if (unlikely(!try_grab_page(page, flags))) { @@ -2112,8 +2108,6 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, pfn++; } while (addr += PAGE_SIZE, addr != end); - if (pgmap) - put_dev_pagemap(pgmap); return 1; } ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()
Jason wondered why the get_user_pages_fast() path takes references on a @pgmap object. The rationale was to protect against accessing a 'struct page' that might be in the process of being removed by the driver, but he rightly points out that should be solved the same way all gup-fast synchronization is solved which is invalidate the mapping and let the gup slow path do @pgmap synchronization [1]. To achieve that it means that new user mappings need to stop being created and all existing user mappings need to be invalidated. For device-dax this is already the case as kill_dax() prevents future faults from installing a pte, and the single device-dax inode address_space can be trivially unmapped. The situation is different for filesystem-dax where device pages could be mapped by any number of inode address_space instances. An initial thought was to treat the device removal event like a drop_pagecache_sb() event that walks superblocks and unmaps all inodes. However, Dave points out that it is not just the filesystem user-mappings that need to react to global DAX page-unmap events, it is also filesystem metadata (proposed DAX metadata access), and other drivers (upstream DM-writecache) that need to react to this event [2]. The only kernel facility that is meant to globally broadcast the loss of a page (via corruption or surprise remove) is memory_failure(). The downside of memory_failure() is that it is a pfn-at-a-time interface. However, the events that would trigger the need to call memory_failure() over a full PMEM device should be rare. Remove should always be coordinated by the administrator with the filesystem. If someone force removes a device from underneath a mounted filesystem the driver assumes they have a good reason, or otherwise get to keep the pieces. Since ->remove() callbacks can not fail the only option is to trigger the mass memory_failure(). The mechanism to determine whether memory_failure() triggers at pmem->remove() time is whether the associated dax_device has an elevated reference at @pgmap ->kill() time. With this in place the get_user_pages_fast() path can drop its half-measure synchronization with an @pgmap reference. Link: http://lore.kernel.org/r/20210224010017.gq2643...@ziepe.ca [1] Link: http://lore.kernel.org/r/20210302075736.gj4...@dread.disaster.area [2] Reported-by: Jason Gunthorpe Cc: Dave Chinner Cc: Christoph Hellwig Cc: Shiyang Ruan Cc: Vishal Verma Cc: Dave Jiang Cc: Ira Weiny Cc: Matthew Wilcox Cc: Jan Kara Cc: Andrew Morton Cc: Naoya Horiguchi Cc: "Darrick J. Wong" Signed-off-by: Dan Williams --- drivers/dax/super.c | 15 +++ drivers/nvdimm/pmem.c| 10 +- drivers/nvdimm/pmem.h|1 + include/linux/dax.h |5 + include/linux/memremap.h |5 + include/linux/mm.h |3 +++ mm/memory-failure.c | 11 +-- mm/memremap.c| 11 +++ 8 files changed, 58 insertions(+), 3 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 5fa6ae9dbc8b..5ebcedf4a68c 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev) } EXPORT_SYMBOL_GPL(put_dax); +bool dax_is_idle(struct dax_device *dax_dev) +{ + struct inode *inode; + + if (!dax_dev) + return true; + + WARN_ONCE(test_bit(DAXDEV_ALIVE, _dev->flags), + "dax idle check on live device.\n"); + + inode = _dev->inode; + return atomic_read(>i_count) < 2; +} +EXPORT_SYMBOL_GPL(dax_is_idle); + /** * dax_get_by_host() - temporary lookup mechanism for filesystem-dax * @host: alternate name for the device registered by a dax driver diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index b8a85bfb2e95..e8822c9262ee 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -348,15 +348,21 @@ static void pmem_pagemap_kill(struct dev_pagemap *pgmap) { struct request_queue *q = container_of(pgmap->ref, struct request_queue, q_usage_counter); + struct pmem_device *pmem = q->queuedata; blk_freeze_queue_start(q); + kill_dax(pmem->dax_dev); + if (!dax_is_idle(pmem->dax_dev)) { + dev_warn(pmem->dev, +"DAX active at remove, trigger mass memory failure\n"); + dev_pagemap_failure(pgmap); + } } static void pmem_release_disk(void *__pmem) { struct pmem_device *pmem = __pmem; - kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); del_gendisk(pmem->disk); put_disk(pmem->disk); @@ -406,6 +412,7 @@ static int pmem_attach_disk(struct device *dev, devm_namespace_disable(dev, ndns); dev_set_drvdata(dev, pmem); + pmem->dev = dev; pmem->phys_addr = res->start; pmem->size = resource_size(res); fua = nvdim
[PATCH 1/3] mm/memory-failure: Prepare for mass memory_failure()
Currently memory_failure() assumes an infrequent report on a handful of pages. A new use case for surprise removal of a persistent memory device needs to trigger memory_failure() on a large range. Rate limit memory_failure() error logging, and allow the memory_failure_dev_pagemap() helper to be called directly. Cc: Naoya Horiguchi Cc: Andrew Morton Signed-off-by: Dan Williams --- mm/memory-failure.c | 25 ++--- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 24210c9bd843..43ba4307c526 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -395,8 +395,9 @@ static void kill_procs(struct list_head *to_kill, int forcekill, bool fail, * signal and then access the memory. Just kill it. */ if (fail || tk->addr == -EFAULT) { - pr_err("Memory failure: %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n", - pfn, tk->tsk->comm, tk->tsk->pid); + pr_err_ratelimited( + "Memory failure: %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n", + pfn, tk->tsk->comm, tk->tsk->pid); do_send_sig_info(SIGKILL, SEND_SIG_PRIV, tk->tsk, PIDTYPE_PID); } @@ -408,8 +409,9 @@ static void kill_procs(struct list_head *to_kill, int forcekill, bool fail, * process anyways. */ else if (kill_proc(tk, pfn, flags) < 0) - pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n", - pfn, tk->tsk->comm, tk->tsk->pid); + pr_err_ratelimited( + "Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n", + pfn, tk->tsk->comm, tk->tsk->pid); } put_task_struct(tk->tsk); kfree(tk); @@ -919,8 +921,8 @@ static void action_result(unsigned long pfn, enum mf_action_page_type type, { trace_memory_failure_event(pfn, type, result); - pr_err("Memory failure: %#lx: recovery action for %s: %s\n", - pfn, action_page_types[type], action_name[result]); + pr_err_ratelimited("Memory failure: %#lx: recovery action for %s: %s\n", + pfn, action_page_types[type], action_name[result]); } static int page_action(struct page_state *ps, struct page *p, @@ -1375,8 +1377,6 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, unlock: dax_unlock_page(page, cookie); out: - /* drop pgmap ref acquired in caller */ - put_dev_pagemap(pgmap); action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED); return rc; } @@ -1415,9 +1415,12 @@ int memory_failure(unsigned long pfn, int flags) if (!p) { if (pfn_valid(pfn)) { pgmap = get_dev_pagemap(pfn, NULL); - if (pgmap) - return memory_failure_dev_pagemap(pfn, flags, - pgmap); + if (pgmap) { + res = memory_failure_dev_pagemap(pfn, flags, +pgmap); + put_dev_pagemap(pgmap); + return res; + } } pr_err("Memory failure: %#lx: memory outside kernel control\n", pfn); ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove
Summary: A dax_dev can be unbound from its driver at any time. Unbind can not fail. The driver-core will always trigger ->remove() and the result from ->remove() is ignored. After ->remove() the driver-core proceeds to tear down context. The filesystem-dax implementation can leave pfns mapped after ->remove() if it is triggered while the filesystem is mounted. Security and data-integrity is forfeit if the dax_dev is repurposed for another security domain (new filesystem or change device modes), or if the dax_dev is physically replaced. CXL is a hotplug bus that makes dax_dev physical replace a real world prospect. All dax_dev pfns must be unmapped at remove. Detect the "remove while mounted" case and trigger memory_failure() over the entire dax_dev range. Details: The get_user_pages_fast() path expects all synchronization to be handled by the pattern of checking for pte presence, taking a page reference, and then validating that the pte was stable over that event. The gup-fast path for devmap / DAX pages additionally attempts to take/hold a live reference against the hosting pgmap over the page pin. The rational for the pgmap reference is to synchronize against a dax-device unbind / ->remove() event, but that is unnecessary if pte invalidation is guaranteed in the ->remove() path. Global dax-device pte invalidation *does* happen when the device is in raw "device-dax" mode where there is a single shared inode to unmap at remove, but the filesystem-dax path has a large number of actively mapped inodes unknown to the driver at ->remove() time. So, that unmap does not happen today for filesystem-dax. However, as Jason points out, that unmap / invalidation *needs* to happen not only to cleanup get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev ->remove() is correlated with actual physical removal / replacement the implications of allowing a physical pfn to be exchanged without tearing down old mappings are severe (security and data-integrity). What is not in this patch set is coordination with the dax_kmem driver to trigger memory_failure() when the dax_dev is onlined as "System RAM". The remove_memory() API was built with the assumption that platform firmware negotiates all removal requests and the OS has a chance to say "no". This is why dax_kmem today simply leaks request_region() to burn that physical address space for any other usage until the next reboot on a manual unbind event if the memory can't be offlined. However a future to make sure that remove_memory() succeeds after memory_failure() of the same range seems a better semantic than permanently burning physical address space. The topic of remove_memory() failures gets to the question of what happens to active page references when the inopportune ->remove() event happens. For transient pins the ->remove() event will wait for for all pins to be dropped before allowing ->remove() to complete. Since fileystem-dax forbids longterm pins all those pins are transient. Device-dax, on the other hand, does allow longterm pins which means that ->remove() will hang unless / until the longterm pin is dropped. Hopefully an unmap_mapping_range() event is sufficient to get the pin dropped, but I suspect device-dax might need to trigger memory_failure() as well to get the longterm pin holder to wake up and get out of the way (TBD). Lest we repeat the "longterm-pin-revoke" debate, which highlighted that RDMA devices do not respond well to having context torn down, keep in mind that this proposal is to do a best effort recovery of an event that should not happen (surprise removal) under nominal operation. --- Dan Williams (3): mm/memory-failure: Prepare for mass memory_failure() mm, dax, pmem: Introduce dev_pagemap_failure() mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path drivers/dax/super.c | 15 +++ drivers/nvdimm/pmem.c| 10 +- drivers/nvdimm/pmem.h|1 + include/linux/dax.h |5 + include/linux/memremap.h |5 + include/linux/mm.h |3 +++ mm/gup.c | 38 -- mm/memory-failure.c | 36 +++- mm/memremap.c| 11 +++ 9 files changed, 88 insertions(+), 36 deletions(-) ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [RFC 0/2] virtio-pmem: Asynchronous flush
On Thu, Mar 11, 2021 at 8:21 PM Pankaj Gupta wrote: > > Hi David, > > > > Jeff reported preflush order issue with the existing implementation > > > of virtio pmem preflush. Dan suggested[1] to implement asynchronous > > > flush > > > for virtio pmem using work queue as done in md/RAID. This patch series > > > intends to solve the preflush ordering issue and also makes the flush > > > asynchronous from the submitting thread POV. > > > > > > Submitting this patch series for feeback and is in WIP. I have > > > done basic testing and currently doing more testing. > > > > > > Pankaj Gupta (2): > > >pmem: make nvdimm_flush asynchronous > > >virtio_pmem: Async virtio-pmem flush > > > > > > drivers/nvdimm/nd_virtio.c | 66 ++-- > > > drivers/nvdimm/pmem.c| 15 > > > drivers/nvdimm/region_devs.c | 3 +- > > > drivers/nvdimm/virtio_pmem.c | 9 + > > > drivers/nvdimm/virtio_pmem.h | 12 +++ > > > 5 files changed, 78 insertions(+), 27 deletions(-) > > > > > > [1] https://marc.info/?l=linux-kernel=157446316409937=2 > > > > > > > Just wondering, was there any follow up of this or are we still waiting > > for feedback? :) > > Thank you for bringing this up. > > My apologies I could not followup on this. I have another version in my local > tree but could not post it as I was not sure if I solved the problem > correctly. I will > clean it up and post for feedback as soon as I can. > > P.S: Due to serious personal/family health issues I am not able to > devote much time > on this with other professional commitments. I feel bad that I have > this unfinished task. > Just in last one year things have not been stable for me & my family > and still not getting :( No worries Pankaj. Take care of yourself and your family. The community can handle this for you. I'm open to coaching somebody through what's involved to get this fix landed. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
On Wed, Mar 10, 2021 at 10:13 AM Joao Martins wrote: > > On 2/22/21 8:37 PM, Dan Williams wrote: > > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins > > wrote: > >> On 2/20/21 1:43 AM, Dan Williams wrote: > >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard wrote: > >>>> On 12/8/20 9:28 AM, Joao Martins wrote: > >> One thing to point out about altmap is that the degradation (in pinning and > >> unpining) we observed with struct page's in device memory, is no longer > >> observed > >> once 1) we batch ref count updates as we move to compound pages 2) reusing > >> tail pages seems to lead to these struct pages staying more likely in cache > >> which perhaps contributes to dirtying a lot less cachelines. > > > > True, it makes it more palatable to survive 'struct page' in PMEM, > > I want to retract for now what I said above wrt to the no degradation with > struct page in device comment. I was fooled by a bug on a patch later down > this series. Particular because I accidentally cleared PGMAP_ALTMAP_VALID when > unilaterally setting PGMAP_COMPOUND, which consequently lead to always > allocating struct pages from memory. No wonder the numbers were just as fast. > I am still confident that it's going to be faster and observe less degradation > in pinning/init. Init for now is worst-case 2x faster. But to be *as fast* > struct > pages in memory might still be early to say. > > The broken masking of the PGMAP_ALTMAP_VALID bit did hide one flaw, where > we don't support altmap for basepages on x86/mm and it apparently depends > on architectures to implement it (and a couple other issues). The vmemmap > allocation isn't the problem, so the previous comment in this thread that > altmap doesn't change much in the vmemmap_populate_compound_pages() is > still accurate. > > The problem though resides on the freeing of vmemmap pagetables with > basepages *with altmap* (e.g. at dax-device teardown) which require arch > support. Doing it properly would mean making the altmap reserve smaller > (given fewer pages are allocated), and the ability for the altmap pfn > allocator to track references per pfn. But I think it deserves its own > separate patch series (probably almost just as big). > > Perhaps for this set I can stick without altmap as you suggested, and > use hugepage vmemmap population (which wouldn't > lead to device memory savings) instead of reusing base pages . I would > still leave the compound page support logic as metadata representation > for > 4K @align, as I think that's the right thing to do. And then > a separate series onto improving altmap to leverage the metadata reduction > support as done with non-device struct pages. > > Thoughts? The space savings is the whole point. So I agree with moving altmap support to a follow-on enhancement, but land the non-altmap basepage support in the first round. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v2 00/10] fsdax,xfs: Add reflink support for fsdax
On Wed, Mar 10, 2021 at 6:27 AM Matthew Wilcox wrote: > > On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote: > > On 13:02 10/03, Matthew Wilcox wrote: > > > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote: > > > > Forgive my ignorance, but is there a reason why this isn't wired up to > > > > Btrfs at the same time? It seems weird to me that adding a feature > > > > > > btrfs doesn't support DAX. only ext2, ext4, XFS and FUSE have DAX > > > support. > > > > > > If you think about it, btrfs and DAX are diametrically opposite things. > > > DAX is about giving raw access to the hardware. btrfs is about offering > > > extra value (RAID, checksums, ...), none of which can be done if the > > > filesystem isn't in the read/write path. > > > > > > That's why there's no DAX support in btrfs. If you want DAX, you have > > > to give up all the features you like in btrfs. So you may as well use > > > a different filesystem. > > > > DAX on btrfs has been attempted[1]. Of course, we could not > > But why? A completeness fetish? I don't understand why you decided > to do this work. Isn't DAX useful for pagecache minimization on read even if it is awkward for a copy-on-write fs? Seems it would be a useful case to have COW'd VM images on BTRFS that don't need superfluous page cache allocations. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v2] libnvdimm: Notify disk drivers to revalidate region read-only
On Tue, Mar 9, 2021 at 10:54 PM Christoph Hellwig wrote: > > Looks good to me: > > Reviewed-by: Christoph Hellwig > > Question on the pre-existing code: given that nvdimm_check_and_set_ro is > the only caller of set_disk_ro for nvdimm devices, we'll also get > the message when initially setting up any read-only disk. Is that > intentional? Yeah, that's intentional. There's no other notification that userspace would be looking for by default besides the kernel log, and the block device name is more meaningful than the region name, or the nvdimm device status for that matter. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[ndctl PATCH] test/libndctl: Use ndctl_region_set_ro() to change disk read-only state
Kernel commit 52f019d43c22 ("block: add a hard-readonly flag to struct gendisk") broke the read-only management test, by fixing the broken behavior that BLKROSET could make a block device read-write even when the disk is read-only. The fix (see Link:) propagates changes of the region read-only state to the underlying disk. Add ndctl_region_set_ro() ahead of BLKROSET so that BLKROSET does not conflict the block_device state with the disk state. Link: http://lore.kernel.org/r/161534060720.528671.2341213328968989192.st...@dwillia2-desk3.amr.corp.intel.com Reported-by: kernel test robot Reported-by: Vishal Verma Signed-off-by: Dan Williams --- test/libndctl.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/test/libndctl.c b/test/libndctl.c index faa308ade8fc..391b94086dae 100644 --- a/test/libndctl.c +++ b/test/libndctl.c @@ -1541,6 +1541,7 @@ static int validate_bdev(const char *devname, struct ndctl_btt *btt, struct ndctl_pfn *pfn, struct ndctl_namespace *ndns, struct namespace *namespace, void *buf) { + struct ndctl_region *region = ndctl_namespace_get_region(ndns); char bdevpath[50]; int fd, rc, ro; @@ -1578,6 +1579,13 @@ static int validate_bdev(const char *devname, struct ndctl_btt *btt, } ro = 0; + rc = ndctl_region_set_ro(region, ro); + if (rc < 0) { + fprintf(stderr, "%s: ndctl_region_set_ro failed\n", devname); + rc = -errno; + goto out; + } + rc = ioctl(fd, BLKROSET, ); if (rc < 0) { fprintf(stderr, "%s: BLKROSET failed\n", @@ -1605,8 +1613,16 @@ static int validate_bdev(const char *devname, struct ndctl_btt *btt, rc = -ENXIO; goto out; } + + rc = ndctl_region_set_ro(region, namespace->ro); + if (rc < 0) { + fprintf(stderr, "%s: ndctl_region_set_ro reset failed\n", devname); + rc = -errno; + goto out; + } + rc = 0; - out: +out: close(fd); return rc; } ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH v2] libnvdimm: Notify disk drivers to revalidate region read-only
Previous kernels allowed the BLKROSET to override the disk's read-only status. With that situation fixed the pmem driver needs to rely on notification events to reevaluate the disk read-only status after the host region has been marked read-write. Recall that when libnvdimm determines that the persistent memory has lost persistence (for example lack of energy to flush from DRAM to FLASH on an NVDIMM-N device) it marks the region read-only, but that state can be overridden by the user via: echo 0 > /sys/bus/nd/devices/regionX/read_only ...to date there is no notification that the region has restored persistence, so the user override is the only recovery. Fixes: 52f019d43c22 ("block: add a hard-readonly flag to struct gendisk") Cc: Christoph Hellwig Cc: Ming Lei Cc: Martin K. Petersen Cc: Hannes Reinecke Cc: Jens Axboe Reported-by: kernel test robot Reported-by: Vishal Verma Signed-off-by: Dan Williams --- Changes since v1 [1]: - Move from the sinking ship of revalidate_disk() to the local hotness of nd_pmem_notify() (hch). [1]: http://lore.kernel.org/r/161527286194.446794.5215036039655765042.st...@dwillia2-desk3.amr.corp.intel.com drivers/nvdimm/bus.c | 14 ++ drivers/nvdimm/pmem.c| 37 + drivers/nvdimm/region_devs.c |7 +++ include/linux/nd.h |1 + 4 files changed, 47 insertions(+), 12 deletions(-) diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index 48f0985ca8a0..3a777d0073b7 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -631,16 +631,14 @@ void nvdimm_check_and_set_ro(struct gendisk *disk) struct nd_region *nd_region = to_nd_region(dev->parent); int disk_ro = get_disk_ro(disk); - /* -* Upgrade to read-only if the region is read-only preserve as -* read-only if the disk is already read-only. -*/ - if (disk_ro || nd_region->ro == disk_ro) + /* catch the disk up with the region ro state */ + if (disk_ro == nd_region->ro) return; - dev_info(dev, "%s read-only, marking %s read-only\n", - dev_name(_region->dev), disk->disk_name); - set_disk_ro(disk, 1); + dev_info(dev, "%s read-%s, marking %s read-%s\n", +dev_name(_region->dev), nd_region->ro ? "only" : "write", +disk->disk_name, nd_region->ro ? "only" : "write"); + set_disk_ro(disk, nd_region->ro); } EXPORT_SYMBOL(nvdimm_check_and_set_ro); diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index b8a85bfb2e95..7daac795db39 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -26,6 +26,7 @@ #include #include #include "pmem.h" +#include "btt.h" #include "pfn.h" #include "nd.h" @@ -585,7 +586,7 @@ static void nd_pmem_shutdown(struct device *dev) nvdimm_flush(to_nd_region(dev->parent), NULL); } -static void nd_pmem_notify(struct device *dev, enum nvdimm_event event) +static void pmem_revalidate_poison(struct device *dev) { struct nd_region *nd_region; resource_size_t offset = 0, end_trunc = 0; @@ -595,9 +596,6 @@ static void nd_pmem_notify(struct device *dev, enum nvdimm_event event) struct range range; struct kernfs_node *bb_state; - if (event != NVDIMM_REVALIDATE_POISON) - return; - if (is_nd_btt(dev)) { struct nd_btt *nd_btt = to_nd_btt(dev); @@ -635,6 +633,37 @@ static void nd_pmem_notify(struct device *dev, enum nvdimm_event event) sysfs_notify_dirent(bb_state); } +static void pmem_revalidate_region(struct device *dev) +{ + struct pmem_device *pmem; + + if (is_nd_btt(dev)) { + struct nd_btt *nd_btt = to_nd_btt(dev); + struct btt *btt = nd_btt->btt; + + nvdimm_check_and_set_ro(btt->btt_disk); + return; + } + + pmem = dev_get_drvdata(dev); + nvdimm_check_and_set_ro(pmem->disk); +} + +static void nd_pmem_notify(struct device *dev, enum nvdimm_event event) +{ + switch (event) { + case NVDIMM_REVALIDATE_POISON: + pmem_revalidate_poison(dev); + break; + case NVDIMM_REVALIDATE_REGION: + pmem_revalidate_region(dev); + break; + default: + dev_WARN_ONCE(dev, 1, "notify: unknown event: %d\n", event); + break; + } +} + MODULE_ALIAS("pmem"); MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_IO); MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_PMEM); diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index ef23119db574..51870eb51da6 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -518,6 +518,1
Re: [PATCH v2] include: Remove pagemap.h from blkdev.h
On Tue, Mar 9, 2021 at 11:59 AM Matthew Wilcox (Oracle) wrote: > > My UEK-derived config has 1030 files depending on pagemap.h before > this change. Afterwards, just 326 files need to be rebuilt when I > touch pagemap.h. I think blkdev.h is probably included too widely, > but untangling that dependency is harder and this solves my problem. > x86 allmodconfig builds, but there may be implicit include problems > on other architectures. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > v2: Fix CONFIG_SWAP=n implicit use of pagemap.h by swap.h. Increases > the number of files from 240, but that's still a big win -- 68% > reduction instead of 77%. > [..] > drivers/nvdimm/btt.c | 1 + > drivers/nvdimm/pmem.c | 1 + For the nvdimm bits: Acked-by: Dan Williams ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] libnvdimm: Let revalidate_disk() revalidate region read-only
On Mon, Mar 8, 2021 at 11:31 PM Christoph Hellwig wrote: > > On Mon, Mar 08, 2021 at 10:54:22PM -0800, Dan Williams wrote: > > Previous kernels allowed the BLKROSET to override the disk's read-only > > status. With that situation fixed the pmem driver needs to rely on > > revalidate_disk() to clear the disk read-only status after the host > > region has been marked read-write. > > > > Recall that when libnvdimm determines that the persistent memory has > > lost persistence (for example lack of energy to flush from DRAM to FLASH > > on an NVDIMM-N device) it marks the region read-only, but that state can > > be overridden by the user via: > > > >echo 0 > /sys/bus/nd/devices/regionX/read_only > > > > ...to date there is no notification that the region has restored > > persistence, so the user override is the only recovery. > > I've just resent my series to kill of ->revalidate_disk for good, so this > obvious makes me a little unhappy. Given that ->revalidate_disk > only ends up beeing called from the same path that ->open is called, > why can't you just hook this up from the open method? > > Also any reason the sysfs attribute can't just directly propagate the > information to the disk? I should have assumed that revalidate_disk() was on the chopping block. Let me take a look at just propagating from the region update down to all affected disks. There's already a notification path for regions to communicate badblocks, should be straightforward to reuse for read-only updates. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH] libnvdimm: Let revalidate_disk() revalidate region read-only
Previous kernels allowed the BLKROSET to override the disk's read-only status. With that situation fixed the pmem driver needs to rely on revalidate_disk() to clear the disk read-only status after the host region has been marked read-write. Recall that when libnvdimm determines that the persistent memory has lost persistence (for example lack of energy to flush from DRAM to FLASH on an NVDIMM-N device) it marks the region read-only, but that state can be overridden by the user via: echo 0 > /sys/bus/nd/devices/regionX/read_only ...to date there is no notification that the region has restored persistence, so the user override is the only recovery. Fixes: 52f019d43c22 ("block: add a hard-readonly flag to struct gendisk") Cc: Christoph Hellwig Cc: Ming Lei Cc: Martin K. Petersen Cc: Hannes Reinecke Cc: Jens Axboe Reported-by: kernel test robot Reported-by: Vishal Verma Signed-off-by: Dan Williams --- drivers/nvdimm/btt.c |7 +++ drivers/nvdimm/bus.c | 14 ++ drivers/nvdimm/pmem.c |7 +++ 3 files changed, 20 insertions(+), 8 deletions(-) diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c index 41aa1f01fc07..73d3bf5aa208 100644 --- a/drivers/nvdimm/btt.c +++ b/drivers/nvdimm/btt.c @@ -1508,11 +1508,18 @@ static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo) return 0; } +static int btt_revalidate(struct gendisk *disk) +{ + nvdimm_check_and_set_ro(disk); + return 0; +} + static const struct block_device_operations btt_fops = { .owner =THIS_MODULE, .submit_bio = btt_submit_bio, .rw_page = btt_rw_page, .getgeo = btt_getgeo, + .revalidate_disk = btt_revalidate, }; static int btt_blk_init(struct btt *btt) diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index 48f0985ca8a0..3a777d0073b7 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -631,16 +631,14 @@ void nvdimm_check_and_set_ro(struct gendisk *disk) struct nd_region *nd_region = to_nd_region(dev->parent); int disk_ro = get_disk_ro(disk); - /* -* Upgrade to read-only if the region is read-only preserve as -* read-only if the disk is already read-only. -*/ - if (disk_ro || nd_region->ro == disk_ro) + /* catch the disk up with the region ro state */ + if (disk_ro == nd_region->ro) return; - dev_info(dev, "%s read-only, marking %s read-only\n", - dev_name(_region->dev), disk->disk_name); - set_disk_ro(disk, 1); + dev_info(dev, "%s read-%s, marking %s read-%s\n", +dev_name(_region->dev), nd_region->ro ? "only" : "write", +disk->disk_name, nd_region->ro ? "only" : "write"); + set_disk_ro(disk, nd_region->ro); } EXPORT_SYMBOL(nvdimm_check_and_set_ro); diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index b8a85bfb2e95..af204fce1b1c 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -276,10 +276,17 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); } +static int pmem_revalidate(struct gendisk *disk) +{ + nvdimm_check_and_set_ro(disk); + return 0; +} + static const struct block_device_operations pmem_fops = { .owner =THIS_MODULE, .submit_bio = pmem_submit_bio, .rw_page = pmem_rw_page, + .revalidate_disk = pmem_revalidate, }; static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
On Mon, Mar 8, 2021 at 3:34 AM ruansy.f...@fujitsu.com wrote: > > > > > 1 file changed, 8 insertions(+) > > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > > > > index 79c49e7f5c30..0bcf2b1e20bd 100644 > > > > > --- a/include/linux/memremap.h > > > > > +++ b/include/linux/memremap.h > > > > > @@ -87,6 +87,14 @@ struct dev_pagemap_ops { > > > > > * the page back to a CPU accessible page. > > > > > */ > > > > > vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf); > > > > > + > > > > > + /* > > > > > +* Handle the memory failure happens on one page. Notify the > > > > > processes > > > > > +* who are using this page, and try to recover the data on > > > > > this page > > > > > +* if necessary. > > > > > +*/ > > > > > + int (*memory_failure)(struct dev_pagemap *pgmap, unsigned > > > > > long pfn, > > > > > + int flags); > > > > > }; > > > > > > > > After the conversation with Dave I don't see the point of this. If > > > > there is a memory_failure() on a page, why not just call > > > > memory_failure()? That already knows how to find the inode and the > > > > filesystem can be notified from there. > > > > > > We want memory_failure() supports reflinked files. In this case, we are > > > not > > > able to track multiple files from a page(this broken page) because > > > page->mapping,page->index can only track one file. Thus, I introduce this > > > ->memory_failure() implemented in pmem driver, to call ->corrupted_range() > > > upper level to upper level, and finally find out files who are > > > using(mmapping) this page. > > > > > > > I know the motivation, but this implementation seems backwards. It's > > already the case that memory_failure() looks up the address_space > > associated with a mapping. From there I would expect a new 'struct > > address_space_operations' op to let the fs handle the case when there > > are multiple address_spaces associated with a given file. > > > > Let me think about it. In this way, we > 1. associate file mapping with dax page in dax page fault; I think this needs to be a new type of association that proxies the representation of the reflink across all involved address_spaces. > 2. iterate files reflinked to notify `kill processes signal` by the > new address_space_operation; > 3. re-associate to another reflinked file mapping when unmmaping > (rmap qeury in filesystem to get the another file). Perhaps the proxy object is reference counted per-ref-link. It seems error prone to keep changing the association of the pfn while the reflink is in-tact. > It did not handle those dax pages are not in use, because their ->mapping are > not associated to any file. I didn't think it through until reading your > conversation. Here is my understanding: this case should be handled by > badblock mechanism in pmem driver. This badblock mechanism will call > ->corrupted_range() to tell filesystem to repaire the data if possible. There are 2 types of notifications. There are badblocks discovered by the driver (see notify_pmem()) and there are memory_failures() signalled by the CPU machine-check handler, or the platform BIOS. In the case of badblocks that needs to be information considered by the fs block allocator to avoid / try-to-repair badblocks on allocate, and to allow listing damaged files that need repair. The memory_failure() notification needs immediate handling to tear down mappings to that pfn and signal processes that have consumed it with SIGBUS-action-required. Processes that have the poison mapped, but have not consumed it receive SIGBUS-action-optional. > So, we split it into two parts. And dax device and block device won't be > mixed > up again. Is my understanding right? Right, it's only the filesystem that knows that the block_device and the dax_device alias data at the same logical offset. The requirements for sector error handling and page error handling are separate like block_device_operations and dax_operations. > But the solution above is to solve the hwpoison on one or couple pages, which > happens rarely(I think). Do the 'pmem remove' operation cause hwpoison too? > Call memory_failure() so many times? I havn't understood this yet. I'm working on a patch here to call memory_failure() on a wide range for the surprise remove of a dax_device while a filesystem might be mounted. It won't be efficient, but there is no other way to notify the kernel that it needs to immediately stop referencing a page. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
On Sun, Mar 7, 2021 at 7:38 PM ruansy.f...@fujitsu.com wrote: > > > On Mon, Feb 8, 2021 at 2:55 AM Shiyang Ruan > > wrote: > > > > > > When memory-failure occurs, we call this function which is implemented > > > by each kind of devices. For the fsdax case, pmem device driver > > > implements it. Pmem device driver will find out the block device where > > > the error page locates in, and try to get the filesystem on this block > > > device. And finally call filesystem handler to deal with the error. > > > The filesystem will try to recover the corrupted data if possiable. > > > > > > Signed-off-by: Shiyang Ruan > > > --- > > > include/linux/memremap.h | 8 > > > 1 file changed, 8 insertions(+) > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > > index 79c49e7f5c30..0bcf2b1e20bd 100644 > > > --- a/include/linux/memremap.h > > > +++ b/include/linux/memremap.h > > > @@ -87,6 +87,14 @@ struct dev_pagemap_ops { > > > * the page back to a CPU accessible page. > > > */ > > > vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf); > > > + > > > + /* > > > +* Handle the memory failure happens on one page. Notify the > > > processes > > > +* who are using this page, and try to recover the data on this > > > page > > > +* if necessary. > > > +*/ > > > + int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long > > > pfn, > > > + int flags); > > > }; > > > > After the conversation with Dave I don't see the point of this. If > > there is a memory_failure() on a page, why not just call > > memory_failure()? That already knows how to find the inode and the > > filesystem can be notified from there. > > We want memory_failure() supports reflinked files. In this case, we are not > able to track multiple files from a page(this broken page) because > page->mapping,page->index can only track one file. Thus, I introduce this > ->memory_failure() implemented in pmem driver, to call ->corrupted_range() > upper level to upper level, and finally find out files who are > using(mmapping) this page. > I know the motivation, but this implementation seems backwards. It's already the case that memory_failure() looks up the address_space associated with a mapping. From there I would expect a new 'struct address_space_operations' op to let the fs handle the case when there are multiple address_spaces associated with a given file. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
On Mon, Feb 8, 2021 at 2:55 AM Shiyang Ruan wrote: > > When memory-failure occurs, we call this function which is implemented > by each kind of devices. For the fsdax case, pmem device driver > implements it. Pmem device driver will find out the block device where > the error page locates in, and try to get the filesystem on this block > device. And finally call filesystem handler to deal with the error. > The filesystem will try to recover the corrupted data if possiable. > > Signed-off-by: Shiyang Ruan > --- > include/linux/memremap.h | 8 > 1 file changed, 8 insertions(+) > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index 79c49e7f5c30..0bcf2b1e20bd 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -87,6 +87,14 @@ struct dev_pagemap_ops { > * the page back to a CPU accessible page. > */ > vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf); > + > + /* > +* Handle the memory failure happens on one page. Notify the > processes > +* who are using this page, and try to recover the data on this page > +* if necessary. > +*/ > + int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn, > + int flags); > }; After the conversation with Dave I don't see the point of this. If there is a memory_failure() on a page, why not just call memory_failure()? That already knows how to find the inode and the filesystem can be notified from there. Although memory_failure() is inefficient for large range failures, I'm not seeing a better option, so I'm going to test calling memory_failure() over a large range whenever an in-use dax-device is hot-removed. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 1, 2021 at 11:57 PM Dave Chinner wrote: > > On Mon, Mar 01, 2021 at 09:41:02PM -0800, Dan Williams wrote: > > On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong wrote: > > > > > I really don't see you seem to be telling us that invalidation is an > > > > > either/or choice. There's more ways to convert physical block > > > > > address -> inode file offset and mapping index than brute force > > > > > inode cache walks > > > > > > > > Yes, but I was trying to map it to an existing mechanism and the > > > > internals of drop_pagecache_sb() are, in coarse terms, close to what > > > > needs to happen here. > > > > > > Yes. XFS (with rmap enabled) can do all the iteration and walking in > > > that function except for the invalidate_mapping_* call itself. The goal > > > of this series is first to wire up a callback within both the block and > > > pmem subsystems so that they can take notifications and reverse-map them > > > through the storage stack until they reach an fs superblock. > > > > I'm chuckling because this "reverse map all the way up the block > > layer" is the opposite of what Dave said at the first reaction to my > > proposal, "can't the mm map pfns to fs inode address_spaces?". > > Ah, no, I never said that the filesystem can't do reverse maps. I > was asking if the mm could directly (brute-force) invalidate PTEs > pointing at physical pmem ranges without needing walk the inode > mappings. That would be far more efficient if it could be done > > > Today whenever the pmem driver receives new corrupted range > > notification from the lower level nvdimm > > infrastructure(nd_pmem_notify) it updates the 'badblocks' instance > > associated with the pmem gendisk and then notifies userspace that > > there are new badblocks. This seems a perfect place to signal an upper > > level stacked block device that may also be watching disk->bb. Then > > each gendisk in a stacked topology is responsible for watching the > > badblock notifications of the next level and storing a remapped > > instance of those blocks until ultimately the filesystem mounted on > > the top-level block device is responsible for registering for those > > top-level disk->bb events. > > > > The device gone notification does not map cleanly onto 'struct badblocks'. > > Filesystems are not allowed to interact with the gendisk > infrastructure - that's for supporting the device side of a block > device. It's a layering violation, and many a filesytem developer > has been shouted at for trying to do this. At most we can peek > through it to query functionality support from the request queue, > but otherwise filesystems do not interact with anything under > bdev->bd_disk. So lets add an api that allows the querying of badblocks by bdev and let the block core handle the bd_disk interaction. I see other block functionality like blk-integrity reaching through gendisk. The fs need not interact with the gendisk directly. > > As it is, badblocks are used by devices to manage internal state. > e.g. md for recording stripes that need recovery if the system > crashes while they are being written out. I know, I was there when it was invented which is why it was top-of-mind when pmem had a need to communicate badblocks. Other block drivers have threatened to use it for badblocks tracking, but none of those have carried through on that initial interest. > > > If an upper level agent really cared about knowing about ->remove() > > events before they happened it could maybe do something like: > > > > dev = disk_to_dev(bdev->bd_disk)->parent; > > bus_register_notifier(dev->bus. _host_device_notifier_block) > > Yeah, that's exactly the sort of thing that filesystems have been > aggressively discouraged from doing for years. Yup, it's a layering violation. > Part of the reason for this is that gendisk based mechanisms are not > very good for stacked device error reporting. Part of the problem > here is that every layer of the stacked device has to hook the > notifier of the block devices underneath it, then translate the > event to match the upper block device map, then regenerate the > notification for the next layer up. This isn't an efficient way to > pass a notification through a series of stacked devices and it is > messy and cumbersome to maintain. It's been messy and cumbersome to route new infrastructure through DM every time a new dax_operation arrives. The corrupted_range() routing has the same burden. The advantage of badblocks over corrupted_range() is that it solves the
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 1, 2021 at 9:38 PM Dave Chinner wrote: > > On Mon, Mar 01, 2021 at 07:33:28PM -0800, Dan Williams wrote: > > On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner wrote: > > [..] > > > We do not need a DAX specific mechanism to tell us "DAX device > > > gone", we need a generic block device interface that tells us "range > > > of block device is gone". > > > > This is the crux of the disagreement. The block_device is going away > > *and* the dax_device is going away. > > No, that is not the disagreement I have with what you are saying. > You still haven't understand that it's even more basic and generic > than devices going away. At the simplest form, all the filesystem > wants is to be notified of is when *unrecoverable media errors* > occur in the persistent storage that underlies the filesystem. > > The filesystem does not care what that media is build from - PMEM, > flash, corroded spinning disks, MRAM, or any other persistent media > you can think off. It just doesn't matter. > > What we care about is that the contents of a *specific LBA range* no > longer contain *valid data*. IOWs, the data in that range of the > block device has been lost, cannot be retreived and/or cannot be > written to any more. > > PMEM taking a MCE because ECC tripped is a media error because data > is lost and inaccessible until recovery actions are taken. > > MD RAID failing a scrub is a media error and data is lost and > unrecoverable at that layer. > > A device disappearing is a media error because the storage media is > now permanently inaccessible to the higher layers. > > This "media error" categorisation is a fundamental property of > persistent storage and, as such, is a property of the block devices > used to access said persistent storage. > > That's the disagreement here - that you and Christoph are saying > ->corrupted_range is not a block device property because only a > pmem/DAX device currently generates it. > > You both seem to be NACKing a generic interface because it's only > implemented for the first subsystem that needs it. AFAICT, you > either don't understand or are completely ignoring the architectural > need for it to be provided across the rest of the storage stack that > *block device based filesystems depend on*. No I'm NAKing it because it's the wrong interface. See my 'struct badblocks' argument in the reply to Darrick. That 'struct badblocks' infrastructure arose from MD and is shared with PMEM. > > Sure, there might be dax device based fielsystems around the corner. > They just require a different pmem device ->corrupted_range callout > to implement the notification - one that directs to the dax device > rather than the block device. That's simple and trivial to > implement, but such functionaity for DAX devices does not replace > the need for the same generic functionality to be provided across a > *range of different block devices* as required by *block device > based filesystems*. > > And that's fundamentally the problem. XFS is block device based, not > DAX device based. We require errors to be reported through block > device mechanisms. fs-dax does not change this - it is based on pmem > being presented as a primarily as a block device to the block device > based filesystems and only secondarily as a dax device. Hence if it > can be trivially implemented as a block device interface, that's > where it should go, because then all the other block devices that > the filesytem runs on can provide the same functionality for similar > media error events Sure, use 'struct badblocks' not struct block_device and block_device_operations. > > > The dax_device removal implies one > > set of actions (direct accessed pfns invalid) the block device removal > > implies another (block layer sector access offline). > > There you go again, saying DAX requires an action, while the block > device notification is a -state change- (i.e. goes offline). There you go reacting to the least generous interpretation of what I said. s/pfns invalid/pfns offline/ > > This is exactly what I said was wrong in my last email. > > > corrupted_range > > is blurring the notification for 2 different failure domains. Look at > > the nascent idea to mount a filesystem on dax sans a block device. > > Look at the existing plumbing for DM to map dax_operations through a > > device stack. > > Ummm, it just maps the direct_access call to the underlying device > and calls it's ->direct_access method. All it's doing is LBA > mapping. That's all it needs to do for ->corrupted_range, too. > I have no clue why you think this is a problem for error > notification... >
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong wrote: > > On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote: > > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner wrote: > > > > > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote: > > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner > > > > wrote: > > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner > > > > > > wrote: > > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > > > it points to, check if it points to the PMEM that is being removed, > > > > > grab the page it points to, map that to the relevant struct page, > > > > > run collect_procs() on that page, then kill the user processes that > > > > > map that page. > > > > > > > > > > So why can't we walk the ptescheck the physical pages that they > > > > > map to and if they map to a pmem page we go poison that > > > > > page and that kills any user process that maps it. > > > > > > > > > > i.e. I can't see how unexpected pmem device unplug is any different > > > > > to an MCE delivering a hwpoison event to a DAX mapped page. > > > > > > > > I guess the tradeoff is walking a long list of inodes vs walking a > > > > large array of pages. > > > > > > Not really. You're assuming all a filesystem has to do is invalidate > > > everything if a device goes away, and that's not true. Finding if an > > > inode has a mapping that spans a specific device in a multi-device > > > filesystem can be a lot more complex than that. Just walking inodes > > > is easy - determining whihc inodes need invalidation is the hard > > > part. > > > > That inode-to-device level of specificity is not needed for the same > > reason that drop_caches does not need to be specific. If the wrong > > page is unmapped a re-fault will bring it back, and re-fault will fail > > for the pages that are successfully removed. > > > > > That's where ->corrupt_range() comes in - the filesystem is already > > > set up to do reverse mapping from physical range to inode(s) > > > offsets... > > > > Sure, but what is the need to get to that level of specificity with > > the filesystem for something that should rarely happen in the course > > of normal operation outside of a mistake? > > I can't tell if we're conflating the "a bunch of your pmem went bad" > case with the "all your dimms fell out of the machine" case. >From the pmem driver perspective it has the media scanning to find some small handful of cachelines that have gone bad, and it has the driver ->remove() callback to tell it a bunch of pmem is now offline. The NVDIMM device "range has gone bad" mechanism has no way to communicate multiple terabytes have gone bad at once. In fact I think the distinction is important that ->remove() is not treated as ->corrupted_range() because I expect the level of freakout is much worse for a "your storage is offline" notification vs "your storage is corrupted" notification. > If, say, a single cacheline's worth of pmem goes bad on a node with 2TB > of pmem, I certainly want that level of specificity. Just notify the > users of the dead piece, don't flush the whole machine down the drain. Right, something like corrupted_range() is there to say, "keep going upper layers, but note that this handful of sectors now has indeterminant data and will return -EIO on access until repaired". The repair for device-offline is device-online. > > > > > There's likely always more pages than inodes, but perhaps it's more > > > > efficient to walk the 'struct page' array than sb->s_inodes? > > > > > > I really don't see you seem to be telling us that invalidation is an > > > either/or choice. There's more ways to convert physical block > > > address -> inode file offset and mapping index than brute force > > > inode cache walks > > > > Yes, but I was trying to map it to an existing mechanism and the > > internals of drop_pagecache_sb() are, in coarse terms, close to what > > needs to happen here. > > Yes. XFS (with rmap enabled) can do all the iteration and walking in > that function except for the invalidate_mapping_* call itself. The goal > of this series is first to wire up a callback within both the block and > pmem subsystems so that they can take notificati
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner wrote: [..] > We do not need a DAX specific mechanism to tell us "DAX device > gone", we need a generic block device interface that tells us "range > of block device is gone". This is the crux of the disagreement. The block_device is going away *and* the dax_device is going away. The dax_device removal implies one set of actions (direct accessed pfns invalid) the block device removal implies another (block layer sector access offline). corrupted_range is blurring the notification for 2 different failure domains. Look at the nascent idea to mount a filesystem on dax sans a block device. Look at the existing plumbing for DM to map dax_operations through a device stack. Look at the pushback Ruan got for adding a new block_device operation for corrupted_range(). > The reason that the block device is gone is irrelevant to the > filesystem. The type of block device is also irrelevant. If the > filesystem isn't using DAX (e.g. dax=never mount option) even when > it is on a DAX capable device, then it just doesn't care about > invalidating DAX mappings because it has none. But it still may care > about shutting down the filesystem because the block device is gone. Sure, let's have a discussion about a block_device gone notification, and a dax_device gone notification. > This is why we need to communicate what error occurred, not what > action a device driver thinks needs to be taken. The driver is only an event producer in this model, whatever the consumer does at the other end is not its concern. There may be a generic consumer and a filesystem specific consumer. > The error is > important to the filesystem, the action might be completely > irrelevant. And, as we know now, shutdown on DAX enable filesystems > needs to imply DAX mapping invalidation in all cases, not just when > the device disappears from under the filesystem. Sure. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Mon, Mar 1, 2021 at 2:47 PM Dave Chinner wrote: > > On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote: > > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner wrote: > > > > > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote: > > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner > > > > wrote: > > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner > > > > > > wrote: > > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > > > it points to, check if it points to the PMEM that is being removed, > > > > > grab the page it points to, map that to the relevant struct page, > > > > > run collect_procs() on that page, then kill the user processes that > > > > > map that page. > > > > > > > > > > So why can't we walk the ptescheck the physical pages that they > > > > > map to and if they map to a pmem page we go poison that > > > > > page and that kills any user process that maps it. > > > > > > > > > > i.e. I can't see how unexpected pmem device unplug is any different > > > > > to an MCE delivering a hwpoison event to a DAX mapped page. > > > > > > > > I guess the tradeoff is walking a long list of inodes vs walking a > > > > large array of pages. > > > > > > Not really. You're assuming all a filesystem has to do is invalidate > > > everything if a device goes away, and that's not true. Finding if an > > > inode has a mapping that spans a specific device in a multi-device > > > filesystem can be a lot more complex than that. Just walking inodes > > > is easy - determining whihc inodes need invalidation is the hard > > > part. > > > > That inode-to-device level of specificity is not needed for the same > > reason that drop_caches does not need to be specific. If the wrong > > page is unmapped a re-fault will bring it back, and re-fault will fail > > for the pages that are successfully removed. > > > > > That's where ->corrupt_range() comes in - the filesystem is already > > > set up to do reverse mapping from physical range to inode(s) > > > offsets... > > > > Sure, but what is the need to get to that level of specificity with > > the filesystem for something that should rarely happen in the course > > of normal operation outside of a mistake? > > Dan, you made this mistake with the hwpoisoning code that we're > trying to fix that here. Hard coding a 1:1 physical address to > inode/offset into the DAX mapping was a bad mistake. It's also one > that should never have occurred because it's *obviously wrong* to > filesystem developers and has been for a long time. I admit that mistake. The traditional memory error handling model assumptions around page->mapping were broken by DAX, I'm not trying to repeat that mistake. I feel we're talking past each other on the discussion of the proposals. > Now we have the filesytem people providing a mechanism for the pmem > devices to tell the filesystems about physical device failures so > they can handle such failures correctly themselves. Having the > device go away unexpectedly from underneath a mounted and active > filesystem is a *device failure*, not an "unplug event". It's the same difference to the physical page, all mappings to that page need to be torn down. I'm happy to call an fs callback and let each filesystem do what it wants with a "every pfn in this dax device needs to be unmapped". I'm looking at the ->corrupted_range() patches trying to map it to this use case and I don't see how, for example a realtime-xfs over DM over multiple PMEM gets the notification to the right place. bd_corrupted_range() uses get_super() which get the wrong answer for both realtime-xfs and DM. I'd flip that arrangement around and have the FS tell the block device "if something happens to you, here is the super_block to notify". So to me this looks like a fs_dax_register_super() helper that plumbs the superblock through an arbitrary stack of block devices to the leaf block-device that might want to send a notification up when a global unmap operation needs to be performed. I naively think that "for_each_inode() unmap_mapping_range(>i_mapping)" is sufficient as a generic implementation, that does not preclude XFS to override that generic implementation and handle it directly if it so chooses. > The mistake you made was not understanding how filesystems work, > nor actually asking filesystem developer
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Sun, Feb 28, 2021 at 11:27 PM Yasunori Goto wrote: > > Hello, Dan-san, > > On 2021/02/27 4:24, Dan Williams wrote: > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong wrote: > >> > >> On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote: > >>> Hi, guys > >>> > >>> Beside this patchset, I'd like to confirm something about the > >>> "EXPERIMENTAL" tag for dax in XFS. > >>> > >>> In XFS, the "EXPERIMENTAL" tag, which is reported in waring message > >>> when we mount a pmem device with dax option, has been existed for a > >>> while. It's a bit annoying when using fsdax feature. So, my initial > >>> intention was to remove this tag. And I started to find out and solve > >>> the problems which prevent it from being removed. > >>> > >>> As is talked before, there are 3 main problems. The first one is "dax > >>> semantics", which has been resolved. The rest two are "RMAP for > >>> fsdax" and "support dax reflink for filesystem", which I have been > >>> working on. > >> > >> > >> > >>> So, what I want to confirm is: does it means that we can remove the > >>> "EXPERIMENTAL" tag when the rest two problem are solved? > >> > >> Yes. I'd keep the experimental tag for a cycle or two to make sure that > >> nothing new pops up, but otherwise the two patchsets you've sent close > >> those two big remaining gaps. Thank you for working on this! > >> > >>> Or maybe there are other important problems need to be fixed before > >>> removing it? If there are, could you please show me that? > >> > >> That remains to be seen through QA/validation, but I think that's it. > >> > >> Granted, I still have to read through the two patchsets... > > > > I've been meaning to circle back here as well. > > > > My immediate concern is the issue Jason recently highlighted [1] with > > respect to invalidating all dax mappings when / if the device is > > ripped out from underneath the fs. I don't think that will collide > > with Ruan's implementation, but it does need new communication from > > driver to fs about removal events. > > > > [1]: > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com > > > > I'm not sure why there is a race condition between unbinding operation > and accessing mmaped file on filesystem dax yet. > > May be silly question, but could you tell me why the "unbinding" > operation of the namespace which is mounted by filesystem dax must be > allowed? The unbind operation is used to switch the mode of a namespace between fsdax and devdax. There is no way to fail unbind. At most it can be delayed for a short while to perform cleanup, but it can't be blocked indefinitely. There is the option to specify 'suppress_bind_attrs' to the driver to preclude software triggered device removal, but that would disable mode changes for the device. > If "unbinding" is rejected when the filesystem is mounted with dax > enabled, what is inconvenience? It would be interesting (read difficult) to introduce the concept of dynamic 'suppress_bind_attrs'. Today the decision is static at driver registration time, not in response to how the device is being used. I think global invalidation of all inodes that might be affected by a dax-capable device being ripped away from the filesystem is sufficient and avoids per-fs enabling, but I'm willing to be convinced that ->corrupted_range() is the proper vehicle for this. > > I can imagine if a device like usb memory stick is removed surprisingly, > kernel/filesystem need to reject writeback at the time, and discard page > cache. Then, I can understand that unbinding operation is essential for > such case. For usb the system is protected by the fact that all future block-i/o submissions to the old block-device will fail, and a new usb-device being plugged in will get a new block-device. I.e. the old security model is invalidated / all holes are closed by blk_cleanup_queue(). > But I don't know why PMEM device/namespace allows unbinding operation > like surprising removal event. DAX hands direct mappings to physical pages, if the security model fronting those physical pages changes the mappings attained via the old security model need to be invalidated. blk_cleanup_queue() does not invalidate DAX mappings. The practical value of fixing that hole is small given that physical unplug is not implemented for NVDIMMs today, but the get_user_pages() path can be optimized if this invalidation is implemented, and future hotplug-capable NVDIMM buses like CXL will need this. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner wrote: > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote: > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner wrote: > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner > > > > wrote: > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > it points to, check if it points to the PMEM that is being removed, > > > grab the page it points to, map that to the relevant struct page, > > > run collect_procs() on that page, then kill the user processes that > > > map that page. > > > > > > So why can't we walk the ptescheck the physical pages that they > > > map to and if they map to a pmem page we go poison that > > > page and that kills any user process that maps it. > > > > > > i.e. I can't see how unexpected pmem device unplug is any different > > > to an MCE delivering a hwpoison event to a DAX mapped page. > > > > I guess the tradeoff is walking a long list of inodes vs walking a > > large array of pages. > > Not really. You're assuming all a filesystem has to do is invalidate > everything if a device goes away, and that's not true. Finding if an > inode has a mapping that spans a specific device in a multi-device > filesystem can be a lot more complex than that. Just walking inodes > is easy - determining whihc inodes need invalidation is the hard > part. That inode-to-device level of specificity is not needed for the same reason that drop_caches does not need to be specific. If the wrong page is unmapped a re-fault will bring it back, and re-fault will fail for the pages that are successfully removed. > That's where ->corrupt_range() comes in - the filesystem is already > set up to do reverse mapping from physical range to inode(s) > offsets... Sure, but what is the need to get to that level of specificity with the filesystem for something that should rarely happen in the course of normal operation outside of a mistake? > > > There's likely always more pages than inodes, but perhaps it's more > > efficient to walk the 'struct page' array than sb->s_inodes? > > I really don't see you seem to be telling us that invalidation is an > either/or choice. There's more ways to convert physical block > address -> inode file offset and mapping index than brute force > inode cache walks Yes, but I was trying to map it to an existing mechanism and the internals of drop_pagecache_sb() are, in coarse terms, close to what needs to happen here. > > . > > > > IOWs, what needs to happen at this point is very filesystem > > > specific. Assuming that "device unplug == filesystem dead" is not > > > correct, nor is specifying a generic action that assumes the > > > filesystem is dead because a device it is using went away. > > > > Ok, I think I set this discussion in the wrong direction implying any > > mapping of this action to a "filesystem dead" event. It's just a "zap > > all ptes" event and upper layers recover from there. > > Yes, that's exactly what ->corrupt_range() is intended for. It > allows the filesystem to lock out access to the bad range > and then recover the data. Or metadata, if that's where the bad > range lands. If that recovery fails, it can then report a data > loss/filesystem shutdown event to userspace and kill user procs that > span the bad range... > > FWIW, is this notification going to occur before or after the device > has been physically unplugged? Before. This will be operations that happen in the pmem driver ->remove() callback. > i.e. what do we do about the > time-of-unplug-to-time-of-invalidation window where userspace can > still attempt to access the missing pmem though the > not-yet-invalidated ptes? It may not be likely that people just yank > pmem nvdimms out of machines, but with NVMe persistent memory > spaces, there's every chance that someone pulls the wrong device... The physical removal aspect is only theoretical today. While the pmem driver has a ->remove() path that's purely a software unbind operation. That said the vulnerability window today is if a process acquires a dax mapping, the pmem device hosting that filesystem goes through an unbind / bind cycle, and then a new filesystem is created / mounted. That old pte may be able to access data that is outside its intended protection domain. Going forward, for buses like CXL, there will be a managed physical remove operation via PCIE native hotplug. The flow there is that the PCIE hotplug driver will notify the OS of a pending removal, trigger ->remove() on the pmem driver, and then notify the technician (slot status LED) that the card is safe to pull. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner wrote: > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner wrote: > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner > > > > wrote: > > > > > > My immediate concern is the issue Jason recently highlighted [1] > > > > > > with > > > > > > respect to invalidating all dax mappings when / if the device is > > > > > > ripped out from underneath the fs. I don't think that will collide > > > > > > with Ruan's implementation, but it does need new communication from > > > > > > driver to fs about removal events. > > > > > > > > > > > > [1]: > > > > > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com > > > > > > > > > > Oh, yay. > > > > > > > > > > The XFS shutdown code is centred around preventing new IO from being > > > > > issued - we don't actually do anything about DAX mappings because, > > > > > well, I don't think anyone on the filesystem side thought they had > > > > > to do anything special if pmem went away from under it. > > > > > > > > > > My understanding -was- that the pmem removal invalidates > > > > > all the ptes currently mapped into CPU page tables that point at > > > > > the dax device across the system. THe vmas that manage these > > > > > mappings are not really something the filesystem really manages, > > > > > but a function of the mm subsystem. What the filesystem cares about > > > > > is that it gets page faults triggered when a change of state occurs > > > > > so that it can remap the page to it's backing store correctly. > > > > > > > > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the > > > > > CPU ptes, because then when then when userspace tries to access the > > > > > mapped DAX pages we get a new page fault. In processing the fault, the > > > > > filesystem will try to get direct access to the pmem from the block > > > > > device. This will get an ENODEV error from the block device because > > > > > because the backing store (pmem) has been unplugged and is no longer > > > > > there... > > > > > > > > > > AFAICT, as long as pmem removal invalidates all the active ptes that > > > > > point at the pmem being removed, the filesystem doesn't need to > > > > > care about device removal at all, DAX or no DAX... > > > > > > > > How would the pmem removal do that without walking all the active > > > > inodes in the fs at the time of shutdown and call > > > > unmap_mapping_range(inode->i_mapping, 0, 0, 1)? > > > > > > Which then immediately ends up back at the vmas that manage the ptes > > > to unmap them. > > > > > > Isn't finding the vma(s) that map a specific memory range exactly > > > what the rmap code in the mm subsystem is supposed to address? > > > > rmap can lookup only vmas from a virt address relative to a given > > mm_struct. The driver has neither the list of mm_struct objects nor > > virt addresses to do a lookup. All it knows is that someone might have > > mapped pages through the fsdax interface. > > So there's no physical addr to vma translation in the mm subsystem > at all? > > That doesn't make sense. We do exactly this for hwpoison for DAX > mappings. While we don't look at ptes, we get a pfn, True hwpoison does get a known failing pfn and then uses page->mapping to get the 'struct address_space' to do the unmap. I discounted that approach from the outset because it would mean walking every pfn in a multi-terabyte device just in case one is mapped at device shutdown. > it points to, check if it points to the PMEM that is being removed, > grab the page it points to, map that to the relevant struct page, > run collect_procs() on that page, then kill the user processes that > map that page. > > So why can't we walk the ptescheck the physical pages that they > map to and if they map to a pmem page we go poison that > page and that kills any user process that maps it. > > i.e. I can't see how unexpected pmem device unplug is any different > to an MCE delivering a hwpoison event to a DAX mapped page. I guess the tradeoff is walking a long lis
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner wrote: > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner wrote: > > > > > > On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote: > > > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong > > > > wrote: > > > > > > > > > > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com > > > > > wrote: > > > > > > Hi, guys > > > > > > > > > > > > Beside this patchset, I'd like to confirm something about the > > > > > > "EXPERIMENTAL" tag for dax in XFS. > > > > > > > > > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message > > > > > > when we mount a pmem device with dax option, has been existed for a > > > > > > while. It's a bit annoying when using fsdax feature. So, my > > > > > > initial > > > > > > intention was to remove this tag. And I started to find out and > > > > > > solve > > > > > > the problems which prevent it from being removed. > > > > > > > > > > > > As is talked before, there are 3 main problems. The first one is > > > > > > "dax > > > > > > semantics", which has been resolved. The rest two are "RMAP for > > > > > > fsdax" and "support dax reflink for filesystem", which I have been > > > > > > working on. > > > > > > > > > > > > > > > > > > > > > So, what I want to confirm is: does it means that we can remove the > > > > > > "EXPERIMENTAL" tag when the rest two problem are solved? > > > > > > > > > > Yes. I'd keep the experimental tag for a cycle or two to make sure > > > > > that > > > > > nothing new pops up, but otherwise the two patchsets you've sent close > > > > > those two big remaining gaps. Thank you for working on this! > > > > > > > > > > > Or maybe there are other important problems need to be fixed before > > > > > > removing it? If there are, could you please show me that? > > > > > > > > > > That remains to be seen through QA/validation, but I think that's it. > > > > > > > > > > Granted, I still have to read through the two patchsets... > > > > > > > > I've been meaning to circle back here as well. > > > > > > > > My immediate concern is the issue Jason recently highlighted [1] with > > > > respect to invalidating all dax mappings when / if the device is > > > > ripped out from underneath the fs. I don't think that will collide > > > > with Ruan's implementation, but it does need new communication from > > > > driver to fs about removal events. > > > > > > > > [1]: > > > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com > > > > > > Oh, yay. > > > > > > The XFS shutdown code is centred around preventing new IO from being > > > issued - we don't actually do anything about DAX mappings because, > > > well, I don't think anyone on the filesystem side thought they had > > > to do anything special if pmem went away from under it. > > > > > > My understanding -was- that the pmem removal invalidates > > > all the ptes currently mapped into CPU page tables that point at > > > the dax device across the system. THe vmas that manage these > > > mappings are not really something the filesystem really manages, > > > but a function of the mm subsystem. What the filesystem cares about > > > is that it gets page faults triggered when a change of state occurs > > > so that it can remap the page to it's backing store correctly. > > > > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the > > > CPU ptes, because then when then when userspace tries to access the > > > mapped DAX pages we get a new page fault. In processing the fault, the > > > filesystem will try to get direct access to the pmem from the block > > > device. This will get an ENODEV error from the block device because > > > because the backing store (pmem) has been unplugged and is no longer > > > there... > > > > >
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner wrote: > > On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote: > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong wrote: > > > > > > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote: > > > > Hi, guys > > > > > > > > Beside this patchset, I'd like to confirm something about the > > > > "EXPERIMENTAL" tag for dax in XFS. > > > > > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message > > > > when we mount a pmem device with dax option, has been existed for a > > > > while. It's a bit annoying when using fsdax feature. So, my initial > > > > intention was to remove this tag. And I started to find out and solve > > > > the problems which prevent it from being removed. > > > > > > > > As is talked before, there are 3 main problems. The first one is "dax > > > > semantics", which has been resolved. The rest two are "RMAP for > > > > fsdax" and "support dax reflink for filesystem", which I have been > > > > working on. > > > > > > > > > > > > > So, what I want to confirm is: does it means that we can remove the > > > > "EXPERIMENTAL" tag when the rest two problem are solved? > > > > > > Yes. I'd keep the experimental tag for a cycle or two to make sure that > > > nothing new pops up, but otherwise the two patchsets you've sent close > > > those two big remaining gaps. Thank you for working on this! > > > > > > > Or maybe there are other important problems need to be fixed before > > > > removing it? If there are, could you please show me that? > > > > > > That remains to be seen through QA/validation, but I think that's it. > > > > > > Granted, I still have to read through the two patchsets... > > > > I've been meaning to circle back here as well. > > > > My immediate concern is the issue Jason recently highlighted [1] with > > respect to invalidating all dax mappings when / if the device is > > ripped out from underneath the fs. I don't think that will collide > > with Ruan's implementation, but it does need new communication from > > driver to fs about removal events. > > > > [1]: > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com > > Oh, yay. > > The XFS shutdown code is centred around preventing new IO from being > issued - we don't actually do anything about DAX mappings because, > well, I don't think anyone on the filesystem side thought they had > to do anything special if pmem went away from under it. > > My understanding -was- that the pmem removal invalidates > all the ptes currently mapped into CPU page tables that point at > the dax device across the system. THe vmas that manage these > mappings are not really something the filesystem really manages, > but a function of the mm subsystem. What the filesystem cares about > is that it gets page faults triggered when a change of state occurs > so that it can remap the page to it's backing store correctly. > > IOWs, all the mm subsystem needs to when pmem goes away is clear the > CPU ptes, because then when then when userspace tries to access the > mapped DAX pages we get a new page fault. In processing the fault, the > filesystem will try to get direct access to the pmem from the block > device. This will get an ENODEV error from the block device because > because the backing store (pmem) has been unplugged and is no longer > there... > > AFAICT, as long as pmem removal invalidates all the active ptes that > point at the pmem being removed, the filesystem doesn't need to > care about device removal at all, DAX or no DAX... How would the pmem removal do that without walking all the active inodes in the fs at the time of shutdown and call unmap_mapping_range(inode->i_mapping, 0, 0, 1)? The core-mm does tear down the ptes in the direct map, but user mappings to pmem are not afaics in xfs_do_force_shutdown(). ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: Question about the "EXPERIMENTAL" tag for dax in XFS
On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong wrote: > > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote: > > Hi, guys > > > > Beside this patchset, I'd like to confirm something about the > > "EXPERIMENTAL" tag for dax in XFS. > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message > > when we mount a pmem device with dax option, has been existed for a > > while. It's a bit annoying when using fsdax feature. So, my initial > > intention was to remove this tag. And I started to find out and solve > > the problems which prevent it from being removed. > > > > As is talked before, there are 3 main problems. The first one is "dax > > semantics", which has been resolved. The rest two are "RMAP for > > fsdax" and "support dax reflink for filesystem", which I have been > > working on. > > > > > So, what I want to confirm is: does it means that we can remove the > > "EXPERIMENTAL" tag when the rest two problem are solved? > > Yes. I'd keep the experimental tag for a cycle or two to make sure that > nothing new pops up, but otherwise the two patchsets you've sent close > those two big remaining gaps. Thank you for working on this! > > > Or maybe there are other important problems need to be fixed before > > removing it? If there are, could you please show me that? > > That remains to be seen through QA/validation, but I think that's it. > > Granted, I still have to read through the two patchsets... I've been meaning to circle back here as well. My immediate concern is the issue Jason recently highlighted [1] with respect to invalidating all dax mappings when / if the device is ripped out from underneath the fs. I don't think that will collide with Ruan's implementation, but it does need new communication from driver to fs about removal events. [1]: http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [ndctl PATCH] Expose ndctl_bus_nfit_translate_spa as a public function.
On Wed, Feb 24, 2021 at 3:48 PM Tsaur, Erwin wrote: > > The motivation is to allow access to ACPI defined NVDIMM Root Device _DSM > Function Index 5(Translate SPA). The rest of the _DSM functions, which are > mostly ARS related, are already public. For future reference you'll want to fix your editor to make sure it wraps at 72 columns for commit messages. If you're a vim user, I have the following in my .vimrc set tw=72 ...and then you can use the 'gq' command to fixup line wrapping after the fact. Otherwise, looks good to me. I was worried that you needed to double check that the bus argument actually is an nfit bus, but bus_has_translate_spa() already takes care of that, so I think we're good to go. Longer term if other buses grow the ability to return the DPA we can add a proper generic wrapper for that, to date I have not seen much support for RAS brewing in other buses. Reviewed-by: Dan Williams ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [GIT PULL] Compute Express Linux (CXL) for v5.12-rc1
As much as I'd love to be working on "Compute Express Linux" the subject should have read "Compute Express Link". On Tue, Feb 23, 2021 at 8:05 PM Dan Williams wrote: > > Hi Linus, please pull from: > > git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm > tags/cxl-for-5.12 > > ...to receive an initial driver for CXL 2.0 Memory Devices. Technical > details are in the tag message and Documentation/. I am taking this > through nvdimm.git this first cycle until the cxl.git repository and > maintainer team can be set up on git.kernel.org. > > In terms of why merge this initial driver now, it establishes just > enough functionality to enumerate these devices and issue all > administrative commands. It sets a v5.12 baseline to develop the more > complicated higher order functionality like memory device > interleaving, persistent memory support, and hotplug which entangle > with ACPI, LIBNVDIMM, and PCI. > > The focus of this release is establishing the ioctl UAPI for the > management commands. Similar to NVME there are a set of standard > commands as well as the possibility for vendor specific commands. > Unlike the NVME driver the CXL driver does not enable vendor specific > command functionality by default. This conservatism is out of concern > for the fact that CXL interleaves memory across devices and implements > host memory. The system integrity implications of some commands are > more severe than NVME and vendor specific functionality is mostly > unauditable. This will be an ongoing topic of discussion with the > wider CXL community for next few months. > > The driver has been developed in the open since November against a > work-in-progress QEMU emulation of the CXL device model. That QEMU > effort has recently attracted contributions from multiple hardware > vendors. > > The driver has appeared in -next. It collected some initial static > analysis fixes and build-robot reports, but all quiet in -next for the > past week. > > A list of review tags that arrived after the branch for -next was cut > is appended to the tag message below. > > --- > > The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac: > > Linux 5.11-rc6 (2021-01-31 13:50:09 -0800) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm > tags/cxl-for-5.12 > > for you to fetch changes up to 88ff5d466c0250259818f3153dbdc4af1f8615dd: > > cxl/mem: Fix potential memory leak (2021-02-22 14:44:39 -0800) > > > cxl for 5.12 > > Introduce an initial driver for CXL 2.0 Type-3 Memory Devices. CXL is > Compute Express Link which released the 2.0 specification in November. > The Linux relevant changes in CXL 2.0 are support for an OS to > dynamically assign address space to memory devices, support for > switches, persistent memory, and hotplug. A Type-3 Memory Device is a > PCI enumerated device presenting the CXL Memory Device Class Code and > implementing the CXL.mem protocol. CXL.mem allows device to advertise > CPU and I/O coherent memory to the system, i.e. typical "System RAM" and > "Persistent Memory" in Linux /proc/iomem terms. > > In addition to the CXL.mem fast path there is an administrative command > hardware mailbox interface for maintenance and provisioning. It is this > command interface that is the focus of the initial driver. With this > driver a CXL device that is mapped by the BIOS can be administered by > Linux. Linux support for CXL PMEM and dynamic CXL address space > management are to be implemented post v5.12. > > 4cdadfd5e0a7 cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints > Reviewed-by: Konrad Rzeszutek Wilk > > 8adaf747c9f0 cxl/mem: Find device capabilities > Reviewed-by: Jonathan Cameron > > b39cb1052a5c cxl/mem: Register CXL memX devices > Reviewed-by: Jonathan Cameron > > 13237183c735 cxl/mem: Add a "RAW" send command > Reviewed-by: Konrad Rzeszutek Wilk > > 472b1ce6e9d6 cxl/mem: Enable commands via CEL > Reviewed-by: Konrad Rzeszutek Wilk > > 57ee605b976c cxl/mem: Add set of informational commands > Reviewed-by: Konrad Rzeszutek Wilk > > > Ben Widawsky (7): > cxl/mem: Find device capabilities > cxl/mem: Add basic IOCTL interface > cxl/mem: Add a "RAW" send command > cxl/mem: Enable commands via CEL > cxl/mem: Add set of informational commands > MAINTAINERS: Add maintainers of the CXL driver > cxl/mem: Fix potential memory leak > > Dan Carpenter (1): > cxl/mem: Return -E
[GIT PULL] Compute Express Linux (CXL) for v5.12-rc1
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/cxl-for-5.12 ...to receive an initial driver for CXL 2.0 Memory Devices. Technical details are in the tag message and Documentation/. I am taking this through nvdimm.git this first cycle until the cxl.git repository and maintainer team can be set up on git.kernel.org. In terms of why merge this initial driver now, it establishes just enough functionality to enumerate these devices and issue all administrative commands. It sets a v5.12 baseline to develop the more complicated higher order functionality like memory device interleaving, persistent memory support, and hotplug which entangle with ACPI, LIBNVDIMM, and PCI. The focus of this release is establishing the ioctl UAPI for the management commands. Similar to NVME there are a set of standard commands as well as the possibility for vendor specific commands. Unlike the NVME driver the CXL driver does not enable vendor specific command functionality by default. This conservatism is out of concern for the fact that CXL interleaves memory across devices and implements host memory. The system integrity implications of some commands are more severe than NVME and vendor specific functionality is mostly unauditable. This will be an ongoing topic of discussion with the wider CXL community for next few months. The driver has been developed in the open since November against a work-in-progress QEMU emulation of the CXL device model. That QEMU effort has recently attracted contributions from multiple hardware vendors. The driver has appeared in -next. It collected some initial static analysis fixes and build-robot reports, but all quiet in -next for the past week. A list of review tags that arrived after the branch for -next was cut is appended to the tag message below. --- The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac: Linux 5.11-rc6 (2021-01-31 13:50:09 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/cxl-for-5.12 for you to fetch changes up to 88ff5d466c0250259818f3153dbdc4af1f8615dd: cxl/mem: Fix potential memory leak (2021-02-22 14:44:39 -0800) cxl for 5.12 Introduce an initial driver for CXL 2.0 Type-3 Memory Devices. CXL is Compute Express Link which released the 2.0 specification in November. The Linux relevant changes in CXL 2.0 are support for an OS to dynamically assign address space to memory devices, support for switches, persistent memory, and hotplug. A Type-3 Memory Device is a PCI enumerated device presenting the CXL Memory Device Class Code and implementing the CXL.mem protocol. CXL.mem allows device to advertise CPU and I/O coherent memory to the system, i.e. typical "System RAM" and "Persistent Memory" in Linux /proc/iomem terms. In addition to the CXL.mem fast path there is an administrative command hardware mailbox interface for maintenance and provisioning. It is this command interface that is the focus of the initial driver. With this driver a CXL device that is mapped by the BIOS can be administered by Linux. Linux support for CXL PMEM and dynamic CXL address space management are to be implemented post v5.12. 4cdadfd5e0a7 cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints Reviewed-by: Konrad Rzeszutek Wilk 8adaf747c9f0 cxl/mem: Find device capabilities Reviewed-by: Jonathan Cameron b39cb1052a5c cxl/mem: Register CXL memX devices Reviewed-by: Jonathan Cameron 13237183c735 cxl/mem: Add a "RAW" send command Reviewed-by: Konrad Rzeszutek Wilk 472b1ce6e9d6 cxl/mem: Enable commands via CEL Reviewed-by: Konrad Rzeszutek Wilk 57ee605b976c cxl/mem: Add set of informational commands Reviewed-by: Konrad Rzeszutek Wilk Ben Widawsky (7): cxl/mem: Find device capabilities cxl/mem: Add basic IOCTL interface cxl/mem: Add a "RAW" send command cxl/mem: Enable commands via CEL cxl/mem: Add set of informational commands MAINTAINERS: Add maintainers of the CXL driver cxl/mem: Fix potential memory leak Dan Carpenter (1): cxl/mem: Return -EFAULT if copy_to_user() fails Dan Williams (2): cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints cxl/mem: Register CXL memX devices .clang-format |1 + Documentation/ABI/testing/sysfs-bus-cxl| 26 + Documentation/driver-api/cxl/index.rst | 12 + Documentation/driver-api/cxl/memory-devices.rst| 46 + Documentation/driver-api/index.rst |1 + Documentation/userspace-api/ioctl/ioctl-number.rst |1 + MAINTAINERS| 11 + drivers/Kconfig|1 + drivers/Makefile
[GIT PULL] libnvdimm + device-dax for v5.12-rc1
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-for-5.12 ...to receive some miscellaneous cleanups and a fix for v5.12. This mainly continues the kernel wide effort to remove a return code from the remove() callback in the driver model. The fix addresses a return code polarity typo in the new sysfs attribute to manually specify a device-dax instance mapping range. This has all appeared in -next with no reported issues. --- The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac: Linux 5.11-rc6 (2021-01-31 13:50:09 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-for-5.12 for you to fetch changes up to 64ffe84320745ea836555ad207ebfb0e896b6167: Merge branch 'for-5.12/dax' into for-5.12/libnvdimm (2021-02-23 18:13:45 -0800) libnvdimm + device-dax for 5.12 - Fix the error code polarity for the device-dax/mapping attribute - For the device-dax and libnvdimm bus implementations stop implementing a useless return code for the remove() callback. - Miscellaneous cleanups Dan Williams (1): Merge branch 'for-5.12/dax' into for-5.12/libnvdimm Shiyang Ruan (1): device-dax: Fix default return code of range_parse() Uwe Kleine-König (7): libnvdimm/dimm: Simplify nvdimm_remove() libnvdimm: Make remove callback return void device-dax: Prevent registering drivers without probe callback device-dax: Properly handle drivers without remove callback device-dax: Fix error path in dax_driver_register device-dax: Drop an empty .remove callback dax-device: Make remove callback return void drivers/dax/bus.c | 24 +--- drivers/dax/bus.h | 2 +- drivers/dax/device.c | 8 +--- drivers/dax/kmem.c| 7 ++- drivers/dax/pmem/compat.c | 3 +-- drivers/nvdimm/blk.c | 3 +-- drivers/nvdimm/bus.c | 13 + drivers/nvdimm/dimm.c | 7 +-- drivers/nvdimm/pmem.c | 4 +--- drivers/nvdimm/region.c | 4 +--- include/linux/nd.h| 2 +- 11 files changed, 36 insertions(+), 41 deletions(-) ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe wrote: > > On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote: > > [ add Ralph ] > > > > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe wrote: > > > > > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote: > > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe wrote: > > > > > > > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > > > > > > > > > The downside would be one extra lookup in dev_pagemap tree > > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > > > > > per gup-fast() call. > > > > > > > > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > > > > > path. It should be measurable that this change is at least as fast > > > > > > or > > > > > > faster than falling back to the slow path, but it would be good to > > > > > > measure. > > > > > > > > > > What is the dev_pagemap thing doing in gup fast anyhow? > > > > > > > > > > I've been wondering for a while.. > > > > > > > > It's there to synchronize against dax-device removal. The device will > > > > suspend removal awaiting all page references to be dropped, but > > > > gup-fast could be racing device removal. So gup-fast checks for > > > > pte_devmap() to grab a live reference to the device before assuming it > > > > can pin a page. > > > > > > From the perspective of CPU A it can't tell if CPU B is doing a HW > > > page table walk or a GUP fast when it invalidates a page table. The > > > design of gup-fast is supposed to be the same as the design of a HW > > > page table walk, and the tlb invalidate CPU A does when removing a > > > page from a page table is supposed to serialize against both a HW page > > > table walk and gup-fast. > > > > > > Given that the HW page table walker does not do dev_pagemap stuff, why > > > does gup-fast? > > > > gup-fast historically assumed that the 'struct page' and memory > > backing the page-table walk could not physically be removed from the > > system during its walk because those pages were allocated from the > > page allocator before being mapped into userspace. > > No, I'd say gup-fast assumes that any non-special PTE it finds in a > page table must have a struct page. > > If something wants to remove that struct page it must first remove all > the PTEs pointing at it from the entire system and flush the TLBs, > which directly prevents a future gup-fast from running and trying to > access the struct page. No extra locking needed > > > implied elevated reference on any page that gup-fast would be asked to > > walk, or pte_special() is there to "say wait, nevermind this isn't a > > page allocator page fallback to gup-slow()". > > pte_special says there is no struct page, and some of those cases can > be fixed up in gup-slow. > > > > Can you sketch the exact race this is protecting against? > > > > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and > > issues direct I/O with that mapping as the target buffer, Thread2 does > > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without > > the dev_pagemap check reference gup-fast could execute > > get_page(pte_page(pte)) on a page that doesn't even exist anymore > > because the driver unbind has already performed remove_pages(). > > Surely the unbind either waits for all the VMAs to be destroyed or > zaps them before allowing things to progress to remove_pages()? If we're talking about device-dax this is precisely what it does, zaps and prevents new faults from resolving, but filesystem-dax... > Having a situation where the CPU page tables still point at physical > pages that have been removed sounds so crazy/insecure, that can't be > what is happening, can it?? Hmm, that may be true and an original dax bug! The unbind of a block-device from underneath the filesystem does trigger the filesystem to emergency shutdown / go read-only, but unless that process also includes a global zap of all dax mappings not only is that violating expectations of "page-tables to disappearing memory", but the filesystem may also want to guarantee that no further dax writes can happen after shutdown. Right now I believe it only assumes that mmap I/O will come from page writeback so there's no need to bother applications with mappings to page cache, but dax mappings need to be ripped away. /me goes to look at what filesytems guarantee when the block-device is surprise removed out from under them. In any event, this accelerates the effort to go implement fs-global-dax-zap at the request of the device driver. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
[ add Ralph ] On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe wrote: > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote: > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe wrote: > > > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > > > > > The downside would be one extra lookup in dev_pagemap tree > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > > > per gup-fast() call. > > > > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > > > path. It should be measurable that this change is at least as fast or > > > > faster than falling back to the slow path, but it would be good to > > > > measure. > > > > > > What is the dev_pagemap thing doing in gup fast anyhow? > > > > > > I've been wondering for a while.. > > > > It's there to synchronize against dax-device removal. The device will > > suspend removal awaiting all page references to be dropped, but > > gup-fast could be racing device removal. So gup-fast checks for > > pte_devmap() to grab a live reference to the device before assuming it > > can pin a page. > > From the perspective of CPU A it can't tell if CPU B is doing a HW > page table walk or a GUP fast when it invalidates a page table. The > design of gup-fast is supposed to be the same as the design of a HW > page table walk, and the tlb invalidate CPU A does when removing a > page from a page table is supposed to serialize against both a HW page > table walk and gup-fast. > > Given that the HW page table walker does not do dev_pagemap stuff, why > does gup-fast? gup-fast historically assumed that the 'struct page' and memory backing the page-table walk could not physically be removed from the system during its walk because those pages were allocated from the page allocator before being mapped into userspace. So there is an implied elevated reference on any page that gup-fast would be asked to walk, or pte_special() is there to "say wait, nevermind this isn't a page allocator page fallback to gup-slow()". pte_devmap() is there to say "wait, there is no implied elevated reference for this page, check and hold dev_pagemap alive until a page reference can be taken". So it splits the difference between pte_special() and typical page allocator pages. > Can you sketch the exact race this is protecting against? Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and issues direct I/O with that mapping as the target buffer, Thread2 does "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without the dev_pagemap check reference gup-fast could execute get_page(pte_page(pte)) on a page that doesn't even exist anymore because the driver unbind has already performed remove_pages(). Effectively the same percpu_ref that protects the pmem0 block device from new command submissions while the device is dying also prevents new dax page references being taken while the device is dying. This could be solved with the traditional gup-fast rules if the device driver could tell the filesystem to unmap all dax files and force them to re-fault through the gup-slow path to see that the device is now dying. I'll likely be working on that sooner rather than later given some of the expectations of the CXL persistent memory "dirty shutdown" detection. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe wrote: > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > The downside would be one extra lookup in dev_pagemap tree > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > per gup-fast() call. > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > path. It should be measurable that this change is at least as fast or > > faster than falling back to the slow path, but it would be good to > > measure. > > What is the dev_pagemap thing doing in gup fast anyhow? > > I've been wondering for a while.. It's there to synchronize against dax-device removal. The device will suspend removal awaiting all page references to be dropped, but gup-fast could be racing device removal. So gup-fast checks for pte_devmap() to grab a live reference to the device before assuming it can pin a page. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
On Tue, Feb 23, 2021 at 9:19 AM Joao Martins wrote: > > On 2/23/21 4:50 PM, Dan Williams wrote: > > On Tue, Feb 23, 2021 at 7:46 AM Joao Martins > > wrote: > >> On 2/22/21 8:37 PM, Dan Williams wrote: > >>> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins > >>> wrote: > >>>> On 2/20/21 1:43 AM, Dan Williams wrote: > >>>>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard wrote: > >>>>>> On 12/8/20 9:28 AM, Joao Martins wrote: > > [...] > > >>> Don't get me wrong the > >>> capability is still needed for filesystem-dax, but the distinction is > >>> that vmemmap_populate_compound_pages() need never worry about an > >>> altmap. > >>> > >> IMO there's not much added complexity strictly speaking about altmap. We > >> still use the > >> same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So > >> whatever it is > >> being maintained for fsdax or other altmap consumers (e.g. we seem to be > >> working towards > >> hotplug making use of it) we are using it in the exact same way. > >> > >> The complexity of the future vmemmap_populate_compound_pages() has more to > >> do with reusing > >> vmemmap blocks allocated in previous vmemmap pages, and preserving that > >> across section > >> onlining (for 1G pages). > > > > True, I'm less worried about the complexity as much as > > opportunistically converting configurations to RAM backed pages. It's > > already the case that poison handling is page mapping size aligned for > > device-dax, and filesystem-dax needs to stick with non-compound-pages > > for the foreseeable future. > > > Hmm, I was sort off wondering that fsdax could move to compound pages too as > opposed to base pages, albeit not necessarily using the vmemmap page reuse > as it splits pages IIUC. I'm not sure compound pages for fsdax would work long term because there's no infrastructure to reassemble compound pages after a split. So if you fracture a block and then coalesce it back to a 2MB or 1GB aligned block there's nothing to go fixup the compound page... unless the filesystem wants to get into mm metadata fixups. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
On Tue, Feb 23, 2021 at 9:16 AM Joao Martins wrote: > > On 2/23/21 4:44 PM, Dan Williams wrote: > > On Tue, Feb 23, 2021 at 8:30 AM Joao Martins > > wrote: > >> > >> On 2/20/21 1:18 AM, Dan Williams wrote: > >>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins > >>> wrote: > >>>> Patch 6 - 8: Optimize grabbing/release a page refcount changes given > >>>> that we > >>>> are working with compound pages i.e. we do 1 increment/decrement to the > >>>> head > >>>> page for a given set of N subpages compared as opposed to N individual > >>>> writes. > >>>> {get,pin}_user_pages_fast() for zone_device with compound pagemap > >>>> consequently > >>>> improves considerably, and unpin_user_pages() improves as well when > >>>> passed a > >>>> set of consecutive pages: > >>>> > >>>>before after > >>>> (get_user_pages_fast 1G;2M page size) ~75k us -> ~3.2k ; ~5.2k us > >>>> (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us > >>> > >>> Compelling! > >>> > >> > >> BTW is there any reason why we don't support pin_user_pages_fast() with > >> FOLL_LONGTERM for > >> device-dax? > >> > > > > Good catch. > > > > Must have been an oversight of the conversion. FOLL_LONGTERM collides > > with filesystem operations, but not device-dax. > > h, fwiw, it was unilaterally disabled for any devmap pmd/pud in commit > 7af75561e171 > ("mm/gup: add FOLL_LONGTERM capability to GUP fast") and I must only assume > that > by "DAX pages" the submitter was only referring to fs-dax pages. Agree, that was an fsdax only assumption. Maybe this went unnoticed because the primary gup-fast case for direct-I/O was not impacted. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
On Tue, Feb 23, 2021 at 7:46 AM Joao Martins wrote: > > On 2/22/21 8:37 PM, Dan Williams wrote: > > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins > > wrote: > >> On 2/20/21 1:43 AM, Dan Williams wrote: > >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard wrote: > >>>> On 12/8/20 9:28 AM, Joao Martins wrote: > >>>>> diff --git a/mm/memremap.c b/mm/memremap.c > >>>>> index 16b2fb482da1..287a24b7a65a 100644 > >>>>> --- a/mm/memremap.c > >>>>> +++ b/mm/memremap.c > >>>>> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap > >>>>> *pgmap, struct mhp_params *params, > >>>>> memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE], > >>>>> PHYS_PFN(range->start), > >>>>> PHYS_PFN(range_len(range)), pgmap); > >>>>> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > >>>>> - - pfn_first(pgmap, range_id)); > >>>>> + if (pgmap->flags & PGMAP_COMPOUND) > >>>>> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > >>>>> + - pfn_first(pgmap, range_id)) / > >>>>> PHYS_PFN(pgmap->align)); > >>>> > >>>> Is there some reason that we cannot use range_len(), instead of > >>>> pfn_end() minus > >>>> pfn_first()? (Yes, this more about the pre-existing code than about your > >>>> change.) > >>>> > >>>> And if not, then why are the nearby range_len() uses OK? I realize that > >>>> range_len() > >>>> is simpler and skips a case, but it's not clear that it's required here. > >>>> But I'm > >>>> new to this area so be warned. :) > >>> > >>> There's a subtle distinction between the range that was passed in and > >>> the pfns that are activated inside of it. See the offset trickery in > >>> pfn_first(). > >>> > >>>> Also, dividing by PHYS_PFN() feels quite misleading: that function does > >>>> what you > >>>> happen to want, but is not named accordingly. Can you use or create > >>>> something > >>>> more accurately named? Like "number of pages in this large page"? > >>> > >>> It's not the number of pages in a large page it's converting bytes to > >>> pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my > >>> though process was if I'm going to add () might as well use a macro > >>> that already does this. > >>> > >>> That said I think this calculation is broken precisely because > >>> pfn_first() makes the result unaligned. > >>> > >>> Rather than fix the unaligned pfn_first() problem I would use this > >>> support as an opportunity to revisit the option of storing pages in > >>> the vmem_altmap reserve soace. The altmap's whole reason for existence > >>> was that 1.5% of large PMEM might completely swamp DRAM. However if > >>> that overhead is reduced by an order (or orders) of magnitude the > >>> primary need for vmem_altmap vanishes. > >>> > >>> Now, we'll still need to keep it around for the ->align == PAGE_SIZE > >>> case, but for most part existing deployments that are specifying page > >>> map on PMEM and an align > PAGE_SIZE can instead just transparently be > >>> upgraded to page map on a smaller amount of DRAM. > >>> > >> I feel the altmap is still relevant. Even with the struct page reuse for > >> tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per > >> 1Tb (strictly speaking about what's stored in the altmap). Muchun and > >> Matthew were thinking (in another thread) on compound_head() adjustments > >> that probably can make this overhead go to 2G (if we learn to differentiate > >> the reused head page from the real head page). > > > > I think that realization is more justification to make a new first > > class vmemmap_populate_compound_pages() rather than try to reuse > > vmemmap_populate_basepages() with new parameters. > > > I was already going to move this to vmemmap_populate_compound_pages() based > on your earlier suggestion :) > > >> But even there it's still > >> 2G per 1Tb. 1G pages, though, have
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
On Tue, Feb 23, 2021 at 8:30 AM Joao Martins wrote: > > On 2/20/21 1:18 AM, Dan Williams wrote: > > On Tue, Dec 8, 2020 at 9:32 AM Joao Martins > > wrote: > >> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that > >> we > >> are working with compound pages i.e. we do 1 increment/decrement to the > >> head > >> page for a given set of N subpages compared as opposed to N individual > >> writes. > >> {get,pin}_user_pages_fast() for zone_device with compound pagemap > >> consequently > >> improves considerably, and unpin_user_pages() improves as well when passed > >> a > >> set of consecutive pages: > >> > >>before after > >> (get_user_pages_fast 1G;2M page size) ~75k us -> ~3.2k ; ~5.2k us > >> (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us > > > > Compelling! > > > > BTW is there any reason why we don't support pin_user_pages_fast() with > FOLL_LONGTERM for > device-dax? > Good catch. Must have been an oversight of the conversion. FOLL_LONGTERM collides with filesystem operations, but not device-dax. In fact that's the motivation for device-dax in the first instance, no need to coordinate runtime physical address layout changes because the device is statically allocated. > Looking at the history, I understand that fsdax can't support it atm, but I > am not sure > that the same holds for device-dax. I have this small chunk (see below the > scissors mark) > which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not sure > if there is > a fundamental issue for the other types that makes this an unwelcoming change. > > Joao > > ->8- > > Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for > MEMORY_DEVICE_GENERIC > > The downside would be one extra lookup in dev_pagemap tree > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > per gup-fast() call. I'd guess a dev_pagemap lookup is faster than a get_user_pages slow path. It should be measurable that this change is at least as fast or faster than falling back to the slow path, but it would be good to measure. Code changes look good to me. > > Signed-off-by: Joao Martins > --- > include/linux/mm.h | 5 + > mm/gup.c | 24 +--- > 2 files changed, 18 insertions(+), 11 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 32f0c3986d4f..c89a049bbd7a 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct > page *page) > page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; > } > > +static inline bool devmap_longterm_available(const struct dev_pagemap *pgmap) > +{ I'd call this devmap_can_longterm(). > + return pgmap->type == MEMORY_DEVICE_GENERIC; > +} > + > /* 127: arbitrary random number, small enough to assemble well */ > #define page_ref_zero_or_close_to_overflow(page) \ > ((unsigned int) page_ref_count(page) + 127u <= 127u) > diff --git a/mm/gup.c b/mm/gup.c > index 222d1fdc5cfa..03e370d360e6 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2092,14 +2092,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long > addr, unsigned > long end, > goto pte_unmap; > > if (pte_devmap(pte)) { > - if (unlikely(flags & FOLL_LONGTERM)) > - goto pte_unmap; > - > pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); > if (unlikely(!pgmap)) { > undo_dev_pagemap(nr, nr_start, flags, pages); > goto pte_unmap; > } > + > + if (unlikely(flags & FOLL_LONGTERM) && > + !devmap_longterm_available(pgmap)) { > + undo_dev_pagemap(nr, nr_start, flags, pages); > + goto pte_unmap; > + } > + > } else if (pte_special(pte)) > goto pte_unmap; > > @@ -2195,6 +2199,10 @@ static int __gup_device_huge(unsigned long pfn, > unsigned long addr, > return 0; > } > > + if (unlikely(flags & FOLL_LONGTERM) && > + !devmap_longterm_available(pgmap)) > + return 0; > + > @@ -2356,12 +2364,9 @@ static i
Re: [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size
On Mon, Feb 22, 2021 at 3:42 AM Joao Martins wrote: > > > > On 2/20/21 3:34 AM, Dan Williams wrote: > > On Tue, Dec 8, 2020 at 9:32 AM Joao Martins > > wrote: > >> > >> Introduce a new flag, MEMHP_REUSE_VMEMMAP, which signals that > >> struct pages are onlined with a given alignment, and should reuse the > >> tail pages vmemmap areas. On that circunstamce we reuse the PFN backing > > > > s/On that circunstamce we reuse/Reuse/ > > > > Kills a "we" and switches to imperative tense. I noticed a couple > > other "we"s in the previous patches, but this crossed my threshold to > > make a comment. > > > /me nods. Will fix. > > >> only the tail pages subsections, while letting the head page PFN remain > >> different. This presumes that the backing page structs are compound > >> pages, such as the case for compound pagemaps (i.e. ZONE_DEVICE with > >> PGMAP_COMPOUND set) > >> > >> On 2M compound pagemaps, it lets us save 6 pages out of the 8 necessary > > > > s/lets us save/saves/ > > > Will fix. > > >> PFNs necessary > > > > s/8 necessary PFNs necessary/8 PFNs necessary/ > > Will fix. > > > > >> to describe the subsection's 32K struct pages we are > >> onlining. > > > > s/we are onlining/being mapped/ > > > > ...because ZONE_DEVICE pages are never "onlined". > > > >> On a 1G compound pagemap it let us save 4096 pages. > > > > s/lets us save/saves/ > > > > Will fix both. > > >> > >> Sections are 128M (or bigger/smaller), > > > > Huh? > > > > Section size is arch-dependent if we are being hollistic. > On x86 it's 64M, 128M or 512M right? > > #ifdef CONFIG_X86_32 > # ifdef CONFIG_X86_PAE > # define SECTION_SIZE_BITS 29 > # define MAX_PHYSMEM_BITS 36 > # else > # define SECTION_SIZE_BITS 26 > # define MAX_PHYSMEM_BITS 32 > # endif > #else /* CONFIG_X86_32 */ > # define SECTION_SIZE_BITS 27 /* matt - 128 is convenient right now */ > # define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46) > #endif > > Also, me pointing about section sizes, is because a 1GB+ page vmemmap > population will > cross sections in how sparsemem populates the vmemmap. And on that case we > gotta reuse the > the PTE/PMD pages across multiple invocations of > vmemmap_populate_basepages(). Either > that, or looking at the previous page PTE, but that might be ineficient. Ok, makes sense I think saying this description of needing to handle section crossing is clearer than mentioning one of the section sizes. > > >> @@ -229,38 +235,95 @@ int __meminit vmemmap_populate_basepages(unsigned > >> long start, unsigned long end, > >> for (; addr < end; addr += PAGE_SIZE) { > >> pgd = vmemmap_pgd_populate(addr, node); > >> if (!pgd) > >> - return -ENOMEM; > >> + return NULL; > >> p4d = vmemmap_p4d_populate(pgd, addr, node); > >> if (!p4d) > >> - return -ENOMEM; > >> + return NULL; > >> pud = vmemmap_pud_populate(p4d, addr, node); > >> if (!pud) > >> - return -ENOMEM; > >> + return NULL; > >> pmd = vmemmap_pmd_populate(pud, addr, node); > >> if (!pmd) > >> - return -ENOMEM; > >> - pte = vmemmap_pte_populate(pmd, addr, node, altmap); > >> + return NULL; > >> + pte = vmemmap_pte_populate(pmd, addr, node, altmap, block); > >> if (!pte) > >> - return -ENOMEM; > >> + return NULL; > >> vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); > >> } > >> > >> + return __va(__pfn_to_phys(pte_pfn(*pte))); > >> +} > >> + > >> +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned > >> long end, > >> +int node, struct vmem_altmap > >> *altmap) > >> +{ > >> + if (!__vmemmap_populate_basepages(start, end, node, altmap, NULL)) > >> + return -ENOMEM; > >> return 0; > >> } > >> > >> +stati
Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
On Mon, Feb 22, 2021 at 3:24 AM Joao Martins wrote: > > On 2/20/21 1:43 AM, Dan Williams wrote: > > On Tue, Dec 8, 2020 at 9:59 PM John Hubbard wrote: > >> On 12/8/20 9:28 AM, Joao Martins wrote: > >>> diff --git a/mm/memremap.c b/mm/memremap.c > >>> index 16b2fb482da1..287a24b7a65a 100644 > >>> --- a/mm/memremap.c > >>> +++ b/mm/memremap.c > >>> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, > >>> struct mhp_params *params, > >>> memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE], > >>> PHYS_PFN(range->start), > >>> PHYS_PFN(range_len(range)), pgmap); > >>> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > >>> - - pfn_first(pgmap, range_id)); > >>> + if (pgmap->flags & PGMAP_COMPOUND) > >>> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > >>> + - pfn_first(pgmap, range_id)) / > >>> PHYS_PFN(pgmap->align)); > >> > >> Is there some reason that we cannot use range_len(), instead of pfn_end() > >> minus > >> pfn_first()? (Yes, this more about the pre-existing code than about your > >> change.) > >> > >> And if not, then why are the nearby range_len() uses OK? I realize that > >> range_len() > >> is simpler and skips a case, but it's not clear that it's required here. > >> But I'm > >> new to this area so be warned. :) > > > > There's a subtle distinction between the range that was passed in and > > the pfns that are activated inside of it. See the offset trickery in > > pfn_first(). > > > >> Also, dividing by PHYS_PFN() feels quite misleading: that function does > >> what you > >> happen to want, but is not named accordingly. Can you use or create > >> something > >> more accurately named? Like "number of pages in this large page"? > > > > It's not the number of pages in a large page it's converting bytes to > > pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my > > though process was if I'm going to add () might as well use a macro > > that already does this. > > > > That said I think this calculation is broken precisely because > > pfn_first() makes the result unaligned. > > > > Rather than fix the unaligned pfn_first() problem I would use this > > support as an opportunity to revisit the option of storing pages in > > the vmem_altmap reserve soace. The altmap's whole reason for existence > > was that 1.5% of large PMEM might completely swamp DRAM. However if > > that overhead is reduced by an order (or orders) of magnitude the > > primary need for vmem_altmap vanishes. > > > > Now, we'll still need to keep it around for the ->align == PAGE_SIZE > > case, but for most part existing deployments that are specifying page > > map on PMEM and an align > PAGE_SIZE can instead just transparently be > > upgraded to page map on a smaller amount of DRAM. > > > I feel the altmap is still relevant. Even with the struct page reuse for > tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per > 1Tb (strictly speaking about what's stored in the altmap). Muchun and > Matthew were thinking (in another thread) on compound_head() adjustments > that probably can make this overhead go to 2G (if we learn to differentiate > the reused head page from the real head page). I think that realization is more justification to make a new first class vmemmap_populate_compound_pages() rather than try to reuse vmemmap_populate_basepages() with new parameters. > But even there it's still > 2G per 1Tb. 1G pages, though, have a better story to remove altmap need. The concern that led to altmap is that someone would build a system with a 96:1 (PMEM:RAM) ratio where that correlates to maximum PMEM and minimum RAM, and mapping all PMEM consumes all RAM. As far as I understand real world populations are rarely going past 8:1, that seems to make 'struct page' in RAM feasible even for the 2M compound page case. Let me ask you for a data point, since you're one of the people actively deploying such systems, would you still use the 'struct page' in PMEM capability after this set was merged? > One thing to point out about altmap is that the degradation (in pinning and > unpining) we observed with struct page's in device memory, is no longer > observed > once 1) we batch ref count updates as we move to compound pages 2) reusing > tai
Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
On Mon, Feb 22, 2021 at 2:24 AM Mike Rapoport wrote: > > On Mon, Feb 22, 2021 at 07:34:52AM +, Matthew Garrett wrote: > > On Mon, Feb 08, 2021 at 10:49:18AM +0200, Mike Rapoport wrote: > > > > > It is unsafe to allow saving of secretmem areas to the hibernation > > > snapshot as they would be visible after the resume and this essentially > > > will defeat the purpose of secret memory mappings. > > > > Sorry for being a bit late to this - from the point of view of running > > processes (and even the kernel once resume is complete), hibernation is > > effectively equivalent to suspend to RAM. Why do they need to be handled > > differently here? > > Hibernation leaves a copy of the data on the disk which we want to prevent. Why not document that users should use data at rest protection mechanisms for their hibernation device? Just because secretmem can't assert its disclosure guarantee does not mean the hibernation device is untrustworthy. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] cxl/mem: Fixes to IOCTL interface
On Sat, Feb 20, 2021 at 7:47 PM Ben Widawsky wrote: > > On 21-02-20 18:38:36, Dan Williams wrote: > > On Sat, Feb 20, 2021 at 1:57 PM Ben Widawsky wrote: > > > > > > When submitting a command for userspace, input and output payload bounce > > > buffers are allocated. For a given command, both input and output > > > buffers may exist and so when allocation of the input buffer fails, the > > > output buffer must be freed. As far as I can tell, userspace can't > > > easily exploit the leak to OOM a machine unless the machine was already > > > near OOM state. > > > > > > This bug was introduced in v5 of the patch and did not exist in prior > > > revisions. > > > > > > > Thanks for the quick turnaround, but I think that speed introduced > > some issues... > > > > > While here, adjust the variable 'j' found in patch review by Konrad. > > > > Please split this pure cleanup to its own patch. The subject says > > "Fixes", but it's only the one fix. > > > > This was intentional. I pinged you internally to just drop it if you don't > like > to combine these kind of things. Apologies, I've been offline this weekend and missed that. I will point out though that in general I will not change patch contents on the way upstream. The adjusted patch must appear on the mailing list. The new korg support for attaching attestation to patches requires they all be on the list, so even small adjustments need new postings. > It didn't feel worthwhile to introduce a new > patch to change the 'j'. Janitorial patches happen all the time, so I think it's ok. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org