Re: [block] 52f019d43c: ndctl.test-libndctl.fail

2021-03-05 Thread Jeff Moyer
Christoph Hellwig  writes:

> Dan,
>
> can you make any sense of thos report?
>
> name='nfit_test'
> path='/lib/modules/5.11.0-rc5-3-g52f019d43c22/extra/test/nfit_test.ko'
>> check_set_config_data: dimm: 0 read2 data miscompare: 0
>> check_set_config_data: dimm: 0x1 read2 data miscompare: 0
>> check_set_config_data: dimm: 0x100 read2 data miscompare: 0
>> check_set_config_data: dimm: 0x101 read2 data miscompare: 0
>> check_dax_autodetect: dax_ndns: 0x558a74d92f00 ndns: 0x558a74d92f00
>> check_dax_autodetect: dax_ndns: 0x558a74d91f40 ndns: 0x558a74d91f40
>> check_pfn_autodetect: pfn_ndns: 0x558a74d91f40 ndns: 0x558a74d91f40
>> check_pfn_autodetect: pfn_ndns: 0x558a74d8c5e0 ndns: 0x558a74d8c5e0
>> check_btt_autodetect: btt_ndns: 0x558a74d8c5e0 ndns: 0x558a74d8c5e0
>> check_btt_autodetect: btt_ndns: 0x558a74da1390 ndns: 0x558a74da1390
>> check_btt_autodetect: btt_ndns: 0x558a74d8c5e0 ndns: 0x558a74d8c5e0
>> check_btt_autodetect: btt_ndns: 0x558a74d91f40 ndns: 0x558a74d91f40
>> namespace7.0: failed to write /dev/pmem7
>> check_namespaces: namespace7.0 validate_bdev failed
>> ndctl-test1 failed: -6

This is from test/libndctl.c in the ndctl repo:

fd = open(bdevpath, O_RDONLY);
if (fd < 0) {
fprintf(stderr, "%s: failed to open(%s, O_RDONLY)\n",
devname, bdevpath);
return -ENXIO;
}
...
ro = 0;
rc = ioctl(fd, BLKROSET, );
if (rc < 0) {
fprintf(stderr, "%s: BLKROSET failed\n",
devname);
rc = -errno;
goto out;
}

close(fd);
fd = open(bdevpath, O_RDWR|O_DIRECT);
...
if (write(fd, buf, 4096) < 4096) {
fprintf(stderr, "%s: failed to write %s\n",
devname, bdevpath);
rc = -ENXIO;
goto out;
}

HTH,
Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH ndctl] dimm: re-fix potential fd leakage in dimm_action()

2021-02-15 Thread Jeff Moyer
Michal Suchanek  writes:

> There are cases not covered by the original fix and cases added by the
> latter patch.
>
> Also there is one case of usage added without returning from the
> function.
>
> Fixes: ff434d87ccbd ("dimm: fix potential fd leakage in dimm_action()")
> Fixes: 41a7e24af5db ("ndctl/dimm: Auto-arm firmware activation")
>
> Signed-off-by: Michal Suchanek 
> ---
>  ndctl/dimm.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/ndctl/dimm.c b/ndctl/dimm.c
> index 09ce49e1d2ca..1c177b5494ec 100644
> --- a/ndctl/dimm.c
> +++ b/ndctl/dimm.c
> @@ -1333,12 +1333,15 @@ static int dimm_action(int argc, const char **argv, 
> struct ndctl_ctx *ctx,
>   if (param.arm_set && param.disarm_set) {
>   fprintf(stderr, "set either --arm, or --disarm, not both\n");
>   usage_with_options(u, options);
> + rc = -EINVAL;
> + goto out_close_fout;

usage_with_options calls exit():

void usage_with_options(const char * const *usagestr,
const struct option *opts)
{
usage_with_options_internal(usagestr, opts, 0);
exit(129);
}

So I don't think this patch is necessary.

Cheers,
Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl patch] zero_info_block: skip seed devices

2021-02-09 Thread Jeff Moyer
Currently, ndctl destroy-namespace -f all will output errors of the
form:

  Error: destroy namespace: namespace0.0 failed to enable for zeroing, 
continuing

for any zero-sized namespace.  That particular namespace looks like this:

  {
"dev":"namespace0.0",
"mode":"raw",
"size":0,
"uuid":"----",
"sector_size":512,
"state":"disabled"
  }

This patch skips over namespaces with size=0 when zeroing out info
blocks.

Fixes: 46654c2d60b70 ("ndctl/namespace: Always zero info-blocks")
Reported-by: Zhang Yi 
Signed-off-by: Jeff Moyer 

diff --git a/ndctl/namespace.c b/ndctl/namespace.c
index 0c8df9f..de1e08f 100644
--- a/ndctl/namespace.c
+++ b/ndctl/namespace.c
@@ -1052,6 +1052,9 @@ static int zero_info_block(struct ndctl_namespace *ndns)
void *buf = NULL, *read_buf = NULL;
char path[50];
 
+   if (ndctl_namespace_get_size(ndns) == 0)
+   return 1;
+
ndctl_namespace_set_raw_mode(ndns, 1);
rc = ndctl_namespace_enable(ndns);
if (rc < 0) {
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] ACPI: NFIT: Fix input validation of bus-family

2020-11-24 Thread Jeff Moyer
Dan Williams  writes:

> Dan reports that smatch thinks userspace can craft an out-of-bound bus
> family number. However, nd_cmd_clear_to_send() blocks all non-zero
> values of bus-family since only the kernel can initiate these commands.
> However, in the speculation path, family is a user controlled array
> index value so mask it for speculation safety. Also, since the
> nd_cmd_clear_to_send() safety is non-obvious and possibly may change in
> the future include input validation is if userspace could get past the
> nd_cmd_clear_to_send() gatekeeper.
>
> Link: http://lore.kernel.org/r/2020113000.GA1237157@mwanda
> Reported-by: Dan Carpenter 
> Fixes: 6450ddbd5d8e ("ACPI: NFIT: Define runtime firmware activation 
> commands")
> Cc: 
> Signed-off-by: Dan Williams 
> ---
>  drivers/acpi/nfit/core.c |6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index cda7b6c52504..b11b08a60684 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -479,8 +480,11 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, 
> struct nvdimm *nvdimm,
>   cmd_mask = nd_desc->cmd_mask;
>   if (cmd == ND_CMD_CALL && call_pkg->nd_family) {
>   family = call_pkg->nd_family;
> - if (!test_bit(family, _desc->bus_family_mask))
> + if (family > NVDIMM_BUS_FAMILY_MAX ||
> + !test_bit(family, _desc->bus_family_mask))
>   return -EINVAL;
> + family = array_index_nospec(family,
> + NVDIMM_BUS_FAMILY_MAX + 1);
>   dsm_mask = acpi_desc->family_dsm_mask[family];
>   guid = to_nfit_bus_uuid(family);
>   } else {

Reviewed-by: Jeff Moyer 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH 0/8] fix serverl issues reported by Coverity

2020-11-20 Thread Jeff Moyer
Zhiqiang Liu  writes:

> Recently, we use Coverity to analysis the ndctl package.
> Several issues should be resolved to make Coverity happy.
>
> lihaotian9 (8):
>   namespace: check whether pfn|dax|btt is NULL in setup_namespace
>   lib/libndctl: fix memory leakage problem in add_bus
>   libdaxctl: fix memory leakage in add_dax_region()
>   dimm: fix potential fd leakage in dimm_action()
>   util/help: check whether strdup returns NULL in exec_man_konqueror
>   lib/inject: check whether cmd is created successfully
>   libndctl: check whether ndctl_btt_get_namespace returns NULL in
> callers
>   namespace: check whether seed is NULL in validate_namespace_options
>
>  daxctl/lib/libdaxctl.c |  3 +++
>  ndctl/dimm.c   | 12 +++-
>  ndctl/lib/inject.c |  8 
>  ndctl/lib/libndctl.c   |  1 +
>  ndctl/namespace.c  | 23 ++-
>  test/libndctl.c| 16 +++-
>  test/parent-uuid.c |  2 +-
>  util/help.c|  8 +++-
>  util/json.c|  3 +++
>  9 files changed, 59 insertions(+), 17 deletions(-)

Except for the minor nit I pointed out, for the  series:

Acked-by: Jeff Moyer 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH 4/8] dimm: fix potential fd leakage in dimm_action()

2020-11-20 Thread Jeff Moyer
Zhiqiang Liu  writes:

> In dimm_action(), actx.f_out and actx.f_in may be set by calling
> fopen(). If exceptions occur, we will directly goto out tag.
> However, we did not close actx.f_out|actx.f_in in out tag, which
> will cause fd leakage.
>
> Signed-off-by: Zhiqiang Liu 
> ---
>  ndctl/dimm.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/ndctl/dimm.c b/ndctl/dimm.c
> index 90eb0b8..2f52cda 100644
> --- a/ndctl/dimm.c
> +++ b/ndctl/dimm.c
> @@ -1352,7 +1352,7 @@ static int dimm_action(int argc, const char **argv, 
> struct ndctl_ctx *ctx,
>   fprintf(stderr, "failed to open: %s: (%s)\n",
>   param.infile, strerror(errno));
>   rc = -errno;
> - goto out;
> + goto out_close_fout;
>   }
>   }
>
> @@ -1371,7 +1371,7 @@ static int dimm_action(int argc, const char **argv, 
> struct ndctl_ctx *ctx,
>   fprintf(stderr, "'%s' is not a valid label version\n",
>   param.labelversion);
>   rc = -EINVAL;
> - goto out;
> + goto out_close_fin_fout;
>   }
>
>   rc = 0;
> @@ -1423,12 +1423,14 @@ static int dimm_action(int argc, const char **argv, 
> struct ndctl_ctx *ctx,
>   util_display_json_array(actx.f_out, actx.jdimms, flags);
>   }
>
> - if (actx.f_out != stdout)
> - fclose(actx.f_out);
> -
> + out_close_fin_fout:
>   if (actx.f_in != stdin)
>   fclose(actx.f_in);
>
> + out_close_fout:
> + if (actx.f_out != stdout)
> + fclose(actx.f_out);
> +
>   out:
>   /*
>* count if some actions succeeded, 0 if none were attempted,

Acked-by: Jeff Moyer 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH 3/8] libdaxctl: fix memory leakage in add_dax_region()

2020-11-20 Thread Jeff Moyer
Zhiqiang Liu  writes:

> In add_dax_region(), region->devname is allocated by
> calling strdup(), which may return NULL. So, we need
> to check whether region->devname is NULL, and free
> region->devname in err_read tag.
>
> Signed-off-by: Zhiqiang Liu 
> ---
>  daxctl/lib/libdaxctl.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/daxctl/lib/libdaxctl.c b/daxctl/lib/libdaxctl.c
> index ee4a069..d841b78 100644
> --- a/daxctl/lib/libdaxctl.c
> +++ b/daxctl/lib/libdaxctl.c
> @@ -287,6 +287,8 @@ static struct daxctl_region *add_dax_region(void *parent, 
> int id,
>   region->refcount = 1;
>   list_head_init(>devices);
>   region->devname = strdup(devpath_to_devname(base));
> + if (!region->devname)
> + goto err_read;
>
>   sprintf(path, "%s/%s/size", base, attrs);
>   if (sysfs_read_attr(ctx, path, buf) == 0)
> @@ -314,6 +316,7 @@ static struct daxctl_region *add_dax_region(void *parent, 
> int id,
>   err_read:
>   free(region->region_buf);
>   free(region->region_path);
> + free(region->devname);
>   free(region);
>   err_region:
>   free(path);

Acked-by: Jeff Moyer 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH 2/8] lib/libndctl: fix memory leakage problem in add_bus

2020-11-20 Thread Jeff Moyer
Zhiqiang Liu  writes:

> In add_bus(), bus->bus_path is set by calling parent_dev_path(),
> which will finally adopt realpath(, NULL) to allocate new path.
> However, bus->bus_path will not be freed in err_read tag, then,
> memory leakage occurs.
>
> Signed-off-by: Zhiqiang Liu 
> ---
>  ndctl/lib/libndctl.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
> index ad521d3..3926382 100644
> --- a/ndctl/lib/libndctl.c
> +++ b/ndctl/lib/libndctl.c
> @@ -975,6 +975,7 @@ static void *add_bus(void *parent, int id, const char 
> *ctl_base)
>   free(bus->wait_probe_path);
>   free(bus->scrub_path);
>   free(bus->provider);
> + free(bus->bus_path);
>   free(bus->bus_buf);
>   free(bus);
>   err_bus:

Acked-by: Jeff Moyer 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH 1/8] namespace: check whether pfn|dax|btt is NULL in setup_namespace

2020-11-20 Thread Jeff Moyer
Zhiqiang Liu  writes:

> In setup_namespace(), pfn|dax|btt is obtained by calling
> ndctl_region_get_**_seed(), which may return NULL. So we
> need to check whether pfn|dax|btt is NULL before accessing
> them.
>
> Signed-off-by: Zhiqiang Liu 
> ---
>  ndctl/namespace.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/ndctl/namespace.c b/ndctl/namespace.c
> index e946bb6..257384d 100644
> --- a/ndctl/namespace.c
> +++ b/ndctl/namespace.c
> @@ -549,6 +549,8 @@ static int setup_namespace(struct ndctl_region *region,
>
>   if (do_setup_pfn(ndns, p)) {
>   struct ndctl_pfn *pfn = ndctl_region_get_pfn_seed(region);
> + if (!pfn)
> + return -ENXIO;
>
>   rc = check_dax_align(ndns);
>   if (rc)
> @@ -563,6 +565,8 @@ static int setup_namespace(struct ndctl_region *region,
>   ndctl_pfn_set_namespace(pfn, NULL);
>   } else if (p->mode == NDCTL_NS_MODE_DEVDAX) {
>   struct ndctl_dax *dax = ndctl_region_get_dax_seed(region);
> + if (!dax)
> + return -ENXIO;
>
>   rc = check_dax_align(ndns);
>   if (rc)
> @@ -577,7 +581,8 @@ static int setup_namespace(struct ndctl_region *region,
>   ndctl_dax_set_namespace(dax, NULL);
>   } else if (p->mode == NDCTL_NS_MODE_SECTOR) {
>   struct ndctl_btt *btt = ndctl_region_get_btt_seed(region);
> -
> + if (!btt)
> + return -ENXIO;
>   /*
>* Handle the case of btt on a pmem namespace where the
>* pmem kernel support is pre-v1.2 namespace labels

Minor inconsistency in the last hunk.  The empty line should come after
the return, no?

Other than that, LGTM.

Acked-by: Jeff Moyer 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [5.4-stable PATCH 0/7] libnvdimm: Cross-arch compatible namespace alignment

2020-05-26 Thread Jeff Moyer
Dan Williams  writes:

>> What problems with 5.4.y and 5.6.y is this series fixing
>> that used to work before?
>
> The "used to work" bug fixed by this set is the fact that the kernel
> used to force a 128MB (memory hotplug section size) alignment padding
> on all persistent memory namespaces to enable DAX operation. The
> support for sub-sections (2MB) dropped forced alignment padding, but
> unfortunately introduced a regression for the case of trying to create
> multiple unaligned namespaces. When that bug triggers namespace
> creation for the region is disabled, iirc, previously that lockout
> scenario was prevented.
>
> Jeff, can you corroborate this?

So, I don't pretend to remember the exact state of brokenness for each
iteration.  :)  As far as I can recall, though, the issue you describe
with a misaligned namespace preventing further namespace creation was
present in all kernels up until it was finally fixed.

> I otherwise agree, if the above never worked then this can all wait
> for v5.7 upgrades.

I can test specific kernel versions if that would help out.

Cheers,
Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-21 Thread Jeff Moyer
Dan Williams  writes:

>> But I agree with your concern that if we have older kernel/applications
>> that continue to use `dcbf` on future hardware we will end up
>> having issues w.r.t powerfail consistency. The plan is what you outlined
>> above as tighter ecosystem control. Considering we don't have a pmem
>> device generally available, we get both kernel and userspace upgraded
>> to use these new instructions before such a device is made available.

I thought power already supported NVDIMM-N, no?  So are you saying that
those devices will continue to work with the existing flushing and
fencing mechanisms?

> Ok, I think a compile time kernel option with a runtime override
> satisfies my concern. Does that work for you?

The compile time option only helps when running newer kernels.  I'm not
sure how you would even begin to audit userspace applications (keep in
mind, not every application is open source, and not every application
uses pmdk).  I also question the merits of forcing the administrator to
make the determination of whether all applications on the system will
work properly.  Really, you have to rely on the vendor to tell you the
platform is supported, and at that point, why put further hurdles in the
way?

The decision to require different instructions on ppc is unfortunate,
but one I'm sure we have no control over.  I don't see any merit in the
kernel disallowing MAP_SYNC access on these platforms.  Ideally, we'd
have some way of ensuring older kernels don't work with these new
platforms, but I don't think that's possible.

Moving on to the patch itself--Aneesh, have you audited other persistent
memory users in the kernel?  For example, drivers/md/dm-writecache.c does
this:

static void writecache_commit_flushed(struct dm_writecache *wc, bool 
wait_for_ios)
{
if (WC_MODE_PMEM(wc))
wmb(); <==
else
ssd_commit_flushed(wc, wait_for_ios);
}

I believe you'll need to make modifications there.

Cheers,
Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 0/5] Manual definition of Soft Reserved memory devices

2020-03-06 Thread Jeff Moyer
Dan Williams  writes:

> Given the current dearth of systems that supply an ACPI HMAT table, and
> the utility of being able to manually define device-dax "hmem" instances
> via the efi_fake_mem= option, relax the requirements for creating these
> devices. Specifically, add an option (numa=nohmat) to optionally disable
> consideration of the HMAT and update efi_fake_mem= to behave like
> memmap=nn!ss in terms of delimiting device boundaries.

So, am I correct in deducing that your primary motivation is testing
without hardware/firmware support?  This looks like a bit of a hack to
me, and I think maybe it would be better to just emulate the HMAT using
qemu.  I don't have a strong objection, though.

-Jeff

>
> All review welcome of course, but the E820 changes want an x86
> maintainer ack, the efi_fake_mem update needs Ard, and Rafael has
> previously shepherded the HMAT changes. For the changes to
> kernel/resource.c, where there is no clear maintainer, I just copied the
> last few people to make thoughtful changes in that area. I am happy to
> take these through the nvdimm tree along with these prerequisites
> already in -next:
>
> b2ca916ce392 ACPI: NUMA: Up-level "map to online node" functionality
> 4fcbe96e4d0b mm/numa: Skip NUMA_NO_NODE and online nodes in 
> numa_map_to_online_node()
> 575e23b6e13c powerpc/papr_scm: Switch to numa_map_to_online_node()
> 1e5d8e1e47af x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO
> 5d30f92e7631 x86/NUMA: Provide a range-to-target_node lookup facility
> 7b27a8622f80 libnvdimm/e820: Retrieve and populate correct 'target_node' info
>
> Tested with:
>
> numa=nohmat efi_fake_mem=4G@9G:0x4,4G@13G:0x4
>
> ...to create to device-dax instances:
>
>   # daxctl list -RDu
>   [
> {
>   "path":"\/platform\/hmem.1",
>   "id":1,
>   "size":"4.00 GiB (4.29 GB)",
>   "align":2097152,
>   "devices":[
> {
>   "chardev":"dax1.0",
>   "size":"4.00 GiB (4.29 GB)",
>   "target_node":3,
>   "mode":"devdax"
> }
>   ]
> },
> {
>   "path":"\/platform\/hmem.0",
>   "id":0,
>   "size":"4.00 GiB (4.29 GB)",
>   "align":2097152,
>   "devices":[
> {
>   "chardev":"dax0.0",
>   "size":"4.00 GiB (4.29 GB)",
>   "target_node":2,
>   "mode":"devdax"
> }
>   ]
> }
>   ]
>
> ---
>
> Dan Williams (5):
>   ACPI: NUMA: Add 'nohmat' option
>   efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
>   ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
>   resource: Report parent to walk_iomem_res_desc() callback
>   ACPI: HMAT: Attach a device for each soft-reserved range
>
>
>  arch/x86/kernel/e820.c  |   16 +-
>  arch/x86/mm/numa.c  |4 +
>  drivers/acpi/numa/hmat.c|   71 +++---
>  drivers/dax/Kconfig |5 ++
>  drivers/dax/Makefile|3 -
>  drivers/dax/hmem/Makefile   |6 ++
>  drivers/dax/hmem/device.c   |   97 
> +++
>  drivers/dax/hmem/hmem.c |2 -
>  drivers/firmware/efi/x86_fake_mem.c |   12 +++-
>  include/acpi/acpi_numa.h|1 
>  include/linux/dax.h |8 +++
>  kernel/resource.c   |1 
>  12 files changed, 156 insertions(+), 70 deletions(-)
>  create mode 100644 drivers/dax/hmem/Makefile
>  create mode 100644 drivers/dax/hmem/device.c
>  rename drivers/dax/{hmem.c => hmem/hmem.c} (98%)
>
> base-commit: 7b27a8622f802761d5c6abd6c37b22312a35343c
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH 1/2] ndctl/region: Support ndctl_region_{get, set}_align()

2020-02-26 Thread Jeff Moyer
Dan Williams  writes:

> On Wed, Feb 26, 2020 at 1:52 PM Jeff Moyer  wrote:
>>
>> Dan Williams  writes:
>>
>> >> Missing doctext.  Specifically, there should be a big, fat warning
>> >> against changing the region alignment.
>> >
>> > I don't mind adding one, but is this the right place to document an
>> > API warning? If the audience is future ndctl developers that should be
>> > warned to keep the status quo of not plumbing this capability into
>> > "create-namespace" that's one message. If it's to stop other libndctl
>> > application developers, they'll likely never see this source file.
>>
>> I meant to target users of the library (not ndctl developers).  I
>> thought that was the reason for the doctext on exported interfaces.  No?
>>
>> I admit, I don't know how users of libndctl figure *anything* out about
>> how to use it.  :)
>>
>
> Right, that's why I was confused about what you were asking. We
> haven't yet formalized a library documentation system, which is bad.
> I'll add kernel-doc for this function, and add an item to the backlog
> to figure out how to build library-documentation from those
> annotations. The developer's guide to date has unfortunately been "go
> review how ndctl uses it".

OK, thanks a lot!

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH 1/2] ndctl/region: Support ndctl_region_{get, set}_align()

2020-02-26 Thread Jeff Moyer
Dan Williams  writes:

>> Missing doctext.  Specifically, there should be a big, fat warning
>> against changing the region alignment.
>
> I don't mind adding one, but is this the right place to document an
> API warning? If the audience is future ndctl developers that should be
> warned to keep the status quo of not plumbing this capability into
> "create-namespace" that's one message. If it's to stop other libndctl
> application developers, they'll likely never see this source file.

I meant to target users of the library (not ndctl developers).  I
thought that was the reason for the doctext on exported interfaces.  No?

I admit, I don't know how users of libndctl figure *anything* out about
how to use it.  :)

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-25 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer  wrote:
>>
>> Dan Williams  writes:
>>
>> >> Let's just focus on reporting errors when we know we have them.
>> >
>> > That's the problem in my eyes. If software needs to contend with
>> > latent error reporting then it should always contend otherwise
>> > software has multiple error models to wrangle.
>>
>> The only way for an application to know that the data has been written
>> successfully would be to issue a read after every write.  That's not a
>> performance hit most applications are willing to take.  And, of course,
>> the media can still go bad at a later time, so it only guarantees the
>> data is accessible immediately after having been written.
>>
>> What I'm suggesting is that we should not complete a write successfully
>> if we know that the data will not be retrievable.  I wouldn't call this
>> adding an extra error model to contend with.  Applications should
>> already be checking for errors on write.
>>
>> Does that make sense? Are we talking past each other?
>
> The badblock list is late to update in both directions, late to add
> entries that the scrub needs to find and late to delete entries that
> were inadvertently cleared by cache-line writes that did not first
> ingest the poison for a read-modify-write.

We aren't looking for perfection.  If the badblocks list is populated,
then report the error instead of letting the user write data that we
know they won't be able to access later.

You have confused me, though, since I thought that stores to bad media
would not clear errors.  Perhaps you are talking about some future
hardware implementation that doesn't yet exist?

> So I see the above as being wishful in using the error list as the
> hard source of truth and unfortunate to up-level all sub-sector error
> entries into full PAGE_SIZE data offline events.

The page size granularity is only an issue for mmap().  If you are using
read/write, then 512 byte granularity can be achieved.  Even with mmap,
if you encounter an error on a 4k page, you can query the status of each
sector in that page to isolate the error.  So I'm not quite sure I
understand what you're getting at.

> I'm hoping we can find a way to make the error handling more fine
> grained over time, but for the current patch, managing the blast
> radius as PAGE_SIZE granularity at least matches the zero path with
> the write path.

I think the write path can do 512 byte granularity, not page size.
Anyway, I think we've gone far enough off into the weeds that more
patches will have to be posted for debate.  :)

>> > Setting that aside we can start with just treating zeroing the same as
>> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
>> > step.
>>
>> OK.
>>
>> > I'd rather have a separate op that filesystems can use to clear errors
>> > at block allocation time that can be enforced to have the correct
>> > alignment.
>>
>> So would file systems always call that routine instead of zeroing, or
>> would they first check to see if there are badblocks?
>
> The proposal is that filesystems distinguish zeroing from free-block
> allocation/initialization such that the fsdax implementation directs
> initialization to a driver callback. This "initialization op" would
> take care to check for poison and clear it. All other dax paths would
> not consult the badblocks list.

What do you mean by "all other dax paths?"  Would that include
mmap/direct_access?  Because that definitely should still consult the
badblocks list.

I'm not against having a separate operation for clearing errors, but I
guess I'm not convinced it's cleaner, either.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-24 Thread Jeff Moyer
Dan Williams  writes:

>> Let's just focus on reporting errors when we know we have them.
>
> That's the problem in my eyes. If software needs to contend with
> latent error reporting then it should always contend otherwise
> software has multiple error models to wrangle.

The only way for an application to know that the data has been written
successfully would be to issue a read after every write.  That's not a
performance hit most applications are willing to take.  And, of course,
the media can still go bad at a later time, so it only guarantees the
data is accessible immediately after having been written.

What I'm suggesting is that we should not complete a write successfully
if we know that the data will not be retrievable.  I wouldn't call this
adding an extra error model to contend with.  Applications should
already be checking for errors on write.

Does that make sense?  Are we talking past each other?

> Setting that aside we can start with just treating zeroing the same as
> the copy_from_iter() case and fail the I/O at the dax_direct_access()
> step.

OK.

> I'd rather have a separate op that filesystems can use to clear errors
> at block allocation time that can be enforced to have the correct
> alignment.

So would file systems always call that routine instead of zeroing, or
would they first check to see if there are badblocks?

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-24 Thread Jeff Moyer
Dan Williams  writes:

> On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner  wrote:
>>
>> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
>> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
>> > > Vivek Goyal  writes:
>> > >
>> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
>> > > >> Vivek Goyal  writes:
>> > > >>
>> > > >> > Currently pmem_clear_poison() expects offset and len to be sector 
>> > > >> > aligned.
>> > > >> > Atleast that seems to be the assumption with which code has been 
>> > > >> > written.
>> > > >> > It is called only from pmem_do_bvec() which is called only from 
>> > > >> > pmem_rw_page()
>> > > >> > and pmem_make_request() which will only passe sector aligned offset 
>> > > >> > and len.
>> > > >> >
>> > > >> > Soon we want use this function from dax_zero_page_range() code path 
>> > > >> > which
>> > > >> > can try to zero arbitrary range of memory with-in a page. So update 
>> > > >> > this
>> > > >> > function to assume that offset and length can be arbitrary and do 
>> > > >> > the
>> > > >> > necessary alignments as needed.
>> > > >>
>> > > >> What caller will try to zero a range that is smaller than a sector?
>> > > >
>> > > > Hi Jeff,
>> > > >
>> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
>> > > > a range which is less than a sector. Or which is bigger than a sector
>> > > > but start and end are not aligned on sector boundaries.
>> > >
>> > > Sure, but who will call it with misaligned ranges?
>> >
>> > create a file foo.txt of size 4K and then truncate it.
>> >
>> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
>> > 4095.
>>
>> This should fail with EIO. Only full page writes should clear the
>> bad page state, and partial writes should therefore fail because
>> they do not guarantee the data in the filesystem block is all good.
>>
>> If this zeroing was a buffered write to an address with a bad
>> sector, then the writeback will fail and the user will (eventually)
>> get an EIO on the file.
>>
>> DAX should do the same thing, except because the zeroing is
>> synchronous (i.e. done directly by the truncate syscall) we can -
>> and should - return EIO immediately.
>>
>> Indeed, with your code, if we then extend the file by truncating up
>> back to 4k, then the range between 23 and 512 is still bad, even
>> though we've successfully zeroed it and the user knows it. An
>> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
>> will fail with EIO, but reading 10 bytes at offset 2000 will
>> succeed.
>>
>> That's *awful* behaviour to expose to userspace, especially when
>> they look at the fs config and see that it's using both 4kB block
>> and sector sizes...
>>
>> The only thing that makes sense from a filesystem perspective is
>> clearing bad page state when entire filesystem blocks are
>> overwritten. The data in a filesystem block is either good or bad,
>> and it doesn't matter how many internal (kernel or device) sectors
>> it has.
>>
>> > > And what happens to the rest?  The caller is left to trip over the
>> > > errors?  That sounds pretty terrible.  I really think there needs to be
>> > > an explicit contract here.
>> >
>> > Ok, I think is is the contentious bit. Current interface
>> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
>> > sector) or expects page to be free of poison.
>> >
>> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
>> > because range being zeroed is not sector aligned. So
>> > __dax_zero_page_range() falls back to calling direct_access(). Which
>> > fails because there are poisoned sectors in the page.
>> >
>> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
>> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
>> > and returns success.
>>
>> Ok, kernel sectors are not the unit of granularity bad page state
>> should be managed at. They don't m

Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-21 Thread Jeff Moyer
Dan Williams  writes:

> Oh you misunderstood my comment, the "move badblocks to filesystem"
> proposal is long term / down the road thing to consider. In the near
> term this unaligned block zeroing facility is an improvement.

I'm not sure I agree.  I'm going to think about it and get back to you.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-21 Thread Jeff Moyer
Vivek Goyal  writes:

> On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
>> Vivek Goyal  writes:
>> 
>> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
>> > Atleast that seems to be the assumption with which code has been written.
>> > It is called only from pmem_do_bvec() which is called only from 
>> > pmem_rw_page()
>> > and pmem_make_request() which will only passe sector aligned offset and 
>> > len.
>> >
>> > Soon we want use this function from dax_zero_page_range() code path which
>> > can try to zero arbitrary range of memory with-in a page. So update this
>> > function to assume that offset and length can be arbitrary and do the
>> > necessary alignments as needed.
>> 
>> What caller will try to zero a range that is smaller than a sector?
>
> Hi Jeff,
>
> New dax zeroing interface (dax_zero_page_range()) can technically pass
> a range which is less than a sector. Or which is bigger than a sector
> but start and end are not aligned on sector boundaries.

Sure, but who will call it with misaligned ranges?

> At this point of time, all I care about is that case of an arbitrary
> range is handeled well. So if a caller passes a range in, we figure
> out subrange which is sector aligned in terms of start and end, and
> clear poison on those sectors and ignore rest of the range. And
> this itself will be an improvement over current behavior where 
> nothing is cleared if I/O is not sector aligned.

I don't think this makes sense.  The caller needs to know about the
blast radius of errors.  This is why I asked for a concrete example.
It might make more sense, for example, to return an error if not all of
the errors could be cleared.

>> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
>> > clear_err_unit boundary. But this is currently internal detail and is
>> > not exported for others to use. So for now, continue to align offset and
>> > length to SECTOR_SIZE boundary. Improving it further and to align it
>> > to clear_err_unit boundary is a TODO item for future.
>> 
>> When there is a poisoned range of persistent memory, it is recorded by
>> the badblocks infrastructure, which currently operates on sectors.  So,
>> no matter what the error unit is for the hardware, we currently can't
>> record/report to userspace anything smaller than a sector, and so that
>> is what we expect when clearing errors.
>> 
>> Continuing on for completeness, we will currently not map a page with
>> badblocks into a process' address space.  So, let's say you have 256
>> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
>> you access a valid mmap()d address in the same page as the poisoned
>> memory, you will get a segfault.
>> 
>> Userspace can fix up the error by calling write(2) and friends to
>> provide new data, or by punching a hole and writing new data to the hole
>> (which may result in getting a new block, or reallocating the old block
>> and zeroing it, which will clear the error).
>
> Fair enough. I do not need poison clearing at finer granularity. It might
> be needed once dev_dax path wants to clear poison. Not sure how exactly
> that works.

It doesn't.  :)

>> > +  /*
>> > +   * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
>> > +   * expects memory offset and length to meet certain alignment
>> > +   * restrction (clear_err_unit). Currently nvdimm does not export
>>   ^^
>> > +   * required alignment. So align offset and length to sector boundary
>> 
>> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
>> does export the required alignment.  Perhaps you meant libnvdimm?
>
> I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> it is called. It first queries alignement required (clear_err_unit) and
> then makes sure range passed in meets that alignment requirement.

My point was your comment is misleading.

>> We could potentially support clearing less than a sector, but I'd have
>> to understand the use cases better before offerring implementation
>> suggestions.
>
> I don't need clearing less than a secotr. Once somebody needs it they
> can implement it. All I am doing is making sure current logic is not
> broken when dax_zero_page_range() starts using this logic and passes
> an arbitrary range. We need to make sure we internally align I/O 

An arbitrary range is the same thing as less than a sector.  :)  Do you
know of an instance where the range will not

Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-20 Thread Jeff Moyer
Vivek Goyal  writes:

> Currently pmem_clear_poison() expects offset and len to be sector aligned.
> Atleast that seems to be the assumption with which code has been written.
> It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> and pmem_make_request() which will only passe sector aligned offset and len.
>
> Soon we want use this function from dax_zero_page_range() code path which
> can try to zero arbitrary range of memory with-in a page. So update this
> function to assume that offset and length can be arbitrary and do the
> necessary alignments as needed.

What caller will try to zero a range that is smaller than a sector?

> nvdimm_clear_poison() seems to assume offset and len to be aligned to
> clear_err_unit boundary. But this is currently internal detail and is
> not exported for others to use. So for now, continue to align offset and
> length to SECTOR_SIZE boundary. Improving it further and to align it
> to clear_err_unit boundary is a TODO item for future.

When there is a poisoned range of persistent memory, it is recorded by
the badblocks infrastructure, which currently operates on sectors.  So,
no matter what the error unit is for the hardware, we currently can't
record/report to userspace anything smaller than a sector, and so that
is what we expect when clearing errors.

Continuing on for completeness, we will currently not map a page with
badblocks into a process' address space.  So, let's say you have 256
bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
you access a valid mmap()d address in the same page as the poisoned
memory, you will get a segfault.

Userspace can fix up the error by calling write(2) and friends to
provide new data, or by punching a hole and writing new data to the hole
(which may result in getting a new block, or reallocating the old block
and zeroing it, which will clear the error).

More comments below...

> Signed-off-by: Vivek Goyal 
> ---
>  drivers/nvdimm/pmem.c | 22 ++
>  1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 075b11682192..e72959203253 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device 
> *pmem,
>   sector_t sector;
>   long cleared;
>   blk_status_t rc = BLK_STS_OK;
> + phys_addr_t start_aligned, end_aligned;
> + unsigned int len_aligned;
>  
> - sector = (offset - pmem->data_offset) / 512;
> + /*
> +  * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> +  * expects memory offset and length to meet certain alignment
> +  * restrction (clear_err_unit). Currently nvdimm does not export
  ^^
> +  * required alignment. So align offset and length to sector boundary

What is "nvdimm" in that sentence?  Because the nvdimm most certainly
does export the required alignment.  Perhaps you meant libnvdimm?

> +  * before passing it to nvdimm_clear_poison().
> +  */
> + start_aligned = ALIGN(offset, SECTOR_SIZE);
> + end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
> + len_aligned = end_aligned - start_aligned + 1;
> +
> + sector = (start_aligned - pmem->data_offset) / 512;
>  
> - cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
> - if (cleared < len)
> + cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
> +   len_aligned);
> + if (cleared < len_aligned)
>   rc = BLK_STS_IOERR;
>   if (cleared > 0 && cleared / 512) {
> - hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> + hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared);
>   cleared /= 512;
>   dev_dbg(dev, "%#llx clear %ld sector%s\n",
>   (unsigned long long) sector, cleared,

We could potentially support clearing less than a sector, but I'd have
to understand the use cases better before offerring implementation
suggestions.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH] ndctl/list: Drop named list objects from verbose listing

2020-02-19 Thread Jeff Moyer
Dan Williams  writes:

> On Wed, Feb 19, 2020 at 9:56 AM Jeff Moyer  wrote:
>>
>> Dan Williams  writes:
>>
>> > The only expected difference between "ndctl list -R" and "ndctl list
>> > -Rv" is some additional output fields. Instead it currently results in
>> > the region array being contained in a named "regions" list object.
>> >
>> > # ndctl list -R -r 0
>> > [
>> >   {
>> > "dev":"region0",
>> > "size":4294967296,
>> > "available_size":0,
>> > "max_available_extent":0,
>> > "type":"pmem",
>> > "persistence_domain":"unknown"
>> >   }
>> > ]
>> >
>> > # ndctl list -Rv -r 0
>> > {
>> >   "regions":[
>> > {
>> >   "dev":"region0",
>> >   "size":4294967296,
>> >   "available_size":0,
>> >   "max_available_extent":0,
>> >   "type":"pmem",
>> >   "numa_node":0,
>> >   "target_node":2,
>> >   "persistence_domain":"unknown",
>> >   "namespaces":[
>> > {
>> >   "dev":"namespace0.0",
>> >   "mode":"fsdax",
>> >   "map":"mem",
>> >   "size":4294967296,
>> >   "sector_size":512,
>> >   "blockdev":"pmem0",
>> >   "numa_node":0,
>> >   "target_node":2
>> > }
>> >   ]
>> > }
>> >   ]
>> > }
>> >
>> > Drop the named list, by not including namespaces in the listing. Extra
>> > objects only appear at the -vv level. "ndctl list -v" and "ndctl list
>> > -Nv" are synonyms and behave as expected.
>> >
>> > # ndctl list -Rv -r 0
>> > [
>> >   {
>> > "dev":"region0",
>> > "size":4294967296,
>> > "available_size":0,
>> > "max_available_extent":0,
>> > "type":"pmem",
>> > "numa_node":0,
>> > "target_node":2,
>> > "persistence_domain":"unknown"
>> >   }
>> > ]
>> >
>>
>> Will this break existing code that parses the javascript output?
>
> Always a potential for that. That said, I'd rather attempt to make it
> symmetric and replace it if someone screams, rather than let this
> quirk persist because it makes it impossible to ingest region data
> with the same script across -R and -Rv.

Yeah, I see where you're coming from.  However, script authors will
still have to deal with older versions of ndctl in the wild (for many
years).  If the decision was up to me, I'd live with the wart in favor
of not breaking scripts when ndctl gets updated.  Users hate that.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH 1/2] ndctl/region: Support ndctl_region_{get, set}_align()

2020-02-19 Thread Jeff Moyer
Dan Williams  writes:

>  
> +NDCTL_EXPORT unsigned long ndctl_region_get_align(struct ndctl_region 
> *region)
> +{
> + return region->align;
> +}
> +
> +NDCTL_EXPORT int ndctl_region_set_align(struct ndctl_region *region,
> + unsigned long align)
> +{
> + struct ndctl_ctx *ctx = ndctl_region_get_ctx(region);
> + char *path = region->region_buf;
> + int len = region->buf_len, rc;
> + char buf[SYSFS_ATTR_SIZE];
> +
> + if (snprintf(path, len, "%s/align", region->region_path) >= len) {
> + err(ctx, "%s: buffer too small!\n",
> + ndctl_region_get_devname(region));
> + return -ENXIO;
> + }
> +
> + sprintf(buf, "%#lx\n", align);
> + rc = sysfs_write_attr(ctx, path, buf);
> + if (rc < 0)
> + return rc;
> +
> + region->align = align;
> + return 0;
> +}
> +

Missing doctext.  Specifically, there should be a big, fat warning
against changing the region alignment.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH] ndctl/list: Drop named list objects from verbose listing

2020-02-19 Thread Jeff Moyer
Dan Williams  writes:

> The only expected difference between "ndctl list -R" and "ndctl list
> -Rv" is some additional output fields. Instead it currently results in
> the region array being contained in a named "regions" list object.
>
> # ndctl list -R -r 0
> [
>   {
> "dev":"region0",
> "size":4294967296,
> "available_size":0,
> "max_available_extent":0,
> "type":"pmem",
> "persistence_domain":"unknown"
>   }
> ]
>
> # ndctl list -Rv -r 0
> {
>   "regions":[
> {
>   "dev":"region0",
>   "size":4294967296,
>   "available_size":0,
>   "max_available_extent":0,
>   "type":"pmem",
>   "numa_node":0,
>   "target_node":2,
>   "persistence_domain":"unknown",
>   "namespaces":[
> {
>   "dev":"namespace0.0",
>   "mode":"fsdax",
>   "map":"mem",
>   "size":4294967296,
>   "sector_size":512,
>   "blockdev":"pmem0",
>   "numa_node":0,
>   "target_node":2
> }
>   ]
> }
>   ]
> }
>
> Drop the named list, by not including namespaces in the listing. Extra
> objects only appear at the -vv level. "ndctl list -v" and "ndctl list
> -Nv" are synonyms and behave as expected.
>
> # ndctl list -Rv -r 0
> [
>   {
> "dev":"region0",
> "size":4294967296,
> "available_size":0,
> "max_available_extent":0,
> "type":"pmem",
> "numa_node":0,
> "target_node":2,
> "persistence_domain":"unknown"
>   }
> ]
>

Will this break existing code that parses the javascript output?

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl V2] namespace/create: Don't create multiple namespaces unless greedy

2020-02-18 Thread Jeff Moyer
Dan Williams  writes:

>> diff --git a/ndctl/namespace.c b/ndctl/namespace.c
>> index 7fb0007..b1f2158 100644
>> --- a/ndctl/namespace.c
>> +++ b/ndctl/namespace.c
>> @@ -1388,11 +1388,9 @@ static int do_xaction_namespace(const char *namespace,
>> (*processed)++;
>> if (param.greedy)
>> continue;
>> -   }
>> -   if (force) {
>> -   if (rc)
>> +   } else if (param.greedy && force) {
>> saved_rc = rc;
>> -   continue;
>> +   continue;
>
> Looks good, applied.

Where?

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH] ndctl/lib: make dimm_ops in private.h extern

2020-02-18 Thread Jeff Moyer
Vishal Verma  writes:

> A toolchain update in Fedora 32 caused new compile errors due to
> multiple definitions of dimm_ops structures. The declarations in
> 'private.h' for the various NFIT families are present so that libndctl
> can find all the per-family dimm-ops. However they need to be declared
> as extern because the actual definitions are in .c
>
> Cc: Dan Williams 
> Suggested-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  ndctl/lib/private.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/ndctl/lib/private.h b/ndctl/lib/private.h
> index e445301..16bf8f9 100644
> --- a/ndctl/lib/private.h
> +++ b/ndctl/lib/private.h
> @@ -343,10 +343,10 @@ struct ndctl_dimm_ops {
>   int (*xlat_firmware_status)(struct ndctl_cmd *);
>  };
>  
> -struct ndctl_dimm_ops * const intel_dimm_ops;
> -struct ndctl_dimm_ops * const hpe1_dimm_ops;
> -struct ndctl_dimm_ops * const msft_dimm_ops;
> -struct ndctl_dimm_ops * const hyperv_dimm_ops;
> +extern struct ndctl_dimm_ops * const intel_dimm_ops;
> +extern struct ndctl_dimm_ops * const hpe1_dimm_ops;
> +extern struct ndctl_dimm_ops * const msft_dimm_ops;
> +extern struct ndctl_dimm_ops * const hyperv_dimm_ops;
>  
>  static inline struct ndctl_bus *cmd_to_bus(struct ndctl_cmd *cmd)
>  {

Acked-by: Jeff Moyer 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH ndctl] ndctl, test: add bus-id parameter for start-scrub/wait-scrub operation

2020-02-18 Thread Jeff Moyer
Yi Zhang  writes:

> On some NVDIMM servers, scrub operation will take long time to be finished
> as it start on all nvdimm buses in the system, add the bus-id parameter to
> do the scrub on the NFIT_TEST_BUS0
>
> Signed-off-by: Yi Zhang 

Reviewed-by: Jeff Moyer 

> ---
>  test/btt-errors.sh  | 4 ++--
>  test/clear.sh   | 2 +-
>  test/inject-error.sh| 2 +-
>  test/pfn-meta-errors.sh | 2 +-
>  test/pmem-errors.sh | 2 +-
>  5 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/test/btt-errors.sh b/test/btt-errors.sh
> index cb35865..00c0796 100755
> --- a/test/btt-errors.sh
> +++ b/test/btt-errors.sh
> @@ -115,7 +115,7 @@ bb_inj=$((dataoff/512))
>  
>  # inject badblocks for one page at the start of the file
>  $NDCTL inject-error --block="$bb_inj" --count=8 $dev
> -$NDCTL start-scrub && $NDCTL wait-scrub
> +$NDCTL start-scrub $NFIT_TEST_BUS0 && $NDCTL wait-scrub $NFIT_TEST_BUS0
>  
>  force_raw 0
>  mount -o nodelalloc "/dev/$blockdev" $MNT
> @@ -149,7 +149,7 @@ map=$(hexdump -s 96 -n 4 "/dev/$raw_bdev" | head -1 | cut 
> -d' ' -f2-)
>  map=$(tr -d ' ' <<< "0x${map#* }${map%% *}")
>  bb_inj=$((map/512))
>  $NDCTL inject-error --block="$bb_inj" --count=1 $dev
> -$NDCTL start-scrub && $NDCTL wait-scrub
> +$NDCTL start-scrub $NFIT_TEST_BUS0 && $NDCTL wait-scrub $NFIT_TEST_BUS0
>  force_raw 0
>  
>  # make sure reading the first block of the namespace fails
> diff --git a/test/clear.sh b/test/clear.sh
> index 17d5bed..1bd12da 100755
> --- a/test/clear.sh
> +++ b/test/clear.sh
> @@ -41,7 +41,7 @@ err_sector="$(((size/512) / 2))"
>  err_count=8
>  if ! read sector len < /sys/block/$blockdev/badblocks; then
>   $NDCTL inject-error --block="$err_sector" --count=$err_count $dev
> - $NDCTL start-scrub && $NDCTL wait-scrub
> + $NDCTL start-scrub $NFIT_TEST_BUS0 && $NDCTL wait-scrub $NFIT_TEST_BUS0
>  fi
>  read sector len < /sys/block/$blockdev/badblocks
>  [ $((sector * 2)) -ne $((size /512)) ] && echo "fail: $LINENO" && exit 1
> diff --git a/test/inject-error.sh b/test/inject-error.sh
> index 49e68b3..825bf18 100755
> --- a/test/inject-error.sh
> +++ b/test/inject-error.sh
> @@ -77,7 +77,7 @@ do_tests()
>  
>   # inject normally
>   $NDCTL inject-error --block=$err_block --count=$err_count $dev
> - $NDCTL start-scrub && $NDCTL wait-scrub
> + $NDCTL start-scrub $NFIT_TEST_BUS0 && $NDCTL wait-scrub $NFIT_TEST_BUS0
>   check_status "$err_block" "$err_count"
>   if read -r sector len < /sys/block/$blockdev/badblocks; then
>   test "$sector" -eq "$err_block"
> diff --git a/test/pfn-meta-errors.sh b/test/pfn-meta-errors.sh
> index 2b57f19..14a15ae 100755
> --- a/test/pfn-meta-errors.sh
> +++ b/test/pfn-meta-errors.sh
> @@ -61,7 +61,7 @@ mblk="$((metaoff/512))"
>  # inject in the middle of the struct page area
>  bb_inj=$(((dblk - mblk)/2))
>  $NDCTL inject-error --block="$bb_inj" --count=32 $dev
> -$NDCTL start-scrub && $NDCTL wait-scrub
> +$NDCTL start-scrub $NFIT_TEST_BUS0 && $NDCTL wait-scrub $NFIT_TEST_BUS0
>  
>  # after probe from the enable-namespace, the error should've been cleared
>  force_raw 0
> diff --git a/test/pmem-errors.sh b/test/pmem-errors.sh
> index 9553a3f..3d90508 100755
> --- a/test/pmem-errors.sh
> +++ b/test/pmem-errors.sh
> @@ -46,7 +46,7 @@ err_sector="$(((size/512) / 2))"
>  err_count=8
>  if ! read sector len < /sys/block/$blockdev/badblocks; then
>   $NDCTL inject-error --block="$err_sector" --count=$err_count $dev
> - $NDCTL start-scrub; $NDCTL wait-scrub
> + $NDCTL start-scrub $NFIT_TEST_BUS0; $NDCTL wait-scrub $NFIT_TEST_BUS0
>  fi
>  read sector len < /sys/block/$blockdev/badblocks
>  [ $((sector * 2)) -ne $((size /512)) ] && echo "fail: $LINENO" && false
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] mm: get rid of WARN if failed to cow user pages

2020-02-18 Thread Jeff Moyer
Dan Williams  writes:

> [ drop Ross, add Kirill, linux-mm, and lkml ]
>
> On Tue, Dec 24, 2019 at 9:42 PM Murphy Zhou  wrote:
>>
>> By running xfstests with fsdax enabled, generic/437 always hits this
>> warning[1] since this commit:
>>
>> commit 83d116c53058d505ddef051e90ab27f57015b025
>> Author: Jia He 
>> Date:   Fri Oct 11 22:09:39 2019 +0800
>>
>> mm: fix double page fault on arm64 if PTE_AF is cleared
>>
>> Looking at the test program[2] generic/437 uses, it's pretty easy
>> to hit this warning. Remove this WARN as it seems not necessary.
>
> This is not sufficient justification. Does this same test fail without
> DAX? If not, why not? At a minimum you need to explain why this is not
> indicating a problem.

I ran into this, too, and Kirill has posted a patch[1] to fix the issue.
Note that it's a potential data corrupter, so just removing the warning
is NOT the right approach.  :)

-Jeff

[1] 
https://lore.kernel.org/linux-mm/20200218154151.13349-1-kirill.shute...@linux.intel.com/T/#u
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] libnvdimm/of_pmem: Fix leaking bus_desc.provider_name in some paths

2020-02-18 Thread Jeff Moyer
Vaibhav Jain  writes:

> String 'bus_desc.provider_name' allocated inside
> of_pmem_region_probe() will leak in case call to nvdimm_bus_register()
> fails or when of_pmem_region_remove() is called.
>
> This minor patch ensures that 'bus_desc.provider_name' is freed in
> error path for of_pmem_region_probe() as well as in
> of_pmem_region_remove().
>
> Cc: sta...@vger.kernel.org
> Fixes:49bddc73d15c2("libnvdimm/of_pmem: Provide a unique name for bus 
> provider")
> Signed-off-by: Vaibhav Jain 
> ---
>  drivers/nvdimm/of_pmem.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
> index 8224d1431ea9..9cb76f9837ad 100644
> --- a/drivers/nvdimm/of_pmem.c
> +++ b/drivers/nvdimm/of_pmem.c
> @@ -36,6 +36,7 @@ static int of_pmem_region_probe(struct platform_device 
> *pdev)
>  
>   priv->bus = bus = nvdimm_bus_register(>dev, >bus_desc);
>   if (!bus) {
> + kfree(priv->bus_desc.provider_name);
>   kfree(priv);
>   return -ENODEV;
>   }
> @@ -81,6 +82,7 @@ static int of_pmem_region_remove(struct platform_device 
> *pdev)
>   struct of_pmem_private *priv = platform_get_drvdata(pdev);
>  
>   nvdimm_bus_unregister(priv->bus);
> + kfree(priv->bus_desc.provider_name);
>   kfree(priv);
>  
>   return 0;

Reviewed-by: Jeff Moyer 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] libnvdimm/bus: return the outvar 'cmd_rc' error code in __nd_ioctl()

2020-02-18 Thread Jeff Moyer
Vaibhav Jain  writes:

> Presently the error code returned via out variable 'cmd_rc' from the
> nvdimm-bus controller function is ignored when called from
> __nd_ioctl() and never communicated back to user-space code that called
> an ioctl on dimm/bus.
>
> This minor patch updates __nd_ioctl() to propagate the value of out
> variable 'cmd_rc' back to user-space in case it reports an error.
>
> Signed-off-by: Vaibhav Jain 
> ---
>  drivers/nvdimm/bus.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index a8b515968569..5b687a27fdf2 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -1153,6 +1153,11 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, 
> struct nvdimm *nvdimm,
>   if (rc < 0)
>   goto out_unlock;
>  
> + if (cmd_rc < 0) {
> + rc = cmd_rc;
> + goto out_unlock;
> + }
> +
>   if (!nvdimm && cmd == ND_CMD_CLEAR_ERROR && cmd_rc >= 0) {
>   struct nd_cmd_clear_error *clear_err = buf;

Looks good to me.

Reviewed-by: Jeff Moyer 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC][PATCH] dax: Do not try to clear poison for partial pages

2020-02-18 Thread Jeff Moyer
Dan Williams  writes:

> Right now the kernel does not install a pte on faults that land on a
> page with known poison, but only because the error clearing path is so
> convoluted and could only claim that fallocate(PUNCH_HOLE) cleared
> errors because that was guaranteed to send 512-byte aligned zero's
> down the block-I/O path when the fs-blocks got reallocated. In a world
> where native cpu instructions can clear errors the dax write() syscall
> case could be covered (modulo 64-byte alignment), and the kernel could
> just let the page be mapped so that the application could attempt it's
> own fine-grained clearing without calling back into the kernel.

I'm not sure we'd want to do allow mapping the PTEs even if there was
support for clearing errors via CPU instructions.  Any load from a
poisoned page will result in an MCE, and there exists the possiblity
that you will hit an unrecoverable error (Processor Context Corrupt).
It's just safer to catch these cases by not mapping the page, and
forcing recovery through the driver.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 0/4] libnvdimm: Cross-arch compatible namespace alignment

2020-02-14 Thread Jeff Moyer
Dan Williams  writes:

> ---
>
> Explicit review requests, but any other feedback is of course
> appreciated:
>
> Patch1 needs an ack from ppc arch maintainers, and I'd like a tested-by
> from Aneesh that this still works to solve the ppc issue. Jeff, does
> this look good to you?

OK, I've reviewed everything.  Testing looks good with the change I
mentioned (memremap_compat_align returning PAGE_SIZE).  I made sure a
4k-aligned namespace created under an unpatched kernel would be
accessible under a patched kernel.  I also made sure that manually
setting align would allow for creating of poorly aligned namespaces, and
that those namespaces were then accessible on the unpatched kernel.

Thanks, Dan!

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 1/4] mm/memremap_pages: Introduce memremap_compat_align()

2020-02-14 Thread Jeff Moyer
Dan Williams  writes:

> On Thu, Feb 13, 2020 at 8:58 AM Jeff Moyer  wrote:

>> I have just a couple of questions.
>>
>> First, can you please add a comment above the generic implementation of
>> memremap_compat_align describing its purpose, and why a platform might
>> want to override it?
>
> Sure, how about:
>
> /*
>  * The memremap() and memremap_pages() interfaces are alternately used
>  * to map persistent memory namespaces. These interfaces place different
>  * constraints on the alignment and size of the mapping (namespace).
>  * memremap() can map individual PAGE_SIZE pages. memremap_pages() can
>  * only map subsections (2MB), and at least one architecture (PowerPC)
>  * the minimum mapping granularity of memremap_pages() is 16MB.
>  *
>  * The role of memremap_compat_align() is to communicate the minimum
>  * arch supported alignment of a namespace such that it can freely
>  * switch modes without violating the arch constraint. Namely, do not
>  * allow a namespace to be PAGE_SIZE aligned since that namespace may be
>  * reconfigured into a mode that requires SUBSECTION_SIZE alignment.
>  */

Well, if we modify the x86 variant to be PAGE_SIZE, I think that text
won't work.  How about:

/*
 * memremap_compat_align should return the minimum alignment for
 * mapping memory via memremap() and memremap_pages().  For x86, this
 * is the system PAGE_SIZE.  Other architectures may impose different
 * restrictions, as is seen on powerpc where the minimum alignment is
 * tied to the linear mapping page size.
 *
 * When creating persistent memory namespaces, the alignment is forced
 * to the least common denominator (MEMREMAP_COMPAT_ALIGN_MAX,
 * currently 16MB).  However, older kernels did not enforce this
 * behavior, so we allow mapping namespaces with smaller alignments,
 * so long as the platform supports it.  See nvdimm_namespace_common_probe.
 */

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 4/4] libnvdimm/region: Introduce an 'align' attribute

2020-02-14 Thread Jeff Moyer
Dan Williams  writes:

> The align attribute applies an alignment constraint for namespace
> creation in a region. Whereas the 'align' attribute of a namespace
> applied alignment padding via an info block, the 'align' attribute
> applies alignment constraints to the free space allocation.
>
> The default for 'align' is the maximum known memremap_compat_align()
> across all archs (16MiB from PowerPC at time of writing) multiplied by
> the number of interleave ways if there is blk-aliasing. The minimum is
> PAGE_SIZE and allows for the creation of cross-arch incompatible
> namespaces, just as previous kernels allowed, but the expectation is
> cross-arch and mode-independent compatibility by default.
>
> The regression risk with this change is limited to cases that were
> dependent on the ability to create unaligned namespaces, *and* for some
> reason are unable to opt-out of aligned namespaces by writing to
> 'regionX/align'. If such a scenario arises the default can be flipped
> from opt-out to opt-in of compat-aligned namespace creation, but that is
> a last resort. The kernel will otherwise continue to support existing
> defined misaligned namespaces.
>
> Unfortunately this change needs to touch several parts of the
> implementation at once:
>
> - region/available_size: expand busy extents to current align
> - region/max_available_extent: expand busy extents to current align
> - namespace/size: trim free space to current align
>
> ...to keep the free space accounting conforming to the dynamic align
> setting.
>
> Reported-by: Aneesh Kumar K.V 
> Reported-by: Jeff Moyer 
> Signed-off-by: Dan Williams 
> Reviewed-by: Aneesh Kumar K.V 
> Link: 
> https://lore.kernel.org/r/158041478371.3889308.14542630147672668068.st...@dwillia2-desk3.amr.corp.intel.com
> Signed-off-by: Dan Williams 

This looks good to me.

Reviewed-by: Jeff Moyer 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 2/4] libnvdimm/namespace: Enforce memremap_compat_align()

2020-02-13 Thread Jeff Moyer
Dan Williams  writes:

> The pmem driver on PowerPC crashes with the following signature when
> instantiating misaligned namespaces that map their capacity via
> memremap_pages().
>
> BUG: Unable to handle kernel data access at 0xc00100040600
> Faulting instruction address: 0xc0090790
> NIP [c0090790] arch_add_memory+0xc0/0x130
> LR [c0090744] arch_add_memory+0x74/0x130
> Call Trace:
>  arch_add_memory+0x74/0x130 (unreliable)
>  memremap_pages+0x74c/0xa30
>  devm_memremap_pages+0x3c/0xa0
>  pmem_attach_disk+0x188/0x770
>  nvdimm_bus_probe+0xd8/0x470
>
> With the assumption that only memremap_pages() has alignment
> constraints, enforce memremap_compat_align() for
> pmem_should_map_pages(), nd_pfn, or nd_dax cases.
>
> Reported-by: Aneesh Kumar K.V 
> Cc: Jeff Moyer 
> Reviewed-by: Aneesh Kumar K.V 
> Link: 
> https://lore.kernel.org/r/158041477336.3889308.4581652885008605170.st...@dwillia2-desk3.amr.corp.intel.com
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/namespace_devs.c |   10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
> index 032dc61725ff..aff1f32fdb4f 100644
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -1739,6 +1739,16 @@ struct nd_namespace_common 
> *nvdimm_namespace_common_probe(struct device *dev)
>   return ERR_PTR(-ENODEV);
>   }
>  
> + if (pmem_should_map_pages(dev) || nd_pfn || nd_dax) {
> + struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
> + resource_size_t start = nsio->res.start;
> +
> + if (!IS_ALIGNED(start | size, memremap_compat_align())) {
> + dev_dbg(>dev, "misaligned, unable to map\n");
> + return ERR_PTR(-EOPNOTSUPP);
> + }
> + }
> +
>   if (is_namespace_pmem(>dev)) {
>   struct nd_namespace_pmem *nspm;
>  

Actually, I take back my ack.  :) This prevents a previously working
namespace from being successfully probed/setup.  I thought we were only
going to enforce the alignment for a newly created namespace?  This should
only check whether the alignment works for the current platform.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 3/4] libnvdimm/region: Introduce NDD_LABELING

2020-02-13 Thread Jeff Moyer
Dan Williams  writes:

> @@ -312,8 +312,9 @@ static ssize_t flags_show(struct device *dev,
>  {
>   struct nvdimm *nvdimm = to_nvdimm(dev);
>  
> - return sprintf(buf, "%s%s\n",
> + return sprintf(buf, "%s%s%s\n",
>   test_bit(NDD_ALIASING, >flags) ? "alias " : "",
> + test_bit(NDD_LABELING, >flags) ? "label" : "",
   ^

Missing a space.

The rest looks sane.

Reviewed-by: Jeff Moyer 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 1/4] mm/memremap_pages: Introduce memremap_compat_align()

2020-02-13 Thread Jeff Moyer
Dan Williams  writes:

> The "sub-section memory hotplug" facility allows memremap_pages() users
> like libnvdimm to compensate for hardware platforms like x86 that have a
> section size larger than their hardware memory mapping granularity.  The
> compensation that sub-section support affords is being tolerant of
> physical memory resources shifting by units smaller (64MiB on x86) than
> the memory-hotplug section size (128 MiB). Where the platform
> physical-memory mapping granularity is limited by the number and
> capability of address-decode-registers in the memory controller.
>
> While the sub-section support allows memremap_pages() to operate on
> sub-section (2MiB) granularity, the Power architecture may still
> require 16MiB alignment on "!radix_enabled()" platforms.
>
> In order for libnvdimm to be able to detect and manage this per-arch
> limitation, introduce memremap_compat_align() as a common minimum
> alignment across all driver-facing memory-mapping interfaces, and let
> Power override it to 16MiB in the "!radix_enabled()" case.
>
> The assumption / requirement for 16MiB to be a viable
> memremap_compat_align() value is that Power does not have platforms
> where its equivalent of address-decode-registers never hardware remaps a
> persistent memory resource on smaller than 16MiB boundaries. Note that I
> tried my best to not add a new Kconfig symbol, but header include
> entanglements defeated the #ifndef memremap_compat_align design pattern
> and the need to export it defeats the __weak design pattern for arch
> overrides.
>
> Based on an initial patch by Aneesh.

I have just a couple of questions.

First, can you please add a comment above the generic implementation of
memremap_compat_align describing its purpose, and why a platform might
want to override it?

Second, I will take it at face value that the power architecture
requires a 16MB alignment, but it's not clear to me why mmu_linear_psize
was chosen to represent that.  What's the relationship, there, and can
we please have a comment explaining it?

Thanks!
Jeff

>
> Link: 
> http://lore.kernel.org/r/capcyv4gbgnp95apyabcsocea50tqj9b5h__83vgngjq3oug...@mail.gmail.com
> Reported-by: Aneesh Kumar K.V 
> Reported-by: Jeff Moyer 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Signed-off-by: Dan Williams 
> ---
>  arch/powerpc/Kconfig  |1 +
>  arch/powerpc/mm/ioremap.c |   12 
>  drivers/nvdimm/pfn_devs.c |2 +-
>  include/linux/memremap.h  |8 
>  include/linux/mmzone.h|1 +
>  lib/Kconfig   |3 +++
>  mm/memremap.c |   13 +
>  7 files changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 497b7d0b2d7e..e6ffe905e2b9 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -122,6 +122,7 @@ config PPC
>   select ARCH_HAS_GCOV_PROFILE_ALL
>   select ARCH_HAS_KCOV
>   select ARCH_HAS_HUGEPD  if HUGETLB_PAGE
> + select ARCH_HAS_MEMREMAP_COMPAT_ALIGN
>   select ARCH_HAS_MMIOWB  if PPC64
>   select ARCH_HAS_PHYS_TO_DMA
>   select ARCH_HAS_PMEM_API
> diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
> index fc669643ce6a..38b5ba7d3e2d 100644
> --- a/arch/powerpc/mm/ioremap.c
> +++ b/arch/powerpc/mm/ioremap.c
> @@ -2,6 +2,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -97,3 +98,14 @@ void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t 
> offset, unsigned long size,
>  
>   return NULL;
>  }
> +
> +#ifdef CONFIG_ZONE_DEVICE
> +/* override of the generic version in mm/memremap.c */
> +unsigned long memremap_compat_align(void)
> +{
> +   if (radix_enabled())
> +   return SUBSECTION_SIZE;
> +   return (1UL << mmu_psize_defs[mmu_linear_psize].shift);
> +}
> +EXPORT_SYMBOL_GPL(memremap_compat_align);
> +#endif
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index b94f7a7e94b8..a5c25cb87116 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -750,7 +750,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>   start = nsio->res.start;
>   size = resource_size(>res);
>   npfns = PHYS_PFN(size - SZ_8K);
> - align = max(nd_pfn->align, (1UL << SUBSECTION_SHIFT));
> + align = max(nd_pfn->align, SUBSECTION_SIZE);
>   end_trunc = start + size - ALIGN_DOWN(start + size, align);
>   if (nd_pfn->mode == PFN_MODE_PMEM) {
>   /*
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 6fefb09af7c3..8af1cbd8f293 100644

Re: [PATCH] tools/testing/nvdimm: Fix compilation failure without CONFIG_DEV_DAX_PMEM_COMPAT

2020-02-12 Thread Jeff Moyer
Jan Kara  writes:

> When a kernel is configured without CONFIG_DEV_DAX_PMEM_COMPAT, the
> compilation of tools/testing/nvdimm fails with:
>
>   Building modules, stage 2.
>   MODPOST 11 modules
> ERROR: "dax_pmem_compat_test" [tools/testing/nvdimm/test/nfit_test.ko] 
> undefined!
>
> Fix the problem by calling dax_pmem_compat_test() only if the kernel has
> the required functionality.
>
> Signed-off-by: Jan Kara 

What's the motivation?  Is this just to fix randconfig builds?  The
reason I ask is that the test suite will expect to be able to find the
dax_pmem_compat module, so it doesn't make sense to me to disable those
tests only in the kernel as you'll hit a problem when running the tests
anyway.

But, I understand if you want to prevent build bots from hitting
compilation failures due to this.

-Jeff

> ---
>  tools/testing/nvdimm/test/nfit.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/tools/testing/nvdimm/test/nfit.c 
> b/tools/testing/nvdimm/test/nfit.c
> index bf6422a6af7f..a8ee5c4d41eb 100644
> --- a/tools/testing/nvdimm/test/nfit.c
> +++ b/tools/testing/nvdimm/test/nfit.c
> @@ -3164,7 +3164,9 @@ static __init int nfit_test_init(void)
>   mcsafe_test();
>   dax_pmem_test();
>   dax_pmem_core_test();
> +#ifdef CONFIG_DEV_DAX_PMEM_COMPAT
>   dax_pmem_compat_test();
> +#endif
>  
>   nfit_test_setup(nfit_test_lookup, nfit_test_evaluate_dsm);
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [patch] dax: pass NOWAIT flag to iomap_apply

2020-02-06 Thread Jeff Moyer
Jan Kara  writes:

> On Thu 06-02-20 09:33:39, Jeff Moyer wrote:
>> Jan Kara  writes:
>> 
>> > On Wed 05-02-20 14:15:58, Jeff Moyer wrote:
>> >> fstests generic/471 reports a failure when run with MOUNT_OPTIONS="-o
>> >> dax".  The reason is that the initial pwrite to an empty file with the
>> >> RWF_NOWAIT flag set does not return -EAGAIN.  It turns out that
>> >> dax_iomap_rw doesn't pass that flag through to iomap_apply.
>> >> 
>> >> With this patch applied, generic/471 passes for me.
>> >> 
>> >> Signed-off-by: Jeff Moyer 
>> >
>> > The patch looks good to me. You can add:
>> >
>> > Reviewed-by: Jan Kara 
>> >
>> > BTW, I've just noticed ext4 seems to be buggy in this regard and even this
>> > patch doesn't fix it. So I guess you've been using XFS for testing this?
>> 
>> That's right, sorry I didn't mention that.  Will you send a patch for
>> ext4, or do you want me to look into it?
>
> I've taken down a note in todo list to eventually look into that but if you
> can have a look, I'm more than happy to remove that entry :).

OK, I'll take a look.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [patch] dax: pass NOWAIT flag to iomap_apply

2020-02-06 Thread Jeff Moyer
Jan Kara  writes:

> On Wed 05-02-20 14:15:58, Jeff Moyer wrote:
>> fstests generic/471 reports a failure when run with MOUNT_OPTIONS="-o
>> dax".  The reason is that the initial pwrite to an empty file with the
>> RWF_NOWAIT flag set does not return -EAGAIN.  It turns out that
>> dax_iomap_rw doesn't pass that flag through to iomap_apply.
>> 
>> With this patch applied, generic/471 passes for me.
>> 
>> Signed-off-by: Jeff Moyer 
>
> The patch looks good to me. You can add:
>
> Reviewed-by: Jan Kara 
>
> BTW, I've just noticed ext4 seems to be buggy in this regard and even this
> patch doesn't fix it. So I guess you've been using XFS for testing this?

That's right, sorry I didn't mention that.  Will you send a patch for
ext4, or do you want me to look into it?

Thanks!
Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[patch] dax: pass NOWAIT flag to iomap_apply

2020-02-05 Thread Jeff Moyer
fstests generic/471 reports a failure when run with MOUNT_OPTIONS="-o
dax".  The reason is that the initial pwrite to an empty file with the
RWF_NOWAIT flag set does not return -EAGAIN.  It turns out that
dax_iomap_rw doesn't pass that flag through to iomap_apply.

With this patch applied, generic/471 passes for me.

Signed-off-by: Jeff Moyer 

diff --git a/fs/dax.c b/fs/dax.c
index 1f1f0201cad1..0b0d8819cb1b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1207,6 +1207,9 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
lockdep_assert_held(>i_rwsem);
}
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   flags |= IOMAP_NOWAIT;
+
while (iov_iter_count(iter)) {
ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
iter, dax_iomap_actor);
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 01/19] dax: remove block device dependencies

2020-01-15 Thread Jeff Moyer
Hi, Dan,

Dan Williams  writes:

> I'm going to take a look at how hard it would be to develop a kpartx
> fallback in udev. If that can live across the driver transition then
> maybe this can be a non-event for end users that already have that
> udev update deployed.

I just wanted to remind you that label-less dimms still exist, and are
still being shipped.  For those devices, the only way to subdivide the
storage is via partitioning.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH] libnvdimm: Update the meaning for persistence_domain values

2020-01-15 Thread Jeff Moyer
"Aneesh Kumar K.V"  writes:

>> /*
>>   * Platform provides mechanisms to flush outstanding write data
>>   * to pmem on system power loss.
>>   */
>>
>
> Wanted to add more details. So with the above interpretation, if the
> persistence_domain is found to be 'cpu_cache', application can expect
> a store instruction to guarantee persistence. If it is 'none' there is
> no persistence ( I am not sure how that is the difference from
> 'volatile' pmem region). If it is  'memory_controller' ( I am not sure
> whether that is the right term), application needs to follow the
> recommended mechanism to flush write data to pmem.

There is no enum for "NONE".  If there is no persistence domain
specified, then it is undefined/unknown, which results in the message:

  nd_pmem namespace0.0: unable to guarantee persistence of writes

Other than that, yes, that's right.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH] libnvdimm: Update the meaning for persistence_domain values

2020-01-15 Thread Jeff Moyer
"Aneesh Kumar K.V"  writes:

>> Would you also update of_pmem to indicate the persistence domain,
>> please?
>>
>
> sure.

Thanks!

>>> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
>>> index f2a33f2e3ba8..9126737377e1 100644
>>> --- a/include/linux/libnvdimm.h
>>> +++ b/include/linux/libnvdimm.h
>>> @@ -52,9 +52,9 @@ enum {
>>>  */
>>> ND_REGION_PERSIST_CACHE = 1,
>>> /*
>>> -* Platform provides mechanisms to automatically flush outstanding
>>> -* write data from memory controler to pmem on system power loss.
>>> -* (ADR)
>>> +* Platform provides instructions to flush data such that on completion
>>> +* of the instructions, data flushed is guaranteed to be on pmem even
>>> +* in case of a system power loss.
>>
>> I find the prior description easier to understand.
>
> I was trying to avoid the term 'automatically, 'memory controler' and
> ADR. Can I update the above as

I can understand avoiding the very x86-specific "ADR," but is memory
controller not accurate for your platform?

> /*
>  * Platform provides mechanisms to flush outstanding write data
>  * to pmem on system power loss.
>  */

That's way too broad.  :) The comments are describing the persistence
domain.  i.e. if you get data to $HERE, it is guaranteed to make it out
to stable media.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH] libnvdimm: Update the meaning for persistence_domain values

2020-01-15 Thread Jeff Moyer
"Aneesh Kumar K.V"  writes:

> Currently, kernel shows the below values
>   "persistence_domain":"cpu_cache"
>   "persistence_domain":"memory_controller"
>   "persistence_domain":"unknown"
>
> This patch updates the meaning of these values such that
>
> "cpu_cache" indicates no extra instructions is needed to ensure the 
> persistence
> of data in the pmem media on power failure.
>
> "memory_controller" indicates platform provided instructions need to be issued
> as per documented sequence to make sure data flushed is guaranteed to be on 
> pmem
> media in case of system power loss.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 7 ++-
>  include/linux/libnvdimm.h | 6 +++---
>  2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index c2ef320ba1bf..26a5ef263758 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -360,8 +360,13 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>  
>   if (p->is_volatile)
>   p->region = nvdimm_volatile_region_create(p->bus, _desc);
> - else
> + else {
> + /*
> +  * We need to flush things correctly to guarantee persistance
> +  */
> + set_bit(ND_REGION_PERSIST_MEMCTRL, _desc.flags);
>   p->region = nvdimm_pmem_region_create(p->bus, _desc);
> + }
>   if (!p->region) {
>   dev_err(dev, "Error registering region %pR from %pOF\n",
>   ndr_desc.res, p->dn);

Would you also update of_pmem to indicate the persistence domain,
please?

> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> index f2a33f2e3ba8..9126737377e1 100644
> --- a/include/linux/libnvdimm.h
> +++ b/include/linux/libnvdimm.h
> @@ -52,9 +52,9 @@ enum {
>*/
>   ND_REGION_PERSIST_CACHE = 1,
>   /*
> -  * Platform provides mechanisms to automatically flush outstanding
> -  * write data from memory controler to pmem on system power loss.
> -  * (ADR)
> +  * Platform provides instructions to flush data such that on completion
> +  * of the instructions, data flushed is guaranteed to be on pmem even
> +  * in case of a system power loss.

I find the prior description easier to understand.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 1/6] libnvdimm/namespace: Make namespace size validation arch dependent

2020-01-10 Thread Jeff Moyer
Hi, Aneesh,

After applying this patch series, several of my namespaces no longer
enumerate:

Before:

# ndctl list
[
  {
"dev":"namespace0.2",
"mode":"sector",
"size":106541672960,
"uuid":"ea1122b2-c219-424c-b09c-38a6e94a1042",
"sector_size":512,
"blockdev":"pmem0.2s"
  },
  {
"dev":"namespace0.1",
"mode":"fsdax",
"map":"dev",
"size":10567548928,
"uuid":"68b6746f-481a-4ae6-80b5-71d62176606c",
"sector_size":512,
"align":4096,
"blockdev":"pmem0.1"
  },
  {
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":52850327552,
"uuid":"6d3a0199-5d9a-4fed-830d-e25249b70571",
"sector_size":512,
"align":2097152,
"blockdev":"pmem0"
  }
]

After:

# ndctl list
[
  {
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":52850327552,
"uuid":"6d3a0199-5d9a-4fed-830d-e25249b70571",
"sector_size":512,
"align":2097152,
"blockdev":"pmem0"
  }
]

I won't have time to dig into it this week, but I wanted to mention it
before Dan merged these patches.

I'll follow up next week with more information.

Cheers,
Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] virtio pmem: fix async flush ordering

2019-11-22 Thread Jeff Moyer
Dan Williams  writes:

> On Fri, Nov 22, 2019 at 8:09 AM Jeff Moyer  wrote:
>>
>> Dan Williams  writes:
>>
>> > On Wed, Nov 20, 2019 at 9:26 AM Jeff Moyer  wrote:
>> >>
>> >> Pankaj Gupta  writes:
>> >>
>> >> >  Remove logic to create child bio in the async flush function which
>> >> >  causes child bio to get executed after parent bio 'pmem_make_request'
>> >> >  completes. This resulted in wrong ordering of REQ_PREFLUSH with the
>> >> >  data write request.
>> >> >
>> >> >  Instead we are performing flush from the parent bio to maintain the
>> >> >  correct order. Also, returning from function 'pmem_make_request' if
>> >> >  REQ_PREFLUSH returns an error.
>> >> >
>> >> > Reported-by: Jeff Moyer 
>> >> > Signed-off-by: Pankaj Gupta 
>> >>
>> >> There's a slight change in behavior for the error path in the
>> >> virtio_pmem driver.  Previously, all errors from virtio_pmem_flush were
>> >> converted to -EIO.  Now, they are reported as-is.  I think this is
>> >> actually an improvement.
>> >>
>> >> I'll also note that the current behavior can result in data corruption,
>> >> so this should be tagged for stable.
>> >
>> > I added that and was about to push this out, but what about the fact
>> > that now the guest will synchronously wait for flushing to occur. The
>> > goal of the child bio was to allow that to be an I/O wait with
>> > overlapping I/O, or at least not blocking the submission thread. Does
>> > the block layer synchronously wait for PREFLUSH requests?
>>
>> You *have* to wait for the preflush to complete before issuing the data
>> write.  See the "Explicit cache flushes" section in
>> Documentation/block/writeback_cache_control.rst.
>
> I'm not debating the ordering, or that the current implementation is
> obviously broken. I'm questioning whether the bio tagged with PREFLUSH
> is a barrier for future I/Os. My reading is that it is only a gate for
> past writes, and it can be queued. I.e. along the lines of
> md_flush_request().

Sorry, I misunderstood your question.

For a write bio with REQ_PREFLUSH set, the PREFLUSH has to be done
before the data attached to the bio is written.  That preflush is not an
I/O barrier.  In other words, for unrelated I/O (any other bio in the
system), it does not impart any specific ordering requirements.  Upper
layers are expected to wait for any related I/O completions before
issuing a flush request.

So yes, you can queue the bio to a worker thread and return to the
caller.  In fact, this is what I had originally suggested to Pankaj.

Cheers,
Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] virtio pmem: fix async flush ordering

2019-11-22 Thread Jeff Moyer
Dan Williams  writes:

> On Wed, Nov 20, 2019 at 9:26 AM Jeff Moyer  wrote:
>>
>> Pankaj Gupta  writes:
>>
>> >  Remove logic to create child bio in the async flush function which
>> >  causes child bio to get executed after parent bio 'pmem_make_request'
>> >  completes. This resulted in wrong ordering of REQ_PREFLUSH with the
>> >  data write request.
>> >
>> >  Instead we are performing flush from the parent bio to maintain the
>> >  correct order. Also, returning from function 'pmem_make_request' if
>> >  REQ_PREFLUSH returns an error.
>> >
>> > Reported-by: Jeff Moyer 
>> > Signed-off-by: Pankaj Gupta 
>>
>> There's a slight change in behavior for the error path in the
>> virtio_pmem driver.  Previously, all errors from virtio_pmem_flush were
>> converted to -EIO.  Now, they are reported as-is.  I think this is
>> actually an improvement.
>>
>> I'll also note that the current behavior can result in data corruption,
>> so this should be tagged for stable.
>
> I added that and was about to push this out, but what about the fact
> that now the guest will synchronously wait for flushing to occur. The
> goal of the child bio was to allow that to be an I/O wait with
> overlapping I/O, or at least not blocking the submission thread. Does
> the block layer synchronously wait for PREFLUSH requests?

You *have* to wait for the preflush to complete before issuing the data
write.  See the "Explicit cache flushes" section in
Documentation/block/writeback_cache_control.rst.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] virtio pmem: fix async flush ordering

2019-11-20 Thread Jeff Moyer
Pankaj Gupta  writes:

>  Remove logic to create child bio in the async flush function which
>  causes child bio to get executed after parent bio 'pmem_make_request'
>  completes. This resulted in wrong ordering of REQ_PREFLUSH with the
>  data write request.
>
>  Instead we are performing flush from the parent bio to maintain the
>  correct order. Also, returning from function 'pmem_make_request' if
>  REQ_PREFLUSH returns an error.
>
> Reported-by: Jeff Moyer 
> Signed-off-by: Pankaj Gupta 

There's a slight change in behavior for the error path in the
virtio_pmem driver.  Previously, all errors from virtio_pmem_flush were
converted to -EIO.  Now, they are reported as-is.  I think this is
actually an improvement.

I'll also note that the current behavior can result in data corruption,
so this should be tagged for stable.

The patch looks good to me.

Thanks!

Reviewed-by: Jeff Moyer 

> ---
>  drivers/acpi/nfit/core.c |  4 ++--
>  drivers/nvdimm/claim.c   |  2 +-
>  drivers/nvdimm/nd.h  |  2 +-
>  drivers/nvdimm/nd_virtio.c   | 29 ++---
>  drivers/nvdimm/pmem.c| 14 ++
>  drivers/nvdimm/region_devs.c |  6 +++---
>  drivers/nvdimm/virtio_pmem.c |  2 +-
>  drivers/nvdimm/virtio_pmem.h |  2 +-
>  include/linux/libnvdimm.h|  4 ++--
>  9 files changed, 23 insertions(+), 42 deletions(-)
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index 14e68f202f81..afbd5e2b2ea8 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -2426,7 +2426,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, 
> unsigned int bw,
>   offset = to_interleave_offset(offset, mmio);
>  
>   writeq(cmd, mmio->addr.base + offset);
> - nvdimm_flush(nfit_blk->nd_region, NULL);
> + nvdimm_flush(nfit_blk->nd_region);
>  
>   if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
>   readq(mmio->addr.base + offset);
> @@ -2475,7 +2475,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
> *nfit_blk,
>   }
>  
>   if (rw)
> - nvdimm_flush(nfit_blk->nd_region, NULL);
> + nvdimm_flush(nfit_blk->nd_region);
>  
>   rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
>   return rc;
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index 2985ca949912..0fedb2fbfcbe 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -293,7 +293,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
>   }
>  
>   memcpy_flushcache(nsio->addr + offset, buf, size);
> - ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL);
> + ret = nvdimm_flush(to_nd_region(ndns->dev.parent));
>   if (ret)
>   rc = ret;
>  
> diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
> index ee5c04070ef9..77d8b9f0c34a 100644
> --- a/drivers/nvdimm/nd.h
> +++ b/drivers/nvdimm/nd.h
> @@ -155,7 +155,7 @@ struct nd_region {
>   struct badblocks bb;
>   struct nd_interleave_set *nd_set;
>   struct nd_percpu_lane __percpu *lane;
> - int (*flush)(struct nd_region *nd_region, struct bio *bio);
> + int (*flush)(struct nd_region *nd_region);
>   struct nd_mapping mapping[0];
>  };
>  
> diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> index 10351d5b49fa..9604ba08a68a 100644
> --- a/drivers/nvdimm/nd_virtio.c
> +++ b/drivers/nvdimm/nd_virtio.c
> @@ -35,7 +35,7 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
>  EXPORT_SYMBOL_GPL(virtio_pmem_host_ack);
>  
>   /* The request submission function */
> -static int virtio_pmem_flush(struct nd_region *nd_region)
> +int virtio_pmem_flush(struct nd_region *nd_region)
>  {
>   struct virtio_device *vdev = nd_region->provider_data;
>   struct virtio_pmem *vpmem  = vdev->priv;
> @@ -96,30 +96,5 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
>   kfree(req_data);
>   return err;
>  };
> -
> -/* The asynchronous flush callback function */
> -int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> -{
> - /*
> -  * Create child bio for asynchronous flush and chain with
> -  * parent bio. Otherwise directly call nd_region flush.
> -  */
> - if (bio && bio->bi_iter.bi_sector != -1) {
> - struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> -
> - if (!child)
> - return -ENOMEM;
> - bio_copy_dev(child, bio);
> - child->bi_opf = REQ_PREFLUSH;
> - child->bi_iter.bi_sector = -1;
> - bio_chain(child, bio);
> - submit_bio(child);
&

Re: [PATCH 4/4] modpost: do not set ->preloaded for symbols from Module.symvers

2019-11-01 Thread Jeff Moyer
Masahiro Yamada  writes:

> On Fri, Nov 1, 2019 at 1:51 AM Jeff Moyer  wrote:
>>
>> Masahiro Yamada  writes:
>>
>> > Now that there is no overwrap between symbols from ELF files and
>> > ones from Module.symvers.
>> >
>> > So, the 'exported twice' warning should be reported irrespective
>> > of where the symbol in question came from. Only the exceptional case
>> > is when __crc_ symbol appears before __ksymtab_. This
>> > typically occurs for EXPORT_SYMBOL in .S files.
>>
>> Hi, Masahiro,
>>
>> After apply this patch, I get the following modpost warnings when doing:
>>
>> $ make M=tools/tesing/nvdimm
>> ...
>>   Building modules, stage 2.
>>   MODPOST 12 modules
>> WARNING: tools/testing/nvdimm/libnvdimm: 'nvdimm_bus_lock' exported
>> twice. Previous export was in drivers/nvdimm/libnvdimm.ko
>> WARNING: tools/testing/nvdimm/libnvdimm: 'nvdimm_bus_unlock'
>> exported twice. Previous export was in drivers/nvdimm/libnvdimm.ko
>> WARNING: tools/testing/nvdimm/libnvdimm: 'is_nvdimm_bus_locked'
>> exported twice. Previous export was in drivers/nvdimm/libnvdimm.ko
>> WARNING: tools/testing/nvdimm/libnvdimm: 'devm_nvdimm_memremap'
>> exported twice. Previous export was in drivers/nvdimm/libnvdimm.ko
>> WARNING: tools/testing/nvdimm/libnvdimm: 'nd_fletcher64' exported twice. 
>> Previous export was in drivers/nvdimm/libnvdimm.ko
>> WARNING: tools/testing/nvdimm/libnvdimm: 'to_nd_desc' exported twice. 
>> Previous export was in drivers/nvdimm/libnvdimm.ko
>> WARNING: tools/testing/nvdimm/libnvdimm: 'to_nvdimm_bus_dev'
>> exported twice. Previous export was in drivers/nvdimm/libnvdimm.ko
>> ...
>>
>> There are a lot of these warnings.  :)
>
> These warnings are correct since
> drivers/nvdimm/Makefile and
> tools/testing/nvdimm/Kbuild
> compile the same files.

Yeah, but that's by design.  Is there a way to silence these warnings?

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 4/4] modpost: do not set ->preloaded for symbols from Module.symvers

2019-10-31 Thread Jeff Moyer
Masahiro Yamada  writes:

> Now that there is no overwrap between symbols from ELF files and
> ones from Module.symvers.
>
> So, the 'exported twice' warning should be reported irrespective
> of where the symbol in question came from. Only the exceptional case
> is when __crc_ symbol appears before __ksymtab_. This
> typically occurs for EXPORT_SYMBOL in .S files.

Hi, Masahiro,

After apply this patch, I get the following modpost warnings when doing:

$ make M=tools/tesing/nvdimm
...
  Building modules, stage 2.
  MODPOST 12 modules
WARNING: tools/testing/nvdimm/libnvdimm: 'nvdimm_bus_lock' exported twice. 
Previous export was in drivers/nvdimm/libnvdimm.ko
WARNING: tools/testing/nvdimm/libnvdimm: 'nvdimm_bus_unlock' exported twice. 
Previous export was in drivers/nvdimm/libnvdimm.ko
WARNING: tools/testing/nvdimm/libnvdimm: 'is_nvdimm_bus_locked' exported twice. 
Previous export was in drivers/nvdimm/libnvdimm.ko
WARNING: tools/testing/nvdimm/libnvdimm: 'devm_nvdimm_memremap' exported twice. 
Previous export was in drivers/nvdimm/libnvdimm.ko
WARNING: tools/testing/nvdimm/libnvdimm: 'nd_fletcher64' exported twice. 
Previous export was in drivers/nvdimm/libnvdimm.ko
WARNING: tools/testing/nvdimm/libnvdimm: 'to_nd_desc' exported twice. Previous 
export was in drivers/nvdimm/libnvdimm.ko
WARNING: tools/testing/nvdimm/libnvdimm: 'to_nvdimm_bus_dev' exported twice. 
Previous export was in drivers/nvdimm/libnvdimm.ko
...

There are a lot of these warnings.  :)  If I revert this patch, no
complaints.

Cheers,
Jeff


>
> Signed-off-by: Masahiro Yamada 
> ---
>
>  scripts/mod/modpost.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
> index 5234555cf550..6ca38d10efc5 100644
> --- a/scripts/mod/modpost.c
> +++ b/scripts/mod/modpost.c
> @@ -2457,7 +2457,6 @@ static void read_dump(const char *fname, unsigned int 
> kernel)
>   s = sym_add_exported(symname, namespace, mod,
>export_no(export));
>   s->kernel= kernel;
> - s->preloaded = 1;
>   s->is_static = 0;
>   sym_update_crc(symname, mod, crc, export_no(export));
>   }
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl patch 3/4] query_fw_finish_status: get rid of redundant variable

2019-10-28 Thread Jeff Moyer
Ira Weiny  writes:

> On Mon, Oct 28, 2019 at 03:37:48PM -0400, Jeff Moyer wrote:
>> Ira Weiny  writes:
>> 
>> >> (Watching the unit test run fall into an infinite loop..) Nope, the
>> >> break is in the switch scope, the while loop needs the 'goto out'.
>> >> 
>> >> Yes this bit definitely needs to be refactored :)
>> >
>> > How about this patch instead?  Untested.
>> 
>> I'm not a fan of the looping with gotos.
>
> Me either... But... the logic here is not the same.

How about this one, then?  Again, compile-tested only.  I'll run it
through testing only if you like it better than your approach.  If you
like your appraoch better, I'll go ahead and review and test that.

Cheers,
Jeff

diff --git a/ndctl/dimm.c b/ndctl/dimm.c
index b1b84c2..63d4d4a 100644
--- a/ndctl/dimm.c
+++ b/ndctl/dimm.c
@@ -674,6 +674,52 @@ out:
return rc;
 }
 
+/*
+ * Wait for a command to complete, up to the firmware-specified timeout.
+ * Returns -errno on error.  On success, which means either the command
+ * completed (sucessfully or with an error), or we timed out waiting for
+ * it, return 0.  The caller needs to check the status on its own if this
+ * function returns 0.
+ */
+static int query_fw_finish_status_timeout(struct ndctl_cmd *cmd,
+ struct fw_info *fw)
+{
+   enum ND_FW_STATUS status;
+   struct timespec sleeptime, start, now;
+   int rc;
+
+   rc = clock_gettime(CLOCK_MONOTONIC, );
+   if (rc < 0)
+   return rc;
+
+   sleeptime.tv_nsec = fw->query_interval / 1000;
+   sleeptime.tv_sec = 0;
+
+   while ((rc = ndctl_cmd_submit(cmd)) == 0 &&
+  (status = ndctl_cmd_fw_xlat_firmware_status(cmd)) == FW_EBUSY) {
+
+   rc = clock_gettime(CLOCK_MONOTONIC, );
+   if (rc < 0)
+   break;
+
+   /*
+* If we expire max query time, we timed out
+*/
+   if (now.tv_sec - start.tv_sec > fw->max_query / 100)
+   break;
+
+   /*
+* Sleep the interval dictated by firmware before
+* query again.
+*/
+   rc = nanosleep(, NULL);
+   if (rc < 0)
+   break;
+   }
+
+   return rc;
+}
+
 static int query_fw_finish_status(struct ndctl_dimm *dimm,
struct action_context *actx)
 {
@@ -682,94 +728,55 @@ static int query_fw_finish_status(struct ndctl_dimm *dimm,
struct ndctl_cmd *cmd;
int rc;
enum ND_FW_STATUS status;
-   struct timespec now, before, after;
uint64_t ver;
 
cmd = ndctl_dimm_cmd_new_fw_finish_query(uctx->start);
if (!cmd)
return -ENXIO;
 
-   rc = clock_gettime(CLOCK_MONOTONIC, );
+   rc = query_fw_finish_status_timeout(cmd, fw);
if (rc < 0)
-   goto out;
-
-   now.tv_nsec = fw->query_interval / 1000;
-   now.tv_sec = 0;
-
-   do {
-   rc = ndctl_cmd_submit(cmd);
-   if (rc < 0)
-   break;
-
-   status = ndctl_cmd_fw_xlat_firmware_status(cmd);
-   switch (status) {
-   case FW_SUCCESS:
-   ver = ndctl_cmd_fw_fquery_get_fw_rev(cmd);
-   if (ver == 0) {
-   fprintf(stderr, "No firmware updated.\n");
-   rc = -ENXIO;
-   goto out;
-   }
-
-   printf("Image updated successfully to DIMM %s.\n",
-   ndctl_dimm_get_devname(dimm));
-   printf("Firmware version %#lx.\n", ver);
-   printf("Cold reboot to activate.\n");
-   rc = 0;
-   goto out;
-   break;
-   case FW_EBUSY:
-   /* Still on going, continue */
-   rc = clock_gettime(CLOCK_MONOTONIC, );
-   if (rc < 0) {
-   rc = -errno;
-   goto out;
-   }
-
-   /*
-* If we expire max query time,
-* we timed out
-*/
-   if (after.tv_sec - before.tv_sec >
-   fw->max_query / 100) {
-   rc = -ETIMEDOUT;
-   goto out;
-   }
+   goto unref;
 
-   /*
-* Sleep the interval dictated by firmware
-* before query again.
-

Re: [ndctl patch 3/4] query_fw_finish_status: get rid of redundant variable

2019-10-28 Thread Jeff Moyer
Ira Weiny  writes:

> On Mon, Oct 28, 2019 at 03:37:48PM -0400, Jeff Moyer wrote:
>> Ira Weiny  writes:
>> 
>> >> (Watching the unit test run fall into an infinite loop..) Nope, the
>> >> break is in the switch scope, the while loop needs the 'goto out'.
>> >> 
>> >> Yes this bit definitely needs to be refactored :)
>> >
>> > How about this patch instead?  Untested.
>> 
>> I'm not a fan of the looping with gotos.
>
> Me either... But... the logic here is not the same.
>
>>
>> I think separating out the
>> waiting for busy to its own function would make this more clear.
>> Looking more closely, there are other issues.  The timeout code looks at
>> the seconds, but ignores the fractions, so you could be off by almost an
>> entire second, there.
>
> For this operation that is probably not a big deal.  We should be waiting much
> longer than the operation should take anyway.
>
>>
>> It also doens't retry the sleep if interrupted.
>
> This could be an issue.
>
>> Finally, I find the variables names to be highly confusing.
>> 
>> I've decided not to fix those last two bugs just yet, but here's a patch
>> that shows the dirction I think it should go.  Compile-tested only for
>> now.  Let me know what you think.
>
> I thought about doing something similar but to make the logic the same it
> becomes a bit awkward.

[...]

>> +wait_for_cmd_completion(cmd, fw, );
>
> wait_for_cmd_completion() does not call ndctl_cmd_submit()
>
> Now I find it odd that we need to resubmit the command but I assume the logic
> is correct.  Therefore we need to go back and call ndctl_cmd_submit() again.
>
> Or is this not required?

Ah.  Stupid mistake.  Yes, it definitely looks like the status query
command needs to be resubmitted, and that's the whole point of the
timeout between calls.  You can't ask too often.  ;-)

> anyway that is why I went ahead and used the goto...

I'll take another look.  Thanks for pointing out that obvious thinko.

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl patch 3/4] query_fw_finish_status: get rid of redundant variable

2019-10-28 Thread Jeff Moyer
Ira Weiny  writes:

>> (Watching the unit test run fall into an infinite loop..) Nope, the
>> break is in the switch scope, the while loop needs the 'goto out'.
>> 
>> Yes this bit definitely needs to be refactored :)
>
> How about this patch instead?  Untested.

I'm not a fan of the looping with gotos.  I think separating out the
waiting for busy to its own function would make this more clear.
Looking more closely, there are other issues.  The timeout code looks at
the seconds, but ignores the fractions, so you could be off by almost an
entire second, there.  It also doens't retry the sleep if interrupted.
Finally, I find the variables names to be highly confusing.

I've decided not to fix those last two bugs just yet, but here's a patch
that shows the dirction I think it should go.  Compile-tested only for
now.  Let me know what you think.

Ira, I used the same base as you.  If you updated ndctl, you'll have to
revert 9e0391e057b36 to apply this patch.

Cheers,
Jeff

diff --git a/ndctl/dimm.c b/ndctl/dimm.c
index c8821d6..701f58b 100644
--- a/ndctl/dimm.c
+++ b/ndctl/dimm.c
@@ -674,6 +674,41 @@ out:
return rc;
 }
 
+static void wait_for_cmd_completion(struct ndctl_cmd *cmd, struct fw_info *fw,
+   struct timespec *start)
+{
+   enum ND_FW_STATUS status;
+   struct timespec sleeptime, now;
+   int rc;
+
+   sleeptime.tv_nsec = fw->query_interval / 1000;
+   sleeptime.tv_sec = 0;
+
+   while ((status = ndctl_cmd_fw_xlat_firmware_status(cmd)) == FW_EBUSY) {
+
+   rc = clock_gettime(CLOCK_MONOTONIC, );
+   if (rc < 0)
+   break;
+
+   /*
+* If we expire max query time, we timed out
+*/
+   if (now.tv_sec - start->tv_sec > fw->max_query / 100)
+   break;
+
+   /*
+* Sleep the interval dictated by firmware before
+* query again.
+*/
+   rc = nanosleep(, NULL);
+   if (rc < 0)
+   break;
+
+   }
+
+   return;
+}
+
 static int query_fw_finish_status(struct ndctl_dimm *dimm,
struct action_context *actx)
 {
@@ -682,98 +717,65 @@ static int query_fw_finish_status(struct ndctl_dimm *dimm,
struct ndctl_cmd *cmd;
int rc;
enum ND_FW_STATUS status;
-   bool done = false;
-   struct timespec now, before, after;
+   struct timespec start;
uint64_t ver;
 
cmd = ndctl_dimm_cmd_new_fw_finish_query(uctx->start);
if (!cmd)
return -ENXIO;
 
-   rc = clock_gettime(CLOCK_MONOTONIC, );
+   rc = clock_gettime(CLOCK_MONOTONIC, );
if (rc < 0)
-   goto out;
-
-   now.tv_nsec = fw->query_interval / 1000;
-   now.tv_sec = 0;
-
-   do {
-   rc = ndctl_cmd_submit(cmd);
-   if (rc < 0)
-   break;
+   goto unref;
 
-   status = ndctl_cmd_fw_xlat_firmware_status(cmd);
-   switch (status) {
-   case FW_SUCCESS:
-   ver = ndctl_cmd_fw_fquery_get_fw_rev(cmd);
-   if (ver == 0) {
-   fprintf(stderr, "No firmware updated.\n");
-   rc = -ENXIO;
-   goto out;
-   }
-
-   printf("Image updated successfully to DIMM %s.\n",
-   ndctl_dimm_get_devname(dimm));
-   printf("Firmware version %#lx.\n", ver);
-   printf("Cold reboot to activate.\n");
-   done = true;
-   rc = 0;
-   break;
-   case FW_EBUSY:
-   /* Still on going, continue */
-   rc = clock_gettime(CLOCK_MONOTONIC, );
-   if (rc < 0) {
-   rc = -errno;
-   goto out;
-   }
+   rc = ndctl_cmd_submit(cmd);
+   if (rc < 0)
+   goto unref;
 
-   /*
-* If we expire max query time,
-* we timed out
-*/
-   if (after.tv_sec - before.tv_sec >
-   fw->max_query / 100) {
-   rc = -ETIMEDOUT;
-   goto out;
-   }
+   wait_for_cmd_completion(cmd, fw, );
 
-   /*
-* Sleep the interval dictated by firmware
-* before query again.
-*/
-   rc = nanosleep(, NULL);
-   if (rc < 0) {
-   rc = -errno;
-   goto out;
-  

Re: [PATCH] fs/dax: Fix pmd vs pte conflict detection

2019-10-21 Thread Jeff Moyer
Dan Williams  writes:

> Check for NULL entries before checking the entry order, otherwise NULL
> is misinterpreted as a present pte conflict. The 'order' check needs to
> happen before the locked check as an unlocked entry at the wrong order
> must fallback to lookup the correct order.

Please include the user-visible effects of the problem in the changelog.

Thanks,
Jeff

>
> Reported-by: Jeff Smits 
> Reported-by: Doug Nelson 
> Cc: 
> Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults")
> Cc: Jan Kara 
> Cc: Matthew Wilcox (Oracle) 
> Signed-off-by: Dan Williams 
> ---
>  fs/dax.c |5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index a71881e77204..08160011d94c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -221,10 +221,11 @@ static void *get_unlocked_entry(struct xa_state *xas, 
> unsigned int order)
>  
>   for (;;) {
>   entry = xas_find_conflict(xas);
> + if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
> + return entry;
>   if (dax_entry_order(entry) < order)
>   return XA_RETRY_ENTRY;
> - if (!entry || WARN_ON_ONCE(!xa_is_value(entry)) ||
> - !dax_is_locked(entry))
> + if (!dax_is_locked(entry))
>   return entry;
>  
>   wq = dax_entry_waitqueue(xas, entry, );
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl patch 3/4] query_fw_finish_status: get rid of redundant variable

2019-10-18 Thread Jeff Moyer
Ira Weiny  writes:

> On Fri, Oct 18, 2019 at 04:23:01PM -0400, Jeff Moyer wrote:
>> The 'done' variable only adds confusion.
>> 
>> Signed-off-by: Jeff Moyer 
>> ---
>>  ndctl/dimm.c | 7 +--
>>  1 file changed, 1 insertion(+), 6 deletions(-)
>> 
>> diff --git a/ndctl/dimm.c b/ndctl/dimm.c
>> index c8821d6..f28b9c1 100644
>> --- a/ndctl/dimm.c
>> +++ b/ndctl/dimm.c
>> @@ -682,7 +682,6 @@ static int query_fw_finish_status(struct ndctl_dimm 
>> *dimm,
>>  struct ndctl_cmd *cmd;
>>  int rc;
>>  enum ND_FW_STATUS status;
>> -bool done = false;
>>  struct timespec now, before, after;
>>  uint64_t ver;
>>  
>> @@ -716,7 +715,6 @@ static int query_fw_finish_status(struct ndctl_dimm 
>> *dimm,
>>  ndctl_dimm_get_devname(dimm));
>>  printf("Firmware version %#lx.\n", ver);
>>  printf("Cold reboot to activate.\n");
>> -done = true;
>>  rc = 0;
>
> Do we need "goto out" here?

Yes, I missed that one.  Thanks.

>>  break;
>>  case FW_EBUSY:
>> @@ -753,7 +751,6 @@ static int query_fw_finish_status(struct ndctl_dimm 
>> *dimm,
>>  ndctl_dimm_get_devname(dimm));
>>  case FW_EINVAL_CTX:
>>  case FW_ESEQUENCE:
>> -done = true;
>>  rc = -ENXIO;
>>  goto out;
>>  case FW_ENORES:
>> @@ -761,17 +758,15 @@ static int query_fw_finish_status(struct ndctl_dimm 
>> *dimm,
>>  "Firmware update sequence timed out: %s\n",
>>  ndctl_dimm_get_devname(dimm));
>>  rc = -ETIMEDOUT;
>> -done = true;
>>  goto out;
>>  default:
>>  fprintf(stderr,
>>  "Unknown update status: %#x on DIMM %s\n",
>>  status, ndctl_dimm_get_devname(dimm));
>>  rc = -EINVAL;
>> -done = true;
>>  goto out;
>>  }
>> -} while (!done);
>> +} while (true);
>
> I'm not a fan of "while (true)".  But I'm not the maintainer.  The Logic seems
> fine otherwise.

The way things stand today is a mashup of goto vs. break.  I'll
follow-up with fixed up patch next week if there is consensus on the
change.  If you have a suggestion for a better way, that's welcome as
well.

Thanks for looking, Ira!

-Jeff
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl patch 3/4] query_fw_finish_status: get rid of redundant variable

2019-10-18 Thread Jeff Moyer
The 'done' variable only adds confusion.

Signed-off-by: Jeff Moyer 
---
 ndctl/dimm.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/ndctl/dimm.c b/ndctl/dimm.c
index c8821d6..f28b9c1 100644
--- a/ndctl/dimm.c
+++ b/ndctl/dimm.c
@@ -682,7 +682,6 @@ static int query_fw_finish_status(struct ndctl_dimm *dimm,
struct ndctl_cmd *cmd;
int rc;
enum ND_FW_STATUS status;
-   bool done = false;
struct timespec now, before, after;
uint64_t ver;
 
@@ -716,7 +715,6 @@ static int query_fw_finish_status(struct ndctl_dimm *dimm,
ndctl_dimm_get_devname(dimm));
printf("Firmware version %#lx.\n", ver);
printf("Cold reboot to activate.\n");
-   done = true;
rc = 0;
break;
case FW_EBUSY:
@@ -753,7 +751,6 @@ static int query_fw_finish_status(struct ndctl_dimm *dimm,
ndctl_dimm_get_devname(dimm));
case FW_EINVAL_CTX:
case FW_ESEQUENCE:
-   done = true;
rc = -ENXIO;
goto out;
case FW_ENORES:
@@ -761,17 +758,15 @@ static int query_fw_finish_status(struct ndctl_dimm *dimm,
"Firmware update sequence timed out: %s\n",
ndctl_dimm_get_devname(dimm));
rc = -ETIMEDOUT;
-   done = true;
goto out;
default:
fprintf(stderr,
"Unknown update status: %#x on DIMM %s\n",
status, ndctl_dimm_get_devname(dimm));
rc = -EINVAL;
-   done = true;
goto out;
}
-   } while (!done);
+   } while (true);
 
 out:
ndctl_cmd_unref(cmd);
-- 
2.19.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl patch 2/4] fix building of tags tables

2019-10-18 Thread Jeff Moyer
Make  currently fails with:

make[1]: *** No rule to make target 'libndctl.h', needed by 'tags-am'.  Stop.

The path to libndctl.h is wrong in ndctl/lib/Makefile.am.  Fix it.

Signed-off-by: Jeff Moyer 
---
 ndctl/lib/Makefile.am | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ndctl/lib/Makefile.am b/ndctl/lib/Makefile.am
index fb75fda..e4eb006 100644
--- a/ndctl/lib/Makefile.am
+++ b/ndctl/lib/Makefile.am
@@ -7,7 +7,7 @@ pkginclude_HEADERS = ../libndctl.h ../ndctl.h
 lib_LTLIBRARIES = libndctl.la
 
 libndctl_la_SOURCES =\
-   libndctl.h \
+   ../libndctl.h \
private.h \
../../util/log.c \
../../util/log.h \
-- 
2.19.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl patch 1/4] util/abspath: cleanup prefix_filename

2019-10-18 Thread Jeff Moyer
Static checkers complain about the unused assignment to pfx_len.
The code can obviously be simplified.

Signed-off-by: Jeff Moyer 
---
 util/abspath.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/util/abspath.c b/util/abspath.c
index 09bbd27..e44236f 100644
--- a/util/abspath.c
+++ b/util/abspath.c
@@ -9,11 +9,7 @@ char *prefix_filename(const char *pfx, const char *arg)
struct strbuf path = STRBUF_INIT;
size_t pfx_len = pfx ? strlen(pfx) : 0;
 
-   if (!pfx_len)
-   ;
-   else if (is_absolute_path(arg))
-   pfx_len = 0;
-   else
+   if (pfx_len && !is_absolute_path(arg))
strbuf_add(, pfx, pfx_len);
 
strbuf_addstr(, arg);
-- 
2.19.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl patch 4/4] load-keys: get rid of duplicate assignment

2019-10-18 Thread Jeff Moyer
Signed-off-by: Jeff Moyer 
---
 ndctl/load-keys.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/ndctl/load-keys.c b/ndctl/load-keys.c
index 981f80f..f0b7a5a 100644
--- a/ndctl/load-keys.c
+++ b/ndctl/load-keys.c
@@ -185,7 +185,6 @@ static int load_keys(struct loadkeys *lk_ctx, const char 
*keypath,
 
rc = chdir(keypath);
if (rc < 0) {
-   rc = -errno;
fprintf(stderr, "Change current work dir to %s failed: %s\n",
param.key_path, strerror(errno));
rc = -errno;
-- 
2.19.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl RFC PATCH] ndctl/namespace: create namespaces greedily

2019-08-28 Thread Jeff Moyer
Vishal Verma  writes:

> When a --region=all option is supplied to ndctl-create-namespace, it
> happily ignores it, since create-namespace has historically only created
> one namespace per invocation.
>
> This can be cumbersome, so change create-namespace to create namespaces
> greedily. When --region=all is specified, or if a --region option is
> absent (same as 'all' from a filtering standpoint), create namespaces on
> all regions.

Cumbersome?  Like, in the same way partitioning a disk is cumbersome?  I
don't understand what the problem is, I guess.  If you want N namespaces
of the same type/size/align, then script it.  Why does there have to be
a command for that?  I definitely think that changing the behavior of
create-namespace is a non-starter.

Cheers,
Jeff

>
> Note that this does has two important implications:
>
> 1. The user-facing behavior of create-namespace changes in a potentially
> surprising way. It may be undesirable for an unadorned 'ndctl
> create-namespace' command to suddenly start creating lots of namespaces.
>
> 2. Error handling becomes a bit inconsistent. As with all commands
> accepting an 'all' option, error reporting becomes a bit tenuous. With
> this change, we will continue to create namespaces so long as we don't
> hit an error, but if we do, we bail and exit. Because of the special
> ENOSPC detection and handling around this, it can be easy to construct a
> scenario where en existing namespace on the last region in the scan list
> happens to report an error exit, but if the existing namespace was in a
> region at the start of the scan list, it gets passed over as a "just try
> the next region"
>
> RFC comment: Maybe the solution is to keep the create-namespace
> semantics unchanged, and introduce a new command - 'create-namespaces'
> or 'create-names-ace-greedy' with the new behavior. I'm not sure if
> users will care deeply about either of the two points above, hence
> sending this as an RFC.
>
> Link: https://github.com/pmem/ndctl/issues/106
> Reported-by: Steve Scargal 
> Cc: Jeff Moyer 
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  ndctl/namespace.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/ndctl/namespace.c b/ndctl/namespace.c
> index af20a42..856ad82 100644
> --- a/ndctl/namespace.c
> +++ b/ndctl/namespace.c
> @@ -1365,9 +1365,12 @@ static int do_xaction_namespace(const char *namespace,
>   rc = namespace_create(region);
>   if (rc == -EAGAIN)
>   continue;
> - if (rc == 0)
> - *processed = 1;
> - return rc;
> + if (rc == 0) {
> + (*processed)++;
> + continue;
> + } else {
> + return rc;
> + }
>   }
>   ndctl_namespace_foreach_safe(region, ndns, _n) {
>   ndns_name = ndctl_namespace_get_devname(ndns);
> @@ -1487,6 +1490,8 @@ int cmd_create_namespace(int argc, const char **argv, 
> struct ndctl_ctx *ctx)
>   rc = do_xaction_namespace(NULL, ACTION_CREATE, ctx, );
>   }
>  
> + fprintf(stderr, "created %d namespace%s\n", created,
> + created == 1 ? "" : "s");
>   if (rc < 0 || (!namespace && created < 1)) {
>   fprintf(stderr, "failed to %s namespace: %s\n", namespace
>   ? "reconfigure" : "create", strerror(-rc));
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 1/3] libnvdimm/security: Introduce a 'frozen' attribute

2019-08-28 Thread Jeff Moyer
Dan Williams  writes:

> In the process of debugging a system with an NVDIMM that was failing to
> unlock it was found that the kernel is reporting 'locked' while the DIMM
> security interface is 'frozen'. Unfortunately the security state is
> tracked internally as an enum which prevents it from communicating the
> difference between 'locked' and 'locked + frozen'. It follows that the
> enum also prevents the kernel from communicating 'unlocked + frozen'
> which would be useful for debugging why security operations like 'change
> passphrase' are disabled.
>
> Ditch the security state enum for a set of flags and introduce a new
> sysfs attribute explicitly for the 'frozen' state. The regression risk
> is low because the 'frozen' state was already blocked behind the
> 'locked' state, but will need to revisit if there were cases where
> applications need 'frozen' to show up in the primary 'security'
> attribute. The expectation is that communicating 'frozen' is mostly a
> helper for debug and status monitoring.
>
> Reviewed-by: Dave Jiang 
> Reported-by: Jeff Moyer 
> Signed-off-by: Dan Williams 

Reviewed-by: Jeff Moyer 

> ---
>  drivers/acpi/nfit/intel.c|   59 ---
>  drivers/nvdimm/bus.c |2 -
>  drivers/nvdimm/dimm_devs.c   |   59 +--
>  drivers/nvdimm/nd-core.h |   21 ++--
>  drivers/nvdimm/security.c|   99 
> ++
>  include/linux/libnvdimm.h|9 ++-
>  tools/testing/nvdimm/dimm_devs.c |   19 ++-
>  7 files changed, 141 insertions(+), 127 deletions(-)
>
> diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
> index cddd0fcf622c..1113b679cd7b 100644
> --- a/drivers/acpi/nfit/intel.c
> +++ b/drivers/acpi/nfit/intel.c
> @@ -7,10 +7,11 @@
>  #include "intel.h"
>  #include "nfit.h"
>  
> -static enum nvdimm_security_state intel_security_state(struct nvdimm *nvdimm,
> +static unsigned long intel_security_flags(struct nvdimm *nvdimm,
>   enum nvdimm_passphrase_type ptype)
>  {
>   struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
> + unsigned long security_flags = 0;
>   struct {
>   struct nd_cmd_pkg pkg;
>   struct nd_intel_get_security_state cmd;
> @@ -27,7 +28,7 @@ static enum nvdimm_security_state 
> intel_security_state(struct nvdimm *nvdimm,
>   int rc;
>  
>   if (!test_bit(NVDIMM_INTEL_GET_SECURITY_STATE, _mem->dsm_mask))
> - return -ENXIO;
> + return 0;
>  
>   /*
>* Short circuit the state retrieval while we are doing overwrite.
> @@ -35,38 +36,42 @@ static enum nvdimm_security_state 
> intel_security_state(struct nvdimm *nvdimm,
>* until the overwrite DSM completes.
>*/
>   if (nvdimm_in_overwrite(nvdimm) && ptype == NVDIMM_USER)
> - return NVDIMM_SECURITY_OVERWRITE;
> + return BIT(NVDIMM_SECURITY_OVERWRITE);
>  
>   rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, _cmd, sizeof(nd_cmd), NULL);
> - if (rc < 0)
> - return rc;
> - if (nd_cmd.cmd.status)
> - return -EIO;
> + if (rc < 0 || nd_cmd.cmd.status) {
> + pr_err("%s: security state retrieval failed (%d:%#x)\n",
> + nvdimm_name(nvdimm), rc, nd_cmd.cmd.status);
> + return 0;
> + }
>  
>   /* check and see if security is enabled and locked */
>   if (ptype == NVDIMM_MASTER) {
>   if (nd_cmd.cmd.extended_state & ND_INTEL_SEC_ESTATE_ENABLED)
> - return NVDIMM_SECURITY_UNLOCKED;
> - else if (nd_cmd.cmd.extended_state &
> - ND_INTEL_SEC_ESTATE_PLIMIT)
> - return NVDIMM_SECURITY_FROZEN;
> - } else {
> - if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_UNSUPPORTED)
> - return -ENXIO;
> - else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_ENABLED) {
> - if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_LOCKED)
> - return NVDIMM_SECURITY_LOCKED;
> - else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_FROZEN
> - || nd_cmd.cmd.state &
> - ND_INTEL_SEC_STATE_PLIMIT)
> - return NVDIMM_SECURITY_FROZEN;
> - else
> - return NVDIMM_SECURITY_UNLOCKED;
> - }
> + set_bit(NVDIMM_SECURITY_UNLOCKED, _flags);
> + else
> + 

Re: [PATCH v2 2/3] libnvdimm/security: Tighten scope of nvdimm->busy vs security operations

2019-08-28 Thread Jeff Moyer
Dan Williams  writes:

> An attempt to freeze DIMMs currently runs afoul of default blocking of
> all security operations in the entry to the 'store' routine for the
> 'security' sysfs attribute.
>
> The blanket blocking of all security operations while the DIMM is in
> active use in a region is too restrictive. The only security operations
> that need to be aware of the ->busy state are those that mutate the
> state of data, i.e. erase and overwrite.
>
> Refactor the ->busy checks to be applied at the entry common entry point
> in __security_store() rather than each of the helper routines to enable
> freeze to be run regardless of busy state.
>
> Reviewed-by: Dave Jiang 
> Signed-off-by: Dan Williams 

Reviewed-by: Jeff Moyer 

> ---
>  drivers/nvdimm/dimm_devs.c |   33 -
>  drivers/nvdimm/security.c  |   10 --
>  2 files changed, 16 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> index 53330625fe07..d837cb9be83d 100644
> --- a/drivers/nvdimm/dimm_devs.c
> +++ b/drivers/nvdimm/dimm_devs.c
> @@ -424,9 +424,6 @@ static ssize_t __security_store(struct device *dev, const 
> char *buf, size_t len)
>   unsigned int key, newkey;
>   int i;
>  
> - if (atomic_read(>busy))
> - return -EBUSY;
> -
>   rc = sscanf(buf, "%"__stringify(SEC_CMD_SIZE)"s"
>   " %"__stringify(KEY_ID_SIZE)"s"
>   " %"__stringify(KEY_ID_SIZE)"s",
> @@ -451,23 +448,25 @@ static ssize_t __security_store(struct device *dev, 
> const char *buf, size_t len)
>   } else if (i == OP_DISABLE) {
>   dev_dbg(dev, "disable %u\n", key);
>   rc = nvdimm_security_disable(nvdimm, key);
> - } else if (i == OP_UPDATE) {
> - dev_dbg(dev, "update %u %u\n", key, newkey);
> - rc = nvdimm_security_update(nvdimm, key, newkey, NVDIMM_USER);
> - } else if (i == OP_ERASE) {
> - dev_dbg(dev, "erase %u\n", key);
> - rc = nvdimm_security_erase(nvdimm, key, NVDIMM_USER);
> + } else if (i == OP_UPDATE || i == OP_MASTER_UPDATE) {
> + dev_dbg(dev, "%s %u %u\n", ops[i].name, key, newkey);
> + rc = nvdimm_security_update(nvdimm, key, newkey, i == OP_UPDATE
> + ? NVDIMM_USER : NVDIMM_MASTER);
> + } else if (i == OP_ERASE || i == OP_MASTER_ERASE) {
> + dev_dbg(dev, "%s %u\n", ops[i].name, key);
> + if (atomic_read(>busy)) {
> + dev_dbg(dev, "Unable to secure erase while DIMM 
> active.\n");
> + return -EBUSY;
> + }
> + rc = nvdimm_security_erase(nvdimm, key, i == OP_ERASE
> + ? NVDIMM_USER : NVDIMM_MASTER);
>   } else if (i == OP_OVERWRITE) {
>   dev_dbg(dev, "overwrite %u\n", key);
> + if (atomic_read(>busy)) {
> + dev_dbg(dev, "Unable to overwrite while DIMM 
> active.\n");
> + return -EBUSY;
> + }
>   rc = nvdimm_security_overwrite(nvdimm, key);
> - } else if (i == OP_MASTER_UPDATE) {
> - dev_dbg(dev, "master_update %u %u\n", key, newkey);
> - rc = nvdimm_security_update(nvdimm, key, newkey,
> - NVDIMM_MASTER);
> - } else if (i == OP_MASTER_ERASE) {
> - dev_dbg(dev, "master_erase %u\n", key);
> - rc = nvdimm_security_erase(nvdimm, key,
> - NVDIMM_MASTER);
>   } else
>   return -EINVAL;
>  
> diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
> index 5862d0eee9db..2166e627383a 100644
> --- a/drivers/nvdimm/security.c
> +++ b/drivers/nvdimm/security.c
> @@ -334,11 +334,6 @@ int nvdimm_security_erase(struct nvdimm *nvdimm, 
> unsigned int keyid,
>   || !nvdimm->sec.flags)
>   return -EOPNOTSUPP;
>  
> - if (atomic_read(>busy)) {
> - dev_dbg(dev, "Unable to secure erase while DIMM active.\n");
> - return -EBUSY;
> - }
> -
>   rc = check_security_state(nvdimm);
>   if (rc)
>   return rc;
> @@ -380,11 +375,6 @@ int nvdimm_security_overwrite(struct nvdimm *nvdimm, 
> unsigned int keyid)
>   || !nvdimm->sec.flags)
>   return -EOPNOTSUPP;
>  
> - if (atomic_read(>busy)) {
> - dev_dbg(dev, "Unable to o

[patch] libnvdimm/pfn: Fix namespace creation on misaligned addresses

2019-08-28 Thread Jeff Moyer
Yi reported[1] that after commit a3619190d62e ("libnvdimm/pfn: stop
padding pmem namespaces to section alignment"), it was no longer
possible to create a device dax namespace with a 1G alignment.  The
reason was that the pmem region was not itself 1G-aligned.  The code
happily skips past the first 512M, but fails to account for a now
misaligned end offset (since space was allocated starting at that
misaligned address, and extending for size GBs).  Reintroduce
end_trunc, so that the code correctly handles the misaligned end
address.  This results in the same behavior as before the introduction
of the offending commit.

[1] https://lists.01.org/pipermail/linux-nvdimm/2019-July/022813.html

Fixes: commit a3619190d62e ("libnvdimm/pfn: stop padding pmem namespaces ...")
Reported-and-tested-by: Yi Zhang 
Signed-off-by: Jeff Moyer 

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 3e7b11c..cb98b8f 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -655,6 +655,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
resource_size_t start, size;
struct nd_region *nd_region;
unsigned long npfns, align;
+   u32 end_trunc;
struct nd_pfn_sb *pfn_sb;
phys_addr_t offset;
const char *sig;
@@ -696,6 +697,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
size = resource_size(>res);
npfns = PHYS_PFN(size - SZ_8K);
align = max(nd_pfn->align, (1UL << SUBSECTION_SHIFT));
+   end_trunc = start + size - ALIGN_DOWN(start + size, align);
if (nd_pfn->mode == PFN_MODE_PMEM) {
/*
 * The altmap should be padded out to the block size used
@@ -714,7 +716,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
return -ENXIO;
}
 
-   npfns = PHYS_PFN(size - offset);
+   npfns = PHYS_PFN(size - offset - end_trunc);
pfn_sb->mode = cpu_to_le32(nd_pfn->mode);
pfn_sb->dataoff = cpu_to_le64(offset);
pfn_sb->npfns = cpu_to_le64(npfns);
@@ -723,6 +725,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(>dev), 16);
pfn_sb->version_major = cpu_to_le16(1);
pfn_sb->version_minor = cpu_to_le16(3);
+   pfn_sb->end_trunc = cpu_to_le32(end_trunc);
pfn_sb->align = cpu_to_le32(nd_pfn->align);
checksum = nd_sb_checksum((struct nd_gen_sb *) pfn_sb);
pfn_sb->checksum = cpu_to_le64(checksum);
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: create devdax with "-a 1g" failed from 5.3.0-rc1

2019-08-26 Thread Jeff Moyer
Dan, we should probably fix this before 5.3 ships.  Do you have any
concerns with the patch I attached?  If not, I'll submit a proper one.

-Jeff

Jeff Moyer  writes:

> Hi, Yi,
>
> Yi Zhang  writes:
>
>> Hi Dan
>>
>> As subject, I found it failed from bellow commit[1], steps list here
>> [2] and I've attached the full dmesg, let me know if you need more
>> info, thanks.
>>
>> [1]
>> commit a3619190d62ed9d66416891be2416f6bea2b3ca4 (refs/bisect/bad)
>> Author: Dan Williams 
>> Date:   Thu Jul 18 15:58:40 2019 -0700
>>
>> libnvdimm/pfn: stop padding pmem namespaces to section alignment
>>
>> [2]
>> # ./ndctl destroy-namespace all -r all -f
>> destroyed 5 namespaces
>> # ./ndctl create-namespace -r region1 -m devdax -a 1g -s 12G -vv
>> libndctl: ndctl_dax_enable: dax1.0: failed to enable
>>   Error: namespace1.0: failed to enable
>>
>> failed to create namespace: No such device or address
>> # ./ndctl -v
>> 65
>
> Thanks for bisecting.  The problem is that your pmem region is not
> aligned to 1GB.  The current code handles the start offset just fine,
> but does not even consider that the end address might be misaligned.  We
> could bring back end_trunc, and that solves the problem.  Unfortunately,
> it means that if you request a 12GB namespace, for example, you'll only
> get 11GB of usable space.  Note that the old code functioned in this
> manner, too.  A better solution would be to bump the allocation so that
> you get 12GB of usable memory.  I'm not quite sure how to go about that,
> though.  Dan?
>
> Below is a patch that fixes the regression on my end.  Feel free to give
> it a try.
>
> -Jeff
>
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 3e7b11c..cb98b8f 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -655,6 +655,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>   resource_size_t start, size;
>   struct nd_region *nd_region;
>   unsigned long npfns, align;
> + u32 end_trunc;
>   struct nd_pfn_sb *pfn_sb;
>   phys_addr_t offset;
>   const char *sig;
> @@ -696,6 +697,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>   size = resource_size(>res);
>   npfns = PHYS_PFN(size - SZ_8K);
>   align = max(nd_pfn->align, (1UL << SUBSECTION_SHIFT));
> + end_trunc = start + size - ALIGN_DOWN(start + size, align);
>   if (nd_pfn->mode == PFN_MODE_PMEM) {
>   /*
>* The altmap should be padded out to the block size used
> @@ -714,7 +716,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>   return -ENXIO;
>   }
>  
> - npfns = PHYS_PFN(size - offset);
> + npfns = PHYS_PFN(size - offset - end_trunc);
>   pfn_sb->mode = cpu_to_le32(nd_pfn->mode);
>   pfn_sb->dataoff = cpu_to_le64(offset);
>   pfn_sb->npfns = cpu_to_le64(npfns);
> @@ -723,6 +725,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>   memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(>dev), 16);
>   pfn_sb->version_major = cpu_to_le16(1);
>   pfn_sb->version_minor = cpu_to_le16(3);
> + pfn_sb->end_trunc = cpu_to_le32(end_trunc);
>   pfn_sb->align = cpu_to_le32(nd_pfn->align);
>   checksum = nd_sb_checksum((struct nd_gen_sb *) pfn_sb);
>   pfn_sb->checksum = cpu_to_le64(checksum);
> ___
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH 2/2] libdaxctl: point to migrate-device-model for dax-class errors

2019-08-22 Thread Jeff Moyer
Vishal Verma  writes:

> When a dax-bus vs. dax-class expectation causes a failure, such as when
> reconfiguring device modes, print an error message directly pointing the
> user to the daxctl-migrate-device-model command.
>
> Reported-by: Dave Hansen 
> Reported-by: Jeff Moyer 
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  daxctl/lib/libdaxctl.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/daxctl/lib/libdaxctl.c b/daxctl/lib/libdaxctl.c
> index 44842b9..c0a859c 100644
> --- a/daxctl/lib/libdaxctl.c
> +++ b/daxctl/lib/libdaxctl.c
> @@ -917,6 +917,7 @@ static int daxctl_dev_enable(struct daxctl_dev *dev, enum 
> daxctl_dev_mode mode)
>  
>   if (!device_model_is_dax_bus(dev)) {
>   err(ctx, "%s: error: device model is dax-class\n", devname);
> + err(ctx, "%s: see man daxctl-migrate-device-model\n", devname);
>   return -EOPNOTSUPP;
>   }
>  
> @@ -962,6 +963,7 @@ DAXCTL_EXPORT int daxctl_dev_disable(struct daxctl_dev 
> *dev)
>  
>   if (!device_model_is_dax_bus(dev)) {
>   err(ctx, "%s: error: device model is dax-class\n", devname);
> + err(ctx, "%s: see man daxctl-migrate-device-model\n", devname);
>   return -EOPNOTSUPP;
>   }

Reviewed-by: Jeff Moyer 

Thanks, Vishal!
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/2] drivers/dax/kmem: give a warning if CONFIG_DEV_DAX_PMEM_COMPAT is enabled

2019-08-22 Thread Jeff Moyer
Jia He  writes:

> commit c221c0b0308f ("device-dax: "Hotplug" persistent memory for use
> like normal RAM") helps to add persistent memory as normal RAM blocks.
> But this driver doesn't work if CONFIG_DEV_DAX_PMEM_COMPAT is enabled.
>
> Here is the debugging call trace when CONFIG_DEV_DAX_PMEM_COMPAT is
> enabled.
> [4.443730]  devm_memremap_pages+0x4b9/0x540
> [4.443733]  dev_dax_probe+0x112/0x220 [device_dax]
> [4.443735]  dax_pmem_compat_probe+0x58/0x92 [dax_pmem_compat]
> [4.443737]  nvdimm_bus_probe+0x6b/0x150
> [4.443739]  really_probe+0xf5/0x3d0
> [4.443740]  driver_probe_device+0x11b/0x130
> [4.443741]  device_driver_attach+0x58/0x60
> [4.443742]  __driver_attach+0xa3/0x140
>
> Then the dax0.0 device will be registered as "nd" bus instead of
> "dax" bus. This causes the error as follows:
> root@ubuntu:~# echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> -bash: echo: write error: No such device
>
> This gives a warning to notify the user.
>
> Signed-off-by: Jia He 
> ---
>  drivers/dax/kmem.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index ad62d551d94e..b77f0e880598 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -93,6 +93,11 @@ static struct dax_device_driver device_dax_kmem_driver = {
>  
>  static int __init dax_kmem_init(void)
>  {
> + if (IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)) {
> + pr_warn("CONFIG_DEV_DAX_PMEM_COMPAT is not compatible\n");
> + pr_warn("kmem dax driver might not be workable\n");
> + }
> +
>   return dax_driver_register(_dax_kmem_driver);
>  }

This logic is wrong (and the error message is *very* confusing).  You
can have the driver configured, but not loaded.  In that case, the kmem
driver will load and work properly.

When using daxctl to online memory, you already get the following
message:

libdaxctl: daxctl_dev_disable: dax0.0: error: device model is dax-class

That's still not very helpful.  It would be better if the message
suggested a fix (like using migrate-device-model).  Vishal?

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/3] libnvdimm/security: Tighten scope of nvdimm->busy vs security operations

2019-08-19 Thread Jeff Moyer
Dan Williams  writes:

> On Fri, Aug 16, 2019 at 1:49 PM Jeff Moyer  wrote:
>>
>> Dan Williams  writes:
>>
>> > The blanket blocking of all security operations while the DIMM is in
>> > active use in a region is too restrictive. The only security operations
>> > that need to be aware of the ->busy state are those that mutate the
>> > state of data, i.e. erase and overwrite.
>> >
>> > Refactor the ->busy checks to be applied at the entry common entry point
>> > in __security_store() rather than each of the helper routines.
>>
>> I'm not sure this buys you much.  Did you test this on actual hardware
>> to make sure your assumptions are correct?  I guess the worst case is we
>> get an "invalid security state" error back from the firmware
>>
>> Still, what's the motivation for this?
>
> The motivation was when I went to test setting the frozen state and
> found that it complained about the DIMM being active. There's nothing
> wrong with freezing security while the DIMM is mapped. ...but then I
> somehow managed to write this generalized commit message that left out
> the explicit failure I was worried about. Yes, moved too fast, but the
> motivation is "allow freeze while active" and centralize the ->busy
> check so it's just one function to review that common constraint.

OK, thanks for the info.

Reviewed-by: Jeff Moyer 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/3] libnvdimm/security: Consolidate 'security' operations

2019-08-16 Thread Jeff Moyer
Dan Williams  writes:

> The security operations are exported from libnvdimm/security.c to
> libnvdimm/dimm_devs.c, and libnvdimm/security.c is optionally compiled
> based on the CONFIG_NVDIMM_KEYS config symbol.
>
> Rather than export the operations across compile objects, just move the
> __security_store() entry point to live with the helpers.
>
> Cc: Dave Jiang 
> Signed-off-by: Dan Williams 

Fine by me.

Acked-by: Jeff Moyer 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/3] libnvdimm/security: Tighten scope of nvdimm->busy vs security operations

2019-08-16 Thread Jeff Moyer
Dan Williams  writes:

> The blanket blocking of all security operations while the DIMM is in
> active use in a region is too restrictive. The only security operations
> that need to be aware of the ->busy state are those that mutate the
> state of data, i.e. erase and overwrite.
>
> Refactor the ->busy checks to be applied at the entry common entry point
> in __security_store() rather than each of the helper routines.

I'm not sure this buys you much.  Did you test this on actual hardware
to make sure your assumptions are correct?  I guess the worst case is we
get an "invalid security state" error back from the firmware

Still, what's the motivation for this?

-Jeff

>
> Cc: Dave Jiang 
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/dimm_devs.c |   33 -
>  drivers/nvdimm/security.c  |   10 --
>  2 files changed, 16 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> index 53330625fe07..d837cb9be83d 100644
> --- a/drivers/nvdimm/dimm_devs.c
> +++ b/drivers/nvdimm/dimm_devs.c
> @@ -424,9 +424,6 @@ static ssize_t __security_store(struct device *dev, const 
> char *buf, size_t len)
>   unsigned int key, newkey;
>   int i;
>  
> - if (atomic_read(>busy))
> - return -EBUSY;
> -
>   rc = sscanf(buf, "%"__stringify(SEC_CMD_SIZE)"s"
>   " %"__stringify(KEY_ID_SIZE)"s"
>   " %"__stringify(KEY_ID_SIZE)"s",
> @@ -451,23 +448,25 @@ static ssize_t __security_store(struct device *dev, 
> const char *buf, size_t len)
>   } else if (i == OP_DISABLE) {
>   dev_dbg(dev, "disable %u\n", key);
>   rc = nvdimm_security_disable(nvdimm, key);
> - } else if (i == OP_UPDATE) {
> - dev_dbg(dev, "update %u %u\n", key, newkey);
> - rc = nvdimm_security_update(nvdimm, key, newkey, NVDIMM_USER);
> - } else if (i == OP_ERASE) {
> - dev_dbg(dev, "erase %u\n", key);
> - rc = nvdimm_security_erase(nvdimm, key, NVDIMM_USER);
> + } else if (i == OP_UPDATE || i == OP_MASTER_UPDATE) {
> + dev_dbg(dev, "%s %u %u\n", ops[i].name, key, newkey);
> + rc = nvdimm_security_update(nvdimm, key, newkey, i == OP_UPDATE
> + ? NVDIMM_USER : NVDIMM_MASTER);
> + } else if (i == OP_ERASE || i == OP_MASTER_ERASE) {
> + dev_dbg(dev, "%s %u\n", ops[i].name, key);
> + if (atomic_read(>busy)) {
> + dev_dbg(dev, "Unable to secure erase while DIMM 
> active.\n");
> + return -EBUSY;
> + }
> + rc = nvdimm_security_erase(nvdimm, key, i == OP_ERASE
> + ? NVDIMM_USER : NVDIMM_MASTER);
>   } else if (i == OP_OVERWRITE) {
>   dev_dbg(dev, "overwrite %u\n", key);
> + if (atomic_read(>busy)) {
> + dev_dbg(dev, "Unable to overwrite while DIMM 
> active.\n");
> + return -EBUSY;
> + }
>   rc = nvdimm_security_overwrite(nvdimm, key);
> - } else if (i == OP_MASTER_UPDATE) {
> - dev_dbg(dev, "master_update %u %u\n", key, newkey);
> - rc = nvdimm_security_update(nvdimm, key, newkey,
> - NVDIMM_MASTER);
> - } else if (i == OP_MASTER_ERASE) {
> - dev_dbg(dev, "master_erase %u\n", key);
> - rc = nvdimm_security_erase(nvdimm, key,
> - NVDIMM_MASTER);
>   } else
>   return -EINVAL;
>  
> diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
> index 5862d0eee9db..2166e627383a 100644
> --- a/drivers/nvdimm/security.c
> +++ b/drivers/nvdimm/security.c
> @@ -334,11 +334,6 @@ int nvdimm_security_erase(struct nvdimm *nvdimm, 
> unsigned int keyid,
>   || !nvdimm->sec.flags)
>   return -EOPNOTSUPP;
>  
> - if (atomic_read(>busy)) {
> - dev_dbg(dev, "Unable to secure erase while DIMM active.\n");
> - return -EBUSY;
> - }
> -
>   rc = check_security_state(nvdimm);
>   if (rc)
>   return rc;
> @@ -380,11 +375,6 @@ int nvdimm_security_overwrite(struct nvdimm *nvdimm, 
> unsigned int keyid)
>   || !nvdimm->sec.flags)
>   return -EOPNOTSUPP;
>  
> - if (atomic_read(>busy)) {
> - dev_dbg(dev, "Unable to overwrite while DIMM active.\n");
> - return -EBUSY;
> - }
> -
>   if (dev->driver == NULL) {
>   dev_dbg(dev, "Unable to overwrite while DIMM active.\n");
>   return -EINVAL;
>
> ___
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org

Re: [PATCH 1/3] libnvdimm/security: Introduce a 'frozen' attribute

2019-08-16 Thread Jeff Moyer
Dan Williams  writes:

> In the process of debugging a system with an NVDIMM that was failing to
> unlock it was found that the kernel is reporting 'locked' while the DIMM
> security interface is 'frozen'. Unfortunately the security state is
> tracked internally as an enum which prevents it from communicating the
> difference between 'locked' and 'locked + frozen'. It follows that the
> enum also prevents the kernel from communicating 'unlocked + frozen'
> which would be useful for debugging why security operations like 'change
> passphrase' are disabled.
>
> Ditch the security state enum for a set of flags and introduce a new
> sysfs attribute explicitly for the 'frozen' state. The regression risk
> is low because the 'frozen' state was already blocked behind the
> 'locked' state, but will need to revisit if there were cases where
> applications need 'frozen' to show up in the primary 'security'
> attribute. The expectation is that communicating 'frozen' is mostly a
> helper for debug and status monitoring.
>
> Cc: Dave Jiang 
> Reported-by: Jeff Moyer 
> Signed-off-by: Dan Williams 
> ---
>  drivers/acpi/nfit/intel.c|   65 ++---
>  drivers/nvdimm/bus.c |2 -
>  drivers/nvdimm/dimm_devs.c   |   59 +--
>  drivers/nvdimm/nd-core.h |   21 ++--
>  drivers/nvdimm/security.c|   99 
> ++
>  include/linux/libnvdimm.h|9 ++-
>  tools/testing/nvdimm/dimm_devs.c |   19 ++-
>  7 files changed, 146 insertions(+), 128 deletions(-)
>
> diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
> index cddd0fcf622c..2c51ca4155dc 100644
> --- a/drivers/acpi/nfit/intel.c
> +++ b/drivers/acpi/nfit/intel.c
> @@ -7,10 +7,11 @@
>  #include "intel.h"
>  #include "nfit.h"
>  
> -static enum nvdimm_security_state intel_security_state(struct nvdimm *nvdimm,
> +static unsigned long intel_security_flags(struct nvdimm *nvdimm,
>   enum nvdimm_passphrase_type ptype)
>  {
>   struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
> + unsigned long security_flags = 0;
>   struct {
>   struct nd_cmd_pkg pkg;
>   struct nd_intel_get_security_state cmd;
> @@ -27,46 +28,54 @@ static enum nvdimm_security_state 
> intel_security_state(struct nvdimm *nvdimm,
>   int rc;
>  
>   if (!test_bit(NVDIMM_INTEL_GET_SECURITY_STATE, _mem->dsm_mask))
> - return -ENXIO;
> + return 0;
>  
>   /*
>* Short circuit the state retrieval while we are doing overwrite.
>* The DSM spec states that the security state is indeterminate
>* until the overwrite DSM completes.
>*/
> - if (nvdimm_in_overwrite(nvdimm) && ptype == NVDIMM_USER)
> - return NVDIMM_SECURITY_OVERWRITE;
> + if (nvdimm_in_overwrite(nvdimm) && ptype == NVDIMM_USER) {
> + set_bit(NVDIMM_SECURITY_OVERWRITE, _flags);
> + return security_flags;
> + }

Why not just
return BIT(NVDIMM_SECURITY_OVERWRITE);
?


>   rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, _cmd, sizeof(nd_cmd), NULL);
> - if (rc < 0)
> - return rc;
> - if (nd_cmd.cmd.status)
> - return -EIO;
> + if (rc < 0 || nd_cmd.cmd.status) {
> + pr_err("%s: security state retrieval failed (%d:%#x)\n",
> + nvdimm_name(nvdimm), rc, nd_cmd.cmd.status);
> + return 0;
> + }
>  
>   /* check and see if security is enabled and locked */
>   if (ptype == NVDIMM_MASTER) {
>   if (nd_cmd.cmd.extended_state & ND_INTEL_SEC_ESTATE_ENABLED)
> - return NVDIMM_SECURITY_UNLOCKED;
> - else if (nd_cmd.cmd.extended_state &
> - ND_INTEL_SEC_ESTATE_PLIMIT)
> - return NVDIMM_SECURITY_FROZEN;
> - } else {
> - if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_UNSUPPORTED)
> - return -ENXIO;
> - else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_ENABLED) {
> - if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_LOCKED)
> - return NVDIMM_SECURITY_LOCKED;
> - else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_FROZEN
> - || nd_cmd.cmd.state &
> - ND_INTEL_SEC_STATE_PLIMIT)
> - return NVDIMM_SECURITY_FROZEN;
> - else
> - return NVDIMM_SECURITY_UNLOCKED;
> - }

Re: create devdax with "-a 1g" failed from 5.3.0-rc1

2019-08-16 Thread Jeff Moyer
Hi, Yi,

Yi Zhang  writes:

> Hi Dan
>
> As subject, I found it failed from bellow commit[1], steps list here
> [2] and I've attached the full dmesg, let me know if you need more
> info, thanks.
>
> [1]
> commit a3619190d62ed9d66416891be2416f6bea2b3ca4 (refs/bisect/bad)
> Author: Dan Williams 
> Date:   Thu Jul 18 15:58:40 2019 -0700
>
> libnvdimm/pfn: stop padding pmem namespaces to section alignment
>
> [2]
> # ./ndctl destroy-namespace all -r all -f
> destroyed 5 namespaces
> # ./ndctl create-namespace -r region1 -m devdax -a 1g -s 12G -vv
> libndctl: ndctl_dax_enable: dax1.0: failed to enable
>   Error: namespace1.0: failed to enable
>
> failed to create namespace: No such device or address
> # ./ndctl -v
> 65

Thanks for bisecting.  The problem is that your pmem region is not
aligned to 1GB.  The current code handles the start offset just fine,
but does not even consider that the end address might be misaligned.  We
could bring back end_trunc, and that solves the problem.  Unfortunately,
it means that if you request a 12GB namespace, for example, you'll only
get 11GB of usable space.  Note that the old code functioned in this
manner, too.  A better solution would be to bump the allocation so that
you get 12GB of usable memory.  I'm not quite sure how to go about that,
though.  Dan?

Below is a patch that fixes the regression on my end.  Feel free to give
it a try.

-Jeff

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 3e7b11c..cb98b8f 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -655,6 +655,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
resource_size_t start, size;
struct nd_region *nd_region;
unsigned long npfns, align;
+   u32 end_trunc;
struct nd_pfn_sb *pfn_sb;
phys_addr_t offset;
const char *sig;
@@ -696,6 +697,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
size = resource_size(>res);
npfns = PHYS_PFN(size - SZ_8K);
align = max(nd_pfn->align, (1UL << SUBSECTION_SHIFT));
+   end_trunc = start + size - ALIGN_DOWN(start + size, align);
if (nd_pfn->mode == PFN_MODE_PMEM) {
/*
 * The altmap should be padded out to the block size used
@@ -714,7 +716,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
return -ENXIO;
}
 
-   npfns = PHYS_PFN(size - offset);
+   npfns = PHYS_PFN(size - offset - end_trunc);
pfn_sb->mode = cpu_to_le32(nd_pfn->mode);
pfn_sb->dataoff = cpu_to_le64(offset);
pfn_sb->npfns = cpu_to_le64(npfns);
@@ -723,6 +725,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(>dev), 16);
pfn_sb->version_major = cpu_to_le16(1);
pfn_sb->version_minor = cpu_to_le16(3);
+   pfn_sb->end_trunc = cpu_to_le32(end_trunc);
pfn_sb->align = cpu_to_le32(nd_pfn->align);
checksum = nd_sb_checksum((struct nd_gen_sb *) pfn_sb);
pfn_sb->checksum = cpu_to_le64(checksum);
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3] ndctl, check: Ensure mmap of BTT sections work with 64K page-sizes

2019-08-13 Thread Jeff Moyer
Vaibhav Jain  writes:

> Thanks for reviewing this patch Jeff.
>
> Jeff Moyer  writes:
>
>> Vaibhav Jain  writes:
>>
>>> On PPC64 which uses a 64K page-size, ndtl-check command fails on a BTT
>>> namespace with following error:
>>>
>>> $ sudo ndctl check-namespace namespace0.0 -vv
>>>   namespace0.0: namespace_check: checking namespace0.0
>>>   namespace0.0: btt_discover_arenas: found 1 BTT arena
>>>   namespace0.0: btt_create_mappings: mmap arena[0].info [sz = 0x1000, off = 
>>> 0x1000] failed: Invalid argument
>>>   error checking namespaces: Invalid argument
>>>   checked 0 namespaces
>>>
>>> An error happens when btt_create_mappings() tries to mmap the sections
>>> of the BTT device which are usually 4K offset aligned. However the
>>> mmap() syscall expects the 'offset' argument to be in multiples of
>>> page-size, hence it returns EINVAL causing the btt_create_mappings()
>>> to error out.
>>>
>>> As a fix for the issue this patch proposes addition of two new
>>> functions to 'check.c' namely btt_mmap/btt_unmap that can be used to
>>> map/unmap parts of BTT device to ndctl process address-space. The
>>> functions tweaks the requested 'offset' argument to mmap() ensuring
>>> that its page-size aligned and then fix-ups the returned pointer such
>>> that it points to the requested offset within mmapped region.
>>>
>>> With these newly introduced functions the patch updates the call-sites
>>> in 'check.c' to use these functions instead of mmap/unmap syscalls.
>>> Also since btt_mmap returns NULL if mmap operation fails, the
>>> error check call-sites can be made against NULL instead of MAP_FAILED.
>>>
>>> Reported-by: Harish Sriram 
>>> Signed-off-by: Vaibhav Jain 
>>> ---
>>> Changelog:
>>> v3:
>>> * Fixed a log string that was splitted across multiple lines [Vishal]
>>>
>>> v2:
>>> * Updated patch description to include canonical email address of bug
>>>   reporter. [Vishal]
>>> * Improved the comment describing function btt_mmap() in 'check.c'
>>>   [Vishal]
>>> * Simplified the argument list for btt_mmap() to just accept bttc,
>>>   length and offset. Other arguments for mmap() are derived from these
>>>   args. [Vishal]
>>> * Renamed 'shift' variable in btt_mmap()/unmap() to 'page_offset'
>>>   [Vishal]
>>> * Improved the dbg() messages logged inside
>>>   btt_mmap()/unmap(). [Vishal]
>>> * Return NULL from btt_mmap() in case of an error. [Vishal]
>>> * Changed the all sites to btt_mmap() to perform error check against
>>>   NULL return value instead of MAP_FAILED. [Vishal]
>>> ---
>>>  ndctl/check.c | 93 +--
>>>  1 file changed, 67 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/ndctl/check.c b/ndctl/check.c
>>> index 8a7125053865..5969012eba84 100644
>>> --- a/ndctl/check.c
>>> +++ b/ndctl/check.c
>>> @@ -907,59 +907,100 @@ static int btt_discover_arenas(struct btt_chk *bttc)
>>> return ret;
>>>  }
>>>  
>>> -static int btt_create_mappings(struct btt_chk *bttc)
>>> +/*
>>> + * Wrap call to mmap(2) to work with btt device offsets that are not 
>>> aligned
>>> + * to system page boundary. It works by rounding down the requested offset
>>> + * to sys_page_size when calling mmap(2) and then returning a fixed-up 
>>> pointer
>>> + * to the correct offset in the mmaped region.
>>> + */
>>> +static void *btt_mmap(struct btt_chk *bttc, size_t length, off_t offset)
>>>  {
>>> -   struct arena_info *a;
>>> -   int mmap_flags;
>>> -   int i;
>>> +   off_t page_offset;
>>> +   int prot_flags;
>>> +   uint8_t *addr;
>>>  
>>> if (!bttc->opts->repair)
>>> -   mmap_flags = PROT_READ;
>>> +   prot_flags = PROT_READ;
>>> else
>>> -   mmap_flags = PROT_READ|PROT_WRITE;
>>> +   prot_flags = PROT_READ|PROT_WRITE;
>>> +
>>> +   /* Calculate the page_offset from the system page boundary */
>>> +   page_offset = offset - rounddown(offset, bttc->sys_page_size);
>>> +
>>> +   /* Update the offset and length with the page_offset calculated above */
>>> +   offset -= page_offset;
>>> +   length += page_offset;
>>
>> Don't you have to ensure that the 

Re: [PATCH v3] ndctl, check: Ensure mmap of BTT sections work with 64K page-sizes

2019-08-12 Thread Jeff Moyer
Vaibhav Jain  writes:

> On PPC64 which uses a 64K page-size, ndtl-check command fails on a BTT
> namespace with following error:
>
> $ sudo ndctl check-namespace namespace0.0 -vv
>   namespace0.0: namespace_check: checking namespace0.0
>   namespace0.0: btt_discover_arenas: found 1 BTT arena
>   namespace0.0: btt_create_mappings: mmap arena[0].info [sz = 0x1000, off = 
> 0x1000] failed: Invalid argument
>   error checking namespaces: Invalid argument
>   checked 0 namespaces
>
> An error happens when btt_create_mappings() tries to mmap the sections
> of the BTT device which are usually 4K offset aligned. However the
> mmap() syscall expects the 'offset' argument to be in multiples of
> page-size, hence it returns EINVAL causing the btt_create_mappings()
> to error out.
>
> As a fix for the issue this patch proposes addition of two new
> functions to 'check.c' namely btt_mmap/btt_unmap that can be used to
> map/unmap parts of BTT device to ndctl process address-space. The
> functions tweaks the requested 'offset' argument to mmap() ensuring
> that its page-size aligned and then fix-ups the returned pointer such
> that it points to the requested offset within mmapped region.
>
> With these newly introduced functions the patch updates the call-sites
> in 'check.c' to use these functions instead of mmap/unmap syscalls.
> Also since btt_mmap returns NULL if mmap operation fails, the
> error check call-sites can be made against NULL instead of MAP_FAILED.
>
> Reported-by: Harish Sriram 
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
> v3:
> * Fixed a log string that was splitted across multiple lines [Vishal]
>
> v2:
> * Updated patch description to include canonical email address of bug
>   reporter. [Vishal]
> * Improved the comment describing function btt_mmap() in 'check.c'
>   [Vishal]
> * Simplified the argument list for btt_mmap() to just accept bttc,
>   length and offset. Other arguments for mmap() are derived from these
>   args. [Vishal]
> * Renamed 'shift' variable in btt_mmap()/unmap() to 'page_offset'
>   [Vishal]
> * Improved the dbg() messages logged inside
>   btt_mmap()/unmap(). [Vishal]
> * Return NULL from btt_mmap() in case of an error. [Vishal]
> * Changed the all sites to btt_mmap() to perform error check against
>   NULL return value instead of MAP_FAILED. [Vishal]
> ---
>  ndctl/check.c | 93 +--
>  1 file changed, 67 insertions(+), 26 deletions(-)
>
> diff --git a/ndctl/check.c b/ndctl/check.c
> index 8a7125053865..5969012eba84 100644
> --- a/ndctl/check.c
> +++ b/ndctl/check.c
> @@ -907,59 +907,100 @@ static int btt_discover_arenas(struct btt_chk *bttc)
>   return ret;
>  }
>  
> -static int btt_create_mappings(struct btt_chk *bttc)
> +/*
> + * Wrap call to mmap(2) to work with btt device offsets that are not aligned
> + * to system page boundary. It works by rounding down the requested offset
> + * to sys_page_size when calling mmap(2) and then returning a fixed-up 
> pointer
> + * to the correct offset in the mmaped region.
> + */
> +static void *btt_mmap(struct btt_chk *bttc, size_t length, off_t offset)
>  {
> - struct arena_info *a;
> - int mmap_flags;
> - int i;
> + off_t page_offset;
> + int prot_flags;
> + uint8_t *addr;
>  
>   if (!bttc->opts->repair)
> - mmap_flags = PROT_READ;
> + prot_flags = PROT_READ;
>   else
> - mmap_flags = PROT_READ|PROT_WRITE;
> + prot_flags = PROT_READ|PROT_WRITE;
> +
> + /* Calculate the page_offset from the system page boundary */
> + page_offset = offset - rounddown(offset, bttc->sys_page_size);
> +
> + /* Update the offset and length with the page_offset calculated above */
> + offset -= page_offset;
> + length += page_offset;

Don't you have to ensure that the length is also a multiple of the
system page size?

-Jeff

> +
> + addr = mmap(NULL, length, prot_flags, MAP_SHARED, bttc->fd, offset);
> +
> + /* If needed fixup the return pointer to correct offset requested */
> + if (addr != MAP_FAILED)
> + addr += page_offset;
> +
> + dbg(bttc, "addr = %p, length = %#lx, offset = %#lx, page_offset = 
> %#lx\n",
> + (void *) addr, length, offset, page_offset);
> +
> + return addr == MAP_FAILED ? NULL : addr;
> +}
> +
> +static void btt_unmap(struct btt_chk *bttc, void *ptr, size_t length)
> +{
> + off_t page_offset;
> + uintptr_t addr = (uintptr_t) ptr;
> +
> + /* Calculate the page_offset from system page boundary */
> + page_offset = addr - rounddown(addr, bttc->sys_page_size);
> +
> + addr -= page_offset;
> + length += page_offset;
> +
> + munmap((void *) addr, length);
> + dbg(bttc, "addr = %p, length = %#lx, page_offset = %#lx\n",
> + (void *) addr, length, page_offset);
> +}
> +
> +static int btt_create_mappings(struct btt_chk *bttc)
> +{
> + struct arena_info *a;
> + int i;
>  
>   for (i = 0; i < 

Re: [ndctl PATCH] daxctl/test: Skip daxctl-devices.sh on older kernels

2019-08-12 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, Aug 12, 2019 at 1:45 PM Jeff Moyer  wrote:
>>
>> Dan Williams  writes:
>>
>> > If the 'kmem' module is missing skip the test to support running the
>> > unit tests on older -stable kernels pre-v5.1.
>> >
>> > Signed-off-by: Dan Williams 
>>
>> Note that the kernel module could also just not be configured.
>
> Yes, but then the test would still report "SKIP" and the developer
> could take a deeper look if they expected to see all "PASS".

Right, I see this as a good thing.  I meant to point out that your
commit message was a bit misleading.

>> > ---
>> >  test/daxctl-devices.sh |7 +++
>> >  1 file changed, 7 insertions(+)
>> >
>> > diff --git a/test/daxctl-devices.sh b/test/daxctl-devices.sh
>> > index 04f53f7b13ab..4102fb6990ae 100755
>> > --- a/test/daxctl-devices.sh
>> > +++ b/test/daxctl-devices.sh
>> > @@ -18,6 +18,13 @@ find_testdev()
>> >  {
>> >   local rc=77
>> >
>> > + # The kmem driver is needed to change the device mode, only
>> > + # kernels >= v5.1 might have it available. Skip if not.
>> > + if ! modinfo kmem; then
>> > + "Unable to find kmem module"
>>
>> I think you need a printf, there.
>
> Whoops, yes.
>
>> Also, do you want the modinfo output in the test log?
>
> Don't think it matters either way since it all ends up in the log.

It's tough to say whether it adds value, so sure, leave it in.

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] mm/memremap: Fix reuse of pgmap instances with internal references

2019-08-12 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, Aug 12, 2019 at 8:51 AM Jeff Moyer  wrote:
>>
>> Dan Williams  writes:
>>
>> > Currently, attempts to shutdown and re-enable a device-dax instance
>> > trigger:
>>
>> What does "shutdown and re-enable" translate to?  If I disable and
>> re-enable a device-dax namespace, I don't see this behavior.
>
> I was not seeing this either until I made sure I was in 'bus" device model 
> mode.
>
> # cat /etc/modprobe.d/daxctl.conf
> blacklist dax_pmem_compat
> alias nd:t7* dax_pmem
>
> # make TESTS="daxctl-devices.sh" check -j 40 2>out
>
> # dmesg | grep WARN.*devm
> [  225.588651] WARNING: CPU: 10 PID: 9103 at mm/memremap.c:211
> devm_memremap_pages+0x234/0x850
> [  225.679828] WARNING: CPU: 10 PID: 9103 at mm/memremap.c:211
> devm_memremap_pages+0x234/0x850

Ah, you see this when reconfiguring the device.  So, the lifetime of the
pgmap is tied to the character device, which doesn't get torn down.  The
fix looks good to me, and tests out fine.

Reviewed-by: Jeff Moyer 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH] daxctl/test: Skip daxctl-devices.sh on older kernels

2019-08-12 Thread Jeff Moyer
Dan Williams  writes:

> If the 'kmem' module is missing skip the test to support running the
> unit tests on older -stable kernels pre-v5.1.
>
> Signed-off-by: Dan Williams 

Note that the kernel module could also just not be configured.

> ---
>  test/daxctl-devices.sh |7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/test/daxctl-devices.sh b/test/daxctl-devices.sh
> index 04f53f7b13ab..4102fb6990ae 100755
> --- a/test/daxctl-devices.sh
> +++ b/test/daxctl-devices.sh
> @@ -18,6 +18,13 @@ find_testdev()
>  {
>   local rc=77
>  
> + # The kmem driver is needed to change the device mode, only
> + # kernels >= v5.1 might have it available. Skip if not.
> + if ! modinfo kmem; then
> + "Unable to find kmem module"

I think you need a printf, there.  Also, do you want the modinfo output
in the test log?

-Jeff

> + exit $rc
> + fi
> +
>   # find a victim device
>   testbus="$ACPI_BUS"
>   testdev=$("$NDCTL" list -b "$testbus" -Ni | jq -er '.[0].dev | .//""')
>
> ___
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] mm/memremap: Fix reuse of pgmap instances with internal references

2019-08-12 Thread Jeff Moyer
Dan Williams  writes:

> Currently, attempts to shutdown and re-enable a device-dax instance
> trigger:

What does "shutdown and re-enable" translate to?  If I disable and
re-enable a device-dax namespace, I don't see this behavior.

-Jeff

>
> Missing reference count teardown definition
> WARNING: CPU: 37 PID: 1608 at mm/memremap.c:211 
> devm_memremap_pages+0x234/0x850
> [..]
> RIP: 0010:devm_memremap_pages+0x234/0x850
> [..]
> Call Trace:
>  dev_dax_probe+0x66/0x190 [device_dax]
>  really_probe+0xef/0x390
>  driver_probe_device+0xb4/0x100
>  device_driver_attach+0x4f/0x60
>
> Given that the setup path initializes pgmap->ref, arrange for it to be
> also torn down so devm_memremap_pages() is ready to be called again and
> not be mistaken for the 3rd-party per-cpu-ref case.
>
> Fixes: 24917f6b1041 ("memremap: provide an optional internal refcount in 
> struct dev_pagemap")
> Reported-by: Fan Du 
> Tested-by: Vishal Verma 
> Cc: Andrew Morton 
> Cc: Christoph Hellwig 
> Cc: Ira Weiny 
> Cc: Jason Gunthorpe 
> Signed-off-by: Dan Williams 
> ---
>
> Andrew, I have another dax fix pending, so I'm ok to take this through
> the nvdimm tree, holler if you want it in -mm.
>
>  mm/memremap.c |6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 6ee03a816d67..86432650f829 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -91,6 +91,12 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>   wait_for_completion(>done);
>   percpu_ref_exit(pgmap->ref);
>   }
> + /*
> +  * Undo the pgmap ref assignment for the internal case as the
> +  * caller may re-enable the same pgmap.
> +  */
> + if (pgmap->ref == >internal_ref)
> + pgmap->ref = NULL;
>  }
>  
>  static void devm_memremap_pages_release(void *data)
>
> ___
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH] ndctl/test: Add xfs reflink dependency

2019-08-12 Thread Jeff Moyer
Dan Williams  writes:

> Starting with xfsprogs version 5.1.0 it will enable reflink by default.
> Any scripts, like ndctl unit tests, that were doing:
>
> mkfs.xfs $pmem; mount -o dax $pmem $mnt
>
> ...must now do:
>
> mkfs.xfs -m reflink=0 $pmem; mount -o dax $pmem $mnt

Agreed.  In the future, the options may not be mutually exclusive, but I
don't see any harm in always testing with reflink=0 for the existing
tests.

Acked-by: Jeff Moyer 

>
> Cc: Jeff Moyer 
> Signed-off-by: Dan Williams 
> ---
>  test/dax.sh  |4 ++--
>  test/mmap.sh |2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/test/dax.sh b/test/dax.sh
> index e703e1222dee..3bb44ac0a26c 100755
> --- a/test/dax.sh
> +++ b/test/dax.sh
> @@ -69,7 +69,7 @@ json=$($NDCTL create-namespace -m raw -f -e $dev)
>  eval $(json2var <<< "$json")
>  [ $mode != "fsdax" ] && echo "fail: $LINENO" &&  exit 1
>  
> -mkfs.xfs -f /dev/$blockdev
> +mkfs.xfs -f /dev/$blockdev -m reflink=0
>  mount /dev/$blockdev $MNT -o dax
>  fallocate -l 1GiB $MNT/$FILE
>  run_test
> @@ -80,7 +80,7 @@ json=$($NDCTL create-namespace -m fsdax -M dev -f -e $dev)
>  eval $(json2var <<< "$json")
>  [ $mode != "fsdax" ] && echo "fail: $LINENO" &&  exit 1
>  
> -mkfs.xfs -f /dev/$blockdev
> +mkfs.xfs -f /dev/$blockdev -m reflink=0
>  mount /dev/$blockdev $MNT -o dax
>  fallocate -l 1GiB $MNT/$FILE
>  run_test
> diff --git a/test/mmap.sh b/test/mmap.sh
> index afe50fd2199b..d072ea289f31 100755
> --- a/test/mmap.sh
> +++ b/test/mmap.sh
> @@ -70,7 +70,7 @@ fallocate -l 1GiB $MNT/$FILE
>  test_mmap
>  umount $MNT
>  
> -mkfs.xfs -f $DEV
> +mkfs.xfs -f $DEV -m reflink=0
>  mount $DEV $MNT -o dax
>  fallocate -l 1GiB $MNT/$FILE
>  test_mmap
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [patch] nfit: report frozen security state

2019-08-08 Thread Jeff Moyer
Dan Williams  writes:

> ...but ABIs are forever, and ndctl has shipped:
>
> if (strcmp(buf, "disabled") == 0)
> return NDCTL_SECURITY_DISABLED;
> else if (strcmp(buf, "unlocked") == 0)
> return NDCTL_SECURITY_UNLOCKED;
> else if (strcmp(buf, "locked") == 0)
> return NDCTL_SECURITY_LOCKED;
> else if (strcmp(buf, "frozen") == 0)
> return NDCTL_SECURITY_FROZEN;
> else if (strcmp(buf, "overwrite") == 0)
> return NDCTL_SECURITY_OVERWRITE;
> return NDCTL_SECURITY_INVALID;
>
>
> I think we could break out the frozen state to its own new attribute
> and leave security state to only show "locked" / "unlocked". Then go
> teach new ndctl to go somewhere else to report frozen.

That works for me.

-Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH v2] Documentation/namespace-description: Clarify label-less restrictions

2019-08-08 Thread Jeff Moyer
Vishal Verma  writes:

> In the ndctl-create-namespace (and related) man pages, add a
> clarification note regarding some of the restrictions a user may see
> when operating on label-less namespaces.
>
> Link: https://github.com/pmem/ndctl/issues/52
> Reported-by: Jane Chu 
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>
> v2:
> - Remove the part about an address abstraction mechanism; It didn't add
>   any value (Jeff)
> - Add an additional sentence about space reclamation semantics (Dan)

LGTM

Reviewed-by: Jeff Moyer 

-Jeff

>
>  Documentation/ndctl/namespace-description.txt | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/Documentation/ndctl/namespace-description.txt 
> b/Documentation/ndctl/namespace-description.txt
> index 94999e5..c59fbef 100644
> --- a/Documentation/ndctl/namespace-description.txt
> +++ b/Documentation/ndctl/namespace-description.txt
> @@ -18,6 +18,15 @@ the kernel's 'memmap=ss!nn' command line option (see the 
> nvdimm wiki on
>  kernel.org), or NVDIMMs without a valid 'namespace index' in their label
>  area.
>  
> +NOTE: Label-less namespaces lack many of the features of their label-rich
> +cousins. For example, their size cannot be modified, or they cannot be
> +fully 'destroyed' (i.e. the space reclaimed). A destroy operation will
> +zero any mode-specific metadata. Finally, for create-namespace operations
> +on label-less namespaces, ndctl bypasses the region capacity availability
> +checks, and always satisfies the request using the full region capacity.
> +The only reconfiguration operation supported on a label-less namespace
> +is changing its 'mode'.
> +
>  A namespace can be provisioned to operate in one of 4 modes, 'fsdax',
>  'devdax', 'sector', and 'raw'. Here are the expected usage models for
>  these modes:
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH] Documentation/namespace-description: Clarify label-less restrictions

2019-08-08 Thread Jeff Moyer
Vishal Verma  writes:

> In the ndctl-create-namespace (and related) man pages, add a
> clarification note regarding some of the restrictions a user may see
> when operating on label-less namespaces.
>
> Link: https://github.com/pmem/ndctl/issues/52
> Reported-by: Jane Chu 
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  Documentation/ndctl/namespace-description.txt | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/Documentation/ndctl/namespace-description.txt 
> b/Documentation/ndctl/namespace-description.txt
> index 94999e5..5f294be 100644
> --- a/Documentation/ndctl/namespace-description.txt
> +++ b/Documentation/ndctl/namespace-description.txt
> @@ -18,6 +18,12 @@ the kernel's 'memmap=ss!nn' command line option (see the 
> nvdimm wiki on
>  kernel.org), or NVDIMMs without a valid 'namespace index' in their label
>  area.
>  
> +NOTE: Label-less namespaces lack many of the features of their label-rich
> +cousins. For example, their size cannot be modified, or they cannot be
> +'destroyed' (only disabled). The only operation supported on them is to
> +change their 'mode', by adding on-media metadata which is then used to
> +identify an address abstraction mechanism for the namespace.

That last sentence is way too obtuse.  You could end it at the word
'mode'.

-Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [patch] nfit: report frozen security state

2019-08-07 Thread Jeff Moyer
Dan Williams  writes:

> On Thu, Aug 1, 2019 at 2:55 PM Jeff Moyer  wrote:
>>
>> If a dimm is frozen, it is currently reported as being "locked".  While
>> that's not technically wrong, it is misleading as the dimm can't be
>> unlocked.  Fix the confusion.
>
> This looks ok, but now I wonder about the case where the DIMM is
> unlocked, but frozen?

Hah, forgot that was even a possibility.  :)

> I think it makes more sense to show "frozen" when the DIMM is
> frozen-locked, and show "unlocked" when frozen-unlocked. I.e. if the
> DIMM is frozen the user should assume it's disabled for general
> purpose operation, and if it's unlocked the fact that it will fail
> some security operations is a constrained error case. Thoughts?

I think that adds confusion.  I think we should print out both whether
or not it's locked and whether or not it's frozen.  Maybe:

unlocked, not frozen:  "unlocked"
locked, not frozen:"locked"
unlocked, frozen:  "unlocked (frozen)"
locked, frozen:"locked (frozen)"

Something like that?  I think nvdimm_security_state should be a bitmask,
not an enum.  That may be a part of the problem.

-Jeff

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH v3 1/8] ndctl/build: Suppress -Waddress-of-packed-member

2019-08-05 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, Aug 5, 2019 at 10:06 AM Jeff Moyer  wrote:
>>
>> Dan Williams  writes:
>>
>> > gcc 9.1.1 emits a slew of warnings for many of the command field
>> > accesses. I.e. warnings of the form:
>> >
>> > libndctl.c:2586:21: warning: taking address of packed member of ‘struct 
>> > nd_cmd_get_config_data_hdr’ may result in an unaligned pointer value 
>> > [-Waddress-of-packed-member]
>> >  2586 |  cmd->iter.offset = >get_data->in_offset;
>> >   | ^
>> >
>> > Suppress these as fixing the warning would defeat the abstraction of being 
>> > able
>> > to have common code that operates on commands with common fields at 
>> > different
>> > offsets in the payload.
>>
>> As I understand it, taking a pointer to this potentially unaligned
>> member can result in bus errors (or worse, accessing wrong data) on
>> architectures that don't support unaligned accesses.  I'd be a whole lot
>> happier with this changelog if it mentioned that you had considered what
>> the warning actually meant, and decided it didn't matter for the
>> architectures you want to support.
>
> True, it was a fleeting thought, but not something I considered
> fleshing out... I'll send a revision.

Thanks.

>> x86 is, of course, safe.  I believe aarch64 is, as well.  I didn't look
>> into others.
>>
> Keep in mind that this code is for interfacing with the ACPI DSM path.
> If an unaligned-incapable architecture defined an NVDIMM command set
> it is highly unlikely it would be ACPI based, or pick these
> problematic command formats. I can add these notes to the changelog,
> but the unfortunate definition of these payloads that require __packed
> is something I hope other architectures avoid.

We can hope...  :) Anyway, thanks for the additional context.

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH v3 1/8] ndctl/build: Suppress -Waddress-of-packed-member

2019-08-05 Thread Jeff Moyer
Dan Williams  writes:

> gcc 9.1.1 emits a slew of warnings for many of the command field
> accesses. I.e. warnings of the form:
>
> libndctl.c:2586:21: warning: taking address of packed member of ‘struct 
> nd_cmd_get_config_data_hdr’ may result in an unaligned pointer value 
> [-Waddress-of-packed-member]
>  2586 |  cmd->iter.offset = >get_data->in_offset;
>   | ^
>
> Suppress these as fixing the warning would defeat the abstraction of being 
> able
> to have common code that operates on commands with common fields at different
> offsets in the payload.

As I understand it, taking a pointer to this potentially unaligned
member can result in bus errors (or worse, accessing wrong data) on
architectures that don't support unaligned accesses.  I'd be a whole lot
happier with this changelog if it mentioned that you had considered what
the warning actually meant, and decided it didn't matter for the
architectures you want to support.

x86 is, of course, safe.  I believe aarch64 is, as well.  I didn't look
into others.

Cheers,
Jeff


> Signed-off-by: Dan Williams 
> ---
>  configure.ac |1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/configure.ac b/configure.ac
> index 4737cfff77f2..79f82977fa44 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -214,6 +214,7 @@ my_CFLAGS="\
>  -Wmaybe-uninitialized \
>  -Wdeclaration-after-statement \
>  -Wunused-result \
> +-Wno-address-of-packed-member \
>  -D_FORTIFY_SOURCE=2 \
>  -O2
>  "
>
> ___
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[patch] nfit: report frozen security state

2019-08-01 Thread Jeff Moyer
If a dimm is frozen, it is currently reported as being "locked".  While
that's not technically wrong, it is misleading as the dimm can't be
unlocked.  Fix the confusion.

Thanks to Dan for pointing this out.

Signed-off-by: Jeff Moyer 

diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
index cddd0fcf622c..0df2216b2c68 100644
--- a/drivers/acpi/nfit/intel.c
+++ b/drivers/acpi/nfit/intel.c
@@ -54,12 +54,12 @@ static enum nvdimm_security_state 
intel_security_state(struct nvdimm *nvdimm,
if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_UNSUPPORTED)
return -ENXIO;
else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_ENABLED) {
-   if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_LOCKED)
-   return NVDIMM_SECURITY_LOCKED;
-   else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_FROZEN
+   if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_FROZEN
|| nd_cmd.cmd.state &
ND_INTEL_SEC_STATE_PLIMIT)
return NVDIMM_SECURITY_FROZEN;
+   else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_LOCKED)
+   return NVDIMM_SECURITY_LOCKED;
else
return NVDIMM_SECURITY_UNLOCKED;
}
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support

2019-04-22 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, Apr 22, 2019 at 8:59 AM Jeff Moyer  wrote:
>>
>> Dan Williams  writes:
>>
>> > On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig  
>> > wrote:
>> >>
>> >> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
>> >> > > > I'd either add a comment about avoiding retpoline overhead here or 
>> >> > > > just
>> >> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that 
>> >> > > > people don't
>> >> > > > get confused by the code.
>> >> > >
>> >> > > Isn't this premature optimization?  I really don't like adding things
>> >> > > like this without some numbers to show it's worth it.
>> >> >
>> >> > I don't think it's premature given this optimization technique is
>> >> > already being deployed elsewhere, see:
>> >> >
>> >> > https://lwn.net/Articles/774347/
>> >>
>> >> For one this one was backed by numbers, and second after feedback
>> >> from Linux we switched to the NULL pointer check instead.
>> >
>> > Ok I should have noticed the switch to NULL pointer check. However,
>> > the question still stands do we want everyone to run numbers to
>> > justify this optimization, or make it a new common kernel coding
>> > practice to do:
>> >
>> > if (!object->op)
>> > generic_op(object);
>> > else
>> > object->op(object);
>> >
>> > ...in hot paths?
>>
>> I don't think nvdimm_flush is a hot path.  Numbers of some
>> representative workload would prove one of us right.
>
> I'd rather say that the if "if (!op) do_generic()" pattern is more
> readable in the general case, saves grepping for who set the op in the
> common case. The fact that it has the potential to be faster is gravy
> at that point.

If the primary motivation is performance, then I'd expect performance
numbers to back it up.  If that isn't the primary motivation, then
choose whichever way you feel is appropriate.

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support

2019-04-22 Thread Jeff Moyer
Dan Williams  writes:

> On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig  wrote:
>>
>> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
>> > > > I'd either add a comment about avoiding retpoline overhead here or just
>> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people 
>> > > > don't
>> > > > get confused by the code.
>> > >
>> > > Isn't this premature optimization?  I really don't like adding things
>> > > like this without some numbers to show it's worth it.
>> >
>> > I don't think it's premature given this optimization technique is
>> > already being deployed elsewhere, see:
>> >
>> > https://lwn.net/Articles/774347/
>>
>> For one this one was backed by numbers, and second after feedback
>> from Linux we switched to the NULL pointer check instead.
>
> Ok I should have noticed the switch to NULL pointer check. However,
> the question still stands do we want everyone to run numbers to
> justify this optimization, or make it a new common kernel coding
> practice to do:
>
> if (!object->op)
> generic_op(object);
> else
> object->op(object);
>
> ...in hot paths?

I don't think nvdimm_flush is a hot path.  Numbers of some
representative workload would prove one of us right.

-Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support

2019-04-18 Thread Jeff Moyer
Dan Williams  writes:

> On Fri, Apr 12, 2019 at 6:12 AM Jeff Moyer  wrote:
>>
>> Jan Kara  writes:
>>
>> > On Thu 11-04-19 07:51:48, Dan Williams wrote:
>> >> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta  wrote:
>> >> > +   } else {
>> >> > +   if (nd_region->flush(nd_region))
>> >> > +   rc = -EIO;
>> >>
>> >> Given the common case wants to be fast and synchronous I think we
>> >> should try to avoid retpoline overhead by default. So something like
>> >> this:
>> >>
>> >> if (nd_region->flush == generic_nvdimm_flush)
>> >> rc = generic_nvdimm_flush(...);
>> >
>> > I'd either add a comment about avoiding retpoline overhead here or just
>> > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
>> > get confused by the code.
>>
>> Isn't this premature optimization?  I really don't like adding things
>> like this without some numbers to show it's worth it.
>
> I don't think it's premature given this optimization technique is
> already being deployed elsewhere, see:
>
> https://lwn.net/Articles/774347/

The technique is fine, but that doesn't mean it should be applied
everywhere.  Is *this* code path really going to benefit from the
optimization?

-Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support

2019-04-18 Thread Jeff Moyer
Dan Williams  writes:

>> On Wed, Apr 17, 2019 at 3:59 PM Dan Williams  
>> wrote:
>>
>> On Wed, Apr 17, 2019 at 3:04 PM Andrew Morton  
>> wrote:
>> >
>> > On Wed, 17 Apr 2019 11:38:55 -0700 Dan Williams  
>> > wrote:
>> >
>> > > The memory hotplug section is an arbitrary / convenient unit for memory
>> > > hotplug. 'Section-size' units have bled into the user interface
>> > > ('memblock' sysfs) and can not be changed without breaking existing
>> > > userspace. The section-size constraint, while mostly benign for typical
>> > > memory hotplug, has and continues to wreak havoc with 'device-memory'
>> > > use cases, persistent memory (pmem) in particular. Recall that pmem uses
>> > > devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
>> > > 'struct page' memmap for pmem. However, it does not use the 'bottom
>> > > half' of memory hotplug, i.e. never marks pmem pages online and never
>> > > exposes the userspace memblock interface for pmem. This leaves an
>> > > opening to redress the section-size constraint.
>> >
>> > v6 and we're not showing any review activity.  Who would be suitable
>> > people to help out here?
>>
>> There was quite a bit of review of the cover letter from Michal and
>> David, but you're right the details not so much as of yet. I'd like to
>> call out other people where I can reciprocate with some review of my
>> own. Oscar's altmap work looks like a good candidate for that.
>
> I'm also hoping Jeff can give a tested-by for the customer scenarios
> that fall over with the current implementation.

Sure.  I'll also have a look over the patches.

-Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [patch] libnvdimm/label: Check for a disabled nmem when deleting labels

2019-04-16 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, Apr 15, 2019 at 1:32 PM Jeff Moyer  wrote:
>>
>> Hi,
>>
>> We ran into a situation where one dimm's label area was corrupt (it had
>> multiple entries for the same DPA), and so that nmem was disabled.  The
>> region as a whole was enabled, though.  In this case, there was a single
>> namespace that showed up as disabled.  Trying to reconfigure that
>> namespace resulted in the following panic:
>
> I'm concerned that there is a seed device for userspace to get its
> hopes up thinking it can do anything with the region.

That looked to be by design, to me.  I thought we kept the region active
in case you wanted to create a dimm local (blk) namespace.

>> BUG: unable to handle kernel NULL pointer dereference at 0024
>> [  108.386384] #PF error: [normal kernel read fault]
>> [  108.391090] PGD 11c3eae5067 P4D 11c3eae5067 PUD 11c262c9067 PMD 0
>> [  108.397267] Oops:  [#1] SMP NOPTI
>> [  108.400936] CPU: 39 PID: 3838 Comm: ndctl Not tainted 5.1.0-rc5 #1
>> [  108.407114] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS 
>> SE5C620.86B.0D.01.0286.011120190816 01/11/2019
>> [  108.417200] RIP: 0010:preamble_next+0xd/0x60 [libnvdimm]
>> ...
>>
>> The problem is that there is no driver data associated with the
>> nd_mapping in this case.  Check for that.
>>
>> Signed-off-by: Jeff Moyer 
>> Reported-by: Zhang Yi 
>>
>> diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c
>> index f3d753d3169c..da1f28d8c9e6 100644
>> --- a/drivers/nvdimm/label.c
>> +++ b/drivers/nvdimm/label.c
>> @@ -1217,7 +1217,7 @@ static int del_labels(struct nd_mapping *nd_mapping, 
>> u8 *uuid)
>> return 0;
>>
>> /* no index || no labels == nothing to delete */
>> -   if (!preamble_next(ndd, , , ))
>> +   if (!ndd || !preamble_next(ndd, , , ))
>> return 0;
>
> This looks ok, but I think I'd prefer to fix init_active_labels() to
> make sure it is failing region activation. I suspect something is
> wrong with this logic, but nothing is immediately coming to mind.

I also worry that, if there had been available space in the region,
new namespace creation could fail in the same way.

> /*
>  * If the dimm is disabled then we may need to prevent
>  * the region from being activated.
>  */
> if (!ndd) {
> if (test_bit(NDD_LOCKED, >flags))
> /* fail, label data may be unreadable */;
> else if (test_bit(NDD_ALIASING, >flags))
> /* fail, labels needed to disambiguate dpa */;
> else
> return 0;
>
> dev_err(_region->dev, "%s: is %s, failing probe\n",
> dev_name(_mapping->nvdimm->dev),
> test_bit(NDD_LOCKED, >flags)
> ? "locked" : "disabled");
> return -ENXIO;
> }

I also looked at this logic.  The only thing wrong here, I think, is
that you don't need to disable the region for a locked dimm (unless all
dimms are locked).

>
> I'll try to reproduce...

OK.  If you can create custom labels, I'd say that would be the easiest
way to reproduce this.

Thanks!
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[patch] libnvdimm/label: Check for a disabled nmem when deleting labels

2019-04-15 Thread Jeff Moyer
Hi,

We ran into a situation where one dimm's label area was corrupt (it had
multiple entries for the same DPA), and so that nmem was disabled.  The
region as a whole was enabled, though.  In this case, there was a single
namespace that showed up as disabled.  Trying to reconfigure that
namespace resulted in the following panic:

BUG: unable to handle kernel NULL pointer dereference at 0024
[  108.386384] #PF error: [normal kernel read fault]
[  108.391090] PGD 11c3eae5067 P4D 11c3eae5067 PUD 11c262c9067 PMD 0 
[  108.397267] Oops:  [#1] SMP NOPTI
[  108.400936] CPU: 39 PID: 3838 Comm: ndctl Not tainted 5.1.0-rc5 #1
[  108.407114] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS 
SE5C620.86B.0D.01.0286.011120190816 01/11/2019
[  108.417200] RIP: 0010:preamble_next+0xd/0x60 [libnvdimm]
...

The problem is that there is no driver data associated with the
nd_mapping in this case.  Check for that.

Signed-off-by: Jeff Moyer 
Reported-by: Zhang Yi 

diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c
index f3d753d3169c..da1f28d8c9e6 100644
--- a/drivers/nvdimm/label.c
+++ b/drivers/nvdimm/label.c
@@ -1217,7 +1217,7 @@ static int del_labels(struct nd_mapping *nd_mapping, u8 
*uuid)
return 0;
 
/* no index || no labels == nothing to delete */
-   if (!preamble_next(ndd, , , ))
+   if (!ndd || !preamble_next(ndd, , , ))
return 0;
 
mutex_lock(_mapping->lock);
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support

2019-04-12 Thread Jeff Moyer
Jan Kara  writes:

> On Thu 11-04-19 07:51:48, Dan Williams wrote:
>> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta  wrote:
>> > +   } else {
>> > +   if (nd_region->flush(nd_region))
>> > +   rc = -EIO;
>> 
>> Given the common case wants to be fast and synchronous I think we
>> should try to avoid retpoline overhead by default. So something like
>> this:
>> 
>> if (nd_region->flush == generic_nvdimm_flush)
>> rc = generic_nvdimm_flush(...);
>
> I'd either add a comment about avoiding retpoline overhead here or just
> make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> get confused by the code.

Isn't this premature optimization?  I really don't like adding things
like this without some numbers to show it's worth it.

-Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] libnvdimm, pmem: fix a possible OOB access when read and write pmem

2019-04-05 Thread Jeff Moyer
Li RongQing  writes:

> If offset is not zero and length is bigger than PAGE_SIZE,
> this will cause to out of boundary access to a page memory

Agreed, though I don't think this can be triggered in practice.

> Fixes: 98cc093cba1e "(block, THP: make block_device_operations.rw_page 
> support THP)"
> Co-developed-by: Liang ZhiCheng 
> Signed-off-by: Liang ZhiCheng 
> Signed-off-by: Li RongQing 

Reviewed-by: Jeff Moyer 


> ---
>  drivers/nvdimm/pmem.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index bc2f700feef8..0279eb1da3ef 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -113,13 +113,13 @@ static void write_pmem(void *pmem_addr, struct page 
> *page,
>  
>   while (len) {
>   mem = kmap_atomic(page);
> - chunk = min_t(unsigned int, len, PAGE_SIZE);
> + chunk = min_t(unsigned int, len, PAGE_SIZE - off);
>   memcpy_flushcache(pmem_addr, mem + off, chunk);
>   kunmap_atomic(mem);
>   len -= chunk;
>   off = 0;
>   page++;
> - pmem_addr += PAGE_SIZE;
> + pmem_addr += chunk;
>   }
>  }
>  
> @@ -132,7 +132,7 @@ static blk_status_t read_pmem(struct page *page, unsigned 
> int off,
>  
>   while (len) {
>   mem = kmap_atomic(page);
> - chunk = min_t(unsigned int, len, PAGE_SIZE);
> + chunk = min_t(unsigned int, len, PAGE_SIZE - off);
>   rem = memcpy_mcsafe(mem + off, pmem_addr, chunk);
>   kunmap_atomic(mem);
>   if (rem)
> @@ -140,7 +140,7 @@ static blk_status_t read_pmem(struct page *page, unsigned 
> int off,
>   len -= chunk;
>   off = 0;
>   page++;
> - pmem_addr += PAGE_SIZE;
> + pmem_addr += chunk;
>   }
>   return BLK_STS_OK;
>  }
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 00/10] mm: Sub-section memory hotplug support

2019-03-25 Thread Jeff Moyer
Michal Hocko  writes:

>> > and I would like to know that you are
>> > not just shifting the problem to a smaller unit and a new/creative HW
>> > will force us to go even more complicated.
>> 
>> HW will not do this to us. It's software that has the problem.
>> Namespace creation is unnecessarily constrained to 128MB alignment.
>
> And why is that a problem? A lack of documentation that this is a
> requirement? Something will not work with a larger alignment? Someting
> else?

See this email for one user-visible problem:
  https://lore.kernel.org/lkml/x49imxbx22d@segfault.boston.devel.redhat.com/

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation

2019-02-22 Thread Jeff Moyer
Dan Williams  writes:

>> Great!  Now let's create another one.
>>
>> # ndctl create-namespace -m fsdax -s 132m
>> libndctl: ndctl_pfn_enable: pfn1.1: failed to enable
>>   Error: namespace1.2: failed to enable
>>
>> failed to create namespace: No such device or address
>>
>> (along with a kernel warning spew)
>
> I assume you're seeing this on the libnvdimm-pending branch?

Yes, but also on linus' master branch.  Things have been operating in
this manner for some time.

>> I understand the desire for expediency.  At some point, though, we have
>> to address the root of the problem.
>
> Well, you've defibrillated me back to reality. We've suffered the
> incomplete broken hacks for 2 years, what's another 10 weeks? I'll
> dust off the sub-section patches and take another run at it.

OK, thanks.  Let me know if I can help at all.

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation

2019-02-22 Thread Jeff Moyer
Dan Williams  writes:

>> > However, to fix this situation a non-backwards compatible change
>> > needs to be made to the interpretation of the nd_pfn info-block.
>> > ->start_pad needs to be accounted in ->map.map_offset (formerly
>> > ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be
>> > adjusted to the section aligned resource base used to establish
>> > ->map.map formerly (formerly ->virt_addr).
>> >
>> > The guiding principles of the info-block compatibility fixup is to
>> > maintain the interpretation of ->data_offset for implementations like
>> > the EFI driver that only care about data_access not dax, but cause older
>> > Linux implementations that care about the mode and dax to fail to parse
>> > the new info-block.
>>
>> What if the core mm grew support for hotplug on sub-section boundaries?
>> Would't that fix this problem (and others)?
>
> Yes, I think it would, and I had patches along these lines [2]. Last
> time I looked at this I was asked by core-mm folks to await some
> general refactoring of hotplug [3], and I wasn't proud about some of
> the hacks I used to make it work. In general I'm less confident about
> being able to get sub-section-hotplug over the goal line (core-mm
> resistance to hotplug complexity) vs the local hacks in nvdimm to deal
> with this breakage.

You first posted that patch series in December of 2016.  How long do we
wait for this refactoring to happen?

Meanwhile, we've been kicking this can down the road for far too long.
Simple namespace creation fails to work.  For example:

# ndctl create-namespace -m fsdax -s 128m
  Error: '--size=' must align to interleave-width: 6 and alignment: 2097152
  did you intend --size=132M?

failed to create namespace: Invalid argument

ok, I can't actually create a small, section-aligned namespace.  Let's
bump it up:

# ndctl create-namespace -m fsdax -s 132m
{
  "dev":"namespace1.0",
  "mode":"fsdax",
  "map":"dev",
  "size":"126.00 MiB (132.12 MB)",
  "uuid":"2a5f8fe0-69e2-46bf-98bc-0f5667cd810a",
  "raw_uuid":"f7324317-5cd2-491e-8cd1-ad03770593f2",
  "sector_size":512,
  "blockdev":"pmem1",
  "numa_node":1
}

Great!  Now let's create another one.

# ndctl create-namespace -m fsdax -s 132m
libndctl: ndctl_pfn_enable: pfn1.1: failed to enable
  Error: namespace1.2: failed to enable

failed to create namespace: No such device or address

(along with a kernel warning spew)

And at this point, all further ndctl create-namespace commands fail.
Lovely.  This is a wart that was acceptable only because a fix was
coming.  2+ years later, and we're still adding hacks to work around it
(and there have been *several* hacks).

> Local hacks are always a sad choice, but I think leaving these
> configurations stranded for another kernel cycle is not tenable. It
> wasn't until the github issue did I realize that the problem was
> happening in the wild on NVDIMM-N platforms.

I understand the desire for expediency.  At some point, though, we have
to address the root of the problem.

-Jeff

>
> [2]: 
> https://lore.kernel.org/lkml/148964440651.19438.2288075389153762985.st...@dwillia2-desk3.amr.corp.intel.com/
> [3]: https://lore.kernel.org/lkml/20170319163531.ga25...@dhcp22.suse.cz/
>
>>
>> -Jeff
>>
>> [1] https://github.com/pmem/ndctl/issues/76
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation

2019-02-21 Thread Jeff Moyer
Hi, Dan,

Thanks for the comprehensive write-up.  Comments below.

Dan Williams  writes:

> In the beginning the pmem driver simply passed the persistent memory
> resource range to memremap and was done. With the introduction of
> devm_memremap_pages() and vmem_altmap the implementation needed to
> contend with metadata at the start of the resource to indicate whether
> the vmemmap is located in System RAM or Persistent Memory, and reserve
> vmemmap capacity in pmem for the latter case.
>
> The indication of metadata space was communicated in the
> nd_pfn->data_offset property and it was defined to be identical to the
> pmem_device->data_offset property, i.e. relative to the raw resource
> base of the namespace. Up until this point in the driver's development
> pmem_device->phys_addr == __pa(pmem_device->virt_addr). This
> implementation was fine up until the discovery of platforms with
> physical address layouts that mapped Persistent Memory and System RAM to
> the same Linux memory hotplug section (128MB span).
>
> The nd_pfn->start_pad and nd_pfn->end_trunc properties were introduced
> to pad and truncate the capacity to fit within an exclusive Linux
> memory hotplug section span, and it was at this point where the
> ->start_pad definition did not comprehend the pmem_device->phys_addr to
> pmem_device->virt_addr relationship. Platforms in the wild typically
> only collided 'System RAM' at the end of the Persistent Memory range so
> ->start_pad was often zero.
>
> Lately Linux has encountered platforms that collide Persistent Memory
> regions between each other, specifically cases where ->start_pad needed
> to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad
> pfn namespaces relative to other regions". That commit allowed
> namespaces to be mapped with devm_memremap_pages(). However dax
> operations on those configurations currently fail if attempted within the
> ->start_pad range because pmem_device->data_offset was still relative to
> raw resource base not relative to the section aligned resource range
> mapped by devm_memremap_pages().
>
> Luckily __bdev_dax_supported() caught these failures and simply disabled
> dax.

Let me make sure I understand the current state of things.  Assume a
machine with two persistent memory ranges overlapping the same hotplug
memory section.  Let's take the example from the ndctl github issue[1]:

187c00-967bff : Persistent Memory

/sys/bus/nd/devices/region0/resource: 0x187c00
/sys/bus/nd/devices/region1/resource: 0x577c00

Create a namespace in region1.  That namespace will have a start_pad of
64MiB.  The problem is that, while the correct offset was specified when
laying out the struct pages (via arch_add_memory), the data_offset for
the pmem block device itself does not take the start_pad into account
(despite the comment in the nd_pfn_sb data structure!).  As a result,
the block device starts at the beginning of the address range, but
struct pages only exist for the address space starting 64MiB into the
range.  __bdev_dax_supported fails, because it tries to perform a
direct_access call on sector 0, and there's no pgmap for the address
corresponding to that sector.

So, we can't simply make the code correct (by adding the start_pad to
pmem->data_offset) without bumping the superblock version, because that
would change the size of the block device, and the location of data on
that block device would all be off by 64MiB (and you'd lose the first
64MiB).  Mass hysteria.

> However, to fix this situation a non-backwards compatible change
> needs to be made to the interpretation of the nd_pfn info-block.
> ->start_pad needs to be accounted in ->map.map_offset (formerly
> ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be
> adjusted to the section aligned resource base used to establish
> ->map.map formerly (formerly ->virt_addr).
>
> The guiding principles of the info-block compatibility fixup is to
> maintain the interpretation of ->data_offset for implementations like
> the EFI driver that only care about data_access not dax, but cause older
> Linux implementations that care about the mode and dax to fail to parse
> the new info-block.

What if the core mm grew support for hotplug on sub-section boundaries?
Would't that fix this problem (and others)?

-Jeff

[1] https://github.com/pmem/ndctl/issues/76
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 1/2] nfit, mce: only handle uncorrectable machine checks

2019-02-21 Thread Jeff Moyer
Dan Williams  writes:

> On Wed, Feb 20, 2019 at 11:26 AM Jeff Moyer  wrote:
>>
>> Borislav Petkov  writes:
>>
>> > Drop stable@
>> >
>> > On Wed, Feb 20, 2019 at 01:59:15PM -0500, Jeff Moyer wrote:
>> >> Sorry for necroposting.  I thought the point of the CEC was to make sure
>> >> that the other registered decoders only ever saw uncorrected errors.
>> >
>> > Ha, good point! You mean drivers/ras/cec.c, right?
>>
>> Yes.
>>
>> > If so, then I don't think we've ever talked about connecting CEC with
>> > NVDIMM and whether that would make sense. Lemme add Dan.
>>
>> I don't think there's a difference between MCEs for NVDIMMs and normal
>> DRAM.  I'll let Dan confirm or deny that.
>
> There is a difference. NVDIMMs have local tracking of discovered
> poison, methods to scan for latent poison, and methods to clear.

What I meant was that you couldn't tell the difference between an MCE
generated by accessing DRAM vs one generated by accessing an NVDIMM
(aside from checking the address).

> A CEC connection, iiuc, would seem an awkward fit. Awkward because
> what CEC enables is meant to be implemented natively in the hardware,
> and CEC seems to have no concept of the fact that errors can be
> repaired.

As far as I can tell, the Correctable Errors Collector just eats
correctable errors so that the rest of the registered decoders don't
have to worry about receiving them.  It sounds like you're suggesting
that NVDIMMs won't spew correctable errors.  If that's the case (I don't
think it is), then there's no need at all for these patches.

Anyway, given that the correctable errors collector can be turned off in
the kernel config, and assuming that we still can get correctable errors
from NVDIMMs (I think we can, since I believe the caching hierarchy can
generate them as well), we definitely need to continue to check for
correctable errors in the nfit mce decoder.  That's something I had
overlooked.

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 1/2] nfit, mce: only handle uncorrectable machine checks

2019-02-20 Thread Jeff Moyer
Borislav Petkov  writes:

> Drop stable@
>
> On Wed, Feb 20, 2019 at 01:59:15PM -0500, Jeff Moyer wrote:
>> Sorry for necroposting.  I thought the point of the CEC was to make sure
>> that the other registered decoders only ever saw uncorrected errors.
>
> Ha, good point! You mean drivers/ras/cec.c, right?

Yes.

> If so, then I don't think we've ever talked about connecting CEC with
> NVDIMM and whether that would make sense. Lemme add Dan.

I don't think there's a difference between MCEs for NVDIMMs and normal
DRAM.  I'll let Dan confirm or deny that.

>> How do we end up getting called with a correctable error?
>
> Good question. We shouldn't.
>
> So we need to figure out here how exactly should those things interact
> and whether NFIT should get all errors reported or it should put all the
> correctable errors through the decay thing, see the comment at the top
> of drivers/ras/cec.c
>
> Thx for pointing that out Jeff.

Sure, thanks for the quick reply, Boris!  Also, thanks for your detailed
commit messages.  They really help with understanding the code changes.

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 1/2] nfit, mce: only handle uncorrectable machine checks

2019-02-20 Thread Jeff Moyer
Hi, Vishal,

Vishal Verma  writes:

> The mce handler for 'nfit' devices is called for memory errors on a
> Non-Volatile DIMM, and adds the error location to a 'badblocks' list.
> This list is used by the various NVDIMM drivers to avoid consuming known
> poison locations during IO.
>
> The mce handler gets called for both corrected and uncorrectable errors.

Sorry for necroposting.  I thought the point of the CEC was to make sure
that the other registered decoders only ever saw uncorrected errors.
How do we end up getting called with a correctable error?

Thanks,
Jeff

> Until now, both kinds of errors have been added to the badblocks list.
> However, corrected memory errors indicate that the problem has already
> been fixed by hardware, and the resulting interrupt is merely a
> notification to Linux. As far as future accesses to that location are
> concerned, it is perfectly fine to use, and thus doesn't need to be
> included in the above badblocks list.
>
> Add a check in the nfit mce handler to filter out corrected mce events,
> and only process uncorrectable errors.
>
> Reported-by: Omar Avelar 
> Fixes: 6839a6d96f4e ("nfit: do an ARS scrub on hitting a latent media error")
> Cc: sta...@vger.kernel.org
> Cc: Dan Williams 
> Cc: Tony Luck 
> Cc: Borislav Petkov 
> Signed-off-by: Vishal Verma 
> ---
>  arch/x86/include/asm/mce.h   | 1 +
>  arch/x86/kernel/cpu/mcheck/mce.c | 3 ++-
>  drivers/acpi/nfit/mce.c  | 4 ++--
>  3 files changed, 5 insertions(+), 3 deletions(-)
>
> v3: Unchanged from v2
>
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 3a17107594c8..3111b3cee2ee 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -216,6 +216,7 @@ static inline int umc_normaddr_to_sysaddr(u64 norm_addr, 
> u16 nid, u8 umc, u64 *s
>  
>  int mce_available(struct cpuinfo_x86 *c);
>  bool mce_is_memory_error(struct mce *m);
> +bool mce_is_correctable(struct mce *m);
>  
>  DECLARE_PER_CPU(unsigned, mce_exception_count);
>  DECLARE_PER_CPU(unsigned, mce_poll_count);
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c 
> b/arch/x86/kernel/cpu/mcheck/mce.c
> index 953b3ce92dcc..27015948bc41 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -534,7 +534,7 @@ bool mce_is_memory_error(struct mce *m)
>  }
>  EXPORT_SYMBOL_GPL(mce_is_memory_error);
>  
> -static bool mce_is_correctable(struct mce *m)
> +bool mce_is_correctable(struct mce *m)
>  {
>   if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
>   return false;
> @@ -544,6 +544,7 @@ static bool mce_is_correctable(struct mce *m)
>  
>   return true;
>  }
> +EXPORT_SYMBOL_GPL(mce_is_correctable);
>  
>  static bool cec_add_mce(struct mce *m)
>  {
> diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
> index e9626bf6ca29..7a51707f87e9 100644
> --- a/drivers/acpi/nfit/mce.c
> +++ b/drivers/acpi/nfit/mce.c
> @@ -25,8 +25,8 @@ static int nfit_handle_mce(struct notifier_block *nb, 
> unsigned long val,
>   struct acpi_nfit_desc *acpi_desc;
>   struct nfit_spa *nfit_spa;
>  
> - /* We only care about memory errors */
> - if (!mce_is_memory_error(mce))
> + /* We only care about uncorrectable memory errors */
> + if (!mce_is_memory_error(mce) || mce_is_correctable(mce))
>   return NOTIFY_DONE;
>  
>   /*
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


  1   2   3   >