Re: [PATCH v2] nvdimm: Avoid race between probe and reading device attributes

2021-01-07 Thread Michal Suchánek
Ping?

On Mon, Jun 15, 2020 at 08:47:23AM +0100, Richard Palethorpe wrote:
> It is possible to cause a division error and use-after-free by querying the
> nmem device before the driver data is fully initialised in nvdimm_probe. E.g
> by doing
> 
> (while true; do
>  cat /sys/bus/nd/devices/nmem*/available_slots 2>&1 > /dev/null
>  done) &
> 
> while true; do
>  for i in $(seq 0 4); do
>echo nmem$i > /sys/bus/nd/drivers/nvdimm/bind
>  done
>  for i in $(seq 0 4); do
>echo nmem$i > /sys/bus/nd/drivers/nvdimm/unbind
>  done
>  done
> 
> On 5.7-rc3 this causes:
> 
> [   12.711578] divide error:  [#1] SMP KASAN PTI
> [   12.712321] CPU: 0 PID: 231 Comm: cat Not tainted 5.7.0-rc3 #48
> [   12.713188] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
> [   12.714857] RIP: 0010:nd_label_nfree+0x134/0x1a0 [libnvdimm]
> [   12.715772] Code: ba 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 0f b6 14 
> 11 84 d2 74 05 80 fa 03 7e 52 8b 73 08 31 d2 89 c1 48 83 c4 08 5b 5d  f6 
> 31 d2 41 5c 83 c0 07 c1 e8 03 48 8d 84 00 8e 02 00 00 25 00
> [   12.718311] RSP: 0018:c946fd08 EFLAGS: 00010282
> [   12.719030] RAX:  RBX: c0073aa0 RCX: 
> 
> [   12.720005] RDX:  RSI:  RDI: 
> 888060931808
> [   12.720970] RBP: 88806609d018 R08: 0001 R09: 
> ed100cc0a2b1
> [   12.721889] R10: 888066051587 R11: ed100cc0a2b0 R12: 
> 888060931800
> [   12.722744] R13: 888064362000 R14: 88806609d018 R15: 
> 8b1a2520
> [   12.723602] FS:  7fd16f3d5580() GS:88806b40() 
> knlGS:
> [   12.724600] CS:  0010 DS:  ES:  CR0: 80050033
> [   12.725308] CR2: 7fd16f1ec000 CR3: 64322006 CR4: 
> 00160ef0
> [   12.726268] Call Trace:
> [   12.726633]  available_slots_show+0x4e/0x120 [libnvdimm]
> [   12.727380]  dev_attr_show+0x42/0x80
> [   12.727891]  ? memset+0x20/0x40
> [   12.728341]  sysfs_kf_seq_show+0x218/0x410
> [   12.728923]  seq_read+0x389/0xe10
> [   12.729415]  vfs_read+0x101/0x2d0
> [   12.729891]  ksys_read+0xf9/0x1d0
> [   12.730361]  ? kernel_write+0x120/0x120
> [   12.730915]  do_syscall_64+0x95/0x4a0
> [   12.731435]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [   12.732163] RIP: 0033:0x7fd16f2fe4be
> [   12.732685] Code: c0 e9 c6 fe ff ff 50 48 8d 3d 2e 12 0a 00 e8 69 e9 01 00 
> 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 
> 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
> [   12.735207] RSP: 002b:7ffd3177b838 EFLAGS: 0246 ORIG_RAX: 
> 
> [   12.736261] RAX: ffda RBX: 0002 RCX: 
> 7fd16f2fe4be
> [   12.737233] RDX: 0002 RSI: 7fd16f1ed000 RDI: 
> 0003
> [   12.738203] RBP: 7fd16f1ed000 R08: 7fd16f1ec010 R09: 
> 
> [   12.739172] R10: 7fd16f3f4f70 R11: 0246 R12: 
> 7ffd3177ce23
> [   12.740144] R13: 0003 R14: 0002 R15: 
> 0002
> [   12.741139] Modules linked in: nfit libnvdimm
> [   12.741783] ---[ end trace 99532e4b82410044 ]---
> [   12.742452] RIP: 0010:nd_label_nfree+0x134/0x1a0 [libnvdimm]
> [   12.743167] Code: ba 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 0f b6 14 
> 11 84 d2 74 05 80 fa 03 7e 52 8b 73 08 31 d2 89 c1 48 83 c4 08 5b 5d  f6 
> 31 d2 41 5c 83 c0 07 c1 e8 03 48 8d 84 00 8e 02 00 00 25 00
> [   12.745709] RSP: 0018:c946fd08 EFLAGS: 00010282
> [   12.746340] RAX:  RBX: c0073aa0 RCX: 
> 
> [   12.747209] RDX:  RSI:  RDI: 
> 888060931808
> [   12.748081] RBP: 88806609d018 R08: 0001 R09: 
> ed100cc0a2b1
> [   12.748977] R10: 888066051587 R11: ed100cc0a2b0 R12: 
> 888060931800
> [   12.749849] R13: 888064362000 R14: 88806609d018 R15: 
> 8b1a2520
> [   12.750729] FS:  7fd16f3d5580() GS:88806b40() 
> knlGS:
> [   12.751708] CS:  0010 DS:  ES:  CR0: 80050033
> [   12.752441] CR2: 7fd16f1ec000 CR3: 64322006 CR4: 
> 00160ef0
> [   12.821357] 
> ==
> [   12.822284] BUG: KASAN: use-after-free in __mutex_lock+0x111c/0x11a0
> [   12.823084] Read of size 4 at addr 888065c26238 by task reproducer/218
> [   12.823968]
> [   12.824183] CPU: 2 PID: 218 Comm: reproducer Tainted: G  D   
> 5.7.0-rc3 #48
> [   12.825167] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
> [   12.826595] Call Trace:
> [   12.826926]  dump_stack+0x97/0xe0
> [   12.827362]  print_address_description.constprop.0+0x1b/0x210
> [   12.828111]  ? __mutex_lock+0x111c/0x11a0
> [   12.828645]  

Re: Feedback requested: Exposing NVDIMM performance statistics in a generic way

2020-10-23 Thread Michal Suchánek
Hello,

On Thu, May 28, 2020 at 11:59 AM Vaibhav Jain  wrote:
>
> Thanks for this taking time to look into this Dan,
>
> Agree with the points you have made earlier that I am summarizing below:
>
> * This is better done in ndctl rather than ipmctl.
> * Should only expose general performance metrics and not performance
>   counters. Performance counter should be exposed via perf
> * Vendor specific metrics to be separated from generic performance
>   metrics.
>
> One way to split generic and vendor specific metrics might be to report
> generic performance metrics together with dimm health metrics such as
> "temprature_celsius" or "spares_percentage" that are already reported in
> by dimm health output.
>
> Vendor specific performance metrics can be reported as a seperate object
> in the json output. Something similar to output below:
>
> # ndctl list -DH --stats --vendor-stats
> [
>   {
> "dev":"nmem0",
> "health":{
>   "health_state":"ok",
>   "shutdown_state":"clean",
>   "temperature_celsius":48.00,
>   "spares_percentage":10,
>
>   /* Generic performance metrics/stats */
>   "TotalMediaReads": 18929,
>   "TotalMediaWrites": 0,
>   
> }
>
> /* Vendor specific stats for the dimm */
> "vendor-stats": {
> "Controller Reset Count":10
> "Controller Reset Elapsed Time": 3600
> "Power-on Seconds": 3600

How do you tell generic from vendor-specific stats, though?

Controller reset count and power-on time may not be reported by some
controllers but sound pretty generic.

Even if you declare that the stats reported by all controllers
available at this moment are generic a later one may not report some of
these 'generic' statistics, or report them in different way/units, or
may simply not report anything at all for some technical reason.

Kernels that do not have this feature will not report anything at all
either.

Thanks

Michal
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH 0/4] powerpc/papr_scm: Add support for reporting NVDIMM performance statistics

2020-10-21 Thread Michal Suchánek
Hello,

apparently this has not received any (public) comments.

Maybe resend without the RFC status?

Clearly the kernel interface must be defined first, and then ndctl can
follow and make use of it.

Thanks

Michal

On Mon, May 18, 2020 at 04:38:10PM +0530, Vaibhav Jain wrote:
> The patch-set proposes to add support for fetching and reporting
> performance statistics for PAPR compliant NVDIMMs as described in
> documentation for H_SCM_PERFORMANCE_STATS hcall Ref[1]. The patch-set
> also implements mechanisms to expose NVDIMM performance stats via
> sysfs and newly introduced PDSMs[2] for libndctl.
> 
> This patch-set combined with corresponding ndctl and libndctl changes
> proposed at Ref[3] should enable user to fetch PAPR compliant NVDIMMs
> using following command:
> 
>  # ndctl list -D --stats
> [
>   {
> "dev":"nmem0",
> "stats":{
>   "Controller Reset Count":2,
>   "Controller Reset Elapsed Time":603331,
>   "Power-on Seconds":603931,
>   "Life Remaining":"100%",
>   "Critical Resource Utilization":"0%",
>   "Host Load Count":5781028,
>   "Host Store Count":8966800,
>   "Host Load Duration":975895365,
>   "Host Store Duration":716230690,
>   "Media Read Count":0,
>   "Media Write Count":6313,
>   "Media Read Duration":0,
>   "Media Write Duration":9679615,
>   "Cache Read Hit Count":5781028,
>   "Cache Write Hit Count":8442479,
>   "Fast Write Count":8969912
> }
>   }
> ]
> 
> The patchset is dependent on existing patch-set "[PATCH v7 0/5]
> powerpc/papr_scm: Add support for reporting nvdimm health" available
> at Ref[2] that adds support for reporting PAPR compliant NVDIMMs in
> 'papr_scm' kernel module.
> 
> Structure of the patch-set
> ==
> 
> The patch-set starts with implementing functionality in papr_scm
> module to issue H_SCM_PERFORMANCE_STATS hcall, fetch & parse dimm
> performance stats and exposing them as a PAPR specific libnvdimm
> attribute named 'perf_stats'
> 
> Patch-2 introduces a new PDSM named FETCH_PERF_STATS that can be
> issued by libndctl asking papr_scm to issue the
> H_SCM_PERFORMANCE_STATS hcall using helpers introduced earlier and
> storing the results in a dimm specific perf-stats-buffer.
> 
> Patch-3 introduces a new PDSM named READ_PERF_STATS that can be
> issued by libndctl to read the perf-stats-buffer in an incremental
> manner to workaround the 256-bytes envelop limitation of libnvdimm.
> 
> Finally Patch-4 introduces a new PDSM named GET_PERF_STAT that can be
> issued by libndctl to read values of a specific NVDIMM performance
> stat like "Life Remaining".
> 
> References
> ==
> [1] Documentation/powerpc/papr_hcals.rst
> 
> [2] 
> https://lore.kernel.org/linux-nvdimm/20200508104922.72565-1-vaib...@linux.ibm.com/
> 
> [3] https://github.com/vaibhav92/ndctl/tree/papr_scm_stats_v1
> 
> Vaibhav Jain (4):
>   powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
>   powerpc/papr_scm: Add support for PAPR_SCM_PDSM_FETCH_PERF_STATS
>   powerpc/papr_scm: Implement support for PAPR_SCM_PDSM_READ_PERF_STATS
>   powerpc/papr_scm: Add support for PDSM GET_PERF_STAT
> 
>  Documentation/ABI/testing/sysfs-bus-papr-scm  |  27 ++
>  arch/powerpc/include/uapi/asm/papr_scm_pdsm.h |  60 +++
>  arch/powerpc/platforms/pseries/papr_scm.c | 391 ++
>  3 files changed, 478 insertions(+)
> 
> -- 
> 2.26.2
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 3/3] namespace-action: Drop zero namespace checks.

2020-09-01 Thread Michal Suchánek
On Fri, Aug 28, 2020 at 11:51:05AM +0200, Michal Suchanek wrote:
> From: Santosh Sivaraj 
> 
> With seed namespaces catched early on these checks for sizes in enable
> and destroy namespace code path are not needed.

Effectively reverts the two below commits:

Fixes: b9cb03f6d5a8 ("ndctl/namespace: Fix enable-namespace error for seed 
namespaces")
Fixes: e01045e58ad5 ("ndctl/namespace: Fix destroy-namespace accounting 
relative to seed devices")
> Link: https://patchwork.kernel.org/patch/11739975/
> Signed-off-by: Santosh Sivaraj 
> [rebased on top of the previous patches]
> Signed-off-by: Michal Suchanek 
> ---
>  ndctl/lib/libndctl.c |  5 -
>  ndctl/namespace.c| 11 ---
>  2 files changed, 16 deletions(-)
> 
> diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
> index 952192c4c6b5..ecced5a3ae0b 100644
> --- a/ndctl/lib/libndctl.c
> +++ b/ndctl/lib/libndctl.c
> @@ -4253,16 +4253,11 @@ NDCTL_EXPORT int ndctl_namespace_enable(struct 
> ndctl_namespace *ndns)
>   const char *devname = ndctl_namespace_get_devname(ndns);
>   struct ndctl_ctx *ctx = ndctl_namespace_get_ctx(ndns);
>   struct ndctl_region *region = ndns->region;
> - unsigned long long size = ndctl_namespace_get_size(ndns);
>   int rc;
>  
>   if (ndctl_namespace_is_enabled(ndns))
>   return 0;
>  
> - /* Don't try to enable idle namespace (no capacity allocated) */
> - if (size == 0)
> - return -ENXIO;
> -
>   rc = ndctl_bind(ctx, ndns->module, devname);
>  
>   /*
> diff --git a/ndctl/namespace.c b/ndctl/namespace.c
> index 835f4076008a..65bca9191603 100644
> --- a/ndctl/namespace.c
> +++ b/ndctl/namespace.c
> @@ -1094,7 +1094,6 @@ static int namespace_destroy(struct ndctl_region 
> *region,
>   struct ndctl_namespace *ndns)
>  {
>   const char *devname = ndctl_namespace_get_devname(ndns);
> - unsigned long long size;
>   bool did_zero = false;
>   int rc;
>  
> @@ -1139,19 +1138,9 @@ static int namespace_destroy(struct ndctl_region 
> *region,
>   goto out;
>   }
>  
> - size = ndctl_namespace_get_size(ndns);
> -
>   rc = ndctl_namespace_delete(ndns);
>   if (rc)
>   debug("%s: failed to reclaim\n", devname);
> -
> - /*
> -  * Don't report a destroyed namespace when no capacity was
> -  * allocated.
> -  */
> - if (size == 0 && rc == 0)
> - rc = 1;
> -
>  out:
>   return rc;
>  }
> -- 
> 2.28.0
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: dm writecache: reject asynchronous pmem.

2020-06-30 Thread Michal Suchánek
On Tue, Jun 30, 2020 at 09:36:33AM -0400, Mike Snitzer wrote:
> On Tue, Jun 30 2020 at 10:10am -0400,
> Michal Suchánek  wrote:
> 
> > On Tue, Jun 30, 2020 at 09:32:01AM -0400, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Tue, 30 Jun 2020, Michal Suchanek wrote:
> > > 
> > > > The writecache driver does not handle asynchronous pmem. Reject it when
> > > > supplied as cache.
> > > > 
> > > > Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc@linux.ibm.com/
> > > > Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
> > > > 
> > > > Signed-off-by: Michal Suchanek 
> > > > ---
> > > >  drivers/md/dm-writecache.c | 6 ++
> > > >  1 file changed, 6 insertions(+)
> > > > 
> > > > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> > > > index 30505d70f423..57b0a972f6fd 100644
> > > > --- a/drivers/md/dm-writecache.c
> > > > +++ b/drivers/md/dm-writecache.c
> > > > @@ -2277,6 +2277,12 @@ static int writecache_ctr(struct dm_target *ti, 
> > > > unsigned argc, char **argv)
> > > >  
> > > > wc->memory_map_size -= (uint64_t)wc->start_sector << 
> > > > SECTOR_SHIFT;
> > > >  
> > > > +   if (!dax_synchronous(wc->ssd_dev->dax_dev)) {
> > > > +   r = -EOPNOTSUPP;
> > > > +   ti->error = "Asynchronous persistent memory not 
> > > > supported as pmem cache";
> > > > +   goto bad;
> > > > +   }
> > > > +
> > > > bio_list_init(>flush_list);
> > > > wc->flush_thread = 
> > > > kthread_create(writecache_flush_thread, wc, "dm_writecache_flush");
> > > > if (IS_ERR(wc->flush_thread)) {
> > > > -- 
> > > 
> > > Hi
> > > 
> > > Shouldn't this be in the "if (WC_MODE_PMEM(wc))" block?
> > That should be always the case at this point.
> > > 
> > > WC_MODE_PMEM(wc) retrurns true if we are using persistent memory as a 
> > > cache device, otherwise we are using generic block device as a cache 
> > > device.
> >
> > This is to prevent the situation where we have WC_MODE_PMEM(wc) but
> > cannot guarantee consistency because the async flush is not handled.
> 
> The writecache operates in 2 modes.  SSD or PMEM.  Mikulas is saying
> your dax_synchronous() check should go within a WC_MODE_PMEM(wc) block
> because it doesn't make sense to do the check when in SSD mode.

Indeed, it's in the wrong if/else branch.

Thanks

Michal
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] dm writecache: reject asynchronous pmem.

2020-06-30 Thread Michal Suchánek
On Tue, Jun 30, 2020 at 09:32:01AM -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 30 Jun 2020, Michal Suchanek wrote:
> 
> > The writecache driver does not handle asynchronous pmem. Reject it when
> > supplied as cache.
> > 
> > Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc@linux.ibm.com/
> > Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
> > 
> > Signed-off-by: Michal Suchanek 
> > ---
> >  drivers/md/dm-writecache.c | 6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> > index 30505d70f423..57b0a972f6fd 100644
> > --- a/drivers/md/dm-writecache.c
> > +++ b/drivers/md/dm-writecache.c
> > @@ -2277,6 +2277,12 @@ static int writecache_ctr(struct dm_target *ti, 
> > unsigned argc, char **argv)
> >  
> > wc->memory_map_size -= (uint64_t)wc->start_sector << 
> > SECTOR_SHIFT;
> >  
> > +   if (!dax_synchronous(wc->ssd_dev->dax_dev)) {
> > +   r = -EOPNOTSUPP;
> > +   ti->error = "Asynchronous persistent memory not 
> > supported as pmem cache";
> > +   goto bad;
> > +   }
> > +
> > bio_list_init(>flush_list);
> > wc->flush_thread = kthread_create(writecache_flush_thread, wc, 
> > "dm_writecache_flush");
> > if (IS_ERR(wc->flush_thread)) {
> > -- 
> 
> Hi
> 
> Shouldn't this be in the "if (WC_MODE_PMEM(wc))" block?
That should be always the case at this point.
> 
> WC_MODE_PMEM(wc) retrurns true if we are using persistent memory as a 
> cache device, otherwise we are using generic block device as a cache 
> device.
This is to prevent the situation where we have WC_MODE_PMEM(wc) but
cannot guarantee consistency because the async flush is not handled.

Thanks

Michal
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v6 6/8] powerpc/pmem: Avoid the barrier in flush routines

2020-06-30 Thread Michal Suchánek
On Mon, Jun 29, 2020 at 06:50:15PM -0700, Dan Williams wrote:
> On Mon, Jun 29, 2020 at 1:41 PM Aneesh Kumar K.V
>  wrote:
> >
> > Michal Suchánek  writes:
> >
> > > Hello,
> > >
> > > On Mon, Jun 29, 2020 at 07:27:20PM +0530, Aneesh Kumar K.V wrote:
> > >> nvdimm expect the flush routines to just mark the cache clean. The 
> > >> barrier
> > >> that mark the store globally visible is done in nvdimm_flush().
> > >>
> > >> Update the papr_scm driver to a simplified nvdim_flush callback that do
> > >> only the required barrier.
> > >>
> > >> Signed-off-by: Aneesh Kumar K.V 
> > >> ---
> > >>  arch/powerpc/lib/pmem.c   |  6 --
> > >>  arch/powerpc/platforms/pseries/papr_scm.c | 13 +
> > >>  2 files changed, 13 insertions(+), 6 deletions(-)
> > >>
> > >> diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c
> > >> index 5a61aaeb6930..21210fa676e5 100644
> > >> --- a/arch/powerpc/lib/pmem.c
> > >> +++ b/arch/powerpc/lib/pmem.c
> > >> @@ -19,9 +19,6 @@ static inline void __clean_pmem_range(unsigned long 
> > >> start, unsigned long stop)
> > >>
> > >>  for (i = 0; i < size >> shift; i++, addr += bytes)
> > >>  asm volatile(PPC_DCBSTPS(%0, %1): :"i"(0), "r"(addr): 
> > >> "memory");
> > >> -
> > >> -
> > >> -asm volatile(PPC_PHWSYNC ::: "memory");
> > >>  }
> > >>
> > >>  static inline void __flush_pmem_range(unsigned long start, unsigned 
> > >> long stop)
> > >> @@ -34,9 +31,6 @@ static inline void __flush_pmem_range(unsigned long 
> > >> start, unsigned long stop)
> > >>
> > >>  for (i = 0; i < size >> shift; i++, addr += bytes)
> > >>  asm volatile(PPC_DCBFPS(%0, %1): :"i"(0), "r"(addr): 
> > >> "memory");
> > >> -
> > >> -
> > >> -asm volatile(PPC_PHWSYNC ::: "memory");
> > >>  }
> > >>
> > >>  static inline void clean_pmem_range(unsigned long start, unsigned long 
> > >> stop)
> > >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> > >> b/arch/powerpc/platforms/pseries/papr_scm.c
> > >> index 9c569078a09f..9a9a0766f8b6 100644
> > >> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> > >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> > >> @@ -630,6 +630,18 @@ static int papr_scm_ndctl(struct 
> > >> nvdimm_bus_descriptor *nd_desc,
> > >>
> > >>  return 0;
> > >>  }
> > >> +/*
> > >> + * We have made sure the pmem writes are done such that before calling 
> > >> this
> > >> + * all the caches are flushed/clean. We use dcbf/dcbfps to ensure this. 
> > >> Here
> > >> + * we just need to add the necessary barrier to make sure the above 
> > >> flushes
> > >> + * are have updated persistent storage before any data access or data 
> > >> transfer
> > >> + * caused by subsequent instructions is initiated.
> > >> + */
> > >> +static int papr_scm_flush_sync(struct nd_region *nd_region, struct bio 
> > >> *bio)
> > >> +{
> > >> +arch_pmem_flush_barrier();
> > >> +return 0;
> > >> +}
> > >>
> > >>  static ssize_t flags_show(struct device *dev,
> > >>struct device_attribute *attr, char *buf)
> > >> @@ -743,6 +755,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv 
> > >> *p)
> > >>  ndr_desc.mapping = 
> > >>  ndr_desc.num_mappings = 1;
> > >>  ndr_desc.nd_set = >nd_set;
> > >> +ndr_desc.flush = papr_scm_flush_sync;
> > >
> > > AFAICT currently the only device that implements flush is virtio_pmem.
> > > How does the nfit driver get away without implementing flush?
> >
> > generic_nvdimm_flush does the required barrier for nfit. The reason for
> > adding ndr_desc.flush call back for papr_scm was to avoid the usage
> > of iomem based deep flushing (ndr_region_data.flush_wpq) which is not
> > supported by papr_scm.
> >
> > BTW we do return NULL for ndrd_get_flush_wpq() on power. So the upstream
> > code also does the same thing, but in a different way.
> >
> >
> > > Also the flush takes arguments that are completely unused but a user of
> > > the pmem region must assume they are used, and call flush() on the
> > > region rather than arch_pmem_flush_barrier() directly.
> >
> > The bio argument can help a pmem driver to do range based flushing in
> > case of pmem_make_request. If bio is null then we must assume a full
> > device flush.
> 
> The bio argument isn't for range based flushing, it is for flush
> operations that need to complete asynchronously.
How does the block layer determine that the pmem device needs
asynchronous fushing?

The flush() was designed for the purpose with the bio argument and only
virtio_pmem which is fulshed asynchronously used it. Now that papr_scm
resuses it fir different purpose how do you tell?

Thanks

Michal
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v6 6/8] powerpc/pmem: Avoid the barrier in flush routines

2020-06-29 Thread Michal Suchánek
Hello,

On Mon, Jun 29, 2020 at 07:27:20PM +0530, Aneesh Kumar K.V wrote:
> nvdimm expect the flush routines to just mark the cache clean. The barrier
> that mark the store globally visible is done in nvdimm_flush().
> 
> Update the papr_scm driver to a simplified nvdim_flush callback that do
> only the required barrier.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/lib/pmem.c   |  6 --
>  arch/powerpc/platforms/pseries/papr_scm.c | 13 +
>  2 files changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c
> index 5a61aaeb6930..21210fa676e5 100644
> --- a/arch/powerpc/lib/pmem.c
> +++ b/arch/powerpc/lib/pmem.c
> @@ -19,9 +19,6 @@ static inline void __clean_pmem_range(unsigned long start, 
> unsigned long stop)
>  
>   for (i = 0; i < size >> shift; i++, addr += bytes)
>   asm volatile(PPC_DCBSTPS(%0, %1): :"i"(0), "r"(addr): "memory");
> -
> -
> - asm volatile(PPC_PHWSYNC ::: "memory");
>  }
>  
>  static inline void __flush_pmem_range(unsigned long start, unsigned long 
> stop)
> @@ -34,9 +31,6 @@ static inline void __flush_pmem_range(unsigned long start, 
> unsigned long stop)
>  
>   for (i = 0; i < size >> shift; i++, addr += bytes)
>   asm volatile(PPC_DCBFPS(%0, %1): :"i"(0), "r"(addr): "memory");
> -
> -
> - asm volatile(PPC_PHWSYNC ::: "memory");
>  }
>  
>  static inline void clean_pmem_range(unsigned long start, unsigned long stop)
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 9c569078a09f..9a9a0766f8b6 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -630,6 +630,18 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor 
> *nd_desc,
>  
>   return 0;
>  }
> +/*
> + * We have made sure the pmem writes are done such that before calling this
> + * all the caches are flushed/clean. We use dcbf/dcbfps to ensure this. Here
> + * we just need to add the necessary barrier to make sure the above flushes
> + * are have updated persistent storage before any data access or data 
> transfer
> + * caused by subsequent instructions is initiated.
> + */
> +static int papr_scm_flush_sync(struct nd_region *nd_region, struct bio *bio)
> +{
> + arch_pmem_flush_barrier();
> + return 0;
> +}
>  
>  static ssize_t flags_show(struct device *dev,
> struct device_attribute *attr, char *buf)
> @@ -743,6 +755,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>   ndr_desc.mapping = 
>   ndr_desc.num_mappings = 1;
>   ndr_desc.nd_set = >nd_set;
> + ndr_desc.flush = papr_scm_flush_sync;

AFAICT currently the only device that implements flush is virtio_pmem.
How does the nfit driver get away without implementing flush?
Also the flush takes arguments that are completely unused but a user of
the pmem region must assume they are used, and call flush() on the
region rather than arch_pmem_flush_barrier() directly.  This may not
work well with md as discussed with earlier iteration of the patchest.

Thanks

Michal
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-06-26 Thread Michal Suchánek
On Fri, May 22, 2020 at 09:01:17AM -0400, Mikulas Patocka wrote:
> 
> 
> On Fri, 22 May 2020, Aneesh Kumar K.V wrote:
> 
> > On 5/22/20 3:01 PM, Michal Suchánek wrote:
> > > On Thu, May 21, 2020 at 02:52:30PM -0400, Mikulas Patocka wrote:
> > > > 
> > > > 
> > > > On Thu, 21 May 2020, Dan Williams wrote:
> > > > 
> > > > > On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V
> > > > >  wrote:
> > > > > > 
> > > > > > > Moving on to the patch itself--Aneesh, have you audited other
> > > > > > > persistent
> > > > > > > memory users in the kernel?  For example, 
> > > > > > > drivers/md/dm-writecache.c
> > > > > > > does
> > > > > > > this:
> > > > > > > 
> > > > > > > static void writecache_commit_flushed(struct dm_writecache *wc, 
> > > > > > > bool
> > > > > > > wait_for_ios)
> > > > > > > {
> > > > > > >if (WC_MODE_PMEM(wc))
> > > > > > >wmb(); <==
> > > > > > >   else
> > > > > > >   ssd_commit_flushed(wc, wait_for_ios);
> > > > > > > }
> > > > > > > 
> > > > > > > I believe you'll need to make modifications there.
> > > > > > > 
> > > > > > 
> > > > > > Correct. Thanks for catching that.
> > > > > > 
> > > > > > 
> > > > > > I don't understand dm much, wondering how this will work with
> > > > > > non-synchronous DAX device?
> > > > > 
> > > > > That's a good point. DM-writecache needs to be cognizant of things
> > > > > like virtio-pmem that violate the rule that persisent memory writes
> > > > > can be flushed by CPU functions rather than calling back into the
> > > > > driver. It seems we need to always make the flush case a dax_operation
> > > > > callback to account for this.
> > > > 
> > > > dm-writecache is normally sitting on the top of dm-linear, so it would
> > > > need to pass the wmb() call through the dm core and dm-linear target ...
> > > > that would slow it down ... I remember that you already did it this way
> > > > some times ago and then removed it.
> > > > 
> > > > What's the exact problem with POWER? Could the POWER system have two 
> > > > types
> > > > of persistent memory that need two different ways of flushing?
> > > 
> > > As far as I understand the discussion so far
> > > 
> > >   - on POWER $oldhardware uses $oldinstruction to ensure pmem consistency
> > >   - on POWER $newhardware uses $newinstruction to ensure pmem consistency
> > > (compatible with $oldinstruction on $oldhardware)
> > 
> > Correct.
> > 
> > >   - on some platforms instead of barrier instruction a callback into the
> > > driver is issued to ensure consistency 
> > 
> > This is virtio-pmem only at this point IIUC.
> > 
> > -aneesh
> 
> And does the virtio-pmem driver track which pages are dirty? Or does it 
> need to specify the range of pages to flush in the flush function?
> 
> > > None of this is reflected by the dm driver.
> 
> We could make a new dax method:
> void *(dax_get_flush_function)(void);
> 
> This would return a pointer to "wmb()" on x86 and something else on Power.
> 
> The method "dax_get_flush_function" would be called only once when 
> initializing the writecache driver (because the call would be slow because 
> it would have to go through the DM stack) and then, the returned function 
> would be called each time we need write ordering. The returned function 
> would do just "sfence; ret".

Hello,

as far as I understand the code virtio_pmem has a fush function defined
which indeed can make use of the region properties, such as memory
range. If such function exists you need quivalent of sync() - call into
the device in question. If it does not calling arch_pmem_flush_barrier()
instead of wmb() should suffice.

I am not aware of an interface to determine if the flush function exists
for a particular region.

Thanks

Michal
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-06-01 Thread Michal Suchánek
On Mon, Jun 01, 2020 at 05:31:50PM +0530, Aneesh Kumar K.V wrote:
> On 6/1/20 3:39 PM, Jan Kara wrote:
> > On Fri 29-05-20 16:25:35, Aneesh Kumar K.V wrote:
> > > On 5/29/20 3:22 PM, Jan Kara wrote:
> > > > On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> > > > > Thanks Michal. I also missed Jeff in this email thread.
> > > > 
> > > > And I think you'll also need some of the sched maintainers for the prctl
> > > > bits...
> > > > 
> > > > > On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > > > > > Adding Jan
> > > > > > 
> > > > > > On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > > > > > > With POWER10, architecture is adding new pmem flush and sync 
> > > > > > > instructions.
> > > > > > > The kernel should prevent the usage of MAP_SYNC if applications 
> > > > > > > are not using
> > > > > > > the new instructions on newer hardware.
> > > > > > > 
> > > > > > > This patch adds a prctl option MAP_SYNC_ENABLE that can be used 
> > > > > > > to enable
> > > > > > > the usage of MAP_SYNC. The kernel config option is added to allow 
> > > > > > > the user
> > > > > > > to control whether MAP_SYNC should be enabled by default or not.
> > > > > > > 
> > > > > > > Signed-off-by: Aneesh Kumar K.V 
> > > > ...
> > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > > > index 8c700f881d92..d5a9a363e81e 100644
> > > > > > > --- a/kernel/fork.c
> > > > > > > +++ b/kernel/fork.c
> > > > > > > @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
> > > > > > > DEFINE_SPINLOCK(mmlist_lock);
> > > > > > > static unsigned long default_dump_filter = 
> > > > > > > MMF_DUMP_FILTER_DEFAULT;
> > > > > > > +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > > > > > > +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > > > > > > +#else
> > > > > > > +unsigned long default_map_sync_mask = 0;
> > > > > > > +#endif
> > > > > > > +
> > > > 
> > > > I'm not sure CONFIG is really the right approach here. For a distro 
> > > > that would
> > > > basically mean to disable MAP_SYNC for all PPC kernels unless 
> > > > application
> > > > explicitly uses the right prctl. Shouldn't we rather initialize
> > > > default_map_sync_mask on boot based on whether the CPU we run on 
> > > > requires
> > > > new flush instructions or not? Otherwise the patch looks sensible.
> > > > 
> > > 
> > > yes that is correct. We ideally want to deny MAP_SYNC only w.r.t POWER10.
> > > But on a virtualized platform there is no easy way to detect that. We 
> > > could
> > > ideally hook this into the nvdimm driver where we look at the new compat
> > > string ibm,persistent-memory-v2 and then disable MAP_SYNC
> > > if we find a device with the specific value.
> > 
> > Hum, couldn't we set some flag for nvdimm devices with
> > "ibm,persistent-memory-v2" property and then check it during mmap(2) time
> > and when the device has this propery and the mmap(2) caller doesn't have
> > the prctl set, we'd disallow MAP_SYNC? That should make things mostly
> > seamless, shouldn't it? Only apps that want to use MAP_SYNC on these
> > devices would need to use prctl(MMF_DISABLE_MAP_SYNC, 0) but then these
> > applications need to be aware of new instructions so this isn't that much
> > additional burden...
> 
> I am not sure application would want to add that much details/knowledge
> about a platform in their code. I was expecting application to do
> 
> #ifdef __ppc64__
> prctl(MAP_SYNC_ENABLE, 1, 0, 0, 0));
> #endif
> a = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
> MAP_SHARED_VALIDATE | MAP_SYNC, fd, 0);
> 
> 
> For that code all the complexity that we add w.r.t ibm,persistent-memory-v2
> is not useful. Do you see a value in making all these device specific rather
> than a conditional on  __ppc64__?
If the vpmem devices continue to work with the old instruction on
POWER10 then it makes sense to make this per-device.

Also adding a message to kernel log in case the application does not do
the prctl would be helful for people migrating old code to POWER10.

Thanks

Michal
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-05-29 Thread Michal Suchánek
Adding Jan

On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> With POWER10, architecture is adding new pmem flush and sync instructions.
> The kernel should prevent the usage of MAP_SYNC if applications are not using
> the new instructions on newer hardware.
> 
> This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
> the usage of MAP_SYNC. The kernel config option is added to allow the user
> to control whether MAP_SYNC should be enabled by default or not.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  include/linux/sched/coredump.h | 13 ++---
>  include/uapi/linux/prctl.h |  3 +++
>  kernel/fork.c  |  8 +++-
>  kernel/sys.c   | 18 ++
>  mm/Kconfig |  3 +++
>  mm/mmap.c  |  4 
>  6 files changed, 45 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index ecdc6542070f..9ba6b3d5f991 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -72,9 +72,16 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_DISABLE_THP  24  /* disable THP for all VMAs */
>  #define MMF_OOM_VICTIM   25  /* mm is the oom victim */
>  #define MMF_OOM_REAP_QUEUED  26  /* mm was queued for oom_reaper */
> -#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP)
> +#define MMF_DISABLE_MAP_SYNC 27  /* disable THP for all VMAs */
> +#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP)
> +#define MMF_DISABLE_MAP_SYNC_MASK(1 << MMF_DISABLE_MAP_SYNC)
>  
> -#define MMF_INIT_MASK(MMF_DUMPABLE_MASK | 
> MMF_DUMP_FILTER_MASK |\
> -  MMF_DISABLE_THP_MASK)
> +#define MMF_INIT_MASK(MMF_DUMPABLE_MASK | 
> MMF_DUMP_FILTER_MASK | \
> + MMF_DISABLE_THP_MASK | MMF_DISABLE_MAP_SYNC_MASK)
> +
> +static inline bool map_sync_enabled(struct mm_struct *mm)
> +{
> + return !(mm->flags & MMF_DISABLE_MAP_SYNC_MASK);
> +}
>  
>  #endif /* _LINUX_SCHED_COREDUMP_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 07b4f8131e36..ee4cde32d5cf 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -238,4 +238,7 @@ struct prctl_mm_map {
>  #define PR_SET_IO_FLUSHER57
>  #define PR_GET_IO_FLUSHER58
>  
> +#define PR_SET_MAP_SYNC_ENABLE   59
> +#define PR_GET_MAP_SYNC_ENABLE   60
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 8c700f881d92..d5a9a363e81e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock);
>  
>  static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
>  
> +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> +#else
> +unsigned long default_map_sync_mask = 0;
> +#endif
> +
>  static int __init coredump_filter_setup(char *s)
>  {
>   default_dump_filter =
> @@ -1039,7 +1045,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
> struct task_struct *p,
>   mm->flags = current->mm->flags & MMF_INIT_MASK;
>   mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
>   } else {
> - mm->flags = default_dump_filter;
> + mm->flags = default_dump_filter | default_map_sync_mask;
>   mm->def_flags = 0;
>   }
>  
> diff --git a/kernel/sys.c b/kernel/sys.c
> index d325f3ab624a..f6127cf4128b 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2450,6 +2450,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> arg2, unsigned long, arg3,
>   clear_bit(MMF_DISABLE_THP, >mm->flags);
>   up_write(>mm->mmap_sem);
>   break;
> +
> + case PR_GET_MAP_SYNC_ENABLE:
> + if (arg2 || arg3 || arg4 || arg5)
> + return -EINVAL;
> + error = !test_bit(MMF_DISABLE_MAP_SYNC, >mm->flags);
> + break;
> + case PR_SET_MAP_SYNC_ENABLE:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + if (down_write_killable(>mm->mmap_sem))
> + return -EINTR;
> + if (arg2)
> + clear_bit(MMF_DISABLE_MAP_SYNC, >mm->flags);
> + else
> + set_bit(MMF_DISABLE_MAP_SYNC, >mm->flags);
> + up_write(>mm->mmap_sem);
> + break;
> +
>   case PR_MPX_ENABLE_MANAGEMENT:
>   case PR_MPX_DISABLE_MANAGEMENT:
>   /* No longer implemented: */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c1acc34c1c35..38fd7cfbfca8 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -867,4 +867,7 @@ config ARCH_HAS_HUGEPD
>  config MAPPING_DIRTY_HELPERS
>  bool
>  
> +config ARCH_MAP_SYNC_DISABLE
> +

Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-22 Thread Michal Suchánek
On Thu, May 21, 2020 at 02:52:30PM -0400, Mikulas Patocka wrote:
> 
> 
> On Thu, 21 May 2020, Dan Williams wrote:
> 
> > On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V
> >  wrote:
> > >
> > > > Moving on to the patch itself--Aneesh, have you audited other persistent
> > > > memory users in the kernel?  For example, drivers/md/dm-writecache.c 
> > > > does
> > > > this:
> > > >
> > > > static void writecache_commit_flushed(struct dm_writecache *wc, bool 
> > > > wait_for_ios)
> > > > {
> > > >   if (WC_MODE_PMEM(wc))
> > > >   wmb(); <==
> > > >  else
> > > >  ssd_commit_flushed(wc, wait_for_ios);
> > > > }
> > > >
> > > > I believe you'll need to make modifications there.
> > > >
> > >
> > > Correct. Thanks for catching that.
> > >
> > >
> > > I don't understand dm much, wondering how this will work with
> > > non-synchronous DAX device?
> > 
> > That's a good point. DM-writecache needs to be cognizant of things
> > like virtio-pmem that violate the rule that persisent memory writes
> > can be flushed by CPU functions rather than calling back into the
> > driver. It seems we need to always make the flush case a dax_operation
> > callback to account for this.
> 
> dm-writecache is normally sitting on the top of dm-linear, so it would 
> need to pass the wmb() call through the dm core and dm-linear target ... 
> that would slow it down ... I remember that you already did it this way 
> some times ago and then removed it.
> 
> What's the exact problem with POWER? Could the POWER system have two types 
> of persistent memory that need two different ways of flushing?

As far as I understand the discussion so far

 - on POWER $oldhardware uses $oldinstruction to ensure pmem consistency
 - on POWER $newhardware uses $newinstruction to ensure pmem consistency
   (compatible with $oldinstruction on $oldhardware)
 - on some platforms instead of barrier instruction a callback into the
   driver is issued to ensure consistency

None of this is reflected by the dm driver.

Thanks

Michal
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH v2 14/26] ndctl/namespace: Handle 'create-namespace' in label-less mode

2019-08-21 Thread Michal Suchánek
On Wed, 21 Aug 2019 18:03:18 +
"Verma, Vishal L"  wrote:

> On Wed, 2019-08-21 at 14:56 +0200, Michal Suchánek wrote:
> > Hello,
> > 
> > this patch is marked as superseded in the patchwork.
> > 
> > What is the intended replacement?
> >   
> 
> Hi Michal,
> 
> The patch was superseded by v3 of the series, and is present in the
> latest release (v66):
> 
> 7966c92 ndctl/namespace: Handle 'create-namespace' in label-less mode
> 
>   -Vishal

I see, it was already merged as part of updated series. Missed that.

Thanks

Michal
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH v2 14/26] ndctl/namespace: Handle 'create-namespace' in label-less mode

2019-08-21 Thread Michal Suchánek
Hello,

this patch is marked as superseded in the patchwork.

What is the intended replacement?

Thanks

Michal

On Sat, 27 Jul 2019 14:40:36 -0700
Dan Williams  wrote:

> A common confusion with ndctl is that 'create-namespace' does not work
> in the label-less case. In the label-less case there is no capacity to
> allocate as the size if already hard-coded by the region boundary.
> 
> However, users typically do something like the following in the
> label-less case:
> 
> # ndctl list
> {
>   "dev":"namespace1.0",
>   "mode":"raw",
>   "size":"127.00 GiB (136.37 GB)",
>   "sector_size":512,
>   "blockdev":"pmem1"
> }
> 
> # ndctl destroy-namespace namespace1.0 -f
> destroyed 1 namespace
> 
> # ndctl create-namespace
> failed to create namespace: Resource temporarily unavailable
> 
> In other words they destroy the raw mode namespace that they don't want,
> and seek to create a new default configuration namespace. Since there is
> no "available_capacity" in the label-less case the 'create' attempt
> fails.
> 
> Fix this by recognizing that the user wants a default sized namespace
> and just reconfigure the raw namespace.
> 
> Signed-off-by: Dan Williams 
> ---
>  ndctl/namespace.c |   10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/ndctl/namespace.c b/ndctl/namespace.c
> index 58fec194ab94..e5a2b1341cdb 100644
> --- a/ndctl/namespace.c
> +++ b/ndctl/namespace.c
> @@ -837,9 +837,13 @@ static int namespace_create(struct ndctl_region *region)
>   return -EAGAIN;
>   }
>  
> - available = ndctl_region_get_max_available_extent(region);
> - if (available == ULLONG_MAX)
> - available = ndctl_region_get_available_size(region);
> + if (ndctl_region_get_nstype(region) == ND_DEVICE_NAMESPACE_IO)
> + available = ndctl_region_get_size(region);
> + else {
> + available = ndctl_region_get_max_available_extent(region);
> + if (available == ULLONG_MAX)
> + available = ndctl_region_get_available_size(region);
> + }
>   if (!available || p.size > available) {
>   debug("%s: insufficient capacity size: %llx avail: %llx\n",
>   devname, p.size, available);

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default

2019-03-06 Thread Michal Suchánek
On Wed, 06 Mar 2019 14:47:33 +0530
"Aneesh Kumar K.V"  wrote:

> Dan Williams  writes:
> 
> > On Thu, Feb 28, 2019 at 1:40 AM Oliver  wrote:  
> >>
> >> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> >>  wrote:  
 
> Also even if the user decided to not use THP, by
> echo "never" > transparent_hugepage/enabled , we should continue to map
> dax fault using huge page on platforms that can support huge pages.

Is this a good idea?

This knob is there for a reason. In some situations having huge pages
can severely impact performance of the system (due to host-guest
interaction or whatever) and the ability to really turn off all THP
would be important in those cases, right?

Thanks

Michal
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm