from:"Verma, Vishal L"

Re: [PATCH v2 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths

2024-04-29 Thread Verma, Vishal L

On Mon, 2024-04-29 at 18:25 -0700, Dan Williams wrote:
> Vishal Verma wrote:
> > Commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> > rwsem")
> > was a bit overzealous in eliminating device_lock() usage, and ended up
> > removing a couple of lock acquisitions which were needed, and as a
> > result, fix some of the conditional locking missteps that the above
> > commit introduced in unregister_dax_dev() and unregister_dax_mapping().
> 
> I think it makes sense to tell the story a bit about why the
> delete_store() conversion was problematic, because the
> unregister_dev_dax() changes were just a knock-on effect to fixing the
> delete_store() flow.
> 
> Something like:
> 
> ---
> commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> aimed to undo device_lock() abuses for protecting changes to dax-driver
> internal data-structures like the dax_region resource tree to
> device-dax-instance range structures. However, the device_lock() was 
> legitamately
> enforcing that devices to be deleted were not current actively attached
> to any driver nor assigned any capacity from the region.
> ---
> 
> ...you can fill in a couple notes about the knock-on fixups after that
> was restored.

Sounds good, updated!

> 
> >  
> > @@ -560,15 +551,12 @@ static ssize_t delete_store(struct device *dev, 
> > struct device_attribute *attr,
> >     if (!victim)
> >     return -ENXIO;
> >  
> > -   rc = down_write_killable(_region_rwsem);
> > -   if (rc)
> > -   return rc;
> > -   rc = down_write_killable(_dev_rwsem);
> > -   if (rc) {
> > -   up_write(_region_rwsem);
> > -   return rc;
> > -   }
> > +   device_lock(dev);
> > +   device_lock(victim);
> >     dev_dax = to_dev_dax(victim);
> > +   rc = down_write_killable(_dev_rwsem);
> 
> This begs the question, why down_write_killable(), but not
> device_lock_interruptible()?

Do you mean change the device_lock()s to device_lock_interruptible() in
addition to the taking the rwsem (i.e. not instead of the rwsem..)?
I guess I just restored what was there previously - but the
interruptible variant makes sense, I can make that change.

> 
> I do not expect any of this is long running so likely down_write() is
> sufficient here, especially since the heaviest locks to acquire are
> already held by the time rwsem is considered.
> 
> Other than that this looks good to me:
> 
> You can include my Reviewed-by on the next posting.

Thanks for the review Dan!

Re: [PATCH][next] dax: remove redundant assignment to variable rc

2024-04-16 Thread Verma, Vishal L

On Mon, 2024-04-15 at 11:19 +0100, Colin Ian King wrote:
> The variable rc is being assigned an value and then is being re-assigned
> a new value in the next statement. The assignment is redundant and can
> be removed.
> 
> Cleans up clang scan build warning:
> drivers/dax/bus.c:1207:2: warning: Value stored to 'rc' is never
> read [deadcode.DeadStores]
> 
> Signed-off-by: Colin Ian King 

Reviewed-by: Vishal Verma 

> ---
>  drivers/dax/bus.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 797e1ebff299..f758afbf8f09 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1204,7 +1204,6 @@ static ssize_t mapping_store(struct device *dev, struct 
> device_attribute *attr,
>   if (rc)
>   return rc;
>  
> - rc = -ENXIO;
>   rc = down_write_killable(_region_rwsem);
>   if (rc)
>   return rc;

Re: [PATCH] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts

2024-04-04 Thread Verma, Vishal L

On Thu, 2024-04-04 at 14:23 -0700, Andrew Morton wrote:
> On Tue, 02 Apr 2024 00:24:28 -0600 Vishal Verma  
> wrote:
> 
> > In [1], Dan points out that all of the WARN_ON_ONCE() usage in the
> > referenced patch should be replaced with lockdep_assert_held(_write)().
> > 
> > Replace those, and additionally, replace a couple of other
> > WARN_ON_ONCE() introduced in the same patch for actual failure
> > cases (i.e. when acquiring a semaphore fails in a remove / unregister
> > path) with dev_WARN_ONCE() as is the precedent here.
> > 
> > Recall that previously, unregistration paths was implicitly protected by
> > overloading the device lock, which the patch in [1] sought to remove.
> > This meant adding a semaphore acquisition in these unregistration paths.
> > Since that can fail, and it doesn't make sense to return errors from
> > these paths, retain the two instances of (now) dev_WARN_ONCE().
> > 
> > ...
> > 
> > @@ -471,6 +471,7 @@ static void __unregister_dev_dax(void *dev)
> >  
> >     dev_dbg(dev, "%s\n", __func__);
> >  
> > +   lockdep_assert_held_write(_region_rwsem);
> >     kill_dev_dax(dev_dax);
> >     device_del(dev);
> >     free_dev_dax_ranges(dev_dax);
> 
> This is new and unchangelogged?
> 
> I'm taking Dan's reply to your patch as Not-A-Nack ;)
> 
True, but with Dan's new feedback, that results in a bit more rework,
this will likely turn into 2-3 patches. Working on it now, will be out
shortly!

Re: [PATCH v7 1/5] dax/bus.c: replace driver-core lock usage by a local rwsem

2024-03-12 Thread Verma, Vishal L

On Tue, 2024-03-12 at 13:07 -0700, Dan Williams wrote:
> Vishal Verma wrote:
> > The dax driver incorrectly used driver-core device locks to protect
> > internal dax region and dax device configuration structures. Replace the
> > device lock usage with a local rwsem, one each for dax region
> > configuration and dax device configuration. As a result of this
> > conversion, no device_lock() usage remains in dax/bus.c.
> > 
> > Cc: Dan Williams 
> > Reported-by: Greg Kroah-Hartman 
> > Signed-off-by: Vishal Verma 
> > ---
> >  drivers/dax/bus.c | 220 
> > ++
> >  1 file changed, 157 insertions(+), 63 deletions(-)
> > 
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 1ff1ab5fa105..cb148f74ceda 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -12,6 +12,18 @@
> >  
> >  static DEFINE_MUTEX(dax_bus_lock);
> >  
> > +/*
> > + * All changes to the dax region configuration occur with this lock held
> > + * for write.
> > + */
> > +DECLARE_RWSEM(dax_region_rwsem);
> > +
> > +/*
> > + * All changes to the dax device configuration occur with this lock held
> > + * for write.
> > + */
> > +DECLARE_RWSEM(dax_dev_rwsem);
> > +
> >  #define DAX_NAME_LEN 30
> >  struct dax_id {
> >     struct list_head list;
> > @@ -180,7 +192,7 @@ static u64 dev_dax_size(struct dev_dax *dev_dax)
> >     u64 size = 0;
> >     int i;
> >  
> > -   device_lock_assert(_dax->dev);
> > +   WARN_ON_ONCE(!rwsem_is_locked(_dev_rwsem));
> 
> Apologies for the late review, but...
> 
> All of these WARN_ON_ONCE() usages should be replaced with
> lockdep_assert_held() and lockdep_assert_held_write() where appropriate.

Makes sense - I can send a patch post -rc1 to change these if that's okay 
Andrew?

Re: [PATCH] dax: remove SLAB_MEM_SPREAD flag usage

2024-02-27 Thread Verma, Vishal L

On Wed, 2024-02-28 at 11:16 +0800, Chengming Zhou wrote:
> On 2024/2/28 00:29, Dave Jiang wrote:
> > 
> > 
> > On 2/24/24 6:47 AM, chengming.z...@linux.dev wrote:
> > > From: Chengming Zhou 
> > > 
> > > The SLAB_MEM_SPREAD flag is already a no-op as of 6.8-rc1, remove
> > > its usage so we can delete it from slab. No functional change.
> > 
> > Can you please provide a Link tag to the lore post that indicates 
> > SLAB_MEM_SPREAD flag is now a no-op?
> 
> Update changelog to make it clearer:
> 
> The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
> removed as of v6.8-rc1, so it became a dead flag since the commit
> 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the
> series[1] went on to mark it obsolete explicitly to avoid confusion
> for users. Here we can just remove all its users, which has no any
> functional change.
> 
> [1] 
> https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8...@suse.cz/
> 
> Thanks!

With the updated changelog, you can add

Reviewed-by: Vishal Verma 

> 
> > > 
> > > Signed-off-by: Chengming Zhou 
> > > ---
> > >  drivers/dax/super.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > index 54e528779877..cff0a15b7236 100644
> > > --- a/drivers/dax/super.c
> > > +++ b/drivers/dax/super.c
> > > @@ -547,7 +547,7 @@ static int dax_fs_init(void)
> > >  
> > >   dax_cache = kmem_cache_create("dax_cache", sizeof(struct dax_device), 0,
> > >   (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
> > > -  SLAB_MEM_SPREAD|SLAB_ACCOUNT),
> > > +  SLAB_ACCOUNT),
> > >   init_once);
> > >   if (!dax_cache)
> > >   return -ENOMEM;

Re: [PATCH v6 2/4] dax/bus: Use guard(device) in sysfs attribute helpers

2023-12-15 Thread Verma, Vishal L

On Fri, 2023-12-15 at 09:15 -0800, Dan Williams wrote:
> Greg Kroah-Hartman wrote:
> > On Thu, Dec 14, 2023 at 10:25:27PM -0700, Vishal Verma wrote:
> > > Use the guard(device) macro to lock a 'struct device', and unlock it
> > > automatically when going out of scope using Scope Based Resource
> > > Management semantics. A lot of the sysfs attribute writes in
> > > drivers/dax/bus.c benefit from a cleanup using these, so change these
> > > where applicable.
> > 
> > Wait, why are you needing to call device_lock() at all here?  Why is dax
> > special in needing this when no other subsystem requires it?
> > 
> > > 
> > > Cc: Joao Martins 
> > > Cc: Dan Williams 
> > > Signed-off-by: Vishal Verma 
> > > ---
> > >  drivers/dax/bus.c | 143 
> > > ++
> > >  1 file changed, 59 insertions(+), 84 deletions(-)
> > > 
> > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > > index 1ff1ab5fa105..6226de131d17 100644
> > > --- a/drivers/dax/bus.c
> > > +++ b/drivers/dax/bus.c
> > > @@ -294,13 +294,10 @@ static ssize_t available_size_show(struct device 
> > > *dev,
> > > struct device_attribute *attr, char *buf)
> > >  {
> > > struct dax_region *dax_region = dev_get_drvdata(dev);
> > > -   unsigned long long size;
> > >  
> > > -   device_lock(dev);
> > > -   size = dax_region_avail_size(dax_region);
> > > -   device_unlock(dev);
> > > +   guard(device)(dev);
> > 
> > You have a valid device here, why are you locking it?  How can it go
> > away?  And if it can, shouldn't you have a local lock for it, and not
> > abuse the driver core lock?
> 
> Yes, this is a driver-core lock abuse written by someone who should have
> known better. And yes, a local lock to protect the dax_region resource
> tree should replace this. A new rwsem to synchronize all list walks
> seems appropriate.

I see why _a_ lock is needed both here and in size_show() - the size
calculations do a walk over discontiguous ranges, and we don't want the
device to get reconfigured in the middle of that. A different local
lock seems reasonable - however can that go as a separate cleanup that
stands on its own?

For this series, I'll add a cleanup to replace the sprintfs with
sysfs_emit().

Re: [PATCH v6 2/4] dax/bus: Use guard(device) in sysfs attribute helpers

2023-12-14 Thread Verma, Vishal L

On Fri, 2023-12-15 at 05:56 +, Matthew Wilcox wrote:
> On Thu, Dec 14, 2023 at 10:25:27PM -0700, Vishal Verma wrote:
> > @@ -294,13 +294,10 @@ static ssize_t available_size_show(struct device *dev,
> > struct device_attribute *attr, char *buf)
> >  {
> > struct dax_region *dax_region = dev_get_drvdata(dev);
> > -   unsigned long long size;
> >  
> > -   device_lock(dev);
> > -   size = dax_region_avail_size(dax_region);
> > -   device_unlock(dev);
> > +   guard(device)(dev);
> >  
> > -   return sprintf(buf, "%llu\n", size);
> > +   return sprintf(buf, "%llu\n", dax_region_avail_size(dax_region));
> >  }
> 
> Is this an appropriate use of guard()?  sprintf is not the fastest of
> functions, so we will end up holding the device_lock for longer than
> we used to.

Hi Matthew,

Agreed that we end up holding the lock for a bit longer in many of
these. I'm inclined to say this is okay, since these are all user
configuration paths through sysfs, not affecting any sort of runtime
performance.

> 
> > @@ -908,9 +890,8 @@ static ssize_t size_show(struct device *dev,
> > struct dev_dax *dev_dax = to_dev_dax(dev);
> > unsigned long long size;
> >  
> > -   device_lock(dev);
> > +   guard(device)(dev);
> > size = dev_dax_size(dev_dax);
> > -   device_unlock(dev);
> >  
> > return sprintf(buf, "%llu\n", size);
> >  }
> 
> If it is appropriate, then you can do without the 'size' variable here.

Yep will remove. I suppose a lot of these can also switch to sysfs_emit
as Greg pointed out in a previous posting. I can add that as a separate
cleanup patch.

> 
> > @@ -1137,21 +1117,20 @@ static ssize_t mapping_store(struct device *dev, 
> > struct device_attribute *attr,
> > if (rc)
> > return rc;
> >  
> > -   rc = -ENXIO;
> > -   device_lock(dax_region->dev);
> > -   if (!dax_region->dev->driver) {
> > -   device_unlock(dax_region->dev);
> > -   return rc;
> > -   }
> > -   device_lock(dev);
> > +   guard(device)(dax_region->dev);
> > +   if (!dax_region->dev->driver)
> > +   return -ENXIO;
> >  
> > +   guard(device)(dev);
> > to_alloc = range_len();
> > -   if (alloc_is_aligned(dev_dax, to_alloc))
> > -   rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
> > -   device_unlock(dev);
> > -   device_unlock(dax_region->dev);
> > +   if (!alloc_is_aligned(dev_dax, to_alloc))
> > +   return -ENXIO;
> >  
> > -   return rc == 0 ? len : rc;
> > +   rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
> > +   if (rc)
> > +   return rc;
> > +
> > +   return len;
> >  }
> 
> Have I mentioned how much I hate the "rc" naming convention?  It tells
> you nothing useful about the contents of the variable.  If you called it
> 'err', I'd know it was an error, and then the end of this function would
> make sense.
> 
> if (err)
> return err;
> return len;
> 
I'm a little hesitant to change this because the 'rc' convention is
used all over this file, and while I don't mind making this change for
the bits I touch in this patch, it would just result in a mix of 'rc'
and 'err' in this file.

Re: [PATCH] driver core: Add a guard() definition for the device_lock()

2023-12-13 Thread Verma, Vishal L

On Wed, 2023-12-13 at 15:02 -0800, Dan Williams wrote:
> At present there are ~200 usages of device_lock() in the kernel. Some of
> those usages lead to "goto unlock;" patterns which have proven to be
> error prone. Define a "device" guard() definition to allow for those to

"Define a definition" sounds a bit awkward, perhaps  "Add a .."?

> be cleaned up and prevent new ones from appearing.
> 
> Link: 
> http://lore.kernel.org/r/657897453dda8_269bd29...@dwillia2-mobl3.amr.corp.intel.com.notmuch
> Link: 
> http://lore.kernel.org/r/6577b0c2a02df_a04c529...@dwillia2-xfh.jf.intel.com.notmuch
> Cc: Vishal Verma 
> Cc: Ira Weiny 
> Cc: Peter Zijlstra 
> Cc: Greg Kroah-Hartman 
> Cc: Andrew Morton 
> Signed-off-by: Dan Williams 

Other than that, looks good,

Reviewed-by: Vishal Verma 

> ---
> Hi Greg,
> 
> I wonder if you might include this change in v6.7-rc to ease some patch
> sets alternately going through my tree and Andrew's tree. Those
> discussions are linked above. Alternately I can can just take it through
> my tree with your ack and the other use case can circle back to it in
> the v6.9 cycle.
> 
> I considered also defining a __free() helper similar to __free(mutex),
> but I think "__free(device)" would be a surprising name for something
> that drops a lock. Also, I like the syntax of guard(device) over
> something like guard(device_lock) since a 'struct device *' is the
> argument, not a lock type, but I'm open to your or Peter's thoughts on
> the naming.
> 
>  include/linux/device.h |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/device.h b/include/linux/device.h
> index d7a72a8749ea..6c83294395ac 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -1007,6 +1007,8 @@ static inline void device_unlock(struct device *dev)
> mutex_unlock(>mutex);
>  }
>  
> +DEFINE_GUARD(device, struct device *, device_lock(_T), device_unlock(_T))
> +
>  static inline void device_lock_assert(struct device *dev)
>  {
> lockdep_assert_held(>mutex);
>

Re: [PATCH v3 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-11 Thread Verma, Vishal L

On Tue, 2023-12-12 at 08:56 +0800, Huang, Ying wrote:
> "Verma, Vishal L"  writes:
> 
> > On Tue, 2023-12-12 at 08:30 +0800, Huang, Ying wrote:
> > > 
> > > > +
> > > > +static ssize_t memmap_on_memory_store(struct device *dev,
> > > > + struct device_attribute *attr,
> > > > + const char *buf, size_t len)
> > > > +{
> > > > +   struct device_driver *drv = dev->driver;
> > > > +   struct dev_dax *dev_dax = to_dev_dax(dev);
> > > > +   struct dax_region *dax_region = dev_dax->region;
> > > > +   struct dax_device_driver *dax_drv = to_dax_drv(drv);
> > > > +   ssize_t rc;
> > > > +   bool val;
> > > > +
> > > > +   rc = kstrtobool(buf, );
> > > > +   if (rc)
> > > > +   return rc;
> > > > +
> > > > +   if (dev_dax->memmap_on_memory == val)
> > > > +   return len;
> > > > +
> > > > +   device_lock(dax_region->dev);
> > > > +   if (!dax_region->dev->driver) {
> > > > +   device_unlock(dax_region->dev);
> > > > +   return -ENXIO;
> > > > +   }
> > > 
> > > I think that it should be OK to write to "memmap_on_memory" if no driver
> > > is bound to the device.  We just need to avoid to write to it when kmem
> > > driver is bound.
> > 
> > Oh this is just a check on the region driver, not for a dax driver
> > being bound to the device. It's the same as what things like
> > align_store(), size_store() etc. do for dax device reconfiguration.
> 
> Sorry, I misunderstood it.
> 
> > That said, it might be okay to remove this check, as this operation
> > doesn't change any attributes of the dax region (the other interfaces I
> > mentioned above can affect regions, so we want to lock the region
> > device). If removing the check, we'd drop the region lock acquisition
> > as well.
> 
> This sounds good to me.
> 
> And is it necessary to check driver type with device_lock()?  Can driver
> be changed between checking and lock?
> 
Oh, good point, the type check should happen with the device lock held.
I'll make that change.

Re: [PATCH v3 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-11 Thread Verma, Vishal L

On Tue, 2023-12-12 at 08:30 +0800, Huang, Ying wrote:
> Vishal Verma  writes:
> 
> > Add a sysfs knob for dax devices to control the memmap_on_memory setting
> > if the dax device were to be hotplugged as system memory.
> > 
> > The default memmap_on_memory setting for dax devices originating via
> > pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
> > preserve legacy behavior. For dax devices via CXL, the default is on.
> > The sysfs control allows the administrator to override the above
> > defaults if needed.
> > 
> > Cc: David Hildenbrand 
> > Cc: Dan Williams 
> > Cc: Dave Jiang 
> > Cc: Dave Hansen 
> > Cc: Huang Ying 
> > Tested-by: Li Zhijian 
> > Reviewed-by: Jonathan Cameron 
> > Reviewed-by: David Hildenbrand 
> > Signed-off-by: Vishal Verma 
> > ---
> >  drivers/dax/bus.c   | 47 
> > +
> >  Documentation/ABI/testing/sysfs-bus-dax | 17 
> >  2 files changed, 64 insertions(+)
> > 
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 1ff1ab5fa105..2871e5188f0d 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -1270,6 +1270,52 @@ static ssize_t numa_node_show(struct device *dev,
> >  }
> >  static DEVICE_ATTR_RO(numa_node);
> >  
> > +static ssize_t memmap_on_memory_show(struct device *dev,
> > +    struct device_attribute *attr, char 
> > *buf)
> > +{
> > +   struct dev_dax *dev_dax = to_dev_dax(dev);
> > +
> > +   return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
> > +}
> > +
> > +static ssize_t memmap_on_memory_store(struct device *dev,
> > + struct device_attribute *attr,
> > + const char *buf, size_t len)
> > +{
> > +   struct device_driver *drv = dev->driver;
> > +   struct dev_dax *dev_dax = to_dev_dax(dev);
> > +   struct dax_region *dax_region = dev_dax->region;
> > +   struct dax_device_driver *dax_drv = to_dax_drv(drv);
> > +   ssize_t rc;
> > +   bool val;
> > +
> > +   rc = kstrtobool(buf, );
> > +   if (rc)
> > +   return rc;
> > +
> > +   if (dev_dax->memmap_on_memory == val)
> > +   return len;
> > +
> > +   device_lock(dax_region->dev);
> > +   if (!dax_region->dev->driver) {
> > +   device_unlock(dax_region->dev);
> > +   return -ENXIO;
> > +   }
> 
> I think that it should be OK to write to "memmap_on_memory" if no driver
> is bound to the device.  We just need to avoid to write to it when kmem
> driver is bound.

Oh this is just a check on the region driver, not for a dax driver
being bound to the device. It's the same as what things like
align_store(), size_store() etc. do for dax device reconfiguration.

That said, it might be okay to remove this check, as this operation
doesn't change any attributes of the dax region (the other interfaces I
mentioned above can affect regions, so we want to lock the region
device). If removing the check, we'd drop the region lock acquisition
as well.

Dan, any thoughts on this?

Re: [PATCH v2 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-08 Thread Verma, Vishal L

On Fri, 2023-12-08 at 12:36 +0100, David Hildenbrand wrote:
> On 07.12.23 05:36, Vishal Verma wrote:
[..]
> > 
> > +
> > +What:  /sys/bus/dax/devices/daxX.Y/memmap_on_memory
> > +Date:  October, 2023
> > +KernelVersion: v6.8
> > +Contact:   nvd...@lists.linux.dev
> > +Description:
> > +   (RW) Control the memmap_on_memory setting if the dax device
> > +   were to be hotplugged as system memory. This determines 
> > whether
> > +   the 'altmap' for the hotplugged memory will be placed on the
> > +   device being hotplugged (memmap_on+memory=1) or if it will 
> > be
> > +   placed on regular memory (memmap_on_memory=0). This 
> > attribute
> > +   must be set before the device is handed over to the 'kmem'
> > +   driver (i.e.  hotplugged into system-ram).
> > 
> 
> Should we note the dependency on other factors as given in 
> mhp_supports_memmap_on_memory(), especially, the system-wide setting and 
> some weird kernel configurations?
> 
Yep good idea, I'll make a note of those for v3.

Re: [PATCH v2 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-08 Thread Verma, Vishal L

On Fri, 2023-12-08 at 11:13 +0800, Huang, Ying wrote:
> Vishal Verma  writes:
> 
[..]
> > 
> > +
> > +static ssize_t memmap_on_memory_store(struct device *dev,
> > + struct device_attribute *attr,
> > + const char *buf, size_t len)
> > +{
> > +   struct dev_dax *dev_dax = to_dev_dax(dev);
> > +   struct dax_region *dax_region = dev_dax->region;
> > +   ssize_t rc;
> > +   bool val;
> > +
> > +   rc = kstrtobool(buf, );
> > +   if (rc)
> > +   return rc;
> > +
> > +   if (dev_dax->memmap_on_memory == val)
> > +   return len;
> > +
> > +   device_lock(dax_region->dev);
> > +   if (!dax_region->dev->driver) {
> 
> This still doesn't look right.  Can we check whether the current driver
> is kmem?  And only allow change if it's not kmem?

Ah yes I lost track of this todo between revisions when I split this
out. Let me fix that for the next revision.

Re: [PATCH v2 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-08 Thread Verma, Vishal L

On Fri, 2023-12-08 at 09:20 +, Zhijian Li (Fujitsu) wrote:
> 
> 
> On 08/12/2023 03:25, Verma, Vishal L wrote:
> > On Thu, 2023-12-07 at 08:29 +, Zhijian Li (Fujitsu) wrote:
> > > Hi Vishal,
> > > 
> > > 
> > > On 07/12/2023 12:36, Vishal Verma wrote:
> > > > +
> > > > +What:  /sys/bus/dax/devices/daxX.Y/memmap_on_memory
> > > > +Date:  October, 2023
> > > > +KernelVersion: v6.8
> > > > +Contact:   nvd...@lists.linux.dev
> > > > +Description:
> > > > +   (RW) Control the memmap_on_memory setting if the dax 
> > > > device
> > > > +   were to be hotplugged as system memory. This determines 
> > > > whether
> > > > +   the 'altmap' for the hotplugged memory will be placed 
> > > > on the
> > > > +   device being hotplugged (memmap_on+memory=1) or if it 
> > > > will be
> > > 
> > > s/memmap_on+memory=1/memmap_on_memory=1
> > 
> > Thanks, will fix.
> > > 
> > > 
> > > I have a question here
> > > 
> > > What relationship about memmap_on_memory and 'ndctl-create-namespace
> > > -M' option which
> > > specifies where is the vmemmap backed memory.
> > > I'm confused that memmap_on_memory=1 and '-M dev' are the same for
> > > nvdimm devdax mode ?
> > > 
> > The main difference is that memory that comes from non-nvdimm sources,
> > such as hmem, or cxl, doesn't have a chance to set up the altmaps as
> > pmem can with '-M dev'. For these, memmap_on_memory does this as part
> > of the memory hotplug.
> 
> Thanks for your explanation.
> (I wrongly thought nvdimm.kmem was also controlled by '-M dev' before)
> 
> feel free add:
> Tested-by: Li Zhijian 

Thank you Zhijian!

Re: [PATCH v2 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-07 Thread Verma, Vishal L

On Thu, 2023-12-07 at 08:29 +, Zhijian Li (Fujitsu) wrote:
> Hi Vishal,
> 
> 
> On 07/12/2023 12:36, Vishal Verma wrote:
> > +
> > +What:  /sys/bus/dax/devices/daxX.Y/memmap_on_memory
> > +Date:  October, 2023
> > +KernelVersion: v6.8
> > +Contact:   nvd...@lists.linux.dev
> > +Description:
> > +   (RW) Control the memmap_on_memory setting if the dax device
> > +   were to be hotplugged as system memory. This determines 
> > whether
> > +   the 'altmap' for the hotplugged memory will be placed on the
> > +   device being hotplugged (memmap_on+memory=1) or if it will 
> > be
> 
> s/memmap_on+memory=1/memmap_on_memory=1

Thanks, will fix.
> 
> 
> I have a question here
> 
> What relationship about memmap_on_memory and 'ndctl-create-namespace
> -M' option which
> specifies where is the vmemmap backed memory.
> I'm confused that memmap_on_memory=1 and '-M dev' are the same for
> nvdimm devdax mode ?
> 
The main difference is that memory that comes from non-nvdimm sources,
such as hmem, or cxl, doesn't have a chance to set up the altmaps as
pmem can with '-M dev'. For these, memmap_on_memory does this as part
of the memory hotplug.

Re: [PATCH v8 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-11-03 Thread Verma, Vishal L

On Fri, 2023-11-03 at 09:43 -0700, fan wrote:
> On Wed, Nov 01, 2023 at 04:51:52PM -0600, Vishal Verma wrote:
> > 
[..]
> >  
> > +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
> > +{
> > +   unsigned long memblock_size = memory_block_size_bytes();
> > +   u64 cur_start;
> > +
> > +   /*
> > +    * For memmap_on_memory, the altmaps were added on a per-memblock
> > +    * basis; we have to process each individual memory block.
> > +    */
> > +   for (cur_start = start; cur_start < start + size;
> > +    cur_start += memblock_size) {
> > +   struct vmem_altmap *altmap = NULL;
> > +   struct memory_block *mem;
> > +
> > +   mem = 
> > find_memory_block(pfn_to_section_nr(PFN_DOWN(cur_start)));
> > +   WARN_ON_ONCE(!mem);
> > +   if (!mem)
> > +   continue;
> > +
> > +   altmap = mem->altmap;
> > +   mem->altmap = NULL;
> > +
> > +   remove_memory_block_devices(cur_start, memblock_size);
> 
> Is cur_start always aligned to memory_block_size_bytes? If not, the
> above function will return directly, is that a issue?
> 
Hi Fan,

Thanks for taking a look and the review (btw v9 is the latest revision
of these).

I think we're okay because the create side would've adding this memory
in the first place as it too does an alignment check for
memory_block_size_bytes.

Thanks
Vishal

Re: [PATCH v8 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-11-01 Thread Verma, Vishal L

On Thu, 2023-11-02 at 09:16 +0800, Huang, Ying wrote:
> Vishal Verma  writes:
> 
[..]
> > +
> > +static int create_altmaps_and_memory_blocks(int nid, struct memory_group 
> > *group,
> > +   u64 start, u64 size)
> > +{
> > +   unsigned long memblock_size = memory_block_size_bytes();
> > +   u64 cur_start;
> > +   int ret;
> > +
> > +   for (cur_start = start; cur_start < start + size;
> > +    cur_start += memblock_size) {
> > +   struct mhp_params params = { .pgprot =
> > +    
> > pgprot_mhp(PAGE_KERNEL) };
> > +   struct vmem_altmap mhp_altmap = {
> > +   .base_pfn = PHYS_PFN(cur_start),
> > +   .end_pfn = PHYS_PFN(cur_start + memblock_size - 1),
> > +   };
> > +
> > +   mhp_altmap.free = memory_block_memmap_on_memory_pages();
> > +   params.altmap = kmemdup(_altmap, sizeof(struct 
> > vmem_altmap),
> > +   GFP_KERNEL);
> > +   if (!params.altmap)
> > +   return -ENOMEM;
> 
> Use "goto out" here too?

Hm, yes I suppose we want to clean up previous iterations of the loop -
I'll make this change.

> 
> > +
> > +   /* call arch's memory hotadd */
> > +   ret = arch_add_memory(nid, cur_start, memblock_size, 
> > );
> > +   if (ret < 0) {
> > +   kfree(params.altmap);
> > +   goto out;
> > +   }
> > +
> > +   /* create memory block devices after memory was added */
> > +   ret = create_memory_block_devices(cur_start, memblock_size,
> > + params.altmap, group);
> > +   if (ret) {
> > +   arch_remove_memory(cur_start, memblock_size, NULL);
> > +   kfree(params.altmap);
> 
> How about move arch_remove_memory() and kree() to error path and use
> different label?

I thought of this, but it got slightly awkward because of the scope of
'params' (declared/allocated within the loop), just kfree'ing in that
scope looked cleaner..

Re: [PATCH v7 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-30 Thread Verma, Vishal L

On Mon, 2023-10-30 at 11:20 +0100, David Hildenbrand wrote:
> On 26.10.23 00:44, Vishal Verma wrote:
> > 
[..]

> > @@ -2146,11 +2186,69 @@ void try_offline_node(int nid)
> >   }
> >   EXPORT_SYMBOL(try_offline_node);
> >   
> > -static int __ref try_remove_memory(u64 start, u64 size)
> > +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
> >   {
> > -   struct memory_block *mem;
> > -   int rc = 0, nid = NUMA_NO_NODE;
> > +   unsigned long memblock_size = memory_block_size_bytes();
> > struct vmem_altmap *altmap = NULL;
> > +   struct memory_block *mem;
> > +   u64 cur_start;
> > +   int rc;
> > +
> > +   /*
> > +    * For memmap_on_memory, the altmaps could have been added on
> > +    * a per-memblock basis. Loop through the entire range if so,
> > +    * and remove each memblock and its altmap.
> > +    */
> 
> /*
>   * altmaps where added on a per-memblock basis; we have to process
>   * each individual memory block.
>   */
> 
> > +   for (cur_start = start; cur_start < start + size;
> > +    cur_start += memblock_size) {
> > +   rc = walk_memory_blocks(cur_start, memblock_size, ,
> > +   test_has_altmap_cb);
> > +   if (rc) {
> > +   altmap = mem->altmap;
> > +   /*
> > +    * Mark altmap NULL so that we can add a debug
> > +    * check on memblock free.
> > +    */
> > +   mem->altmap = NULL;
> > +   }
> 
> Simpler (especially, we know that there must be an altmap):
> 
> mem = find_memory_block(pfn_to_section_nr(cur_start));
> altmap = mem->altmap;
> mem->altmap = NULL;
> 
> I think we might be able to remove test_has_altmap_cb() then.
> 
> > +
> > +   remove_memory_block_devices(cur_start, memblock_size);
> > +
> > +   arch_remove_memory(cur_start, memblock_size, altmap);
> > +
> > +   /* Verify that all vmemmap pages have actually been freed. 
> > */
> > +   if (altmap) {
> 
> There must be an altmap, so this can be done unconditionally.

Hi David,

All other comments make sense, making those changes now.

However for this one, does the WARN() below go away then?

I was wondering if maybe arch_remove_memory() is responsible for
freeing the altmap here, and at this stage we're just checking if that
happened. If it didn't WARN and then free it.

I drilled down the path, and I don't see altmap actually getting freed
in vmem_altmap_free(), but I wasn't sure if  was meant
to free it as altmap->alloc went down to 0.

> 
> > +   WARN(altmap->alloc, "Altmap not fully unmapped");
> > +   kfree(altmap);
> > +   }
> > +   }
> > +}
> 
>

Re: [PATCH v5 2/2] dax/kmem: allow kmem to add memory with memmap_on_memory

2023-10-16 Thread Verma, Vishal L

On Tue, 2023-10-17 at 13:18 +0800, Huang, Ying wrote:
> "Verma, Vishal L"  writes:
> 
> > On Thu, 2023-10-05 at 14:16 -0700, Dan Williams wrote:
> > > Vishal Verma wrote:
> > > > 
> > <..>
> > 
> > > > +
> > > > +   rc = kstrtobool(buf, );
> > > > +   if (rc)
> > > > +   return rc;
> > > 
> > > Perhaps:
> > > 
> > > if (dev_dax->memmap_on_memory == val)
> > >     return len;
> > > 
> > > ...and skip the check below when it is going to be a nop
> > > 
> > > > +
> > > > +   device_lock(dax_region->dev);
> > > > +   if (!dax_region->dev->driver) {
> > > 
> > > Is the polarity backwards here? I.e. if the device is already
> > > attached to
> > > the kmem driver it is too late to modify memmap_on_memory policy.
> > 
> > Hm this sounded logical until I tried it. After a reconfigure-
> > device to
> > devdax (i.e. detach kmem), I get the -EBUSY if I invert this check.
> 
> Can you try to unbind the device via sysfs by hand and retry?
> 
I think what is happening maybe is while kmem gets detached, the device
goes back to another dax driver (hmem in my tests). So either way, the
check for if (driver) or if (!driver) won't distinguish between kmem
vs. something else.

Maybe we just remove this check? Or add an explicit kmem check somehow?

Re: [PATCH v5 2/2] dax/kmem: allow kmem to add memory with memmap_on_memory

2023-10-16 Thread Verma, Vishal L

On Thu, 2023-10-05 at 14:16 -0700, Dan Williams wrote:
> Vishal Verma wrote:
> > 
<..>

> > +
> > +   rc = kstrtobool(buf, );
> > +   if (rc)
> > +   return rc;
> 
> Perhaps:
> 
> if (dev_dax->memmap_on_memory == val)
> return len;
> 
> ...and skip the check below when it is going to be a nop
> 
> > +
> > +   device_lock(dax_region->dev);
> > +   if (!dax_region->dev->driver) {
> 
> Is the polarity backwards here? I.e. if the device is already attached to
> the kmem driver it is too late to modify memmap_on_memory policy.

Hm this sounded logical until I tried it. After a reconfigure-device to
devdax (i.e. detach kmem), I get the -EBUSY if I invert this check.

> 
> > +   device_unlock(dax_region->dev);
> > +   return -ENXIO;
> 
> I would expect -EBUSY since disabling the device allows the property to be
> set and -ENXIO implies a more permanent error.
> 
> > +   }
> > +
> > +   dev_dax->memmap_on_memory = val;
> > +
> > +   device_unlock(dax_region->dev);
> > +   return len;
> > +}
> > +static DEVICE_ATTR_RW(memmap_on_memory);
> 
> This new attribute needs a new Documentation/ABI/ entry... in fact all of
> these attributes need Documentation/ entries. I can help with that base
> document to get things started.
> 
> Perhaps split this sysfs ABI into its own patch and, depending on how fast
> we can pull the Documentation together, start with the
> region-driver-conveyed approach in the meantime.

Yep I'll split this out and I can do a separate series to add the ABI
docs for /sys/bus/dax, and include this new ABI in that as well.

Agreed with all other comments.

Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-16 Thread Verma, Vishal L

On Thu, 2023-10-12 at 10:40 +0200, David Hildenbrand wrote:
> On 12.10.23 07:53, Verma, Vishal L wrote:
> > On Mon, 2023-10-09 at 17:04 +0200, David Hildenbrand wrote:
> > > On 07.10.23 10:55, Huang, Ying wrote:
> > > > Vishal Verma  writes:
> 
<..>
> > +
> > +   for (cur_start = start; cur_start < start + size;
> > +    cur_start += memblock_size) {
> > +   if (walk_memory_blocks(cur_start, memblock_size, ,
> > +  test_has_altmap_cb))
> > +   num_altmaps++;
> > +   else
> > +   num_no_altmaps++;
> > +   }
> 
> You should do that without the outer loop, by doing the counting in the 
> callback function instead.  
> 
> 
I made a new callback, since the existing callback that returns the
memory_block breaks the walk the first time an altmap was encountered.

Agreed on all the other comments - it looks much cleaner now!

Sending v6 shortly with all of this.

Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-11 Thread Verma, Vishal L

On Mon, 2023-10-09 at 17:04 +0200, David Hildenbrand wrote:
> On 07.10.23 10:55, Huang, Ying wrote:
> > Vishal Verma  writes:
> > 
> > > @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 
> > > size)
> > > if (rc)
> > > return rc;
> > >   
> > > +   mem_hotplug_begin();
> > > +
> > > /*
> > > -    * We only support removing memory added with 
> > > MHP_MEMMAP_ON_MEMORY in
> > > -    * the same granularity it was added - a single memory block.
> > > +    * For memmap_on_memory, the altmaps could have been added on
> > > +    * a per-memblock basis. Loop through the entire range if so,
> > > +    * and remove each memblock and its altmap.
> > >  */
> > > if (mhp_memmap_on_memory()) {
> > 
> > IIUC, even if mhp_memmap_on_memory() returns true, it's still possible
> > that the memmap is put in DRAM after [2/2].  So that,
> > arch_remove_memory() are called for each memory block unnecessarily.  Can
> > we detect this (via altmap?) and call remove_memory_block_and_altmap()
> > for the whole range?
> 
> Good point. We should handle memblock-per-memblock onny if we have to
> handle the altmap. Otherwise, just call a separate function that doesn't 
> care about -- e.g., called remove_memory_blocks_no_altmap().
> 
> We could simply walk all memory blocks and make sure either all have an 
> altmap or none has an altmap. If there is a mix, we should bail out with 
> WARN_ON_ONCE().
> 
Ok I think I follow - based on both of these threads, here's my
understanding in an incremental diff from the original patches (may not
apply directly as I've already committed changes from the other bits of
feedback - but this should provide an idea of the direction) - 

---

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 507291e44c0b..30addcb063b4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2201,6 +2201,40 @@ static void __ref remove_memory_block_and_altmap(u64 
start, u64 size)
}
 }
 
+static bool memblocks_have_altmaps(u64 start, u64 size)
+{
+   unsigned long memblock_size = memory_block_size_bytes();
+   u64 num_altmaps = 0, num_no_altmaps = 0;
+   struct memory_block *mem;
+   u64 cur_start;
+   int rc = 0;
+
+   if (!mhp_memmap_on_memory())
+   return false;
+
+   for (cur_start = start; cur_start < start + size;
+cur_start += memblock_size) {
+   if (walk_memory_blocks(cur_start, memblock_size, ,
+  test_has_altmap_cb))
+   num_altmaps++;
+   else
+   num_no_altmaps++;
+   }
+
+   if (!num_altmaps && num_no_altmaps > 0)
+   return false;
+
+   if (!num_no_altmaps && num_altmaps > 0)
+   return true;
+
+   /*
+* If there is a mix of memblocks with and without altmaps,
+* something has gone very wrong. WARN and bail.
+*/
+   WARN_ONCE(1, "memblocks have a mix of missing and present altmaps");
+   return false;
+}
+
 static int __ref try_remove_memory(u64 start, u64 size)
 {
int rc, nid = NUMA_NO_NODE;
@@ -2230,7 +2264,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
 * a per-memblock basis. Loop through the entire range if so,
 * and remove each memblock and its altmap.
 */
-   if (mhp_memmap_on_memory()) {
+   if (mhp_memmap_on_memory() && memblocks_have_altmaps(start, size)) {
unsigned long memblock_size = memory_block_size_bytes();
u64 cur_start;
 
@@ -2239,7 +2273,8 @@ static int __ref try_remove_memory(u64 start, u64 size)
remove_memory_block_and_altmap(cur_start,
   memblock_size);
} else {
-   remove_memory_block_and_altmap(start, size);
+   remove_memory_block_devices(start, size);
+   arch_remove_memory(start, size, NULL);
}
 
if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {

Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-06 Thread Verma, Vishal L

On Fri, 2023-10-06 at 14:52 +0200, David Hildenbrand wrote:
> On 05.10.23 20:31, Vishal Verma wrote:
> > 
<..>
> > @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 
> > size)
> > if (rc)
> > return rc;
> >   
> > +   mem_hotplug_begin();
> > +
> > /*
> > -    * We only support removing memory added with MHP_MEMMAP_ON_MEMORY 
> > in
> > -    * the same granularity it was added - a single memory block.
> > +    * For memmap_on_memory, the altmaps could have been added on
> > +    * a per-memblock basis. Loop through the entire range if so,
> > +    * and remove each memblock and its altmap.
> >  */
> > if (mhp_memmap_on_memory()) {
> > -   rc = walk_memory_blocks(start, size, , 
> > test_has_altmap_cb);
> > -   if (rc) {
> > -   if (size != memory_block_size_bytes()) {
> > -   pr_warn("Refuse to remove %#llx - %#llx,"
> > -   "wrong granularity\n",
> > -   start, start + size);
> > -   return -EINVAL;
> > -   }
> > -   altmap = mem->altmap;
> > -   /*
> > -    * Mark altmap NULL so that we can add a debug
> > -    * check on memblock free.
> > -    */
> > -   mem->altmap = NULL;
> > -   }
> > +   unsigned long memblock_size = memory_block_size_bytes();
> > +   u64 cur_start;
> > +
> > +   for (cur_start = start; cur_start < start + size;
> > +    cur_start += memblock_size)
> > +   remove_memory_block_and_altmap(nid, cur_start,
> > +  memblock_size);
> > +   } else {
> > +   remove_memory_block_and_altmap(nid, start, size);
> 
> Better call remove_memory_block_devices() and arch_remove_memory(start, 
> size, altmap) here explicitly instead of using 
> remove_memory_block_and_altmap() that really can only handle a single
> memory block with any inputs.
> 
I'm not sure I follow. Even in the non memmap_on_memory case, we'd have
to walk_memory_blocks() to get to the memory_block->altmap, right?

Or is there a more direct way? If we have to walk_memory_blocks, what's
the advantage of calling those directly instead of calling the helper
created above?

Agreed with and fixed up all the other comments.

Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-06 Thread Verma, Vishal L

On Thu, 2023-10-05 at 14:20 -0700, Dan Williams wrote:
> Vishal Verma wrote:
<..>
> > 
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned 
> > long size)
> > return arch_supports_memmap_on_memory(vmemmap_size);
> >  }
> >  
> > +static int add_memory_create_devices(int nid, struct memory_group *group,
> > +    u64 start, u64 size, mhp_t mhp_flags)
> > +{
> > +   struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> > +   struct vmem_altmap mhp_altmap = {
> > +   .base_pfn =  PHYS_PFN(start),
> > +   .end_pfn  =  PHYS_PFN(start + size - 1),
> > +   };
> > +   int ret;
> > +
> > +   if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
> > +   mhp_altmap.free = memory_block_memmap_on_memory_pages();
> > +   params.altmap = kmalloc(sizeof(struct vmem_altmap), 
> > GFP_KERNEL);
> > +   if (!params.altmap)
> > +   return -ENOMEM;
> > +
> > +   memcpy(params.altmap, _altmap, sizeof(mhp_altmap));
> 
> Isn't this just open coded kmemdup()?

Ah yes - it was existing code that I just moved, but I can add a
precursor cleanup patch to change it.

> 
> Other than that, I am not seeing anything else to comment on, you can add:
> 
> Reviewed-by: Dan Williams 

Thanks Dan!

Re: [PATCH v4 2/2] dax/kmem: allow kmem to add memory with memmap_on_memory

2023-10-03 Thread Verma, Vishal L

On Tue, 2023-10-03 at 09:34 +0530, Aneesh Kumar K V wrote:
> On 9/29/23 2:00 AM, Vishal Verma wrote:
> > Large amounts of memory managed by the kmem driver may come in via CXL,
> > and it is often desirable to have the memmap for this memory on the new
> > memory itself.
> > 
> > Enroll kmem-managed memory for memmap_on_memory semantics if the dax
> > region originates via CXL. For non-CXL dax regions, retain the existing
> > default behavior of hot adding without memmap_on_memory semantics.
> > 
> 
> Are we not looking at doing altmap space for CXL DAX regions? Last discussion 
> around
> this was suggesting we look at doing this via altmap reservation so that
> we get contigous space for device memory enabling us to map them
> via 1G direct mapping entries? 
> 
Hey Aneesh - was this on a previous posting or something - do you have
a link so I can refresh myself on what the discussion was?

If it is enabling something in CXL similar to the --map=mem mode for
pmem + device dax, that could be incremental to this.

Re: [PATCH v4 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-03 Thread Verma, Vishal L

On Mon, 2023-10-02 at 11:28 +0200, David Hildenbrand wrote:
> 
> > +
> > +static int __ref try_remove_memory(u64 start, u64 size)
> > +{
> > +   int rc, nid = NUMA_NO_NODE;
> > +
> > +   BUG_ON(check_hotplug_memory_range(start, size));
> > +
> > +   /*
> > +    * All memory blocks must be offlined before removing memory.  Check
> > +    * whether all memory blocks in question are offline and return 
> > error
> > +    * if this is not the case.
> > +    *
> > +    * While at it, determine the nid. Note that if we'd have mixed 
> > nodes,
> > +    * we'd only try to offline the last determined one -- which is good
> > +    * enough for the cases we care about.
> > +    */
> > +   rc = walk_memory_blocks(start, size, , 
> > check_memblock_offlined_cb);
> > +   if (rc)
> > +   return rc;
> > +
> > +   /*
> > +    * For memmap_on_memory, the altmaps could have been added on
> > +    * a per-memblock basis. Loop through the entire range if so,
> > +    * and remove each memblock and its altmap.
> > +    */
> > +   if (mhp_memmap_on_memory()) {
> > +   unsigned long memblock_size = memory_block_size_bytes();
> > +   u64 cur_start;
> > +
> > +   for (cur_start = start; cur_start < start + size;
> > +    cur_start += memblock_size)
> > +   __try_remove_memory(nid, cur_start, memblock_size);
> > +   } else {
> > +   __try_remove_memory(nid, start, size);
> > +   }
> > +
> > return 0;
> >   }
> 
> Why is the firmware, memblock and nid handling not kept in this outer
> function?
> 
> We really shouldn't be doing per memory block what needs to be done per 
> memblock: remove_memory_block_devices() and arch_remove_memory().


Ah yes makes sense since we only do create_memory_block_devices() and
arch_add_memory() in the per memory block inner loop during addition.

How should the locking work in this case though?

The original code holds the mem_hotplug_begin() lock over
arch_remove_memory() and all of the nid and memblock stuff. Should I
just hold the lock and release it in the inner loop for each memory
block, and then also acquire and release it separately for the memblock
and nid stuff in the outer function?

Here's the incremental diff for what I'm thinking:

---

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 43dbd71a4910..13380178173d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2171,7 +2171,7 @@ void try_offline_node(int nid)
 }
 EXPORT_SYMBOL(try_offline_node);
 
-static void __ref __try_remove_memory(int nid, u64 start, u64 size)
+static void __ref remove_memory_block_and_altmap(int nid, u64 start, u64 size)
 {
int rc = 0;
struct memory_block *mem;
@@ -2187,9 +2187,6 @@ static void __ref __try_remove_memory(int nid, u64 start, 
u64 size)
mem->altmap = NULL;
}
 
-   /* remove memmap entry */
-   firmware_map_remove(start, start + size, "System RAM");
-
/*
 * Memory block device removal under the device_hotplug_lock is
 * a barrier against racing online attempts.
@@ -2206,16 +2203,6 @@ static void __ref __try_remove_memory(int nid, u64 
start, u64 size)
kfree(altmap);
}
 
-   if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
-   memblock_phys_free(start, size);
-   memblock_remove(start, size);
-   }
-
-   release_mem_region_adjustable(start, size);
-
-   if (nid != NUMA_NO_NODE)
-   try_offline_node(nid);
-
mem_hotplug_done();
 }
 
@@ -2249,11 +2236,29 @@ static int __ref try_remove_memory(u64 start, u64 size)
 
for (cur_start = start; cur_start < start + size;
 cur_start += memblock_size)
-   __try_remove_memory(nid, cur_start, memblock_size);
+   remove_memory_block_and_altmap(nid, cur_start,
+  memblock_size);
} else {
-   __try_remove_memory(nid, start, size);
+   remove_memory_block_and_altmap(nid, start, size);
}
 
+   /* remove memmap entry */
+   firmware_map_remove(start, start + size, "System RAM");
+
+   mem_hotplug_begin();
+
+   if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
+   memblock_phys_free(start, size);
+   memblock_remove(start, size);
+   }
+
+   release_mem_region_adjustable(start, size);
+
+   if (nid != NUMA_NO_NODE)
+   try_offline_node(nid);
+
+   mem_hotplug_done();
+
return 0;
 }

Re: [PATCH v2] nd_btt: Make BTT lanes preemptible

2023-09-22 Thread Verma, Vishal L

On Wed, 2023-09-20 at 07:37 +0200, Tomas Glozar wrote:
> nd_region_acquire_lane uses get_cpu, which disables preemption. This is
> an issue on PREEMPT_RT kernels, since btt_write_pg and also
> nd_region_acquire_lane itself take a spin lock, resulting in BUG:
> sleeping function called from invalid context.
> 
> Fix the issue by replacing get_cpu with smp_process_id and
> migrate_disable when needed. This makes BTT operations preemptible, thus
> permitting the use of spin_lock.
> 
> BUG example occurring when running ndctl tests on PREEMPT_RT kernel:
> 
> BUG: sleeping function called from invalid context at
> kernel/locking/spinlock_rt.c:48
> in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 4903, name:
> libndctl
> preempt_count: 1, expected: 0
> RCU nest depth: 0, expected: 0
> Preemption disabled at:
> [] nd_region_acquire_lane+0x15/0x90 [libnvdimm]
> Call Trace:
>  
>  dump_stack_lvl+0x8e/0xb0
>  __might_resched+0x19b/0x250
>  rt_spin_lock+0x4c/0x100
>  ? btt_write_pg+0x2d7/0x500 [nd_btt]
>  btt_write_pg+0x2d7/0x500 [nd_btt]
>  ? local_clock_noinstr+0x9/0xc0
>  btt_submit_bio+0x16d/0x270 [nd_btt]
>  __submit_bio+0x48/0x80
>  __submit_bio_noacct+0x7e/0x1e0
>  submit_bio_wait+0x58/0xb0
>  __blkdev_direct_IO_simple+0x107/0x240
>  ? inode_set_ctime_current+0x51/0x110
>  ? __pfx_submit_bio_wait_endio+0x10/0x10
>  blkdev_write_iter+0x1d8/0x290
>  vfs_write+0x237/0x330
>  ...
>  
> 
> Fixes: 5212e11fde4d ("nd_btt: atomic sector updates")
> Signed-off-by: Tomas Glozar 

Thanks for the update to include the failure signature.
I read the commentary around migrate_disable(), and it makes sense to
use it as a replacement for disabling preemption here. I suppose at
some point there will be a move to remove migrate_disable(), and at
that point we'd have to rework this code to remove the implicit
lockless mode when operating under per-cpu assumptions. Until then,

Reviewed-by: Vishal Verma 

> ---
>  drivers/nvdimm/region_devs.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 0a81f87f6f6c..e2f1fb99707f 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -939,7 +939,8 @@ unsigned int nd_region_acquire_lane(struct nd_region 
> *nd_region)
>  {
> unsigned int cpu, lane;
>  
> -   cpu = get_cpu();
> +   migrate_disable();
> +   cpu = smp_processor_id();
> if (nd_region->num_lanes < nr_cpu_ids) {
> struct nd_percpu_lane *ndl_lock, *ndl_count;
>  
> @@ -958,16 +959,15 @@ EXPORT_SYMBOL(nd_region_acquire_lane);
>  void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
>  {
> if (nd_region->num_lanes < nr_cpu_ids) {
> -   unsigned int cpu = get_cpu();
> +   unsigned int cpu = smp_processor_id();
> struct nd_percpu_lane *ndl_lock, *ndl_count;
>  
> ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> if (--ndl_count->count == 0)
> spin_unlock(_lock->lock);
> -   put_cpu();
> }
> -   put_cpu();
> +   migrate_enable();
>  }
>  EXPORT_SYMBOL(nd_region_release_lane);
>

Re: [PATCH v3 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-08-02 Thread Verma, Vishal L

On Wed, 2023-08-02 at 15:20 +0100, Jonathan Cameron wrote:
> On Tue, 01 Aug 2023 23:55:37 -0600
> Vishal Verma  wrote:
> 
> > The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> > 'memblock_size' chunks of memory being added. Adding a larger span of
> > memory precludes memmap_on_memory semantics.
> > 
> > For users of hotplug such as kmem, large amounts of memory might get
> > added from the CXL subsystem. In some cases, this amount may exceed the
> > available 'main memory' to store the memmap for the memory being added.
> > In this case, it is useful to have a way to place the memmap on the
> > memory being added, even if it means splitting the addition into
> > memblock-sized chunks.
> > 
> > Change add_memory_resource() to loop over memblock-sized chunks of
> > memory if caller requested memmap_on_memory, and if other conditions for
> > it are met. Teach try_remove_memory() to also expect that a memory
> > range being removed might have been split up into memblock sized chunks,
> > and to loop through those as needed.
> > 
> > Cc: Andrew Morton 
> > Cc: David Hildenbrand 
> > Cc: Michal Hocko 
> > Cc: Oscar Salvador 
> > Cc: Dan Williams 
> > Cc: Dave Jiang 
> > Cc: Dave Hansen 
> > Cc: Huang Ying 
> > Suggested-by: David Hildenbrand 
> > Signed-off-by: Vishal Verma 
> 
> A couple of trivial comments inline.

Hi Jonathan,

Thanks for taking a look.

> 
> > ---
> >  mm/memory_hotplug.c | 150 
> > 
> >  1 file changed, 93 insertions(+), 57 deletions(-)
> > 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index d282664f558e..cae03c8d4bbf 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1383,6 +1383,44 @@ static bool mhp_supports_memmap_on_memory(unsigned 
> > long size)
> > return arch_supports_memmap_on_memory(vmemmap_size);
> >  }
> >  
> > +static int add_memory_create_devices(int nid, struct memory_group *group,
> > +    u64 start, u64 size, mhp_t mhp_flags)
> > +{
> > +   struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> > +   struct vmem_altmap mhp_altmap = {
> > +   .base_pfn =  PHYS_PFN(start),
> > +   .end_pfn  =  PHYS_PFN(start + size - 1),
> > +   };
> > +   int ret;
> > +
> > +   if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
> > +   mhp_altmap.free = memory_block_memmap_on_memory_pages();
> > +   params.altmap = kmalloc(sizeof(struct vmem_altmap), 
> > GFP_KERNEL);
> > +   if (!params.altmap)
> > +   return -ENOMEM;
> > +
> > +   memcpy(params.altmap, _altmap, sizeof(mhp_altmap));
> > +   }
> > +
> > +   /* call arch's memory hotadd */
> > +   ret = arch_add_memory(nid, start, size, );
> > +   if (ret < 0)
> > +   goto error;
> > +
> > +   /* create memory block devices after memory was added */
> > +   ret = create_memory_block_devices(start, size, params.altmap, 
> > group);
> > +   if (ret) {
> > +   arch_remove_memory(start, size, NULL);
> 
> Maybe push this down to a second label?

Yep will do.

> 

> > +
> > +static int __ref try_remove_memory(u64 start, u64 size)
> > +{
> > +   int ret, nid = NUMA_NO_NODE;
> 
> I'm not overly keen to see the trivial rename of rc -> ret in here.
> Just makes it ever so slightly harder to compare old code and new code.

Yep - this was to work around the patches I was based on, which added
both a ret and left the original rc [1]. Aneesh will stick to 'rc' so
my next revision should sort this out naturally.

[1]: 
https://lore.kernel.org/all/715042319ceb86016a4986862a82756e5629d725.ca...@intel.com/

>

Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-08-02 Thread Verma, Vishal L

On Mon, 2023-07-24 at 13:54 +0800, Huang, Ying wrote:
> Vishal Verma  writes:
> 
> > 
> > @@ -2035,12 +2056,38 @@ void try_offline_node(int nid)
> >  }
> >  EXPORT_SYMBOL(try_offline_node);
> >  
> > -static int __ref try_remove_memory(u64 start, u64 size)
> > +static void __ref __try_remove_memory(int nid, u64 start, u64 size,
> > +    struct vmem_altmap *altmap)
> >  {
> > -   struct vmem_altmap mhp_altmap = {};
> > -   struct vmem_altmap *altmap = NULL;
> > -   unsigned long nr_vmemmap_pages;
> > -   int rc = 0, nid = NUMA_NO_NODE;
> > +   /* remove memmap entry */
> > +   firmware_map_remove(start, start + size, "System RAM");
> 
> If mhp_supports_memmap_on_memory(), we will call
> firmware_map_add_hotplug() for whole range.  But here we may call
> firmware_map_remove() for part of range.  Is it OK?
> 

Good point, this is a discrepancy in the add vs remove path. Can the
firmware memmap entries be moved up a bit in the add path, and is it
okay to create these for each memblock? Or should these be for the
whole range? I'm not familiar with the implications. (I've left it as
is for v3 for now, but depending on the direction I can update in a
future rev).

Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-08-02 Thread Verma, Vishal L

On Mon, 2023-07-24 at 11:16 +0800, Huang, Ying wrote:
> "Aneesh Kumar K.V"  writes:
> > 
> > > @@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid,
> > > struct resource *res, mhp_t mhp_flags)
> > > /*
> > >  * Self hosted memmap array
> > >  */
> > > -   if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> > > -   if (!mhp_supports_memmap_on_memory(size)) {
> > > -   ret = -EINVAL;
> > > +   if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
> > > +   mhp_supports_memmap_on_memory(memblock_size)) {
> > > +   for (cur_start = start; cur_start < start + size;
> > > +    cur_start += memblock_size) {
> > > +   ret = add_memory_create_devices(nid,
> > > group, cur_start,
> > > +   memblock_
> > > size,
> > > +   mhp_flags
> > > );
> > > +   if (ret)
> > > +   goto error;
> > > +   }
> > 
> > We should handle the below error details here. 
> > 
> > 1) If we hit an error after some blocks got added, should we
> > iterate over rest of the dev_dax->nr_range.
> > 2) With some blocks added if we return a failure here, we remove
> > the
> > resource in dax_kmem. Is that ok? 
> > 
> > IMHO error handling with partial creation of memory blocks in a
> > resource range should be
> > documented with this change.
> 
> Or, should we remove all added memory blocks upon error?
> 
I didn't address these in v3 - I wasn't sure how we'd proceed here.
Something obviously went very wrong and I'd imagine it is okay if this
memory is unusable as a result.

What woyuld removing the blocks we added look like? Just call
try_remove_memory() from the error path in add_memory_resource()? (for
a range of [start, cur_start) ?

Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

2023-07-13 Thread Verma, Vishal L

On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote:
> On 13.07.23 17:15, Verma, Vishal L wrote:
> > On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
> > > On 13.07.23 08:45, Verma, Vishal L wrote:
> > > > 
> > > > I'm taking a shot at implementing the splitting internally in
> > > > memory_hotplug.c. The caller (kmem) side does become trivial with this
> > > > approach, but there's a slight complication if I don't have the module
> > > > param override (patch 1 of this series).
> > > > 
> > > > The kmem diff now looks like:
> > > > 
> > > >  diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > > >  index 898ca9505754..8be932f63f90 100644
> > > >  --- a/drivers/dax/kmem.c
> > > >  +++ b/drivers/dax/kmem.c
> > > >  @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax 
> > > > *dev_dax)
> > > >  data->mgid = rc;
> > > >   
> > > >  for (i = 0; i < dev_dax->nr_range; i++) {
> > > >  +   mhp_t mhp_flags = MHP_NID_IS_MGID | 
> > > > MHP_MEMMAP_ON_MEMORY |
> > > >  + MHP_SPLIT_MEMBLOCKS;
> > > >  struct resource *res;
> > > >  struct range range;
> > > >   
> > > >  @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax 
> > > > *dev_dax)
> > > >   * this as RAM automatically.
> > > >   */
> > > >  rc = add_memory_driver_managed(data->mgid, 
> > > > range.start,
> > > >  -   range_len(), kmem_name, 
> > > > MHP_NID_IS_MGID);
> > > >  +   range_len(), kmem_name, 
> > > > mhp_flags);
> > > >   
> > > >  if (rc) {
> > > >  dev_warn(dev, "mapping%d: %#llx-%#llx 
> > > > memory add failed\n",
> > > >  
> > > > 
> > > 
> > > Why do we need the MHP_SPLIT_MEMBLOCKS?
> > 
> > I thought we still wanted either an opt-in or opt-out for the kmem
> > driver to be able to do memmap_on_memory, in case there were
> > performance implications or the lack of 1GiB PUDs. I haven't
> > implemented that yet, but I was thinking along the lines of a sysfs
> > knob exposed by kmem, that controls setting of this new
> > MHP_SPLIT_MEMBLOCKS flag.
> 
> Why is MHP_MEMMAP_ON_MEMORY not sufficient for that?
> 
> 
Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY,
and memory_hotplug is free to split to memblocks if it needs to to
satisfy that.

That sounds reasonable. Let me give this a try and see if I run into
anything else. Thanks David!

Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

2023-07-13 Thread Verma, Vishal L

On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
> On 13.07.23 08:45, Verma, Vishal L wrote:
> > 
> > I'm taking a shot at implementing the splitting internally in
> > memory_hotplug.c. The caller (kmem) side does become trivial with this
> > approach, but there's a slight complication if I don't have the module
> > param override (patch 1 of this series).
> > 
> > The kmem diff now looks like:
> > 
> >     diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> >     index 898ca9505754..8be932f63f90 100644
> >     --- a/drivers/dax/kmem.c
> >     +++ b/drivers/dax/kmem.c
> >     @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax 
> > *dev_dax)
> >     data->mgid = rc;
> >  
> >     for (i = 0; i < dev_dax->nr_range; i++) {
> >     +   mhp_t mhp_flags = MHP_NID_IS_MGID | 
> > MHP_MEMMAP_ON_MEMORY |
> >     + MHP_SPLIT_MEMBLOCKS;
> >     struct resource *res;
> >     struct range range;
> >  
> >     @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax 
> > *dev_dax)
> >  * this as RAM automatically.
> >  */
> >     rc = add_memory_driver_managed(data->mgid, range.start,
> >     -   range_len(), kmem_name, 
> > MHP_NID_IS_MGID);
> >     +   range_len(), kmem_name, 
> > mhp_flags);
> >  
> >     if (rc) {
> >     dev_warn(dev, "mapping%d: %#llx-%#llx memory 
> > add failed\n",
> >     
> > 
> 
> Why do we need the MHP_SPLIT_MEMBLOCKS?

I thought we still wanted either an opt-in or opt-out for the kmem
driver to be able to do memmap_on_memory, in case there were
performance implications or the lack of 1GiB PUDs. I haven't
implemented that yet, but I was thinking along the lines of a sysfs
knob exposed by kmem, that controls setting of this new
MHP_SPLIT_MEMBLOCKS flag.

> 
> In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
> single memory block, you can simply split up internally, no?
> 
> Essentially in add_memory_resource() something like
> 
> if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
>  mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
> for (cur_start = start, cur_start < start + size;
>  cur_start += memory_block_size_bytes()) {
> mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
> mhp_altmap.base_pfn = PHYS_PFN(start);
> params.altmap = _altmap;
> 
> ret = arch_add_memory(nid, start,
>   memory_block_size_bytes(), );
> if (ret < 0) ...
> 
> ret = create_memory_block_devices(start, 
> memory_block_size_bytes(),
>   mhp_altmap.alloc, group);
> if (ret) ...
> 
> }
> } else {
> /* old boring stuff */
> }
> 
> Of course, doing it a bit cleaner, factoring out adding of mem+creating 
> devices into
> a helper so we can use it on the other path, avoiding repeating 
> memory_block_size_bytes()
> ...

My current approach was looping a level higher, on the call to
add_memory_resource, but this looks reasonable too, and I can switch to
this. In fact it is better as in my case I had to loop twice, once for
the regular add_memory() path and once for the _driver_managed() path.
Yours should avoid that.

> 
> If any adding of memory failed, we remove what we already added. That works, 
> because as
> long as we're holding the relevant locks, memory cannot get onlined in the 
> meantime.
> 
> Then we only have to teach remove_memory() to deal with individual blocks 
> when finding
> blocks that have an altmap.
>

Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

2023-07-13 Thread Verma, Vishal L

On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote:
> On 11.07.23 16:30, Aneesh Kumar K.V wrote:
> > David Hildenbrand  writes:
> > > 
> > > Maybe the better alternative is teach
> > > add_memory_resource()/try_remove_memory() to do that internally.
> > > 
> > > In the add_memory_resource() case, it might be a loop around that
> > > memmap_on_memory + arch_add_memory code path (well, and the error path
> > > also needs adjustment):
> > > 
> > > /*
> > >  * Self hosted memmap array
> > >  */
> > > if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> > > if (!mhp_supports_memmap_on_memory(size)) {
> > > ret = -EINVAL;
> > > goto error;
> > > }
> > > mhp_altmap.free = PHYS_PFN(size);
> > > mhp_altmap.base_pfn = PHYS_PFN(start);
> > > params.altmap = _altmap;
> > > }
> > > 
> > > /* call arch's memory hotadd */
> > > ret = arch_add_memory(nid, start, size, );
> > > if (ret < 0)
> > > goto error;
> > > 
> > > 
> > > Note that we want to handle that on a per memory-block basis, because we
> > > don't want the vmemmap of memory block #2 to end up on memory block #1.
> > > It all gets messy with memory onlining/offlining etc otherwise ...
> > > 
> > 
> > I tried to implement this inside add_memory_driver_managed() and also
> > within dax/kmem. IMHO doing the error handling inside dax/kmem is
> > better. Here is how it looks:
> > 
> > 1. If any blocks got added before (mapped > 0) we loop through all 
> > successful request_mem_regions
> > 2. For each succesful request_mem_regions if any blocks got added, we
> > keep the resource. If none got added, we will kfree the resource
> > 
> 
> Doing this unconditional splitting outside of 
> add_memory_driver_managed() is undesirable for at least two reasons
> 
> 1) You end up always creating individual entries in the resource tree
>     (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
> 2) As we call arch_add_memory() in memory block granularity (e.g., 128
>     MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
>     identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.
> 
> While you could sense for support and do the split based on that, it 
> will be beneficial for other users (especially DIMMs) if we do that 
> internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be 
> effective or not.

I'm taking a shot at implementing the splitting internally in
memory_hotplug.c. The caller (kmem) side does become trivial with this
approach, but there's a slight complication if I don't have the module
param override (patch 1 of this series).

The kmem diff now looks like:

   diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
   index 898ca9505754..8be932f63f90 100644
   --- a/drivers/dax/kmem.c
   +++ b/drivers/dax/kmem.c
   @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
   data->mgid = rc;

   for (i = 0; i < dev_dax->nr_range; i++) {
   +   mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
   + MHP_SPLIT_MEMBLOCKS;
   struct resource *res;
   struct range range;

   @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
* this as RAM automatically.
*/
   rc = add_memory_driver_managed(data->mgid, range.start,
   -   range_len(), kmem_name, 
MHP_NID_IS_MGID);
   +   range_len(), kmem_name, mhp_flags);

   if (rc) {
   dev_warn(dev, "mapping%d: %#llx-%#llx memory add 
failed\n",
   

However this begins to fail if the memmap_on_memory modparam is not
set, as add_memory_driver_managed EINVALs from the
mhp_supports_memmap_on_memory() check.

The way to work around this would probably include doing the
mhp_supports_memmap_on_memory() check in kmem, in a loop to check for
each memblock sized chunk, and that feels like a leak of the
implementation details into the caller.

Any suggestions on how to go about this?
> 
> In general, we avoid placing important kernel data-structures on slow
> memory. That's one of the reasons why PMEM decided to mostly always use 
> ZONE_MOVABLE such that exactly what this patch does would not happen. So 
> I'm wondering if there would be demand for an additional toggle.
> 
> Because even with memmap_on_memory enabled in general, you might not 
> want to do that for dax/kmem.
> 
> IMHO, this patch should be dropped from your ppc64 series, as it's an
> independent change that might be valuable for other architectures as well.
> 
Sure thing, I can pick this back up and Aneesh can drop this from his set.

Re: [PATCH] ACPI: NFIT: add helper to_nfit_mem() to take device to nfit_mem

2023-07-05 Thread Verma, Vishal L

On Mon, 2023-07-03 at 14:17 +0100, Ben Dooks wrote:
> Add a quick helper to just do struct device to the struct nfit_mem
> field it should be referencing. This reduces the number of code
> lines in some of the followgn code as it removes the intermediate
> struct nvdimm.
> 
> Signed-off-by: Ben Dooks 
> ---
>  drivers/acpi/nfit/core.c | 27 +--
>  1 file changed, 13 insertions(+), 14 deletions(-)
> 

Modulo the typo Dave pointed out,

Reviewed-by: Vishal Verma

[GIT PULL] NVDIMM and DAX for 6.5

2023-06-30 Thread Verma, Vishal L

Hi Linus, please pull from

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git 
tags/libnvdimm-for-6.5

... to receive  the libnvdimm and DAX update for v6.5.

This is mostly small cleanups and fixes, with the biggest change being the
change to the DAX fault handler allowing it to return VM_FAULT_HWPOISON.

It has appeared in linux-next with no reported issues.

On an operational note, as Dan handed off the branch to me for this cycle, we
missed that the original few commits were inadvertently made on top of a few
CXL commits that went in in the 6.4-rc cycle via the CXL tree.

git-request-pull included these, and hence they appear in the shortlog and
diffstat below, but the actual merge correctly identifies and skips over them.
I kept it as it is to preserve the linux-next soak time, but if I should have
done it differently, please let me know.

---

The following changes since commit f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6:

  Linux 6.4-rc2 (2023-05-14 12:51:40 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git 
tags/libnvdimm-for-6.5

for you to fetch changes up to 1ea7ca1b090145519aad998679222f0a14ab8fce:

  dax: enable dax fault handler to report VM_FAULT_HWPOISON (2023-06-26 
07:54:23 -0600)


libnvdimm for 6.5

- DAX fixes and cleanups including a use after free, extra references,
  and device unregistration, and a redundant variable.

- Allow the DAX fault handler to return VM_FAULT_HWPOISON

- A few libnvdimm cleanups such as making some functions and variables
static where sufficient.

- Add a few missing prototypes for wrapped functions in
tools/testing/nvdimm


Arnd Bergmann (3):
  dax: fix missing-prototype warnings
  testing: nvdimm: add missing prototypes for wrapped functions
  libnvdimm: mark 'security_show' static again

Ben Dooks (2):
  nvdimm: make nd_class variable static
  nvdimm: make security_show static

Colin Ian King (1):
  fsdax: remove redundant variable 'error'

Dan Williams (5):
  cxl/port: Enable the HDM decoder capability for switch ports
  dax: Fix dax_mapping_release() use after free
  dax: Use device_unregister() in unregister_dax_mapping()
  dax: Introduce alloc_dev_dax_id()
  dax: Cleanup extra dax_region references

Dave Jiang (2):
  cxl: Wait Memory_Info_Valid before access memory related info
  cxl: Move cxl_await_media_ready() to before capacity info retrieval

Jane Chu (1):
  dax: enable dax fault handler to report VM_FAULT_HWPOISON

Tarun Sahu (1):
  dax/kmem: Pass valid argument to memory_group_register_static

Uwe Kleine-König (1):
  tools/testing/nvdimm: Drop empty platform remove function

Vishal Verma (1):
  Merge branch 'for-6.5/dax-cleanups' into nvdimm-for-next

 include/linux/dax.h   |  13 
 include/linux/mm.h|   2 +
 drivers/cxl/cxl.h |   1 +
 drivers/cxl/cxlmem.h  |   2 +
 drivers/cxl/cxlpci.h  |   2 +
 drivers/dax/bus.h |   8 ---
 drivers/dax/dax-private.h |  11 +++-
 tools/testing/nvdimm/test/nfit_test.h |  29 +
 drivers/cxl/core/mbox.c   |  15 +++--
 drivers/cxl/core/pci.c| 112 ++
 drivers/cxl/mem.c |   3 +
 drivers/cxl/pci.c |   6 ++
 drivers/cxl/port.c|  20 +++---
 drivers/dax/bus.c |  64 +++
 drivers/dax/cxl.c |   8 +--
 drivers/dax/device.c  |   3 +-
 drivers/dax/hmem/hmem.c   |   8 +--
 drivers/dax/kmem.c|   2 +-
 drivers/dax/pmem.c|   7 +--
 drivers/dax/super.c   |   5 +-
 drivers/nvdimm/bus.c  |   2 +-
 drivers/nvdimm/dimm_devs.c|   4 +-
 drivers/nvdimm/pmem.c |   2 +-
 drivers/s390/block/dcssblk.c  |   3 +-
 fs/dax.c  |  14 ++---
 fs/fuse/virtio_fs.c   |   3 +-
 tools/testing/cxl/test/mem.c  |   1 +
 tools/testing/cxl/test/mock.c |  15 +
 tools/testing/nvdimm/test/nfit.c  |   6 --
 tools/testing/cxl/Kbuild  |   1 +
 30 files changed, 265 insertions(+), 107 deletions(-)

Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem

2023-06-21 Thread Verma, Vishal L

On Fri, 2023-06-16 at 09:44 +0200, David Hildenbrand wrote:
> On 16.06.23 00:00, Vishal Verma wrote:
> > The dax/kmem driver can potentially hot-add large amounts of memory
> > originating from CXL memory expanders, or NVDIMMs, or other 'device
> > memories'. There is a chance there isn't enough regular system memory
> > available to fit ythe memmap for this new memory. It's therefore
> > desirable, if all other conditions are met, for the kmem managed memory
> > to place its memmap on the newly added memory itself.
> > 
> > Arrange for this by first allowing for a module parameter override for
> > the mhp_supports_memmap_on_memory() test using a flag, adjusting the
> > only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
> > exporting the symbol so it can be called by kmem.c, and finally changing
> > the kmem driver to add_memory() in chunks of memory_block_size_bytes().
> 
> 1) Why is the override a requirement here? Just let the admin configure 
> it then then add conditional support for kmem.

Configure it in the current way using the module parameter to
memory_hotplug? The whole module param check feels a bit awkward,
especially since memory_hotplug is builtin, the only way to supply the
param is on the kernel command line as opposed to a modprobe config.

The goal with extending mhp_supports_memmap_on_memory() to check for
support with or without consideration for the module param is that it
makes it a bit more flexible for callers like kmem.

> 2) I recall that there are cases where we don't want the memmap to land 
> on slow memory (which online_movable would achieve). Just imagine the
> slow PMEM case. So this might need another configuration knob on the 
> kmem side.
> 
> I recall some discussions on doing that chunk handling internally (so
> kmem can just perform one add_memory() and we'd split that up internally).
> 
Another config knob isn't unreasonable - though the thinking in making
this behavior the new default policy was that with CXL based memory
expanders, the performance delta from main memory wouldn't be as big as
the pmem - main memory delta. With pmem devices being phased out, it's
not clear we'd need a knob, and it can always be added if it ends up
becoming necessary.

The other comments about doing the per-memblock loop internally, and
fixing up the removal paths all sound good, and I've started reworking
those - thanks for taking a look!

Re: [PATCH] dax/kmem: Pass valid argument to memory_group_register_static

2023-06-21 Thread Verma, Vishal L

On Wed, 2023-06-21 at 11:36 +0530, Tarun Sahu wrote:
> Hi Alison,
> 
> Alison Schofield  writes:
> 
> > On Tue, Jun 20, 2023 at 07:33:32PM +0530, Tarun Sahu wrote:
> > > memory_group_register_static takes maximum number of pages as the argument
> > > while dev_dax_kmem_probe passes total_len (in bytes) as the argument.
> > 
> > This sounds like a fix. An explanation of the impact and a fixes tag
> > may be needed. Also, wondering how you found it.
> > 
> Yes, it is a fix, I found it during dry code walk-through.
> There is not any impact as such. As,
> memory_group_register_static just set the max_pages limit which
> is used in auto_movable_zone_for_pfn to determine the zone.
> 
> which might cause these condition to behave differently,
> 
> This will be true always so jump will happen to kernel_zone
> if (!auto_movable_can_online_movable(NUMA_NO_NODE, group, nr_pages))
> goto kernel_zone;
> ---
> kernel_zone:
> return default_kernel_zone_for_pfn(nid, pfn, nr_pages);
> 
> ---
> 
> Here, In below, zone_intersects compare range will be larger as nr_pages
> will be higher (derived from total_len passed in dev_dax_kmem_probe).
> 
> static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long 
> start_pfn,
> unsigned long nr_pages)
> {
> struct pglist_data *pgdat = NODE_DATA(nid);
> int zid;
> 
> for (zid = 0; zid < ZONE_NORMAL; zid++) {
> struct zone *zone = >node_zones[zid];
> 
> if (zone_intersects(zone, start_pfn, nr_pages))
> return zone;
> }
> 
> return >node_zones[ZONE_NORMAL];
> }
> 
> In Mostly cases, ZONE_NORMAL will be returned. But there is no
> crash/panic issues involved here, only decision making on selecting zone
> is affected.
> 

Hi Tarun,

Good find! With a Fixes tag, and perhaps inclusion of a bit more of
this detail described in the commit message, feel free to add:

Reviewed-by: Vishal Verma

Re: [PATCH v2] nvdimm: Use kstrtobool() instead of strtobool()

2023-01-25 Thread Verma, Vishal L

On Sat, 2023-01-14 at 09:50 +0100, Christophe JAILLET wrote:
> strtobool() is the same as kstrtobool().
> However, the latter is more used within the kernel.
> 
> In order to remove strtobool() and slightly simplify kstrtox.h, switch to
> the other function name.
> 
> While at it, include the corresponding header file ()
> 
> Signed-off-by: Christophe JAILLET 
> ---
> This patch was already sent as a part of a serie ([1]) that axed all usages
> of strtobool().
> Most of the patches have been merged in -next.
> 
> I synch'ed with latest -next and re-send the remaining ones as individual
> patches.
> 
> Changes in v2:
>   - synch with latest -next.
> 
> [1]: 
> https://lore.kernel.org/all/cover.1667336095.git.christophe.jail...@wanadoo.fr/
> ---
>  drivers/nvdimm/namespace_devs.c | 3 ++-
>  drivers/nvdimm/pmem.c   | 3 ++-
>  drivers/nvdimm/region_devs.c    | 5 +++--
>  3 files changed, 7 insertions(+), 4 deletions(-)

This looks reasonable,

Reviewed-by: Vishal Verma 

> 
> diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
> index c60ec0b373c5..07177eadc56e 100644
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -2,6 +2,7 @@
>  /*
>   * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
>   */
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1338,7 +1339,7 @@ static ssize_t force_raw_store(struct device *dev,
> struct device_attribute *attr, const char *buf, size_t len)
>  {
> bool force_raw;
> -   int rc = strtobool(buf, _raw);
> +   int rc = kstrtobool(buf, _raw);
>  
> if (rc)
> return rc;
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 80ded5a2838a..f2a336c6d8c6 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -408,7 +409,7 @@ static ssize_t write_cache_store(struct device *dev,
> bool write_cache;
> int rc;
>  
> -   rc = strtobool(buf, _cache);
> +   rc = kstrtobool(buf, _cache);
> if (rc)
> return rc;
> dax_write_cache(pmem->dax_dev, write_cache);
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 83dbf398ea84..f5872de7ea5a 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -275,7 +276,7 @@ static ssize_t deep_flush_store(struct device *dev, 
> struct device_attribute *att
> const char *buf, size_t len)
>  {
> bool flush;
> -   int rc = strtobool(buf, );
> +   int rc = kstrtobool(buf, );
> struct nd_region *nd_region = to_nd_region(dev);
>  
> if (rc)
> @@ -530,7 +531,7 @@ static ssize_t read_only_store(struct device *dev,
> struct device_attribute *attr, const char *buf, size_t len)
>  {
> bool ro;
> -   int rc = strtobool(buf, );
> +   int rc = kstrtobool(buf, );
> struct nd_region *nd_region = to_nd_region(dev);
>  
> if (rc)

Re: [PATCH 2/2] ACPI: HMAT: Fix initiator registration for single-initiator systems

2022-11-16 Thread Verma, Vishal L

On Wed, 2022-11-16 at 15:46 +0300, Kirill A. Shutemov wrote:
> On Wed, Nov 16, 2022 at 12:57:36AM -0700, Vishal Verma wrote:
> > In a system with a single initiator node, and one or more memory-only
> > 'target' nodes, the memory-only node(s) would fail to register their
> > initiator node correctly. i.e. in sysfs:
> > 
> >   # ls /sys/devices/system/node/node0/access0/targets/
> >   node0
> > 
> > Where as the correct behavior should be:
> > 
> >   # ls /sys/devices/system/node/node0/access0/targets/
> >   node0 node1
> > 
> > This happened because hmat_register_target_initiators() uses list_sort()
> > to sort the initiator list, but the sort comparision function
> > (initiator_cmp()) is overloaded to also set the node mask's bits.
> > 
> > In a system with a single initiator, the list is singular, and list_sort
> > elides the comparision helper call. Thus the node mask never gets set,
> > and the subsequent search for the best initiator comes up empty.
> > 
> > Add a new helper to sort the initiator list, and handle the singular
> > list corner case by setting the node mask for that explicitly.
> > 
> > Reported-by: Chris Piper 
> > Cc: 
> > Cc: Rafael J. Wysocki 
> > Cc: Liu Shixin 
> > Cc: Dan Williams 
> > Signed-off-by: Vishal Verma 
> > ---
> >  drivers/acpi/numa/hmat.c | 32 ++--
> >  1 file changed, 30 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> > index 144a84f429ed..cd20b0e9cdfa 100644
> > --- a/drivers/acpi/numa/hmat.c
> > +++ b/drivers/acpi/numa/hmat.c
> > @@ -573,6 +573,30 @@ static int initiator_cmp(void *priv, const struct 
> > list_head *a,
> > return ia->processor_pxm - ib->processor_pxm;
> >  }
> >  
> > +static int initiators_to_nodemask(unsigned long *p_nodes)
> > +{
> > +   /*
> > +    * list_sort doesn't call @cmp (initiator_cmp) for 0 or 1 sized 
> > lists.
> > +    * For a single-initiator system with other memory-only nodes, this
> > +    * means an empty p_nodes mask, since that is set by 
> > initiator_cmp().
> > +    * Special case the singular list, and make sure the node mask gets 
> > set
> > +    * appropriately.
> > +    */
> > +   if (list_empty())
> > +   return -ENXIO;
> > +
> > +   if (list_is_singular()) {
> > +   struct memory_initiator *initiator = list_first_entry(
> > +   , struct memory_initiator, node);
> > +
> > +   set_bit(initiator->processor_pxm, p_nodes);
> > +   return 0;
> > +   }
> > +
> > +   list_sort(p_nodes, , initiator_cmp);
> > +   return 0;
> > +}
> > +
> 
> Hm. I think it indicates that these set_bit()s do not belong to
> initiator_cmp().
> 
> Maybe remove both set_bit() from the compare helper and walk the list
> separately to initialize the node mask? I think it will be easier to
> follow.


Yes - I thuoght about this, but went with the seemingly less intrusive
change. I can send a v2 which separates out the set_bit()s. I agree
that's cleaner and easier to follow than overloading initiator_cmp().

[GIT PULL] nvdimm fixes v5.19-rc5

2022-07-01 Thread Verma, Vishal L

Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-fixes-5.19-rc5

...to receive a fix for v5.19-rc5. It has been in -next for
a week with no reported issues.

---

The following changes since commit a111daf0c53ae91e71fd2bfe7497862d14132e3e:

  Linux 5.19-rc3 (2022-06-19 15:06:47 -0500)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git 
tags/libnvdimm-fixes-5.19-rc5

for you to fetch changes up to ef9102004a87cb3f8b26e000a095a261fc0467d3:

  nvdimm: Fix badblocks clear off-by-one error (2022-06-24 11:57:19 -0700)


libnvdimm fixes for v5.19-rc5

- Fix a bug in the libnvdimm 'BTT' (Block Translation Table) driver
  where accounting for poison blocks to be cleared was off by one,
  causing a failure to clear the the last badblock in an nvdimm region.


Chris Ye (1):
  nvdimm: Fix badblocks clear off-by-one error

 drivers/nvdimm/bus.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index a4fc17db707c..b38d0355b0ac 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -176,8 +176,8 @@ static int nvdimm_clear_badblocks_region(struct device 
*dev, void *data)
ndr_end = nd_region->ndr_start + nd_region->ndr_size - 1;
 
/* make sure we are in the region */
-   if (ctx->phys < nd_region->ndr_start
-   || (ctx->phys + ctx->cleared) > ndr_end)
+   if (ctx->phys < nd_region->ndr_start ||
+   (ctx->phys + ctx->cleared - 1) > ndr_end)
return 0;
 
sector = (ctx->phys - nd_region->ndr_start) / 512;

Re: [PATCH] cxl/mem: Fix memory device capacity probing

2021-04-16 Thread Verma, Vishal L

On Fri, 2021-04-16 at 17:43 -0700, Dan Williams wrote:
> The CXL Identify Memory Device output payload emits capacity in 256MB
> units. The driver is treating the capacity field as bytes. This was
> missed because QEMU reports bytes when it should report bytes / 256MB.
> 
> Fixes: 8adaf747c9f0 ("cxl/mem: Find device capabilities")
> Cc: Ben Widawsky 
> Signed-off-by: Dan Williams 
> ---
>  drivers/cxl/mem.c |7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)

Looks good,
Reviewed-by: Vishal Verma 

> 
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 1b5078311f7d..2acc6173da36 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -4,6 +4,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1419,6 +1420,7 @@ static int cxl_mem_enumerate_cmds(struct cxl_mem *cxlm)
>   */
>  static int cxl_mem_identify(struct cxl_mem *cxlm)
>  {
> + /* See CXL 2.0 Table 175 Identify Memory Device Output Payload */
>   struct cxl_mbox_identify {
>   char fw_revision[0x10];
>   __le64 total_capacity;
> @@ -1447,10 +1449,11 @@ static int cxl_mem_identify(struct cxl_mem *cxlm)
>    * For now, only the capacity is exported in sysfs
>    */
>   cxlm->ram_range.start = 0;
> - cxlm->ram_range.end = le64_to_cpu(id.volatile_capacity) - 1;
> + cxlm->ram_range.end = le64_to_cpu(id.volatile_capacity) * SZ_256M - 1;
>  
> 
> 
> 
>   cxlm->pmem_range.start = 0;
> - cxlm->pmem_range.end = le64_to_cpu(id.persistent_capacity) - 1;
> + cxlm->pmem_range.end =
> + le64_to_cpu(id.persistent_capacity) * SZ_256M - 1;
>  
> 
> 
> 
>   memcpy(cxlm->firmware_version, id.fw_revision, sizeof(id.fw_revision));
>  
> 
> 
> 
>

Re: [PATCH 3/3] cxl/mem: Demarcate vendor specific capability IDs

2021-04-15 Thread Verma, Vishal L

On Thu, 2021-04-15 at 16:27 -0700, Ben Widawsky wrote:
> Vendor capabilities occupy 0x8000 to 0x according to CXL 2.0 spec
> 8.2.8.2.1 CXL Device Capabilities. While they are not defined by the
> spec, they are allowed and not "unknown". Call this detail out in the
> logs to let users easily distinguish the difference.
> 
> v2: Should be greater than or equal to (Ben)

If there's a v3, drop this to below the '---'. Otherwise note for Dan to
drop when applying I guess :)

> 
> Fixes: 8adaf747c9f0b ("cxl/mem: Find device capabilities")
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/mem.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index c05617b0ba4b..28c7c29567b3 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -939,7 +939,10 @@ static int cxl_mem_setup_regs(struct cxl_mem *cxlm)
>   cxlm->memdev_regs = register_block;
>   break;
>   default:
> - dev_dbg(dev, "Unknown cap ID: %x (0x%x)\n", cap_id, 
> offset);
> + if (cap_id >= 0x8000)
> + dev_dbg(dev, "Vendor cap ID: %x (0x%x)\n", 
> cap_id, offset);
> + else
> + dev_dbg(dev, "Unknown cap ID: %x (0x%x)\n", 
> cap_id, offset);
>   break;
>   }
>   }

Re: [PATCH 2/3] cxl/mem: Print unknown capability IDs as hex

2021-04-15 Thread Verma, Vishal L

On Thu, 2021-04-15 at 16:26 -0700, Ben Widawsky wrote:
> Trivial. The spec lists these as hex, so do the same here to make
> debugging easier.
> 
> Fixes: 8adaf747c9f0b ("cxl/mem: Find device capabilities")
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/mem.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 1b5078311f7d..c05617b0ba4b 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -939,7 +939,7 @@ static int cxl_mem_setup_regs(struct cxl_mem *cxlm)
>   cxlm->memdev_regs = register_block;
>   break;
>   default:
> - dev_dbg(dev, "Unknown cap ID: %d (0x%x)\n", cap_id, 
> offset);
> + dev_dbg(dev, "Unknown cap ID: %x (0x%x)\n", cap_id, 
> offset);

Would %#x be better just for making it unambiguous? Maybe also change
the adjacent 0x%x to %#x while at it.

>   break;
>   }
>   }

Re: [PATCH v2] libnvdimm: Notify disk drivers to revalidate region read-only

2021-03-10 Thread Verma, Vishal L

On Tue, 2021-03-09 at 17:43 -0800, Dan Williams wrote:
> Previous kernels allowed the BLKROSET to override the disk's read-only
> status. With that situation fixed the pmem driver needs to rely on
> notification events to reevaluate the disk read-only status after the
> host region has been marked read-write.
> 
> Recall that when libnvdimm determines that the persistent memory has
> lost persistence (for example lack of energy to flush from DRAM to FLASH
> on an NVDIMM-N device) it marks the region read-only, but that state can
> be overridden by the user via:
> 
>    echo 0 > /sys/bus/nd/devices/regionX/read_only
> 
> ...to date there is no notification that the region has restored
> persistence, so the user override is the only recovery.
> 
> Fixes: 52f019d43c22 ("block: add a hard-readonly flag to struct gendisk")
> Cc: Christoph Hellwig 
> Cc: Ming Lei 
> Cc: Martin K. Petersen 
> Cc: Hannes Reinecke 
> Cc: Jens Axboe 
> Reported-by: kernel test robot 
> Reported-by: Vishal Verma 
> Signed-off-by: Dan Williams 
> ---
> Changes since v1 [1]:
> - Move from the sinking ship of revalidate_disk() to the local hotness
>   of nd_pmem_notify() (hch).
> 
> [1]: 
> http://lore.kernel.org/r/161527286194.446794.5215036039655765042.st...@dwillia2-desk3.amr.corp.intel.com
> 
>  drivers/nvdimm/bus.c |   14 ++
>  drivers/nvdimm/pmem.c|   37 +
>  drivers/nvdimm/region_devs.c |7 +++
>  include/linux/nd.h   |1 +
>  4 files changed, 47 insertions(+), 12 deletions(-)

With the update to the unit test applied, and this, everything passes
for me. You can add:

Tested-by: Vishal Verma

Re: [RFC PATCH v3 02/16] cxl/acpi: Add an acpi_cxl module for the CXL interconnect

2021-01-20 Thread Verma, Vishal L

On Wed, 2021-01-13 at 13:40 +0100, Rafael J. Wysocki wrote:
> On Tue, Jan 12, 2021 at 1:29 AM Ben Widawsky  wrote:
> > 
> > From: Vishal Verma 
> > 
> > Add an acpi_cxl module to coordinate the ACPI portions of the CXL
> > (Compute eXpress Link) interconnect. This driver binds to ACPI0017
> > objects in the ACPI tree, and coordinates access to the resources
> > provided by the ACPI CEDT (CXL Early Discovery Table).
> > 
> > It also coordinates operations of the root port _OSC object to notify
> > platform firmware that the OS has native support for the CXL
> > capabilities of endpoints.
> 
> This doesn't happen here, but in the next patch.
> 
> > Note: the actbl1.h changes are speculative. The expectation is that they
> > will arrive through the ACPICA tree in due time.
> 
> So why don't you put them into a separate patch and drop it from the
> series when not necessary any more?

[snip]

> > +/*
> > + * If/when CXL support is defined by other platform firmware the kernel
> > + * will need a mechanism to select between the platform specific version
> > + * of this routine, until then, hard-code ACPI assumptions
> > + */
> > +int cxl_bus_acquire(struct pci_dev *pdev)
> > +{
> > +   struct acpi_device *adev;
> > +   struct pci_dev *root_port;
> > +   struct device *root;
> > +
> > +   root_port = pcie_find_root_port(pdev);
> > +   if (!root_port)
> > +   return -ENXIO;
> > +
> > +   root = root_port->dev.parent;
> > +   if (!root)
> > +   return -ENXIO;
> > +
> > +   adev = ACPI_COMPANION(root);
> > +   if (!adev)
> > +   return -ENXIO;
> > +
> > +   /* TODO: OSC enabling */
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(cxl_bus_acquire);
> 
> I would move the addition of cxl_bus_acquire() entirely to the next
> patch, it looks quite confusing to me as is.

Makes sense - and also agreed with all of your other comments. I've
cleaned this up for the next revision. Thanks Rafael!
>

Re: [RFC PATCH v3 02/16] cxl/acpi: Add an acpi_cxl module for the CXL interconnect

2021-01-20 Thread Verma, Vishal L

On Tue, 2021-01-12 at 18:43 +, Jonathan Cameron wrote:
> On Mon, 11 Jan 2021 14:51:06 -0800
> Ben Widawsky  wrote:
> 
> > From: Vishal Verma 
> > 
> > Add an acpi_cxl module to coordinate the ACPI portions of the CXL
> > (Compute eXpress Link) interconnect. This driver binds to ACPI0017
> > objects in the ACPI tree, and coordinates access to the resources
> > provided by the ACPI CEDT (CXL Early Discovery Table).
> > 
> > It also coordinates operations of the root port _OSC object to notify
> > platform firmware that the OS has native support for the CXL
> > capabilities of endpoints.
> > 
> > Note: the actbl1.h changes are speculative. The expectation is that they
> > will arrive through the ACPICA tree in due time.
> 
> I would pull the ACPICA changes out into a precursor patch.

Done.

> 
> > 
> > Cc: Ben Widawsky 
> > Cc: Dan Williams 
> > Signed-off-by: Vishal Verma 
> > Signed-off-by: Ben Widawsky 
> 
> Hi,
> 
> I think it would be good to also add CEDT to the list in drivers/acpi/tables.c
> so that we can dump it from /sys/firmware/acpi/tables/ and potentially
> override it from an initrd.

Dumping it from /sys/firmware/acpi/tables already works, but I did add
it to tables.c so it can also be overridden. I did this in a separate
patch, in case ACPI folks want to do this differently.

> 
> https://elixir.bootlin.com/linux/v5.11-rc3/source/drivers/acpi/tables.c#L482
> Can be very helpful whilst debugging.  Related to that, anyone know if anyone
> has acpica patches so we can have iasl -d work on the table?  Would probably
> be useful but I'd rather not duplicate work if it's already done.
> 
> A few minor things inline

Agreed with all of your other comments below, and I've addresed them for
the next revision. Thanks for the review Jonathan!

> 
> Jonathan
> 
>

Re: [PATCH] ACPI: NFIT: Fix flexible_array.cocci warnings

2021-01-05 Thread Verma, Vishal L

On Tue, 2021-01-05 at 13:03 -0800, Dan Williams wrote:
> Julia and 0day report:
> 
> Zero-length and one-element arrays are deprecated, see
> Documentation/process/deprecated.rst
> Flexible-array members should be used instead.
> 
> However, a straight conversion to flexible arrays yields:
> 
> drivers/acpi/nfit/core.c:2276:4: error: flexible array member in a struct 
> with no named members
> drivers/acpi/nfit/core.c:2287:4: error: flexible array member in a struct 
> with no named members
> 
> Instead, just use plain arrays not embedded a flexible arrays.

This reads a bit awkwardly, maybe:

  "Just use plain arrays instead of embedded flexible arrays."

Other than that, the patch looks looks good:
Reviewed-by: Vishal Verma 

> 
> Cc: Denis Efremov 
> Reported-by: kernel test robot 
> Reported-by: Julia Lawall 
> Signed-off-by: Dan Williams 
> ---
>  drivers/acpi/nfit/core.c |   75 
> +-
>  1 file changed, 28 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index b11b08a60684..8c5dde628405 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -2269,40 +2269,24 @@ static const struct attribute_group 
> *acpi_nfit_region_attribute_groups[] = {
>  
> 
> 
> 
>  /* enough info to uniquely specify an interleave set */
>  struct nfit_set_info {
> - struct nfit_set_info_map {
> - u64 region_offset;
> - u32 serial_number;
> - u32 pad;
> - } mapping[0];
> + u64 region_offset;
> + u32 serial_number;
> + u32 pad;
>  };
>  
> 
> 
> 
>  struct nfit_set_info2 {
> - struct nfit_set_info_map2 {
> - u64 region_offset;
> - u32 serial_number;
> - u16 vendor_id;
> - u16 manufacturing_date;
> - u8  manufacturing_location;
> - u8  reserved[31];
> - } mapping[0];
> + u64 region_offset;
> + u32 serial_number;
> + u16 vendor_id;
> + u16 manufacturing_date;
> + u8 manufacturing_location;
> + u8 reserved[31];
>  };
>  
> 
> 
> 
> -static size_t sizeof_nfit_set_info(int num_mappings)
> -{
> - return sizeof(struct nfit_set_info)
> - + num_mappings * sizeof(struct nfit_set_info_map);
> -}
> -
> -static size_t sizeof_nfit_set_info2(int num_mappings)
> -{
> - return sizeof(struct nfit_set_info2)
> - + num_mappings * sizeof(struct nfit_set_info_map2);
> -}
> -
>  static int cmp_map_compat(const void *m0, const void *m1)
>  {
> - const struct nfit_set_info_map *map0 = m0;
> - const struct nfit_set_info_map *map1 = m1;
> + const struct nfit_set_info *map0 = m0;
> + const struct nfit_set_info *map1 = m1;
>  
> 
> 
> 
>   return memcmp(>region_offset, >region_offset,
>   sizeof(u64));
> @@ -2310,8 +2294,8 @@ static int cmp_map_compat(const void *m0, const void 
> *m1)
>  
> 
> 
> 
>  static int cmp_map(const void *m0, const void *m1)
>  {
> - const struct nfit_set_info_map *map0 = m0;
> - const struct nfit_set_info_map *map1 = m1;
> + const struct nfit_set_info *map0 = m0;
> + const struct nfit_set_info *map1 = m1;
>  
> 
> 
> 
>   if (map0->region_offset < map1->region_offset)
>   return -1;
> @@ -2322,8 +2306,8 @@ static int cmp_map(const void *m0, const void *m1)
>  
> 
> 
> 
>  static int cmp_map2(const void *m0, const void *m1)
>  {
> - const struct nfit_set_info_map2 *map0 = m0;
> - const struct nfit_set_info_map2 *map1 = m1;
> + const struct nfit_set_info2 *map0 = m0;
> + const struct nfit_set_info2 *map1 = m1;
>  
> 
> 
> 
>   if (map0->region_offset < map1->region_offset)
>   return -1;
> @@ -2361,22 +2345,22 @@ static int acpi_nfit_init_interleave_set(struct 
> acpi_nfit_desc *acpi_desc,
>   return -ENOMEM;
>   import_guid(_set->type_guid, spa->range_guid);
>  
> 
> 
> 
> - info = devm_kzalloc(dev, sizeof_nfit_set_info(nr), GFP_KERNEL);
> + info = devm_kcalloc(dev, nr, sizeof(*info), GFP_KERNEL);
>   if (!info)
>   return -ENOMEM;
>  
> 
> 
> 
> - info2 = devm_kzalloc(dev, sizeof_nfit_set_info2(nr), GFP_KERNEL);
> + info2 = devm_kcalloc(dev, nr, sizeof(*info2), GFP_KERNEL);
>   if (!info2)
>   return -ENOMEM;
>  
> 
> 
> 
>   for (i = 0; i < nr; i++) {
>   struct nd_mapping_desc *mapping = _desc->mapping[i];
> - struct nfit_set_info_map *map = >mapping[i];
> - struct nfit_set_info_map2 *map2 = >mapping[i];
>   struct nvdimm *nvdimm = mapping->nvdimm;
>   struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
> - struct acpi_nfit_memory_map *memdev = memdev_from_spa(acpi_desc,
> - spa->range_index, i);
> + struct nfit_set_info *map = [i];
> + struct nfit_set_info2 *map2 = [i];
> + struct

Re: [RFC PATCH v2 00/14] CXL 2.0 Support

2020-12-08 Thread Verma, Vishal L

On Tue, 2020-12-08 at 16:24 -0800, Ben Widawsky wrote:
> Changes since v1 [1]
> 
> A few additions have been made:
>  - IOCTL (UAPI) interface has been added with commands
>  - Kernel docs have been created
>  - A new debug macro is introduced and sprinkled throughout.
> 
> A deletion was made:
>  - Removal of the non-standard _OSC UUID.
> 
> The detailed list of fixes is:
>  - fix cxl_register() no previous prototype warning (0day robot)
>  - s/REGLOG/REGLOC/ (Ben)
>  - Wait for doorbell on cxl_mem_mbox_get() and add comment on why (Ben)
>  - make "type-3" a proper adjective, add spec references, also did the same 
> for
>the Kconfig (Bjorn)
>  - align some defines (Bjorn)
>  - s/bar/BAR (Bjorn)
>  - rename cxl_bus_prepared() to cxl_bus_acquire() (Bjorn)
>  - move definition of struct cxl_mem to "cxl/mem: Map memory device 
> registers" (Bjorn)
>  - use consistent hex/decimal (Bjorn)
>  - use consistent upper/lower hex values (Bjorn)
>  - get rid of READ_ONCE (Bjorn)
>  - add offsets to debug messages (Bjorn)
>  - cleanup SPDX comment style (Bjorn, Christoph)
>  - change errors returned by case (Bjorn, Dan)
>  - 80 character violation cleanups (Christoph)
>  - cleanup CXL_BUS_PROVIDER dependencies (Christoph, Randy)
>  - remove "raw" from mmio functions (Dan)
>  - rename PCI_DVSEC_VENDOR_CXL to add _ID (Jonathan)
>  - combine introduction of mbox infrastruct and cxl_mem_identify() (Jonathan)
>  - add ABI documentation for sysfs attributes (Jonathan)
>  - document scope of cxl_memdev_lock (Jonathan)
>  - rework cxl_register() to have devm semantics (reaction to comments about
>cxl_mem_remove() and cxl_mem_add_memdev() semantics) (Jonathan)
>  - fix cxl_mem_exit() ordering (Jonathan)
>  - use GENMASK/GET_FIELD (Jonathan)
>  - fix and add comments for cap ids (Jonathan)
>  - use _OFFSET postfix in definitions (Jonathan)
>  - save pci_set_drvdata for later (Jonathan)

There are a few more change credits for the acpi patches:

- Remove unnecessary helpers and callbacks in drivers/cxl/acpi.c (Christoph, 
Jonathan)
- Remove some unnecessary variable initalizations (Bjorn, Jonathan)
- Convert to platform_driver (Rafael)

> 
> [1]: 
> https://lore.kernel.org/linux-cxl/2020054356.793390-1-ben.widaw...@intel.com/
> 
> ---
> 
> Introduce support for “type-3” memory devices defined in the recently released
> Compute Express Link (CXL) 2.0 specification[2]. Specifically, these are the
> memory devices defined by section 8.2.8.5 of the CXL 2.0 spec. A reference
> implementation emulating these devices has been submitted to the QEMU mailing
> list and is available on gitlab [3]. “Type-3” is a CXL device that acts as a
> memory expander for RAM or PMEM.  It might be interleaved with other CXL 
> devices
> in a given physical address range.
> 
> These changes allow for foundational enumeration of CXL 2.0 memory devices as
> well as basic userspace interaction. The functionality present is:
> - Initial driver bring-up
> - Device enumeration and an initial sysfs representation
> - Submit a basic firmware command via ‘mailbox’ to an emulated memory device
>   with non-volatile capacity.
> - Provide an interface to send "raw" commands to the hardware.
> 
> Some of the functionality that is still missing includes:
> - Memory interleaving at the host bridge, root port, or switch level
> - CXL 1.1 Root Complex Integrated Endpoint Support
> - CXL 2.0 Hot plug support
> - A bevy of supported device commands
> 
> In addition to the core functionality of discovering the spec defined 
> registers
> and resources, introduce a CXL device model that will be the foundation for
> translating CXL capabilities into existing Linux infrastructure for Persistent
> Memory and other memory devices. For now, this only includes support for the
> management command mailbox that type-3 devices surface. These control devices
> fill the role of “DIMMs” / nmemX memory-devices in LIBNVDIMM terms.
> 
> Now, while implementing the driver some feedback for the specification was
> generated to cover perceived gaps and address conflicts. The feedback is
> presented as a reference implementation in the driver and QEMU emulation.
> Specifically the following concepts are original to the Linux implementation 
> and
> feedback / collaboration is requested to develop these into specification
> proposals:
> 1. Top level ACPI object (ACPI0017)
> 2. HW imposed address space and interleave constraints
> 
> ACPI0017
> 
> Introduce a new ACPI namespace device with an _HID of ACPI0017. The purpose of
> this object is twofold, support a legacy OS with a set of out-of-tree CXL
> modules, and establish an attach point for a driver that knows about
> interleaving. Both of these boil down to the same point, to centralize 
> Operating
> System support for resources described by the CXL Early Discovery Table 
> (CEDT).
> 
> The legacy OS problem stems from the spec's description of a host bridge,
> ACPI0016 is denoted as the _HID for host bridges, with a

Re: [RFC PATCH 0/9] CXL 2.0 Support

2020-12-04 Thread Verma, Vishal L

On Fri, 2020-12-04 at 10:12 -0800, Ben Widawsky wrote:
> Hi Chris.
> 
> On 20-12-04 12:40:03, Chris Browy wrote:
[..]
> 
> >acpidump indicates the CXL0 and CXLM devices but no SRAT or HMAT tables 
> > are
> >in the dump which is curious.
> 
> I don't typically use HMAT, but I do have an SRAT in mine, so that's strange.
> You should also have a CEDT.
> 
I suspect an SRAT is only added if you have distinct numa nodes. Adding
a few '-numa node' bits to the qemu command line should be enough to
make that happen.

Re: [PATCH 1/1] ACPI/nfit: correct the badrange to be reported in nfit_handle_mce()

2020-11-18 Thread Verma, Vishal L

On Wed, 2020-11-18 at 16:41 +0800, Zhen Lei wrote:
> The badrange to be reported should always cover mce->addr.
> 
> Signed-off-by: Zhen Lei 
> ---
>  drivers/acpi/nfit/mce.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Ah good find, agreed with Dan that this is stable material.
Minor nit, I'd recommend rewording the subject line to something like:

"acpi/nfit: fix badrange insertion in nfit_handle_mce()"

Otherwise, looks good to me.
Reviewed-by: Vishal Verma 

> 
> diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
> index ee8d9973f60b..053e719c7bea 100644
> --- a/drivers/acpi/nfit/mce.c
> +++ b/drivers/acpi/nfit/mce.c
> @@ -63,7 +63,7 @@ static int nfit_handle_mce(struct notifier_block *nb, 
> unsigned long val,
>  
>   /* If this fails due to an -ENOMEM, there is little we can do */
>   nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
> - ALIGN(mce->addr, L1_CACHE_BYTES),
> + ALIGN_DOWN(mce->addr, L1_CACHE_BYTES),
>   L1_CACHE_BYTES);
>   nvdimm_region_notify(nfit_spa->nd_region,
>   NVDIMM_REVALIDATE_POISON);

Re: [RFC PATCH 1/9] cxl/acpi: Add an acpi_cxl module for the CXL interconnect

2020-11-16 Thread Verma, Vishal L

On Mon, 2020-11-16 at 17:59 +, Jonathan Cameron wrote:
> On Tue, 10 Nov 2020 21:43:48 -0800
> Ben Widawsky  wrote:
> 
> > From: Vishal Verma 
> > 
> > Add an acpi_cxl module to coordinate the ACPI portions of the CXL
> > (Compute eXpress Link) interconnect. This driver binds to ACPI0017
> > objects in the ACPI tree, and coordinates access to the resources
> > provided by the ACPI CEDT (CXL Early Discovery Table).
> 
> I think the qemu series notes that this ACPI0017 is just a proposal at
> this stage. Please make sure that's highlighted here as well unless
> that status is out of date.

Hi Jonathan,

Thank you for the review. The cover letter talks about this, but I agree
it would be worth repeating in this patch briefly as I did with the OSC
patch. If it is still in a proposal state by the next posting, I'll make
sure to add a note here too.

> 
> > +
> > +static void acpi_cxl_desc_init(struct acpi_cxl_desc *acpi_desc, struct 
> > device *dev)
> > +{
> > +   dev_set_drvdata(dev, acpi_desc);
> 
> No need to have this wrapper + it hides the fact you are not just initialsing
> the acpi_desc structure.
> 
> > +   acpi_desc->dev = dev;
> > +}
> > +
> > +static void acpi_cedt_put_table(void *table)
> > +{
> > +   acpi_put_table(table);
> > +}
> > +
> > +static int acpi_cxl_add(struct acpi_device *adev)
> > +{
> > +   struct acpi_cxl_desc *acpi_desc;
> > +   struct device *dev = >dev;
> > +   struct acpi_table_header *tbl;
> > +   acpi_status status = AE_OK;
> 
> Set below, so don't do it here.
> 
> > +   acpi_size sz;
> > +   int rc = 0;
> 
> Set in paths in which it's used so don't do it here.
> 
> > +
> > +   status = acpi_get_table(ACPI_SIG_CEDT, 0, );
> > +   if (ACPI_FAILURE(status)) {
> > +   dev_err(dev, "failed to find CEDT at startup\n");
> > +   return 0;
> > +   }
> > +
> > +   rc = devm_add_action_or_reset(dev, acpi_cedt_put_table, tbl);
> > +   if (rc)
> > +   return rc;
> 
> blank line here preferred for readability (do something, then check errors as
> block)
> 
> > +   sz = tbl->length;
> > +   dev_info(dev, "found CEDT at startup: %lld bytes\n", sz);
> > +
> > +   acpi_desc = devm_kzalloc(dev, sizeof(*acpi_desc), GFP_KERNEL);
> > +   if (!acpi_desc)
> > +   return -ENOMEM;
> 
> blank line here slightly helps readability.
> 
> > +   acpi_cxl_desc_init(acpi_desc, >dev);
> > +
> > +   acpi_desc->acpi_header = *tbl;
> > +
> > +   return 0;
> > +}
> > +
> > +static int acpi_cxl_remove(struct acpi_device *adev)
> > +{
> > +   return 0;
> 
> Don't think empty remove is needed.
> 

Agreed with all of the above, I'll fix for the next posting.

> > +
> > +/* Values for CEDT structure types */
> > +
> > +enum acpi_cedt_type {
> > +   ACPI_CEDT_TYPE_HOST_BRIDGE = 0, /* CHBS - CXL Host Bridge Structure */
> > +   ACPI_CEDT_TYPE_CFMWS = 1,   /* CFMWS - CXL Fixed Memory Window 
> > Structure */
> 
> This isn't in the 2.0 spec, so I guess also part of some proposed changes.

Yes this got in accidentally, will remove.

Re: [RFC PATCH 1/9] cxl/acpi: Add an acpi_cxl module for the CXL interconnect

2020-11-10 Thread Verma, Vishal L

On Wed, 2020-11-11 at 07:34 +, h...@infradead.org wrote:
> On Wed, Nov 11, 2020 at 07:30:34AM +0000, Verma, Vishal L wrote:
> > Hi Christpph,
> > 
> > I thought 100 col. lines were acceptable now.
> 
> Quote from the coding style document:
> 
> "The preferred limit on the length of a single line is 80 columns.
> 
> Statements longer than 80 columns should be broken into sensible chunks,
> unless exceeding 80 columns significantly increases readability and does
> not hide information."
> 
> So yes, they are acceptable as an expception.  Not for crap like this.

Ah fair enough, I'll reflow all of these for the next revision.

Re: [RFC PATCH 1/9] cxl/acpi: Add an acpi_cxl module for the CXL interconnect

2020-11-10 Thread Verma, Vishal L

On Wed, 2020-11-11 at 07:10 +, Christoph Hellwig wrote:
> On Tue, Nov 10, 2020 at 09:43:48PM -0800, Ben Widawsky wrote:
> > +menuconfig CXL_BUS
> > +   tristate "CXL (Compute Express Link) Devices Support"
> > +   help
> > + CXL is a bus that is electrically compatible with PCI-E, but layers
> > + three protocols on that signalling (CXL.io, CXL.cache, and CXL.mem). 
> > The
> > + CXL.cache protocol allows devices to hold cachelines locally, the
> > + CXL.mem protocol allows devices to be fully coherent memory targets, 
> > the
> > + CXL.io protocol is equivalent to PCI-E. Say 'y' to enable support for
> > + the configuration and management of devices supporting these 
> > protocols.
> > +
> 
> Please fix the overly long lines.
> 
> > +static void acpi_cxl_desc_init(struct acpi_cxl_desc *acpi_desc, struct 
> > device *dev)
> 
> Another overly long line.

Hi Christpph,

I thought 100 col. lines were acceptable now.

> 
> > +{
> > +   dev_set_drvdata(dev, acpi_desc);
> > +   acpi_desc->dev = dev;
> > +}
> 
> But this helper seems pretty pointless to start with.
> 
> > +static int acpi_cxl_remove(struct acpi_device *adev)
> > +{
> > +   return 0;
> > +}
> 
> The emptry remove callback is not needed.

Agreed on both of the above comments - these are just boilerplate for
now, I expect they will get filled in in the next revision as more
functionality gets fleshed out. If they are still empty/no-op by then I
will remove them.

> 
> > +/*
> > + * If/when CXL support is defined by other platform firmware the kernel
> > + * will need a mechanism to select between the platform specific version
> > + * of this routine, until then, hard-code ACPI assumptions
> > + */
> > +int cxl_bus_prepared(struct pci_dev *pdev)
> > +{
> > +   struct acpi_device *adev;
> > +   struct pci_dev *root_port;
> > +   struct device *root;
> > +
> > +   root_port = pcie_find_root_port(pdev);
> > +   if (!root_port)
> > +   return -ENXIO;
> > +
> > +   root = root_port->dev.parent;
> > +   if (!root)
> > +   return -ENXIO;
> > +
> > +   adev = ACPI_COMPANION(root);
> > +   if (!adev)
> > +   return -ENXIO;
> > +
> > +   /* TODO: OSC enabling */
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(cxl_bus_prepared);
> 
> What is the point of this function?  I doesn't realy do anything,
> not even a CXL specific check.  

This gets a bit more fleshed out in patch 2. I kept that separate so
that it is easier to review the bulk of the _OSC work in that patch
without this driver boilerplate getting in the way.

> 
> >  
> > +/***
> > + *
> > + * CEDT - CXL Early Discovery Table (ACPI 6.4)
> > + *Version 1
> > + *
> > + 
> > **/
> > +
> 
> Pleae use the normal Linux comment style.
> 
> 
> > +#define ACPI_CEDT_CHBS_VERSION_CXL11(0)
> > +#define ACPI_CEDT_CHBS_VERSION_CXL20(1)
> > +
> > +/* Values for length field above */
> > +
> > +#define ACPI_CEDT_CHBS_LENGTH_CXL11 (0x2000)
> > +#define ACPI_CEDT_CHBS_LENGTH_CXL20 (0x1)
> 
> No need for the braces.

For both of these - see the note in the commit message. I just followed
the ACPI header's style, and these hunks are only in this series to make
it usable. I expect the 'actual' struct definitions, naming etc will
come through ACPICA.

[GIT PULL] libnvdimm fix for v5.9-rc5

2020-09-11 Thread Verma, Vishal L

Hi Linus, please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git
tags/libnvdimm-fix-v5.9-rc5

...to receive another fix for detection of DAX support for block
devices. The previous fix in this area (merged in -rc3) was incomplete,
and this should finally address all the problems.

---

The following changes since commit c2affe920b0e0669650943ac086215cf6519be34:

  dax: do not print error message for non-persistent memory block device 
(2020-08-20 11:43:18 -0600)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git 
tags/libnvdimm-fix-v5.9-rc5

for you to fetch changes up to 6180bb446ab624b9ab8bf201ed251ca87f07b413:

  dax: fix detection of dax support for non-persistent memory block devices 
(2020-09-03 12:28:03 -0600)


libnvdimm fix for v5.9-rc5

Fix decetion of dax support for block devices. Previous fixes in this
area, which only affected printing of debug messages, had an incorrect
condition for detection of dax. This fix should finally do the right thing.


Coly Li (1):
  dax: fix detection of dax support for non-persistent memory block devices

 drivers/dax/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 32642634c1bb..e5767c83ea23 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -100,7 +100,7 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
return false;
}
 
-   if (!dax_dev && !bdev_dax_supported(bdev, blocksize)) {
+   if (!dax_dev || !bdev_dax_supported(bdev, blocksize)) {
pr_debug("%s: error: dax unsupported by block device\n",
bdevname(bdev, buf));
return false;

[GIT PULL] libnvdimm fixes for v5.9-rc3

2020-08-24 Thread Verma, Vishal L

Hi Linus, please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git
tags/libnvdimm-fix-v5.9-rc3

...to receive a couple of minor fixes for things merged in 5.9-rc1. One
is an out-of-bounds access caught by KASAN, and the second is a tweak to
some overzealous logging about dax support even for traditional block
devices which was unnecessary. These have appeared in -next without any
problems.

---

The following changes since commit 9123e3a74ec7b934a4a099e98af6a61c2f80bbf5:

  Linux 5.9-rc1 (2020-08-16 13:04:57 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git 
tags/libnvdimm-fix-v5.9-rc3

for you to fetch changes up to c2affe920b0e0669650943ac086215cf6519be34:

  dax: do not print error message for non-persistent memory block device 
(2020-08-20 11:43:18 -0600)


libnvdimm fixes for v5.9-rc3

Fix an out-of-bounds access introduced in libnvdimm v5.9-rc1
dax: do not print error message for non-persistent memory block device


Adrian Huang (1):
  dax: do not print error message for non-persistent memory block device

Zqiang (1):
  libnvdimm: KASAN: global-out-of-bounds Read in internal_create_group

 drivers/dax/super.c| 6 ++
 drivers/nvdimm/dimm_devs.c | 1 +
 2 files changed, 7 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c82cbcb64202..32642634c1bb 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -100,6 +100,12 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
return false;
}
 
+   if (!dax_dev && !bdev_dax_supported(bdev, blocksize)) {
+   pr_debug("%s: error: dax unsupported by block device\n",
+   bdevname(bdev, buf));
+   return false;
+   }
+
id = dax_read_lock();
len = dax_direct_access(dax_dev, pgoff, 1, , );
len2 = dax_direct_access(dax_dev, pgoff_end, 1, _kaddr, _pfn);
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 61374def5155..b59032e0859b 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -529,6 +529,7 @@ static DEVICE_ATTR_ADMIN_RW(activate);
 static struct attribute *nvdimm_firmware_attributes[] = {
_attr_activate.attr,
_attr_result.attr,
+   NULL,
 };
 
 static umode_t nvdimm_firmware_visible(struct kobject *kobj, struct attribute 
*a, int n)

Re: 回复: [PATCH v2] libnvdimm: KASAN: global-out-of-bounds Read in internal_create_group

2020-08-18 Thread Verma, Vishal L

On Wed, 2020-08-19 at 03:23 +, Zhang, Qiang wrote:
> cc: Dan Williams
> Please review.

Hi Qiang,

I've got this queued up, I'll submit it for -rc2.

Thanks,
-Vishal

> 
> 
> 发件人: linux-kernel-ow...@vger.kernel.org  
> 代表 qiang.zh...@windriver.com 
> 发送时间: 2020年8月12日 16:55
> 收件人: dan.j.willi...@intel.com; vishal.l.ve...@intel.com; 
> dave.ji...@intel.com; ira.we...@intel.com
> 抄送: linux-nvd...@lists.01.org; linux-kernel@vger.kernel.org
> 主题: [PATCH v2] libnvdimm: KASAN: global-out-of-bounds Read in 
> internal_create_group
> 
> From: Zqiang 
> 
> Because the last member of the "nvdimm_firmware_attributes" array
> was not assigned a null ptr, when traversal of "grp->attrs" array
> is out of bounds in "create_files" func.
> 
> func:
> create_files:
> ->for (i = 0, attr = grp->attrs; *attr && !error; i++, attr++)
> ->
> 
> BUG: KASAN: global-out-of-bounds in create_files fs/sysfs/group.c:43 [inline]
> BUG: KASAN: global-out-of-bounds in internal_create_group+0x9d8/0xb20
> fs/sysfs/group.c:149
> Read of size 8 at addr 8a2e4cf0 by task kworker/u17:10/959
> 
> CPU: 2 PID: 959 Comm: kworker/u17:10 Not tainted 5.8.0-syzkaller #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> Workqueue: events_unbound async_run_entry_fn
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x18f/0x20d lib/dump_stack.c:118
>  print_address_description.constprop.0.cold+0x5/0x497 mm/kasan/report.c:383
>  __kasan_report mm/kasan/report.c:513 [inline]
>  kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
>  create_files fs/sysfs/group.c:43 [inline]
>  internal_create_group+0x9d8/0xb20 fs/sysfs/group.c:149
>  internal_create_groups.part.0+0x90/0x140 fs/sysfs/group.c:189
>  internal_create_groups fs/sysfs/group.c:185 [inline]
>  sysfs_create_groups+0x25/0x50 fs/sysfs/group.c:215
>  device_add_groups drivers/base/core.c:2024 [inline]
>  device_add_attrs drivers/base/core.c:2178 [inline]
>  device_add+0x7fd/0x1c40 drivers/base/core.c:2881
>  nd_async_device_register+0x12/0x80 drivers/nvdimm/bus.c:506
>  async_run_entry_fn+0x121/0x530 kernel/async.c:123
>  process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
>  worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
>  kthread+0x3b5/0x4a0 kernel/kthread.c:292
>  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
> 
> The buggy address belongs to the variable:
>  nvdimm_firmware_attributes+0x10/0x40
> 
> Reported-by: syzbot+1cf0ffe61aecf46f5...@syzkaller.appspotmail.com
> Signed-off-by: Zqiang 
> ---
>  v1->v2:
>  Modify the description of the error.
> 
>  drivers/nvdimm/dimm_devs.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> index 61374def5155..b59032e0859b 100644
> --- a/drivers/nvdimm/dimm_devs.c
> +++ b/drivers/nvdimm/dimm_devs.c
> @@ -529,6 +529,7 @@ static DEVICE_ATTR_ADMIN_RW(activate);
>  static struct attribute *nvdimm_firmware_attributes[] = {
> _attr_activate.attr,
> _attr_result.attr,
> +   NULL,
>  };
> 
>  static umode_t nvdimm_firmware_visible(struct kobject *kobj, struct 
> attribute *a, int n)
> --
> 2.17.1
> 
> ___
> Linux-nvdimm mailing list -- linux-nvd...@lists.01.org
> To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] acpi/nfit: Use kobj_to_dev() instead

2020-08-14 Thread Verma, Vishal L

On Fri, 2020-08-14 at 17:28 +0200, Rafael J. Wysocki wrote:
> On Thu, Aug 13, 2020 at 4:54 AM Wang Qing  wrote:
> > Use kobj_to_dev() instead of container_of()
> > 
> > Signed-off-by: Wang Qing 
> 
> LGTM
> 
> Dan, any objections?

Looks good to me - you can add:
Acked-by: Vishal Verma 
> 
> > ---
> >  drivers/acpi/nfit/core.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> > index fa4500f..3bb350b
> > --- a/drivers/acpi/nfit/core.c
> > +++ b/drivers/acpi/nfit/core.c
> > @@ -1382,7 +1382,7 @@ static bool ars_supported(struct nvdimm_bus 
> > *nvdimm_bus)
> > 
> >  static umode_t nfit_visible(struct kobject *kobj, struct attribute *a, int 
> > n)
> >  {
> > -   struct device *dev = container_of(kobj, struct device, kobj);
> > +   struct device *dev = kobj_to_dev(kobj);
> > struct nvdimm_bus *nvdimm_bus = to_nvdimm_bus(dev);
> > 
> > if (a == _attr_scrub.attr && !ars_supported(nvdimm_bus))
> > @@ -1667,7 +1667,7 @@ static struct attribute *acpi_nfit_dimm_attributes[] 
> > = {
> >  static umode_t acpi_nfit_dimm_attr_visible(struct kobject *kobj,
> > struct attribute *a, int n)
> >  {
> > -   struct device *dev = container_of(kobj, struct device, kobj);
> > +   struct device *dev = kobj_to_dev(kobj);
> > struct nvdimm *nvdimm = to_nvdimm(dev);
> > struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
> > 
> > --
> > 2.7.4
> >

[GIT PULL] libnvdimm for v5.9

2020-08-10 Thread Verma, Vishal L

Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/
tags/libnvdimm-for-5.9

...to receive a new feature in libnvdimm - 'Runtime Firmware
Activation', and a few small cleanups and fixes in libnvdimm and DAX.

You'd normally receive this pull request from Dan Williams, but he's
busy watching a newborn (Congrats Dan!), so I'm watching libnvdimm this
cycle.

I'd originally intended to make separate topic-based pull requests - one
for libnvdimm, and one for DAX, but some of the DAX material fell out
since it wasn't quite ready. I ended up merging the two branches into
one, and hence a couple of 'internal' merges - I hope these are ok. If
you prefer that I should've handled this differently please let me know!

I was also expecting a potential conflict - I was assuming Greg had
pulled in one of Dan's patches[1] through driver-core, but I don't see
it in his tree, and a test merge with the current master went through
cleanly.

There were a small handful of late fixes, but everything has had at
least a week of -next soak time without any reported issues. We've also
received a positive build notification from 0-day.

[1]: https://lore.kernel.org/linux-nvdimm/20200721104442.gf1676...@kroah.com/

---

The following changes since commit 48778464bb7d346b47157d21ffde2af6b2d39110:

  Linux 5.8-rc2 (2020-06-21 15:45:29 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/ 
tags/libnvdimm-for-5.9

for you to fetch changes up to 7f674025d9f7321dea11b802cc0ab3f09cbe51c5:

  libnvdimm/security: ensure sysfs poll thread woke up and fetch updated attr 
(2020-08-03 18:54:13 -0600)


libnvdimm for 5.9

- Add 'Runtime Firmware Activation' support for NVDIMMs that advertise
  the relevant capability
- Misc libnvdimm and DAX cleanups


Coly Li (1):
  dax: print error message by pr_info() in __generic_fsdax_supported()

Dan Williams (12):
  libnvdimm: Validate command family indices
  ACPI: NFIT: Move bus_dsm_mask out of generic nvdimm_bus_descriptor
  ACPI: NFIT: Define runtime firmware activation commands
  tools/testing/nvdimm: Cleanup dimm index passing
  tools/testing/nvdimm: Add command debug messages
  tools/testing/nvdimm: Prepare nfit_ctl_test() for ND_CMD_CALL emulation
  tools/testing/nvdimm: Emulate firmware activation commands
  driver-core: Introduce DEVICE_ATTR_ADMIN_{RO,RW}
  libnvdimm: Convert to DEVICE_ATTR_ADMIN_RO()
  PM, libnvdimm: Add runtime firmware activation support
  ACPI: NFIT: Add runtime firmware activate support
  ACPI: NFIT: Fix ARS zero-sized allocation

Hao Li (1):
  dax: Fix incorrect argument passed to xas_set_err()

Ira Weiny (2):
  fs/dax: Remove unused size parameter
  drivers/dax: Expand lock scope to cover the use of addresses

Jane Chu (3):
  libnvdimm/security: fix a typo
  libnvdimm/security: the 'security' attr never show 'overwrite' state
  libnvdimm/security: ensure sysfs poll thread woke up and fetch updated 
attr

Vishal Verma (2):
  Merge branch 'for-5.9/dax' into libnvdimm-for-next
  Merge branch 'for-5.9/firmware-activate' into libnvdimm-for-next

 Documentation/ABI/testing/sysfs-bus-nfit   |  19 +
 Documentation/ABI/testing/sysfs-bus-nvdimm |   2 +
 .../driver-api/nvdimm/firmware-activate.rst|  86 +
 drivers/acpi/nfit/core.c   | 157 ++---
 drivers/acpi/nfit/intel.c  | 386 +
 drivers/acpi/nfit/intel.h  |  61 
 drivers/acpi/nfit/nfit.h   |  38 +-
 drivers/dax/super.c|  13 +-
 drivers/nvdimm/bus.c   |  16 +
 drivers/nvdimm/core.c  | 149 
 drivers/nvdimm/dimm_devs.c | 123 ++-
 drivers/nvdimm/namespace_devs.c|   2 +-
 drivers/nvdimm/nd-core.h   |   1 +
 drivers/nvdimm/pfn_devs.c  |   2 +-
 drivers/nvdimm/region_devs.c   |   2 +-
 drivers/nvdimm/security.c  |  13 +-
 fs/dax.c   |  15 +-
 include/linux/device.h |   4 +
 include/linux/libnvdimm.h  |  52 ++-
 include/linux/suspend.h|   6 +
 include/linux/sysfs.h  |   7 +
 include/uapi/linux/ndctl.h |   5 +
 kernel/power/hibernate.c   |  97 ++
 tools/testing/nvdimm/test/nfit.c   | 367 
 24 files changed, 1486 insertions(+), 137 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-nvdimm
 create mode

Re: [PATCH] ACPI: Replace HTTP links with HTTPS ones

2020-07-17 Thread Verma, Vishal L

On Fri, 2020-07-17 at 20:24 +0200, Alexander A. Klimov wrote:
> Rationale:
> Reduces attack surface on kernel devs opening the links for MITM
> as HTTPS traffic is much harder to manipulate.
> 
> Deterministic algorithm:
> For each file:
>   If not .svg:
> For each line:
>   If doesn't contain `\bxmlns\b`:
> For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
> If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
> If both the HTTP and HTTPS versions
> return 200 OK and serve the same content:
>   Replace HTTP with HTTPS.
> 
> Signed-off-by: Alexander A. Klimov 
> ---
>  Continuing my work started at 93431e0607e5.
>  See also: git log --oneline '--author=Alexander A. Klimov 
> ' v5.7..master
> 
>  If there are any URLs to be removed completely or at least not just 
> HTTPSified:
>  Just clearly say so and I'll *undo my change*.
>  See also: https://lkml.org/lkml/2020/6/27/64
> 
>  If there are any valid, but yet not changed URLs:
>  See: https://lkml.org/lkml/2020/6/26/837
> 
>  If you apply the patch, please let me know.
> 
>  Sorry again to all maintainers who complained about subject lines.
>  Now I realized that you want an actually perfect prefixes,
>  not just subsystem ones.
>  I tried my best...
>  And yes, *I could* (at least half-)automate it.
>  Impossible is nothing! :)
> 
> 
>  .../firmware-guide/acpi/DSD-properties-rules.rst   |  4 ++--
>  .../firmware-guide/acpi/dsd/data-node-references.rst   |  4 ++--
>  Documentation/firmware-guide/acpi/dsd/graph.rst| 10 +-
>  Documentation/firmware-guide/acpi/dsd/leds.rst |  6 +++---
>  Documentation/firmware-guide/acpi/lpit.rst |  2 +-
>  drivers/acpi/Kconfig   |  2 +-
>  drivers/acpi/nfit/nfit.h   |  2 +-
>  7 files changed, 15 insertions(+), 15 deletions(-)

For nfit/nfit.h,
Acked-by: Vishal Verma 

> diff --git a/drivers/acpi/nfit/nfit.h b/drivers/acpi/nfit/nfit.h
> index f5525f8bb770..a303f0123394 100644
> --- a/drivers/acpi/nfit/nfit.h
> +++ b/drivers/acpi/nfit/nfit.h
> @@ -16,7 +16,7 @@
>  /* ACPI 6.1 */
>  #define UUID_NFIT_BUS "2f10e7a4-9e91-11e4-89d3-123b93f75cba"
>  
> -/* http://pmem.io/documents/NVDIMM_DSM_Interface-V1.6.pdf */
> +/* https://pmem.io/documents/NVDIMM_DSM_Interface-V1.6.pdf */
>  #define UUID_NFIT_DIMM "4309ac30-0d11-11e4-9191-0800200c9a66"
>  
>  /* https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/ */
> -- 
> 2.27.0
> ___
> Linux-nvdimm mailing list -- linux-nvd...@lists.01.org
> To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v4 2/2] ACPICA: Preserve memory opregion mappings

2020-07-16 Thread Verma, Vishal L

On Mon, 2020-06-29 at 18:33 +0200, Rafael J. Wysocki wrote:
> From: "Rafael J. Wysocki" 
> 
> The ACPICA's strategy with respect to the handling of memory mappings
> associated with memory operation regions is to avoid mapping the
> entire region at once which may be problematic at least in principle
> (for example, it may lead to conflicts with overlapping mappings
> having different attributes created by drivers).  It may also be
> wasteful, because memory opregions on some systems take up vast
> chunks of address space while the fields in those regions actually
> accessed by AML are sparsely distributed.
> 
> For this reason, a one-page "window" is mapped for a given opregion
> on the first memory access through it and if that "window" does not
> cover an address range accessed through that opregion subsequently,
> it is unmapped and a new "window" is mapped to replace it.  Next,
> if the new "window" is not sufficient to acess memory through the
> opregion in question in the future, it will be replaced with yet
> another "window" and so on.  That may lead to a suboptimal sequence
> of memory mapping and unmapping operations, for example if two fields
> in one opregion separated from each other by a sufficiently wide
> chunk of unused address space are accessed in an alternating pattern.
> 
> The situation may still be suboptimal if the deferred unmapping
> introduced previously is supported by the OS layer.  For instance,
> the alternating memory access pattern mentioned above may produce
> a relatively long list of mappings to release with substantial
> duplication among the entries in it, which could be avoided if
> acpi_ex_system_memory_space_handler() did not release the mapping
> used by it previously as soon as the current access was not covered
> by it.
> 
> In order to improve that, modify acpi_ex_system_memory_space_handler()
> to preserve all of the memory mappings created by it until the memory
> regions associated with them go away.
> 
> Accordingly, update acpi_ev_system_memory_region_setup() to unmap all
> memory associated with memory opregions that go away.
> 
> Reported-by: Dan Williams 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  drivers/acpi/acpica/evrgnini.c | 14 
>  drivers/acpi/acpica/exregion.c | 65 --
>  include/acpi/actypes.h | 12 +--
>  3 files changed, 64 insertions(+), 27 deletions(-)
> 

Hi Rafael,

Picking up from Dan while he's out - I had these patches tested by the
original reporter, and they work fine. I see you had them staged in the
acpica-osl branch. Is that slated to go in during the 5.9 merge window?

You can add:
Tested-by: Xiang Li

Re: [RFC] Memory Tiering

2019-10-17 Thread Verma, Vishal L


On Thu, 2019-10-17 at 07:17 -0700, Dave Hansen wrote:
> On 10/17/19 1:07 AM, David Hildenbrand wrote:
> > Very interesting topic. I heard similar demand from HPC folks 
> > (especially involving other memory types ("tiers")). There, I think
> > you often want to let the application manage that. But of course, for
> > many applications an automatic management might already be
> > beneficial.
> > 
> > Am I correct that you are using PMEM in this area along with
> > ZONE_DEVICE and not by giving PMEM to the buddy (add_memory())?
> 
> The PMEM starts out as ZONE_DEVICE, but we unbind it from its original
> driver and bind it to this stub of a "driver": drivers/dax/kmem.c which
> uses add_memory() on it.
> 
> There's some nice tooling inside the daxctl component of ndctl to do all
> the sysfs magic to make this happen.
> 
Here is more info about the daxctl command in question:

https://pmem.io/ndctl/daxctl-reconfigure-device.html

Re: [PATCH 11/13] nvdimm: Use more common logic testing styles and bare ; positions

2019-09-11 Thread Verma, Vishal L

On Wed, 2019-09-11 at 19:54 -0700, Joe Perches wrote:
> Avoid using uncommon logic testing styles to make the code a
> bit more like other kernel code.
> 
> e.g.:
>   if (foo) {
>   ;
>   } else {
>   
>   }
> 
> is typically written
> 
>   if (!foo) {
>   
>   }
> 

A lot of times the excessive inversions seem to result in a net loss of
readability - e.g.:

> diff --git a/drivers/nvdimm/region_devs.c
> b/drivers/nvdimm/region_devs.c
> index 65df07481909..6861e0997d21 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -320,9 +320,7 @@ static ssize_t set_cookie_show(struct device *dev,
>   struct nd_interleave_set *nd_set = nd_region->nd_set;
>   ssize_t rc = 0;
>  
> - if (is_memory(dev) && nd_set)
> - /* pass, should be precluded by region_visible */;

For one, the comment is lost

> - else
> + if (!(is_memory(dev) && nd_set))

And it takes a moment to resolve between things such as:

if (!(A && B))
  vs.
if (!(A) && B)

And this is especially true if 'A' and 'B' are longer function calls,
split over multiple lines, or are themselves compound 'sections'.

I'm not opposed to /all/ such transformations -- for example, the ones
where the logical inversion can be 'consumed' by toggling a comparision
operator, as you have a few times in this patch, don't sacrifice any
readibility, and perhaps even improve it. 

>   return -ENXIO;
>  
>   /*

Re: [Ksummit-discuss] [PATCH v2 2/3] Maintainer Handbook: Maintainer Entry Profile

2019-09-11 Thread Verma, Vishal L

On Wed, 2019-09-11 at 08:48 -0700, Dan Williams wrote:

> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3f171339df53..e5d111a86e61 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -98,6 +98,10 @@ Descriptions of section entries:
>  Obsolete:Old code. Something tagged obsolete generally means
>   it has been replaced by a better system and you
>   should be using that.
> + P: Subsystem Profile document for more details submitting
> +patches to the given subsystem. This is either an in-tree file,
> +or a URI. See Documentation/maintainer/maintainer-entry-profile.rst
> +for details.

Considering each maintainer entry or driver may not have a subdirectory
under Documentation/ to put these under, would it be better to collect
them in one place, say Documentation/maintainer-entry-profiles/ ?

Re: [PATCH] libnvdimm, region: Use struct_size() in kzalloc()

2019-08-28 Thread Verma, Vishal L

On Wed, 2019-08-28 at 14:36 -0500, Gustavo A. R. Silva wrote:

> struct_size() does not apply to those scenarios. See below...
> 
> > [1]: 
> > https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/tree/drivers/nvdimm/region_devs.c#n1030
> 
> struct_size() only applies to structures of the following kind:
> 
> struct foo {
>int stuff;
>struct boo entry[];
> };
> 
> and this scenario includes two different structures:
> 
> struct nd_region {
>   ...
> struct nd_mapping mapping[0];
> };
> 
> struct nd_blk_region {
>   ...
> struct nd_region nd_region;
> };

Yep - I neglected to actually look at the structures involved - you're
right, it doesn't apply here.

> 
> > [2]: 
> > https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/tree/drivers/nvdimm/region_devs.c#n96
> > 
> 
> In this scenario struct_size() does not apply directly because of the
> following
> logic before the call to devm_kzalloc():

Agreed, I missed that the calculation was more involved here.

Thanks for the clarifications, you can add:
Reviewed-by: Vishal Verma

Re: [PATCH] libnvdimm, region: Use struct_size() in kzalloc()

2019-08-28 Thread Verma, Vishal L

On Mon, 2019-06-10 at 16:06 -0500, Gustavo A. R. Silva wrote:
> One of the more common cases of allocation size calculations is
> finding
> the size of a structure that has a zero-sized array at the end, along
> with memory for some number of elements for that array. For example:
> 
> struct nd_region {
>   ...
> struct nd_mapping mapping[0];
> };
> 
> instance = kzalloc(sizeof(struct nd_region) + sizeof(struct
> nd_mapping) *
>   count, GFP_KERNEL);
> 
> Instead of leaving these open-coded and prone to type mistakes, we can
> now use the new struct_size() helper:
> 
> instance = kzalloc(struct_size(instance, mapping, count), GFP_KERNEL);
> 
> This code was detected with the help of Coccinelle.
> 
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  drivers/nvdimm/region_devs.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvdimm/region_devs.c
> b/drivers/nvdimm/region_devs.c
> index b4ef7d9ff22e..88becc87e234 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1027,10 +1027,9 @@ static struct nd_region
> *nd_region_create(struct nvdimm_bus *nvdimm_bus,
>   }
>   region_buf = ndbr;
>   } else {
> - nd_region = kzalloc(sizeof(struct nd_region)
> - + sizeof(struct nd_mapping)
> - * ndr_desc->num_mappings,
> - GFP_KERNEL);
> + nd_region = kzalloc(struct_size(nd_region, mapping,
> + ndr_desc->num_mappings),
> + GFP_KERNEL);
>   region_buf = nd_region;
>   }
>  

Hi Gustavo,

The patch looks good to me, however it looks like it might've missed
some instances where this replacement can be performed?

One is just a few lines below from the above, in the 'else' block[1].
Additionally, maybe the Coccinelle script can be augmented to catch
devm_kzalloc instances as well - there is one of those in this file[2].

Thanks,
-Vishal

[1]: 
https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/tree/drivers/nvdimm/region_devs.c#n1030
[2]: 
https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/tree/drivers/nvdimm/region_devs.c#n96

Re: [PATCH 2/2] drivers/dax/kmem: give a warning if CONFIG_DEV_DAX_PMEM_COMPAT is enabled

2019-08-22 Thread Verma, Vishal L

On Thu, 2019-08-22 at 14:55 -0400, Jeff Moyer wrote:
> 
> When using daxctl to online memory, you already get the following
> message:
> 
> libdaxctl: daxctl_dev_disable: dax0.0: error: device model is dax-class
> 
> That's still not very helpful.  It would be better if the message
> suggested a fix (like using migrate-device-model).  Vishal?

Yes, it is on my list to improve this. Currently the man page shows the
above error message, and talks about migrate-device-model, but I
received more feedback to add another bread crumb in the printed message
pointing to migrate-device-model.  I'll send a patch for it soon.

Thanks,
-Vishal

Re: [PATCH v2 4/7] libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant

2019-07-18 Thread Verma, Vishal L


On Wed, 2019-07-17 at 18:08 -0700, Dan Williams wrote:
> In preparation for not holding a lock over the execution of
> nd_ioctl(),
> update the implementation to allow multiple threads to be attempting
> ioctls at the same time. The bus lock still prevents multiple in-
> flight
> ->ndctl() invocations from corrupting each other's state, but static
> global staging buffers are moved to the heap.
> 
> Reported-by: Vishal Verma 
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/bus.c |   59 +++
> ---
>  1 file changed, 37 insertions(+), 22 deletions(-)

Ran tens of iterations of the unit tests with this, and couldn't
reproduce the failure.

Reviewed-by: Vishal Verma 
Tested-by: Vishal Verma 

> 
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index 42713b210f51..a3180c28fb2b 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -970,20 +970,19 @@ static int __nd_ioctl(struct nvdimm_bus
> *nvdimm_bus, struct nvdimm *nvdimm,
>   int read_only, unsigned int ioctl_cmd, unsigned long
> arg)
>  {
>   struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc;
> - static char out_env[ND_CMD_MAX_ENVELOPE];
> - static char in_env[ND_CMD_MAX_ENVELOPE];
>   const struct nd_cmd_desc *desc = NULL;
>   unsigned int cmd = _IOC_NR(ioctl_cmd);
>   struct device *dev = _bus->dev;
>   void __user *p = (void __user *) arg;
> + char *out_env = NULL, *in_env = NULL;
>   const char *cmd_name, *dimm_name;
>   u32 in_len = 0, out_len = 0;
>   unsigned int func = cmd;
>   unsigned long cmd_mask;
>   struct nd_cmd_pkg pkg;
>   int rc, i, cmd_rc;
> + void *buf = NULL;
>   u64 buf_len = 0;
> - void *buf;
>  
>   if (nvdimm) {
>   desc = nd_cmd_dimm_desc(cmd);
> @@ -1023,6 +1022,9 @@ static int __nd_ioctl(struct nvdimm_bus
> *nvdimm_bus, struct nvdimm *nvdimm,
>   }
>  
>   /* process an input envelope */
> + in_env = kzalloc(ND_CMD_MAX_ENVELOPE, GFP_KERNEL);
> + if (!in_env)
> + return -ENOMEM;
>   for (i = 0; i < desc->in_num; i++) {
>   u32 in_size, copy;
>  
> @@ -1030,14 +1032,17 @@ static int __nd_ioctl(struct nvdimm_bus
> *nvdimm_bus, struct nvdimm *nvdimm,
>   if (in_size == UINT_MAX) {
>   dev_err(dev, "%s:%s unknown input size cmd: %s
> field: %d\n",
>   __func__, dimm_name, cmd_name,
> i);
> - return -ENXIO;
> + rc = -ENXIO;
> + goto out;
>   }
> - if (in_len < sizeof(in_env))
> - copy = min_t(u32, sizeof(in_env) - in_len,
> in_size);
> + if (in_len < ND_CMD_MAX_ENVELOPE)
> + copy = min_t(u32, ND_CMD_MAX_ENVELOPE - in_len,
> in_size);
>   else
>   copy = 0;
> - if (copy && copy_from_user(_env[in_len], p + in_len,
> copy))
> - return -EFAULT;
> + if (copy && copy_from_user(_env[in_len], p + in_len,
> copy)) {
> + rc = -EFAULT;
> + goto out;
> + }
>   in_len += in_size;
>   }
>  
> @@ -1049,6 +1054,12 @@ static int __nd_ioctl(struct nvdimm_bus
> *nvdimm_bus, struct nvdimm *nvdimm,
>   }
>  
>   /* process an output envelope */
> + out_env = kzalloc(ND_CMD_MAX_ENVELOPE, GFP_KERNEL);
> + if (!out_env) {
> + rc = -ENOMEM;
> + goto out;
> + }
> +
>   for (i = 0; i < desc->out_num; i++) {
>   u32 out_size = nd_cmd_out_size(nvdimm, cmd, desc, i,
>   (u32 *) in_env, (u32 *) out_env, 0);
> @@ -1057,15 +1068,18 @@ static int __nd_ioctl(struct nvdimm_bus
> *nvdimm_bus, struct nvdimm *nvdimm,
>   if (out_size == UINT_MAX) {
>   dev_dbg(dev, "%s unknown output size cmd: %s
> field: %d\n",
>   dimm_name, cmd_name, i);
> - return -EFAULT;
> + rc = -EFAULT;
> + goto out;
>   }
> - if (out_len < sizeof(out_env))
> - copy = min_t(u32, sizeof(out_env) - out_len,
> out_size);
> + if (out_len < ND_CMD_MAX_ENVELOPE)
> + copy = min_t(u32, ND_CMD_MAX_ENVELOPE - out_len,
> out_size);
>   else
>   copy = 0;
>   if (copy && copy_from_user(_env[out_len],
> - p + in_len + out_len, copy))
> - return -EFAULT;
> + p + in_len + out_len, copy)) {
> + rc = -EFAULT;
> + goto out;
> + }
>   out_len += out_size;
>   }
>  
> @@ -1073,12 +1087,15 @@ static int __nd_ioctl(struct nvdimm_bus
> *nvdimm_bus, struct

Re: [PATCH v2 3/7] libnvdimm/region: Register badblocks before namespaces

2019-07-18 Thread Verma, Vishal L


On Wed, 2019-07-17 at 18:08 -0700, Dan Williams wrote:
> Namespace activation expects to be able to reference region badblocks.
> The following warning sometimes triggers when asynchronous namespace
> activation races in front of the completion of namespace probing. Move
> all possible namespace probing after region badblocks initialization.
> 
> Otherwise, lockdep sometimes catches the uninitialized state of the
> badblocks seqlock with stack trace signatures like:
> 
> INFO: trying to register non-static key.
> pmem2: detected capacity change from 0 to 136365211648
> the code is fine but needs lockdep annotation.
> turning off the locking correctness validator.
> CPU: 9 PID: 358 Comm: kworker/u80:5 Tainted:
> G   OE 5.2.0-rc4+ #3382
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0
> 02/06/2015
> Workqueue: events_unbound async_run_entry_fn
> Call Trace:
>  dump_stack+0x85/0xc0
> pmem1.12: detected capacity change from 0 to 8589934592
>  register_lock_class+0x56a/0x570
>  ? check_object+0x140/0x270
>  __lock_acquire+0x80/0x1710
>  ? __mutex_lock+0x39d/0x910
>  lock_acquire+0x9e/0x180
>  ? nd_pfn_validate+0x28f/0x440 [libnvdimm]
>  badblocks_check+0x93/0x1f0
>  ? nd_pfn_validate+0x28f/0x440 [libnvdimm]
>  nd_pfn_validate+0x28f/0x440 [libnvdimm]
>  ? lockdep_hardirqs_on+0xf0/0x180
>  nd_dax_probe+0x9a/0x120 [libnvdimm]
>  nd_pmem_probe+0x6d/0x180 [nd_pmem]
>  nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
> 
> Fixes: 48af2f7e52f4 ("libnvdimm, pfn: during init, clear errors...")
> Cc: 
> Cc: Vishal Verma 
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/region.c |   22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)

This looks good to me,
Reviewed-by: Vishal Verma 

> 
> diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
> index ef46cc3a71ae..488c47ac4c4a 100644
> --- a/drivers/nvdimm/region.c
> +++ b/drivers/nvdimm/region.c
> @@ -34,17 +34,6 @@ static int nd_region_probe(struct device *dev)
>   if (rc)
>   return rc;
>  
> - rc = nd_region_register_namespaces(nd_region, );
> - if (rc < 0)
> - return rc;
> -
> - ndrd = dev_get_drvdata(dev);
> - ndrd->ns_active = rc;
> - ndrd->ns_count = rc + err;
> -
> - if (rc && err && rc == err)
> - return -ENODEV;
> -
>   if (is_nd_pmem(_region->dev)) {
>   struct resource ndr_res;
>  
> @@ -60,6 +49,17 @@ static int nd_region_probe(struct device *dev)
>   nvdimm_badblocks_populate(nd_region, _region->bb,
> _res);
>   }
>  
> + rc = nd_region_register_namespaces(nd_region, );
> + if (rc < 0)
> + return rc;
> +
> + ndrd = dev_get_drvdata(dev);
> + ndrd->ns_active = rc;
> + ndrd->ns_count = rc + err;
> +
> + if (rc && err && rc == err)
> + return -ENODEV;
> +
>   nd_region->btt_seed = nd_btt_create(nd_region);
>   nd_region->pfn_seed = nd_pfn_create(nd_region);
>   nd_region->dax_seed = nd_dax_create(nd_region);
>

Re: [RESEND PATCH] nvdimm: fix some compilation warnings

2019-06-26 Thread Verma, Vishal L


On Wed, 2019-06-26 at 17:00 -0400, Qian Cai wrote:
> 
> Verma, are you still working on this? I can still see this warning in the 
> latest
> linux-next.
> 
> drivers/nvdimm/btt.c: In function 'btt_read_pg':
> drivers/nvdimm/btt.c:1272:8: warning: variable 'rc' set but not used
> [-Wunused-but-set-variable]
> 
Sorry, this fell off the list. I'll take a look soon, but in the
meanwhile, if a patch were to appear, I'd be happy to review it :) 
(i.e. feel free to take a shot at it).

Thanks,
-Vishal

Re: [PATCH 3/6] libnvdimm/region: Register badblocks before namespaces

2019-06-20 Thread Verma, Vishal L

On Tue, 2019-06-11 at 16:25 -0700, Dan Williams wrote:
> Namespace activation expects to be able to reference region badblocks.
> The following warning sometimes triggers when asynchronous namespace
> activation races in front of the completion of namespace probing. Move
> all possible namespace probing after region badblocks initialization.
> 
> Otherwise, lockdep sometimes catches the uninitialized state of the
> badblocks seqlock with stack trace signatures like:
> 
> INFO: trying to register non-static key.
> pmem2: detected capacity change from 0 to 136365211648
> the code is fine but needs lockdep annotation.
> turning off the locking correctness validator.
> CPU: 9 PID: 358 Comm: kworker/u80:5 Tainted:
> G   OE 5.2.0-rc4+ #3382
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0
> 02/06/2015
> Workqueue: events_unbound async_run_entry_fn
> Call Trace:
>  dump_stack+0x85/0xc0
> pmem1.12: detected capacity change from 0 to 8589934592
>  register_lock_class+0x56a/0x570
>  ? check_object+0x140/0x270
>  __lock_acquire+0x80/0x1710
>  ? __mutex_lock+0x39d/0x910
>  lock_acquire+0x9e/0x180
>  ? nd_pfn_validate+0x28f/0x440 [libnvdimm]
>  badblocks_check+0x93/0x1f0
>  ? nd_pfn_validate+0x28f/0x440 [libnvdimm]
>  nd_pfn_validate+0x28f/0x440 [libnvdimm]
>  ? lockdep_hardirqs_on+0xf0/0x180
>  nd_dax_probe+0x9a/0x120 [libnvdimm]
>  nd_pmem_probe+0x6d/0x180 [nd_pmem]
>  nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
> 
> Fixes: 48af2f7e52f4 ("libnvdimm, pfn: during init, clear errors...")
> Cc: 
> Cc: Vishal Verma 
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/region.c |   22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)

This looks good to me,
Reviewed-by: Vishal Verma 

> 
> diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
> index ef46cc3a71ae..488c47ac4c4a 100644
> --- a/drivers/nvdimm/region.c
> +++ b/drivers/nvdimm/region.c
> @@ -34,17 +34,6 @@ static int nd_region_probe(struct device *dev)
>   if (rc)
>   return rc;
>  
> - rc = nd_region_register_namespaces(nd_region, );
> - if (rc < 0)
> - return rc;
> -
> - ndrd = dev_get_drvdata(dev);
> - ndrd->ns_active = rc;
> - ndrd->ns_count = rc + err;
> -
> - if (rc && err && rc == err)
> - return -ENODEV;
> -
>   if (is_nd_pmem(_region->dev)) {
>   struct resource ndr_res;
>  
> @@ -60,6 +49,17 @@ static int nd_region_probe(struct device *dev)
>   nvdimm_badblocks_populate(nd_region, _region->bb,
> _res);
>   }
>  
> + rc = nd_region_register_namespaces(nd_region, );
> + if (rc < 0)
> + return rc;
> +
> + ndrd = dev_get_drvdata(dev);
> + ndrd->ns_active = rc;
> + ndrd->ns_count = rc + err;
> +
> + if (rc && err && rc == err)
> + return -ENODEV;
> +
>   nd_region->btt_seed = nd_btt_create(nd_region);
>   nd_region->pfn_seed = nd_pfn_create(nd_region);
>   nd_region->dax_seed = nd_dax_create(nd_region);
>

Re: [PATCH] mm, memory-failure: clarify error message

2019-05-17 Thread Verma, Vishal L

On Thu, 2019-05-16 at 22:08 -0600, Jane Chu wrote:
> Some user who install SIGBUS handler that does longjmp out
> therefore keeping the process alive is confused by the error
> message
>   "[188988.765862] Memory failure: 0x1840200: Killing
>cellsrv:33395 due to hardware memory corruption"
> Slightly modify the error message to improve clarity.
> 
> Signed-off-by: Jane Chu 
> ---
>  mm/memory-failure.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fc8b517..14de5e2 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -216,10 +216,9 @@ static int kill_proc(struct to_kill *tk, unsigned long 
> pfn, int flags)
>   short addr_lsb = tk->size_shift;
>   int ret;
>  
> - pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory 
> corruption\n",
> - pfn, t->comm, t->pid);
> -
>   if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
> + pr_err("Memory failure: %#lx: Killing %s:%d due to hardware 
> memory "
> + "corruption\n", pfn, t->comm, t->pid);

Minor nit, but the string shouldn't be split over multiple lines to
preserve grep-ability. In such a case it is usually considered OK to
exceed 80 characters for the line if needed.

>   ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
>  addr_lsb, current);
>   } else {
> @@ -229,6 +228,8 @@ static int kill_proc(struct to_kill *tk, unsigned long 
> pfn, int flags)
>* This could cause a loop when the user sets SIGBUS
>* to SIG_IGN, but hopefully no one will do that?
>*/
> + pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to 
> hardware "
> + "memory corruption\n", pfn, t->comm, t->pid);
>   ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
> addr_lsb, t);  /* synchronous? */
>   }

Re: [RESEND PATCH] nvdimm: fix some compilation warnings

2019-05-15 Thread Verma, Vishal L

On Wed, 2019-05-15 at 16:25 -0700, Dan Williams wrote:
> 
> > diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> > index 4671776f5623..9f02a99cfac0 100644
> > --- a/drivers/nvdimm/btt.c
> > +++ b/drivers/nvdimm/btt.c
> > @@ -1269,11 +1269,9 @@ static int btt_read_pg(struct btt *btt, struct 
> > bio_integrity_payload *bip,
> > 
> > ret = btt_data_read(arena, page, off, postmap, cur_len);
> > if (ret) {
> > -   int rc;
> > -
> > /* Media error - set the e_flag */
> > -   rc = btt_map_write(arena, premap, postmap, 0, 1,
> > -   NVDIMM_IO_ATOMIC);
> > +   btt_map_write(arena, premap, postmap, 0, 1,
> > + NVDIMM_IO_ATOMIC);
> > goto out_rtt;
> 
> This doesn't look correct to me, shouldn't we at least be logging that
> the bad-block failed to be persistently tracked?

Yes logging it sounds good to me. Qian, can you include this in your
respin or shall I send a fix for it separately (since we were always
ignoring the failure here regardless of this patch)?

Re: [RESEND PATCH] nvdimm: fix some compilation warnings

2019-05-15 Thread Verma, Vishal L


On Wed, 2019-05-15 at 17:26 -0700, Dan Williams wrote:
> On Wed, May 15, 2019 at 5:25 PM Verma, Vishal L
>  wrote:
> > On Wed, 2019-05-15 at 16:25 -0700, Dan Williams wrote:
> > > > diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> > > > index 4671776f5623..9f02a99cfac0 100644
> > > > --- a/drivers/nvdimm/btt.c
> > > > +++ b/drivers/nvdimm/btt.c
> > > > @@ -1269,11 +1269,9 @@ static int btt_read_pg(struct btt *btt,
> > > > struct bio_integrity_payload *bip,
> > > > 
> > > > ret = btt_data_read(arena, page, off, postmap,
> > > > cur_len);
> > > > if (ret) {
> > > > -   int rc;
> > > > -
> > > > /* Media error - set the e_flag */
> > > > -   rc = btt_map_write(arena, premap,
> > > > postmap, 0, 1,
> > > > -   NVDIMM_IO_ATOMIC);
> > > > +   btt_map_write(arena, premap, postmap, 0,
> > > > 1,
> > > > + NVDIMM_IO_ATOMIC);
> > > > goto out_rtt;
> > > 
> > > This doesn't look correct to me, shouldn't we at least be logging
> > > that
> > > the bad-block failed to be persistently tracked?
> > 
> > Yes logging it sounds good to me. Qian, can you include this in your
> > respin or shall I send a fix for it separately (since we were always
> > ignoring the failure here regardless of this patch)?
> 
> I think a separate fix for this makes more sense. Likely also needs to
> be a ratelimited message in case a storm of errors is encountered.

Yes good point on rate limiting - I was thinking WARN_ONCE but that
might mask errors for distinct blocks, but a rate limited printk should
work best. I'll prepare a patch.

Re: [v5 0/3] "Hotremove" persistent memory

2019-05-02 Thread Verma, Vishal L

On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote:
> The series of operations look like this:
> 
> 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
>and free ramdisk.
> 2. Convert raw pmem0 to devdax
>ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> 3. Hotadd to System RAM
>echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>echo online_movable > /sys/devices/system/memoryXXX/state
> 4. Before reboot hotremove device-dax memory from System RAM
>echo offline > /sys/devices/system/memoryXXX/state
>echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind

Hi Pavel,

I am working on adding this sort of a workflow into a new daxctl command
(daxctl-reconfigure-device)- this will allow changing the 'mode' of a
dax device to kmem, online the resulting memory, and with your patches,
also attempt to offline the memory, and change back to device-dax.

In running with these patches, and testing the offlining part, I ran
into the following lockdep below.

This is with just these three patches on top of -rc7.


[  +0.004886] ==
[  +0.001576] WARNING: possible circular locking dependency detected
[  +0.001506] 5.1.0-rc7+ #13 Tainted: G   O 
[  +0.000929] --
[  +0.000708] daxctl/22950 is trying to acquire lock:
[  +0.000548] f4d397f7 (kn->count#424){}, at: 
kernfs_remove_by_name_ns+0x40/0x80
[  +0.000922] 
  but task is already holding lock:
[  +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: 
unregister_memory_section+0x22/0xa0
[  +0.000960] 
  which lock already depends on the new lock.

[  +0.001001] 
  the existing dependency chain (in reverse order) is:
[  +0.000837] 
  -> #3 (mem_sysfs_mutex){+.+.}:
[  +0.000631]__mutex_lock+0x82/0x9a0
[  +0.000477]unregister_memory_section+0x22/0xa0
[  +0.000582]__remove_pages+0xe9/0x520
[  +0.000489]arch_remove_memory+0x81/0xc0
[  +0.000510]devm_memremap_pages_release+0x180/0x270
[  +0.000633]release_nodes+0x234/0x280
[  +0.000483]device_release_driver_internal+0xf4/0x1d0
[  +0.000701]bus_remove_device+0xfc/0x170
[  +0.000529]device_del+0x16a/0x380
[  +0.000459]unregister_dev_dax+0x23/0x50
[  +0.000526]release_nodes+0x234/0x280
[  +0.000487]device_release_driver_internal+0xf4/0x1d0
[  +0.000646]unbind_store+0x9b/0x130
[  +0.000467]kernfs_fop_write+0xf0/0x1a0
[  +0.000510]vfs_write+0xba/0x1c0
[  +0.000438]ksys_write+0x5a/0xe0
[  +0.000521]do_syscall_64+0x60/0x210
[  +0.000489]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000637] 
  -> #2 (mem_hotplug_lock.rw_sem){}:
[  +0.000717]get_online_mems+0x3e/0x80
[  +0.000491]kmem_cache_create_usercopy+0x2e/0x270
[  +0.000609]kmem_cache_create+0x12/0x20
[  +0.000507]ptlock_cache_init+0x20/0x28
[  +0.000506]start_kernel+0x240/0x4d0
[  +0.000480]secondary_startup_64+0xa4/0xb0
[  +0.000539] 
  -> #1 (cpu_hotplug_lock.rw_sem){}:
[  +0.000784]cpus_read_lock+0x3e/0x80
[  +0.000511]online_pages+0x37/0x310
[  +0.000469]memory_subsys_online+0x34/0x60
[  +0.000611]device_online+0x60/0x80
[  +0.000611]state_store+0x66/0xd0
[  +0.000552]kernfs_fop_write+0xf0/0x1a0
[  +0.000649]vfs_write+0xba/0x1c0
[  +0.000487]ksys_write+0x5a/0xe0
[  +0.000459]do_syscall_64+0x60/0x210
[  +0.000482]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000646] 
  -> #0 (kn->count#424){}:
[  +0.000669]lock_acquire+0x9e/0x180
[  +0.000471]__kernfs_remove+0x26a/0x310
[  +0.000518]kernfs_remove_by_name_ns+0x40/0x80
[  +0.000583]remove_files.isra.1+0x30/0x70
[  +0.000555]sysfs_remove_group+0x3d/0x80
[  +0.000524]sysfs_remove_groups+0x29/0x40
[  +0.000532]device_remove_attrs+0x42/0x80
[  +0.000522]device_del+0x162/0x380
[  +0.000464]device_unregister+0x16/0x60
[  +0.000505]unregister_memory_section+0x6e/0xa0
[  +0.000591]__remove_pages+0xe9/0x520
[  +0.000492]arch_remove_memory+0x81/0xc0
[  +0.000568]try_remove_memory+0xba/0xd0
[  +0.000510]remove_memory+0x23/0x40
[  +0.000483]dev_dax_kmem_remove+0x29/0x57 [kmem]
[  +0.000608]device_release_driver_internal+0xe4/0x1d0
[  +0.000637]unbind_store+0x9b/0x130
[  +0.000464]kernfs_fop_write+0xf0/0x1a0
[  +0.000685]vfs_write+0xba/0x1c0
[  +0.000594]ksys_write+0x5a/0xe0
[  +0.000449]do_syscall_64+0x60/0x210
[  +0.000481]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000619] 
  other info that might help us debug this:

[  +0.000889] Chain exists of:

Re: [PATCH] nvdimm: btt_devs: fix a NULL pointer dereference and a memory leak

2019-03-22 Thread Verma, Vishal L

On Tue, 2019-03-12 at 03:15 -0500, Kangjie Lu wrote:
> In case kmemdup fails, the fix releases resources and returns to
> avoid the NULL pointer dereference.
> Also, the error paths in the following code should release
> resources to avoid memory leaks.
> 
> Signed-off-by: Kangjie Lu 
> ---
>  drivers/nvdimm/btt_devs.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 

Looks good,
Reviewed-by: Vishal Verma 

> diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
> index 795ad4ff35ca..565ea0b6f765 100644
> --- a/drivers/nvdimm/btt_devs.c
> +++ b/drivers/nvdimm/btt_devs.c
> @@ -196,8 +196,13 @@ static struct device *__nd_btt_create(struct
> nd_region *nd_region,
>   }
>  
>   nd_btt->lbasize = lbasize;
> - if (uuid)
> + if (uuid) {
>   uuid = kmemdup(uuid, 16, GFP_KERNEL);
> + if (!uuid) {
> + kfree(nd_btt);
> + return NULL;
> + }
> + }
>   nd_btt->uuid = uuid;
>   dev = _btt->dev;
>   dev_set_name(dev, "btt%d.%d", nd_region->id, nd_btt->id);
> @@ -209,6 +214,7 @@ static struct device *__nd_btt_create(struct
> nd_region *nd_region,
>   dev_dbg(>dev, "failed, already claimed by %s\n",
>   dev_name(ndns->claim));
>   put_device(dev);
> + kfree(uuid);
>   return NULL;
>   }
>   return dev;

Re: [PATCH] libnvdimm/namespace: Clean up holder_class_store()

2019-03-04 Thread Verma, Vishal L

On Mon, 2019-03-04 at 12:14 -0800, Dan Williams wrote:
> Use sysfs_streq() in place of open-coded strcmp()'s that check for an
> optional "\n" at the end of the input.
> 
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/namespace_devs.c |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)

Looks good,
Reviewed-by: Vishal Verma 

> 
> diff --git a/drivers/nvdimm/namespace_devs.c
> b/drivers/nvdimm/namespace_devs.c
> index 3677b0c4a33d..17fb7f931f0c 100644
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -1506,13 +1506,13 @@ static ssize_t __holder_class_store(struct
> device *dev, const char *buf)
>   if (dev->driver || ndns->claim)
>   return -EBUSY;
>  
> - if (strcmp(buf, "btt") == 0 || strcmp(buf, "btt\n") == 0)
> + if (sysfs_streq(buf, "btt"))
>   ndns->claim_class = btt_claim_class(dev);
> - else if (strcmp(buf, "pfn") == 0 || strcmp(buf, "pfn\n") == 0)
> + else if (sysfs_streq(buf, "pfn"))
>   ndns->claim_class = NVDIMM_CCLASS_PFN;
> - else if (strcmp(buf, "dax") == 0 || strcmp(buf, "dax\n") == 0)
> + else if (sysfs_streq(buf, "dax"))
>   ndns->claim_class = NVDIMM_CCLASS_DAX;
> - else if (strcmp(buf, "") == 0 || strcmp(buf, "\n") == 0)
> + else if (sysfs_streq(buf, ""))
>   ndns->claim_class = NVDIMM_CCLASS_NONE;
>   else
>   return -EINVAL;
>

Re: [PATCH] acpi/nfit: Fix bus command validation

2019-02-07 Thread Verma, Vishal L


On Thu, 2019-02-07 at 15:57 -0800, Dan Williams wrote:
> Commit 11189c1089da "acpi/nfit: Fix command-supported detection" broke
> ND_CMD_CALL for bus-level commands. The "func = cmd" assumption is only
> valid for:
> 
> ND_CMD_ARS_CAP
> ND_CMD_ARS_START
> ND_CMD_ARS_STATUS
> ND_CMD_CLEAR_ERROR
> 
> The function number otherwise needs to be pulled from the command
> payload for:
> 
> NFIT_CMD_TRANSLATE_SPA
> NFIT_CMD_ARS_INJECT_SET
> NFIT_CMD_ARS_INJECT_CLEAR
> NFIT_CMD_ARS_INJECT_GET
> 
> Update cmd_to_func() for the bus case and call it in the common path.
> 
> Fixes: 11189c1089da ("acpi/nfit: Fix command-supported detection")
> Cc: 
> Cc: Vishal Verma 
> Reported-by: Grzegorz Burzynski 
> Signed-off-by: Dan Williams 
> ---
>  drivers/acpi/nfit/core.c |   22 --
>  1 file changed, 12 insertions(+), 10 deletions(-)

Looks good,
Reviewed-by: Vishal Verma

Re: [PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-25 Thread Verma, Vishal L


On Fri, 2019-01-25 at 09:18 -0800, Dan Williams wrote:
> On Fri, Jan 25, 2019 at 12:20 AM Du, Fan  wrote:
> > Dan
> > 
> > Thanks for the insights!
> > 
> > Can I say, the UCE is delivered from h/w to OS in a single way in
> > case of machine
> > check, only PMEM/DAX stuff filter out UC address and managed in its
> > own way by
> > badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will
> > kick in,
> > right?
> 
> The common RAS workflow always kicks in, it's just the page state
> presented by a DAX mapping needs distinct handling. Once it is
> hot-plugged it no longer needs to be treated differently than "System
> RAM".
> 
> > And how about when ARS is involved but no machine check fired for
> > the function
> > of this patchset?
> 
> The hotplug effectively disconnects this address range from the ARS
> results. They will still be reported in the libnvdimm "region" level
> badblocks instance, but there's no safe / coordinated way to go clear
> those errors without additional kernel enabling. There is no "clear
> error" semantic for "System RAM".
> 
Perhaps as future enabling, the kernel can go perform "clear error" for
offlined pages, and make them usable again. But I'm not sure how
prepared mm is to re-accept pages previously offlined.

Re: [PATCH 1/2] tools/testing/nvdimm: Align test resources to 128M

2018-12-03 Thread Verma, Vishal L


On Sat, 2018-11-24 at 10:46 -0800, Dan Williams wrote:
> In preparation for libnvdimm growing new restrictions to detect section
> conflicts between persistent memory regions, enable nfit_test to
> allocate aligned resources. Use a gen_pool to allocate nfit_test's fake
> resources in a separate address space from the virtual translation of
> the same.
> 
> Signed-off-by: Dan Williams 
> ---
>  tools/testing/nvdimm/test/nfit.c |   36 ++--
>  1 file changed, 34 insertions(+), 2 deletions(-)
> 

This looks good to me,
Reviewed-by: Vishal Verma 

> diff --git a/tools/testing/nvdimm/test/nfit.c 
> b/tools/testing/nvdimm/test/nfit.c
> index 01ec04bf91b5..ca4e61c864d5 100644
> --- a/tools/testing/nvdimm/test/nfit.c
> +++ b/tools/testing/nvdimm/test/nfit.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -215,6 +216,8 @@ struct nfit_test {
>  
>  static struct workqueue_struct *nfit_wq;
>  
> +static struct gen_pool *nfit_pool;
> +
>  static struct nfit_test *to_nfit_test(struct device *dev)
>  {
>   struct platform_device *pdev = to_platform_device(dev);
> @@ -1132,6 +1135,9 @@ static void release_nfit_res(void *data)
>   list_del(_res->list);
>   spin_unlock(_test_lock);
>  
> + if (resource_size(_res->res) >= DIMM_SIZE)
> + gen_pool_free(nfit_pool, nfit_res->res.start,
> + resource_size(_res->res));
>   vfree(nfit_res->buf);
>   kfree(nfit_res);
>  }
> @@ -1144,7 +1150,7 @@ static void *__test_alloc(struct nfit_test *t, size_t 
> size, dma_addr_t *dma,
>   GFP_KERNEL);
>   int rc;
>  
> - if (!buf || !nfit_res)
> + if (!buf || !nfit_res || !*dma)
>   goto err;
>   rc = devm_add_action(dev, release_nfit_res, nfit_res);
>   if (rc)
> @@ -1164,6 +1170,8 @@ static void *__test_alloc(struct nfit_test *t, size_t 
> size, dma_addr_t *dma,
>  
>   return nfit_res->buf;
>   err:
> + if (*dma && size >= DIMM_SIZE)
> + gen_pool_free(nfit_pool, *dma, size);
>   if (buf)
>   vfree(buf);
>   kfree(nfit_res);
> @@ -1172,9 +1180,16 @@ static void *__test_alloc(struct nfit_test *t, size_t 
> size, dma_addr_t *dma,
>  
>  static void *test_alloc(struct nfit_test *t, size_t size, dma_addr_t *dma)
>  {
> + struct genpool_data_align data = {
> + .align = SZ_128M,
> + };
>   void *buf = vmalloc(size);
>  
> - *dma = (unsigned long) buf;
> + if (size >= DIMM_SIZE)
> + *dma = gen_pool_alloc_algo(nfit_pool, size,
> + gen_pool_first_fit_align, );
> + else
> + *dma = (unsigned long) buf;
>   return __test_alloc(t, size, dma, buf);
>  }
>  
> @@ -2839,6 +2854,18 @@ static __init int nfit_test_init(void)
>   goto err_register;
>   }
>  
> + nfit_pool = gen_pool_create(ilog2(SZ_4M), NUMA_NO_NODE);
> + if (!nfit_pool) {
> + rc = -ENOMEM;
> + goto err_register;
> + }
> +
> + if (gen_pool_add(nfit_pool, VMALLOC_START,
> + VMALLOC_END + 1 - VMALLOC_START, NUMA_NO_NODE)) 
> {
> + rc = -ENOMEM;
> + goto err_register;
> + }
> +
>   for (i = 0; i < NUM_NFITS; i++) {
>   struct nfit_test *nfit_test;
>   struct platform_device *pdev;
> @@ -2894,6 +2921,9 @@ static __init int nfit_test_init(void)
>   return 0;
>  
>   err_register:
> + if (nfit_pool)
> + gen_pool_destroy(nfit_pool);
> +
>   destroy_workqueue(nfit_wq);
>   for (i = 0; i < NUM_NFITS; i++)
>   if (instances[i])
> @@ -2917,6 +2947,8 @@ static __exit void nfit_test_exit(void)
>   platform_driver_unregister(_test_driver);
>   nfit_test_teardown();
>  
> + gen_pool_destroy(nfit_pool);
> +
>   for (i = 0; i < NUM_NFITS; i++)
>   put_device([i]->pdev.dev);
>   class_destroy(nfit_test_dimm);
>

Re: [PATCH 1/2] tools/testing/nvdimm: Align test resources to 128M

2018-12-03 Thread Verma, Vishal L


On Sat, 2018-11-24 at 10:46 -0800, Dan Williams wrote:
> In preparation for libnvdimm growing new restrictions to detect section
> conflicts between persistent memory regions, enable nfit_test to
> allocate aligned resources. Use a gen_pool to allocate nfit_test's fake
> resources in a separate address space from the virtual translation of
> the same.
> 
> Signed-off-by: Dan Williams 
> ---
>  tools/testing/nvdimm/test/nfit.c |   36 ++--
>  1 file changed, 34 insertions(+), 2 deletions(-)
> 

This looks good to me,
Reviewed-by: Vishal Verma 

> diff --git a/tools/testing/nvdimm/test/nfit.c 
> b/tools/testing/nvdimm/test/nfit.c
> index 01ec04bf91b5..ca4e61c864d5 100644
> --- a/tools/testing/nvdimm/test/nfit.c
> +++ b/tools/testing/nvdimm/test/nfit.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -215,6 +216,8 @@ struct nfit_test {
>  
>  static struct workqueue_struct *nfit_wq;
>  
> +static struct gen_pool *nfit_pool;
> +
>  static struct nfit_test *to_nfit_test(struct device *dev)
>  {
>   struct platform_device *pdev = to_platform_device(dev);
> @@ -1132,6 +1135,9 @@ static void release_nfit_res(void *data)
>   list_del(_res->list);
>   spin_unlock(_test_lock);
>  
> + if (resource_size(_res->res) >= DIMM_SIZE)
> + gen_pool_free(nfit_pool, nfit_res->res.start,
> + resource_size(_res->res));
>   vfree(nfit_res->buf);
>   kfree(nfit_res);
>  }
> @@ -1144,7 +1150,7 @@ static void *__test_alloc(struct nfit_test *t, size_t 
> size, dma_addr_t *dma,
>   GFP_KERNEL);
>   int rc;
>  
> - if (!buf || !nfit_res)
> + if (!buf || !nfit_res || !*dma)
>   goto err;
>   rc = devm_add_action(dev, release_nfit_res, nfit_res);
>   if (rc)
> @@ -1164,6 +1170,8 @@ static void *__test_alloc(struct nfit_test *t, size_t 
> size, dma_addr_t *dma,
>  
>   return nfit_res->buf;
>   err:
> + if (*dma && size >= DIMM_SIZE)
> + gen_pool_free(nfit_pool, *dma, size);
>   if (buf)
>   vfree(buf);
>   kfree(nfit_res);
> @@ -1172,9 +1180,16 @@ static void *__test_alloc(struct nfit_test *t, size_t 
> size, dma_addr_t *dma,
>  
>  static void *test_alloc(struct nfit_test *t, size_t size, dma_addr_t *dma)
>  {
> + struct genpool_data_align data = {
> + .align = SZ_128M,
> + };
>   void *buf = vmalloc(size);
>  
> - *dma = (unsigned long) buf;
> + if (size >= DIMM_SIZE)
> + *dma = gen_pool_alloc_algo(nfit_pool, size,
> + gen_pool_first_fit_align, );
> + else
> + *dma = (unsigned long) buf;
>   return __test_alloc(t, size, dma, buf);
>  }
>  
> @@ -2839,6 +2854,18 @@ static __init int nfit_test_init(void)
>   goto err_register;
>   }
>  
> + nfit_pool = gen_pool_create(ilog2(SZ_4M), NUMA_NO_NODE);
> + if (!nfit_pool) {
> + rc = -ENOMEM;
> + goto err_register;
> + }
> +
> + if (gen_pool_add(nfit_pool, VMALLOC_START,
> + VMALLOC_END + 1 - VMALLOC_START, NUMA_NO_NODE)) 
> {
> + rc = -ENOMEM;
> + goto err_register;
> + }
> +
>   for (i = 0; i < NUM_NFITS; i++) {
>   struct nfit_test *nfit_test;
>   struct platform_device *pdev;
> @@ -2894,6 +2921,9 @@ static __init int nfit_test_init(void)
>   return 0;
>  
>   err_register:
> + if (nfit_pool)
> + gen_pool_destroy(nfit_pool);
> +
>   destroy_workqueue(nfit_wq);
>   for (i = 0; i < NUM_NFITS; i++)
>   if (instances[i])
> @@ -2917,6 +2947,8 @@ static __exit void nfit_test_exit(void)
>   platform_driver_unregister(_test_driver);
>   nfit_test_teardown();
>  
> + gen_pool_destroy(nfit_pool);
> +
>   for (i = 0; i < NUM_NFITS; i++)
>   put_device([i]->pdev.dev);
>   class_destroy(nfit_test_dimm);
>

Re: [PATCH] libnvdimm, bus: check id immediately following ida_simple_get

2018-08-03 Thread Verma, Vishal L


On Fri, 2018-08-03 at 08:08 -0400, Ocean He wrote:
> From: Ocean He 
> 
> The id check was not executed immediately following ida_simple_get.
> Just
> change the codes position, without function change.
> 
> Signed-off-by: Ocean He 
> ---
>  drivers/nvdimm/bus.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Looks good, feel free to add
Reviewed-by: Vishal Verma 

> 
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index 27902a8..ab28e7c 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -350,12 +350,12 @@ struct nvdimm_bus *nvdimm_bus_register(struct
> device *parent,
>   INIT_LIST_HEAD(_bus->mapping_list);
>   init_waitqueue_head(_bus->probe_wait);
>   nvdimm_bus->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL);
> - mutex_init(_bus->reconfig_mutex);
> - badrange_init(_bus->badrange);
>   if (nvdimm_bus->id < 0) {
>   kfree(nvdimm_bus);
>   return NULL;
>   }
> + mutex_init(_bus->reconfig_mutex);
> + badrange_init(_bus->badrange);
>   nvdimm_bus->nd_desc = nd_desc;
>   nvdimm_bus->dev.parent = parent;
>   nvdimm_bus->dev.release = nvdimm_bus_release;

Re: [PATCH] libnvdimm, bus: check id immediately following ida_simple_get

2018-08-03 Thread Verma, Vishal L


On Fri, 2018-08-03 at 08:08 -0400, Ocean He wrote:
> From: Ocean He 
> 
> The id check was not executed immediately following ida_simple_get.
> Just
> change the codes position, without function change.
> 
> Signed-off-by: Ocean He 
> ---
>  drivers/nvdimm/bus.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Looks good, feel free to add
Reviewed-by: Vishal Verma 

> 
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index 27902a8..ab28e7c 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -350,12 +350,12 @@ struct nvdimm_bus *nvdimm_bus_register(struct
> device *parent,
>   INIT_LIST_HEAD(_bus->mapping_list);
>   init_waitqueue_head(_bus->probe_wait);
>   nvdimm_bus->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL);
> - mutex_init(_bus->reconfig_mutex);
> - badrange_init(_bus->badrange);
>   if (nvdimm_bus->id < 0) {
>   kfree(nvdimm_bus);
>   return NULL;
>   }
> + mutex_init(_bus->reconfig_mutex);
> + badrange_init(_bus->badrange);
>   nvdimm_bus->nd_desc = nd_desc;
>   nvdimm_bus->dev.parent = parent;
>   nvdimm_bus->dev.release = nvdimm_bus_release;

Re: [PATCH] libnvdimm: rework region badblocks clearing

2017-05-01 Thread Verma, Vishal L

On Mon, 2017-05-01 at 09:38 -0700, Dan Williams wrote:
> On Mon, May 1, 2017 at 9:20 AM, Kani, Toshimitsu 
> wrote:
> > On Mon, 2017-05-01 at 09:16 -0700, Dan Williams wrote:
> > > On Mon, May 1, 2017 at 9:12 AM, Kani, Toshimitsu  > > com>
> > > wrote:
> > > > On Mon, 2017-05-01 at 08:52 -0700, Dan Williams wrote:
> > > > > On Mon, May 1, 2017 at 8:43 AM, Dan Williams  > > > > inte
> > > > > l.co
> > > > > m> wrote:
> > > > > > On Mon, May 1, 2017 at 8:34 AM, Kani, Toshimitsu  > > > > > i@hp
> > > > > > e.co
> > > > > > m> wrote:
> > > > > > > On Sun, 2017-04-30 at 05:39 -0700, Dan Williams wrote:
> > > > 
> > > >  :
> > > > > > > 
> > > > > > > Hi Dan,
> > > > > > > 
> > > > > > > I was testing the change with CONFIG_DEBUG_ATOMIC_SLEEP
> > > > > > > set
> > > > > > > this time, and hit the following BUG with BTT.  This is a
> > > > > > > separate issue (not introduced by this patch), but it
> > > > > > > shows
> > > > > > > that we have an issue with the DSM call path as well.
> > > > > > 
> > > > > > Ah, great find, thanks! We don't see this in the unit tests
> > > > > > because the nfit_test infrastructure takes no sleeping
> > > > > > actions
> > > > > > in its simulated DSM path. Outside of converting btt to use
> > > > > > sleeping locks I'm not sure I see a path forward. I wonder
> > > > > > how
> > > > > > bad the performance impact of that would be? Perhaps with
> > > > > > opportunistic spinning it won't be so bad, but I don't see
> > > > > > another choice.
> > > > > 
> > > > > It's worse than that. Part of the performance optimization of
> > > > > BTT
> > > > > I/O was to avoid locking altogether when we could rely on a
> > > > > BTT
> > > > > lane percpu, so that would also need to be removed.
> > > > 
> > > > I do not have a good idea either, but I'd rather disable this
> > > > clearing in the regular BTT write path than adding sleeping
> > > > locks
> > > > to BTT. Clearing a bad block in the BTT write path is
> > > > difficult/challenging since it allocates a new block.
> > > 
> > > Actually, that may make things easier. Can we teach BTT to track
> > > error blocks and clear them before they are reassigned?
> > 
> > I was thinking the same after sending it.  I think we should be
> > able to
> > do that.
> 
> Ok, but we obviously can't develop something that detailed while the
> merge window is open, so I think that means we need to revert commit
> e88da7998d7d "Revert 'libnvdimm: band aid btt vs clear poison
> locking'" and leave BTT I/O-error-clearing disabled for this cycle
> and
> try again for 4.13.

Agreed, I'll work on something to track badblocks and clear them
outside the IO path.

Re: [PATCH] libnvdimm: rework region badblocks clearing

2017-05-01 Thread Verma, Vishal L

On Mon, 2017-05-01 at 09:38 -0700, Dan Williams wrote:
> On Mon, May 1, 2017 at 9:20 AM, Kani, Toshimitsu 
> wrote:
> > On Mon, 2017-05-01 at 09:16 -0700, Dan Williams wrote:
> > > On Mon, May 1, 2017 at 9:12 AM, Kani, Toshimitsu  > > com>
> > > wrote:
> > > > On Mon, 2017-05-01 at 08:52 -0700, Dan Williams wrote:
> > > > > On Mon, May 1, 2017 at 8:43 AM, Dan Williams  > > > > inte
> > > > > l.co
> > > > > m> wrote:
> > > > > > On Mon, May 1, 2017 at 8:34 AM, Kani, Toshimitsu  > > > > > i@hp
> > > > > > e.co
> > > > > > m> wrote:
> > > > > > > On Sun, 2017-04-30 at 05:39 -0700, Dan Williams wrote:
> > > > 
> > > >  :
> > > > > > > 
> > > > > > > Hi Dan,
> > > > > > > 
> > > > > > > I was testing the change with CONFIG_DEBUG_ATOMIC_SLEEP
> > > > > > > set
> > > > > > > this time, and hit the following BUG with BTT.  This is a
> > > > > > > separate issue (not introduced by this patch), but it
> > > > > > > shows
> > > > > > > that we have an issue with the DSM call path as well.
> > > > > > 
> > > > > > Ah, great find, thanks! We don't see this in the unit tests
> > > > > > because the nfit_test infrastructure takes no sleeping
> > > > > > actions
> > > > > > in its simulated DSM path. Outside of converting btt to use
> > > > > > sleeping locks I'm not sure I see a path forward. I wonder
> > > > > > how
> > > > > > bad the performance impact of that would be? Perhaps with
> > > > > > opportunistic spinning it won't be so bad, but I don't see
> > > > > > another choice.
> > > > > 
> > > > > It's worse than that. Part of the performance optimization of
> > > > > BTT
> > > > > I/O was to avoid locking altogether when we could rely on a
> > > > > BTT
> > > > > lane percpu, so that would also need to be removed.
> > > > 
> > > > I do not have a good idea either, but I'd rather disable this
> > > > clearing in the regular BTT write path than adding sleeping
> > > > locks
> > > > to BTT. Clearing a bad block in the BTT write path is
> > > > difficult/challenging since it allocates a new block.
> > > 
> > > Actually, that may make things easier. Can we teach BTT to track
> > > error blocks and clear them before they are reassigned?
> > 
> > I was thinking the same after sending it.  I think we should be
> > able to
> > do that.
> 
> Ok, but we obviously can't develop something that detailed while the
> merge window is open, so I think that means we need to revert commit
> e88da7998d7d "Revert 'libnvdimm: band aid btt vs clear poison
> locking'" and leave BTT I/O-error-clearing disabled for this cycle
> and
> try again for 4.13.

Agreed, I'll work on something to track badblocks and clear them
outside the IO path.

Re: [PATCH] block: hide badblocks attribute by default

2017-04-27 Thread Verma, Vishal L

On Thu, 2017-04-27 at 14:46 -0700, Dan Williams wrote:
> Commit 99e6608c9e74 "block: Add badblock management for gendisks"
> allowed for drivers like pmem and software-raid to advertise a list of
> bad media areas. However, it inadvertently added a 'badblocks' to all
> block devices. Lets clean this up by having the 'badblocks' attribute
> not be visible when the driver has not populated a 'struct badblocks'
> instance in the gendisk.
> 
> Cc: Jens Axboe 
> Cc: Christoph Hellwig 
> Cc: Martin K. Petersen 
> Reported-by: Vishal Verma 
> Signed-off-by: Dan Williams 
> ---
>  block/genhd.c |   11 +++
>  1 file changed, 11 insertions(+)
> 
Tested that this removes the badblocks attribute for block devices that
don't use them:

$ cat /sys/block/pmem6/badblocks 
61576 8

$ cat /sys/block/vda/badblocks 
cat: /sys/block/vda/badblocks: No such file or directory

Tested-by: Vishal Verma

Re: [PATCH] block: hide badblocks attribute by default

2017-04-27 Thread Verma, Vishal L

On Thu, 2017-04-27 at 14:46 -0700, Dan Williams wrote:
> Commit 99e6608c9e74 "block: Add badblock management for gendisks"
> allowed for drivers like pmem and software-raid to advertise a list of
> bad media areas. However, it inadvertently added a 'badblocks' to all
> block devices. Lets clean this up by having the 'badblocks' attribute
> not be visible when the driver has not populated a 'struct badblocks'
> instance in the gendisk.
> 
> Cc: Jens Axboe 
> Cc: Christoph Hellwig 
> Cc: Martin K. Petersen 
> Reported-by: Vishal Verma 
> Signed-off-by: Dan Williams 
> ---
>  block/genhd.c |   11 +++
>  1 file changed, 11 insertions(+)
> 
Tested that this removes the badblocks attribute for block devices that
don't use them:

$ cat /sys/block/pmem6/badblocks 
61576 8

$ cat /sys/block/vda/badblocks 
cat: /sys/block/vda/badblocks: No such file or directory

Tested-by: Vishal Verma

Re: [RFC PATCH] x86, mce: change the mce notifier to 'blocking' from 'atomic'

2017-04-21 Thread Verma, Vishal L

On Thu, 2017-04-13 at 13:31 +0200, Borislav Petkov wrote:
> On Thu, Apr 13, 2017 at 12:29:25AM +0200, Borislav Petkov wrote:
> > On Wed, Apr 12, 2017 at 03:26:19PM -0700, Luck, Tony wrote:
> > > We can futz with that and have them specify which chain (or both)
> > > that they want to be added to.
> > 
> > Well, I didn't want the atomic chain to be a notifier because we can
> > keep it simple and non-blocking. Only the process context one will
> > be.
> > 
> > So the question is, do we even have a use case for outside consumers
> > hanging on the atomic chain? Because if not, we're good to go.
> 
> Ok, new day, new patch.
> 
> Below is what we could do: we don't call the notifier at all on the
> atomic path but only print the MCEs. We do log them and if the machine
> survives, we process them accordingly. This is only a fix for upstream
> so that the current issue at hand is addressed.
> 
> For later, we'd need to split the paths in:
> 
> critical_print_mce()
> 
> or somesuch which immediately dumps the MCE to dmesg, and
> 
> mce_log()
> 
> which does the slow path of logging MCEs and calling the blocking
> notifier.
> 
> Now, I'd want to have decoding of the MCE on the critical path too so
> I have to think about how to do that nicely. Maybe move the decoding
> bits which are the same between Intel and AMD in mce.c and have some
> vendor-specific, fast calls. We'll see. Btw, this is something Ingo
> has
> been mentioning for a while.
> 
> Anyway, here's just the urgent fix for now.
> 
> Thanks.
> 
> ---
> From: Vishal Verma 
> Date: Tue, 11 Apr 2017 16:44:57 -0600
> Subject: [PATCH] x86/mce: Make the MCE notifier a blocking one
> 
> The NFIT MCE handler callback (for handling media errors on NVDIMMs)
> takes a mutex to add the location of a memory error to a list. But
> since
> the notifier call chain for machine checks (x86_mce_decoder_chain) is
> atomic, we get a lockdep splat like:
> 
>   BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:620
>   in_atomic(): 1, irqs_disabled(): 0, pid: 4, name: kworker/0:0
>   [..]
>   Call Trace:
>    dump_stack
>    ___might_sleep
>    __might_sleep
>    mutex_lock_nested
>    ? __lock_acquire
>    nfit_handle_mce
>    notifier_call_chain
>    atomic_notifier_call_chain
>    ? atomic_notifier_call_chain
>    mce_gen_pool_process
> 
> Convert the notifier to a blocking one which gets to run only in
> process
> context.
> 
> Boris: remove the notifier call in atomic context in print_mce(). For
> now, let's print the MCE on the atomic path so that we can make sure
> it
> goes out. We still log it for process context later.
> 
> Reported-by: Ross Zwisler 
> Signed-off-by: Vishal Verma 
> Cc: Tony Luck 
> Cc: Dan Williams 
> Cc: linux-edac 
> Cc: x86-ml 
> Cc: 
> Link: http://lkml.kernel.org/r/20170411224457.24777-1-vishal.l.verma@i
> ntel.com
> Fixes: 6839a6d96f4e ("nfit: do an ARS scrub on hitting a latent media
> error")
> Signed-off-by: Borislav Petkov 
> ---
>  arch/x86/kernel/cpu/mcheck/mce-genpool.c  |  2 +-
>  arch/x86/kernel/cpu/mcheck/mce-internal.h |  2 +-
>  arch/x86/kernel/cpu/mcheck/mce.c  | 18 --
>  3 files changed, 6 insertions(+), 16 deletions(-)
> 

I noticed this patch was picked up in tip, in ras/urgent, but didn't see
a pull request for 4.11 - was this the intention? Or will it just be
added for 4.12?

-Vishal

Re: [RFC PATCH] x86, mce: change the mce notifier to 'blocking' from 'atomic'

2017-04-21 Thread Verma, Vishal L

On Thu, 2017-04-13 at 13:31 +0200, Borislav Petkov wrote:
> On Thu, Apr 13, 2017 at 12:29:25AM +0200, Borislav Petkov wrote:
> > On Wed, Apr 12, 2017 at 03:26:19PM -0700, Luck, Tony wrote:
> > > We can futz with that and have them specify which chain (or both)
> > > that they want to be added to.
> > 
> > Well, I didn't want the atomic chain to be a notifier because we can
> > keep it simple and non-blocking. Only the process context one will
> > be.
> > 
> > So the question is, do we even have a use case for outside consumers
> > hanging on the atomic chain? Because if not, we're good to go.
> 
> Ok, new day, new patch.
> 
> Below is what we could do: we don't call the notifier at all on the
> atomic path but only print the MCEs. We do log them and if the machine
> survives, we process them accordingly. This is only a fix for upstream
> so that the current issue at hand is addressed.
> 
> For later, we'd need to split the paths in:
> 
> critical_print_mce()
> 
> or somesuch which immediately dumps the MCE to dmesg, and
> 
> mce_log()
> 
> which does the slow path of logging MCEs and calling the blocking
> notifier.
> 
> Now, I'd want to have decoding of the MCE on the critical path too so
> I have to think about how to do that nicely. Maybe move the decoding
> bits which are the same between Intel and AMD in mce.c and have some
> vendor-specific, fast calls. We'll see. Btw, this is something Ingo
> has
> been mentioning for a while.
> 
> Anyway, here's just the urgent fix for now.
> 
> Thanks.
> 
> ---
> From: Vishal Verma 
> Date: Tue, 11 Apr 2017 16:44:57 -0600
> Subject: [PATCH] x86/mce: Make the MCE notifier a blocking one
> 
> The NFIT MCE handler callback (for handling media errors on NVDIMMs)
> takes a mutex to add the location of a memory error to a list. But
> since
> the notifier call chain for machine checks (x86_mce_decoder_chain) is
> atomic, we get a lockdep splat like:
> 
>   BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:620
>   in_atomic(): 1, irqs_disabled(): 0, pid: 4, name: kworker/0:0
>   [..]
>   Call Trace:
>    dump_stack
>    ___might_sleep
>    __might_sleep
>    mutex_lock_nested
>    ? __lock_acquire
>    nfit_handle_mce
>    notifier_call_chain
>    atomic_notifier_call_chain
>    ? atomic_notifier_call_chain
>    mce_gen_pool_process
> 
> Convert the notifier to a blocking one which gets to run only in
> process
> context.
> 
> Boris: remove the notifier call in atomic context in print_mce(). For
> now, let's print the MCE on the atomic path so that we can make sure
> it
> goes out. We still log it for process context later.
> 
> Reported-by: Ross Zwisler 
> Signed-off-by: Vishal Verma 
> Cc: Tony Luck 
> Cc: Dan Williams 
> Cc: linux-edac 
> Cc: x86-ml 
> Cc: 
> Link: http://lkml.kernel.org/r/20170411224457.24777-1-vishal.l.verma@i
> ntel.com
> Fixes: 6839a6d96f4e ("nfit: do an ARS scrub on hitting a latent media
> error")
> Signed-off-by: Borislav Petkov 
> ---
>  arch/x86/kernel/cpu/mcheck/mce-genpool.c  |  2 +-
>  arch/x86/kernel/cpu/mcheck/mce-internal.h |  2 +-
>  arch/x86/kernel/cpu/mcheck/mce.c  | 18 --
>  3 files changed, 6 insertions(+), 16 deletions(-)
> 

I noticed this patch was picked up in tip, in ras/urgent, but didn't see
a pull request for 4.11 - was this the intention? Or will it just be
added for 4.12?

-Vishal

Re: [RFC PATCH] x86, mce: change the mce notifier to 'blocking' from 'atomic'

2017-04-12 Thread Verma, Vishal L

On Wed, 2017-04-12 at 22:22 +0200, Borislav Petkov wrote:
> On Wed, Apr 12, 2017 at 01:59:03PM -0600, Vishal Verma wrote:
> > I don't think we can do anything about the panic path errors.
> 
> Then the fix should be a lot easier:
> 
> ---
> diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
> index 3ba1c3472cf9..44c092ec2ac9 100644
> --- a/drivers/acpi/nfit/mce.c
> +++ b/drivers/acpi/nfit/mce.c
> @@ -25,6 +25,9 @@ static int nfit_handle_mce(struct notifier_block
> *nb, unsigned long val,
>   struct acpi_nfit_desc *acpi_desc;
>   struct nfit_spa *nfit_spa;
>  
> + if (in_atomic())
> + return NOTIFY_DONE;

But isn't the atomic notifier call chain always called in atomic
context?

> +
>   /* We only care about memory errors */
>   if (!(mce->status & MCACOD))
>   return NOTIFY_DONE;
> 
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nürnberg)

Re: [RFC PATCH] x86, mce: change the mce notifier to 'blocking' from 'atomic'

2017-04-12 Thread Verma, Vishal L

On Wed, 2017-04-12 at 22:22 +0200, Borislav Petkov wrote:
> On Wed, Apr 12, 2017 at 01:59:03PM -0600, Vishal Verma wrote:
> > I don't think we can do anything about the panic path errors.
> 
> Then the fix should be a lot easier:
> 
> ---
> diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
> index 3ba1c3472cf9..44c092ec2ac9 100644
> --- a/drivers/acpi/nfit/mce.c
> +++ b/drivers/acpi/nfit/mce.c
> @@ -25,6 +25,9 @@ static int nfit_handle_mce(struct notifier_block
> *nb, unsigned long val,
>   struct acpi_nfit_desc *acpi_desc;
>   struct nfit_spa *nfit_spa;
>  
> + if (in_atomic())
> + return NOTIFY_DONE;

But isn't the atomic notifier call chain always called in atomic
context?

> +
>   /* We only care about memory errors */
>   if (!(mce->status & MCACOD))
>   return NOTIFY_DONE;
> 
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nürnberg)

Re: [PATCH] libnvdimm: use consistent naming for request_mem_region()

2016-11-28 Thread Verma, Vishal L

On Mon, 2016-11-28 at 11:25 -0800, Dan Williams wrote:
> Here is an example /proc/iomem listing for a system with 2
> namespaces,
> one in "sector" mode and one in "memory" mode:
> 
>   1fc00-2fbff : Persistent Memory (legacy)
> 1fc00-2fbff : namespace1.0
>   34000-34fff : Persistent Memory
> 34000-34fff : btt0.1
> 
> Here is the corresponding ndctl listing:
> 
>   # ndctl list
>   [
> {
>   "dev":"namespace1.0",
>   "mode":"memory",
>   "size":4294967296,
>   "blockdev":"pmem1"
> },
> {
>   "dev":"namespace0.0",
>   "mode":"sector",
>   "size":267091968,
>   "uuid":"f7594f86-badb-4592-875f-ded577da2eaf",
>   "sector_size":4096,
>   "blockdev":"pmem0s"
> }
>   ]
> 
> Notice that the ndctl listing is purely in terms of namespace
> devices,
> while the iomem listing leaks the internal "btt0.1" implementation
> detail. Given that ndctl requires the namespace device name to change
> the mode, for example:
> 
>   # ndctl create-namespace --reconfig=namespace0.0 --mode=raw --force
> 
> ...use the namespace name in the iomem listing to keep the claiming
> device name consistent across different mode settings.
> 
> Cc: Vishal Verma 
> Signed-off-by: Dan Williams 
> ---
>  drivers/dax/pmem.c |3 ++-
>  drivers/nvdimm/claim.c |2 +-
>  drivers/nvdimm/pmem.c  |2 +-
>  3 files changed, 4 insertions(+), 3 deletions(-)

Looks good!

Reveiwed-by: Vishal Verma 

> 
> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
> index 9630d8837ba9..3ff84784249a 100644
> --- a/drivers/dax/pmem.c
> +++ b/drivers/dax/pmem.c
> @@ -87,7 +87,8 @@ static int dax_pmem_probe(struct device *dev)
>   pfn_sb = nd_pfn->pfn_sb;
>  
>   if (!devm_request_mem_region(dev, nsio->res.start,
> - resource_size(>res),
> dev_name(dev))) {
> + resource_size(>res),
> + dev_name(>dev))) {
>   dev_warn(dev, "could not reserve region %pR\n",
> >res);
>   return -EBUSY;
>   }
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index 8d66fbb779ed..4638b9ea5229 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -275,7 +275,7 @@ int devm_nsio_enable(struct device *dev, struct
> nd_namespace_io *nsio)
>  
>   nsio->size = resource_size(res);
>   if (!devm_request_mem_region(dev, res->start,
> resource_size(res),
> - dev_name(dev))) {
> + dev_name(>dev))) {
>   dev_warn(dev, "could not reserve region %pR\n",
> res);
>   return -EBUSY;
>   }
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 42b3a8217073..34f16a17c07b 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -266,7 +266,7 @@ static int pmem_attach_disk(struct device *dev,
>   dev_warn(dev, "unable to guarantee persistence of
> writes\n");
>  
>   if (!devm_request_mem_region(dev, res->start,
> resource_size(res),
> - dev_name(dev))) {
> + dev_name(>dev))) {
>   dev_warn(dev, "could not reserve region %pR\n",
> res);
>   return -EBUSY;
>   }
>

Re: [PATCH] libnvdimm: use consistent naming for request_mem_region()

2016-11-28 Thread Verma, Vishal L

On Mon, 2016-11-28 at 11:25 -0800, Dan Williams wrote:
> Here is an example /proc/iomem listing for a system with 2
> namespaces,
> one in "sector" mode and one in "memory" mode:
> 
>   1fc00-2fbff : Persistent Memory (legacy)
> 1fc00-2fbff : namespace1.0
>   34000-34fff : Persistent Memory
> 34000-34fff : btt0.1
> 
> Here is the corresponding ndctl listing:
> 
>   # ndctl list
>   [
> {
>   "dev":"namespace1.0",
>   "mode":"memory",
>   "size":4294967296,
>   "blockdev":"pmem1"
> },
> {
>   "dev":"namespace0.0",
>   "mode":"sector",
>   "size":267091968,
>   "uuid":"f7594f86-badb-4592-875f-ded577da2eaf",
>   "sector_size":4096,
>   "blockdev":"pmem0s"
> }
>   ]
> 
> Notice that the ndctl listing is purely in terms of namespace
> devices,
> while the iomem listing leaks the internal "btt0.1" implementation
> detail. Given that ndctl requires the namespace device name to change
> the mode, for example:
> 
>   # ndctl create-namespace --reconfig=namespace0.0 --mode=raw --force
> 
> ...use the namespace name in the iomem listing to keep the claiming
> device name consistent across different mode settings.
> 
> Cc: Vishal Verma 
> Signed-off-by: Dan Williams 
> ---
>  drivers/dax/pmem.c |3 ++-
>  drivers/nvdimm/claim.c |2 +-
>  drivers/nvdimm/pmem.c  |2 +-
>  3 files changed, 4 insertions(+), 3 deletions(-)

Looks good!

Reveiwed-by: Vishal Verma 

> 
> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
> index 9630d8837ba9..3ff84784249a 100644
> --- a/drivers/dax/pmem.c
> +++ b/drivers/dax/pmem.c
> @@ -87,7 +87,8 @@ static int dax_pmem_probe(struct device *dev)
>   pfn_sb = nd_pfn->pfn_sb;
>  
>   if (!devm_request_mem_region(dev, nsio->res.start,
> - resource_size(>res),
> dev_name(dev))) {
> + resource_size(>res),
> + dev_name(>dev))) {
>   dev_warn(dev, "could not reserve region %pR\n",
> >res);
>   return -EBUSY;
>   }
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index 8d66fbb779ed..4638b9ea5229 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -275,7 +275,7 @@ int devm_nsio_enable(struct device *dev, struct
> nd_namespace_io *nsio)
>  
>   nsio->size = resource_size(res);
>   if (!devm_request_mem_region(dev, res->start,
> resource_size(res),
> - dev_name(dev))) {
> + dev_name(>dev))) {
>   dev_warn(dev, "could not reserve region %pR\n",
> res);
>   return -EBUSY;
>   }
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 42b3a8217073..34f16a17c07b 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -266,7 +266,7 @@ static int pmem_attach_disk(struct device *dev,
>   dev_warn(dev, "unable to guarantee persistence of
> writes\n");
>  
>   if (!devm_request_mem_region(dev, res->start,
> resource_size(res),
> - dev_name(dev))) {
> + dev_name(>dev))) {
>   dev_warn(dev, "could not reserve region %pR\n",
> res);
>   return -EBUSY;
>   }
>

Re: [PATCH] acpi, nfit: fix acpi event notifications for nfit

2016-08-18 Thread Verma, Vishal L

On Thu, 2016-08-18 at 12:54 -0700, Dan Williams wrote:
> On Thu, Aug 18, 2016 at 12:52 PM, Linda Knippers  om> wrote:
> > 
> > 
> > 
> > On 8/18/2016 3:48 PM, Dan Williams wrote:
> > > 
> > > On Thu, Aug 18, 2016 at 11:48 AM, Vishal Verma  > > el.com> wrote:
> > > > 
> > > > The nfit driver had an acpi event notification handler, but it
> > > > never
> > > > would've worked because we weren't setting the
> > > > ACPI_DRIVER_ALL_NOTIFY_EVENTS flag in acpi_driver.
> > > 
> > > Let's update the changelog to be helpful for someone implementing
> > > a
> > > backport or taking this back to a -stable branch.  Something like:
> > > 
> > > Subject: acpi, nfit: fix event notifications
> > > 
> > > Commit 209851649dc4 "acpi: nfit: Add support for hot-add" added
> > > support for _FIT notifications, but it neglected to set the
> > > ACPI_DRIVER_ALL_NOTIFY_EVENTS flag that acpi_bus_notify() uses to
> > > gate
> > > notification delivery.
> > 
> > While we're at it, should we update the notifier function to
> > explicitly check
> > for event 0x80 before re-evaluating the _FIT?  I'm thinking about
> > some time
> > in the future when there might be more than one event.
> 
> Yes, good idea.

Sounds good, part of the same patch or separately?

Re: [PATCH] acpi, nfit: fix acpi event notifications for nfit

2016-08-18 Thread Verma, Vishal L

On Thu, 2016-08-18 at 12:54 -0700, Dan Williams wrote:
> On Thu, Aug 18, 2016 at 12:52 PM, Linda Knippers  om> wrote:
> > 
> > 
> > 
> > On 8/18/2016 3:48 PM, Dan Williams wrote:
> > > 
> > > On Thu, Aug 18, 2016 at 11:48 AM, Vishal Verma  > > el.com> wrote:
> > > > 
> > > > The nfit driver had an acpi event notification handler, but it
> > > > never
> > > > would've worked because we weren't setting the
> > > > ACPI_DRIVER_ALL_NOTIFY_EVENTS flag in acpi_driver.
> > > 
> > > Let's update the changelog to be helpful for someone implementing
> > > a
> > > backport or taking this back to a -stable branch.  Something like:
> > > 
> > > Subject: acpi, nfit: fix event notifications
> > > 
> > > Commit 209851649dc4 "acpi: nfit: Add support for hot-add" added
> > > support for _FIT notifications, but it neglected to set the
> > > ACPI_DRIVER_ALL_NOTIFY_EVENTS flag that acpi_bus_notify() uses to
> > > gate
> > > notification delivery.
> > 
> > While we're at it, should we update the notifier function to
> > explicitly check
> > for event 0x80 before re-evaluating the _FIT?  I'm thinking about
> > some time
> > in the future when there might be more than one event.
> 
> Yes, good idea.

Sounds good, part of the same patch or separately?

Re: [PATCH v4 0/6] Add alignment check for DAX mount

2016-05-20 Thread Verma, Vishal L

On Fri, 2016-05-20 at 14:50 +, Kani, Toshimitsu wrote:
> On Thu, 2016-05-19 at 18:37 -0500, Eric Sandeen wrote:
> > 
> > On 5/10/16 11:23 AM, Toshi Kani wrote:
> > > 
> > > 
> > > When a partition is not aligned by 4KB, mount -o dax succeeds,
> > Sorry for being late, but -
> > 
> > Shouldn't this and all subsequent patch commits refer to
> > PAGE_SIZE, rather than "4kB?"
> Right, the patch commits should refer to PAGE_SIZE to match with the
> code
> changes.  I am afraid it may be a bit too late to update, though...
> 
> Vishal, do you think you can tweak the logs, "4KB" to "PAGE_SIZE"?
> 
> Thanks,
> -Toshi

Hi Toshi,

Is it just commit message changes? If so I'm not sure it is worthwhile
to rebase everything for that - i.e. my dax error handling series and
Ross' dax-locking branch would both have to be rebased..

If there are and fixes for code, we can do them as an add-on patch
though.

Re: [PATCH v4 0/6] Add alignment check for DAX mount

2016-05-20 Thread Verma, Vishal L

On Fri, 2016-05-20 at 14:50 +, Kani, Toshimitsu wrote:
> On Thu, 2016-05-19 at 18:37 -0500, Eric Sandeen wrote:
> > 
> > On 5/10/16 11:23 AM, Toshi Kani wrote:
> > > 
> > > 
> > > When a partition is not aligned by 4KB, mount -o dax succeeds,
> > Sorry for being late, but -
> > 
> > Shouldn't this and all subsequent patch commits refer to
> > PAGE_SIZE, rather than "4kB?"
> Right, the patch commits should refer to PAGE_SIZE to match with the
> code
> changes.  I am afraid it may be a bit too late to update, though...
> 
> Vishal, do you think you can tweak the logs, "4KB" to "PAGE_SIZE"?
> 
> Thanks,
> -Toshi

Hi Toshi,

Is it just commit message changes? If so I'm not sure it is worthwhile
to rebase everything for that - i.e. my dax error handling series and
Ross' dax-locking branch would both have to be rebased..

If there are and fixes for code, we can do them as an add-on patch
though.

Re: [PATCH v7 4/6] dax: export a low-level __dax_zero_page_range helper

2016-05-12 Thread Verma, Vishal L

On Thu, 2016-05-12 at 10:41 +0200, Jan Kara wrote:
> On Wed 11-05-16 15:08:50, Vishal Verma wrote:
> > 
> > From: Christoph Hellwig 
> > 
> > This allows XFS to perform zeroing using the iomap infrastructure
> > and
> > avoid buffer heads.
> > 
> > [vishal: fix conflicts with dax-error-handling]
> > Signed-off-by: Christoph Hellwig 
> Looks good. You can add:
> 
> Reviewed-by: Jan Kara 
> 
> BTW: You are supposed to add your Signed-off-by when forwarding
> patches
> like this...

Ah I didn't know. I'll add it when making the stable topic branch for
this. Thanks!

Re: [PATCH v7 4/6] dax: export a low-level __dax_zero_page_range helper

2016-05-12 Thread Verma, Vishal L

On Thu, 2016-05-12 at 10:41 +0200, Jan Kara wrote:
> On Wed 11-05-16 15:08:50, Vishal Verma wrote:
> > 
> > From: Christoph Hellwig 
> > 
> > This allows XFS to perform zeroing using the iomap infrastructure
> > and
> > avoid buffer heads.
> > 
> > [vishal: fix conflicts with dax-error-handling]
> > Signed-off-by: Christoph Hellwig 
> Looks good. You can add:
> 
> Reviewed-by: Jan Kara 
> 
> BTW: You are supposed to add your Signed-off-by when forwarding
> patches
> like this...

Ah I didn't know. I'll add it when making the stable topic branch for
this. Thanks!

Re: [PATCH v6 4/5] dax: for truncate/hole-punch, do zeroing through the driver if possible

2016-05-11 Thread Verma, Vishal L

On Tue, 2016-05-10 at 12:49 -0600, Vishal Verma wrote:
> 
...

> @@ -1240,11 +1254,16 @@ int dax_zero_page_range(struct inode *inode,
> loff_t from, unsigned length,
>   .size = PAGE_SIZE,
>   };
>  
> - if (dax_map_atomic(bdev, ) < 0)
> - return PTR_ERR(dax.addr);
> - clear_pmem(dax.addr + offset, length);
> - wmb_pmem();
> - dax_unmap_atomic(bdev, );
> + if (dax_range_is_aligned(bdev, , offset, length))
> + return blkdev_issue_zeroout(bdev, dax.sector,
> + length >> 9, GFP_NOFS, true);

Found another bug here while testing. The zeroout needs to be done for
sector + (offset >> 9). The above just zeroed out the first sector of
the page irrespective of offset, which is wrong.

Re: [PATCH v6 4/5] dax: for truncate/hole-punch, do zeroing through the driver if possible

2016-05-11 Thread Verma, Vishal L

On Tue, 2016-05-10 at 12:49 -0600, Vishal Verma wrote:
> 
...

> @@ -1240,11 +1254,16 @@ int dax_zero_page_range(struct inode *inode,
> loff_t from, unsigned length,
>   .size = PAGE_SIZE,
>   };
>  
> - if (dax_map_atomic(bdev, ) < 0)
> - return PTR_ERR(dax.addr);
> - clear_pmem(dax.addr + offset, length);
> - wmb_pmem();
> - dax_unmap_atomic(bdev, );
> + if (dax_range_is_aligned(bdev, , offset, length))
> + return blkdev_issue_zeroout(bdev, dax.sector,
> + length >> 9, GFP_NOFS, true);

Found another bug here while testing. The zeroout needs to be done for
sector + (offset >> 9). The above just zeroed out the first sector of
the page irrespective of offset, which is wrong.

Re: [PATCH v6 4/5] dax: for truncate/hole-punch, do zeroing through the driver if possible

2016-05-11 Thread Verma, Vishal L

On Wed, 2016-05-11 at 10:15 +0200, Jan Kara wrote:
> On Tue 10-05-16 12:49:15, Vishal Verma wrote:
> > 
> > In the truncate or hole-punch path in dax, we clear out sub-page
> > ranges.
> > If these sub-page ranges are sector aligned and sized, we can do the
> > zeroing through the driver instead so that error-clearing is handled
> > automatically.
> > 
> > For sub-sector ranges, we still have to rely on clear_pmem and have
> > the
> > possibility of tripping over errors.
> > 
> > Cc: Dan Williams 
> > Cc: Ross Zwisler 
> > Cc: Jeff Moyer 
> > Cc: Christoph Hellwig 
> > Cc: Dave Chinner 
> > Cc: Jan Kara 
> > Reviewed-by: Christoph Hellwig 
> > Signed-off-by: Vishal Verma 
> ...
> 
> > 
> > +static bool dax_range_is_aligned(struct block_device *bdev,
> > +    struct blk_dax_ctl *dax, unsigned
> > int offset,
> > +    unsigned int length)
> > +{
> > +   unsigned short sector_size = bdev_logical_block_size(bdev);
> > +
> > +   if (!IS_ALIGNED(((u64)dax->addr + offset), sector_size))
> One more question: 'dax' is initialized in dax_zero_page_range() and
> dax->addr is going to be always NULL here. So either you forgot to
> call
> dax_map_atomic() to get the addr or the use of dax->addr is just bogus
> (which is what I currently believe since I see no way how the address
> could
> be unaligned with the sector_size)...
> 

Good catch, and you're right. I don't think I actually even want to use
dax->addr for the alignment check here - I want to check if we're
aligned to the block device sector. I'm thinking something like:

if (!IS_ALIGNED(offset, sector_size))

Technically we want to check if sector * sector_size + offset is
aligned, but the first part of that is already a sector :)

1 2 >

1 - 100 of 151 matches

Mail list logo