Re: [RFC PATCH 1/4] fs: introduce ->storage_lost() for memory-failure

2020-09-15 Thread Ruan Shiyang



On 2020/9/16 上午12:16, Darrick J. Wong wrote:

On Tue, Sep 15, 2020 at 06:13:08PM +0800, Shiyang Ruan wrote:

This function is used to handle errors which may cause data lost in
filesystem.  Such as memory-failure in fsdax mode.

In XFS, it requires "rmapbt" feature in order to query for files or
metadata which associated to the error block.  Then we could call fs
recover functions to try to repair the damaged data.(did not implemented
in this patch)

After that, the memory-failure also needs to kill processes who are
using those files.  The struct mf_recover_controller is created to store
necessary parameters.

Signed-off-by: Shiyang Ruan 
---
  fs/xfs/xfs_super.c | 80 ++
  include/linux/fs.h |  1 +
  include/linux/mm.h |  6 
  3 files changed, 87 insertions(+)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 71ac6c1cdc36..118d9c5d9e1e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -35,6 +35,10 @@
  #include "xfs_refcount_item.h"
  #include "xfs_bmap_item.h"
  #include "xfs_reflink.h"
+#include "xfs_alloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_bit.h"
  
  #include 

  #include 
@@ -1104,6 +1108,81 @@ xfs_fs_free_cached_objects(
return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
  }
  
+static int

+xfs_storage_lost_helper(
+   struct xfs_btree_cur*cur,
+   struct xfs_rmap_irec*rec,
+   void*priv)
+{
+   struct xfs_inode*ip;
+   struct mf_recover_controller*mfrc = priv;
+   int rc = 0;
+
+   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner)) {
+   // TODO check and try to fix metadata
+   } else {
+   /*
+* Get files that incore, filter out others that are not in use.
+*/
+   xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE,
+0, );


Missing return value check here.


Yes, I ignored it.  My fault.




+   if (!ip)
+   return 0;
+   if (!VFS_I(ip)->i_mapping)
+   goto out;
+
+   rc = mfrc->recover_fn(mfrc->pfn, mfrc->flags,
+ VFS_I(ip)->i_mapping, rec->rm_offset);
+
+   // TODO try to fix data
+out:
+   xfs_irele(ip);
+   }
+
+   return rc;
+}
+
+static int
+xfs_fs_storage_lost(
+   struct super_block  *sb,
+   loff_t  offset,


offset into which device?  XFS supports three...

I'm also a little surprised you don't pass in a length.

I guess that means this function will get called repeatedly for every
byte in the poisoned range?


The memory-failure will triggered on one PFN each time, so its length 
could be one PAGE_SIZE.  But I think you are right, it's better to tell 
filesystem how much range is effected and need to be fixed.  I didn't 
know but I think there may be some other cases besides memory-failure. 
So the length is necessary.





+   void*priv)
+{
+   struct xfs_mount*mp = XFS_M(sb);
+   struct xfs_trans*tp = NULL;
+   struct xfs_btree_cur*cur = NULL;
+   struct xfs_rmap_irecrmap_low, rmap_high;
+   struct xfs_buf  *agf_bp = NULL;
+   xfs_fsblock_t   fsbno = XFS_B_TO_FSB(mp, offset);
+   xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
+   xfs_agblock_t   agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+   int error = 0;
+
+   error = xfs_trans_alloc_empty(mp, );
+   if (error)
+   return error;
+
+   error = xfs_alloc_read_agf(mp, tp, agno, 0, _bp);
+   if (error)
+   return error;
+
+   cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);


...and this is definitely the wrong call sequence if the malfunctioning
device is the realtime device.  If a dax rt device dies, you'll be
shooting down files on the data device, which will cause all sorts of
problems.


I didn't notice that.  Let me think about it.


Question: Should all this poison recovery stuff go into a new file?
xfs_poison.c?  There's already a lot of code in xfs_super.c.


Yes, it's a bit too much.  I'll move them into a new file.


--
Thanks,
Ruan Shiyang.


--D


+
+   /* Construct a range for rmap query */
+   memset(_low, 0, sizeof(rmap_low));
+   memset(_high, 0xFF, sizeof(rmap_high));
+   rmap_low.rm_startblock = rmap_high.rm_startblock = agbno;
+
+   error = xfs_rmap_query_range(cur, _low, _high,
+xfs_storage_lost_helper, priv);
+   if (error == -ECANCELED)
+   error = 0;
+
+   xfs_btree_del_cursor(cur, error);
+   xfs_trans_brelse(tp, agf_bp);
+   return error;
+}
+
  static const struct super_operations xfs_super_operations = {
.alloc_inode  

Re: [RFC PATCH 2/4] pagemap: introduce ->memory_failure()

2020-09-15 Thread Ruan Shiyang



On 2020/9/16 上午12:31, Darrick J. Wong wrote:

On Tue, Sep 15, 2020 at 06:13:09PM +0800, Shiyang Ruan wrote:

When memory-failure occurs, we call this function which is implemented
by each devices.  For fsdax, pmem device implements it.  Pmem device
will find out the block device where the error page located in, gets the
filesystem on this block device, and finally call ->storage_lost() to
handle the error in filesystem layer.

Normally, a pmem device may contain one or more partitions, each
partition contains a block device, each block device contains a
filesystem.  So we are able to find out the filesystem by one offset on
this pmem device.  However, in other cases, such as mapped device, I
didn't find a way to obtain the filesystem laying on it.  It is a
problem need to be fixed.

Signed-off-by: Shiyang Ruan 
---
  block/genhd.c| 12 
  drivers/nvdimm/pmem.c| 31 +++
  include/linux/genhd.h|  2 ++
  include/linux/memremap.h |  3 +++
  4 files changed, 48 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 99c64641c314..e7442b60683e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1063,6 +1063,18 @@ struct block_device *bdget_disk(struct gendisk *disk, 
int partno)
  }
  EXPORT_SYMBOL(bdget_disk);
  
+struct block_device *bdget_disk_sector(struct gendisk *disk, sector_t sector)

+{
+   struct block_device *bdev = NULL;
+   struct hd_struct *part = disk_map_sector_rcu(disk, sector);
+
+   if (part)
+   bdev = bdget(part_devt(part));
+
+   return bdev;
+}
+EXPORT_SYMBOL(bdget_disk_sector);
+
  /*
   * print a full list of all partitions - intended for places where the root
   * filesystem can't be mounted and thus to give the victim some idea of what
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fab29b514372..3ed96486c883 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -364,9 +364,40 @@ static void pmem_release_disk(void *__pmem)
put_disk(pmem->disk);
  }
  
+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,

+   struct mf_recover_controller *mfrc)
+{
+   struct pmem_device *pdev;
+   struct block_device *bdev;
+   sector_t disk_sector;
+   loff_t bdev_offset;
+
+   pdev = container_of(pgmap, struct pmem_device, pgmap);
+   if (!pdev->disk)
+   return -ENXIO;
+
+   disk_sector = (PFN_PHYS(mfrc->pfn) - pdev->phys_addr) >> SECTOR_SHIFT;


Ah, I see, looking at the current x86 MCE code, the MCE handler gets a
physical address, which is then rounded down to a PFN, which is then
blown back up into a byte address(?) and then rounded down to sectors.
That is then blown back up into a byte address and passed on to XFS,
which rounds it down to fs blocksize.

/me wishes that wasn't so convoluted, but reforming the whole mm poison
system to have smaller blast radii isn't the purpose of this patch. :)


+   bdev = bdget_disk_sector(pdev->disk, disk_sector);
+   if (!bdev)
+   return -ENXIO;
+
+   // TODO what if block device contains a mapped device


Find its dev_pagemap_ops and invoke its memory_failure function? ;)


Thanks for pointing out.  I'll think about it in this way.



+   if (!bdev->bd_super)
+   goto out;
+
+   bdev_offset = ((disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT) -
+   pdev->data_offset;
+   bdev->bd_super->s_op->storage_lost(bdev->bd_super, bdev_offset, mfrc);


->storage_lost is required for all filesystems?


I think it is required for filesystems that support fsdax, since the 
owner tracking is moved here.  But anyway, there should have a non-NULL 
judgment.



--
Thanks,
Ruan Shiyang.


--D


+
+out:
+   bdput(bdev);
+   return 0;
+}
+
  static const struct dev_pagemap_ops fsdax_pagemap_ops = {
.kill   = pmem_pagemap_kill,
.cleanup= pmem_pagemap_cleanup,
+   .memory_failure = pmem_pagemap_memory_failure,
  };
  
  static int pmem_attach_disk(struct device *dev,

diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 4ab853461dff..16e9e13e0841 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -303,6 +303,8 @@ static inline void add_disk_no_queue_reg(struct gendisk 
*disk)
  extern void del_gendisk(struct gendisk *gp);
  extern struct gendisk *get_gendisk(dev_t dev, int *partno);
  extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
+extern struct block_device *bdget_disk_sector(struct gendisk *disk,
+   sector_t sector);
  
  extern void set_device_ro(struct block_device *bdev, int flag);

  extern void set_disk_ro(struct gendisk *disk, int flag);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 5f5b2df06e61..efebefa70d00 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -6,6 +6,7 @@
  
  struct resource;

  struct device;

Amazon.co.jp アカウント所有権の証明(名前、その他個人情報)の確認

2020-09-15 Thread Amazon.co.jp
 





平素は Amazon.co.jp をご利用いただき、誠にありがとうございます。


お客様のアカウントで異常なアクティビティが検出されたためAmazon 
アカウントを停止させていただいております。アカウントにログインして画面の指示に従うことで、アカウントのロックを解除していただけます。


Amazon ログイン 


請求先情報の確認が完了するまで、お客様のアカウントへのアクセスを停止させていただきますので、ご了承ください。

異常なログインIP:132.197.75.100


何卒、よろしくお願い申し上げます。


Amazon.co.jpのまたのご利用をお待ちしております。
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: 回复:regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413?? ("dax: fix detection of dax support for non-persistent memory block?? devices")

2020-09-15 Thread Mike Snitzer
On Tue, Sep 15 2020 at  3:49pm -0400,
Dan Williams  wrote:

> On Tue, Sep 15, 2020 at 1:01 AM Jan Kara  wrote:
> >
> > Hi!
> >
> > On Tue 15-09-20 11:03:29, col...@suse.de wrote:
> > > Could you please to take a look? I am offline in the next two weeks.
> >
> > I just had a look into this. IMHO the justification in 6180bb446a "dax: fix
> > detection of dax support for non-persistent memory block devices" is just
> > bogus and people got confused by the previous condition
> >
> > if (!dax_dev && !bdev_dax_supported(bdev, blocksize))
> >
> > which was bogus as well. bdev_dax_supported() always returns false for bdev
> > that doesn't have dax_dev (naturally so). So in the original condition
> > there was no point in calling bdev_dax_supported() if we know dax_dev is
> > NULL.
> >
> > Then this was changed to:
> >
> > if (!dax_dev || !bdev_dax_supported(bdev, blocksize))
> >
> > which looks more sensible at the first sight. But only at the first sight -
> > if you look at wider context, __generic_fsdax_supported() is the bulk of
> > code that decides whether a device supports DAX so calling
> > bdev_dax_supported() from it indeed doesn't look as such a great idea. So
> > IMO the condition should be just:
> >
> > if (!dax_dev)
> >
> > I'll send a fix for this.
> 
> If you beat me to it, great, but you might be sleeping now. I agree
> the original condition was bogus and looks to be a result of previous
> non-thorough refactoring on my part. I think we can move that !dax_dev
> into dax_supported(). I'll take a look.

You trimmed the relevant portion of Jan's reply but: can you also
weigh-in one whether DM is using the wrong function to test for DAX?

Mike
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: 回复:regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413?? ("dax: fix detection of dax support for non-persistent memory block?? devices")

2020-09-15 Thread Dan Williams
On Tue, Sep 15, 2020 at 1:01 AM Jan Kara  wrote:
>
> Hi!
>
> On Tue 15-09-20 11:03:29, col...@suse.de wrote:
> > Could you please to take a look? I am offline in the next two weeks.
>
> I just had a look into this. IMHO the justification in 6180bb446a "dax: fix
> detection of dax support for non-persistent memory block devices" is just
> bogus and people got confused by the previous condition
>
> if (!dax_dev && !bdev_dax_supported(bdev, blocksize))
>
> which was bogus as well. bdev_dax_supported() always returns false for bdev
> that doesn't have dax_dev (naturally so). So in the original condition
> there was no point in calling bdev_dax_supported() if we know dax_dev is
> NULL.
>
> Then this was changed to:
>
> if (!dax_dev || !bdev_dax_supported(bdev, blocksize))
>
> which looks more sensible at the first sight. But only at the first sight -
> if you look at wider context, __generic_fsdax_supported() is the bulk of
> code that decides whether a device supports DAX so calling
> bdev_dax_supported() from it indeed doesn't look as such a great idea. So
> IMO the condition should be just:
>
> if (!dax_dev)
>
> I'll send a fix for this.

If you beat me to it, great, but you might be sleeping now. I agree
the original condition was bogus and looks to be a result of previous
non-thorough refactoring on my part. I think we can move that !dax_dev
into dax_supported(). I'll take a look.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] acpi/nfit: Use kobj_to_dev() instead

2020-09-15 Thread Rafael J. Wysocki
On Sat, Aug 15, 2020 at 12:52 AM Verma, Vishal L
 wrote:
>
> On Fri, 2020-08-14 at 17:28 +0200, Rafael J. Wysocki wrote:
> > On Thu, Aug 13, 2020 at 4:54 AM Wang Qing  wrote:
> > > Use kobj_to_dev() instead of container_of()
> > >
> > > Signed-off-by: Wang Qing 
> >
> > LGTM
> >
> > Dan, any objections?
>
> Looks good to me - you can add:
> Acked-by: Vishal Verma 

Applied as 5.10 material, thanks!

> > > ---
> > >  drivers/acpi/nfit/core.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> > > index fa4500f..3bb350b
> > > --- a/drivers/acpi/nfit/core.c
> > > +++ b/drivers/acpi/nfit/core.c
> > > @@ -1382,7 +1382,7 @@ static bool ars_supported(struct nvdimm_bus 
> > > *nvdimm_bus)
> > >
> > >  static umode_t nfit_visible(struct kobject *kobj, struct attribute *a, 
> > > int n)
> > >  {
> > > -   struct device *dev = container_of(kobj, struct device, kobj);
> > > +   struct device *dev = kobj_to_dev(kobj);
> > > struct nvdimm_bus *nvdimm_bus = to_nvdimm_bus(dev);
> > >
> > > if (a == _attr_scrub.attr && !ars_supported(nvdimm_bus))
> > > @@ -1667,7 +1667,7 @@ static struct attribute 
> > > *acpi_nfit_dimm_attributes[] = {
> > >  static umode_t acpi_nfit_dimm_attr_visible(struct kobject *kobj,
> > > struct attribute *a, int n)
> > >  {
> > > -   struct device *dev = container_of(kobj, struct device, kobj);
> > > +   struct device *dev = kobj_to_dev(kobj);
> > > struct nvdimm *nvdimm = to_nvdimm(dev);
> > > struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
> > >
> > > --
> > > 2.7.4
> > >
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Mikulas Patocka wrote:

> > > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and
> > > trailing bytes.
> > 
> > You want copy_user_flushcache(). See how fs/dax.c arranges for
> > dax_copy_from_iter() to route to pmem_copy_from_iter().
> 
> Is it something new for the kernel 5.10? I see only __copy_user_flushcache 
> that is implemented just for x86 and arm64.
> 
> There is __copy_from_user_flushcache implemented for x86, arm64 and power. 
> It is used in lib/iov_iter.c under
> #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this?
> 
> Mikulas

... and __copy_user_flushcache is not exported for modules. So, I am stuck 
with __copy_from_user_inatomic_nocache.

Mikulas
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Dan Williams wrote:

> > - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses
> > buffer cache for the mapping. The buffer cache slows does fsck by a factor
> > of 5 to 10. Could it be possible to change the kernel so that it maps DAX
> > based block devices directly?
> 
> We've been down this path before.
> 
> 5a023cdba50c block: enable dax for raw block devices
> 9f4736fe7ca8 block: revert runtime dax control of the raw block device
> acc93d30d7d4 Revert "block: enable dax for raw block devices"

It says "The functionality is superseded by the new 'Device DAX' 
facility". But the fsck tool can't change a fsdax device into a devdax 
device just for checking. Or can it?

> EXT2/4 metadata buffer management depends on the page cache and we
> eliminated a class of bugs by removing that support. The problems are
> likely tractable, but there was not a straightforward fix visible at
> the time.

Thinking about it - it isn't as easy as it looks...

Suppose that the user mounts an ext2 filesystem and then uses the tune2fs 
tool on the mounted block device. The tune2fs tool reads and writes the 
mounted superblock directly.

So, read/write must be coherent with the buffer cache (otherwise the 
kernel would not see the changes written by tune2fs). And mmap must be 
coherent with read/write.

So, if we want to map the pmem device directly, we could add a new flag 
MAP_DAX. Or we could test if the fd has O_DIRECT flag and map it directly 
in this case. But the default must be to map it coherently in order to not 
break existing programs.

> > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and
> > trailing bytes.
> 
> You want copy_user_flushcache(). See how fs/dax.c arranges for
> dax_copy_from_iter() to route to pmem_copy_from_iter().

Is it something new for the kernel 5.10? I see only __copy_user_flushcache 
that is implemented just for x86 and arm64.

There is __copy_from_user_flushcache implemented for x86, arm64 and power. 
It is used in lib/iov_iter.c under
#ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this?

Mikulas
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH 2/4] pagemap: introduce ->memory_failure()

2020-09-15 Thread Darrick J. Wong
On Tue, Sep 15, 2020 at 06:13:09PM +0800, Shiyang Ruan wrote:
> When memory-failure occurs, we call this function which is implemented
> by each devices.  For fsdax, pmem device implements it.  Pmem device
> will find out the block device where the error page located in, gets the
> filesystem on this block device, and finally call ->storage_lost() to
> handle the error in filesystem layer.
> 
> Normally, a pmem device may contain one or more partitions, each
> partition contains a block device, each block device contains a
> filesystem.  So we are able to find out the filesystem by one offset on
> this pmem device.  However, in other cases, such as mapped device, I
> didn't find a way to obtain the filesystem laying on it.  It is a
> problem need to be fixed.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  block/genhd.c| 12 
>  drivers/nvdimm/pmem.c| 31 +++
>  include/linux/genhd.h|  2 ++
>  include/linux/memremap.h |  3 +++
>  4 files changed, 48 insertions(+)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 99c64641c314..e7442b60683e 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1063,6 +1063,18 @@ struct block_device *bdget_disk(struct gendisk *disk, 
> int partno)
>  }
>  EXPORT_SYMBOL(bdget_disk);
>  
> +struct block_device *bdget_disk_sector(struct gendisk *disk, sector_t sector)
> +{
> + struct block_device *bdev = NULL;
> + struct hd_struct *part = disk_map_sector_rcu(disk, sector);
> +
> + if (part)
> + bdev = bdget(part_devt(part));
> +
> + return bdev;
> +}
> +EXPORT_SYMBOL(bdget_disk_sector);
> +
>  /*
>   * print a full list of all partitions - intended for places where the root
>   * filesystem can't be mounted and thus to give the victim some idea of what
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index fab29b514372..3ed96486c883 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -364,9 +364,40 @@ static void pmem_release_disk(void *__pmem)
>   put_disk(pmem->disk);
>  }
>  
> +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
> + struct mf_recover_controller *mfrc)
> +{
> + struct pmem_device *pdev;
> + struct block_device *bdev;
> + sector_t disk_sector;
> + loff_t bdev_offset;
> +
> + pdev = container_of(pgmap, struct pmem_device, pgmap);
> + if (!pdev->disk)
> + return -ENXIO;
> +
> + disk_sector = (PFN_PHYS(mfrc->pfn) - pdev->phys_addr) >> SECTOR_SHIFT;

Ah, I see, looking at the current x86 MCE code, the MCE handler gets a
physical address, which is then rounded down to a PFN, which is then
blown back up into a byte address(?) and then rounded down to sectors.
That is then blown back up into a byte address and passed on to XFS,
which rounds it down to fs blocksize.

/me wishes that wasn't so convoluted, but reforming the whole mm poison
system to have smaller blast radii isn't the purpose of this patch. :)

> + bdev = bdget_disk_sector(pdev->disk, disk_sector);
> + if (!bdev)
> + return -ENXIO;
> +
> + // TODO what if block device contains a mapped device

Find its dev_pagemap_ops and invoke its memory_failure function? ;)

> + if (!bdev->bd_super)
> + goto out;
> +
> + bdev_offset = ((disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT) -
> + pdev->data_offset;
> + bdev->bd_super->s_op->storage_lost(bdev->bd_super, bdev_offset, mfrc);

->storage_lost is required for all filesystems?

--D

> +
> +out:
> + bdput(bdev);
> + return 0;
> +}
> +
>  static const struct dev_pagemap_ops fsdax_pagemap_ops = {
>   .kill   = pmem_pagemap_kill,
>   .cleanup= pmem_pagemap_cleanup,
> + .memory_failure = pmem_pagemap_memory_failure,
>  };
>  
>  static int pmem_attach_disk(struct device *dev,
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 4ab853461dff..16e9e13e0841 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -303,6 +303,8 @@ static inline void add_disk_no_queue_reg(struct gendisk 
> *disk)
>  extern void del_gendisk(struct gendisk *gp);
>  extern struct gendisk *get_gendisk(dev_t dev, int *partno);
>  extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
> +extern struct block_device *bdget_disk_sector(struct gendisk *disk,
> + sector_t sector);
>  
>  extern void set_device_ro(struct block_device *bdev, int flag);
>  extern void set_disk_ro(struct gendisk *disk, int flag);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 5f5b2df06e61..efebefa70d00 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -6,6 +6,7 @@
>  
>  struct resource;
>  struct device;
> +struct mf_recover_controller;
>  
>  /**
>   * struct vmem_altmap - pre-allocated storage for vmemmap_populate
> @@ -87,6 +88,8 @@ struct dev_pagemap_ops {
>* the 

Re: [RFC PATCH 1/4] fs: introduce ->storage_lost() for memory-failure

2020-09-15 Thread Darrick J. Wong
On Tue, Sep 15, 2020 at 06:13:08PM +0800, Shiyang Ruan wrote:
> This function is used to handle errors which may cause data lost in
> filesystem.  Such as memory-failure in fsdax mode.
> 
> In XFS, it requires "rmapbt" feature in order to query for files or
> metadata which associated to the error block.  Then we could call fs
> recover functions to try to repair the damaged data.(did not implemented
> in this patch)
> 
> After that, the memory-failure also needs to kill processes who are
> using those files.  The struct mf_recover_controller is created to store
> necessary parameters.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_super.c | 80 ++
>  include/linux/fs.h |  1 +
>  include/linux/mm.h |  6 
>  3 files changed, 87 insertions(+)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 71ac6c1cdc36..118d9c5d9e1e 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -35,6 +35,10 @@
>  #include "xfs_refcount_item.h"
>  #include "xfs_bmap_item.h"
>  #include "xfs_reflink.h"
> +#include "xfs_alloc.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_bit.h"
>  
>  #include 
>  #include 
> @@ -1104,6 +1108,81 @@ xfs_fs_free_cached_objects(
>   return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
>  }
>  
> +static int
> +xfs_storage_lost_helper(
> + struct xfs_btree_cur*cur,
> + struct xfs_rmap_irec*rec,
> + void*priv)
> +{
> + struct xfs_inode*ip;
> + struct mf_recover_controller*mfrc = priv;
> + int rc = 0;
> +
> + if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner)) {
> + // TODO check and try to fix metadata
> + } else {
> + /*
> +  * Get files that incore, filter out others that are not in use.
> +  */
> + xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE,
> +  0, );

Missing return value check here.

> + if (!ip)
> + return 0;
> + if (!VFS_I(ip)->i_mapping)
> + goto out;
> +
> + rc = mfrc->recover_fn(mfrc->pfn, mfrc->flags,
> +   VFS_I(ip)->i_mapping, rec->rm_offset);
> +
> + // TODO try to fix data
> +out:
> + xfs_irele(ip);
> + }
> +
> + return rc;
> +}
> +
> +static int
> +xfs_fs_storage_lost(
> + struct super_block  *sb,
> + loff_t  offset,

offset into which device?  XFS supports three...

I'm also a little surprised you don't pass in a length.

I guess that means this function will get called repeatedly for every
byte in the poisoned range?

> + void*priv)
> +{
> + struct xfs_mount*mp = XFS_M(sb);
> + struct xfs_trans*tp = NULL;
> + struct xfs_btree_cur*cur = NULL;
> + struct xfs_rmap_irecrmap_low, rmap_high;
> + struct xfs_buf  *agf_bp = NULL;
> + xfs_fsblock_t   fsbno = XFS_B_TO_FSB(mp, offset);
> + xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
> + xfs_agblock_t   agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
> + int error = 0;
> +
> + error = xfs_trans_alloc_empty(mp, );
> + if (error)
> + return error;
> +
> + error = xfs_alloc_read_agf(mp, tp, agno, 0, _bp);
> + if (error)
> + return error;
> +
> + cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);

...and this is definitely the wrong call sequence if the malfunctioning
device is the realtime device.  If a dax rt device dies, you'll be
shooting down files on the data device, which will cause all sorts of
problems.

Question: Should all this poison recovery stuff go into a new file?
xfs_poison.c?  There's already a lot of code in xfs_super.c.

--D

> +
> + /* Construct a range for rmap query */
> + memset(_low, 0, sizeof(rmap_low));
> + memset(_high, 0xFF, sizeof(rmap_high));
> + rmap_low.rm_startblock = rmap_high.rm_startblock = agbno;
> +
> + error = xfs_rmap_query_range(cur, _low, _high,
> +  xfs_storage_lost_helper, priv);
> + if (error == -ECANCELED)
> + error = 0;
> +
> + xfs_btree_del_cursor(cur, error);
> + xfs_trans_brelse(tp, agf_bp);
> + return error;
> +}
> +
>  static const struct super_operations xfs_super_operations = {
>   .alloc_inode= xfs_fs_alloc_inode,
>   .destroy_inode  = xfs_fs_destroy_inode,
> @@ -1117,6 +1196,7 @@ static const struct super_operations 
> xfs_super_operations = {
>   .show_options   = xfs_fs_show_options,
>   .nr_cached_objects  = xfs_fs_nr_cached_objects,
>   .free_cached_objects= xfs_fs_free_cached_objects,
> + .storage_lost   = xfs_fs_storage_lost,
>  };
>  
>  static int
> diff --git 

Re: [PATCH v2] powerpc/papr_scm: Fix warning triggered by perf_stats_show()

2020-09-15 Thread Vaibhav Jain
Michael Ellerman  writes:

> Vaibhav Jain  writes:
>> A warning is reported by the kernel in case perf_stats_show() returns
>> an error code. The warning is of the form below:
>>
>>  papr_scm ibm,persistent-memory:ibm,pmemory@4411:
>>Failed to query performance stats, Err:-10
>>  dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
>>  fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count
>>
>> On investigation it looks like that the compiler is silently truncating the
>> return value of drc_pmem_query_stats() from 'long' to 'int', since the
>> variable used to store the return code 'rc' is an 'int'. This
>> truncated value is then returned back as a 'ssize_t' back from
>> perf_stats_show() to 'dev_attr_show()' which thinks of it as a large
>> unsigned number and triggers this warning..
>>
>> To fix this we update the type of variable 'rc' from 'int' to
>> 'ssize_t' that prevents the compiler from truncating the return value
>> of drc_pmem_query_stats() and returning correct signed value back from
>> perf_stats_show().
>>
>> Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance
>>stats from PHYP')
>
> Please don't word wrap the Fixes tag it breaks b4.
>
> I've fixed it up this time.

Thanks Mpe

>
> cheers

-- 
Cheers
~ Vaibhav
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 2/9] fs: Introduce i_blocks_per_page

2020-09-15 Thread Matthew Wilcox
On Tue, Sep 15, 2020 at 03:40:52PM +, David Laight wrote:
> > @@ -147,7 +147,7 @@ iomap_iop_set_range_uptodate(struct page *page, 
> > unsigned off, unsigned len)
> > unsigned int i;
> > 
> > spin_lock_irqsave(>uptodate_lock, flags);
> > -   for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
> > +   for (i = 0; i < i_blocks_per_page(inode, page); i++) {
> 
> You probably don't want to call the helper every time
> around the loop.

This is a classic example of focusing on the details and missing the
larger picture.  We don't want the loop at all, and if you'd kept reading
the patch series, you'd see it disappear later.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


RE: [PATCH v2 2/9] fs: Introduce i_blocks_per_page

2020-09-15 Thread David Laight
From: Matthew Wilcox (Oracle)
> Sent: 11 September 2020 00:47
> This helper is useful for both THPs and for supporting block size larger
> than page size.  Convert all users that I could find (we have a few
> different ways of writing this idiom, and I may have missed some).
> 
...
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d81a9a86c5aa..330f86b825d7 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -46,7 +46,7 @@ iomap_page_create(struct inode *inode, struct page *page)
>  {
>   struct iomap_page *iop = to_iomap_page(page);
> 
> - if (iop || i_blocksize(inode) == PAGE_SIZE)
> + if (iop || i_blocks_per_page(inode, page) <= 1)
>   return iop;
> 
>   iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
> @@ -147,7 +147,7 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned 
> off, unsigned len)
>   unsigned int i;
> 
>   spin_lock_irqsave(>uptodate_lock, flags);
> - for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
> + for (i = 0; i < i_blocks_per_page(inode, page); i++) {

You probably don't want to call the helper every time
around the loop.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Amazon.co.jp アカウント所有権の証明(名前、その他個人情報)の確認

2020-09-15 Thread Amazon.co.jp
 





平素は Amazon.co.jp をご利用いただき、誠にありがとうございます。


お客様のアカウントで異常なアクティビティが検出されたためAmazon 
アカウントを停止させていただいております。アカウントにログインして画面の指示に従うことで、アカウントのロックを解除していただけます。


Amazon ログイン 


請求先情報の確認が完了するまで、お客様のアカウントへのアクセスを停止させていただきますので、ご了承ください。

異常なログインIP:226.13.186.68


何卒、よろしくお願い申し上げます。


Amazon.co.jpのまたのご利用をお待ちしております。
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Dan Williams
On Tue, Sep 15, 2020 at 5:35 AM Mikulas Patocka  wrote:
>
> Hi
>
> I am developing a new filesystem suitable for persistent memory - nvfs.

Nice!

> The goal is to have a small and fast filesystem that can be used on
> DAX-based devices. Nvfs maps the whole device into linear address space
> and it completely bypasses the overhead of the block layer and buffer
> cache.

So does device-dax, but device-dax lacks read(2)/write(2).

> In the past, there was nova filesystem for pmem, but it was abandoned a
> year ago (the last version is for the kernel 5.1 -
> https://github.com/NVSL/linux-nova ). Nvfs is smaller and performs better.
>
> The design of nvfs is similar to ext2/ext4, so that it fits into the VFS
> layer naturally, without too much glue code.
>
> I'd like to ask you to review it.
>
>
> tarballs:
> http://people.redhat.com/~mpatocka/nvfs/
> git:
> git://leontynka.twibright.com/nvfs.git
> the description of filesystem internals:
> http://people.redhat.com/~mpatocka/nvfs/INTERNALS
> benchmarks:
> http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
>
>
> TODO:
>
> - programs run approximately 4% slower when running from Optane-based
> persistent memory. Therefore, programs and libraries should use page cache
> and not DAX mapping.

This needs to be based on platform firmware data f(ACPI HMAT) for the
relative performance of a PMEM range vs DRAM. For example, this
tradeoff should not exist with battery backed DRAM, or virtio-pmem.

>
> - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses
> buffer cache for the mapping. The buffer cache slows does fsck by a factor
> of 5 to 10. Could it be possible to change the kernel so that it maps DAX
> based block devices directly?

We've been down this path before.

5a023cdba50c block: enable dax for raw block devices
9f4736fe7ca8 block: revert runtime dax control of the raw block device
acc93d30d7d4 Revert "block: enable dax for raw block devices"

EXT2/4 metadata buffer management depends on the page cache and we
eliminated a class of bugs by removing that support. The problems are
likely tractable, but there was not a straightforward fix visible at
the time.

> - __copy_from_user_inatomic_nocache doesn't flush cache for leading and
> trailing bytes.

You want copy_user_flushcache(). See how fs/dax.c arranges for
dax_copy_from_iter() to route to pmem_copy_from_iter().
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: 回复:regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413?? ("dax: fix detection of dax support for non-persistent memory block?? devices")

2020-09-15 Thread Verma, Vishal L
On Tue, 2020-09-15 at 10:01 +0200, Jan Kara wrote:
> Hi!
> 
> On Tue 15-09-20 11:03:29, col...@suse.de wrote:
> > Could you please to take a look? I am offline in the next two weeks.
> 
> I just had a look into this. IMHO the justification in 6180bb446a "dax: fix
> detection of dax support for non-persistent memory block devices" is just
> bogus and people got confused by the previous condition
> 
> if (!dax_dev && !bdev_dax_supported(bdev, blocksize))
> 
> which was bogus as well. bdev_dax_supported() always returns false for bdev
> that doesn't have dax_dev (naturally so). So in the original condition
> there was no point in calling bdev_dax_supported() if we know dax_dev is
> NULL.
> 
> Then this was changed to:
> 
> if (!dax_dev || !bdev_dax_supported(bdev, blocksize))
> 
> which looks more sensible at the first sight. But only at the first sight -
> if you look at wider context, __generic_fsdax_supported() is the bulk of
> code that decides whether a device supports DAX so calling
> bdev_dax_supported() from it indeed doesn't look as such a great idea. So
> IMO the condition should be just:
> 
> if (!dax_dev)
> 
> I'll send a fix for this.
> 
> Also there's the process question how this patch could get to Linus when
> any attempt to use DAX would immediately kill the machine like Mikulas
> spotted. This shows the that patch was untested with DAX by anybody on the
> path from the developer to Linus...

Hi Jan,

This was entirely my fault, and I apologize. I got confused as to what
state my branches were in, and I thought this had cleared our unit tests
etc, when it obviously hadn't. I'm going to take a harder look at my
personal branch/patch management process to make sure it doesn't happen
again!

Thanks for taking a look.

-Vishal

> 
>   Honza
> 
> >  原始邮件 
> > 发件人: Mikulas Patocka 
> > 日期: 2020年9月14日周一半夜11:48
> > 收件人: Coly Li , Dan Williams ,
> > Dave Jiang 
> > 抄送: Jan Kara , Vishal Verma ,
> > Adrian Huang , Ira Weiny , Mike
> > Snitzer , Pankaj Gupta ,
> > linux-nvdimm@lists.01.org
> > 主题: regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413
> > ("dax: fix detection of dax support for non-persistent memory block
> > devices")
> > 
> > Hi
> > 
> > The patch 6180bb446ab624b9ab8bf201ed251ca87f07b413 ("dax: fix detection 
> > of
> > dax support for non-persistent memory block devices") causes crash when
> > attempting to mount the ext4 filesystem on /dev/pmem0 ("mkfs.ext4
> > /dev/pmem0; mount -t ext4 /dev/pmem0 /mnt/test"). The device /dev/pmem0 
> > is
> > emulated using the "memmap" kernel parameter.
> > 
> > The patch causes infinite recursion and double-fault exception:
> > 
> > __generic_fsdax_supported
> > bdev_dax_supported
> > __bdev_dax_supported
> > dax_supported
> > dax_dev->ops->dax_supported
> > generic_fsdax_supported
> > __generic_fsdax_supported
> > 
> > Mikulas
> > 
> > 
> > 
> > [   17.500619] traps: PANIC: double fault, error_code: 0x0
> > [   17.500619] double fault:  [#1] PREEMPT SMP
> > [   17.500620] CPU: 0 PID: 1326 Comm: mount Not tainted 
> > 5.9.0-rc1-bisect #
> > 10
> > [   17.500620] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [   17.500621] RIP: 0010:__generic_fsdax_supported+0x6a/0x500
> > [   17.500622] Code: ff ff ff ff ff 7f 00 48 21 f3 48 01 c3 48 c1 e3 09 
> > f6
> > c7 0e 0f 85 fa 01 00 00 48 85 ff 49 89 fd 74 11 be 00 10 00 00 4c 89 e7
> >  b1 fe ff ff 84 c0 75 11 31 c0 48 83 c4 48 5b 5d 41 5c 41 5d 41
> > [   17.500623] RSP: 0018:88940b4fdff8 EFLAGS: 00010286
> > [   17.500624] RAX:  RBX: 0007f000 RCX:
> > 
> > [   17.500625] RDX: 1000 RSI: 1000 RDI:
> > 88940b34c300
> > [   17.500625] RBP:  R08: 0400 R09:
> > 8080808080808080
> > [   17.500626] R10:  R11: fefefefefefefeff R12:
> > 88940b34c300
> > [   17.500626] R13: 88940b3dc000 R14: 88940badd000 R15:
> > 0001
> > [   17.500627] FS:  f7c25780() GS:88940fa0()
> > knlGS:
> > [   17.500628] CS:  0010 DS: 002b ES: 002b CR0: 80050033
> > [   17.500628] CR2: 88940b4fdfe8 CR3: 00140bd15000 CR4:
> > 06b0
> > [   17.500628] Call Trace:
> > [   17.500629] Modules linked in: uvesafb cfbfillrect cfbimgblt cn
> > cfbcopyarea fb fbdev ipv6 tun autofs4 binfmt_misc configfs af_packet
> > virtio_rng rng_core mousedev evdev pcspkr virtio_balloon button raid10
> > raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor
> > async_tx libcrc32c raid1 raid0 md_mod sd_mod t10_pi virtio_scsi 
> > virtio_net
> > net_failover psmouse scsi_mod failover
> > [   17.500638] ---[ end trace 3c877fcb5b865459 ]---
> > [   

Re: [PATCH v2 2/9] fs: Introduce i_blocks_per_page

2020-09-15 Thread Dave Kleikamp
On 9/10/20 6:47 PM, Matthew Wilcox (Oracle) wrote:
> This helper is useful for both THPs and for supporting block size larger
> than page size.  Convert all users that I could find (we have a few
> different ways of writing this idiom, and I may have missed some).
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Dave Chinner 
> Reviewed-by: Darrick J. Wong 

For jfs:
Acked-by: Dave Kleikamp 

> ---
>  fs/iomap/buffered-io.c  |  8 
>  fs/jfs/jfs_metapage.c   |  2 +-
>  fs/xfs/xfs_aops.c   |  2 +-
>  include/linux/pagemap.h | 16 
>  4 files changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d81a9a86c5aa..330f86b825d7 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -46,7 +46,7 @@ iomap_page_create(struct inode *inode, struct page *page)
>  {
>   struct iomap_page *iop = to_iomap_page(page);
>  
> - if (iop || i_blocksize(inode) == PAGE_SIZE)
> + if (iop || i_blocks_per_page(inode, page) <= 1)
>   return iop;
>  
>   iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
> @@ -147,7 +147,7 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned 
> off, unsigned len)
>   unsigned int i;
>  
>   spin_lock_irqsave(>uptodate_lock, flags);
> - for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
> + for (i = 0; i < i_blocks_per_page(inode, page); i++) {
>   if (i >= first && i <= last)
>   set_bit(i, iop->uptodate);
>   else if (!test_bit(i, iop->uptodate))
> @@ -1077,7 +1077,7 @@ iomap_finish_page_writeback(struct inode *inode, struct 
> page *page,
>   mapping_set_error(inode->i_mapping, -EIO);
>   }
>  
> - WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
> + WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
>   WARN_ON_ONCE(iop && atomic_read(>write_count) <= 0);
>  
>   if (!iop || atomic_dec_and_test(>write_count))
> @@ -1373,7 +1373,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>   int error = 0, count = 0, i;
>   LIST_HEAD(submit_list);
>  
> - WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
> + WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
>   WARN_ON_ONCE(iop && atomic_read(>write_count) != 0);
>  
>   /*
> diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
> index a2f5338a5ea1..176580f54af9 100644
> --- a/fs/jfs/jfs_metapage.c
> +++ b/fs/jfs/jfs_metapage.c
> @@ -473,7 +473,7 @@ static int metapage_readpage(struct file *fp, struct page 
> *page)
>   struct inode *inode = page->mapping->host;
>   struct bio *bio = NULL;
>   int block_offset;
> - int blocks_per_page = PAGE_SIZE >> inode->i_blkbits;
> + int blocks_per_page = i_blocks_per_page(inode, page);
>   sector_t page_start;/* address of page in fs blocks */
>   sector_t pblock;
>   int xlen;
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index b35611882ff9..55d126d4e096 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -544,7 +544,7 @@ xfs_discard_page(
>   page, ip->i_ino, offset);
>  
>   error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> - PAGE_SIZE / i_blocksize(inode));
> + i_blocks_per_page(inode, page));
>   if (error && !XFS_FORCED_SHUTDOWN(mp))
>   xfs_alert(mp, "page discard unable to remove delalloc 
> mapping.");
>  out_invalidate:
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 50d2c39b47ab..f7f602040913 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -975,4 +975,20 @@ static inline int page_mkwrite_check_truncate(struct 
> page *page,
>   return offset;
>  }
>  
> +/**
> + * i_blocks_per_page - How many blocks fit in this page.
> + * @inode: The inode which contains the blocks.
> + * @page: The page (head page if the page is a THP).
> + *
> + * If the block size is larger than the size of this page, return zero.
> + *
> + * Context: The caller should hold a refcount on the page to prevent it
> + * from being split.
> + * Return: The number of filesystem blocks covered by this page.
> + */
> +static inline
> +unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
> +{
> + return thp_size(page) >> inode->i_blkbits;
> +}
>  #endif /* _LINUX_PAGEMAP_H */
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Matthew Wilcox wrote:

> On Tue, Sep 15, 2020 at 08:34:41AM -0400, Mikulas Patocka wrote:
> > - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses 
> > buffer cache for the mapping. The buffer cache slows does fsck by a factor 
> > of 5 to 10. Could it be possible to change the kernel so that it maps DAX 
> > based block devices directly?
> 
> Oh, because fs/block_dev.c has:
> .mmap   = generic_file_mmap,
> 
> I don't see why we shouldn't have a blkdev_mmap modelled after
> ext2_file_mmap() with the corresponding blkdev_dax_vm_ops.

Yes, that's possible - and we whould also have to rewrite methods 
read_iter and write_iter on DAX block devices, so that they are coherent 
with mmap.

Mikulas
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Matthew Wilcox
On Tue, Sep 15, 2020 at 08:34:41AM -0400, Mikulas Patocka wrote:
> - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses 
> buffer cache for the mapping. The buffer cache slows does fsck by a factor 
> of 5 to 10. Could it be possible to change the kernel so that it maps DAX 
> based block devices directly?

Oh, because fs/block_dev.c has:
.mmap   = generic_file_mmap,

I don't see why we shouldn't have a blkdev_mmap modelled after
ext2_file_mmap() with the corresponding blkdev_dax_vm_ops.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] libnvdimm: Fix dereference of pointer ndns before it is null checked

2020-09-15 Thread Markus Elfring
> … Fix this by

I suggest to replace this information by the tag “Fixes”.


> dereferencing ndns after ndns has been null pointer sanity checked.

Would an other imperative wording become helpful for the change description?

Regards,
Markus
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Mikulas Patocka
Hi

I am developing a new filesystem suitable for persistent memory - nvfs. 
The goal is to have a small and fast filesystem that can be used on 
DAX-based devices. Nvfs maps the whole device into linear address space 
and it completely bypasses the overhead of the block layer and buffer 
cache.

In the past, there was nova filesystem for pmem, but it was abandoned a 
year ago (the last version is for the kernel 5.1 - 
https://github.com/NVSL/linux-nova ). Nvfs is smaller and performs better.

The design of nvfs is similar to ext2/ext4, so that it fits into the VFS 
layer naturally, without too much glue code.

I'd like to ask you to review it.


tarballs:
http://people.redhat.com/~mpatocka/nvfs/
git:
git://leontynka.twibright.com/nvfs.git
the description of filesystem internals:
http://people.redhat.com/~mpatocka/nvfs/INTERNALS
benchmarks:
http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS


TODO:

- programs run approximately 4% slower when running from Optane-based 
persistent memory. Therefore, programs and libraries should use page cache 
and not DAX mapping.

- when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses 
buffer cache for the mapping. The buffer cache slows does fsck by a factor 
of 5 to 10. Could it be possible to change the kernel so that it maps DAX 
based block devices directly?

- __copy_from_user_inatomic_nocache doesn't flush cache for leading and 
trailing bytes.

Mikulas
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2] powerpc/papr_scm: Fix warning triggered by perf_stats_show()

2020-09-15 Thread Michael Ellerman
Vaibhav Jain  writes:
> A warning is reported by the kernel in case perf_stats_show() returns
> an error code. The warning is of the form below:
>
>  papr_scm ibm,persistent-memory:ibm,pmemory@4411:
> Failed to query performance stats, Err:-10
>  dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
>  fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count
>
> On investigation it looks like that the compiler is silently truncating the
> return value of drc_pmem_query_stats() from 'long' to 'int', since the
> variable used to store the return code 'rc' is an 'int'. This
> truncated value is then returned back as a 'ssize_t' back from
> perf_stats_show() to 'dev_attr_show()' which thinks of it as a large
> unsigned number and triggers this warning..
>
> To fix this we update the type of variable 'rc' from 'int' to
> 'ssize_t' that prevents the compiler from truncating the return value
> of drc_pmem_query_stats() and returning correct signed value back from
> perf_stats_show().
>
> Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance
>stats from PHYP')

Please don't word wrap the Fixes tag it breaks b4.

I've fixed it up this time.

cheers
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


RE: [External] regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413 ("dax: fix detection of dax support for non-persistent memory block devices")

2020-09-15 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Adrian Huang12 wrote:

> Hi Mikulas,
> 
> > -Original Message-
> > From: Mikulas Patocka 
> > Sent: Monday, September 14, 2020 11:49 PM
> > To: Coly Li ; Dan Williams ; Dave
> > Jiang 
> > Cc: Jan Kara ; Vishal Verma ;
> > Adrian Huang12 ; Ira Weiny ;
> > Mike Snitzer ; Pankaj Gupta
> > ; linux-nvdimm@lists.01.org
> > Subject: [External] regression caused by patch
> > 6180bb446ab624b9ab8bf201ed251ca87f07b413 ("dax: fix detection of dax
> > support for non-persistent memory block devices")
> > 
> > Hi
> > 
> > The patch 6180bb446ab624b9ab8bf201ed251ca87f07b413 ("dax: fix
> > detection of dax support for non-persistent memory block devices") causes
> > crash when attempting to mount the ext4 filesystem on /dev/pmem0
> > ("mkfs.ext4 /dev/pmem0; mount -t ext4 /dev/pmem0 /mnt/test"). The device
> > /dev/pmem0 is emulated using the "memmap" kernel parameter.
> > 
> 
> Could you please test the following patch? Thanks.
> https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/thread/2JDBSE2WK75LSGCFEOY3RXRN3CNLBPB2/

I tested it and it works.

Mikulas
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[RFC PATCH 1/4] fs: introduce ->storage_lost() for memory-failure

2020-09-15 Thread Shiyang Ruan
This function is used to handle errors which may cause data lost in
filesystem.  Such as memory-failure in fsdax mode.

In XFS, it requires "rmapbt" feature in order to query for files or
metadata which associated to the error block.  Then we could call fs
recover functions to try to repair the damaged data.(did not implemented
in this patch)

After that, the memory-failure also needs to kill processes who are
using those files.  The struct mf_recover_controller is created to store
necessary parameters.

Signed-off-by: Shiyang Ruan 
---
 fs/xfs/xfs_super.c | 80 ++
 include/linux/fs.h |  1 +
 include/linux/mm.h |  6 
 3 files changed, 87 insertions(+)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 71ac6c1cdc36..118d9c5d9e1e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -35,6 +35,10 @@
 #include "xfs_refcount_item.h"
 #include "xfs_bmap_item.h"
 #include "xfs_reflink.h"
+#include "xfs_alloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_bit.h"
 
 #include 
 #include 
@@ -1104,6 +1108,81 @@ xfs_fs_free_cached_objects(
return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
 }
 
+static int
+xfs_storage_lost_helper(
+   struct xfs_btree_cur*cur,
+   struct xfs_rmap_irec*rec,
+   void*priv)
+{
+   struct xfs_inode*ip;
+   struct mf_recover_controller*mfrc = priv;
+   int rc = 0;
+
+   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner)) {
+   // TODO check and try to fix metadata
+   } else {
+   /*
+* Get files that incore, filter out others that are not in use.
+*/
+   xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE,
+0, );
+   if (!ip)
+   return 0;
+   if (!VFS_I(ip)->i_mapping)
+   goto out;
+
+   rc = mfrc->recover_fn(mfrc->pfn, mfrc->flags,
+ VFS_I(ip)->i_mapping, rec->rm_offset);
+
+   // TODO try to fix data
+out:
+   xfs_irele(ip);
+   }
+
+   return rc;
+}
+
+static int
+xfs_fs_storage_lost(
+   struct super_block  *sb,
+   loff_t  offset,
+   void*priv)
+{
+   struct xfs_mount*mp = XFS_M(sb);
+   struct xfs_trans*tp = NULL;
+   struct xfs_btree_cur*cur = NULL;
+   struct xfs_rmap_irecrmap_low, rmap_high;
+   struct xfs_buf  *agf_bp = NULL;
+   xfs_fsblock_t   fsbno = XFS_B_TO_FSB(mp, offset);
+   xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
+   xfs_agblock_t   agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+   int error = 0;
+
+   error = xfs_trans_alloc_empty(mp, );
+   if (error)
+   return error;
+
+   error = xfs_alloc_read_agf(mp, tp, agno, 0, _bp);
+   if (error)
+   return error;
+
+   cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);
+
+   /* Construct a range for rmap query */
+   memset(_low, 0, sizeof(rmap_low));
+   memset(_high, 0xFF, sizeof(rmap_high));
+   rmap_low.rm_startblock = rmap_high.rm_startblock = agbno;
+
+   error = xfs_rmap_query_range(cur, _low, _high,
+xfs_storage_lost_helper, priv);
+   if (error == -ECANCELED)
+   error = 0;
+
+   xfs_btree_del_cursor(cur, error);
+   xfs_trans_brelse(tp, agf_bp);
+   return error;
+}
+
 static const struct super_operations xfs_super_operations = {
.alloc_inode= xfs_fs_alloc_inode,
.destroy_inode  = xfs_fs_destroy_inode,
@@ -1117,6 +1196,7 @@ static const struct super_operations xfs_super_operations 
= {
.show_options   = xfs_fs_show_options,
.nr_cached_objects  = xfs_fs_nr_cached_objects,
.free_cached_objects= xfs_fs_free_cached_objects,
+   .storage_lost   = xfs_fs_storage_lost,
 };
 
 static int
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e019ea2f1347..bd90485cfdc9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1951,6 +1951,7 @@ struct super_operations {
  struct shrink_control *);
long (*free_cached_objects)(struct super_block *,
struct shrink_control *);
+   int (*storage_lost)(struct super_block *sb, loff_t offset, void *priv);
 };
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1983e08f5906..3f0c36e1bf3d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3002,6 +3002,12 @@ extern void shake_page(struct page *p, int access);
 extern atomic_long_t num_poisoned_pages __read_mostly;
 extern int soft_offline_page(unsigned long pfn, int flags);
 
+struct 

[RFC PATCH 2/4] pagemap: introduce ->memory_failure()

2020-09-15 Thread Shiyang Ruan
When memory-failure occurs, we call this function which is implemented
by each devices.  For fsdax, pmem device implements it.  Pmem device
will find out the block device where the error page located in, gets the
filesystem on this block device, and finally call ->storage_lost() to
handle the error in filesystem layer.

Normally, a pmem device may contain one or more partitions, each
partition contains a block device, each block device contains a
filesystem.  So we are able to find out the filesystem by one offset on
this pmem device.  However, in other cases, such as mapped device, I
didn't find a way to obtain the filesystem laying on it.  It is a
problem need to be fixed.

Signed-off-by: Shiyang Ruan 
---
 block/genhd.c| 12 
 drivers/nvdimm/pmem.c| 31 +++
 include/linux/genhd.h|  2 ++
 include/linux/memremap.h |  3 +++
 4 files changed, 48 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 99c64641c314..e7442b60683e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1063,6 +1063,18 @@ struct block_device *bdget_disk(struct gendisk *disk, 
int partno)
 }
 EXPORT_SYMBOL(bdget_disk);
 
+struct block_device *bdget_disk_sector(struct gendisk *disk, sector_t sector)
+{
+   struct block_device *bdev = NULL;
+   struct hd_struct *part = disk_map_sector_rcu(disk, sector);
+
+   if (part)
+   bdev = bdget(part_devt(part));
+
+   return bdev;
+}
+EXPORT_SYMBOL(bdget_disk_sector);
+
 /*
  * print a full list of all partitions - intended for places where the root
  * filesystem can't be mounted and thus to give the victim some idea of what
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fab29b514372..3ed96486c883 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -364,9 +364,40 @@ static void pmem_release_disk(void *__pmem)
put_disk(pmem->disk);
 }
 
+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
+   struct mf_recover_controller *mfrc)
+{
+   struct pmem_device *pdev;
+   struct block_device *bdev;
+   sector_t disk_sector;
+   loff_t bdev_offset;
+
+   pdev = container_of(pgmap, struct pmem_device, pgmap);
+   if (!pdev->disk)
+   return -ENXIO;
+
+   disk_sector = (PFN_PHYS(mfrc->pfn) - pdev->phys_addr) >> SECTOR_SHIFT;
+   bdev = bdget_disk_sector(pdev->disk, disk_sector);
+   if (!bdev)
+   return -ENXIO;
+
+   // TODO what if block device contains a mapped device
+   if (!bdev->bd_super)
+   goto out;
+
+   bdev_offset = ((disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT) -
+   pdev->data_offset;
+   bdev->bd_super->s_op->storage_lost(bdev->bd_super, bdev_offset, mfrc);
+
+out:
+   bdput(bdev);
+   return 0;
+}
+
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
.kill   = pmem_pagemap_kill,
.cleanup= pmem_pagemap_cleanup,
+   .memory_failure = pmem_pagemap_memory_failure,
 };
 
 static int pmem_attach_disk(struct device *dev,
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 4ab853461dff..16e9e13e0841 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -303,6 +303,8 @@ static inline void add_disk_no_queue_reg(struct gendisk 
*disk)
 extern void del_gendisk(struct gendisk *gp);
 extern struct gendisk *get_gendisk(dev_t dev, int *partno);
 extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
+extern struct block_device *bdget_disk_sector(struct gendisk *disk,
+   sector_t sector);
 
 extern void set_device_ro(struct block_device *bdev, int flag);
 extern void set_disk_ro(struct gendisk *disk, int flag);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 5f5b2df06e61..efebefa70d00 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -6,6 +6,7 @@
 
 struct resource;
 struct device;
+struct mf_recover_controller;
 
 /**
  * struct vmem_altmap - pre-allocated storage for vmemmap_populate
@@ -87,6 +88,8 @@ struct dev_pagemap_ops {
 * the page back to a CPU accessible page.
 */
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+   int (*memory_failure)(struct dev_pagemap *pgmap,
+ struct mf_recover_controller *mfrc);
 };
 
 #define PGMAP_ALTMAP_VALID (1 << 0)
-- 
2.28.0


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[RFC PATCH 4/4] fsdax: remove useless (dis)associate functions

2020-09-15 Thread Shiyang Ruan
Since owner tarcking is triggerred by pmem device, these functions are
useless.  So remove it.

Signed-off-by: Shiyang Ruan 
---
 fs/dax.c | 46 --
 1 file changed, 46 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1ec592f0aadd..4c85b07abcc0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -323,48 +323,6 @@ static unsigned long dax_end_pfn(void *entry)
for (pfn = dax_to_pfn(entry); \
pfn < dax_end_pfn(entry); pfn++)
 
-/*
- * TODO: for reflink+dax we need a way to associate a single page with
- * multiple address_space instances at different linear_page_index()
- * offsets.
- */
-static void dax_associate_entry(void *entry, struct address_space *mapping,
-   struct vm_area_struct *vma, unsigned long address)
-{
-   unsigned long size = dax_entry_size(entry), pfn, index;
-   int i = 0;
-
-   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-   return;
-
-   index = linear_page_index(vma, address & ~(size - 1));
-   for_each_mapped_pfn(entry, pfn) {
-   struct page *page = pfn_to_page(pfn);
-
-   WARN_ON_ONCE(page->mapping);
-   page->mapping = mapping;
-   page->index = index + i++;
-   }
-}
-
-static void dax_disassociate_entry(void *entry, struct address_space *mapping,
-   bool trunc)
-{
-   unsigned long pfn;
-
-   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-   return;
-
-   for_each_mapped_pfn(entry, pfn) {
-   struct page *page = pfn_to_page(pfn);
-
-   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
-   page->mapping = NULL;
-   page->index = 0;
-   }
-}
-
 static struct page *dax_busy_page(void *entry)
 {
unsigned long pfn;
@@ -516,7 +474,6 @@ static void *grab_mapping_entry(struct xa_state *xas,
xas_lock_irq(xas);
}
 
-   dax_disassociate_entry(entry, mapping, false);
xas_store(xas, NULL);   /* undo the PMD join */
dax_wake_entry(xas, entry, true);
mapping->nrexceptional--;
@@ -636,7 +593,6 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
(xas_get_mark(, PAGECACHE_TAG_DIRTY) ||
 xas_get_mark(, PAGECACHE_TAG_TOWRITE)))
goto out;
-   dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
mapping->nrexceptional--;
ret = 1;
@@ -730,8 +686,6 @@ static void *dax_insert_entry(struct xa_state *xas,
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;
 
-   dax_disassociate_entry(entry, mapping, false);
-   dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
/*
 * Only swap our new entry into the page cache if the current
 * entry is a zero page or an empty entry.  If a normal PTE or
-- 
2.28.0


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[RFC PATCH 0/4] fsdax: introduce fs query to support reflink

2020-09-15 Thread Shiyang Ruan
This patchset is a try to resolve the problem of tracking shared page
for fsdax.

This patchset moves owner tracking from dax_assocaite_entry() to pmem
device, by introducing an interface ->memory_failure() of struct
pagemap.  The interface is called by memory_failure() in mm, and
implemented by pmem device.  Then pmem device find the filesystem
which the damaged page located in, and call ->storage_lost() to track
files or metadata assocaited with this page.  Finally we are able to
try to fix the damaged data in filesystem and do other necessary
processing, such as killing processes who are using the files
affected.

The call trace is like this:
 memory_failure()
   pgmap->ops->memory_failure()  => pmem_pgmap_memory_failure()
 bdev->bd_super->storage_lost()  => xfs_fs_storage_lost()
   xfs_rmap_query_range()
 xfs_storage_lost_helper()
   mf_recover_controller->recover_fn => \ 
memory_failure_dev_pagemap_kill_procs()

The collect_procs() and kill_procs() are moved into a callback which
is passed from memory_failure() to xfs_storage_lost_helper().  So we
can call it when a file assocaited is found, instead of creating a
file list and iterate it.

The fsdax & reflink support for XFS is not contained in this patchset.

(Rebased on v5.9-rc2)
==

Shiyang Ruan (4):
  fs: introduce ->storage_lost() for memory-failure
  pagemap: introduce ->memory_failure()
  mm, fsdax: refactor dax handler in memory-failure
  fsdax: remove useless (dis)associate functions

 block/genhd.c|  12 
 drivers/nvdimm/pmem.c|  31 ++
 fs/dax.c |  64 ++--
 fs/xfs/xfs_super.c   |  80 
 include/linux/dax.h  |   5 +-
 include/linux/fs.h   |   1 +
 include/linux/genhd.h|   2 +
 include/linux/memremap.h |   3 +
 include/linux/mm.h   |  14 +
 mm/memory-failure.c  | 127 ---
 10 files changed, 229 insertions(+), 110 deletions(-)

-- 
2.28.0


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[RFC PATCH 3/4] mm, fsdax: refactor dax handler in memory-failure

2020-09-15 Thread Shiyang Ruan
With the ->memory_failure() implemented in pmem device and
->storage_lost() in XFS, we are able to track files or metadata
and process them further.

We don't track files by page->mapping, page->index any more, so
some of functions who obtain ->mapping, ->index from struct page
parameter need to be changed by directly passing mapping and index.

Signed-off-by: Shiyang Ruan 
---
 fs/dax.c|  18 +++
 include/linux/dax.h |   5 +-
 include/linux/mm.h  |   8 +++
 mm/memory-failure.c | 127 +++-
 4 files changed, 94 insertions(+), 64 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 95341af1a966..1ec592f0aadd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -379,14 +379,14 @@ static struct page *dax_busy_page(void *entry)
 }
 
 /*
- * dax_lock_mapping_entry - Lock the DAX entry corresponding to a page
+ * dax_lock - Lock the DAX entry corresponding to a page
  * @page: The page whose entry we want to lock
  *
  * Context: Process context.
  * Return: A cookie to pass to dax_unlock_page() or 0 if the entry could
  * not be locked.
  */
-dax_entry_t dax_lock_page(struct page *page)
+dax_entry_t dax_lock(struct address_space *mapping, pgoff_t index)
 {
XA_STATE(xas, NULL, 0);
void *entry;
@@ -394,8 +394,6 @@ dax_entry_t dax_lock_page(struct page *page)
/* Ensure page->mapping isn't freed while we look at it */
rcu_read_lock();
for (;;) {
-   struct address_space *mapping = READ_ONCE(page->mapping);
-
entry = NULL;
if (!mapping || !dax_mapping(mapping))
break;
@@ -413,11 +411,7 @@ dax_entry_t dax_lock_page(struct page *page)
 
xas.xa = >i_pages;
xas_lock_irq();
-   if (mapping != page->mapping) {
-   xas_unlock_irq();
-   continue;
-   }
-   xas_set(, page->index);
+   xas_set(, index);
entry = xas_load();
if (dax_is_locked(entry)) {
rcu_read_unlock();
@@ -433,10 +427,10 @@ dax_entry_t dax_lock_page(struct page *page)
return (dax_entry_t)entry;
 }
 
-void dax_unlock_page(struct page *page, dax_entry_t cookie)
+void dax_unlock(struct address_space *mapping, pgoff_t index,
+   dax_entry_t cookie)
 {
-   struct address_space *mapping = page->mapping;
-   XA_STATE(xas, >i_pages, page->index);
+   XA_STATE(xas, >i_pages, index);
 
if (S_ISCHR(mapping->host->i_mode))
return;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..669ba768b89e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -141,8 +141,9 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
struct dax_device *dax_dev, struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
-dax_entry_t dax_lock_page(struct page *page);
-void dax_unlock_page(struct page *page, dax_entry_t cookie);
+dax_entry_t dax_lock(struct address_space *mapping, pgoff_t index);
+void dax_unlock(struct address_space *mapping, pgoff_t index,
+   dax_entry_t cookie);
 #else
 static inline bool bdev_dax_supported(struct block_device *bdev,
int blocksize)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3f0c36e1bf3d..d170b3f74d83 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1126,6 +1126,14 @@ static inline bool is_device_private_page(const struct 
page *page)
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
+static inline bool is_device_fsdax_page(const struct page *page)
+{
+   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+   IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+   is_zone_device_page(page) &&
+   page->pgmap->type == MEMORY_DEVICE_FS_DAX;
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f1aa6433f404..0c4a25bf276f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -93,6 +93,9 @@ static int hwpoison_filter_dev(struct page *p)
if (PageSlab(p))
return -EINVAL;
 
+   if (is_device_fsdax_page(p))
+   return 0;
+
mapping = page_mapping(p);
if (mapping == NULL || mapping->host == NULL)
return -EINVAL;
@@ -263,9 +266,8 @@ void shake_page(struct page *p, int access)
 EXPORT_SYMBOL_GPL(shake_page);
 
 static unsigned long dev_pagemap_mapping_shift(struct page *page,
-   struct vm_area_struct *vma)
+   struct vm_area_struct *vma, unsigned long address)
 {
-   unsigned long address = vma_address(page, vma);
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
@@ -306,8 +308,8 @@ static unsigned long dev_pagemap_mapping_shift(struct page 
*page,
  * Uses 

Amazon Services Japan アカウント所有権の証明(名前、その他個人情報)の確認

2020-09-15 Thread Amazon.co.jp


 

Amazon お客様
Amazonチームはあなたのアカウントの状態が異常であることを発見しました。バインディングされたカードが期限が切れていたり、システムのアップグレードによるアドレス情報が間違っていたりして、あなたのアカウント情報を更新できませんでした。
リアルタイム サポートをご利用ください
お客様の Amazon アカウントは 24 時間 365 日対応のサポートの対象となっておりますので、Amazon 
サポートチームにご連絡いただければ、アカウントの所有権の証明をお手伝いします。
お客様の Amazon アカウント
アカウント所有権の証明をご自身で行う場合は、Amazon 
管理コンソールにログインし、所定の手順でお手続きください。アカウント所有権の証明についてのヘルプセンター記事も併せてご参照ください。
状態: 
異常は更新待ちです
所有権の証明



数日以内アカウント所有権をご証明いただかなかった場合、Amazonアカウントは自動的に削除されますのでご注意ください。
今後ともよろしくお願い申し上げます。
Amazon チーム
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 1/7] kernel/resource: make release_mem_region_adjustable() never fail

2020-09-15 Thread Wei Yang
On Tue, Sep 15, 2020 at 11:15:53AM +0200, David Hildenbrand wrote:
>On 15.09.20 11:06, Wei Yang wrote:
>> On Tue, Sep 15, 2020 at 09:35:30AM +0200, David Hildenbrand wrote:
>>>
> static int __ref try_remove_memory(int nid, u64 start, u64 size)
> {
>   int rc = 0;
> @@ -1777,7 +1757,7 @@ static int __ref try_remove_memory(int nid, u64 
> start, u64 size)
>   memblock_remove(start, size);
>   }
>
> - __release_memory_resource(start, size);
> + release_mem_region_adjustable(_resource, start, size);
>

 Seems the only user of release_mem_region_adjustable() is here, can we move
 iomem_resource into the function body? Actually, we don't iterate the 
 resource
 tree from any level. We always start from the root.
>>>
>>> You mean, making iomem_resource implicit? I can spot that something
>>> similar was done for
>>>
>>> #define devm_release_mem_region(dev, start, n) \
>>> __devm_release_region(dev, _resource, (start), (n))
>>>
>> 
>> What I prefer is remove iomem_resource from the parameter list. Just use is 
>> in
>> the function body.
>> 
>> For the example you listed, __release_region() would have varies of *parent*,
>> which looks reasonable to keep it here.
>
>Yeah I got that ("making iomem_resource implicit"), as I said:
>

Thanks

>>> I'll send an addon patch for that, ok? - thanks.
>
>-- 
>Thanks,
>
>David / dhildenb

-- 
Wei Yang
Help you, Help me
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


不用发信箱---不限制群发额度的邮件群发软件

2020-09-15 Thread 不用发信箱---不限制群发额度的邮件群发软件

还在为群发邮件花了大价钱却 老是邮箱被封, 老换ip,打码验证,不停地注册新邮箱 ,一天却一个邮箱只能发几十封而烦恼吗?软件自动虚拟 数万个高质量 发信邮箱 
,再也不用担心邮箱账号被封了
本邮件群发系统的优势有:1:无需注册发件箱2:不用打码,不用换ip3: 不限制群发额度
无任何限制,急速提升群发效率,让您发到爽,客户电话从此响不停!软件自动一秒钟发送一封,可以指定发件箱,客户看不到群发
免费测试请联系: Q  Q :3351665625 E-mail:bengpao6...@hotmail.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 1/7] kernel/resource: make release_mem_region_adjustable() never fail

2020-09-15 Thread David Hildenbrand
On 15.09.20 11:06, Wei Yang wrote:
> On Tue, Sep 15, 2020 at 09:35:30AM +0200, David Hildenbrand wrote:
>>
 static int __ref try_remove_memory(int nid, u64 start, u64 size)
 {
int rc = 0;
 @@ -1777,7 +1757,7 @@ static int __ref try_remove_memory(int nid, u64 
 start, u64 size)
memblock_remove(start, size);
}

 -  __release_memory_resource(start, size);
 +  release_mem_region_adjustable(_resource, start, size);

>>>
>>> Seems the only user of release_mem_region_adjustable() is here, can we move
>>> iomem_resource into the function body? Actually, we don't iterate the 
>>> resource
>>> tree from any level. We always start from the root.
>>
>> You mean, making iomem_resource implicit? I can spot that something
>> similar was done for
>>
>> #define devm_release_mem_region(dev, start, n) \
>>  __devm_release_region(dev, _resource, (start), (n))
>>
> 
> What I prefer is remove iomem_resource from the parameter list. Just use is in
> the function body.
> 
> For the example you listed, __release_region() would have varies of *parent*,
> which looks reasonable to keep it here.

Yeah I got that ("making iomem_resource implicit"), as I said:

>> I'll send an addon patch for that, ok? - thanks.

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 1/7] kernel/resource: make release_mem_region_adjustable() never fail

2020-09-15 Thread Wei Yang
On Tue, Sep 15, 2020 at 09:35:30AM +0200, David Hildenbrand wrote:
>
>>> static int __ref try_remove_memory(int nid, u64 start, u64 size)
>>> {
>>> int rc = 0;
>>> @@ -1777,7 +1757,7 @@ static int __ref try_remove_memory(int nid, u64 
>>> start, u64 size)
>>> memblock_remove(start, size);
>>> }
>>>
>>> -   __release_memory_resource(start, size);
>>> +   release_mem_region_adjustable(_resource, start, size);
>>>
>> 
>> Seems the only user of release_mem_region_adjustable() is here, can we move
>> iomem_resource into the function body? Actually, we don't iterate the 
>> resource
>> tree from any level. We always start from the root.
>
>You mean, making iomem_resource implicit? I can spot that something
>similar was done for
>
>#define devm_release_mem_region(dev, start, n) \
>   __devm_release_region(dev, _resource, (start), (n))
>

What I prefer is remove iomem_resource from the parameter list. Just use is in
the function body.

For the example you listed, __release_region() would have varies of *parent*,
which looks reasonable to keep it here.

>I'll send an addon patch for that, ok? - thanks.
>
>-- 
>Thanks,
>
>David / dhildenb

-- 
Wei Yang
Help you, Help me
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 1/1] dax: Fix stack overflow when mounting fsdax pmem device

2020-09-15 Thread Jan Kara
On Tue 15-09-20 15:57:29, Adrian Huang wrote:
> From: Adrian Huang 
> 
> When mounting fsdax pmem device, commit 6180bb446ab6 ("dax: fix
> detection of dax support for non-persistent memory block devices")
> introduces the stack overflow [1][2]. Here is the call path for
> mounting ext4 file system:
>   ext4_fill_super
> bdev_dax_supported
>   __bdev_dax_supported
> dax_supported
>   generic_fsdax_supported
> __generic_fsdax_supported
>   bdev_dax_supported
> 
> The call path leads to the infinite calling loop, so we cannot
> call bdev_dax_supported() in __generic_fsdax_supported(). The sanity
> checking of the variable 'dax_dev' is moved prior to the two
> bdev_dax_pgoff() checks [3][4].
> 
> To fix the issue triggered by lvm2-testsuite (the issue that the
> above-mentioned commit wants to fix), this patch does not print the
> "error: dax access failed" message if the physical disk does not
> support DAX (dax_dev is NULL). The detail info is described as follows:

Thanks for looking into this!

> 
>   1. The dax_dev of the dm devices (dm-0, dm-1..) is always allocated
>  in alloc_dev() [drivers/md/dm.c].
>   2. When calling __generic_fsdax_supported() with dm-0 device, the
>  call path is shown as follows (the physical disks of dm-0 do
>  not support DAX):
> dax_direct_access (valid dax_dev with dm-0)
>   dax_dev->ops->direct_access
> dm_dax_direct_access
>   ti->type->direct_access
> linear_dax_direct_access (assume the target is linear)
>   dax_direct_access (dax_dev is NULLL with ram0, or sdaX)

I'm not sure how you can get __generic_fsdax_supported() called for dm-0?
Possibly because there's another dm device stacked on top of it and
dm_table_supports_dax() calls generic_fsdax_supported()? That actually
seems to be a bug in dm_table_supports_dax() (device_supports_dax() in
particular). I'd think it should be calling dax_supported() instead of
generic_fsdax_supported() so that proper device callback gets called when
determining whether a device supports DAX or not.

>   3. The call 'dax_direct_access()' in __generic_fsdax_supported() gets
>  the returned value '-EOPNOTSUPP'.

I don't think this should happen under any normal conditions after the
above bug is fixed. -EOPNOTSUPP is returned when dax_dev is NULL and that
should have been caught earlier... So at this poing I don't think your
changes to printing errors after dax_direct_access() are needed.

Honza

>   4. However, the message 'dm-3: error: dax access failed (-5)' is still
>  printed for the dm target 'error' since io_err_dax_direct_access()
>  always returns the status '-EIO'. Cc' device mapper maintainers to
>  see if they have concerns.
> 
> [1] 
> https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/thread/BULZHRILK7N2WS2JVISNF2QZNRQK6JU4/
> [2] 
> https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/thread/OOZGFY3RNQGTGJJCH52YXCSYIDXMOPXO/
> [3] 
> https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/SMQW2LY3QHPXOAW76RKNSCGG3QJFO7HT/
> [4] 
> https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/7E2X6UGX5RQ2ISGYNAF66VLY5BKBFI4M/
> 
> Fixes: 6180bb446ab6 ("dax: fix detection of dax support for non-persistent 
> memory block devices")
> Cc: Coly Li 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Ira Weiny 
> Cc: John Pittman 
> Cc: Mikulas Patocka 
> Cc: Alasdair Kergon 
> Cc: Mike Snitzer 
> Signed-off-by: Adrian Huang 
> ---
>  drivers/dax/super.c | 23 ---
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index e5767c83ea23..fb151417ec10 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -85,6 +85,12 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
>   return false;
>   }
>  
> + if (!dax_dev) {
> + pr_debug("%s: error: dax unsupported by block device\n",
> + bdevname(bdev, buf));
> + return false;
> + }
> +
>   err = bdev_dax_pgoff(bdev, start, PAGE_SIZE, );
>   if (err) {
>   pr_info("%s: error: unaligned partition for dax\n",
> @@ -100,19 +106,22 @@ bool __generic_fsdax_supported(struct dax_device 
> *dax_dev,
>   return false;
>   }
>  
> - if (!dax_dev || !bdev_dax_supported(bdev, blocksize)) {
> - pr_debug("%s: error: dax unsupported by block device\n",
> - bdevname(bdev, buf));
> - return false;
> - }
> -
>   id = dax_read_lock();
>   len = dax_direct_access(dax_dev, pgoff, 1, , );
>   len2 = dax_direct_access(dax_dev, pgoff_end, 1, _kaddr, _pfn);
>  
>   if (len < 1 || len2 < 1) {
> - pr_info("%s: error: dax access failed (%ld)\n",
> +

Re: [PATCH v3 3/7] mm/memory_hotplug: prepare passing flags to add_memory() and friends

2020-09-15 Thread kernel test robot
Hi David,

I love your patch! Perhaps something to improve:

[auto build test WARNING on next-20200909]
[cannot apply to mmotm/master hnaz-linux-mm/master xen-tip/linux-next 
powerpc/next linus/master v5.9-rc4 v5.9-rc3 v5.9-rc2 v5.9-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/David-Hildenbrand/mm-memory_hotplug-selective-merging-of-system-ram-resources/20200910-171630
base:7204eaa2c1f509066486f488c9dcb065d7484494
:: branch date: 9 hours ago
:: commit date: 9 hours ago
config: powerpc-randconfig-r011-20200909 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 
0a5dc7effb191eff740e0e7ae7bd8e1f6bdb3ad9)
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# install powerpc cross compiling tool for clang build
# apt-get install binutils-powerpc-linux-gnu
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All warnings (new ones prefixed by >>):

   In file included from arch/powerpc/kernel/asm-offsets.c:14:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:15:
   In file included from include/linux/radix-tree.h:18:
   In file included from include/linux/xarray.h:14:
   In file included from include/linux/gfp.h:6:
   In file included from include/linux/mmzone.h:853:
   include/linux/memory_hotplug.h:354:55: error: unknown type name 'mhp_t'
   extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
 ^
   include/linux/memory_hotplug.h:355:53: error: unknown type name 'mhp_t'
   extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
   ^
   include/linux/memory_hotplug.h:357:11: error: unknown type name 'mhp_t'
  mhp_t mhp_flags);
  ^
   include/linux/memory_hotplug.h:360:10: error: unknown type name 'mhp_t'
mhp_t mhp_flags);
^
   In file included from arch/powerpc/kernel/asm-offsets.c:21:
>> include/linux/mman.h:137:9: warning: division by zero is undefined 
>> [-Wdivision-by-zero]
  _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED) |
  ^~~~
   include/linux/mman.h:115:21: note: expanded from macro '_calc_vm_trans'
  : ((x) & (bit1)) / ((bit1) / (bit2
   ^ ~
   include/linux/mman.h:138:9: warning: division by zero is undefined 
[-Wdivision-by-zero]
  _calc_vm_trans(flags, MAP_SYNC,   VM_SYNC  );
  ^~~~
   include/linux/mman.h:115:21: note: expanded from macro '_calc_vm_trans'
  : ((x) & (bit1)) / ((bit1) / (bit2
   ^ ~
   2 warnings and 4 errors generated.
   make[2]: *** [scripts/Makefile.build:117: arch/powerpc/kernel/asm-offsets.s] 
Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [Makefile:1196: prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:185: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.

# 
https://github.com/0day-ci/linux/commit/d88270d1c0783a7f99f24a85692be90fd2ae0d7d
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
David-Hildenbrand/mm-memory_hotplug-selective-merging-of-system-ram-resources/20200910-171630
git checkout d88270d1c0783a7f99f24a85692be90fd2ae0d7d
vim +137 include/linux/mman.h

^1da177e4c3f41 Linus Torvalds  2005-04-16  128  
^1da177e4c3f41 Linus Torvalds  2005-04-16  129  /*
^1da177e4c3f41 Linus Torvalds  2005-04-16  130   * Combine the mmap "flags" 
argument into "vm_flags" used internally.
^1da177e4c3f41 Linus Torvalds  2005-04-16  131   */
^1da177e4c3f41 Linus Torvalds  2005-04-16  132  static inline unsigned long
^1da177e4c3f41 Linus Torvalds  2005-04-16  133  calc_vm_flag_bits(unsigned long 
flags)
^1da177e4c3f41 Linus Torvalds  2005-04-16  134  {
^1da177e4c3f41 Linus Torvalds  2005-04-16  135  return 
_calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
^1da177e4c3f41 Linus Torvalds  2005-04-16  136 
_calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
b6fb293f2497a9 Jan Kara2017-11-01 @137 
_calc_vm_trans(flags, MAP_LOCKED, 

RE: [External] regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413 ("dax: fix detection of dax support for non-persistent memory block devices")

2020-09-15 Thread Adrian Huang12
> -Original Message-
> 
> Hi Mikulas,
> 
> > -Original Message-
> > From: Mikulas Patocka 
> >
> > Hi
> >
> > The patch 6180bb446ab624b9ab8bf201ed251ca87f07b413 ("dax: fix
> > detection of dax support for non-persistent memory block devices")
> > causes crash when attempting to mount the ext4 filesystem on
> > /dev/pmem0
> > ("mkfs.ext4 /dev/pmem0; mount -t ext4 /dev/pmem0 /mnt/test"). The
> > device
> > /dev/pmem0 is emulated using the "memmap" kernel parameter.
> >
> 
> Could you please test the following patch? Thanks.
> https://lists.01.org/hyperkitty/list/linux-
> nvd...@lists.01.org/thread/2JDBSE2WK75LSGCFEOY3RXRN3CNLBPB2/
> 

BTW, I have verified this patch 
(https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/thread/2JDBSE2WK75LSGCFEOY3RXRN3CNLBPB2/)
 on local box with the following tests:
1. mount fsdax pmem device
2. Run lvm2-testsuite
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


RE: [External] regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413 ("dax: fix detection of dax support for non-persistent memory block devices")

2020-09-15 Thread Adrian Huang12
Hi Mikulas,

> -Original Message-
> From: Mikulas Patocka 
> Sent: Monday, September 14, 2020 11:49 PM
> To: Coly Li ; Dan Williams ; Dave
> Jiang 
> Cc: Jan Kara ; Vishal Verma ;
> Adrian Huang12 ; Ira Weiny ;
> Mike Snitzer ; Pankaj Gupta
> ; linux-nvdimm@lists.01.org
> Subject: [External] regression caused by patch
> 6180bb446ab624b9ab8bf201ed251ca87f07b413 ("dax: fix detection of dax
> support for non-persistent memory block devices")
> 
> Hi
> 
> The patch 6180bb446ab624b9ab8bf201ed251ca87f07b413 ("dax: fix
> detection of dax support for non-persistent memory block devices") causes
> crash when attempting to mount the ext4 filesystem on /dev/pmem0
> ("mkfs.ext4 /dev/pmem0; mount -t ext4 /dev/pmem0 /mnt/test"). The device
> /dev/pmem0 is emulated using the "memmap" kernel parameter.
> 

Could you please test the following patch? Thanks.
https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/thread/2JDBSE2WK75LSGCFEOY3RXRN3CNLBPB2/

 > The patch causes infinite recursion and double-fault exception:
> 
> __generic_fsdax_supported
> bdev_dax_supported
> __bdev_dax_supported
> dax_supported
> dax_dev->ops->dax_supported
> generic_fsdax_supported
> __generic_fsdax_supported
> 
> Mikulas
> 
> 
> 
> [   17.500619] traps: PANIC: double fault, error_code: 0x0
> [   17.500619] double fault:  [#1] PREEMPT SMP
> [   17.500620] CPU: 0 PID: 1326 Comm: mount Not tainted 5.9.0-rc1-bisect #10
> [   17.500620] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   17.500621] RIP: 0010:__generic_fsdax_supported+0x6a/0x500
> [   17.500622] Code: ff ff ff ff ff 7f 00 48 21 f3 48 01 c3 48 c1 e3 09 f6 c7 
> 0e 0f
> 85 fa 01 00 00 48 85 ff 49 89 fd 74 11 be 00 10 00 00 4c 89 e7  b1 fe ff 
> ff
> 84 c0 75 11 31 c0 48 83 c4 48 5b 5d 41 5c 41 5d 41
> [   17.500623] RSP: 0018:88940b4fdff8 EFLAGS: 00010286
> [   17.500624] RAX:  RBX: 0007f000 RCX:
> 
> [   17.500625] RDX: 1000 RSI: 1000 RDI:
> 88940b34c300
> [   17.500625] RBP:  R08: 0400 R09:
> 8080808080808080
> [   17.500626] R10:  R11: fefefefefefefeff R12:
> 88940b34c300
> [   17.500626] R13: 88940b3dc000 R14: 88940badd000 R15:
> 0001
> [   17.500627] FS:  f7c25780() GS:88940fa0()
> knlGS:
> [   17.500628] CS:  0010 DS: 002b ES: 002b CR0: 80050033
> [   17.500628] CR2: 88940b4fdfe8 CR3: 00140bd15000 CR4:
> 06b0
> [   17.500628] Call Trace:
> [   17.500629] Modules linked in: uvesafb cfbfillrect cfbimgblt cn 
> cfbcopyarea fb
> fbdev ipv6 tun autofs4 binfmt_misc configfs af_packet virtio_rng rng_core
> mousedev evdev pcspkr virtio_balloon button raid10 raid456 async_raid6_recov
> async_memcpy async_pq raid6_pq async_xor xor async_tx libcrc32c raid1 raid0
> md_mod sd_mod t10_pi virtio_scsi virtio_net net_failover psmouse scsi_mod
> failover
> [   17.500638] ---[ end trace 3c877fcb5b865459 ]---
> [   17.500638] RIP: 0010:__generic_fsdax_supported+0x6a/0x500
> [   17.500639] Code: ff ff ff ff ff 7f 00 48 21 f3 48 01 c3 48 c1 e3 09 f6 c7 
> 0e 0f
> 85 fa 01 00 00 48 85 ff 49 89 fd 74 11 be 00 10 00 00 4c 89 e7  b1 fe ff 
> ff
> 84 c0 75 11 31 c0 48 83 c4 48 5b 5d 41 5c 41 5d 41
> [   17.500640] RSP: 0018:88940b4fdff8 EFLAGS: 00010286
> [   17.500641] RAX:  RBX: 0007f000 RCX:
> 
> [   17.500641] RDX: 1000 RSI: 1000 RDI:
> 88940b34c300
> [   17.500642] RBP:  R08: 0400 R09:
> 8080808080808080
> [   17.500642] R10:  R11: fefefefefefefeff R12:
> 88940b34c300
> [   17.500643] R13: 88940b3dc000 R14: 88940badd000 R15:
> 0001
> [   17.500643] FS:  f7c25780() GS:88940fa0()
> knlGS:
> [   17.500644] CS:  0010 DS: 002b ES: 002b CR0: 80050033
> [   17.500644] CR2: 88940b4fdfe8 CR3: 00140bd15000 CR4:
> 06b0
> [   17.500645] Kernel panic - not syncing: Fatal exception in interrupt
> [   17.500941] Kernel Offset: disabled
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: 回复:regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413?? ("dax: fix detection of dax support for non-persistent memory block?? devices")

2020-09-15 Thread Jan Kara
Hi!

On Tue 15-09-20 11:03:29, col...@suse.de wrote:
> Could you please to take a look? I am offline in the next two weeks.

I just had a look into this. IMHO the justification in 6180bb446a "dax: fix
detection of dax support for non-persistent memory block devices" is just
bogus and people got confused by the previous condition

if (!dax_dev && !bdev_dax_supported(bdev, blocksize))

which was bogus as well. bdev_dax_supported() always returns false for bdev
that doesn't have dax_dev (naturally so). So in the original condition
there was no point in calling bdev_dax_supported() if we know dax_dev is
NULL.

Then this was changed to:

if (!dax_dev || !bdev_dax_supported(bdev, blocksize))

which looks more sensible at the first sight. But only at the first sight -
if you look at wider context, __generic_fsdax_supported() is the bulk of
code that decides whether a device supports DAX so calling
bdev_dax_supported() from it indeed doesn't look as such a great idea. So
IMO the condition should be just:

if (!dax_dev)

I'll send a fix for this.

Also there's the process question how this patch could get to Linus when
any attempt to use DAX would immediately kill the machine like Mikulas
spotted. This shows the that patch was untested with DAX by anybody on the
path from the developer to Linus...

Honza

>  原始邮件 
> 发件人: Mikulas Patocka 
> 日期: 2020年9月14日周一半夜11:48
> 收件人: Coly Li , Dan Williams ,
> Dave Jiang 
> 抄送: Jan Kara , Vishal Verma ,
> Adrian Huang , Ira Weiny , Mike
> Snitzer , Pankaj Gupta ,
> linux-nvdimm@lists.01.org
> 主题: regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413
> ("dax: fix detection of dax support for non-persistent memory block
> devices")
> 
> Hi
> 
> The patch 6180bb446ab624b9ab8bf201ed251ca87f07b413 ("dax: fix detection of
> dax support for non-persistent memory block devices") causes crash when
> attempting to mount the ext4 filesystem on /dev/pmem0 ("mkfs.ext4
> /dev/pmem0; mount -t ext4 /dev/pmem0 /mnt/test"). The device /dev/pmem0 is
> emulated using the "memmap" kernel parameter.
> 
> The patch causes infinite recursion and double-fault exception:
> 
> __generic_fsdax_supported
> bdev_dax_supported
> __bdev_dax_supported
> dax_supported
> dax_dev->ops->dax_supported
> generic_fsdax_supported
> __generic_fsdax_supported
> 
> Mikulas
> 
> 
> 
> [   17.500619] traps: PANIC: double fault, error_code: 0x0
> [   17.500619] double fault:  [#1] PREEMPT SMP
> [   17.500620] CPU: 0 PID: 1326 Comm: mount Not tainted 5.9.0-rc1-bisect #
> 10
> [   17.500620] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   17.500621] RIP: 0010:__generic_fsdax_supported+0x6a/0x500
> [   17.500622] Code: ff ff ff ff ff 7f 00 48 21 f3 48 01 c3 48 c1 e3 09 f6
> c7 0e 0f 85 fa 01 00 00 48 85 ff 49 89 fd 74 11 be 00 10 00 00 4c 89 e7
>  b1 fe ff ff 84 c0 75 11 31 c0 48 83 c4 48 5b 5d 41 5c 41 5d 41
> [   17.500623] RSP: 0018:88940b4fdff8 EFLAGS: 00010286
> [   17.500624] RAX:  RBX: 0007f000 RCX:
> 
> [   17.500625] RDX: 1000 RSI: 1000 RDI:
> 88940b34c300
> [   17.500625] RBP:  R08: 0400 R09:
> 8080808080808080
> [   17.500626] R10:  R11: fefefefefefefeff R12:
> 88940b34c300
> [   17.500626] R13: 88940b3dc000 R14: 88940badd000 R15:
> 0001
> [   17.500627] FS:  f7c25780() GS:88940fa0()
> knlGS:
> [   17.500628] CS:  0010 DS: 002b ES: 002b CR0: 80050033
> [   17.500628] CR2: 88940b4fdfe8 CR3: 00140bd15000 CR4:
> 06b0
> [   17.500628] Call Trace:
> [   17.500629] Modules linked in: uvesafb cfbfillrect cfbimgblt cn
> cfbcopyarea fb fbdev ipv6 tun autofs4 binfmt_misc configfs af_packet
> virtio_rng rng_core mousedev evdev pcspkr virtio_balloon button raid10
> raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor
> async_tx libcrc32c raid1 raid0 md_mod sd_mod t10_pi virtio_scsi virtio_net
> net_failover psmouse scsi_mod failover
> [   17.500638] ---[ end trace 3c877fcb5b865459 ]---
> [   17.500638] RIP: 0010:__generic_fsdax_supported+0x6a/0x500
> [   17.500639] Code: ff ff ff ff ff 7f 00 48 21 f3 48 01 c3 48 c1 e3 09 f6
> c7 0e 0f 85 fa 01 00 00 48 85 ff 49 89 fd 74 11 be 00 10 00 00 4c 89 e7
>  b1 fe ff ff 84 c0 75 11 31 c0 48 83 c4 48 5b 5d 41 5c 41 5d 41
> [   17.500640] RSP: 0018:88940b4fdff8 EFLAGS: 00010286
> [   17.500641] RAX:  RBX: 0007f000 RCX:
> 
> [   17.500641] RDX: 1000 RSI: 1000 RDI:
> 88940b34c300
> [   17.500642] RBP:  R08: 0400 R09:
> 8080808080808080

[PATCH 1/1] dax: Fix stack overflow when mounting fsdax pmem device

2020-09-15 Thread Adrian Huang
From: Adrian Huang 

When mounting fsdax pmem device, commit 6180bb446ab6 ("dax: fix
detection of dax support for non-persistent memory block devices")
introduces the stack overflow [1][2]. Here is the call path for
mounting ext4 file system:
  ext4_fill_super
bdev_dax_supported
  __bdev_dax_supported
dax_supported
  generic_fsdax_supported
__generic_fsdax_supported
  bdev_dax_supported

The call path leads to the infinite calling loop, so we cannot
call bdev_dax_supported() in __generic_fsdax_supported(). The sanity
checking of the variable 'dax_dev' is moved prior to the two
bdev_dax_pgoff() checks [3][4].

To fix the issue triggered by lvm2-testsuite (the issue that the
above-mentioned commit wants to fix), this patch does not print the
"error: dax access failed" message if the physical disk does not
support DAX (dax_dev is NULL). The detail info is described as follows:

  1. The dax_dev of the dm devices (dm-0, dm-1..) is always allocated
 in alloc_dev() [drivers/md/dm.c].
  2. When calling __generic_fsdax_supported() with dm-0 device, the
 call path is shown as follows (the physical disks of dm-0 do
 not support DAX):
dax_direct_access (valid dax_dev with dm-0)
  dax_dev->ops->direct_access
dm_dax_direct_access
  ti->type->direct_access
linear_dax_direct_access (assume the target is linear)
  dax_direct_access (dax_dev is NULLL with ram0, or sdaX)
  3. The call 'dax_direct_access()' in __generic_fsdax_supported() gets
 the returned value '-EOPNOTSUPP'.
  4. However, the message 'dm-3: error: dax access failed (-5)' is still
 printed for the dm target 'error' since io_err_dax_direct_access()
 always returns the status '-EIO'. Cc' device mapper maintainers to
 see if they have concerns.

[1] 
https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/thread/BULZHRILK7N2WS2JVISNF2QZNRQK6JU4/
[2] 
https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/thread/OOZGFY3RNQGTGJJCH52YXCSYIDXMOPXO/
[3] 
https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/SMQW2LY3QHPXOAW76RKNSCGG3QJFO7HT/
[4] 
https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/7E2X6UGX5RQ2ISGYNAF66VLY5BKBFI4M/

Fixes: 6180bb446ab6 ("dax: fix detection of dax support for non-persistent 
memory block devices")
Cc: Coly Li 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Ira Weiny 
Cc: John Pittman 
Cc: Mikulas Patocka 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Signed-off-by: Adrian Huang 
---
 drivers/dax/super.c | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e5767c83ea23..fb151417ec10 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -85,6 +85,12 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
return false;
}
 
+   if (!dax_dev) {
+   pr_debug("%s: error: dax unsupported by block device\n",
+   bdevname(bdev, buf));
+   return false;
+   }
+
err = bdev_dax_pgoff(bdev, start, PAGE_SIZE, );
if (err) {
pr_info("%s: error: unaligned partition for dax\n",
@@ -100,19 +106,22 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
return false;
}
 
-   if (!dax_dev || !bdev_dax_supported(bdev, blocksize)) {
-   pr_debug("%s: error: dax unsupported by block device\n",
-   bdevname(bdev, buf));
-   return false;
-   }
-
id = dax_read_lock();
len = dax_direct_access(dax_dev, pgoff, 1, , );
len2 = dax_direct_access(dax_dev, pgoff_end, 1, _kaddr, _pfn);
 
if (len < 1 || len2 < 1) {
-   pr_info("%s: error: dax access failed (%ld)\n",
+   /*
+* Only print the real error message: do not need to print
+* the message for the underlying raw disk (physical disk)
+* that does not support DAX (dax_dev = NULL). This case
+* is observed when physical disks are configured by
+* lvm2 (device mapper).
+*/
+   if (len != -EOPNOTSUPP && len2 != -EOPNOTSUPP) {
+   pr_info("%s: error: dax access failed (%ld)\n",
bdevname(bdev, buf), len < 1 ? len : len2);
+   }
dax_read_unlock(id);
return false;
}
-- 
2.17.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 2/7] kernel/resource: move and rename IORESOURCE_MEM_DRIVER_MANAGED

2020-09-15 Thread David Hildenbrand
On 15.09.20 04:20, Wei Yang wrote:
> On Tue, Sep 08, 2020 at 10:10:07PM +0200, David Hildenbrand wrote:
>> IORESOURCE_MEM_DRIVER_MANAGED currently uses an unused PnP bit, which is
>> always set to 0 by hardware. This is far from beautiful (and confusing),
>> and the bit only applies to SYSRAM. So let's move it out of the
>> bus-specific (PnP) defined bits.
>>
>> We'll add another SYSRAM specific bit soon. If we ever need more bits for
>> other purposes, we can steal some from "desc", or reshuffle/regroup what we
>> have.
> 
> I think you make this definition because we use IORESOURCE_SYSRAM_RAM for
> hotpluged memory? So we make them all in IORESOURCE_SYSRAM_XXX family?

Yeah, to specify based on the extended MEM type SYSRAM. Because it
really only applies to that.

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 1/7] kernel/resource: make release_mem_region_adjustable() never fail

2020-09-15 Thread David Hildenbrand


>> static int __ref try_remove_memory(int nid, u64 start, u64 size)
>> {
>>  int rc = 0;
>> @@ -1777,7 +1757,7 @@ static int __ref try_remove_memory(int nid, u64 start, 
>> u64 size)
>>  memblock_remove(start, size);
>>  }
>>
>> -__release_memory_resource(start, size);
>> +release_mem_region_adjustable(_resource, start, size);
>>
> 
> Seems the only user of release_mem_region_adjustable() is here, can we move
> iomem_resource into the function body? Actually, we don't iterate the resource
> tree from any level. We always start from the root.

You mean, making iomem_resource implicit? I can spot that something
similar was done for

#define devm_release_mem_region(dev, start, n) \
__devm_release_region(dev, _resource, (start), (n))

I'll send an addon patch for that, ok? - thanks.

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v4 5/8] mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System RAM resources

2020-09-15 Thread David Hildenbrand
On 15.09.20 04:43, Wei Yang wrote:
> On Fri, Sep 11, 2020 at 12:34:56PM +0200, David Hildenbrand wrote:
>> Some add_memory*() users add memory in small, contiguous memory blocks.
>> Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
>>
>> This can quickly result in a lot of memory resources, whereby the actual
>> resource boundaries are not of interest (e.g., it might be relevant for
>> DIMMs, exposed via /proc/iomem to user space). We really want to merge
>> added resources in this scenario where possible.
>>
>> Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
>> either created within add_memory*() or passed via add_memory_resource()
>> shall be marked mergeable and merged with applicable siblings.
>>
>> To implement that, we need a kernel/resource interface to mark selected
>> System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
>> merging.
>>
>> Note: We really want to merge after the whole operation succeeded, not
>> directly when adding a resource to the resource tree (it would break
>> add_memory_resource() and require splitting resources again when the
>> operation failed - e.g., due to -ENOMEM).
> 
> Oops, the latest version is here.

Yeah, sorry, I dropped the "mm" prefix on the subject of the cover
letter by mistake.

> 
> BTW, I don't see patch 4. Not sure it is junked by my mail system?

At least you're in the CC list below with your old mail address (I
assume you monitor that).

I'll try to use your new address in the future.


-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH] libnvdimm: Fix dereference of pointer ndns before it is null checked

2020-09-15 Thread Jing Xiangfeng
In current code, the pointer ndns is being dereferenced on the
initialization of pointer parent_uuid before ndns is null check. This
could lead to a potential null pointer dereference. Fix this by
dereferencing ndns after ndns has been null pointer sanity checked.

Signed-off-by: Jing Xiangfeng 
---
 drivers/nvdimm/pfn_devs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 3e11ef8d3f5b..c443994f81f3 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -452,7 +452,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
unsigned long align, start_pad;
struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb;
struct nd_namespace_common *ndns = nd_pfn->ndns;
-   const u8 *parent_uuid = nd_dev_to_uuid(>dev);
+   const u8 *parent_uuid;
 
if (!pfn_sb || !ndns)
return -ENODEV;
@@ -472,6 +472,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
return -ENODEV;
pfn_sb->checksum = cpu_to_le64(checksum);
 
+   parent_uuid = nd_dev_to_uuid(>dev);
if (memcmp(pfn_sb->parent_uuid, parent_uuid, 16) != 0)
return -ENODEV;
 
-- 
2.17.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org