Re: [PATCH v3 05/11] mm, fsdax: Refactor memory-failure handler for dax mapping

2021-02-18 Thread Ruan Shiyang



On 2021/2/18 下午4:32, Christoph Hellwig wrote:

On Wed, Feb 17, 2021 at 10:56:11AM +0800, Ruan Shiyang wrote:

I'd like to confirm one thing...  I have checked all of this patchset by
checkpatch.pl and it did not report the overly long line warning.  So, I
should still obey the rule of 80 chars one line?


checkpatch.pl is completely broken, I would not rely on it.

Here is the quote from the coding style document:

"The preferred limit on the length of a single line is 80 columns.

Statements longer than 80 columns should be broken into sensible chunks,
unless exceeding 80 columns significantly increases readability and does
not hide information."



OK.  Got it.  Thank you.


--
Ruan Shiyang.




___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

2021-02-16 Thread Ruan Shiyang



On 2021/2/10 下午9:19, Christoph Hellwig wrote:

On Tue, Feb 09, 2021 at 05:46:13PM +0800, Ruan Shiyang wrote:



On 2021/2/9 下午5:34, Christoph Hellwig wrote:

On Tue, Feb 09, 2021 at 05:15:13PM +0800, Ruan Shiyang wrote:

The dax dedupe comparison need the iomap_ops pointer as argument, so my
understanding is that we don't modify the argument list of
generic_remap_file_range_prep(), but move its code into
__generic_remap_file_range_prep() whose argument list can be modified to
accepts the iomap_ops pointer.  Then it looks like this:


I'd say just add the iomap_ops pointer to
generic_remap_file_range_prep and do away with the extra wrappers.  We
only have three callers anyway.


OK.


So looking at this again I think your proposal actaully is better,
given that the iomap variant is still DAX specific.  Sorry for
the noise.

Also I think dax_file_range_compare should use iomap_apply instead
of open coding it.



There are two files, which are not reflinked, need to be direct_access() 
here.  The iomap_apply() can handle one file each time.  So, it seems 
that iomap_apply() is not suitable for this case...



The pseudo code of this process is as follows:

  srclen = ops->begin()
  destlen = ops->begin()

  direct_access(, )
  direct_access(, )

  same = memcpy(saddr, daddr, min(srclen,destlen))

  ops->end()
  ops->end()

I think a nested call like this is necessary.  That's why I use the open 
code way.



--
Thanks,
Ruan Shiyang.




___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 4/7] fsdax: Replace mmap entry in case of CoW

2021-02-16 Thread Ruan Shiyang



On 2021/2/16 下午9:11, David Sterba wrote:

On Mon, Feb 08, 2021 at 01:09:21AM +0800, Shiyang Ruan wrote:

We replace the existing entry to the newly allocated one
in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
so writeback marks this entry as writeprotected. This
helps us snapshots so new write pagefaults after snapshots
trigger a CoW.

Signed-off-by: Goldwyn Rodrigues 
Signed-off-by: Shiyang Ruan 
---
  fs/dax.c | 31 +++
  1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b2195cbdf2dc..29698a3d2e37 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -722,6 +722,9 @@ static int copy_cow_page_dax(struct block_device *bdev, 
struct dax_device *dax_d
return 0;
  }
  
+#define DAX_IF_DIRTY		(1ULL << 0)

+#define DAX_IF_COW (1ULL << 1)


The constants are ULL, but I see other flags only 'unsigned long'


+
  /*
   * By this point grab_mapping_entry() has ensured that we have a locked entry
   * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -731,14 +734,16 @@ static int copy_cow_page_dax(struct block_device *bdev, 
struct dax_device *dax_d
   */
  static void *dax_insert_entry(struct xa_state *xas,
struct address_space *mapping, struct vm_fault *vmf,
-   void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+   void *entry, pfn_t pfn, unsigned long flags, bool insert_flags)


insert_flags is bool


  {
void *new_entry = dax_make_entry(pfn, flags);
+   bool dirty = insert_flags & DAX_IF_DIRTY;


"insert_flags & DAX_IF_DIRTY" is "bool & ULL", this can't be right


This is a mistake caused by rebasing my old version patchset.  Thanks 
for pointing out.  I'll fix this in next version.



+   bool cow = insert_flags & DAX_IF_COW;


Same

  
  	if (dirty)

__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
  
-	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {

+   if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
unsigned long index = xas->xa_index;
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
@@ -750,7 +755,7 @@ static void *dax_insert_entry(struct xa_state *xas,
  
  	xas_reset(xas);

xas_lock_irq(xas);
-   if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+   if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;
  
  		dax_disassociate_entry(entry, mapping, false);

@@ -774,6 +779,9 @@ static void *dax_insert_entry(struct xa_state *xas,
if (dirty)
xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
  
+	if (cow)

+   xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
xas_unlock_irq(xas);
return entry;
  }
@@ -1319,6 +1327,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
*vmf, pfn_t *pfnp,
void *entry;
pfn_t pfn;
void *kaddr;
+   unsigned long insert_flags = 0;
  
  	trace_dax_pte_fault(inode, vmf, ret);

/*
@@ -1444,8 +1453,10 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
*vmf, pfn_t *pfnp,
  
  		goto finish_iomap;

case IOMAP_UNWRITTEN:
-   if (write && iomap.flags & IOMAP_F_SHARED)
+   if (write && (iomap.flags & IOMAP_F_SHARED)) {
+   insert_flags |= DAX_IF_COW;


Here's an example of 'unsigned long = unsigned long long', though it'll
work, it would be better to unify all the types.


Yes, I'll fix it.


--
Thanks,
Ruan Shiyang.



goto cow;
+   }
fallthrough;
case IOMAP_HOLE:
if (!write) {
@@ -1555,6 +1566,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
*vmf, pfn_t *pfnp,
int error;
pfn_t pfn;
void *kaddr;
+   unsigned long insert_flags = 0;
  
  	/*

 * Check whether offset isn't beyond end of file now. Caller is
@@ -1670,14 +1682,17 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
*vmf, pfn_t *pfnp,
result = vmf_insert_pfn_pmd(vmf, pfn, write);
break;
case IOMAP_UNWRITTEN:
-   if (write && iomap.flags & IOMAP_F_SHARED)
+   if (write && (iomap.flags & IOMAP_F_SHARED)) {
+   insert_flags |= DAX_IF_COW;
goto cow;
+   }
fallthrough;
case IOMAP_HOLE:
-   if (WARN_ON_ONCE(write))
+   if (!write) {
+   result = dax_pmd_load_hole(, vmf, , );
break;
-   result = dax_pmd_load_hole(, vmf, , );
-   break;
+   }
+   fallthrough;
default:
WARN_ON_ONCE(1);
 

Re: [PATCH v3 05/11] mm, fsdax: Refactor memory-failure handler for dax mapping

2021-02-16 Thread Ruan Shiyang



On 2021/2/10 下午9:33, Christoph Hellwig wrote:

+extern int mf_dax_mapping_kill_procs(struct address_space *mapping, pgoff_t 
index, int flags);


No nee for the extern, please avoid the overly long line.


OK.

I'd like to confirm one thing...  I have checked all of this patchset by 
checkpatch.pl and it did not report the overly long line warning.  So, I 
should still obey the rule of 80 chars one line?





@@ -120,6 +121,13 @@ static int hwpoison_filter_dev(struct page *p)
if (PageSlab(p))
return -EINVAL;
  
+	if (pfn_valid(page_to_pfn(p))) {

+   if (is_device_fsdax_page(p))
+   return 0;
+   else
+   return -EINVAL;
+   }
+


This looks odd.  For one there is no need for an else after a return.
But more importantly page_mapping() as called below pretty much assumes
a valid PFN, so something is fishy in this function.


Yes, a mistake here.  I'll fix it.




+   if (is_zone_device_page(p)) {
+   if (is_device_fsdax_page(p))
+   tk->addr = vma->vm_start +
+   ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);


The arithmetics here scream for a common helper, I suspect there might
be other places that could use the same helper.


+int mf_dax_mapping_kill_procs(struct address_space *mapping, pgoff_t index, 
int flags)


Overly long line.  Also the naming scheme with the mf_ seems rather
unusual. Maybe dax_kill_mapping_procs?  Also please add a kerneldoc
comment describing the function given that it exported.



OK.  Thanks for your guidance.


--
Thanks,
Ruan Shiyang.





___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

2021-02-09 Thread Ruan Shiyang



On 2021/2/9 下午5:34, Christoph Hellwig wrote:

On Tue, Feb 09, 2021 at 05:15:13PM +0800, Ruan Shiyang wrote:

The dax dedupe comparison need the iomap_ops pointer as argument, so my
understanding is that we don't modify the argument list of
generic_remap_file_range_prep(), but move its code into
__generic_remap_file_range_prep() whose argument list can be modified to
accepts the iomap_ops pointer.  Then it looks like this:


I'd say just add the iomap_ops pointer to
generic_remap_file_range_prep and do away with the extra wrappers.  We
only have three callers anyway.


OK.


--
Thanks,
Ruan Shiyang.





___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

2021-02-09 Thread Ruan Shiyang



On 2021/2/8 下午11:19, Christoph Hellwig wrote:

On Mon, Feb 08, 2021 at 01:09:22AM +0800, Shiyang Ruan wrote:

With dax we cannot deal with readpage() etc. So, we create a
funciton callback to perform the file data comparison and pass


s/funciton/function/g


+#define MIN(a, b) (((a) < (b)) ? (a) : (b))


This should use the existing min or min_t helpers.



  int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
  struct file *file_out, loff_t pos_out,
- loff_t *len, unsigned int remap_flags)
+ loff_t *len, unsigned int remap_flags,
+ compare_range_t compare_range_fn)


Can we keep generic_remap_file_range_prep as-is, and add a new
dax_remap_file_range_prep, both sharing a low-level
__generic_remap_file_range_prep implementation?  And for that
implementation I'd also go for classic if/else instead of the function
pointer.


The dax dedupe comparison need the iomap_ops pointer as argument, so my 
understanding is that we don't modify the argument list of 
generic_remap_file_range_prep(), but move its code into 
__generic_remap_file_range_prep() whose argument list can be modified to 
accepts the iomap_ops pointer.  Then it looks like this:


```
int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
  struct file *file_out, loff_t pos_out,
  loff_t *len, unsigned int remap_flags,
  const struct iomap_ops *ops)
{
return __generic_remap_file_range_prep(file_in, pos_in, file_out,
pos_out, len, remap_flags, ops);
}
EXPORT_SYMBOL(dax_remap_file_range_prep);

int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
  struct file *file_out, loff_t pos_out,
  loff_t *len, unsigned int remap_flags)
{
return __generic_remap_file_range_prep(file_in, pos_in, file_out,
pos_out, len, remap_flags, NULL);
}
EXPORT_SYMBOL(generic_remap_file_range_prep);
```

Am i right?


--
Thanks,
Ruan Shiyang.




+extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+struct inode *dest, loff_t destoff,
+loff_t len, bool *is_same);


no need for the extern.




___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 3/7] fsdax: Copy data before write

2021-02-08 Thread Ruan Shiyang



On 2021/2/8 下午11:14, Christoph Hellwig wrote:

switch (iomap.type) {
case IOMAP_MAPPED:
+cow:
if (iomap.flags & IOMAP_F_NEW) {
count_vm_event(PGMAJFAULT);
count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
major = VM_FAULT_MAJOR;
}
error = dax_iomap_direct_access(, pos, PAGE_SIZE,
-   NULL, );
+   , );


Any chance you could look into factoring out this code into a helper
to avoid the goto magic, which is a little too convoluted?


switch (iomap.type) {
case IOMAP_MAPPED:
+cow:
error = dax_iomap_direct_access(, pos, PMD_SIZE,
-   NULL, );
+   , );
if (error < 0)
goto finish_iomap;
  
  		entry = dax_insert_entry(, mapping, vmf, entry, pfn,

DAX_PMD, write && !sync);
  
+		if (srcmap.type != IOMAP_HOLE) {


Same here.


Thanks for suggestion.  I'll try it.


--
Thanks,
Ruan Shiyang.





___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 0/7] fsdax,xfs: Add reflink support for fsdax

2021-02-08 Thread Ruan Shiyang



On 2021/2/8 下午11:39, Jan Kara wrote:

On Mon 08-02-21 01:09:17, Shiyang Ruan wrote:

This patchset is attempt to add CoW support for fsdax, and take XFS,
which has both reflink and fsdax feature, as an example.

One of the key mechanism need to be implemented in fsdax is CoW.  Copy
the data from srcmap before we actually write data to the destance
iomap.  And we just copy range in which data won't be changed.

Another mechanism is range comparison .  In page cache case, readpage()
is used to load data on disk to page cache in order to be able to
compare data.  In fsdax case, readpage() does not work.  So, we need
another compare data with direct access support.

With the two mechanism implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.

Some of the patches are picked up from Goldwyn's patchset.  I made some
changes to adapt to this patchset.


How do you deal with HWPoison code trying to reverse-map struct page back
to inode-offset pair? This also needs to be fixed for reflink to work with
DAX.



I have posted v3 patchset to add reverse-map support for dax page 
yesterday.  Please take a look at this:


  https://lkml.org/lkml/2021/2/8/347


--
Thanks,
Ruan Shiyang.


Honza



(Rebased on v5.10)
==

Shiyang Ruan (7):
   fsdax: Output address in dax_iomap_pfn() and rename it
   fsdax: Introduce dax_copy_edges() for CoW
   fsdax: Copy data before write
   fsdax: Replace mmap entry in case of CoW
   fsdax: Dedup file range to use a compare function
   fs/xfs: Handle CoW for fsdax write() path
   fs/xfs: Add dedupe support for fsdax

  fs/btrfs/reflink.c |   3 +-
  fs/dax.c   | 188 ++---
  fs/ocfs2/file.c|   2 +-
  fs/remap_range.c   |  14 +--
  fs/xfs/xfs_bmap_util.c |   6 +-
  fs/xfs/xfs_file.c  |  30 ++-
  fs/xfs/xfs_inode.c |   8 +-
  fs/xfs/xfs_inode.h |   1 +
  fs/xfs/xfs_iomap.c |   3 +-
  fs/xfs/xfs_iops.c  |  11 ++-
  fs/xfs/xfs_reflink.c   |  23 -
  include/linux/dax.h|   5 ++
  include/linux/fs.h |   9 +-
  13 files changed, 270 insertions(+), 33 deletions(-)

--
2.30.0





___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] dax: fix default return code of range_parse()

2021-02-07 Thread Ruan Shiyang

ping

On 2021/1/26 上午10:13, Shiyang Ruan wrote:

The return value of range_parse() indicates the size when it is
positive.  The error code should be negative.

Signed-off-by: Shiyang Ruan 
---
  drivers/dax/bus.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 737b207c9e30..3003558c1a8b 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1038,7 +1038,7 @@ static ssize_t range_parse(const char *opt, size_t len, 
struct range *range)
  {
unsigned long long addr = 0;
char *start, *end, *str;
-   ssize_t rc = EINVAL;
+   ssize_t rc = -EINVAL;
  
  	str = kstrdup(opt, GFP_KERNEL);

if (!str)



___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RESEND v2 08/10] md: Implement ->corrupted_range()

2021-02-02 Thread Ruan Shiyang
tor_t sector,
  static int pmem_corrupted_range(struct gendisk *disk, struct block_device 
*bdev,
loff_t disk_offset, size_t len, void *data)
  {
-   struct super_block *sb;
loff_t bdev_offset;
sector_t disk_sector = disk_offset >> SECTOR_SHIFT;
-   int rc = 0;
+   int rc = -ENODEV;
  
  	bdev = bdget_disk_sector(disk, disk_sector);

if (!bdev)
-   return -ENODEV;
+   return rc;
  
  	bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;

-   sb = get_super(bdev);
-   if (sb && sb->s_op->corrupted_range) {
-   rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, 
data);
-   drop_super(sb);
-   }
+   rc = bd_corrupted_range(bdev, bdev_offset, bdev_offset, len, data);
  
  	bdput(bdev);

return rc;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3b8963e228a1..3cc2b2911e3a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1079,6 +1079,27 @@ struct bd_holder_disk {
int refcnt;
  };
  
+static int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,

+ size_t len, void *data)
+{
+   struct bd_holder_disk *holder;
+   struct gendisk *disk;
+   int rc = 0;
+
+   if (list_empty(&(bdev->bd_holder_disks)))
+   return -ENODEV;
+
+   list_for_each_entry(holder, >bd_holder_disks, list) {
+   disk = holder->disk;
+   if (disk->fops->corrupted_range) {
+   rc = disk->fops->corrupted_range(disk, bdev, off, len, 
data);
+   if (rc != -ENODEV)
+   break;
+   }
+   }
+   return rc;
+}
+
  static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev,
  struct gendisk *disk)
  {
@@ -1212,7 +1233,26 @@ void bd_unlink_disk_holder(struct block_device *bdev, 
struct gendisk *disk)
mutex_unlock(>bd_mutex);
  }
  EXPORT_SYMBOL_GPL(bd_unlink_disk_holder);
-#endif
+#endif /* CONFIG_SYSFS */
+
+int bd_corrupted_range(struct block_device *bdev, loff_t disk_off,
+  loff_t bdev_off, size_t len, void *data)
+{
+   struct super_block *sb = get_super(bdev);
+   int rc = -EOPNOTSUPP;
+
+   if (!sb) {
+#ifdef CONFIG_SYSFS
+   rc = bd_disk_holder_corrupted_range(bdev, disk_off, len, data);
+#endif /* CONFIG_SYSFS */


Normal kernel convention is that you'd provide a empty shell for the
CONFIG_SYSFS=n case, e.g.

#ifdef CONFIG_SYSFS
int bd_corrupted_range(...) {
/* real code */
}
#else
static inline bd_corrupted_range(...) { return -EOPNOTSUPP; }
#endif

so that you don't have preprocessor directives making this function
choppy.


I'll fix it.


--
Thanks,
Ruan Shiyang.



--D


+   return rc;
+   } else if (sb->s_op->corrupted_range)
+   rc = sb->s_op->corrupted_range(sb, bdev, bdev_off, len, data);
+   drop_super(sb);
+
+   return rc;
+}
+EXPORT_SYMBOL(bd_corrupted_range);
  
  static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part);
  
diff --git a/include/linux/genhd.h b/include/linux/genhd.h

index 4da480798955..996f91b08d48 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -315,6 +315,8 @@ void unregister_blkdev(unsigned int major, const char 
*name);
  bool bdev_check_media_change(struct block_device *bdev);
  int __invalidate_device(struct block_device *bdev, bool kill_dirty);
  void set_capacity(struct gendisk *disk, sector_t size);
+int bd_corrupted_range(struct block_device *bdev, loff_t disk_off,
+  loff_t bdev_off, size_t len, void *data);
  
  /* for drivers/char/raw.c: */

  int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
--
2.30.0








___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RESEND v2 09/10] xfs: Implement ->corrupted_range() for XFS

2021-02-02 Thread Ruan Shiyang
pagecache writeback?


But in this case, the dax page is already broken, it seems that page 
cache should not be written back to the origin dax page.  I think 
another dax page need to be allocate for the writeback.





+out:
+   xfs_irele(ip);
+   }
+
+   return rc;
+}
+
+static int
+xfs_fs_corrupted_range(
+   struct super_block  *sb,
+   struct block_device *bdev,
+   loff_t  offset,
+   size_t  len,
+   void*data)
+{
+   struct xfs_mount*mp = XFS_M(sb);
+   struct xfs_trans*tp = NULL;
+   struct xfs_btree_cur*cur = NULL;
+   struct xfs_rmap_irecrmap_low, rmap_high;
+   struct xfs_buf  *agf_bp = NULL;
+   xfs_fsblock_t   fsbno = XFS_B_TO_FSB(mp, offset);
+   xfs_filblks_t   bcnt = XFS_B_TO_FSB(mp, len);
+   xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
+   xfs_agblock_t   agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+   int error = 0;
+
+   if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_bdev == bdev) {
+   xfs_warn(mp, "corrupted_range support not available for realtime 
device!");
+   return 0;
+   }
+   if (mp->m_logdev_targp && mp->m_logdev_targp->bt_bdev == bdev &&
+   mp->m_logdev_targp != mp->m_ddev_targp) {
+   xfs_err(mp, "ondisk log corrupt, shutting down fs!");
+   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);


Longer term question for the rest of the xfs community: Can we do better
than this?  If the ail has checkpointed past this part of the log then
we could just write zeroes into dead area, right?

Also, if one of the log buffers points to a dead log area and isn't the
one that's currently being written into, can we just submit_bio it to
rewrite the lost part of the log??


Yes, We should also fix the log rather than shutdown it directly.  I 
will take that into consideration in future patches.





+   return 0;
+   }
+
+   if (!xfs_sb_version_hasrmapbt(>m_sb)) {
+   xfs_warn(mp, "corrupted_range needs rmapbt enabled!");
+   return 0;
+   }
+
+   error = xfs_trans_alloc_empty(mp, );
+   if (error)
+   return error;
+
+   error = xfs_alloc_read_agf(mp, tp, agno, 0, _bp);
+   if (error)
+   goto out_cancel_tp;
+
+   cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);
+
+   /* Construct a range for rmap query */
+   memset(_low, 0, sizeof(rmap_low));
+   memset(_high, 0xFF, sizeof(rmap_high));
+   rmap_low.rm_startblock = rmap_high.rm_startblock = agbno;
+   rmap_low.rm_blockcount = rmap_high.rm_blockcount = bcnt;
+
+   error = xfs_rmap_query_range(cur, _low, _high, 
xfs_corrupt_helper, data);


Long line here...


+   if (error == -EFSCORRUPTED)
+   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);


This should go in xfs_corrupt_helper as I mentioned above.


OK.


--
Thanks,
Ruan Shiyang.


--D


+
+   xfs_btree_del_cursor(cur, error);
+   xfs_trans_brelse(tp, agf_bp);
+out_cancel_tp:
+   xfs_trans_cancel(tp);
+   return error;
+}
+
  static const struct super_operations xfs_super_operations = {
.alloc_inode= xfs_fs_alloc_inode,
.destroy_inode  = xfs_fs_destroy_inode,
@@ -1118,6 +1226,7 @@ static const struct super_operations xfs_super_operations 
= {
.show_options   = xfs_fs_show_options,
.nr_cached_objects  = xfs_fs_nr_cached_objects,
.free_cached_objects= xfs_fs_free_cached_objects,
+   .corrupted_range= xfs_fs_corrupted_range,
  };
  
  static int

--
2.30.0








___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping

2021-01-13 Thread Ruan Shiyang



On 2021/1/14 上午11:26, zhong jiang wrote:


On 2021/1/14 9:44 上午, Ruan Shiyang wrote:



On 2021/1/13 下午6:04, zhong jiang wrote:


On 2021/1/12 10:55 上午, Ruan Shiyang wrote:



On 2021/1/6 下午11:41, Jan Kara wrote:

On Thu 31-12-20 00:55:55, Shiyang Ruan wrote:
The current memory_failure_dev_pagemap() can only handle 
single-mapped
dax page for fsdax mode.  The dax page could be mapped by multiple 
files

and offsets if we let reflink feature & fsdax mode work together. So,
we refactor current implementation to support handle memory 
failure on

each file and offset.

Signed-off-by: Shiyang Ruan 


Overall this looks OK to me, a few comments below.


---
  fs/dax.c    | 21 +++
  include/linux/dax.h |  1 +
  include/linux/mm.h  |  9 +
  mm/memory-failure.c | 91 
++---

  4 files changed, 100 insertions(+), 22 deletions(-)


...

  @@ -345,9 +348,12 @@ static void add_to_kill(struct task_struct 
*tsk, struct page *p,

  }
    tk->addr = page_address_in_vma(p, vma);
-    if (is_zone_device_page(p))
-    tk->size_shift = dev_pagemap_mapping_shift(p, vma);
-    else
+    if (is_zone_device_page(p)) {
+    if (is_device_fsdax_page(p))
+    tk->addr = vma->vm_start +
+    ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);


It seems strange to use 'pgoff' for dax pages and not for any other 
page.
Why? I'd rather pass correct pgoff from all callers of 
add_to_kill() and

avoid this special casing...


Because one fsdax page can be shared by multiple pgoffs.  I have to 
pass each pgoff in each iteration to calculate the address in vma 
(for tk->addr).  Other kinds of pages don't need this. They can get 
their unique address by calling "page_address_in_vma()".


IMO,   an fsdax page can be shared by multiple files rather than 
multiple pgoffs if fs query support reflink.   Because an page only 
located in an mapping(page->mapping is exclusive), hence it  only has 
an pgoff or index pointing at the node.


  or  I miss something for the feature ?  thanks,


Yes, a fsdax page is shared by multiple files because of reflink. I 
think my description of 'pgoff' here is not correct.  This 'pgoff' 
means the offset within the a file.  (We use rmap to find out all the 
sharing files and their offsets.)  So, I said that "can be shared by 
multiple pgoffs".  It's my bad.


I think I should name it another word to avoid misunderstandings.

IMO,  All the sharing files should be the same offset to share the fsdax 
page.  why not that ? 


The dedupe operation can let different files share their same data 
extent, though offsets are not same.  So, files can share one fsdax page 
at different offset.


As you has said,  a shared fadax page should be 
inserted to different mapping files.  but page->index and page->mapping 
is exclusive.  hence an page only should be placed in an mapping tree.


We can't use page->mapping and page->index here for reflink & fsdax. 
And that's this patchset aims to solve.  I introduced a series of 
->corrupted_range(), from mm to pmem driver to block device and finally 
to filesystem, to use rmap feature of filesystem to find out all files 
sharing same data extent (fsdax page).



--
Thanks,
Ruan Shiyang.



And In the current patch,  we failed to found out that all process use 
the fsdax page shared by multiple files and kill them.



Thanks,


--
Thanks,
Ruan Shiyang.



So, I added this fsdax case here.  This patchset only implemented 
the fsdax case, other cases also need to be added here if to be 
implemented.



--
Thanks,
Ruan Shiyang.



+    tk->size_shift = dev_pagemap_mapping_shift(p, vma, 
tk->addr);

+    } else
  tk->size_shift = page_shift(compound_head(p));
    /*
@@ -495,7 +501,7 @@ static void collect_procs_anon(struct page 
*page, struct list_head *to_kill,

  if (!page_mapped_in_vma(page, vma))
  continue;
  if (vma->vm_mm == t->mm)
-    add_to_kill(t, page, vma, to_kill);
+    add_to_kill(t, page, NULL, 0, vma, to_kill);
  }
  }
  read_unlock(_lock);
@@ -505,24 +511,19 @@ static void collect_procs_anon(struct page 
*page, struct list_head *to_kill,

  /*
   * Collect processes when the error hit a file mapped page.
   */
-static void collect_procs_file(struct page *page, struct 
list_head *to_kill,

-    int force_early)
+static void collect_procs_file(struct page *page, struct 
address_space *mapping,

+    pgoff_t pgoff, struct list_head *to_kill, int force_early)
  {
  struct vm_area_struct *vma;
  struct task_struct *tsk;
-    struct address_space *mapping = page->mapping;
-    pgoff_t pgoff;
    i_mmap_lock_read(mapping);
  read_lock(_lock);
-    pgoff = page_to_pgoff(page);
  for_each_process(tsk) {
  struct task_struct *t = task_early_kill(tsk, force_ea

Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping

2021-01-13 Thread Ruan Shiyang



On 2021/1/13 下午6:04, zhong jiang wrote:


On 2021/1/12 10:55 上午, Ruan Shiyang wrote:



On 2021/1/6 下午11:41, Jan Kara wrote:

On Thu 31-12-20 00:55:55, Shiyang Ruan wrote:

The current memory_failure_dev_pagemap() can only handle single-mapped
dax page for fsdax mode.  The dax page could be mapped by multiple 
files

and offsets if we let reflink feature & fsdax mode work together.  So,
we refactor current implementation to support handle memory failure on
each file and offset.

Signed-off-by: Shiyang Ruan 


Overall this looks OK to me, a few comments below.


---
  fs/dax.c    | 21 +++
  include/linux/dax.h |  1 +
  include/linux/mm.h  |  9 +
  mm/memory-failure.c | 91 
++---

  4 files changed, 100 insertions(+), 22 deletions(-)


...

  @@ -345,9 +348,12 @@ static void add_to_kill(struct task_struct 
*tsk, struct page *p,

  }
    tk->addr = page_address_in_vma(p, vma);
-    if (is_zone_device_page(p))
-    tk->size_shift = dev_pagemap_mapping_shift(p, vma);
-    else
+    if (is_zone_device_page(p)) {
+    if (is_device_fsdax_page(p))
+    tk->addr = vma->vm_start +
+    ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);


It seems strange to use 'pgoff' for dax pages and not for any other 
page.

Why? I'd rather pass correct pgoff from all callers of add_to_kill() and
avoid this special casing...


Because one fsdax page can be shared by multiple pgoffs.  I have to 
pass each pgoff in each iteration to calculate the address in vma (for 
tk->addr).  Other kinds of pages don't need this. They can get their 
unique address by calling "page_address_in_vma()".


IMO,   an fsdax page can be shared by multiple files rather than 
multiple pgoffs if fs query support reflink.   Because an page only 
located in an mapping(page->mapping is exclusive),  hence it  only has 
an pgoff or index pointing at the node.


  or  I miss something for the feature ?  thanks,


Yes, a fsdax page is shared by multiple files because of reflink.  I 
think my description of 'pgoff' here is not correct.  This 'pgoff' means 
the offset within the a file.  (We use rmap to find out all the sharing 
files and their offsets.)  So, I said that "can be shared by multiple 
pgoffs".  It's my bad.


I think I should name it another word to avoid misunderstandings.


--
Thanks,
Ruan Shiyang.



So, I added this fsdax case here.  This patchset only implemented the 
fsdax case, other cases also need to be added here if to be implemented.



--
Thanks,
Ruan Shiyang.




+    tk->size_shift = dev_pagemap_mapping_shift(p, vma, tk->addr);
+    } else
  tk->size_shift = page_shift(compound_head(p));
    /*
@@ -495,7 +501,7 @@ static void collect_procs_anon(struct page 
*page, struct list_head *to_kill,

  if (!page_mapped_in_vma(page, vma))
  continue;
  if (vma->vm_mm == t->mm)
-    add_to_kill(t, page, vma, to_kill);
+    add_to_kill(t, page, NULL, 0, vma, to_kill);
  }
  }
  read_unlock(_lock);
@@ -505,24 +511,19 @@ static void collect_procs_anon(struct page 
*page, struct list_head *to_kill,

  /*
   * Collect processes when the error hit a file mapped page.
   */
-static void collect_procs_file(struct page *page, struct list_head 
*to_kill,

-    int force_early)
+static void collect_procs_file(struct page *page, struct 
address_space *mapping,

+    pgoff_t pgoff, struct list_head *to_kill, int force_early)
  {
  struct vm_area_struct *vma;
  struct task_struct *tsk;
-    struct address_space *mapping = page->mapping;
-    pgoff_t pgoff;
    i_mmap_lock_read(mapping);
  read_lock(_lock);
-    pgoff = page_to_pgoff(page);
  for_each_process(tsk) {
  struct task_struct *t = task_early_kill(tsk, force_early);
-
  if (!t)
  continue;
-    vma_interval_tree_foreach(vma, >i_mmap, pgoff,
-  pgoff) {
+    vma_interval_tree_foreach(vma, >i_mmap, pgoff, 
pgoff) {

  /*
   * Send early kill signal to tasks where a vma covers
   * the page but the corrupted page is not necessarily
@@ -531,7 +532,7 @@ static void collect_procs_file(struct page 
*page, struct list_head *to_kill,

   * to be informed of all such data corruptions.
   */
  if (vma->vm_mm == t->mm)
-    add_to_kill(t, page, vma, to_kill);
+    add_to_kill(t, page, mapping, pgoff, vma, to_kill);
  }
  }
  read_unlock(_lock);
@@ -550,7 +551,8 @@ static void collect_procs(struct page *page, 
struct list_head *tokill,

  if (PageAnon(page))
  collect_procs_anon(page, tokill, force_early);
  else
-    collect_procs_file(page, tokill, force_early);
+    collect_procs_file(page, page->mapping, page_

Re: [PATCH 08/10] md: Implement ->corrupted_range()

2021-01-12 Thread Ruan Shiyang



On 2021/1/7 上午1:14, Jan Kara wrote:

On Thu 31-12-20 00:55:59, Shiyang Ruan wrote:

With the support of ->rmap(), it is possible to obtain the superblock on
a mapped device.

If a pmem device is used as one target of mapped device, we cannot
obtain its superblock directly.  With the help of SYSFS, the mapped
device can be found on the target devices.  So, we iterate the
bdev->bd_holder_disks to obtain its mapped device.

Signed-off-by: Shiyang Ruan 


Thanks for the patch. Two comments below.


diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4688bff19c20..9f9a2f3bf73b 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -256,21 +256,16 @@ static int pmem_rw_page(struct block_device *bdev, 
sector_t sector,
  static int pmem_corrupted_range(struct gendisk *disk, struct block_device 
*bdev,
loff_t disk_offset, size_t len, void *data)
  {
-   struct super_block *sb;
loff_t bdev_offset;
sector_t disk_sector = disk_offset >> SECTOR_SHIFT;
-   int rc = 0;
+   int rc = -ENODEV;
  
  	bdev = bdget_disk_sector(disk, disk_sector);

if (!bdev)
-   return -ENODEV;
+   return rc;
  
  	bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;

-   sb = get_super(bdev);
-   if (sb && sb->s_op->corrupted_range) {
-   rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, 
data);
-   drop_super(sb);
-   }
+   rc = bd_corrupted_range(bdev, bdev_offset, bdev_offset, len, data);
  
  	bdput(bdev);

return rc;


This (and the fs/block_dev.c change below) is just refining the function
you've implemented in the patch 6. I think it's confusing to split changes
like this - why not implement things correctly from the start in patch 6?


This change added a helper function to find the md devices created on a 
low-level block device, such as a LVM on /dev/pmem0, and calls 
->corrupted_range() for each md device.  The md parts were introduced 
starts from patch 7.  So, I add this change in this patch.





diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9e84b1928b94..0e50f0e8e8af 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1171,6 +1171,27 @@ struct bd_holder_disk {
int refcnt;
  };
  
+static int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,

+ size_t len, void *data)
+{
+   struct bd_holder_disk *holder;
+   struct gendisk *disk;
+   int rc = 0;
+
+   if (list_empty(&(bdev->bd_holder_disks)))
+   return -ENODEV;


This will not compile for !CONFIG_SYSFS kernels. Not that it would be
common but still. Also I'm not sure whether using bd_holder_disks like this
is really the right thing to do (when it seems to be only a sysfs thing),
although admittedly I'm not aware of a better way of getting this
information.


I did a lot of tries and finally found this way.  I think I should add a 
judgement that whether CONFIG_SYSFS is turned on.



--
Thanks,
Ruan Shiyang.



Honza


+
+   list_for_each_entry(holder, >bd_holder_disks, list) {
+   disk = holder->disk;
+   if (disk->fops->corrupted_range) {
+   rc = disk->fops->corrupted_range(disk, bdev, off, len, 
data);
+   if (rc != -ENODEV)
+   break;
+   }
+   }
+   return rc;
+}
+
  static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev,
  struct gendisk *disk)
  {
@@ -1378,6 +1399,22 @@ void bd_set_nr_sectors(struct block_device *bdev, 
sector_t sectors)
  }
  EXPORT_SYMBOL(bd_set_nr_sectors);
  
+int bd_corrupted_range(struct block_device *bdev, loff_t disk_off, loff_t bdev_off, size_t len, void *data)

+{
+   struct super_block *sb = get_super(bdev);
+   int rc = 0;
+
+   if (!sb) {
+   rc = bd_disk_holder_corrupted_range(bdev, disk_off, len, data);
+   return rc;
+   } else if (sb->s_op->corrupted_range)
+   rc = sb->s_op->corrupted_range(sb, bdev, bdev_off, len, data);
+   drop_super(sb);
+
+   return rc;
+}
+EXPORT_SYMBOL(bd_corrupted_range);
+
  static void __blkdev_put(struct block_device *bdev, fmode_t mode, int 
for_part);
  
  int bdev_disk_changed(struct block_device *bdev, bool invalidate)

diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index ed06209008b8..42290470810d 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -376,6 +376,8 @@ void revalidate_disk_size(struct gendisk *disk, bool 
verbose);
  bool bdev_check_media_change(struct block_device *bdev);
  int __invalidate_device(struct block_device *bdev, bool kill_dirty);
  void bd_set_nr_sector

Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping

2021-01-11 Thread Ruan Shiyang



On 2021/1/6 下午11:41, Jan Kara wrote:

On Thu 31-12-20 00:55:55, Shiyang Ruan wrote:

The current memory_failure_dev_pagemap() can only handle single-mapped
dax page for fsdax mode.  The dax page could be mapped by multiple files
and offsets if we let reflink feature & fsdax mode work together.  So,
we refactor current implementation to support handle memory failure on
each file and offset.

Signed-off-by: Shiyang Ruan 


Overall this looks OK to me, a few comments below.


---
  fs/dax.c| 21 +++
  include/linux/dax.h |  1 +
  include/linux/mm.h  |  9 +
  mm/memory-failure.c | 91 ++---
  4 files changed, 100 insertions(+), 22 deletions(-)


...

  
@@ -345,9 +348,12 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,

}
  
  	tk->addr = page_address_in_vma(p, vma);

-   if (is_zone_device_page(p))
-   tk->size_shift = dev_pagemap_mapping_shift(p, vma);
-   else
+   if (is_zone_device_page(p)) {
+   if (is_device_fsdax_page(p))
+   tk->addr = vma->vm_start +
+   ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);


It seems strange to use 'pgoff' for dax pages and not for any other page.
Why? I'd rather pass correct pgoff from all callers of add_to_kill() and
avoid this special casing...


Because one fsdax page can be shared by multiple pgoffs.  I have to pass 
each pgoff in each iteration to calculate the address in vma (for 
tk->addr).  Other kinds of pages don't need this.  They can get their 
unique address by calling "page_address_in_vma()".


So, I added this fsdax case here.  This patchset only implemented the 
fsdax case, other cases also need to be added here if to be implemented.



--
Thanks,
Ruan Shiyang.




+   tk->size_shift = dev_pagemap_mapping_shift(p, vma, tk->addr);
+   } else
tk->size_shift = page_shift(compound_head(p));
  
  	/*

@@ -495,7 +501,7 @@ static void collect_procs_anon(struct page *page, struct 
list_head *to_kill,
if (!page_mapped_in_vma(page, vma))
continue;
if (vma->vm_mm == t->mm)
-   add_to_kill(t, page, vma, to_kill);
+   add_to_kill(t, page, NULL, 0, vma, to_kill);
}
}
read_unlock(_lock);
@@ -505,24 +511,19 @@ static void collect_procs_anon(struct page *page, struct 
list_head *to_kill,
  /*
   * Collect processes when the error hit a file mapped page.
   */
-static void collect_procs_file(struct page *page, struct list_head *to_kill,
-   int force_early)
+static void collect_procs_file(struct page *page, struct address_space 
*mapping,
+   pgoff_t pgoff, struct list_head *to_kill, int force_early)
  {
struct vm_area_struct *vma;
struct task_struct *tsk;
-   struct address_space *mapping = page->mapping;
-   pgoff_t pgoff;
  
  	i_mmap_lock_read(mapping);

read_lock(_lock);
-   pgoff = page_to_pgoff(page);
for_each_process(tsk) {
struct task_struct *t = task_early_kill(tsk, force_early);
-
if (!t)
continue;
-   vma_interval_tree_foreach(vma, >i_mmap, pgoff,
- pgoff) {
+   vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
/*
 * Send early kill signal to tasks where a vma covers
 * the page but the corrupted page is not necessarily
@@ -531,7 +532,7 @@ static void collect_procs_file(struct page *page, struct 
list_head *to_kill,
 * to be informed of all such data corruptions.
 */
if (vma->vm_mm == t->mm)
-   add_to_kill(t, page, vma, to_kill);
+   add_to_kill(t, page, mapping, pgoff, vma, 
to_kill);
}
}
read_unlock(_lock);
@@ -550,7 +551,8 @@ static void collect_procs(struct page *page, struct 
list_head *tokill,
if (PageAnon(page))
collect_procs_anon(page, tokill, force_early);
else
-   collect_procs_file(page, tokill, force_early);
+   collect_procs_file(page, page->mapping, page_to_pgoff(page),


Why not use page_mapping() helper here? It would be safer for THPs if they
ever get here...

Honza



___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v3 8/9] md: Implement ->corrupted_range()

2021-01-08 Thread Ruan Shiyang



On 2021/1/5 上午7:34, Darrick J. Wong wrote:

On Fri, Dec 18, 2020 at 10:11:54AM +0800, Ruan Shiyang wrote:



On 2020/12/16 上午4:51, Darrick J. Wong wrote:

On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:

With the support of ->rmap(), it is possible to obtain the superblock on
a mapped device.

If a pmem device is used as one target of mapped device, we cannot
obtain its superblock directly.  With the help of SYSFS, the mapped
device can be found on the target devices.  So, we iterate the
bdev->bd_holder_disks to obtain its mapped device.

Signed-off-by: Shiyang Ruan 
---
   drivers/md/dm.c   | 66 +++
   drivers/nvdimm/pmem.c |  9 --
   fs/block_dev.c| 21 ++
   include/linux/genhd.h |  7 +
   4 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4e0cbfe3f14d..9da1f9322735 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -507,6 +507,71 @@ static int dm_blk_report_zones(struct gendisk *disk, 
sector_t sector,
   #define dm_blk_report_zones  NULL
   #endif /* CONFIG_BLK_DEV_ZONED */
+struct dm_blk_corrupt {
+   struct block_device *bdev;
+   sector_t offset;
+};
+
+static int dm_blk_corrupt_fn(struct dm_target *ti, struct dm_dev *dev,
+   sector_t start, sector_t len, void *data)
+{
+   struct dm_blk_corrupt *bc = data;
+
+   return bc->bdev == (void *)dev->bdev &&
+   (start <= bc->offset && bc->offset < start + len);
+}
+
+static int dm_blk_corrupted_range(struct gendisk *disk,
+ struct block_device *target_bdev,
+ loff_t target_offset, size_t len, void *data)
+{
+   struct mapped_device *md = disk->private_data;
+   struct block_device *md_bdev = md->bdev;
+   struct dm_table *map;
+   struct dm_target *ti;
+   struct super_block *sb;
+   int srcu_idx, i, rc = 0;
+   bool found = false;
+   sector_t disk_sec, target_sec = to_sector(target_offset);
+
+   map = dm_get_live_table(md, _idx);
+   if (!map)
+   return -ENODEV;
+
+   for (i = 0; i < dm_table_get_num_targets(map); i++) {
+   ti = dm_table_get_target(map, i);
+   if (ti->type->iterate_devices && ti->type->rmap) {
+   struct dm_blk_corrupt bc = {target_bdev, target_sec};
+
+   found = ti->type->iterate_devices(ti, dm_blk_corrupt_fn, 
);
+   if (!found)
+   continue;
+   disk_sec = ti->type->rmap(ti, target_sec);


What happens if the dm device has multiple reverse mappings because the
physical storage is being shared at multiple LBAs?  (e.g. a
deduplication target)


I thought that the dm device knows the mapping relationship, and it can be
done by implementation of ->rmap() in each target.  Did I understand it
wrong?


The dm device /does/ know the mapping relationship.  I'm asking what
happens if there are *multiple* mappings.  For example, a deduplicating
dm device could observe that the upper level code wrote some data to
sector 200 and now it wants to write the same data to sector 500.
Instead of writing twice, it simply maps sector 500 in its LBA space to
the same space that it mapped sector 200.

Pretend that sector 200 on the dm-dedupe device maps to sector 64 on the
underlying storage (call it /dev/pmem1 and let's say it's the only
target sitting underneath the dm-dedupe device).

If /dev/pmem1 then notices that sector 64 has gone bad, it will start
calling ->corrupted_range handlers until it calls dm_blk_corrupted_range
on the dm-dedupe device.  At least in theory, the dm-dedupe driver's
rmap method ought to return both (64 -> 200) and (64 -> 500) so that
dm_blk_corrupted_range can pass on both corruption notices to whatever's
sitting atop the dedupe device.

At the moment, your ->rmap prototype is only capable of returning one
sector_t mapping per target, and there's only the one target under the
dedupe device, so we cannot report the loss of sectors 200 and 500 to
whatever device is sitting on top of dm-dedupe.


Got it.  I didn't know there is a kind of dm device called dm-dedupe. 
Thanks for the guidance.



--
Thanks,
Ruan Shiyang.



--D




+   break;
+   }
+   }
+
+   if (!found) {
+   rc = -ENODEV;
+   goto out;
+   }
+
+   sb = get_super(md_bdev);
+   if (!sb) {
+   rc = bd_disk_holder_corrupted_range(md_bdev, 
to_bytes(disk_sec), len, data);
+   goto out;
+   } else if (sb->s_op->corrupted_range) {
+   loff_t off = to_bytes(disk_sec - get_start_sect(md_bdev));
+
+   rc = sb->s_op->corrupted_range(sb, md_bdev, off, len, data);


This &qu

Re: [RFC PATCH v3 0/9] fsdax: introduce fs query to support reflink

2020-12-18 Thread Ruan Shiyang



On 2020/12/18 上午11:49, Darrick J. Wong wrote:

On Fri, Dec 18, 2020 at 10:44:26AM +0800, Ruan Shiyang wrote:



On 2020/12/17 上午4:55, Jane Chu wrote:

Hi, Shiyang,

On 12/15/2020 4:14 AM, Shiyang Ruan wrote:

The call trace is like this:
memory_failure()
   pgmap->ops->memory_failure()  => pmem_pgmap_memory_failure()
    gendisk->fops->corrupted_range() => - pmem_corrupted_range()
    - md_blk_corrupted_range()
     sb->s_ops->currupted_range()    => xfs_fs_corrupted_range()
  xfs_rmap_query_range()
   xfs_currupt_helper()
    * corrupted on metadata
    try to recover data, call xfs_force_shutdown()
    * corrupted on file data
    try to recover data, call mf_dax_mapping_kill_procs()

The fsdax & reflink support for XFS is not contained in this patchset.

(Rebased on v5.10)


So I tried the patchset with pmem error injection, the SIGBUS payload
does not look right -

** SIGBUS(7): **
** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **

I expect the payload looks like

** si_addr(0x7f3672e0), si_lsb(0x15), si_code(0x4, BUS_MCEERR_AR) **


Thanks for testing.  I test the SIGBUS by writing a program which calls
madvise(... ,MADV_HWPOISON) to inject memory-failure.  It just shows that
the program is killed by SIGBUS.  I cannot get any detail from it.  So,
could you please show me the right way(test tools) to test it?


I'm assuming that Jane is using a program that calls sigaction to
install a SIGBUS handler, and dumps the entire siginfo_t structure
whenever it receives one...


OK.  Let me try it and figure out what's wrong in it.


--
Thanks,
Ruan Shiyang.



--D



--
Thanks,
Ruan Shiyang.



thanks,
-jane














___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v3 0/9] fsdax: introduce fs query to support reflink

2020-12-17 Thread Ruan Shiyang



On 2020/12/17 上午4:55, Jane Chu wrote:

Hi, Shiyang,

On 12/15/2020 4:14 AM, Shiyang Ruan wrote:

The call trace is like this:
memory_failure()
  pgmap->ops->memory_failure()  => pmem_pgmap_memory_failure()
   gendisk->fops->corrupted_range() => - pmem_corrupted_range()
   - md_blk_corrupted_range()
    sb->s_ops->currupted_range()    => xfs_fs_corrupted_range()
 xfs_rmap_query_range()
  xfs_currupt_helper()
   * corrupted on metadata
   try to recover data, call xfs_force_shutdown()
   * corrupted on file data
   try to recover data, call mf_dax_mapping_kill_procs()

The fsdax & reflink support for XFS is not contained in this patchset.

(Rebased on v5.10)


So I tried the patchset with pmem error injection, the SIGBUS payload
does not look right -

** SIGBUS(7): **
** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **

I expect the payload looks like

** si_addr(0x7f3672e0), si_lsb(0x15), si_code(0x4, BUS_MCEERR_AR) **


Thanks for testing.  I test the SIGBUS by writing a program which calls 
madvise(... ,MADV_HWPOISON) to inject memory-failure.  It just shows 
that the program is killed by SIGBUS.  I cannot get any detail from it. 
 So, could you please show me the right way(test tools) to test it?



--
Thanks,
Ruan Shiyang.



thanks,
-jane








___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v3 9/9] xfs: Implement ->corrupted_range() for XFS

2020-12-17 Thread Ruan Shiyang
   if (!VFS_I(ip)->i_mapping)
+   goto out;
+
+   if (IS_DAX(VFS_I(ip)))
+   rc = mf_dax_mapping_kill_procs(VFS_I(ip)->i_mapping,
+  rec->rm_offset, *flags);


If the file isn't S_DAX, should we call mapping_set_error here so
that the next fsync() will also return EIO?


Nice idea.  I will try.




+
+   // TODO try to fix data
+out:
+   xfs_irele(ip);
+   }
+
+   return rc;
+}
+
+static int
+xfs_fs_corrupted_range(
+   struct super_block  *sb,
+   struct block_device *bdev,
+   loff_t  offset,
+   size_t  len,
+   void*data)
+{
+   struct xfs_mount*mp = XFS_M(sb);
+   struct xfs_trans*tp = NULL;
+   struct xfs_btree_cur*cur = NULL;
+   struct xfs_rmap_irecrmap_low, rmap_high;
+   struct xfs_buf  *agf_bp = NULL;
+   xfs_fsblock_t   fsbno = XFS_B_TO_FSB(mp, offset);
+   xfs_filblks_t   bc = XFS_B_TO_FSB(mp, len);
+   xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
+   xfs_agblock_t   agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+   int rc = 0;
+
+   if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_bdev == bdev) {
+   xfs_warn(mp, "storage lost support not available for realtime 
device!");
+   return 0;
+   }


This ought to kill the fs when an external log device is configured:

if (mp->m_logdev_targp &&
mp->m_logdev_targp != mp->m_ddev_targp &&
mp->m_logdev_targp->bt_bdev == bdev) {
xfs_error(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);
return 0;
}

Then, we need to check explicitly for rmap support:

if (!xfs_sb_version_hasrmapbt(>m_sb))
return 0;

so that we screen out filesystems that don't have rmap enabled.


Ah...  I was too negligent to think of this.  That's very thoughtful of 
you.  Thanks.





+   rc = xfs_trans_alloc_empty(mp, );
+   if (rc)
+   return rc;
+
+   rc = xfs_alloc_read_agf(mp, tp, agno, 0, _bp);
+   if (rc)
+   return rc;
+
+   cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);
+
+   /* Construct a range for rmap query */
+   memset(_low, 0, sizeof(rmap_low));
+   memset(_high, 0xFF, sizeof(rmap_high));
+   rmap_low.rm_startblock = rmap_high.rm_startblock = agbno;
+   rmap_low.rm_blockcount = rmap_high.rm_blockcount = bc;
+
+   rc = xfs_rmap_query_range(cur, _low, _high, 
xfs_corrupt_helper, data);
+   if (rc == -ECANCELED)
+   rc = 0;


I don't think anything returns ECANCELED here...


Yes.  Will remove it.


--
Thanks,
Ruan Shiyang.


--D


+   if (rc == -EFSCORRUPTED)
+   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);
+
+   xfs_btree_del_cursor(cur, rc);
+   xfs_trans_brelse(tp, agf_bp);
+   return rc;
+}
+
  static const struct super_operations xfs_super_operations = {
.alloc_inode= xfs_fs_alloc_inode,
.destroy_inode  = xfs_fs_destroy_inode,
@@ -1116,6 +1208,7 @@ static const struct super_operations xfs_super_operations 
= {
.show_options   = xfs_fs_show_options,
.nr_cached_objects  = xfs_fs_nr_cached_objects,
.free_cached_objects= xfs_fs_free_cached_objects,
+   .corrupted_range= xfs_fs_corrupted_range,
  };
  
  static int

--
2.29.2








___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v3 8/9] md: Implement ->corrupted_range()

2020-12-17 Thread Ruan Shiyang



On 2020/12/16 上午4:51, Darrick J. Wong wrote:

On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:

With the support of ->rmap(), it is possible to obtain the superblock on
a mapped device.

If a pmem device is used as one target of mapped device, we cannot
obtain its superblock directly.  With the help of SYSFS, the mapped
device can be found on the target devices.  So, we iterate the
bdev->bd_holder_disks to obtain its mapped device.

Signed-off-by: Shiyang Ruan 
---
  drivers/md/dm.c   | 66 +++
  drivers/nvdimm/pmem.c |  9 --
  fs/block_dev.c| 21 ++
  include/linux/genhd.h |  7 +
  4 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4e0cbfe3f14d..9da1f9322735 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -507,6 +507,71 @@ static int dm_blk_report_zones(struct gendisk *disk, 
sector_t sector,
  #define dm_blk_report_zones   NULL
  #endif /* CONFIG_BLK_DEV_ZONED */
  
+struct dm_blk_corrupt {

+   struct block_device *bdev;
+   sector_t offset;
+};
+
+static int dm_blk_corrupt_fn(struct dm_target *ti, struct dm_dev *dev,
+   sector_t start, sector_t len, void *data)
+{
+   struct dm_blk_corrupt *bc = data;
+
+   return bc->bdev == (void *)dev->bdev &&
+   (start <= bc->offset && bc->offset < start + len);
+}
+
+static int dm_blk_corrupted_range(struct gendisk *disk,
+ struct block_device *target_bdev,
+ loff_t target_offset, size_t len, void *data)
+{
+   struct mapped_device *md = disk->private_data;
+   struct block_device *md_bdev = md->bdev;
+   struct dm_table *map;
+   struct dm_target *ti;
+   struct super_block *sb;
+   int srcu_idx, i, rc = 0;
+   bool found = false;
+   sector_t disk_sec, target_sec = to_sector(target_offset);
+
+   map = dm_get_live_table(md, _idx);
+   if (!map)
+   return -ENODEV;
+
+   for (i = 0; i < dm_table_get_num_targets(map); i++) {
+   ti = dm_table_get_target(map, i);
+   if (ti->type->iterate_devices && ti->type->rmap) {
+   struct dm_blk_corrupt bc = {target_bdev, target_sec};
+
+   found = ti->type->iterate_devices(ti, dm_blk_corrupt_fn, 
);
+   if (!found)
+   continue;
+   disk_sec = ti->type->rmap(ti, target_sec);


What happens if the dm device has multiple reverse mappings because the
physical storage is being shared at multiple LBAs?  (e.g. a
deduplication target)


I thought that the dm device knows the mapping relationship, and it can 
be done by implementation of ->rmap() in each target.  Did I understand 
it wrong?





+   break;
+   }
+   }
+
+   if (!found) {
+   rc = -ENODEV;
+   goto out;
+   }
+
+   sb = get_super(md_bdev);
+   if (!sb) {
+   rc = bd_disk_holder_corrupted_range(md_bdev, 
to_bytes(disk_sec), len, data);
+   goto out;
+   } else if (sb->s_op->corrupted_range) {
+   loff_t off = to_bytes(disk_sec - get_start_sect(md_bdev));
+
+   rc = sb->s_op->corrupted_range(sb, md_bdev, off, len, data);


This "call bd_disk_holder_corrupted_range or sb->s_op->corrupted_range"
logic appears twice; should it be refactored into a common helper?

Or, should the superblock dispatch part move to
bd_disk_holder_corrupted_range?


bd_disk_holder_corrupted_range() requires SYSFS configuration.  I 
introduce it to handle those block devices that can not obtain 
superblock by `get_super()`.


Usually, if we create filesystem directly on a pmem device, or make some 
partitions at first, we can use `get_super()` to get the superblock.  In 
other case, such as creating a LVM on pmem device, `get_super()` does 
not work.


So, I think refactoring it into a common helper looks better.


--
Thanks,
Ruan Shiyang.




+   }
+   drop_super(sb);
+
+out:
+   dm_put_live_table(md, srcu_idx);
+   return rc;
+}
+
  static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
struct block_device **bdev)
  {
@@ -3084,6 +3149,7 @@ static const struct block_device_operations dm_blk_dops = 
{
.getgeo = dm_blk_getgeo,
.report_zones = dm_blk_report_zones,
.pr_ops = _pr_ops,
+   .corrupted_range = dm_blk_corrupted_range,
.owner = THIS_MODULE
  };
  
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c

index 4688bff19c20..e8cfaf860149 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -267,11 +267,14 @@ static int pmem_corrupted_range(struct gendisk *disk, 
struct block_devi

Re: [RFC PATCH v3 8/9] md: Implement ->corrupted_range()

2020-12-17 Thread Ruan Shiyang



On 2020/12/16 下午1:43, Jane Chu wrote:

On 12/15/2020 4:14 AM, Shiyang Ruan wrote:

  #ifdef CONFIG_SYSFS
+int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t 
off,

+   size_t len, void *data);
  int bd_link_disk_holder(struct block_device *bdev, struct gendisk 
*disk);
  void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk 
*disk);

  #else
+int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t 
off,


Did you mean
   static inline int bd_disk_holder_corrupted_range(..
?


Yes, it's my fault.  Thanks a lot.


--
Thanks,
Ruan Shiyang.



thanks,
-jane


+   size_t len, void *data)
+{
+    return 0;
+}
  static inline int bd_link_disk_holder(struct block_device *bdev,
    struct gendisk *disk)





___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v3 4/9] mm, fsdax: Refactor memory-failure handler for dax mapping

2020-12-17 Thread Ruan Shiyang



On 2020/12/17 上午5:26, Dave Chinner wrote:

On Tue, Dec 15, 2020 at 08:14:09PM +0800, Shiyang Ruan wrote:

The current memory_failure_dev_pagemap() can only handle single-mapped
dax page for fsdax mode.  The dax page could be mapped by multiple files
and offsets if we let reflink feature & fsdax mode work together.  So,
we refactor current implementation to support handle memory failure on
each file and offset.

Signed-off-by: Shiyang Ruan 
---

.

  static const char *action_name[] = {
@@ -1147,6 +1148,60 @@ static int try_to_split_thp_page(struct page *page, 
const char *msg)
return 0;
  }
  
+int mf_dax_mapping_kill_procs(struct address_space *mapping, pgoff_t index, int flags)

+{
+   const bool unmap_success = true;
+   unsigned long pfn, size = 0;
+   struct to_kill *tk;
+   LIST_HEAD(to_kill);
+   int rc = -EBUSY;
+   loff_t start;
+   dax_entry_t cookie;
+
+   /*
+* Prevent the inode from being freed while we are interrogating
+* the address_space, typically this would be handled by
+* lock_page(), but dax pages do not use the page lock. This
+* also prevents changes to the mapping of this pfn until
+* poison signaling is complete.
+*/
+   cookie = dax_lock(mapping, index, );
+   if (!cookie)
+   goto unlock;


Why do we need to prevent the inode from going away here? This
function gets called by XFS after doing an xfs_iget() call to grab
the inode that owns the block. Hence the the inode (and the mapping)
are guaranteed to be referenced and can't go away. Hence for the
filesystem based callers, this whole "dax_lock()" thing can go away >
So, AFAICT, the dax_lock() stuff is only necessary when the
filesystem can't be used to resolve the owner of physical page that
went bad


Yes, you are right.  I made a mistake in the calling sequence here. 
Thanks for pointing out.



--
Thanks,
Ruan Shiyang.



Cheers,

Dave.



___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink

2020-12-15 Thread Ruan Shiyang

Hi Jane

On 2020/12/15 上午4:58, Jane Chu wrote:

Hi, Shiyang,

On 11/22/2020 4:41 PM, Shiyang Ruan wrote:

This patchset is a try to resolve the problem of tracking shared page
for fsdax.

Change from v1:
   - Intorduce ->block_lost() for block device
   - Support mapped device
   - Add 'not available' warning for realtime device in XFS
   - Rebased to v5.10-rc1

This patchset moves owner tracking from dax_assocaite_entry() to pmem
device, by introducing an interface ->memory_failure() of struct
pagemap.  The interface is called by memory_failure() in mm, and
implemented by pmem device.  Then pmem device calls its ->block_lost()
to find the filesystem which the damaged page located in, and call
->storage_lost() to track files or metadata assocaited with this page.
Finally we are able to try to fix the damaged data in filesystem and do


Does that mean clearing poison? if so, would you mind to elaborate 
specifically which change does that?


Recovering data for filesystem (or pmem device) has not been done in 
this patchset...  I just triggered the handler for the files sharing the 
corrupted page here.



--
Thanks,
Ruan Shiyang.



Thanks!
-jane


other necessary processing, such as killing processes who are using the
files affected.





___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink

2020-12-01 Thread Ruan Shiyang
is also reused to 
handles anonymous page.  If we move the recover function to 
address_space ops, I think we also need to refactor the existing handler 
for page cache mapping, which may affect anonymous page handling.  This 
makes me confused...



I rewrote the call trace:
memory_failure()
 * dax mapping case
 pgmap->ops->memory_failure()  =>
   pmem_pgmap_memory_failure()
  gendisk->fops->block_corrupt_range() =>
   - pmem_block_corrupt_range()
   - md_blk_block_corrupt_range()
   sb->s_ops->storage_currupt_range()  =>
   xfs_fs_storage_corrupt_range()
xfs_rmap_query_range()
 xfs_storage_lost_helper()
  mapping->a_ops->corrupt_range()  =>
   xfs_dax_aops.xfs_dax_corrupt_range
   memory_failure_dev_pagemap_kill_procs()

 * page cache mapping case
 mapping->a_ops->corrupt_range()   =>
   xfs_address_space_operations.xfs_xxx
  memory_failure_generic_kill_procs()

It's rough and not completed yet.  Hope for your comment.

--
Thanks,
Ruan Shiyang.



Cheers,

Dave.



___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH 1/4] fs: introduce ->storage_lost() for memory-failure

2020-09-15 Thread Ruan Shiyang



On 2020/9/16 上午12:16, Darrick J. Wong wrote:

On Tue, Sep 15, 2020 at 06:13:08PM +0800, Shiyang Ruan wrote:

This function is used to handle errors which may cause data lost in
filesystem.  Such as memory-failure in fsdax mode.

In XFS, it requires "rmapbt" feature in order to query for files or
metadata which associated to the error block.  Then we could call fs
recover functions to try to repair the damaged data.(did not implemented
in this patch)

After that, the memory-failure also needs to kill processes who are
using those files.  The struct mf_recover_controller is created to store
necessary parameters.

Signed-off-by: Shiyang Ruan 
---
  fs/xfs/xfs_super.c | 80 ++
  include/linux/fs.h |  1 +
  include/linux/mm.h |  6 
  3 files changed, 87 insertions(+)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 71ac6c1cdc36..118d9c5d9e1e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -35,6 +35,10 @@
  #include "xfs_refcount_item.h"
  #include "xfs_bmap_item.h"
  #include "xfs_reflink.h"
+#include "xfs_alloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_bit.h"
  
  #include 

  #include 
@@ -1104,6 +1108,81 @@ xfs_fs_free_cached_objects(
return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
  }
  
+static int

+xfs_storage_lost_helper(
+   struct xfs_btree_cur*cur,
+   struct xfs_rmap_irec*rec,
+   void*priv)
+{
+   struct xfs_inode*ip;
+   struct mf_recover_controller*mfrc = priv;
+   int rc = 0;
+
+   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner)) {
+   // TODO check and try to fix metadata
+   } else {
+   /*
+* Get files that incore, filter out others that are not in use.
+*/
+   xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE,
+0, );


Missing return value check here.


Yes, I ignored it.  My fault.




+   if (!ip)
+   return 0;
+   if (!VFS_I(ip)->i_mapping)
+   goto out;
+
+   rc = mfrc->recover_fn(mfrc->pfn, mfrc->flags,
+ VFS_I(ip)->i_mapping, rec->rm_offset);
+
+   // TODO try to fix data
+out:
+   xfs_irele(ip);
+   }
+
+   return rc;
+}
+
+static int
+xfs_fs_storage_lost(
+   struct super_block  *sb,
+   loff_t  offset,


offset into which device?  XFS supports three...

I'm also a little surprised you don't pass in a length.

I guess that means this function will get called repeatedly for every
byte in the poisoned range?


The memory-failure will triggered on one PFN each time, so its length 
could be one PAGE_SIZE.  But I think you are right, it's better to tell 
filesystem how much range is effected and need to be fixed.  I didn't 
know but I think there may be some other cases besides memory-failure. 
So the length is necessary.





+   void*priv)
+{
+   struct xfs_mount*mp = XFS_M(sb);
+   struct xfs_trans*tp = NULL;
+   struct xfs_btree_cur*cur = NULL;
+   struct xfs_rmap_irecrmap_low, rmap_high;
+   struct xfs_buf  *agf_bp = NULL;
+   xfs_fsblock_t   fsbno = XFS_B_TO_FSB(mp, offset);
+   xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
+   xfs_agblock_t   agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+   int error = 0;
+
+   error = xfs_trans_alloc_empty(mp, );
+   if (error)
+   return error;
+
+   error = xfs_alloc_read_agf(mp, tp, agno, 0, _bp);
+   if (error)
+   return error;
+
+   cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);


...and this is definitely the wrong call sequence if the malfunctioning
device is the realtime device.  If a dax rt device dies, you'll be
shooting down files on the data device, which will cause all sorts of
problems.


I didn't notice that.  Let me think about it.


Question: Should all this poison recovery stuff go into a new file?
xfs_poison.c?  There's already a lot of code in xfs_super.c.


Yes, it's a bit too much.  I'll move them into a new file.


--
Thanks,
Ruan Shiyang.


--D


+
+   /* Construct a range for rmap query */
+   memset(_low, 0, sizeof(rmap_low));
+   memset(_high, 0xFF, sizeof(rmap_high));
+   rmap_low.rm_startblock = rmap_high.rm_startblock = agbno;
+
+   error = xfs_rmap_query_range(cur, _low, _high,
+xfs_storage_lost_helper, priv);
+   if (error == -ECANCELED)
+   error = 0;
+
+   xfs_btree_del_cursor(cur, error);
+   xfs_trans_brelse(tp, agf_bp)

Re: [RFC PATCH 2/4] pagemap: introduce ->memory_failure()

2020-09-15 Thread Ruan Shiyang



On 2020/9/16 上午12:31, Darrick J. Wong wrote:

On Tue, Sep 15, 2020 at 06:13:09PM +0800, Shiyang Ruan wrote:

When memory-failure occurs, we call this function which is implemented
by each devices.  For fsdax, pmem device implements it.  Pmem device
will find out the block device where the error page located in, gets the
filesystem on this block device, and finally call ->storage_lost() to
handle the error in filesystem layer.

Normally, a pmem device may contain one or more partitions, each
partition contains a block device, each block device contains a
filesystem.  So we are able to find out the filesystem by one offset on
this pmem device.  However, in other cases, such as mapped device, I
didn't find a way to obtain the filesystem laying on it.  It is a
problem need to be fixed.

Signed-off-by: Shiyang Ruan 
---
  block/genhd.c| 12 
  drivers/nvdimm/pmem.c| 31 +++
  include/linux/genhd.h|  2 ++
  include/linux/memremap.h |  3 +++
  4 files changed, 48 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 99c64641c314..e7442b60683e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1063,6 +1063,18 @@ struct block_device *bdget_disk(struct gendisk *disk, 
int partno)
  }
  EXPORT_SYMBOL(bdget_disk);
  
+struct block_device *bdget_disk_sector(struct gendisk *disk, sector_t sector)

+{
+   struct block_device *bdev = NULL;
+   struct hd_struct *part = disk_map_sector_rcu(disk, sector);
+
+   if (part)
+   bdev = bdget(part_devt(part));
+
+   return bdev;
+}
+EXPORT_SYMBOL(bdget_disk_sector);
+
  /*
   * print a full list of all partitions - intended for places where the root
   * filesystem can't be mounted and thus to give the victim some idea of what
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fab29b514372..3ed96486c883 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -364,9 +364,40 @@ static void pmem_release_disk(void *__pmem)
put_disk(pmem->disk);
  }
  
+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,

+   struct mf_recover_controller *mfrc)
+{
+   struct pmem_device *pdev;
+   struct block_device *bdev;
+   sector_t disk_sector;
+   loff_t bdev_offset;
+
+   pdev = container_of(pgmap, struct pmem_device, pgmap);
+   if (!pdev->disk)
+   return -ENXIO;
+
+   disk_sector = (PFN_PHYS(mfrc->pfn) - pdev->phys_addr) >> SECTOR_SHIFT;


Ah, I see, looking at the current x86 MCE code, the MCE handler gets a
physical address, which is then rounded down to a PFN, which is then
blown back up into a byte address(?) and then rounded down to sectors.
That is then blown back up into a byte address and passed on to XFS,
which rounds it down to fs blocksize.

/me wishes that wasn't so convoluted, but reforming the whole mm poison
system to have smaller blast radii isn't the purpose of this patch. :)


+   bdev = bdget_disk_sector(pdev->disk, disk_sector);
+   if (!bdev)
+   return -ENXIO;
+
+   // TODO what if block device contains a mapped device


Find its dev_pagemap_ops and invoke its memory_failure function? ;)


Thanks for pointing out.  I'll think about it in this way.



+   if (!bdev->bd_super)
+   goto out;
+
+   bdev_offset = ((disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT) -
+   pdev->data_offset;
+   bdev->bd_super->s_op->storage_lost(bdev->bd_super, bdev_offset, mfrc);


->storage_lost is required for all filesystems?


I think it is required for filesystems that support fsdax, since the 
owner tracking is moved here.  But anyway, there should have a non-NULL 
judgment.



--
Thanks,
Ruan Shiyang.


--D


+
+out:
+   bdput(bdev);
+   return 0;
+}
+
  static const struct dev_pagemap_ops fsdax_pagemap_ops = {
.kill   = pmem_pagemap_kill,
.cleanup= pmem_pagemap_cleanup,
+   .memory_failure = pmem_pagemap_memory_failure,
  };
  
  static int pmem_attach_disk(struct device *dev,

diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 4ab853461dff..16e9e13e0841 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -303,6 +303,8 @@ static inline void add_disk_no_queue_reg(struct gendisk 
*disk)
  extern void del_gendisk(struct gendisk *gp);
  extern struct gendisk *get_gendisk(dev_t dev, int *partno);
  extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
+extern struct block_device *bdget_disk_sector(struct gendisk *disk,
+   sector_t sector);
  
  extern void set_device_ro(struct block_device *bdev, int flag);

  extern void set_disk_ro(struct gendisk *disk, int flag);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 5f5b2df06e61..efebefa70d00 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ 

Re: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink

2020-08-10 Thread Ruan Shiyang



On 2020/8/7 下午9:38, Matthew Wilcox wrote:

On Fri, Aug 07, 2020 at 09:13:28PM +0800, Shiyang Ruan wrote:

This patchset is a try to resolve the problem of tracking shared page
for fsdax.

Instead of per-page tracking method, this patchset introduces a query
interface: get_shared_files(), which is implemented by each FS, to
obtain the owners of a shared page.  It returns an owner list of this
shared page.  Then, the memory-failure() iterates the list to be able
to notify each process using files that sharing this page.

The design of the tracking method is as follow:
1. dax_assocaite_entry() associates the owner's info to this page


I think that's the first problem with this design.  dax_associate_entry is
a horrendous idea which needs to be ripped out, not made more important.
It's all part of the general problem of trying to do something on a
per-page basis instead of per-extent basis.



The memory-failure needs to track owners info from a dax page, so I 
should associate the owner with this page.  In this version, I associate 
the block device to the dax page, so that the memory-failure is able to 
iterate the owners by the query interface provided by filesystem.



--
Thanks,
Ruan Shiyang.





___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH 1/8] fs: introduce get_shared_files() for dax

2020-08-10 Thread Ruan Shiyang
is zero in a freshly created cursor.


+   /* Construct the range for one rmap search */
+   memset(_low, 0, sizeof(rmap_low));
+   memset(_high, 0xFF, sizeof(rmap_high));
+   rmap_low.rm_startblock = rmap_high.rm_startblock = bno;
+
+   error = xfs_rmap_query_range(cur, _low, _high,
+_get_shared_files_fn, list);
+   if (error == -ECANCELED)
+   error = 0;
+
+   xfs_btree_del_cursor(cur, error);
+   xfs_trans_brelse(tp, agf_bp);
+   return error;
+}


Looking at this, I don't think this is the right way to approach memory
poisoning.  Rather than allocating a (potentially huge) linked list and
passing it to the memory poison code to unmap pages, kill processes, and
free the list, I think:

1) "->get_shared_files" should be more targetted.  Call it ->storage_lost
or something, so that it only has one purpose, which is to react to
asynchronous notifications that storage has been lost.


Yes, it's better.  I only considered file tracking.


2) The inner _get_shared_files_fn should directly call back into the
memory manager to remove a poisoned page from the mapping and signal
whatever process might have it mapped.


For the error handling part, it's really reasonable.  But we have to 
call dax_lock_page() at the beginning of memory-failure, which also need 
to iterate all the owners in order to lock their dax entries.  I think 
it not supposed to call the lock in each iteration instead of at the 
beginning.




That way, _get_shared_files_fn can look in the xfs buffer cache to see
if we have a copy in DRAM, and immediately write it back to pmem.


I didn't think of fixing the storage device.  Will take it into 
consideration.




Hmm and now that you've gotten me rambling about hwpoison, I wonder what
happens if dram backing part of the xfs buffer cache goes bad...


Yes, so many possible situations to consider.  For the current stage, 
just shutdown the filesystem if memory failures on metadata, and kill 
user processes if failures on normal files.  Is that OK?



Anyway, thanks for reviewing.

--
Thanks,
Ruan Shiyang.



--D


+
  static const struct super_operations xfs_super_operations = {
.alloc_inode= xfs_fs_alloc_inode,
.destroy_inode  = xfs_fs_destroy_inode,
@@ -1110,6 +1176,7 @@ static const struct super_operations xfs_super_operations 
= {
.show_options   = xfs_fs_show_options,
.nr_cached_objects  = xfs_fs_nr_cached_objects,
.free_cached_objects= xfs_fs_free_cached_objects,
+   .get_shared_files   = xfs_fs_get_shared_files,
  };
  
  static int

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..0a85e321d6b4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -40,6 +40,13 @@ struct dax_operations {
  
  extern struct attribute_group dax_attribute_group;
  
+struct shared_files {

+   struct list_headlist;
+   struct address_space*mapping;
+   pgoff_t index;
+   dax_entry_t cookie;
+};
+
  #if IS_ENABLED(CONFIG_DAX)
  struct dax_device *dax_get_by_host(const char *host);
  struct dax_device *alloc_dax(void *private, const char *host,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f5abba86107d..81de3d2739b9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1977,6 +1977,8 @@ struct super_operations {
  struct shrink_control *);
long (*free_cached_objects)(struct super_block *,
struct shrink_control *);
+   int (*get_shared_files)(struct super_block *sb, pgoff_t offset,
+   struct list_head *list);
  };
  
  /*

--
2.27.0








___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-06-04 Thread Ruan Shiyang



On 2020/6/5 上午9:30, Dave Chinner wrote:

On Thu, Jun 04, 2020 at 07:51:07AM -0700, Darrick J. Wong wrote:

On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:



On 2020/4/28 下午2:43, Dave Chinner wrote:

On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote:


在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:


On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:

   This patchset is a try to resolve the shared 'page cache' problem for
   fsdax.

   In order to track multiple mappings and indexes on one page, I
   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
   will be associated more than once if is shared.  At the second time we
   associate this entry, we create this rb-tree and store its root in
   page->private(not used in fsdax).  Insert (->mapping, ->index) when
   dax_associate_entry() and delete it when dax_disassociate_entry().


Do we really want to track all of this on a per-page basis?  I would
have thought a per-extent basis was more useful.  Essentially, create
a new address_space for each shared extent.  Per page just seems like
a huge overhead.


Per-extent tracking is a nice idea for me.  I haven't thought of it
yet...

But the extent info is maintained by filesystem.  I think we need a way
to obtain this info from FS when associating a page.  May be a bit
complicated.  Let me think about it...


That's why I want the -user of this association- to do a filesystem
callout instead of keeping it's own naive tracking infrastructure.
The filesystem can do an efficient, on-demand reverse mapping lookup
from it's own extent tracking infrastructure, and there's zero
runtime overhead when there are no errors present.


Hi Dave,

I ran into some difficulties when trying to implement the per-extent rmap
tracking.  So, I re-read your comments and found that I was misunderstanding
what you described here.

I think what you mean is: we don't need the in-memory dax-rmap tracking now.
Just ask the FS for the owner's information that associate with one page
when memory-failure.  So, the per-page (even per-extent) dax-rmap is
needless in this case.  Is this right?


Right.  XFS already has its own rmap tree.


*nod*


Based on this, we only need to store the extent information of a fsdax page
in its ->mapping (by searching from FS).  Then obtain the owners of this
page (also by searching from FS) when memory-failure or other rmap case
occurs.


I don't even think you need that much.  All you need is the "physical"
offset of that page within the pmem device (e.g. 'this is the 307th 4k
page == offset 1257472 since the start of /dev/pmem0') and xfs can look
up the owner of that range of physical storage and deal with it as
needed.


Right. If we have the dax device associated with the page that had
the failure, then we can determine the offset of the page into the
block device address space and that's all we need to find the owner
of the page in the filesystem.

Note that there may actually be no owner - the page that had the
fault might land in free space, in which case we can simply zero
the page and clear the error.


OK.  Thanks for pointing out.




So, a fsdax page is no longer associated with a specific file, but with a
FS(or the pmem device).  I think it's easier to understand and implement.


Effectively, yes. But we shouldn't need to actually associate the
page with anything at the filesystem level because it is already
associated with a DAX device at a lower level via a dev_pagemap.
The hardware page fault already runs thought this code
memory_failure_dev_pagemap() before it gets to the DAX code, so
really all we need to is have that function pass us the page, offset
into the device and, say, the struct dax_device associated with that
page so we can get to the filesystem superblock we can then use for
rmap lookups on...



OK.  I was just thinking about how can I execute the FS rmap search from 
the memory-failure.  Thanks again for pointing out. :)



--
Thanks,
Ruan Shiyang.


Cheers,

Dave.



___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-06-04 Thread Ruan Shiyang



On 2020/6/4 下午10:51, Darrick J. Wong wrote:

On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:



On 2020/4/28 下午2:43, Dave Chinner wrote:

On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote:


在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:


On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:

   This patchset is a try to resolve the shared 'page cache' problem for
   fsdax.

   In order to track multiple mappings and indexes on one page, I
   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
   will be associated more than once if is shared.  At the second time we
   associate this entry, we create this rb-tree and store its root in
   page->private(not used in fsdax).  Insert (->mapping, ->index) when
   dax_associate_entry() and delete it when dax_disassociate_entry().


Do we really want to track all of this on a per-page basis?  I would
have thought a per-extent basis was more useful.  Essentially, create
a new address_space for each shared extent.  Per page just seems like
a huge overhead.


Per-extent tracking is a nice idea for me.  I haven't thought of it
yet...

But the extent info is maintained by filesystem.  I think we need a way
to obtain this info from FS when associating a page.  May be a bit
complicated.  Let me think about it...


That's why I want the -user of this association- to do a filesystem
callout instead of keeping it's own naive tracking infrastructure.
The filesystem can do an efficient, on-demand reverse mapping lookup
from it's own extent tracking infrastructure, and there's zero
runtime overhead when there are no errors present.


Hi Dave,

I ran into some difficulties when trying to implement the per-extent rmap
tracking.  So, I re-read your comments and found that I was misunderstanding
what you described here.

I think what you mean is: we don't need the in-memory dax-rmap tracking now.
Just ask the FS for the owner's information that associate with one page
when memory-failure.  So, the per-page (even per-extent) dax-rmap is
needless in this case.  Is this right?


Right.  XFS already has its own rmap tree.


Based on this, we only need to store the extent information of a fsdax page
in its ->mapping (by searching from FS).  Then obtain the owners of this
page (also by searching from FS) when memory-failure or other rmap case
occurs.


I don't even think you need that much.  All you need is the "physical"
offset of that page within the pmem device (e.g. 'this is the 307th 4k
page == offset 1257472 since the start of /dev/pmem0') and xfs can look
up the owner of that range of physical storage and deal with it as
needed.


Yes, I think so.




So, a fsdax page is no longer associated with a specific file, but with a
FS(or the pmem device).  I think it's easier to understand and implement.


Yes.  I also suspect this will be necessary to support reflink...

--D


OK, Thank you very much.


--
Thanks,
Ruan Shiyang.





--
Thanks,
Ruan Shiyang.


At the moment, this "dax association" is used to "report" a storage
media error directly to userspace. I say "report" because what it
does is kill userspace processes dead. The storage media error
actually needs to be reported to the owner of the storage media,
which in the case of FS-DAX is the filesytem.

That way the filesystem can then look up all the owners of that bad
media range (i.e. the filesystem block it corresponds to) and take
appropriate action. e.g.

- if it falls in filesytem metadata, shutdown the filesystem
- if it falls in user data, call the "kill userspace dead" routines
for each mapping/index tuple the filesystem finds for the given
LBA address that the media error occurred.

Right now if the media error is in filesystem metadata, the
filesystem isn't even told about it. The filesystem can't even shut
down - the error is just dropped on the floor and it won't be until
the filesystem next tries to reference that metadata that we notice
there is an issue.

Cheers,

Dave.









___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-06-04 Thread Ruan Shiyang



On 2020/4/28 下午2:43, Dave Chinner wrote:

On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote:


在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:


On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:

  This patchset is a try to resolve the shared 'page cache' problem for
  fsdax.

  In order to track multiple mappings and indexes on one page, I
  introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
  will be associated more than once if is shared.  At the second time we
  associate this entry, we create this rb-tree and store its root in
  page->private(not used in fsdax).  Insert (->mapping, ->index) when
  dax_associate_entry() and delete it when dax_disassociate_entry().


Do we really want to track all of this on a per-page basis?  I would
have thought a per-extent basis was more useful.  Essentially, create
a new address_space for each shared extent.  Per page just seems like
a huge overhead.


Per-extent tracking is a nice idea for me.  I haven't thought of it
yet...

But the extent info is maintained by filesystem.  I think we need a way
to obtain this info from FS when associating a page.  May be a bit
complicated.  Let me think about it...


That's why I want the -user of this association- to do a filesystem
callout instead of keeping it's own naive tracking infrastructure.
The filesystem can do an efficient, on-demand reverse mapping lookup
from it's own extent tracking infrastructure, and there's zero
runtime overhead when there are no errors present.


Hi Dave,

I ran into some difficulties when trying to implement the per-extent 
rmap tracking.  So, I re-read your comments and found that I was 
misunderstanding what you described here.


I think what you mean is: we don't need the in-memory dax-rmap tracking 
now.  Just ask the FS for the owner's information that associate with 
one page when memory-failure.  So, the per-page (even per-extent) 
dax-rmap is needless in this case.  Is this right?


Based on this, we only need to store the extent information of a fsdax 
page in its ->mapping (by searching from FS).  Then obtain the owners of 
this page (also by searching from FS) when memory-failure or other rmap 
case occurs.


So, a fsdax page is no longer associated with a specific file, but with 
a FS(or the pmem device).  I think it's easier to understand and implement.



--
Thanks,
Ruan Shiyang.


At the moment, this "dax association" is used to "report" a storage
media error directly to userspace. I say "report" because what it
does is kill userspace processes dead. The storage media error
actually needs to be reported to the owner of the storage media,
which in the case of FS-DAX is the filesytem.

That way the filesystem can then look up all the owners of that bad
media range (i.e. the filesystem block it corresponds to) and take
appropriate action. e.g.

- if it falls in filesytem metadata, shutdown the filesystem
- if it falls in user data, call the "kill userspace dead" routines
   for each mapping/index tuple the filesystem finds for the given
   LBA address that the media error occurred.

Right now if the media error is in filesystem metadata, the
filesystem isn't even told about it. The filesystem can't even shut
down - the error is just dropped on the floor and it won't be until
the filesystem next tries to reference that metadata that we notice
there is an issue.

Cheers,

Dave.



___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-04-28 Thread Ruan Shiyang



On 2020/4/28 下午2:43, Dave Chinner wrote:

On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote:


在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:


On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:

  This patchset is a try to resolve the shared 'page cache' problem for
  fsdax.

  In order to track multiple mappings and indexes on one page, I
  introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
  will be associated more than once if is shared.  At the second time we
  associate this entry, we create this rb-tree and store its root in
  page->private(not used in fsdax).  Insert (->mapping, ->index) when
  dax_associate_entry() and delete it when dax_disassociate_entry().


Do we really want to track all of this on a per-page basis?  I would
have thought a per-extent basis was more useful.  Essentially, create
a new address_space for each shared extent.  Per page just seems like
a huge overhead.


Per-extent tracking is a nice idea for me.  I haven't thought of it
yet...

But the extent info is maintained by filesystem.  I think we need a way
to obtain this info from FS when associating a page.  May be a bit
complicated.  Let me think about it...


That's why I want the -user of this association- to do a filesystem
callout instead of keeping it's own naive tracking infrastructure.
The filesystem can do an efficient, on-demand reverse mapping lookup
from it's own extent tracking infrastructure, and there's zero
runtime overhead when there are no errors present.

At the moment, this "dax association" is used to "report" a storage
media error directly to userspace. I say "report" because what it
does is kill userspace processes dead. The storage media error
actually needs to be reported to the owner of the storage media,
which in the case of FS-DAX is the filesytem.


Understood.

BTW, this is the usage in memory-failure, so what about rmap?  I have 
not found how to use this tracking in rmap.  Do you have any ideas?




That way the filesystem can then look up all the owners of that bad
media range (i.e. the filesystem block it corresponds to) and take
appropriate action. e.g.


I tried writing a function to look up all the owners' info of one block 
in xfs for memory-failure use.  It was dropped in this patchset because 
I found out that this lookup function needs 'rmapbt' to be enabled when 
mkfs.  But by default, rmapbt is disabled.  I am not sure if it matters...




- if it falls in filesytem metadata, shutdown the filesystem
- if it falls in user data, call the "kill userspace dead" routines
   for each mapping/index tuple the filesystem finds for the given
   LBA address that the media error occurred >
Right now if the media error is in filesystem metadata, the
filesystem isn't even told about it. The filesystem can't even shut
down - the error is just dropped on the floor and it won't be until
the filesystem next tries to reference that metadata that we notice
there is an issue.


Understood.  Thanks.



Cheers,

Dave.




--
Thanks,
Ruan Shiyang.

___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-04-28 Thread Ruan, Shiyang

在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:

>On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>  This patchset is a try to resolve the shared 'page cache' problem for
>>  fsdax.
>>
>>  In order to track multiple mappings and indexes on one page, I
>>  introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
>>  will be associated more than once if is shared.  At the second time we
>>  associate this entry, we create this rb-tree and store its root in
>>  page->private(not used in fsdax).  Insert (->mapping, ->index) when
>>  dax_associate_entry() and delete it when dax_disassociate_entry().
>
>Do we really want to track all of this on a per-page basis?  I would
>have thought a per-extent basis was more useful.  Essentially, create
>a new address_space for each shared extent.  Per page just seems like
>a huge overhead.
>
Per-extent tracking is a nice idea for me.  I haven't thought of it 
yet...

But the extent info is maintained by filesystem.  I think we need a way 
to obtain this info from FS when associating a page.  May be a bit 
complicated.  Let me think about it...


--
Thanks,
Ruan Shiyang.

___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org