from:"Jan Kara"

Re: [PATCH v3 3/3] dax: Wake up all waiters after invalidating dax entry

2021-04-21 Thread Jan Kara

On Mon 19-04-21 17:36:36, Vivek Goyal wrote:
> I am seeing missed wakeups which ultimately lead to a deadlock when I am
> using virtiofs with DAX enabled and running "make -j". I had to mount
> virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
> the problem consistently.
> 
> So here is the problem. put_unlocked_entry() wakes up waiters only
> if entry is not null as well as !dax_is_conflict(entry). But if I
> call multiple instances of invalidate_inode_pages2() in parallel,
> then I can run into a situation where there are waiters on
> this index but nobody will wait these.
> 
> invalidate_inode_pages2()
>   invalidate_inode_pages2_range()
> invalidate_exceptional_entry2()
>   dax_invalidate_mapping_entry_sync()
> __dax_invalidate_entry() {
> xas_lock_irq();
> entry = get_unlocked_entry(, 0);
> ...
> ...
> dax_disassociate_entry(entry, mapping, trunc);
> xas_store(, NULL);
> ...
> ...
> put_unlocked_entry(, entry);
> xas_unlock_irq();
> }
> 
> Say a fault in in progress and it has locked entry at offset say "0x1c".
> Now say three instances of invalidate_inode_pages2() are in progress
> (A, B, C) and they all try to invalidate entry at offset "0x1c". Given
> dax entry is locked, all tree instances A, B, C will wait in wait queue.
> 
> When dax fault finishes, say A is woken up. It will store NULL entry
> at index "0x1c" and wake up B. When B comes along it will find "entry=0"
> at page offset 0x1c and it will call put_unlocked_entry(, 0). And
> this means put_unlocked_entry() will not wake up next waiter, given
> the current code. And that means C continues to wait and is not woken
> up.
> 
> This patch fixes the issue by waking up all waiters when a dax entry
> has been invalidated. This seems to fix the deadlock I am facing
> and I can make forward progress.
> 
> Reported-by: Sergio Lopez 
> Fixes: ac401cc78242 ("dax: New fault locking")
> Suggested-by: Dan Williams 
> Signed-off-by: Vivek Goyal 

Looks good to me. Thanks for fixing this! Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/dax.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index f19d76a6a493..cc497519be83 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -676,7 +676,7 @@ static int __dax_invalidate_entry(struct address_space 
> *mapping,
>   mapping->nrexceptional--;
>   ret = 1;
>  out:
> - put_unlocked_entry(, entry, WAKE_NEXT);
> + put_unlocked_entry(, entry, WAKE_ALL);
>   xas_unlock_irq();
>   return ret;
>  }
> -- 
> 2.25.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-21 Thread Jan Kara

On Mon 19-04-21 17:36:35, Vivek Goyal wrote:
> As of now put_unlocked_entry() always wakes up next waiter. In next
> patches we want to wake up all waiters at one callsite. Hence, add a
> parameter to the function.
> 
> This patch does not introduce any change of behavior.
> 
> Suggested-by: Dan Williams 
> Signed-off-by: Vivek Goyal 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/dax.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 00978d0838b1..f19d76a6a493 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -275,11 +275,12 @@ static void wait_entry_unlocked(struct xa_state *xas, 
> void *entry)
>   finish_wait(wq, );
>  }
>  
> -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> +static void put_unlocked_entry(struct xa_state *xas, void *entry,
> +enum dax_entry_wake_mode mode)
>  {
>   /* If we were the only waiter woken, wake the next one */
>   if (entry && !dax_is_conflict(entry))
> - dax_wake_entry(xas, entry, WAKE_NEXT);
> + dax_wake_entry(xas, entry, mode);
>  }
>  
>  /*
> @@ -633,7 +634,7 @@ struct page *dax_layout_busy_page_range(struct 
> address_space *mapping,
>   entry = get_unlocked_entry(, 0);
>   if (entry)
>   page = dax_busy_page(entry);
> - put_unlocked_entry(, entry);
> + put_unlocked_entry(, entry, WAKE_NEXT);
>   if (page)
>   break;
>   if (++scanned % XA_CHECK_SCHED)
> @@ -675,7 +676,7 @@ static int __dax_invalidate_entry(struct address_space 
> *mapping,
>   mapping->nrexceptional--;
>   ret = 1;
>  out:
> - put_unlocked_entry(, entry);
> + put_unlocked_entry(, entry, WAKE_NEXT);
>   xas_unlock_irq();
>   return ret;
>  }
> @@ -954,7 +955,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
> dax_device *dax_dev,
>   return ret;
>  
>   put_unlocked:
> - put_unlocked_entry(xas, entry);
> + put_unlocked_entry(xas, entry, WAKE_NEXT);
>   return ret;
>  }
>  
> @@ -1695,7 +1696,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, 
> unsigned int order)
>   /* Did we race with someone splitting entry or so? */
>   if (!entry || dax_is_conflict(entry) ||
>   (order == 0 && !dax_is_pte_entry(entry))) {
> - put_unlocked_entry(, entry);
> + put_unlocked_entry(, entry, WAKE_NEXT);
>   xas_unlock_irq();
>   trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
> VM_FAULT_NOPAGE);
> -- 
> 2.25.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-21 Thread Jan Kara

On Mon 19-04-21 17:36:34, Vivek Goyal wrote:
> Dan mentioned that he is not very fond of passing around a boolean true/false
> to specify if only next waiter should be woken up or all waiters should be
> woken up. He instead prefers that we introduce an enum and make it very
> explicity at the callsite itself. Easier to read code.
> 
> This patch should not introduce any change of behavior.
> 
> Suggested-by: Dan Williams 
> Signed-off-by: Vivek Goyal 
> ---
>  fs/dax.c | 23 +--
>  1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index b3d27fdc6775..00978d0838b1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -144,6 +144,16 @@ struct wait_exceptional_entry_queue {
>   struct exceptional_entry_key key;
>  };
>  
> +/**
> + * enum dax_entry_wake_mode: waitqueue wakeup toggle
> + * @WAKE_NEXT: entry was not mutated
> + * @WAKE_ALL: entry was invalidated, or resized

Let's document the constants in terms of what they do, not when they are
expected to be called. So something like:

@WAKE_NEXT: wake only the first waiter in the waitqueue
@WAKE_ALL: wake all waiters in the waitqueue

Otherwise the patch looks good so feel free to add:

Reviewed-by: Jan Kara 

Honza

> + */
> +enum dax_entry_wake_mode {
> + WAKE_NEXT,
> + WAKE_ALL,
> +};
> +
>  static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
>   void *entry, struct exceptional_entry_key *key)
>  {
> @@ -182,7 +192,8 @@ static int wake_exceptional_entry_func(wait_queue_entry_t 
> *wait,
>   * The important information it's conveying is whether the entry at
>   * this index used to be a PMD entry.
>   */
> -static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
> +static void dax_wake_entry(struct xa_state *xas, void *entry,
> +enum dax_entry_wake_mode mode)
>  {
>   struct exceptional_entry_key key;
>   wait_queue_head_t *wq;
> @@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_state *xas, void 
> *entry, bool wake_all)
>* must be in the waitqueue and the following check will see them.
>*/
>   if (waitqueue_active(wq))
> - __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
> + __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, );
>  }
>  
>  /*
> @@ -268,7 +279,7 @@ static void put_unlocked_entry(struct xa_state *xas, void 
> *entry)
>  {
>   /* If we were the only waiter woken, wake the next one */
>   if (entry && !dax_is_conflict(entry))
> - dax_wake_entry(xas, entry, false);
> + dax_wake_entry(xas, entry, WAKE_NEXT);
>  }
>  
>  /*
> @@ -286,7 +297,7 @@ static void dax_unlock_entry(struct xa_state *xas, void 
> *entry)
>   old = xas_store(xas, entry);
>   xas_unlock_irq(xas);
>   BUG_ON(!dax_is_locked(old));
> - dax_wake_entry(xas, entry, false);
> + dax_wake_entry(xas, entry, WAKE_NEXT);
>  }
>  
>  /*
> @@ -524,7 +535,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
>  
>   dax_disassociate_entry(entry, mapping, false);
>   xas_store(xas, NULL);   /* undo the PMD join */
> - dax_wake_entry(xas, entry, true);
> + dax_wake_entry(xas, entry, WAKE_ALL);
>   mapping->nrexceptional--;
>   entry = NULL;
>   xas_set(xas, index);
> @@ -937,7 +948,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
> dax_device *dax_dev,
>   xas_lock_irq(xas);
>       xas_store(xas, entry);
>   xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
> - dax_wake_entry(xas, entry, false);
> + dax_wake_entry(xas, entry, WAKE_NEXT);
>  
>   trace_dax_writeback_one(mapping->host, index, count);
>   return ret;
> -- 
> 2.25.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-19 Thread Jan Kara

, entry, false);
> > > +   if (dax_is_conflict(entry))
> > > +   return;
> > > +
> > > +   dax_wake_entry(xas, entry, false);
> > 
> 
> Hi Dan,
> 
> > How does this work if entry is NULL? dax_entry_waitqueue() will not
> > know if it needs to adjust the index.
> 
> Wake waiters both at current index as well PMD adjusted index. It feels
> little ugly though.
> 
> > I think the fix might be to
> > specify that put_unlocked_entry() in the invalidate path needs to do a
> > wake_up_all().
> 
> Doing a wake_up_all() when we invalidate an entry, sounds good. I will give
> it a try.

Yeah, that's what I'd suggest as well. After invalidating entry, there's no
point to let other waiters sleep. Trying to optimize for thundering herd
problems in face of entry invalidation is really fragile as you noticed.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 0/7] fsdax,xfs: Add reflink support for fsdax

2021-02-08 Thread Jan Kara

On Mon 08-02-21 01:09:17, Shiyang Ruan wrote:
> This patchset is attempt to add CoW support for fsdax, and take XFS,
> which has both reflink and fsdax feature, as an example.
> 
> One of the key mechanism need to be implemented in fsdax is CoW.  Copy
> the data from srcmap before we actually write data to the destance
> iomap.  And we just copy range in which data won't be changed.
> 
> Another mechanism is range comparison .  In page cache case, readpage()
> is used to load data on disk to page cache in order to be able to
> compare data.  In fsdax case, readpage() does not work.  So, we need
> another compare data with direct access support.
> 
> With the two mechanism implemented in fsdax, we are able to make reflink
> and fsdax work together in XFS.
> 
> Some of the patches are picked up from Goldwyn's patchset.  I made some
> changes to adapt to this patchset.

How do you deal with HWPoison code trying to reverse-map struct page back
to inode-offset pair? This also needs to be fixed for reflink to work with
DAX.

Honza

> 
> (Rebased on v5.10)
> ==
> 
> Shiyang Ruan (7):
>   fsdax: Output address in dax_iomap_pfn() and rename it
>   fsdax: Introduce dax_copy_edges() for CoW
>   fsdax: Copy data before write
>   fsdax: Replace mmap entry in case of CoW
>   fsdax: Dedup file range to use a compare function
>   fs/xfs: Handle CoW for fsdax write() path
>   fs/xfs: Add dedupe support for fsdax
> 
>  fs/btrfs/reflink.c |   3 +-
>  fs/dax.c   | 188 ++---
>  fs/ocfs2/file.c|   2 +-
>  fs/remap_range.c   |  14 +--
>  fs/xfs/xfs_bmap_util.c |   6 +-
>  fs/xfs/xfs_file.c  |  30 ++-
>  fs/xfs/xfs_inode.c |   8 +-
>  fs/xfs/xfs_inode.h |   1 +
>  fs/xfs/xfs_iomap.c |   3 +-
>  fs/xfs/xfs_iops.c  |  11 ++-
>  fs/xfs/xfs_reflink.c   |  23 -
>  include/linux/dax.h|   5 ++
>  include/linux/fs.h |   9 +-
>  13 files changed, 270 insertions(+), 33 deletions(-)
> 
> -- 
> 2.30.0
> 
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Expense of read_iter

2021-01-20 Thread Jan Kara

On Wed 20-01-21 15:47:00, Dave Chinner wrote:
> On Fri, Jan 15, 2021 at 05:40:43PM +0800, Zhongwei Cai wrote:
> > On Thu, 14 Jan 2021, Mikulas wrote:
> > For Ext4-dax, the overhead of dax_iomap_rw is significant
> > compared to the overhead of struct iov_iter. Although methods
> > proposed by Mikulas can eliminate the overhead of iov_iter
> > well, they can not be applied in Ext4-dax unless we implement an
> > internal "read" method in Ext4-dax.
> > 
> > For Ext4-dax, there could be two approaches to optimizing:
> > 1) implementing the internal "read" method without the complexity
> > of iterators and dax_iomap_rw;
> 
> Please do not go an re-invent the wheel just for ext4. If there's a
> problem in a shared path - ext2, FUSE and XFS all use dax_iomap_rw()
> as well, so any improvements to that path benefit all DAX users, not
> just ext4.
> 
> > 2) optimizing how dax_iomap_rw works.
> > Since dax_iomap_rw requires ext4_iomap_begin, which further involves
> > the iomap structure and others (e.g., journaling status locks in Ext4),
> > we think implementing the internal "read" method would be easier.
> 
> Maybe it is, but it's also very selfish. The DAX iomap path was
> written to be correct for all users, not inecessarily provide
> optimal performance. There will be lots of things that could be done
> to optimise it, so rather than creating a special snowflake in ext4
> that makes DAX in ext4 much harder to maintain for non-ext4 DAX
> developers, please work to improve the common DAX IO path and so
> provide the same benefit to all the filesystems that use it.

Yeah, I agree. I'm against ext4 private solution for this read problem. And
I'm also against duplicating ->read_iter functionatily in ->read handler.
The maintenance burden of this code duplication is IMHO just too big. We
rather need to improve the generic code so that the fast path is faster.
And every filesystem will benefit because this is not ext4 specific
problem.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 08/10] md: Implement ->corrupted_range()

2021-01-06 Thread Jan Kara

On Thu 31-12-20 00:55:59, Shiyang Ruan wrote:
> With the support of ->rmap(), it is possible to obtain the superblock on
> a mapped device.
> 
> If a pmem device is used as one target of mapped device, we cannot
> obtain its superblock directly.  With the help of SYSFS, the mapped
> device can be found on the target devices.  So, we iterate the
> bdev->bd_holder_disks to obtain its mapped device.
> 
> Signed-off-by: Shiyang Ruan 

Thanks for the patch. Two comments below.

> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 4688bff19c20..9f9a2f3bf73b 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -256,21 +256,16 @@ static int pmem_rw_page(struct block_device *bdev, 
> sector_t sector,
>  static int pmem_corrupted_range(struct gendisk *disk, struct block_device 
> *bdev,
>   loff_t disk_offset, size_t len, void *data)
>  {
> - struct super_block *sb;
>   loff_t bdev_offset;
>   sector_t disk_sector = disk_offset >> SECTOR_SHIFT;
> - int rc = 0;
> + int rc = -ENODEV;
>  
>   bdev = bdget_disk_sector(disk, disk_sector);
>   if (!bdev)
> - return -ENODEV;
> + return rc;
>  
>   bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;
> - sb = get_super(bdev);
> - if (sb && sb->s_op->corrupted_range) {
> - rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, 
> data);
> - drop_super(sb);
> - }
> + rc = bd_corrupted_range(bdev, bdev_offset, bdev_offset, len, data);
>  
>   bdput(bdev);
>   return rc;

This (and the fs/block_dev.c change below) is just refining the function
you've implemented in the patch 6. I think it's confusing to split changes
like this - why not implement things correctly from the start in patch 6?

> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 9e84b1928b94..0e50f0e8e8af 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1171,6 +1171,27 @@ struct bd_holder_disk {
>   int refcnt;
>  };
>  
> +static int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t 
> off,
> +   size_t len, void *data)
> +{
> + struct bd_holder_disk *holder;
> + struct gendisk *disk;
> + int rc = 0;
> +
> + if (list_empty(&(bdev->bd_holder_disks)))
> + return -ENODEV;

This will not compile for !CONFIG_SYSFS kernels. Not that it would be
common but still. Also I'm not sure whether using bd_holder_disks like this
is really the right thing to do (when it seems to be only a sysfs thing),
although admittedly I'm not aware of a better way of getting this
information.

Honza

> +
> + list_for_each_entry(holder, >bd_holder_disks, list) {
> + disk = holder->disk;
> + if (disk->fops->corrupted_range) {
> + rc = disk->fops->corrupted_range(disk, bdev, off, len, 
> data);
> + if (rc != -ENODEV)
> + break;
> + }
> + }
> + return rc;
> +}
> +
>  static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev,
> struct gendisk *disk)
>  {
> @@ -1378,6 +1399,22 @@ void bd_set_nr_sectors(struct block_device *bdev, 
> sector_t sectors)
>  }
>  EXPORT_SYMBOL(bd_set_nr_sectors);
>  
> +int bd_corrupted_range(struct block_device *bdev, loff_t disk_off, loff_t 
> bdev_off, size_t len, void *data)
> +{
> + struct super_block *sb = get_super(bdev);
> + int rc = 0;
> +
> + if (!sb) {
> + rc = bd_disk_holder_corrupted_range(bdev, disk_off, len, data);
> + return rc;
> + } else if (sb->s_op->corrupted_range)
> + rc = sb->s_op->corrupted_range(sb, bdev, bdev_off, len, data);
> + drop_super(sb);
> +
> + return rc;
> +}
> +EXPORT_SYMBOL(bd_corrupted_range);
> +
>  static void __blkdev_put(struct block_device *bdev, fmode_t mode, int 
> for_part);
>  
>  int bdev_disk_changed(struct block_device *bdev, bool invalidate)
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index ed06209008b8..42290470810d 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -376,6 +376,8 @@ void revalidate_disk_size(struct gendisk *disk, bool 
> verbose);
>  bool bdev_check_media_change(struct block_device *bdev);
>  int __invalidate_device(struct block_device *bdev, bool kill_dirty);
>  void bd_set_nr_sectors(struct block_device *bdev, secto

Re: [PATCH 05/10] mm, pmem: Implement ->memory_failure() in pmem driver

2021-01-06 Thread Jan Kara

->ops->memory_failure)
> + rc = pgmap->ops->memory_failure(pgmap, pfn, flags);
>  
> - list_for_each_entry(tk, _kill, nd)
> - if (tk->size_shift)
> - size = max(size, 1UL << tk->size_shift);
> - if (size) {
> - /*
> -  * Unmap the largest mapping to avoid breaking up
> -  * device-dax mappings which are constant size. The
> -  * actual size of the mapping being torn down is
> -  * communicated in siginfo, see kill_proc()
> -  */
> - start = (page->index << PAGE_SHIFT) & ~(size - 1);
> - unmap_mapping_range(page->mapping, start, start + size, 0);
> - }
> - kill_procs(_kill, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
> - rc = 0;
> -unlock:
> - dax_unlock_page(page, cookie);
>  out:
>   /* drop pgmap ref acquired in caller */
>   put_dev_pagemap(pgmap);
> -- 
> 2.29.2
> 
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping

2021-01-06 Thread Jan Kara

o avoid potential recursions in the VM.
>   */
>  static void add_to_kill(struct task_struct *tsk, struct page *p,
> -struct vm_area_struct *vma,
> -struct list_head *to_kill)
> + struct address_space *mapping, pgoff_t pgoff,
> + struct vm_area_struct *vma, struct list_head *to_kill)
>  {
>   struct to_kill *tk;
>  
> @@ -345,9 +348,12 @@ static void add_to_kill(struct task_struct *tsk, struct 
> page *p,
>   }
>  
>   tk->addr = page_address_in_vma(p, vma);
> - if (is_zone_device_page(p))
> - tk->size_shift = dev_pagemap_mapping_shift(p, vma);
> - else
> + if (is_zone_device_page(p)) {
> + if (is_device_fsdax_page(p))
> + tk->addr = vma->vm_start +
> + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);

It seems strange to use 'pgoff' for dax pages and not for any other page.
Why? I'd rather pass correct pgoff from all callers of add_to_kill() and
avoid this special casing...

> + tk->size_shift = dev_pagemap_mapping_shift(p, vma, tk->addr);
> + } else
>   tk->size_shift = page_shift(compound_head(p));
>  
>   /*
> @@ -495,7 +501,7 @@ static void collect_procs_anon(struct page *page, struct 
> list_head *to_kill,
>   if (!page_mapped_in_vma(page, vma))
>   continue;
>   if (vma->vm_mm == t->mm)
> - add_to_kill(t, page, vma, to_kill);
> + add_to_kill(t, page, NULL, 0, vma, to_kill);
>   }
>   }
>   read_unlock(_lock);
> @@ -505,24 +511,19 @@ static void collect_procs_anon(struct page *page, 
> struct list_head *to_kill,
>  /*
>   * Collect processes when the error hit a file mapped page.
>   */
> -static void collect_procs_file(struct page *page, struct list_head *to_kill,
> - int force_early)
> +static void collect_procs_file(struct page *page, struct address_space 
> *mapping,
> + pgoff_t pgoff, struct list_head *to_kill, int force_early)
>  {
>   struct vm_area_struct *vma;
>   struct task_struct *tsk;
> - struct address_space *mapping = page->mapping;
> - pgoff_t pgoff;
>  
>   i_mmap_lock_read(mapping);
>   read_lock(_lock);
> - pgoff = page_to_pgoff(page);
>   for_each_process(tsk) {
>   struct task_struct *t = task_early_kill(tsk, force_early);
> -
>   if (!t)
>   continue;
> - vma_interval_tree_foreach(vma, >i_mmap, pgoff,
> -   pgoff) {
> + vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
>   /*
>* Send early kill signal to tasks where a vma covers
>* the page but the corrupted page is not necessarily
> @@ -531,7 +532,7 @@ static void collect_procs_file(struct page *page, struct 
> list_head *to_kill,
>* to be informed of all such data corruptions.
>*/
>   if (vma->vm_mm == t->mm)
> - add_to_kill(t, page, vma, to_kill);
> + add_to_kill(t, page, mapping, pgoff, vma, 
> to_kill);
>   }
>   }
>   read_unlock(_lock);
> @@ -550,7 +551,8 @@ static void collect_procs(struct page *page, struct 
> list_head *tokill,
>   if (PageAnon(page))
>   collect_procs_anon(page, tokill, force_early);
>   else
> - collect_procs_file(page, tokill, force_early);
> + collect_procs_file(page, page->mapping, page_to_pgoff(page),

Why not use page_mapping() helper here? It would be safer for THPs if they
ever get here...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] mm/up: combine put_compound_head() and unpin_user_page()

2020-12-10 Thread Jan Kara

On Wed 09-12-20 15:13:57, Jason Gunthorpe wrote:
> These functions accomplish the same thing but have different
> implementations.
> 
> unpin_user_page() has a bug where it calls mod_node_page_state() after
> calling put_page() which creates a risk that the page could have been
> hot-uplugged from the system.
> 
> Fix this by using put_compound_head() as the only implementation.
> 
> __unpin_devmap_managed_user_page() and related can be deleted as well in
> favour of the simpler, but slower, version in put_compound_head() that has
> an extra atomic page_ref_sub, but always calls put_page() which internally
> contains the special devmap code.
> 
> Move put_compound_head() to be directly after try_grab_compound_head() so
> people can find it in future.
> 
> Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) 
> reporting")
> Signed-off-by: Jason Gunthorpe 

Nice cleanup. The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/gup.c | 103 +--
>  1 file changed, 23 insertions(+), 80 deletions(-)
> 
> With Matt's folio idea I'd next to go to make a
>   put_folio(folio, refs)
> 
> Which would cleanly eliminate that extra atomic here without duplicating the
> devmap special case.
> 
> This should also be called 'ungrab_compound_head' as we seem to be using the
> word 'grab' to mean 'pin or get' depending on GUP flags.
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 98eb8e6d2609c3..7b33b7d4b324d7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -123,6 +123,28 @@ static __maybe_unused struct page 
> *try_grab_compound_head(struct page *page,
>   return NULL;
>  }
>  
> +static void put_compound_head(struct page *page, int refs, unsigned int 
> flags)
> +{
> + if (flags & FOLL_PIN) {
> + mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
> + refs);
> +
> + if (hpage_pincount_available(page))
> + hpage_pincount_sub(page, refs);
> + else
> + refs *= GUP_PIN_COUNTING_BIAS;
> + }
> +
> + VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
> + /*
> +  * Calling put_page() for each ref is unnecessarily slow. Only the last
> +  * ref needs a put_page().
> +  */
> + if (refs > 1)
> + page_ref_sub(page, refs - 1);
> + put_page(page);
> +}
> +
>  /**
>   * try_grab_page() - elevate a page's refcount by a flag-dependent amount
>   *
> @@ -177,41 +199,6 @@ bool __must_check try_grab_page(struct page *page, 
> unsigned int flags)
>   return true;
>  }
>  
> -#ifdef CONFIG_DEV_PAGEMAP_OPS
> -static bool __unpin_devmap_managed_user_page(struct page *page)
> -{
> - int count, refs = 1;
> -
> - if (!page_is_devmap_managed(page))
> - return false;
> -
> - if (hpage_pincount_available(page))
> - hpage_pincount_sub(page, 1);
> - else
> - refs = GUP_PIN_COUNTING_BIAS;
> -
> - count = page_ref_sub_return(page, refs);
> -
> - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
> - /*
> -  * devmap page refcounts are 1-based, rather than 0-based: if
> -  * refcount is 1, then the page is free and the refcount is
> -  * stable because nobody holds a reference on the page.
> -  */
> - if (count == 1)
> - free_devmap_managed_page(page);
> - else if (!count)
> - __put_page(page);
> -
> - return true;
> -}
> -#else
> -static bool __unpin_devmap_managed_user_page(struct page *page)
> -{
> - return false;
> -}
> -#endif /* CONFIG_DEV_PAGEMAP_OPS */
> -
>  /**
>   * unpin_user_page() - release a dma-pinned page
>   * @page:pointer to page to be released
> @@ -223,28 +210,7 @@ static bool __unpin_devmap_managed_user_page(struct page 
> *page)
>   */
>  void unpin_user_page(struct page *page)
>  {
> - int refs = 1;
> -
> - page = compound_head(page);
> -
> - /*
> -  * For devmap managed pages we need to catch refcount transition from
> -  * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
> -  * page is free and we need to inform the device driver through
> -  * callback. See include/linux/memremap.h and HMM for details.
> -  */
> - if (__unpin_devmap_managed_user_page(page))
> - return;
> -
> - if (hpage_pincount_available(page))
> - hpage_pincount_sub(page, 1);
> - else
> -

Re: [RFC PATCH 1/3] fs: dax.c: move fs hole signifier from DAX_ZERO_PAGE to XA_ZERO_ENTRY

2020-11-30 Thread Jan Kara

On Mon 30-11-20 06:22:42, Amy Parker wrote:
> > > +/*
> > > + * A zero entry, XA_ZERO_ENTRY, is used to represent a zero page. This
> > > + * definition helps with checking if an entry is a PMD size.
> > > + */
> > > +#define XA_ZERO_PMD_ENTRY DAX_PMD | (unsigned long)XA_ZERO_ENTRY
> > > +
> >
> > Firstly, if you define a macro, we usually wrap it inside braces like:
> >
> > #define XA_ZERO_PMD_ENTRY (DAX_PMD | (unsigned long)XA_ZERO_ENTRY)
> >
> > to avoid unexpected issues when macro expands and surrounding operators
> > have higher priority.
> 
> Oops! Must've missed that - I'll make sure to get on that when
> revising this patch.
> 
> > Secondly, I don't think you can combine XA_ZERO_ENTRY with DAX_PMD (or any
> > other bits for that matter). XA_ZERO_ENTRY is defined as
> > xa_mk_internal(257) which is ((257 << 2) | 2) - DAX bits will overlap with
> > the bits xarray internal entries are using and things will break.
> 
> Could you provide an example of this overlap? I can't seem to find any.

Well XA_ZERO_ENTRY | DAX_PMD == ((257 << 2) | 2) | (1 << 1). So the way
you've defined XA_ZERO_PMD_ENTRY the DAX_PMD will just get lost. AFAIU (but
Matthew might correct me here), for internal entries (and XA_ZERO_ENTRY is
one instance of such entry) low 10-bits of the of the entry values are
reserved for internal xarray usage so DAX could use only higher bits. For
classical value entries, only the lowest bit is reserved for xarray usage,
all the rest is available for the user (and so DAX uses it).

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH 1/3] fs: dax.c: move fs hole signifier from DAX_ZERO_PAGE to XA_ZERO_ENTRY

2020-11-30 Thread Jan Kara

On Sat 28-11-20 20:36:29, Amy Parker wrote:
> DAX uses the DAX_ZERO_PAGE bit to represent holes in files. It could also use
> a single entry, such as XArray's XA_ZERO_ENTRY. This distinguishes zero pages
> and allows us to shift DAX_EMPTY down (see patch 2/3).
> 
> Signed-off-by: Amy Parker 

Thanks for the patch. The idea looks nice however I think technically there
are some problems with the patch. See below.

> +/*
> + * A zero entry, XA_ZERO_ENTRY, is used to represent a zero page. This
> + * definition helps with checking if an entry is a PMD size.
> + */
> +#define XA_ZERO_PMD_ENTRY DAX_PMD | (unsigned long)XA_ZERO_ENTRY
> +

Firstly, if you define a macro, we usually wrap it inside braces like:

#define XA_ZERO_PMD_ENTRY (DAX_PMD | (unsigned long)XA_ZERO_ENTRY)

to avoid unexpected issues when macro expands and surrounding operators
have higher priority.

Secondly, I don't think you can combine XA_ZERO_ENTRY with DAX_PMD (or any
other bits for that matter). XA_ZERO_ENTRY is defined as
xa_mk_internal(257) which is ((257 << 2) | 2) - DAX bits will overlap with
the bits xarray internal entries are using and things will break.

Honestly, I find it somewhat cumbersome to use xarray internal entries for
DAX purposes since all the locking (using DAX_LOCKED) and size checking
(using DAX_PMD) functions will have to special-case internal entries to
operate on different set of bits. It could be done, sure, but I'm not sure
it is worth the trouble for saving two bits (we could get rid of
DAX_ZERO_PAGE and DAX_EMPTY bits in this way) in DAX entries. But maybe
Matthew had some idea how to do this in an elegant way...

Honza

>  static unsigned long dax_to_pfn(void *entry)
>  {
>  return xa_to_value(entry) >> DAX_SHIFT;
> @@ -114,7 +119,7 @@ static bool dax_is_pte_entry(void *entry)
> 
>  static int dax_is_zero_entry(void *entry)
>  {
> -return xa_to_value(entry) & DAX_ZERO_PAGE;
> +return xa_to_value(entry) & (unsigned long)XA_ZERO_ENTRY;
>  }
> 
>  static int dax_is_empty_entry(void *entry)
> @@ -738,7 +743,7 @@ static void *dax_insert_entry(struct xa_state *xas,
>  if (dirty)
>  __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> 
> -if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> +if (dax_is_zero_entry(entry) && !(flags & (unsigned long)XA_ZERO_ENTRY)) 
> {
>  unsigned long index = xas->xa_index;
>  /* we are replacing a zero page with block mapping */
>  if (dax_is_pmd_entry(entry))
> @@ -1047,7 +1052,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>  vm_fault_t ret;
> 
>  *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> -DAX_ZERO_PAGE, false);
> +XA_ZERO_ENTRY, false);
> 
>  ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
>  trace_dax_load_hole(inode, vmf, ret);
> @@ -1434,7 +1439,7 @@ static vm_fault_t dax_pmd_load_hole(struct
> xa_state *xas, struct vm_fault *vmf,
> 
>  pfn = page_to_pfn_t(zero_page);
>  *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> -        DAX_PMD | DAX_ZERO_PAGE, false);
> +XA_ZERO_PMD_ENTRY, false);
> 
>  if (arch_needs_pgtable_deposit()) {
>  pgtable = pte_alloc_one(vma->vm_mm);
> -- 
> 2.29.2
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2] ext4/xfs: add page refcount helper

2020-10-08 Thread Jan Kara

On Wed 07-10-20 14:49:25, Ralph Campbell wrote:
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add helper functions to hide this detail.
> 
> Signed-off-by: Ralph Campbell 
> Reviewed-by: Christoph Hellwig 
> Acked-by: Darrick J. Wong 
> Acked-by: Theodore Ts'o  # for fs/ext4/inode.c

The patch looks good to me. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
> 
> Changes in v2:
> I strongly resisted the idea of extending this patch but after Jan
> Kara's comment about there being more places that could be cleaned
> up, I felt compelled to make this one tensy wensy change to add
> a dax_wakeup_page() to match the dax_wait_page().
> I kept the Reviewed/Acked-bys since I don't think this substantially
> changes the patch.
> 
>  fs/dax.c|  4 ++--
>  fs/ext4/inode.c |  5 +
>  fs/xfs/xfs_file.c   |  4 +---
>  include/linux/dax.h | 15 +++
>  mm/memremap.c   |  3 ++-
>  5 files changed, 21 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5b47834f2e1b..85c63f735909 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct 
> address_space *mapping,
>   for_each_mapped_pfn(entry, pfn) {
>   struct page *page = pfn_to_page(pfn);
>  
> - WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> + WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
>   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
>   page->mapping = NULL;
>   page->index = 0;
> @@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
>   for_each_mapped_pfn(entry, pfn) {
>   struct page *page = pfn_to_page(pfn);
>  
> - if (page_ref_count(page) > 1)
> + if (!dax_layout_is_idle_page(page))
>   return page;
>   }
>   return NULL;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 771ed8b1fadb..132620cbfa13 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3937,10 +3937,7 @@ int ext4_break_layouts(struct inode *inode)
>   if (!page)
>   return 0;
>  
> - error = ___wait_var_event(>_refcount,
> - atomic_read(>_refcount) == 1,
> - TASK_INTERRUPTIBLE, 0, 0,
> - ext4_wait_dax_page(ei));
> + error = dax_wait_page(ei, page, ext4_wait_dax_page);
>   } while (error == 0);
>  
>   return error;
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 3d1b95124744..a5304aaeaa3a 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -749,9 +749,7 @@ xfs_break_dax_layouts(
>   return 0;
>  
>   *retry = true;
> - return ___wait_var_event(>_refcount,
> - atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
> - 0, 0, xfs_wait_dax_page(inode));
> + return dax_wait_page(inode, page, xfs_wait_dax_page);
>  }
>  
>  int
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b52f084aa643..e2da78e87338 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -243,6 +243,21 @@ static inline bool dax_mapping(struct address_space 
> *mapping)
>   return mapping->host && IS_DAX(mapping->host);
>  }
>  
> +static inline bool dax_layout_is_idle_page(struct page *page)
> +{
> + return page_ref_count(page) == 1;
> +}
> +
> +static inline void dax_wakeup_page(struct page *page)
> +{
> + wake_up_var(>_refcount);
> +}
> +
> +#define dax_wait_page(_inode, _page, _wait_cb)   
> \
> + ___wait_var_event(&(_page)->_refcount,  \
> + dax_layout_is_idle_page(_page), \
> + TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
> +
>  #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
>  void hmem_register_device(int target_nid, struct resource *r);
>  #else
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 2bb276680837..504a10ff2edf 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static DEFINE_XARRAY(pgmap_array);
>  
> @@ -508,7 +509,7 @@ void free_devmap_managed_page(struct page *page)
>  {
>   /* notify page idle for dax */
>   if (!is_device_private_page(page)) {
> - wake_up_var(>_refcount);
> + dax_wakeup_page(page);
>   return;
>   }
>  
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] ext4/xfs: add page refcount helper

2020-10-07 Thread Jan Kara

On Tue 06-10-20 16:09:30, Ralph Campbell wrote:
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add a helper function to hide this detail.
> 
> Signed-off-by: Ralph Campbell 
> Reviewed-by: Christoph Hellwig 

Looks as sane direction but if we are going to abstract checks when
ZONE_DEVICE page is idle, we should also update e.g.
mm/swap.c:put_devmap_managed_page() or
mm/gup.c:__unpin_devmap_managed_user_page() (there may be more places like
this but I found at least these two...). Maybe Dan has more thoughts about
this.

Honza

> diff --git a/fs/dax.c b/fs/dax.c
> index 5b47834f2e1b..85c63f735909 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct 
> address_space *mapping,
>   for_each_mapped_pfn(entry, pfn) {
>   struct page *page = pfn_to_page(pfn);
>  
> - WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> + WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
>   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
>   page->mapping = NULL;
>   page->index = 0;
> @@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
>   for_each_mapped_pfn(entry, pfn) {
>   struct page *page = pfn_to_page(pfn);
>  
> - if (page_ref_count(page) > 1)
> + if (!dax_layout_is_idle_page(page))
>   return page;
>   }
>   return NULL;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 771ed8b1fadb..132620cbfa13 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3937,10 +3937,7 @@ int ext4_break_layouts(struct inode *inode)
>   if (!page)
>   return 0;
>  
> - error = ___wait_var_event(>_refcount,
> - atomic_read(>_refcount) == 1,
> - TASK_INTERRUPTIBLE, 0, 0,
> - ext4_wait_dax_page(ei));
> + error = dax_wait_page(ei, page, ext4_wait_dax_page);
>   } while (error == 0);
>  
>   return error;
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 3d1b95124744..a5304aaeaa3a 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -749,9 +749,7 @@ xfs_break_dax_layouts(
>   return 0;
>  
>   *retry = true;
> - return ___wait_var_event(>_refcount,
> - atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
> - 0, 0, xfs_wait_dax_page(inode));
> + return dax_wait_page(inode, page, xfs_wait_dax_page);
>  }
>  
>  int
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b52f084aa643..8909a91cd381 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space 
> *mapping)
>   return mapping->host && IS_DAX(mapping->host);
>  }
>  
> +static inline bool dax_layout_is_idle_page(struct page *page)
> +{
> + return page_ref_count(page) == 1;
> +}
> +
> +#define dax_wait_page(_inode, _page, _wait_cb)   
> \
> + ___wait_var_event(&(_page)->_refcount,      \
> + dax_layout_is_idle_page(_page), \
> + TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
> +
>  #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
>  void hmem_register_device(int target_nid, struct resource *r);
>  #else
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-23 Thread Jan Kara

On Tue 22-09-20 12:46:05, Mikulas Patocka wrote:
> > mapping 2^21 blocks requires a 5 level indirect tree. Which one if going 
> > to be faster to truncate away - a single record or 2 million individual 
> > blocks?
> > 
> > IOWs, we can take afford to take an extra cacheline miss or two on a
> > tree block search, because we're accessing and managing orders of
> > magnitude fewer records in the mapping tree than an indirect block
> > tree.
> > 
> > PMEM doesn't change this: extents are more time and space efficient
> > at scale for mapping trees than indirect block trees regardless
> > of the storage medium in use.
> 
> PMEM doesn't have to be read linearly, so the attempts to allocate large 
> linear space are not needed. They won't harm but they won't help either.
> 
> That's why NVFS has very simple block allocation alrogithm - it uses a 
> per-cpu pointer and tries to allocate by a bit scan from this pointer. If 
> the group is full, it tries a random group with above-average number of 
> free blocks.

I agree with Dave here. People are interested in 2MB or 1GB contiguous
allocations for DAX so that files can be mapped at PMD or event PUD levels
thus saving a lot of CPU time on page faults and TLB.

> EXT4 uses bit scan for allocations and people haven't complained that it's 
> inefficient, so it is probably OK.

Yes, it is more or less OK but once you get to 1TB filesystem size and
larger, the number of block groups grows enough that it isn't that great
anymore. We are actually considering new allocation schemes for ext4 for
this large filesystems...

> If you think that the lack of journaling is show-stopper, I can implement 
> it. But then, I'll have something that has complexity of EXT4 and 
> performance of EXT4. So that there will no longer be any reason why to use 
> NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may 
> attract some users who want good performance and who don't care about GID 
> and UID being updated atomically, etc.

I'd hope that your filesystem offers more performance benefits than just
what you can get from a lack of journalling :). ext4 can be configured to
run without a journal as well - mkfs.ext4 -O ^has_journal. And yes, it does
significantly improve performance for some workloads but you have to have
some way to recover from crashes so it's mostly used for scratch
filesystems (e.g. in build systems, Google uses this feature a lot for some
of their infrastructure as well).

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2] dm: Call proper helper to determine dax support

2020-09-21 Thread Jan Kara

On Mon 21-09-20 11:23:07, Naresh Kamboju wrote:
> On Fri, 18 Sep 2020 at 11:18, Dan Williams  wrote:
> >
> > From: Jan Kara 
> >
> > DM was calling generic_fsdax_supported() to determine whether a device
> > referenced in the DM table supports DAX. However this is a helper for 
> > "leaf" device drivers so that
> > they don't have to duplicate common generic checks. High level code
> > should call dax_supported() helper which that calls into appropriate
> > helper for the particular device. This problem manifested itself as
> > kernel messages:
> >
> > dm-3: error: dax access failed (-95)
> >
> > when lvm2-testsuite run in cases where a DM device was stacked on top of
> > another DM device.
> >
> > Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple 
> > devices")
> > Cc: 
> > Tested-by: Adrian Huang 
> > Signed-off-by: Jan Kara 
> > Acked-by: Mike Snitzer 
> > Signed-off-by: Dan Williams 
> > ---
> > Changes since v1 [1]:
> > - Add missing dax_read_lock() around dax_supported()
> >
> > [1]: http://lore.kernel.org/r/20200916151445.450-1-j...@suse.cz
> >
> >  drivers/dax/super.c   |4 
> >  drivers/md/dm-table.c |   10 +++---
> >  include/linux/dax.h   |   11 +--
> >  3 files changed, 20 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index e5767c83ea23..b6284c5cae0a 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -325,11 +325,15 @@ EXPORT_SYMBOL_GPL(dax_direct_access);
> >  bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
> > int blocksize, sector_t start, sector_t len)
> >  {
> > +   if (!dax_dev)
> > +   return false;
> > +
> > if (!dax_alive(dax_dev))
> > return false;
> >
> > return dax_dev->ops->dax_supported(dax_dev, bdev, blocksize, start, 
> > len);
> >  }
> > +EXPORT_SYMBOL_GPL(dax_supported);
> 
> arm build error while building with allmodconfig.
> 
> ../drivers/dax/super.c:325:6: error: redefinition of ‘dax_supported’
>   325 | bool dax_supported(struct dax_device *dax_dev, struct
> block_device *bdev,
>   |  ^
> In file included from ../drivers/dax/super.c:16:
> ../include/linux/dax.h:162:20: note: previous definition of
> ‘dax_supported’ was here
>   162 | static inline bool dax_supported(struct dax_device *dax_dev,
>   |^
> make[3]: *** [../scripts/Makefile.build:283: drivers/dax/super.o] Error 1
> 
> Reported-by: Naresh Kamboju 
> 
> Ref:
> https://builds.tuxbuild.com/IO690jFQDp0qP9zFuWBqpA/build.log

Thanks for report! Attached patch should fix the build (at least I've
tested it with CONFIG_DAX && CONFIG_FS_DAX, CONFIG_DAX && !CONFIG_FS_DAX,
and !CONFIG_DAX cases). Dan can you please merge the fix?

Honza
-- 
Jan Kara 
SUSE Labs, CR
>From c48c9d1ee41ca17561dfd7ec5247b5afc527d40e Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Mon, 21 Sep 2020 11:33:23 +0200
Subject: [PATCH] dax: Fix compilation for CONFIG_DAX && !CONFIG_FS_DAX

dax_supported() is defined whenever CONFIG_DAX is enabled. So dummy
implementation should be defined only in !CONFIG_DAX case, not in
!CONFIG_FS_DAX case.

Fixes: e2ec51282545 ("dm: Call proper helper to determine dax support")
Cc: 
Reported-by: Geert Uytterhoeven 
Reported-by: Naresh Kamboju 
Reported-by: kernel test robot 
Signed-off-by: Jan Kara 
---
 include/linux/dax.h | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 497031392e0a..43b39ab9de1a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -58,6 +58,8 @@ static inline void set_dax_synchronous(struct dax_device *dax_dev)
 {
 	__set_dax_synchronous(dax_dev);
 }
+bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
+		int blocksize, sector_t start, sector_t len);
 /*
  * Check if given mapping is supported by the file / underlying device.
  */
@@ -104,6 +106,12 @@ static inline bool dax_synchronous(struct dax_device *dax_dev)
 static inline void set_dax_synchronous(struct dax_device *dax_dev)
 {
 }
+static inline bool dax_supported(struct dax_device *dax_dev,
+		struct block_device *bdev, int blocksize, sector_t start,
+		sector_t len)
+{
+	return false;
+}
 static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
 struct dax_device *dax_dev)
 {
@@ -130,8 +138,6 @@ static inline bool generic_fsdax_supported(struct dax_device *dax

Re: [linux-nvdimm:libnvdimm-fixes 2/3] drivers/dax/super.c:325:6: error: redefinition of 'dax_supported'

2020-09-21 Thread Jan Kara

On Mon 21-09-20 07:12:11, kernel test robot wrote:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git 
> libnvdimm-fixes
> head:   d4c5da5049ac27c6ef8f6f98548c3a1ade352d25
> commit: e2ec5128254518cae320d5dc631b71b94160f663 [2/3] dm: Call proper helper 
> to determine dax support
> config: x86_64-randconfig-a011-20200920 (attached as .config)
> compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 
> f4e554180962aa6bc93678898b6933ea712bde50)
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # install x86_64 cross compiling tool for clang build
> # apt-get install binutils-x86-64-linux-gnu
> git checkout e2ec5128254518cae320d5dc631b71b94160f663
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross 
> ARCH=x86_64 
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> 
> All errors (new ones prefixed by >>):
> 
> >> drivers/dax/super.c:325:6: error: redefinition of 'dax_supported'
>bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
> ^
>include/linux/dax.h:162:20: note: previous definition is here
>static inline bool dax_supported(struct dax_device *dax_dev,
>   ^
>drivers/dax/super.c:451:6: warning: no previous prototype for function 
> 'run_dax' [-Wmissing-prototypes]
>void run_dax(struct dax_device *dax_dev)
> ^
>drivers/dax/super.c:451:1: note: declare 'static' if the function is not 
> intended to be used outside of this translation unit
>void run_dax(struct dax_device *dax_dev)
>^
>static 
>1 warning and 1 error generated.

Attached patch should fix the build error.

        Honza
-- 
Jan Kara 
SUSE Labs, CR
>From c48c9d1ee41ca17561dfd7ec5247b5afc527d40e Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Mon, 21 Sep 2020 11:33:23 +0200
Subject: [PATCH] dax: Fix compilation for CONFIG_DAX && !CONFIG_FS_DAX

dax_supported() is defined whenever CONFIG_DAX is enabled. So dummy
implementation should be defined only in !CONFIG_DAX case, not in
!CONFIG_FS_DAX case.

Fixes: e2ec51282545 ("dm: Call proper helper to determine dax support")
Cc: 
Reported-by: Geert Uytterhoeven 
Reported-by: Naresh Kamboju 
Reported-by: kernel test robot 
Signed-off-by: Jan Kara 
---
 include/linux/dax.h | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 497031392e0a..43b39ab9de1a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -58,6 +58,8 @@ static inline void set_dax_synchronous(struct dax_device *dax_dev)
 {
 	__set_dax_synchronous(dax_dev);
 }
+bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
+		int blocksize, sector_t start, sector_t len);
 /*
  * Check if given mapping is supported by the file / underlying device.
  */
@@ -104,6 +106,12 @@ static inline bool dax_synchronous(struct dax_device *dax_dev)
 static inline void set_dax_synchronous(struct dax_device *dax_dev)
 {
 }
+static inline bool dax_supported(struct dax_device *dax_dev,
+		struct block_device *bdev, int blocksize, sector_t start,
+		sector_t len)
+{
+	return false;
+}
 static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
 struct dax_device *dax_dev)
 {
@@ -130,8 +138,6 @@ static inline bool generic_fsdax_supported(struct dax_device *dax_dev,
 	return __generic_fsdax_supported(dax_dev, bdev, blocksize, start,
 			sectors);
 }
-bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
-		int blocksize, sector_t start, sector_t len);
 
 static inline void fs_put_dax(struct dax_device *dax_dev)
 {
@@ -159,13 +165,6 @@ static inline bool generic_fsdax_supported(struct dax_device *dax_dev,
 	return false;
 }
 
-static inline bool dax_supported(struct dax_device *dax_dev,
-		struct block_device *bdev, int blocksize, sector_t start,
-		sector_t len)
-{
-	return false;
-}
-
 static inline void fs_put_dax(struct dax_device *dax_dev)
 {
 }
-- 
2.16.4

___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: PROBLEM: 5.9.0-rc6 fails to compile due to 'redefinition of ‘dax_supported’'

2020-09-21 Thread Jan Kara

On Mon 21-09-20 09:32:18, Greg KH wrote:
> On Mon, Sep 21, 2020 at 11:34:17AM +0530, Naresh Kamboju wrote:
> > On Mon, 21 Sep 2020 at 06:34, Stuart Little  wrote:
> > >
> > > I am trying to compile for an x86_64 machine (Intel(R) Core(TM) i7-7500U 
> > > CPU @ 2.70GHz). The config file I am currently using is at
> > >
> > > https://termbin.com/xin7
> > >
> > > The build for 5.9.0-rc6 fails with the following errors:
> > >
> > 
> > arm and mips allmodconfig build breaks due to this error.
> 
> all my local builds are breaking now too with this :(
> 
> Was there a proposed patch anywhere for this?

Attached patch should fix the build breakage. I'm sorry for that.

Honza
-- 
Jan Kara 
SUSE Labs, CR
>From 8b8c7d6148bc1bab3cf88cac49038a05db7dd938 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Mon, 21 Sep 2020 11:33:23 +0200
Subject: [PATCH] dax: Fix compilation for CONFIG_DAX && !CONFIG_FS_DAX

dax_supported() is defined whenever CONFIG_DAX is enabled. So dummy
implementation should be defined only in !CONFIG_DAX case, not in
!CONFIG_FS_DAX case.

Fixes: e2ec51282545 ("dm: Call proper helper to determine dax support")
Cc: 
Reported-by: Geert Uytterhoeven 
Reported-by: Naresh Kamboju 
Signed-off-by: Jan Kara 
---
 include/linux/dax.h | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 497031392e0a..43b39ab9de1a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -58,6 +58,8 @@ static inline void set_dax_synchronous(struct dax_device *dax_dev)
 {
 	__set_dax_synchronous(dax_dev);
 }
+bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
+		int blocksize, sector_t start, sector_t len);
 /*
  * Check if given mapping is supported by the file / underlying device.
  */
@@ -104,6 +106,12 @@ static inline bool dax_synchronous(struct dax_device *dax_dev)
 static inline void set_dax_synchronous(struct dax_device *dax_dev)
 {
 }
+static inline bool dax_supported(struct dax_device *dax_dev,
+		struct block_device *bdev, int blocksize, sector_t start,
+		sector_t len)
+{
+	return false;
+}
 static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
 struct dax_device *dax_dev)
 {
@@ -130,8 +138,6 @@ static inline bool generic_fsdax_supported(struct dax_device *dax_dev,
 	return __generic_fsdax_supported(dax_dev, bdev, blocksize, start,
 			sectors);
 }
-bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
-		int blocksize, sector_t start, sector_t len);
 
 static inline void fs_put_dax(struct dax_device *dax_dev)
 {
@@ -159,13 +165,6 @@ static inline bool generic_fsdax_supported(struct dax_device *dax_dev,
 	return false;
 }
 
-static inline bool dax_supported(struct dax_device *dax_dev,
-		struct block_device *bdev, int blocksize, sector_t start,
-		sector_t len)
-{
-	return false;
-}
-
 static inline void fs_put_dax(struct dax_device *dax_dev)
 {
 }
-- 
2.16.4

___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] dm: Call proper helper to determine dax support

2020-09-17 Thread Jan Kara

On Thu 17-09-20 02:28:57, Dan Williams wrote:
> On Wed, Sep 16, 2020 at 8:15 AM Jan Kara  wrote:
> >
> > DM was calling generic_fsdax_supported() to determine whether a device
> > referenced in the DM table supports DAX. However this is a helper for 
> > "leaf" device drivers so that
> > they don't have to duplicate common generic checks. High level code
> > should call dax_supported() helper which that calls into appropriate
> > helper for the particular device. This problem manifested itself as
> > kernel messages:
> >
> > dm-3: error: dax access failed (-95)
> >
> > when lvm2-testsuite run in cases where a DM device was stacked on top of
> > another DM device.
> >
> > Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple 
> > devices")
> > Tested-by: Adrian Huang 
> > Signed-off-by: Jan Kara 
> > ---
> >  drivers/dax/super.c   |  4 
> >  drivers/md/dm-table.c |  3 +--
> >  include/linux/dax.h   | 11 +--
> >  3 files changed, 14 insertions(+), 4 deletions(-)
> >
> > This patch should go in together with Adrian's
> > https://lore.kernel.org/linux-nvdimm/20200916133923.31-1-adrianhuang0...@gmail.com
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index e5767c83ea23..b6284c5cae0a 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -325,11 +325,15 @@ EXPORT_SYMBOL_GPL(dax_direct_access);
> >  bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
> > int blocksize, sector_t start, sector_t len)
> >  {
> > +   if (!dax_dev)
> > +   return false;
> > +
> 
> Hi Jan, Thanks for this.
> 
> > if (!dax_alive(dax_dev))
> > return false;
> 
> One small fixup to quiet lockdep because dax_supported() calls
> dax_alive() it expects that dax_read_lock() is held. So I'm testing
> with this incremental change:
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index bed1ff0744ec..229f461e7def 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -860,9 +860,14 @@ EXPORT_SYMBOL_GPL(dm_table_set_type);
>  int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
> sector_t start, sector_t len, void *data)
>  {
> -   int blocksize = *(int *) data;
> +   int blocksize = *(int *) data, id;
> +   bool rc;
> 
> -   return dax_supported(dev->dax_dev, dev->bdev, blocksize, start, len);
> +   id = dax_read_lock();
> +   rc = dax_supported(dev->dax_dev, dev->bdev, blocksize, start, len);
> +   dax_read_unlock(id);
> +
> +   return rc;
>  }

Yeah, thanks for this! I was actually looking into this when writing the
patch and somehow convinced myself we will always be called through
bdev_dax_supported() which does dax_read_lock() for us. But apparently I
was wrong...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: dm: Call proper helper to determine dax support

2020-09-17 Thread Jan Kara

On Wed 16-09-20 11:22:05, Mike Snitzer wrote:
> On Wed, Sep 16 2020 at 11:14am -0400,
> Jan Kara  wrote:
> 
> > DM was calling generic_fsdax_supported() to determine whether a device
> > referenced in the DM table supports DAX. However this is a helper for 
> > "leaf" device drivers so that
> > they don't have to duplicate common generic checks. High level code
> > should call dax_supported() helper which that calls into appropriate
> > helper for the particular device. This problem manifested itself as
> > kernel messages:
> > 
> > dm-3: error: dax access failed (-95)
> > 
> > when lvm2-testsuite run in cases where a DM device was stacked on top of
> > another DM device.
> > 
> > Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple 
> > devices")
> > Tested-by: Adrian Huang 
> > Signed-off-by: Jan Kara 
> 
> Looked good:
> 
> Acked-by: Mike Snitzer 

Thanks!

> This fix should Cc stable@ right?

Yes, it should go to stable.

> >  drivers/dax/super.c   |  4 
> >  drivers/md/dm-table.c |  3 +--
> >  include/linux/dax.h   | 11 +--
> >  3 files changed, 14 insertions(+), 4 deletions(-)
> > 
> > This patch should go in together with Adrian's
> > https://lore.kernel.org/linux-nvdimm/20200916133923.31-1-adrianhuang0...@gmail.com
> 
> Sure, but there really isn't a dependency right?

Yes, it isn't a context or strict functional dependency. But without this
patch Adrian's patch just trades one set of warnings for another set of
warnings...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH] dm: Call proper helper to determine dax support

2020-09-16 Thread Jan Kara

DM was calling generic_fsdax_supported() to determine whether a device
referenced in the DM table supports DAX. However this is a helper for "leaf" 
device drivers so that
they don't have to duplicate common generic checks. High level code
should call dax_supported() helper which that calls into appropriate
helper for the particular device. This problem manifested itself as
kernel messages:

dm-3: error: dax access failed (-95)

when lvm2-testsuite run in cases where a DM device was stacked on top of
another DM device.

Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple 
devices")
Tested-by: Adrian Huang 
Signed-off-by: Jan Kara 
---
 drivers/dax/super.c   |  4 
 drivers/md/dm-table.c |  3 +--
 include/linux/dax.h   | 11 +--
 3 files changed, 14 insertions(+), 4 deletions(-)

This patch should go in together with Adrian's
https://lore.kernel.org/linux-nvdimm/20200916133923.31-1-adrianhuang0...@gmail.com

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e5767c83ea23..b6284c5cae0a 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -325,11 +325,15 @@ EXPORT_SYMBOL_GPL(dax_direct_access);
 bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
int blocksize, sector_t start, sector_t len)
 {
+   if (!dax_dev)
+   return false;
+
if (!dax_alive(dax_dev))
return false;
 
return dax_dev->ops->dax_supported(dax_dev, bdev, blocksize, start, 
len);
 }
+EXPORT_SYMBOL_GPL(dax_supported);
 
 size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void 
*addr,
size_t bytes, struct iov_iter *i)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 5edc3079e7c1..bed1ff0744ec 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -862,8 +862,7 @@ int device_supports_dax(struct dm_target *ti, struct dm_dev 
*dev,
 {
int blocksize = *(int *) data;
 
-   return generic_fsdax_supported(dev->dax_dev, dev->bdev, blocksize,
-  start, len);
+   return dax_supported(dev->dax_dev, dev->bdev, blocksize, start, len);
 }
 
 /* Check devices support synchronous DAX */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..9f916326814a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -130,6 +130,8 @@ static inline bool generic_fsdax_supported(struct 
dax_device *dax_dev,
return __generic_fsdax_supported(dax_dev, bdev, blocksize, start,
sectors);
 }
+bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
+   int blocksize, sector_t start, sector_t len);
 
 static inline void fs_put_dax(struct dax_device *dax_dev)
 {
@@ -157,6 +159,13 @@ static inline bool generic_fsdax_supported(struct 
dax_device *dax_dev,
return false;
 }
 
+static inline bool dax_supported(struct dax_device *dax_dev,
+   struct block_device *bdev, int blocksize, sector_t start,
+   sector_t len)
+{
+   return false;
+}
+
 static inline void fs_put_dax(struct dax_device *dax_dev)
 {
 }
@@ -195,8 +204,6 @@ bool dax_alive(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
 long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
void **kaddr, pfn_t *pfn);
-bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
-   int blocksize, sector_t start, sector_t len);
 size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void 
*addr,
size_t bytes, struct iov_iter *i);
 size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
-- 
2.16.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [External] Re: [PATCH 1/1] dax: Fix stack overflow when mounting fsdax pmem device

2020-09-16 Thread Jan Kara

On Wed 16-09-20 14:02:19, Adrian Huang12 wrote:
> > -Original Message-
> > From: Jan Kara 
> > Sent: Wednesday, September 16, 2020 7:19 PM
> > >
> > > dm-3: error: dax access failed (-95)
> > > dm-3: error: dax access failed (-95)
> > > dm-3: error: dax access failed (-95)
> > 
> > Right, and that's result of the problem I also describe above. Attached 
> > patch
> > should fix these errors.
> 
> The patch introduces the following panic during boot. Apparently, the
> dax_dev is NULL in dax_supported(). So, the address 0x02d0 is
> offset of the member 'flags' in struct dax_device (the member 'flags' is
> referenced in dax_alive()):

Thanks for testing!

> The following patch solves the panic. Feel free to add it to your patch. 

I've added you fixup to the patch. Thanks for it.
 
> BTW, feel free to add my tested-by to your patch after including the 
> following patch to your patch (I don't see any dax error messages when 
> running lvm2-testsuite).
> Tested-by: Adrian Huang 
> 
> Thanks for looking into the issue triggered by lvm2-testsuite.
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0d2dcbb1e549..e84070b55463 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -325,6 +325,9 @@ EXPORT_SYMBOL_GPL(dax_direct_access);
>  bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
> int blocksize, sector_t start, sector_t len)
>  {
> +   if (!dax_dev)
> +   return false;
> +
> if (!dax_alive(dax_dev))
> return false;
> 
> BTW, I just submitted the v2 version:
> https://lore.kernel.org/linux-nvdimm/20200916133923.31-1-adrianhuang0...@gmail.com/T/#u
> 
> Hopefully/ideally, your patch and mine can be merged at the same rc release.

Yup, I'll send it rightaway.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 1/1] dax: Fix stack overflow when mounting fsdax pmem device

2020-09-16 Thread Jan Kara

On Wed 16-09-20 21:39:23, Adrian Huang wrote:
> From: Adrian Huang 
> 
> When mounting fsdax pmem device, commit 6180bb446ab6 ("dax: fix
> detection of dax support for non-persistent memory block devices")
> introduces the stack overflow [1][2]. Here is the call path for
> mounting ext4 file system:
>   ext4_fill_super
> bdev_dax_supported
>   __bdev_dax_supported
> dax_supported
>   generic_fsdax_supported
> __generic_fsdax_supported
>   bdev_dax_supported
> 
> The call path leads to the infinite calling loop, so we cannot
> call bdev_dax_supported() in __generic_fsdax_supported(). The sanity
> checking of the variable 'dax_dev' is moved prior to the two
> bdev_dax_pgoff() checks [3][4].
> 
> [1] 
> https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/thread/BULZHRILK7N2WS2JVISNF2QZNRQK6JU4/
> [2] 
> https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/thread/OOZGFY3RNQGTGJJCH52YXCSYIDXMOPXO/
> [3] 
> https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/SMQW2LY3QHPXOAW76RKNSCGG3QJFO7HT/
> [4] 
> https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/7E2X6UGX5RQ2ISGYNAF66VLY5BKBFI4M/
> 
> Fixes: 6180bb446ab6 ("dax: fix detection of dax support for non-persistent 
> memory block devices")
> Cc: Coly Li 
> Cc: Jan Kara 
> Cc: Ira Weiny 
> Cc: John Pittman 
> Cc: Mikulas Patocka 
> Cc: Alasdair Kergon 
> Cc: Mike Snitzer 
> Signed-off-by: Adrian Huang 

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
> v1->v2
> * Remove the checking for the returned value '-EOPNOTSUPP' of
>   dax_direct_access(). Jan has prepared a patch to address the
>   issue in dm.
> ---
>  drivers/dax/super.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index e5767c83ea23..11d0541e6f8f 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -85,6 +85,12 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
>   return false;
>   }
>  
> + if (!dax_dev) {
> + pr_debug("%s: error: dax unsupported by block device\n",
> + bdevname(bdev, buf));
> + return false;
> + }
> +
>   err = bdev_dax_pgoff(bdev, start, PAGE_SIZE, );
>   if (err) {
>   pr_info("%s: error: unaligned partition for dax\n",
> @@ -100,12 +106,6 @@ bool __generic_fsdax_supported(struct dax_device 
> *dax_dev,
>   return false;
>   }
>  
> - if (!dax_dev || !bdev_dax_supported(bdev, blocksize)) {
> - pr_debug("%s: error: dax unsupported by block device\n",
> -     bdevname(bdev, buf));
> - return false;
> - }
> -
>   id = dax_read_lock();
>   len = dax_direct_access(dax_dev, pgoff, 1, , );
>   len2 = dax_direct_access(dax_dev, pgoff_end, 1, _kaddr, _pfn);
> -- 
> 2.17.1
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: 回复：regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413?? ("dax: fix detection of dax support for non-persistent memory block?? devices")

2020-09-16 Thread Jan Kara

On Tue 15-09-20 12:49:10, Dan Williams wrote:
> On Tue, Sep 15, 2020 at 1:01 AM Jan Kara  wrote:
> >
> > Hi!
> >
> > On Tue 15-09-20 11:03:29, col...@suse.de wrote:
> > > Could you please to take a look? I am offline in the next two weeks.
> >
> > I just had a look into this. IMHO the justification in 6180bb446a "dax: fix
> > detection of dax support for non-persistent memory block devices" is just
> > bogus and people got confused by the previous condition
> >
> > if (!dax_dev && !bdev_dax_supported(bdev, blocksize))
> >
> > which was bogus as well. bdev_dax_supported() always returns false for bdev
> > that doesn't have dax_dev (naturally so). So in the original condition
> > there was no point in calling bdev_dax_supported() if we know dax_dev is
> > NULL.
> >
> > Then this was changed to:
> >
> > if (!dax_dev || !bdev_dax_supported(bdev, blocksize))
> >
> > which looks more sensible at the first sight. But only at the first sight -
> > if you look at wider context, __generic_fsdax_supported() is the bulk of
> > code that decides whether a device supports DAX so calling
> > bdev_dax_supported() from it indeed doesn't look as such a great idea. So
> > IMO the condition should be just:
> >
> > if (!dax_dev)
> >
> > I'll send a fix for this.
> 
> If you beat me to it, great, but you might be sleeping now. I agree
> the original condition was bogus and looks to be a result of previous
> non-thorough refactoring on my part. I think we can move that !dax_dev
> into dax_supported(). I'll take a look.

Adrian actually already submitted a fix here:

https://lore.kernel.org/linux-nvdimm/20200915075729.12518-1-adrianhuang0...@gmail.com/

so we're now refining the fix in that thread.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: 回复：regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413?? ("dax: fix detection of dax support for non-persistent memory block?? devices")

2020-09-16 Thread Jan Kara

On Tue 15-09-20 15:12:21, Verma, Vishal L wrote:
> On Tue, 2020-09-15 at 10:01 +0200, Jan Kara wrote:
> > Hi!
> > 
> > On Tue 15-09-20 11:03:29, col...@suse.de wrote:
> > > Could you please to take a look? I am offline in the next two weeks.
> > 
> > I just had a look into this. IMHO the justification in 6180bb446a "dax: fix
> > detection of dax support for non-persistent memory block devices" is just
> > bogus and people got confused by the previous condition
> > 
> > if (!dax_dev && !bdev_dax_supported(bdev, blocksize))
> > 
> > which was bogus as well. bdev_dax_supported() always returns false for bdev
> > that doesn't have dax_dev (naturally so). So in the original condition
> > there was no point in calling bdev_dax_supported() if we know dax_dev is
> > NULL.
> > 
> > Then this was changed to:
> > 
> > if (!dax_dev || !bdev_dax_supported(bdev, blocksize))
> > 
> > which looks more sensible at the first sight. But only at the first sight -
> > if you look at wider context, __generic_fsdax_supported() is the bulk of
> > code that decides whether a device supports DAX so calling
> > bdev_dax_supported() from it indeed doesn't look as such a great idea. So
> > IMO the condition should be just:
> > 
> > if (!dax_dev)
> > 
> > I'll send a fix for this.
> > 
> > Also there's the process question how this patch could get to Linus when
> > any attempt to use DAX would immediately kill the machine like Mikulas
> > spotted. This shows the that patch was untested with DAX by anybody on the
> > path from the developer to Linus...
> 
> This was entirely my fault, and I apologize. I got confused as to what
> state my branches were in, and I thought this had cleared our unit tests
> etc, when it obviously hadn't. I'm going to take a harder look at my
> personal branch/patch management process to make sure it doesn't happen
> again!

No worries. Bugs happen and this was still caught rather early without real
harm caused... I was just ranting to make sure testing does happen in the
future :)

Honza

> > >  原始邮件 
> > > 发件人： Mikulas Patocka 
> > > 日期： 2020年9月14日周一半夜11:48
> > > 收件人： Coly Li , Dan Williams ,
> > > Dave Jiang 
> > > 抄送： Jan Kara , Vishal Verma ,
> > > Adrian Huang , Ira Weiny , Mike
> > > Snitzer , Pankaj Gupta ,
> > > linux-nvdimm@lists.01.org
> > > 主题： regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413
> > > ("dax: fix detection of dax support for non-persistent memory block
> > > devices")
> > > 
> > > Hi
> > > 
> > > The patch 6180bb446ab624b9ab8bf201ed251ca87f07b413 ("dax: fix 
> > > detection of
> > > dax support for non-persistent memory block devices") causes crash 
> > > when
> > > attempting to mount the ext4 filesystem on /dev/pmem0 ("mkfs.ext4
> > > /dev/pmem0; mount -t ext4 /dev/pmem0 /mnt/test"). The device 
> > > /dev/pmem0 is
> > > emulated using the "memmap" kernel parameter.
> > > 
> > > The patch causes infinite recursion and double-fault exception:
> > > 
> > > __generic_fsdax_supported
> > > bdev_dax_supported
> > > __bdev_dax_supported
> > > dax_supported
> > > dax_dev->ops->dax_supported
> > > generic_fsdax_supported
> > > __generic_fsdax_supported
> > > 
> > > Mikulas
> > > 
> > > 
> > > 
> > > [   17.500619] traps: PANIC: double fault, error_code: 0x0
> > > [   17.500619] double fault:  [#1] PREEMPT SMP
> > > [   17.500620] CPU: 0 PID: 1326 Comm: mount Not tainted 
> > > 5.9.0-rc1-bisect #
> > > 10
> > > [   17.500620] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > [   17.500621] RIP: 0010:__generic_fsdax_supported+0x6a/0x500
> > > [   17.500622] Code: ff ff ff ff ff 7f 00 48 21 f3 48 01 c3 48 c1 e3 
> > > 09 f6
> > > c7 0e 0f 85 fa 01 00 00 48 85 ff 49 89 fd 74 11 be 00 10 00 00 4c 89 
> > > e7
> > >  b1 fe ff ff 84 c0 75 11 31 c0 48 83 c4 48 5b 5d 41 5c 41 5d 41
> > > [   17.500623] RSP: 0018:88940b4fdff8 EFLAGS: 00010286
> > > [   17.500624] RAX:  RBX: 0007f000 RCX:
> > > 
> > > [   17.500625] RDX: 1000 RSI: 1000 RDI:
> > >

Re: [PATCH 1/1] dax: Fix stack overflow when mounting fsdax pmem device

2020-09-15 Thread Jan Kara

 bdevname(bdev, buf));
> - return false;
> - }
> -
>   id = dax_read_lock();
>   len = dax_direct_access(dax_dev, pgoff, 1, , );
>   len2 = dax_direct_access(dax_dev, pgoff_end, 1, _kaddr, _pfn);
>  
>   if (len < 1 || len2 < 1) {
> - pr_info("%s: error: dax access failed (%ld)\n",
> + /*
> +  * Only print the real error message: do not need to print
> +  * the message for the underlying raw disk (physical disk)
> +  * that does not support DAX (dax_dev = NULL). This case
> +  * is observed when physical disks are configured by
> +  * lvm2 (device mapper).
> +  */
> + if (len != -EOPNOTSUPP && len2 != -EOPNOTSUPP) {
> +         pr_info("%s: error: dax access failed (%ld)\n",
>   bdevname(bdev, buf), len < 1 ? len : len2);
> + }
>   dax_read_unlock(id);
>   return false;
>   }
> -- 
> 2.17.1
> ___
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: 回复：regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413?? ("dax: fix detection of dax support for non-persistent memory block?? devices")

2020-09-15 Thread Jan Kara

Hi!

On Tue 15-09-20 11:03:29, col...@suse.de wrote:
> Could you please to take a look? I am offline in the next two weeks.

I just had a look into this. IMHO the justification in 6180bb446a "dax: fix
detection of dax support for non-persistent memory block devices" is just
bogus and people got confused by the previous condition

if (!dax_dev && !bdev_dax_supported(bdev, blocksize))

which was bogus as well. bdev_dax_supported() always returns false for bdev
that doesn't have dax_dev (naturally so). So in the original condition
there was no point in calling bdev_dax_supported() if we know dax_dev is
NULL.

Then this was changed to:

if (!dax_dev || !bdev_dax_supported(bdev, blocksize))

which looks more sensible at the first sight. But only at the first sight -
if you look at wider context, __generic_fsdax_supported() is the bulk of
code that decides whether a device supports DAX so calling
bdev_dax_supported() from it indeed doesn't look as such a great idea. So
IMO the condition should be just:

if (!dax_dev)

I'll send a fix for this.

Also there's the process question how this patch could get to Linus when
any attempt to use DAX would immediately kill the machine like Mikulas
spotted. This shows the that patch was untested with DAX by anybody on the
path from the developer to Linus...

Honza

>  原始邮件 
> 发件人： Mikulas Patocka 
> 日期： 2020年9月14日周一半夜11:48
> 收件人： Coly Li , Dan Williams ,
> Dave Jiang 
> 抄送： Jan Kara , Vishal Verma ,
> Adrian Huang , Ira Weiny , Mike
> Snitzer , Pankaj Gupta ,
> linux-nvdimm@lists.01.org
> 主题： regression caused by patch 6180bb446ab624b9ab8bf201ed251ca87f07b413
> ("dax: fix detection of dax support for non-persistent memory block
> devices")
> 
> Hi
> 
> The patch 6180bb446ab624b9ab8bf201ed251ca87f07b413 ("dax: fix detection of
> dax support for non-persistent memory block devices") causes crash when
> attempting to mount the ext4 filesystem on /dev/pmem0 ("mkfs.ext4
> /dev/pmem0; mount -t ext4 /dev/pmem0 /mnt/test"). The device /dev/pmem0 is
> emulated using the "memmap" kernel parameter.
> 
> The patch causes infinite recursion and double-fault exception:
> 
> __generic_fsdax_supported
> bdev_dax_supported
> __bdev_dax_supported
> dax_supported
> dax_dev->ops->dax_supported
> generic_fsdax_supported
> __generic_fsdax_supported
> 
> Mikulas
> 
> 
> 
> [   17.500619] traps: PANIC: double fault, error_code: 0x0
> [   17.500619] double fault:  [#1] PREEMPT SMP
> [   17.500620] CPU: 0 PID: 1326 Comm: mount Not tainted 5.9.0-rc1-bisect #
> 10
> [   17.500620] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   17.500621] RIP: 0010:__generic_fsdax_supported+0x6a/0x500
> [   17.500622] Code: ff ff ff ff ff 7f 00 48 21 f3 48 01 c3 48 c1 e3 09 f6
> c7 0e 0f 85 fa 01 00 00 48 85 ff 49 89 fd 74 11 be 00 10 00 00 4c 89 e7
>  b1 fe ff ff 84 c0 75 11 31 c0 48 83 c4 48 5b 5d 41 5c 41 5d 41
> [   17.500623] RSP: 0018:88940b4fdff8 EFLAGS: 00010286
> [   17.500624] RAX:  RBX: 0007f000 RCX:
> 
> [   17.500625] RDX: 1000 RSI: 1000 RDI:
> 88940b34c300
> [   17.500625] RBP:  R08: 0400 R09:
> 8080808080808080
> [   17.500626] R10:  R11: fefefefefefefeff R12:
> 88940b34c300
> [   17.500626] R13: 88940b3dc000 R14: 88940badd000 R15:
> 0001
> [   17.500627] FS:  f7c25780() GS:88940fa0()
> knlGS:
> [   17.500628] CS:  0010 DS: 002b ES: 002b CR0: 80050033
> [   17.500628] CR2: 88940b4fdfe8 CR3: 00140bd15000 CR4:
> 06b0
> [   17.500628] Call Trace:
> [   17.500629] Modules linked in: uvesafb cfbfillrect cfbimgblt cn
> cfbcopyarea fb fbdev ipv6 tun autofs4 binfmt_misc configfs af_packet
> virtio_rng rng_core mousedev evdev pcspkr virtio_balloon button raid10
> raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor
> async_tx libcrc32c raid1 raid0 md_mod sd_mod t10_pi virtio_scsi virtio_net
> net_failover psmouse scsi_mod failover
> [   17.500638] ---[ end trace 3c877fcb5b865459 ]---
> [   17.500638] RIP: 0010:__generic_fsdax_supported+0x6a/0x500
> [   17.500639] Code: ff ff ff ff ff 7f 00 48 21 f3 48 01 c3 48 c1 e3 09 f6
> c7 0e 0f 85 fa 01 00 00 48 85 ff 49 89 fd 74 11 be 00 10 00 00 4c 89 e7
>  b1 fe ff ff 84 c0 75 11 31 c0 48 83 c4 48 5b 5d 41 5c 41 5d 41
> [   17.500640] RSP: 0018:88940b4fdff8 EFLAGS: 0

Re: [PATCH 1/2] ext2: don't update mtime on COW faults

2020-09-07 Thread Jan Kara

On Sat 05-09-20 08:12:01, Mikulas Patocka wrote:
> When running in a dax mode, if the user maps a page with MAP_PRIVATE and
> PROT_WRITE, the ext2 filesystem would incorrectly update ctime and mtime
> when the user hits a COW fault.
> 
> This breaks building of the Linux kernel.
> How to reproduce:
> 1. extract the Linux kernel tree on dax-mounted ext2 filesystem
> 2. run make clean
> 3. run make -j12
> 4. run make -j12
> - at step 4, make would incorrectly rebuild the whole kernel (although it
>   was already built in step 3).
> 
> The reason for the breakage is that almost all object files depend on
> objtool. When we run objtool, it takes COW page fault on its .data
> section, and these faults will incorrectly update the timestamp of the
> objtool binary. The updated timestamp causes make to rebuild the whole
> tree.
> 
> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org

Thanks. Good spotting! Linus has already merged this so nothing more to do
here.

Honza

> 
> ---
>  fs/ext2/file.c |6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/fs/ext2/file.c
> ===
> --- linux-2.6.orig/fs/ext2/file.c 2020-09-05 10:01:41.0 +0200
> +++ linux-2.6/fs/ext2/file.c  2020-09-05 13:09:50.0 +0200
> @@ -93,8 +93,10 @@ static vm_fault_t ext2_dax_fault(struct
>   struct inode *inode = file_inode(vmf->vma->vm_file);
>   struct ext2_inode_info *ei = EXT2_I(inode);
>   vm_fault_t ret;
> + bool write = (vmf->flags & FAULT_FLAG_WRITE) &&
> + (vmf->vma->vm_flags & VM_SHARED);
>  
> - if (vmf->flags & FAULT_FLAG_WRITE) {
> + if (write) {
>   sb_start_pagefault(inode->i_sb);
>   file_update_time(vmf->vma->vm_file);
>   }
> @@ -103,7 +105,7 @@ static vm_fault_t ext2_dax_fault(struct
>   ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, NULL, _iomap_ops);
>  
>   up_read(>dax_sem);
> - if (vmf->flags & FAULT_FLAG_WRITE)
> + if (write)
>   sb_end_pagefault(inode->i_sb);
>   return ret;
>  }
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 2/2] xfs: don't update mtime on COW faults

2020-09-07 Thread Jan Kara

On Sat 05-09-20 10:03:20, Linus Torvalds wrote:
> On Sat, Sep 5, 2020 at 9:47 AM Linus Torvalds
>  wrote:
> >
> > So your patch is obviously correct, [..]
> 
> Oh, and I had a xfs pull request in my inbox already, so rather than
> expect Darrick to do another one just for this and have Jan do one for
> ext2, I just applied these two directly as "ObviouslyCorrect(tm)".

OK, thanks!

        Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 02/18] dax: Create a range version of dax_layout_busy_page()

2020-08-20 Thread Jan Kara

On Wed 19-08-20 18:19:40, Vivek Goyal wrote:
> virtiofs device has a range of memory which is mapped into file inodes
> using dax. This memory is mapped in qemu on host and maps different
> sections of real file on host. Size of this memory is limited
> (determined by administrator) and depending on filesystem size, we will
> soon reach a situation where all the memory is in use and we need to
> reclaim some.
> 
> As part of reclaim process, we will need to make sure that there are
> no active references to pages (taken by get_user_pages()) on the memory
> range we are trying to reclaim. I am planning to use
> dax_layout_busy_page() for this. But in current form this is per inode
> and scans through all the pages of the inode.
> 
> We want to reclaim only a portion of memory (say 2MB page). So we want
> to make sure that only that 2MB range of pages do not have any
> references  (and don't want to unmap all the pages of inode).
> 
> Hence, create a range version of this function named
> dax_layout_busy_page_range() which can be used to pass a range which
> needs to be unmapped.
> 
> Cc: Dan Williams 
> Cc: linux-nvdimm@lists.01.org
> Cc: Jan Kara 
> Cc: Vishal L Verma 
> Cc: "Weiny, Ira" 
> Signed-off-by: Vivek Goyal 
> ---
>  fs/dax.c| 29 +++--
>  include/linux/dax.h |  6 ++
>  2 files changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 95341af1a966..ddd705251d9f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -559,7 +559,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
>  }
>  
>  /**
> - * dax_layout_busy_page - find first pinned page in @mapping
> + * dax_layout_busy_page_range - find first pinned page in @mapping
>   * @mapping: address space to scan for a page with ref count > 1

Please document additional function arguments in the kernel-doc comment.

Otherwise the patch looks good so feel free to add:

Reviewed-by: Jan Kara 

after fixing this nit.

Honza

>   *
>   * DAX requires ZONE_DEVICE mapped pages. These pages are never
> @@ -572,13 +572,19 @@ static void *grab_mapping_entry(struct xa_state *xas,
>   * establishment of new mappings in this address_space. I.e. it expects
>   * to be able to run unmap_mapping_range() and subsequently not race
>   * mapping_mapped() becoming true.
> + *
> + * Partial pages are included. If 'end' is LLONG_MAX, pages in the range
> + * from 'start' to end of the file are inluded.
>   */
> -struct page *dax_layout_busy_page(struct address_space *mapping)
> +struct page *dax_layout_busy_page_range(struct address_space *mapping,
> + loff_t start, loff_t end)
>  {
> - XA_STATE(xas, >i_pages, 0);
>   void *entry;
>   unsigned int scanned = 0;
>   struct page *page = NULL;
> + pgoff_t start_idx = start >> PAGE_SHIFT;
> + pgoff_t end_idx;
> + XA_STATE(xas, >i_pages, start_idx);
>  
>   /*
>* In the 'limited' case get_user_pages() for dax is disabled.
> @@ -589,6 +595,11 @@ struct page *dax_layout_busy_page(struct address_space 
> *mapping)
>   if (!dax_mapping(mapping) || !mapping_mapped(mapping))
>   return NULL;
>  
> + /* If end == LLONG_MAX, all pages from start to till end of file */
> + if (end == LLONG_MAX)
> + end_idx = ULONG_MAX;
> + else
> + end_idx = end >> PAGE_SHIFT;
>   /*
>* If we race get_user_pages_fast() here either we'll see the
>* elevated page count in the iteration and wait, or
> @@ -596,15 +607,15 @@ struct page *dax_layout_busy_page(struct address_space 
> *mapping)
>* against is no longer mapped in the page tables and bail to the
>* get_user_pages() slow path.  The slow path is protected by
>* pte_lock() and pmd_lock(). New references are not taken without
> -  * holding those locks, and unmap_mapping_range() will not zero the
> +  * holding those locks, and unmap_mapping_pages() will not zero the
>* pte or pmd without holding the respective lock, so we are
>* guaranteed to either see new references or prevent new
>* references from being established.
>*/
> - unmap_mapping_range(mapping, 0, 0, 0);
> + unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
>  
>   xas_lock_irq();
> - xas_for_each(, entry, ULONG_MAX) {
> + xas_for_each(, entry, end_idx) {
>   if (WARN_ON_ONCE(!xa_is_value(entry)))
>   continue;
>   if (unlikely(dax_is_locked(entry)))
> @@ -625,6 +636,12 @@ stru

Re: [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev

2020-08-17 Thread Jan Kara

On Fri 07-08-20 15:55:07, Vivek Goyal wrote:
> virtiofs does not have a block device but it has dax device.
> Modify bdev_dax_pgoff() to be able to handle that.
> 
> If there is no bdev, that means dax offset is 0. (It can't be a partition
> block device starting at an offset in dax device).
> 
> This is little hackish. There have been discussions about getting rid
> of dax not supporting partitions.
> 
> https://lore.kernel.org/linux-fsdevel/20200107125159.ga15...@infradead.org/
> 
> IMHO, this path can easily break exisitng users. For example
> ioctl(BLKPG_ADD_PARTITION) will start breaking on block devices
> supporting DAX. Also, I personally find it very useful to be able to
> partition dax devices and still be able to use DAX.
> 
> Alternatively, I tried to store offset into dax device information in iomap
> interface, but that got NACKed.
> 
> https://lore.kernel.org/linux-fsdevel/20200217133117.gb20...@infradead.org/
> 
> I can't think of a good path to solve this issue properly. So to make
> progress, it seems this patch is least bad option for now and I hope
> we can take it.
> 
> Signed-off-by: Stefan Hajnoczi 
> Signed-off-by: Vivek Goyal 
> Cc: Christoph Hellwig 
> Cc: Dan Williams 
> Cc: linux-nvdimm@lists.01.org

This patch looks OK to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  drivers/dax/super.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 8e32345be0f7..c4bec437e88b 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -46,7 +46,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
>  int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
>   pgoff_t *pgoff)
>  {
> - phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
> + sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
> + phys_addr_t phys_off = (start_sect + sector) * 512;
>  
>   if (pgoff)
>   *pgoff = PHYS_PFN(phys_off);
> -- 
> 2.25.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page()

2020-08-17 Thread Jan Kara

();
> - xas_for_each(, entry, ULONG_MAX) {
> + xas_for_each(, entry, end_idx) {
>   if (WARN_ON_ONCE(!xa_is_value(entry)))
>   continue;
>   if (unlikely(dax_is_locked(entry)))
> @@ -625,6 +634,27 @@ struct page *dax_layout_busy_page(struct address_space 
> *mapping)
>   xas_unlock_irq();
>   return page;
>  }
> +EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
> +
> +/**
> + * dax_layout_busy_page - find first pinned page in @mapping
> + * @mapping: address space to scan for a page with ref count > 1
> + *
> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> + * 'onlined' to the page allocator so they are considered idle when
> + * page->count == 1. A filesystem uses this interface to determine if
> + * any page in the mapping is busy, i.e. for DMA, or other
> + * get_user_pages() usages.
> + *
> + * It is expected that the filesystem is holding locks to block the
> + * establishment of new mappings in this address_space. I.e. it expects
> + * to be able to run unmap_mapping_range() and subsequently not race
> + * mapping_mapped() becoming true.
> + */
> +struct page *dax_layout_busy_page(struct address_space *mapping)
> +{
> + return dax_layout_busy_page_range(mapping, 0, 0);

Should the 'end' rather be LLONG_MAX?

Otherwise the patch looks good to me.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3] dax: print error message by pr_info() in __generic_fsdax_supported()

2020-07-27 Thread Jan Kara

On Mon 27-07-20 10:02:11, Jane Chu wrote:
> Hi,
> 
> On 7/25/2020 9:24 AM, Coly Li wrote:
> > It is not simple to make dax_supported() from struct dax_operations
> > or __generic_fsdax_supported() to return exact failure type right now.
> > So the simplest fix is to use pr_info() to print all the error messages
> > inside __generic_fsdax_supported(). Then users may find informative clue
> > from the kernel message at least.
> 
> I happen to notice that some servers set their printk levels at 4 by default
> to minimize console messages:
> # cat /proc/sys/kernel/printk
>  4   4   1  7
> So I'm wondering if you would consider pr_error() instead of pr_info() ?

I don't think this is a good reason to raise priority of this message -
with this logic applied, all info messages should be raised to error level
because someone may find them useful :). And then people raise printk
loglevel because the kernel is too noisy... Personally I think that
pr_info() is fine because there will be error message about unsupported dax
setup from the filesystem and if sysadmin wishes, (s)he can always lookup
info messages in dmesg.

        Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] dax: print error message by pr_info() in __generic_fsdax_supported()

2020-07-07 Thread Jan Kara

On Tue 07-07-20 01:26:26, Coly Li wrote:
> In struct dax_operations, the callback routine dax_supported() returns
> a bool type result. For false return value, the caller has no idea
> whether the device does not support dax at all, or it is just some mis-
> configuration issue.
> 
> An example is formatting an Ext4 file system on pmem device on top of
> a NVDIMM namespace by,
>  # mkfs.ext4 /dev/pmem0
> If the fs block size does not match kernel space memory page size (which
> is possible on non-x86 platform), mount this Ext4 file system will fail,
>   # mount -o dax /dev/pmem0 /mnt
>   mount: /mnt: wrong fs type, bad option, bad superblock on /dev/pmem0,
>   missing codepage or helper program, or other error.
> And from the dmesg output there is only the following information,
>   [  307.853148] EXT4-fs (pmem0): DAX unsupported by block device.
> 
> The above information is quite confusing. Because definiately the pmem0
> device supports dax operation, and the super block is consistent as how
> it was created by mkfs.ext4.
> 
> Indeed the failure is from __generic_fsdax_supported() by the following
> code piece,
> if (blocksize != PAGE_SIZE) {
>pr_debug("%s: error: unsupported blocksize for dax\n",
> bdevname(bdev, buf));
> return false;
> }
> It is because the Ext4 block size is 4KB and kernel page size is 8KB or
> 16KB.
> 
> It is not simple to make dax_supported() from struct dax_operations
> or __generic_fsdax_supported() to return exact failure type right now.
> So the simplest fix is to use pr_info() to print all the error messages
> inside __generic_fsdax_supported(). Then users may find informative clue
> from the kernel message at least.
> 
> Message printed by pr_debug() is very easy to be ignored by users. This
> patch prints error message by pr_info() in __generic_fsdax_supported(),
> when then mount fails, following lines can be found from dmesg output,
>  [ 2705.500885] pmem0: error: unsupported blocksize for dax
>  [ 2705.500888] EXT4-fs (pmem0): DAX unsupported by block device.
> Now the users may have idea the mount failure is from pmem driver for
> unsupported block size.
> 
> Reported-by: Michal Suchanek 
> Suggested-by: Jan Kara 
> Signed-off-by: Coly Li 
> Cc: Dan Williams 
> Cc: Anthony Iliopoulos 

Yes, the patch looks good to me. There's just the slight concern that
somebody could call bdev_dax_supported() frequently (e.g. on each ioctl(2)
call) and that would then spam the logs but currently it is only ever
called from the mount functions so printing the message does the right
thing and I think it is a usability improvement... Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  drivers/dax/super.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 8e32345be0f7..de0d02ec0347 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -80,14 +80,14 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
>   int err, id;
>  
>   if (blocksize != PAGE_SIZE) {
> - pr_debug("%s: error: unsupported blocksize for dax\n",
> + pr_info("%s: error: unsupported blocksize for dax\n",
>   bdevname(bdev, buf));
>   return false;
>   }
>  
>   err = bdev_dax_pgoff(bdev, start, PAGE_SIZE, );
>   if (err) {
> - pr_debug("%s: error: unaligned partition for dax\n",
> + pr_info("%s: error: unaligned partition for dax\n",
>   bdevname(bdev, buf));
>   return false;
>   }
> @@ -95,7 +95,7 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
>   last_page = PFN_DOWN((start + sectors - 1) * 512) * PAGE_SIZE / 512;
>   err = bdev_dax_pgoff(bdev, last_page, PAGE_SIZE, _end);
>   if (err) {
> - pr_debug("%s: error: unaligned partition for dax\n",
> + pr_info("%s: error: unaligned partition for dax\n",
>   bdevname(bdev, buf));
>   return false;
>   }
> @@ -106,7 +106,7 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
>   dax_read_unlock(id);
>  
>   if (len < 1 || len2 < 1) {
> - pr_debug("%s: error: dax access failed (%ld)\n",
> + pr_info("%s: error: dax access failed (%ld)\n",
>   bdevname(bdev, buf), len < 1 ? len : len2);
>   return false;
>       }
> @@ -139,7 +

Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-06-03 Thread Jan Kara

On Tue 02-06-20 17:59:08, Williams, Dan J wrote:
> [ forgive formatting, a series of unfortunate events has me using Outlook for 
> the moment ]
> 
> > From: Jan Kara 
> > > > > These flags are device properties that affect the kernel and
> > > > > userspace's handling of persistence.
> > > > >
> > > >
> > > > That will not handle the scenario with multiple applications using
> > > > the same fsdax mount point where one is updated to use the new
> > > > instruction and the other is not.
> > >
> > > Right, it needs to be a global setting / flag day to switch from one
> > > regime to another. Per-process control is a recipe for disaster.
> > 
> > First I'd like to mention that hopefully the concern is mostly theoretical 
> > since
> > as Aneesh wrote above, real persistent memory never shipped for PPC and
> > so there are very few apps (if any) using the old way to ensure cache
> > flushing.
> > 
> > But I'd like to understand why do you think per-process control is a recipe 
> > for
> > disaster? Because from my POV the sysfs interface you propose is actually
> > difficult to use in practice. As a distributor, you have hard time picking 
> > the
> > default because you have a choice between picking safe option which is
> > going to confuse users because of failing MAP_SYNC and unsafe option
> > where everyone will be happy until someone looses data because of some
> > ancient application using wrong instructions to persist data. Poor 
> > experience
> > for users in either way. And when distro defaults to "safe option", then the
> > burden is on the sysadmin to toggle the switch but how is he supposed to
> > decide when that is safe? First he has to understand what the problem
> > actually is, then he has to audit all the applications using pmem whether 
> > they
> > use the new instruction - which is IMO a lot of effort if you have a couple 
> > of
> > applications and practically infeasible if you have more of them.
> > So IMO the burden should be *on the application* to declare that it is aware
> > of the new instructions to flush pmem on the platform and only to such
> > application the kernel should give the trust to use MAP_SYNC mappings.
> 
> The "disaster" in my mind is this need to globally change the ABI for
> persistence semantics for all of Linux because one CPU wants a do over.
> What does a generic "MAP_SYNC_ENABLE" knob even mean to the existing
> deployed base of persistent memory applications? Yes, sysfs is awkward,
> but it's trying to provide some relief without imposing unexplainable
> semantics on everyone else. I think a comprehensive (overengineered)
> solution would involve not introducing another "I know what I'm doing"
> flag to the interface, but maybe requiring applications to call a pmem
> sync API in something like a vsyscall. Or, also overengineered, some
> binary translation / interpretation to actively detect and kill
> applications that deploy the old instructions. Something horrid like on
> first write fault to a MAP_SYNC try to look ahead in the binary for the
> correct sync sequence and kill the application otherwise. That would at
> least provide some enforcement and safety without requiring other
> architectures to consider what MAP_SYNC_ENABLE means to them.

Thanks for explanation. So I absolutely agree that other architectures (and
even older versions of POWER architecture) must not be influenced by the new
tunable. That's why I wrote in my reply to Aneesh that I'd be for checking
during mmap(2) with MAP_SYNC, whether we are in a situation where new PPC
flush instructions are required and *only in that case* decide based on the
prctl value whether MAP_SYNC should be allowed or not.

Whether this solution is overengineering or not depends on how you think
it's likely there will be applications trying to use old flush instructions
with MAP_SYNC on POWER10 platforms...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-06-01 Thread Jan Kara

On Mon 01-06-20 17:31:50, Aneesh Kumar K.V wrote:
> On 6/1/20 3:39 PM, Jan Kara wrote:
> > On Fri 29-05-20 16:25:35, Aneesh Kumar K.V wrote:
> > > On 5/29/20 3:22 PM, Jan Kara wrote:
> > > > On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> > > > > Thanks Michal. I also missed Jeff in this email thread.
> > > > 
> > > > And I think you'll also need some of the sched maintainers for the prctl
> > > > bits...
> > > > 
> > > > > On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > > > > > Adding Jan
> > > > > > 
> > > > > > On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > > > > > > With POWER10, architecture is adding new pmem flush and sync 
> > > > > > > instructions.
> > > > > > > The kernel should prevent the usage of MAP_SYNC if applications 
> > > > > > > are not using
> > > > > > > the new instructions on newer hardware.
> > > > > > > 
> > > > > > > This patch adds a prctl option MAP_SYNC_ENABLE that can be used 
> > > > > > > to enable
> > > > > > > the usage of MAP_SYNC. The kernel config option is added to allow 
> > > > > > > the user
> > > > > > > to control whether MAP_SYNC should be enabled by default or not.
> > > > > > > 
> > > > > > > Signed-off-by: Aneesh Kumar K.V 
> > > > ...
> > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > > > index 8c700f881d92..d5a9a363e81e 100644
> > > > > > > --- a/kernel/fork.c
> > > > > > > +++ b/kernel/fork.c
> > > > > > > @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
> > > > > > > DEFINE_SPINLOCK(mmlist_lock);
> > > > > > > static unsigned long default_dump_filter = 
> > > > > > > MMF_DUMP_FILTER_DEFAULT;
> > > > > > > +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > > > > > > +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > > > > > > +#else
> > > > > > > +unsigned long default_map_sync_mask = 0;
> > > > > > > +#endif
> > > > > > > +
> > > > 
> > > > I'm not sure CONFIG is really the right approach here. For a distro 
> > > > that would
> > > > basically mean to disable MAP_SYNC for all PPC kernels unless 
> > > > application
> > > > explicitly uses the right prctl. Shouldn't we rather initialize
> > > > default_map_sync_mask on boot based on whether the CPU we run on 
> > > > requires
> > > > new flush instructions or not? Otherwise the patch looks sensible.
> > > > 
> > > 
> > > yes that is correct. We ideally want to deny MAP_SYNC only w.r.t POWER10.
> > > But on a virtualized platform there is no easy way to detect that. We 
> > > could
> > > ideally hook this into the nvdimm driver where we look at the new compat
> > > string ibm,persistent-memory-v2 and then disable MAP_SYNC
> > > if we find a device with the specific value.
> > 
> > Hum, couldn't we set some flag for nvdimm devices with
> > "ibm,persistent-memory-v2" property and then check it during mmap(2) time
> > and when the device has this propery and the mmap(2) caller doesn't have
> > the prctl set, we'd disallow MAP_SYNC? That should make things mostly
> > seamless, shouldn't it? Only apps that want to use MAP_SYNC on these
> > devices would need to use prctl(MMF_DISABLE_MAP_SYNC, 0) but then these
> > applications need to be aware of new instructions so this isn't that much
> > additional burden...
> 
> I am not sure application would want to add that much details/knowledge
> about a platform in their code. I was expecting application to do
> 
> #ifdef __ppc64__
> prctl(MAP_SYNC_ENABLE, 1, 0, 0, 0));
> #endif
> a = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
> MAP_SHARED_VALIDATE | MAP_SYNC, fd, 0);
> 
> 
> For that code all the complexity that we add w.r.t ibm,persistent-memory-v2
> is not useful. Do you see a value in making all these device specific rather
> than a conditional on  __ppc64__?

Yes, from the application POV the code would look like this plus the
application would use instructions appropriate for POWER10 for flushing
caches...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-06-01 Thread Jan Kara

On Fri 29-05-20 16:25:35, Aneesh Kumar K.V wrote:
> On 5/29/20 3:22 PM, Jan Kara wrote:
> > On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> > > Thanks Michal. I also missed Jeff in this email thread.
> > 
> > And I think you'll also need some of the sched maintainers for the prctl
> > bits...
> > 
> > > On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > > > Adding Jan
> > > > 
> > > > On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > > > > With POWER10, architecture is adding new pmem flush and sync 
> > > > > instructions.
> > > > > The kernel should prevent the usage of MAP_SYNC if applications are 
> > > > > not using
> > > > > the new instructions on newer hardware.
> > > > > 
> > > > > This patch adds a prctl option MAP_SYNC_ENABLE that can be used to 
> > > > > enable
> > > > > the usage of MAP_SYNC. The kernel config option is added to allow the 
> > > > > user
> > > > > to control whether MAP_SYNC should be enabled by default or not.
> > > > > 
> > > > > Signed-off-by: Aneesh Kumar K.V 
> > ...
> > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > index 8c700f881d92..d5a9a363e81e 100644
> > > > > --- a/kernel/fork.c
> > > > > +++ b/kernel/fork.c
> > > > > @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
> > > > > DEFINE_SPINLOCK(mmlist_lock);
> > > > >static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
> > > > > +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > > > > +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > > > > +#else
> > > > > +unsigned long default_map_sync_mask = 0;
> > > > > +#endif
> > > > > +
> > 
> > I'm not sure CONFIG is really the right approach here. For a distro that 
> > would
> > basically mean to disable MAP_SYNC for all PPC kernels unless application
> > explicitly uses the right prctl. Shouldn't we rather initialize
> > default_map_sync_mask on boot based on whether the CPU we run on requires
> > new flush instructions or not? Otherwise the patch looks sensible.
> > 
> 
> yes that is correct. We ideally want to deny MAP_SYNC only w.r.t POWER10.
> But on a virtualized platform there is no easy way to detect that. We could
> ideally hook this into the nvdimm driver where we look at the new compat
> string ibm,persistent-memory-v2 and then disable MAP_SYNC
> if we find a device with the specific value.

Hum, couldn't we set some flag for nvdimm devices with
"ibm,persistent-memory-v2" property and then check it during mmap(2) time
and when the device has this propery and the mmap(2) caller doesn't have
the prctl set, we'd disallow MAP_SYNC? That should make things mostly
seamless, shouldn't it? Only apps that want to use MAP_SYNC on these
devices would need to use prctl(MMF_DISABLE_MAP_SYNC, 0) but then these
applications need to be aware of new instructions so this isn't that much
additional burden...

> With that I am wondering should we even have this patch? Can we expect
> userspace get updated to use new instruction?.
> 
> With ppc64 we never had a real persistent memory device available for end
> user to try. The available persistent memory stack was using vPMEM which was
> presented as a volatile memory region for which there is no need to use any
> of the flush instructions. We could safely assume that as we get
> applications certified/verified for working with pmem device on ppc64, they
> would all be using the new instructions?

This is a bit of a gamble... I don't have too much trust in certification /
verification because only the "big players" may do powerfail testing
throughout enough that they'd uncover these problems. So the question
really is: How many apps are out there using MAP_SYNC on ppc64? Hopefully
not many given the HW didn't ship yet as you wrote but I have no real clue.
Similarly there's a question: How many app writers will read manual for
older ppc64 architecture and write apps that won't work reliably on
POWER10? Again, I have no idea.

So the prctl would be IMHO a nice safety belt but I'm not 100% certain it
will be needed...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-06-01 Thread Jan Kara

On Sat 30-05-20 09:35:19, Dan Williams wrote:
> On Sat, May 30, 2020 at 12:18 AM Aneesh Kumar K.V
>  wrote:
> >
> > On 5/30/20 12:52 AM, Dan Williams wrote:
> > > On Fri, May 29, 2020 at 3:55 AM Aneesh Kumar K.V
> > >  wrote:
> > >>
> > >> On 5/29/20 3:22 PM, Jan Kara wrote:
> > >>> Hi!
> > >>>
> > >>> On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> > >>>> Thanks Michal. I also missed Jeff in this email thread.
> > >>>
> > >>> And I think you'll also need some of the sched maintainers for the prctl
> > >>> bits...
> > >>>
> > >>>> On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > >>>>> Adding Jan
> > >>>>>
> > >>>>> On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > >>>>>> With POWER10, architecture is adding new pmem flush and sync 
> > >>>>>> instructions.
> > >>>>>> The kernel should prevent the usage of MAP_SYNC if applications are 
> > >>>>>> not using
> > >>>>>> the new instructions on newer hardware.
> > >>>>>>
> > >>>>>> This patch adds a prctl option MAP_SYNC_ENABLE that can be used to 
> > >>>>>> enable
> > >>>>>> the usage of MAP_SYNC. The kernel config option is added to allow 
> > >>>>>> the user
> > >>>>>> to control whether MAP_SYNC should be enabled by default or not.
> > >>>>>>
> > >>>>>> Signed-off-by: Aneesh Kumar K.V 
> > >>> ...
> > >>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
> > >>>>>> index 8c700f881d92..d5a9a363e81e 100644
> > >>>>>> --- a/kernel/fork.c
> > >>>>>> +++ b/kernel/fork.c
> > >>>>>> @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
> > >>>>>> DEFINE_SPINLOCK(mmlist_lock);
> > >>>>>> static unsigned long default_dump_filter = 
> > >>>>>> MMF_DUMP_FILTER_DEFAULT;
> > >>>>>> +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > >>>>>> +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > >>>>>> +#else
> > >>>>>> +unsigned long default_map_sync_mask = 0;
> > >>>>>> +#endif
> > >>>>>> +
> > >>>
> > >>> I'm not sure CONFIG is really the right approach here. For a distro 
> > >>> that would
> > >>> basically mean to disable MAP_SYNC for all PPC kernels unless 
> > >>> application
> > >>> explicitly uses the right prctl. Shouldn't we rather initialize
> > >>> default_map_sync_mask on boot based on whether the CPU we run on 
> > >>> requires
> > >>> new flush instructions or not? Otherwise the patch looks sensible.
> > >>>
> > >>
> > >> yes that is correct. We ideally want to deny MAP_SYNC only w.r.t
> > >> POWER10. But on a virtualized platform there is no easy way to detect
> > >> that. We could ideally hook this into the nvdimm driver where we look at
> > >> the new compat string ibm,persistent-memory-v2 and then disable MAP_SYNC
> > >> if we find a device with the specific value.
> > >>
> > >> BTW with the recent changes I posted for the nvdimm driver, older kernel
> > >> won't initialize persistent memory device on newer hardware. Newer
> > >> hardware will present the device to OS with a different device tree
> > >> compat string.
> > >>
> > >> My expectation  w.r.t this patch was, Distro would want to  mark
> > >> CONFIG_ARCH_MAP_SYNC_DISABLE=n based on the different application
> > >> certification.  Otherwise application will have to end up calling the
> > >> prctl(MMF_DISABLE_MAP_SYNC, 0) any way. If that is the case, should this
> > >> be dependent on P10?
> > >>
> > >> With that I am wondering should we even have this patch? Can we expect
> > >> userspace get updated to use new instruction?.
> > >>
> > >> With ppc64 we never had a real persistent memory device available for
> > >> end user to try. The available persistent memory stack was using vPMEM
> > >> which was presented as a volatile memory

Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-05-29 Thread Jan Kara

Hi!

On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> Thanks Michal. I also missed Jeff in this email thread.

And I think you'll also need some of the sched maintainers for the prctl
bits...

> On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > Adding Jan
> > 
> > On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > > With POWER10, architecture is adding new pmem flush and sync instructions.
> > > The kernel should prevent the usage of MAP_SYNC if applications are not 
> > > using
> > > the new instructions on newer hardware.
> > > 
> > > This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
> > > the usage of MAP_SYNC. The kernel config option is added to allow the user
> > > to control whether MAP_SYNC should be enabled by default or not.
> > > 
> > > Signed-off-by: Aneesh Kumar K.V 
...
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 8c700f881d92..d5a9a363e81e 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
> > > DEFINE_SPINLOCK(mmlist_lock);
> > >   static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
> > > +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > > +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > > +#else
> > > +unsigned long default_map_sync_mask = 0;
> > > +#endif
> > > +

I'm not sure CONFIG is really the right approach here. For a distro that would
basically mean to disable MAP_SYNC for all PPC kernels unless application
explicitly uses the right prctl. Shouldn't we rather initialize
default_map_sync_mask on boot based on whether the CPU we run on requires
new flush instructions or not? Otherwise the patch looks sensible.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2] dax: Add missing annotation for wait_entry_unlocked()

2020-04-15 Thread Jan Kara

On Wed 01-04-20 16:33:59, Jules Irenge wrote:
> Sparse reports a warning at wait_entry_unlocked()
> 
> warning: context imbalance in wait_entry_unlocked() - unexpected unlock
> 
> The root cause is the missing annotation at wait_entry_unlocked()
> Add the missing __releases(xas->xa->xa_lock) annotation
> 
> Signed-off-by: Jules Irenge 

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/dax.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 35da144375a0..ee0468af4d81 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -244,6 +244,7 @@ static void *get_unlocked_entry(struct xa_state *xas, 
> unsigned int order)
>   * After we call xas_unlock_irq(), we cannot touch xas->xa.
>   */
>  static void wait_entry_unlocked(struct xa_state *xas, void *entry)
> + __releases(xas->xa->xa_lock)
>  {
>   struct wait_exceptional_entry_queue ewait;
>   wait_queue_head_t *wq;
> -- 
> Change since v2
> - gives more accurate lock variable name
> 2.25.1
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 3/7] dax: Add missing annotation for wait_entry_unlocked()

2020-04-01 Thread Jan Kara

On Tue 31-03-20 21:46:39, Jules Irenge wrote:
> Sparse reports a warning at wait_entry_unlocked()
> 
> warning: context imbalance in wait_entry_unlocked()
>   - unexpected unlock
> 
> The root cause is the missing annotation at wait_entry_unlocked()
> Add the missing __releases(xa) annotation.
> 
> Signed-off-by: Jules Irenge 
> ---
>  fs/dax.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 1f1f0201cad1..adcd2a57fbad 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -244,6 +244,7 @@ static void *get_unlocked_entry(struct xa_state *xas, 
> unsigned int order)
>   * After we call xas_unlock_irq(), we cannot touch xas->xa.
>   */
>  static void wait_entry_unlocked(struct xa_state *xas, void *entry)
> + __releases(xa)

Thanks for the patch but is this a proper sparse annotation? I'd rather
expect something like __releases(xas->xa->xa_lock) here...

Honza

>  {
>   struct wait_exceptional_entry_queue ewait;
>   wait_queue_head_t *wq;
> -- 
> 2.24.1
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [ndctl PATCH v2 1/2] ndctl/test: Cleanup test-vs-production nvdimm module detection

2020-03-04 Thread Jan Kara

On Tue 03-03-20 14:58:30, Dan Williams wrote:
> Update nfit_test_init() to use strcmp() instead of strstr() to filter
> which modules are probed to come from the out-of-tree unit-test set.
> 
> Reported-by: Jan Kara 
> Link: http://lore.kernel.org/r/20200303132850.ga21...@quack2.suse.cz
> Signed-off-by: Dan Williams 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  test/core.c |6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/test/core.c b/test/core.c
> index 888f5d8c0e42..3aa746fe6786 100644
> --- a/test/core.c
> +++ b/test/core.c
> @@ -164,7 +164,7 @@ int nfit_test_init(struct kmod_ctx **ctx, struct 
> kmod_module **mod,
>* Don't check for device-dax modules on kernels older
>* than 4.7.
>*/
> - if (strstr(name, "dax")
> + if (strcmp(name, "dax") == 0
>   && !ndctl_test_attempt(test,
>   KERNEL_VERSION(4, 7, 0)))
>   continue;
> @@ -172,8 +172,8 @@ int nfit_test_init(struct kmod_ctx **ctx, struct 
> kmod_module **mod,
>   /*
>* Skip device-dax bus-model modules on pre-v5.1
>*/
> - if ((strstr(name, "dax_pmem_core")
> - || strstr(name, "dax_pmem_compat"))
> + if ((strcmp(name, "dax_pmem_core") == 0
> + || strcmp(name, "dax_pmem_compat") == 0)
>   && !ndctl_test_attempt(test,
>   KERNEL_VERSION(5, 1, 0)))
>   continue;
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [ndctl PATCH v2 2/2] ndctl/test: Relax dax_pmem_compat requirement

2020-03-04 Thread Jan Kara

On Tue 03-03-20 14:58:35, Dan Williams wrote:
> While there are some tests that require the new "dax-bus" device model,
> none of the tests require compatibility mode. Drop the requirement so
> the tests work with DEV_DAX_PMEM_COMPAT=n kernels.
> 
> Link: http://lore.kernel.org/r/20200123154720.12097-1-j...@suse.cz
> Cc: Jan Kara 
> Signed-off-by: Dan Williams 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  test/core.c |8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/test/core.c b/test/core.c
> index 3aa746fe6786..5118d86483d4 100644
> --- a/test/core.c
> +++ b/test/core.c
> @@ -180,6 +180,14 @@ int nfit_test_init(struct kmod_ctx **ctx, struct 
> kmod_module **mod,
>  
>  retry:
>   rc = kmod_module_new_from_name(*ctx, name, mod);
> +
> + /*
> +  * dax_pmem_compat is not required, missing is ok,
> +  * present-but-production is not ok.
> +  */
> + if (rc && strcmp(name, "dax_pmem_compat") == 0)
> + continue;
> +
>   if (rc) {
>   log_err(_ctx, "%s.ko: missing\n", name);
>   break;
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [ndctl PATCH 27/36] ndctl/test: Relax dax_pmem_compat requirement

2020-03-03 Thread Jan Kara

On Sat 29-02-20 12:22:28, Dan Williams wrote:
> While there are some tests that require the new "dax-bus" device model,
> none of the tests require compatibility mode. Drop the requirement so
> the tests work with DEV_DAX_PMEM_COMPAT=n kernels.
> 
> Link: http://lore.kernel.org/r/20200123154720.12097-1-j...@suse.cz
> Cc: Jan Kara 
> Signed-off-by: Dan Williams 

The patch looks good to me. Thanks for fixing this! I just have to say that
the strstr(3) usage in this function looks rather unusual to me. Why not
just strcmp(3)?

Honza

> ---
>  test/core.c |8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/test/core.c b/test/core.c
> index 888f5d8c0e42..dff842a9f378 100644
> --- a/test/core.c
> +++ b/test/core.c
> @@ -180,6 +180,14 @@ int nfit_test_init(struct kmod_ctx **ctx, struct 
> kmod_module **mod,
>  
>  retry:
>   rc = kmod_module_new_from_name(*ctx, name, mod);
> +
> + /*
> +  * dax_pmem_compat is not required, missing is ok,
> +  * present-but-production is not ok.
> +  */
> + if (rc && strstr(name, "dax_pmem_compat"))
> + continue;
> +
>   if (rc) {
>   log_err(_ctx, "%s.ko: missing\n", name);
>   break;
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] tools/testing/nvdimm: Fix compilation failure without CONFIG_DEV_DAX_PMEM_COMPAT

2020-02-17 Thread Jan Kara

On Fri 14-02-20 08:13:59, Dan Williams wrote:
> On Fri, Feb 14, 2020 at 1:42 AM Jan Kara  wrote:
> > > > But, I understand if you want to prevent build bots from hitting
> > > > compilation failures due to this.
> > >
> > > Hmm, build bots would only hit what's covered by
> > > CONFIG_NVDIMM_TEST_BUILD, and that's only building
> > > tools/testing/nvdimm/test/iomap.c.
> > >
> > > Jan, were you just looking to use nfit_test outside of running the
> > > ndctl test suites? Or was this just a drive-by compilation test?
> >
> > The problem is following: We build our distro kernels without
> > CONFIG_DEV_DAX_PMEM_COMPAT because we don't need that functionality. And
> > Jing Han (from Intel ;) is now complaining that he cannot compile and run
> > the ndctl testsuite on our kernels... It seems stupid to enable that config
> > option for all distro users just to be able to run the testsuite but OTOH
> > it would be neat to be able to run the testsuite with stock distro
> > config.
> 
> Sounds good, minus the fact that Jing and I were not on the same page.
> I'll send the ndctl fixup.

Thanks!

Honza

-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] tools/testing/nvdimm: Fix compilation failure without CONFIG_DEV_DAX_PMEM_COMPAT

2020-02-14 Thread Jan Kara

On Wed 12-02-20 12:49:41, Dan Williams wrote:
> On Wed, Feb 12, 2020 at 6:04 AM Jeff Moyer  wrote:
> >
> > Jan Kara  writes:
> >
> > > When a kernel is configured without CONFIG_DEV_DAX_PMEM_COMPAT, the
> > > compilation of tools/testing/nvdimm fails with:
> > >
> > >   Building modules, stage 2.
> > >   MODPOST 11 modules
> > > ERROR: "dax_pmem_compat_test" [tools/testing/nvdimm/test/nfit_test.ko] 
> > > undefined!
> > >
> > > Fix the problem by calling dax_pmem_compat_test() only if the kernel has
> > > the required functionality.
> > >
> > > Signed-off-by: Jan Kara 
> >
> > What's the motivation?  Is this just to fix randconfig builds?  The
> > reason I ask is that the test suite will expect to be able to find the
> > dax_pmem_compat module, so it doesn't make sense to me to disable those
> > tests only in the kernel as you'll hit a problem when running the tests
> > anyway.
> 
> Yeah, at a minimum you'd also need to go fix up nfit_test_init() to
> not check for the dax_pmem_compat module:
> 
> https://github.com/pmem/ndctl/blob/master/test/core.c#L119

OK.

> > But, I understand if you want to prevent build bots from hitting
> > compilation failures due to this.
> 
> Hmm, build bots would only hit what's covered by
> CONFIG_NVDIMM_TEST_BUILD, and that's only building
> tools/testing/nvdimm/test/iomap.c.
> 
> Jan, were you just looking to use nfit_test outside of running the
> ndctl test suites? Or was this just a drive-by compilation test?

The problem is following: We build our distro kernels without
CONFIG_DEV_DAX_PMEM_COMPAT because we don't need that functionality. And
Jing Han (from Intel ;) is now complaining that he cannot compile and run
the ndctl testsuite on our kernels... It seems stupid to enable that config
option for all distro users just to be able to run the testsuite but OTOH
it would be neat to be able to run the testsuite with stock distro
config.

Honza


-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] tools/testing/nvdimm: Fix compilation failure without CONFIG_DEV_DAX_PMEM_COMPAT

2020-02-10 Thread Jan Kara

On Thu 23-01-20 16:47:20, Jan Kara wrote:
> When a kernel is configured without CONFIG_DEV_DAX_PMEM_COMPAT, the
> compilation of tools/testing/nvdimm fails with:
> 
>   Building modules, stage 2.
>   MODPOST 11 modules
> ERROR: "dax_pmem_compat_test" [tools/testing/nvdimm/test/nfit_test.ko] 
> undefined!
> 
> Fix the problem by calling dax_pmem_compat_test() only if the kernel has
> the required functionality.
> 
> Signed-off-by: Jan Kara 

Ping?

Honza

> ---
>  tools/testing/nvdimm/test/nfit.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/tools/testing/nvdimm/test/nfit.c 
> b/tools/testing/nvdimm/test/nfit.c
> index bf6422a6af7f..a8ee5c4d41eb 100644
> --- a/tools/testing/nvdimm/test/nfit.c
> +++ b/tools/testing/nvdimm/test/nfit.c
> @@ -3164,7 +3164,9 @@ static __init int nfit_test_init(void)
>   mcsafe_test();
>   dax_pmem_test();
>   dax_pmem_core_test();
> +#ifdef CONFIG_DEV_DAX_PMEM_COMPAT
>   dax_pmem_compat_test();
> +#endif
>  
>   nfit_test_setup(nfit_test_lookup, nfit_test_evaluate_dsm);
>  
> -- 
> 2.16.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [patch] dax: pass NOWAIT flag to iomap_apply

2020-02-06 Thread Jan Kara

On Thu 06-02-20 09:33:39, Jeff Moyer wrote:
> Jan Kara  writes:
> 
> > On Wed 05-02-20 14:15:58, Jeff Moyer wrote:
> >> fstests generic/471 reports a failure when run with MOUNT_OPTIONS="-o
> >> dax".  The reason is that the initial pwrite to an empty file with the
> >> RWF_NOWAIT flag set does not return -EAGAIN.  It turns out that
> >> dax_iomap_rw doesn't pass that flag through to iomap_apply.
> >> 
> >> With this patch applied, generic/471 passes for me.
> >> 
> >> Signed-off-by: Jeff Moyer 
> >
> > The patch looks good to me. You can add:
> >
> > Reviewed-by: Jan Kara 
> >
> > BTW, I've just noticed ext4 seems to be buggy in this regard and even this
> > patch doesn't fix it. So I guess you've been using XFS for testing this?
> 
> That's right, sorry I didn't mention that.  Will you send a patch for
> ext4, or do you want me to look into it?

I've taken down a note in todo list to eventually look into that but if you
can have a look, I'm more than happy to remove that entry :).

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [patch] dax: pass NOWAIT flag to iomap_apply

2020-02-06 Thread Jan Kara

On Wed 05-02-20 14:15:58, Jeff Moyer wrote:
> fstests generic/471 reports a failure when run with MOUNT_OPTIONS="-o
> dax".  The reason is that the initial pwrite to an empty file with the
> RWF_NOWAIT flag set does not return -EAGAIN.  It turns out that
> dax_iomap_rw doesn't pass that flag through to iomap_apply.
> 
> With this patch applied, generic/471 passes for me.
> 
> Signed-off-by: Jeff Moyer 

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

BTW, I've just noticed ext4 seems to be buggy in this regard and even this
patch doesn't fix it. So I guess you've been using XFS for testing this?

Honza

> diff --git a/fs/dax.c b/fs/dax.c
> index 1f1f0201cad1..0b0d8819cb1b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1207,6 +1207,9 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>   lockdep_assert_held(>i_rwsem);
>   }
>  
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + flags |= IOMAP_NOWAIT;
> +
>   while (iov_iter_count(iter)) {
>   ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
>   iter, dax_iomap_actor);
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 01/19] dax: remove block device dependencies

2020-01-15 Thread Jan Kara

On Tue 14-01-20 16:28:05, Vivek Goyal wrote:
> On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote:
> > I think we should at least try to delete the partition support and see
> > if anyone screams. Have a module option to revert the behavior so
> > people are not stuck waiting for the revert to land, but if it stays
> > quiet then we're in a better place with that support pushed out of the
> > dax core.
> 
> Hi Dan,
> 
> So basically keep partition support code just that disable it by default
> and it is enabled by some knob say kernel command line option/module
> option.
> 
> At what point of time will we remove that code completely. I mean what
> if people scream after two kernel releases, after we have removed the
> code.
> 
> Also, from distribution's perspective, we might not hear from our
> customers for a very long time (till we backport that code in to
> existing releases or release this new code in next major release). From
> that view point I will not like to break existing user visible behavior.
> 
> How bad it is to keep partition support around. To me it feels reasonaly
> simple where we just have to store offset into dax device into another
> dax object and pass that object around (instead of dax_device). If that's
> the case, I am not sure why to even venture into a direction where some
> user's setup might be broken.
> 
> Also from an application perspective, /dev/pmem is a block device, so it
> should behave like a block device, (including kernel partition table support).
> From that view, dax looks like just an additional feature of that device
> which can be enabled by passing option "-o dax".

Well, not all block devices are partitionable. For example cdroms are
standard block devices but partitioning does not run for them. Similarly
device mapper devices are block devices but not partitioned. So there is
some precedens in not doing partitioning for some types of block devices.

For the rest I agree that kernels where pmem devices are partitionable have
shipped in enterprise distros and are going to be supported (and used) for
5-10 years before users decide to move on to something newer - at which
point we'll only find out whether someone used the feature or not. So
deprecation is going to be somewhat interesting. On the other hand clever
udev rule that detects partition table on pmem device and uses kpartx to
partition these devices (like what happens e.g. for dm-multipath devices)
could possibly be used as a replacement for kernel support so there's a way
out of this...

So I don't care too deeply about what the decision is going to be.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 01/19] dax: remove block device dependencies

2020-01-09 Thread Jan Kara

On Tue 07-01-20 10:49:55, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal  wrote:
> > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > dax code refers back to block device to figure out partition offset in
> > dax device. If we create a dax object corresponding to "struct block_device"
> > and store sector offset in that, then we could pass that object to dax
> > code and not worry about referring back to bdev. I have written some
> > proof of concept code and called that object "dax_handle". I can post
> > that code if there is interest.
> 
> I don't think it's worth it in the end especially considering
> filesystems are looking to operate on /dev/dax devices directly and
> remove block entanglements entirely.
> 
> > IMHO, it feels useful to be able to partition and use a dax capable
> > block device in same way as non-dax block device. It will be really
> > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > will work.
> 
> That can already happen today. If you do not properly align the
> partition then dax operations will be disabled. This proposal just
> extends that existing failure domain to make all partitions fail to
> support dax.

Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
decides to create partitions on it for whatever (possibly misguided)
reason and then ponders why the hell DAX is not working? And PAGE_SIZE
partition alignment is so obvious and widespread that I don't count it as a
realistic error case sysadmins would be pondering about currently.

So I'd find two options reasonably consistent:
1) Keep status quo where partitions are created and support DAX.
2) Stop partition creation altogether, if anyones wants to split pmem
device further, he can use dm-linear for that (i.e., kpartx).

But I'm not sure if the ship hasn't already sailed for option 2) to be
feasible without angry users and Linus reverting the change.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 2/2] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-15 Thread Jan Kara

On Thu 14-11-19 16:11:34, John Hubbard wrote:
> An upcoming patch changes and complicates the refcounting and
> especially the "put page" aspects of it. In order to keep
> everything clean, refactor the devmap page release routines:
> 
> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
>   and limit the functionality to "read only": return a bool,
>   with no side effects.
> 
> * Add a new routine, put_devmap_managed_page(), to handle checking
>   what kind of page it is, and what kind of refcount handling it
>   requires.
> 
> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
>   and limit the functionality to unconditionally freeing a devmap
>   page.
> 
> This is originally based on a separate patch by Ira Weiny, which
> applied to an early version of the put_user_page() experiments.
> Since then, Jérôme Glisse suggested the refactoring described above.
> 
> Cc: Jan Kara 
> Cc: Jérôme Glisse 
> Cc: Christoph Hellwig 
> Cc: Dan Williams 
> Suggested-by: Jérôme Glisse 
> Signed-off-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/mm.h | 27 ---
>  mm/memremap.c  | 16 ++--
>  2 files changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2adf95b3f9c..96228376139c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page 
> *page)
>  #endif
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void free_devmap_managed_page(struct page *page);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
>   if (!static_branch_unlikely(_managed_key))
>   return false;
> @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   switch (page->pgmap->type) {
>   case MEMORY_DEVICE_PRIVATE:
>   case MEMORY_DEVICE_FS_DAX:
> - __put_devmap_managed_page(page);
>   return true;
>   default:
>   break;
> @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   return false;
>  }
>  
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {
> + int count = page_ref_dec_return(page);
> +
> + /*
> +  * devmap page refcounts are 1-based, rather than 0-based: if
> +  * refcount is 1, then the page is free and the refcount is
> +  * stable because nobody holds a reference on the page.
> +  */
> + if (count == 1)
> + free_devmap_managed_page(page);
> + else if (!count)
> + __put_page(page);
> + }
> +
> + return is_devmap;
> +}
> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index e899fa876a62..2ba773859031 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page)
> +void free_devmap_managed_page(struct page *page)
>  {
> - int count = page_ref_dec_return(page);
> -
> - /* still busy */
> - if (count > 1)
> - return;
> -
> - /* only triggered by the dev_pagemap shutdown path */
> - if (count == 0) {
> - __put_page(page);
> - return;
> - }
> -
>   /* notify page idle for dax */
>   if (!is_device_private_page(page)) {
>   wake_up_var(>_refcount);
> @@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
>   page->mapping = NULL;
>   page->pgmap->ops->page_free(page);
>  }
> -EXPORT_SYMBOL(__put_devmap_managed_page);
> +EXPORT_SYMBOL(free_devmap_managed_page);
>  #endif /* CONFIG_DEV_PAGEMAP_OPS */
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: DAX filesystem support on ARMv8

2019-11-12 Thread Jan Kara

Hi!

On Tue 12-11-19 02:12:09, Bharat Kumar Gogada wrote:
> As per Documentation/filesystems/dax.txt
> 
> The DAX code does not work correctly on architectures which have virtually
> mapped caches such as ARM, MIPS and SPARC.
> 
> Can anyone please shed light on dax filesystem issue w.r.t ARM architecture ? 

I've CCed Dan, he might have idea what that comment means :)

Out of curiosity, why do you care?

        Honza

-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] fs/dax: Fix pmd vs pte conflict detection

2019-10-21 Thread Jan Kara

On Sat 19-10-19 09:26:19, Dan Williams wrote:
> Check for NULL entries before checking the entry order, otherwise NULL
> is misinterpreted as a present pte conflict. The 'order' check needs to
> happen before the locked check as an unlocked entry at the wrong order
> must fallback to lookup the correct order.
> 
> Reported-by: Jeff Smits 
> Reported-by: Doug Nelson 
> Cc: 
> Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults")
> Cc: Jan Kara 
> Cc: Matthew Wilcox (Oracle) 
> Signed-off-by: Dan Williams 

Good catch! The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/dax.c |5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index a71881e77204..08160011d94c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -221,10 +221,11 @@ static void *get_unlocked_entry(struct xa_state *xas, 
> unsigned int order)
>  
>   for (;;) {
>   entry = xas_find_conflict(xas);
> + if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
> + return entry;
>   if (dax_entry_order(entry) < order)
>   return XA_RETRY_ENTRY;
> - if (!entry || WARN_ON_ONCE(!xa_is_value(entry)) ||
> - !dax_is_locked(entry))
> + if (!dax_is_locked(entry))
>   return entry;
>  
>   wq = dax_entry_waitqueue(xas, entry, );
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Lease semantic proposal

2019-10-07 Thread Jan Kara

On Mon 30-09-19 18:42:33, Dave Chinner wrote:
> On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> > On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > > doesn't appear to be compatible with the semantics required by
> > > existing users of layout leases.
> > 
> > I disagree.  Other than the addition of F_UNBREAK, I think this is 
> > consistent
> > with what is currently implemented.  Also, by exporting all this to user 
> > space
> > we can now write tests for it independent of the RDMA pinning.
> 
> The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
> layout changes to occur to the file while the layout lease is held.

I remember you saying that in the past conversations. But I agree with Ira
that I don't see where in the code this would be implemented. AFAICS
break_layout() called from xfs_break_leased_layouts() simply breaks all the
leases with F_LAYOUT set attached to the inode... Now I'm not any expert on
file leases but what am I missing?

> IOWs, your definition of F_RDLCK | F_LAYOUT not being allowed
> to change the is in direct contradition to existing users.
> 
> I've said this several times over the past few months now: shared
> layout leases must allow layout modifications to be made. Only
> allowing an exclusive layout lease to modify the layout rules out
> many potential use cases for direct data placement and p2p DMA
> applications, not to mention conflicts with the existing pNFS usage.
> Layout leases need to support more than just RDMA, and tailoring the
> API to exactly the immediate needs of RDMA is just going to make it
> useless for anything else.

I agree we should not tailor the layout lease definition to just RDMA
usecase. But let's talk about the semantics once our confusion about how
pNFS currently uses layout leases is clear out.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Lease semantic proposal

2019-10-03 Thread Jan Kara

g the underlying layout.
> > > > > > 
> > > > > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > > > > 
> > > > > > Write Layout leases can be used to break read layout leases to 
> > > > > > indicate that
> > > > > > the process intends to change the underlying layout lease of the 
> > > > > > file.
> > > > > > 
> > > > > > A process which has taken a write layout lease has exclusive 
> > > > > > ownership of the
> > > > > > file layout and can modify that layout as long as the lease is held.
> > > > > > Operations which change the layout are allowed by that process.  
> > > > > > But operations
> > > > > > from other file descriptors which attempt to change the layout will 
> > > > > > break the
> > > > > > lease through the standard lease break process.  The F_LAYOUT flag 
> > > > > > is used to
> > > > > > indicate a difference between a regular F_WRLCK and F_WRLCK with 
> > > > > > F_LAYOUT.  In
> > > > > > the F_LAYOUT case opens for write do not break the lease.  But some 
> > > > > > operations,
> > > > > > if they change the underlying layout, may.
> > > > > > 
> > > > > > The distinction between read layout leases and write layout leases 
> > > > > > is that
> > > > > > write layout leases can change the layout without breaking the 
> > > > > > lease within the
> > > > > > owning process.  This is useful to guarantee a layout prior to 
> > > > > > specifying the
> > > > > > unbreakable flag described below.
> > > > > > 
> > > > > > 
> > > > > 
> > > > > The above sounds totally reasonable. You're essentially exposing the
> > > > > behavior of nfsd's layout leases to userland. To be clear, will 
> > > > > F_LAYOUT
> > > > > leases work the same way as "normal" leases, wrt signals and timeouts?
> > > > 
> > > > That was my intention, yes.
> > > > 
> > > > > I do wonder if we're better off not trying to "or" in flags for this,
> > > > > and instead have a separate set of commands (maybe F_RDLAYOUT,
> > > > > F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
> > > > > feel terribly strongly about it.
> > > > 
> > > > I'm leaning that was as well.  To make these even more distinct from
> > > > F_SETLEASE.
> > > > 
> > > > > Also, at least in NFSv4, layouts are handed out for a particular byte
> > > > > range in a file. Should we consider doing this with an API that allows
> > > > > for that in the future? Is this something that would be desirable for
> > > > > your RDMA+DAX use-cases?
> > > > 
> > > > I don't see this.  I've thought it would be a nice thing to have but I 
> > > > don't
> > > > know of any hard use case.  But first I'd like to understand how this 
> > > > works for
> > > > NFS.
> > > > 
> > > 
> > > The NFSv4.1 spec allows the client to request the layouts for a
> > > particular range in the file:
> > > 
> > > https://tools.ietf.org/html/rfc5661#page-538
> > > 
> > > The knfsd only hands out whole-file layouts at present. Eventually we
> > > may want to make better use of segmented layouts, at which point we'd
> > > need something like a byte-range lease.
> > > 
> > > > > We could add a new F_SETLEASE variant that takes a struct with a byte
> > > > > range (something like struct flock).
> > > > 
> > > > I think this would be another reason to introduce F_[RD|WR|UN]LAYOUT as 
> > > > a
> > > > command.  Perhaps supporting smaller byte ranges could be added later?
> > > > 
> > > 
> > > I'd definitely not multiplex this over F_SETLEASE. An F_SETLAYOUT cmd
> > > would probably be sufficient, and maybe just reuse
> > > F_RDLCK/F_WRLCK/F_UNLCK for the iomode?
> > > 
> > > For the byte ranges, the catch there is that extending the userland
> > > interface for that later will be difficult.
> > 
> > Why would it be difficult?
> > 
> 
> Legacy userland code that wanted to use byte range enabled layouts would
> have to be rebuilt to take advantage of them. If we require a range from
> the get-go, then they will get the benefit of them once they're
> available.

I don't think this is true. Because current implementation of locking the
whole file may hide implementation bugs in userspace. So the new
range lock handling may break userspace and history shows such
problems with APIs are actually rather common. So I think switching to
range locking *must* be conscious decision of the application and as
such having separate API for that is the most natural thing to do.

> > > What I'd probably suggest
> > > (and what would jive with the way pNFS works) would be to go ahead and
> > > add an offset and length to the arguments and result (maybe also
> > > whence?).
> > 
> > Why not add new commands with range arguments later if it turns out to
> > be necessary?
> 
> We could do that. It'd be a little ugly, IMO, simply because then we'd
> end up with two interfaces that do almost the exact same thing.
> 
> Should byte-range layouts at that point conflict with non-byte range
> layouts, or should they be in different "spaces" (a'la POSIX and flock
> locks)? When it's all one interface, those sorts of questions sort of
> answer themselves. When they aren't we'll have to document them clearly
> and I think the result will be more confusing for userland programmers.
> 
> If you felt strongly about leaving those out for now, you could just do
> something similar to what Aleksa is planning for openat2 -- have a
> struct pointer and length as arguments for this cmd, and only have a
> single iomode member in there for now.
> 
> The kernel would have to know how to deal with "legacy" and byte-range-
> enabled variants if we ever extend it, but that's not too hard to
> handle.

Yeah, so we can discuss how to make possible future extension towards
range locking the least confusing to userspace. E.g. we can just put the
ranges in the API and require that start is always 0 and end is always
ULONG_MAX or whatever. But switching to smaller ranges must be the decision
in the application after the kernel supports it.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()

2019-08-27 Thread Jan Kara

On Mon 26-08-19 16:33:26, Vivek Goyal wrote:
> On Mon, Aug 26, 2019 at 04:53:16AM -0700, Christoph Hellwig wrote:
> > On Wed, Aug 21, 2019 at 01:57:03PM -0400, Vivek Goyal wrote:
> > > Right now dax_writeback_mapping_range() is passed a bdev and dax_dev
> > > is searched from that bdev name.
> > > 
> > > virtio-fs does not have a bdev. So pass in dax_dev also to
> > > dax_writeback_mapping_range(). If dax_dev is passed in, bdev is not
> > > used otherwise dax_dev is searched using bdev.
> > 
> > Please just pass in only the dax_device and get rid of the block device.
> > The callers should have one at hand easily, e.g. for XFS just call
> > xfs_find_daxdev_for_inode instead of xfs_find_bdev_for_inode.
> 
> Sure. Here is the updated patch.
> 
> This patch can probably go upstream independently. If you are fine with
> the patch, I can post it separately for inclusion.
> 
> 
> Subject: dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()
> 
> As of now dax_writeback_mapping_range() takes "struct block_device" as a
> parameter and dax_dev is searched from bdev name. This also involves taking
> a fresh reference on dax_dev and putting that reference at the end of
> function.
> 
> We are developing a new filesystem virtio-fs and using dax to access host
> page cache directly. But there is no block device. IOW, we want to make
> use of dax but want to get rid of this assumption that there is always
> a block device associated with dax_dev.
> 
> So pass in "struct dax_device" as parameter instead of bdev.
> 
> ext2/ext4/xfs are current users and they already have a reference on
> dax_device. So there is no need to take reference and drop reference to
> dax_device on each call of this function.
> 
> Suggested-by: Christoph Hellwig 
> Signed-off-by: Vivek Goyal 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza
> ---
>  fs/dax.c|8 +---
>  fs/ext2/inode.c |5 +++--
>  fs/ext4/inode.c |2 +-
>  fs/xfs/xfs_aops.c   |2 +-
>  include/linux/dax.h |2 +-
>  5 files changed, 7 insertions(+), 12 deletions(-)
> 
> Index: rhvgoyal-linux-fuse/fs/dax.c
> ===
> --- rhvgoyal-linux-fuse.orig/fs/dax.c 2019-08-26 11:20:36.545009968 -0400
> +++ rhvgoyal-linux-fuse/fs/dax.c  2019-08-26 11:24:43.973009968 -0400
> @@ -936,12 +936,11 @@ static int dax_writeback_one(struct xa_s
>   * on persistent storage prior to completion of the operation.
>   */
>  int dax_writeback_mapping_range(struct address_space *mapping,
> - struct block_device *bdev, struct writeback_control *wbc)
> + struct dax_device *dax_dev, struct writeback_control *wbc)
>  {
>   XA_STATE(xas, >i_pages, wbc->range_start >> PAGE_SHIFT);
>   struct inode *inode = mapping->host;
>   pgoff_t end_index = wbc->range_end >> PAGE_SHIFT;
> - struct dax_device *dax_dev;
>   void *entry;
>   int ret = 0;
>   unsigned int scanned = 0;
> @@ -952,10 +951,6 @@ int dax_writeback_mapping_range(struct a
>   if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
>   return 0;
>  
> - dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
> - if (!dax_dev)
> - return -EIO;
> -
>   trace_dax_writeback_range(inode, xas.xa_index, end_index);
>  
>   tag_pages_for_writeback(mapping, xas.xa_index, end_index);
> @@ -976,7 +971,6 @@ int dax_writeback_mapping_range(struct a
>   xas_lock_irq();
>   }
>   xas_unlock_irq();
> - put_dax(dax_dev);
>   trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
>   return ret;
>  }
> Index: rhvgoyal-linux-fuse/include/linux/dax.h
> ===
> --- rhvgoyal-linux-fuse.orig/include/linux/dax.h  2019-08-26 
> 11:20:36.545009968 -0400
> +++ rhvgoyal-linux-fuse/include/linux/dax.h   2019-08-26 11:26:08.384009968 
> -0400
> @@ -141,7 +141,7 @@ static inline void fs_put_dax(struct dax
>  
>  struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
>  int dax_writeback_mapping_range(struct address_space *mapping,
> - struct block_device *bdev, struct writeback_control *wbc);
> + struct dax_device *dax_dev, struct writeback_control *wbc);
>  
>  struct page *dax_layout_busy_page(struct address_space *mapping);
>  dax_entry_t dax_lock_page(struct page *page);
> Index: rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c
> ===

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-19 Thread Jan Kara

On Fri 16-08-19 16:20:07, Ira Weiny wrote:
> On Fri, Aug 16, 2019 at 12:05:28PM -0700, 'Ira Weiny' wrote:
> > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > > > Hello!
> > > > > 
> > > > > On Fri 09-08-19 15:58:14, ira.we...@intel.com wrote:
> > > > > > Pre-requisites
> > > > > > ==
> > > > > > Based on mmotm tree.
> > > > > > 
> > > > > > Based on the feedback from LSFmm, the LWN article, the RFC series 
> > > > > > since
> > > > > > then, and a ton of scenarios I've worked in my mind and/or 
> > > > > > tested...[1]
> > > > > > 
> > > > > > Solution summary
> > > > > > 
> > > > > > 
> > > > > > The real issue is that there is no use case for a user to have RDMA 
> > > > > > pinn'ed
> > > > > > memory which is then truncated.  So really any solution we present 
> > > > > > which:
> > > > > > 
> > > > > > A) Prevents file system corruption or data leaks
> > > > > > ...and...
> > > > > > B) Informs the user that they did something wrong
> > > > > > 
> > > > > > Should be an acceptable solution.
> > > > > > 
> > > > > > Because this is slightly new behavior.  And because this is going 
> > > > > > to be
> > > > > > specific to DAX (because of the lack of a page cache) we have made 
> > > > > > the user
> > > > > > "opt in" to this behavior.
> > > > > > 
> > > > > > The following patches implement the following solution.
> > > > > > 
> > > > > > 0) Registrations to Device DAX char devs are not affected
> > > > > > 
> > > > > > 1) The user has to opt in to allowing page pins on a file with an 
> > > > > > exclusive
> > > > > >layout lease.  Both exclusive and layout lease flags are user 
> > > > > > visible now.
> > > > > > 
> > > > > > 2) page pins will fail if the lease is not active when the file 
> > > > > > back page is
> > > > > >encountered.
> > > > > > 
> > > > > > 3) Any truncate or hole punch operation on a pinned DAX page will 
> > > > > > fail.
> > > > > 
> > > > > So I didn't fully grok the patch set yet but by "pinned DAX page" do 
> > > > > you
> > > > > mean a page which has corresponding file_pin covering it? Or do you 
> > > > > mean a
> > > > > page which has pincount increased? If the first then I'd rephrase 
> > > > > this to
> > > > > be less ambiguous, if the second then I think it is wrong. 
> > > > 
> > > > I mean the second.  but by "fail" I mean hang.  Right now the "normal" 
> > > > page
> > > > pincount processing will hang the truncate.  Given the discussion with 
> > > > John H
> > > > we can make this a bit better if we use something like FOLL_PIN and the 
> > > > page
> > > > count bias to indicate this type of pin.  Then I could fail the truncate
> > > > outright.  but that is not done yet.
> > > > 
> > > > so... I used the word "fail" to be a bit more vague as the final 
> > > > implementation
> > > > may return ETXTBUSY or hang as noted.
> > > 
> > > Ah, OK. Hanging is fine in principle but with longterm pins, your work
> > > makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
> > > e.g. DIO will use page pins as well for its buffers and we must wait there
> > > until the pin is released. So please just clarify your 'fail' here a bit
> > > :).
> > 
> > It will fail with ETXTBSY.  I've fixed a bug...  See below.
> > 
> > > 
> > > > > > 4) The user has the option of holding the lease or releasing it.  
> > > > > > If they
> > > > > >release it no other pin calls will work on the file.
> > > > > 
> > > > > Last time we spoke the plan

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-19 Thread Jan Kara

On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > 2) Second reason is that I thought I did not have a good way to tell if the
> >lease was actually in use.  What I mean is that letting the lease go 
> > should
> >be ok IFF we don't have any pins...  I was thinking that without John's 
> > code
> >we don't have a way to know if there are any pins...  But that is 
> > wrong...
> >All we have to do is check
> > 
> > !list_empty(file->file_pins)
> > 
> > So now with this detail I think you are right, we should be able to hold the
> > lease through the struct file even if the process no longer has any
> > "references" to it (ie closes and munmaps the file).
> 
> I really, really dislike the idea of zombie layout leases. It's a
> nasty hack for poor application behaviour. This is a "we allow use
> after layout lease release" API, and I think encoding largely
> untraceable zombie objects into an API is very poor design.
> 
> From the fcntl man page:
> 
> LEASES
>   Leases are associated with an open file description (see
>   open(2)).  This means that duplicate file descriptors
>   (created by, for example, fork(2) or dup(2))  re‐ fer  to
>   the  same  lease,  and this lease may be modified or
>   released using any of these descriptors.  Furthermore, the
>   lease is released by either an explicit F_UNLCK operation on
>   any of these duplicate file descriptors, or when all such
>   file descriptors have been closed.
> 
> Leases are associated with *open* file descriptors, not the
> lifetime of the struct file in the kernel. If the application closes
> the open fds that refer to the lease, then the kernel does not
> guarantee, and the application has no right to expect, that the
> lease remains active in any way once the application closes all
> direct references to the lease.
> 
> IOWs, applications using layout leases need to hold the lease fd
> open for as long as the want access to the physical file layout. It
> is a also a requirement of the layout lease that the holder releases
> the resources it holds on the layout before it releases the layout
> lease, exclusive lease or not. Closing the fd indicates they do not
> need access to the file any more, and so the lease should be
> reclaimed at that point.
> 
> I'm of a mind to make the last close() on a file block if there's an
> active layout lease to prevent processes from zombie-ing layout
> leases like this. i.e. you can't close the fd until resources that
> pin the lease have been released.

Yeah, so this was my initial though as well [1]. But as the discussion in
that thread revealed, the problem with blocking last close is that kernel
does not really expect close to block. You could easily deadlock e.g. if
the process gets SIGKILL, file with lease has fd 10, and the RDMA context
holding pages pinned has fd 15. Or you could wait for another process to
release page pins and blocking SIGKILL on that is also bad. So in the end
the least bad solution we've come up with were these "zombie" leases as you
call them and tracking them in /proc so that userspace at least has a way
of seeing them. But if you can come up with a different solution, I'm
certainly not attached to the current one...

Honza

[1] https://lore.kernel.org/lkml/20190606104203.gf7...@quack2.suse.cz
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-15 Thread Jan Kara

On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > Hello!
> > 
> > On Fri 09-08-19 15:58:14, ira.we...@intel.com wrote:
> > > Pre-requisites
> > > ==
> > >   Based on mmotm tree.
> > > 
> > > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > > 
> > > Solution summary
> > > 
> > > 
> > > The real issue is that there is no use case for a user to have RDMA 
> > > pinn'ed
> > > memory which is then truncated.  So really any solution we present which:
> > > 
> > > A) Prevents file system corruption or data leaks
> > > ...and...
> > > B) Informs the user that they did something wrong
> > > 
> > > Should be an acceptable solution.
> > > 
> > > Because this is slightly new behavior.  And because this is going to be
> > > specific to DAX (because of the lack of a page cache) we have made the 
> > > user
> > > "opt in" to this behavior.
> > > 
> > > The following patches implement the following solution.
> > > 
> > > 0) Registrations to Device DAX char devs are not affected
> > > 
> > > 1) The user has to opt in to allowing page pins on a file with an 
> > > exclusive
> > >layout lease.  Both exclusive and layout lease flags are user visible 
> > > now.
> > > 
> > > 2) page pins will fail if the lease is not active when the file back page 
> > > is
> > >encountered.
> > > 
> > > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > 
> > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > mean a page which has corresponding file_pin covering it? Or do you mean a
> > page which has pincount increased? If the first then I'd rephrase this to
> > be less ambiguous, if the second then I think it is wrong. 
> 
> I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
> pincount processing will hang the truncate.  Given the discussion with John H
> we can make this a bit better if we use something like FOLL_PIN and the page
> count bias to indicate this type of pin.  Then I could fail the truncate
> outright.  but that is not done yet.
> 
> so... I used the word "fail" to be a bit more vague as the final 
> implementation
> may return ETXTBUSY or hang as noted.

Ah, OK. Hanging is fine in principle but with longterm pins, your work
makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
e.g. DIO will use page pins as well for its buffers and we must wait there
until the pin is released. So please just clarify your 'fail' here a bit
:).

> > > 4) The user has the option of holding the lease or releasing it.  If they
> > >release it no other pin calls will work on the file.
> > 
> > Last time we spoke the plan was that the lease is kept while the pages are
> > pinned (and an attempt to release the lease would block until the pages are
> > unpinned). That also makes it clear that the *lease* is what is making
> > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > just an implementation detail how the existence is efficiently tracked (and
> > what keeps the backing file for the pages open so that the lease does not
> > get auto-destroyed). Why did you change this?
> 
> closing the file _and_ unmaping it will cause the lease to be released
> regardless of if we allow this or not.
> 
> As we discussed preventing the close seemed intractable.

Yes, preventing the application from closing the file is difficult. But
from a quick look at your patches it seemed to me that you actually hold a
backing file reference from the file_pin structure thus even though the
application closes its file descriptor, the struct file (and thus the
lease) lives further until the file_pin gets released. And that should last
as long as the pages are pinned. Am I missing something?

> I thought about failing the munmap but that seemed wrong as well.  But more
> importantly AFAIK RDMA can pass its memory pins to other processes via FD
> passing...  This means that one could pin this memory, pass it to another
> process and exit.  The file lease on the pin'ed file is lost.

Not if file_pin grabs struct file reference as I mentioned above...
 
> The file lease is just a key to get the memory pin.  Once unlocked the procfs
> tracking keeps track of where that pin goes an

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-14 Thread Jan Kara

Hello!

On Fri 09-08-19 15:58:14, ira.we...@intel.com wrote:
> Pre-requisites
> ==
>   Based on mmotm tree.
> 
> Based on the feedback from LSFmm, the LWN article, the RFC series since
> then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> 
> Solution summary
> 
> 
> The real issue is that there is no use case for a user to have RDMA pinn'ed
> memory which is then truncated.  So really any solution we present which:
> 
> A) Prevents file system corruption or data leaks
> ...and...
> B) Informs the user that they did something wrong
> 
> Should be an acceptable solution.
> 
> Because this is slightly new behavior.  And because this is going to be
> specific to DAX (because of the lack of a page cache) we have made the user
> "opt in" to this behavior.
> 
> The following patches implement the following solution.
> 
> 0) Registrations to Device DAX char devs are not affected
> 
> 1) The user has to opt in to allowing page pins on a file with an exclusive
>layout lease.  Both exclusive and layout lease flags are user visible now.
> 
> 2) page pins will fail if the lease is not active when the file back page is
>encountered.
> 
> 3) Any truncate or hole punch operation on a pinned DAX page will fail.

So I didn't fully grok the patch set yet but by "pinned DAX page" do you
mean a page which has corresponding file_pin covering it? Or do you mean a
page which has pincount increased? If the first then I'd rephrase this to
be less ambiguous, if the second then I think it is wrong. 

> 4) The user has the option of holding the lease or releasing it.  If they
>release it no other pin calls will work on the file.

Last time we spoke the plan was that the lease is kept while the pages are
pinned (and an attempt to release the lease would block until the pages are
unpinned). That also makes it clear that the *lease* is what is making
truncate and hole punch fail with ETXTBUSY and the file_pin structure is
just an implementation detail how the existence is efficiently tracked (and
what keeps the backing file for the pages open so that the lease does not
get auto-destroyed). Why did you change this?

> 5) Closing the file is ok.
> 
> 6) Unmapping the file is ok
> 
> 7) Pins against the files are tracked back to an owning file or an owning mm
>depending on the internal subsystem needs.  With RDMA there is an owning
>file which is related to the pined file.
> 
> 8) Only RDMA is currently supported

If you currently only need "owning file" variant in your patch set, then
I'd just implement that and leave "owning mm" variant for later if it
proves to be necessary. The things are complex enough as is...

> 9) Truncation of pages which are not actively pinned nor covered by a lease
>will succeed.

Otherwise I like the design.

Honza

-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-29 Thread Jan Kara

On Tue 16-07-19 20:39:46, Dan Williams wrote:
> On Fri, Jul 12, 2019 at 2:14 AM Jan Kara  wrote:
> >
> > On Thu 11-07-19 08:25:50, Matthew Wilcox wrote:
> > > On Thu, Jul 11, 2019 at 07:13:50AM -0700, Matthew Wilcox wrote:
> > > > However, the XA_RETRY_ENTRY might be a good choice.  It doesn't normally
> > > > appear in an XArray (it may appear if you're looking at a deleted node,
> > > > but since we're holding the lock, we can't see deleted nodes).
> > >
> > ...
> >
> > > @@ -254,7 +267,7 @@ static void wait_entry_unlocked(struct xa_state *xas, 
> > > void *entry)
> > >  static void put_unlocked_entry(struct xa_state *xas, void *entry)
> > >  {
> > >   /* If we were the only waiter woken, wake the next one */
> > > - if (entry)
> > > + if (entry && dax_is_conflict(entry))
> >
> > This should be !dax_is_conflict(entry)...
> >
> > >   dax_wake_entry(xas, entry, false);
> > >  }
> >
> > Otherwise the patch looks good to me so feel free to add:
> >
> > Reviewed-by: Jan Kara 
> 
> Looks good, and passes the test case. Now pushed out to
> libnvdimm-for-next for v5.3 inclusion:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/commit/?h=libnvdimm-for-next=23c84eb7837514e16d79ed6d849b13745e0ce688

Thanks for picking up the patch but you didn't apply the fix I've mentioned
above. So put_unlocked_entry() is not waking up anybody anymore... Since
this got already to Linus' tree, I guess a separate fixup patch is needed
(attached).

Honza

-- 
Jan Kara 
SUSE Labs, CR
>From 950204f7dfdb06198f40820be4d33ce824508f11 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Mon, 29 Jul 2019 13:57:49 +0200
Subject: [PATCH] dax: Fix missed wakup in put_unlocked_entry()

The condition checking whether put_unlocked_entry() needs to wake up
following waiter got broken by commit 23c84eb78375 ("dax: Fix missed
wakeup with PMD faults"). We need to wake the waiter whenever the passed
entry is valid (i.e., non-NULL and not special conflict entry). This
could lead to processes never being woken up when waiting for entry
lock. Fix the condition.

CC: sta...@vger.kernel.org
Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults")
Signed-off-by: Jan Kara 
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index a237141d8787..b64964ef44f6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -266,7 +266,7 @@ static void wait_entry_unlocked(struct xa_state *xas, void *entry)
 static void put_unlocked_entry(struct xa_state *xas, void *entry)
 {
 	/* If we were the only waiter woken, wake the next one */
-	if (entry && dax_is_conflict(entry))
+	if (entry && !dax_is_conflict(entry))
 		dax_wake_entry(xas, entry, false);
 }
 
-- 
2.16.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-12 Thread Jan Kara

On Thu 11-07-19 08:25:50, Matthew Wilcox wrote:
> On Thu, Jul 11, 2019 at 07:13:50AM -0700, Matthew Wilcox wrote:
> > However, the XA_RETRY_ENTRY might be a good choice.  It doesn't normally
> > appear in an XArray (it may appear if you're looking at a deleted node,
> > but since we're holding the lock, we can't see deleted nodes).
> 
...

> @@ -254,7 +267,7 @@ static void wait_entry_unlocked(struct xa_state *xas, 
> void *entry)
>  static void put_unlocked_entry(struct xa_state *xas, void *entry)
>  {
>   /* If we were the only waiter woken, wake the next one */
> - if (entry)
> + if (entry && dax_is_conflict(entry))

This should be !dax_is_conflict(entry)...

>   dax_wake_entry(xas, entry, false);
>  }

Otherwise the patch looks good to me so feel free to add:

Reviewed-by: Jan Kara 

once you fix this.

        Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-11 Thread Jan Kara

On Wed 10-07-19 13:15:39, Matthew Wilcox wrote:
> On Wed, Jul 10, 2019 at 09:02:04PM +0200, Jan Kara wrote:
> > +#define DAX_ENTRY_CONFLICT dax_make_entry(pfn_to_pfn_t(1), DAX_EMPTY)
> 
> I was hoping to get rid of DAX_EMPTY ... it's almost unused now.  Once
> we switch to having a single DAX_LOCK value instead of a single bit,
> I think it can go away, freeing up two bits.
> 
> If you really want a special DAX_ENTRY_CONFLICT, I think we can make
> one in the 2..4094 range.
> 
> That aside, this looks pretty similar to the previous patch I sent, so
> if you're now happy with this, let's add
> 
> #define XA_DAX_CONFLICT_ENTRY xa_mk_internal(258)
> 
> to xarray.h and do it that way?

Yeah, that would work for me as well. The chosen value for DAX_ENTRY_CONFLICT
was pretty arbitrary. Or we could possibly use:

#define DAX_ENTRY_CONFLICT XA_ZERO_ENTRY

so that we don't leak DAX-specific internal definition into xarray.h?

        Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-11 Thread Jan Kara

On Wed 10-07-19 20:35:55, Matthew Wilcox wrote:
> On Wed, Jul 10, 2019 at 09:02:04PM +0200, Jan Kara wrote:
> > So how about the attached patch? That keeps the interface sane and passes a
> > smoketest for me (full fstest run running). Obviously it also needs a
> > proper changelog...
> 
> Changelog and slightly massaged version along the lines of my two comments
> attached.
> 

> From 57b63fdd38e7bea7eb8d6332f0163fb028570def Mon Sep 17 00:00:00 2001
> From: "Matthew Wilcox (Oracle)" 
> Date: Wed, 3 Jul 2019 23:21:25 -0400
> Subject: [PATCH] dax: Fix missed wakeup with PMD faults
> 
> RocksDB can hang indefinitely when using a DAX file.  This is due to
> a bug in the XArray conversion when handling a PMD fault and finding a
> PTE entry.  We use the wrong index in the hash and end up waiting on
> the wrong waitqueue.
> 
> There's actually no need to wait; if we find a PTE entry while looking
> for a PMD entry, we can return immediately as we know we should fall
> back to a PTE fault (which may not conflict with the lock held).
> 
> Cc: sta...@vger.kernel.org
> Fixes: b15cd800682f ("dax: Convert page fault handlers to XArray")
> Signed-off-by: Matthew Wilcox (Oracle) 

Just one nit below. Otherwise feel free to add:

Reviewed-by: Jan Kara 

> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index 052e06ff4c36..fb25452bcfa4 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -169,7 +169,9 @@ static inline bool xa_is_internal(const void *entry)
>   return ((unsigned long)entry & 3) == 2;
>  }
>  
> +#define XA_RETRY_ENTRY   xa_mk_internal(256)
>  #define XA_ZERO_ENTRYxa_mk_internal(257)
> +#define XA_DAX_CONFLICT_ENTRYxa_mk_internal(258)
>  
>  /**
>   * xa_is_zero() - Is the entry a zero entry?

As I wrote in my previous email, won't it be nicer if we just defined
DAX_CONFLICT_ENTRY (or however we name it) inside dax code say to
XA_ZERO_ENTRY?  Generic xarray code just doesn't care about that value and
I can imagine in future there'll be other xarray user's who'd like to have
some special value(s) for use similarly to DAX and we don't want to clutter
xarray definitions with those as well. If you don't like XA_ZERO_ENTRY, I
could also imagine having XA_USER_RESERVED value that's guaranteed to be
unused by xarray and we'd define DAX_CONFLICT_ENTRY to it. Overall I don't
care too much so I can live even with what you have now but it would seem
cleaner that way to me.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-11 Thread Jan Kara

On Wed 10-07-19 20:08:55, Matthew Wilcox wrote:
> On Wed, Jul 10, 2019 at 09:02:04PM +0200, Jan Kara wrote:
> > @@ -848,7 +853,7 @@ static int dax_writeback_one(struct xa_state *xas, 
> > struct dax_device *dax_dev,
> > if (unlikely(dax_is_locked(entry))) {
> > void *old_entry = entry;
> >  
> > -   entry = get_unlocked_entry(xas);
> > +   entry = get_unlocked_entry(xas, 0);
> >  
> > /* Entry got punched out / reallocated? */
> > if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
> 
> I'm not sure about this one.  Are we sure there will never be a dirty
> PMD entry?  Even if we can't create one today, it feels like a bit of
> a landmine to leave for someone who creates one in the future.

I was thinking about this but dax_writeback_one() doesn't really care what
entry it gets. Yes, in theory it could get a PMD when previously there was
PTE or vice-versa but we check that PFN's match and if they really do
match, there's no harm in doing the flushing whatever entry we got back...
And the checks are simpler this way.

    Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-10 Thread Jan Kara

On Fri 05-07-19 13:47:02, Dan Williams wrote:
> On Fri, Jul 5, 2019 at 12:10 PM Matthew Wilcox  wrote:
> >
> > On Thu, Jul 04, 2019 at 04:27:14PM -0700, Dan Williams wrote:
> > > On Thu, Jul 4, 2019 at 12:14 PM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Thu, Jul 04, 2019 at 06:54:50PM +0200, Jan Kara wrote:
> > > > > On Wed 03-07-19 20:27:28, Matthew Wilcox wrote:
> > > > > > So I think we're good for all current users.
> > > > >
> > > > > Agreed but it is an ugly trap. As I already said, I'd rather pay the
> > > > > unnecessary cost of waiting for pte entry and have an easy to 
> > > > > understand
> > > > > interface. If we ever have a real world use case that would care for 
> > > > > this
> > > > > optimization, we will need to refactor functions to make this 
> > > > > possible and
> > > > > still keep the interfaces sane. For example get_unlocked_entry() could
> > > > > return special "error code" indicating that there's no entry with 
> > > > > matching
> > > > > order in xarray but there's a conflict with it. That would be much 
> > > > > less
> > > > > error-prone interface.
> > > >
> > > > This is an internal interface.  I think it's already a pretty gnarly
> > > > interface to use by definition -- it's going to sleep and might return
> > > > almost anything.  There's not much scope for returning an error 
> > > > indicator
> > > > either; value entries occupy half of the range (all odd numbers between 
> > > > 1
> > > > and ULONG_MAX inclusive), plus NULL.  We could use an internal entry, 
> > > > but
> > > > I don't think that makes the interface any easier to use than returning
> > > > a locked entry.
> > > >
> > > > I think this iteration of the patch makes it a little clearer.  What do 
> > > > you
> > > > think?
> > > >
> > >
> > > Not much clearer to me. get_unlocked_entry() is now misnamed and this
> >
> > misnamed?  You'd rather it was called "try_get_unlocked_entry()"?
> 
> I was thinking more along the lines of
> get_unlocked_but_sometimes_locked_entry(), i.e. per Jan's feedback to
> keep the interface simple.

So how about the attached patch? That keeps the interface sane and passes a
smoketest for me (full fstest run running). Obviously it also needs a
proper changelog...

Honza

-- 
Jan Kara 
SUSE Labs, CR
>From 1aeaba0e061b2bf38143f21d054e66853543a680 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Wed, 10 Jul 2019 20:28:37 +0200
Subject: [PATCH] dax: Fix missed PMD wakeups

Signed-off-by: Jan Kara 
---
 fs/dax.c | 46 +-
 1 file changed, 25 insertions(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fe5e33810cd4..3fe655d38c7a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -191,15 +191,18 @@ static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
 		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
 }
 
+#define DAX_ENTRY_CONFLICT dax_make_entry(pfn_to_pfn_t(1), DAX_EMPTY)
 /*
  * Look up entry in page cache, wait for it to become unlocked if it
  * is a DAX entry and return it.  The caller must subsequently call
  * put_unlocked_entry() if it did not lock the entry or dax_unlock_entry()
- * if it did.
+ * if it did. 'order' is the minimum order of entry to return. If there's no
+ * entry of sufficiently large order but there are some entries of lower order
+ * in the range described by xas, return special DAX_ENTRY_CONFLICT value.
  *
  * Must be called with the i_pages lock held.
  */
-static void *get_unlocked_entry(struct xa_state *xas)
+static void *get_unlocked_entry(struct xa_state *xas, unsigned int order)
 {
 	void *entry;
 	struct wait_exceptional_entry_queue ewait;
@@ -210,6 +213,8 @@ static void *get_unlocked_entry(struct xa_state *xas)
 
 	for (;;) {
 		entry = xas_find_conflict(xas);
+		if (dax_entry_order(entry) < order)
+			return DAX_ENTRY_CONFLICT;
 		if (!entry || WARN_ON_ONCE(!xa_is_value(entry)) ||
 !dax_is_locked(entry))
 			return entry;
@@ -254,7 +259,7 @@ static void wait_entry_unlocked(struct xa_state *xas, void *entry)
 static void put_unlocked_entry(struct xa_state *xas, void *entry)
 {
 	/* If we were the only waiter woken, wake the next one */
-	if (entry)
+	if (entry && entry != DAX_ENTRY_CONFLICT)
 		dax_wake_entry(xas, entry, false);
 }
 
@@ -461,7 +466,7 @@ void dax_unlock_page(struct page *page, dax_entry_t cookie)
  * overlap with xarray value ent

Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-04 Thread Jan Kara

On Wed 03-07-19 20:27:28, Matthew Wilcox wrote:
> On Wed, Jul 03, 2019 at 02:28:41PM -0700, Dan Williams wrote:
> > On Wed, Jul 3, 2019 at 12:53 PM Matthew Wilcox  wrote:
> > > @@ -211,7 +215,8 @@ static void *get_unlocked_entry(struct xa_state *xas)
> > > for (;;) {
> > > entry = xas_find_conflict(xas);
> > > if (!entry || WARN_ON_ONCE(!xa_is_value(entry)) ||
> > > -   !dax_is_locked(entry))
> > > +   !dax_is_locked(entry) ||
> > > +   dax_entry_order(entry) < 
> > > xas_get_order(xas))
> > 
> > Doesn't this potentially allow a locked entry to be returned for a
> > caller that expects all value entries are unlocked?
> 
> It only allows locked entries to be returned for callers which pass in
> an xas which refers to a PMD entry.  This is fine for grab_mapping_entry()
> because it checks size_flag & is_pte_entry.
> 
> dax_layout_busy_page() only uses 0-order.
> __dax_invalidate_entry() only uses 0-order.
> dax_writeback_one() needs an extra fix:
> 
> /* Did a PMD entry get split? */
> if (dax_is_locked(entry))
> goto put_unlocked;
> 
> dax_insert_pfn_mkwrite() checks for a mismatch of pte vs pmd.
> 
> So I think we're good for all current users.

Agreed but it is an ugly trap. As I already said, I'd rather pay the
unnecessary cost of waiting for pte entry and have an easy to understand
interface. If we ever have a real world use case that would care for this
optimization, we will need to refactor functions to make this possible and
still keep the interfaces sane. For example get_unlocked_entry() could
return special "error code" indicating that there's no entry with matching
order in xarray but there's a conflict with it. That would be much less
error-prone interface.

Honza

-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] filesystem-dax: Disable PMD support

2019-07-04 Thread Jan Kara

On Wed 03-07-19 08:47:00, Matthew Wilcox wrote:
> On Mon, Jul 01, 2019 at 02:11:19PM +0200, Jan Kara wrote:
> > BTW, looking into the xarray code, I think I found another difference
> > between the old radix tree code and the new xarray code that could cause
> > issues. In the old radix tree code if we tried to insert PMD entry but
> > there was some PTE entry in the covered range, we'd get EEXIST error back
> > and the DAX fault code relies on this. I don't see how similar behavior is
> > achieved by xas_store()...
> 
> Are you referring to this?
> 
> -   entry = dax_make_locked(0, size_flag | DAX_EMPTY);
> -
> -   err = __radix_tree_insert(>i_pages, index,
> -   dax_entry_order(entry), entry);
> -   radix_tree_preload_end();
> -   if (err) {
> -   xa_unlock_irq(>i_pages);
> -   /*
> -* Our insertion of a DAX entry failed, most likely
> -* because we were inserting a PMD entry and it
> -* collided with a PTE sized entry at a different
> -* index in the PMD range.  We haven't inserted
> -* anything into the radix tree and have no waiters to
> -* wake.
> -*/
> -   return ERR_PTR(err);
> -   }

Mostly yes.

> If so, that can't happen any more because we no longer drop the i_pages
> lock while the entry is NULL, so the entry is always locked while the
> i_pages lock is dropped.

Ah, I have misinterpretted what xas_find_conflict() does. I'm sorry for the
noise. I find it somewhat unfortunate that xas_find_conflict() will not
return in any way the index where it has found the conflicting entry. We
could then use it for the wait logic as well and won't have to resort to
some masking tricks...

Honza

-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] filesystem-dax: Disable PMD support

2019-07-03 Thread Jan Kara

On Sun 30-06-19 08:23:24, Matthew Wilcox wrote:
> On Sun, Jun 30, 2019 at 01:01:04AM -0700, Dan Williams wrote:
> > @@ -215,7 +216,7 @@ static wait_queue_head_t
> > *dax_entry_waitqueue(struct xa_state *xas,
> >  * queue to the start of that PMD.  This ensures that all offsets in
> >  * the range covered by the PMD map to the same bit lock.
> >  */
> > -   if (dax_is_pmd_entry(entry))
> > +   //if (dax_is_pmd_entry(entry))
> > index &= ~PG_PMD_COLOUR;
> > key->xa = xas->xa;
> > key->entry_start = index;
> 
> Hah, that's a great naive fix!  Thanks for trying that out.
> 
> I think my theory was slightly mistaken, but your fix has the effect of
> fixing the actual problem too.
> 
> The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of
> 512), but xas_find_conflict() does _not_ adjust xa_index (... which I
> really should have mentioned in the documentation).  So we go to sleep
> on the PMD-aligned index instead of the index of the PTE.  Your patch
> fixes this by using the PMD-aligned index for PTEs too.
> 
> I'm trying to come up with a clean fix for this.  Clearly we
> shouldn't wait for a PTE entry if we're looking for a PMD entry.
> But what should get_unlocked_entry() return if it detects that case?
> We could have it return an error code encoded as an internal entry,
> like grab_mapping_entry() does.  Or we could have it return the _locked_
> PTE entry, and have callers interpret that.
> 
> At least get_unlocked_entry() is static, but it's got quite a few callers.
> Trying to discern which ones might ask for a PMD entry is a bit tricky.
> So this seems like a large patch which might have bugs.

Yeah. So get_unlocked_entry() is used in several cases:

1) Case where we already have entry at given index but it is locked and we
need it unlocked so that we can do our thing `(dax_writeback_one(),
dax_layout_busy_page()).

2) Case where we want any entry covering given index (in
__dax_invalidate_entry()). This is essentially the same as case 1) since we
have already looked up the entry (just didn't propagate that information
from mm/truncate.c) - we want any unlocked entry covering given index.

3) Cases where we really want entry at given index and we have some entry
order constraints (dax_insert_pfn_mkwrite(), grab_mapping_entry()).

Honestly I'd make the rule that get_unlocked_entry() returns entry of any
order that is covering given index. I agree it may be unnecessarily waiting
for PTE entry lock for the case where in case 3) we are really looking only
for PMD entry but that seems like a relatively small cost for the
simplicity of the interface.

BTW, looking into the xarray code, I think I found another difference
between the old radix tree code and the new xarray code that could cause
issues. In the old radix tree code if we tried to insert PMD entry but
there was some PTE entry in the covered range, we'd get EEXIST error back
and the DAX fault code relies on this. I don't see how similar behavior is
achieved by xas_store()...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: a few questions about pagevc_lookup_entries

2019-06-24 Thread Jan Kara

On Mon 24-06-19 09:25:00, Miklos Szeredi wrote:
> [cc: vivek, stefan, dgilbert]
> 
> On Fri, Jun 21, 2019 at 12:04 AM Liu Bo  wrote:
> >
> > On Thu, Jun 20, 2019 at 1:36 AM Jan Kara  wrote:
> > >
> > > [added some relevant lists to CC - this can safe some people debugging by
> > > being able to google this discussion]
> > >
> > > On Wed 19-06-19 15:57:38, Liu Bo wrote:
> > > > I found a weird dead loop within invalidate_inode_pages2_range, the
> > > > reason being that  pagevec_lookup_entries(index=1) returns an indices
> > > > array which has only one entry storing value 0, and this has led
> > > > invalidate_inode_pages2_range() to a dead loop, something like,
> > > >
> > > > invalidate_inode_pages2_range()
> > > >   -> while (pagevec_lookup_entries(index=1, indices))
> > > > ->  for (i = 0; i < pagevec_count(); i++) {
> > > >   -> index = indices[0]; // index is set to 0
> > > >   -> if (radix_tree_exceptional_entry(page)) {
> > > >   -> if (!invalidate_exceptional_entry2()) //
> > > >   ->__dax_invalidate_mapping_entry // return 0
> > > >  -> // entry marked as PAGECACHE_TAG_DIRTY/TOWRITE
> > > >  ret = -EBUSY;
> > > >   ->continue;
> > > >   } // end of if (radix_tree_exceptional_entry(page))
> > > > -> index++; // index is set to 1
> > > >
> > > > The following debug[1] proved the above analysis,  I was wondering if
> > > > this was a corner case that  pagevec_lookup_entries() allows or a
> > > > known bug that has been fixed upstream?
> > > >
> > > > ps: the kernel in use is 4.19.30 (LTS).
> > >
> > > Hum, the above trace suggests you are using DAX. Are you really? Because 
> > > the
> > > stacktrace below shows we are working on fuse inode so that shouldn't
> > > really be DAX inode...
> > >
> >
> > So I was running tests against virtiofs[1] which adds dax support to
> > fuse, with dax, fuse provides posix stuff while dax provides data
> > channel.
> >
> > [1]: https://virtio-fs.gitlab.io/
> > https://gitlab.com/virtio-fs/linux

OK, thanks for the explanation and the pointer. So if I should guess, I'd
say that there's some problem with multiorder entries (for PMD pages) in
the radix tree. In particular if you lookup index 1 and there's
multiorder entry for indices 0-511, radix_tree_next_chunk() is updating
iter->index like:

iter->index = (index &~ node_maxindex(node)) | (offset << node->shift);

and offset is computed by radix_tree_descend() as:

offset = (index >> parent->shift) & RADIX_TREE_MAP_MASK;

So this all results in iter->index being set to 0 and thus confusing the
iteration in invalidate_inode_pages2_range(). Current kernel has xarray
code from Matthew which maintains originally passed index in xas.xa_index
and thus the problem isn't there.

So to sum up: Seems like a DAX-specific bug with PMD entries in older
kernels fixed by xarray rewrite.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-20 Thread Jan Kara

On Thu 13-06-19 08:27:55, Matthew Wilcox wrote:
> On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > e.g. Process A has an exclusive layout lease on file F. It does an
> > IO to file F. The filesystem IO path checks that Process A owns the
> > lease on the file and so skips straight through layout breaking
> > because it owns the lease and is allowed to modify the layout. It
> > then takes the inode metadata locks to allocate new space and write
> > new data.
> > 
> > Process B now tries to write to file F. The FS checks whether
> > Process B owns a layout lease on file F. It doesn't, so then it
> > tries to break the layout lease so the IO can proceed. The layout
> > breaking code sees that process A has an exclusive layout lease
> > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > break the lease and so the IO fails with -ETXTBSY.
> 
> This description doesn't match the behaviour that RDMA wants either.
> Even if Process A has a lease on the file, an IO from Process A which
> results in blocks being freed from the file is going to result in the
> RDMA device being able to write to blocks which are now freed (and
> potentially reallocated to another file).

I think you're partially wrong here. You are correct that the lease won't
stop process A from doing truncate on the file. *But* there are still page
pins in existence so truncate will block on waiting for these pins to go
away (after all this is a protection that guards all short-term page pin
users). So there is no problem with blocks being freed under the RDMA app.
Yes, the app will effectively deadlock and sysadmin has to kill it. IMO an
acceptable answer for doing something stupid and unsupportable...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: a few questions about pagevc_lookup_entries

2019-06-20 Thread Jan Kara

[added some relevant lists to CC - this can safe some people debugging by
being able to google this discussion]

On Wed 19-06-19 15:57:38, Liu Bo wrote:
> I found a weird dead loop within invalidate_inode_pages2_range, the
> reason being that  pagevec_lookup_entries(index=1) returns an indices
> array which has only one entry storing value 0, and this has led
> invalidate_inode_pages2_range() to a dead loop, something like,
> 
> invalidate_inode_pages2_range()
>   -> while (pagevec_lookup_entries(index=1, indices))
> ->  for (i = 0; i < pagevec_count(); i++) {
>   -> index = indices[0]; // index is set to 0
>   -> if (radix_tree_exceptional_entry(page)) {
>   -> if (!invalidate_exceptional_entry2()) //
>   ->__dax_invalidate_mapping_entry // return 0
>  -> // entry marked as PAGECACHE_TAG_DIRTY/TOWRITE
>  ret = -EBUSY;
>   ->continue;
>   } // end of if (radix_tree_exceptional_entry(page))
> -> index++; // index is set to 1
> 
> The following debug[1] proved the above analysis,  I was wondering if
> this was a corner case that  pagevec_lookup_entries() allows or a
> known bug that has been fixed upstream?
> 
> ps: the kernel in use is 4.19.30 (LTS).

Hum, the above trace suggests you are using DAX. Are you really? Because the
stacktrace below shows we are working on fuse inode so that shouldn't
really be DAX inode...

Honza

> [1]:
> $git diff
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 71b65aab8077..82bfeeb53135 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -692,6 +692,7 @@ int invalidate_inode_pages2_range(struct
> address_space *mapping,
> struct page *page = pvec.pages[i];
> 
> /* We rely upon deletion not changing page->index */
> +   WARN_ONCE(index > indices[i], "index = %d
> indices[%d]=%d\n", index, i, indices[i]);
> index = indices[i];
> if (index > end)
> break;
> 
> [  129.095383] [ cut here ]
> [  129.096164] index = 1 indices[0]=0
> [  129.096786] WARNING: CPU: 0 PID: 3022 at mm/truncate.c:695
> invalidate_inode_pages2_range+0x471/0x500
> [  129.098234] Modules linked in:
> [  129.098717] CPU: 0 PID: 3022 Comm: doio Not tainted 4.19.30+ #4
> ...
> [  129.101288] RIP: 0010:invalidate_inode_pages2_range+0x471/0x500
> ...
> [  129.114162] Call Trace:
> [  129.114623]  ? __schedule+0x2ad/0x860
> [  129.115214]  ? prepare_to_wait_event+0x80/0x140
> [  129.115903]  ? finish_wait+0x3f/0x80
> [  129.116452]  ? request_wait_answer+0x13d/0x210
> [  129.117128]  ? remove_wait_queue+0x60/0x60
> [  129.117757]  ? make_kgid+0x13/0x20
> [  129.118277]  ? fuse_change_attributes_common+0x7d/0x130
> [  129.119057]  ? fuse_change_attributes+0x8d/0x120
> [  129.119754]  fuse_dentry_revalidate+0x2c5/0x300
> [  129.120456]  lookup_fast+0x237/0x2b0
> [  129.121018]  path_openat+0x15f/0x1380
> [  129.121614]  ? generic_update_time+0x6b/0xd0
> [  129.122316]  do_filp_open+0x91/0x100
> [  129.122876]  do_sys_open+0x126/0x210
> [  129.123453]  do_syscall_64+0x55/0x180
> [  129.124036]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  129.124820] RIP: 0033:0x7fbe0cd75e80
> ...
> [  129.134574] ---[ end trace c0fc0bbc5aebf0dc ]---
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Jan Kara

On Wed 12-06-19 15:13:36, Ira Weiny wrote:
> On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > > > 
> > > > > > > The main objection to the current ODP & DAX solution is that very
> > > > > > > little HW can actually implement it, having the alternative still
> > > > > > > require HW support doesn't seem like progress.
> > > > > > > 
> > > > > > > I think we will eventually start seein some HW be able to do this
> > > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX 
> > > > > > > is
> > > > > > > on fire, I need to unplug it).
> > > > > > 
> > > > > > Agreed.  I think software wise there is not much some of the 
> > > > > > devices can do
> > > > > > with such an "invalidate".
> > > > > 
> > > > > So out of curiosity: What does RDMA driver do when userspace just 
> > > > > closes
> > > > > the file pointing to RDMA object? It has to handle that somehow by 
> > > > > aborting
> > > > > everything that's going on... And I wanted similar behavior here.
> > > > 
> > > > It aborts *everything* connected to that file descriptor. Destroying
> > > > everything avoids creating inconsistencies that destroying a subset
> > > > would create.
> > > > 
> > > > What has been talked about for lease break is not destroying anything
> > > > but very selectively saying that one memory region linked to the GUP
> > > > is no longer functional.
> > > 
> > > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > > and closes the file with existing pins (and thus layout lease) we would
> > > force it to abort everything. Yes, it is disruptive but then the app 
> > > didn't
> > > obey the rule that it has to maintain file lease while holding pins. Thus
> > > such situation should never happen unless the app is malicious / buggy.
> > 
> > We do have the infrastructure to completely revoke the entire
> > *content* of a FD (this is called device disassociate). It is
> > basically close without the app doing close. But again it only works
> > with some drivers. However, this is more likely something a driver
> > could support without a HW change though.
> > 
> > It is quite destructive as it forcibly kills everything RDMA related
> > the process(es) are doing, but it is less violent than SIGKILL, and
> > there is perhaps a way for the app to recover from this, if it is
> > coded for it.
> 
> I don't think many are...  I think most would effectively be "killed" if this
> happened to them.

Yes, I repeat we are in a situation when the application has a bug and
didn't propely manage its long term pins which are fully under its control.
So in my mind a situation similar to application using memory it has
already freed. The kernel has to manage that but we don't really care
what's left from the application when this happens.

That being said I'm not insisting this has to happen - tracking associated
"RDMA file" with a layout lease and somehow invalidating it on close of a
leased file is somewhat ugly anyway. But it is still an option if exposing
pins to userspace for lsof to consume proves even worse...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Jan Kara

On Wed 12-06-19 11:49:52, Dan Williams wrote:
> On Wed, Jun 12, 2019 at 3:29 AM Jan Kara  wrote:
> >
> > On Fri 07-06-19 07:52:13, Ira Weiny wrote:
> > > On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote:
> > > >
> > > > > Because the pins would be invisible to sysadmin from that point on.
> > > >
> > > > It is not invisible, it just shows up in a rdma specific kernel
> > > > interface. You have to use rdma netlink to see the kernel object
> > > > holding this pin.
> > > >
> > > > If this visibility is the main sticking point I suggest just enhancing
> > > > the existing MR reporting to include the file info for current GUP
> > > > pins and teaching lsof to collect information from there as well so it
> > > > is easy to use.
> > > >
> > > > If the ownership of the lease transfers to the MR, and we report that
> > > > ownership to userspace in a way lsof can find, then I think all the
> > > > concerns that have been raised are met, right?
> > >
> > > I was contemplating some new lsof feature yesterday.  But what I don't
> > > think we want is sysadmins to have multiple tools for multiple
> > > subsystems.  Or even have to teach lsof something new for every potential
> > > new subsystem user of GUP pins.
> >
> > Agreed.
> >
> > > I was thinking more along the lines of reporting files which have GUP
> > > pins on them directly somewhere (dare I say procfs?) and teaching lsof to
> > > report that information.  That would cover any subsystem which does a
> > > longterm pin.
> >
> > So lsof already parses /proc//maps to learn about files held open by
> > memory mappings. It could parse some other file as well I guess. The good
> > thing about that would be that then "longterm pin" structure would just hold
> > struct file reference. That would avoid any needs of special behavior on
> > file close (the file reference in the "longterm pin" structure would make
> > sure struct file and thus the lease stays around, we'd just need to make
> > explicit lease unlock block until the "longterm pin" structure is freed).
> > The bad thing is that it requires us to come up with a sane new proc
> > interface for reporting "longterm pins" and associated struct file. Also we
> > need to define what this interface shows if the pinned pages are in DRAM
> > (either page cache or anon) and not on NVDIMM.
> 
> The anon vs shared detection case is important because a longterm pin
> might be blocking a memory-hot-unplug operation if it is pinning
> ZONE_MOVABLE memory, but I don't think we want DRAM vs NVDIMM to be an
> explicit concern of the interface. For the anon / cached case I expect
> it might be useful to put that communication under the memory-blocks
> sysfs interface. I.e. a list of pids that are pinning that
> memory-block from being hot-unplugged.

Yes, I was thinking of memory hotplug as well. But I don't think the
distinction is really shared vs anon - a pinned page cache page can be
blocking your memory unplug / migration the same way as a pinned anon page.
So the information for a pin we need to convey is the "location of
resources" being pinned - that is pfn (both for DRAM and NVDIMM) - but then
also additional mapping information (which is filename for DAX page, not
sure about DRAM). Also a separate question is how to expose this
information so that it is efficiently usable by userspace. For lsof, a file
in /proc//xxx with information would be probably the easiest to use
plus all the issues with file access permissions and visibility among
different user namespaces is solved out of the box. And I believe it would
be reasonably usable for memory hotplug usecase as well. A file in sysfs
would be OK for memory hotplug I guess, but not really usable for lsof and
so I'm not sure we really need it when we are going to have one in procfs.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Jan Kara

On Wed 12-06-19 11:41:53, Dan Williams wrote:
> On Wed, Jun 12, 2019 at 5:09 AM Jan Kara  wrote:
> >
> > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > >
> > > > > > The main objection to the current ODP & DAX solution is that very
> > > > > > little HW can actually implement it, having the alternative still
> > > > > > require HW support doesn't seem like progress.
> > > > > >
> > > > > > I think we will eventually start seein some HW be able to do this
> > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > > > on fire, I need to unplug it).
> > > > >
> > > > > Agreed.  I think software wise there is not much some of the devices 
> > > > > can do
> > > > > with such an "invalidate".
> > > >
> > > > So out of curiosity: What does RDMA driver do when userspace just closes
> > > > the file pointing to RDMA object? It has to handle that somehow by 
> > > > aborting
> > > > everything that's going on... And I wanted similar behavior here.
> > >
> > > It aborts *everything* connected to that file descriptor. Destroying
> > > everything avoids creating inconsistencies that destroying a subset
> > > would create.
> > >
> > > What has been talked about for lease break is not destroying anything
> > > but very selectively saying that one memory region linked to the GUP
> > > is no longer functional.
> >
> > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > and closes the file with existing pins (and thus layout lease) we would
> > force it to abort everything. Yes, it is disruptive but then the app didn't
> > obey the rule that it has to maintain file lease while holding pins. Thus
> > such situation should never happen unless the app is malicious / buggy.
> 
> When you say 'close' do you mean the final release of the fd? The vma
> keeps a reference to a 'struct file' live even after the fd is closed.

When I say 'close', I mean a call to ->release file operation which happens
when the last reference to struct file is dropped. I.e., when all file
descriptors and vmas (and possibly other places holding struct file
reference) are gone.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-12 Thread Jan Kara

On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> 
> > > > The main objection to the current ODP & DAX solution is that very
> > > > little HW can actually implement it, having the alternative still
> > > > require HW support doesn't seem like progress.
> > > > 
> > > > I think we will eventually start seein some HW be able to do this
> > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > on fire, I need to unplug it).
> > > 
> > > Agreed.  I think software wise there is not much some of the devices can 
> > > do
> > > with such an "invalidate".
> > 
> > So out of curiosity: What does RDMA driver do when userspace just closes
> > the file pointing to RDMA object? It has to handle that somehow by aborting
> > everything that's going on... And I wanted similar behavior here.
> 
> It aborts *everything* connected to that file descriptor. Destroying
> everything avoids creating inconsistencies that destroying a subset
> would create.
> 
> What has been talked about for lease break is not destroying anything
> but very selectively saying that one memory region linked to the GUP
> is no longer functional.

OK, so what I had in mind was that if RDMA app doesn't play by the rules
and closes the file with existing pins (and thus layout lease) we would
force it to abort everything. Yes, it is disruptive but then the app didn't
obey the rule that it has to maintain file lease while holding pins. Thus
such situation should never happen unless the app is malicious / buggy.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-12 Thread Jan Kara

On Fri 07-06-19 07:52:13, Ira Weiny wrote:
> On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote:
> > 
> > > Because the pins would be invisible to sysadmin from that point on. 
> > 
> > It is not invisible, it just shows up in a rdma specific kernel
> > interface. You have to use rdma netlink to see the kernel object
> > holding this pin.
> > 
> > If this visibility is the main sticking point I suggest just enhancing
> > the existing MR reporting to include the file info for current GUP
> > pins and teaching lsof to collect information from there as well so it
> > is easy to use.
> > 
> > If the ownership of the lease transfers to the MR, and we report that
> > ownership to userspace in a way lsof can find, then I think all the
> > concerns that have been raised are met, right?
> 
> I was contemplating some new lsof feature yesterday.  But what I don't
> think we want is sysadmins to have multiple tools for multiple
> subsystems.  Or even have to teach lsof something new for every potential
> new subsystem user of GUP pins.

Agreed.

> I was thinking more along the lines of reporting files which have GUP
> pins on them directly somewhere (dare I say procfs?) and teaching lsof to
> report that information.  That would cover any subsystem which does a
> longterm pin.

So lsof already parses /proc//maps to learn about files held open by
memory mappings. It could parse some other file as well I guess. The good
thing about that would be that then "longterm pin" structure would just hold
struct file reference. That would avoid any needs of special behavior on
file close (the file reference in the "longterm pin" structure would make
sure struct file and thus the lease stays around, we'd just need to make
explicit lease unlock block until the "longterm pin" structure is freed).
The bad thing is that it requires us to come up with a sane new proc
interface for reporting "longterm pins" and associated struct file. Also we
need to define what this interface shows if the pinned pages are in DRAM
(either page cache or anon) and not on NVDIMM.

> > > ugly to live so we have to come up with something better. The best I can
> > > currently come up with is to have a method associated with the lease that
> > > would invalidate the RDMA context that holds the pins in the same way that
> > > a file close would do it.
> > 
> > This is back to requiring all RDMA HW to have some new behavior they
> > currently don't have..
> > 
> > The main objection to the current ODP & DAX solution is that very
> > little HW can actually implement it, having the alternative still
> > require HW support doesn't seem like progress.
> > 
> > I think we will eventually start seein some HW be able to do this
> > invalidation, but it won't be universal, and I'd rather leave it
> > optional, for recovery from truely catastrophic errors (ie my DAX is
> > on fire, I need to unplug it).
> 
> Agreed.  I think software wise there is not much some of the devices can do
> with such an "invalidate".

So out of curiosity: What does RDMA driver do when userspace just closes
the file pointing to RDMA object? It has to handle that somehow by aborting
everything that's going on... And I wanted similar behavior here.

Honza

-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 02/10] fs/locks: Export F_LAYOUT lease to user space

2019-06-12 Thread Jan Kara

On Tue 11-06-19 14:38:13, Ira Weiny wrote:
> On Sun, Jun 09, 2019 at 09:00:24AM -0400, Jeff Layton wrote:
> > On Wed, 2019-06-05 at 18:45 -0700, ira.we...@intel.com wrote:
> > > From: Ira Weiny 
> > > 
> > > GUP longterm pins of non-pagecache file system pages (eg FS DAX) are
> > > currently disallowed because they are unsafe.
> > > 
> > > The danger for pinning these pages comes from the fact that hole punch
> > > and/or truncate of those files results in the pages being mapped and
> > > pinned by a user space process while DAX has potentially allocated those
> > > pages to other processes.
> > > 
> > > Most (All) users who are mapping FS DAX pages for long term pin purposes
> > > (such as RDMA) are not going to want to deallocate these pages while
> > > those pages are in use.  To do so would mean the application would lose
> > > data.  So the use case for allowing truncate operations of such pages
> > > is limited.
> > > 
> > > However, the kernel must protect itself and users from potential
> > > mistakes and/or malicious user space code.  Rather than disabling long
> > > term pins as is done now.   Allow for users who know they are going to
> > > be pinning this memory to alert the file system of this intention.
> > > Furthermore, allow users to be alerted such that they can react if a
> > > truncate operation occurs for some reason.
> > > 
> > > Example user space pseudocode for a user using RDMA and wanting to allow
> > > a truncate would look like this:
> > > 
> > > lease_break_sigio_handler() {
> > > ...
> > >   if (sigio.fd == rdma_fd) {
> > >   complete_rdma_operations(...);
> > >   ibv_dereg_mr(mr);
> > >   close(rdma_fd);
> > >   fcntl(rdma_fd, F_SETLEASE, F_UNLCK);
> > >   }
> > > }
> > > 
> > > setup_rdma_to_dax_file() {
> > > ...
> > >   rdma_fd = open(...)
> > >   fcntl(rdma_fd, F_SETLEASE, F_LAYOUT);
> > 
> > I'm not crazy about this interface. F_LAYOUT doesn't seem to be in the
> > same category as F_RDLCK/F_WRLCK/F_UNLCK.
> > 
> > Maybe instead of F_SETLEASE, this should use new
> > F_SETLAYOUT/F_GETLAYOUT cmd values? There is nothing that would prevent
> > you from setting both a lease and a layout on a file, and indeed knfsd
> > can set both.
> > 
> > This interface seems to conflate the two.
> 
> I've been feeling the same way.  This is why I was leaning toward a new lease
> type.  I called it "F_LONGTERM" but the name is not important.
> 
> I think the concept of adding "exclusive" to the layout lease can fix this
> because the NFS lease is non-exclusive where the user space one (for the
> purpose of GUP pinning) would need to be.
> 
> FWIW I have not worked out exactly what this new "exclusive" code will look
> like.  Jan said:
> 
>   "There actually is support for locks that are not broken after given
>   timeout so there shouldn't be too many changes need."
> 
> But I'm not seeing that for Lease code.  So I'm working on something for the
> lease code now.

Yeah, sorry for misleading you. Somehow I thought that if lease_break_time
== 0, we will wait indefinitely but when checking the code again, that
doesn't seem to be the case.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v3 3/6] mm/nvdimm: Add page size and struct page size to pfn superblock

2019-06-11 Thread Jan Kara

On Tue 04-06-19 14:43:54, Aneesh Kumar K.V wrote:
> This is needed so that we don't wrongly initialize a namespace
> which doesn't have enough space reserved for holding struct pages
> with the current kernel.
> 
> We also increment PFN_MIN_VERSION to make sure that older kernel
> won't initialize namespace created with newer kernel.
> 
> Signed-off-by: Aneesh Kumar K.V 
...
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 00c57805cad3..e01eee9efafe 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -467,6 +467,15 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char 
> *sig)
>   if (__le16_to_cpu(pfn_sb->version_minor) < 2)
>   pfn_sb->align = 0;
>  
> + if (__le16_to_cpu(pfn_sb->version_minor) < 3) {
> + /*
> +  * For a large part we use PAGE_SIZE. But we
> +  * do have some accounting code using SZ_4K.
> +  */
> + pfn_sb->page_struct_size = cpu_to_le16(64);
> + pfn_sb->page_size = cpu_to_le32(SZ_4K);
> + }
> +
>   switch (le32_to_cpu(pfn_sb->mode)) {
>   case PFN_MODE_RAM:
>   case PFN_MODE_PMEM:

As we discussed with Aneesh privately, this actually means that existing
NVDIMM namespaces on PPC64 will stop working due to these defaults for old
superblocks. I don't think that's a good thing as upgrading kernels is
going to be nightmare due to this on PPC64. So I believe we should make
defaults for old superblocks such that working setups keep working without
sysadmin having to touch anything.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-07 Thread Jan Kara

On Thu 06-06-19 15:03:30, Ira Weiny wrote:
> On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > On Wed 05-06-19 18:45:33, ira.we...@intel.com wrote:
> > > From: Ira Weiny 
> > 
> > So I'd like to actually mandate that you *must* hold the file lease until
> > you unpin all pages in the given range (not just that you have an option to
> > hold a lease). And I believe the kernel should actually enforce this. That
> > way we maintain a sane state that if someone uses a physical location of
> > logical file offset on disk, he has a layout lease. Also once this is done,
> > sysadmin has a reasonably easy way to discover run-away RDMA application
> > and kill it if he wishes so.
> 
> Fair enough.
> 
> I was kind of heading that direction but had not thought this far forward.  I
> was exploring how to have a lease remain on the file even after a "lease
> break".  But that is incompatible with the current semantics of a "layout"
> lease (as currently defined in the kernel).  [In the end I wanted to get an 
> RFC
> out to see what people think of this idea so I did not look at keeping the
> lease.]
> 
> Also hitch is that currently a lease is forcefully broken after
> /lease-break-time.  To do what you suggest I think we would need a new
> lease type with the semantics you describe.

I'd do what Dave suggested - add flag to mark lease as unbreakable by
truncate and teach file locking core to handle that. There actually is
support for locks that are not broken after given timeout so there
shouldn't be too many changes need.
 
> Previously I had thought this would be a good idea (for other reasons).  But
> what does everyone think about using a "longterm lease" similar to [1] which
> has the semantics you proppose?  In [1] I was not sure "longterm" was a good
> name but with your proposal I think it makes more sense.

As I wrote elsewhere in this thread I think FL_LAYOUT name still makes
sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with
truncate.

> > - probably I'd just transition all gup_longterm()
> > users to a saner API similar to the one we have in mm/frame_vector.c where
> > we don't hand out page pointers but an encapsulating structure that does
> > all the necessary tracking.
> 
> I'll take a look at that code.  But that seems like a pretty big change.

I was looking into that yesterday before proposing this and there aren't
than many gup_longterm() users and most of them anyway just stick pages
array into their tracking structure and then release them once done. So it
shouldn't be that complex to convert to a new convention (and you have to
touch all gup_longterm() users anyway to teach them track leases etc.).

> > Removing a lease would need to block until all
> > pins are released - this is probably the most hairy part since we need to
> > handle a case if application just closes the file descriptor which would
> > release the lease but OTOH we need to make sure task exit does not deadlock.
> > Maybe we could block only on explicit lease unlock and just drop the layout
> > lease on file close and if there are still pinned pages, send SIGKILL to an
> > application as a reminder it did something stupid...
> 
> As presented at LSFmm I'm not opposed to killing a process which does not
> "follow the rules".  But I'm concerned about how to handle this across a fork.
> 
> Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins
> to a child through the RDMA context.  This was the major issue Jason had with
> the SIGBUS proposal.
> 
> Always sending a SIGKILL would prevent an RDMA process from doing something
> like system("ls") (would kill the child unnecessarily).  Are we ok with that?

I answered this in another email but system("ls") won't kill anybody.
fork(2) just creates new file descriptor for the same file and possibly
then closes it but since there is still another file descriptor for the
same struct file, the "close" code won't trigger.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] dax: Fix xarray entry association for mixed mappings

2019-06-06 Thread Jan Kara

On Thu 06-06-19 10:00:01, Dan Williams wrote:
> On Thu, Jun 6, 2019 at 2:10 AM Jan Kara  wrote:
> >
> > When inserting entry into xarray, we store mapping and index in
> > corresponding struct pages for memory error handling. When it happened
> > that one process was mapping file at PMD granularity while another
> > process at PTE granularity, we could wrongly deassociate PMD range and
> > then reassociate PTE range leaving the rest of struct pages in PMD range
> > without mapping information which could later cause missed notifications
> > about memory errors. Fix the problem by calling the association /
> > deassociation code if and only if we are really going to update the
> > xarray (deassociating and associating zero or empty entries is just
> > no-op so there's no reason to complicate the code with trying to avoid
> > the calls for these cases).
> 
> Looks good to me, I assume this also needs:
> 
> Cc: 
> Fixes: d2c997c0f145 ("fs, dax: use page->mapping to warn if truncate
> collides with a busy page")

Yes, thanks for that.

Honza

> 
> >
> > Signed-off-by: Jan Kara 
> > ---
> >  fs/dax.c | 9 -
> >  1 file changed, 4 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index f74386293632..9fd908f3df32 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -728,12 +728,11 @@ static void *dax_insert_entry(struct xa_state *xas,
> >
> > xas_reset(xas);
> > xas_lock_irq(xas);
> > -   if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
> > +   if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> > +   void *old;
> > +
> > dax_disassociate_entry(entry, mapping, false);
> > dax_associate_entry(new_entry, mapping, vmf->vma, 
> > vmf->address);
> > -   }
> > -
> > -   if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> > /*
> >  * Only swap our new entry into the page cache if the 
> > current
> >  * entry is a zero page or an empty entry.  If a normal PTE 
> > or
> > @@ -742,7 +741,7 @@ static void *dax_insert_entry(struct xa_state *xas,
> >  * existing entry is a PMD, we will just leave the PMD in 
> > the
> >  * tree and dirty it if necessary.
> >  */
> > -   void *old = dax_lock_entry(xas, new_entry);
> > +   old = dax_lock_entry(xas, new_entry);
> > WARN_ON_ONCE(old != xa_mk_value(xa_to_value(entry) |
> > DAX_LOCKED));
> > entry = new_entry;
> > --
> > 2.16.4
> >
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 07/10] fs/ext4: Fail truncate if pages are GUP pinned

2019-06-06 Thread Jan Kara

On Wed 05-06-19 18:45:40, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> If pages are actively gup pinned fail the truncate operation.
> 
> Signed-off-by: Ira Weiny 
> ---
>  fs/ext4/inode.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 75f543f384e4..1ded83ec08c0 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4250,6 +4250,9 @@ int ext4_break_layouts(struct inode *inode, loff_t 
> offset, loff_t len)
>   if (!page)
>   return 0;
>  
> + if (page_gup_pinned(page))
> + return -ETXTBSY;
> +
>   error = ___wait_var_event(>_refcount,
>   atomic_read(>_refcount) == 1,
>   TASK_INTERRUPTIBLE, 0, 0,

This caught my eye. Does this mean that now truncate for a file which has
temporary gup users (such buffers for DIO) can fail with ETXTBUSY? That
doesn't look desirable. If we would mandate layout lease while pages are
pinned as I suggested, this could be dealt with by checking for leases with
pins (breaking such lease would return error and not break it) and if
breaking leases succeeds (i.e., there are no long-term pinned pages), we'd
just wait for the remaining references as we do now.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-06 Thread Jan Kara

 cloud)
>  performance
> 
> Therefore, in order to support RDMA to File system pages without On Demand
> Paging (ODP) a number of things need to be done.
> 
> 1) GUP "longterm" users need to inform the other subsystems that they have
>taken a pin on a page which may remain pinned for a very "long time".[3]
> 
> 2) Any page which is "controlled" by a file system needs to have special
>handling.  The details of the handling depends on if the page is page cache
>fronted or not.
> 
>2a) A page cache fronted page which has been pinned by GUP long term can 
> use a
>bounce buffer to allow the file system to write back snap shots of the 
> page.
>This is handled by the FS recognizing the GUP long term pin and making a 
> copy
>of the page to be written back.
>   NOTE: this patch set does not address this path.
> 
>2b) A FS "controlled" page which is not page cache fronted is either easier
>to deal with or harder depending on the operation the filesystem is trying
>to do.
> 
>   2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
>   FS can no longer use the pages in question until the pin has been
>   removed.  This patch set presents a solution to this by introducing
>   some reasonable restrictions on user space applications.
> 
>   2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
>   then there is nothing which need be done.  Data is Read or Written
>   directly to the page.  This is an easy case which would currently work
>   if not for GUP long term pins being disabled.  Therefore this patch set
>   need not change access to the file data but does allow for GUP pins
>   after 2ba above is dealt with.
> 
> 
> This patch series and presents a solution for problem 2ba)
> 
> [1] https://github.com/johnhubbard/linux/tree/gup_dma_core
> 
> [2] ext4/dev branch:
> 
> - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev
> 
>   Specific patches:
> 
>   [2a] ext4: wait for outstanding dio during truncate in nojournal mode
> 
>   - 
> https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev=82a25b027ca48d7ef197295846b352345853dfa8
> 
>   [2b] ext4: do not delete unlinked inode from orphan list on failed 
> truncate
> 
>   - 
> https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev=ee0ed02ca93ef1ecf8963ad96638795d55af2c14
> 
>   [2c] ext4: gracefully handle ext4_break_layouts() failure during 
> truncate
> 
>   - 
> https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c
> 
> [3] The definition of long time is debatable but it has been established
> that RDMAs use of pages, minutes or hours after the pin is the extreme case
> which makes this problem most severe.
> 
> 
> Ira Weiny (10):
>   fs/locks: Add trace_leases_conflict
>   fs/locks: Export F_LAYOUT lease to user space
>   mm/gup: Pass flags down to __gup_device_huge* calls
>   mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
>   fs/ext4: Teach ext4 to break layout leases
>   fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
>   fs/ext4: Fail truncate if pages are GUP pinned
>   fs/xfs: Teach xfs to use new dax_layout_busy_page()
>   fs/xfs: Fail truncate if pages are GUP pinned
>   mm/gup: Remove FOLL_LONGTERM DAX exclusion
> 
>  fs/Kconfig   |   1 +
>  fs/dax.c |  38 ++---
>  fs/ext4/ext4.h   |   2 +-
>  fs/ext4/extents.c|   6 +-
>  fs/ext4/inode.c  |  26 +--
>  fs/locks.c   |  97 ---
>  fs/xfs/xfs_file.c|  24 --
>  fs/xfs/xfs_inode.h   |   5 +-
>  fs/xfs/xfs_ioctl.c   |  15 +++-
>  fs/xfs/xfs_iops.c|  14 +++-
>  fs/xfs/xfs_pnfs.c|  14 ++--
>  include/linux/dax.h  |   9 ++-
>  include/linux/fs.h   |   2 +-
>  include/linux/mm.h   |   2 +
>  include/trace/events/filelock.h  |  35 +
>  include/uapi/asm-generic/fcntl.h |   3 +
>  mm/gup.c | 129 ---
>  mm/huge_memory.c |  12 +++
>  18 files changed, 299 insertions(+), 135 deletions(-)
> 
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

[PATCH] dax: Fix xarray entry association for mixed mappings

2019-06-06 Thread Jan Kara

When inserting entry into xarray, we store mapping and index in
corresponding struct pages for memory error handling. When it happened
that one process was mapping file at PMD granularity while another
process at PTE granularity, we could wrongly deassociate PMD range and
then reassociate PTE range leaving the rest of struct pages in PMD range
without mapping information which could later cause missed notifications
about memory errors. Fix the problem by calling the association /
deassociation code if and only if we are really going to update the
xarray (deassociating and associating zero or empty entries is just
no-op so there's no reason to complicate the code with trying to avoid
the calls for these cases).

Signed-off-by: Jan Kara 
---
 fs/dax.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f74386293632..9fd908f3df32 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -728,12 +728,11 @@ static void *dax_insert_entry(struct xa_state *xas,
 
xas_reset(xas);
xas_lock_irq(xas);
-   if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+   if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+   void *old;
+
dax_disassociate_entry(entry, mapping, false);
dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
-   }
-
-   if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
/*
 * Only swap our new entry into the page cache if the current
 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -742,7 +741,7 @@ static void *dax_insert_entry(struct xa_state *xas,
 * existing entry is a PMD, we will just leave the PMD in the
 * tree and dirty it if necessary.
 */
-   void *old = dax_lock_entry(xas, new_entry);
+   old = dax_lock_entry(xas, new_entry);
WARN_ON_ONCE(old != xa_mk_value(xa_to_value(entry) |
DAX_LOCKED));
entry = new_entry;
-- 
2.16.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes

2019-05-29 Thread Jan Kara

On Wed 29-05-19 14:46:58, Dave Chinner wrote:
> On Tue, May 28, 2019 at 09:07:19PM -0700, Darrick J. Wong wrote:
> > On Wed, May 29, 2019 at 12:02:40PM +0800, Shiyang Ruan wrote:
> > > On 5/29/19 10:47 AM, Dave Chinner wrote:
> > > > On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote:
> > > > > On 5/28/19 5:17 PM, Jan Kara wrote:
> > > > > > I'm sorry but I don't follow what you suggest. One COW operation is 
> > > > > > a call
> > > > > > to dax_iomap_rw(), isn't it? That may call iomap_apply() several 
> > > > > > times,
> > > > > > each invocation calls ->iomap_begin(), ->actor() 
> > > > > > (dax_iomap_actor()),
> > > > > > ->iomap_end() once. So I don't see a difference between doing 
> > > > > > something in
> > > > > > ->actor() and ->iomap_end() (besides the passed arguments but that 
> > > > > > does not
> > > > > > seem to be your concern). So what do you exactly want to do?
> > > > > 
> > > > > Hi Jan,
> > > > > 
> > > > > Thanks for pointing out, and I'm sorry for my mistake.  It's
> > > > > ->dax_iomap_rw(), not ->dax_iomap_actor().
> > > > > 
> > > > > I want to call the callback function at the end of ->dax_iomap_rw().
> > > > > 
> > > > > Like this:
> > > > > dax_iomap_rw(..., callback) {
> > > > > 
> > > > >  ...
> > > > >  while (...) {
> > > > >  iomap_apply(...);
> > > > >  }
> > > > > 
> > > > >  if (callback != null) {
> > > > >  callback();
> > > > >  }
> > > > >  return ...;
> > > > > }
> > > > 
> > > > Why does this need to be in dax_iomap_rw()?
> > > > 
> > > > We already do post-dax_iomap_rw() "io-end callbacks" directly in
> > > > xfs_file_dax_write() to update the file size
> > > 
> > > Yes, but we also need to call ->xfs_reflink_end_cow() after a COW 
> > > operation.
> > > And an is-cow flag(from iomap) is also needed to determine if we call it. 
> > >  I
> > > think it would be better to put this into ->dax_iomap_rw() as a callback
> > > function.
> > 
> > Sort of like how iomap_dio_rw takes a write endio function?
> 
> You mean like we originally had in the DAX code for unwritten
> extents?
> 
> But we got rid of that because performance of unwritten extents was
> absolutely woeful - it's cheaper in terms of CPU cost to do up front
> zeroing (i.e. inside ->iomap_begin) than it is to use unwritten
> extents and convert them to protect against stale data exposure.
> 
> I have a feeling that exactly the same thing is true for CoW - the
> hoops we jump through to do COW fork manipulation and then extent
> movement between the COW fork and the data fork on IO completion
> would be better done before we commit the COW extent allocation.
> 
> In which case, what we actually want for DAX is:
> 
> 
>  iomap_apply()
> 
>   ->iomap_begin()
>   map old data extent that we copy from
> 
>   allocate new data extent we copy to in data fork,
>   immediately replacing old data extent
> 
>   return transaction handle as private data
> 
>   dax_iomap_actor()
>   copies data from old extent to new extent
> 
>   ->iomap_end
>   commits transaction now data has been copied, making
>   the COW operation atomic with the data copy.
> 
> 
> This, in fact, should be how we do all DAX writes that require
> allocation, because then we get rid of the need to zero newly
> allocated or unwritten extents before we copy the data into it. i.e.
> we only need to write once to newly allocated storage rather than
> twice.

You need to be careful though. You need to synchronize with page faults so
that they cannot see and expose in page tables blocks you've allocated
before their contents is filled. This race was actually the strongest
motivation for pre-zeroing of blocks. OTOH copy_from_iter() in
dax_iomap_actor() needs to be able to fault pages to copy from (and these
pages may be from the same file you're writing to) so you cannot just block
faulting for the file through I_MMAP_LOCK.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes

2019-05-28 Thread Jan Kara

On Mon 27-05-19 16:25:41, Shiyang Ruan wrote:
> On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote:
> > > 
> > > Hi,
> > > 
> > > I'm working on reflink & dax in XFS, here are some thoughts on this:
> > > 
> > > As mentioned above: the second iomap's offset and length must match the
> > > first.  I thought so at the beginning, but later found that the only
> > > difference between these two iomaps is @addr.  So, what about adding a
> > > @saddr, which means the source address of COW extent, into the struct 
> > > iomap.
> > > The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not.  Then
> > > handle this @saddr in each ->actor().  No more modifications in other
> > > functions.
> > 
> > Yes, I started of with the exact idea before being recommended this by Dave.
> > I used two fields instead of one namely cow_pos and cow_addr which defined
> > the source details. I had put it as a iomap flag as opposed to a type
> > which of course did not appeal well.
> > 
> > We may want to use iomaps for cases where two inodes are involved.
> > An example of the other scenario where offset may be different is file
> > comparison for dedup: vfs_dedup_file_range_compare(). However, it would
> > need two inodes in iomap as well.
> > 
> Yes, it is reasonable.  Thanks for your explanation.
> 
> One more thing RFC:
> I'd like to add an end-io callback argument in ->dax_iomap_actor() to update
> the metadata after one whole COW operation is completed.  The end-io can
> also be called in ->iomap_end().  But one COW operation may call
> ->iomap_apply() many times, and so does the end-io.  Thus, I think it would
> be nice to move it to the bottom of ->dax_iomap_actor(), called just once in
> each COW operation.

I'm sorry but I don't follow what you suggest. One COW operation is a call
to dax_iomap_rw(), isn't it? That may call iomap_apply() several times,
each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()),
->iomap_end() once. So I don't see a difference between doing something in
->actor() and ->iomap_end() (besides the passed arguments but that does not
seem to be your concern). So what do you exactly want to do?

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot

2019-05-23 Thread Jan Kara

On Thu 23-05-19 10:27:22, Goldwyn Rodrigues wrote:
> On 16:04 23/05, Jan Kara wrote:
> > On Mon 29-04-19 12:26:47, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues 
> > > 
> > > Inorder to make sure mmap'd files don't change after snapshot,
> > > writeprotect the mmap pages on snapshot. This is done by performing
> > > a data writeback on the pages (which simply mark the pages are
> > > wrprotected). This way if the user process tries to access the memory
> > > we will get another fault and we can perform a CoW.
> > > 
> > > In order to accomplish this, we tag all CoW pages as
> > > PAGECACHE_TAG_TOWRITE, and add the mmapd inode in delalloc_inodes.
> > > During snapshot, it starts writeback of all delalloc'd inodes and
> > > here we perform a data writeback. We don't want to keep the inodes
> > > in delalloc_inodes until it umount (WARN_ON), so we remove it
> > > during inode evictions.
> > > 
> > > Signed-off-by: Goldwyn Rodrigues 
> > 
> > OK, so here you use PAGECACHE_TAG_TOWRITE. But why is not
> > PAGECACHE_TAG_DIRTY enough for you? Also why isn't the same needed also for
> > normal non-DAX inodes? There you also need to trigger CoW on mmap write so
> > I just don't see the difference...
> 
> Because dax_writeback_mapping_range() writebacks pages marked 
> PAGECACHE_TAG_TOWRITE and not PAGECACHE_TAG_DIRTY. Should it
> writeback pages marked as PAGECACHE_TAG_DIRTY as well?

It does writeback PAGECACHE_TAG_DIRTY pages - tag_pages_for_writeback()
moves PAGECACHE_TAG_DIRTY to PAGECACHE_TAG_TOWRITE...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot

2019-05-23 Thread Jan Kara

;   return dax_writeback_mapping_range(mapping, 
> fs_info->fs_devices->latest_bdev,
>   wbc);
>  }
> @@ -9981,6 +9990,8 @@ static void btrfs_run_delalloc_work(struct btrfs_work 
> *work)
>   delalloc_work = container_of(work, struct btrfs_delalloc_work,
>work);
>   inode = delalloc_work->inode;
> + if (IS_DAX(inode))
> + filemap_fdatawrite(inode->i_mapping);
>   filemap_flush(inode->i_mapping);
>   if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>   _I(inode)->runtime_flags))
> -- 
> 2.16.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 12/18] btrfs: allow MAP_SYNC mmap

2019-05-23 Thread Jan Kara

On Mon 29-04-19 12:26:43, Goldwyn Rodrigues wrote:
> From: Adam Borowski 
> 
> Used by userspace to detect DAX.
> [rgold...@suse.com: Added CONFIG_FS_DAX around mmap_supported_flags]

Why the CONFIG_FS_DAX bit? Your mmap(2) implementation understands
implications of MAP_SYNC flag and that's all that's needed to set
.mmap_supported_flags.

Honza

> Signed-off-by: Adam Borowski 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/file.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 9d5a3c99a6b9..362a9cf9dcb2 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> @@ -3319,6 +3320,9 @@ const struct file_operations btrfs_file_operations = {
>   .splice_read= generic_file_splice_read,
>   .write_iter = btrfs_file_write_iter,
>   .mmap   = btrfs_file_mmap,
> +#ifdef CONFIG_FS_DAX
> + .mmap_supported_flags = MAP_SYNC,
> +#endif
>   .open   = btrfs_file_open,
>   .release= btrfs_release_file,
>   .fsync  = btrfs_sync_file,
> -- 
> 2.16.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 10/18] dax: replace mmap entry in case of CoW

2019-05-23 Thread Jan Kara

On Mon 29-04-19 12:26:41, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> We replace the existing entry to the newly allocated one
> in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
> so writeback marks this entry as writeprotected. This
> helps us snapshots so new write pagefaults after snapshots
> trigger a CoW.

I don't understand why do you need to mark the new entry with
PAGECACHE_TAG_TOWRITE. dax_insert_entry() will unmap the entry from all
page tables so what's there left to writeprotect?

>  /*
>   * By this point grab_mapping_entry() has ensured that we have a locked entry
>   * of the appropriate size so we don't have to worry about downgrading PMDs 
> to
> @@ -709,14 +712,17 @@ static int copy_user_dax(struct block_device *bdev, 
> struct dax_device *dax_dev,
>   */
>  static void *dax_insert_entry(struct xa_state *xas,
>   struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> + void *entry, pfn_t pfn, unsigned long flags,
> + unsigned long insert_flags)
>  {
>   void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;
> + bool cow = insert_flags & DAX_IF_COW;

Does 'cow' really need to be a separate flag? dax_insert_entry() can just
figure out the right thing to do on its own based on old entry value and
new entry to be inserted...

>  
>   if (dirty)
>   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {

E.g. here we need to unmap if old entry is not 'empty' and the pfns differ
(well, the pfns differ check should better be done like I outline below to
make pmd + pte match work correctly).

>   unsigned long index = xas->xa_index;
>   /* we are replacing a zero page with block mapping */
>   if (dax_is_pmd_entry(entry))
> @@ -728,12 +734,12 @@ static void *dax_insert_entry(struct xa_state *xas,
>  
>   xas_reset(xas);
>   xas_lock_irq(xas);
> - if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
> + if (cow || (dax_entry_size(entry) != dax_entry_size(new_entry))) {

This needs to be done if entries are different at all...

>   dax_disassociate_entry(entry, mapping, false);
>   dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
>   }
>  
> - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {

This is the only place that will be a bit more subtle - you need to check
whether the new entry is not a subset of the old one (i.e., a PTE inside a
PMD) and skip setting in that case. So something like:

if (xa_to_value(new_entry) | DAX_LOCKED == xa_to_value(entry) ||
(dax_is_pmd_entry(entry) && dax_is_pte_entry(new_entry) &&
 dax_to_pfn(entry) + (xas->xa_index & PG_PMD_COLOUR) ==
 dax_to_pfn(new_entry))) {
/* New entry is a subset of the current one? Skip update... */
xas_load(xas);
} else {
do work...
}


Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults

2019-05-23 Thread Jan Kara

On Tue 21-05-19 10:46:25, Darrick J. Wong wrote:
> On Mon, Apr 29, 2019 at 12:26:39PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues 
> > 
> > Change dax_iomap_pfn to return the address as well in order to
> > use it for performing a memcpy in case the type is IOMAP_DAX_COW.
> > We don't handle PMD because btrfs does not support hugepages.
> > 
> > Question:
> > The sequence of bdev_dax_pgoff() and dax_direct_access() is
> > used multiple times to calculate address and pfn's. Would it make
> > sense to call it while calculating address as well to reduce code?
> > 
> > Signed-off-by: Goldwyn Rodrigues 
> > ---
> >  fs/dax.c | 19 +++
> >  1 file changed, 15 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 610bfa861a28..718b1632a39d 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -984,7 +984,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, 
> > loff_t pos)
> >  }
> >  
> >  static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> > -pfn_t *pfnp)
> > +pfn_t *pfnp, void **addr)
> >  {
> > const sector_t sector = dax_iomap_sector(iomap, pos);
> > pgoff_t pgoff;
> > @@ -996,7 +996,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t 
> > pos, size_t size,
> > return rc;
> > id = dax_read_lock();
> > length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
> > -  NULL, pfnp);
> > +  addr, pfnp);
> > if (length < 0) {
> > rc = length;
> > goto out;
> > @@ -1286,6 +1286,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> > *vmf, pfn_t *pfnp,
> > XA_STATE(xas, >i_pages, vmf->pgoff);
> > struct inode *inode = mapping->host;
> > unsigned long vaddr = vmf->address;
> > +   void *addr;
> > loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
> > struct iomap iomap = { 0 };
> 
> Ugh, I had forgotten that fs/dax.c open-codes iomap_apply, probably
> because the actor returns vm_fault_t, not bytes copied.  I guess that
> makes it a tiny bit more complicated to pass in two (struct iomap *) to
> the iomap_begin function...

Hum, right. We could actually reimplement dax_iomap_{pte|pmd}_fault() using
iomap_apply(). We would just need to propagate error code out of our
'actor' inside the structure pointed to by 'data'. But that's doable.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-20 Thread Jan Kara

On Sat 18-05-19 21:46:03, Dan Williams wrote:
> On Fri, May 17, 2019 at 12:25 PM Kees Cook  wrote:
> > On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote:
> > > It seems dax_iomap_actor() is not a path where we'd be worried about
> > > needing hardened user copy checks.
> >
> > I would agree: I think the proposed patch makes sense. :)
> 
> Sounds like an acked-by to me.

Yeah, if Kees agrees, I'm fine with skipping the checks as well. I just
wanted that to be clarified. Also it helped me that you wrote:

That routine (dax_iomap_actor()) validates that the logical file offset is
within bounds of the file, then it does a sector-to-pfn translation which
validates that the physical mapping is within bounds of the block device.

That is more specific than "dax_iomap_actor() takes care of necessary
checks" which was in the changelog. And the above paragraph helped me
clarify which checks in dax_iomap_actor() you think replace those usercopy
checks. So I think it would be good to add that paragraph to those
copy_from_pmem() functions as a comment just in case we are wondering in
the future why we are skipping the checks... Also feel free to add:

Acked-by: Jan Kara 

        Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Jan Kara

Let's add Kees to CC for usercopy expertise...

On Thu 16-05-19 17:33:38, Dan Williams wrote:
> Jeff discovered that performance improves from ~375K iops to ~519K iops
> on a simple psync-write fio workload when moving the location of 'struct
> page' from the default PMEM location to DRAM. This result is surprising
> because the expectation is that 'struct page' for dax is only needed for
> third party references to dax mappings. For example, a dax-mapped buffer
> passed to another system call for direct-I/O requires 'struct page' for
> sending the request down the driver stack and pinning the page. There is
> no usage of 'struct page' for first party access to a file via
> read(2)/write(2) and friends.
> 
> However, this "no page needed" expectation is violated by
> CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in
> copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The
> check_heap_object() helper routine assumes the buffer is backed by a
> page-allocator DRAM page and applies some checks.  Those checks are
> invalid, dax pages are not from the heap, and redundant,
> dax_iomap_actor() has already validated that the I/O is within bounds.

So this last paragraph is not obvious to me as check_copy_size() does a lot
of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of
those checks don't make sense for PMEM pages but I'd rather handle that by
refining check_copy_size() and check_object_size() functions to detect and
appropriately handle pmem pages rather that generally skip all the checks
in pmem_copy_from/to_iter(). And yes, every check in such hot path is going
to cost performance but that's what user asked for with
CONFIG_HARDENED_USERCOPY... Kees?

Honza

> 
> Bypass this overhead and call the 'no check' versions of the
> copy_{to,from}_iter operations directly.
> 
> Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...")
> Cc: Jan Kara 
> Cc: 
> Cc: Jeff Moyer 
> Cc: Ingo Molnar 
> Cc: Christoph Hellwig 
> Cc: Al Viro 
> Cc: Thomas Gleixner 
> Cc: Matthew Wilcox 
> Reported-and-tested-by: Jeff Smits 
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/pmem.c |9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 845c5b430cdd..c894f45e5077 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -281,16 +281,21 @@ static long pmem_dax_direct_access(struct dax_device 
> *dax_dev,
>   return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
>  }
>  
> +/*
> + * Use the 'no check' versions of copy_from_iter_flushcache() and
> + * copy_to_iter_mcsafe() to bypass HARDENED_USERCOPY overhead. Bounds
> + * checking is handled by dax_iomap_actor()
> + */
>  static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>   void *addr, size_t bytes, struct iov_iter *i)
>  {
> - return copy_from_iter_flushcache(addr, bytes, i);
> + return _copy_from_iter_flushcache(addr, bytes, i);
>  }
>  
>  static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>   void *addr, size_t bytes, struct iov_iter *i)
>  {
> - return copy_to_iter_mcsafe(addr, bytes, i);
> + return _copy_to_iter_mcsafe(addr, bytes, i);
>  }
>  
>  static const struct dax_operations pmem_dax_ops = {
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH] dax: Arrange for dax_supported check to span multiple devices

2019-05-15 Thread Jan Kara

On Tue 14-05-19 20:48:49, Dan Williams wrote:
> Pankaj reports that starting with commit ad428cdb525a "dax: Check the
> end of the block-device capacity with dax_direct_access()" device-mapper
> no longer allows dax operation. This results from the stricter checks in
> __bdev_dax_supported() that validate that the start and end of a
> block-device map to the same 'pagemap' instance.
> 
> Teach the dax-core and device-mapper to validate the 'pagemap' on a
> per-target basis. This is accomplished by refactoring the
> bdev_dax_supported() internals into generic_fsdax_supported() which
> takes a sector range to validate. Consequently generic_fsdax_supported()
> is suitable to be used in a device-mapper ->iterate_devices() callback.
> A new ->dax_supported() operation is added to allow composite devices to
> split and route upper-level bdev_dax_supported() requests.
> 
> Fixes: ad428cdb525a ("dax: Check the end of the block-device...")
> Cc: 
> Cc: Jan Kara 
> Cc: Ira Weiny 
> Cc: Dave Jiang 
> Cc: Mike Snitzer 
> Cc: Keith Busch 
> Cc: Matthew Wilcox 
> Cc: Vishal Verma 
> Cc: Heiko Carstens 
> Cc: Martin Schwidefsky 
> Reported-by: Pankaj Gupta 
> Signed-off-by: Dan Williams 

Thanks for the fix. The patch looks good to me so feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
> Hi Mike,
> 
> Another day another new dax operation to allow device-mapper to better
> scope dax operations.
> 
> Let me know if the device-mapper changes look sane. This passes a new
> unit test that indeed fails on current mainline.
> 
> https://github.com/pmem/ndctl/blob/device-mapper-pending/test/dm.sh
> 
>  drivers/dax/super.c  |   88 
> +++---
>  drivers/md/dm-table.c|   17 +---
>  drivers/md/dm.c  |   20 ++
>  drivers/md/dm.h  |1 
>  drivers/nvdimm/pmem.c|1 
>  drivers/s390/block/dcssblk.c |1 
>  include/linux/dax.h  |   19 +
>  7 files changed, 110 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0a339b85133e..ec2f2262e3a9 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -73,22 +73,12 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device 
> *bdev)
>  EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
>  #endif
>  
> -/**
> - * __bdev_dax_supported() - Check if the device supports dax for filesystem
> - * @bdev: block device to check
> - * @blocksize: The block size of the device
> - *
> - * This is a library function for filesystems to check if the block device
> - * can be mounted with dax option.
> - *
> - * Return: true if supported, false if unsupported
> - */
> -bool __bdev_dax_supported(struct block_device *bdev, int blocksize)
> +bool generic_fsdax_supported(struct dax_device *dax_dev,
> + struct block_device *bdev, int blocksize, sector_t start,
> + sector_t sectors)
>  {
> - struct dax_device *dax_dev;
>   bool dax_enabled = false;
>   pgoff_t pgoff, pgoff_end;
> - struct request_queue *q;
>   char buf[BDEVNAME_SIZE];
>   void *kaddr, *end_kaddr;
>   pfn_t pfn, end_pfn;
> @@ -102,21 +92,14 @@ bool __bdev_dax_supported(struct block_device *bdev, int 
> blocksize)
>   return false;
>   }
>  
> - q = bdev_get_queue(bdev);
> - if (!q || !blk_queue_dax(q)) {
> - pr_debug("%s: error: request queue doesn't support dax\n",
> - bdevname(bdev, buf));
> - return false;
> - }
> -
> - err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, );
> + err = bdev_dax_pgoff(bdev, start, PAGE_SIZE, );
>   if (err) {
>   pr_debug("%s: error: unaligned partition for dax\n",
>   bdevname(bdev, buf));
>   return false;
>   }
>  
> - last_page = PFN_DOWN(i_size_read(bdev->bd_inode) - 1) * 8;
> + last_page = PFN_DOWN((start + sectors - 1) * 512) * PAGE_SIZE / 512;
>   err = bdev_dax_pgoff(bdev, last_page, PAGE_SIZE, _end);
>   if (err) {
>   pr_debug("%s: error: unaligned partition for dax\n",
> @@ -124,20 +107,11 @@ bool __bdev_dax_supported(struct block_device *bdev, 
> int blocksize)
>   return false;
>   }
>  
> - dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
> - if (!dax_dev) {
> - pr_debug("%s: error: device does not support dax\n",
> - bdevname(bdev, buf));
> - return false;
>

Re: [PATCH v2] mm: Fix modifying of page protection by insert_pfn_pmd()

2019-04-26 Thread Jan Kara

On Thu 25-04-19 17:33:04, Dan Williams wrote:
> On Thu, Apr 25, 2019 at 12:32 AM Jan Kara  wrote:
> >
> > On Wed 24-04-19 11:13:48, Dan Williams wrote:
> > > On Wed, Apr 24, 2019 at 10:38 AM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Wed, Apr 24, 2019 at 10:13:15AM -0700, Dan Williams wrote:
> > > > > I think unaligned addresses have always been passed to
> > > > > vmf_insert_pfn_pmd(), but nothing cared until this patch. I *think*
> > > > > the only change needed is the following, thoughts?
> > > > >
> > > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > > index ca0671d55aa6..82aee9a87efa 100644
> > > > > --- a/fs/dax.c
> > > > > +++ b/fs/dax.c
> > > > > @@ -1560,7 +1560,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct
> > > > > vm_fault *vmf, pfn_t *pfnp,
> > > > > }
> > > > >
> > > > > trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, 
> > > > > pfn, entry);
> > > > > -   result = vmf_insert_pfn_pmd(vma, vmf->address, 
> > > > > vmf->pmd, pfn,
> > > > > +   result = vmf_insert_pfn_pmd(vma, pmd_addr, vmf->pmd, 
> > > > > pfn,
> > > > > write);
> > > >
> > > > We also call vmf_insert_pfn_pmd() in dax_insert_pfn_mkwrite() -- does
> > > > that need to change too?
> > >
> > > It wasn't clear to me that it was a problem. I think that one already
> > > happens to be pmd-aligned.
> >
> > Why would it need to be? The address is taken from vmf->address and that's
> > set up in __handle_mm_fault() like .address = address & PAGE_MASK. So I
> > don't see anything forcing PMD alignment of the virtual address...
> 
> True. So now I'm wondering if the masking should be done internal to
> the routine. Given it's prefixed vmf_ it seems to imply the api is
> prepared to take raw 'struct vm_fault' parameters. I think I'll go
> that route unless someone sees a reason to require the caller to
> handle this responsibility.

Yeah, that sounds good to me. Thanks for fixing this.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v2] mm: Fix modifying of page protection by insert_pfn_pmd()

2019-04-25 Thread Jan Kara

On Wed 24-04-19 11:13:48, Dan Williams wrote:
> On Wed, Apr 24, 2019 at 10:38 AM Matthew Wilcox  wrote:
> >
> > On Wed, Apr 24, 2019 at 10:13:15AM -0700, Dan Williams wrote:
> > > I think unaligned addresses have always been passed to
> > > vmf_insert_pfn_pmd(), but nothing cared until this patch. I *think*
> > > the only change needed is the following, thoughts?
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index ca0671d55aa6..82aee9a87efa 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -1560,7 +1560,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct
> > > vm_fault *vmf, pfn_t *pfnp,
> > > }
> > >
> > > trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, 
> > > entry);
> > > -   result = vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, 
> > > pfn,
> > > +   result = vmf_insert_pfn_pmd(vma, pmd_addr, vmf->pmd, pfn,
> > > write);
> >
> > We also call vmf_insert_pfn_pmd() in dax_insert_pfn_mkwrite() -- does
> > that need to change too?
> 
> It wasn't clear to me that it was a problem. I think that one already
> happens to be pmd-aligned.

Why would it need to be? The address is taken from vmf->address and that's
set up in __handle_mm_fault() like .address = address & PAGE_MASK. So I
don't see anything forcing PMD alignment of the virtual address...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v6 5/6] ext4: disable map_sync for async flush

2019-04-23 Thread Jan Kara

On Tue 23-04-19 13:36:11, Pankaj Gupta wrote:
> Dont support 'MAP_SYNC' with non-DAX files and DAX files
> with asynchronous dax_device. Virtio pmem provides 
> asynchronous host page cache flush mechanism. We don't
> support 'MAP_SYNC' with virtio pmem and ext4.
> 
> Signed-off-by: Pankaj Gupta 

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza


> ---
>  fs/ext4/file.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 69d65d49837b..4b2ccaf1932e 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -360,15 +360,16 @@ static const struct vm_operations_struct 
> ext4_file_vm_ops = {
>  static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>   struct inode *inode = file->f_mapping->host;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> + struct dax_device *dax_dev = sbi->s_daxdev;
>  
> - if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb
> + if (unlikely(ext4_forced_shutdown(sbi)))
>   return -EIO;
>  
> - /*
> -  * We don't support synchronous mappings for non-DAX files. At least
> -  * until someone comes with a sensible use case.
> + /* We don't support synchronous mappings for non-DAX files and
> +  * for DAX files if underneath dax_device is not synchronous.
>*/
> - if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC))
> + if (!daxdev_mapping_supported(vma, dax_dev))
>   return -EOPNOTSUPP;
>  
>   file_accessed(file);
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v6 4/6] dax: check synchronous mapping is supported

2019-04-23 Thread Jan Kara

On Tue 23-04-19 13:36:10, Pankaj Gupta wrote:
> This patch introduces 'daxdev_mapping_supported' helper
> which checks if 'MAP_SYNC' is supported with filesystem
> mapping. It also checks if corresponding dax_device is
> synchronous. Virtio pmem device is asynchronous and
> does not not support VM_SYNC. 
> 
> Suggested-by: Jan Kara 
> Signed-off-by: Pankaj Gupta 

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza


> ---
>  include/linux/dax.h | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index c97fc0cc7167..41b4a5db6305 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -41,6 +41,18 @@ void kill_dax(struct dax_device *dax_dev);
>  void dax_write_cache(struct dax_device *dax_dev, bool wc);
>  bool dax_write_cache_enabled(struct dax_device *dax_dev);
>  bool dax_synchronous(struct dax_device *dax_dev);
> +/*
> + * Check if given mapping is supported by the file / underlying device.
> + */
> +static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
> + struct dax_device *dax_dev)
> +{
> + if (!(vma->vm_flags & VM_SYNC))
> + return true;
> + if (!IS_DAX(file_inode(vma->vm_file)))
> + return false;
> + return dax_synchronous(dax_dev);
> +}
>  #else
>  static inline struct dax_device *dax_get_by_host(const char *host)
>  {
> @@ -68,6 +80,11 @@ static inline bool dax_write_cache_enabled(struct 
> dax_device *dax_dev)
>  {
>   return false;
>  }
> +static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
> + struct dax_device *dax_dev)
> +{
> + return !(vma->flags & VM_SYNC);
> +}
>  #endif
>  
>  struct writeback_control;
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support

2019-04-12 Thread Jan Kara

On Thu 11-04-19 07:51:48, Dan Williams wrote:
> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta  wrote:
> > +   } else {
> > +   if (nd_region->flush(nd_region))
> > +   rc = -EIO;
> 
> Given the common case wants to be fast and synchronous I think we
> should try to avoid retpoline overhead by default. So something like
> this:
> 
> if (nd_region->flush == generic_nvdimm_flush)
> rc = generic_nvdimm_flush(...);

I'd either add a comment about avoiding retpoline overhead here or just
make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
get confused by the code.

        Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

1 2 3 4 5 6 7 >

1 - 100 of 655 matches

Mail list logo