Re: [PATCH] ext4: Delete redundant uptodate check for buffer

2021-04-01 Thread Ritesh Harjani
On 21/04/01 03:03PM, Shaokun Zhang wrote:
> From: Yang Guo 
>
> The buffer uptodate state has been checked in function set_buffer_uptodate,
> there is no need use buffer_uptodate before calling set_buffer_uptodate and
> delete it.
>
> Cc: "Theodore Ts'o" 
> Cc: Andreas Dilger 
> Signed-off-by: Yang Guo 
> Signed-off-by: Shaokun Zhang 
> ---
>  fs/ext4/inode.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)

Thanks for the patch. Changes looks good and trivial.

Feel free to add -
Reviewed-by: Ritesh Harjani 


Re: [PATCH v3 08/10] fsdax: Dedup file range to use a compare function

2021-04-01 Thread Ritesh Harjani
On 21/03/19 09:52AM, Shiyang Ruan wrote:
> With dax we cannot deal with readpage() etc. So, we create a dax
> comparison funciton which is similar with
> vfs_dedupe_file_range_compare().
> And introduce dax_remap_file_range_prep() for filesystem use.
>
> Signed-off-by: Goldwyn Rodrigues 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c | 56 
>  fs/remap_range.c | 45 ---
>  fs/xfs/xfs_reflink.c |  9 +--
>  include/linux/dax.h  |  4 
>  include/linux/fs.h   | 15 
>  5 files changed, 115 insertions(+), 14 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 348297b38f76..76f81f1d76ec 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1833,3 +1833,59 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>   return dax_insert_pfn_mkwrite(vmf, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
> + struct inode *ino2, loff_t pos2, loff_t len, void *data,
> + struct iomap *smap, struct iomap *dmap)
> +{
> + void *saddr, *daddr;
> + bool *same = data;
> + int ret;
> +
> + if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
> + *same = true;
> + return len;
> + }
> +
> + if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
> + *same = false;
> + return 0;
> + }
> +
> + ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
> +   , NULL);

shouldn't it take len as the second argument?

> + if (ret < 0)
> + return -EIO;
> +
> + ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
> +   , NULL);

ditto.
> + if (ret < 0)
> + return -EIO;
> +
> + *same = !memcmp(saddr, daddr, len);
> + return len;
> +}
> +
> +int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff, loff_t len, bool *is_same,
> + const struct iomap_ops *ops)
> +{
> + int id, ret = 0;
> +
> + id = dax_read_lock();
> + while (len) {
> + ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops,
> +is_same, dax_range_compare_actor);
> + if (ret < 0 || !*is_same)
> + goto out;
> +
> + len -= ret;
> + srcoff += ret;
> + destoff += ret;
> + }
> + ret = 0;
> +out:
> + dax_read_unlock(id);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
> diff --git a/fs/remap_range.c b/fs/remap_range.c
> index 77dba3a49e65..9079390edaf3 100644
> --- a/fs/remap_range.c
> +++ b/fs/remap_range.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>
>  #include 
> @@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, 
> struct page *page2)
>   * Compare extents of two files to see if they are the same.
>   * Caller must have locked both inodes to prevent write races.
>   */
> -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> -  struct inode *dest, loff_t destoff,
> -  loff_t len, bool *is_same)
> +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> +   struct inode *dest, loff_t destoff,
> +   loff_t len, bool *is_same)
>  {
>   loff_t src_poff;
>   loff_t dest_poff;
> @@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode 
> *src, loff_t srcoff,
>  out_error:
>   return error;
>  }
> +EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
>
>  /*
>   * Check that the two inodes are eligible for cloning, the ranges make
> @@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode 
> *src, loff_t srcoff,
>   * If there's an error, then the usual negative error code is returned.
>   * Otherwise returns 0 with *len set to the request length.
>   */
> -int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> -   struct file *file_out, loff_t pos_out,
> -   loff_t *len, unsigned int remap_flags)
> +static int
> +__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + loff_t *len, unsigned int remap_flags,
> + const struct iomap_ops *ops)
>  {
>   struct inode *inode_in = file_inode(file_in);
>   struct inode *inode_out = file_inode(file_out);
> @@ -351,8 +355,15 @@ int generic_remap_file_range_prep(struct file *file_in, 
> loff_t pos_in,
>   if (remap_flags & REMAP_FILE_DEDUP) {
>   

Re: [PATCH v3 07/10] iomap: Introduce iomap_apply2() for operations on two files

2021-04-01 Thread Ritesh Harjani
On 21/03/19 09:52AM, Shiyang Ruan wrote:
> Some operations, such as comparing a range of data in two files under
> fsdax mode, requires nested iomap_open()/iomap_end() on two file.  Thus,
> we introduce iomap_apply2() to accept arguments from two files and
> iomap_actor2_t for actions on two files.
>
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/iomap/apply.c  | 56 +++
>  include/linux/iomap.h |  7 +-
>  2 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index 26ab6563181f..fbc38ce3d5b6 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -97,3 +97,59 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t 
> length, unsigned flags,
>
>   return written ? written : ret;
>  }
> +
> +loff_t
> +iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t 
> pos2,
> + loff_t length, unsigned int flags, const struct iomap_ops *ops,
> + void *data, iomap_actor2_t actor)
> +{
> + struct iomap smap = { .type = IOMAP_HOLE };
> + struct iomap dmap = { .type = IOMAP_HOLE };
> + loff_t written = 0, ret, ret2 = 0;
> + loff_t len1 = length, len2, min_len;
> +
> + ret = ops->iomap_begin(ino1, pos1, len1, flags, , NULL);
> + if (ret)
> + goto out_src;

if above call fails we need not call ->iomap_end() on smap.

> + if (WARN_ON(smap.offset > pos1)) {
> + written = -EIO;
> + goto out_src;
> + }
> + if (WARN_ON(smap.length == 0)) {
> + written = -EIO;
> + goto out_src;
> + }
> + len2 = min_t(loff_t, len1, smap.length);
> +
> + ret = ops->iomap_begin(ino2, pos2, len2, flags, , NULL);
> + if (ret)
> + goto out_dest;

ditto

> + if (WARN_ON(dmap.offset > pos2)) {
> + written = -EIO;
> + goto out_dest;
> + }
> + if (WARN_ON(dmap.length == 0)) {
> + written = -EIO;
> + goto out_dest;
> + }
> + min_len = min_t(loff_t, len2, dmap.length);
> +
> + written = actor(ino1, pos1, ino2, pos2, min_len, data, , );
> +
> +out_dest:
> + if (ops->iomap_end)
> + ret2 = ops->iomap_end(ino2, pos2, len2,
> +   written > 0 ? written : 0, flags, );
> +out_src:
> + if (ops->iomap_end)
> + ret = ops->iomap_end(ino1, pos1, len1,
> +  written > 0 ? written : 0, flags, );
> +

I guess, this maynot be a problem, but I still think we should be
consistent w.r.t len argument we are passing in ->iomap_end() for both type of
iomap_apply* family of functions.
IIUC, we used to call ->iomap_end() with the length argument filled by the
filesystem from ->iomap_begin() call.

whereas above breaks that behavior. Although I don't think this is FATAL, but
still it is better to be consistent with the APIs.
Thoughts?


> + if (ret)
> + return written ? written : ret;
> +
> + if (ret2)
> + return written ? written : ret2;
> +
> + return written;
> +}

if (written)
return written;

return ret ? ret : ret2;

Is above a simpler version?

-ritesh


Re: [PATCH v3 06/10] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero

2021-04-01 Thread Ritesh Harjani
On 21/03/19 09:52AM, Shiyang Ruan wrote:
> Punch hole on a reflinked file needs dax_copy_edge() too.  Otherwise,
> data in not aligned area will be not correct.  So, add the srcmap to
> dax_iomap_zero() and replace memset() as dax_copy_edge().
>
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c   | 9 +++--
>  fs/iomap/buffered-io.c | 2 +-
>  include/linux/dax.h| 3 ++-
>  3 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index cfe513eb111e..348297b38f76 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1174,7 +1174,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>  }
>  #endif /* CONFIG_FS_DAX_PMD */
>
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
> +s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
> + struct iomap *srcmap)

Do we know why does dax_iomap_zero() operates on PAGE_SIZE range?
IIUC, dax_zero_page_range() can take nr_pages as a parameter. But we still
always use one page at a time. Why is that?

>  {
>   sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
>   pgoff_t pgoff;
> @@ -1204,7 +1205,11 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct 
> iomap *iomap)
>   }
>
>   if (!page_aligned) {
> - memset(kaddr + offset, 0, size);
> + if (iomap->addr != srcmap->addr)
> + dax_iomap_cow_copy(offset, size, PAGE_SIZE, srcmap,
> +kaddr, true);
> + else
> + memset(kaddr + offset, 0, size);
>   dax_flush(iomap->dax_dev, kaddr + offset, size);
>   }
>   dax_read_unlock(id);
>

Maybe the above could be simplified to this?

if (page_aligned) {
rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
} else {
rc = dax_direct_access(iomap->dax_dev, pgoff, 1, , NULL);
if (iomap->addr != srcmap->addr)
dax_iomap_cow_copy(offset, size, PAGE_SIZE, srcmap,
   kaddr, true);
else
memset(kaddr + offset, 0, size);
dax_flush(iomap->dax_dev, kaddr + offset, size);
}

dax_read_unlock(id);
return rc < 0 ? rc : size;

Other than that looks good.
Feel free to add.
Reviewed-by: Ritesh Harjani 




Re: [PATCH v3 05/10] fsdax: Replace mmap entry in case of CoW

2021-04-01 Thread Ritesh Harjani
lt_actor(struct vm_fault *vmf, 
> pfn_t *pfnp,
>   loff_t pos = (loff_t)xas->xa_offset << PAGE_SHIFT;
>   bool write = vmf->flags & FAULT_FLAG_WRITE;
>   bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
> + unsigned int insert_flags = 0;
>   int err = 0;
>   pfn_t pfn;
>   void *kaddr;
> @@ -1453,8 +1462,14 @@ static vm_fault_t dax_fault_actor(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   if (err)
>   return dax_fault_return(err);
>
> - entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0,
> -  write && !sync);
> + if (write) {
> + if (!sync)
> +     insert_flags |= DAX_IF_DIRTY;
> + if (iomap->flags & IOMAP_F_SHARED)
> + insert_flags |= DAX_IF_COW;
> + }
> +
> + entry = dax_insert_entry(xas, vmf, entry, pfn, 0, insert_flags);
>
>   if (write && srcmap->addr != iomap->addr) {
>   err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr, false);
>

Rest looks good to me. Please feel free to add
Reviewed-by: Ritesh Harjani 

sorry about changing my email in between of this code review.
I am planning to use above gmail id as primary account for all upstream work
from now.

> --
> 2.30.1
>
>
>


Re: [PATCH 0/8] EXT4: Trivial typo fixes

2021-03-28 Thread Ritesh Harjani
On 21/03/27 04:00PM, Bhaskar Chowdhury wrote:
> This patch series fixed few mundane typos in the below specific files.
>
> Bhaskar Chowdhury (8):
>   EXT4: migrate.c: Fixed few typos
>   EXT4: namei.c: Fixed a typo
>   EXT4: inline.c: A typo fixed
>   Fix a typo
>   EXT4: indirect.c: A trivial typo fix
>   EXT4: xattr.c: Mundane typo fix
>   EXT4: fast_commit.c: A mere typo fix
>   EXT4: mballoc.h: Single typo fix
>
>  fs/ext4/fast_commit.c | 2 +-
>  fs/ext4/indirect.c| 2 +-
>  fs/ext4/inline.c  | 2 +-
>  fs/ext4/inode.c   | 2 +-
>  fs/ext4/mballoc.h | 2 +-
>  fs/ext4/migrate.c | 6 +++---
>  fs/ext4/namei.c   | 8 
>  fs/ext4/xattr.c   | 2 +-
>  8 files changed, 13 insertions(+), 13 deletions(-)

Thanks,
since these are trivial typo fixes, why not just 1 patch for fs/ext4/
IMO, that way it is easier for sender, reviewer and maintainer.

-ritesh


Re: [PATCH v3 04/10] fsdax: Introduce dax_iomap_cow_copy()

2021-03-23 Thread Ritesh Harjani




On 3/19/21 7:22 AM, Shiyang Ruan wrote:

In the case where the iomap is a write operation and iomap is not equal
to srcmap after iomap_begin, we consider it is a CoW operation.

The destance extent which iomap indicated is new allocated extent.
So, it is needed to copy the data from srcmap to new allocated extent.
In theory, it is better to copy the head and tail ranges which is
outside of the non-aligned area instead of copying the whole aligned
range. But in dax page fault, it will always be an aligned range.  So,
we have to copy the whole range in this case.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
---
  fs/dax.c | 71 
  1 file changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a70e6aa285bb..181aad97136a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1037,6 +1037,51 @@ static int dax_iomap_direct_access(struct iomap *iomap, 
loff_t pos, size_t size,
return rc;
  }
  
+/*

+ * Copy the head and tail part of the pages not included in the write but
+ * required for CoW, because pos/pos+length are not page aligned.  But in dax
+ * page fault case, the range is page aligned, we need to copy the whole range
+ * of data.  Use copy_edge to distinguish these cases.
+ */



Is this version better? Feel free to update/change.

dax_iomap_cow_copy(): This can be called from two places.
Either during DAX write fault, to copy the length size data to daddr.
Or, while doing normal DAX write operation, dax_iomap_actor() might call 
this to do the copy of either start or end unaligned address. In this 
case the rest of the copy of aligned ranges is taken care by 
dax_iomap_actor() itself.

Also, note DAX fault will always result in aligned pos and pos + length.

* @pos: address to do copy from.
* @length:  size of copy operation.
* @align_size:  aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
* @srcmap:  iomap srcmap
* @daddr:   destination address to copy to.


+static int dax_iomap_cow_copy(loff_t pos, loff_t length, size_t align_size,
+   struct iomap *srcmap, void *daddr, bool copy_edge)


do we need bool copy_edge here?
We can detect non-align case directly if head_off != pos or pd_end != 
end no?




+{
+   loff_t head_off = pos & (align_size - 1);
+   size_t size = ALIGN(head_off + length, align_size);
+   loff_t end = pos + length;
+   loff_t pg_end = round_up(end, align_size);
+   void *saddr = 0;
+   int ret = 0;
+
+   ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
+   if (ret)
+   return ret;
+
+   if (!copy_edge)
+   return copy_mc_to_kernel(daddr, saddr, length);
+
+   /* Copy the head part of the range.  Note: we pass offset as length. */
+   if (head_off) {
+   if (saddr)
+   ret = copy_mc_to_kernel(daddr, saddr, head_off);
+   else
+   memset(daddr, 0, head_off);
+   }
+   /* Copy the tail part of the range */
+   if (end < pg_end) {
+   loff_t tail_off = head_off + length;
+   loff_t tail_len = pg_end - end;
+
+   if (saddr)
+   ret = copy_mc_to_kernel(daddr + tail_off,
+   saddr + tail_off, tail_len);
+   else
+   memset(daddr + tail_off, 0, tail_len);
+   }


Can you pls help me understand in which case where saddr is 0 and we
will fall back to memset API ?
I was thinking shouldn't such restrictions be coded inside 
copy_mc_to_kernel() function in general?



-ritesh


Re: [PATCH v3 02/10] fsdax: Factor helper: dax_fault_actor()

2021-03-23 Thread Ritesh Harjani




On 3/19/21 7:22 AM, Shiyang Ruan wrote:

The core logic in the two dax page fault functions is similar. So, move
the logic into a common helper function. Also, to facilitate the
addition of new features, such as CoW, switch-case is no longer used to
handle different iomap types.

Signed-off-by: Shiyang Ruan 
---
  fs/dax.c | 291 +++
  1 file changed, 145 insertions(+), 146 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7031e4302b13..33ddad0f3091 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1053,6 +1053,66 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
return ret;
  }
  
+#ifdef CONFIG_FS_DAX_PMD

+static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
+   struct iomap *iomap, void **entry)
+{
+   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+   unsigned long pmd_addr = vmf->address & PMD_MASK;
+   struct vm_area_struct *vma = vmf->vma;
+   struct inode *inode = mapping->host;
+   pgtable_t pgtable = NULL;
+   struct page *zero_page;
+   spinlock_t *ptl;
+   pmd_t pmd_entry;
+   pfn_t pfn;
+
+   zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
+
+   if (unlikely(!zero_page))
+   goto fallback;
+
+   pfn = page_to_pfn_t(zero_page);
+   *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
+   DAX_PMD | DAX_ZERO_PAGE, false);
+
+   if (arch_needs_pgtable_deposit()) {
+   pgtable = pte_alloc_one(vma->vm_mm);
+   if (!pgtable)
+   return VM_FAULT_OOM;
+   }
+
+   ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
+   if (!pmd_none(*(vmf->pmd))) {
+   spin_unlock(ptl);
+   goto fallback;
+   }
+
+   if (pgtable) {
+   pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+   mm_inc_nr_ptes(vma->vm_mm);
+   }
+   pmd_entry = mk_pmd(zero_page, vmf->vma->vm_page_prot);
+   pmd_entry = pmd_mkhuge(pmd_entry);
+   set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
+   spin_unlock(ptl);
+   trace_dax_pmd_load_hole(inode, vmf, zero_page, *entry);
+   return VM_FAULT_NOPAGE;
+
+fallback:
+   if (pgtable)
+   pte_free(vma->vm_mm, pgtable);
+   trace_dax_pmd_load_hole_fallback(inode, vmf, zero_page, *entry);
+   return VM_FAULT_FALLBACK;
+}
+#else
+static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
+   struct iomap *iomap, void **entry)
+{
+   return VM_FAULT_FALLBACK;
+}
+#endif /* CONFIG_FS_DAX_PMD */
+
  s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
  {
sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
@@ -1289,6 +1349,61 @@ static int dax_fault_cow_page(struct vm_fault *vmf, 
struct iomap *iomap,
return 0;
  }
  
+/**

+ * dax_fault_actor - Common actor to handle pfn insertion in PTE/PMD fault.
+ * @vmf:   vm fault instance
+ * @pfnp:  pfn to be returned
+ * @xas:   the dax mapping tree of a file
+ * @entry: an unlocked dax entry to be inserted
+ * @pmd:   distinguish whether it is a pmd fault
+ * @flags: iomap flags
+ * @iomap: from iomap_begin()
+ * @srcmap:from iomap_begin(), not equal to iomap if it is a CoW
+ */
+static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
+   struct xa_state *xas, void *entry, bool pmd, unsigned int flags,
+   struct iomap *iomap, struct iomap *srcmap)
+{
+   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+   size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
+   loff_t pos = (loff_t)xas->xa_offset << PAGE_SHIFT;


shouldn't we use xa_index here for pos ?
(loff_t)xas->xa_index << PAGE_SHIFT;


+   bool write = vmf->flags & FAULT_FLAG_WRITE;
+   bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
+   int err = 0;
+   pfn_t pfn;
+
+   /* if we are reading UNWRITTEN and HOLE, return a hole. */
+   if (!write &&
+   (iomap->type == IOMAP_UNWRITTEN || iomap->type == IOMAP_HOLE)) {
+   if (!pmd)
+   return dax_load_hole(xas, mapping, , vmf);
+   else
+   return dax_pmd_load_hole(xas, vmf, iomap, );
+   }
+
+   if (iomap->type != IOMAP_MAPPED) {
+   WARN_ON_ONCE(1);
+   return VM_FAULT_SIGBUS;
+   }


So now in case if mapping is not mapped, we always cause
VM_FAULT_SIGBUG. But earlier we were only doing WARN_ON_ONCE(1).
Can you pls help answer why the change in behavior?





+
+   err = dax_iomap_pfn(iomap, pos, size, );
+   if (err)
+   return dax_fault_return(err);


Same case here as well. This could return SIGBUS while earlier I am not 
sure why were we only returning FALLBACK?




+
+   entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0,
+

Re: [PATCH v3 03/10] fsdax: Output address in dax_iomap_pfn() and rename it

2021-03-23 Thread Ritesh Harjani




On 3/19/21 7:22 AM, Shiyang Ruan wrote:

Add address output in dax_iomap_pfn() in order to perform a memcpy() in
CoW case.  Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
---
  fs/dax.c | 16 
  1 file changed, 12 insertions(+), 4 deletions(-)



Like the naming convention. It is consistent with
dax_direct_access() function.

Changes looks good to me. Feel free to add.

Reviewed-by: Ritesh Harjani 


Re: [PATCH v3 01/10] fsdax: Factor helpers to simplify dax fault code

2021-03-23 Thread Ritesh Harjani




On 3/19/21 7:22 AM, Shiyang Ruan wrote:

The dax page fault code is too long and a bit difficult to read. And it
is hard to understand when we trying to add new features. Some of the
PTE/PMD codes have similar logic. So, factor them as helper functions to
simplify the code.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
---
  fs/dax.c | 152 ++-
  1 file changed, 84 insertions(+), 68 deletions(-)



Refactoring & the changes looks good to me.
Feel free to add.

Reviewed-by: Ritesh Harjani 


Re: [PATCH v3 00/10] fsdax,xfs: Add reflink support for fsdax

2021-03-23 Thread Ritesh Harjani




On 3/19/21 7:22 AM, Shiyang Ruan wrote:

From: Shiyang Ruan 

This patchset is attempt to add CoW support for fsdax, and take XFS,
which has both reflink and fsdax feature, as an example.



Thanks for the patchset. I have tried reviewing the series from logical
correctness and to some extent functional correctness.
Since I am not well versed with the core functionality of COW operation,
so I may have requested for some clarifications where I could not get
the code 100% on what it is doing.




(Rebased on v5.11)
==


Thanks. Yes, I see some conflicts when tried to apply on latest kernel.

-ritesh


Re: [PATCH] ext4: properly check for dirty state in ext4_inode_datasync_dirty()

2020-10-28 Thread Ritesh Harjani




On 10/28/20 8:59 PM, Theodore Y. Ts'o wrote:

On Wed, Oct 28, 2020 at 08:57:03AM +0530, Ritesh Harjani wrote:


Well, I too noticed this yesterday while I was testing xfstests -g swap.
Those tests were returning _notrun, hence that could be the reason why
it didn't get notice in XFSTESTing from Ted.


Yeah, one of the things I discussed with Harshad is we really need a
test that looks like generic/472, but which is in shared/NNN, and
which unconditionally tries to use swapon for those file systems where
swapfiles are expected to work.  This is actually the second
regression caused by our breaking swapfile support (the other being
the iomap bmap change), which escaped our testing because we didn't
notice that generic/472 was skipped.


Yes, agreed this is second in a row.
So with fast-commit, swap tests returned _not_run, since
swapon syscall returned -EINVAL in _require_scratch_swapfile() itself.
This is due to some old commit in fstests to make swap tests work on
btrfs on both kernels (with and w/o support of swapon in btrfs), it
first checks in _require_scratch_swapfile() to see if swapon even works
or not. Hence it skips to run further if _require_scratch_swapfile()
fails.

Secondly with bmap to iomap interface, I guess it should pass
all tests except for case with fallocate files, which I think is
tests/generic/496. But here too it assumes that if 1st time it fails
with falloc then swapon may not be supported for that fs and hence does
_notrun.

I am actually working on this to make these swap tests return some
definitive pass or failure status. Will be sending some patches soon.
I could use your idea to add a test in shared/NNN for testing swap with
fallocate files for ext4 and xfs (for bmap to iomap ext4 regression
category of tests)

Thanks
-ritesh



Re: [PATCH] ext4: properly check for dirty state in ext4_inode_datasync_dirty()

2020-10-28 Thread Ritesh Harjani




On 10/28/20 9:18 AM, harshad shirwadkar wrote:

Actually the simpler fix for this in case of fast commits is to check
if the inode is on the fast commit list or not. Since we clear the
fast commit list after every fast and / or full commit, it's always
true that if the inode is not on the list, that means it isn't dirty.
This will simplify the logic here and then we can probably get rid of
i_fc_committed_subtid field altogether. I'll test this and send out a
patch.


Yes, sounds like a better solution. Thanks!

-ritesh


Re: [PATCH] ext4: properly check for dirty state in ext4_inode_datasync_dirty()

2020-10-28 Thread Ritesh Harjani




On 10/27/20 3:58 AM, harshad shirwadkar wrote:

Thanks Andrea for catching and sending out a fix for this.

On Sat, Oct 24, 2020 at 7:01 AM Andrea Righi  wrote:


ext4_inode_datasync_dirty() needs to return 'true' if the inode is
dirty, 'false' otherwise, but the logic seems to be incorrectly changed
by commit aa75f4d3daae ("ext4: main fast-commit commit path").

This introduces a problem with swap files that are always failing to be
activated, showing this error in dmesg:

  [   34.406479] swapon: file is not committed



Well, I too noticed this yesterday while I was testing xfstests -g swap.
Those tests were returning _notrun, hence that could be the reason why
it didn't get notice in XFSTESTing from Ted.

- I did notice that this code was introduced in v10 only.
This wasn't there in v9 though.



Simple test case to reproduce the problem:

   # fallocate -l 8G swapfile
   # chmod 0600 swapfile
   # mkswap swapfile
   # swapon swapfile

Fix the logic to return the proper state of the inode.

Link: https://lore.kernel.org/lkml/20201024131333.GA32124@xps-13-7390
Fixes: aa75f4d3daae ("ext4: main fast-commit commit path")
Signed-off-by: Andrea Righi 
---
  fs/ext4/inode.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 03c2253005f0..a890a17ab7e1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3308,8 +3308,8 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
 if (journal) {
 if (jbd2_transaction_committed(journal,
 EXT4_I(inode)->i_datasync_tid))
-   return true;
-   return atomic_read(_SB(inode->i_sb)->s_fc_subtid) >=
+   return false;
+   return atomic_read(_SB(inode->i_sb)->s_fc_subtid) <
 EXT4_I(inode)->i_fc_committed_subtid;

In addition, the above condition should only be checked if fast
commits are enabled. So, in effect this overall condition will look
like this:

if (journal) {
 if (jbd2_transaction_committed(journal, EXT4_I(inode)->i_datasync_tid))
 return false;
 if (test_opt2(sb, JOURNAL_FAST_COMMIT))
 return atomic_read(_SB(inode->i_sb)->s_fc_subtid) <
EXT4_I(inode)->i_fc_committed_subtid;
 return true;
}


Yup - I too had made a similar patch. But then I also noticed that below
condition will always remain false. Since we never update
"i_fc_committed_subtid" other than at these 2 places
(one during init where we set it to 0 and other during ext4_fc_commit()
where we set it to sbi->s_fc_subtid).


atomic_read(_SB(inode->i_sb)->s_fc_subtid < 
EXT4_I(inode)->i_fc_committed_subtid



Maybe I need more reading around this.

-ritesh







Thanks,
Harshad


 }

--
2.27.0



Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-22 Thread Ritesh Harjani




On 9/15/20 6:30 PM, Matthew Wilcox wrote:

On Tue, Sep 15, 2020 at 08:34:41AM -0400, Mikulas Patocka wrote:

- when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses
buffer cache for the mapping. The buffer cache slows does fsck by a factor
of 5 to 10. Could it be possible to change the kernel so that it maps DAX
based block devices directly?


Oh, because fs/block_dev.c has:
 .mmap   = generic_file_mmap,

I don't see why we shouldn't have a blkdev_mmap modelled after
ext2_file_mmap() with the corresponding blkdev_dax_vm_ops.



pls help with below 2 queries:-

1. Can't we use ->direct_IO here to avoid the mentioned performance problem?
2. Any other existing use case where having this blkdev_dax_vm_ops be 
useful?


-ritesh


Re: [PATCH AUTOSEL 5.4 105/330] ext4: make dioread_nolock the default

2020-09-18 Thread Ritesh Harjani

Hello Sasha,

On 9/18/20 7:27 AM, Sasha Levin wrote:

From: Theodore Ts'o 

[ Upstream commit 244adf6426ee31a83f397b700d964cff12a247d3 ]

This fixes the direct I/O versus writeback race which can reveal stale
data, and it improves the tail latency of commits on slow devices.

Link: https://lore.kernel.org/r/20200125022254.1101588-1-ty...@mit.edu
Signed-off-by: Theodore Ts'o 
Signed-off-by: Sasha Levin 


I see that the subjected patch is being taken into many different stable
trees. I remember there was a fixes patch on top of this[1].
Below thread[1] has more details. Just wanted to point this one out.
Please ignore if already taken.

[1]: https://www.spinics.net/lists/linux-ext4/msg72219.html

-ritesh


[PATCHv3 0/1] Optimize ext4 file overwrites - perf improvement

2020-09-17 Thread Ritesh Harjani
 spinlock contention problem in slab alloc path could be 
optimized
on PPC in general, will look into it seperately. But I could still see the
perf improvement of close to ~2x on QEMU setup on x86 with simulated pmem device
with the patched_kernel v/s vanilla_kernel with same fio workload.

perf report from vanilla_kernel (this is not seen with patched kernel) (ppc64)
===

  47.86%  fio  [kernel.vmlinux][k] do_raw_spin_lock
 |
 ---do_raw_spin_lock
|
|--19.43%--_raw_spin_lock
|  |
|   --19.31%--0
| |
| |--9.77%--deactivate_slab.isra.61
| |  ___slab_alloc
| |  __slab_alloc
| |  kmem_cache_alloc
| |  jbd2__journal_start
| |  __ext4_journal_start_sb
<...>

2. This problem was reported by Dan Williams at [1]

Links
==
[1]: https://lore.kernel.org/linux-ext4/20190802144304.gp25...@quack2.suse.cz/T/
[v2]: https://lkml.org/lkml/2020/8/22/123

Ritesh Harjani (1):
  ext4: Optimize file overwrites

 fs/ext4/inode.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

-- 
2.26.2



[PATCHv3 1/1] ext4: Optimize file overwrites

2020-09-17 Thread Ritesh Harjani
In case if the file already has underlying blocks/extents allocated
then we don't need to start a journal txn and can directly return
the underlying mapping. Currently ext4_iomap_begin() is used by
both DAX & DIO path. We can check if the write request is an
overwrite & then directly return the mapping information.

This could give a significant perf boost for multi-threaded writes
specially random overwrites.
On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement
could be seen in random writes (overwrite). Also bcoz this optimizes
away the spinlock contention during jbd2 slab cache allocation
(jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed.

Reported-by: Dan Williams 
Suggested-by: Jan Kara 
Signed-off-by: Ritesh Harjani 
---
 fs/ext4/inode.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 10dd470876b3..6eae17758ece 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3437,14 +3437,26 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
 
-   if (flags & IOMAP_WRITE)
+   if (flags & IOMAP_WRITE) {
+   /*
+* We check here if the blocks are already allocated, then we
+* don't need to start a journal txn and we can directly return
+* the mapping information. This could boost performance
+* especially in multi-threaded overwrite requests.
+*/
+   if (offset + length <= i_size_read(inode)) {
+   ret = ext4_map_blocks(NULL, inode, , 0);
+   if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
+   goto out;
+   }
ret = ext4_iomap_alloc(inode, , flags);
-   else
+   } else {
ret = ext4_map_blocks(NULL, inode, , 0);
+   }
 
if (ret < 0)
return ret;
-
+out:
ext4_set_iomap(inode, iomap, , offset, length);
 
return 0;
-- 
2.26.2



[RESEND PATCH 1/1] block: Set same_page to false in __bio_try_merge_page if ret is false

2020-09-08 Thread Ritesh Harjani
If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway
not merging this page in this bio, then it make sense to make same_page
also as false before returning.

Without this patch, we hit below WARNING in iomap.
This mostly happens with very large memory system and / or after tweaking
vm dirty threshold params to delay writeback of dirty data.

WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 
iomap_page_release+0x120/0x150
 CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: GW 
5.8.0-rc3 #6
 Call Trace:
  __remove_mapping+0x154/0x320 (unreliable)
  iomap_releasepage+0x80/0x180
  try_to_release_page+0x94/0xe0
  invalidate_inode_page+0xc8/0x110
  invalidate_mapping_pages+0x1dc/0x540
  generic_fadvise+0x3c8/0x450
  xfs_file_fadvise+0x2c/0xe0 [xfs]
  vfs_fadvise+0x3c/0x60
  ksys_fadvise64_64+0x68/0xe0
  sys_fadvise64+0x28/0x40
  system_call_exception+0xf8/0x1c0
  system_call_common+0xf0/0x278

Fixes: cc90bc68422 ("block: fix "check bi_size overflow before merge"")
Suggested-by: Christoph Hellwig 
Reported-by: Shivaprasad G Bhat 
Signed-off-by: Anju T Sudhakar 
Signed-off-by: Ritesh Harjani 
---
RESEND: added "fixes" tag

 block/bio.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index a7366c02c9b5..675ecd81047b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -877,8 +877,10 @@ bool __bio_try_merge_page(struct bio *bio, struct page 
*page,
struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
 
if (page_is_mergeable(bv, page, len, off, same_page)) {
-   if (bio->bi_iter.bi_size > UINT_MAX - len)
+   if (bio->bi_iter.bi_size > UINT_MAX - len) {
+   *same_page = false;
return false;
+   }
bv->bv_len += len;
bio->bi_iter.bi_size += len;
return true;
-- 
2.25.4



[PATCH] ext4: Implement swap_activate aops using iomap

2020-09-04 Thread Ritesh Harjani
After moving ext4's bmap to iomap interface, swapon functionality
on files created using fallocate (which creates unwritten extents) are
failing. This is since iomap_bmap interface returns 0 for unwritten
extents and thus generic_swapfile_activate considers this as holes
and hence bail out with below kernel msg :-

[340.915835] swapon: swapfile has holes

To fix this we need to implement ->swap_activate aops in ext4
which will use ext4_iomap_report_ops. Since we only need to return
the list of extents so ext4_iomap_report_ops should be enough.

Reported-by: Yuxuan Shui 
Fixes: ac58e4fb03f ("ext4: move ext4 bmap to use iomap infrastructure")
Signed-off-by: Ritesh Harjani 
---
[Tested xfstests with -g swap; Tested stress-ng with --thrash option]

 fs/ext4/inode.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1556cabd3a7d..9ac0f83e6fbe 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3605,6 +3605,13 @@ static int ext4_set_page_dirty(struct page *page)
return __set_page_dirty_buffers(page);
 }
 
+static int ext4_iomap_swap_activate(struct swap_info_struct *sis,
+   struct file *file, sector_t *span)
+{
+   return iomap_swapfile_activate(sis, file, span,
+  _iomap_report_ops);
+}
+
 static const struct address_space_operations ext4_aops = {
.readpage   = ext4_readpage,
.readpages  = ext4_readpages,
@@ -3620,6 +3627,7 @@ static const struct address_space_operations ext4_aops = {
.migratepage= buffer_migrate_page,
.is_partially_uptodate  = block_is_partially_uptodate,
.error_remove_page  = generic_error_remove_page,
+   .swap_activate  = ext4_iomap_swap_activate,
 };
 
 static const struct address_space_operations ext4_journalled_aops = {
@@ -3636,6 +3644,7 @@ static const struct address_space_operations 
ext4_journalled_aops = {
.direct_IO  = noop_direct_IO,
.is_partially_uptodate  = block_is_partially_uptodate,
.error_remove_page  = generic_error_remove_page,
+   .swap_activate  = ext4_iomap_swap_activate,
 };
 
 static const struct address_space_operations ext4_da_aops = {
@@ -3653,6 +3662,7 @@ static const struct address_space_operations ext4_da_aops 
= {
.migratepage= buffer_migrate_page,
.is_partially_uptodate  = block_is_partially_uptodate,
.error_remove_page  = generic_error_remove_page,
+   .swap_activate  = ext4_iomap_swap_activate,
 };
 
 static const struct address_space_operations ext4_dax_aops = {
@@ -3661,6 +3671,7 @@ static const struct address_space_operations 
ext4_dax_aops = {
.set_page_dirty = noop_set_page_dirty,
.bmap   = ext4_bmap,
.invalidatepage = noop_invalidatepage,
+   .swap_activate  = ext4_iomap_swap_activate,
 };
 
 void ext4_set_aops(struct inode *inode)
-- 
2.25.1



Re: [PATCH 1/1] block: Set same_page to false in __bio_try_merge_page if ret is false

2020-08-27 Thread Ritesh Harjani

Hello Jens,

On 8/21/20 11:41 PM, Ritesh Harjani wrote:

If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway
not merging this page in this bio, then it make sense to make same_page
also as false before returning.

Without this patch, we hit below WARNING in iomap.
This mostly happens with very large memory system and / or after tweaking
vm dirty threshold params to delay writeback of dirty data.

WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 
iomap_page_release+0x120/0x150
  CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: GW 
5.8.0-rc3 #6
  Call Trace:
   __remove_mapping+0x154/0x320 (unreliable)
   iomap_releasepage+0x80/0x180
   try_to_release_page+0x94/0xe0
   invalidate_inode_page+0xc8/0x110
   invalidate_mapping_pages+0x1dc/0x540
   generic_fadvise+0x3c8/0x450
   xfs_file_fadvise+0x2c/0xe0 [xfs]
   vfs_fadvise+0x3c/0x60
   ksys_fadvise64_64+0x68/0xe0
   sys_fadvise64+0x28/0x40
   system_call_exception+0xf8/0x1c0
   system_call_common+0xf0/0x278

Suggested-by: Christoph Hellwig 
Reported-by: Shivaprasad G Bhat 
Signed-off-by: Anju T Sudhakar 
Signed-off-by: Ritesh Harjani 


Sorry, I forgot to add the fixes tag. I think below commit
have introduced this. Let me know if you want me to
re-send this patch with below fixes tag.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cc90bc68422318eb8e75b15cd74bc8d538a7df29


-ritesh



[PATCHv2 0/3] Optimize ext4 DAX overwrites

2020-08-22 Thread Ritesh Harjani
h patched kernel) (ppc64)
===

  47.86%  fio  [kernel.vmlinux][k] do_raw_spin_lock
 |
 ---do_raw_spin_lock
|
|--19.43%--_raw_spin_lock
|  |
|   --19.31%--0
| |
| |--9.77%--deactivate_slab.isra.61
| |  ___slab_alloc
| |  __slab_alloc
| |  kmem_cache_alloc
| |  jbd2__journal_start
| |  __ext4_journal_start_sb
<...>

2. This problem was reported by Dan Williams at [1]

Links
==
[1]: https://lore.kernel.org/linux-ext4/20190802144304.gp25...@quack2.suse.cz/T/

Ritesh Harjani (3):
  ext4: Refactor ext4_overwrite_io() to take ext4_map_blocks as argument
  ext4: Extend ext4_overwrite_io() for dax path
  ext4: Optimize ext4 DAX overwrites

 fs/ext4/ext4.h  |  2 ++
 fs/ext4/file.c  | 28 ++--
 fs/ext4/inode.c | 11 +--
 3 files changed, 29 insertions(+), 12 deletions(-)

-- 
2.25.4



[PATCHv2 1/3] ext4: Refactor ext4_overwrite_io() to take ext4_map_blocks as argument

2020-08-22 Thread Ritesh Harjani
Refactor ext4_overwrite_io() to take struct ext4_map_blocks
as it's function argument with m_lblk and m_len filled
from caller

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/file.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2a01e31a032c..84f73ed91af2 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -188,26 +188,22 @@ ext4_extending_io(struct inode *inode, loff_t offset, 
size_t len)
 }
 
 /* Is IO overwriting allocated and initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
+static bool ext4_overwrite_io(struct inode *inode, struct ext4_map_blocks *map)
 {
-   struct ext4_map_blocks map;
unsigned int blkbits = inode->i_blkbits;
-   int err, blklen;
+   loff_t end = (map->m_lblk + map->m_len) << blkbits;
+   int err, blklen = map->m_len;
 
-   if (pos + len > i_size_read(inode))
+   if (end > i_size_read(inode))
return false;
 
-   map.m_lblk = pos >> blkbits;
-   map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
-   blklen = map.m_len;
-
-   err = ext4_map_blocks(NULL, inode, , 0);
+   err = ext4_map_blocks(NULL, inode, map, 0);
/*
 * 'err==len' means that all of the blocks have been preallocated,
 * regardless of whether they have been initialized or not. To exclude
 * unwritten extents, we need to check m_flags.
 */
-   return err == blklen && (map.m_flags & EXT4_MAP_MAPPED);
+   return err == blklen && (map->m_flags & EXT4_MAP_MAPPED);
 }
 
 static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
@@ -407,6 +403,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, 
struct iov_iter *from,
 {
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
+   struct ext4_map_blocks map;
loff_t offset;
size_t count;
ssize_t ret;
@@ -420,6 +417,9 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, 
struct iov_iter *from,
count = ret;
if (ext4_extending_io(inode, offset, count))
*extend = true;
+
+   map.m_lblk = offset >> inode->i_blkbits;
+   map.m_len = EXT4_MAX_BLOCKS(count, offset, inode->i_blkbits);
/*
 * Determine whether the IO operation will overwrite allocated
 * and initialized blocks.
@@ -427,7 +427,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, 
struct iov_iter *from,
 * in file_modified().
 */
if (*ilock_shared && (!IS_NOSEC(inode) || *extend ||
-!ext4_overwrite_io(inode, offset, count))) {
+!ext4_overwrite_io(inode, ))) {
inode_unlock_shared(inode);
*ilock_shared = false;
inode_lock(inode);
-- 
2.25.4



[PATCHv2 2/3] ext4: Extend ext4_overwrite_io() for dax path

2020-08-22 Thread Ritesh Harjani
DAX uses ->iomap_begin path which gets called to map/allocate
extents. In order to avoid starting journal txn where extent
allocation is not required, we need to check if ext4_map_blocks()
has returned any mapped extent entry.

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/ext4.h |  2 ++
 fs/ext4/file.c | 14 +++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 42f5060f3cdf..8a1b468bfb49 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3232,6 +3232,8 @@ extern const struct dentry_operations ext4_dentry_ops;
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
+extern bool ext4_overwrite_io(struct inode *inode, struct ext4_map_blocks *map,
+ bool is_dax);
 
 /* inline.c */
 extern int ext4_get_max_inline_size(struct inode *inode);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 84f73ed91af2..6c252498334b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -188,7 +188,8 @@ ext4_extending_io(struct inode *inode, loff_t offset, 
size_t len)
 }
 
 /* Is IO overwriting allocated and initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode, struct ext4_map_blocks *map)
+bool ext4_overwrite_io(struct inode *inode, struct ext4_map_blocks *map,
+  bool is_dax)
 {
unsigned int blkbits = inode->i_blkbits;
loff_t end = (map->m_lblk + map->m_len) << blkbits;
@@ -198,12 +199,19 @@ static bool ext4_overwrite_io(struct inode *inode, struct 
ext4_map_blocks *map)
return false;
 
err = ext4_map_blocks(NULL, inode, map, 0);
+
/*
+* In case of dax to avoid starting a transaction in ext4_iomap_begin()
+* we check if ext4_map_blocks() can return any mapped extent.
+*
 * 'err==len' means that all of the blocks have been preallocated,
 * regardless of whether they have been initialized or not. To exclude
 * unwritten extents, we need to check m_flags.
 */
-   return err == blklen && (map->m_flags & EXT4_MAP_MAPPED);
+   if (is_dax)
+   return err > 0 && (map->m_flags & EXT4_MAP_MAPPED);
+   else
+   return err == blklen && (map->m_flags & EXT4_MAP_MAPPED);
 }
 
 static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
@@ -427,7 +435,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, 
struct iov_iter *from,
 * in file_modified().
 */
if (*ilock_shared && (!IS_NOSEC(inode) || *extend ||
-!ext4_overwrite_io(inode, ))) {
+!ext4_overwrite_io(inode, , false))) {
inode_unlock_shared(inode);
*ilock_shared = false;
inode_lock(inode);
-- 
2.25.4



[PATCHv2 3/3] ext4: Optimize ext4 DAX overwrites

2020-08-22 Thread Ritesh Harjani
Currently in case of DAX, we are starting a journal txn everytime for
IOMAP_WRITE case. This can be optimized away in case of an overwrite
(where the blocks were already allocated).

This could give a significant performance boost for multi-threaded writes
specially random writes.
On PPC64 VM with simulated pmem device, ~10x perf improvement could be
seen in random writes (overwrite). Also bcoz this optimizes away the
spinlock contention during jbd2 slab cache allocation (jbd2_journal_handle)
On x86 VM, ~2x perf improvement was observed.

Reported-by: Dan Williams 
Suggested-by: Jan Kara 
Signed-off-by: Ritesh Harjani 
---
 fs/ext4/inode.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 10dd470876b3..c18009c91e68 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3437,6 +3437,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
 
+   /*
+* In case of DAX write, we check if this is overwrite request, to avoid
+* starting a journal txn in ext4_iomap_alloc()
+*/
+   if ((flags & IOMAP_WRITE) && IS_DAX(inode) &&
+   ext4_overwrite_io(inode, , true))
+   goto out_set;
+
if (flags & IOMAP_WRITE)
ret = ext4_iomap_alloc(inode, , flags);
else
@@ -3444,9 +3452,8 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
 
if (ret < 0)
return ret;
-
+out_set:
ext4_set_iomap(inode, iomap, , offset, length);
-
return 0;
 }
 
-- 
2.25.4



[PATCH 1/1] block: Set same_page to false in __bio_try_merge_page if ret is false

2020-08-21 Thread Ritesh Harjani
If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway
not merging this page in this bio, then it make sense to make same_page
also as false before returning.

Without this patch, we hit below WARNING in iomap.
This mostly happens with very large memory system and / or after tweaking
vm dirty threshold params to delay writeback of dirty data.

WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 
iomap_page_release+0x120/0x150
 CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: GW 
5.8.0-rc3 #6
 Call Trace:
  __remove_mapping+0x154/0x320 (unreliable)
  iomap_releasepage+0x80/0x180
  try_to_release_page+0x94/0xe0
  invalidate_inode_page+0xc8/0x110
  invalidate_mapping_pages+0x1dc/0x540
  generic_fadvise+0x3c8/0x450
  xfs_file_fadvise+0x2c/0xe0 [xfs]
  vfs_fadvise+0x3c/0x60
  ksys_fadvise64_64+0x68/0xe0
  sys_fadvise64+0x28/0x40
  system_call_exception+0xf8/0x1c0
  system_call_common+0xf0/0x278

Suggested-by: Christoph Hellwig 
Reported-by: Shivaprasad G Bhat 
Signed-off-by: Anju T Sudhakar 
Signed-off-by: Ritesh Harjani 
---
[prev discussion]:- https://patchwork.kernel.org/patch/11723453/

 block/bio.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index a7366c02c9b5..675ecd81047b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -877,8 +877,10 @@ bool __bio_try_merge_page(struct bio *bio, struct page 
*page,
struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
 
if (page_is_mergeable(bv, page, len, off, same_page)) {
-   if (bio->bi_iter.bi_size > UINT_MAX - len)
+   if (bio->bi_iter.bi_size > UINT_MAX - len) {
+   *same_page = false;
return false;
+   }
bv->bv_len += len;
bio->bi_iter.bi_size += len;
return true;
-- 
2.25.4



Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend().

2020-08-21 Thread Ritesh Harjani




On 8/21/20 11:30 AM, Christoph Hellwig wrote:

On Fri, Aug 21, 2020 at 10:15:33AM +0530, Ritesh Harjani wrote:

Please correct me here, but as I see, bio has only these two limits
which it checks for adding page to bio. It doesn't check for limits
of /sys/block//queue/* no? I guess then it could be checked
by block layer below b4 submitting the bio?


The bio does not, but the blk-mq code will split the bios when mapping
it to requests, take a look at blk_mq_submit_bio and __blk_queue_split.


Thanks :)



But while the default limits are quite low, they can be increased
siginificantly, which tends to help with performance and is often
also done by scripts shipped by the distributions.


This issue was first observed while running a fio run on a system with
huge memory. But then here is an easy way we figured out to trigger the
issue almost everytime with loop device on my VM setup. I have provided
all the details on this below.


Can you wire this up for xfstests?



Sure, will do that.
I do see generic/460 does play with such vm dirty_ratio params which
this test would also require to manipulate to trigger this issue.


Thanks for the suggestion!
-ritesh


Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend().

2020-08-21 Thread Ritesh Harjani




On 8/21/20 11:37 AM, Christoph Hellwig wrote:

On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote:

From: Ritesh Harjani 

__bio_try_merge_page() may return same_page = 1 and merged = 0.
This could happen when bio->bi_iter.bi_size + len > UINT_MAX.
Handle this case in iomap_add_to_ioend() by incrementing write_count.
This scenario mostly happens where we have too much dirty data accumulated.

w/o the patch we hit below kernel warning,


I think this is better fixed in the block layer rather than working
around the problem in the callers.  Something like this:

diff --git a/block/bio.c b/block/bio.c
index c63ba04bd62967..ef321cd1072e4e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -879,8 +879,10 @@ bool __bio_try_merge_page(struct bio *bio, struct page 
*page,
struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
  
  		if (page_is_mergeable(bv, page, len, off, same_page)) {

-   if (bio->bi_iter.bi_size > UINT_MAX - len)
+   if (bio->bi_iter.bi_size > UINT_MAX - len) {
+   *same_page = false;
return false;
+   }
bv->bv_len += len;
bio->bi_iter.bi_size += len;
return true;



Ya, we had think of that. But what we then thought was, maybe the
API does return the right thing. Meaning, what API says is, same_page is
true, but the page couldn't be merged hence it returned ret = false.
With that thought, we fixed this in the caller.

But agree with you that with ret = false, there is no meaning of
same_page being true. Ok, so let linux-block comment on whether
above also looks good. If yes, I can spin out an official patch with
all details.

-ritesh


Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend().

2020-08-20 Thread Ritesh Harjani

Hello Dave,

Thanks for reviewing this.

On 8/21/20 4:41 AM, Dave Chinner wrote:

On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote:

From: Ritesh Harjani 

__bio_try_merge_page() may return same_page = 1 and merged = 0.
This could happen when bio->bi_iter.bi_size + len > UINT_MAX.


Ummm, silly question, but exactly how are we getting a bio that
large in ->writepages getting built? Even with 64kB pages, that's a
bio with 2^16 pages attached to it. We shouldn't be building single
bios in writeback that large - what storage hardware is allowing
such huge bios to be built? (i.e. can you dump all the values in
/sys/block//queue/* for that device for us?)


Please correct me here, but as I see, bio has only these two limits
which it checks for adding page to bio. It doesn't check for limits
of /sys/block//queue/* no? I guess then it could be checked
by block layer below b4 submitting the bio?

113 static inline bool bio_full(struct bio *bio, unsigned len)
114 {
115 if (bio->bi_vcnt >= bio->bi_max_vecs)
116 return true;
117
118 if (bio->bi_iter.bi_size > UINT_MAX - len)
119 return true;
120
121 return false;
122 }


This issue was first observed while running a fio run on a system with
huge memory. But then here is an easy way we figured out to trigger the
issue almost everytime with loop device on my VM setup. I have provided
all the details on this below.


===
echo  > /proc/sys/vm/dirtytime_expire_seconds
echo  > /proc/sys/vm/dirty_expire_centisecs
echo 90  > /proc/sys/vm/dirty_rati0
echo 90  > /proc/sys/vm/dirty_background_ratio
echo 0  > /proc/sys/vm/dirty_writeback_centisecs

sudo perf probe -s ~/host_shared/src/linux/ -a '__bio_try_merge_page:10 
bio page page->index bio->bi_iter.bi_size len same_page[0]'


sudo perf record -e probe:__bio_try_merge_page_L10 -a --filter 'bi_size 
> 0xff00' sudo fio --rw=write --bs=1M --numjobs=1 
--name=/mnt/testfile --size=24G --ioengine=libaio



# on running this 2nd time it gets hit everytime on my setup

sudo perf record -e probe:__bio_try_merge_page_L10 -a --filter 'bi_size 
> 0xff00' sudo fio --rw=write --bs=1M --numjobs=1 
--name=/mnt/testfile --size=24G --ioengine=libaio



Perf o/p from above filter causing overflow
===
<...>
 fio 25194 [029] 70471.559084: 
probe:__bio_try_merge_page_L10: (c0aa054c) 
bio=0xc013d49a4b80 page=0xc00c04029d80 index=0x10a9d 
bi_size=0x8000 len=0x1000 same_page=0x1
 fio 25194 [029] 70471.559087: 
probe:__bio_try_merge_page_L10: (c0aa054c) 
bio=0xc013d49a4b80 page=0xc00c04029d80 index=0x10a9d 
bi_size=0x9000 len=0x1000 same_page=0x1
 fio 25194 [029] 70471.559090: 
probe:__bio_try_merge_page_L10: (c0aa054c) 
bio=0xc013d49a4b80 page=0xc00c04029d80 index=0x10a9d 
bi_size=0xa000 len=0x1000 same_page=0x1
 fio 25194 [029] 70471.559093: 
probe:__bio_try_merge_page_L10: (c0aa054c) 
bio=0xc013d49a4b80 page=0xc00c04029d80 index=0x10a9d 
bi_size=0xb000 len=0x1000 same_page=0x1
 fio 25194 [029] 70471.559095: 
probe:__bio_try_merge_page_L10: (c0aa054c) 
bio=0xc013d49a4b80 page=0xc00c04029d80 index=0x10a9d 
bi_size=0xc000 len=0x1000 same_page=0x1
 fio 25194 [029] 70471.559098: 
probe:__bio_try_merge_page_L10: (c0aa054c) 
bio=0xc013d49a4b80 page=0xc00c04029d80 index=0x10a9d 
bi_size=0xd000 len=0x1000 same_page=0x1
 fio 25194 [029] 70471.559101: 
probe:__bio_try_merge_page_L10: (c0aa054c) 
bio=0xc013d49a4b80 page=0xc00c04029d80 index=0x10a9d 
bi_size=0xe000 len=0x1000 same_page=0x1
 fio 25194 [029] 70471.559104: 
probe:__bio_try_merge_page_L10: (c0aa054c) 
bio=0xc013d49a4b80 page=0xc00c04029d80 index=0x10a9d 
bi_size=0xf000 len=0x1000 same_page=0x1


^^ (this could cause an overflow)

loop dev
=
NAME   SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILEDIO LOG-SEC
/dev/loop1 0  0 0  0 /mnt1/filefs   0 512


mount o/p
=
/dev/loop1 on /mnt type xfs 
(rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)



/sys/block//queue/*


setup:/run/perf$ cat /sys/block/loop1/queue/max_segments
128
setup:/run/perf$ cat /sys/block/loop1/queue/max_segment_size
65536
setup:/run/perf$ cat /sys/block/loop1/queue/max_hw_sectors_kb
1280
setup:/run/perf$ cat /sys/block/loop1/queue/logical_block_size
512
setup:/run/perf$ cat /sys/block/loop1/queue/max_sectors_kb
1280
setup:/run/perf$ cat /sys/block/loop1/queue/hw_sector_size
512
setup:/run/perf$ cat /sys/block/loop1/queue/discard_max_bytes
4294966784
setup:/run/perf$ cat /sys/block/loop1/queue/discard_max_hw_bytes
4294966784
setup:/run/perf$ cat /sys/block/loop1/queue/discard_

Re: [RFC 1/1] ext4: Optimize ext4 DAX overwrites

2020-08-20 Thread Ritesh Harjani




On 8/20/20 6:23 PM, Jan Kara wrote:

On Thu 20-08-20 17:06:28, Ritesh Harjani wrote:

Currently in case of DAX, we are starting a transaction
everytime for IOMAP_WRITE case. This can be optimized
away in case of an overwrite (where the blocks were already
allocated). This could give a significant performance boost
for multi-threaded random writes.

Reported-by: Dan Williams 
Signed-off-by: Ritesh Harjani 


Thanks for returning to this and I'm glad to see how much this helped :)
BTW, I'd suspect there could be also significant contention and cache line
bouncing on j_state_lock and transaction's atomic counters...


ok, will try and profile to see if this happens.





---
  fs/ext4/ext4.h  | 1 +
  fs/ext4/file.c  | 2 +-
  fs/ext4/inode.c | 8 +++-
  3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 42f5060f3cdf..9a2138afc751 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3232,6 +3232,7 @@ extern const struct dentry_operations ext4_dentry_ops;
  extern const struct inode_operations ext4_file_inode_operations;
  extern const struct file_operations ext4_file_operations;
  extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
+extern bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len);
  
  /* inline.c */

  extern int ext4_get_max_inline_size(struct inode *inode);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2a01e31a032c..51cd92ac1758 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -188,7 +188,7 @@ ext4_extending_io(struct inode *inode, loff_t offset, 
size_t len)
  }
  
  /* Is IO overwriting allocated and initialized blocks? */

-static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
+bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
  {
struct ext4_map_blocks map;
unsigned int blkbits = inode->i_blkbits;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 10dd470876b3..f0ac0ee9e991 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3423,6 +3423,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
int ret;
struct ext4_map_blocks map;
u8 blkbits = inode->i_blkbits;
+   bool overwrite = false;
  
  	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)

return -EINVAL;
@@ -3430,6 +3431,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
return -ERANGE;
  
+	if (IS_DAX(inode) && (flags & IOMAP_WRITE) &&

+   ext4_overwrite_io(inode, offset, length))
+   overwrite = true;


So the patch looks correct but using ext4_overwrite_io() seems a bit
foolish since under the hood it does ext4_map_blocks() only to be able to
decide whether to call ext4_map_blocks() once again with exactly the same
arguments :). So I'd rather slightly refactor the code in
ext4_iomap_begin() to avoid this double calling of ext4_map_blocks() for
the fast path.


Yes, agreed. Looking at the numbers I was excited to post out the RFC
for discussion. Will make above changes and post. :)


With DIO, we need to detect overwrite case early in
ext4_dio_write_iter() to determine whether we need shared or excl.
locks - so probably for DIO case we still need overwrite check in
ext4_dio_write_iter()

Thanks for review!!
-ritesh




Honza


/*
 * Calculate the first and last logical blocks respectively.
 */
@@ -3437,13 +3441,15 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
  
-	if (flags & IOMAP_WRITE)

+   if ((flags & IOMAP_WRITE) && !overwrite)
ret = ext4_iomap_alloc(inode, , flags);
else
ret = ext4_map_blocks(NULL, inode, , 0);
  
  	if (ret < 0)

return ret;
+   if (IS_DAX(inode) && overwrite)
+   WARN_ON(!(map.m_flags & EXT4_MAP_MAPPED));
  
  	ext4_set_iomap(inode, iomap, , offset, length);
  
--

2.25.4



[RFC 0/1] Optimize ext4 DAX overwrites

2020-08-20 Thread Ritesh Harjani
ice with the
patched_kernel v/s vanilla_kernel with same fio workload.

perf report from vanilla_kernel (this is not seen with patched kernel) (ppc64)
===

  47.86%  fio  [kernel.vmlinux][k] do_raw_spin_lock
 |
 ---do_raw_spin_lock
|
|--19.43%--_raw_spin_lock
|  |
|   --19.31%--0
| |
| |--9.77%--deactivate_slab.isra.61
| |  ___slab_alloc
| |  __slab_alloc
| |  kmem_cache_alloc
| |  jbd2__journal_start
| |  __ext4_journal_start_sb
<...>

2. Kept this as RFC, since maybe using the ext4_iomap_overwrite_ops,
will be better here. We could check for overwrite in ext4_dax_write_iter(),
like how we do for DIO writes. Thoughts?

3. This problem was reported by Dan Williams at [1]

Links
==
[1]: https://lore.kernel.org/linux-ext4/20190802144304.gp25...@quack2.suse.cz/T/

Ritesh Harjani (1):
  ext4: Optimize ext4 DAX overwrites

 fs/ext4/ext4.h  | 1 +
 fs/ext4/file.c  | 2 +-
 fs/ext4/inode.c | 8 +++-
 3 files changed, 9 insertions(+), 2 deletions(-)

-- 
2.25.4



[RFC 1/1] ext4: Optimize ext4 DAX overwrites

2020-08-20 Thread Ritesh Harjani
Currently in case of DAX, we are starting a transaction
everytime for IOMAP_WRITE case. This can be optimized
away in case of an overwrite (where the blocks were already
allocated). This could give a significant performance boost
for multi-threaded random writes.

Reported-by: Dan Williams 
Signed-off-by: Ritesh Harjani 
---
 fs/ext4/ext4.h  | 1 +
 fs/ext4/file.c  | 2 +-
 fs/ext4/inode.c | 8 +++-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 42f5060f3cdf..9a2138afc751 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3232,6 +3232,7 @@ extern const struct dentry_operations ext4_dentry_ops;
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
+extern bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len);
 
 /* inline.c */
 extern int ext4_get_max_inline_size(struct inode *inode);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2a01e31a032c..51cd92ac1758 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -188,7 +188,7 @@ ext4_extending_io(struct inode *inode, loff_t offset, 
size_t len)
 }
 
 /* Is IO overwriting allocated and initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
+bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
 {
struct ext4_map_blocks map;
unsigned int blkbits = inode->i_blkbits;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 10dd470876b3..f0ac0ee9e991 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3423,6 +3423,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
int ret;
struct ext4_map_blocks map;
u8 blkbits = inode->i_blkbits;
+   bool overwrite = false;
 
if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
return -EINVAL;
@@ -3430,6 +3431,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
return -ERANGE;
 
+   if (IS_DAX(inode) && (flags & IOMAP_WRITE) &&
+   ext4_overwrite_io(inode, offset, length))
+   overwrite = true;
/*
 * Calculate the first and last logical blocks respectively.
 */
@@ -3437,13 +3441,15 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
 
-   if (flags & IOMAP_WRITE)
+   if ((flags & IOMAP_WRITE) && !overwrite)
ret = ext4_iomap_alloc(inode, , flags);
else
ret = ext4_map_blocks(NULL, inode, , 0);
 
if (ret < 0)
return ret;
+   if (IS_DAX(inode) && overwrite)
+   WARN_ON(!(map.m_flags & EXT4_MAP_MAPPED));
 
ext4_set_iomap(inode, iomap, , offset, length);
 
-- 
2.25.4



Re: [PATCH -next] ext2: remove duplicate include

2020-08-18 Thread Ritesh Harjani




On 8/19/20 8:24 AM, Wang Hai wrote:

Remove linux/fiemap.h which is included more than once

Reported-by: Hulk Robot 
Signed-off-by: Wang Hai 



LGTM, please feel free to add,
Reviewed-by: Ritesh Harjani 


---
  fs/ext2/inode.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 80662e1f7889..de6b97612410 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -36,7 +36,6 @@
  #include 
  #include 
  #include 
-#include 
  #include "ext2.h"
  #include "acl.h"
  #include "xattr.h"



[PATCHv2 2/2] tasks: Add headers and improve spacing format

2020-08-18 Thread Ritesh Harjani
With the patch.

  TASK  PIDCOMM
0x82c2b8c0   0   swapper/0
0x888a0ba20040   1   systemd
0x888a0ba24040   2   kthreadd
0x888a0ba28040   3   rcu_gp

w/o
0x82c2b8c0  0 swapper/0
0x888a0ba20040 1 systemd
0x888a0ba24040 2 kthreadd
0x888a0ba28040 3 rcu_gp

Signed-off-by: Ritesh Harjani 
---
 scripts/gdb/linux/tasks.py | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/scripts/gdb/linux/tasks.py b/scripts/gdb/linux/tasks.py
index 0301dc1e0138..17ec19e9b5bf 100644
--- a/scripts/gdb/linux/tasks.py
+++ b/scripts/gdb/linux/tasks.py
@@ -73,11 +73,12 @@ class LxPs(gdb.Command):
 super(LxPs, self).__init__("lx-ps", gdb.COMMAND_DATA)
 
 def invoke(self, arg, from_tty):
+gdb.write("{:>10} {:>12} {:>7}\n".format("TASK", "PID", "COMM"))
 for task in task_lists():
-gdb.write("{address} {pid} {comm}\n".format(
-address=task,
-pid=task["pid"],
-comm=task["comm"].string()))
+gdb.write("{} {:^5} {}\n".format(
+task.format_string().split()[0],
+task["pid"].format_string(),
+task["comm"].string()))
 
 
 LxPs()
-- 
2.25.4



[PATCHv2 1/2] proc: Add struct mount & struct super_block addr in lx-mounts command

2020-08-18 Thread Ritesh Harjani
This is many times found useful while debugging some FS related
issue.


  mount  super_block devname pathname fstype options
0x888a0bfa4b40 0x888a0bfc1000 none / rootfs rw 0 0
0x888a033f75c0 0x8889fcf65000 /dev/root / ext4 rw,relatime 0 0
0x8889fc8ce040 0x888a0bb51000 devtmpfs /dev devtmpfs rw,relatime 0 0

Signed-off-by: Ritesh Harjani 
---
 scripts/gdb/linux/proc.py | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/scripts/gdb/linux/proc.py b/scripts/gdb/linux/proc.py
index 6a56bba233a9..09cd871925a5 100644
--- a/scripts/gdb/linux/proc.py
+++ b/scripts/gdb/linux/proc.py
@@ -167,6 +167,9 @@ values of that process namespace"""
 if not namespace:
 raise gdb.GdbError("No namespace for current process")
 
+gdb.write("{:^18} {:^15} {:>9} {} {} options\n".format(
+  "mount", "super_block", "devname", "pathname", "fstype"))
+
 for vfs in lists.list_for_each_entry(namespace['list'],
  mount_ptr_type, "mnt_list"):
 devname = vfs['mnt_devname'].string()
@@ -190,14 +193,10 @@ values of that process namespace"""
 m_flags = int(vfs['mnt']['mnt_flags'])
 rd = "ro" if (s_flags & constants.LX_SB_RDONLY) else "rw"
 
-gdb.write(
-"{} {} {} {}{}{} 0 0\n"
-.format(devname,
-pathname,
-fstype,
-rd,
-info_opts(FS_INFO, s_flags),
-info_opts(MNT_INFO, m_flags)))
+gdb.write("{} {} {} {} {} {}{}{} 0 0\n".format(
+  vfs.format_string(), superblock.format_string(), devname,
+  pathname, fstype, rd, info_opts(FS_INFO, s_flags),
+  info_opts(MNT_INFO, m_flags)))
 
 
 LxMounts()
-- 
2.25.4



[PATCHv2 0/2] scripts:gdb: Add few structs info in gdb scripts

2020-08-18 Thread Ritesh Harjani
I was using these structs info internally in my gdb scripts.
So sending it out for merge to upstream.
Patch-2 is just adding some headers and improves spacing.

-- 
2.25.4



Re: [PATCH 2/2] tasks: Add task_struct addr for lx-ps cmd

2020-08-18 Thread Ritesh Harjani




On 8/18/20 11:10 AM, Jan Kiszka wrote:

On 18.08.20 06:04, Ritesh Harjani wrote:

task_struct addr in lx-ps cmd seems helpful


   TASK  PIDCOMM
0x82c2b8c0   0   swapper/0
0x888a0ba20040   1   systemd
0x888a0ba24040   2   kthreadd
0x888a0ba28040   3   rcu_gp

Signed-off-by: Ritesh Harjani 
---
  scripts/gdb/linux/tasks.py | 9 +
  1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/scripts/gdb/linux/tasks.py b/scripts/gdb/linux/tasks.py
index 0301dc1e0138..17ec19e9b5bf 100644
--- a/scripts/gdb/linux/tasks.py
+++ b/scripts/gdb/linux/tasks.py
@@ -73,11 +73,12 @@ class LxPs(gdb.Command):
  super(LxPs, self).__init__("lx-ps", gdb.COMMAND_DATA)
  
  def invoke(self, arg, from_tty):

+gdb.write("{:>10} {:>12} {:>7}\n".format("TASK", "PID", "COMM"))
  for task in task_lists():
-gdb.write("{address} {pid} {comm}\n".format(
-address=task,
-pid=task["pid"],
-comm=task["comm"].string()))
+gdb.write("{} {:^5} {}\n".format(
+task.format_string().split()[0],
+task["pid"].format_string(),
+task["comm"].string()))
  
  
  LxPs()




This patch is confusing me. We already dump the task address. What the
patch changes is adding a header and some conversions of the values. Can
you elaborate?


You are right. Sorry for the confusion. I will update the commit msg (in
v2) to reflect that this patch adds the header and formats the spacing.
Without the patch we get it like this:-

0x82c2b8c0  0 swapper/0
0x888a0ba20040 1 systemd
0x888a0ba24040 2 kthreadd
0x888a0ba28040 3 rcu_gp

-ritesh


Re: [PATCH 1/2] proc: Add struct mount & struct super_block addr in lx-mounts command

2020-08-18 Thread Ritesh Harjani




On 8/18/20 11:07 AM, Jan Kiszka wrote:

On 18.08.20 06:04, Ritesh Harjani wrote:

This is many times found useful while debugging some FS related issue.


   mount  super_block  fstype devname pathname options
0x888a0bfa4b40 0x888a0bfc1000 rootfs none / rw   0 0
0x888a02c065c0 0x8889fcf65000 ext4 /dev/root / rw  ,relatime 0 0
0x8889fc8cc040 0x888a0bb51000 devtmpfs devtmpfs /dev rw  ,relatime 0 0

Signed-off-by: Ritesh Harjani 
---
  scripts/gdb/linux/proc.py | 15 +++
  1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/scripts/gdb/linux/proc.py b/scripts/gdb/linux/proc.py
index 6a56bba233a9..c16fab981bdd 100644
--- a/scripts/gdb/linux/proc.py
+++ b/scripts/gdb/linux/proc.py
@@ -167,6 +167,9 @@ values of that process namespace"""
  if not namespace:
  raise gdb.GdbError("No namespace for current process")
  
+gdb.write("{:^18} {:^15} {:>9} {} {} options\n".format(

+  "mount", "super_block", "fstype", "devname", "pathname"))
+
  for vfs in lists.list_for_each_entry(namespace['list'],
   mount_ptr_type, "mnt_list"):
  devname = vfs['mnt_devname'].string()
@@ -190,14 +193,10 @@ values of that process namespace"""
  m_flags = int(vfs['mnt']['mnt_flags'])
  rd = "ro" if (s_flags & constants.LX_SB_RDONLY) else "rw"
  
-gdb.write(

-"{} {} {} {}{}{} 0 0\n"
-.format(devname,
-pathname,
-fstype,
-rd,
-info_opts(FS_INFO, s_flags),
-info_opts(MNT_INFO, m_flags)))
+gdb.write("{} {} {} {} {} {} {} {} 0 0\n".format(
+  vfs.format_string(), superblock.format_string(), fstype,
+  devname, pathname, rd, info_opts(FS_INFO, s_flags),
+  info_opts(MNT_INFO, m_flags)))


The last three format elements should not be space-separated. The effect
can even be seen in your example above.


yes, agreed. Will fix it in next version.

-ritesh


[PATCH 2/2] tasks: Add task_struct addr for lx-ps cmd

2020-08-17 Thread Ritesh Harjani
task_struct addr in lx-ps cmd seems helpful


  TASK  PIDCOMM
0x82c2b8c0   0   swapper/0
0x888a0ba20040   1   systemd
0x888a0ba24040   2   kthreadd
0x888a0ba28040   3   rcu_gp

Signed-off-by: Ritesh Harjani 
---
 scripts/gdb/linux/tasks.py | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/scripts/gdb/linux/tasks.py b/scripts/gdb/linux/tasks.py
index 0301dc1e0138..17ec19e9b5bf 100644
--- a/scripts/gdb/linux/tasks.py
+++ b/scripts/gdb/linux/tasks.py
@@ -73,11 +73,12 @@ class LxPs(gdb.Command):
 super(LxPs, self).__init__("lx-ps", gdb.COMMAND_DATA)
 
 def invoke(self, arg, from_tty):
+gdb.write("{:>10} {:>12} {:>7}\n".format("TASK", "PID", "COMM"))
 for task in task_lists():
-gdb.write("{address} {pid} {comm}\n".format(
-address=task,
-pid=task["pid"],
-comm=task["comm"].string()))
+gdb.write("{} {:^5} {}\n".format(
+task.format_string().split()[0],
+task["pid"].format_string(),
+task["comm"].string()))
 
 
 LxPs()
-- 
2.25.4



[PATCH 0/2] scripts:gdb: Add few structs in gdb scripts

2020-08-17 Thread Ritesh Harjani
I was using these structs info internally in my gdb scripts.
So sending it out for merge to upstream.

Ritesh Harjani (2):
  proc: Add struct mount & struct super_block addr in lx-mounts command
  tasks: Add task_struct addr for lx-ps cmd

 scripts/gdb/linux/lists.py | 43 ++
 scripts/gdb/linux/proc.py  | 15 +++--
 scripts/gdb/linux/tasks.py |  9 
 3 files changed, 12 insertions(+), 12 deletions(-)

-- 
2.25.4



[PATCH 1/2] proc: Add struct mount & struct super_block addr in lx-mounts command

2020-08-17 Thread Ritesh Harjani
This is many times found useful while debugging some FS related issue.


  mount  super_block  fstype devname pathname options
0x888a0bfa4b40 0x888a0bfc1000 rootfs none / rw   0 0
0x888a02c065c0 0x8889fcf65000 ext4 /dev/root / rw  ,relatime 0 0
0x8889fc8cc040 0x888a0bb51000 devtmpfs devtmpfs /dev rw  ,relatime 0 0

Signed-off-by: Ritesh Harjani 
---
 scripts/gdb/linux/proc.py | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/scripts/gdb/linux/proc.py b/scripts/gdb/linux/proc.py
index 6a56bba233a9..c16fab981bdd 100644
--- a/scripts/gdb/linux/proc.py
+++ b/scripts/gdb/linux/proc.py
@@ -167,6 +167,9 @@ values of that process namespace"""
 if not namespace:
 raise gdb.GdbError("No namespace for current process")
 
+gdb.write("{:^18} {:^15} {:>9} {} {} options\n".format(
+  "mount", "super_block", "fstype", "devname", "pathname"))
+
 for vfs in lists.list_for_each_entry(namespace['list'],
  mount_ptr_type, "mnt_list"):
 devname = vfs['mnt_devname'].string()
@@ -190,14 +193,10 @@ values of that process namespace"""
 m_flags = int(vfs['mnt']['mnt_flags'])
 rd = "ro" if (s_flags & constants.LX_SB_RDONLY) else "rw"
 
-gdb.write(
-"{} {} {} {}{}{} 0 0\n"
-.format(devname,
-pathname,
-fstype,
-rd,
-info_opts(FS_INFO, s_flags),
-info_opts(MNT_INFO, m_flags)))
+gdb.write("{} {} {} {} {} {} {} {} 0 0\n".format(
+  vfs.format_string(), superblock.format_string(), fstype,
+  devname, pathname, rd, info_opts(FS_INFO, s_flags),
+  info_opts(MNT_INFO, m_flags)))
 
 
 LxMounts()
-- 
2.25.4



Re: [PATCH] mballoc: Replace seq_printf with seq_puts

2020-08-10 Thread Ritesh Harjani




On 8/10/20 7:51 AM, Xu Wang wrote:

seq_puts is a lot cheaper than seq_printf, so use that to print
literal strings.

Signed-off-by: Xu Wang 


Looks good to me.
Reviewed-by: Ritesh Harjani 


Re: [RFC 1/1] pmem: Add cond_resched() in bio_for_each_segment loop in pmem_make_request

2020-08-07 Thread Ritesh Harjani




On 8/4/20 3:21 AM, Dave Chinner wrote:

On Mon, Aug 03, 2020 at 12:44:04PM +0530, Ritesh Harjani wrote:



On 8/3/20 4:31 AM, Dave Chinner wrote:

On Wed, Jul 29, 2020 at 02:15:18PM +0530, Ritesh Harjani wrote:

For systems which do not have CONFIG_PREEMPT set and
if there is a heavy multi-threaded load/store operation happening
on pmem + sometimes along with device latencies, softlockup warnings like
this could trigger. This was seen on Power where pagesize is 64K.

To avoid softlockup, this patch adds a cond_resched() in this path.

<...>
watchdog: BUG: soft lockup - CPU#31 stuck for 22s!
<...>
CPU: 31 PID: 15627 <..> 5.3.18-20
<...>
NIP memcpy_power7+0x43c/0x7e0
LR memcpy_flushcache+0x28/0xa0

Call Trace:
memcpy_power7+0x274/0x7e0 (unreliable)
memcpy_flushcache+0x28/0xa0
write_pmem+0xa0/0x100 [nd_pmem]
pmem_do_bvec+0x1f0/0x420 [nd_pmem]
pmem_make_request+0x14c/0x370 [nd_pmem]
generic_make_request+0x164/0x400
submit_bio+0x134/0x2e0
submit_bio_wait+0x70/0xc0
blkdev_issue_zeroout+0xf4/0x2a0
xfs_zero_extent+0x90/0xc0 [xfs]
xfs_bmapi_convert_unwritten+0x198/0x230 [xfs]
xfs_bmapi_write+0x284/0x630 [xfs]
xfs_iomap_write_direct+0x1f0/0x3e0 [xfs]
xfs_file_iomap_begin+0x344/0x690 [xfs]
dax_iomap_pmd_fault+0x488/0xc10
__xfs_filemap_fault+0x26c/0x2b0 [xfs]
__handle_mm_fault+0x794/0x1af0
handle_mm_fault+0x12c/0x220
__do_page_fault+0x290/0xe40
do_page_fault+0x38/0xc0
handle_page_fault+0x10/0x30

Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Ritesh Harjani 
---
   drivers/nvdimm/pmem.c | 1 +
   1 file changed, 1 insertion(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 2df6994acf83..fcf7af13897e 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -214,6 +214,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, 
struct bio *bio)
bio->bi_status = rc;
break;
}
+   cond_resched();


There are already cond_resched() calls between submitted bios in
blkdev_issue_zeroout() via both __blkdev_issue_zero_pages() and
__blkdev_issue_write_zeroes(), so I'm kinda wondering where the
problem is coming from here.


This problem is coming from that bio call- submit_bio()



Just how big is the bio being issued here that it spins for 22s
trying to copy it?


It's 256 (due to BIO_MAX_PAGES) * 64KB (pagesize) = 16MB.
So this is definitely not an easy trigger as per tester was mainly seen
on a VM.


Right, so a memcpy() of 16MB of data in a single bio is taking >22s?

If so, then you can't solve this problem with cond_resched() - if
something that should only take half a millisecond to run is taking
22s of CPU time, there are bigger problems occurring that need
fixing.

i.e. if someone is running a workload that has effectively
livelocked the hardware cache coherency protocol across the entire
machine, then changing kernel code that requires memory bandwidth to
be available is not going to fix the problem. All is does is shoot
the messenger that tells you there is something going wrong.


Thanks Dave. Appreciate your inputs in this area.
Yes, agreed on the fact that we should not shoot the messenger itself.
Will look more into this.

-ritesh


Re: [RFC 1/1] pmem: Add cond_resched() in bio_for_each_segment loop in pmem_make_request

2020-08-03 Thread Ritesh Harjani




On 8/3/20 4:31 AM, Dave Chinner wrote:

On Wed, Jul 29, 2020 at 02:15:18PM +0530, Ritesh Harjani wrote:

For systems which do not have CONFIG_PREEMPT set and
if there is a heavy multi-threaded load/store operation happening
on pmem + sometimes along with device latencies, softlockup warnings like
this could trigger. This was seen on Power where pagesize is 64K.

To avoid softlockup, this patch adds a cond_resched() in this path.

<...>
watchdog: BUG: soft lockup - CPU#31 stuck for 22s!
<...>
CPU: 31 PID: 15627 <..> 5.3.18-20
<...>
NIP memcpy_power7+0x43c/0x7e0
LR memcpy_flushcache+0x28/0xa0

Call Trace:
memcpy_power7+0x274/0x7e0 (unreliable)
memcpy_flushcache+0x28/0xa0
write_pmem+0xa0/0x100 [nd_pmem]
pmem_do_bvec+0x1f0/0x420 [nd_pmem]
pmem_make_request+0x14c/0x370 [nd_pmem]
generic_make_request+0x164/0x400
submit_bio+0x134/0x2e0
submit_bio_wait+0x70/0xc0
blkdev_issue_zeroout+0xf4/0x2a0
xfs_zero_extent+0x90/0xc0 [xfs]
xfs_bmapi_convert_unwritten+0x198/0x230 [xfs]
xfs_bmapi_write+0x284/0x630 [xfs]
xfs_iomap_write_direct+0x1f0/0x3e0 [xfs]
xfs_file_iomap_begin+0x344/0x690 [xfs]
dax_iomap_pmd_fault+0x488/0xc10
__xfs_filemap_fault+0x26c/0x2b0 [xfs]
__handle_mm_fault+0x794/0x1af0
handle_mm_fault+0x12c/0x220
__do_page_fault+0x290/0xe40
do_page_fault+0x38/0xc0
handle_page_fault+0x10/0x30

Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Ritesh Harjani 
---
  drivers/nvdimm/pmem.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 2df6994acf83..fcf7af13897e 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -214,6 +214,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, 
struct bio *bio)
bio->bi_status = rc;
break;
}
+   cond_resched();


There are already cond_resched() calls between submitted bios in
blkdev_issue_zeroout() via both __blkdev_issue_zero_pages() and
__blkdev_issue_write_zeroes(), so I'm kinda wondering where the
problem is coming from here.


This problem is coming from that bio call- submit_bio()



Just how big is the bio being issued here that it spins for 22s
trying to copy it?


It's 256 (due to BIO_MAX_PAGES) * 64KB (pagesize) = 16MB.
So this is definitely not an easy trigger as per tester was mainly seen
on a VM.

Looking at the cond_resched() inside dax_writeback_mapping_range()
in xas_for_each_marked() loop, I thought it should be good to have a
cond_resched() in the above path as well.

Hence an RFC for discussion.



And, really, if the system is that bound on cacheline bouncing that
it prevents memcpy() from making progress, I think we probably
should be issuing a soft lockup warning like this... >
Cheers,

Dave.



[RFC 1/1] pmem: Add cond_resched() in bio_for_each_segment loop in pmem_make_request

2020-07-29 Thread Ritesh Harjani
For systems which do not have CONFIG_PREEMPT set and
if there is a heavy multi-threaded load/store operation happening
on pmem + sometimes along with device latencies, softlockup warnings like
this could trigger. This was seen on Power where pagesize is 64K.

To avoid softlockup, this patch adds a cond_resched() in this path.

<...>
watchdog: BUG: soft lockup - CPU#31 stuck for 22s!
<...>
CPU: 31 PID: 15627 <..> 5.3.18-20
<...>
NIP memcpy_power7+0x43c/0x7e0
LR memcpy_flushcache+0x28/0xa0

Call Trace:
memcpy_power7+0x274/0x7e0 (unreliable)
memcpy_flushcache+0x28/0xa0
write_pmem+0xa0/0x100 [nd_pmem]
pmem_do_bvec+0x1f0/0x420 [nd_pmem]
pmem_make_request+0x14c/0x370 [nd_pmem]
generic_make_request+0x164/0x400
submit_bio+0x134/0x2e0
submit_bio_wait+0x70/0xc0
blkdev_issue_zeroout+0xf4/0x2a0
xfs_zero_extent+0x90/0xc0 [xfs]
xfs_bmapi_convert_unwritten+0x198/0x230 [xfs]
xfs_bmapi_write+0x284/0x630 [xfs]
xfs_iomap_write_direct+0x1f0/0x3e0 [xfs]
xfs_file_iomap_begin+0x344/0x690 [xfs]
dax_iomap_pmd_fault+0x488/0xc10
__xfs_filemap_fault+0x26c/0x2b0 [xfs]
__handle_mm_fault+0x794/0x1af0
handle_mm_fault+0x12c/0x220
__do_page_fault+0x290/0xe40
do_page_fault+0x38/0xc0
handle_page_fault+0x10/0x30

Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Ritesh Harjani 
---
 drivers/nvdimm/pmem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 2df6994acf83..fcf7af13897e 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -214,6 +214,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, 
struct bio *bio)
bio->bi_status = rc;
break;
}
+   cond_resched();
}
if (do_acct)
nd_iostat_end(bio, start);
-- 
2.25.4



Re: ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp()

2020-07-26 Thread Ritesh Harjani




On 7/27/20 7:24 AM, brookxu wrote:

Delete the invalid BUGON in ext4_mb_load_buddy_gfp(), the previous
code has already judged whether page is NULL.

Signed-off-by: Chunguang Xu 


Thanks for the patch. LGTM. Feel free to add.
Reviewed-by: Ritesh Harjani 


---
  fs/ext4/mballoc.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 28a139f..9b1c3ad 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1279,9 +1279,6 @@ int ext4_mb_init_group(struct super_block *sb, 
ext4_group_t group, gfp_t gfp)
  e4b->bd_buddy_page = page;
  e4b->bd_buddy = page_address(page) + (poff * sb->s_blocksize);
  
-    BUG_ON(e4b->bd_bitmap_page == NULL);

-    BUG_ON(e4b->bd_buddy_page == NULL);
-
  return 0;
  
  err:




Re: [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression

2020-07-15 Thread Ritesh Harjani

Hello Xing,

On 4/7/20 1:30 PM, kernel test robot wrote:

Greeting,

FYI, we noticed a -60.5% regression of stress-ng.fiemap.ops_per_sec due to 
commit:


commit: d3b6f23f71670007817a5d59f3fbafab2b794e8c ("ext4: move ext4_fiemap to use 
iomap framework")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: stress-ng
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G 
memory
with following parameters:

nr_threads: 10%
disk: 1HDD
testtime: 1s
class: os
cpufreq_governor: performance
ucode: 0x52c
fs: ext4


I started looking into this issue. But with my unit testing, I didn't
find any perf issue with fiemap ioctl call. I haven't yet explored about
how stress-ng take fiemap performance numbers, it could be doing
something differently. But in my testing I just made sure to create a
file with large number of extents and used xfs_io -c "fiemap -v" cmd
to check how much time it takes to read all the entries in 1st
and subsequent iterations.


Setup comprised of qemu machine on x86_64 with latest linux branch.

1. created a file of 10G using fallocate. (this allocated unwritten
extents for this file).

2. Then I punched hole on every alternate block of file. This step took
a long time. And after sufficiently long time, I had to cancel it.
for i in $(seq 1 2 x); do echo $i; fallocate -p -o $(($i*4096)) -l 
4096; done


3. Then issued fiemap call via xfs_io and took the time measurement.
time xfs_io -c "fiemap -v" bigfile > /dev/null


Perf numbers on latest default kernel build for above cmd.

1st iteration
==
real0m31.684s
user0m1.593s
sys 0m24.174s

2nd and subsequent iteration

real0m3.379s
user0m1.300s
sys 0m2.080s


4. Then I reverted all the iomap_fiemap patches and re-tested this.
With this the older ext4_fiemap implementation will be tested:-


1st iteration
==
real0m31.591s
user0m1.400s
sys 0m24.243s


2nd and subsequent iteration (had to cancel it since it was taking more 
time then 15m)


^C^C
real15m49.884s
user0m0.032s
sys 15m49.722s

I guess the reason why 2nd iteration with older implementation is taking
too much time is since with previous implementation we never cached
extent entries in extent_status tree. And also in 1st iteration the page
cache may get filled with lot of buffer_head entries. So maybe page
reclaims are taking more time.

While with the latest implementation using iomap_fiemap(), the call to 
query extent blocks is done using ext4_map_blocks(). ext4_map_blocks()

by default will also cache the extent entries into extent_status tree.
Hence during 2nd iteration, we will directly read the entries from 
extent_status tree and will not do any disk I/O.


-ritesh


Re: [PATCH] ext4: Delete unnecessary checks before brelse()

2020-07-08 Thread Ritesh Harjani




On 6/13/20 11:37 PM, Markus Elfring wrote:

From: Markus Elfring 
Date: Sat, 13 Jun 2020 19:12:24 +0200

The brelse() function tests whether its argument is NULL
and then returns immediately.
Thus remove the tests which are not needed around the shown calls.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 


Sure, LGTM. Feel free to add
Reviewed-by: Ritesh Harjani 



---
  fs/ext4/extents.c | 6 ++
  fs/ext4/xattr.c   | 3 +--
  2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 221f240eae60..315276d50aa8 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -693,10 +693,8 @@ void ext4_ext_drop_refs(struct ext4_ext_path *path)
return;
depth = path->p_depth;
for (i = 0; i <= depth; i++, path++) {
-   if (path->p_bh) {
-   brelse(path->p_bh);
-   path->p_bh = NULL;
-   }
+   brelse(path->p_bh);
+   path->p_bh = NULL;
}
  }

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 9b29a40738ac..eb997ce21be3 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1354,8 +1354,7 @@ static int ext4_xattr_inode_write(handle_t *handle, 
struct inode *ea_inode,

block = 0;
while (wsize < bufsize) {
-   if (bh != NULL)
-   brelse(bh);
+   brelse(bh);
csize = (bufsize - wsize) > blocksize ? blocksize :
bufsize - wsize;
bh = ext4_getblk(handle, ea_inode, block, 0);
--
2.27.0



Re: 5.8-rc1: new warnings in ext4_mb_new_blocks?

2020-06-15 Thread Ritesh Harjani




On 6/15/20 1:07 PM, Pavel Machek wrote:

Hi!

Booting 5.8-rc1 on x220, I get scary warnings:

[7.089941] EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: 
errors=remount-ro
[7.343231] BUG: using smp_processor_id() in preemptible [] code: 
systemd-tmpfile/2788
[7.344052] caller is debug_smp_processor_id+0x17/0x20
[7.344897] CPU: 3 PID: 2788 Comm: systemd-tmpfile Not tainted 5.8.0-rc1+ 
#116
[7.345745] Hardware name: LENOVO 42872WU/42872WU, BIOS 8DET74WW (1.44 ) 
03/13/2018
[7.346561] Call Trace:
[7.347355]  dump_stack+0x60/0x7a
[7.348170]  check_preemption_disabled+0xb1/0xc0
[7.348951]  debug_smp_processor_id+0x17/0x20
[7.349728]  ext4_mb_new_blocks+0x1f2/0x13e0

and they continue after boot.



Below should fix this. I guess it couldn't make it to rc1.

https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=ext4-for-linus-5.8-rc1-2=811985365378df01386c3cfb7ff716e74ca376d5

-ritesh


Re: upstream test error: BUG: using smp_processor_id() in preemptible code in ext4_mb_new_blocks

2020-06-12 Thread Ritesh Harjani

I am seeing this all over the place on Linus's tree right now:

[  +0.008563] BUG: using smp_processor_id() in preemptible [] code: 
systemd/1
[  +0.11] caller is ext4_mb_new_blocks+0x2ac/0xc10
[  +0.02] CPU: 31 PID: 1 Comm: systemd Not tainted 
5.7.0-14371-g25ae6195a4c7 #66
[  +0.02] Hardware name: Micro-Star International Co., Ltd. MS-7C59/Creator 
TRX40 (MS-7C59), BIOS 1.50 05/13/2020
[  +0.01] Call Trace:
[  +0.08]  dump_stack+0x57/0x70
[  +0.04]  debug_smp_processor_id.cold+0x4e/0x53
[  +0.01]  ext4_mb_new_blocks+0x2ac/0xc10
[  +0.04]  ? ext4_find_extent+0x3e8/0x450
[  +0.02]  ext4_ext_map_blocks+0x9f6/0x1b10
[  +0.03]  ? ext4_mark_iloc_dirty+0x60f/0xa50
[  +0.03]  ? __ext4_journal_get_write_access+0x2d/0x70
[  +0.04]  ext4_map_blocks+0x119/0x5a0
[  +0.04]  ext4_getblk+0x66/0x1c0
[  +0.03]  ext4_bread+0x26/0xc0
[  +0.02]  ext4_append+0x49/0xe0
[  +0.02]  ext4_mkdir+0x233/0x450
[  +0.05]  vfs_mkdir+0x11d/0x1b0
[  +0.03]  do_mkdirat+0x92/0x130
[  +0.04]  do_syscall_64+0x43/0x80
[  +0.04]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.03] RIP: 0033:0x7fef3df7a72b
[  +0.01] Code: Bad RIP value.
[  +0.01] RSP: 002b:7ffdb4eba0c8 EFLAGS: 0246 ORIG_RAX: 
0053
[  +0.03] RAX: ffda RBX: 0001 RCX: 7fef3df7a72b
[  +0.01] RDX:  RSI: 01c0 RDI: 563b11cf26e0
[  +0.01] RBP: 563b11cf2731 R08: 000d R09: 0002
[  +0.00] R10: 11175e4331068ed5 R11: 0246 R12: 563b11cf26e0
[  +0.02] R13: 7fef3e019c20 R14: 7ffdb4eba0f0 R15: 8421084210842109

Just a constant stream of them.

There's a few other fun:

[  +0.453222] BUG: unable to handle page fault for address: b59cc2719000
[  +0.04] #PF: supervisor write access in kernel mode
[  +0.01] #PF: error_code(0x000b) - reserved bit violation

messages at times, but I don't think that's an ext4 issue, but rather
something in the sound stack...

EXT4 developers, any hints/patches to try?



Below patch should fix the ext4 issue coming from ext4_mb_new_blocks().

https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev=811985365378df01386c3cfb7ff716e74ca376d5

Not sure about the second one though.

-ritesh



Re: linux-next test error: BUG: using smp_processor_id() in preemptible [ADDR] code: syz-fuzzer/6792

2020-06-12 Thread Ritesh Harjani



On 6/12/20 6:13 PM, Ido Schimmel wrote:

On Tue, Jun 02, 2020 at 06:11:29PM +0530, Ritesh Harjani wrote:

#syz test:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
0e21d4620dd047da7952f44a2e1ac777ded2d57e



>From cc1cf67d99d5fa61db0651c89c288df31bad6b8e Mon Sep 17 00:00:00 2001
From: Ritesh Harjani 
Date: Tue, 2 Jun 2020 17:54:12 +0530
Subject: [PATCH 1/1] ext4: mballoc: Use raw_cpu_ptr in case if preemption is 
enabled

It doesn't matter really in ext4_mb_new_blocks() about whether the code
is rescheduled on any other cpu due to preemption. Because we care
about discard_pa_seq only when the block allocation fails and then too
we add the seq counter of all the cpus against the initial sampled one
to check if anyone has freed any blocks while we were doing allocation.

So just use raw_cpu_ptr to not trigger this BUG.

BUG: using smp_processor_id() in preemptible [] code: syz-fuzzer/6927
caller is ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
CPU: 1 PID: 6927 Comm: syz-fuzzer Not tainted 5.7.0-next-20200602-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x18f/0x20d lib/dump_stack.c:118
  check_preemption_disabled+0x20d/0x220 lib/smp_processor_id.c:48
  ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
  ext4_ext_map_blocks+0x201b/0x33e0 fs/ext4/extents.c:4244
  ext4_map_blocks+0x4cb/0x1640 fs/ext4/inode.c:626
  ext4_getblk+0xad/0x520 fs/ext4/inode.c:833
  ext4_bread+0x7c/0x380 fs/ext4/inode.c:883
  ext4_append+0x153/0x360 fs/ext4/namei.c:67
  ext4_init_new_dir fs/ext4/namei.c:2757 [inline]
  ext4_mkdir+0x5e0/0xdf0 fs/ext4/namei.c:2802
  vfs_mkdir+0x419/0x690 fs/namei.c:3632
  do_mkdirat+0x21e/0x280 fs/namei.c:3655
  do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Ritesh Harjani 
Reported-by: syzbot+82f324bb69744c5f6...@syzkaller.appspotmail.com


Hi,

Are you going to submit this patch formally? Without it I'm constantly
seeing the above splat.



I see Ted has already taken v2 of this patch in his dev repo.
Should be able to see in linux tree soon.

https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev=811985365378df01386c3cfb7ff716e74ca376d5


-ritesh


Re: [PATCH] ext4: mballoc: Disable preemption before getting per-CPU pointer

2020-06-10 Thread Ritesh Harjani

Hello,

Fix for this is already submitted @

https://patchwork.ozlabs.org/project/linux-ext4/patch/534f275016296996f54ecf65168bb3392b6f653d.1591699601.git.rite...@linux.ibm.com/

-ritesh

On 6/10/20 7:19 PM, YueHaibing wrote:

BUG: using smp_processor_id() in preemptible [] code: kworker/u16:3/2181
caller is ext4_mb_new_blocks+0x388/0xed0
CPU: 2 PID: 2181 Comm: kworker/u16:3 Not tainted 5.7.0+ #182
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Workqueue: writeback wb_workfn (flush-8:0)
Call Trace:
  dump_stack+0xb9/0xfc
  debug_smp_processor_id+0xc8/0xd0
  ext4_mb_new_blocks+0x388/0xed0
  ext4_ext_map_blocks+0xa92/0xff0
  ext4_map_blocks+0x34e/0x580
  ext4_writepages+0xa28/0x11b0
  do_writepages+0x46/0xe0
  __writeback_single_inode+0x5f/0x6b0
  writeback_sb_inodes+0x290/0x620
  __writeback_inodes_wb+0x62/0xb0
  wb_writeback+0x36c/0x520
  wb_workfn+0x319/0x680
  process_one_work+0x271/0x640
  worker_thread+0x3a/0x3a0
  kthread+0x14e/0x170
  ret_from_fork+0x27/0x40

Disable preemption before accessing discard_pa_seq.

Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve 
ENOSPC handling")
Signed-off-by: YueHaibing 
---
  fs/ext4/mballoc.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a9083113a8c0..30b3bfb1e06a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4708,7 +4708,8 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
}
  
  	ac->ac_op = EXT4_MB_HISTORY_PREALLOC;

-   seq = *this_cpu_ptr(_pa_seq);
+   seq = *get_cpu_ptr(_pa_seq);
+   put_cpu_ptr(_pa_seq);
if (!ext4_mb_use_preallocated(ac)) {
ac->ac_op = EXT4_MB_HISTORY_ALLOC;
ext4_mb_normalize_request(ac, ar);



Re: [PATCHv2 1/1] ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr

2020-06-09 Thread Ritesh Harjani




On 6/9/20 6:07 PM, Hillf Danton wrote:


On Tue, 9 Jun 2020 18:53:23 +0800 Ritesh Harjani wrote:


Simplify reading a seq variable by directly using this_cpu_read API
instead of doing this_cpu_ptr and then dereferencing it.


Two of the quick questions
1) Why can blocks discarded in a ext4 FS help allocators in another?


I am not sure if I understand your Q correctly. But here is a brief 
about the patchset. If there were PA blocks just or about to be 
discarded by another thread, then the current thread who is doing block 
allocation should not fail with ENOSPC error instead should be able to 
allocate those blocks from another thread. The concept is better 
explained in the commit msgs, if more details are required.
Without this patchset (in some heavy multi-threaded use case) allocation 
was failing when the overall filesystem space available was more then 50%.




2) Why is a percpu seqcount prefered over what 
can offer?



Since this could be a multi-threaded use case, per cpu variable helps in 
avoid cache line bouncing problem, which could happen when the same 
variable is updated by multiple threads on different cpus.


-ritesh


Re: [PATCHv5 1/1] ext4: mballoc: Use raw_cpu_ptr instead of this_cpu_ptr

2020-06-09 Thread Ritesh Harjani

This patch is superseded by

https://patchwork.ozlabs.org/project/linux-ext4/patch/534f275016296996f54ecf65168bb3392b6f653d.1591699601.git.rite...@linux.ibm.com/


Re: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

2020-06-09 Thread Ritesh Harjani




On 6/9/20 3:50 PM, Borislav Petkov wrote:

On Wed, Jun 03, 2020 at 03:40:16PM +0530, Ritesh Harjani wrote:

Yes, this is being discussed in the community.
I have submitted a patch which should help fix this warning msg.
Feel free to give this a try on your setup.

https://marc.info/?l=linux-ext4=159110574414645=2


I just triggered the same thing here too. Looking at your fix and not
even pretending to know what's going on with that percpu sequence
counting, I can't help but wonder why do you wanna do:

seq = *raw_cpu_ptr(_pa_seq);

instead of simply doing:

seq = this_cpu_read(discard_pa_seq);



That's correct. Thanks for pointing that out.
I guess in my development version of code I had seq as a u64 pointer
variable which later I had changed to u64 variable but I guess I
continued using pcpu ptr APIs for that.

Let me re-submit that patch with your Suggested-by tag and corresponding
changes.

-riteshh


Re: [PATCHv5 1/1] ext4: mballoc: Use raw_cpu_ptr instead of this_cpu_ptr

2020-06-03 Thread Ritesh Harjani

This fixes the warning observed on various Samsung Exynos SoC based
boards with linux-next 20200602.

Tested-by: Marek Szyprowski 



Thanks Marek,

Hello Ted,

Please pick up below change which I just sent with an added "Fixes" by
tag. Changes wise it is the same which Marek tested.

https://patchwork.ozlabs.org/project/linux-ext4/patch/20200603101827.2824-1-rite...@linux.ibm.com/


-ritesh


[PATCH 1/1] ext4: mballoc: Use raw_cpu_ptr instead of this_cpu_ptr

2020-06-03 Thread Ritesh Harjani
It doesn't really matter in ext4_mb_new_blocks() about whether the code
is rescheduled on any other cpu due to preemption. Because we care
about discard_pa_seq only when the block allocation fails and then too
we add the seq counter of all the cpus against the initial sampled one
to check if anyone has freed any blocks while we were doing allocation.

So just use raw_cpu_ptr instead of this_cpu_ptr to avoid this BUG.

BUG: using smp_processor_id() in preemptible [] code: syz-fuzzer/6927
caller is ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
CPU: 1 PID: 6927 Comm: syz-fuzzer Not tainted 5.7.0-next-20200602-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x18f/0x20d lib/dump_stack.c:118
 check_preemption_disabled+0x20d/0x220 lib/smp_processor_id.c:48
 ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
 ext4_ext_map_blocks+0x201b/0x33e0 fs/ext4/extents.c:4244
 ext4_map_blocks+0x4cb/0x1640 fs/ext4/inode.c:626
 ext4_getblk+0xad/0x520 fs/ext4/inode.c:833
 ext4_bread+0x7c/0x380 fs/ext4/inode.c:883
 ext4_append+0x153/0x360 fs/ext4/namei.c:67
 ext4_init_new_dir fs/ext4/namei.c:2757 [inline]
 ext4_mkdir+0x5e0/0xdf0 fs/ext4/namei.c:2802
 vfs_mkdir+0x419/0x690 fs/namei.c:3632
 do_mkdirat+0x21e/0x280 fs/namei.c:3655
 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 42f56b7a4a7d ("ext4: mballoc: introduce pcpu seqcnt for freeing PA
to improve ENOSPC handling")
Signed-off-by: Ritesh Harjani 
Reported-by: syzbot+82f324bb69744c5f6...@syzkaller.appspotmail.com
---
 fs/ext4/mballoc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a9083113a8c0..b79b32dbe3ea 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4708,7 +4708,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
}
 
ac->ac_op = EXT4_MB_HISTORY_PREALLOC;
-   seq = *this_cpu_ptr(_pa_seq);
+   seq = *raw_cpu_ptr(_pa_seq);
if (!ext4_mb_use_preallocated(ac)) {
ac->ac_op = EXT4_MB_HISTORY_ALLOC;
ext4_mb_normalize_request(ac, ar);
-- 
2.21.3



Re: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

2020-06-03 Thread Ritesh Harjani

Hi Marek,

On 6/3/20 12:18 PM, Marek Szyprowski wrote:

Hi Ritesh,

On 20.05.2020 08:40, Ritesh Harjani wrote:

There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.

The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
 allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we either allocate
 or free these blocks i.e. while marking those blocks as used/free in
 mb_mark_used()/mb_free_blocks().
3. We also increment this percpu seq counter when we successfully identify
 that the bb_prealloc_list is not empty and hence proceed for discarding
 of those PAs inside ext4_mb_discard_group_preallocations().

Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.

It can be well argued that why don't just check for grp->bb_free to
see if there are any free blocks to be allocated. So here are the two
concerns which were discussed:-

1. If for some reason the blocks available in the group are not
 appropriate for allocation logic (say for e.g.
 EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
 the retry logic may result into infinte looping since grp->bb_free is
 non-zero.

2. Also before preallocation was clubbed with block allocation with the
 same ext4_lock_group() held, there were lot of races where grp->bb_free
 could not be reliably relied upon.
Due to above, this patch considers discard_pa_seq logic to determine if
we should retry for block allocation. Say if there are are n threads
trying for block allocation and none of those could allocate or discard
any of the blocks, then all of those n threads will fail the block
allocation and return -ENOSPC error. (Since the seq counter for all of
those will match as no block allocation/discard was done during that
duration).

Signed-off-by: Ritesh Harjani 


This patch landed in yesterday's linux-next and causes following
WARNING/BUG on various Samsung Exynos-based boards:

   BUG: using smp_processor_id() in preemptible [] code: logsave/552
   caller is ext4_mb_new_blocks+0x404/0x1300


Yes, this is being discussed in the community.
I have submitted a patch which should help fix this warning msg.
Feel free to give this a try on your setup.

https://marc.info/?l=linux-ext4=159110574414645=2


-ritesh



Re: linux-next test error: BUG: using smp_processor_id() in preemptible [ADDR] code: syz-fuzzer/6792

2020-06-03 Thread Ritesh Harjani




On 6/2/20 8:22 PM, Hillf Danton wrote:


Tue, 02 Jun 2020 04:20:16 -0700

syzbot found the following crash on:

HEAD commit:0e21d462 Add linux-next specific files for 20200602
git tree:   linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=127233ee10
kernel config:  https://syzkaller.appspot.com/x/.config?x=ecc1aef35f550ee3
dashboard link: https://syzkaller.appspot.com/bug?extid=82f324bb69744c5f6969
compiler:   gcc (GCC) 9.0.0 20181231 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+82f324bb69744c5f6...@syzkaller.appspotmail.com

BUG: using smp_processor_id() in preemptible [] code: syz-fuzzer/6792
caller is ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711


Fix 42f56b7a4a7d ("ext4: mballoc: introduce pcpu seqcnt for freeing PA
to improve ENOSPC handling") by redefining discard_pa_seq to be a simple
regular sequence counter to axe the need of percpu operation.


Why remove percpu seqcnt? IIUC, percpu are much better in case of a 
multi-threaded use case which could run and allocate blocks in parallel.
Whereas a updating a simple variable across different cpus may lead to 
cacheline bouncing problem.
Since in this case we can very well have a use case of multiple threads 
trying to allocate blocks at the same time, so why change this to a 
simple seqcnt from percpu seqcnt?


-ritesh


[PATCHv5 1/1] ext4: mballoc: Use raw_cpu_ptr instead of this_cpu_ptr

2020-06-02 Thread Ritesh Harjani
It doesn't really matter in ext4_mb_new_blocks() about whether the code
is rescheduled on any other cpu due to preemption. Because we care
about discard_pa_seq only when the block allocation fails and then too
we add the seq counter of all the cpus against the initial sampled one
to check if anyone has freed any blocks while we were doing allocation.

So just use raw_cpu_ptr instead of this_cpu_ptr to avoid this BUG.

BUG: using smp_processor_id() in preemptible [] code: syz-fuzzer/6927
caller is ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
CPU: 1 PID: 6927 Comm: syz-fuzzer Not tainted 5.7.0-next-20200602-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x18f/0x20d lib/dump_stack.c:118
 check_preemption_disabled+0x20d/0x220 lib/smp_processor_id.c:48
 ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
 ext4_ext_map_blocks+0x201b/0x33e0 fs/ext4/extents.c:4244
 ext4_map_blocks+0x4cb/0x1640 fs/ext4/inode.c:626
 ext4_getblk+0xad/0x520 fs/ext4/inode.c:833
 ext4_bread+0x7c/0x380 fs/ext4/inode.c:883
 ext4_append+0x153/0x360 fs/ext4/namei.c:67
 ext4_init_new_dir fs/ext4/namei.c:2757 [inline]
 ext4_mkdir+0x5e0/0xdf0 fs/ext4/namei.c:2802
 vfs_mkdir+0x419/0x690 fs/namei.c:3632
 do_mkdirat+0x21e/0x280 fs/namei.c:3655
 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Ritesh Harjani 
Reported-by: syzbot+82f324bb69744c5f6...@syzkaller.appspotmail.com
---
 fs/ext4/mballoc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a9083113a8c0..b79b32dbe3ea 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4708,7 +4708,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
}
 
ac->ac_op = EXT4_MB_HISTORY_PREALLOC;
-   seq = *this_cpu_ptr(_pa_seq);
+   seq = *raw_cpu_ptr(_pa_seq);
if (!ext4_mb_use_preallocated(ac)) {
ac->ac_op = EXT4_MB_HISTORY_ALLOC;
ext4_mb_normalize_request(ac, ar);
-- 
2.21.3



Re: linux-next test error: BUG: using smp_processor_id() in preemptible [ADDR] code: syz-fuzzer/6792

2020-06-02 Thread Ritesh Harjani
#syz test: 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 
0e21d4620dd047da7952f44a2e1ac777ded2d57e
>From cc1cf67d99d5fa61db0651c89c288df31bad6b8e Mon Sep 17 00:00:00 2001
From: Ritesh Harjani 
Date: Tue, 2 Jun 2020 17:54:12 +0530
Subject: [PATCH 1/1] ext4: mballoc: Use raw_cpu_ptr in case if preemption is enabled

It doesn't matter really in ext4_mb_new_blocks() about whether the code
is rescheduled on any other cpu due to preemption. Because we care
about discard_pa_seq only when the block allocation fails and then too
we add the seq counter of all the cpus against the initial sampled one
to check if anyone has freed any blocks while we were doing allocation.

So just use raw_cpu_ptr to not trigger this BUG.

BUG: using smp_processor_id() in preemptible [] code: syz-fuzzer/6927
caller is ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
CPU: 1 PID: 6927 Comm: syz-fuzzer Not tainted 5.7.0-next-20200602-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x18f/0x20d lib/dump_stack.c:118
 check_preemption_disabled+0x20d/0x220 lib/smp_processor_id.c:48
 ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
 ext4_ext_map_blocks+0x201b/0x33e0 fs/ext4/extents.c:4244
 ext4_map_blocks+0x4cb/0x1640 fs/ext4/inode.c:626
 ext4_getblk+0xad/0x520 fs/ext4/inode.c:833
 ext4_bread+0x7c/0x380 fs/ext4/inode.c:883
 ext4_append+0x153/0x360 fs/ext4/namei.c:67
 ext4_init_new_dir fs/ext4/namei.c:2757 [inline]
 ext4_mkdir+0x5e0/0xdf0 fs/ext4/namei.c:2802
 vfs_mkdir+0x419/0x690 fs/namei.c:3632
 do_mkdirat+0x21e/0x280 fs/namei.c:3655
 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Ritesh Harjani 
Reported-by: syzbot+82f324bb69744c5f6...@syzkaller.appspotmail.com
---
 fs/ext4/mballoc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a9083113a8c0..b79b32dbe3ea 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4708,7 +4708,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 	}
 
 	ac->ac_op = EXT4_MB_HISTORY_PREALLOC;
-	seq = *this_cpu_ptr(_pa_seq);
+	seq = *raw_cpu_ptr(_pa_seq);
 	if (!ext4_mb_use_preallocated(ac)) {
 		ac->ac_op = EXT4_MB_HISTORY_ALLOC;
 		ext4_mb_normalize_request(ac, ar);
-- 
2.21.3



Re: linux-next: Signed-off-by missing for commit in the ext4 tree

2020-06-01 Thread Ritesh Harjani




On 5/29/20 5:41 PM, Stephen Rothwell wrote:

Hi all,

Commit

   560d6b3da024 ("ext4: mballoc: fix possible NULL ptr & remove BUG_ONs from 
DOUBLE_CHECK")

is missing a Signed-off-by from its author.


My bad. I guess it got missed due to the three dotted lines.

https://patchwork.ozlabs.org/project/linux-ext4/patch/9a54f8a696ff17c057cd571be3d15ac3ec1407f1.1589086800.git.rite...@linux.ibm.com/

-ritesh


[PATCHv5 1/5] ext4: mballoc: Add blocks to PA list under same spinlock after allocating blocks

2020-05-20 Thread Ritesh Harjani
ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list
of every group to discard the group's PA to free up the space if
allocation request fails. Consider below race:-

Process A   Process B

1. allocate blocks
1. Fails block allocation from
 ext4_mb_regular_allocator()
   ext4_lock_group()
allocated blocks
more than ac_o_ex.fe_len
   ext4_unlock_group()
2. Scans the
   grp->bb_prealloc_list (under
   ext4_lock_group()) and
   find nothing and thus return
   -ENOSPC.

2. Add the additional blocks to PA list

   ext4_lock_group()
add blocks to grp->bb_prealloc_list
   ext4_unlock_group()

Above race could be avoided if we add those additional blocks to
grp->bb_prealloc_list at the same time with block allocation when
ext4_lock_group() was still held.
With this discard-PA will know if there are actually any blocks which
could be freed from the PA

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/mballoc.c | 97 ++-
 1 file changed, 62 insertions(+), 35 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 33a69424942c..decc5168d126 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -349,6 +349,7 @@ static void ext4_mb_generate_from_pa(struct super_block 
*sb, void *bitmap,
ext4_group_t group);
 static void ext4_mb_generate_from_freelist(struct super_block *sb, void 
*bitmap,
ext4_group_t group);
+static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
 
 static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
 {
@@ -1701,6 +1702,14 @@ static void ext4_mb_use_best_found(struct 
ext4_allocation_context *ac,
sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
spin_unlock(>s_md_lock);
}
+   /*
+* As we've just preallocated more space than
+* user requested originally, we store allocated
+* space in a special descriptor.
+*/
+   if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len)
+   ext4_mb_new_preallocation(ac);
+
 }
 
 /*
@@ -1949,7 +1958,7 @@ void ext4_mb_simple_scan_group(struct 
ext4_allocation_context *ac,
 
ext4_mb_use_best_found(ac, e4b);
 
-   BUG_ON(ac->ac_b_ex.fe_len != ac->ac_g_ex.fe_len);
+   BUG_ON(ac->ac_f_ex.fe_len != ac->ac_g_ex.fe_len);
 
if (EXT4_SB(sb)->s_mb_stats)
atomic_inc(_SB(sb)->s_bal_2orders);
@@ -3675,7 +3684,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context 
*ac,
 /*
  * creates new preallocated space for given inode
  */
-static noinline_for_stack int
+static noinline_for_stack void
 ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 {
struct super_block *sb = ac->ac_sb;
@@ -3688,10 +3697,9 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
BUG_ON(ac->ac_status != AC_STATUS_FOUND);
BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
+   BUG_ON(ac->ac_pa == NULL);
 
-   pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
-   if (pa == NULL)
-   return -ENOMEM;
+   pa = ac->ac_pa;
 
if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) {
int winl;
@@ -3735,7 +3743,6 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
pa->pa_pstart = ext4_grp_offs_to_block(sb, >ac_b_ex);
pa->pa_len = ac->ac_b_ex.fe_len;
pa->pa_free = pa->pa_len;
-   atomic_set(>pa_count, 1);
spin_lock_init(>pa_lock);
INIT_LIST_HEAD(>pa_inode_list);
INIT_LIST_HEAD(>pa_group_list);
@@ -3755,21 +3762,17 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
pa->pa_obj_lock = >i_prealloc_lock;
pa->pa_inode = ac->ac_inode;
 
-   ext4_lock_group(sb, ac->ac_b_ex.fe_group);
list_add(>pa_group_list, >bb_prealloc_list);
-   ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 
spin_lock(pa->pa_obj_lock);
list_add_rcu(>pa_inode_list, >i_prealloc_list);
spin_unlock(pa->pa_obj_lock);
-
-   return 0;
 }
 
 /*
  * creates new preallocated space for locality group inodes belongs to
  */
-static noinline_for_stack int
+static noinline_for_stack void
 ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 {
struct super_block *sb = ac->ac_sb;
@@ -3781,11 +3784,9 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
BUG_ON(ac-&g

[PATCHv5 4/5] ext4: mballoc: Refactor ext4_mb_good_group()

2020-05-20 Thread Ritesh Harjani
ext4_mb_good_group() definition was changed some time back
and now it even initializes the buddy cache (via ext4_mb_init_group()),
if in case the EXT4_MB_GRP_NEED_INIT() is true for a group.
Note that ext4_mb_init_group() could sleep and so should not be called
under a spinlock held.
This is fine as of now because ext4_mb_good_group() is called before
loading the buddy bitmap without ext4_lock_group() held
and again called after loading the bitmap, only this time with
ext4_lock_group() held.
But still this whole thing is confusing.

So this patch refactors out ext4_mb_good_group_nolock() which should be
called when without holding ext4_lock_group().
Also in further patches we hold the spinlock (ext4_lock_group()) while
doing any calculations which involves grp->bb_free or grp->bb_fragments.

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/mballoc.c | 78 ++-
 1 file changed, 50 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 754ff9f65199..c9297c878a90 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2106,15 +2106,14 @@ void ext4_mb_scan_aligned(struct 
ext4_allocation_context *ac,
 }
 
 /*
- * This is now called BEFORE we load the buddy bitmap.
+ * This is also called BEFORE we load the buddy bitmap.
  * Returns either 1 or 0 indicating that the group is either suitable
- * for the allocation or not. In addition it can also return negative
- * error code when something goes wrong.
+ * for the allocation or not.
  */
-static int ext4_mb_good_group(struct ext4_allocation_context *ac,
+static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
ext4_group_t group, int cr)
 {
-   unsigned free, fragments;
+   ext4_grpblk_t free, fragments;
int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb));
struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
 
@@ -2122,23 +2121,16 @@ static int ext4_mb_good_group(struct 
ext4_allocation_context *ac,
 
free = grp->bb_free;
if (free == 0)
-   return 0;
+   return false;
if (cr <= 2 && free < ac->ac_g_ex.fe_len)
-   return 0;
+   return false;
 
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
-   return 0;
-
-   /* We only do this if the grp has never been initialized */
-   if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
-   int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
-   if (ret)
-   return ret;
-   }
+   return false;
 
fragments = grp->bb_fragments;
if (fragments == 0)
-   return 0;
+   return false;
 
switch (cr) {
case 0:
@@ -2148,31 +2140,63 @@ static int ext4_mb_good_group(struct 
ext4_allocation_context *ac,
if ((ac->ac_flags & EXT4_MB_HINT_DATA) &&
(flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) &&
((group % flex_size) == 0))
-   return 0;
+   return false;
 
if ((ac->ac_2order > ac->ac_sb->s_blocksize_bits+1) ||
(free / fragments) >= ac->ac_g_ex.fe_len)
-   return 1;
+   return true;
 
if (grp->bb_largest_free_order < ac->ac_2order)
-   return 0;
+   return false;
 
-   return 1;
+   return true;
case 1:
if ((free / fragments) >= ac->ac_g_ex.fe_len)
-   return 1;
+   return true;
break;
case 2:
if (free >= ac->ac_g_ex.fe_len)
-   return 1;
+   return true;
break;
case 3:
-   return 1;
+   return true;
default:
BUG();
}
 
-   return 0;
+   return false;
+}
+
+/*
+ * This could return negative error code if something goes wrong
+ * during ext4_mb_init_group(). This should not be called with
+ * ext4_lock_group() held.
+ */
+static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
+ext4_group_t group, int cr)
+{
+   struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
+   ext4_grpblk_t free;
+   int ret = 0;
+
+   free = grp->bb_free;
+   if (free == 0)
+   goto out;
+   if (cr <= 2 && free < ac->ac_g_ex.fe_len)
+   goto out;
+   if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
+   goto out;
+
+   /* We only do this if the grp has never been initialized */
+   if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
+   ret = ext4_m

[PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

2020-05-20 Thread Ritesh Harjani
There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.

The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
   allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we either allocate
   or free these blocks i.e. while marking those blocks as used/free in
   mb_mark_used()/mb_free_blocks().
3. We also increment this percpu seq counter when we successfully identify
   that the bb_prealloc_list is not empty and hence proceed for discarding
   of those PAs inside ext4_mb_discard_group_preallocations().

Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.

It can be well argued that why don't just check for grp->bb_free to
see if there are any free blocks to be allocated. So here are the two
concerns which were discussed:-

1. If for some reason the blocks available in the group are not
   appropriate for allocation logic (say for e.g.
   EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
   the retry logic may result into infinte looping since grp->bb_free is
   non-zero.

2. Also before preallocation was clubbed with block allocation with the
   same ext4_lock_group() held, there were lot of races where grp->bb_free
   could not be reliably relied upon.
Due to above, this patch considers discard_pa_seq logic to determine if
we should retry for block allocation. Say if there are are n threads
trying for block allocation and none of those could allocate or discard
any of the blocks, then all of those n threads will fail the block
allocation and return -ENOSPC error. (Since the seq counter for all of
those will match as no block allocation/discard was done during that
duration).

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/mballoc.c | 56 ++-
 1 file changed, 51 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b75408d72773..754ff9f65199 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -351,6 +351,35 @@ static void ext4_mb_generate_from_freelist(struct 
super_block *sb, void *bitmap,
ext4_group_t group);
 static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
 
+/*
+ * The algorithm using this percpu seq counter goes below:
+ * 1. We sample the percpu discard_pa_seq counter before trying for block
+ *allocation in ext4_mb_new_blocks().
+ * 2. We increment this percpu discard_pa_seq counter when we either allocate
+ *or free these blocks i.e. while marking those blocks as used/free in
+ *mb_mark_used()/mb_free_blocks().
+ * 3. We also increment this percpu seq counter when we successfully identify
+ *that the bb_prealloc_list is not empty and hence proceed for discarding
+ *of those PAs inside ext4_mb_discard_group_preallocations().
+ *
+ * Now to make sure that the regular fast path of block allocation is not
+ * affected, as a small optimization we only sample the percpu seq counter
+ * on that cpu. Only when the block allocation fails and when freed blocks
+ * found were 0, that is when we sample percpu seq counter for all cpus using
+ * below function ext4_get_discard_pa_seq_sum(). This happens after making
+ * sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.
+ */
+static DEFINE_PER_CPU(u64, discard_pa_seq);
+static inline u64 ext4_get_discard_pa_seq_sum(void)
+{
+   int __cpu;
+   u64 __seq = 0;
+
+   for_each_possible_cpu(__cpu)
+   __seq += per_cpu(discard_pa_seq, __cpu);
+   return __seq;
+}
+
 static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
 {
 #if BITS_PER_LONG == 64
@@ -1462,6 +1491,7 @@ static void mb_free_blocks(struct inode *inode, struct 
ext4_buddy *e4b,
mb_check_buddy(e4b);
mb_free_blocks_double(inode, e4b, first, count);
 
+   this_cpu_

[PATCHv5 5/5] ext4: mballoc: Use lock for checking free blocks while retrying

2020-05-20 Thread Ritesh Harjani
Currently while doing block allocation grp->bb_free may be getting
modified if discard is happening in parallel.
For e.g. consider a case where there are lot of threads who have
preallocated lot of blocks and there is a thread which is trying
to discard all of this group's PA. Now it could happen that
we see all of those group's bb_free is zero and fail the allocation
while there is sufficient space if we free up all the PA.

So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set
if we are unable to allocate any blocks in the first try (since we may
not have considered blocks about to be discarded from PA lists).
So during retry attempt to allocate blocks we will use ext4_lock_group()
for checking if the group is good or not.

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/ext4.h  |  2 ++
 fs/ext4/mballoc.c   | 13 -
 include/trace/events/ext4.h |  3 ++-
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index fb37fb3fe689..d185f3bcb9eb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -150,6 +150,8 @@ enum SHIFT_DIRECTION {
 #define EXT4_MB_USE_ROOT_BLOCKS0x1000
 /* Use blocks from reserved pool */
 #define EXT4_MB_USE_RESERVED   0x2000
+/* Do strict check for free blocks while retrying block allocation */
+#define EXT4_MB_STRICT_CHECK   0x4000
 
 struct ext4_allocation_request {
/* target inode for block we're allocating */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c9297c878a90..a9083113a8c0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2176,9 +2176,13 @@ static int ext4_mb_good_group_nolock(struct 
ext4_allocation_context *ac,
 ext4_group_t group, int cr)
 {
struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
+   struct super_block *sb = ac->ac_sb;
+   bool should_lock = ac->ac_flags & EXT4_MB_STRICT_CHECK;
ext4_grpblk_t free;
int ret = 0;
 
+   if (should_lock)
+   ext4_lock_group(sb, group);
free = grp->bb_free;
if (free == 0)
goto out;
@@ -2186,6 +2190,8 @@ static int ext4_mb_good_group_nolock(struct 
ext4_allocation_context *ac,
goto out;
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
goto out;
+   if (should_lock)
+   ext4_unlock_group(sb, group);
 
/* We only do this if the grp has never been initialized */
if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
@@ -2194,8 +2200,12 @@ static int ext4_mb_good_group_nolock(struct 
ext4_allocation_context *ac,
return ret;
}
 
+   if (should_lock)
+   ext4_lock_group(sb, group);
ret = ext4_mb_good_group(ac, group, cr);
 out:
+   if (should_lock)
+   ext4_unlock_group(sb, group);
return ret;
 }
 
@@ -4610,7 +4620,8 @@ static bool 
ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
goto out_dbg;
}
seq_retry = ext4_get_discard_pa_seq_sum();
-   if (seq_retry != *seq) {
+   if (!(ac->ac_flags & EXT4_MB_STRICT_CHECK) || seq_retry != *seq) {
+   ac->ac_flags |= EXT4_MB_STRICT_CHECK;
*seq = seq_retry;
ret = true;
}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 19c87661eeec..0df9efa80b16 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -35,7 +35,8 @@ struct partial_cluster;
{ EXT4_MB_DELALLOC_RESERVED,"DELALLOC_RESV" },  \
{ EXT4_MB_STREAM_ALLOC, "STREAM_ALLOC" },   \
{ EXT4_MB_USE_ROOT_BLOCKS,  "USE_ROOT_BLKS" },  \
-   { EXT4_MB_USE_RESERVED, "USE_RESV" })
+   { EXT4_MB_USE_RESERVED, "USE_RESV" },   \
+   { EXT4_MB_STRICT_CHECK, "STRICT_CHECK" })
 
 #define show_map_flags(flags) __print_flags(flags, "|",
\
{ EXT4_GET_BLOCKS_CREATE,   "CREATE" }, \
-- 
2.21.0



[PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case

2020-05-20 Thread Ritesh Harjani
Hello All,

Please note that these patches are based on top of mballoc cleanup series [2]
which is also pending review. :)

v4 -> v5:
1. Removed ext4_lock_group() from fastpath and added that in the retry attempt,
so that the performance of fastpath is not affected.

v3 -> v4:
1. Splitted code cleanups and debug improvements as a separate patch series.
2. Dropped rcu_barrier() approach since it did cause some latency
in my testing of ENOSPC handling.
3. This patch series takes a different approach to improve the multi-threaded
ENOSPC handling in ext4 mballoc code. Below mail gives more details.


Background
==
Consider a case where your disk is close to full but still enough space
remains for your multi-threaded application to run. Now when this application
threads tries to write (e.g. sparse file followed by mmap write or
even fallocate multiple files) in parallel, then with current code of
ext4 multi-block allocator, the application may get an ENOSPC error in some
cases. Examining disk space at this time, we see there is sufficient space
remaining for your application to continue to run.

Additional info:

1. Our internal test team was easily able to reproduce this ENOSPC error on 
   an upstream kernel with 2GB ext4 image, with 64K blocksize. They didn't try
   above 2GB and reprorted this issue directly to dev team. On examining the
   free space when the application gets ENOSPC, the free space left was more
   then 50% of filesystem size in some cases.

2. For debugging/development of these patches, I used below script [1] to
   trigger this issue quite frequently on a 64K blocksize setup with 240MB
   ext4 image.


Summary of patches and problem with current design
==
There were 3 main problems which these patches tries to address and hence
improve the ENOSPC handling in ext4's multi-block allocator code.

1. Patch-2: Earlier we were considering the group is good or not (means
   checking if it has enough free blocks to serve your request) without taking
   the group's lock. This could result into a race where, if another thread is
   discarding the group's prealloc list, then the allocation thread will not
   consider those about to be free blocks and will fail will return that group
   is not fit for allocation thus eventually fails with ENOSPC error.

2. Patch-4: Discard PA algoritm only scans the PA list to free up the
   additional blocks which got added to PA. This is done by the same thread-A
   which at 1st couldn't allocate any blocks. But there is a window where,
   once the blocks were allocated (say by some other thread-B previously) we
   drop the group's lock and then checks to see if some of these blocks could
   be added to prealloc list of the group from where we allocated some blocks.
   After that we take the lock and add these additional blocks allocated by
   thread-B to the PA list. But say if thread-A tries to scan the PA list
   between this time interval then there is possibilty that it won't find any
   blocks added to the PA list and hence may return ENOSPC error.
   Hence this patch tries to add those additional blocks to the PA list just
   after the blocks are marked as used with the same group's spinlock held.
   
3. Patch-3: Introduces a per cpu discard_pa_seq counter which is increased
   whenever there is block allocation/freeing or when the discarding of any
   group's PA list has started. With this we could know when to stop the
   retrying logic and return ENOSPC error if there is actually no free space
   left.
   There is an optimization done in the block allocation fast path with this
   approach that, before starting the block allocation, we only sample the
   percpu seq count on that cpu. Only when the allocation fails and discard
   couldn't free up any of the blocks in all of the group's PA list, that is
   when we sample the percpu seq counter sum over all possible cpus to check
   if we need to retry.


Testing:
=
Tested fstests with default bs of 4K and bs == PAGESIZE ("-g auto")
No new failures were reported with this patch series in this testing.

NOTE:
1. This patch series is based on top of mballoc code cleanup patch series
posted at [2].

References:
===
[v3]: 
https://lkml.kernel.org/linux-ext4/cover.1588313626.git.rite...@linux.ibm.com/
[1]: 
https://github.com/riteshharjani/LinuxStudy/blob/master/tools/test_mballoc.sh
[2]: 
https://lkml.kernel.org/linux-ext4/cover.1589086800.git.rite...@linux.ibm.com/


Ritesh Harjani (5):
  ext4: mballoc: Add blocks to PA list under same spinlock after allocating 
blocks
  ext4: mballoc: Refactor ext4_mb_discard_preallocations()
  ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
  ext4: mballoc: Refactor ext4_mb_good_group()
  ext4: mballoc: Use lock for checking free blocks while retrying

 fs/ext4/ext4.h  |   2 +
 fs/ext4/mballoc.c

[PATCHv5 2/5] ext4: mballoc: Refactor ext4_mb_discard_preallocations()

2020-05-20 Thread Ritesh Harjani
Implement ext4_mb_discard_preallocations_should_retry()
which we will need in later patches to add more logic
like check for sequence number match to see if we should
retry for block allocation or not.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/mballoc.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index decc5168d126..b75408d72773 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4543,6 +4543,17 @@ static int ext4_mb_discard_preallocations(struct 
super_block *sb, int needed)
return freed;
 }
 
+static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
+   struct ext4_allocation_context *ac)
+{
+   int freed;
+
+   freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
+   if (freed)
+   return true;
+   return false;
+}
+
 /*
  * Main entry point into mballoc to allocate blocks
  * it tries to use preallocation first, then falls back
@@ -4551,7 +4562,6 @@ static int ext4_mb_discard_preallocations(struct 
super_block *sb, int needed)
 ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
struct ext4_allocation_request *ar, int *errp)
 {
-   int freed;
struct ext4_allocation_context *ac = NULL;
struct ext4_sb_info *sbi;
struct super_block *sb;
@@ -4656,8 +4666,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
ar->len = ac->ac_b_ex.fe_len;
}
} else {
-   freed  = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
-   if (freed)
+   if (ext4_mb_discard_preallocations_should_retry(sb, ac))
goto repeat;
/*
 * If block allocation fails then the pa allocated above
-- 
2.21.0



[RFCv4 3/6] ext4: mballoc: Optimize ext4_mb_good_group_nolock further if grp needs init

2020-05-10 Thread Ritesh Harjani
Currently in case if EXT4_MB_GRP_NEED_INIT(grp) is true, then we first
check for few things like grp->bb_free etc with spinlock (ext4_lock_group)
held. Then we drop the lock only to initialize the group's buddy cache
and then again take the lock and check for ext4_mb_good_group().

Once this step is done we return to ext4_mb_regular_allocator(), load
the buddy and then again take the lock only to check
ext4_mb_good_group(), which was anyways done in previous step.

I believe we can optimize one step here where if the group needs init
we can check for only few things and return early. Then recheck for
ext4_mb_good_group() only once after loading the buddy cache.

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/mballoc.c | 44 +---
 1 file changed, 21 insertions(+), 23 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index dcd05ff7c012..7d766dc34cdd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2134,33 +2134,34 @@ static bool ext4_mb_good_group(struct 
ext4_allocation_context *ac,
  * during ext4_mb_init_group(). This should not be called with
  * ext4_lock_group() held.
  */
-static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
+static bool ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
 ext4_group_t group, int cr)
 {
struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
struct super_block *sb = ac->ac_sb;
ext4_grpblk_t free;
-   int ret = 0;
+   bool ret = false, need_init = EXT4_MB_GRP_NEED_INIT(grp);
 
ext4_lock_group(sb, group);
-   free = grp->bb_free;
-   if (free == 0)
-   goto out;
-   if (cr <= 2 && free < ac->ac_g_ex.fe_len)
-   goto out;
-   if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
-   goto out;
-   ext4_unlock_group(sb, group);
-
-   /* We only do this if the grp has never been initialized */
-   if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
-   ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
-   if (ret)
-   return ret;
+   /*
+* If the group needs init then no need to call ext4_mb_init_group()
+* after dropping the lock. It's better we check bb_free/other things
+* here and if it meets the criteria than return true. Later we
+* will anyway check for good group after loading the buddy cache
+* which, if required will call ext4_mb_init_group() from within.
+*/
+   if (unlikely(need_init)) {
+   free = grp->bb_free;
+   if (free == 0)
+   goto out;
+   if (cr <= 2 && free < ac->ac_g_ex.fe_len)
+   goto out;
+   if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
+   goto out;
+   ret = true;
+   } else {
+   ret = ext4_mb_good_group(ac, group, cr);
}
-
-   ext4_lock_group(sb, group);
-   ret = ext4_mb_good_group(ac, group, cr);
 out:
ext4_unlock_group(sb, group);
return ret;
@@ -2252,11 +2253,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context 
*ac)
 
/* This now checks without needing the buddy page */
ret = ext4_mb_good_group_nolock(ac, group, cr);
-   if (ret <= 0) {
-   if (!first_err)
-   first_err = ret;
+   if (ret == 0)
continue;
-   }
 
err = ext4_mb_load_buddy(sb, group, );
if (err)
-- 
2.21.0



[RFCv4 6/6] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

2020-05-10 Thread Ritesh Harjani
There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.

The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
   allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we either allocate
   or free these blocks i.e. while marking those blocks as used/free in
   mb_mark_used()/mb_free_blocks().
3. We also increment this percpu seq counter when we successfully identify
   that the bb_prealloc_list is not empty and hence proceed for discarding
   of those PAs inside ext4_mb_discard_group_preallocations().

Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.

It can be well argued that why don't just check for grp->bb_free to
see if there are any free blocks to be allocated. So here are the two
concerns which were discussed:-

1. If for some reason the blocks available in the group are not
   appropriate for allocation logic (say for e.g.
   EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
   the retry logic may result into infinte looping since grp->bb_free is
   non-zero.

2. Also before preallocation was clubbed with block allocation with the
   same ext4_lock_group() held, there were lot of races where grp->bb_free
   could not be reliably relied upon.
Due to above, this patch considers discard_pa_seq logic to determine if
we should retry for block allocation. Say if there are are n threads
trying for block allocation and none of those could allocate or discard
any of the blocks, then all of those n threads will fail the block
allocation and return -ENOSPC error. (Since the seq counter for all of
those will match as no block allocation/discard was done during that
duration).

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/mballoc.c | 56 ++-
 1 file changed, 51 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c18670924bbe..ddf2179ed7cf 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -351,6 +351,35 @@ static void ext4_mb_generate_from_freelist(struct 
super_block *sb, void *bitmap,
ext4_group_t group);
 static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
 
+/*
+ * The algorithm using this percpu seq counter goes below:
+ * 1. We sample the percpu discard_pa_seq counter before trying for block
+ *allocation in ext4_mb_new_blocks().
+ * 2. We increment this percpu discard_pa_seq counter when we either allocate
+ *or free these blocks i.e. while marking those blocks as used/free in
+ *mb_mark_used()/mb_free_blocks().
+ * 3. We also increment this percpu seq counter when we successfully identify
+ *that the bb_prealloc_list is not empty and hence proceed for discarding
+ *of those PAs inside ext4_mb_discard_group_preallocations().
+ *
+ * Now to make sure that the regular fast path of block allocation is not
+ * affected, as a small optimization we only sample the percpu seq counter
+ * on that cpu. Only when the block allocation fails and when freed blocks
+ * found were 0, that is when we sample percpu seq counter for all cpus using
+ * below function ext4_get_discard_pa_seq_sum(). This happens after making
+ * sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.
+ */
+static DEFINE_PER_CPU(u64, discard_pa_seq);
+static inline u64 ext4_get_discard_pa_seq_sum(void)
+{
+   int __cpu;
+   u64 __seq = 0;
+
+   for_each_possible_cpu(__cpu)
+   __seq += per_cpu(discard_pa_seq, __cpu);
+   return __seq;
+}
+
 static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
 {
 #if BITS_PER_LONG == 64
@@ -1462,6 +1491,7 @@ static void mb_free_blocks(struct inode *inode, struct 
ext4_buddy *e4b,
mb_check_buddy(e4b);
mb_free_blocks_double(inode, e4b, first, count);
 
+   this_cpu_

[RFCv4 5/6] ext4: mballoc: Refactor ext4_mb_discard_preallocations()

2020-05-10 Thread Ritesh Harjani
Implement ext4_mb_discard_preallocations_should_retry()
which we will need in later patches to add more logic
like check for sequence number match to see if we should
retry for block allocation or not.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/mballoc.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 574ce400a3b5..c18670924bbe 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4570,6 +4570,17 @@ static int ext4_mb_discard_preallocations(struct 
super_block *sb, int needed)
return freed;
 }
 
+static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
+   struct ext4_allocation_context *ac)
+{
+   int freed;
+
+   freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
+   if (freed)
+   return true;
+   return false;
+}
+
 /*
  * Main entry point into mballoc to allocate blocks
  * it tries to use preallocation first, then falls back
@@ -4578,7 +4589,6 @@ static int ext4_mb_discard_preallocations(struct 
super_block *sb, int needed)
 ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
struct ext4_allocation_request *ar, int *errp)
 {
-   int freed;
struct ext4_allocation_context *ac = NULL;
struct ext4_sb_info *sbi;
struct super_block *sb;
@@ -4683,8 +4693,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
ar->len = ac->ac_b_ex.fe_len;
}
} else {
-   freed  = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
-   if (freed)
+   if (ext4_mb_discard_preallocations_should_retry(sb, ac))
goto repeat;
/*
 * If block allocation fails then the pa allocated above
-- 
2.21.0



[RFCv4 4/6] ext4: mballoc: Add blocks to PA list under same spinlock after allocating blocks

2020-05-10 Thread Ritesh Harjani
ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list
of every group to discard the group's PA to free up the space if
allocation request fails. Consider below race:-

Process A   Process B

1. allocate blocks
1. Fails block allocation from
 ext4_mb_regular_allocator()
   ext4_lock_group()
allocated blocks
more than ac_o_ex.fe_len
   ext4_unlock_group()
2. Scans the
   grp->bb_prealloc_list (under
   ext4_lock_group()) and
   find nothing and thus return
   -ENOSPC.

2. Add the additional blocks to PA list

   ext4_lock_group()
add blocks to grp->bb_prealloc_list
   ext4_unlock_group()

Above race could be avoided if we add those additional blocks to
grp->bb_prealloc_list at the same time with block allocation when
ext4_lock_group() was still held.
With this discard-PA will know if there are actually any blocks which
could be freed from the PA

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/mballoc.c | 97 ++-
 1 file changed, 62 insertions(+), 35 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 7d766dc34cdd..574ce400a3b5 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -349,6 +349,7 @@ static void ext4_mb_generate_from_pa(struct super_block 
*sb, void *bitmap,
ext4_group_t group);
 static void ext4_mb_generate_from_freelist(struct super_block *sb, void 
*bitmap,
ext4_group_t group);
+static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
 
 static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
 {
@@ -1701,6 +1702,14 @@ static void ext4_mb_use_best_found(struct 
ext4_allocation_context *ac,
sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
spin_unlock(>s_md_lock);
}
+   /*
+* As we've just preallocated more space than
+* user requested originally, we store allocated
+* space in a special descriptor.
+*/
+   if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len)
+   ext4_mb_new_preallocation(ac);
+
 }
 
 /*
@@ -1949,7 +1958,7 @@ void ext4_mb_simple_scan_group(struct 
ext4_allocation_context *ac,
 
ext4_mb_use_best_found(ac, e4b);
 
-   BUG_ON(ac->ac_b_ex.fe_len != ac->ac_g_ex.fe_len);
+   BUG_ON(ac->ac_f_ex.fe_len != ac->ac_g_ex.fe_len);
 
if (EXT4_SB(sb)->s_mb_stats)
atomic_inc(_SB(sb)->s_bal_2orders);
@@ -3702,7 +3711,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context 
*ac,
 /*
  * creates new preallocated space for given inode
  */
-static noinline_for_stack int
+static noinline_for_stack void
 ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 {
struct super_block *sb = ac->ac_sb;
@@ -3715,10 +3724,9 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
BUG_ON(ac->ac_status != AC_STATUS_FOUND);
BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
+   BUG_ON(ac->ac_pa == NULL);
 
-   pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
-   if (pa == NULL)
-   return -ENOMEM;
+   pa = ac->ac_pa;
 
if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) {
int winl;
@@ -3762,7 +3770,6 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
pa->pa_pstart = ext4_grp_offs_to_block(sb, >ac_b_ex);
pa->pa_len = ac->ac_b_ex.fe_len;
pa->pa_free = pa->pa_len;
-   atomic_set(>pa_count, 1);
spin_lock_init(>pa_lock);
INIT_LIST_HEAD(>pa_inode_list);
INIT_LIST_HEAD(>pa_group_list);
@@ -3782,21 +3789,17 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
pa->pa_obj_lock = >i_prealloc_lock;
pa->pa_inode = ac->ac_inode;
 
-   ext4_lock_group(sb, ac->ac_b_ex.fe_group);
list_add(>pa_group_list, >bb_prealloc_list);
-   ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 
spin_lock(pa->pa_obj_lock);
list_add_rcu(>pa_inode_list, >i_prealloc_list);
spin_unlock(pa->pa_obj_lock);
-
-   return 0;
 }
 
 /*
  * creates new preallocated space for locality group inodes belongs to
  */
-static noinline_for_stack int
+static noinline_for_stack void
 ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 {
struct super_block *sb = ac->ac_sb;
@@ -3808,11 +3811,9 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
BUG_ON(ac-&g

[RFCv4 2/6] ext4: mballoc: Use ext4_lock_group() around calculations involving bb_free

2020-05-10 Thread Ritesh Harjani
Currently while doing block allocation grp->bb_free may be getting
modified if discard is happening in parallel.
For e.g. consider a case where there are lot of threads who have
preallocated lot of blocks and there is a thread which is trying
to discard all of this group's PA. Now it could happen that
we see all of those group's bb_free is zero and fail the allocation
while there is sufficient space if we free up all the PA.

So this patch takes the ext4_lock_group() around calculations involving
grp->bb_free in ext4_mb_good_group() & ext4_mb_good_group_nolock()

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/mballoc.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index da11a4a738bd..dcd05ff7c012 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2138,9 +2138,11 @@ static int ext4_mb_good_group_nolock(struct 
ext4_allocation_context *ac,
 ext4_group_t group, int cr)
 {
struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
+   struct super_block *sb = ac->ac_sb;
ext4_grpblk_t free;
int ret = 0;
 
+   ext4_lock_group(sb, group);
free = grp->bb_free;
if (free == 0)
goto out;
@@ -2148,6 +2150,7 @@ static int ext4_mb_good_group_nolock(struct 
ext4_allocation_context *ac,
goto out;
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
goto out;
+   ext4_unlock_group(sb, group);
 
/* We only do this if the grp has never been initialized */
if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
@@ -2156,8 +2159,10 @@ static int ext4_mb_good_group_nolock(struct 
ext4_allocation_context *ac,
return ret;
}
 
+   ext4_lock_group(sb, group);
ret = ext4_mb_good_group(ac, group, cr);
 out:
+   ext4_unlock_group(sb, group);
return ret;
 }
 
-- 
2.21.0



[RFCv4 1/6] ext4: mballoc: Refactor ext4_mb_good_group()

2020-05-10 Thread Ritesh Harjani
ext4_mb_good_group() definition was changed some time back
and now it even initializes the buddy cache (via ext4_mb_init_group()),
if in case the EXT4_MB_GRP_NEED_INIT() is true for a group.
Note that ext4_mb_init_group() could sleep and so should not be called
under a spinlock held.
This is fine as of now because ext4_mb_good_group() is called before
loading the buddy bitmap without ext4_lock_group() held
and again called after loading the bitmap, only this time with
ext4_lock_group() held.
But still this whole thing is confusing.

So this patch refactors out ext4_mb_good_group_nolock() which should be
called when without holding ext4_lock_group().
Also in further patches we hold the spinlock (ext4_lock_group()) while
doing any calculations which involves grp->bb_free or grp->bb_fragments.

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/mballoc.c | 80 ++-
 1 file changed, 52 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 33a69424942c..da11a4a738bd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2066,15 +2066,16 @@ void ext4_mb_scan_aligned(struct 
ext4_allocation_context *ac,
 }
 
 /*
- * This is now called BEFORE we load the buddy bitmap.
+ * This is also called BEFORE we load the buddy bitmap.
  * Returns either 1 or 0 indicating that the group is either suitable
- * for the allocation or not. In addition it can also return negative
- * error code when something goes wrong.
+ * for the allocation or not. This should be called with ext4_lock_group()
+ * held to protect grp->bb_free from getting changed due to another
+ * allocation/discard is happening in parallel.
  */
-static int ext4_mb_good_group(struct ext4_allocation_context *ac,
+static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
ext4_group_t group, int cr)
 {
-   unsigned free, fragments;
+   ext4_grpblk_t free, fragments;
int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb));
struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
 
@@ -2082,23 +2083,16 @@ static int ext4_mb_good_group(struct 
ext4_allocation_context *ac,
 
free = grp->bb_free;
if (free == 0)
-   return 0;
+   return false;
if (cr <= 2 && free < ac->ac_g_ex.fe_len)
-   return 0;
+   return false;
 
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
-   return 0;
-
-   /* We only do this if the grp has never been initialized */
-   if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
-   int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
-   if (ret)
-   return ret;
-   }
+   return false;
 
fragments = grp->bb_fragments;
if (fragments == 0)
-   return 0;
+   return false;
 
switch (cr) {
case 0:
@@ -2108,31 +2102,63 @@ static int ext4_mb_good_group(struct 
ext4_allocation_context *ac,
if ((ac->ac_flags & EXT4_MB_HINT_DATA) &&
(flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) &&
((group % flex_size) == 0))
-   return 0;
+   return false;
 
if ((ac->ac_2order > ac->ac_sb->s_blocksize_bits+1) ||
(free / fragments) >= ac->ac_g_ex.fe_len)
-   return 1;
+   return true;
 
if (grp->bb_largest_free_order < ac->ac_2order)
-   return 0;
+   return false;
 
-   return 1;
+   return true;
case 1:
if ((free / fragments) >= ac->ac_g_ex.fe_len)
-   return 1;
+   return true;
break;
case 2:
if (free >= ac->ac_g_ex.fe_len)
-   return 1;
+   return true;
break;
case 3:
-   return 1;
+   return true;
default:
BUG();
}
 
-   return 0;
+   return false;
+}
+
+/*
+ * This could return negative error code if something goes wrong
+ * during ext4_mb_init_group(). This should not be called with
+ * ext4_lock_group() held.
+ */
+static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
+ext4_group_t group, int cr)
+{
+   struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
+   ext4_grpblk_t free;
+   int ret = 0;
+
+   free = grp->bb_free;
+   if (free == 0)
+   goto out;
+   if (cr <= 2 && free < ac->ac_g_ex.fe_len)
+   goto out;
+   if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
+

[RFCv4 0/6] Improve ext4 handling of ENOSPC with multi-threaded use-case

2020-05-10 Thread Ritesh Harjani
Hello All,

v3 -> v4:
1. Splitted code cleanups and debug improvements as a separate patch series.
2. Dropped rcu_barrier() approach since it did cause some latency
in my testing of ENOSPC handling.
3. This patch series takes a different approach to improve the multi-threaded
ENOSPC handling in ext4 mballoc code. Below mail gives more details.


Background
==
Consider a case where your disk is close to full but still enough space
remains for your multi-threaded application to run. Now when this application
threads tries to write (e.g. sparse file followed by mmap write or
even fallocate multiple files) in parallel, then with current code of
ext4 multi-block allocator, the application may get an ENOSPC error in some
cases. Examining disk space at this time, we see there is sufficient space
remaining for your application to continue to run.

Additional info:

1. Our internal test team was easily able to reproduce this ENOSPC error on 
   an upstream kernel with 2GB ext4 image, with 64K blocksize. They didn't try
   above 2GB and reprorted this issue directly to dev team. On examining the
   free space when the application gets ENOSPC, the free space left was more
   then 50% of filesystem size in some cases.

2. For debugging/development of these patches, I used below script [1] to
   trigger this issue quite frequently on a 64K blocksize setup with 240MB
   ext4 image.


Summary of patches and problem with current design
==
There were 3 main problems which these patches tries to address and hence
improve the ENOSPC handling in ext4's multi-block allocator code.

1. Patch-2: Earlier we were considering the group is good or not (means
   checking if it has enough free blocks to serve your request) without taking
   the group's lock. This could result into a race where, if another thread is
   discarding the group's prealloc list, then the allocation thread will not
   consider those about to be free blocks and will fail will return that group
   is not fit for allocation thus eventually fails with ENOSPC error.

2. Patch-4: Discard PA algoritm only scans the PA list to free up the
   additional blocks which got added to PA. This is done by the same thread-A
   which at 1st couldn't allocate any blocks. But there is a window where,
   once the blocks were allocated (say by some other thread-B previously) we
   drop the group's lock and then checks to see if some of these blocks could
   be added to prealloc list of the group from where we allocated some blocks.
   After that we take the lock and add these additional blocks allocated by
   thread-B to the PA list. But say if thread-A tries to scan the PA list
   between this time interval then there is possibilty that it won't find any
   blocks added to the PA list and hence may return ENOSPC error.
   Hence this patch tries to add those additional blocks to the PA list just
   after the blocks are marked as used with the same group's spinlock held.
   
3. Patch-3: Introduces a per cpu discard_pa_seq counter which is increased
   whenever there is block allocation/freeing or when the discarding of any
   group's PA list has started. With this we could know when to stop the
   retrying logic and return ENOSPC error if there is actually no free space
   left.
   There is an optimization done in the block allocation fast path with this
   approach that, before starting the block allocation, we only sample the
   percpu seq count on that cpu. Only when the allocation fails and discard
   couldn't free up any of the blocks in all of the group's PA list, that is
   when we sample the percpu seq counter sum over all possible cpus to check
   if we need to retry.


Testing:
=
Tested fstests with default bs of 4K and bs == PAGESIZE ("-g auto")
No new failures were reported with this patch series in this testing.

NOTE:
1. This patch series is based on top of mballoc code cleanup patch series
posted at [2].
2. Patch-2 & Patch-3 is intentionally kept separate for better reviewer's
   attention on what each patch is trying to address.

References:
===
[v3]: 
https://patchwork.ozlabs.org/project/linux-ext4/cover/cover.1588313626.git.rite...@linux.ibm.com/
[1]: 
https://github.com/riteshharjani/LinuxStudy/blob/master/tools/test_mballoc.sh
[2]: 
https://patchwork.ozlabs.org/project/linux-ext4/cover/cover.1589086800.git.rite...@linux.ibm.com/


Ritesh Harjani (6):
  ext4: mballoc: Refactor ext4_mb_good_group()
  ext4: mballoc: Use ext4_lock_group() around calculations involving bb_free
  ext4: mballoc: Optimize ext4_mb_good_group_nolock further if grp needs init
  ext4: mballoc: Add blocks to PA list under same spinlock after allocating 
blocks
  ext4: mballoc: Refactor ext4_mb_discard_preallocations()
  ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

 fs/ext4/mballoc.c | 249 +-
 1 

Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

2019-10-23 Thread Ritesh Harjani




On 10/23/19 1:41 AM, Al Viro wrote:

On Tue, Oct 22, 2019 at 03:37:36PM +0100, Al Viro wrote:

On Tue, Oct 22, 2019 at 07:08:54PM +0530, Ritesh Harjani wrote:

I think we have still not taken this patch. Al?



or, for that matter, any callers of filename_lookup() assuming that the
lack of ENOENT means that the last call of walk_component() (from lookup_last())
has not failed with the same ENOENT and thus the result has been observed
positive.
You've picked the easiest one to hit, but on e.g. KVM setups you can have the
host thread representing the CPU where __d_set_inode_and_type() runs get
preempted (by the host kernel), leaving others with much wider window.


I had thought so about the other call sites, but as you said
maybe this was the easiest one to hit.
Then, I kind of followed your suggested fix in below link to fix at 
least this current crash.

https://patchwork.kernel.org/patch/10909881/



Sure, we can do that to all callers of d_is_negative/d_is_positive, but...
the same goes for any places that assumes that d_is_dir() implies that
the sucker is positive, etc.

What we have guaranteed is
* ->d_lock serializes ->d_flags/->d_inode changes
* ->d_seq is bumped before/after such changes
* positive dentry never changes ->d_inode as long as you hold
a reference (negative dentry *can* become positive right under you)

So there are 3 classes of valid users: those holding ->d_lock, those
sampling and rechecking ->d_seq and those relying upon having observed
the sucker they've pinned to be positive.


:) Thanks for simplifying like this. Agreed.





What you've been hitting is "we have it pinned, ->d_flags says it's
positive but we still observe the value of ->d_inode from before the
store to ->d_flags that has made it look positive".


Actually, your patch opens another problem there.  Suppose you make
it d_really_is_positive() and hit the same race sans reordering.
Dentry is found by __d_lookup() and is negative.  Right after we
return from __d_lookup() another thread makes it positive (a symlink)
- ->d_inode is set, d_really_is_positive() becomes true.  OK, on we
go, pick the inode and move on.  Right?  ->d_flags is still not set
by the thread that made it positive.  We return from lookup_fast()
and call step_into().  And get to
 if (likely(!d_is_symlink(path->dentry)) ||
Which checks ->d_flags and sees the value from before the sucker
became positive.  IOW, d_is_symlink() is false here.  If that
was the last path component and we'd been told to follow links,
we will end up with positive dentry of a symlink coming out of
pathname resolution.



hmm. yes, looks like it, based on your above explanation.
So even though this could avoid crash, but still we may end up with
a bogus entry with current patch.




Similar fun happens if you have mkdir racing with lookup - ENOENT
is what should've happened if lookup comes first, success - if
mkdir does.  This way we can hit ENOTDIR, due to false negative
from d_can_lookup().

IOW, d_really_is_negative() in lookup_fast() will paper over
one of oopsen, but it
* won't cover similar oopsen on other codepaths and
* will lead to bogus behaviour.

I'm not sure that blanket conversion of d_is_... to smp_load_acquire()
is the right solution; it might very well be that we need to do that
only on a small subset of call sites, lookup_fast() being one of
those.  But we do want at least to be certain that something we'd
got from lookup_fast() in non-RCU mode already has ->d_flags visible.


We may also need similar guarantees with __d_clear_type_and_inode().

So do you think we should make use of ->d_seq for verifying this?
I see both __d_set_inode_and_type & __d_clear_type_and_inode() called
under ->d_seq_begin/->d_seq_end.

Then maybe we should use ->d_seq checking at those call sites.
We cannot unconditionally use ->d_seq checking in __d_entry_type(),
since we sometimes call this function inside ->d_seq_begin
(like in lookup_fast).




I'm going through the callers right now, will post a followup once
the things get cleaner...


Thanks for looking into this.



Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

2019-10-22 Thread Ritesh Harjani

I think we have still not taken this patch. Al?


On 10/15/19 9:37 AM, Ritesh Harjani wrote:

ping!!

On 9/27/19 10:12 AM, Ritesh Harjani wrote:

d_is_negative can race with d_instantiate_new()
-> __d_set_inode_and_type().
For e.g. in use cases where Thread-1 is creating
symlink (doing d_instantiate_new()) & Thread-2 is doing
cat of that symlink while doing lookup_fast (via REF-walk-
one such case is, when ->permission returns -ECHILD).

During this race if __d_set_and_inode_type() does out-of-order
execution and set the dentry->d_flags before setting
dentry->inode, then it can result into following kernel panic.

This change fixes the issue by directly checking for inode.

E.g. kernel panic, since inode was NULL.
trailing_symlink() -> may_follow_link() -> inode->i_uid.
Issue signature:-
   [NIP  : trailing_symlink+80]
   [LR   : trailing_symlink+1092]
   #4 [c0198069bb70] trailing_symlink at c04bae60  
(unreliable)

   #5 [c0198069bc00] path_openat at c04bdd14
   #6 [c0198069bc90] do_filp_open at c04c0274
   #7 [c0198069bdb0] do_sys_open at c049b248
   #8 [c0198069be30] system_call at c000b388

Sequence of events:-
Thread-2(Comm: ln)    Thread-1(Comm: cat)

    dentry = __d_lookup() //nonRCU

__d_set_and_inode_type() (Out-of-order execution)
 flags = READ_ONCE(dentry->d_flags);
 flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
 flags |= type_flags;
 WRITE_ONCE(dentry->d_flags, flags);

    if (unlikely(d_is_negative()) // fails
   {}
    // since d_flags is already updated in
    // Thread-2 in parallel but inode
    // not yet set.
    // d_is_negative returns false

    *inode = d_backing_inode(path->dentry);
    // means inode is still NULL

 dentry->d_inode = inode;

    trailing_symlink()
    may_follow_link()
    inode = nd->link_inode;
    // nd->link_inode = NULL
    //Then it crashes while
    //doing inode->i_uid

Reported-by: Guang Yuan Wu 
Tested-by: Guang Yuan Wu 
Acked-by: Jeff Layton 
Signed-off-by: Ritesh Harjani 
---
  fs/namei.c | 16 +++-
  1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..7c5337cddebd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1617,7 +1617,21 @@ static int lookup_fast(struct nameidata *nd,
  dput(dentry);
  return status;
  }
-    if (unlikely(d_is_negative(dentry))) {
+
+    /*
+ * Caution: d_is_negative() can race with
+ * __d_set_inode_and_type().
+ * For e.g. in use cases where Thread-1 is creating
+ * symlink (doing d_instantiate_new()) & Thread-2 is doing
+ * cat of that symlink and falling here (via Ref-walk) while
+ * doing lookup_fast (one such case is when ->permission
+ * returns -ECHILD).
+ * Now if __d_set_inode_and_type() does out-of-order execution
+ * i.e. it first sets the dentry->d_flags & then dentry->inode
+ * then it can result into inode being NULL, causing panic later.
+ * Hence directly check if inode is NULL here.
+ */
+    if (unlikely(d_really_is_negative(dentry))) {
  dput(dentry);
  return -ENOENT;
  }







Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

2019-10-14 Thread Ritesh Harjani

ping!!

On 9/27/19 10:12 AM, Ritesh Harjani wrote:

d_is_negative can race with d_instantiate_new()
-> __d_set_inode_and_type().
For e.g. in use cases where Thread-1 is creating
symlink (doing d_instantiate_new()) & Thread-2 is doing
cat of that symlink while doing lookup_fast (via REF-walk-
one such case is, when ->permission returns -ECHILD).

During this race if __d_set_and_inode_type() does out-of-order
execution and set the dentry->d_flags before setting
dentry->inode, then it can result into following kernel panic.

This change fixes the issue by directly checking for inode.

E.g. kernel panic, since inode was NULL.
trailing_symlink() -> may_follow_link() -> inode->i_uid.
Issue signature:-
   [NIP  : trailing_symlink+80]
   [LR   : trailing_symlink+1092]
   #4 [c0198069bb70] trailing_symlink at c04bae60  (unreliable)
   #5 [c0198069bc00] path_openat at c04bdd14
   #6 [c0198069bc90] do_filp_open at c04c0274
   #7 [c0198069bdb0] do_sys_open at c049b248
   #8 [c0198069be30] system_call at c000b388

Sequence of events:-
Thread-2(Comm: ln) Thread-1(Comm: cat)

dentry = __d_lookup() //nonRCU

__d_set_and_inode_type() (Out-of-order execution)
 flags = READ_ONCE(dentry->d_flags);
 flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
 flags |= type_flags;
 WRITE_ONCE(dentry->d_flags, flags);

if (unlikely(d_is_negative()) // fails
   {}
// since d_flags is already updated in
// Thread-2 in parallel but inode
// not yet set.
// d_is_negative returns false

*inode = d_backing_inode(path->dentry);
// means inode is still NULL

 dentry->d_inode = inode;

trailing_symlink()
may_follow_link()
inode = nd->link_inode;
// nd->link_inode = NULL
//Then it crashes while
//doing inode->i_uid

Reported-by: Guang Yuan Wu 
Tested-by: Guang Yuan Wu 
Acked-by: Jeff Layton 
Signed-off-by: Ritesh Harjani 
---
  fs/namei.c | 16 +++-
  1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..7c5337cddebd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1617,7 +1617,21 @@ static int lookup_fast(struct nameidata *nd,
dput(dentry);
return status;
}
-   if (unlikely(d_is_negative(dentry))) {
+
+   /*
+* Caution: d_is_negative() can race with
+* __d_set_inode_and_type().
+* For e.g. in use cases where Thread-1 is creating
+* symlink (doing d_instantiate_new()) & Thread-2 is doing
+* cat of that symlink and falling here (via Ref-walk) while
+* doing lookup_fast (one such case is when ->permission
+* returns -ECHILD).
+* Now if __d_set_inode_and_type() does out-of-order execution
+* i.e. it first sets the dentry->d_flags & then dentry->inode
+* then it can result into inode being NULL, causing panic later.
+* Hence directly check if inode is NULL here.
+*/
+   if (unlikely(d_really_is_negative(dentry))) {
dput(dentry);
return -ENOENT;
}





[PATCH 1/1] direct-io: Change is_async member of struct dio from int to bool

2019-09-26 Thread Ritesh Harjani
Change is_async member of struct dio from int to bool.
Found this during code review.

Signed-off-by: Ritesh Harjani 
---
 fs/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index ae196784f487..b139876653ef 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -128,7 +128,7 @@ struct dio {
/* BIO completion state */
spinlock_t bio_lock;/* protects BIO fields below */
int page_errors;/* errno from get_user_pages() */
-   int is_async;   /* is IO async ? */
+   bool is_async;  /* is IO async ? */
bool defer_completion;  /* defer AIO completion to workqueue? */
bool should_dirty;  /* if pages should be dirtied */
int io_error;   /* IO error in completion path */
-- 
2.21.0



[PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

2019-09-26 Thread Ritesh Harjani
d_is_negative can race with d_instantiate_new()
-> __d_set_inode_and_type().
For e.g. in use cases where Thread-1 is creating
symlink (doing d_instantiate_new()) & Thread-2 is doing
cat of that symlink while doing lookup_fast (via REF-walk-
one such case is, when ->permission returns -ECHILD).

During this race if __d_set_and_inode_type() does out-of-order
execution and set the dentry->d_flags before setting
dentry->inode, then it can result into following kernel panic.

This change fixes the issue by directly checking for inode.

E.g. kernel panic, since inode was NULL.
trailing_symlink() -> may_follow_link() -> inode->i_uid.
Issue signature:-
  [NIP  : trailing_symlink+80]
  [LR   : trailing_symlink+1092]
  #4 [c0198069bb70] trailing_symlink at c04bae60  (unreliable)
  #5 [c0198069bc00] path_openat at c04bdd14
  #6 [c0198069bc90] do_filp_open at c04c0274
  #7 [c0198069bdb0] do_sys_open at c049b248
  #8 [c0198069be30] system_call at c000b388

Sequence of events:-
Thread-2(Comm: ln) Thread-1(Comm: cat)

dentry = __d_lookup() //nonRCU

__d_set_and_inode_type() (Out-of-order execution)
flags = READ_ONCE(dentry->d_flags);
flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
flags |= type_flags;
WRITE_ONCE(dentry->d_flags, flags);

if (unlikely(d_is_negative()) // fails
   {}
// since d_flags is already updated in
// Thread-2 in parallel but inode
// not yet set.
// d_is_negative returns false

*inode = d_backing_inode(path->dentry);
// means inode is still NULL

dentry->d_inode = inode;

trailing_symlink()
may_follow_link()
inode = nd->link_inode;
// nd->link_inode = NULL
//Then it crashes while
//doing inode->i_uid

Reported-by: Guang Yuan Wu 
Tested-by: Guang Yuan Wu 
Acked-by: Jeff Layton 
Signed-off-by: Ritesh Harjani 
---
 fs/namei.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..7c5337cddebd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1617,7 +1617,21 @@ static int lookup_fast(struct nameidata *nd,
dput(dentry);
return status;
}
-   if (unlikely(d_is_negative(dentry))) {
+
+   /*
+* Caution: d_is_negative() can race with
+* __d_set_inode_and_type().
+* For e.g. in use cases where Thread-1 is creating
+* symlink (doing d_instantiate_new()) & Thread-2 is doing
+* cat of that symlink and falling here (via Ref-walk) while
+* doing lookup_fast (one such case is when ->permission
+* returns -ECHILD).
+* Now if __d_set_inode_and_type() does out-of-order execution
+* i.e. it first sets the dentry->d_flags & then dentry->inode
+* then it can result into inode being NULL, causing panic later.
+* Hence directly check if inode is NULL here.
+*/
+   if (unlikely(d_really_is_negative(dentry))) {
dput(dentry);
return -ENOENT;
}
-- 
2.21.0



[RFC-v2 1/2] ext4: Add ext4_ilock & ext4_iunlock API

2019-09-22 Thread Ritesh Harjani
This adds ext4_ilock/iunlock types of APIs.
This is the preparation APIs to make shared
locking/unlocking & restarting with exclusive
locking/unlocking easier in next patch.

Along with above this also addresses the AIM7
regression problem which was only fixed for XFS
in,
commit 942491c9e6d6 ("xfs: fix AIM7 regression")

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/ext4.h| 33 +
 fs/ext4/extents.c | 16 ++--
 fs/ext4/file.c| 63 +--
 fs/ext4/inode.c   |  4 +--
 fs/ext4/ioctl.c   | 16 ++--
 fs/ext4/super.c   | 12 -
 fs/ext4/xattr.c   | 16 ++--
 7 files changed, 98 insertions(+), 62 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2ab91815f52d..9ffafbe6bc3f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2945,6 +2945,39 @@ static inline int ext4_update_inode_size(struct inode 
*inode, loff_t newsize)
return changed;
 }
 
+#define EXT4_IOLOCK_EXCL   (1 << 0)
+#define EXT4_IOLOCK_SHARED (1 << 1)
+
+static inline void ext4_ilock(struct inode *inode, unsigned int iolock)
+{
+   if (iolock == EXT4_IOLOCK_EXCL)
+   inode_lock(inode);
+   else
+   inode_lock_shared(inode);
+}
+
+static inline void ext4_iunlock(struct inode *inode, unsigned int iolock)
+{
+   if (iolock == EXT4_IOLOCK_EXCL)
+   inode_unlock(inode);
+   else
+   inode_unlock_shared(inode);
+}
+
+static inline int ext4_ilock_nowait(struct inode *inode, unsigned int iolock)
+{
+   if (iolock == EXT4_IOLOCK_EXCL)
+   return inode_trylock(inode);
+   else
+   return inode_trylock_shared(inode);
+}
+
+static inline void ext4_ilock_demote(struct inode *inode, unsigned int iolock)
+{
+   BUG_ON(iolock != EXT4_IOLOCK_EXCL);
+   downgrade_write(>i_rwsem);
+}
+
 int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
  loff_t len);
 
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a869e206bd81..ef37f4d4ee7e 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4680,7 +4680,7 @@ static long ext4_zero_range(struct file *file, loff_t 
offset,
else
max_blocks -= lblk;
 
-   inode_lock(inode);
+   ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 
/*
 * Indirect files do not support unwritten extnets
@@ -4790,7 +4790,7 @@ static long ext4_zero_range(struct file *file, loff_t 
offset,
 
ext4_journal_stop(handle);
 out_mutex:
-   inode_unlock(inode);
+   ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
return ret;
 }
 
@@ -4856,7 +4856,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t 
offset, loff_t len)
if (mode & FALLOC_FL_KEEP_SIZE)
flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
 
-   inode_lock(inode);
+   ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 
/*
 * We only support preallocation for extent-based files only
@@ -4887,7 +4887,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t 
offset, loff_t len)
EXT4_I(inode)->i_sync_tid);
}
 out:
-   inode_unlock(inode);
+   ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
return ret;
 }
@@ -5387,7 +5387,7 @@ int ext4_collapse_range(struct inode *inode, loff_t 
offset, loff_t len)
return ret;
}
 
-   inode_lock(inode);
+   ext4_ilock(inode, EXT4_IOLOCK_EXCL);
/*
 * There is no need to overlap collapse range with EOF, in which case
 * it is effectively a truncate operation
@@ -5486,7 +5486,7 @@ int ext4_collapse_range(struct inode *inode, loff_t 
offset, loff_t len)
 out_mmap:
up_write(_I(inode)->i_mmap_sem);
 out_mutex:
-   inode_unlock(inode);
+   ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
return ret;
 }
 
@@ -5537,7 +5537,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, 
loff_t len)
return ret;
}
 
-   inode_lock(inode);
+   ext4_ilock(inode, EXT4_IOLOCK_EXCL);
/* Currently just for extent based files */
if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
ret = -EOPNOTSUPP;
@@ -5664,7 +5664,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, 
loff_t len)
 out_mmap:
up_write(_I(inode)->i_mmap_sem);
 out_mutex:
-   inode_unlock(inode);
+   ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
return ret;
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d2ff383a8b9f..ce1cecbae932 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -57,14 +57,15 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
/*
 * Get exclusion from truncate and other inode operations.
 */
-   if (!inode_trylock_shared(in

[RFC-v2 0/2] ext4: Improve ilock scalability to improve performance in DirectIO workloads

2019-09-22 Thread Ritesh Harjani
  |   
 -2 +-+**   **  **   **   ** +-+   
|   **  **   **   **   |   
|+++**  +   +++|   
 -3 +-+--++++---+++--+-+   
1,   2,   4,   8,  12,  16,  24,   
No. of threads 



Performance(%) data randrw (read)-1M
  3 +-+--++++---+++--+-+   
|++++   +++|   
|  |   
  2 +-+   ** +-+   
| **   **  |   
| ** ****  |   
  1 +-+   **   ******+-+   
| **   ******  |   
  0 +-+   **   **   **  **   **   **   **+-+   
|   **  ****   |   
|   **  ** |   
 -1 +-+ **   +-+   
|   ** |   
|  |   
 -2 +-+  +-+   
|  |   
|++++   +++|   
 -3 +-+--++++---+++--+-+   
1,   2,   4,   8,  12,  16,  24,   
No. of threads 
   
   
Performance(%) data randrw (write)-1M
  3 +-+--++++---+++--+-+   
|++++   +++**  |   
|  **  |   
  2 +-+   **   **+-+   
| **   **  |   
| ** ****  |   
  1 +-+   **   ****   **   **+-+   
| **   ****   **   **  |   
  0 +-+   **   **   **  **   **   **   **+-+   
|   **  ** |   
|   **  ** |   
 -1 +-+ **   +-+   
|   ** |   
|  |   
 -2 +-+  +-+   
|  |   
|++++   +++|   
 -3 +-+--++++---+++--+-+   
1,   2,   4,   8,  12,  16,  24,   
No. of threads 
  

Git tree- https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC-v2

References
==
[1]: 
https://lore.kernel.org/linux-ext4/1566871552-60946-4-git-send-email-joseph...@linux.alibaba.com/
[2]: https://lore.kernel.org/linux-ext4/20190910215720.ga7...@quack2.suse.cz/
[3]: 
https://lore.kernel.org/linux-fsdevel/20190911103117.e32c34c...@d06av22.portsmouth.uk.ibm.com/
[4]: https://lwn.net/Articles/799184/

[5]: 
https://github.com/riteshharjani/LinuxStudy/blob/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-4K.png
[6]: 
https://github.com/riteshharjani/LinuxStudy/blob/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-1M.pg

Ritesh Harjani (2):
  ext4: Add ext4_ilock & ext4_iunlock API
  ext4: Improve DIO writes locking sequence

 fs/ext4/ext4.h|  33 ++
 fs/ext4/extents.c |  16 +--
 fs/ext4/file.c| 260 --
 fs/ext4/inode.c   |   4 +-
 fs/ext4/ioctl.c   |  16 +--
 fs/ext4/super.c   |  12 +--
 fs/ext4/xattr.c   |  16 +--
 7 files changed, 249 insertions(+), 108 deletions(-)

-- 
2.21.0



[RFC-v2 2/2] ext4: Improve DIO writes locking sequence

2019-09-22 Thread Ritesh Harjani
Earlier there was no shared lock in DIO read path.
But this patch (16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing
for parallel DIO reads by adding shared lock in inode DIO
read path.

But this created problem with mixed read/write workload.
It is due to the fact that in DIO path, we first start with
exclusive lock and only when we determine that it is a ovewrite
IO, we downgrade the lock. This causes the problem, since
with above patch we have shared locking in DIO reads.

So, this patch tries to fix this issue by starting with
shared lock and then switching to exclusive lock only
when required based on ext4_dio_write_checks().

Other than that, it also simplifies below cases:-

1. Simplified ext4_unaligned_aio API to
ext4_unaligned_io.
Previous API was abused in the sense that it was
not really checking for AIO anywhere also it used to
check for extending writes.
So this API was renamed and simplified to ext4_unaligned_io()
which actully only checks if the IO is really unaligned.

This is because in all other cases inode_dio_wait()
will anyway become a no-op. So no need to over complicate
by checking for every condition here.

2. Added ext4_extending_io API. This checks if the IO
is extending the file.

Now we only check for
unaligned_io, extend, dioread_nolock & overwrite.

Signed-off-by: Ritesh Harjani 
---
 fs/ext4/file.c | 213 -
 1 file changed, 159 insertions(+), 54 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ce1cecbae932..faa8dcf9c08d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -166,19 +166,25 @@ static int ext4_release_file(struct inode *inode, struct 
file *filp)
  * threads are at work on the same unwritten block, they must be synchronized
  * or one thread will zero the other's data, causing corruption.
  */
-static int
-ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
+static bool
+ext4_unaligned_io(struct inode *inode, struct iov_iter *from, loff_t pos)
 {
struct super_block *sb = inode->i_sb;
-   int blockmask = sb->s_blocksize - 1;
-
-   if (pos >= ALIGN(i_size_read(inode), sb->s_blocksize))
-   return 0;
+   unsigned long blockmask = sb->s_blocksize - 1;
 
if ((pos | iov_iter_alignment(from)) & blockmask)
-   return 1;
+   return true;
 
-   return 0;
+   return false;
+}
+
+static bool
+ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
+{
+   if (offset + len > i_size_read(inode) ||
+   offset + len > EXT4_I(inode)->i_disksize)
+   return true;
+   return false;
 }
 
 /* Is IO overwriting allocated and initialized blocks? */
@@ -204,7 +210,9 @@ static bool ext4_overwrite_io(struct inode *inode, loff_t 
pos, loff_t len)
return err == blklen && (map.m_flags & EXT4_MAP_MAPPED);
 }
 
-static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+
+static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
+struct iov_iter *from)
 {
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
@@ -216,10 +224,6 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
if (ret <= 0)
return ret;
 
-   ret = file_modified(iocb->ki_filp);
-   if (ret)
-   return 0;
-
/*
 * If we have encountered a bitmap-format file, the size limit
 * is smaller than s_maxbytes, which is for extent-mapped files.
@@ -231,9 +235,26 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
return -EFBIG;
iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
}
+
return iov_iter_count(from);
 }
 
+static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+   ssize_t ret;
+   ssize_t count;
+
+   count = ext4_generic_write_checks(iocb, from);
+   if (count <= 0)
+   return count;
+
+   ret = file_modified(iocb->ki_filp);
+   if (ret)
+   return ret;
+
+   return count;
+}
+
 static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
struct iov_iter *from)
 {
@@ -336,6 +357,83 @@ static int ext4_handle_failed_inode_extension(struct inode 
*inode, loff_t size)
return 0;
 }
 
+/*
+ * The intention here is to start with shared lock acquired
+ * (except in unaligned IO & extending writes case),
+ * then see if any condition requires an exclusive inode
+ * lock. If yes, then we restart the whole operation by
+ * releasing the shared lock and acquiring exclusive lock.
+ *
+ * - For unaligned_io we never take exclusive lock as it
+ *   may cause data corruption when two unaligned IO tries
+ *   to mod

Re: [PATCH V3 07/10] mmc: tegra: add Tegra186 WAR for CQE

2019-03-13 Thread Ritesh Harjani



On 3/14/2019 3:15 AM, Sowjanya Komatineni wrote:

Tegra186 CQHCI host has a known bug where CQHCI controller selects
DATA_PRESENT_SELECT bit to 1 for DCMDs with R1B response type and
since DCMD does not trigger any data transfer, DCMD task complete
happens leaving the DATA FSM of host controller in wait state for
the data.

This effects the data transfer tasks issued after the DCMDs with
R1b response type resulting in timeout.

SW WAR is to set CMD_TIMING to 1 in DCMD task descriptor. This bug
and SW WAR is applicable only for Tegra186 and not for Tegra194.

This patch implements this WAR thru NVQUIRK_CQHCI_DCMD_R1B_CMD_TIMING
for Tegra186 and also implements update_dcmd_desc of cqhci_host_ops
interface to set CMD_TIMING bit depending on the NVQUIRK.

Tested-by: Jon Hunter 
Signed-off-by: Sowjanya Komatineni 


Thanks,

Reviewed-by: Ritesh Harjani 

---
  drivers/mmc/host/sdhci-tegra.c | 17 -
  1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/sdhci-tegra.c b/drivers/mmc/host/sdhci-tegra.c
index f1aa0591112a..2f08b6e480df 100644
--- a/drivers/mmc/host/sdhci-tegra.c
+++ b/drivers/mmc/host/sdhci-tegra.c
@@ -106,6 +106,7 @@
  #define NVQUIRK_HAS_PADCALIB  BIT(6)
  #define NVQUIRK_NEEDS_PAD_CONTROL BIT(7)
  #define NVQUIRK_DIS_CARD_CLK_CONFIG_TAP   BIT(8)
+#define NVQUIRK_CQHCI_DCMD_R1B_CMD_TIMING  BIT(9)
  
  /* SDMMC CQE Base Address for Tegra Host Ver 4.1 and Higher */

  #define SDHCI_TEGRA_CQE_BASE_ADDR 0xF000
@@ -1123,6 +1124,18 @@ static void tegra_sdhci_voltage_switch(struct sdhci_host 
*host)
tegra_host->pad_calib_required = true;
  }
  
+static void sdhci_tegra_update_dcmd_desc(struct mmc_host *mmc,

+struct mmc_request *mrq, u64 *data)
+{
+   struct sdhci_pltfm_host *pltfm_host = sdhci_priv(mmc_priv(mmc));
+   struct sdhci_tegra *tegra_host = sdhci_pltfm_priv(pltfm_host);
+   const struct sdhci_tegra_soc_data *soc_data = tegra_host->soc_data;
+
+   if (soc_data->nvquirks & NVQUIRK_CQHCI_DCMD_R1B_CMD_TIMING &&
+   mrq->cmd->flags & MMC_RSP_R1B)
+   *data |= CQHCI_CMD_TIMING(1);
+}
+
  static void sdhci_tegra_cqe_enable(struct mmc_host *mmc)
  {
struct cqhci_host *cq_host = mmc->cqe_private;
@@ -1164,6 +1177,7 @@ static const struct cqhci_host_ops sdhci_tegra_cqhci_ops 
= {
.enable = sdhci_tegra_cqe_enable,
.disable = sdhci_cqe_disable,
.dumpregs = sdhci_tegra_dumpregs,
+   .update_dcmd_desc = sdhci_tegra_update_dcmd_desc,
  };
  
  static const struct sdhci_ops tegra_sdhci_ops = {

@@ -1345,7 +1359,8 @@ static const struct sdhci_tegra_soc_data 
soc_data_tegra186 = {
NVQUIRK_HAS_PADCALIB |
NVQUIRK_DIS_CARD_CLK_CONFIG_TAP |
NVQUIRK_ENABLE_SDR50 |
-   NVQUIRK_ENABLE_SDR104,
+   NVQUIRK_ENABLE_SDR104 |
+   NVQUIRK_CQHCI_DCMD_R1B_CMD_TIMING,
.min_tap_delay = 84,
.max_tap_delay = 136,
  };


Re: [PATCH V3 06/10] mmc: cqhci: allow hosts to update dcmd cmd desc

2019-03-13 Thread Ritesh Harjani



On 3/14/2019 3:15 AM, Sowjanya Komatineni wrote:

This patch adds update_dcmd_desc interface to cqhci_host_ops to
allow hosts to update any of the DCMD task descriptor attributes
and parameters.

Tested-by: Jon Hunter 
Signed-off-by: Sowjanya Komatineni 


Thanks,

Reviewed-by: Ritesh Harjani 


---
  drivers/mmc/host/cqhci.c | 2 ++
  drivers/mmc/host/cqhci.h | 2 ++
  2 files changed, 4 insertions(+)

diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c
index a8af682a9182..d59cb0a51964 100644
--- a/drivers/mmc/host/cqhci.c
+++ b/drivers/mmc/host/cqhci.c
@@ -537,6 +537,8 @@ static void cqhci_prep_dcmd_desc(struct mmc_host *mmc,
 CQHCI_ACT(0x5) |
 CQHCI_CMD_INDEX(mrq->cmd->opcode) |
 CQHCI_CMD_TIMING(timing) | CQHCI_RESP_TYPE(resp_type));
+   if (cq_host->ops->update_dcmd_desc)
+   cq_host->ops->update_dcmd_desc(mmc, mrq, );
*task_desc |= data;
desc = (u8 *)task_desc;
pr_debug("%s: cqhci: dcmd: cmd: %d timing: %d resp: %d\n",
diff --git a/drivers/mmc/host/cqhci.h b/drivers/mmc/host/cqhci.h
index 9e68286a07b4..8c8ec6f01c45 100644
--- a/drivers/mmc/host/cqhci.h
+++ b/drivers/mmc/host/cqhci.h
@@ -210,6 +210,8 @@ struct cqhci_host_ops {
u32 (*read_l)(struct cqhci_host *host, int reg);
void (*enable)(struct mmc_host *mmc);
void (*disable)(struct mmc_host *mmc, bool recovery);
+   void (*update_dcmd_desc)(struct mmc_host *mmc, struct mmc_request *mrq,
+u64 *data);
  };
  
  static inline void cqhci_writel(struct cqhci_host *host, u32 val, int reg)


Re: [PATCH V1 07/11] mmc: cqhci: add quirk for setting DCMD CMD_TIMING

2019-03-12 Thread Ritesh Harjani



On 3/7/2019 11:46 PM, Sowjanya Komatineni wrote:

On 3/6/2019 6:30 PM, Adrian Hunter wrote:

On 2/03/19 7:20 AM, Sowjanya Komatineni wrote:

This patch adds a quirk for setting CMD_TIMING to 1 in descriptor for
DCMD with R1B response type to allow the command to be sent to device
during data activity or busy time.

Tegra186 CQHCI host has bug where it selects DATA_PRESENT_SELECT to 1
by CQHCI controller for DCMDs with R1B response type and since DCMD
does not trigger any data transfer, DCMD task complete happens
leaving the DATA FSM of host controller in wait state for data.

This effects the data transfer task issued after R1B DCMD task and no
interrupt is generated for the data transfer task.

SW WAR for this issue is to set CMD_TIMING bit to 1 in DCMD task
descriptor and as DCMD task descriptor preparation is done by cqhci
driver, this patch adds cqequirk to handle this.

Signed-off-by: Sowjanya Komatineni 
---
   drivers/mmc/host/cqhci.c | 5 -
   drivers/mmc/host/cqhci.h | 1 +
   2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c
index a8af682a9182..b34c07125f32 100644
--- a/drivers/mmc/host/cqhci.c
+++ b/drivers/mmc/host/cqhci.c
@@ -521,7 +521,10 @@ static void cqhci_prep_dcmd_desc(struct mmc_host *mmc,
} else {
if (mrq->cmd->flags & MMC_RSP_R1B) {
resp_type = 0x3;
-   timing = 0x0;
+   if (cq_host->quirks & CQHCI_QUIRK_CMD_TIMING_R1B_DCMD)
+   timing = 0x1;
+   else
+   timing = 0x0;

I was thinking it would be nice if there was a generic way for drivers
to make changes to descriptors before a task is started.  Currently
there is
host->ops->write_l() which would make it possible by checking for
host->ops->CQHCI_TDBR
register and, in this case, the DCMD tag.  We would need to export
get_desc(), perhaps rename it cqhci_get_desc() and put it in cqhci.h
since it is an inline function.

We take spin_lock_irqsave after the descriptor is prepared and before writing 
to TDBR.
Not sure but tomorrow this may become a limitation for drivers to make changes to 
descriptors if they edit descriptors in host->ops->write_l() call.
Though in this case it is not required here.


Alternatively we could add host->ops for descriptor preparation.

Both ways sounds good to me.
But maybe adding a host->ops for descriptor preparation is better way to go,
since that will be the right interface exposed to make changes to
descriptors.


DCMD descriptor attributes remain same for any host and also task parameters as 
QBR need to be enabled with DCMD.
So I believe it should be ok if we just add callback to allow hosts to update 
command parameters of DCMD descriptor only thru cqhci_host_ops.


For now we can add host->ops as update_dcmd_desc and pass the task_desc 
of DCMD for updating any params which host may want to update.




Also, don’t see any requirement for host specific Task parameter updates in 
Task descriptor so not sure if there is any need to provide callback for task 
descriptor data preparation to hosts.
Please confirm.


Sure, for now the requirement has come up only for DCMD desc update. 
Sure we can add task descriptor ops in the similar way later when required.


Adrian, please confirm if you are fine with both of above?




What do people think?


} else {
resp_type = 0x2;
timing = 0x1;
diff --git a/drivers/mmc/host/cqhci.h b/drivers/mmc/host/cqhci.h
index 9e68286a07b4..f96d8565cc07 100644
--- a/drivers/mmc/host/cqhci.h
+++ b/drivers/mmc/host/cqhci.h
@@ -170,6 +170,7 @@ struct cqhci_host {
   
   	u32 quirks;

   #define CQHCI_QUIRK_SHORT_TXFR_DESC_SZ   0x1
+#define CQHCI_QUIRK_CMD_TIMING_R1B_DCMD0x2
   
   	bool enabled;

bool halted;



Re: [PATCH V1 07/11] mmc: cqhci: add quirk for setting DCMD CMD_TIMING

2019-03-06 Thread Ritesh Harjani



On 3/6/2019 6:30 PM, Adrian Hunter wrote:

On 2/03/19 7:20 AM, Sowjanya Komatineni wrote:

This patch adds a quirk for setting CMD_TIMING to 1 in descriptor
for DCMD with R1B response type to allow the command to be sent to
device during data activity or busy time.

Tegra186 CQHCI host has bug where it selects DATA_PRESENT_SELECT
to 1 by CQHCI controller for DCMDs with R1B response type and
since DCMD does not trigger any data transfer, DCMD task complete
happens leaving the DATA FSM of host controller in wait state for
data.

This effects the data transfer task issued after R1B DCMD task
and no interrupt is generated for the data transfer task.

SW WAR for this issue is to set CMD_TIMING bit to 1 in DCMD task
descriptor and as DCMD task descriptor preparation is done by
cqhci driver, this patch adds cqequirk to handle this.

Signed-off-by: Sowjanya Komatineni 
---
  drivers/mmc/host/cqhci.c | 5 -
  drivers/mmc/host/cqhci.h | 1 +
  2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c
index a8af682a9182..b34c07125f32 100644
--- a/drivers/mmc/host/cqhci.c
+++ b/drivers/mmc/host/cqhci.c
@@ -521,7 +521,10 @@ static void cqhci_prep_dcmd_desc(struct mmc_host *mmc,
} else {
if (mrq->cmd->flags & MMC_RSP_R1B) {
resp_type = 0x3;
-   timing = 0x0;
+   if (cq_host->quirks & CQHCI_QUIRK_CMD_TIMING_R1B_DCMD)
+   timing = 0x1;
+   else
+   timing = 0x0;

I was thinking it would be nice if there was a generic way for drivers to
make changes to descriptors before a task is started.  Currently there is
host->ops->write_l() which would make it possible by checking for CQHCI_TDBR
register and, in this case, the DCMD tag.  We would need to export
get_desc(), perhaps rename it cqhci_get_desc() and put it in cqhci.h since
it is an inline function.


We take spin_lock_irqsave after the descriptor is prepared and before 
writing to TDBR.
Not sure but tomorrow this may become a limitation for drivers to make 
changes to

descriptors if they edit descriptors in host->ops->write_l() call.
Though in this case it is not required here.



Alternatively we could add host->ops for descriptor preparation.


Both ways sounds good to me.
But maybe adding a host->ops for descriptor preparation is better way to go,
since that will be the right interface exposed to make changes to 
descriptors.




What do people think?


} else {
resp_type = 0x2;
timing = 0x1;
diff --git a/drivers/mmc/host/cqhci.h b/drivers/mmc/host/cqhci.h
index 9e68286a07b4..f96d8565cc07 100644
--- a/drivers/mmc/host/cqhci.h
+++ b/drivers/mmc/host/cqhci.h
@@ -170,6 +170,7 @@ struct cqhci_host {
  
  	u32 quirks;

  #define CQHCI_QUIRK_SHORT_TXFR_DESC_SZ0x1
+#define CQHCI_QUIRK_CMD_TIMING_R1B_DCMD0x2
  
  	bool enabled;

bool halted;



Re: panic with CONFIG_FAIL_MMC_REQUEST and cqhci

2019-02-22 Thread Ritesh Harjani

Hi Laura,

On 2/19/2019 12:59 AM, Laura Abbott wrote:

Hi,

Fedora got report of a panic when I accidentally left debugging enabled
on a build https://bugzilla.redhat.com/show_bug.cgi?id=1677438

It looks like a panic from code in CONFIG_FAIL_MMC_REQUEST from the
cqhci driver because there isn't a command (high level overview)


With CQHCI, in case of non-DCMD (data) requests, mrq->cmd can be NULL.
Is this crash happening always (100% on bootup) with CQHCI & 
CONFIG_FAIL_MMC_REQUEST enabled?


Sure, I will role out a patch to handle this case.
It will be great, if you could also confirm it from your side.

Regards
Ritesh




(gdb) list *(mmc_should_fail_request+0xa)
0x149a is in mmc_should_fail_request (drivers/mmc/core/core.c:98).
93    };
94
95    if (!data)
96    return;
97
98    if (cmd->error || data->error ||
99    !should_fail(>fail_mmc_request, data->blksz * 
data->blocks))

100    return;
101
102    data->error = data_errors[prandom_u32() % 
ARRAY_SIZE(data_errors)];

(gdb)

(gdb) list *(mmc_cqe_request_done+0x1c)
0x2a6c is in mmc_cqe_request_done (drivers/mmc/core/core.c:505).
500    void mmc_cqe_request_done(struct mmc_host *host, struct 
mmc_request *mrq)

501    {
502    mmc_should_fail_request(host, mrq);
503
504    /* Flag re-tuning needed on CRC errors */
505    if ((mrq->cmd && mrq->cmd->error == -EILSEQ) ||
506    (mrq->data && mrq->data->error == -EILSEQ))
507    mmc_retune_needed(host);
508
509    trace_mmc_request_done(host, mrq);


(gdb) list *(cqhci_irq+0x1d2)
0x1172 is in cqhci_irq (drivers/mmc/host/cqhci.c:747).
742    data->bytes_xfered = 0;
743    else
744    data->bytes_xfered = data->blksz * data->blocks;
745    }
746
747    mmc_cqe_request_done(mmc, mrq);
748    }
749
750    irqreturn_t cqhci_irq(struct mmc_host *mmc, u32 intmask, int 
cmd_error,

751  int data_error)

This can be worked around by turning off the option but it
seems like something to fix up.

Thanks,
Laura


Re: [f2fs-dev] [PATCH] f2fs: do not use mutex lock in atomic context

2019-02-17 Thread Ritesh Harjani



On 2/15/2019 2:40 PM, Chao Yu wrote:

On 2019/2/15 12:28, Ritesh Harjani wrote:

On 2/14/2019 9:40 PM, Chao Yu wrote:

On 2019-2-14 15:46, Sahitya Tummala wrote:

On Wed, Feb 13, 2019 at 11:25:31AM +0800, Chao Yu wrote:

On 2019/2/4 16:06, Sahitya Tummala wrote:

Fix below warning coming because of using mutex lock in atomic context.

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:98
in_atomic(): 1, irqs_disabled(): 0, pid: 585, name: sh
Preemption disabled at: __radix_tree_preload+0x28/0x130
Call trace:
   dump_backtrace+0x0/0x2b4
   show_stack+0x20/0x28
   dump_stack+0xa8/0xe0
   ___might_sleep+0x144/0x194
   __might_sleep+0x58/0x8c
   mutex_lock+0x2c/0x48
   f2fs_trace_pid+0x88/0x14c
   f2fs_set_node_page_dirty+0xd0/0x184

Do not use f2fs_radix_tree_insert() to avoid doing cond_resched() with
spin_lock() acquired.

Signed-off-by: Sahitya Tummala 
---
   fs/f2fs/trace.c | 20 +---
   1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/f2fs/trace.c b/fs/f2fs/trace.c
index ce2a5eb..d0ab533 100644
--- a/fs/f2fs/trace.c
+++ b/fs/f2fs/trace.c
@@ -14,7 +14,7 @@
   #include "trace.h"
   
   static RADIX_TREE(pids, GFP_ATOMIC);

-static struct mutex pids_lock;
+static spinlock_t pids_lock;
   static struct last_io_info last_io;
   
   static inline void __print_last_io(void)

@@ -58,23 +58,29 @@ void f2fs_trace_pid(struct page *page)
   
   	set_page_private(page, (unsigned long)pid);
   
+retry:

if (radix_tree_preload(GFP_NOFS))
return;
   
-	mutex_lock(_lock);

+   spin_lock(_lock);
p = radix_tree_lookup(, pid);
if (p == current)
goto out;
if (p)
radix_tree_delete(, pid);
   
-	f2fs_radix_tree_insert(, pid, current);

Do you know why do we have a retry logic here? When anyways we have
called for radix_tree_delete with pid key?
Which should ensure the slot is empty, no?
Then why in the original code (f2fs_radix_tree_insert), we were
retrying. For what condition a retry was needed?

Hi,

f2fs_radix_tree_insert is used in many places, it was introduced to used in
some paths we should not failed.

And here, I guess we used it for the same purpose, if we failed to insert
@current pointer into radix, next time, we may not skip calling
trace_printk, actually it will print the same current->comm info as
previous one, it's redundant.


Sure, thanks for the info.

Regards
Ritesh




Re: [f2fs-dev] [PATCH] f2fs: do not use mutex lock in atomic context

2019-02-14 Thread Ritesh Harjani



On 2/14/2019 9:40 PM, Chao Yu wrote:

On 2019-2-14 15:46, Sahitya Tummala wrote:

On Wed, Feb 13, 2019 at 11:25:31AM +0800, Chao Yu wrote:

On 2019/2/4 16:06, Sahitya Tummala wrote:

Fix below warning coming because of using mutex lock in atomic context.

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:98
in_atomic(): 1, irqs_disabled(): 0, pid: 585, name: sh
Preemption disabled at: __radix_tree_preload+0x28/0x130
Call trace:
  dump_backtrace+0x0/0x2b4
  show_stack+0x20/0x28
  dump_stack+0xa8/0xe0
  ___might_sleep+0x144/0x194
  __might_sleep+0x58/0x8c
  mutex_lock+0x2c/0x48
  f2fs_trace_pid+0x88/0x14c
  f2fs_set_node_page_dirty+0xd0/0x184

Do not use f2fs_radix_tree_insert() to avoid doing cond_resched() with
spin_lock() acquired.

Signed-off-by: Sahitya Tummala 
---
  fs/f2fs/trace.c | 20 +---
  1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/f2fs/trace.c b/fs/f2fs/trace.c
index ce2a5eb..d0ab533 100644
--- a/fs/f2fs/trace.c
+++ b/fs/f2fs/trace.c
@@ -14,7 +14,7 @@
  #include "trace.h"
  
  static RADIX_TREE(pids, GFP_ATOMIC);

-static struct mutex pids_lock;
+static spinlock_t pids_lock;
  static struct last_io_info last_io;
  
  static inline void __print_last_io(void)

@@ -58,23 +58,29 @@ void f2fs_trace_pid(struct page *page)
  
  	set_page_private(page, (unsigned long)pid);
  
+retry:

if (radix_tree_preload(GFP_NOFS))
return;
  
-	mutex_lock(_lock);

+   spin_lock(_lock);
p = radix_tree_lookup(, pid);
if (p == current)
goto out;
if (p)
radix_tree_delete(, pid);
  
-	f2fs_radix_tree_insert(, pid, current);


Do you know why do we have a retry logic here? When anyways we have 
called for radix_tree_delete with pid key?

Which should ensure the slot is empty, no?
Then why in the original code (f2fs_radix_tree_insert), we were 
retrying. For what condition a retry was needed?


Regards
Ritesh



+   if (radix_tree_insert(, pid, current)) {
+   spin_unlock(_lock);
+   radix_tree_preload_end();
+   cond_resched();
+   goto retry;
+   }
  
  	trace_printk("%3x:%3x %4x %-16s\n",

MAJOR(inode->i_sb->s_dev), MINOR(inode->i_sb->s_dev),
pid, current->comm);

Hi Sahitya,

Can trace_printk sleep? For safety, how about moving it out of spinlock?


Hi Chao,

Yes, trace_printk() is safe to use in atomic context (unlike printk).

Hi Sahitya,

Thanks for your confirmation. :)

Reviewed-by: Chao Yu 

Thanks,


Thanks,
Sahitya.


Thanks,


  out:
-   mutex_unlock(_lock);
+   spin_unlock(_lock);
radix_tree_preload_end();
  }
  
@@ -119,7 +125,7 @@ void f2fs_trace_ios(struct f2fs_io_info *fio, int flush)
  
  void f2fs_build_trace_ios(void)

  {
-   mutex_init(_lock);
+   spin_lock_init(_lock);
  }
  
  #define PIDVEC_SIZE	128

@@ -147,7 +153,7 @@ void f2fs_destroy_trace_ios(void)
pid_t next_pid = 0;
unsigned int found;
  
-	mutex_lock(_lock);

+   spin_lock(_lock);
while ((found = gang_lookup_pids(pid, next_pid, PIDVEC_SIZE))) {
unsigned idx;
  
@@ -155,5 +161,5 @@ void f2fs_destroy_trace_ios(void)

for (idx = 0; idx < found; idx++)
radix_tree_delete(, pid[idx]);
}
-   mutex_unlock(_lock);
+   spin_unlock(_lock);
  }



___
Linux-f2fs-devel mailing list
linux-f2fs-de...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [RFC PATCH 16/35] ovl: readd lsattr/chattr support

2018-04-23 Thread Ritesh Harjani



On 4/12/2018 8:38 PM, Miklos Szeredi wrote:

Implement FS_IOC_GETFLAGS and FS_IOC_SETFLAGS.

Needs vfs_ioctl() exported to modules.
Do you think if it is better to separate out ovl implementation and 
export of vfs_ioctl method ?
Probably there are other users which should be using vfs_ioctl method as 
well (like ecryptfs) ?




Signed-off-by: Miklos Szeredi 
---
  fs/internal.h   |  1 -
  fs/ioctl.c  |  1 +
  fs/overlayfs/file.c | 59 +
  include/linux/fs.h  |  2 ++
  4 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/fs/internal.h b/fs/internal.h
index 3319bf39e339..d5108d9c6a2f 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -176,7 +176,6 @@ extern const struct dentry_operations ns_dentry_operations;
   */
  extern int do_vfs_ioctl(struct file *file, unsigned int fd, unsigned int cmd,
unsigned long arg);
-extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
  
  /*

   * iomap support:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5ace7efb0d04..696f4c46a868 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -49,6 +49,7 @@ long vfs_ioctl(struct file *filp, unsigned int cmd, unsigned 
long arg)
   out:
return error;
  }
+EXPORT_SYMBOL(vfs_ioctl);
  
  static int ioctl_fibmap(struct file *filp, int __user *p)

  {
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 05e3e2f80b89..cc004ff1b05b 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -8,6 +8,7 @@
  
  #include 

  #include 
+#include 
  #include 
  #include 
  #include "overlayfs.h"
@@ -291,6 +292,63 @@ long ovl_fallocate(struct file *file, int mode, loff_t 
offset, loff_t len)
return ret;
  }
  
+static long ovl_real_ioctl(struct file *file, unsigned int cmd,

+  unsigned long arg)
+{
+   struct fd real;
+   const struct cred *old_cred;
+   long ret;
+
+   ret = ovl_real_file(file, );
+   if (ret)
+   return ret;
+
+   old_cred = ovl_override_creds(file_inode(file)->i_sb);
+   ret = vfs_ioctl(real.file, cmd, arg);
+   revert_creds(old_cred);
+
+   fdput(real);
+
+   return ret;
+}
+
+long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+   long ret;
+   struct inode *inode = file_inode(file);
+
+   switch (cmd) {
+   case FS_IOC_GETFLAGS:
+   ret = ovl_real_ioctl(file, cmd, arg);
+   break;
+
+   case FS_IOC_SETFLAGS:
+   if (!inode_owner_or_capable(inode))
+   return -EACCES;
+
+   ret = mnt_want_write_file(file);
+   if (ret)
+   return ret;
+
+   ret = ovl_copy_up(file_dentry(file));
+   if (!ret) {
+   ret = ovl_real_ioctl(file, cmd, arg);
+
+   inode_lock(inode);
+   ovl_copyflags(ovl_inode_real(inode), inode);
+   inode_unlock(inode);
+   }
+
+   mnt_drop_write_file(file);
+   break;
+
+   default:
+   ret = -ENOTTY;
+   }
+
+   return ret;
+}
+
  const struct file_operations ovl_file_operations = {
.open   = ovl_open,
.release= ovl_release,
@@ -300,4 +358,5 @@ const struct file_operations ovl_file_operations = {
.fsync  = ovl_fsync,
.mmap   = ovl_mmap,
.fallocate  = ovl_fallocate,
+   .unlocked_ioctl = ovl_ioctl,


What about compat_ioctl ? should that implementation also go in the same 
patch ?



  };
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0b215faa30ae..1add10f04b56 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1621,6 +1621,8 @@ int vfs_mkobj(struct dentry *, umode_t,
int (*f)(struct dentry *, umode_t, void *),
void *);
  
+extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);

+
  /*
   * VFS file helper functions.
   */



--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, 
Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project.


Re: [RFC PATCH 16/35] ovl: readd lsattr/chattr support

2018-04-23 Thread Ritesh Harjani



On 4/12/2018 8:38 PM, Miklos Szeredi wrote:

Implement FS_IOC_GETFLAGS and FS_IOC_SETFLAGS.

Needs vfs_ioctl() exported to modules.
Do you think if it is better to separate out ovl implementation and 
export of vfs_ioctl method ?
Probably there are other users which should be using vfs_ioctl method as 
well (like ecryptfs) ?




Signed-off-by: Miklos Szeredi 
---
  fs/internal.h   |  1 -
  fs/ioctl.c  |  1 +
  fs/overlayfs/file.c | 59 +
  include/linux/fs.h  |  2 ++
  4 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/fs/internal.h b/fs/internal.h
index 3319bf39e339..d5108d9c6a2f 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -176,7 +176,6 @@ extern const struct dentry_operations ns_dentry_operations;
   */
  extern int do_vfs_ioctl(struct file *file, unsigned int fd, unsigned int cmd,
unsigned long arg);
-extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
  
  /*

   * iomap support:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5ace7efb0d04..696f4c46a868 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -49,6 +49,7 @@ long vfs_ioctl(struct file *filp, unsigned int cmd, unsigned 
long arg)
   out:
return error;
  }
+EXPORT_SYMBOL(vfs_ioctl);
  
  static int ioctl_fibmap(struct file *filp, int __user *p)

  {
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 05e3e2f80b89..cc004ff1b05b 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -8,6 +8,7 @@
  
  #include 

  #include 
+#include 
  #include 
  #include 
  #include "overlayfs.h"
@@ -291,6 +292,63 @@ long ovl_fallocate(struct file *file, int mode, loff_t 
offset, loff_t len)
return ret;
  }
  
+static long ovl_real_ioctl(struct file *file, unsigned int cmd,

+  unsigned long arg)
+{
+   struct fd real;
+   const struct cred *old_cred;
+   long ret;
+
+   ret = ovl_real_file(file, );
+   if (ret)
+   return ret;
+
+   old_cred = ovl_override_creds(file_inode(file)->i_sb);
+   ret = vfs_ioctl(real.file, cmd, arg);
+   revert_creds(old_cred);
+
+   fdput(real);
+
+   return ret;
+}
+
+long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+   long ret;
+   struct inode *inode = file_inode(file);
+
+   switch (cmd) {
+   case FS_IOC_GETFLAGS:
+   ret = ovl_real_ioctl(file, cmd, arg);
+   break;
+
+   case FS_IOC_SETFLAGS:
+   if (!inode_owner_or_capable(inode))
+   return -EACCES;
+
+   ret = mnt_want_write_file(file);
+   if (ret)
+   return ret;
+
+   ret = ovl_copy_up(file_dentry(file));
+   if (!ret) {
+   ret = ovl_real_ioctl(file, cmd, arg);
+
+   inode_lock(inode);
+   ovl_copyflags(ovl_inode_real(inode), inode);
+   inode_unlock(inode);
+   }
+
+   mnt_drop_write_file(file);
+   break;
+
+   default:
+   ret = -ENOTTY;
+   }
+
+   return ret;
+}
+
  const struct file_operations ovl_file_operations = {
.open   = ovl_open,
.release= ovl_release,
@@ -300,4 +358,5 @@ const struct file_operations ovl_file_operations = {
.fsync  = ovl_fsync,
.mmap   = ovl_mmap,
.fallocate  = ovl_fallocate,
+   .unlocked_ioctl = ovl_ioctl,


What about compat_ioctl ? should that implementation also go in the same 
patch ?



  };
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0b215faa30ae..1add10f04b56 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1621,6 +1621,8 @@ int vfs_mkobj(struct dentry *, umode_t,
int (*f)(struct dentry *, umode_t, void *),
void *);
  
+extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);

+
  /*
   * VFS file helper functions.
   */



--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, 
Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project.


[PATCH] quota: Kill an unused extern entry form quota.h

2018-03-16 Thread Ritesh Harjani
Kill an unused extern entry from quota.h
which is leftover of below patch.

[f32764bd2: quota: Convert quota statistics to generic percpu_counter]

Signed-off-by: Ritesh Harjani <rite...@codeaurora.org>
---
 include/linux/quota.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/quota.h b/include/linux/quota.h
index 5ac9de4..ca9772c 100644
--- a/include/linux/quota.h
+++ b/include/linux/quota.h
@@ -267,7 +267,6 @@ struct dqstats {
struct percpu_counter counter[_DQST_DQSTAT_LAST];
 };
 
-extern struct dqstats *dqstats_pcpu;
 extern struct dqstats dqstats;
 
 static inline void dqstats_inc(unsigned int type)
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



[PATCH] quota: Kill an unused extern entry form quota.h

2018-03-16 Thread Ritesh Harjani
Kill an unused extern entry from quota.h
which is leftover of below patch.

[f32764bd2: quota: Convert quota statistics to generic percpu_counter]

Signed-off-by: Ritesh Harjani 
---
 include/linux/quota.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/quota.h b/include/linux/quota.h
index 5ac9de4..ca9772c 100644
--- a/include/linux/quota.h
+++ b/include/linux/quota.h
@@ -267,7 +267,6 @@ struct dqstats {
struct percpu_counter counter[_DQST_DQSTAT_LAST];
 };
 
-extern struct dqstats *dqstats_pcpu;
 extern struct dqstats dqstats;
 
 static inline void dqstats_inc(unsigned int type)
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



[PATCH-Review] f2fs: Set GF_NOFS in read_cache_page_gfp while doing f2fs_quota_read

2018-03-16 Thread Ritesh Harjani
Hello,

Observed below deadlock with quota enabled on f2fs during low memory situation.
I think during f2fs_quota_enable -> which is doing read_cache_page with __GFP_FS
flag set, due to low memory this is causing shrinker to get invoked
and thus going into a deadlock due to mutex lock in dquota. Commit details
in next patch contains the call stack.

Since I did not had the quota enabled on f2fs configuration, I could not test
this on my setup yet.

Could the community kindly review and let me know if the patch
is correct or not ?


Ritesh Harjani (1):
  f2fs: Set GF_NOFS in read_cache_page_gfp while doing f2fs_quota_read

 fs/f2fs/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
a Linux Foundation Collaborative Project.



[PATCH-Review] f2fs: Set GF_NOFS in read_cache_page_gfp while doing f2fs_quota_read

2018-03-16 Thread Ritesh Harjani
Hello,

Observed below deadlock with quota enabled on f2fs during low memory situation.
I think during f2fs_quota_enable -> which is doing read_cache_page with __GFP_FS
flag set, due to low memory this is causing shrinker to get invoked
and thus going into a deadlock due to mutex lock in dquota. Commit details
in next patch contains the call stack.

Since I did not had the quota enabled on f2fs configuration, I could not test
this on my setup yet.

Could the community kindly review and let me know if the patch
is correct or not ?


Ritesh Harjani (1):
  f2fs: Set GF_NOFS in read_cache_page_gfp while doing f2fs_quota_read

 fs/f2fs/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
a Linux Foundation Collaborative Project.



[PATCH] f2fs: Set GF_NOFS in read_cache_page_gfp while doing f2fs_quota_read

2018-03-16 Thread Ritesh Harjani
Quota code itself is serializing the operations by taking mutex_lock.
It seems a below deadlock can happen if GF_NOFS is not used in
f2fs_quota_read

__switch_to+0x88
__schedule+0x5b0
schedule+0x78
schedule_preempt_disabled+0x20
__mutex_lock_slowpath+0xdc  //mutex owner is itself
mutex_lock+0x2c
dquot_commit+0x30   //mutex_lock(>dqio_mutex);
dqput+0xe0
__dquot_drop+0x80
dquot_drop+0x48
f2fs_evict_inode+0x218
evict+0xa8
dispose_list+0x3c
prune_icache_sb+0x58
super_cache_scan+0xf4
do_shrink_slab+0x208
shrink_slab.part.40+0xac
shrink_zone+0x1b0
do_try_to_free_pages+0x25c
try_to_free_pages+0x164
__alloc_pages_nodemask+0x534
do_read_cache_page+0x6c
read_cache_page+0x14
f2fs_quota_read+0xa4
read_blk+0x54
find_tree_dqentry+0xe4
find_tree_dqentry+0xb8
find_tree_dqentry+0xb8
find_tree_dqentry+0xb8
qtree_read_dquot+0x68
v2_read_dquot+0x24
dquot_acquire+0x5c  // mutex_lock(>dqio_mutex);
dqget+0x238
__dquot_initialize+0xd4
dquot_initialize+0x10
dquot_file_open+0x34
f2fs_file_open+0x6c
do_dentry_open+0x1e4
vfs_open+0x6c
path_openat+0xa20
do_filp_open+0x4c
do_sys_open+0x178

Signed-off-by: Ritesh Harjani <rite...@codeaurora.org>
---
 fs/f2fs/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 708155d..3a9a924 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1382,7 +1382,7 @@ static ssize_t f2fs_quota_read(struct super_block *sb, 
int type, char *data,
while (toread > 0) {
tocopy = min_t(unsigned long, sb->s_blocksize - offset, toread);
 repeat:
-   page = read_mapping_page(mapping, blkidx, NULL);
+   page = read_cache_page_gfp(mapping, blkidx, GFP_NOFS);
if (IS_ERR(page)) {
if (PTR_ERR(page) == -ENOMEM) {
congestion_wait(BLK_RW_ASYNC, HZ/50);
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
a Linux Foundation Collaborative Project.



[PATCH] f2fs: Set GF_NOFS in read_cache_page_gfp while doing f2fs_quota_read

2018-03-16 Thread Ritesh Harjani
Quota code itself is serializing the operations by taking mutex_lock.
It seems a below deadlock can happen if GF_NOFS is not used in
f2fs_quota_read

__switch_to+0x88
__schedule+0x5b0
schedule+0x78
schedule_preempt_disabled+0x20
__mutex_lock_slowpath+0xdc  //mutex owner is itself
mutex_lock+0x2c
dquot_commit+0x30   //mutex_lock(>dqio_mutex);
dqput+0xe0
__dquot_drop+0x80
dquot_drop+0x48
f2fs_evict_inode+0x218
evict+0xa8
dispose_list+0x3c
prune_icache_sb+0x58
super_cache_scan+0xf4
do_shrink_slab+0x208
shrink_slab.part.40+0xac
shrink_zone+0x1b0
do_try_to_free_pages+0x25c
try_to_free_pages+0x164
__alloc_pages_nodemask+0x534
do_read_cache_page+0x6c
read_cache_page+0x14
f2fs_quota_read+0xa4
read_blk+0x54
find_tree_dqentry+0xe4
find_tree_dqentry+0xb8
find_tree_dqentry+0xb8
find_tree_dqentry+0xb8
qtree_read_dquot+0x68
v2_read_dquot+0x24
dquot_acquire+0x5c  // mutex_lock(>dqio_mutex);
dqget+0x238
__dquot_initialize+0xd4
dquot_initialize+0x10
dquot_file_open+0x34
f2fs_file_open+0x6c
do_dentry_open+0x1e4
vfs_open+0x6c
path_openat+0xa20
do_filp_open+0x4c
do_sys_open+0x178

Signed-off-by: Ritesh Harjani 
---
 fs/f2fs/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 708155d..3a9a924 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1382,7 +1382,7 @@ static ssize_t f2fs_quota_read(struct super_block *sb, 
int type, char *data,
while (toread > 0) {
tocopy = min_t(unsigned long, sb->s_blocksize - offset, toread);
 repeat:
-   page = read_mapping_page(mapping, blkidx, NULL);
+   page = read_cache_page_gfp(mapping, blkidx, GFP_NOFS);
if (IS_ERR(page)) {
if (PTR_ERR(page) == -ENOMEM) {
congestion_wait(BLK_RW_ASYNC, HZ/50);
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
a Linux Foundation Collaborative Project.



Re: [RFC 0/4] mmc: sdhci-msm: Add CQE support for sdhci-msm

2017-09-01 Thread Ritesh Harjani


On 8/31/2017 2:09 PM, Adrian Hunter wrote:

On 30/08/17 16:04, Ritesh Harjani wrote:

Hi All,

Please ignore the previous patch series from a wrong email
address. Stupid gitconfig issue. Apologies for the spam.

This is RFC patch series based on top of ulfh_mmc/cmdq branch
which is based upon Adrian's CMDQ patch series.

Below patch series enables CQE for sdhci-msm platform.
This has been tested on internal 8996 MTP which has CMDQ support.

Fixes w.r.t. CMDQ:-
There are some patches identified which were required atleast on
MSM platform. I am not sure if these are required for any other
CQE platform or not. Patchset 1, 3 & 4 commit text describes
the problems.

Performance related:-
I gave one small shot for performance and the numbers were not looking good.
So, unless I have tested for performance completely, I should not discuss
on performance numbers as of now with this patchset.
I can try doing some more performance testing and post the results -
though this may take some while.


You might also need custom Send Status Configuration.


Yes, I will check that once.
I think for the single job I/O this may not would have matter much.
But sure, I will check on this, if re-configuring is needed.





I used below test script for random read/write test.

*randwrite-test-script*
[global]
bs=32k
size=1g
rw=randwrite
direct=1
directory=/data/fiotest


Random write results can vary a lot.  It is important to know if the eMMC
has lots of un-mapped blocks or not. e.g. for ext4 is the "-o discard"
option being used.  I find I get more consistent results if I always have
discards enabled.



[file1]
filename=singlefile1

*randread-test-script*
[global]
bs=32k
size=1g
rw=randread
directory=/data/fiotest


If you don't set numjobs > 1 then there is little benefit of the queue.
Also still need direct=1


Yes, silly me. But still I got lower results with single job then.
But anyways let me again check this out. Thanks.





[file1]
filename=singlefile1

@Adrian,
Thanks a lot for pursuing and bringing CMDQ patch series to it's final stages :)


Ritesh Harjani (4):
   mmc: cqhci: Move CQHCI_ENABLE before setting TDLBA/TDLBAU
   mmc: sdhci-msm: Add CQHCI support for sdhci-msm
   mmc: sdhci-msm: Change the desc_sz on cqe_enable/disable.
   mmc: sdhci-msm: Handle unexpected interrupt case on enabling legacy
 IRQs on CQE halt

  .../devicetree/bindings/mmc/sdhci-msm.txt  |   1 +
  drivers/mmc/host/Kconfig   |   1 +
  drivers/mmc/host/cqhci.c   |   7 +-
  drivers/mmc/host/sdhci-msm.c   | 121 -
  4 files changed, 125 insertions(+), 5 deletions(-)





--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, 
Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project.


Re: [RFC 0/4] mmc: sdhci-msm: Add CQE support for sdhci-msm

2017-09-01 Thread Ritesh Harjani


On 8/31/2017 2:09 PM, Adrian Hunter wrote:

On 30/08/17 16:04, Ritesh Harjani wrote:

Hi All,

Please ignore the previous patch series from a wrong email
address. Stupid gitconfig issue. Apologies for the spam.

This is RFC patch series based on top of ulfh_mmc/cmdq branch
which is based upon Adrian's CMDQ patch series.

Below patch series enables CQE for sdhci-msm platform.
This has been tested on internal 8996 MTP which has CMDQ support.

Fixes w.r.t. CMDQ:-
There are some patches identified which were required atleast on
MSM platform. I am not sure if these are required for any other
CQE platform or not. Patchset 1, 3 & 4 commit text describes
the problems.

Performance related:-
I gave one small shot for performance and the numbers were not looking good.
So, unless I have tested for performance completely, I should not discuss
on performance numbers as of now with this patchset.
I can try doing some more performance testing and post the results -
though this may take some while.


You might also need custom Send Status Configuration.


Yes, I will check that once.
I think for the single job I/O this may not would have matter much.
But sure, I will check on this, if re-configuring is needed.





I used below test script for random read/write test.

*randwrite-test-script*
[global]
bs=32k
size=1g
rw=randwrite
direct=1
directory=/data/fiotest


Random write results can vary a lot.  It is important to know if the eMMC
has lots of un-mapped blocks or not. e.g. for ext4 is the "-o discard"
option being used.  I find I get more consistent results if I always have
discards enabled.



[file1]
filename=singlefile1

*randread-test-script*
[global]
bs=32k
size=1g
rw=randread
directory=/data/fiotest


If you don't set numjobs > 1 then there is little benefit of the queue.
Also still need direct=1


Yes, silly me. But still I got lower results with single job then.
But anyways let me again check this out. Thanks.





[file1]
filename=singlefile1

@Adrian,
Thanks a lot for pursuing and bringing CMDQ patch series to it's final stages :)


Ritesh Harjani (4):
   mmc: cqhci: Move CQHCI_ENABLE before setting TDLBA/TDLBAU
   mmc: sdhci-msm: Add CQHCI support for sdhci-msm
   mmc: sdhci-msm: Change the desc_sz on cqe_enable/disable.
   mmc: sdhci-msm: Handle unexpected interrupt case on enabling legacy
 IRQs on CQE halt

  .../devicetree/bindings/mmc/sdhci-msm.txt  |   1 +
  drivers/mmc/host/Kconfig   |   1 +
  drivers/mmc/host/cqhci.c   |   7 +-
  drivers/mmc/host/sdhci-msm.c   | 121 -
  4 files changed, 125 insertions(+), 5 deletions(-)





--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, 
Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project.


  1   2   3   >