from:"Goldwyn Rodrigues"

Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-06 Thread Goldwyn Rodrigues

On 11:07 06/12, Johannes Thumshirn wrote:
> On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> > This is a support for DAX in btrfs. I understand there have been
> > previous attempts at it. However, I wanted to make sure copy-on-write
> > (COW) works on dax as well.
> > 
> > Before I present this to the FS folks I wanted to run this through the
> > btrfs. Even though I wish, I cannot get it correct the first time
> > around :/.. Here are some questions for which I need suggestions:
> 
> Hi Goldwyn,
> 
> I've thrown your patches (from your git tree) onto one of my pmem test
> machines with this pmem config:

Thanks. I will check on this. Ordered extents have been a pain to deal
with for me (though mainly because of my incorrect usage)

> 
> mayhem:~/:[0]# ndctl list
> [
>   {
> "dev":"namespace1.0",
> "mode":"fsdax",
> "map":"dev",
> "size":792721358848,
> "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
> "raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
> "sector_size":512,
> "blockdev":"pmem1",
> "numa_node":1
>   },
>   {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":792721358848,
> "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
> "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
> "sector_size":4096,
> "blockdev":"pmem0",
> "numa_node":0
>   }
> ]
> 
> Unfortunately I hit a btrfs_panic() with btrfs/002.
> export TEST_DEV=/dev/pmem0
> export SCRATCH_DEV=/dev/pmem1
> export MOUNT_OPTIONS="-o dax"
> ./check
> [...]
> [  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
> [  178.357044] BTRFS info (device pmem0): disk space caching is enabled
> [  178.357047] BTRFS info (device pmem0): has skinny extents
> [  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
> [  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
> devid 1 transid 3 /dev/pmem1
> [  178.505717] BTRFS info (device pmem1): disk space caching is enabled
> [  178.513593] BTRFS info (device pmem1): has skinny extents
> [  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
> feature
> [  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
> [  178.538331] BTRFS info (device pmem1): creating UUID tree
> [  178.587200] BTRFS critical (device pmem1): panic in
> ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
> (errno=-17 Object already exists)
> [  178.603129] [ cut here ]
> [  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
> [  178.614333] invalid opcode:  [#1] SMP PTI
> [  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
>   E 4.20.0-rc5-default-btrfs-dax #920
> [  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
> SE5C620.86B.0D.01.0010.072020182008 07/20/2018
> [  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
> [  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
> 39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
> 00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
> [  178.667019] RSP: 0018:a3e3674c7ba8 EFLAGS: 00010096
> [  178.672684] RAX: 008f RBX: 9770c2ac5748 RCX:
> 
> [  178.680254] RDX: 97711f9dee80 RSI: 97711f9d6868 RDI:
> 97711f9d6868
> [  178.687831] RBP: 97711d523000 R08:  R09:
> 065a
> [  178.695411] R10: 03ff R11: 0001 R12:
> 97710d66da70
> [  178.702993] R13: 9770c2ac5600 R14:  R15:
> 97710d66d9c0
> [  178.710573] FS:  7fe11ef90700() GS:97711f9c()
> knlGS:
> [  178.719122] CS:  0010 DS:  ES:  CR0: 80050033
> [  178.725380] CR2: 0156a000 CR3: 00eb30dfc006 CR4:
> 007606e0
> [  178.732999] DR0:  DR1:  DR2:
> 
> [  178.740574] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [  178.748147] PKRU: 5554
> [  178.751297] Call Trace:
> [  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
> [  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
> [  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
> [  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
> [  178.777508]  ? current_t

Re: [PATCH 07/10] dax: export functions for use with btrfs

2018-12-06 Thread Goldwyn Rodrigues

On  6:52 05/12, Christoph Hellwig wrote:
> If you want to export these at all they have to be EXPORT_SYMBOL_GPL.
> 

Understood.

> But I'd really like to avoid seeing another duplicate DAX I/O path.
> Please try to adopt the existing iomap-based infrastructure for your
> needs first.

This is not worth with btrfs. With non-page aligned I/O on btrfs, we
need to copy the first/last page of the extents for CoW. So, we
would end up using the exported functions anyways. Believe me, I have
spent some time getting btrfs iomap compatible before giving up. The
problems are btrfs needs to carry a lot of information across
iomap_begin and iomap_end. While the added private variable helps in
this, it also needs hooks in bio_submit() functions for crc calculations
during direct writes.

-- 
Goldwyn

[PATCH 10/10] btrfs: dax mmap write

2018-12-05 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

Create a page size extent and copy the contents of the original
extent into the new one, and present to user space as the page
to write.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/dax.c | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 6d68d39cc5da..4634917877f3 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -231,6 +231,45 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
sector >>= 9;
ret = copy_user_dax(em->bdev, dax_dev, sector, PAGE_SIZE, 
vmf->cow_page, vaddr);
goto out;
+   } else if (vmf->flags & FAULT_FLAG_WRITE) {
+   pfn_t pfn;
+   struct extent_map *orig = em;
+   void *daddr;
+   sector_t dstart;
+   size_t maplen;
+   struct extent_changeset *data_reserved = NULL;
+   struct extent_state *cached_state = NULL;
+
+   ret = btrfs_delalloc_reserve_space(inode, _reserved, pos, 
PAGE_SIZE);
+   if (ret < 0)
+   return ret;
+   refcount_inc(>refs);
+   lock_extent_bits(_I(inode)->io_tree, pos, pos + 
PAGE_SIZE, _state);
+   /* Create an extent of page size */
+   ret = btrfs_get_extent_map_write(, NULL, inode, pos,
+   PAGE_SIZE);
+   if (ret < 0) {
+   free_extent_map(orig);
+   btrfs_delalloc_release_space(inode, data_reserved, pos,
+   PAGE_SIZE, true);
+   goto out;
+   }
+
+   dax_dev = fs_dax_get_by_bdev(em->bdev);
+   /* Calculate start address of destination extent */
+   dstart = (get_start_sect(em->bdev) << 9) + em->block_start;
+   maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart),
+   1, , );
+
+   /* Copy the original contents into new destination */
+   copy_extent_page(orig, daddr, pos);
+   btrfs_update_ordered_extent(inode, pos, PAGE_SIZE, true);
+   dax_insert_entry(, mapping, vmf, entry, pfn, 0, false);
+   ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
+   free_extent_map(orig);
+   unlock_extent_cached(_I(inode)->io_tree, pos, pos + 
PAGE_SIZE, _state);
+   extent_changeset_free(data_reserved);
+   btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, 
false);
} else {
sector_t sector;
if (em->block_start == EXTENT_MAP_HOLE) {
-- 
2.16.4

[PATCH 06/10] btrfs: dax write support

2018-12-05 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

This is a combination of direct and buffered I/O. Similarties
with direct I/O is that it needs to allocate space before
writing. Similarities with buffered is when the data is not
page-aligned, it needs to copy parts of the previous extents. In
order to accomplish that, keep a references of the first and last
extent (if required) and then perform allocations. If the "pos"
or "end" is not aligned, copy the data from first and last extent
respectively.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/ctree.h |   1 +
 fs/btrfs/dax.c   | 121 +++
 fs/btrfs/file.c  |   4 +-
 3 files changed, 125 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a0d296b0d826..d91ff283a966 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3693,6 +3693,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int 
err);
 #ifdef CONFIG_FS_DAX
 /* dax.c */
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
 #endif /* CONFIG_FS_DAX */
 
 static inline int is_fstree(u64 rootid)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 5a297674adec..4000259a426c 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -2,6 +2,7 @@
 #include 
 #include "ctree.h"
 #include "btrfs_inode.h"
+#include "extent_io.h"
 
 static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
u64 len, struct iov_iter *iter)
@@ -71,3 +72,123 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct 
iov_iter *to)
 return done ? done : ret;
 }
 
+static int copy_extent_page(struct extent_map *em, void *daddr, u64 pos)
+{
+struct dax_device *dax_dev;
+   void *saddr;
+   sector_t start;
+   size_t len;
+
+   if (em->block_start == EXTENT_MAP_HOLE) {
+   memset(daddr, 0, PAGE_SIZE);
+   } else {
+   dax_dev = fs_dax_get_by_bdev(em->bdev);
+   start = (get_start_sect(em->bdev) << 9) + (em->block_start + 
(pos - em->start));
+   len = dax_direct_access(dax_dev, PHYS_PFN(start), 1, , 
NULL);
+   memcpy(daddr, saddr, PAGE_SIZE);
+   }
+   free_extent_map(em);
+
+   return 0;
+}
+
+
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
+{
+   ssize_t ret, done = 0, count = iov_iter_count(from);
+struct inode *inode = file_inode(iocb->ki_filp);
+   u64 pos = iocb->ki_pos;
+   u64 start = round_down(pos, PAGE_SIZE);
+   u64 end = round_up(pos + count, PAGE_SIZE);
+   struct extent_state *cached_state = NULL;
+   struct extent_changeset *data_reserved = NULL;
+   struct extent_map *first = NULL, *last = NULL;
+
+   ret = btrfs_delalloc_reserve_space(inode, _reserved, start, end - 
start);
+   if (ret < 0)
+   return ret;
+
+   /* Grab a reference of the first extent to copy data */
+   if (start < pos) {
+   first = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - 
start, 0);
+   if (IS_ERR(first)) {
+   ret = PTR_ERR(first);
+   goto out2;
+   }
+   }
+
+   /* Grab a reference of the last extent to copy data */
+   if (pos + count < end) {
+   last = btrfs_get_extent(BTRFS_I(inode), NULL, 0, end - 
PAGE_SIZE, PAGE_SIZE, 0);
+   if (IS_ERR(last)) {
+   ret = PTR_ERR(last);
+   goto out2;
+   }
+   }
+
+   lock_extent_bits(_I(inode)->io_tree, start, end, _state);
+   while (done < count) {
+   struct extent_map *em;
+   struct dax_device *dax_dev;
+   int offset = pos & (PAGE_SIZE - 1);
+   u64 estart = round_down(pos, PAGE_SIZE);
+   u64 elen = end - estart;
+   size_t len = count - done;
+   sector_t dstart;
+   void *daddr;
+   ssize_t maplen;
+
+   /* Read the current extent */
+em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, estart, elen, 
0);
+   if (IS_ERR(em)) {
+   ret = PTR_ERR(em);
+   goto out;
+   }
+
+   /* Get a new extent */
+   ret = btrfs_get_extent_map_write(, NULL, inode, estart, 
elen);
+   if (ret < 0)
+   goto out;
+
+   dax_dev = fs_dax_get_by_bdev(em->bdev);
+   /* Calculate start address start of destination extent */
+   dstart = (get_start_sect(em->bdev) << 9) + em->block_start;
+   maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart),
+   PHYS_PFN(em->len), , NULL);
+
+   /* Copy fr

[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared

2018-12-05 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/dax.c | 25 -
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 88017f8799d1..6d68d39cc5da 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -198,10 +198,13 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
pfn_t pfn;
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
XA_STATE(xas, >i_pages, vmf->pgoff);
+   unsigned long vaddr = vmf->address;
struct inode *inode = mapping->host;
loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
void *entry = NULL;
vm_fault_t ret = 0;
+   struct extent_map *em;
+   struct dax_device *dax_dev;
 
if (pos > i_size_read(inode)) {
ret = VM_FAULT_SIGBUS;
@@ -214,21 +217,33 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
goto out;
}
 
-   if (!vmf->cow_page) {
+em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 0);
+   if (em->block_start != EXTENT_MAP_HOLE)
+   dax_dev = fs_dax_get_by_bdev(em->bdev);
+
+   if (vmf->cow_page) {
+   sector_t sector;
+   if (em->block_start == EXTENT_MAP_HOLE) {
+   clear_user_highpage(vmf->cow_page, vaddr);
+   goto out;
+   }
+   sector = (get_start_sect(em->bdev) << 9) + (em->block_start + 
(pos - em->start));
+   sector >>= 9;
+   ret = copy_user_dax(em->bdev, dax_dev, sector, PAGE_SIZE, 
vmf->cow_page, vaddr);
+   goto out;
+   } else {
sector_t sector;
-   struct extent_map *em;
-em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 
0);
if (em->block_start == EXTENT_MAP_HOLE) {
ret = dax_load_hole(, mapping, entry, vmf);
goto out;
}
sector = ((get_start_sect(em->bdev) << 9) +
  (em->block_start + (pos - em->start))) >> 9;
-   ret = dax_pfn(fs_dax_get_by_bdev(em->bdev), em->bdev, sector, 
PAGE_SIZE, );
+   ret = dax_pfn(dax_dev, em->bdev, sector, PAGE_SIZE, );
if (ret)
goto out;
dax_insert_entry(, mapping, vmf, entry, pfn, 0, false);
-   ret = vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+   ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
}
 out:
if (entry)
-- 
2.16.4

[PATCH 07/10] dax: export functions for use with btrfs

2018-12-05 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

These functions are required for btrfs dax support.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/dax.c| 35 ---
 include/linux/dax.h | 16 
 2 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9bcce89ea18e..4578640af631 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -244,7 +244,7 @@ static void put_unlocked_entry(struct xa_state *xas, void 
*entry)
  * dropped the xa_lock, so we know the xa_state is stale and must be reset
  * before use.
  */
-static void dax_unlock_entry(struct xa_state *xas, void *entry)
+void dax_unlock_entry(struct xa_state *xas, void *entry)
 {
void *old;
 
@@ -256,6 +256,7 @@ static void dax_unlock_entry(struct xa_state *xas, void 
*entry)
BUG_ON(!dax_is_locked(old));
dax_wake_entry(xas, entry, false);
 }
+EXPORT_SYMBOL(dax_unlock_entry);
 
 /*
  * Return: The entry stored at this location before it was locked.
@@ -448,7 +449,7 @@ void dax_unlock_mapping_entry(struct page *page)
  * a VM_FAULT code, encoded as an xarray internal entry.  The ERR_PTR values
  * overlap with xarray value entries.
  */
-static void *grab_mapping_entry(struct xa_state *xas,
+void *grab_mapping_entry(struct xa_state *xas,
struct address_space *mapping, unsigned long size_flag)
 {
unsigned long index = xas->xa_index;
@@ -531,6 +532,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
xas_unlock_irq(xas);
return xa_mk_internal(VM_FAULT_FALLBACK);
 }
+EXPORT_SYMBOL(grab_mapping_entry);
 
 /**
  * dax_layout_busy_page - find first pinned page in @mapping
@@ -654,7 +656,7 @@ int dax_invalidate_mapping_entry_sync(struct address_space 
*mapping,
return __dax_invalidate_entry(mapping, index, false);
 }
 
-static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
+int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
sector_t sector, size_t size, struct page *to,
unsigned long vaddr)
 {
@@ -679,6 +681,7 @@ static int copy_user_dax(struct block_device *bdev, struct 
dax_device *dax_dev,
dax_read_unlock(id);
return 0;
 }
+EXPORT_SYMBOL(copy_user_dax);
 
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
@@ -687,7 +690,7 @@ static int copy_user_dax(struct block_device *bdev, struct 
dax_device *dax_dev,
  * already in the tree, we will skip the insertion and just dirty the PMD as
  * appropriate.
  */
-static void *dax_insert_entry(struct xa_state *xas,
+void *dax_insert_entry(struct xa_state *xas,
struct address_space *mapping, struct vm_fault *vmf,
void *entry, pfn_t pfn, unsigned long flags, bool dirty)
 {
@@ -736,6 +739,7 @@ static void *dax_insert_entry(struct xa_state *xas,
xas_unlock_irq(xas);
return entry;
 }
+EXPORT_SYMBOL(dax_insert_entry);
 
 static inline
 unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
@@ -962,19 +966,18 @@ static sector_t dax_iomap_sector(struct iomap *iomap, 
loff_t pos)
return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
 }
 
-static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
-pfn_t *pfnp)
+int dax_pfn(struct dax_device *dax_dev, struct block_device *bdev,
+   const sector_t sector, size_t size, pfn_t *pfnp)
 {
-   const sector_t sector = dax_iomap_sector(iomap, pos);
pgoff_t pgoff;
int id, rc;
long length;
 
-   rc = bdev_dax_pgoff(iomap->bdev, sector, size, );
+   rc = bdev_dax_pgoff(bdev, sector, size, );
if (rc)
return rc;
id = dax_read_lock();
-   length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
+   length = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
   NULL, pfnp);
if (length < 0) {
rc = length;
@@ -993,6 +996,14 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, 
size_t size,
dax_read_unlock(id);
return rc;
 }
+EXPORT_SYMBOL(dax_pfn);
+
+static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
+pfn_t *pfnp)
+{
+   const sector_t sector = dax_iomap_sector(iomap, pos);
+   return dax_pfn(iomap->dax_dev, iomap->bdev, sector, size, pfnp);
+}
 
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
@@ -1001,7 +1012,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, 
size_t size,
  * If this page is ever written to we will re-fault and change the mapping to
  * point to real DAX storage instead.
  */
-static vm_fault_t dax_load_hole(struct xa_state *xas,
+vm_fault_t dax_load_hole(struct xa_state *xas,
struct address_space *mapping, void **entry,
struct vm_fault *vmf)
 {
@@ -1017,6 +1028,7 @@ static vm_fault_t

[PATCH 03/10] btrfs: dax: read zeros from holes

2018-12-05 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/dax.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index d614bf73bf8e..5a297674adec 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -54,7 +54,12 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct 
iov_iter *to)
 
 BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);
 
-ret = em_dax_rw(inode, em, pos, len, to);
+   if (em->block_start == EXTENT_MAP_HOLE) {
+   u64 zero_len = min(em->len - (em->start - pos), len);
+   ret = iov_iter_zero(zero_len, to);
+   } else {
+   ret = em_dax_rw(inode, em, pos, len, to);
+   }
 if (ret < 0)
 goto out;
 pos += ret;
-- 
2.16.4

[PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Goldwyn Rodrigues

This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.

2. Currently, a user can continue writing on "old" extents of an mmaped file
after a snapshot has been created. How can we enforce writes to be directed
to new extents after snapshots have been created? Do we keep a list of
all mmap()s, and re-mmap them after a snapshot?

Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
command line parameter.


[PATCH 01/10] btrfs: create a mount option for dax
[PATCH 02/10] btrfs: basic dax read
[PATCH 03/10] btrfs: dax: read zeros from holes
[PATCH 04/10] Rename __endio_write_update_ordered() to
[PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
[PATCH 06/10] btrfs: dax write support
[PATCH 07/10] dax: export functions for use with btrfs
[PATCH 08/10] btrfs: dax add read mmap path
[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
[PATCH 10/10] btrfs: dax mmap write

 fs/btrfs/Makefile   |1 
 fs/btrfs/ctree.h|   17 ++
 fs/btrfs/dax.c  |  303 ++--
 fs/btrfs/file.c |   29 
 fs/btrfs/inode.c|   54 +
 fs/btrfs/ioctl.c|5 
 fs/btrfs/super.c|   15 ++
 fs/dax.c|   35 --
 include/linux/dax.h |   16 ++
 9 files changed, 430 insertions(+), 45 deletions(-)


-- 
Goldwyn

[PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

Also, set the inode->i_flags to S_DAX

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/ioctl.c |  5 -
 fs/btrfs/super.c | 15 +++
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 68f322f600a0..5cc470fa6a40 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY (1 << 28)
+#define BTRFS_MOUNT_DAX(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL  (30)
 #define BTRFS_DEFAULT_MAX_INLINE   (2048)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 802a628e9f7d..e9146c157816 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
if (binode->flags & BTRFS_INODE_DIRSYNC)
new_fl |= S_DIRSYNC;
 
+   if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && 
S_ISREG(inode->i_mode))
+   new_fl |= S_DAX;
+
set_mask_bits(>i_flags,
- S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
+ S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | 
S_DAX,
  new_fl);
 }
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 645fc81e2a94..035263b61cf5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -326,6 +326,7 @@ enum {
Opt_treelog, Opt_notreelog,
Opt_usebackuproot,
Opt_user_subvol_rm_allowed,
+   Opt_dax,
 
/* Deprecated options */
Opt_alloc_start,
@@ -393,6 +394,7 @@ static const match_table_t tokens = {
{Opt_notreelog, "notreelog"},
{Opt_usebackuproot, "usebackuproot"},
{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+   {Opt_dax, "dax"},
 
/* Deprecated options */
{Opt_alloc_start, "alloc_start=%s"},
@@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
case Opt_user_subvol_rm_allowed:
btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
break;
+#ifdef CONFIG_FS_DAX
+   case Opt_dax:
+   if (btrfs_super_num_devices(info->super_copy) > 1) {
+   btrfs_info(info,
+  "dax not supported for multi-device 
btrfs partition\n");
+   ret = -EOPNOTSUPP;
+   goto out;
+   }
+   btrfs_set_opt(info->mount_opt, DAX);
+   break;
+#endif
case Opt_enospc_debug:
btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
break;
@@ -1329,6 +1342,8 @@ static int btrfs_show_options(struct seq_file *seq, 
struct dentry *dentry)
seq_puts(seq, ",clear_cache");
if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED))
seq_puts(seq, ",user_subvol_rm_allowed");
+   if (btrfs_test_opt(info, DAX))
+   seq_puts(seq, ",dax");
if (btrfs_test_opt(info, ENOSPC_DEBUG))
seq_puts(seq, ",enospc_debug");
if (btrfs_test_opt(info, AUTO_DEFRAG))
-- 
2.16.4

[PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent()

2018-12-05 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

Since we will be using it in another part of the code, use a
better name to declare it non-static

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/ctree.h |  7 +--
 fs/btrfs/inode.c | 14 +-
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 038d64ecebe5..5144d28216b0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3170,8 +3170,11 @@ struct inode *btrfs_iget_path(struct super_block *s, 
struct btrfs_key *location,
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 struct btrfs_root *root, int *was_new);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
-   struct page *page, size_t pg_offset,
-   u64 start, u64 end, int create);
+   struct page *page, size_t pg_offset,
+   u64 start, u64 end, int create);
+void btrfs_update_ordered_extent(struct inode *inode,
+   const u64 offset, const u64 bytes,
+   const bool uptodate);
 int btrfs_update_inode(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  struct inode *inode);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9ea4c6f0352f..96e9fe9e4150 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -97,10 +97,6 @@ static struct extent_map *create_io_em(struct inode *inode, 
u64 start, u64 len,
   u64 ram_bytes, int compress_type,
   int type);
 
-static void __endio_write_update_ordered(struct inode *inode,
-const u64 offset, const u64 bytes,
-const bool uptodate);
-
 /*
  * Cleanup all submitted ordered extents in specified range to handle errors
  * from the fill_dellaloc() callback.
@@ -130,7 +126,7 @@ static inline void btrfs_cleanup_ordered_extents(struct 
inode *inode,
ClearPagePrivate2(page);
put_page(page);
}
-   return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
+   return btrfs_update_ordered_extent(inode, offset + PAGE_SIZE,
bytes - PAGE_SIZE, false);
 }
 
@@ -8059,7 +8055,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
bio_put(bio);
 }
 
-static void __endio_write_update_ordered(struct inode *inode,
+void btrfs_update_ordered_extent(struct inode *inode,
 const u64 offset, const u64 bytes,
 const bool uptodate)
 {
@@ -8112,7 +8108,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
struct btrfs_dio_private *dip = bio->bi_private;
struct bio *dio_bio = dip->dio_bio;
 
-   __endio_write_update_ordered(dip->inode, dip->logical_offset,
+   btrfs_update_ordered_extent(dip->inode, dip->logical_offset,
 dip->bytes, !bio->bi_status);
 
kfree(dip);
@@ -8432,7 +8428,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
struct inode *inode,
bio = NULL;
} else {
if (write)
-   __endio_write_update_ordered(inode,
+   btrfs_update_ordered_extent(inode,
file_offset,
dio_bio->bi_iter.bi_size,
false);
@@ -8572,7 +8568,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
 */
if (dio_data.unsubmitted_oe_range_start <
dio_data.unsubmitted_oe_range_end)
-   __endio_write_update_ordered(inode,
+   btrfs_update_ordered_extent(inode,
dio_data.unsubmitted_oe_range_start,
dio_data.unsubmitted_oe_range_end -
dio_data.unsubmitted_oe_range_start,
-- 
2.16.4

[PATCH 02/10] btrfs: basic dax read

2018-12-05 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/Makefile |  1 +
 fs/btrfs/ctree.h  |  5 
 fs/btrfs/dax.c| 68 +++
 fs/btrfs/file.c   | 13 ++-
 4 files changed, 86 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/dax.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index ca693dd554e9..1fa77b875ae9 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
   uuid-tree.o props.o free-space-tree.o tree-checker.o
 
+btrfs-$(CONFIG_FS_DAX) += dax.o
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5cc470fa6a40..038d64ecebe5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3685,6 +3685,11 @@ int btrfs_reada_wait(void *handle);
 void btrfs_reada_detach(void *handle);
 int btree_readahead_hook(struct extent_buffer *eb, int err);
 
+#ifdef CONFIG_FS_DAX
+/* dax.c */
+ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+#endif /* CONFIG_FS_DAX */
+
 static inline int is_fstree(u64 rootid)
 {
if (rootid == BTRFS_FS_TREE_OBJECTID ||
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
new file mode 100644
index ..d614bf73bf8e
--- /dev/null
+++ b/fs/btrfs/dax.c
@@ -0,0 +1,68 @@
+#include 
+#include 
+#include "ctree.h"
+#include "btrfs_inode.h"
+
+static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
+   u64 len, struct iov_iter *iter)
+{
+struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev);
+ssize_t map_len;
+pgoff_t blk_pg;
+void *kaddr;
+sector_t blk_start;
+unsigned offset = pos & (PAGE_SIZE - 1);
+
+len = min(len + offset, em->len - (pos - em->start));
+len = ALIGN(len, PAGE_SIZE);
+blk_start = (get_start_sect(em->bdev) << 9) + (em->block_start + (pos 
- em->start));
+blk_pg = blk_start - offset;
+map_len = dax_direct_access(dax_dev, PHYS_PFN(blk_pg), PHYS_PFN(len), 
, NULL);
+map_len = PFN_PHYS(map_len);
+kaddr += offset;
+map_len -= offset;
+if (map_len > len)
+map_len = len;
+if (iov_iter_rw(iter) == WRITE)
+return dax_copy_from_iter(dax_dev, blk_pg, kaddr, map_len, 
iter);
+else
+return dax_copy_to_iter(dax_dev, blk_pg, kaddr, map_len, iter);
+}
+
+ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
+{
+size_t ret = 0, done = 0, count = iov_iter_count(to);
+struct extent_map *em;
+u64 pos = iocb->ki_pos;
+u64 end = pos + count;
+struct inode *inode = file_inode(iocb->ki_filp);
+
+if (!count)
+return 0;
+
+end = i_size_read(inode) < end ? i_size_read(inode) : end;
+
+while (pos < end) {
+u64 len = end - pos;
+
+em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, len, 0);
+if (IS_ERR(em)) {
+if (!ret)
+ret = PTR_ERR(em);
+goto out;
+}
+
+BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);
+
+ret = em_dax_rw(inode, em, pos, len, to);
+if (ret < 0)
+goto out;
+pos += ret;
+done += ret;
+}
+
+out:
+iocb->ki_pos += done;
+return done ? done : ret;
+}
+
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 58e93bce3036..ef6ed93f44d1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3308,9 +3308,20 @@ static int btrfs_file_open(struct inode *inode, struct 
file *filp)
return generic_file_open(inode, filp);
 }
 
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+   struct inode *inode = file_inode(iocb->ki_filp);
+
+#ifdef CONFIG_FS_DAX
+   if (IS_DAX(inode))
+   return btrfs_file_dax_read(iocb, to);
+#endif
+   return generic_file_read_iter(iocb, to);
+}
+
 const struct file_operations btrfs_file_operations = {
.llseek = btrfs_file_llseek,
-   .read_iter  = generic_file_read_iter,
+   .read_iter  = btrfs_file_read_iter,
.splice_read= generic_file_splice_read,
.write_iter = btrfs_file_write_iter,
.mmap   = btrfs_file_mmap,
-- 
2.16.4

[PATCH 08/10] btrfs: dax add read mmap path

2018-12-05 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/dax.c   | 43 +++
 fs/btrfs/file.c  | 12 +++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d91ff283a966..33648121ca52 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3694,6 +3694,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int 
err);
 /* dax.c */
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
 ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
+vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
 #endif /* CONFIG_FS_DAX */
 
 static inline int is_fstree(u64 rootid)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 4000259a426c..88017f8799d1 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -190,5 +190,48 @@ ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct 
iov_iter *from)
count - done, true);
extent_changeset_free(data_reserved);
 return done ? done : ret;
+}
+
+/* As copied from dax_iomap_pte_fault() */
+vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
+{
+   pfn_t pfn;
+   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+   XA_STATE(xas, >i_pages, vmf->pgoff);
+   struct inode *inode = mapping->host;
+   loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
+   void *entry = NULL;
+   vm_fault_t ret = 0;
+
+   if (pos > i_size_read(inode)) {
+   ret = VM_FAULT_SIGBUS;
+   goto out;
+   }
 
+   entry = grab_mapping_entry(, mapping, 0);
+   if (IS_ERR(entry)) {
+   ret = dax_fault_return(PTR_ERR(entry));
+   goto out;
+   }
+
+   if (!vmf->cow_page) {
+   sector_t sector;
+   struct extent_map *em;
+em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 
0);
+   if (em->block_start == EXTENT_MAP_HOLE) {
+   ret = dax_load_hole(, mapping, entry, vmf);
+   goto out;
+   }
+   sector = ((get_start_sect(em->bdev) << 9) +
+ (em->block_start + (pos - em->start))) >> 9;
+   ret = dax_pfn(fs_dax_get_by_bdev(em->bdev), em->bdev, sector, 
PAGE_SIZE, );
+   if (ret)
+   goto out;
+   dax_insert_entry(, mapping, vmf, entry, pfn, 0, false);
+   ret = vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+   }
+out:
+   if (entry)
+   dax_unlock_entry(, entry);
+   return ret;
 }
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 29a3b12e6660..38b494686fb2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2227,8 +2227,18 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
return ret > 0 ? -EIO : ret;
 }
 
+static vm_fault_t btrfs_fault(struct vm_fault *vmf)
+{
+   struct inode *inode = vmf->vma->vm_file->f_mapping->host;
+#ifdef CONFIG_FS_DAX
+   if (IS_DAX(inode))
+   return btrfs_dax_fault(vmf);
+#endif
+   return filemap_fault(vmf);
+}
+
 static const struct vm_operations_struct btrfs_file_vm_ops = {
-   .fault  = filemap_fault,
+   .fault  = btrfs_fault,
.map_pages  = filemap_map_pages,
.page_mkwrite   = btrfs_page_mkwrite,
 };
-- 
2.16.4

[PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write()

2018-12-05 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

This makes btrfs_get_extent_map_write() independent of Direct
I/O code.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 40 +++-
 2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5144d28216b0..a0d296b0d826 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3169,6 +3169,8 @@ struct inode *btrfs_iget_path(struct super_block *s, 
struct btrfs_key *location,
  struct btrfs_path *path);
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 struct btrfs_root *root, int *was_new);
+int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
+   struct inode *inode, u64 start, u64 len);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
struct page *page, size_t pg_offset,
u64 start, u64 end, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 96e9fe9e4150..4671cd9165c1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7485,11 +7485,10 @@ static int btrfs_get_blocks_direct_read(struct 
extent_map *em,
return 0;
 }
 
-static int btrfs_get_blocks_direct_write(struct extent_map **map,
-struct buffer_head *bh_result,
-struct inode *inode,
-struct btrfs_dio_data *dio_data,
-u64 start, u64 len)
+int btrfs_get_extent_map_write(struct extent_map **map,
+   struct buffer_head *bh,
+   struct inode *inode,
+   u64 start, u64 len)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct extent_map *em = *map;
@@ -7543,22 +7542,38 @@ static int btrfs_get_blocks_direct_write(struct 
extent_map **map,
 */
btrfs_free_reserved_data_space_noquota(inode, start,
   len);
-   goto skip_cow;
+   /* skip COW */
+   goto out;
}
}
 
/* this will cow the extent */
-   len = bh_result->b_size;
+   if (bh)
+   len = bh->b_size;
free_extent_map(em);
*map = em = btrfs_new_extent_direct(inode, start, len);
-   if (IS_ERR(em)) {
-   ret = PTR_ERR(em);
-   goto out;
-   }
+   if (IS_ERR(em))
+   return PTR_ERR(em);
+out:
+   return ret;
+}
 
+static int btrfs_get_blocks_direct_write(struct extent_map **map,
+struct buffer_head *bh_result,
+struct inode *inode,
+struct btrfs_dio_data *dio_data,
+u64 start, u64 len)
+{
+   int ret = 0;
+   struct extent_map *em;
+
+   ret = btrfs_get_extent_map_write(map, bh_result, inode,
+   start, len);
+   if (ret < 0)
+   return ret;
+   em = *map;
len = min(len, em->len - (start - em->start));
 
-skip_cow:
bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
inode->i_blkbits;
bh_result->b_size = len;
@@ -7579,7 +7594,6 @@ static int btrfs_get_blocks_direct_write(struct 
extent_map **map,
dio_data->reserve -= len;
dio_data->unsubmitted_oe_range_end = start + len;
current->journal_info = dio_data;
-out:
return ret;
 }
 
-- 
2.16.4

Re: [PATCH] btrfs: Remove unused variable mode in btrfs_mount

2018-10-08 Thread Goldwyn Rodrigues

On 15:03 08/10, David Sterba wrote:
> On Fri, Oct 05, 2018 at 07:26:15AM -0500, Goldwyn Rodrigues wrote:
> > Code cleanup.
> 
> Have you check when and why the variable become unused? Thanks.

No, I did not check it earlier. git blame points to
312c89fbca06 ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()")
Author cc'd.

-- 
Goldwyn

[PATCH] btrfs: Remove unused variable mode in btrfs_mount

2018-10-05 Thread Goldwyn Rodrigues

Code cleanup.

Signed-off-by: Goldwyn Rodrigues 

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index e7f702761cb7..f7b8b7a6b86a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1661,14 +1661,10 @@ static struct dentry *btrfs_mount(struct 
file_system_type *fs_type, int flags,
 {
struct vfsmount *mnt_root;
struct dentry *root;
-   fmode_t mode = FMODE_READ;
char *subvol_name = NULL;
u64 subvol_objectid = 0;
int error = 0;
 
-   if (!(flags & SB_RDONLY))
-   mode |= FMODE_WRITE;
-
error = btrfs_parse_subvol_options(data, _name,
_objectid);
if (error) {

-- 
Goldwyn

Re: [PATCH] btrfs: Use iocb to derive pos instead of passing a separate parameter

2018-06-25 Thread Goldwyn Rodrigues

On 06-25 18:20, David Sterba wrote:
> On Mon, Jun 25, 2018 at 01:58:58PM +0900, Misono Tomohiro wrote:
> > So, this is the updated version of 
> > https://patchwork.kernel.org/patch/10063039/
> > 
> > This time xfstest is ok and
> >  Reviewed-by: Misono Tomohiro 

Yes, thats right.

> 
> Your comment about invalidate_mapping_pages is also ok, right? As
> filemap_fdatawait_range and invalidate_mapping_pages use the same
> start/end of the range.

I did not mess around with other functions which are affected by
iocb->ki_pos (as opposed to local pos). So, this should be safe with
respect to invalidate_mapping_pages().

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: Use iocb to derive pos instead of passing a separate parameter

2018-06-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

struct kiocb carries the ki_pos, so there is no need to pass it as
a separate function parameter.

generic_file_direct_write() increments ki_pos, so we now assign pos
after the function.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/file.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f660ba1e5e58..f84100a60cec 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1569,10 +1569,11 @@ static noinline int check_can_nocow(struct btrfs_inode 
*inode, loff_t pos,
return ret;
 }
 
-static noinline ssize_t __btrfs_buffered_write(struct file *file,
-  struct iov_iter *i,
-  loff_t pos)
+static noinline ssize_t __btrfs_buffered_write(struct kiocb *iocb,
+  struct iov_iter *i)
 {
+   struct file *file = iocb->ki_filp;
+   loff_t pos = iocb->ki_pos;
struct inode *inode = file_inode(file);
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -1804,7 +1805,7 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, 
struct iov_iter *from)
 {
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
-   loff_t pos = iocb->ki_pos;
+   loff_t pos;
ssize_t written;
ssize_t written_buffered;
loff_t endbyte;
@@ -1815,8 +1816,8 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, 
struct iov_iter *from)
if (written < 0 || !iov_iter_count(from))
return written;
 
-   pos += written;
-   written_buffered = __btrfs_buffered_write(file, from, pos);
+   pos = iocb->ki_pos;
+   written_buffered = __btrfs_buffered_write(iocb, from);
if (written_buffered < 0) {
err = written_buffered;
goto out;
@@ -1953,7 +1954,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
if (iocb->ki_flags & IOCB_DIRECT) {
num_written = __btrfs_direct_write(iocb, from);
} else {
-   num_written = __btrfs_buffered_write(file, from, pos);
+   num_written = __btrfs_buffered_write(iocb, from);
if (num_written > 0)
iocb->ki_pos = pos + num_written;
if (clean_page)
-- 
2.16.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/8] btrfs iomap support

2017-11-17 Thread Goldwyn Rodrigues



On 11/17/2017 12:45 PM, Nikolay Borisov wrote:
> 
> 
> On 17.11.2017 19:44, Goldwyn Rodrigues wrote:
>> This patch series attempts to use kernels iomap for btrfs. Currently,
>> it covers buffered writes only, but I intend to add some other iomap
>> uses once this gets through. I am sending this as an RFC because I
>> would like to find ways to improve the solution since some changes
>> require adding more functions to the iomap infrastructure which I
>> would try to avoid. I still have to remove some kinks as well such
>> as -o compress. I have posted some questions in the individual
>> patches and would appreciate some input to those.
>>
>> Some of the problems I faced is:
>>
>> 1. extent locking: While we perform the extent locking for writes,
>> we need to perform any reads because of non-page-aligned calls before
>> locking can be done. This requires reading the page, increasing their
>> pagecount and "letting it go". The iomap infrastructure uses
>> buffer_heads wheras btrfs uses bio and hence needs to call readpage
>> exclusively. The "letting it go" part makes me somewhat nervous of
>> conflicting reads/writes, even though we are protected under i_rwsem.
>> Is readpage_nolock() a good idea? The extent locking sequence is a
>> bit weird, with locks and unlock happening in different functions.
> 
> Is there some inherent requirement in iomap's design that necessitates
> the usage of buffer heads? I thought the trend is for buffer_head to
> eventually die out. Given that iomap is fairly recent (2-3 years?) I
> find it odd it's relying on buffer heads.
> 

No, there is no inherent reason that I see, but legacy. iomap is carved
out of existing filesystems such as xfs which traditionally use
buffer_heads. In any case, the buffer heads make I/O to individual pages
independently. iomap calls existing functions which use buffer heads.

>>
>> 2. btrfs pages use PagePrivate to store EXTENT_PAGE_PRIVATE which is not 
>> used anywhere.
>> However, a PagePrivate flag is used for try_to_release_buffers(). Can
>> we do away with PagePrivate for data pages? The same with PageChecked.
>> How and why is it used (I guess -o compress)
>>
>> 3. I had to stick information which will be required from iomap_begin()
>> to iomap_end() in btrfs_iomap which is a pointer in btrfs_inode. Is
>> there any other place/way we can transmit this information. XFS only
>> performs allocations and deallocations so it just relies of bmap code
>> for it.
>>
>> Suggestions/Criticism welcome.
>>

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 1/8] btrfs: use iocb for __btrfs_buffered_write

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Preparatory patch. It reduces the arguments to __btrfs_buffered_write
to follow buffered_write() style.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>

---
 fs/btrfs/file.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index aafcc785f840..9bceb0e61361 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1572,10 +1572,11 @@ static noinline int check_can_nocow(struct btrfs_inode 
*inode, loff_t pos,
return ret;
 }
 
-static noinline ssize_t __btrfs_buffered_write(struct file *file,
-  struct iov_iter *i,
-  loff_t pos)
+static noinline ssize_t __btrfs_buffered_write(struct kiocb *iocb,
+  struct iov_iter *i)
 {
+   struct file *file = iocb->ki_filp;
+   loff_t pos = iocb->ki_pos;
struct inode *inode = file_inode(file);
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -1815,7 +1816,6 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, 
struct iov_iter *from)
 {
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
-   loff_t pos = iocb->ki_pos;
ssize_t written;
ssize_t written_buffered;
loff_t endbyte;
@@ -1826,8 +1826,8 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, 
struct iov_iter *from)
if (written < 0 || !iov_iter_count(from))
return written;
 
-   pos += written;
-   written_buffered = __btrfs_buffered_write(file, from, pos);
+   iocb->ki_pos += written;
+   written_buffered = __btrfs_buffered_write(iocb, from);
if (written_buffered < 0) {
err = written_buffered;
goto out;
@@ -1836,16 +1836,16 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, 
struct iov_iter *from)
 * Ensure all data is persisted. We want the next direct IO read to be
 * able to read what was just written.
 */
-   endbyte = pos + written_buffered - 1;
-   err = btrfs_fdatawrite_range(inode, pos, endbyte);
+   endbyte = iocb->ki_pos + written_buffered - 1;
+   err = btrfs_fdatawrite_range(inode, iocb->ki_pos, endbyte);
if (err)
goto out;
-   err = filemap_fdatawait_range(inode->i_mapping, pos, endbyte);
+   err = filemap_fdatawait_range(inode->i_mapping, iocb->ki_pos, endbyte);
if (err)
goto out;
+   iocb->ki_pos += written_buffered;
written += written_buffered;
-   iocb->ki_pos = pos + written_buffered;
-   invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
+   invalidate_mapping_pages(file->f_mapping, iocb->ki_pos >> PAGE_SHIFT,
 endbyte >> PAGE_SHIFT);
 out:
return written ? written : err;
@@ -1964,7 +1964,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
if (iocb->ki_flags & IOCB_DIRECT) {
num_written = __btrfs_direct_write(iocb, from);
} else {
-   num_written = __btrfs_buffered_write(file, from, pos);
+   num_written = __btrfs_buffered_write(iocb, from);
if (num_written > 0)
iocb->ki_pos = pos + num_written;
if (clean_page)
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 7/8] fs: iomap->prepare_pages() to set directives specific for the page

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

This adds prepare_pages() to iomap in order to set page directives
for the page so as FS such as btrfs may perform post-write operations
after write completes.

Can we do away with this? EXTENT_PAGE_PRIVATE is only set and not used.
However, we want the page to be set with PG_Priavate with SetPagePrivate()
for try_to_release_buffers(). Can we work around it?

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/file.c   |  8 
 fs/dax.c  |  2 +-
 fs/internal.h |  2 +-
 fs/iomap.c| 23 ++-
 include/linux/iomap.h |  3 +++
 5 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b34ec493fe4b..b5cc5c0a0cf5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1641,9 +1641,17 @@ int btrfs_file_iomap_end(struct inode *inode, loff_t 
pos, loff_t length,
return ret;
 }
 
+static void btrfs_file_process_page(struct inode *inode, struct page *page)
+{
+   SetPagePrivate(page);
+   set_page_private(page, EXTENT_PAGE_PRIVATE);
+   get_page(page);
+}
+
 const struct iomap_ops btrfs_iomap_ops = {
 .iomap_begin= btrfs_file_iomap_begin,
 .iomap_end  = btrfs_file_iomap_end,
+   .iomap_process_page = btrfs_file_process_page,
 };
 
 static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
diff --git a/fs/dax.c b/fs/dax.c
index f001d8c72a06..51d07b24b3a1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -943,7 +943,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, 
loff_t pos)
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-   struct iomap *iomap)
+   const struct iomap_ops *ops, struct iomap *iomap)
 {
struct block_device *bdev = iomap->bdev;
struct dax_device *dax_dev = iomap->dax_dev;
diff --git a/fs/internal.h b/fs/internal.h
index 48cee21b4f14..bd9d5a37bd23 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -176,7 +176,7 @@ extern long vfs_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg);
  * iomap support:
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
-   void *data, struct iomap *iomap);
+   void *data, const struct iomap_ops *ops, struct iomap *iomap);
 
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
unsigned flags, const struct iomap_ops *ops, void *data,
diff --git a/fs/iomap.c b/fs/iomap.c
index 9ec9cc3077b3..a32660b1b6c5 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -78,7 +78,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, 
unsigned flags,
 * we can do the copy-in page by page without having to worry about
 * failures exposing transient data.
 */
-   written = actor(inode, pos, length, data, );
+   written = actor(inode, pos, length, data, ops, );
 
/*
 * Now the data has been copied, commit the range we've copied.  This
@@ -155,7 +155,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned 
len,
 
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-   struct iomap *iomap)
+   const struct iomap_ops *ops, struct iomap *iomap)
 {
struct iov_iter *i = data;
long status = 0;
@@ -195,6 +195,9 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
if (unlikely(status))
break;
 
+   if (ops->iomap_process_page)
+   ops->iomap_process_page(inode, page);
+
if (mapping_writably_mapped(inode->i_mapping))
flush_dcache_page(page);
 
@@ -271,7 +274,7 @@ __iomap_read_page(struct inode *inode, loff_t offset)
 
 static loff_t
 iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-   struct iomap *iomap)
+   const struct iomap_ops *ops, struct iomap *iomap)
 {
long status = 0;
ssize_t written = 0;
@@ -363,7 +366,7 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, 
unsigned bytes,
 
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-   void *data, struct iomap *iomap)
+   void *data, const struct iomap_ops *ops, struct iomap *iomap)
 {
bool *did_zero = data;
loff_t written = 0;
@@ -432,7 +435,7 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
 static loff_t
 iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
-   void *data, struct iomap *iomap)
+   void *data, const struct iomap_ops *ops, struct iomap *iomap)
 {
struct page *page = data;
int ret;
@@ -523,7 +526,7 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
 
 static loff_t
 iomap_fiemap_actor(struct inode *inode

[RFC PATCH 7/8] fs: iomap->prepare_pages() to set directives specific for the page

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

This adds prepare_pages() to iomap in order to set page directives
for the page so as FS such as btrfs may perform post-write operations
after write completes.

Can we do away with this? EXTENT_PAGE_PRIVATE is only set and not used.
However, we want the page to be set with PG_Priavate with SetPagePrivate()
for try_to_release_buffers(). Can we work around it?

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/file.c   | 12 ++--
 fs/dax.c  |  2 +-
 fs/internal.h |  2 +-
 fs/iomap.c| 23 ++-
 include/linux/iomap.h |  3 +++
 5 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f5f34e199709..1c459c9001b2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1261,8 +1261,8 @@ static int prepare_uptodate_page(struct inode *inode, u64 
pos, struct page **pag
if (!(pos & (PAGE_SIZE - 1)))
goto out;
 
-   page = find_or_create_page(inode->i_mapping, index,
-   btrfs_alloc_write_mask(inode->i_mapping) | __GFP_WRITE);
+   page = grab_cache_page_write_begin(inode->i_mapping, index,
+   AOP_FLAG_NOFS);
 
if (!PageUptodate(page)) {
int ret = btrfs_readpage(NULL, page);
@@ -1641,9 +1641,17 @@ int btrfs_file_iomap_end(struct inode *inode, loff_t 
pos, loff_t length,
return ret;
 }
 
+static void btrfs_file_process_page(struct inode *inode, struct page *page)
+{
+   SetPagePrivate(page);
+   set_page_private(page, EXTENT_PAGE_PRIVATE);
+   get_page(page);
+}
+
 const struct iomap_ops btrfs_iomap_ops = {
 .iomap_begin= btrfs_file_iomap_begin,
 .iomap_end  = btrfs_file_iomap_end,
+   .iomap_process_page = btrfs_file_process_page,
 };
 
 static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
diff --git a/fs/dax.c b/fs/dax.c
index f001d8c72a06..51d07b24b3a1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -943,7 +943,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, 
loff_t pos)
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-   struct iomap *iomap)
+   const struct iomap_ops *ops, struct iomap *iomap)
 {
struct block_device *bdev = iomap->bdev;
struct dax_device *dax_dev = iomap->dax_dev;
diff --git a/fs/internal.h b/fs/internal.h
index 48cee21b4f14..bd9d5a37bd23 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -176,7 +176,7 @@ extern long vfs_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg);
  * iomap support:
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
-   void *data, struct iomap *iomap);
+   void *data, const struct iomap_ops *ops, struct iomap *iomap);
 
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
unsigned flags, const struct iomap_ops *ops, void *data,
diff --git a/fs/iomap.c b/fs/iomap.c
index 9ec9cc3077b3..a32660b1b6c5 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -78,7 +78,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, 
unsigned flags,
 * we can do the copy-in page by page without having to worry about
 * failures exposing transient data.
 */
-   written = actor(inode, pos, length, data, );
+   written = actor(inode, pos, length, data, ops, );
 
/*
 * Now the data has been copied, commit the range we've copied.  This
@@ -155,7 +155,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned 
len,
 
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-   struct iomap *iomap)
+   const struct iomap_ops *ops, struct iomap *iomap)
 {
struct iov_iter *i = data;
long status = 0;
@@ -195,6 +195,9 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
if (unlikely(status))
break;
 
+   if (ops->iomap_process_page)
+   ops->iomap_process_page(inode, page);
+
if (mapping_writably_mapped(inode->i_mapping))
flush_dcache_page(page);
 
@@ -271,7 +274,7 @@ __iomap_read_page(struct inode *inode, loff_t offset)
 
 static loff_t
 iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-   struct iomap *iomap)
+   const struct iomap_ops *ops, struct iomap *iomap)
 {
long status = 0;
ssize_t written = 0;
@@ -363,7 +366,7 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, 
unsigned bytes,
 
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-   void *data, struct iomap *iomap)
+   void *data, const struct iomap_ops *ops, struct iomap

[RFC PATCH 8/8] iomap: Introduce iomap->dirty_page()

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

In dirty_page(), we are clearing PageChecked, though I don't see it set.
Is this used for compression only?
Can we call __set_page_dirty_nobuffers instead?

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/file.c   | 8 
 fs/iomap.c| 2 ++
 include/linux/iomap.h | 1 +
 3 files changed, 11 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 1c459c9001b2..ba304e782098 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1648,10 +1648,18 @@ static void btrfs_file_process_page(struct inode 
*inode, struct page *page)
get_page(page);
 }
 
+static void btrfs_file_dirty_page(struct page *page)
+{
+   SetPageUptodate(page);
+   ClearPageChecked(page);
+   set_page_dirty(page);
+}
+
 const struct iomap_ops btrfs_iomap_ops = {
 .iomap_begin= btrfs_file_iomap_begin,
 .iomap_end  = btrfs_file_iomap_end,
.iomap_process_page = btrfs_file_process_page,
+   .iomap_dirty_page   = btrfs_file_dirty_page,
 };
 
 static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
diff --git a/fs/iomap.c b/fs/iomap.c
index a32660b1b6c5..0907790c76c0 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -208,6 +208,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
status = iomap_write_end(inode, pos, bytes, copied, page, 
iomap);
if (unlikely(status < 0))
break;
+   if (ops->iomap_dirty_page)
+   ops->iomap_dirty_page(page);
copied = status;
 
cond_resched();
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index fbb0194d56d6..7fbf6889dc54 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -76,6 +76,7 @@ struct iomap_ops {
ssize_t written, unsigned flags, struct iomap *iomap);
 
void (*iomap_process_page)(struct inode *inode, struct page *page);
+   void (*iomap_dirty_page)(struct page *page);
 };
 
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 3/8] fs: Introduce IOMAP_F_NOBH

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

IOMAP_F_NOBH tells iomap functions not to use or attach buffer heads
to the page. Page flush and writeback is the responsibility of the
filesystem (such as btrfs) code, which use bio to perform it.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/iomap.c| 20 
 include/linux/iomap.h |  1 +
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index d4801f8dd4fd..9ec9cc3077b3 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -123,7 +123,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned 
len, unsigned flags,
if (!page)
return -ENOMEM;
 
-   status = __block_write_begin_int(page, pos, len, NULL, iomap);
+   if (!(iomap->flags & IOMAP_F_NOBH))
+   status = __block_write_begin_int(page, pos, len, NULL, iomap);
if (unlikely(status)) {
unlock_page(page);
put_page(page);
@@ -138,12 +139,15 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
unsigned len, unsigned flags,
 
 static int
 iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
-   unsigned copied, struct page *page)
+   unsigned copied, struct page *page, struct iomap *iomap)
 {
-   int ret;
+   int ret = len;
 
-   ret = generic_write_end(NULL, inode->i_mapping, pos, len,
-   copied, page, NULL);
+   if (iomap->flags & IOMAP_F_NOBH)
+   ret = inode_extend_page(inode, pos, copied, page);
+   else
+   ret = generic_write_end(NULL, inode->i_mapping, pos, len,
+   copied, page, NULL);
if (ret < len)
iomap_write_failed(inode, pos, len);
return ret;
@@ -198,7 +202,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
 
flush_dcache_page(page);
 
-   status = iomap_write_end(inode, pos, bytes, copied, page);
+   status = iomap_write_end(inode, pos, bytes, copied, page, 
iomap);
if (unlikely(status < 0))
break;
copied = status;
@@ -292,7 +296,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
 
WARN_ON_ONCE(!PageUptodate(page));
 
-   status = iomap_write_end(inode, pos, bytes, bytes, page);
+   status = iomap_write_end(inode, pos, bytes, bytes, page, iomap);
if (unlikely(status <= 0)) {
if (WARN_ON_ONCE(status == 0))
return -EIO;
@@ -344,7 +348,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, 
unsigned offset,
zero_user(page, offset, bytes);
mark_page_accessed(page);
 
-   return iomap_write_end(inode, pos, bytes, bytes, page);
+   return iomap_write_end(inode, pos, bytes, bytes, page, iomap);
 }
 
 static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8a7c6d26b147..61af7b1bd0fc 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -29,6 +29,7 @@ struct vm_fault;
  */
 #define IOMAP_F_MERGED 0x10/* contains multiple blocks/extents */
 #define IOMAP_F_SHARED 0x20/* block shared with another file */
+#define IOMAP_F_NOBH   0x40/* Do not assign buffer heads */
 
 /*
  * Magic value for blkno:
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 5/8] btrfs: use iomap to perform buffered writes

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

This eliminates all page related code.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/btrfs_inode.h |   4 +-
 fs/btrfs/file.c| 488 ++---
 2 files changed, 185 insertions(+), 307 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index eccadb5f62a5..2c2bc5fd5cc9 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -21,7 +21,7 @@
 
 #include 
 #include "extent_map.h"
-#include "extent_io.h"
+#include "iomap.h"
 #include "ordered-data.h"
 #include "delayed-inode.h"
 
@@ -207,6 +207,8 @@ struct btrfs_inode {
 */
struct rw_semaphore dio_sem;
 
+   struct btrfs_iomap *b_iomap;
+
struct inode vfs_inode;
 };
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 876c2acc2a71..b7390214ef3a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -405,79 +405,6 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
return 0;
 }
 
-/* simple helper to fault in pages and copy.  This should go away
- * and be replaced with calls into generic code.
- */
-static noinline int btrfs_copy_from_user(loff_t pos, size_t write_bytes,
-struct page **prepared_pages,
-struct iov_iter *i)
-{
-   size_t copied = 0;
-   size_t total_copied = 0;
-   int pg = 0;
-   int offset = pos & (PAGE_SIZE - 1);
-
-   while (write_bytes > 0) {
-   size_t count = min_t(size_t,
-PAGE_SIZE - offset, write_bytes);
-   struct page *page = prepared_pages[pg];
-   /*
-* Copy data from userspace to the current page
-*/
-   copied = iov_iter_copy_from_user_atomic(page, i, offset, count);
-
-   /* Flush processor's dcache for this page */
-   flush_dcache_page(page);
-
-   /*
-* if we get a partial write, we can end up with
-* partially up to date pages.  These add
-* a lot of complexity, so make sure they don't
-* happen by forcing this copy to be retried.
-*
-* The rest of the btrfs_file_write code will fall
-* back to page at a time copies after we return 0.
-*/
-   if (!PageUptodate(page) && copied < count)
-   copied = 0;
-
-   iov_iter_advance(i, copied);
-   write_bytes -= copied;
-   total_copied += copied;
-
-   /* Return to btrfs_file_write_iter to fault page */
-   if (unlikely(copied == 0))
-   break;
-
-   if (copied < PAGE_SIZE - offset) {
-   offset += copied;
-   } else {
-   pg++;
-   offset = 0;
-   }
-   }
-   return total_copied;
-}
-
-/*
- * unlocks pages after btrfs_file_write is done with them
- */
-static void btrfs_drop_pages(struct page **pages, size_t num_pages)
-{
-   size_t i;
-   for (i = 0; i < num_pages; i++) {
-   /* page checked is some magic around finding pages that
-* have been modified without going through btrfs_set_page_dirty
-* clear it here. There should be no need to mark the pages
-* accessed as prepare_pages should have marked them accessed
-* in prepare_pages via find_or_create_page()
-*/
-   ClearPageChecked(pages[i]);
-   unlock_page(pages[i]);
-   put_page(pages[i]);
-   }
-}
-
 /*
  * after copy_from_user, pages need to be dirtied and we need to make
  * sure holes are created between the current EOF and the start of
@@ -1457,8 +1384,7 @@ static int btrfs_find_new_delalloc_bytes(struct 
btrfs_inode *inode,
  * the other < 0 number - Something wrong happens
  */
 static noinline int
-lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages,
-   size_t num_pages, loff_t pos,
+lock_and_cleanup_extent(struct btrfs_inode *inode, loff_t pos,
size_t write_bytes,
u64 *lockstart, u64 *lockend,
struct extent_state **cached_state)
@@ -1466,7 +1392,6 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode 
*inode, struct page **pages,
struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
u64 start_pos;
u64 last_pos;
-   int i;
int ret = 0;
 
start_pos = round_down(pos, fs_info->sectorsize);
@@ -1488,10 +1413,6 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode 
*inode, struct page **pages,

[RFC PATCH 6/8] btrfs: read the first/last page of the write

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

We cannot perform a readpage in iomap_apply after
iomap_begin() because we have our extents locked. So,
we perform a readpage and make sure we unlock it, but
increase the page count.

Question: How do we deal with -EAGAIN return from
prepare_uptodate_page()? Under what scenario's would this occur?

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/file.c  | 116 ++-
 fs/btrfs/iomap.h |   1 +
 2 files changed, 47 insertions(+), 70 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b7390214ef3a..b34ec493fe4b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1252,84 +1252,36 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle 
*trans,
return 0;
 }
 
-/*
- * on error we return an unlocked page and the error value
- * on success we return a locked page and 0
- */
-static int prepare_uptodate_page(struct inode *inode,
-struct page *page, u64 pos,
-bool force_uptodate)
+static int prepare_uptodate_page(struct inode *inode, u64 pos, struct page 
**pagep)
 {
+   struct page *page = NULL;
int ret = 0;
+   int index = pos >> PAGE_SHIFT;
+
+   if (!(pos & (PAGE_SIZE - 1)))
+   goto out;
+
+   page = grab_cache_page_write_begin(inode->i_mapping, index,
+   AOP_FLAG_NOFS);
 
-   if (((pos & (PAGE_SIZE - 1)) || force_uptodate) &&
-   !PageUptodate(page)) {
+   if (!PageUptodate(page)) {
ret = btrfs_readpage(NULL, page);
if (ret)
-   return ret;
-   lock_page(page);
+   goto out;
if (!PageUptodate(page)) {
-   unlock_page(page);
-   return -EIO;
+   ret = -EIO;
+   goto out;
}
if (page->mapping != inode->i_mapping) {
-   unlock_page(page);
-   return -EAGAIN;
-   }
-   }
-   return 0;
-}
-
-/*
- * this just gets pages into the page cache and locks them down.
- */
-static noinline int prepare_pages(struct inode *inode, struct page **pages,
- size_t num_pages, loff_t pos,
- size_t write_bytes, bool force_uptodate)
-{
-   int i;
-   unsigned long index = pos >> PAGE_SHIFT;
-   gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
-   int err = 0;
-   int faili;
-
-   for (i = 0; i < num_pages; i++) {
-again:
-   pages[i] = find_or_create_page(inode->i_mapping, index + i,
-  mask | __GFP_WRITE);
-   if (!pages[i]) {
-   faili = i - 1;
-   err = -ENOMEM;
-   goto fail;
-   }
-
-   if (i == 0)
-   err = prepare_uptodate_page(inode, pages[i], pos,
-   force_uptodate);
-   if (!err && i == num_pages - 1)
-   err = prepare_uptodate_page(inode, pages[i],
-   pos + write_bytes, false);
-   if (err) {
-   put_page(pages[i]);
-   if (err == -EAGAIN) {
-   err = 0;
-   goto again;
-   }
-   faili = i - 1;
-   goto fail;
+   ret = -EAGAIN;
+   goto out;
}
-   wait_on_page_writeback(pages[i]);
}
-
-   return 0;
-fail:
-   while (faili >= 0) {
-   unlock_page(pages[faili]);
-   put_page(pages[faili]);
-   faili--;
-   }
-   return err;
-
+out:
+   if (page)
+   unlock_page(page);
+   *pagep = page;
+   return ret;
 }
 
 static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
@@ -1502,7 +1454,7 @@ int btrfs_file_iomap_begin(struct inode *inode, loff_t 
pos, loff_t length,
 fs_info->sectorsize);
 bim->extent_locked = false;
 iomap->type = IOMAP_DELALLOC;
-iomap->flags = IOMAP_F_NEW;
+iomap->flags = IOMAP_F_NEW | IOMAP_F_NOBH;
 
extent_changeset_release(bim->data_reserved);
 /* Reserve data/quota space */
@@ -1526,7 +1478,7 @@ int btrfs_file_iomap_begin(struct inode *inode, loff_t 
pos, loff_t length,
 sector_offset,
 fs_info->sectorsize);
 iomap->type = IOMAP_UNWRITTEN;
-iom

[RFC PATCH 8/8] fs: Introduce iomap->dirty_page()

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

In dirty_page(), we are clearing PageChecked, though I don't see it set.
Is this used for compression only?
Can we call __set_page_dirty_nobuffers instead?

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/file.c   | 8 
 fs/iomap.c| 2 ++
 include/linux/iomap.h | 1 +
 3 files changed, 11 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b5cc5c0a0cf5..049ed1d8ce1f 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1648,10 +1648,18 @@ static void btrfs_file_process_page(struct inode 
*inode, struct page *page)
get_page(page);
 }
 
+static void btrfs_file_dirty_page(struct page *page)
+{
+   SetPageUptodate(page);
+   ClearPageChecked(page);
+   set_page_dirty(page);
+}
+
 const struct iomap_ops btrfs_iomap_ops = {
 .iomap_begin= btrfs_file_iomap_begin,
 .iomap_end  = btrfs_file_iomap_end,
.iomap_process_page = btrfs_file_process_page,
+   .iomap_dirty_page   = btrfs_file_dirty_page,
 };
 
 static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
diff --git a/fs/iomap.c b/fs/iomap.c
index a32660b1b6c5..0907790c76c0 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -208,6 +208,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
status = iomap_write_end(inode, pos, bytes, copied, page, 
iomap);
if (unlikely(status < 0))
break;
+   if (ops->iomap_dirty_page)
+   ops->iomap_dirty_page(page);
copied = status;
 
cond_resched();
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index fbb0194d56d6..7fbf6889dc54 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -76,6 +76,7 @@ struct iomap_ops {
ssize_t written, unsigned flags, struct iomap *iomap);
 
void (*iomap_process_page)(struct inode *inode, struct page *page);
+   void (*iomap_dirty_page)(struct page *page);
 };
 
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 4/8] btrfs: Introduce btrfs_iomap

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Preparatory patch. btrfs_iomap structure carries extent/page
state from iomap_begin() to iomap_end().

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/file.c  | 68 ++--
 fs/btrfs/iomap.h | 21 +
 2 files changed, 53 insertions(+), 36 deletions(-)
 create mode 100644 fs/btrfs/iomap.h

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9bceb0e61361..876c2acc2a71 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -41,6 +41,7 @@
 #include "volumes.h"
 #include "qgroup.h"
 #include "compression.h"
+#include "iomap.h"
 
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 /*
@@ -1580,18 +1581,14 @@ static noinline ssize_t __btrfs_buffered_write(struct 
kiocb *iocb,
struct inode *inode = file_inode(file);
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_iomap btrfs_iomap = {0};
+   struct btrfs_iomap *bim = _iomap;
struct page **pages = NULL;
-   struct extent_state *cached_state = NULL;
-   struct extent_changeset *data_reserved = NULL;
u64 release_bytes = 0;
-   u64 lockstart;
-   u64 lockend;
size_t num_written = 0;
int nrptrs;
int ret = 0;
-   bool only_release_metadata = false;
bool force_page_uptodate = false;
-   bool need_unlock;
 
nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
PAGE_SIZE / (sizeof(struct page *)));
@@ -1609,7 +1606,6 @@ static noinline ssize_t __btrfs_buffered_write(struct 
kiocb *iocb,
 offset);
size_t num_pages = DIV_ROUND_UP(write_bytes + offset,
PAGE_SIZE);
-   size_t reserve_bytes;
size_t dirty_pages;
size_t copied;
size_t dirty_sectors;
@@ -1627,11 +1623,11 @@ static noinline ssize_t __btrfs_buffered_write(struct 
kiocb *iocb,
}
 
sector_offset = pos & (fs_info->sectorsize - 1);
-   reserve_bytes = round_up(write_bytes + sector_offset,
+   bim->reserve_bytes = round_up(write_bytes + sector_offset,
fs_info->sectorsize);
 
-   extent_changeset_release(data_reserved);
-   ret = btrfs_check_data_free_space(inode, _reserved, pos,
+   extent_changeset_release(bim->data_reserved);
+   ret = btrfs_check_data_free_space(inode, >data_reserved, 
pos,
  write_bytes);
if (ret < 0) {
if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
@@ -1642,14 +1638,14 @@ static noinline ssize_t __btrfs_buffered_write(struct 
kiocb *iocb,
 * For nodata cow case, no need to reserve
 * data space.
 */
-   only_release_metadata = true;
+   bim->only_release_metadata = true;
/*
 * our prealloc extent may be smaller than
 * write_bytes, so scale down.
 */
num_pages = DIV_ROUND_UP(write_bytes + offset,
 PAGE_SIZE);
-   reserve_bytes = round_up(write_bytes +
+   bim->reserve_bytes = round_up(write_bytes +
 sector_offset,
 fs_info->sectorsize);
} else {
@@ -1658,19 +1654,19 @@ static noinline ssize_t __btrfs_buffered_write(struct 
kiocb *iocb,
}
 
ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-   reserve_bytes);
+   bim->reserve_bytes);
if (ret) {
-   if (!only_release_metadata)
+   if (!bim->only_release_metadata)
btrfs_free_reserved_data_space(inode,
-   data_reserved, pos,
+   bim->data_reserved, pos,
write_bytes);
else
btrfs_end_write_no_snapshotting(root);
break;
}
 
-   release_bytes = reserve_bytes;
-   need_unlock = false;
+   release_bytes = bim->reserve_bytes;
+

[RFC PATCH 2/8] fs: Add inode_extend_page()

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

This splits the generic_write_end() into functions which handle
block_write_end() and iomap_extend_page().

iomap_extend_page() performs the functions of increasing
i_size (if required) and extending pagecache.

Performed this split so we don't use buffer_heads while ending file I/O.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/buffer.c | 20 +---
 include/linux/buffer_head.h |  1 +
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 170df856bdb9..266daa85b80e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2180,16 +2180,11 @@ int block_write_end(struct file *file, struct 
address_space *mapping,
 }
 EXPORT_SYMBOL(block_write_end);
 
-int generic_write_end(struct file *file, struct address_space *mapping,
-   loff_t pos, unsigned len, unsigned copied,
-   struct page *page, void *fsdata)
+int inode_extend_page(struct inode *inode, loff_t pos,
+   unsigned copied, struct page *page)
 {
-   struct inode *inode = mapping->host;
loff_t old_size = inode->i_size;
int i_size_changed = 0;
-
-   copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
-
/*
 * No need to use i_size_read() here, the i_size
 * cannot change under us because we hold i_mutex.
@@ -2218,6 +2213,17 @@ int generic_write_end(struct file *file, struct 
address_space *mapping,
 
return copied;
 }
+EXPORT_SYMBOL(inode_extend_page);
+
+int generic_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
+{
+   struct inode *inode = mapping->host;
+   copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+   return inode_extend_page(inode, pos, copied, page);
+
+}
 EXPORT_SYMBOL(generic_write_end);
 
 /*
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index afa37f807f12..16cf994be178 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -229,6 +229,7 @@ int __block_write_begin(struct page *page, loff_t pos, 
unsigned len,
 int block_write_end(struct file *, struct address_space *,
loff_t, unsigned, unsigned,
struct page *, void *);
+int inode_extend_page(struct inode *, loff_t, unsigned, struct page*);
 int generic_write_end(struct file *, struct address_space *,
loff_t, unsigned, unsigned,
struct page *, void *);
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/8] btrfs iomap support

2017-11-17 Thread Goldwyn Rodrigues

This patch series attempts to use kernels iomap for btrfs. Currently,
it covers buffered writes only, but I intend to add some other iomap
uses once this gets through. I am sending this as an RFC because I
would like to find ways to improve the solution since some changes
require adding more functions to the iomap infrastructure which I
would try to avoid. I still have to remove some kinks as well such
as -o compress. I have posted some questions in the individual
patches and would appreciate some input to those.

Some of the problems I faced is:

1. extent locking: While we perform the extent locking for writes,
we need to perform any reads because of non-page-aligned calls before
locking can be done. This requires reading the page, increasing their
pagecount and "letting it go". The iomap infrastructure uses
buffer_heads wheras btrfs uses bio and hence needs to call readpage
exclusively. The "letting it go" part makes me somewhat nervous of
conflicting reads/writes, even though we are protected under i_rwsem.
Is readpage_nolock() a good idea? The extent locking sequence is a
bit weird, with locks and unlock happening in different functions.

2. btrfs pages use PagePrivate to store EXTENT_PAGE_PRIVATE which is not used 
anywhere.
However, a PagePrivate flag is used for try_to_release_buffers(). Can
we do away with PagePrivate for data pages? The same with PageChecked.
How and why is it used (I guess -o compress)

3. I had to stick information which will be required from iomap_begin()
to iomap_end() in btrfs_iomap which is a pointer in btrfs_inode. Is
there any other place/way we can transmit this information. XFS only
performs allocations and deallocations so it just relies of bmap code
for it.

Suggestions/Criticism welcome.

-- 
Goldwyn


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What is the purpose of EXTENT_PAGE_MAPPED

2017-10-30 Thread Goldwyn Rodrigues



On 10/30/2017 09:01 AM, Nikolay Borisov wrote:
> 
> 
> On 30.10.2017 15:55, David Sterba wrote:
>> On Mon, Oct 30, 2017 at 03:21:51PM +0200, Nikolay Borisov wrote:
>>>
>>>
>>> On 27.10.2017 07:17, Liu Bo wrote:
>>>> On Tue, Oct 24, 2017 at 05:47:11AM -0500, Goldwyn Rodrigues wrote:
>>>>>
>>>>> EXTENT_PAGE_MAPPED gets set in set_page_extent_mapped(), but I don't see
>>>>> it being cross checked anytime. What is the purpose of setting it?
>>>>
>>>> Please check commit d1310b2e0cd98eb1348553e69b73827b436dca7b, it was
>>>> used to differentiate page for metadata and for data, but I think
>>>> currently it's just a piece of legacy code.
>>>
>>> Be that as it may - is there any reason why we are keeping this and can
>>> it be killed off?
>>
>> Are we're talking about EXTENT_PAGE_PRIVATE? There's no
>> EXTENT_PAGE_MAPPED. There's some control dependency on the page private
>> bit and the value, so we should be careful and replace the function with
>> an assert (or a BUG_ON if it's a must-not-happen state). The page->private
>> points to an extent buffer, and if it's always an eb, then the
>> EXTENT_PAGE_PRIVATE is unused.
> 
> I guess I meant do we actually need: set_page_extent_mapped and all the
> jazz happening in it  or is it a leftover (which I believe it is) ?
> 

We definitely need set_page_extent_mapped() to set the page private for
the lowers layers to handle I/O, primarily writebacks. However, we may
not need EXTENT_PAGE_PRIVATE (yes, I got it wrong the first time). I am
still reading on this though.


-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

What is the purpose of EXTENT_PAGE_MAPPED

2017-10-24 Thread Goldwyn Rodrigues


EXTENT_PAGE_MAPPED gets set in set_page_extent_mapped(), but I don't see
it being cross checked anytime. What is the purpose of setting it?

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] btrfs: cleanup extent locking sequence

2017-10-16 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Code cleanup for better understanding:
Variable needs_unlock to be called extent_locked to show state as opposed to
action. Changed the type to int, to reduce code in the critical
path.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>

---
Changes since v1:
fixed using extents_locked vs ret for error checks. Declared
closer to use.

 fs/btrfs/file.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f6c6754cf52d..aae589e0915a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1590,7 +1590,6 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
int ret = 0;
bool only_release_metadata = false;
bool force_page_uptodate = false;
-   bool need_unlock;
 
nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
PAGE_SIZE / (sizeof(struct page *)));
@@ -1613,6 +1612,7 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
size_t copied;
size_t dirty_sectors;
size_t num_sectors;
+   int extents_locked;
 
WARN_ON(num_pages > nrptrs);
 
@@ -1670,7 +1670,6 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
}
 
release_bytes = reserve_bytes;
-   need_unlock = false;
 again:
/*
 * This is going to setup the pages array with the number of
@@ -1683,16 +1682,15 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
if (ret)
break;
 
-   ret = lock_and_cleanup_extent_if_need(BTRFS_I(inode), pages,
+   extents_locked = lock_and_cleanup_extent_if_need(
+   BTRFS_I(inode), pages,
num_pages, pos, write_bytes, ,
, _state);
-   if (ret < 0) {
-   if (ret == -EAGAIN)
+   if (extents_locked < 0) {
+   if (extents_locked == -EAGAIN)
goto again;
+   ret = extents_locked;
break;
-   } else if (ret > 0) {
-   need_unlock = true;
-   ret = 0;
}
 
copied = btrfs_copy_from_user(pos, write_bytes, pages, i);
@@ -1754,7 +1752,7 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
if (copied > 0)
ret = btrfs_dirty_pages(inode, pages, dirty_pages,
pos, copied, NULL);
-   if (need_unlock)
+   if (extents_locked)
unlock_extent_cached(_I(inode)->io_tree,
 lockstart, lockend, _state,
 GFP_NOFS);
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Remove unused dedupe argument btrfs_set_extent_delalloc()

2017-10-10 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>


Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/file.c  |  2 +-
 fs/btrfs/inode.c |  9 -
 fs/btrfs/relocation.c|  2 +-
 fs/btrfs/tests/inode-tests.c | 12 ++--
 5 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8fc690384c58..ac7e2b02a4df 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3174,7 +3174,7 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root, 
int delay_iput);
 int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput,
   int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
- struct extent_state **cached_state, int dedupe);
+ struct extent_state **cached_state);
 int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
 struct btrfs_root *new_root,
 struct btrfs_root *parent_root,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a3d006d14683..46fa02e109f3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -504,7 +504,7 @@ int btrfs_dirty_pages(struct inode *inode, struct page 
**pages,
 
end_of_last_block = start_pos + num_bytes - 1;
err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
-   cached, 0);
+   cached);
if (err)
return err;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d94e3f68b9b1..9a3953fc3b45 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2036,7 +2036,7 @@ static noinline int add_pending_csums(struct 
btrfs_trans_handle *trans,
 }
 
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
- struct extent_state **cached_state, int dedupe)
+ struct extent_state **cached_state)
 {
WARN_ON((end & (PAGE_SIZE - 1)) == 0);
return set_extent_delalloc(_I(inode)->io_tree, start, end,
@@ -2101,8 +2101,7 @@ static void btrfs_writepage_fixup_worker(struct 
btrfs_work *work)
goto out;
 }
 
-   btrfs_set_extent_delalloc(inode, page_start, page_end, _state,
- 0);
+   btrfs_set_extent_delalloc(inode, page_start, page_end, _state);
ClearPageChecked(page);
set_page_dirty(page);
 out:
@@ -4854,7 +4853,7 @@ int btrfs_truncate_block(struct inode *inode, loff_t 
from, loff_t len,
  0, 0, _state, GFP_NOFS);
 
ret = btrfs_set_extent_delalloc(inode, block_start, block_end,
-   _state, 0);
+   _state);
if (ret) {
unlock_extent_cached(io_tree, block_start, block_end,
 _state, GFP_NOFS);
@@ -9253,7 +9252,7 @@ int btrfs_page_mkwrite(struct vm_fault *vmf)
  0, 0, _state, GFP_NOFS);
 
ret = btrfs_set_extent_delalloc(inode, page_start, end,
-   _state, 0);
+   _state);
if (ret) {
unlock_extent_cached(io_tree, page_start, page_end,
 _state, GFP_NOFS);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 9841faef08ea..ff19edb84d0e 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3266,7 +3266,7 @@ static int relocate_file_extent_cluster(struct inode 
*inode,
nr++;
}
 
-   btrfs_set_extent_delalloc(inode, page_start, page_end, NULL, 0);
+   btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
set_page_dirty(page);
 
unlock_extent(_I(inode)->io_tree,
diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
index 8c91d03cc82d..1a7d8b65d500 100644
--- a/fs/btrfs/tests/inode-tests.c
+++ b/fs/btrfs/tests/inode-tests.c
@@ -970,7 +970,7 @@ static int test_extent_accounting(u32 sectorsize, u32 
nodesize)
/* [BTRFS_MAX_EXTENT_SIZE] */
BTRFS_I(inode)->outstanding_extents++;
ret = btrfs_set_extent_delalloc(inode, 0, BTRFS_MAX_EXTENT_SIZE - 1,
-   NULL, 0);
+   NULL);
if (ret) {
test_msg("btrfs_set_extent_delalloc returned %d\n", ret);
goto out;
@@ -986,7 +986,7 @@ static int test_extent_accounting(u32 sectorsize, u32 
nodesize)
BTRFS_I(inode)->outstanding_extents++;
ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE,
BTRFS_MAX_EXTENT_SIZE + sectorsize - 1,
-

[PATCH] btrfs: cleanup extent locking sequence

2017-10-10 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Code cleanup for better understanding:
needs_unlock to be called extent_locked to show state as opposed to
action.
Changed the variable to int, to reduce code in the critical path(code usually
executed).

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/file.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f6c6754cf52d..a3d006d14683 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1590,7 +1590,7 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
int ret = 0;
bool only_release_metadata = false;
bool force_page_uptodate = false;
-   bool need_unlock;
+   int extents_locked;
 
nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
PAGE_SIZE / (sizeof(struct page *)));
@@ -1670,7 +1670,6 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
}
 
release_bytes = reserve_bytes;
-   need_unlock = false;
 again:
/*
 * This is going to setup the pages array with the number of
@@ -1683,16 +1682,15 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
if (ret)
break;
 
-   ret = lock_and_cleanup_extent_if_need(BTRFS_I(inode), pages,
+   extents_locked = lock_and_cleanup_extent_if_need(
+   BTRFS_I(inode), pages,
num_pages, pos, write_bytes, ,
, _state);
-   if (ret < 0) {
+   if (extents_locked < 0) {
if (ret == -EAGAIN)
goto again;
+   ret = extents_locked;
break;
-   } else if (ret > 0) {
-   need_unlock = true;
-   ret = 0;
}
 
copied = btrfs_copy_from_user(pos, write_bytes, pages, i);
@@ -1754,7 +1752,7 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
if (copied > 0)
ret = btrfs_dirty_pages(inode, pages, dirty_pages,
pos, copied, NULL);
-   if (need_unlock)
+   if (extents_locked)
unlock_extent_cached(_I(inode)->io_tree,
 lockstart, lockend, _state,
 GFP_NOFS);
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Commit edf064e7c (btrfs: nowait aio support) breaks shells

2017-07-07 Thread Goldwyn Rodrigues



On 07/04/2017 05:16 PM, Jens Axboe wrote:
> 
> Please expedite getting this upstream, asap.
> 

Jens,

I have posted an updated patch [1] and it is acked by David. Would you
pick it up or should it go through the btrfs tree (or some other tree)?

[1] https://patchwork.kernel.org/patch/9825813/

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] btrfs: Correct assignment of pos

2017-07-04 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Assigning pos for usage early messes up in append mode, where
the pos is re-assigned in generic_write_checks(). Assign
pos later to get the correct position to write from iocb->ki_pos.

Since check_can_nocow also uses the value of pos, we shift
generic_write_checks() before check_can_nocow(). Checks with
IOCB_DIRECT are present in generic_write_checks(), so checking
for IOCB_NOWAIT is enough.

Also, put locking sequence in the fast path.

Changes since v1:
 - Moved pos higher up to encompass check_can_nocow() call.

Fixes: edf064e7c6fe ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/file.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 59e2dccdf75b..ad53832838b5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1875,16 +1875,25 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
ssize_t num_written = 0;
bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
ssize_t err;
-   loff_t pos = iocb->ki_pos;
+   loff_t pos;
size_t count = iov_iter_count(from);
loff_t oldsize;
int clean_page = 0;
 
-   if ((iocb->ki_flags & IOCB_NOWAIT) &&
-   (iocb->ki_flags & IOCB_DIRECT)) {
-   /* Don't sleep on inode rwsem */
-   if (!inode_trylock(inode))
+   if (!inode_trylock(inode)) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
return -EAGAIN;
+   inode_lock(inode);
+   }
+
+   err = generic_write_checks(iocb, from);
+   if (err <= 0) {
+   inode_unlock(inode);
+   return err;
+   }
+
+   pos = iocb->ki_pos;
+   if (iocb->ki_flags & IOCB_NOWAIT) {
/*
 * We will allocate space in case nodatacow is not set,
 * so bail
@@ -1895,13 +1904,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
inode_unlock(inode);
return -EAGAIN;
}
-   } else
-   inode_lock(inode);
-
-   err = generic_write_checks(iocb, from);
-   if (err <= 0) {
-   inode_unlock(inode);
-   return err;
}
 
current->backing_dev_info = inode_to_bdi(inode);
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: Correct assignment of pos

2017-07-04 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Assigning pos for usage early messes up in append mode, where
the pos is re-assigned in generic_write_checks(). Re-assign
pos to get the correct position to write from iocb->ki_pos.

Fixes: edf064e7c6fe ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Tested-by: Markus Trippelsdorf <mar...@trippelsdorf.de>
---
 fs/btrfs/file.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 59e2dccdf75b..7947781229e5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1931,6 +1931,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 */
update_time_for_write(inode);
 
+   pos = iocb->ki_pos;
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Commit edf064e7c (btrfs: nowait aio support) breaks shells

2017-07-04 Thread Goldwyn Rodrigues



On 07/04/2017 02:45 AM, Markus Trippelsdorf wrote:
> On 2017.07.04 at 06:23 +0200, Markus Trippelsdorf wrote:
>> commit edf064e7c6fec3646b06c944a8e35d1a3de5c2c3 (HEAD, refs/bisect/bad)
>> Author: Goldwyn Rodrigues <rgold...@suse.com>
>> Date:   Tue Jun 20 07:05:49 2017 -0500
>>
>> btrfs: nowait aio support
>>
>> apparently breaks several shell related features on my system.
> 
> Here is a simple testcase:
> 
>  % echo "foo" >> test
>  % echo "foo" >> test
>  % cat test
>  foo
>  %
> 

Thanks for testing.
Yes, pos must be set with iocb->ki_pos for appends. I should not have
removed the initialization. Could you try this patch?

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 59e2dccdf75b..7947781229e5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1931,6 +1931,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb
*iocb,
 */
update_time_for_write(inode);

+   pos = iocb->ki_pos;
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {



-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/10 v11] No wait AIO

2017-06-12 Thread Goldwyn Rodrigues



On 06/10/2017 12:34 AM, Al Viro wrote:
> On Thu, Jun 08, 2017 at 12:39:10AM -0700, Christoph Hellwig wrote:
>> As already indicated this whole series looks fine to me.
>>
>> Al: are you going to pick this up?  Or Andrew?
> 
> The main issue here is "let's bail out from ->write_iter() instances"
> patch.  It very obviously has holes in coverage.
> 
> Could we have FMODE_AIO_NOWAIT and make those who claim to support it
> set that in ->open()?  And make aio check that and bail out if asked
> for nowait on a file without that flag...
> 

Yes, I would agree.

We had FS_NOWAIT in filesystem type flags (in v3), but retracted it
later in v4.

Another option could be to keep the feature against FS_REQUIRES_DEV to
rule out filesystems which are not local, but it again has the problem
of holes in coverage.

I will work on adding FMODE_AIO_NOWAIT in the meantime.

Thanks,

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/10] fs: Introduce filemap_range_has_page()

2017-06-06 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

filemap_range_has_page() return true if the file's mapping has
a page within the range mentioned. This function will be used
to check if a write() call will cause a writeback of previous
writes.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c   | 33 +
 2 files changed, 35 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f53867140f43..dc0ab585cd56 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2517,6 +2517,8 @@ extern int filemap_fdatawait(struct address_space *);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
   loff_t lend);
+extern int filemap_range_has_page(struct address_space *, loff_t lstart,
+ loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index 6f1be573a5e6..87aba7698584 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,6 +376,39 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
+/**
+ * filemap_range_has_page - check if a page exists in range.
+ * @mapping:   address space structure to wait for
+ * @start_byte:offset in bytes where the range starts
+ * @end_byte:  offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback.
+ */
+int filemap_range_has_page(struct address_space *mapping,
+  loff_t start_byte, loff_t end_byte)
+{
+   pgoff_t index = start_byte >> PAGE_SHIFT;
+   pgoff_t end = end_byte >> PAGE_SHIFT;
+   struct pagevec pvec;
+   int ret;
+
+   if (end_byte < start_byte)
+   return 0;
+
+   if (mapping->nrpages == 0)
+   return 0;
+
+   pagevec_init(, 0);
+   ret = pagevec_lookup(, mapping, index, 1);
+   if (!ret)
+   return 0;
+   ret = (pvec.pages[0]->index <= end);
+   pagevec_release();
+   return ret;
+}
+EXPORT_SYMBOL(filemap_range_has_page);
+
 static int __filemap_fdatawait_range(struct address_space *mapping,
 loff_t start_byte, loff_t end_byte)
 {
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/10] fs: Use RWF_* flags for AIO operations

2017-06-06 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will
carry the RWF_* flags. We cannot use aio_flags because they are not
checked for validity which may break existing applications.

Note, the only place RWF_HIPRI comes in effect is dio_await_one().
All the rest of the locations, aio code return -EIOCBQUEUED before the
checks for RWF_HIPRI.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/aio.c | 8 +++-
 include/uapi/linux/aio_abi.h | 2 +-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925ee259..020fa0045e3c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1541,7 +1541,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
ssize_t ret;
 
/* enforce forwards compatibility on users */
-   if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
+   if (unlikely(iocb->aio_reserved2)) {
pr_debug("EINVAL: reserve field set\n");
return -EINVAL;
}
@@ -1586,6 +1586,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
req->common.ki_flags |= IOCB_EVENTFD;
}
 
+   ret = kiocb_set_rw_flags(>common, iocb->aio_rw_flags);
+   if (unlikely(ret)) {
+   pr_debug("EINVAL: aio_rw_flags\n");
+   goto out_put_req;
+   }
+
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f7fbd1..a2d4a8ac94ca 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -79,7 +79,7 @@ struct io_event {
 struct iocb {
/* these are internal to the kernel/libc. */
__u64   aio_data;   /* data to be returned in event's data */
-   __u32   PADDED(aio_key, aio_reserved1);
+   __u32   PADDED(aio_key, aio_rw_flags);
/* the kernel sets aio_key to the req # */
 
/* common fields */
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/10] fs: return if direct write will trigger writeback

2017-06-06 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Find out if the write will trigger a wait due to writeback. If yes,
return -EAGAIN.

Return -EINVAL for buffered AIO: there are multiple causes of
delay such as page locks, dirty throttling logic, page loading
from disk etc. which cannot be taken care of.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 mm/filemap.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 097213275461..bc146efa6815 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2675,6 +2675,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
 
pos = iocb->ki_pos;
 
+   if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+   return -EINVAL;
+
if (limit != RLIM_INFINITY) {
if (iocb->ki_pos >= limit) {
send_sig(SIGXFSZ, current, 0);
@@ -2743,9 +2746,17 @@ generic_file_direct_write(struct kiocb *iocb, struct 
iov_iter *from)
write_len = iov_iter_count(from);
end = (pos + write_len - 1) >> PAGE_SHIFT;
 
-   written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 
1);
-   if (written)
-   goto out;
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   /* If there are pages to writeback, return */
+   if (filemap_range_has_page(inode->i_mapping, pos,
+  pos + iov_iter_count(from)))
+   return -EAGAIN;
+   } else {
+   written = filemap_write_and_wait_range(mapping, pos,
+   pos + write_len - 1);
+   if (written)
+   goto out;
+   }
 
/*
 * After a write we want buffered reads to be sure to go to disk to get
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/10] fs: Introduce RWF_NOWAIT

2017-06-06 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

RWF_NOWAIT informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This
is called by most filesystems, either through fsops.write_iter() or through
the function defined by write_iter(). If not, we perform the check defined
by .write_iter() which is called for direct IO specifically.

Filesystems xfs, btrfs and ext4 would be supported in the following patches.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/9p/vfs_file.c| 3 +++
 fs/aio.c| 6 ++
 fs/ceph/file.c  | 3 +++
 fs/cifs/file.c  | 3 +++
 fs/fuse/file.c  | 3 +++
 fs/nfs/direct.c | 3 +++
 fs/ocfs2/file.c | 3 +++
 include/linux/fs.h  | 5 -
 include/uapi/linux/fs.h | 1 +
 mm/filemap.c| 3 +++
 10 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 3de3b4a89d89..403681db7723 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -411,6 +411,9 @@ v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
loff_t origin;
int err = 0;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
retval = generic_write_checks(iocb, from);
if (retval <= 0)
return retval;
diff --git a/fs/aio.c b/fs/aio.c
index 020fa0045e3c..34027b67e2f4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1592,6 +1592,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
goto out_put_req;
}
 
+   if ((req->common.ki_flags & IOCB_NOWAIT) &&
+   !(req->common.ki_flags & IOCB_DIRECT)) {
+   ret = -EOPNOTSUPP;
+   goto out_put_req;
+   }
+
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 29308a80d66f..366b0bb71f97 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1300,6 +1300,9 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct 
iov_iter *from)
int err, want, got;
loff_t pos;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
 
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0fd081bd2a2f..ff84fa9ddb6c 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2725,6 +2725,9 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct 
iov_iter *from)
 * write request.
 */
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
rc = generic_write_checks(iocb, from);
if (rc <= 0)
return rc;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3ee4fdc3da9e..812c7bd0c290 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1425,6 +1425,9 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(file);
ssize_t res;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
if (is_bad_inode(inode))
return -EIO;
 
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 6fb9fad2d1e6..c8e7dd76126c 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -979,6 +979,9 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct 
iov_iter *iter)
dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
file, iov_iter_count(iter), (long long) iocb->ki_pos);
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
result = generic_write_checks(iocb, iter);
if (result <= 0)
return result;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index bfeb647459d9..e7f8ba890305 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2235,6 +2235,9 @@ static ssize_t ocfs2_file_write_iter(struct kiocb *iocb,
if (count == 0)
return 0;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
direct_io = iocb->ki_flags & IOCB_DIRECT ? 1 : 0;
 
inode_lock(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dc0ab585cd56..2a7d14af6d12 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -268,6 +268,7 @@ struct writeback_control;
 #define IOCB_DSYNC (1 << 4)
 #define IOCB_SYNC  (1 << 5)
 #define IOCB_WRITE (1 << 6)
+#define IOCB_NOWA

[PATCH 07/10] block: return on congested block device

2017-06-06 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 block/blk-core.c  | 23 +--
 block/blk-mq-sched.c  |  3 +++
 block/blk-mq.c|  2 ++
 fs/direct-io.c| 10 --
 include/linux/bio.h   |  6 ++
 include/linux/blk_types.h |  2 ++
 6 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a7421b772d0e..972d6fdb1432 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1256,6 +1256,11 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
if (!IS_ERR(rq))
return rq;
 
+   if (op & REQ_NOWAIT) {
+   blk_put_rl(rl);
+   return ERR_PTR(-EAGAIN);
+   }
+
if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) 
{
blk_put_rl(rl);
return rq;
@@ -1900,6 +1905,16 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
+   /*
+* For a REQ_NOWAIT based request, return -EOPNOTSUPP
+* if queue is not a request based queue.
+*/
+
+   if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) {
+   err = -EOPNOTSUPP;
+   goto end_io;
+   }
+
part = bio->bi_bdev->bd_part;
if (should_fail_request(part, bio->bi_iter.bi_size) ||
should_fail_request(_to_disk(part)->part0,
@@ -2057,7 +2072,7 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-   if (likely(blk_queue_enter(q, false) == 0)) {
+   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
@@ -2082,7 +2097,11 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_merge(_list_on_stack[0], );
bio_list_merge(_list_on_stack[0], 
_list_on_stack[1]);
} else {
-   bio_io_error(bio);
+   if (unlikely(!blk_queue_dying(q) &&
+   (bio->bi_opf & REQ_NOWAIT)))
+   bio_wouldblock_error(bio);
+   else
+   bio_io_error(bio);
}
bio = bio_list_pop(_list_on_stack[0]);
} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 1f5b692526ae..9a1dea8b964e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -83,6 +83,9 @@ struct request *blk_mq_sched_get_request(struct request_queue 
*q,
if (likely(!data->hctx))
data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+   if (op & REQ_NOWAIT)
+   data->flags |= BLK_MQ_REQ_NOWAIT;
+
if (e) {
data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1bcccedcc74f..b0608f1955b2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1556,6 +1556,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
+   if (bio->bi_opf & REQ_NOWAIT)
+   bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..139ebd5ae1c7 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -480,8 +480,12 @@ static int dio_bio_complete(struct dio *dio, struct bio 
*bio)
unsigned i;
int err;
 
-   if (bio->bi_error)
-   dio->io_error = -EIO;
+   if (bio->bi_error) {
+   if (bio->bi_error == -EAGAIN && (bio->bi_opf & REQ_NOWAIT))
+   dio->io_error = -EAGAIN;
+   else
+   dio->io_error = -EIO;
+   }
 
if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
err =

[PATCH 10/10] btrfs: nowait aio support

2017-06-06 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated

Acked-by: David Sterba <dste...@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/inode.c |  3 +++
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index da1096eb1a40..aae088e49915 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1875,12 +1875,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
ssize_t num_written = 0;
bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
ssize_t err;
-   loff_t pos;
-   size_t count;
+   loff_t pos = iocb->ki_pos;
+   size_t count = iov_iter_count(from);
loff_t oldsize;
int clean_page = 0;
 
-   inode_lock(inode);
+   if ((iocb->ki_flags & IOCB_NOWAIT) &&
+   (iocb->ki_flags & IOCB_DIRECT)) {
+   /* Don't sleep on inode rwsem */
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   /*
+* We will allocate space in case nodatacow is not set,
+* so bail
+*/
+   if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+ BTRFS_INODE_PREALLOC)) ||
+   check_can_nocow(BTRFS_I(inode), pos, ) <= 0) {
+   inode_unlock(inode);
+   return -EAGAIN;
+   }
+   } else
+   inode_lock(inode);
+
err = generic_write_checks(iocb, from);
if (err <= 0) {
inode_unlock(inode);
@@ -1914,8 +1931,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 */
update_time_for_write(inode);
 
-   pos = iocb->ki_pos;
-   count = iov_iter_count(from);
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 17cbe9306faf..2ab71b946829 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8755,6 +8755,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
dio_data.overwrite = 1;
inode_unlock(inode);
relock = true;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
}
ret = btrfs_delalloc_reserve_space(inode, offset, count);
if (ret)
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/10] fs: Separate out kiocb flags setup based on RWF_* flags

2017-06-06 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/read_write.c| 12 +++-
 include/linux/fs.h | 14 ++
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 47c1d4484df9..53c816c61122 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -678,16 +678,10 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
struct iov_iter *iter,
struct kiocb kiocb;
ssize_t ret;
 
-   if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
-   return -EOPNOTSUPP;
-
init_sync_kiocb(, filp);
-   if (flags & RWF_HIPRI)
-   kiocb.ki_flags |= IOCB_HIPRI;
-   if (flags & RWF_DSYNC)
-   kiocb.ki_flags |= IOCB_DSYNC;
-   if (flags & RWF_SYNC)
-   kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   ret = kiocb_set_rw_flags(, flags);
+   if (ret)
+   return ret;
kiocb.ki_pos = *ppos;
 
if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 803e5a9b2654..f53867140f43 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3056,6 +3056,20 @@ static inline int iocb_flags(struct file *file)
return res;
 }
 
+static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags)
+{
+   if (unlikely(flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC)))
+   return -EOPNOTSUPP;
+
+   if (flags & RWF_HIPRI)
+   ki->ki_flags |= IOCB_HIPRI;
+   if (flags & RWF_DSYNC)
+   ki->ki_flags |= IOCB_DSYNC;
+   if (flags & RWF_SYNC)
+   ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   return 0;
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
ino_t res;
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/10] fs: Introduce IOMAP_NOWAIT

2017-06-06 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
This is used by XFS in the XFS patch.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/iomap.c| 2 ++
 include/linux/iomap.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 4b10892967a5..5d85ec6e7b20 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -879,6 +879,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
} else {
dio->flags |= IOMAP_DIO_WRITE;
flags |= IOMAP_WRITE;
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   flags |= IOMAP_NOWAIT;
}
 
ret = filemap_write_and_wait_range(mapping, start, end);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f753e788da31..69f4e9470084 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -52,6 +52,7 @@ struct iomap {
 #define IOMAP_REPORT   (1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT   (1 << 4) /* direct I/O */
+#define IOMAP_NOWAIT   (1 << 5) /* Don't wait for writeback */
 
 struct iomap_ops {
/*
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/10] xfs: nowait aio support

2017-06-06 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable
immediately.

IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin
if it needs allocation either due to file extension, writing to a hole,
or COW or waiting for other DIOs to finish.

Return -EAGAIN if we don't have extent list in memory.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>
---
 fs/xfs/xfs_file.c  | 19 ++-
 fs/xfs/xfs_iomap.c | 22 ++
 2 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5fb5a0958a14..f87a8a66e6f7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -541,8 +541,11 @@ xfs_file_dio_aio_write(
iolock = XFS_IOLOCK_SHARED;
}
 
-   xfs_ilock(ip, iolock);
-
+   if (!xfs_ilock_nowait(ip, iolock)) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EAGAIN;
+   xfs_ilock(ip, iolock);
+   }
ret = xfs_file_aio_write_checks(iocb, from, );
if (ret)
goto out;
@@ -553,9 +556,15 @@ xfs_file_dio_aio_write(
 * otherwise demote the lock if we had to take the exclusive lock
 * for other reasons in xfs_file_aio_write_checks.
 */
-   if (unaligned_io)
-   inode_dio_wait(inode);
-   else if (iolock == XFS_IOLOCK_EXCL) {
+   if (unaligned_io) {
+   /* If we are going to wait for other DIO to finish, bail */
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (atomic_read(>i_dio_count))
+   return -EAGAIN;
+   } else {
+   inode_dio_wait(inode);
+   }
+   } else if (iolock == XFS_IOLOCK_EXCL) {
xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
iolock = XFS_IOLOCK_SHARED;
}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 94e5bdf7304c..05dc87e8c1f5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -995,6 +995,11 @@ xfs_file_iomap_begin(
lockmode = xfs_ilock_data_map_shared(ip);
}
 
+   if ((flags & IOMAP_NOWAIT) && !(ip->i_df.if_flags & XFS_IFEXTENTS)) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
+
ASSERT(offset <= mp->m_super->s_maxbytes);
if ((xfs_fsize_t)offset + length > mp->m_super->s_maxbytes)
length = mp->m_super->s_maxbytes - offset;
@@ -1016,6 +1021,15 @@ xfs_file_iomap_begin(
 
if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
if (flags & IOMAP_DIRECT) {
+   /*
+* A reflinked inode will result in CoW alloc.
+* FIXME: It could still overwrite on unshared extents
+* and not need allocation.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, , ,
);
@@ -1033,6 +1047,14 @@ xfs_file_iomap_begin(
 
if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, , nimaps)) {
/*
+* If nowait is set bail since we are going to make
+* allocations.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
+   /*
 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 * pages to keep the chunks of work done where somewhat 
symmetric
 * with the work writeback does. This is a completely arbitrary
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/10] ext4: nowait aio support

2017-06-06 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Return EAGAIN if any of the following checks fail for direct I/O:
  + i_rwsem is lockable
  + Writing beyond end of file (will trigger allocation)
  + Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 fs/ext4/file.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 831fd6beebf0..07f08ff2c11b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -216,7 +216,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
return ext4_dax_write_iter(iocb, from);
 #endif
 
-   inode_lock(inode);
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   } else {
+   inode_lock(inode);
+   }
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
@@ -235,9 +241,15 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
 
iocb->private = 
/* Check whether we do a DIO overwrite or not */
-   if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
-   ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
-   overwrite = 1;
+   if (o_direct && !unaligned_aio) {
+   if (ext4_overwrite_io(inode, iocb->ki_pos, 
iov_iter_count(from))) {
+   if (ext4_should_dioread_nolock(inode))
+   overwrite = 1;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
+   }
+   }
 
ret = __generic_file_write_iter(iocb, from);
inode_unlock(inode);
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/10 v11] No wait AIO

2017-06-06 Thread Goldwyn Rodrigues

This series adds nonblocking feature to asynchronous I/O writes.
io_submit() can be delayed because of a number of reason:
 - Block allocation for files
 - Data writebacks for direct I/O
 - Sleeping because of waiting to acquire i_rwsem
 - Congested block device

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.

In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in
uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to
IOCB_NOWAIT for struct iocb, REQ_NOWAIT for bio.bi_opf and IOMAP_NOWAIT for
iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could
not use aio_flags because it is not currently checked for invalidity
in the kernel.

This feature is provided for direct I/O of asynchronous I/O only. I have
tested it against xfs, ext4, and btrfs while I intend to add more filesystems.
The nowait feature is for request based devices. In the future, I intend to
add support to stacked devices such as md.

Applications will have to check supportability
by sending a async direct write and any other error besides -EAGAIN
would mean it is not supported.

First two patches are prep patches into nowait I/O.

Changes since v1:
 + changed name from _NONBLOCKING to *_NOWAIT
 + filemap_range_has_page call moved to closer to (just before) calling 
filemap_write_and_wait_range().
 + BIO_NOWAIT limited to get_request()
 + XFS fixes 
- included reflink 
- use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag
- Translate the flag through IOMAP_NOWAIT (iomap) to check for
  block allocation for the file.
 + ext4 coding style

Changes since v2:
 + Using aio_reserved1 as aio_rw_flags instead of aio_flags
 + blk-mq support
 + xfs uptodate with kernel and reflink changes

 Changes since v3:
  + Added FS_NOWAIT, which is set if the filesystem supports NOWAIT feature.
  + Checks in generic_make_request() to make sure BIO_NOWAIT comes in
for async direct writes only.
  + Added QUEUE_FLAG_NOWAIT, which is set if the device supports BIO_NOWAIT.
This is added (rather not set) to block devices such as dm/md currently.

 Changes since v4:
  + Ported AIO code to use RWF_* flags. Check for RWF_* flags in
generic_file_write_iter().
  + Changed IOCB_RW_FLAGS_NOWAIT to RWF_NOWAIT.

 Changes since v5:
  + BIO_NOWAIT to REQ_NOWAIT
  + Common helper for RWF flags.

 Changes since v6:
  + REQ_NOWAIT will be ignored for request based devices since they
cannot block. So, removed QUEUE_FLAG_NOWAIT since it is not
required in the current implementation. It will be resurrected
when we program for stacked devices.
  + changed kiocb_rw_flags() to kiocb_set_rw_flags() in order to accomodate
for errors. Moved checks in the function.

 Changes since v7:
  + split patches into prep so the main patches are smaller and easier
to understand
  + All patches are reviewed or acked!
 
 Changes since v8:
 + Err out AIO reads with -EINVAL flagged as RWF_NOWAIT

 Changes since v9:
 + Retract - Err out AIO reads with -EINVAL flagged as RWF_NOWAIT
 + XFS returns EAGAIN if extent list is not in memory
 + Man page updates to io_submit with iocb description and nowait features.

 Changes since v10:
 + Corrected comment and subject in "return on congested block device"

-- 
Goldwyn


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] btrfs: qgroup: Fix hang when using inode_cache and qgroup

2017-06-05 Thread Goldwyn Rodrigues



On 05/31/2017 03:08 AM, Qu Wenruo wrote:
> Commit 48a89bc4f2ce ("btrfs: qgroups: Retry after commit on getting EDQUOT")
> is causing hang, with the following backtrace:
> 
> Call Trace:
>  __schedule+0x374/0xaf0
>  schedule+0x3d/0x90
>  wait_for_commit+0x4a/0x80 [btrfs]
>  ? wake_atomic_t_function+0x60/0x60
>  btrfs_commit_transaction+0xe0/0xa10 [btrfs]  <<< Here
>  ? start_transaction+0xad/0x510 [btrfs]
>  qgroup_reserve+0x1f0/0x350 [btrfs]
>  btrfs_qgroup_reserve_data+0xf8/0x2f0 [btrfs]
>  ? _raw_spin_unlock+0x27/0x40
>  btrfs_check_data_free_space+0x6d/0xb0 [btrfs]
>  btrfs_delalloc_reserve_space+0x25/0x70 [btrfs]
>  btrfs_save_ino_cache+0x402/0x650 [btrfs]
>  commit_fs_roots+0xb7/0x170 [btrfs]
>  btrfs_commit_transaction+0x425/0xa10 [btrfs] <<< And here
>  qgroup_reserve+0x1f0/0x350 [btrfs]
>  btrfs_qgroup_reserve_data+0xf8/0x2f0 [btrfs]
>  ? _raw_spin_unlock+0x27/0x40
>  btrfs_check_data_free_space+0x6d/0xb0 [btrfs]
>  btrfs_delalloc_reserve_space+0x25/0x70 [btrfs]
>  btrfs_direct_IO+0x1c5/0x3b0 [btrfs]
>  generic_file_direct_write+0xab/0x150
>  btrfs_file_write_iter+0x243/0x530 [btrfs]
>  __vfs_write+0xc9/0x120
>  vfs_write+0xcb/0x1f0
>  SyS_pwrite64+0x79/0x90
>  entry_SYSCALL_64_fastpath+0x18/0xad
> 
> The problem is that, inode_cache will be written in commit_fs_roots(),
> which is called in btrfs_commit_transaction().
> 
> And when it fails to reserve enough data space, qgroup_reserve() will
> try to call btrfs_commit_transaction() again, then we are waiting for
> ourselves.
> 
> The patch will introduce can_retry parameter for qgroup_reserve(),
> allowing related callers to avoid deadly commit transaction deadlock.
> 
> Now for space cache inode, we will not allow qgroup retry, so it will
> not cause deadlock.
> 
> Fixes: 48a89bc4f2ce ("btrfs: qgroups: Retry after commit on getting EDQUOT")
> Cc: Goldwyn Rodrigues <rgold...@suse.de>
> Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
> ---
> Commit  48a89bc4f2ce ("btrfs: qgroups: Retry after commit on getting EDQUOT")
> is not only causing such deadlock, but also screwing up qgroup reserved
> space for even generic test cases.
> 
> I'm afraid we may need to revert that commit if we can't find a good way
> to fix the newly caused qgroup meta reserved space underflow.
> (Unlike old bug which is qgroup data reserved space underflow, this time
> the commit is causing new metadata space underflow).

I tried the same with direct I/O and have the same results. I run into
underflows often. By reverting the patch, we are avoiding the problem
not resolving it. The numbers don't add up and the point is to find out
where the numbers are getting lost (or counted in excess). I will
continue investigating on this front.

By ignoring the warning (unset BTRFS_DEBUG) and continuing during
overflow, we are just avoiding the problem. It does not show up in dmesg
any longer.


-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/10] fs: Separate out kiocb flags setup based on RWF_* flags

2017-06-04 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/read_write.c| 12 +++-
 include/linux/fs.h | 14 ++
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 47c1d4484df9..53c816c61122 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -678,16 +678,10 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
struct iov_iter *iter,
struct kiocb kiocb;
ssize_t ret;
 
-   if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
-   return -EOPNOTSUPP;
-
init_sync_kiocb(, filp);
-   if (flags & RWF_HIPRI)
-   kiocb.ki_flags |= IOCB_HIPRI;
-   if (flags & RWF_DSYNC)
-   kiocb.ki_flags |= IOCB_DSYNC;
-   if (flags & RWF_SYNC)
-   kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   ret = kiocb_set_rw_flags(, flags);
+   if (ret)
+   return ret;
kiocb.ki_pos = *ppos;
 
if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 803e5a9b2654..f53867140f43 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3056,6 +3056,20 @@ static inline int iocb_flags(struct file *file)
return res;
 }
 
+static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags)
+{
+   if (unlikely(flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC)))
+   return -EOPNOTSUPP;
+
+   if (flags & RWF_HIPRI)
+   ki->ki_flags |= IOCB_HIPRI;
+   if (flags & RWF_DSYNC)
+   ki->ki_flags |= IOCB_DSYNC;
+   if (flags & RWF_SYNC)
+   ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   return 0;
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
ino_t res;
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/10] fs: Introduce filemap_range_has_page()

2017-06-04 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

filemap_range_has_page() return true if the file's mapping has
a page within the range mentioned. This function will be used
to check if a write() call will cause a writeback of previous
writes.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c   | 33 +
 2 files changed, 35 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f53867140f43..dc0ab585cd56 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2517,6 +2517,8 @@ extern int filemap_fdatawait(struct address_space *);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
   loff_t lend);
+extern int filemap_range_has_page(struct address_space *, loff_t lstart,
+ loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index 6f1be573a5e6..87aba7698584 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,6 +376,39 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
+/**
+ * filemap_range_has_page - check if a page exists in range.
+ * @mapping:   address space structure to wait for
+ * @start_byte:offset in bytes where the range starts
+ * @end_byte:  offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback.
+ */
+int filemap_range_has_page(struct address_space *mapping,
+  loff_t start_byte, loff_t end_byte)
+{
+   pgoff_t index = start_byte >> PAGE_SHIFT;
+   pgoff_t end = end_byte >> PAGE_SHIFT;
+   struct pagevec pvec;
+   int ret;
+
+   if (end_byte < start_byte)
+   return 0;
+
+   if (mapping->nrpages == 0)
+   return 0;
+
+   pagevec_init(, 0);
+   ret = pagevec_lookup(, mapping, index, 1);
+   if (!ret)
+   return 0;
+   ret = (pvec.pages[0]->index <= end);
+   pagevec_release();
+   return ret;
+}
+EXPORT_SYMBOL(filemap_range_has_page);
+
 static int __filemap_fdatawait_range(struct address_space *mapping,
 loff_t start_byte, loff_t end_byte)
 {
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/10] fs: Introduce IOMAP_NOWAIT

2017-06-04 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
This is used by XFS in the XFS patch.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/iomap.c| 2 ++
 include/linux/iomap.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 4b10892967a5..5d85ec6e7b20 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -879,6 +879,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
} else {
dio->flags |= IOMAP_DIO_WRITE;
flags |= IOMAP_WRITE;
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   flags |= IOMAP_NOWAIT;
}
 
ret = filemap_write_and_wait_range(mapping, start, end);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f753e788da31..69f4e9470084 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -52,6 +52,7 @@ struct iomap {
 #define IOMAP_REPORT   (1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT   (1 << 4) /* direct I/O */
+#define IOMAP_NOWAIT   (1 << 5) /* Don't wait for writeback */
 
 struct iomap_ops {
/*
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/10] xfs: nowait aio support

2017-06-04 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable
immediately.

IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin
if it needs allocation either due to file extension, writing to a hole,
or COW or waiting for other DIOs to finish.

Return -EAGAIN if we don't have extent list in memory.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>
---
 fs/xfs/xfs_file.c  | 19 ++-
 fs/xfs/xfs_iomap.c | 22 ++
 2 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5fb5a0958a14..f87a8a66e6f7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -541,8 +541,11 @@ xfs_file_dio_aio_write(
iolock = XFS_IOLOCK_SHARED;
}
 
-   xfs_ilock(ip, iolock);
-
+   if (!xfs_ilock_nowait(ip, iolock)) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EAGAIN;
+   xfs_ilock(ip, iolock);
+   }
ret = xfs_file_aio_write_checks(iocb, from, );
if (ret)
goto out;
@@ -553,9 +556,15 @@ xfs_file_dio_aio_write(
 * otherwise demote the lock if we had to take the exclusive lock
 * for other reasons in xfs_file_aio_write_checks.
 */
-   if (unaligned_io)
-   inode_dio_wait(inode);
-   else if (iolock == XFS_IOLOCK_EXCL) {
+   if (unaligned_io) {
+   /* If we are going to wait for other DIO to finish, bail */
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (atomic_read(>i_dio_count))
+   return -EAGAIN;
+   } else {
+   inode_dio_wait(inode);
+   }
+   } else if (iolock == XFS_IOLOCK_EXCL) {
xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
iolock = XFS_IOLOCK_SHARED;
}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 94e5bdf7304c..05dc87e8c1f5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -995,6 +995,11 @@ xfs_file_iomap_begin(
lockmode = xfs_ilock_data_map_shared(ip);
}
 
+   if ((flags & IOMAP_NOWAIT) && !(ip->i_df.if_flags & XFS_IFEXTENTS)) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
+
ASSERT(offset <= mp->m_super->s_maxbytes);
if ((xfs_fsize_t)offset + length > mp->m_super->s_maxbytes)
length = mp->m_super->s_maxbytes - offset;
@@ -1016,6 +1021,15 @@ xfs_file_iomap_begin(
 
if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
if (flags & IOMAP_DIRECT) {
+   /*
+* A reflinked inode will result in CoW alloc.
+* FIXME: It could still overwrite on unshared extents
+* and not need allocation.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, , ,
);
@@ -1033,6 +1047,14 @@ xfs_file_iomap_begin(
 
if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, , nimaps)) {
/*
+* If nowait is set bail since we are going to make
+* allocations.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
+   /*
 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 * pages to keep the chunks of work done where somewhat 
symmetric
 * with the work writeback does. This is a completely arbitrary
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/10] ext4: nowait aio support

2017-06-04 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Return EAGAIN if any of the following checks fail for direct I/O:
  + i_rwsem is lockable
  + Writing beyond end of file (will trigger allocation)
  + Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 fs/ext4/file.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 831fd6beebf0..07f08ff2c11b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -216,7 +216,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
return ext4_dax_write_iter(iocb, from);
 #endif
 
-   inode_lock(inode);
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   } else {
+   inode_lock(inode);
+   }
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
@@ -235,9 +241,15 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
 
iocb->private = 
/* Check whether we do a DIO overwrite or not */
-   if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
-   ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
-   overwrite = 1;
+   if (o_direct && !unaligned_aio) {
+   if (ext4_overwrite_io(inode, iocb->ki_pos, 
iov_iter_count(from))) {
+   if (ext4_should_dioread_nolock(inode))
+   overwrite = 1;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
+   }
+   }
 
ret = __generic_file_write_iter(iocb, from);
inode_unlock(inode);
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/10] btrfs: nowait aio support

2017-06-04 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated

Acked-by: David Sterba <dste...@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/inode.c |  3 +++
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index da1096eb1a40..aae088e49915 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1875,12 +1875,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
ssize_t num_written = 0;
bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
ssize_t err;
-   loff_t pos;
-   size_t count;
+   loff_t pos = iocb->ki_pos;
+   size_t count = iov_iter_count(from);
loff_t oldsize;
int clean_page = 0;
 
-   inode_lock(inode);
+   if ((iocb->ki_flags & IOCB_NOWAIT) &&
+   (iocb->ki_flags & IOCB_DIRECT)) {
+   /* Don't sleep on inode rwsem */
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   /*
+* We will allocate space in case nodatacow is not set,
+* so bail
+*/
+   if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+ BTRFS_INODE_PREALLOC)) ||
+   check_can_nocow(BTRFS_I(inode), pos, ) <= 0) {
+   inode_unlock(inode);
+   return -EAGAIN;
+   }
+   } else
+   inode_lock(inode);
+
err = generic_write_checks(iocb, from);
if (err <= 0) {
inode_unlock(inode);
@@ -1914,8 +1931,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 */
update_time_for_write(inode);
 
-   pos = iocb->ki_pos;
-   count = iov_iter_count(from);
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 17cbe9306faf..2ab71b946829 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8755,6 +8755,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
dio_data.overwrite = 1;
inode_unlock(inode);
relock = true;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
}
ret = btrfs_delalloc_reserve_space(inode, offset, count);
if (ret)
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/10] fs: Use RWF_* flags for AIO operations

2017-06-04 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will
carry the RWF_* flags. We cannot use aio_flags because they are not
checked for validity which may break existing applications.

Note, the only place RWF_HIPRI comes in effect is dio_await_one().
All the rest of the locations, aio code return -EIOCBQUEUED before the
checks for RWF_HIPRI.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/aio.c | 8 +++-
 include/uapi/linux/aio_abi.h | 2 +-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925ee259..020fa0045e3c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1541,7 +1541,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
ssize_t ret;
 
/* enforce forwards compatibility on users */
-   if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
+   if (unlikely(iocb->aio_reserved2)) {
pr_debug("EINVAL: reserve field set\n");
return -EINVAL;
}
@@ -1586,6 +1586,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
req->common.ki_flags |= IOCB_EVENTFD;
}
 
+   ret = kiocb_set_rw_flags(>common, iocb->aio_rw_flags);
+   if (unlikely(ret)) {
+   pr_debug("EINVAL: aio_rw_flags\n");
+   goto out_put_req;
+   }
+
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f7fbd1..a2d4a8ac94ca 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -79,7 +79,7 @@ struct io_event {
 struct iocb {
/* these are internal to the kernel/libc. */
__u64   aio_data;   /* data to be returned in event's data */
-   __u32   PADDED(aio_key, aio_reserved1);
+   __u32   PADDED(aio_key, aio_rw_flags);
/* the kernel sets aio_key to the req # */
 
/* common fields */
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/10] fs: return if direct write will trigger writeback

2017-06-04 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Find out if the write will trigger a wait due to writeback. If yes,
return -EAGAIN.

Return -EINVAL for buffered AIO: there are multiple causes of
delay such as page locks, dirty throttling logic, page loading
from disk etc. which cannot be taken care of.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 mm/filemap.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 097213275461..bc146efa6815 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2675,6 +2675,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
 
pos = iocb->ki_pos;
 
+   if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+   return -EINVAL;
+
if (limit != RLIM_INFINITY) {
if (iocb->ki_pos >= limit) {
send_sig(SIGXFSZ, current, 0);
@@ -2743,9 +2746,17 @@ generic_file_direct_write(struct kiocb *iocb, struct 
iov_iter *from)
write_len = iov_iter_count(from);
end = (pos + write_len - 1) >> PAGE_SHIFT;
 
-   written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 
1);
-   if (written)
-   goto out;
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   /* If there are pages to writeback, return */
+   if (filemap_range_has_page(inode->i_mapping, pos,
+  pos + iov_iter_count(from)))
+   return -EAGAIN;
+   } else {
+   written = filemap_write_and_wait_range(mapping, pos,
+   pos + write_len - 1);
+   if (written)
+   goto out;
+   }
 
/*
 * After a write we want buffered reads to be sure to go to disk to get
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/10] fs: Introduce RWF_NOWAIT

2017-06-04 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

RWF_NOWAIT informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This
is called by most filesystems, either through fsops.write_iter() or through
the function defined by write_iter(). If not, we perform the check defined
by .write_iter() which is called for direct IO specifically.

Filesystems xfs, btrfs and ext4 would be supported in the following patches.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/9p/vfs_file.c| 3 +++
 fs/aio.c| 6 ++
 fs/ceph/file.c  | 3 +++
 fs/cifs/file.c  | 3 +++
 fs/fuse/file.c  | 3 +++
 fs/nfs/direct.c | 3 +++
 fs/ocfs2/file.c | 3 +++
 include/linux/fs.h  | 5 -
 include/uapi/linux/fs.h | 1 +
 mm/filemap.c| 3 +++
 10 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 3de3b4a89d89..403681db7723 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -411,6 +411,9 @@ v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
loff_t origin;
int err = 0;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
retval = generic_write_checks(iocb, from);
if (retval <= 0)
return retval;
diff --git a/fs/aio.c b/fs/aio.c
index 020fa0045e3c..34027b67e2f4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1592,6 +1592,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
goto out_put_req;
}
 
+   if ((req->common.ki_flags & IOCB_NOWAIT) &&
+   !(req->common.ki_flags & IOCB_DIRECT)) {
+   ret = -EOPNOTSUPP;
+   goto out_put_req;
+   }
+
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 29308a80d66f..366b0bb71f97 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1300,6 +1300,9 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct 
iov_iter *from)
int err, want, got;
loff_t pos;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
 
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0fd081bd2a2f..ff84fa9ddb6c 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2725,6 +2725,9 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct 
iov_iter *from)
 * write request.
 */
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
rc = generic_write_checks(iocb, from);
if (rc <= 0)
return rc;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3ee4fdc3da9e..812c7bd0c290 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1425,6 +1425,9 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(file);
ssize_t res;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
if (is_bad_inode(inode))
return -EIO;
 
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 6fb9fad2d1e6..c8e7dd76126c 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -979,6 +979,9 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct 
iov_iter *iter)
dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
file, iov_iter_count(iter), (long long) iocb->ki_pos);
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
result = generic_write_checks(iocb, iter);
if (result <= 0)
return result;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index bfeb647459d9..e7f8ba890305 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2235,6 +2235,9 @@ static ssize_t ocfs2_file_write_iter(struct kiocb *iocb,
if (count == 0)
return 0;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
direct_io = iocb->ki_flags & IOCB_DIRECT ? 1 : 0;
 
inode_lock(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dc0ab585cd56..2a7d14af6d12 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -268,6 +268,7 @@ struct writeback_control;
 #define IOCB_DSYNC (1 << 4)
 #define IOCB_SYNC  (1 << 5)
 #define IOCB_WRITE (1 << 6)
+#define IOCB_NOWA

[PATCH 07/10] fs: return on congested block device

2017-06-04 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 block/blk-core.c  | 24 ++--
 block/blk-mq-sched.c  |  3 +++
 block/blk-mq.c|  2 ++
 fs/direct-io.c| 10 --
 include/linux/bio.h   |  6 ++
 include/linux/blk_types.h |  2 ++
 6 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a7421b772d0e..a6ee659fd56b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1256,6 +1256,11 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
if (!IS_ERR(rq))
return rq;
 
+   if (op & REQ_NOWAIT) {
+   blk_put_rl(rl);
+   return ERR_PTR(-EAGAIN);
+   }
+
if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) 
{
blk_put_rl(rl);
return rq;
@@ -1900,6 +1905,17 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
+   /*
+* For a REQ_NOWAIT based request, return -EOPNOTSUPP
+* if queue does not have QUEUE_FLAG_NOWAIT_SUPPORT set
+* and if it is not a request based queue.
+*/
+
+   if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) {
+   err = -EOPNOTSUPP;
+   goto end_io;
+   }
+
part = bio->bi_bdev->bd_part;
if (should_fail_request(part, bio->bi_iter.bi_size) ||
should_fail_request(_to_disk(part)->part0,
@@ -2057,7 +2073,7 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-   if (likely(blk_queue_enter(q, false) == 0)) {
+   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
@@ -2082,7 +2098,11 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_merge(_list_on_stack[0], );
bio_list_merge(_list_on_stack[0], 
_list_on_stack[1]);
} else {
-   bio_io_error(bio);
+   if (unlikely(!blk_queue_dying(q) &&
+   (bio->bi_opf & REQ_NOWAIT)))
+   bio_wouldblock_error(bio);
+   else
+   bio_io_error(bio);
}
bio = bio_list_pop(_list_on_stack[0]);
} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 1f5b692526ae..9a1dea8b964e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -83,6 +83,9 @@ struct request *blk_mq_sched_get_request(struct request_queue 
*q,
if (likely(!data->hctx))
data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+   if (op & REQ_NOWAIT)
+   data->flags |= BLK_MQ_REQ_NOWAIT;
+
if (e) {
data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1bcccedcc74f..b0608f1955b2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1556,6 +1556,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
+   if (bio->bi_opf & REQ_NOWAIT)
+   bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..139ebd5ae1c7 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -480,8 +480,12 @@ static int dio_bio_complete(struct dio *dio, struct bio 
*bio)
unsigned i;
int err;
 
-   if (bio->bi_error)
-   dio->io_error = -EIO;
+   if (bio->bi_error) {
+   if (bio->bi_error == -EAGAIN && (bio->bi_opf & REQ_NOWAIT))
+   dio->io_error = -EAGAIN;
+   else
+   dio->io_error = -EIO;
+   }
 
if (dio->is_async && dio->op == R

[PATCH 0/10 v10] No wait AIO

2017-06-04 Thread Goldwyn Rodrigues

Formerly known as non-blocking AIO.

This series adds nonblocking feature to asynchronous I/O writes.
io_submit() can be delayed because of a number of reason:
 - Block allocation for files
 - Data writebacks for direct I/O
 - Sleeping because of waiting to acquire i_rwsem
 - Congested block device

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.

In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in
uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to
IOCB_NOWAIT for struct iocb, REQ_NOWAIT for bio.bi_opf and IOMAP_NOWAIT for
iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could
not use aio_flags because it is not currently checked for invalidity
in the kernel.

This feature is provided for direct I/O of asynchronous I/O only. I have
tested it against xfs, ext4, and btrfs while I intend to add more filesystems.
The nowait feature is for request based devices. In the future, I intend to
add support to stacked devices such as md.

Applications will have to check supportability
by sending a async direct write and any other error besides -EAGAIN
would mean it is not supported.

First two patches are prep patches into nowait I/O.

Changes since v1:
 + changed name from _NONBLOCKING to *_NOWAIT
 + filemap_range_has_page call moved to closer to (just before) calling 
filemap_write_and_wait_range().
 + BIO_NOWAIT limited to get_request()
 + XFS fixes 
- included reflink 
- use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag
- Translate the flag through IOMAP_NOWAIT (iomap) to check for
  block allocation for the file.
 + ext4 coding style

Changes since v2:
 + Using aio_reserved1 as aio_rw_flags instead of aio_flags
 + blk-mq support
 + xfs uptodate with kernel and reflink changes

 Changes since v3:
  + Added FS_NOWAIT, which is set if the filesystem supports NOWAIT feature.
  + Checks in generic_make_request() to make sure BIO_NOWAIT comes in
for async direct writes only.
  + Added QUEUE_FLAG_NOWAIT, which is set if the device supports BIO_NOWAIT.
This is added (rather not set) to block devices such as dm/md currently.

 Changes since v4:
  + Ported AIO code to use RWF_* flags. Check for RWF_* flags in
generic_file_write_iter().
  + Changed IOCB_RW_FLAGS_NOWAIT to RWF_NOWAIT.

 Changes since v5:
  + BIO_NOWAIT to REQ_NOWAIT
  + Common helper for RWF flags.

 Changes since v6:
  + REQ_NOWAIT will be ignored for request based devices since they
cannot block. So, removed QUEUE_FLAG_NOWAIT since it is not
required in the current implementation. It will be resurrected
when we program for stacked devices.
  + changed kiocb_rw_flags() to kiocb_set_rw_flags() in order to accomodate
for errors. Moved checks in the function.

 Changes since v7:
  + split patches into prep so the main patches are smaller and easier
to understand
  + All patches are reviewed or acked!
 
 Changes since v8:
 + Err out AIO reads with -EINVAL flagged as RWF_NOWAIT

 Changes since v9:
 + Retract - Err out AIO reads with -EINVAL flagged as RWF_NOWAIT
 + XFS returns EAGAIN if extent list is not in memory
 + Man page updates to io_submit with iocb description and nowait features.

-- 
Goldwyn


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/10] xfs: nowait aio support

2017-05-30 Thread Goldwyn Rodrigues



On 05/29/2017 03:33 AM, Christoph Hellwig wrote:
> On Sun, May 28, 2017 at 09:38:26PM -0500, Goldwyn Rodrigues wrote:
>>
>>
>> On 05/28/2017 04:31 AM, Christoph Hellwig wrote:
>>> Despite my previous reviewed-by tag this will need another fix:
>>>
>>> xfs_file_iomap_begin needs to return EAGAIN if we don't have the extent
>>> list in memoery already.  E.g. something like this:
>>>
>>> if ((flags & IOMAP_NOWAIT) && !(ip->i_d.if_flags & XFS_IFEXTENTS)) {
>>> error = -EAGAIN;
>>> goto out_unlock;
>>> }
>>>
>>> right after locking the ilock.
>>>
>>
>> I am not sure if it is right to penalize the application to write to
>> file which has been freshly opened (and is the first one to open). It
>> basically means extent maps needs to be read from disk. Do you see a
>> reason it would have a non-deterministic wait if it is the only user? I
>> understand the block layer can block if it has too many requests though.
> 
> For either a read or a write we might have to read in the extent list
> (note that for few enough extents they are stored in the inode and
> we won't have to), in which case the call will block and by the
> semantics you define we'll need to return -EAGAIN.

Yes, that is right. I will include it in.

> 
> Btw, can you write a small blurb up for the man page to document these
> ѕemantics in man-page like language?
> 

Yes, but which man page would it belong to?
Should it be a subsection of errors in io_getevents/io_submit. We don't
want to add ERRORS to io_getevents() because it would be the return
value of the io_getevents call, and not the ones in the iocb structure.
Should it be a new man page, say for iocb(7/8)?



-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/10 v9] No wait AIO

2017-05-28 Thread Goldwyn Rodrigues



On 05/28/2017 04:27 AM, Christoph Hellwig wrote:
>>  Changes since v8:
>>  + Err out AIO reads with -EINVAL flagged as RWF_NOWAIT
> 
> Ugg, why?  Reads aren't really treated any different than writes in
> the direct I/O code.

This effort focused on writes only.

>From the point of view of the application/user, reads are usually
required to complete with success. I don't see a scenario where reads()
would need the nowait feature. If there is a use case, I'd be happy to
add and support it.

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/10] xfs: nowait aio support

2017-05-28 Thread Goldwyn Rodrigues

On 05/28/2017 04:31 AM, Christoph Hellwig wrote:
> Despite my previous reviewed-by tag this will need another fix:
> 
> xfs_file_iomap_begin needs to return EAGAIN if we don't have the extent
> list in memoery already.  E.g. something like this:
> 
>   if ((flags & IOMAP_NOWAIT) && !(ip->i_d.if_flags & XFS_IFEXTENTS)) {
>   error = -EAGAIN;
>   goto out_unlock;
>   }
> 
> right after locking the ilock.
> 

I am not sure if it is right to penalize the application to write to
file which has been freshly opened (and is the first one to open). It
basically means extent maps needs to be read from disk. Do you see a
reason it would have a non-deterministic wait if it is the only user? I
understand the block layer can block if it has too many requests though.

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/10] fs: Use RWF_* flags for AIO operations

2017-05-24 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will
carry the RWF_* flags. We cannot use aio_flags because they are not
checked for validity which may break existing applications.

Note, the only place RWF_HIPRI comes in effect is dio_await_one().
All the rest of the locations, aio code return -EIOCBQUEUED before the
checks for RWF_HIPRI.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/aio.c | 8 +++-
 include/uapi/linux/aio_abi.h | 2 +-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925ee259..020fa0045e3c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1541,7 +1541,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
ssize_t ret;
 
/* enforce forwards compatibility on users */
-   if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
+   if (unlikely(iocb->aio_reserved2)) {
pr_debug("EINVAL: reserve field set\n");
return -EINVAL;
}
@@ -1586,6 +1586,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
req->common.ki_flags |= IOCB_EVENTFD;
}
 
+   ret = kiocb_set_rw_flags(>common, iocb->aio_rw_flags);
+   if (unlikely(ret)) {
+   pr_debug("EINVAL: aio_rw_flags\n");
+   goto out_put_req;
+   }
+
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f7fbd1..a2d4a8ac94ca 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -79,7 +79,7 @@ struct io_event {
 struct iocb {
/* these are internal to the kernel/libc. */
__u64   aio_data;   /* data to be returned in event's data */
-   __u32   PADDED(aio_key, aio_reserved1);
+   __u32   PADDED(aio_key, aio_rw_flags);
/* the kernel sets aio_key to the req # */
 
/* common fields */
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/10] fs: Separate out kiocb flags setup based on RWF_* flags

2017-05-24 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/read_write.c| 12 +++-
 include/linux/fs.h | 14 ++
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 47c1d4484df9..53c816c61122 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -678,16 +678,10 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
struct iov_iter *iter,
struct kiocb kiocb;
ssize_t ret;
 
-   if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
-   return -EOPNOTSUPP;
-
init_sync_kiocb(, filp);
-   if (flags & RWF_HIPRI)
-   kiocb.ki_flags |= IOCB_HIPRI;
-   if (flags & RWF_DSYNC)
-   kiocb.ki_flags |= IOCB_DSYNC;
-   if (flags & RWF_SYNC)
-   kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   ret = kiocb_set_rw_flags(, flags);
+   if (ret)
+   return ret;
kiocb.ki_pos = *ppos;
 
if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 803e5a9b2654..f53867140f43 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3056,6 +3056,20 @@ static inline int iocb_flags(struct file *file)
return res;
 }
 
+static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags)
+{
+   if (unlikely(flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC)))
+   return -EOPNOTSUPP;
+
+   if (flags & RWF_HIPRI)
+   ki->ki_flags |= IOCB_HIPRI;
+   if (flags & RWF_DSYNC)
+   ki->ki_flags |= IOCB_DSYNC;
+   if (flags & RWF_SYNC)
+   ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   return 0;
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
ino_t res;
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/10] fs: return if direct write will trigger writeback

2017-05-24 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Find out if the write will trigger a wait due to writeback. If yes,
return -EAGAIN.

Return -EINVAL for buffered AIO: there are multiple causes of
delay such as page locks, dirty throttling logic, page loading
from disk etc. which cannot be taken care of.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 mm/filemap.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 097213275461..bc146efa6815 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2675,6 +2675,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
 
pos = iocb->ki_pos;
 
+   if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+   return -EINVAL;
+
if (limit != RLIM_INFINITY) {
if (iocb->ki_pos >= limit) {
send_sig(SIGXFSZ, current, 0);
@@ -2743,9 +2746,17 @@ generic_file_direct_write(struct kiocb *iocb, struct 
iov_iter *from)
write_len = iov_iter_count(from);
end = (pos + write_len - 1) >> PAGE_SHIFT;
 
-   written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 
1);
-   if (written)
-   goto out;
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   /* If there are pages to writeback, return */
+   if (filemap_range_has_page(inode->i_mapping, pos,
+  pos + iov_iter_count(from)))
+   return -EAGAIN;
+   } else {
+   written = filemap_write_and_wait_range(mapping, pos,
+   pos + write_len - 1);
+   if (written)
+   goto out;
+   }
 
/*
 * After a write we want buffered reads to be sure to go to disk to get
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/10] fs: return on congested block device

2017-05-24 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 block/blk-core.c  | 24 ++--
 block/blk-mq-sched.c  |  3 +++
 block/blk-mq.c|  2 ++
 fs/direct-io.c| 10 --
 include/linux/bio.h   |  6 ++
 include/linux/blk_types.h |  2 ++
 6 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c7068520794b..04d15fa2646c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1250,6 +1250,11 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
if (!IS_ERR(rq))
return rq;
 
+   if (op & REQ_NOWAIT) {
+   blk_put_rl(rl);
+   return ERR_PTR(-EAGAIN);
+   }
+
if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) 
{
blk_put_rl(rl);
return rq;
@@ -1894,6 +1899,17 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
+   /*
+* For a REQ_NOWAIT based request, return -EOPNOTSUPP
+* if queue does not have QUEUE_FLAG_NOWAIT_SUPPORT set
+* and if it is not a request based queue.
+*/
+
+   if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) {
+   err = -EOPNOTSUPP;
+   goto end_io;
+   }
+
part = bio->bi_bdev->bd_part;
if (should_fail_request(part, bio->bi_iter.bi_size) ||
should_fail_request(_to_disk(part)->part0,
@@ -2051,7 +2067,7 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-   if (likely(blk_queue_enter(q, false) == 0)) {
+   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
@@ -2076,7 +2092,11 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_merge(_list_on_stack[0], );
bio_list_merge(_list_on_stack[0], 
_list_on_stack[1]);
} else {
-   bio_io_error(bio);
+   if (unlikely(!blk_queue_dying(q) &&
+   (bio->bi_opf & REQ_NOWAIT)))
+   bio_wouldblock_error(bio);
+   else
+   bio_io_error(bio);
}
bio = bio_list_pop(_list_on_stack[0]);
} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 1f5b692526ae..9a1dea8b964e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -83,6 +83,9 @@ struct request *blk_mq_sched_get_request(struct request_queue 
*q,
if (likely(!data->hctx))
data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+   if (op & REQ_NOWAIT)
+   data->flags |= BLK_MQ_REQ_NOWAIT;
+
if (e) {
data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a69ad122ed66..c6932067c9e5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1575,6 +1575,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
+   if (bio->bi_opf & REQ_NOWAIT)
+   bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..139ebd5ae1c7 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -480,8 +480,12 @@ static int dio_bio_complete(struct dio *dio, struct bio 
*bio)
unsigned i;
int err;
 
-   if (bio->bi_error)
-   dio->io_error = -EIO;
+   if (bio->bi_error) {
+   if (bio->bi_error == -EAGAIN && (bio->bi_opf & REQ_NOWAIT))
+   dio->io_error = -EAGAIN;
+   else
+   dio->io_error = -EIO;
+   }
 
if (dio->is_async && dio->op == R

[PATCH 09/10] xfs: nowait aio support

2017-05-24 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable
immediately.

IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin
if it needs allocation either due to file extension, writing to a hole,
or COW or waiting for other DIOs to finish.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/xfs/xfs_file.c  | 19 ++-
 fs/xfs/xfs_iomap.c | 17 +
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 35703a801372..b307940e7d56 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -541,8 +541,11 @@ xfs_file_dio_aio_write(
iolock = XFS_IOLOCK_SHARED;
}
 
-   xfs_ilock(ip, iolock);
-
+   if (!xfs_ilock_nowait(ip, iolock)) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EAGAIN;
+   xfs_ilock(ip, iolock);
+   }
ret = xfs_file_aio_write_checks(iocb, from, );
if (ret)
goto out;
@@ -553,9 +556,15 @@ xfs_file_dio_aio_write(
 * otherwise demote the lock if we had to take the exclusive lock
 * for other reasons in xfs_file_aio_write_checks.
 */
-   if (unaligned_io)
-   inode_dio_wait(inode);
-   else if (iolock == XFS_IOLOCK_EXCL) {
+   if (unaligned_io) {
+   /* If we are going to wait for other DIO to finish, bail */
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (atomic_read(>i_dio_count))
+   return -EAGAIN;
+   } else {
+   inode_dio_wait(inode);
+   }
+   } else if (iolock == XFS_IOLOCK_EXCL) {
xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
iolock = XFS_IOLOCK_SHARED;
}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 94e5bdf7304c..8b0e3c1e086d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1016,6 +1016,15 @@ xfs_file_iomap_begin(
 
if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
if (flags & IOMAP_DIRECT) {
+   /*
+* A reflinked inode will result in CoW alloc.
+* FIXME: It could still overwrite on unshared extents
+* and not need allocation.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, , ,
);
@@ -1033,6 +1042,14 @@ xfs_file_iomap_begin(
 
if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, , nimaps)) {
/*
+* If nowait is set bail since we are going to make
+* allocations.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
+   /*
 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 * pages to keep the chunks of work done where somewhat 
symmetric
 * with the work writeback does. This is a completely arbitrary
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/10] ext4: nowait aio support

2017-05-24 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Return EAGAIN if any of the following checks fail for direct I/O:
  + i_rwsem is lockable
  + Writing beyond end of file (will trigger allocation)
  + Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 fs/ext4/file.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 831fd6beebf0..07f08ff2c11b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -216,7 +216,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
return ext4_dax_write_iter(iocb, from);
 #endif
 
-   inode_lock(inode);
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   } else {
+   inode_lock(inode);
+   }
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
@@ -235,9 +241,15 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
 
iocb->private = 
/* Check whether we do a DIO overwrite or not */
-   if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
-   ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
-   overwrite = 1;
+   if (o_direct && !unaligned_aio) {
+   if (ext4_overwrite_io(inode, iocb->ki_pos, 
iov_iter_count(from))) {
+   if (ext4_should_dioread_nolock(inode))
+   overwrite = 1;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
+   }
+   }
 
ret = __generic_file_write_iter(iocb, from);
inode_unlock(inode);
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/10] fs: Introduce RWF_NOWAIT

2017-05-24 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

RWF_NOWAIT informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This
is called by most filesystems, either through fsops.write_iter() or through
the function defined by write_iter(). If not, we perform the check defined
by .write_iter() which is called for direct IO specifically.

Filesystems xfs, btrfs and ext4 would be supported in the following patches.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/9p/vfs_file.c|  3 +++
 fs/aio.c| 13 +
 fs/ceph/file.c  |  3 +++
 fs/cifs/file.c  |  3 +++
 fs/fuse/file.c  |  3 +++
 fs/nfs/direct.c |  3 +++
 fs/ocfs2/file.c |  3 +++
 include/linux/fs.h  |  5 -
 include/uapi/linux/fs.h |  1 +
 mm/filemap.c|  3 +++
 10 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 3de3b4a89d89..403681db7723 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -411,6 +411,9 @@ v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
loff_t origin;
int err = 0;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
retval = generic_write_checks(iocb, from);
if (retval <= 0)
return retval;
diff --git a/fs/aio.c b/fs/aio.c
index 020fa0045e3c..9616dc733103 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1592,6 +1592,19 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
goto out_put_req;
}
 
+   if (req->common.ki_flags & IOCB_NOWAIT) {
+   if (!(req->common.ki_flags & IOCB_DIRECT)) {
+   ret = -EOPNOTSUPP;
+   goto out_put_req;
+   }
+
+   if ((iocb->aio_lio_opcode != IOCB_CMD_PWRITE) &&
+   (iocb->aio_lio_opcode != IOCB_CMD_PWRITEV)) {
+   ret = -EINVAL;
+   goto out_put_req;
+   }
+   }
+
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 3fdde0b283c9..a53fd2675b1b 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1300,6 +1300,9 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct 
iov_iter *from)
int err, want, got;
loff_t pos;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
 
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0fd081bd2a2f..ff84fa9ddb6c 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2725,6 +2725,9 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct 
iov_iter *from)
 * write request.
 */
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
rc = generic_write_checks(iocb, from);
if (rc <= 0)
return rc;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3ee4fdc3da9e..812c7bd0c290 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1425,6 +1425,9 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(file);
ssize_t res;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
if (is_bad_inode(inode))
return -EIO;
 
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 6fb9fad2d1e6..c8e7dd76126c 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -979,6 +979,9 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct 
iov_iter *iter)
dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
file, iov_iter_count(iter), (long long) iocb->ki_pos);
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
result = generic_write_checks(iocb, iter);
if (result <= 0)
return result;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index bfeb647459d9..e7f8ba890305 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2235,6 +2235,9 @@ static ssize_t ocfs2_file_write_iter(struct kiocb *iocb,
if (count == 0)
return 0;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
direct_io = iocb->ki_flags & IOCB_DIRECT ? 1 : 0;
 
inode_lock(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dc0ab585cd56..2a7d14af6d12 100644
--- a/includ

[PATCH 06/10] fs: Introduce IOMAP_NOWAIT

2017-05-24 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
This is used by XFS in the XFS patch.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/iomap.c| 2 ++
 include/linux/iomap.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 4b10892967a5..5d85ec6e7b20 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -879,6 +879,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
} else {
dio->flags |= IOMAP_DIO_WRITE;
flags |= IOMAP_WRITE;
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   flags |= IOMAP_NOWAIT;
}
 
ret = filemap_write_and_wait_range(mapping, start, end);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f753e788da31..69f4e9470084 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -52,6 +52,7 @@ struct iomap {
 #define IOMAP_REPORT   (1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT   (1 << 4) /* direct I/O */
+#define IOMAP_NOWAIT   (1 << 5) /* Don't wait for writeback */
 
 struct iomap_ops {
/*
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/10] btrfs: nowait aio support

2017-05-24 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Acked-by: David Sterba <dste...@suse.com>
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/inode.c |  3 +++
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index da1096eb1a40..aae088e49915 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1875,12 +1875,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
ssize_t num_written = 0;
bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
ssize_t err;
-   loff_t pos;
-   size_t count;
+   loff_t pos = iocb->ki_pos;
+   size_t count = iov_iter_count(from);
loff_t oldsize;
int clean_page = 0;
 
-   inode_lock(inode);
+   if ((iocb->ki_flags & IOCB_NOWAIT) &&
+   (iocb->ki_flags & IOCB_DIRECT)) {
+   /* Don't sleep on inode rwsem */
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   /*
+* We will allocate space in case nodatacow is not set,
+* so bail
+*/
+   if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+ BTRFS_INODE_PREALLOC)) ||
+   check_can_nocow(BTRFS_I(inode), pos, ) <= 0) {
+   inode_unlock(inode);
+   return -EAGAIN;
+   }
+   } else
+   inode_lock(inode);
+
err = generic_write_checks(iocb, from);
if (err <= 0) {
inode_unlock(inode);
@@ -1914,8 +1931,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 */
update_time_for_write(inode);
 
-   pos = iocb->ki_pos;
-   count = iov_iter_count(from);
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 17cbe9306faf..2ab71b946829 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8755,6 +8755,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
dio_data.overwrite = 1;
inode_unlock(inode);
relock = true;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
}
ret = btrfs_delalloc_reserve_space(inode, offset, count);
if (ret)
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/10] fs: Introduce filemap_range_has_page()

2017-05-24 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

filemap_range_has_page() return true if the file's mapping has
a page within the range mentioned. This function will be used
to check if a write() call will cause a writeback of previous
writes.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c   | 33 +
 2 files changed, 35 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f53867140f43..dc0ab585cd56 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2517,6 +2517,8 @@ extern int filemap_fdatawait(struct address_space *);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
   loff_t lend);
+extern int filemap_range_has_page(struct address_space *, loff_t lstart,
+ loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index 6f1be573a5e6..87aba7698584 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,6 +376,39 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
+/**
+ * filemap_range_has_page - check if a page exists in range.
+ * @mapping:   address space structure to wait for
+ * @start_byte:offset in bytes where the range starts
+ * @end_byte:  offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback.
+ */
+int filemap_range_has_page(struct address_space *mapping,
+  loff_t start_byte, loff_t end_byte)
+{
+   pgoff_t index = start_byte >> PAGE_SHIFT;
+   pgoff_t end = end_byte >> PAGE_SHIFT;
+   struct pagevec pvec;
+   int ret;
+
+   if (end_byte < start_byte)
+   return 0;
+
+   if (mapping->nrpages == 0)
+   return 0;
+
+   pagevec_init(, 0);
+   ret = pagevec_lookup(, mapping, index, 1);
+   if (!ret)
+   return 0;
+   ret = (pvec.pages[0]->index <= end);
+   pagevec_release();
+   return ret;
+}
+EXPORT_SYMBOL(filemap_range_has_page);
+
 static int __filemap_fdatawait_range(struct address_space *mapping,
 loff_t start_byte, loff_t end_byte)
 {
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/10 v9] No wait AIO

2017-05-24 Thread Goldwyn Rodrigues

Formerly known as non-blocking AIO.

This series adds nonblocking feature to asynchronous I/O writes.
io_submit() can be delayed because of a number of reason:
 - Block allocation for files
 - Data writebacks for direct I/O
 - Sleeping because of waiting to acquire i_rwsem
 - Congested block device

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.

In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in
uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to
IOCB_NOWAIT for struct iocb, REQ_NOWAIT for bio.bi_opf and IOMAP_NOWAIT for
iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could
not use aio_flags because it is not currently checked for invalidity
in the kernel.

This feature is provided for direct I/O of asynchronous I/O only. I have
tested it against xfs, ext4, and btrfs while I intend to add more filesystems.
The nowait feature is for request based devices. In the future, I intend to
add support to stacked devices such as md.

Applications will have to check supportability
by sending a async direct write and any other error besides -EAGAIN
would mean it is not supported.

First two patches are prep patches into nowait I/O.

Changes since v1:
 + changed name from _NONBLOCKING to *_NOWAIT
 + filemap_range_has_page call moved to closer to (just before) calling 
filemap_write_and_wait_range().
 + BIO_NOWAIT limited to get_request()
 + XFS fixes 
- included reflink 
- use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag
- Translate the flag through IOMAP_NOWAIT (iomap) to check for
  block allocation for the file.
 + ext4 coding style

Changes since v2:
 + Using aio_reserved1 as aio_rw_flags instead of aio_flags
 + blk-mq support
 + xfs uptodate with kernel and reflink changes

 Changes since v3:
  + Added FS_NOWAIT, which is set if the filesystem supports NOWAIT feature.
  + Checks in generic_make_request() to make sure BIO_NOWAIT comes in
for async direct writes only.
  + Added QUEUE_FLAG_NOWAIT, which is set if the device supports BIO_NOWAIT.
This is added (rather not set) to block devices such as dm/md currently.

 Changes since v4:
  + Ported AIO code to use RWF_* flags. Check for RWF_* flags in
generic_file_write_iter().
  + Changed IOCB_RW_FLAGS_NOWAIT to RWF_NOWAIT.

 Changes since v5:
  + BIO_NOWAIT to REQ_NOWAIT
  + Common helper for RWF flags.

 Changes since v6:
  + REQ_NOWAIT will be ignored for request based devices since they
cannot block. So, removed QUEUE_FLAG_NOWAIT since it is not
required in the current implementation. It will be resurrected
when we program for stacked devices.
  + changed kiocb_rw_flags() to kiocb_set_rw_flags() in order to accomodate
for errors. Moved checks in the function.

 Changes since v7:
  + split patches into prep so the main patches are smaller and easier
to understand
  + All patches are reviewed or acked!
 
 Changes since v8:
 + Err out AIO reads with -EINVAL flagged as RWF_NOWAIT

-- 
Goldwyn


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/10] fs: Introduce RWF_NOWAIT

2017-05-11 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

RWF_NOWAIT informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This
is called by most filesystems, either through fsops.write_iter() or through
the function defined by write_iter(). If not, we perform the check defined
by .write_iter() which is called for direct IO specifically.

Filesystems xfs, btrfs and ext4 would be supported in the following patches.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/9p/vfs_file.c| 3 +++
 fs/aio.c| 6 ++
 fs/ceph/file.c  | 3 +++
 fs/cifs/file.c  | 3 +++
 fs/fuse/file.c  | 3 +++
 fs/nfs/direct.c | 3 +++
 fs/ocfs2/file.c | 3 +++
 include/linux/fs.h  | 5 -
 include/uapi/linux/fs.h | 1 +
 mm/filemap.c| 3 +++
 10 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 3de3b4a89d89..403681db7723 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -411,6 +411,9 @@ v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
loff_t origin;
int err = 0;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
retval = generic_write_checks(iocb, from);
if (retval <= 0)
return retval;
diff --git a/fs/aio.c b/fs/aio.c
index 020fa0045e3c..34027b67e2f4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1592,6 +1592,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
goto out_put_req;
}
 
+   if ((req->common.ki_flags & IOCB_NOWAIT) &&
+   !(req->common.ki_flags & IOCB_DIRECT)) {
+   ret = -EOPNOTSUPP;
+   goto out_put_req;
+   }
+
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 26cc95421cca..af28419b1731 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1267,6 +1267,9 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct 
iov_iter *from)
int err, want, got;
loff_t pos;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
 
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 21d404535739..f8858a06e119 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2638,6 +2638,9 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct 
iov_iter *from)
 * write request.
 */
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
rc = generic_write_checks(iocb, from);
if (rc <= 0)
return rc;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ec238fb5a584..72786e798319 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1425,6 +1425,9 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(file);
ssize_t res;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
if (is_bad_inode(inode))
return -EIO;
 
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index c1b5fed7c863..dcea0caa5cb5 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -996,6 +996,9 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct 
iov_iter *iter)
dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
file, iov_iter_count(iter), (long long) iocb->ki_pos);
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
result = generic_write_checks(iocb, iter);
if (result <= 0)
return result;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index bfeb647459d9..e7f8ba890305 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2235,6 +2235,9 @@ static ssize_t ocfs2_file_write_iter(struct kiocb *iocb,
if (count == 0)
return 0;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
direct_io = iocb->ki_flags & IOCB_DIRECT ? 1 : 0;
 
inode_lock(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2e6fc6a23f91..7e39b510b7a4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -270,6 +270,7 @@ struct writeback_control;
 #define IOCB_DSYNC (1 << 4)
 #define IOCB_SYNC  (1 << 5)
 #define IOCB_WRITE (1 << 6)
+#define IOCB_NOWAIT(1 << 7)

[PATCH 05/10] fs: return if direct write will trigger writeback

2017-05-11 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Find out if the write will trigger a wait due to writeback. If yes,
return -EAGAIN.

Return -EINVAL for buffered AIO: there are multiple causes of
delay such as page locks, dirty throttling logic, page loading
from disk etc. which cannot be taken care of.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 mm/filemap.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index ca3031f505f2..fd7d175b3dee 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2673,6 +2673,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
 
pos = iocb->ki_pos;
 
+   if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+   return -EINVAL;
+
if (limit != RLIM_INFINITY) {
if (iocb->ki_pos >= limit) {
send_sig(SIGXFSZ, current, 0);
@@ -2742,9 +2745,17 @@ generic_file_direct_write(struct kiocb *iocb, struct 
iov_iter *from)
write_len = iov_iter_count(from);
end = (pos + write_len - 1) >> PAGE_SHIFT;
 
-   written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 
1);
-   if (written)
-   goto out;
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   /* If there are pages to writeback, return */
+   if (filemap_range_has_page(inode->i_mapping, pos,
+  pos + iov_iter_count(from)))
+   return -EAGAIN;
+   } else {
+   written = filemap_write_and_wait_range(mapping, pos,
+   pos + write_len - 1);
+   if (written)
+   goto out;
+   }
 
/*
 * After a write we want buffered reads to be sure to go to disk to get
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/10] fs: Introduce IOMAP_NOWAIT

2017-05-11 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
This is used by XFS in the XFS patch.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/iomap.c| 2 ++
 include/linux/iomap.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 141c3cd55a8b..d1c81753d411 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -885,6 +885,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
} else {
dio->flags |= IOMAP_DIO_WRITE;
flags |= IOMAP_WRITE;
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   flags |= IOMAP_NOWAIT;
}
 
if (mapping->nrpages) {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7291810067eb..53f6af89c625 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -51,6 +51,7 @@ struct iomap {
 #define IOMAP_REPORT   (1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT   (1 << 4) /* direct I/O */
+#define IOMAP_NOWAIT   (1 << 5) /* Don't wait for writeback */
 
 struct iomap_ops {
/*
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/10] xfs: nowait aio support

2017-05-11 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable
immediately.

IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin
if it needs allocation either due to file extension, writing to a hole,
or COW or waiting for other DIOs to finish.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/xfs/xfs_file.c  | 19 ++-
 fs/xfs/xfs_iomap.c | 17 +
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 35703a801372..b307940e7d56 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -541,8 +541,11 @@ xfs_file_dio_aio_write(
iolock = XFS_IOLOCK_SHARED;
}
 
-   xfs_ilock(ip, iolock);
-
+   if (!xfs_ilock_nowait(ip, iolock)) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EAGAIN;
+   xfs_ilock(ip, iolock);
+   }
ret = xfs_file_aio_write_checks(iocb, from, );
if (ret)
goto out;
@@ -553,9 +556,15 @@ xfs_file_dio_aio_write(
 * otherwise demote the lock if we had to take the exclusive lock
 * for other reasons in xfs_file_aio_write_checks.
 */
-   if (unaligned_io)
-   inode_dio_wait(inode);
-   else if (iolock == XFS_IOLOCK_EXCL) {
+   if (unaligned_io) {
+   /* If we are going to wait for other DIO to finish, bail */
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (atomic_read(>i_dio_count))
+   return -EAGAIN;
+   } else {
+   inode_dio_wait(inode);
+   }
+   } else if (iolock == XFS_IOLOCK_EXCL) {
xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
iolock = XFS_IOLOCK_SHARED;
}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 288ee5b840d7..9baa65eeae9e 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1015,6 +1015,15 @@ xfs_file_iomap_begin(
 
if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
if (flags & IOMAP_DIRECT) {
+   /*
+* A reflinked inode will result in CoW alloc.
+* FIXME: It could still overwrite on unshared extents
+* and not need allocation.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, , ,
);
@@ -1032,6 +1041,14 @@ xfs_file_iomap_begin(
 
if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, , nimaps)) {
/*
+* If nowait is set bail since we are going to make
+* allocations.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
+   /*
 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 * pages to keep the chunks of work done where somewhat 
symmetric
 * with the work writeback does. This is a completely arbitrary
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/10] ext4: nowait aio support

2017-05-11 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Return EAGAIN if any of the following checks fail for direct I/O:
  + i_rwsem is lockable
  + Writing beyond end of file (will trigger allocation)
  + Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 fs/ext4/file.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index cefa9835f275..2efdc6d4d3e8 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -216,7 +216,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
return ext4_dax_write_iter(iocb, from);
 #endif
 
-   inode_lock(inode);
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   } else {
+   inode_lock(inode);
+   }
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
@@ -235,9 +241,15 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
 
iocb->private = 
/* Check whether we do a DIO overwrite or not */
-   if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
-   ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
-   overwrite = 1;
+   if (o_direct && !unaligned_aio) {
+   if (ext4_overwrite_io(inode, iocb->ki_pos, 
iov_iter_count(from))) {
+   if (ext4_should_dioread_nolock(inode))
+   overwrite = 1;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
+   }
+   }
 
ret = __generic_file_write_iter(iocb, from);
inode_unlock(inode);
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/10] fs: return on congested block device

2017-05-11 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 block/blk-core.c  | 24 ++--
 block/blk-mq-sched.c  |  3 +++
 block/blk-mq.c|  4 
 fs/direct-io.c| 10 --
 include/linux/bio.h   |  6 ++
 include/linux/blk_types.h |  2 ++
 6 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c221cc17..effe934b806b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
if (!IS_ERR(rq))
return rq;
 
+   if (op & REQ_NOWAIT) {
+   blk_put_rl(rl);
+   return ERR_PTR(-EAGAIN);
+   }
+
if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) 
{
blk_put_rl(rl);
return rq;
@@ -1870,6 +1875,17 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
+   /*
+* For a REQ_NOWAIT based request, return -EOPNOTSUPP
+* if queue does not have QUEUE_FLAG_NOWAIT_SUPPORT set
+* and if it is not a request based queue.
+*/
+
+   if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) {
+   err = -EOPNOTSUPP;
+   goto end_io;
+   }
+
part = bio->bi_bdev->bd_part;
if (should_fail_request(part, bio->bi_iter.bi_size) ||
should_fail_request(_to_disk(part)->part0,
@@ -2021,7 +2037,7 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-   if (likely(blk_queue_enter(q, false) == 0)) {
+   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
@@ -2046,7 +2062,11 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_merge(_list_on_stack[0], );
bio_list_merge(_list_on_stack[0], 
_list_on_stack[1]);
} else {
-   bio_io_error(bio);
+   if (unlikely(!blk_queue_dying(q) &&
+   (bio->bi_opf & REQ_NOWAIT)))
+   bio_wouldblock_error(bio);
+   else
+   bio_io_error(bio);
}
bio = bio_list_pop(_list_on_stack[0]);
} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c974a1bbf4cb..019d881d62b7 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct 
request_queue *q,
if (likely(!data->hctx))
data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+   if (op & REQ_NOWAIT)
+   data->flags |= BLK_MQ_REQ_NOWAIT;
+
if (e) {
data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c7836a1ded97..d7613ae6a269 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1538,6 +1538,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
+   if (bio->bi_opf & REQ_NOWAIT)
+   bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
 
@@ -1662,6 +1664,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
+   if (bio->bi_opf & REQ_NOWAIT)
+   bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..139ebd5ae1c7 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -480,8 +480,12 @@ static int dio_bio_complete(struct dio *dio, struct bio 
*bio)

[PATCH 02/10] fs: Introduce filemap_range_has_page()

2017-05-11 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

filemap_range_has_page() return true if the file's mapping has
a page within the range mentioned. This function will be used
to check if a write() call will cause a writeback of previous
writes.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c   | 33 +
 2 files changed, 35 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 869c9a6fe58d..2e6fc6a23f91 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2513,6 +2513,8 @@ extern int filemap_fdatawait(struct address_space *);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
   loff_t lend);
+extern int filemap_range_has_page(struct address_space *, loff_t lstart,
+ loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index 1694623a6289..fae5a361befb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,6 +376,39 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
+/**
+ * filemap_range_has_page - check if a page exists in range.
+ * @mapping:   address space structure to wait for
+ * @start_byte:offset in bytes where the range starts
+ * @end_byte:  offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback.
+ */
+int filemap_range_has_page(struct address_space *mapping,
+  loff_t start_byte, loff_t end_byte)
+{
+   pgoff_t index = start_byte >> PAGE_SHIFT;
+   pgoff_t end = end_byte >> PAGE_SHIFT;
+   struct pagevec pvec;
+   int ret;
+
+   if (end_byte < start_byte)
+   return 0;
+
+   if (mapping->nrpages == 0)
+   return 0;
+
+   pagevec_init(, 0);
+   ret = pagevec_lookup(, mapping, index, 1);
+   if (!ret)
+   return 0;
+   ret = (pvec.pages[0]->index <= end);
+   pagevec_release();
+   return ret;
+}
+EXPORT_SYMBOL(filemap_range_has_page);
+
 static int __filemap_fdatawait_range(struct address_space *mapping,
 loff_t start_byte, loff_t end_byte)
 {
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/10] btrfs: nowait aio support

2017-05-11 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Acked-by: David Sterba <dste...@suse.com>
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/inode.c |  3 +++
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 520cb7230b2d..a870e5dd2b4d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1823,12 +1823,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
ssize_t num_written = 0;
bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
ssize_t err;
-   loff_t pos;
-   size_t count;
+   loff_t pos = iocb->ki_pos;
+   size_t count = iov_iter_count(from);
loff_t oldsize;
int clean_page = 0;
 
-   inode_lock(inode);
+   if ((iocb->ki_flags & IOCB_NOWAIT) &&
+   (iocb->ki_flags & IOCB_DIRECT)) {
+   /* Don't sleep on inode rwsem */
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   /*
+* We will allocate space in case nodatacow is not set,
+* so bail
+*/
+   if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+ BTRFS_INODE_PREALLOC)) ||
+   check_can_nocow(BTRFS_I(inode), pos, ) <= 0) {
+   inode_unlock(inode);
+   return -EAGAIN;
+   }
+   } else
+   inode_lock(inode);
+
err = generic_write_checks(iocb, from);
if (err <= 0) {
inode_unlock(inode);
@@ -1862,8 +1879,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 */
update_time_for_write(inode);
 
-   pos = iocb->ki_pos;
-   count = iov_iter_count(from);
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5e71f1ea3391..47d3fcd86979 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8625,6 +8625,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
dio_data.overwrite = 1;
inode_unlock(inode);
relock = true;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
}
ret = btrfs_delalloc_reserve_space(inode, offset, count);
if (ret)
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/10] fs: Use RWF_* flags for AIO operations

2017-05-11 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will
carry the RWF_* flags. We cannot use aio_flags because they are not
checked for validity which may break existing applications.

Note, the only place RWF_HIPRI comes in effect is dio_await_one().
All the rest of the locations, aio code return -EIOCBQUEUED before the
checks for RWF_HIPRI.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/aio.c | 8 +++-
 include/uapi/linux/aio_abi.h | 2 +-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925ee259..020fa0045e3c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1541,7 +1541,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
ssize_t ret;
 
/* enforce forwards compatibility on users */
-   if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
+   if (unlikely(iocb->aio_reserved2)) {
pr_debug("EINVAL: reserve field set\n");
return -EINVAL;
}
@@ -1586,6 +1586,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
req->common.ki_flags |= IOCB_EVENTFD;
}
 
+   ret = kiocb_set_rw_flags(>common, iocb->aio_rw_flags);
+   if (unlikely(ret)) {
+   pr_debug("EINVAL: aio_rw_flags\n");
+   goto out_put_req;
+   }
+
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f7fbd1..a2d4a8ac94ca 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -79,7 +79,7 @@ struct io_event {
 struct iocb {
/* these are internal to the kernel/libc. */
__u64   aio_data;   /* data to be returned in event's data */
-   __u32   PADDED(aio_key, aio_reserved1);
+   __u32   PADDED(aio_key, aio_rw_flags);
/* the kernel sets aio_key to the req # */
 
/* common fields */
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/10 v8] No wait AIO

2017-05-11 Thread Goldwyn Rodrigues

Formerly known as non-blocking AIO.

This series adds nonblocking feature to asynchronous I/O writes.
io_submit() can be delayed because of a number of reason:
 - Block allocation for files
 - Data writebacks for direct I/O
 - Sleeping because of waiting to acquire i_rwsem
 - Congested block device

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.

In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in
uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to
IOCB_NOWAIT for struct iocb, REQ_NOWAIT for bio.bi_opf and IOMAP_NOWAIT for
iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could
not use aio_flags because it is not currently checked for invalidity
in the kernel.

This feature is provided for direct I/O of asynchronous I/O only. I have
tested it against xfs, ext4, and btrfs while I intend to add more filesystems.
The nowait feature is for request based devices. In the future, I intend to
add support to stacked devices such as md.

Applications will have to check supportability
by sending a async direct write and any other error besides -EAGAIN
would mean it is not supported.

First two patches are prep patches into nowait I/O.

Changes since v1:
 + changed name from _NONBLOCKING to *_NOWAIT
 + filemap_range_has_page call moved to closer to (just before) calling 
filemap_write_and_wait_range().
 + BIO_NOWAIT limited to get_request()
 + XFS fixes 
- included reflink 
- use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag
- Translate the flag through IOMAP_NOWAIT (iomap) to check for
  block allocation for the file.
 + ext4 coding style

Changes since v2:
 + Using aio_reserved1 as aio_rw_flags instead of aio_flags
 + blk-mq support
 + xfs uptodate with kernel and reflink changes

 Changes since v3:
  + Added FS_NOWAIT, which is set if the filesystem supports NOWAIT feature.
  + Checks in generic_make_request() to make sure BIO_NOWAIT comes in
for async direct writes only.
  + Added QUEUE_FLAG_NOWAIT, which is set if the device supports BIO_NOWAIT.
This is added (rather not set) to block devices such as dm/md currently.

 Changes since v4:
  + Ported AIO code to use RWF_* flags. Check for RWF_* flags in
generic_file_write_iter().
  + Changed IOCB_RW_FLAGS_NOWAIT to RWF_NOWAIT.

 Changes since v5:
  + BIO_NOWAIT to REQ_NOWAIT
  + Common helper for RWF flags.

 Changes since v6:
  + REQ_NOWAIT will be ignored for request based devices since they
cannot block. So, removed QUEUE_FLAG_NOWAIT since it is not
required in the current implementation. It will be resurrected
when we program for stacked devices.
  + changed kiocb_rw_flags() to kiocb_set_rw_flags() in order to accomodate
for errors. Moved checks in the function.

 Changes since v7:
  + split patches into prep so the main patches are smaller and easier
to understand
  + All patches are reviewed or acked!

-- 
Goldwyn


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/10] fs: Separate out kiocb flags setup based on RWF_* flags

2017-05-11 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/read_write.c| 12 +++-
 include/linux/fs.h | 14 ++
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index c4f88afbc67f..362f91cd8d66 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -678,16 +678,10 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
struct iov_iter *iter,
struct kiocb kiocb;
ssize_t ret;
 
-   if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
-   return -EOPNOTSUPP;
-
init_sync_kiocb(, filp);
-   if (flags & RWF_HIPRI)
-   kiocb.ki_flags |= IOCB_HIPRI;
-   if (flags & RWF_DSYNC)
-   kiocb.ki_flags |= IOCB_DSYNC;
-   if (flags & RWF_SYNC)
-   kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   ret = kiocb_set_rw_flags(, flags);
+   if (ret)
+   return ret;
kiocb.ki_pos = *ppos;
 
if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7bb45e8..869c9a6fe58d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3049,6 +3049,20 @@ static inline int iocb_flags(struct file *file)
return res;
 }
 
+static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags)
+{
+   if (unlikely(flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC)))
+   return -EOPNOTSUPP;
+
+   if (flags & RWF_HIPRI)
+   ki->ki_flags |= IOCB_HIPRI;
+   if (flags & RWF_DSYNC)
+   ki->ki_flags |= IOCB_DSYNC;
+   if (flags & RWF_SYNC)
+   ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   return 0;
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
ino_t res;
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/8] nowait aio: return on congested block device

2017-05-11 Thread Goldwyn Rodrigues



On 05/11/2017 02:44 AM, Christoph Hellwig wrote:
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig 
> 
> Although lifting the make_request limit is something a lot of users
> would appreciate in the near future..
> 

Yes, I understand. That will be on my todo list next on priority.

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

qgroup: direct writes returns -EDQUOT too soon

2017-05-10 Thread Goldwyn Rodrigues


Here is a sample script to recreate the issue:
mkfs.btrfs -f /dev/vdb
mount /dev/vdb /mnt
btrfs quota enable /mnt
btrfs sub create /mnt/tmp
btrfs qgroup limit 200M /mnt/tmp
btrfs quota rescan -w /mnt
cd /mnt/tmp
for i in {1..5}; do
sync
dd if=/dev/zero of=/mnt/tmp/file-$i oflag=direct
sync
done

btrfs qgroup show -pcref /mnt/tmp


Output:

Create subvolume '/mnt/tmp'
quota rescan started
dd: writing to '/mnt/tmp/file-1': Disk quota exceeded
11991+0 records in
11990+0 records out
6138880 bytes (6.1 MB, 5.9 MiB) copied, 2.40459 s, 2.6 MB/s
dd: writing to '/mnt/tmp/file-2': Disk quota exceeded
11807+0 records in
11806+0 records out
6044672 bytes (6.0 MB, 5.8 MiB) copied, 2.11256 s, 2.9 MB/s
dd: writing to '/mnt/tmp/file-3': Disk quota exceeded
11628+0 records in
11627+0 records out
5953024 bytes (6.0 MB, 5.7 MiB) copied, 2.53767 s, 2.3 MB/s
dd: writing to '/mnt/tmp/file-4': Disk quota exceeded
11080+0 records in
11079+0 records out
5672448 bytes (5.7 MB, 5.4 MiB) copied, 2.3697 s, 2.4 MB/s
dd: writing to '/mnt/tmp/file-5': Disk quota exceeded
11358+0 records in
11357+0 records out
5814784 bytes (5.8 MB, 5.5 MiB) copied, 2.10354 s, 2.8 MB/s

qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/25728.84MiB 28.84MiB200.00MiB none --- ---

The files created are only 5-6MB when the subvolume size is 200m. Each
of the attempts, including the first attempt, returns EDQUOT at around
5-6 MB.


-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v3 5/6] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions

2017-05-10 Thread Goldwyn Rodrigues



On 05/09/2017 09:36 PM, Qu Wenruo wrote:
> Introduce a new parameter, struct extent_changeset for
> btrfs_qgroup_reserved_data() and its callers.
> 
> Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
> which range it reserved in current reserve, so it can free it at error
> path.
> 
> The reason we need to export it to callers is, at buffered write error
> path, without knowing what exactly which range we reserved in current
> allocation, we can free space which is not reserved by us.
> 
> This will lead to qgroup reserved space underflow.
> 
> Reviewed-by: Chandan Rajendra 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/ctree.h   |  6 --
>  fs/btrfs/extent-tree.c | 16 +++-
>  fs/btrfs/extent_io.h   | 34 ++
>  fs/btrfs/file.c| 12 +---
>  fs/btrfs/inode-map.c   |  4 +++-
>  fs/btrfs/inode.c   | 18 ++
>  fs/btrfs/ioctl.c   |  5 -
>  fs/btrfs/qgroup.c  | 41 +
>  fs/btrfs/qgroup.h  |  3 ++-
>  fs/btrfs/relocation.c  |  4 +++-
>  10 files changed, 113 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1e82516fe2d8..52a0147cd612 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2704,8 +2704,9 @@ enum btrfs_flush_state {
>   COMMIT_TRANS=   6,
>  };
>  
> -int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
> +int btrfs_check_data_free_space(struct inode *inode,
> + struct extent_changeset **reserved, u64 start, u64 len);
>  void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
>  void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
>   u64 len);
> @@ -2723,7 +2724,8 @@ void btrfs_subvolume_release_metadata(struct 
> btrfs_fs_info *fs_info,
> struct btrfs_block_rsv *rsv);
>  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 
> num_bytes);
>  void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 
> num_bytes);
> -int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
> +int btrfs_delalloc_reserve_space(struct inode *inode,
> + struct extent_changeset **reserved, u64 start, u64 len);
>  void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
>  void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
>  struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 4f62696131a6..782e0f5feb69 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3364,6 +3364,7 @@ static int cache_save_setup(struct 
> btrfs_block_group_cache *block_group,
>   struct btrfs_fs_info *fs_info = block_group->fs_info;
>   struct btrfs_root *root = fs_info->tree_root;
>   struct inode *inode = NULL;
> + struct extent_changeset *data_reserved = NULL;
>   u64 alloc_hint = 0;
>   int dcs = BTRFS_DC_ERROR;
>   u64 num_pages = 0;
> @@ -3483,7 +3484,7 @@ static int cache_save_setup(struct 
> btrfs_block_group_cache *block_group,
>   num_pages *= 16;
>   num_pages *= PAGE_SIZE;
>  
> - ret = btrfs_check_data_free_space(inode, 0, num_pages);
> + ret = btrfs_check_data_free_space(inode, _reserved, 0, num_pages);
>   if (ret)
>   goto out_put;
>  
> @@ -3514,6 +3515,7 @@ static int cache_save_setup(struct 
> btrfs_block_group_cache *block_group,
>   block_group->disk_cache_state = dcs;
>   spin_unlock(_group->lock);
>  
> + extent_changeset_free(data_reserved);
>   return ret;
>  }
>  
> @@ -4282,7 +4284,8 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
> *inode, u64 bytes)
>   * Will replace old btrfs_check_data_free_space(), but for patch split,
>   * add a new function first and then replace it.
>   */
> -int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
> +int btrfs_check_data_free_space(struct inode *inode,
> + struct extent_changeset **reserved, u64 start, u64 len)
>  {
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   int ret;
> @@ -4297,9 +4300,11 @@ int btrfs_check_data_free_space(struct inode *inode, 
> u64 start, u64 len)
>   return ret;
>  
>   /* Use new btrfs_qgroup_reserve_data to reserve precious data space. */
> - ret = btrfs_qgroup_reserve_data(inode, start, len);
> + ret = btrfs_qgroup_reserve_data(inode, reserved, start, len);
>   if (ret < 0)
>   btrfs_free_reserved_data_space_noquota(inode, start, len);
> + else
> + ret = 0;
>   return ret;
>  }
>  
> @@ -6140,11 +6145,12 @@ void

[PATCH 0/8 v7] No wait AIO

2017-05-09 Thread Goldwyn Rodrigues

Formerly known as non-blocking AIO.

This series adds nonblocking feature to asynchronous I/O writes.
io_submit() can be delayed because of a number of reason:
 - Block allocation for files
 - Data writebacks for direct I/O
 - Sleeping because of waiting to acquire i_rwsem
 - Congested block device

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.

In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in
uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to
IOCB_NOWAIT for struct iocb, REQ_NOWAIT for bio.bi_opf and IOMAP_NOWAIT for
iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could
not use aio_flags because it is not currently checked for invalidity
in the kernel.

This feature is provided for direct I/O of asynchronous I/O only. I have
tested it against xfs, ext4, and btrfs while I intend to add more filesystems.
The nowait feature is for request based devices. In the future, I intend to
add support to stacked devices such as md.

Applications will have to check supportability
by sending a async direct write and any other error besides -EAGAIN
would mean it is not supported.

Changes since v1:
 + changed name from _NONBLOCKING to *_NOWAIT
 + filemap_range_has_page call moved to closer to (just before) calling 
filemap_write_and_wait_range().
 + BIO_NOWAIT limited to get_request()
 + XFS fixes 
- included reflink 
- use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag
- Translate the flag through IOMAP_NOWAIT (iomap) to check for
  block allocation for the file.
 + ext4 coding style

Changes since v2:
 + Using aio_reserved1 as aio_rw_flags instead of aio_flags
 + blk-mq support
 + xfs uptodate with kernel and reflink changes

 Changes since v3:
  + Added FS_NOWAIT, which is set if the filesystem supports NOWAIT feature.
  + Checks in generic_make_request() to make sure BIO_NOWAIT comes in
for async direct writes only.
  + Added QUEUE_FLAG_NOWAIT, which is set if the device supports BIO_NOWAIT.
This is added (rather not set) to block devices such as dm/md currently.

 Changes since v4:
  + Ported AIO code to use RWF_* flags. Check for RWF_* flags in
generic_file_write_iter().
  + Changed IOCB_RW_FLAGS_NOWAIT to RWF_NOWAIT.

 Changes since v5:
  + BIO_NOWAIT to REQ_NOWAIT
  + Common helper for RWF flags.

 Changes since v6:
  + REQ_NOWAIT will be ignored for request based devices since they
cannot block. So, removed QUEUE_FLAG_NOWAIT since it is not
required in the current implementation. It will be resurrected
when we program for stacked devices.
  + changed kiocb_rw_flags() to kiocb_set_rw_flags() in order to accomodate
for errors. Moved checks in the function.

-- 
Goldwyn


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/8] nowait aio: return if direct write will trigger writeback

2017-05-09 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Find out if the write will trigger a wait due to writeback. If yes,
return -EAGAIN.

This introduces a new function filemap_range_has_page() which
returns true if the file's mapping has a page within the range
mentioned.

Return -EINVAL for buffered AIO: there are multiple causes of
delay such as page locks, dirty throttling logic, page loading
from disk etc. which cannot be taken care of.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c   | 50 +++---
 2 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4cb62e032b70..24d5c123788f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2514,6 +2514,8 @@ extern int filemap_fdatawait(struct address_space *);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
   loff_t lend);
+extern int filemap_range_has_page(struct address_space *, loff_t lstart,
+  loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index d51670b7fe6b..48b83d1d4a30 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,6 +376,39 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
+/**
+ * filemap_range_has_page - check if a page exists in range.
+ * @mapping:   address space structure to wait for
+ * @start_byte:offset in bytes where the range starts
+ * @end_byte:  offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback.
+ */
+int filemap_range_has_page(struct address_space *mapping,
+   loff_t start_byte, loff_t end_byte)
+{
+   pgoff_t index = start_byte >> PAGE_SHIFT;
+   pgoff_t end = end_byte >> PAGE_SHIFT;
+   struct pagevec pvec;
+   int ret;
+
+   if (end_byte < start_byte)
+   return 0;
+
+   if (mapping->nrpages == 0)
+   return 0;
+
+   pagevec_init(, 0);
+   ret = pagevec_lookup(, mapping, index, 1);
+   if (!ret)
+   return 0;
+   ret = (pvec.pages[0]->index <= end);
+   pagevec_release();
+   return ret;
+}
+EXPORT_SYMBOL(filemap_range_has_page);
+
 static int __filemap_fdatawait_range(struct address_space *mapping,
 loff_t start_byte, loff_t end_byte)
 {
@@ -2640,6 +2673,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
 
pos = iocb->ki_pos;
 
+   if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+   return -EINVAL;
+
if (limit != RLIM_INFINITY) {
if (iocb->ki_pos >= limit) {
send_sig(SIGXFSZ, current, 0);
@@ -2709,9 +2745,17 @@ generic_file_direct_write(struct kiocb *iocb, struct 
iov_iter *from)
write_len = iov_iter_count(from);
end = (pos + write_len - 1) >> PAGE_SHIFT;
 
-   written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 
1);
-   if (written)
-   goto out;
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   /* If there are pages to writeback, return */
+   if (filemap_range_has_page(inode->i_mapping, pos,
+  pos + iov_iter_count(from)))
+   return -EAGAIN;
+   } else {
+   written = filemap_write_and_wait_range(mapping, pos,
+   pos + write_len - 1);
+   if (written)
+   goto out;
+   }
 
/*
 * After a write we want buffered reads to be sure to go to disk to get
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/8] nowait aio: Introduce RWF_NOWAIT

2017-05-09 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

This flag informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

Unfortunately, aio_flags is not checked for validity, which would
break existing applications which have it set to anything besides zero
or IOCB_FLAG_RESFD. So, we are using aio_reserved1 and renaming it
to aio_rw_flags.

RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This
is called by most filesystems, either through fsops.write_iter() or through
the function defined by write_iter(). If not, we perform the check defined
by .write_iter() which is called for direct IO specifically.

Filesystems xfs, btrfs and ext4 would be supported in the following patches.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/9p/vfs_file.c| 3 +++
 fs/aio.c| 6 ++
 fs/ceph/file.c  | 3 +++
 fs/cifs/file.c  | 3 +++
 fs/fuse/file.c  | 3 +++
 fs/nfs/direct.c | 3 +++
 fs/ocfs2/file.c | 3 +++
 include/linux/fs.h  | 5 -
 include/uapi/linux/fs.h | 1 +
 mm/filemap.c| 3 +++
 10 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 3de3b4a89d89..403681db7723 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -411,6 +411,9 @@ v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
loff_t origin;
int err = 0;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
retval = generic_write_checks(iocb, from);
if (retval <= 0)
return retval;
diff --git a/fs/aio.c b/fs/aio.c
index 020fa0045e3c..ea9f8581d902 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1592,6 +1592,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
goto out_put_req;
}
 
+   if ((req->common.ki_flags & IOCB_NOWAIT) &&
+   !(req->common.ki_flags & IOCB_DIRECT)) {
+   ret = -EOPNOTSUPP;
+   goto out_put_req;
+   }
+
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 26cc95421cca..af28419b1731 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1267,6 +1267,9 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct 
iov_iter *from)
int err, want, got;
loff_t pos;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
 
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 21d404535739..f8858a06e119 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2638,6 +2638,9 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct 
iov_iter *from)
 * write request.
 */
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
rc = generic_write_checks(iocb, from);
if (rc <= 0)
return rc;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ec238fb5a584..72786e798319 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1425,6 +1425,9 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(file);
ssize_t res;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
if (is_bad_inode(inode))
return -EIO;
 
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index c1b5fed7c863..dcea0caa5cb5 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -996,6 +996,9 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct 
iov_iter *iter)
dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
file, iov_iter_count(iter), (long long) iocb->ki_pos);
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
result = generic_write_checks(iocb, iter);
if (result <= 0)
return result;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index bfeb647459d9..e7f8ba890305 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2235,6 +2235,9 @@ static ssize_t ocfs2_file_write_iter(struct kiocb *iocb,
if (count == 0)
return 0;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
direct_io = iocb->ki_flags & IOCB_DIRECT ? 1 : 0;
 
inode_lock(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 869c9a6fe58d..4cb62e032b70 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -270,6 +270,7 @@ struct writeback_control;
 #define IOCB_DSYNC (1 &

[PATCH 7/8] nowait aio: xfs

2017-05-09 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable
immediately.

IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin
if it needs allocation either due to file extension, writing to a hole,
or COW or waiting for other DIOs to finish.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/xfs/xfs_file.c  | 19 ++-
 fs/xfs/xfs_iomap.c | 17 +
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 35703a801372..b307940e7d56 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -541,8 +541,11 @@ xfs_file_dio_aio_write(
iolock = XFS_IOLOCK_SHARED;
}
 
-   xfs_ilock(ip, iolock);
-
+   if (!xfs_ilock_nowait(ip, iolock)) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EAGAIN;
+   xfs_ilock(ip, iolock);
+   }
ret = xfs_file_aio_write_checks(iocb, from, );
if (ret)
goto out;
@@ -553,9 +556,15 @@ xfs_file_dio_aio_write(
 * otherwise demote the lock if we had to take the exclusive lock
 * for other reasons in xfs_file_aio_write_checks.
 */
-   if (unaligned_io)
-   inode_dio_wait(inode);
-   else if (iolock == XFS_IOLOCK_EXCL) {
+   if (unaligned_io) {
+   /* If we are going to wait for other DIO to finish, bail */
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (atomic_read(>i_dio_count))
+   return -EAGAIN;
+   } else {
+   inode_dio_wait(inode);
+   }
+   } else if (iolock == XFS_IOLOCK_EXCL) {
xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
iolock = XFS_IOLOCK_SHARED;
}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 288ee5b840d7..9baa65eeae9e 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1015,6 +1015,15 @@ xfs_file_iomap_begin(
 
if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
if (flags & IOMAP_DIRECT) {
+   /*
+* A reflinked inode will result in CoW alloc.
+* FIXME: It could still overwrite on unshared extents
+* and not need allocation.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, , ,
);
@@ -1032,6 +1041,14 @@ xfs_file_iomap_begin(
 
if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, , nimaps)) {
/*
+* If nowait is set bail since we are going to make
+* allocations.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
+   /*
 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 * pages to keep the chunks of work done where somewhat 
symmetric
 * with the work writeback does. This is a completely arbitrary
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 8/8] nowait aio: btrfs

2017-05-09 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Acked-by: David Sterba <dste...@suse.com>
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/inode.c |  3 +++
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 520cb7230b2d..a870e5dd2b4d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1823,12 +1823,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
ssize_t num_written = 0;
bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
ssize_t err;
-   loff_t pos;
-   size_t count;
+   loff_t pos = iocb->ki_pos;
+   size_t count = iov_iter_count(from);
loff_t oldsize;
int clean_page = 0;
 
-   inode_lock(inode);
+   if ((iocb->ki_flags & IOCB_NOWAIT) &&
+   (iocb->ki_flags & IOCB_DIRECT)) {
+   /* Don't sleep on inode rwsem */
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   /*
+* We will allocate space in case nodatacow is not set,
+* so bail
+*/
+   if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+ BTRFS_INODE_PREALLOC)) ||
+   check_can_nocow(BTRFS_I(inode), pos, ) <= 0) {
+   inode_unlock(inode);
+   return -EAGAIN;
+   }
+   } else
+   inode_lock(inode);
+
err = generic_write_checks(iocb, from);
if (err <= 0) {
inode_unlock(inode);
@@ -1862,8 +1879,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 */
update_time_for_write(inode);
 
-   pos = iocb->ki_pos;
-   count = iov_iter_count(from);
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5e71f1ea3391..47d3fcd86979 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8625,6 +8625,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
dio_data.overwrite = 1;
inode_unlock(inode);
relock = true;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
}
ret = btrfs_delalloc_reserve_space(inode, offset, count);
if (ret)
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/8] nowait aio: return on congested block device

2017-05-09 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 block/blk-core.c  | 24 ++--
 block/blk-mq-sched.c  |  3 +++
 block/blk-mq.c|  4 
 fs/direct-io.c| 10 --
 include/linux/bio.h   |  6 ++
 include/linux/blk_types.h |  2 ++
 6 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c221cc17..effe934b806b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
if (!IS_ERR(rq))
return rq;
 
+   if (op & REQ_NOWAIT) {
+   blk_put_rl(rl);
+   return ERR_PTR(-EAGAIN);
+   }
+
if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) 
{
blk_put_rl(rl);
return rq;
@@ -1870,6 +1875,17 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
+   /*
+* For a REQ_NOWAIT based request, return -EOPNOTSUPP
+* if queue does not have QUEUE_FLAG_NOWAIT_SUPPORT set
+* and if it is not a request based queue.
+*/
+
+   if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) {
+   err = -EOPNOTSUPP;
+   goto end_io;
+   }
+
part = bio->bi_bdev->bd_part;
if (should_fail_request(part, bio->bi_iter.bi_size) ||
should_fail_request(_to_disk(part)->part0,
@@ -2021,7 +2037,7 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-   if (likely(blk_queue_enter(q, false) == 0)) {
+   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
@@ -2046,7 +2062,11 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_merge(_list_on_stack[0], );
bio_list_merge(_list_on_stack[0], 
_list_on_stack[1]);
} else {
-   bio_io_error(bio);
+   if (unlikely(!blk_queue_dying(q) &&
+   (bio->bi_opf & REQ_NOWAIT)))
+   bio_wouldblock_error(bio);
+   else
+   bio_io_error(bio);
}
bio = bio_list_pop(_list_on_stack[0]);
} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c974a1bbf4cb..019d881d62b7 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct 
request_queue *q,
if (likely(!data->hctx))
data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+   if (op & REQ_NOWAIT)
+   data->flags |= BLK_MQ_REQ_NOWAIT;
+
if (e) {
data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c7836a1ded97..d7613ae6a269 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1538,6 +1538,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
+   if (bio->bi_opf & REQ_NOWAIT)
+   bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
 
@@ -1662,6 +1664,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
+   if (bio->bi_opf & REQ_NOWAIT)
+   bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..139ebd5ae1c7 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -480,8 +480,12 @@ static int dio_bio_complete(struct dio *dio, struct bio 
*bio)
unsigned i;
int err;
 
-   if (bio->bi_error)

[PATCH 6/8] nowait aio: ext4

2017-05-09 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

Return EAGAIN if any of the following checks fail for direct I/O:
  + i_rwsem is lockable
  + Writing beyond end of file (will trigger allocation)
  + Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/ext4/file.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index cefa9835f275..2efdc6d4d3e8 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -216,7 +216,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
return ext4_dax_write_iter(iocb, from);
 #endif
 
-   inode_lock(inode);
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   } else {
+   inode_lock(inode);
+   }
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
@@ -235,9 +241,15 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
 
iocb->private = 
/* Check whether we do a DIO overwrite or not */
-   if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
-   ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
-   overwrite = 1;
+   if (o_direct && !unaligned_aio) {
+   if (ext4_overwrite_io(inode, iocb->ki_pos, 
iov_iter_count(from))) {
+   if (ext4_should_dioread_nolock(inode))
+   overwrite = 1;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
+   }
+   }
 
ret = __generic_file_write_iter(iocb, from);
inode_unlock(inode);
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/8] Use RWF_* flags for AIO operations

2017-05-09 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

RWF_* flags is used for preadv2/pwritev2 calls. Port to use
it for aio operations as well. For this, aio_rw_flags is
introduced in struct iocb (using aio_reserved1) which will
carry these flags.

This is a precursor to the nowait AIO calls.

Note, the only place RWF_HIPRI comes in effect is dio_await_one().
All the rest of the locations, aio code return -EIOCBQUEUED before the
checks for RWF_HIPRI.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/aio.c |  8 +++-
 fs/read_write.c  | 12 +++-
 include/linux/fs.h   | 14 ++
 include/uapi/linux/aio_abi.h |  2 +-
 4 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925ee259..020fa0045e3c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1541,7 +1541,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
ssize_t ret;
 
/* enforce forwards compatibility on users */
-   if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
+   if (unlikely(iocb->aio_reserved2)) {
pr_debug("EINVAL: reserve field set\n");
return -EINVAL;
}
@@ -1586,6 +1586,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
req->common.ki_flags |= IOCB_EVENTFD;
}
 
+   ret = kiocb_set_rw_flags(>common, iocb->aio_rw_flags);
+   if (unlikely(ret)) {
+   pr_debug("EINVAL: aio_rw_flags\n");
+   goto out_put_req;
+   }
+
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/fs/read_write.c b/fs/read_write.c
index c4f88afbc67f..362f91cd8d66 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -678,16 +678,10 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
struct iov_iter *iter,
struct kiocb kiocb;
ssize_t ret;
 
-   if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
-   return -EOPNOTSUPP;
-
init_sync_kiocb(, filp);
-   if (flags & RWF_HIPRI)
-   kiocb.ki_flags |= IOCB_HIPRI;
-   if (flags & RWF_DSYNC)
-   kiocb.ki_flags |= IOCB_DSYNC;
-   if (flags & RWF_SYNC)
-   kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   ret = kiocb_set_rw_flags(, flags);
+   if (ret)
+   return ret;
kiocb.ki_pos = *ppos;
 
if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7bb45e8..869c9a6fe58d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3049,6 +3049,20 @@ static inline int iocb_flags(struct file *file)
return res;
 }
 
+static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags)
+{
+   if (unlikely(flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC)))
+   return -EOPNOTSUPP;
+
+   if (flags & RWF_HIPRI)
+   ki->ki_flags |= IOCB_HIPRI;
+   if (flags & RWF_DSYNC)
+   ki->ki_flags |= IOCB_DSYNC;
+   if (flags & RWF_SYNC)
+   ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   return 0;
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
ino_t res;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f7fbd1..a2d4a8ac94ca 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -79,7 +79,7 @@ struct io_event {
 struct iocb {
/* these are internal to the kernel/libc. */
__u64   aio_data;   /* data to be returned in event's data */
-   __u32   PADDED(aio_key, aio_reserved1);
+   __u32   PADDED(aio_key, aio_rw_flags);
/* the kernel sets aio_key to the req # */
 
/* common fields */
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/8] nowait-aio: Introduce IOMAP_NOWAIT

2017-05-09 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgold...@suse.com>

IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
This is used by XFS in the XFS patch.

Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/iomap.c| 2 ++
 include/linux/iomap.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 141c3cd55a8b..d1c81753d411 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -885,6 +885,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
} else {
dio->flags |= IOMAP_DIO_WRITE;
flags |= IOMAP_WRITE;
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   flags |= IOMAP_NOWAIT;
}
 
if (mapping->nrpages) {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7291810067eb..53f6af89c625 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -51,6 +51,7 @@ struct iomap {
 #define IOMAP_REPORT   (1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT   (1 << 4) /* direct I/O */
+#define IOMAP_NOWAIT   (1 << 5) /* Don't wait for writeback */
 
 struct iomap_ops {
/*
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/8] nowait aio: return on congested block device

2017-04-24 Thread Goldwyn Rodrigues

On 04/19/2017 01:45 AM, Christoph Hellwig wrote:
> On Fri, Apr 14, 2017 at 07:02:54AM -0500, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues <rgold...@suse.com>
>>
> 
>> +/* Request queue supports BIO_NOWAIT */
>> +queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q);
> 
> BIO_NOWAIT is gone.  And the comment would not be needed if the
> flag had a more descriptive name, e.g. QUEUE_FLAG_NOWAIT_SUPPORT.
> 
> And I think all request based drivers should set the flag implicitly
> as ->queuecommand can't sleep, and ->queue_rq only when it's always
> offloaded to a workqueue when the BLK_MQ_F_BLOCKING flag is set.
> 

We introduced QUEUE_FLAG_NOWAIT for devices which would not wait for
request completions. The ones which wait are MD devices because of sync
or suspend operations.

The only user of BLK_MQ_F_NONBLOCKING seems to be nbd. As you mentioned,
it uses the flag to offload it to a workqueue.

The other way to do it implicitly is to change the flag to
BLK_MAY_BLOCK_REQS and use it for devices which do wait such as md/dm.
Is that what you are hinting at? Or do you have something else in mind?

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 245 matches

Mail list logo