Re: [PATCH v2 00/10] fsdax,xfs: Add reflink support for fsdax

2021-03-10 Thread Goldwyn Rodrigues
On 14:26 10/03, Matthew Wilcox wrote:
> On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> > On 13:02 10/03, Matthew Wilcox wrote:
> > > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > > Btrfs at the same time? It seems weird to me that adding a feature
> > > 
> > > btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX 
> > > support.
> > > 
> > > If you think about it, btrfs and DAX are diametrically opposite things.
> > > DAX is about giving raw access to the hardware.  btrfs is about offering
> > > extra value (RAID, checksums, ...), none of which can be done if the
> > > filesystem isn't in the read/write path.
> > > 
> > > That's why there's no DAX support in btrfs.  If you want DAX, you have
> > > to give up all the features you like in btrfs.  So you may as well use
> > > a different filesystem.
> > 
> > DAX on btrfs has been attempted[1]. Of course, we could not
> 
> But why?  A completeness fetish?  I don't understand why you decided
> to do this work.

If only I had a penny every time I heard "why would you want to do that?"

> 
> > have checksums or multi-device with it. However, got stuck on
> > associating a shared extent on the same page mapping: basically the
> > TODO above dax_associate_entry().
> > 
> > Shiyang has proposed a way to disassociate existing mapping, but I
> > don't think that is the best solution. DAX for CoW will not work until
> > we have a way of mapping a page to multiple inodes (page->mapping),
> > which will convert a 1-N inode-page mapping to M-N inode-page mapping.
> 
> If you're still thinking in terms of pages, you're doing DAX wrong.
> DAX should work without a struct page.

Not pages specifically, but mappings.
fsdax needs the mappings during the page fault and it breaks in case both
files fault on the same shared extent.

For Reference: WARN_ON_ONCE(page->mapping && page->mapping != mapping)
in dax_disassociate_entry().

-- 
Goldwyn


Re: [PATCH v2 00/10] fsdax,xfs: Add reflink support for fsdax

2021-03-10 Thread Goldwyn Rodrigues
On 13:02 10/03, Matthew Wilcox wrote:
> On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > Forgive my ignorance, but is there a reason why this isn't wired up to
> > Btrfs at the same time? It seems weird to me that adding a feature
> 
> btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX support.
> 
> If you think about it, btrfs and DAX are diametrically opposite things.
> DAX is about giving raw access to the hardware.  btrfs is about offering
> extra value (RAID, checksums, ...), none of which can be done if the
> filesystem isn't in the read/write path.
> 
> That's why there's no DAX support in btrfs.  If you want DAX, you have
> to give up all the features you like in btrfs.  So you may as well use
> a different filesystem.

DAX on btrfs has been attempted[1]. Of course, we could not
have checksums or multi-device with it. However, got stuck on
associating a shared extent on the same page mapping: basically the
TODO above dax_associate_entry().

Shiyang has proposed a way to disassociate existing mapping, but I
don't think that is the best solution. DAX for CoW will not work until
we have a way of mapping a page to multiple inodes (page->mapping),
which will convert a 1-N inode-page mapping to M-N inode-page mapping.

[1] https://lore.kernel.org/linux-btrfs/20190429172649.8288-1-rgold...@suse.de/

-- 
Goldwyn


Re: [PATCH v2 00/10] fsdax,xfs: Add reflink support for fsdax

2021-03-09 Thread Goldwyn Rodrigues
Hi Shiang,

Thanks for picking up this work.

On  8:20 26/02, Shiyang Ruan wrote:
> This patchset is attempt to add CoW support for fsdax, and take XFS,
> which has both reflink and fsdax feature, as an example.

How does this work for read sequence for two different files
mapped to the same extent, both residing in DAX?

If two different files read the same shared extent, which file
would resultant page->mapping->host point to?

This problem is listed as a TODO over dax_associate_entry() and is
still not fixed.



-- 
Goldwyn


Re: exfatprogs-1.0.3 version released

2020-07-02 Thread Goldwyn Rodrigues
On  9:29 02/07, Sedat Dilek wrote:
> On Thu, Jul 2, 2020 at 6:57 AM Hyunchul Lee  wrote:
> As said I contacted the Debian maintainer via PM and he is thinking of
> taking the maintenance of exfatprogs.
> But he did not do a last decision.
> 
> You happen to know what other Linux distributions do in this topic?

Suse has incorporated and backported exfat in Leap 15.2 and SLES 15 SP2.


-- 
Goldwyn


Re: [RFC PATCH 0/7] xfs: add reflink & dedupe support for fsdax.

2019-07-31 Thread Goldwyn Rodrigues
On 19:49 31/07, Shiyang Ruan wrote:
> This patchset aims to take care of this issue to make reflink and dedupe
> work correctly in XFS.
> 
> It is based on Goldwyn's patchsets: "v4 Btrfs dax support" and "Btrfs
> iomap".  I picked up some patches related and made a few fix to make it
> basically works fine.
> 
> For dax framework: 
>   1. adapt to the latest change in iomap.
> 
> For XFS:
>   1. report the source address and set IOMAP_COW type for those write
>  operations that need COW.
>   2. update extent list at the end.
>   3. add file contents comparison function based on dax framework.
>   4. use xfs_break_layouts() to support dax.

Shiyang,

I think you used the older patches which does not contain the iomap changes
which would call iomap_begin() with two iomaps. I have it in the btrfs-iomap
branch and plan to update it today. It is built on v5.3-rcX, so it should
contain the changes which moves the iomap code to the different directory.
I will build the dax patches on top of that.
However, we are making a big dependency chain here :(

-- 
Goldwyn

> 
> 
> Goldwyn Rodrigues (3):
>   dax: replace mmap entry in case of CoW
>   fs: dedup file range to use a compare function
>   dax: memcpy before zeroing range
> 
> Shiyang Ruan (4):
>   dax: Introduce dax_copy_edges() for COW.
>   dax: copy data before write.
>   xfs: Add COW handle for fsdax.
>   xfs: Add dedupe support for fsdax.
> 
>  fs/btrfs/ioctl.c  |  11 ++-
>  fs/dax.c  | 203 ++
>  fs/iomap.c|   9 +-
>  fs/ocfs2/file.c   |   2 +-
>  fs/read_write.c   |  11 +--
>  fs/xfs/xfs_iomap.c|  42 +
>  fs/xfs/xfs_reflink.c  |  84 +
>  include/linux/dax.h   |  15 ++--
>  include/linux/fs.h|   8 +-
>  include/linux/iomap.h |   6 ++
>  10 files changed, 294 insertions(+), 97 deletions(-)
> 
> -- 
> 2.17.0
> 
> 
> 

-- 
Goldwyn


Re: [Lsf-pc] [LSF/MM TOPIC] The end of the DAX experiment

2019-06-05 Thread Goldwyn Rodrigues
Hi Dan/Jerome,

On 12:20 14/02, Dan Williams wrote:
> On Thu, Feb 14, 2019 at 12:09 PM Matthew Wilcox  wrote:
> >
> > On Thu, Feb 14, 2019 at 11:31:24AM -0800, Dan Williams wrote:
> > > On Thu, Feb 14, 2019 at 11:10 AM Jerome Glisse  wrote:
> > > > I am just again working on my struct page mapping patchset as well as
> > > > the generic page write protection that sits on top. I hope to be able
> > > > to post the v2 in couple weeks. You can always look at my posting last
> > > > year to see more details.
> > >
> > > Yes, I have that in mind as one of the contenders. However, it's not
> > > clear to me that its a suitable fit for filesystem-reflink. Others
> > > have floated the 'page proxy' idea, so it would be good to discuss the
> > > merits of the general approaches.
> >
> > ... and my preferred option of putting pfn entries in the page cache.
> 
> Another option to include the discussion.
> 
> > Or is that what you meant by "page proxy"?
> 
> Page proxy would be an object that a filesystem could allocate to
> point back to a single physical 'struct page *'. The proxy would
> contain an override for page->index.

Was there any outcome on this and its implementation? I am specifically
interested in this for DAX support on btrfs/CoW: The TODO comment on
top of dax_associate_entry() :)

If there are patches/git tree I could use to base my patches on, it would
be nice.

-- 
Goldwyn


Re: [PATCH v2] ima: define ima_post_create_tmpfile() hook and add missing call

2019-01-22 Thread Goldwyn Rodrigues
On 10:43 22/01, Mimi Zohar wrote:
> On Mon, 2019-01-21 at 14:29 +0200, Amir Goldstein wrote:
> > On Mon, Jan 21, 2019 at 2:00 PM Mimi Zohar  wrote:
> > >
> > > On Thu, 2019-01-17 at 15:34 -0600, Goldwyn Rodrigues wrote:
> > > > On 13:47 18/12, Mimi Zohar wrote:
> > > > > If tmpfiles can be made persistent, then newly created tmpfiles need 
> > > > > to
> > > > > be treated like any other new files in policy.
> > > > >
> > > > > This patch indicates which newly created tmpfiles are in policy, 
> > > > > causing
> > > > > the file hash to be calculated on __fput().
> > > >
> > > > Discussed in overlayfs, this would be better if we use this on inode
> > > > and called from vfs_tmpfile() instead of do_tmpfile(). This will cover
> > > > the overlayfs case which uses tmpfiles while performing copy_up().
> > > > The patch is attached.
> > > >
> > > > Here is the updated patch which works for my cases.
> > > > However, it is the the failure case after setting the IMA flags
> > > > I am concerned about, though I don't think that should be as harmful.
> > >
> > > Right.  The new IMA hook allocates memory for storing the flags, which
> > > needs to be cleaned up on failure.  For this reason, the IMA call is
> > > deferred until after the transition from locally freeing memory on
> > > failure to relying on __fput().  In "do_last", the call to IMA is
> > > after "opened"; and in the original version of this patch the call to
> > > IMA is after finish_open().
> > >
> > 
> > Not sure I understand the concern.
> > The integrity context is associated with the inode and will be freed
> > on destroy_inode() no matter which error path is taken.
> > Am I missing something?
> 
> No, as long as destroy_inode() is called, it should be fine.
> 

Excellent. I will resend the patch as v3.

Thanks!

-- 
Goldwyn


Re: [PATCH v2] ima: define ima_post_create_tmpfile() hook and add missing call

2019-01-17 Thread Goldwyn Rodrigues
On 13:47 18/12, Mimi Zohar wrote:
> If tmpfiles can be made persistent, then newly created tmpfiles need to
> be treated like any other new files in policy.
> 
> This patch indicates which newly created tmpfiles are in policy, causing
> the file hash to be calculated on __fput().

Discussed in overlayfs, this would be better if we use this on inode
and called from vfs_tmpfile() instead of do_tmpfile(). This will cover
the overlayfs case which uses tmpfiles while performing copy_up().
The patch is attached.

Here is the updated patch which works for my cases.
However, it is the the failure case after setting the IMA flags
I am concerned about, though I don't think that should be as harmful.


If tmpfiles can be made persistent, then newly created tmpfiles need to
be treated like any other new files in policy.

This patch indicates which newly created tmpfiles are in policy, causing
the file hash to be calculated on __fput().

Reported-by: Ignaz Forster 
Signed-off-by: Mimi Zohar 
Signed-off-by: Goldwyn Rodrigues 
---
 fs/namei.c|  1 +
 include/linux/ima.h   |  6 ++
 security/integrity/ima/ima_main.c | 35 +--
 3 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 914178cdbe94..373a7ec4b09d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3462,6 +3462,7 @@ struct dentry *vfs_tmpfile(struct dentry *dentry, umode_t 
mode, int open_flag)
inode->i_state |= I_LINKABLE;
spin_unlock(>i_lock);
}
+   ima_post_create_tmpfile(inode);
return child;
 
 out_err:
diff --git a/include/linux/ima.h b/include/linux/ima.h
index b5e16b8c50b7..32b0c5bdcd99 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -18,6 +18,7 @@ struct linux_binprm;
 #ifdef CONFIG_IMA
 extern int ima_bprm_check(struct linux_binprm *bprm);
 extern int ima_file_check(struct file *file, int mask);
+extern void ima_post_create_tmpfile(struct inode *inode);
 extern void ima_file_free(struct file *file);
 extern int ima_file_mmap(struct file *file, unsigned long prot);
 extern int ima_load_data(enum kernel_load_data_id id);
@@ -56,6 +57,11 @@ static inline int ima_file_check(struct file *file, int mask)
return 0;
 }
 
+static inline void ima_post_create_tmpfile(struct inode *inode)
+{
+   return;
+}
+
 static inline void ima_file_free(struct file *file)
 {
return;
diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 4ffac4f5c647..357edd140c09 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -396,6 +396,33 @@ int ima_file_check(struct file *file, int mask)
 }
 EXPORT_SYMBOL_GPL(ima_file_check);
 
+/**
+ * ima_post_create_tmpfile - mark newly created tmpfile as new
+ * @file : newly created tmpfile
+ *
+ * No measuring, appraising or auditing of newly created tmpfiles is needed.
+ * Skip calling process_measurement(), but indicate which newly, created
+ * tmpfiles are in policy.
+ */
+void ima_post_create_tmpfile(struct inode *inode)
+{
+   struct integrity_iint_cache *iint;
+   int must_appraise;
+
+   must_appraise = ima_must_appraise(inode, MAY_ACCESS, FILE_CHECK);
+   if (!must_appraise)
+   return;
+
+   /* Nothing to do if we can't allocate memory */
+   iint = integrity_inode_get(inode);
+   if (!iint)
+   return;
+
+   /* needed for writing the security xattrs */
+   set_bit(IMA_UPDATE_XATTR, >atomic_flags);
+   iint->ima_file_status = INTEGRITY_PASS;
+}
+
 /**
  * ima_post_path_mknod - mark as a new inode
  * @dentry: newly created dentry
@@ -413,9 +440,13 @@ void ima_post_path_mknod(struct dentry *dentry)
if (!must_appraise)
return;
 
+   /* Nothing to do if we can't allocate memory */
iint = integrity_inode_get(inode);
-   if (iint)
-   iint->flags |= IMA_NEW_FILE;
+   if (!iint)
+   return;
+
+   /* needed for re-opening empty files */
+   iint->flags |= IMA_NEW_FILE;
 }
 
 /**
-- 
2.16.4


Re: Spurious EIO on AIO+DIO+RWF_NOWAIT

2018-12-12 Thread Goldwyn Rodrigues
On 14:05 12/12, Avi Kivity wrote:
> On 12/10/18 2:48 PM, Goldwyn Rodrigues wrote:
> > On 13:19 09/12, Avi Kivity wrote:
> > > I have an application that receives spurious EIO when running with
> > > RWF_NOWAIT enabled. Removing RWF_NOWAIT causes those EIOs to disappear. 
> > > The
> > > application uses AIO+DIO, and errors were seen on both xfs and ext4.
> > > 
> > > 
> > > I suspect the following code:
> > > 
> > > 
> > > /*
> > >   * Process one completed BIO.  No locks are held.
> > >   */
> > > static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio)
> > > {
> > >      struct bio_vec *bvec;
> > >      unsigned i;
> > >      blk_status_t err = bio->bi_status;
> > > 
> > >      if (err) {
> > >      if (err == BLK_STS_AGAIN && (bio->bi_opf & REQ_NOWAIT))
> > >      dio->io_error = -EAGAIN;
> > >      else
> > >      dio->io_error = -EIO;
> > >      }
> > > 
> > > Could it be that REQ_NOWAIT was dropped from bio->bi_opf? or that
> > > bio->bi_status got changed along the way?
> > > 
> > I don't think REQ_NOWAIT is dropped. I am assuming bio->bi_status error
> > is set differently. Is the blk queue being stopped? Is it possible to
> > instrument the kernel in your testcase?
> > 
> 
> I traced the function, and I see bio->bi_status == BLK_STS_NOTSUPP and
> bio->bi_opf == REQ_OP_WRITE|REQ_SYNC|REQ_NOMERGE|REQ_FUA|REQ_NOWAIT.
> Presumably the NOTSUPP is the result of NOWAIT not being supported down the
> stack, but shouldn't it be detected earlier? And not converted to EIO?
> 

I don't think there is a way to detect it earlier. However, I think we should
return -EOPNOTSUPP if the lower layers do not support REQ_NOWAIT. I will write
a patch to modify this.

-- 
Goldwyn


Re: Spurious EIO on AIO+DIO+RWF_NOWAIT

2018-12-10 Thread Goldwyn Rodrigues
On 13:19 09/12, Avi Kivity wrote:
> I have an application that receives spurious EIO when running with
> RWF_NOWAIT enabled. Removing RWF_NOWAIT causes those EIOs to disappear. The
> application uses AIO+DIO, and errors were seen on both xfs and ext4.
> 
> 
> I suspect the following code:
> 
> 
> /*
>  * Process one completed BIO.  No locks are held.
>  */
> static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio)
> {
>     struct bio_vec *bvec;
>     unsigned i;
>     blk_status_t err = bio->bi_status;
> 
>     if (err) {
>     if (err == BLK_STS_AGAIN && (bio->bi_opf & REQ_NOWAIT))
>     dio->io_error = -EAGAIN;
>     else
>     dio->io_error = -EIO;
>     }
> 
> Could it be that REQ_NOWAIT was dropped from bio->bi_opf? or that
> bio->bi_status got changed along the way?
> 

I don't think REQ_NOWAIT is dropped. I am assuming bio->bi_status error
is set differently. Is the blk queue being stopped? Is it possible to
instrument the kernel in your testcase?

-- 
Goldwyn


Re: possible deadlock in mnt_want_write

2018-11-21 Thread Goldwyn Rodrigues
On 20:57 21/11, Amir Goldstein wrote:
> On Wed, Nov 21, 2018 at 8:33 PM syzbot
>  wrote:
> >
> > syzbot has found a reproducer for the following crash on:
> >
> > HEAD commit:442b8cea2477 Add linux-next specific files for 20181109
> > git tree:   linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=11a1426d40
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=2f72bdb11df9fbe8
> > dashboard link: https://syzkaller.appspot.com/bug?extid=ae82084b07d0297e566b
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1632326d40
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17a16ed540
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+ae82084b07d0297e5...@syzkaller.appspotmail.com
> >
> > IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
> > IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
> > 8021q: adding VLAN 0 to HW filter on device team0
> >
> > ==
> > WARNING: possible circular locking dependency detected
> > 4.20.0-rc1-next-20181109+ #110 Not tainted
> > --
> > syz-executor599/5968 is trying to acquire lock:
> > e42cbf00 (sb_writers#3){.+.+}, at: sb_start_write
> > include/linux/fs.h:1607 [inline]
> > e42cbf00 (sb_writers#3){.+.+}, at: mnt_want_write+0x3f/0xc0
> > fs/namespace.c:359
> >
> > but task is already holding lock:
> > 166f985a (>mutex){+.+.}, at: process_measurement+0x438/0x1bf0
> > security/integrity/ima/ima_main.c:224
> >
> > which lock already depends on the new lock.
> >
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #1 (>mutex){+.+.}:
> > __mutex_lock_common kernel/locking/mutex.c:925 [inline]
> > __mutex_lock+0x166/0x16f0 kernel/locking/mutex.c:1072
> > mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1087
> > process_measurement+0x438/0x1bf0
> > security/integrity/ima/ima_main.c:224
> > ima_file_check+0xe5/0x130 security/integrity/ima/ima_main.c:391
> > do_last fs/namei.c:3422 [inline]
> > path_openat+0x134a/0x5150 fs/namei.c:3534
> > do_filp_open+0x255/0x380 fs/namei.c:3564
> > do_sys_open+0x568/0x700 fs/open.c:1063
> > __do_sys_open fs/open.c:1081 [inline]
> > __se_sys_open fs/open.c:1076 [inline]
> > __x64_sys_open+0x7e/0xc0 fs/open.c:1076
> > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
> > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > -> #0 (sb_writers#3){.+.+}:
> > lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844
> > percpu_down_read_preempt_disable include/linux/percpu-rwsem.h:36
> > [inline]
> > percpu_down_read include/linux/percpu-rwsem.h:59 [inline]
> > __sb_start_write+0x214/0x370 fs/super.c:1564
> > sb_start_write include/linux/fs.h:1607 [inline]
> > mnt_want_write+0x3f/0xc0 fs/namespace.c:359
> > ovl_want_write+0x76/0xa0 fs/overlayfs/util.c:24
> > ovl_open_maybe_copy_up+0x12c/0x190 fs/overlayfs/copy_up.c:888
> > ovl_open+0xb3/0x260 fs/overlayfs/file.c:123
> > do_dentry_open+0x499/0x1250 fs/open.c:771
> > vfs_open fs/open.c:880 [inline]
> > dentry_open+0x143/0x1d0 fs/open.c:896
> > ima_calc_file_hash+0x324/0x570
> 
> I suppose ima_calc_file_hash opens the file with write flags
> and cause overlay to try to copy up which takes mnt_want_write().
> Why does IMA need to open the file with write flags?
> 
> Isn't this commit supposed to prevent that:
> a408e4a86b36 ima: open a new file instance if no read permissions
> 

Not write, read flags. This patch re-opens the files in O_RDONLY for
files opened with O_WRONLY and cannot be read, so that the hash can be
calculated. IOW, the user opened the file in overlayfs with write flags.

The copy-up, since it is already in write mode can add the security.ima
xattrs.

-- 
Goldwyn


Re: possible deadlock in mnt_want_write

2018-11-21 Thread Goldwyn Rodrigues
On 20:57 21/11, Amir Goldstein wrote:
> On Wed, Nov 21, 2018 at 8:33 PM syzbot
>  wrote:
> >
> > syzbot has found a reproducer for the following crash on:
> >
> > HEAD commit:442b8cea2477 Add linux-next specific files for 20181109
> > git tree:   linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=11a1426d40
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=2f72bdb11df9fbe8
> > dashboard link: https://syzkaller.appspot.com/bug?extid=ae82084b07d0297e566b
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1632326d40
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17a16ed540
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+ae82084b07d0297e5...@syzkaller.appspotmail.com
> >
> > IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
> > IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
> > 8021q: adding VLAN 0 to HW filter on device team0
> >
> > ==
> > WARNING: possible circular locking dependency detected
> > 4.20.0-rc1-next-20181109+ #110 Not tainted
> > --
> > syz-executor599/5968 is trying to acquire lock:
> > e42cbf00 (sb_writers#3){.+.+}, at: sb_start_write
> > include/linux/fs.h:1607 [inline]
> > e42cbf00 (sb_writers#3){.+.+}, at: mnt_want_write+0x3f/0xc0
> > fs/namespace.c:359
> >
> > but task is already holding lock:
> > 166f985a (>mutex){+.+.}, at: process_measurement+0x438/0x1bf0
> > security/integrity/ima/ima_main.c:224
> >
> > which lock already depends on the new lock.
> >
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #1 (>mutex){+.+.}:
> > __mutex_lock_common kernel/locking/mutex.c:925 [inline]
> > __mutex_lock+0x166/0x16f0 kernel/locking/mutex.c:1072
> > mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1087
> > process_measurement+0x438/0x1bf0
> > security/integrity/ima/ima_main.c:224
> > ima_file_check+0xe5/0x130 security/integrity/ima/ima_main.c:391
> > do_last fs/namei.c:3422 [inline]
> > path_openat+0x134a/0x5150 fs/namei.c:3534
> > do_filp_open+0x255/0x380 fs/namei.c:3564
> > do_sys_open+0x568/0x700 fs/open.c:1063
> > __do_sys_open fs/open.c:1081 [inline]
> > __se_sys_open fs/open.c:1076 [inline]
> > __x64_sys_open+0x7e/0xc0 fs/open.c:1076
> > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
> > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > -> #0 (sb_writers#3){.+.+}:
> > lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844
> > percpu_down_read_preempt_disable include/linux/percpu-rwsem.h:36
> > [inline]
> > percpu_down_read include/linux/percpu-rwsem.h:59 [inline]
> > __sb_start_write+0x214/0x370 fs/super.c:1564
> > sb_start_write include/linux/fs.h:1607 [inline]
> > mnt_want_write+0x3f/0xc0 fs/namespace.c:359
> > ovl_want_write+0x76/0xa0 fs/overlayfs/util.c:24
> > ovl_open_maybe_copy_up+0x12c/0x190 fs/overlayfs/copy_up.c:888
> > ovl_open+0xb3/0x260 fs/overlayfs/file.c:123
> > do_dentry_open+0x499/0x1250 fs/open.c:771
> > vfs_open fs/open.c:880 [inline]
> > dentry_open+0x143/0x1d0 fs/open.c:896
> > ima_calc_file_hash+0x324/0x570
> 
> I suppose ima_calc_file_hash opens the file with write flags
> and cause overlay to try to copy up which takes mnt_want_write().
> Why does IMA need to open the file with write flags?
> 
> Isn't this commit supposed to prevent that:
> a408e4a86b36 ima: open a new file instance if no read permissions
> 

Not write, read flags. This patch re-opens the files in O_RDONLY for
files opened with O_WRONLY and cannot be read, so that the hash can be
calculated. IOW, the user opened the file in overlayfs with write flags.

The copy-up, since it is already in write mode can add the security.ima
xattrs.

-- 
Goldwyn


Re: [PATCH 26/32] afs: Use fs_context to pass parameters over automount [ver #8]

2018-06-06 Thread Goldwyn Rodrigues



On 05/24/2018 07:08 PM, David Howells wrote:
> Alter the AFS automounting code to create and modify an fs_context struct
> when parameterising a new mount triggered by an AFS mountpoint rather than
> constructing device name and option strings.
> 
> Also remove the cell=, vol= and rwpath options as they are then redundant.
> The reason they existed is because the 'device name' may be derived
> literally from a mountpoint object in the filesystem, so default cell and
> parent-type information needed to be passed in by some other method from
> the automount routines.  The vol= option didn't end up being used.
> 
> Signed-off-by: David Howells 
> cc: Eric W. Biederman 
> ---
> 
>  fs/afs/internal.h |1 
>  fs/afs/mntpt.c|  148 
> +++--
>  fs/afs/super.c|   43 +--
>  3 files changed, 79 insertions(+), 113 deletions(-)
> 
> diff --git a/fs/afs/internal.h b/fs/afs/internal.h
> index eb6e75e00181..90af5001f8c8 100644
> --- a/fs/afs/internal.h
> +++ b/fs/afs/internal.h
> @@ -35,7 +35,6 @@ struct pagevec;
>  struct afs_call;
>  
>  struct afs_fs_context {
> - boolrwpath; /* T if the parent should be 
> considered R/W */
>   boolforce;  /* T to force cell type */
>   boolautocell;   /* T if set auto mount 
> operation */
>   booldyn_root;   /* T if dynamic root */
> diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
> index c45aa1776591..fc383d727552 100644
> --- a/fs/afs/mntpt.c
> +++ b/fs/afs/mntpt.c
> @@ -47,6 +47,8 @@ static DECLARE_DELAYED_WORK(afs_mntpt_expiry_timer, 
> afs_mntpt_expiry_timed_out);
>  
>  static unsigned long afs_mntpt_expiry_timeout = 10 * 60;
>  
> +static const char afs_root_volume[] = "root.cell";
> +
>  /*
>   * no valid lookup procedure on this sort of dir
>   */
> @@ -68,107 +70,107 @@ static int afs_mntpt_open(struct inode *inode, struct 
> file *file)
>  }
>  
>  /*
> - * create a vfsmount to be automounted
> + * Set the parameters for the proposed superblock.
>   */
> -static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
> +static int afs_mntpt_set_params(struct fs_context *fc, struct dentry *mntpt)
>  {
> - struct afs_super_info *as;
> - struct vfsmount *mnt;
> - struct afs_vnode *vnode;
> - struct page *page;
> - char *devname, *options;
> - bool rwpath = false;
> + struct afs_fs_context *ctx = fc->fs_private;
> + struct afs_vnode *vnode = AFS_FS_I(d_inode(mntpt));
> + struct afs_cell *cell;
> + const char *p;
>   int ret;
>  
> - _enter("{%pd}", mntpt);
> -
> - BUG_ON(!d_inode(mntpt));
> -
> - ret = -ENOMEM;
> - devname = (char *) get_zeroed_page(GFP_KERNEL);
> - if (!devname)
> - goto error_no_devname;
> -
> - options = (char *) get_zeroed_page(GFP_KERNEL);
> - if (!options)
> - goto error_no_options;
> -
> - vnode = AFS_FS_I(d_inode(mntpt));
>   if (test_bit(AFS_VNODE_PSEUDODIR, >flags)) {
>   /* if the directory is a pseudo directory, use the d_name */
> - static const char afs_root_cell[] = ":root.cell.";
>   unsigned size = mntpt->d_name.len;
>  
> - ret = -ENOENT;
> - if (size < 2 || size > AFS_MAXCELLNAME)
> - goto error_no_page;
> + if (size < 2)
> + return -ENOENT;
>  
> + p = mntpt->d_name.name;
>   if (mntpt->d_name.name[0] == '.') {
> - devname[0] = '%';
> - memcpy(devname + 1, mntpt->d_name.name + 1, size - 1);
> - memcpy(devname + size, afs_root_cell,
> -sizeof(afs_root_cell));
> - rwpath = true;
> - } else {
> - devname[0] = '#';
> - memcpy(devname + 1, mntpt->d_name.name, size);
> - memcpy(devname + size + 1, afs_root_cell,
> -sizeof(afs_root_cell));
> + size--;
> + p++;
> + ctx->type = AFSVL_RWVOL;
> + ctx->force = true;
> + }
> + if (size > AFS_MAXCELLNAME)
> + return -ENAMETOOLONG;
> +
> + cell = afs_lookup_cell(ctx->net, p, size, NULL, false);
> + if (IS_ERR(cell)) {
> + pr_err("kAFS: unable to lookup cell '%pd'\n", mntpt);
> + return PTR_ERR(cell);
>   }
> + afs_put_cell(ctx->net, ctx->cell);
> + ctx->cell = cell;
> +
> + ctx->volname = afs_root_volume;
> + ctx->volnamesz = sizeof(afs_root_volume) - 1;
>   } else {
>   /* read the contents of the AFS special symlink */
> + struct page *page;
>   loff_t size = i_size_read(d_inode(mntpt));

Re: [PATCH 26/32] afs: Use fs_context to pass parameters over automount [ver #8]

2018-06-06 Thread Goldwyn Rodrigues



On 05/24/2018 07:08 PM, David Howells wrote:
> Alter the AFS automounting code to create and modify an fs_context struct
> when parameterising a new mount triggered by an AFS mountpoint rather than
> constructing device name and option strings.
> 
> Also remove the cell=, vol= and rwpath options as they are then redundant.
> The reason they existed is because the 'device name' may be derived
> literally from a mountpoint object in the filesystem, so default cell and
> parent-type information needed to be passed in by some other method from
> the automount routines.  The vol= option didn't end up being used.
> 
> Signed-off-by: David Howells 
> cc: Eric W. Biederman 
> ---
> 
>  fs/afs/internal.h |1 
>  fs/afs/mntpt.c|  148 
> +++--
>  fs/afs/super.c|   43 +--
>  3 files changed, 79 insertions(+), 113 deletions(-)
> 
> diff --git a/fs/afs/internal.h b/fs/afs/internal.h
> index eb6e75e00181..90af5001f8c8 100644
> --- a/fs/afs/internal.h
> +++ b/fs/afs/internal.h
> @@ -35,7 +35,6 @@ struct pagevec;
>  struct afs_call;
>  
>  struct afs_fs_context {
> - boolrwpath; /* T if the parent should be 
> considered R/W */
>   boolforce;  /* T to force cell type */
>   boolautocell;   /* T if set auto mount 
> operation */
>   booldyn_root;   /* T if dynamic root */
> diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
> index c45aa1776591..fc383d727552 100644
> --- a/fs/afs/mntpt.c
> +++ b/fs/afs/mntpt.c
> @@ -47,6 +47,8 @@ static DECLARE_DELAYED_WORK(afs_mntpt_expiry_timer, 
> afs_mntpt_expiry_timed_out);
>  
>  static unsigned long afs_mntpt_expiry_timeout = 10 * 60;
>  
> +static const char afs_root_volume[] = "root.cell";
> +
>  /*
>   * no valid lookup procedure on this sort of dir
>   */
> @@ -68,107 +70,107 @@ static int afs_mntpt_open(struct inode *inode, struct 
> file *file)
>  }
>  
>  /*
> - * create a vfsmount to be automounted
> + * Set the parameters for the proposed superblock.
>   */
> -static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
> +static int afs_mntpt_set_params(struct fs_context *fc, struct dentry *mntpt)
>  {
> - struct afs_super_info *as;
> - struct vfsmount *mnt;
> - struct afs_vnode *vnode;
> - struct page *page;
> - char *devname, *options;
> - bool rwpath = false;
> + struct afs_fs_context *ctx = fc->fs_private;
> + struct afs_vnode *vnode = AFS_FS_I(d_inode(mntpt));
> + struct afs_cell *cell;
> + const char *p;
>   int ret;
>  
> - _enter("{%pd}", mntpt);
> -
> - BUG_ON(!d_inode(mntpt));
> -
> - ret = -ENOMEM;
> - devname = (char *) get_zeroed_page(GFP_KERNEL);
> - if (!devname)
> - goto error_no_devname;
> -
> - options = (char *) get_zeroed_page(GFP_KERNEL);
> - if (!options)
> - goto error_no_options;
> -
> - vnode = AFS_FS_I(d_inode(mntpt));
>   if (test_bit(AFS_VNODE_PSEUDODIR, >flags)) {
>   /* if the directory is a pseudo directory, use the d_name */
> - static const char afs_root_cell[] = ":root.cell.";
>   unsigned size = mntpt->d_name.len;
>  
> - ret = -ENOENT;
> - if (size < 2 || size > AFS_MAXCELLNAME)
> - goto error_no_page;
> + if (size < 2)
> + return -ENOENT;
>  
> + p = mntpt->d_name.name;
>   if (mntpt->d_name.name[0] == '.') {
> - devname[0] = '%';
> - memcpy(devname + 1, mntpt->d_name.name + 1, size - 1);
> - memcpy(devname + size, afs_root_cell,
> -sizeof(afs_root_cell));
> - rwpath = true;
> - } else {
> - devname[0] = '#';
> - memcpy(devname + 1, mntpt->d_name.name, size);
> - memcpy(devname + size + 1, afs_root_cell,
> -sizeof(afs_root_cell));
> + size--;
> + p++;
> + ctx->type = AFSVL_RWVOL;
> + ctx->force = true;
> + }
> + if (size > AFS_MAXCELLNAME)
> + return -ENAMETOOLONG;
> +
> + cell = afs_lookup_cell(ctx->net, p, size, NULL, false);
> + if (IS_ERR(cell)) {
> + pr_err("kAFS: unable to lookup cell '%pd'\n", mntpt);
> + return PTR_ERR(cell);
>   }
> + afs_put_cell(ctx->net, ctx->cell);
> + ctx->cell = cell;
> +
> + ctx->volname = afs_root_volume;
> + ctx->volnamesz = sizeof(afs_root_volume) - 1;
>   } else {
>   /* read the contents of the AFS special symlink */
> + struct page *page;
>   loff_t size = i_size_read(d_inode(mntpt));

Re: [PATCH v6 2/5] fs: Convert kiocb rw_hint from enum to u16

2018-05-22 Thread Goldwyn Rodrigues


On 05/22/2018 10:32 AM, Jens Axboe wrote:
> On 5/22/18 9:07 AM, adam.manzana...@wdc.com wrote:
>> From: Adam Manzanares 
>>
>> In order to avoid kiocb bloat for per command iopriority support, rw_hint
>> is converted from enum to a u16. Added a guard around ki_hint assignment.
>>
>> Signed-off-by: Adam Manzanares 
>> Reviewed-by: Christoph Hellwig 
>> ---
>>  include/linux/fs.h | 13 +++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 7f07977bdfd7..50de40dbbb85 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -284,6 +284,8 @@ enum rw_hint {
>>  WRITE_LIFE_EXTREME  = RWH_WRITE_LIFE_EXTREME,
>>  };
>>  
>> +#define MAX_KI_HINT ((1 << 16) - 1) /* ki_hint type is u16 */
> 
> Instead of having to do this and now rely on those now being synced,
> how about something like the below.
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 760d8da1b6c7..070438d0b62d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -299,7 +299,7 @@ struct kiocb {
>   void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
>   void*private;
>   int ki_flags;
> - enum rw_hintki_hint;
> + u16 ki_hint;
>  } __randomize_layout;
>  
>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> @@ -1927,12 +1927,22 @@ static inline enum rw_hint file_write_hint(struct 
> file *file)
>  
>  static inline int iocb_flags(struct file *file);
>  
> +static inline u16 ki_hint_validate(enum rw_hint hint)
> +{
> + typeof(((struct kiocb *)0)->ki_hint) max_hint = -1;

This looks complex to me. Would force a reader to lookback at what
datatype ki_hint is. I'd prefer to declare it as u16 max_hint = -1, or
even the previous #define MAX_KI_HINT format is easier to read. Just a
program reading style you are comfortable with though.


-- 
Goldwyn


Re: [PATCH v6 2/5] fs: Convert kiocb rw_hint from enum to u16

2018-05-22 Thread Goldwyn Rodrigues


On 05/22/2018 10:32 AM, Jens Axboe wrote:
> On 5/22/18 9:07 AM, adam.manzana...@wdc.com wrote:
>> From: Adam Manzanares 
>>
>> In order to avoid kiocb bloat for per command iopriority support, rw_hint
>> is converted from enum to a u16. Added a guard around ki_hint assignment.
>>
>> Signed-off-by: Adam Manzanares 
>> Reviewed-by: Christoph Hellwig 
>> ---
>>  include/linux/fs.h | 13 +++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 7f07977bdfd7..50de40dbbb85 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -284,6 +284,8 @@ enum rw_hint {
>>  WRITE_LIFE_EXTREME  = RWH_WRITE_LIFE_EXTREME,
>>  };
>>  
>> +#define MAX_KI_HINT ((1 << 16) - 1) /* ki_hint type is u16 */
> 
> Instead of having to do this and now rely on those now being synced,
> how about something like the below.
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 760d8da1b6c7..070438d0b62d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -299,7 +299,7 @@ struct kiocb {
>   void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
>   void*private;
>   int ki_flags;
> - enum rw_hintki_hint;
> + u16 ki_hint;
>  } __randomize_layout;
>  
>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> @@ -1927,12 +1927,22 @@ static inline enum rw_hint file_write_hint(struct 
> file *file)
>  
>  static inline int iocb_flags(struct file *file);
>  
> +static inline u16 ki_hint_validate(enum rw_hint hint)
> +{
> + typeof(((struct kiocb *)0)->ki_hint) max_hint = -1;

This looks complex to me. Would force a reader to lookback at what
datatype ki_hint is. I'd prefer to declare it as u16 max_hint = -1, or
even the previous #define MAX_KI_HINT format is easier to read. Just a
program reading style you are comfortable with though.


-- 
Goldwyn


Re: [PATCH] md: fix two problems with setting the "re-add" device state.

2018-04-29 Thread Goldwyn Rodrigues


On 04/25/2018 11:46 PM, NeilBrown wrote:
> 
> If "re-add" is written to the "state" file for a device
> which is faulty, this has an effect similar to removing
> and re-adding the device.  It should take up the
> same slot in the array that it previously had, and
> an accelerated (e.g. bitmap-based) rebuild should happen.
> 
> The slot that "it previously had" is determined by
> rdev->saved_raid_disk.
> However this is not set when a device fails (only when a device
> is added), and it is cleared when resync completes.
> This means that "re-add" will normally work once, but may not work a
> second time.
> 
> This patch includes two fixes.
> 1/ when a device fails, record the ->raid_disk value in
> ->saved_raid_disk before clearing ->raid_disk
> 2/ when "re-add" is written to a device for which
> ->saved_raid_disk is not set, fail.
> 
> I think this is suitable for stable as it can
> cause re-adding a device to be forced to do a full
> resync which takes a lot longer and so puts data at
> more risk.
> 
> Cc: <sta...@vger.kernel.org> (v4.1)
> Fixes: 97f6cd39da22 ("md-cluster: re-add capabilities")
> Signed-off-by: NeilBrown <ne...@suse.com>
> ---
>  drivers/md/md.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 3bea45e8ccff..ecd4235c6e30 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2853,7 +2853,8 @@ state_store(struct md_rdev *rdev, const char *buf, 
> size_t len)
>   err = 0;
>   }
>   } else if (cmd_match(buf, "re-add")) {
> - if (test_bit(Faulty, >flags) && (rdev->raid_disk == -1)) {
> + if (test_bit(Faulty, >flags) && (rdev->raid_disk == -1) &&
> + rdev->saved_raid_disk >= 0) {
>   /* clear_bit is performed _after_ all the devices
>* have their local Faulty bit cleared. If any writes
>* happen in the meantime in the local node, they
> @@ -8641,6 +8642,7 @@ static int remove_and_add_spares(struct mddev *mddev,
>   if (mddev->pers->hot_remove_disk(
>   mddev, rdev) == 0) {
>   sysfs_unlink_rdev(mddev, rdev);
> +         rdev->saved_raid_disk = rdev->raid_disk;
>   rdev->raid_disk = -1;
>   removed++;
>   }
> 

Performing a partial resync as opposed to full resync is always better
and less time consuming. Thanks!

Reviewed-by: Goldwyn Rodrigues <rgold...@suse.com>

-- 
Goldwyn


Re: [PATCH] md: fix two problems with setting the "re-add" device state.

2018-04-29 Thread Goldwyn Rodrigues


On 04/25/2018 11:46 PM, NeilBrown wrote:
> 
> If "re-add" is written to the "state" file for a device
> which is faulty, this has an effect similar to removing
> and re-adding the device.  It should take up the
> same slot in the array that it previously had, and
> an accelerated (e.g. bitmap-based) rebuild should happen.
> 
> The slot that "it previously had" is determined by
> rdev->saved_raid_disk.
> However this is not set when a device fails (only when a device
> is added), and it is cleared when resync completes.
> This means that "re-add" will normally work once, but may not work a
> second time.
> 
> This patch includes two fixes.
> 1/ when a device fails, record the ->raid_disk value in
> ->saved_raid_disk before clearing ->raid_disk
> 2/ when "re-add" is written to a device for which
> ->saved_raid_disk is not set, fail.
> 
> I think this is suitable for stable as it can
> cause re-adding a device to be forced to do a full
> resync which takes a lot longer and so puts data at
> more risk.
> 
> Cc:  (v4.1)
> Fixes: 97f6cd39da22 ("md-cluster: re-add capabilities")
> Signed-off-by: NeilBrown 
> ---
>  drivers/md/md.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 3bea45e8ccff..ecd4235c6e30 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2853,7 +2853,8 @@ state_store(struct md_rdev *rdev, const char *buf, 
> size_t len)
>   err = 0;
>   }
>   } else if (cmd_match(buf, "re-add")) {
> - if (test_bit(Faulty, >flags) && (rdev->raid_disk == -1)) {
> + if (test_bit(Faulty, >flags) && (rdev->raid_disk == -1) &&
> + rdev->saved_raid_disk >= 0) {
>   /* clear_bit is performed _after_ all the devices
>* have their local Faulty bit cleared. If any writes
>* happen in the meantime in the local node, they
> @@ -8641,6 +8642,7 @@ static int remove_and_add_spares(struct mddev *mddev,
>   if (mddev->pers->hot_remove_disk(
>   mddev, rdev) == 0) {
>   sysfs_unlink_rdev(mddev, rdev);
> +     rdev->saved_raid_disk = rdev->raid_disk;
>   rdev->raid_disk = -1;
>   removed++;
>   }
> 

Performing a partial resync as opposed to full resync is always better
and less time consuming. Thanks!

Reviewed-by: Goldwyn Rodrigues 

-- 
Goldwyn


Re: copy_file_range and user space tools to do copy fastest

2018-04-28 Thread Goldwyn Rodrigues


On 04/28/2018 12:26 AM, Steve French wrote:
> On Sat, Apr 28, 2018 at 12:18 AM, Andreas Dilger  wrote:
>> On Apr 27, 2018, at 5:41 PM, Eric Biggers  wrote:
>>>
>>> On Fri, Apr 27, 2018 at 01:45:40PM -0600, Andreas Dilger wrote:
 On Apr 27, 2018, at 12:25 PM, Steve French  wrote:
>
> Are there any user space tools (other than our test tools and xfs_io
> etc.) that support copy_file_range?  Looks like at least cp and rsync
> and dd don't.  That syscall which now has been around a couple years,
> and was reminded about at the LSF/MM summit a few days ago, presumably
> is the 'best' way to copy a file fast since it tries all the
> mechanisms (reflink etc.) in order.
>
> Since copy_file_range syscall can be 100x or more faster for network
> file systems than the alternative, was surprised when I noticed that
> cp and rsync didn't support it.  It doesn't look like rsync even
> supports reflink either(although presumably if you call
> copy_file_range you don't have to worry about that), and reads/writes
> are 8K. See copy_file() in rsync/util.c
>
> In the cp command it looks like it can call the FICLONE IOCTL (see
> clone_file() in coreutils/src/copy.c) but doesn't call the expected
> "copy_file_range" syscall.
>
> In the dd command it doesn't call either - see dd_copy in 
> corutils/src/dd.c
>
> Since it can be 100x or more faster in some cases to call
> copy_file_range than do reads/writes back and forth to do a copy
> (especially if network or clustered backend or cloud), what tools are
> the best to recommend?
>
> Would rsync or cp be likely to take patches to call the standard
> "copy_file_range" syscall
> (http://man7.org/linux/man-pages/man2/copy_file_range.2.html)?
> Presumably not if it has been two+ years ... but would be interested
> what copy tools to recommend to use instead.

 I would start with submitting a patch to coreutils, if you can figure
 out that code enough to do so (I find it quite opaque).  Since it has
 been in the kernel for a while already, it should be acceptable to the
 upstream coreutils maintainers to use this interface.  Doubly so if you
 include some benchmarks with CIFS/NFS clients avoiding network overhead
 during the copy.

>>>
>>> For cp (coreutils), apparently there was a concern that copy_file_range()
>>> expands holes; see the thread at
>>> https://lists.gnu.org/archive/html/bug-coreutils/2016-09/msg00020.html.
>>> Though, I'd think it could just be used on non-holes only.  And I don't 
>>> think
>>> the size_t type of 'len' is a problem either, since it's the copy length, 
>>> not
>>> the file size.  You just call it multiple times if the file is larger.
>>
>> I think cp is already using SEEK_HOLE/SEEK_DATA and/or FIEMAP to determine
>> the mapped and sparse segments of the file, so it should be practical to
>> use copy_file_range() in conjunction with these to copy only the allocated
>> parts of the file.
> 
> For the case where clone/reflink or copy_file_range is supported - is
> there any reason to
> not sent the request to copy the whole file? Presumably long
> timeout/errors might be a concern, but
> that could happen with ranges too.  In any case, if sent the whole
> file copy request,
> the server file system can figure out the  holes and copy more efficiently.
> 
> In the case where it is copying local to remote or remote to local -
> figuring out whether it is
> sparse and optimizing makes a lot of sense - but I didn't think cp did
> that (at least the
> sections of code I was looking at).

cp does check for sparse files and tries to recreate them depending on
--sparse=WHEN option. Check the make_holes variable in copy.c. However,
we could still use copy_file_range() when make_holes is false and close
on success. However, you would have to be careful to check if the return
value is positive and less than len and have to act accordingly.

-- 
Goldwyn


Re: copy_file_range and user space tools to do copy fastest

2018-04-28 Thread Goldwyn Rodrigues


On 04/28/2018 12:26 AM, Steve French wrote:
> On Sat, Apr 28, 2018 at 12:18 AM, Andreas Dilger  wrote:
>> On Apr 27, 2018, at 5:41 PM, Eric Biggers  wrote:
>>>
>>> On Fri, Apr 27, 2018 at 01:45:40PM -0600, Andreas Dilger wrote:
 On Apr 27, 2018, at 12:25 PM, Steve French  wrote:
>
> Are there any user space tools (other than our test tools and xfs_io
> etc.) that support copy_file_range?  Looks like at least cp and rsync
> and dd don't.  That syscall which now has been around a couple years,
> and was reminded about at the LSF/MM summit a few days ago, presumably
> is the 'best' way to copy a file fast since it tries all the
> mechanisms (reflink etc.) in order.
>
> Since copy_file_range syscall can be 100x or more faster for network
> file systems than the alternative, was surprised when I noticed that
> cp and rsync didn't support it.  It doesn't look like rsync even
> supports reflink either(although presumably if you call
> copy_file_range you don't have to worry about that), and reads/writes
> are 8K. See copy_file() in rsync/util.c
>
> In the cp command it looks like it can call the FICLONE IOCTL (see
> clone_file() in coreutils/src/copy.c) but doesn't call the expected
> "copy_file_range" syscall.
>
> In the dd command it doesn't call either - see dd_copy in 
> corutils/src/dd.c
>
> Since it can be 100x or more faster in some cases to call
> copy_file_range than do reads/writes back and forth to do a copy
> (especially if network or clustered backend or cloud), what tools are
> the best to recommend?
>
> Would rsync or cp be likely to take patches to call the standard
> "copy_file_range" syscall
> (http://man7.org/linux/man-pages/man2/copy_file_range.2.html)?
> Presumably not if it has been two+ years ... but would be interested
> what copy tools to recommend to use instead.

 I would start with submitting a patch to coreutils, if you can figure
 out that code enough to do so (I find it quite opaque).  Since it has
 been in the kernel for a while already, it should be acceptable to the
 upstream coreutils maintainers to use this interface.  Doubly so if you
 include some benchmarks with CIFS/NFS clients avoiding network overhead
 during the copy.

>>>
>>> For cp (coreutils), apparently there was a concern that copy_file_range()
>>> expands holes; see the thread at
>>> https://lists.gnu.org/archive/html/bug-coreutils/2016-09/msg00020.html.
>>> Though, I'd think it could just be used on non-holes only.  And I don't 
>>> think
>>> the size_t type of 'len' is a problem either, since it's the copy length, 
>>> not
>>> the file size.  You just call it multiple times if the file is larger.
>>
>> I think cp is already using SEEK_HOLE/SEEK_DATA and/or FIEMAP to determine
>> the mapped and sparse segments of the file, so it should be practical to
>> use copy_file_range() in conjunction with these to copy only the allocated
>> parts of the file.
> 
> For the case where clone/reflink or copy_file_range is supported - is
> there any reason to
> not sent the request to copy the whole file? Presumably long
> timeout/errors might be a concern, but
> that could happen with ranges too.  In any case, if sent the whole
> file copy request,
> the server file system can figure out the  holes and copy more efficiently.
> 
> In the case where it is copying local to remote or remote to local -
> figuring out whether it is
> sparse and optimizing makes a lot of sense - but I didn't think cp did
> that (at least the
> sections of code I was looking at).

cp does check for sparse files and tries to recreate them depending on
--sparse=WHEN option. Check the make_holes variable in copy.c. However,
we could still use copy_file_range() when make_holes is false and close
on success. However, you would have to be careful to check if the return
value is positive and less than len and have to act accordingly.

-- 
Goldwyn


Re: [PATCH v2] f2fs: add nowait aio support

2018-03-08 Thread Goldwyn Rodrigues


On 03/08/2018 01:21 AM, Hyunchul Lee wrote:
> From: Hyunchul Lee <cheol@lge.com>
> 
> This patch adds nowait aio support[1].
> 
> Return EAGAIN if any of the following checks fail for direct I/O:
>   - i_rwsem is not lockable
>   - Blocks are not allocated at the write location
> 
> And xfstests generic/471 is passed.
> 
>  [1]: 6be96d "Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT"
> 
> Signed-off-by: Hyunchul Lee <cheol@lge.com>

Looks good.
Reviewed-by: Goldwyn Rodrigues <rgold...@suse.com>

> ---
> Changes from v1:
>  - Return EGAIN if dio_rwsem is not lockable in f2fs_direct_IO
>  - Fix the wrong calculation of last_lblk in f2fs_overwrite_io 
> 
>  fs/f2fs/data.c | 47 +--
>  fs/f2fs/f2fs.h |  8 
>  fs/f2fs/file.c | 35 +--
>  3 files changed, 74 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 6c3c978..b27ea1e 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -839,13 +839,6 @@ static int __allocate_data_block(struct dnode_of_data 
> *dn, int seg_type)
>   return 0;
>  }
>  
> -static inline bool __force_buffered_io(struct inode *inode, int rw)
> -{
> - return (f2fs_encrypted_file(inode) ||
> - (rw == WRITE && test_opt(F2FS_I_SB(inode), LFS)) ||
> - F2FS_I_SB(inode)->s_ndevs);
> -}
> -
>  int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>  {
>   struct inode *inode = file_inode(iocb->ki_filp);
> @@ -877,7 +870,7 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct 
> iov_iter *from)
>  
>   if (direct_io) {
>   map.m_seg_type = rw_hint_to_seg_type(iocb->ki_hint);
> - flag = __force_buffered_io(inode, WRITE) ?
> + flag = f2fs_force_buffered_io(inode, WRITE) ?
>   F2FS_GET_BLOCK_PRE_AIO :
>   F2FS_GET_BLOCK_PRE_DIO;
>   goto map_blocks;
> @@ -1121,6 +1114,31 @@ int f2fs_map_blocks(struct inode *inode, struct 
> f2fs_map_blocks *map,
>   return err;
>  }
>  
> +bool f2fs_overwrite_io(struct inode *inode, loff_t pos, size_t len)
> +{
> + struct f2fs_map_blocks map;
> + block_t last_lblk;
> + int err;
> +
> + if (pos + len > i_size_read(inode))
> + return false;
> +
> + map.m_lblk = F2FS_BYTES_TO_BLK(pos);
> + map.m_next_pgofs = 0;
> + map.m_next_extent = NULL;
> + map.m_seg_type = NO_CHECK_TYPE;
> + last_lblk = F2FS_BYTES_TO_BLK(pos + len - 1) + 1;
> +
> + while (map.m_lblk < last_lblk) {
> + map.m_len = last_lblk - map.m_lblk;
> + err = f2fs_map_blocks(inode, , 0, 0);
> + if (err || map.m_len == 0)
> + return false;
> + map.m_lblk += map.m_len;
> + }
> + return true;
> +}
> +
>  static int __get_data_block(struct inode *inode, sector_t iblock,
>   struct buffer_head *bh, int create, int flag,
>   pgoff_t *next_pgofs, int seg_type)
> @@ -2306,7 +2324,7 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter)
>   if (err)
>   return err;
>  
> - if (__force_buffered_io(inode, rw))
> + if (f2fs_force_buffered_io(inode, rw))
>   return 0;
>  
>   trace_f2fs_direct_IO_enter(inode, offset, count, rw);
> @@ -2314,7 +2332,15 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter)
>   if (rw == WRITE && whint_mode == WHINT_MODE_OFF)
>   iocb->ki_hint = WRITE_LIFE_NOT_SET;
>  
> - down_read(_I(inode)->dio_rwsem[rw]);
> + if (!down_read_trylock(_I(inode)->dio_rwsem[rw])) {
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + iocb->ki_hint = hint;
> + err = -EAGAIN;
> + goto out;
> + }
> + down_read(_I(inode)->dio_rwsem[rw]);
> + }
> +
>   err = blockdev_direct_IO(iocb, inode, iter, get_data_block_dio);
>   up_read(_I(inode)->dio_rwsem[rw]);
>  
> @@ -2330,6 +2356,7 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter)
>   }
>   }
>  
> +out:
>   trace_f2fs_direct_IO_exit(inode, offset, count, rw, err);
>  
>   return err;
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index f6dc706..351226e 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -2875,6 +2875,7 @@ void f2fs_invalidate_page(str

Re: [PATCH v2] f2fs: add nowait aio support

2018-03-08 Thread Goldwyn Rodrigues


On 03/08/2018 01:21 AM, Hyunchul Lee wrote:
> From: Hyunchul Lee 
> 
> This patch adds nowait aio support[1].
> 
> Return EAGAIN if any of the following checks fail for direct I/O:
>   - i_rwsem is not lockable
>   - Blocks are not allocated at the write location
> 
> And xfstests generic/471 is passed.
> 
>  [1]: 6be96d "Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT"
> 
> Signed-off-by: Hyunchul Lee 

Looks good.
Reviewed-by: Goldwyn Rodrigues 

> ---
> Changes from v1:
>  - Return EGAIN if dio_rwsem is not lockable in f2fs_direct_IO
>  - Fix the wrong calculation of last_lblk in f2fs_overwrite_io 
> 
>  fs/f2fs/data.c | 47 +--
>  fs/f2fs/f2fs.h |  8 
>  fs/f2fs/file.c | 35 +--
>  3 files changed, 74 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 6c3c978..b27ea1e 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -839,13 +839,6 @@ static int __allocate_data_block(struct dnode_of_data 
> *dn, int seg_type)
>   return 0;
>  }
>  
> -static inline bool __force_buffered_io(struct inode *inode, int rw)
> -{
> - return (f2fs_encrypted_file(inode) ||
> - (rw == WRITE && test_opt(F2FS_I_SB(inode), LFS)) ||
> - F2FS_I_SB(inode)->s_ndevs);
> -}
> -
>  int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>  {
>   struct inode *inode = file_inode(iocb->ki_filp);
> @@ -877,7 +870,7 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct 
> iov_iter *from)
>  
>   if (direct_io) {
>   map.m_seg_type = rw_hint_to_seg_type(iocb->ki_hint);
> - flag = __force_buffered_io(inode, WRITE) ?
> + flag = f2fs_force_buffered_io(inode, WRITE) ?
>   F2FS_GET_BLOCK_PRE_AIO :
>   F2FS_GET_BLOCK_PRE_DIO;
>   goto map_blocks;
> @@ -1121,6 +1114,31 @@ int f2fs_map_blocks(struct inode *inode, struct 
> f2fs_map_blocks *map,
>   return err;
>  }
>  
> +bool f2fs_overwrite_io(struct inode *inode, loff_t pos, size_t len)
> +{
> + struct f2fs_map_blocks map;
> + block_t last_lblk;
> + int err;
> +
> + if (pos + len > i_size_read(inode))
> + return false;
> +
> + map.m_lblk = F2FS_BYTES_TO_BLK(pos);
> + map.m_next_pgofs = 0;
> + map.m_next_extent = NULL;
> + map.m_seg_type = NO_CHECK_TYPE;
> + last_lblk = F2FS_BYTES_TO_BLK(pos + len - 1) + 1;
> +
> + while (map.m_lblk < last_lblk) {
> + map.m_len = last_lblk - map.m_lblk;
> + err = f2fs_map_blocks(inode, , 0, 0);
> + if (err || map.m_len == 0)
> + return false;
> + map.m_lblk += map.m_len;
> + }
> + return true;
> +}
> +
>  static int __get_data_block(struct inode *inode, sector_t iblock,
>   struct buffer_head *bh, int create, int flag,
>   pgoff_t *next_pgofs, int seg_type)
> @@ -2306,7 +2324,7 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter)
>   if (err)
>   return err;
>  
> - if (__force_buffered_io(inode, rw))
> + if (f2fs_force_buffered_io(inode, rw))
>   return 0;
>  
>   trace_f2fs_direct_IO_enter(inode, offset, count, rw);
> @@ -2314,7 +2332,15 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter)
>   if (rw == WRITE && whint_mode == WHINT_MODE_OFF)
>   iocb->ki_hint = WRITE_LIFE_NOT_SET;
>  
> - down_read(_I(inode)->dio_rwsem[rw]);
> + if (!down_read_trylock(_I(inode)->dio_rwsem[rw])) {
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + iocb->ki_hint = hint;
> + err = -EAGAIN;
> + goto out;
> + }
> + down_read(_I(inode)->dio_rwsem[rw]);
> + }
> +
>   err = blockdev_direct_IO(iocb, inode, iter, get_data_block_dio);
>   up_read(_I(inode)->dio_rwsem[rw]);
>  
> @@ -2330,6 +2356,7 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter)
>   }
>   }
>  
> +out:
>   trace_f2fs_direct_IO_exit(inode, offset, count, rw, err);
>  
>   return err;
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index f6dc706..351226e 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -2875,6 +2875,7 @@ void f2fs_invalidate_page(struct page *page, unsigned 
> int offset,
>  int f2fs_migrate_

Re: BUG: unable to handle kernel NULL pointer dereference in free_pipe_info

2018-01-30 Thread Goldwyn Rodrigues


On 01/30/2018 04:13 PM, Eric Biggers wrote:
> On Tue, Dec 19, 2017 at 12:39:01AM -0800, syzbot wrote:
>> Hello,
>>
>> syzkaller hit the following crash on
>> 6084b576dca2e898f5c101baef151f7bfdbb606d
>> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached
>> Raw console output is attached.
>>
>> Unfortunately, I don't have any reproducer for this bug yet.
>>
>>
>> BUG: unable to handle kernel NULL pointer dereference at 002f
>> IP: pipe_buf_release include/linux/pipe_fs_i.h:136 [inline]
>> IP: free_pipe_info+0x75/0xd0 fs/pipe.c:674
>> PGD 1e069b067 P4D 1e069b067 PUD 1e0d7f067 PMD 0
>> Oops:  [#1] SMP
>> Dumping ftrace buffer:
>>(ftrace buffer empty)
>> Modules linked in:
>> CPU: 1 PID: 11658 Comm: syz-executor4 Not tainted 4.15.0-rc3-next-20171214+
>> #67
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> RIP: 0010:pipe_buf_release include/linux/pipe_fs_i.h:136 [inline]
>> RIP: 0010:free_pipe_info+0x75/0xd0 fs/pipe.c:674
>> RSP: 0018:c9e27ba8 EFLAGS: 00010293
>> RAX: 8801dfa7e500 RBX: 8801dac70200 RCX: 81414817
>> RDX:  RSI: 8802156b6800 RDI: 8801dac70200
>> RBP: c9e27bc8 R08: 0001 R09: 
>> R10:  R11:  R12: 
>> R13: 8802156b6800 R14: 001f R15: 880213d99820
>> FS:  7f81f5ada700() GS:88021fd0() knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 002f CR3: 0001e05cc000 CR4: 001426e0
>> Call Trace:
>>  put_pipe_info+0x65/0x80 fs/pipe.c:561
>>  pipe_release+0xda/0x110 fs/pipe.c:582
>>  __fput+0x120/0x270 fs/file_table.c:209
>>  fput+0x15/0x20 fs/file_table.c:243
>>  task_work_run+0xa3/0xe0 kernel/task_work.c:113
>>  exit_task_work include/linux/task_work.h:22 [inline]
>>  do_exit+0x3e6/0x1050 kernel/exit.c:869
>>  do_group_exit+0x60/0x100 kernel/exit.c:972
>>  get_signal+0x36c/0xad0 kernel/signal.c:2337
>>  do_signal+0x23/0x670 arch/x86/kernel/signal.c:809
>>  exit_to_usermode_loop+0x13c/0x160 arch/x86/entry/common.c:161
>>  prepare_exit_to_usermode arch/x86/entry/common.c:195 [inline]
>>  syscall_return_slowpath+0x1b4/0x1e0 arch/x86/entry/common.c:264
>>  entry_SYSCALL_64_fastpath+0x94/0x96
>> RIP: 0033:0x452a39
>> RSP: 002b:7f81f5ad9ce8 EFLAGS: 0246 ORIG_RAX: 00ca
>> RAX: fe00 RBX: 007581b8 RCX: 00452a39
>> RDX:  RSI:  RDI: 007581b8
>> RBP: 007581b8 R08: 001a R09: 00758190
>> R10:  R11: 0246 R12: 
>> R13: 00a6f7ff R14: 7f81f5ada9c0 R15: 000e
>> Code: 48 8d 14 80 48 8b 83 08 01 00 00 4c 8d 2c d0 4d 8b 75 10 4d 85 f6 74
>> 17 e8 29 5b ea ff 49 c7 45 10 00 00 00 00 4c 89 ee 48 89 df <41> ff 56 10 e8
>> 12 5b ea ff 41 83 c4 01 44 39 a3 d0 00 00 00 77
>> RIP: pipe_buf_release include/linux/pipe_fs_i.h:136 [inline] RSP:
>> c9e27ba8
>> RIP: free_pipe_info+0x75/0xd0 fs/pipe.c:674 RSP: c9e27ba8
>> CR2: 002f
>> ---[ end trace 7ae396f579301f25 ]---
> 
> Invalidating this bug since it hasn't been seen again, and it was reported 
> while
> KASAN was accidentally disabled in the syzbot kconfig due to a change to the
> kconfig menus in linux-next (so this crash was possibly caused by slab
> corruption elsewhere).
> 

This crash looks similar to the ones discussed in
https://lkml.org/lkml/2017/10/30/780

If this can be reproduced, please share how.


-- 
Goldwyn


Re: BUG: unable to handle kernel NULL pointer dereference in free_pipe_info

2018-01-30 Thread Goldwyn Rodrigues


On 01/30/2018 04:13 PM, Eric Biggers wrote:
> On Tue, Dec 19, 2017 at 12:39:01AM -0800, syzbot wrote:
>> Hello,
>>
>> syzkaller hit the following crash on
>> 6084b576dca2e898f5c101baef151f7bfdbb606d
>> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached
>> Raw console output is attached.
>>
>> Unfortunately, I don't have any reproducer for this bug yet.
>>
>>
>> BUG: unable to handle kernel NULL pointer dereference at 002f
>> IP: pipe_buf_release include/linux/pipe_fs_i.h:136 [inline]
>> IP: free_pipe_info+0x75/0xd0 fs/pipe.c:674
>> PGD 1e069b067 P4D 1e069b067 PUD 1e0d7f067 PMD 0
>> Oops:  [#1] SMP
>> Dumping ftrace buffer:
>>(ftrace buffer empty)
>> Modules linked in:
>> CPU: 1 PID: 11658 Comm: syz-executor4 Not tainted 4.15.0-rc3-next-20171214+
>> #67
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> RIP: 0010:pipe_buf_release include/linux/pipe_fs_i.h:136 [inline]
>> RIP: 0010:free_pipe_info+0x75/0xd0 fs/pipe.c:674
>> RSP: 0018:c9e27ba8 EFLAGS: 00010293
>> RAX: 8801dfa7e500 RBX: 8801dac70200 RCX: 81414817
>> RDX:  RSI: 8802156b6800 RDI: 8801dac70200
>> RBP: c9e27bc8 R08: 0001 R09: 
>> R10:  R11:  R12: 
>> R13: 8802156b6800 R14: 001f R15: 880213d99820
>> FS:  7f81f5ada700() GS:88021fd0() knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 002f CR3: 0001e05cc000 CR4: 001426e0
>> Call Trace:
>>  put_pipe_info+0x65/0x80 fs/pipe.c:561
>>  pipe_release+0xda/0x110 fs/pipe.c:582
>>  __fput+0x120/0x270 fs/file_table.c:209
>>  fput+0x15/0x20 fs/file_table.c:243
>>  task_work_run+0xa3/0xe0 kernel/task_work.c:113
>>  exit_task_work include/linux/task_work.h:22 [inline]
>>  do_exit+0x3e6/0x1050 kernel/exit.c:869
>>  do_group_exit+0x60/0x100 kernel/exit.c:972
>>  get_signal+0x36c/0xad0 kernel/signal.c:2337
>>  do_signal+0x23/0x670 arch/x86/kernel/signal.c:809
>>  exit_to_usermode_loop+0x13c/0x160 arch/x86/entry/common.c:161
>>  prepare_exit_to_usermode arch/x86/entry/common.c:195 [inline]
>>  syscall_return_slowpath+0x1b4/0x1e0 arch/x86/entry/common.c:264
>>  entry_SYSCALL_64_fastpath+0x94/0x96
>> RIP: 0033:0x452a39
>> RSP: 002b:7f81f5ad9ce8 EFLAGS: 0246 ORIG_RAX: 00ca
>> RAX: fe00 RBX: 007581b8 RCX: 00452a39
>> RDX:  RSI:  RDI: 007581b8
>> RBP: 007581b8 R08: 001a R09: 00758190
>> R10:  R11: 0246 R12: 
>> R13: 00a6f7ff R14: 7f81f5ada9c0 R15: 000e
>> Code: 48 8d 14 80 48 8b 83 08 01 00 00 4c 8d 2c d0 4d 8b 75 10 4d 85 f6 74
>> 17 e8 29 5b ea ff 49 c7 45 10 00 00 00 00 4c 89 ee 48 89 df <41> ff 56 10 e8
>> 12 5b ea ff 41 83 c4 01 44 39 a3 d0 00 00 00 77
>> RIP: pipe_buf_release include/linux/pipe_fs_i.h:136 [inline] RSP:
>> c9e27ba8
>> RIP: free_pipe_info+0x75/0xd0 fs/pipe.c:674 RSP: c9e27ba8
>> CR2: 002f
>> ---[ end trace 7ae396f579301f25 ]---
> 
> Invalidating this bug since it hasn't been seen again, and it was reported 
> while
> KASAN was accidentally disabled in the syzbot kconfig due to a change to the
> kconfig menus in linux-next (so this crash was possibly caused by slab
> corruption elsewhere).
> 

This crash looks similar to the ones discussed in
https://lkml.org/lkml/2017/10/30/780

If this can be reproduced, please share how.


-- 
Goldwyn


Re: Detecting RWF_NOWAIT support

2017-12-17 Thread Goldwyn Rodrigues


On 12/16/2017 08:49 AM, Avi Kivity wrote:
> 
> 
> On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote:
>>
>> On 12/14/2017 11:38 AM, Avi Kivity wrote:
>>> I'm looking to add support for RWF_NOWAIT within a linux-aio iocb.
>>> Naturally, I need to detect at runtime whether the kernel support
>>> RWF_NOWAIT or not.
>>>
>>>
>>> The only method I could find was to issue an I/O with RWF_NOWAIT set,
>>> and look for errors. This is somewhat less than perfect:
>>>
>>>   - from the error, I can't tell whether RWF_NOWAIT was the problem, or
>>> something else. If I enable a number of new features, I have to run
>>> through all combinations to figure out which ones are supported and
>>> which are not.
>> Here is the return codes for RWF_NOWAIT
>> EINVAL - not supported (older kernel)
>> EOPNOTSUPP - not supported
>> EAGAIN - supported but could not complete because I/O will be delayed
> 
> Which of these are returned from io_submit() and which are returned in
> the iocb?

These are returned in iocb.

> 
>> 0 - supported and I/O completed (success).
>>
>>>   - RWF_NOWAIT support is per-filesystem, so I can't just remember
>>> not to
>>> enable RWF_NOWAIT globally, I have to track it per file.
>> Yes, the support is per filesystem. So, the application must know if the
>> filesystem supports it, possibly by performing a small I/O.
> 
> So the application must know about filesystem mount points, and be
> prepared to create a file and try to write it (in case the filesystem is
> empty) or alter its behavior during runtime depending on the errors it
> sees.

Well yes. Hopefully, the application knows what it is doing when it
performs RWF_NOWAIT.

-- 
Goldwyn


Re: Detecting RWF_NOWAIT support

2017-12-17 Thread Goldwyn Rodrigues


On 12/16/2017 08:49 AM, Avi Kivity wrote:
> 
> 
> On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote:
>>
>> On 12/14/2017 11:38 AM, Avi Kivity wrote:
>>> I'm looking to add support for RWF_NOWAIT within a linux-aio iocb.
>>> Naturally, I need to detect at runtime whether the kernel support
>>> RWF_NOWAIT or not.
>>>
>>>
>>> The only method I could find was to issue an I/O with RWF_NOWAIT set,
>>> and look for errors. This is somewhat less than perfect:
>>>
>>>   - from the error, I can't tell whether RWF_NOWAIT was the problem, or
>>> something else. If I enable a number of new features, I have to run
>>> through all combinations to figure out which ones are supported and
>>> which are not.
>> Here is the return codes for RWF_NOWAIT
>> EINVAL - not supported (older kernel)
>> EOPNOTSUPP - not supported
>> EAGAIN - supported but could not complete because I/O will be delayed
> 
> Which of these are returned from io_submit() and which are returned in
> the iocb?

These are returned in iocb.

> 
>> 0 - supported and I/O completed (success).
>>
>>>   - RWF_NOWAIT support is per-filesystem, so I can't just remember
>>> not to
>>> enable RWF_NOWAIT globally, I have to track it per file.
>> Yes, the support is per filesystem. So, the application must know if the
>> filesystem supports it, possibly by performing a small I/O.
> 
> So the application must know about filesystem mount points, and be
> prepared to create a file and try to write it (in case the filesystem is
> empty) or alter its behavior during runtime depending on the errors it
> sees.

Well yes. Hopefully, the application knows what it is doing when it
performs RWF_NOWAIT.

-- 
Goldwyn


Re: Detecting RWF_NOWAIT support

2017-12-14 Thread Goldwyn Rodrigues


On 12/14/2017 11:38 AM, Avi Kivity wrote:
> I'm looking to add support for RWF_NOWAIT within a linux-aio iocb.
> Naturally, I need to detect at runtime whether the kernel support
> RWF_NOWAIT or not.
> 
> 
> The only method I could find was to issue an I/O with RWF_NOWAIT set,
> and look for errors. This is somewhat less than perfect:
> 
>  - from the error, I can't tell whether RWF_NOWAIT was the problem, or
> something else. If I enable a number of new features, I have to run
> through all combinations to figure out which ones are supported and
> which are not.

Here is the return codes for RWF_NOWAIT
EINVAL - not supported (older kernel)
EOPNOTSUPP - not supported
EAGAIN - supported but could not complete because I/O will be delayed
0 - supported and I/O completed (success).

> 
>  - RWF_NOWAIT support is per-filesystem, so I can't just remember not to
> enable RWF_NOWAIT globally, I have to track it per file.

Yes, the support is per filesystem. So, the application must know if the
filesystem supports it, possibly by performing a small I/O.

-- 
Goldwyn


Re: Detecting RWF_NOWAIT support

2017-12-14 Thread Goldwyn Rodrigues


On 12/14/2017 11:38 AM, Avi Kivity wrote:
> I'm looking to add support for RWF_NOWAIT within a linux-aio iocb.
> Naturally, I need to detect at runtime whether the kernel support
> RWF_NOWAIT or not.
> 
> 
> The only method I could find was to issue an I/O with RWF_NOWAIT set,
> and look for errors. This is somewhat less than perfect:
> 
>  - from the error, I can't tell whether RWF_NOWAIT was the problem, or
> something else. If I enable a number of new features, I have to run
> through all combinations to figure out which ones are supported and
> which are not.

Here is the return codes for RWF_NOWAIT
EINVAL - not supported (older kernel)
EOPNOTSUPP - not supported
EAGAIN - supported but could not complete because I/O will be delayed
0 - supported and I/O completed (success).

> 
>  - RWF_NOWAIT support is per-filesystem, so I can't just remember not to
> enable RWF_NOWAIT globally, I have to track it per file.

Yes, the support is per filesystem. So, the application must know if the
filesystem supports it, possibly by performing a small I/O.

-- 
Goldwyn


Re: [PATCH] VFS: Handle lazytime in do_mount()

2017-11-17 Thread Goldwyn Rodrigues


On 10/11/2017 12:01 AM, Markus Trippelsdorf wrote:
> Since commit e462ec50cb5fa ("VFS: Differentiate mount flags (MS_*) from
> internal superblock flags") the lazytime mount option doesn't get passed
> on anymore.
> 
> Fix the issue by handling the option in do_mount().
> 
> Reviewed-by: Lukas Czerner <lczer...@redhat.com>
> Signed-off-by: Markus Trippelsdorf <mar...@trippelsdorf.de>

Reviewed-by: Goldwyn Rodrigues <rgold...@suse.com>

> ---
>  fs/namespace.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 54059b142d6b..b633838b8f02 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2823,7 +2823,8 @@ long do_mount(const char *dev_name, const char __user 
> *dir_name,
>   SB_MANDLOCK |
>   SB_DIRSYNC |
>   SB_SILENT |
> - SB_POSIXACL);
> + SB_POSIXACL |
> + SB_LAZYTIME);
>  
>   if (flags & MS_REMOUNT)
>   retval = do_remount(, flags, sb_flags, mnt_flags,
> 

-- 
Goldwyn


Re: [PATCH] VFS: Handle lazytime in do_mount()

2017-11-17 Thread Goldwyn Rodrigues


On 10/11/2017 12:01 AM, Markus Trippelsdorf wrote:
> Since commit e462ec50cb5fa ("VFS: Differentiate mount flags (MS_*) from
> internal superblock flags") the lazytime mount option doesn't get passed
> on anymore.
> 
> Fix the issue by handling the option in do_mount().
> 
> Reviewed-by: Lukas Czerner 
> Signed-off-by: Markus Trippelsdorf 

Reviewed-by: Goldwyn Rodrigues 

> ---
>  fs/namespace.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 54059b142d6b..b633838b8f02 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2823,7 +2823,8 @@ long do_mount(const char *dev_name, const char __user 
> *dir_name,
>   SB_MANDLOCK |
>   SB_DIRSYNC |
>   SB_SILENT |
> - SB_POSIXACL);
> + SB_POSIXACL |
> + SB_LAZYTIME);
>  
>   if (flags & MS_REMOUNT)
>   retval = do_remount(, flags, sb_flags, mnt_flags,
> 

-- 
Goldwyn


Re: [RESEND PATCH] fs: add RWF_APPEND

2017-09-30 Thread Goldwyn Rodrigues


On 09/29/2017 07:07 AM, Jürg Billeter wrote:
> This is the per-I/O equivalent of O_APPEND to support atomic append
> operations on any open file.
> 
> If a file is opened with O_APPEND, pwrite() ignores the offset and
> always appends data to the end of the file. RWF_APPEND enables atomic
> append and pwrite() with offset on a single file descriptor.
> 

I think similarly we can introduce RWF_DIRECT, though I am not sure if
users would want to mix buffered writes and direct writes.

Reviewed-by: Goldwyn Rodrigues <rgold...@suse.com>

> Signed-off-by: Jürg Billeter <j...@bitron.ch>
> ---
>  include/linux/fs.h  | 2 ++
>  include/uapi/linux/fs.h | 6 +-
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 339e73742e73..fee24eae7523 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3204,6 +3204,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, 
> rwf_t flags)
>   ki->ki_flags |= IOCB_DSYNC;
>   if (flags & RWF_SYNC)
>   ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> + if (flags & RWF_APPEND)
> + ki->ki_flags |= IOCB_APPEND;
>   return 0;
>  }
>  
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 56235dddea7d..ac145430bcd8 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -376,7 +376,11 @@ typedef int __bitwise __kernel_rwf_t;
>  /* per-IO, return -EAGAIN if operation would block */
>  #define RWF_NOWAIT   ((__force __kernel_rwf_t)0x0008)
>  
> +/* per-IO O_APPEND */
> +#define RWF_APPEND   ((__force __kernel_rwf_t)0x0010)
> +
>  /* mask of flags supported by the kernel */
> -#define RWF_SUPPORTED(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT)
> +#define RWF_SUPPORTED(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT 
> |\
> +  RWF_APPEND)
>  
>  #endif /* _UAPI_LINUX_FS_H */
> 

-- 
Goldwyn


Re: [RESEND PATCH] fs: add RWF_APPEND

2017-09-30 Thread Goldwyn Rodrigues


On 09/29/2017 07:07 AM, Jürg Billeter wrote:
> This is the per-I/O equivalent of O_APPEND to support atomic append
> operations on any open file.
> 
> If a file is opened with O_APPEND, pwrite() ignores the offset and
> always appends data to the end of the file. RWF_APPEND enables atomic
> append and pwrite() with offset on a single file descriptor.
> 

I think similarly we can introduce RWF_DIRECT, though I am not sure if
users would want to mix buffered writes and direct writes.

Reviewed-by: Goldwyn Rodrigues 

> Signed-off-by: Jürg Billeter 
> ---
>  include/linux/fs.h  | 2 ++
>  include/uapi/linux/fs.h | 6 +-
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 339e73742e73..fee24eae7523 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3204,6 +3204,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, 
> rwf_t flags)
>   ki->ki_flags |= IOCB_DSYNC;
>   if (flags & RWF_SYNC)
>   ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> + if (flags & RWF_APPEND)
> + ki->ki_flags |= IOCB_APPEND;
>   return 0;
>  }
>  
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 56235dddea7d..ac145430bcd8 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -376,7 +376,11 @@ typedef int __bitwise __kernel_rwf_t;
>  /* per-IO, return -EAGAIN if operation would block */
>  #define RWF_NOWAIT   ((__force __kernel_rwf_t)0x0008)
>  
> +/* per-IO O_APPEND */
> +#define RWF_APPEND   ((__force __kernel_rwf_t)0x0010)
> +
>  /* mask of flags supported by the kernel */
> -#define RWF_SUPPORTED(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT)
> +#define RWF_SUPPORTED(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT 
> |\
> +  RWF_APPEND)
>  
>  #endif /* _UAPI_LINUX_FS_H */
> 

-- 
Goldwyn


Re: nowait aio return value

2017-07-26 Thread Goldwyn Rodrigues
Hi Jason,

On 07/26/2017 01:45 PM, Jason Baron wrote:
> Hi,
> 
> In testing nowait aio on ext4, I found that when appending to a file
> the return value is EAGAIN/EWOULDBLOCK, because as mentioned in the
> commit this will potentially trigger an allocation. However, the EAGAIN,
> seems somewhat misleading to me, in that if I continuously try the
> write, it will never succeed.
> 
> The relevant commit is:
> 728fbc0 ext4: nowait aio support
> 
> As you can see there, failure to get the inode lock is treated as
> EAGAIN, which seems more appropriate to me, as its very likely
> to succeed on subsequent calls.
> 
> Perhaps, it could be switched to -EINVAL, or something else?
> 

Thanks for testing this.

I would suggest read it as -EWOULDBLOCK. The idea is to pass on
IOCB_NOWAIT when we don't want the I/O process to wait, and if it does
return -EWOULDBLOCK. If it returns EWOULDBLOCK in case of allocation,
you may want to allocate your file space before performing the I/O.

I would return EINVAL in case the parameters passed are incorrect.

-- 
Goldwyn


Re: nowait aio return value

2017-07-26 Thread Goldwyn Rodrigues
Hi Jason,

On 07/26/2017 01:45 PM, Jason Baron wrote:
> Hi,
> 
> In testing nowait aio on ext4, I found that when appending to a file
> the return value is EAGAIN/EWOULDBLOCK, because as mentioned in the
> commit this will potentially trigger an allocation. However, the EAGAIN,
> seems somewhat misleading to me, in that if I continuously try the
> write, it will never succeed.
> 
> The relevant commit is:
> 728fbc0 ext4: nowait aio support
> 
> As you can see there, failure to get the inode lock is treated as
> EAGAIN, which seems more appropriate to me, as its very likely
> to succeed on subsequent calls.
> 
> Perhaps, it could be switched to -EINVAL, or something else?
> 

Thanks for testing this.

I would suggest read it as -EWOULDBLOCK. The idea is to pass on
IOCB_NOWAIT when we don't want the I/O process to wait, and if it does
return -EWOULDBLOCK. If it returns EWOULDBLOCK in case of allocation,
you may want to allocate your file space before performing the I/O.

I would return EINVAL in case the parameters passed are incorrect.

-- 
Goldwyn


Re: Commit edf064e7c (btrfs: nowait aio support) breaks shells

2017-07-07 Thread Goldwyn Rodrigues


On 07/04/2017 05:16 PM, Jens Axboe wrote:
> 
> Please expedite getting this upstream, asap.
> 

Jens,

I have posted an updated patch [1] and it is acked by David. Would you
pick it up or should it go through the btrfs tree (or some other tree)?

[1] https://patchwork.kernel.org/patch/9825813/

-- 
Goldwyn


Re: Commit edf064e7c (btrfs: nowait aio support) breaks shells

2017-07-07 Thread Goldwyn Rodrigues


On 07/04/2017 05:16 PM, Jens Axboe wrote:
> 
> Please expedite getting this upstream, asap.
> 

Jens,

I have posted an updated patch [1] and it is acked by David. Would you
pick it up or should it go through the btrfs tree (or some other tree)?

[1] https://patchwork.kernel.org/patch/9825813/

-- 
Goldwyn


[PATCH v2] btrfs: Correct assignment of pos

2017-07-04 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgold...@suse.com>

Assigning pos for usage early messes up in append mode, where
the pos is re-assigned in generic_write_checks(). Assign
pos later to get the correct position to write from iocb->ki_pos.

Since check_can_nocow also uses the value of pos, we shift
generic_write_checks() before check_can_nocow(). Checks with
IOCB_DIRECT are present in generic_write_checks(), so checking
for IOCB_NOWAIT is enough.

Also, put locking sequence in the fast path.

Changes since v1:
 - Moved pos higher up to encompass check_can_nocow() call.

Fixes: edf064e7c6fe ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
---
 fs/btrfs/file.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 59e2dccdf75b..ad53832838b5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1875,16 +1875,25 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
ssize_t num_written = 0;
bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
ssize_t err;
-   loff_t pos = iocb->ki_pos;
+   loff_t pos;
size_t count = iov_iter_count(from);
loff_t oldsize;
int clean_page = 0;
 
-   if ((iocb->ki_flags & IOCB_NOWAIT) &&
-   (iocb->ki_flags & IOCB_DIRECT)) {
-   /* Don't sleep on inode rwsem */
-   if (!inode_trylock(inode))
+   if (!inode_trylock(inode)) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
return -EAGAIN;
+   inode_lock(inode);
+   }
+
+   err = generic_write_checks(iocb, from);
+   if (err <= 0) {
+   inode_unlock(inode);
+   return err;
+   }
+
+   pos = iocb->ki_pos;
+   if (iocb->ki_flags & IOCB_NOWAIT) {
/*
 * We will allocate space in case nodatacow is not set,
 * so bail
@@ -1895,13 +1904,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
inode_unlock(inode);
return -EAGAIN;
}
-   } else
-   inode_lock(inode);
-
-   err = generic_write_checks(iocb, from);
-   if (err <= 0) {
-   inode_unlock(inode);
-   return err;
}
 
current->backing_dev_info = inode_to_bdi(inode);
-- 
2.12.0



[PATCH v2] btrfs: Correct assignment of pos

2017-07-04 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Assigning pos for usage early messes up in append mode, where
the pos is re-assigned in generic_write_checks(). Assign
pos later to get the correct position to write from iocb->ki_pos.

Since check_can_nocow also uses the value of pos, we shift
generic_write_checks() before check_can_nocow(). Checks with
IOCB_DIRECT are present in generic_write_checks(), so checking
for IOCB_NOWAIT is enough.

Also, put locking sequence in the fast path.

Changes since v1:
 - Moved pos higher up to encompass check_can_nocow() call.

Fixes: edf064e7c6fe ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/file.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 59e2dccdf75b..ad53832838b5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1875,16 +1875,25 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
ssize_t num_written = 0;
bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
ssize_t err;
-   loff_t pos = iocb->ki_pos;
+   loff_t pos;
size_t count = iov_iter_count(from);
loff_t oldsize;
int clean_page = 0;
 
-   if ((iocb->ki_flags & IOCB_NOWAIT) &&
-   (iocb->ki_flags & IOCB_DIRECT)) {
-   /* Don't sleep on inode rwsem */
-   if (!inode_trylock(inode))
+   if (!inode_trylock(inode)) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
return -EAGAIN;
+   inode_lock(inode);
+   }
+
+   err = generic_write_checks(iocb, from);
+   if (err <= 0) {
+   inode_unlock(inode);
+   return err;
+   }
+
+   pos = iocb->ki_pos;
+   if (iocb->ki_flags & IOCB_NOWAIT) {
/*
 * We will allocate space in case nodatacow is not set,
 * so bail
@@ -1895,13 +1904,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
inode_unlock(inode);
return -EAGAIN;
}
-   } else
-   inode_lock(inode);
-
-   err = generic_write_checks(iocb, from);
-   if (err <= 0) {
-   inode_unlock(inode);
-   return err;
}
 
current->backing_dev_info = inode_to_bdi(inode);
-- 
2.12.0



[PATCH] btrfs: Correct assignment of pos

2017-07-04 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgold...@suse.com>

Assigning pos for usage early messes up in append mode, where
the pos is re-assigned in generic_write_checks(). Re-assign
pos to get the correct position to write from iocb->ki_pos.

Fixes: edf064e7c6fe ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
Tested-by: Markus Trippelsdorf <mar...@trippelsdorf.de>
---
 fs/btrfs/file.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 59e2dccdf75b..7947781229e5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1931,6 +1931,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 */
update_time_for_write(inode);
 
+   pos = iocb->ki_pos;
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
-- 
2.12.0



[PATCH] btrfs: Correct assignment of pos

2017-07-04 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Assigning pos for usage early messes up in append mode, where
the pos is re-assigned in generic_write_checks(). Re-assign
pos to get the correct position to write from iocb->ki_pos.

Fixes: edf064e7c6fe ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues 
Tested-by: Markus Trippelsdorf 
---
 fs/btrfs/file.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 59e2dccdf75b..7947781229e5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1931,6 +1931,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 */
update_time_for_write(inode);
 
+   pos = iocb->ki_pos;
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
-- 
2.12.0



Re: Commit edf064e7c (btrfs: nowait aio support) breaks shells

2017-07-04 Thread Goldwyn Rodrigues


On 07/04/2017 02:45 AM, Markus Trippelsdorf wrote:
> On 2017.07.04 at 06:23 +0200, Markus Trippelsdorf wrote:
>> commit edf064e7c6fec3646b06c944a8e35d1a3de5c2c3 (HEAD, refs/bisect/bad)
>> Author: Goldwyn Rodrigues <rgold...@suse.com>
>> Date:   Tue Jun 20 07:05:49 2017 -0500
>>
>> btrfs: nowait aio support
>>
>> apparently breaks several shell related features on my system.
> 
> Here is a simple testcase:
> 
>  % echo "foo" >> test
>  % echo "foo" >> test
>  % cat test
>  foo
>  %
> 

Thanks for testing.
Yes, pos must be set with iocb->ki_pos for appends. I should not have
removed the initialization. Could you try this patch?

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 59e2dccdf75b..7947781229e5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1931,6 +1931,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb
*iocb,
 */
update_time_for_write(inode);

+   pos = iocb->ki_pos;
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {



-- 
Goldwyn


Re: Commit edf064e7c (btrfs: nowait aio support) breaks shells

2017-07-04 Thread Goldwyn Rodrigues


On 07/04/2017 02:45 AM, Markus Trippelsdorf wrote:
> On 2017.07.04 at 06:23 +0200, Markus Trippelsdorf wrote:
>> commit edf064e7c6fec3646b06c944a8e35d1a3de5c2c3 (HEAD, refs/bisect/bad)
>> Author: Goldwyn Rodrigues 
>> Date:   Tue Jun 20 07:05:49 2017 -0500
>>
>> btrfs: nowait aio support
>>
>> apparently breaks several shell related features on my system.
> 
> Here is a simple testcase:
> 
>  % echo "foo" >> test
>  % echo "foo" >> test
>  % cat test
>  foo
>  %
> 

Thanks for testing.
Yes, pos must be set with iocb->ki_pos for appends. I should not have
removed the initialization. Could you try this patch?

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 59e2dccdf75b..7947781229e5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1931,6 +1931,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb
*iocb,
 */
update_time_for_write(inode);

+   pos = iocb->ki_pos;
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {



-- 
Goldwyn


Re: [PATCH] fs: Handle register_shrinker failure

2017-04-06 Thread Goldwyn Rodrigues


On 03/24/2017 02:55 AM, Nikolay Borisov wrote:
> register_shrinker allocates dynamic memory and thus is susceptible to failures
> under low-memory situation. Currently,get_userns ignores the return value of
> register_shrinker, potentially exposing not fully initialised object. This
> can lead to a NULL-ptr deref everytime shrinker->nr_deferred is referenced.
> 
> Fix this by failing to register the filesystem in case there is not enough
> memory to fully construct the shrinker object.
> 
> Signed-off-by: Nikolay Borisov <nbori...@suse.com>

Looks good, though the situation seems rare.

Reviewed-by: Goldwyn Rodrigues <rgold...@suse.com>

> ---
>  fs/super.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index b8b6a086c03b..964b18447c92 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -518,7 +518,19 @@ struct super_block *sget_userns(struct file_system_type 
> *type,
>   hlist_add_head(>s_instances, >fs_supers);
>   spin_unlock(_lock);
>   get_filesystem(type);
> - register_shrinker(>s_shrink);
> + err = register_shrinker(>s_shrink);
> + if (err) {
> + spin_lock(_lock);
> + list_del(>s_list);
> + hlist_del(>s_instances);
> + spin_unlock(_lock);
> +
> + up_write(>s_umount);
> + destroy_super(s);
> + put_filesystem(type);
> + return ERR_PTR(err);
> + }
> +
>   return s;
>  }
>  
> 

-- 
Goldwyn


Re: [PATCH] fs: Handle register_shrinker failure

2017-04-06 Thread Goldwyn Rodrigues


On 03/24/2017 02:55 AM, Nikolay Borisov wrote:
> register_shrinker allocates dynamic memory and thus is susceptible to failures
> under low-memory situation. Currently,get_userns ignores the return value of
> register_shrinker, potentially exposing not fully initialised object. This
> can lead to a NULL-ptr deref everytime shrinker->nr_deferred is referenced.
> 
> Fix this by failing to register the filesystem in case there is not enough
> memory to fully construct the shrinker object.
> 
> Signed-off-by: Nikolay Borisov 

Looks good, though the situation seems rare.

Reviewed-by: Goldwyn Rodrigues 

> ---
>  fs/super.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index b8b6a086c03b..964b18447c92 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -518,7 +518,19 @@ struct super_block *sget_userns(struct file_system_type 
> *type,
>   hlist_add_head(>s_instances, >fs_supers);
>   spin_unlock(_lock);
>   get_filesystem(type);
> - register_shrinker(>s_shrink);
> + err = register_shrinker(>s_shrink);
> + if (err) {
> + spin_lock(_lock);
> + list_del(>s_list);
> + hlist_del(>s_instances);
> + spin_unlock(_lock);
> +
> + up_write(>s_umount);
> + destroy_super(s);
> + put_filesystem(type);
> + return ERR_PTR(err);
> + }
> +
>   return s;
>  }
>  
> 

-- 
Goldwyn


Re: [PATCH] ocfs2: insure dlm lockspace is created by kernel module

2016-05-20 Thread Goldwyn Rodrigues

Looks good.

Reviewed-by: Goldwyn Rodrigues <rgold...@suse.com>

On 05/20/2016 03:12 AM, Gang He wrote:

This patch will be used to insure the dlm lockspace is created by kernel
module when mounting a ocfs2 file system. There are two ways to create a
lockspace, from user space and kernel space, but the same name lockspaces
probably have different lvblen lengths/flags.
To avoid this mix using, we add one more flag DLM_LSFL_NEWEXCL, it will make
sure the dlm lockspace is created by kernel module when mounting.
Secondly, if a user space program (ocfs2-tools) is running on a file system,
the user tries to mount this file system in the cluster, DLM module will return
a -EEXIST or -EPROTO errno, we should give the user a obvious error message,
then, the user can let that user space tool exit before mounting the file
system again.

Signed-off-by: Gang He <g...@suse.com>
---
  fs/ocfs2/stack_user.c | 11 +--
  1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/stack_user.c b/fs/ocfs2/stack_user.c
index ced70c8..c9e828e 100644
--- a/fs/ocfs2/stack_user.c
+++ b/fs/ocfs2/stack_user.c
@@ -1007,10 +1007,17 @@ static int user_cluster_connect(struct 
ocfs2_cluster_connection *conn)
lc->oc_type = NO_CONTROLD;

rc = dlm_new_lockspace(conn->cc_name, conn->cc_cluster_name,
-  DLM_LSFL_FS, DLM_LVB_LEN,
+  DLM_LSFL_FS | DLM_LSFL_NEWEXCL, DLM_LVB_LEN,
   _ls_ops, conn, _rv, );
-   if (rc)
+   if (rc) {
+   if (rc == -EEXIST || rc == -EPROTO)
+   printk(KERN_ERR "ocfs2: Unable to create the "
+   "lockspace %s (%d), because a ocfs2-tools "
+   "program is running on this file system "
+   "with the same name lockspace\n",
+   conn->cc_name, rc);
goto out;
+   }

if (ops_rv == -EOPNOTSUPP) {
lc->oc_type = WITH_CONTROLD;



--
Goldwyn


Re: [PATCH] ocfs2: insure dlm lockspace is created by kernel module

2016-05-20 Thread Goldwyn Rodrigues

Looks good.

Reviewed-by: Goldwyn Rodrigues 

On 05/20/2016 03:12 AM, Gang He wrote:

This patch will be used to insure the dlm lockspace is created by kernel
module when mounting a ocfs2 file system. There are two ways to create a
lockspace, from user space and kernel space, but the same name lockspaces
probably have different lvblen lengths/flags.
To avoid this mix using, we add one more flag DLM_LSFL_NEWEXCL, it will make
sure the dlm lockspace is created by kernel module when mounting.
Secondly, if a user space program (ocfs2-tools) is running on a file system,
the user tries to mount this file system in the cluster, DLM module will return
a -EEXIST or -EPROTO errno, we should give the user a obvious error message,
then, the user can let that user space tool exit before mounting the file
system again.

Signed-off-by: Gang He 
---
  fs/ocfs2/stack_user.c | 11 +--
  1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/stack_user.c b/fs/ocfs2/stack_user.c
index ced70c8..c9e828e 100644
--- a/fs/ocfs2/stack_user.c
+++ b/fs/ocfs2/stack_user.c
@@ -1007,10 +1007,17 @@ static int user_cluster_connect(struct 
ocfs2_cluster_connection *conn)
lc->oc_type = NO_CONTROLD;

rc = dlm_new_lockspace(conn->cc_name, conn->cc_cluster_name,
-  DLM_LSFL_FS, DLM_LVB_LEN,
+  DLM_LSFL_FS | DLM_LSFL_NEWEXCL, DLM_LVB_LEN,
   _ls_ops, conn, _rv, );
-   if (rc)
+   if (rc) {
+   if (rc == -EEXIST || rc == -EPROTO)
+   printk(KERN_ERR "ocfs2: Unable to create the "
+   "lockspace %s (%d), because a ocfs2-tools "
+   "program is running on this file system "
+   "with the same name lockspace\n",
+   conn->cc_name, rc);
goto out;
+   }

if (ops_rv == -EOPNOTSUPP) {
lc->oc_type = WITH_CONTROLD;



--
Goldwyn


Re: clustered MD

2015-06-14 Thread Goldwyn Rodrigues



On 06/12/2015 01:46 PM, David Teigland wrote:

When a node fails, its dirty areas get special treatment from other nodes
using the area_resyncing() function.  Should the suspend_list be created
before any reads or writes from the file system are processed by md?  It
seems to me that gfs journal recovery could read/write to dirty regions
(from the failed node) before md was finished setting up the suspend_list.
md could probably prevent that by using the recover_prep() dlm callback to
set a flag that would block any i/o that arrived before the suspend_list
was ready.

.


Yes, we should call mddev_suspend() in recover_prep() and mddev_resume() 
after suspend_list is created. Thanks for pointing it out.


--
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-14 Thread Goldwyn Rodrigues



On 06/12/2015 01:46 PM, David Teigland wrote:

When a node fails, its dirty areas get special treatment from other nodes
using the area_resyncing() function.  Should the suspend_list be created
before any reads or writes from the file system are processed by md?  It
seems to me that gfs journal recovery could read/write to dirty regions
(from the failed node) before md was finished setting up the suspend_list.
md could probably prevent that by using the recover_prep() dlm callback to
set a flag that would block any i/o that arrived before the suspend_list
was ready.

.


Yes, we should call mddev_suspend() in recover_prep() and mddev_resume() 
after suspend_list is created. Thanks for pointing it out.


--
Goldwyn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-10 Thread Goldwyn Rodrigues
To start with, the goal of (basic) MD RAID1 is to keep the two mirrored 
device consistent _all_ of the time. In case of a device failure, it 
should degrade the array pointing to the failed device, so it can be 
(hot)removed/replaced. Now, take the same concepts to multiple nodes 
using the same MD-RAID1 device..


On 06/10/2015 10:48 AM, David Teigland wrote:

On Wed, Jun 10, 2015 at 10:27:27AM -0500, Goldwyn Rodrigues wrote:

I thought I answered that:
To use a software RAID1 across multiple nodes of a cluster. Let me
explain in more words..

In a cluster with multiple nodes with a shared storage, such as a
SAN. The shared device becomes a single point of failure.


OK, shared storage, that's an important starting point that was never
clear.


If the
device loses power, you will lose everything. A solution proposed is
to use software RAID, say with two SAN switches with different
devices and create a RAID1 on it. So if you lose power on one switch
or one of the device is fails the other is still available. Once you
get the other switch/device back up, it would resync the devices.


OK, MD RAID1 on shared disks.


, and exactly
what breaks when you use raid1 in that way?  Once we've established the
technical problem, then I can fairly evaluate your solution for it.


Data consistency breaks. If node 1 is writing to the RAID1 device,
you have to make sure the data between the two RAID devices is
consistent. With software raid, this is performed with bitmaps. The
DLM is used to maintain data consistency.


What's different about disks being on SAN that breaks data consistency vs
disks being locally attached?  Where did the dlm come into the picture?


There are multiple nodes using the same shared device. Different nodes 
would be writing their own data to the shared device possibly using a 
shared filesystem such as ocfs2 on top of it. Each node maintains a 
bitmap to co-ordinate syncs between the two devices of the RAID. Since 
there are two devices, writes on the two devices can end at different 
times and must be co-ordinated.





Device failure can be partial. Say, only node 1 sees that one of the
device has failed (link break).  You need to "tell" other nodes not
to use the device and that the array is degraded.


Why?


Data consistency. Because the node which continues to "see" the failed 
device (on another node) as working will read stale data.





In case of node failure, the blocks of the failed nodes must be
synced before the cluster can continue operation.


What do cluster/node failures have to do with syncing mirror copies?



Data consistency. Different nodes will be writing to different blocks. 
So, if a node fails, you need to make sure that what the other node has 
not synced between the two devices is completed by the one performing 
recovery. You need to provide a consistent view to all nodes.



Does that explain the situation?


No.  I don't see what clusters have to do with MD RAID1 devices, they seem
like completely orthogonal concepts.


If you need an analogy: cLVM, but with lesser overhead ;)

Also, may I point you to linux/Documentation/md-cluster.txt?

HTH,

--
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-10 Thread Goldwyn Rodrigues


On 06/10/2015 10:01 AM, David Teigland wrote:

On Tue, Jun 09, 2015 at 10:33:08PM -0500, Goldwyn Rodrigues wrote:

some real world utility to warrant the potential maintenance effort.


We do have a valid real world utility. It is to provide
high-availability of RAID1 storage  over the cluster. The
distributed locking is required only during cases of error and
superblock updates and is not required during normal operations,
which makes it fast enough for usual case scenarios.


That's the theory, how much evidence do you have of that in practice?


We wanted to develop a solution which is lock free (or atleast
minimum) for the most common/frequent usage scenario. Also, we
compared it with iozone on top of ocfs2 to find that it is very
close to local device performance numbers. we compared it with cLVM
mirroring to find it better as well. However, in the future we would
want to use it with with other RAID (10?) scenarios which is missing
now.


OK, but that's the second time you've missed the question I asked about
examples of real world usage.  Given the early stage of development, I'm
supposing there is none, which also implies it's too early for merging.



I thought I answered that:
To use a software RAID1 across multiple nodes of a cluster. Let me 
explain in more words..


In a cluster with multiple nodes with a shared storage, such as a SAN. 
The shared device becomes a single point of failure. If the device loses 
power, you will lose everything. A solution proposed is to use software 
RAID, say with two SAN switches with different devices and create a 
RAID1 on it. So if you lose power on one switch or one of the device is 
fails the other is still available. Once you get the other switch/device 
back up, it would resync the devices.



What are the doubts you have about it?


Before I begin reviewing the implementation, I'd like to better understand
what it is about the existing raid1 that doesn't work correctly for what
you'd like to do with it, i.e. I don't know what the problem is.


David Lang has already responded: The idea is to use a RAID device
(currently only level 1 mirroring is supported) with multiple nodes
of the cluster.


That doesn't come close to answering the question: exactly how do you want
to use raid1 (I have no idea from the statements you've made)


Using software RAID1 on a cluster with shared devices.



, and exactly
what breaks when you use raid1 in that way?  Once we've established the
technical problem, then I can fairly evaluate your solution for it.



Data consistency breaks. If node 1 is writing to the RAID1 device, you 
have to make sure the data between the two RAID devices is consistent. 
With software raid, this is performed with bitmaps. The DLM is used to 
maintain data consistency.


Device failure can be partial. Say, only node 1 sees that one of the 
device has failed (link break).  You need to "tell" other nodes not to 
use the device and that the array is degraded.


In case of node failure, the blocks of the failed nodes must be synced 
before the cluster can continue operation.


Does that explain the situation?


--
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-10 Thread Goldwyn Rodrigues



On 06/10/2015 03:00 AM, Richard Weinberger wrote:

On Wed, Jun 10, 2015 at 5:33 AM, Goldwyn Rodrigues  wrote:

David Lang has already responded: The idea is to use a RAID device
(currently only level 1 mirroring is supported) with multiple nodes of the
cluster.

Here is a description on how to use it:
http://marc.info/?l=linux-raid=141935561418770=2


Sorry if this is a stupid question, but how does this compare to DRBD?



No, it is not.

DRBD is for local devices synced over the network. Cluster-md is for 
shared devices such as SAN, so all syncs happen over the FC.
DRBD works for two nodes (until recently), cluster-md works for multiple 
nodes.


--
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-10 Thread Goldwyn Rodrigues



On 06/10/2015 03:00 AM, Richard Weinberger wrote:

On Wed, Jun 10, 2015 at 5:33 AM, Goldwyn Rodrigues rgold...@suse.com wrote:

David Lang has already responded: The idea is to use a RAID device
(currently only level 1 mirroring is supported) with multiple nodes of the
cluster.

Here is a description on how to use it:
http://marc.info/?l=linux-raidm=141935561418770w=2


Sorry if this is a stupid question, but how does this compare to DRBD?



No, it is not.

DRBD is for local devices synced over the network. Cluster-md is for 
shared devices such as SAN, so all syncs happen over the FC.
DRBD works for two nodes (until recently), cluster-md works for multiple 
nodes.


--
Goldwyn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-10 Thread Goldwyn Rodrigues
To start with, the goal of (basic) MD RAID1 is to keep the two mirrored 
device consistent _all_ of the time. In case of a device failure, it 
should degrade the array pointing to the failed device, so it can be 
(hot)removed/replaced. Now, take the same concepts to multiple nodes 
using the same MD-RAID1 device..


On 06/10/2015 10:48 AM, David Teigland wrote:

On Wed, Jun 10, 2015 at 10:27:27AM -0500, Goldwyn Rodrigues wrote:

I thought I answered that:
To use a software RAID1 across multiple nodes of a cluster. Let me
explain in more words..

In a cluster with multiple nodes with a shared storage, such as a
SAN. The shared device becomes a single point of failure.


OK, shared storage, that's an important starting point that was never
clear.


If the
device loses power, you will lose everything. A solution proposed is
to use software RAID, say with two SAN switches with different
devices and create a RAID1 on it. So if you lose power on one switch
or one of the device is fails the other is still available. Once you
get the other switch/device back up, it would resync the devices.


OK, MD RAID1 on shared disks.


, and exactly
what breaks when you use raid1 in that way?  Once we've established the
technical problem, then I can fairly evaluate your solution for it.


Data consistency breaks. If node 1 is writing to the RAID1 device,
you have to make sure the data between the two RAID devices is
consistent. With software raid, this is performed with bitmaps. The
DLM is used to maintain data consistency.


What's different about disks being on SAN that breaks data consistency vs
disks being locally attached?  Where did the dlm come into the picture?


There are multiple nodes using the same shared device. Different nodes 
would be writing their own data to the shared device possibly using a 
shared filesystem such as ocfs2 on top of it. Each node maintains a 
bitmap to co-ordinate syncs between the two devices of the RAID. Since 
there are two devices, writes on the two devices can end at different 
times and must be co-ordinated.





Device failure can be partial. Say, only node 1 sees that one of the
device has failed (link break).  You need to tell other nodes not
to use the device and that the array is degraded.


Why?


Data consistency. Because the node which continues to see the failed 
device (on another node) as working will read stale data.





In case of node failure, the blocks of the failed nodes must be
synced before the cluster can continue operation.


What do cluster/node failures have to do with syncing mirror copies?



Data consistency. Different nodes will be writing to different blocks. 
So, if a node fails, you need to make sure that what the other node has 
not synced between the two devices is completed by the one performing 
recovery. You need to provide a consistent view to all nodes.



Does that explain the situation?


No.  I don't see what clusters have to do with MD RAID1 devices, they seem
like completely orthogonal concepts.


If you need an analogy: cLVM, but with lesser overhead ;)

Also, may I point you to linux/Documentation/md-cluster.txt?

HTH,

--
Goldwyn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-10 Thread Goldwyn Rodrigues


On 06/10/2015 10:01 AM, David Teigland wrote:

On Tue, Jun 09, 2015 at 10:33:08PM -0500, Goldwyn Rodrigues wrote:

some real world utility to warrant the potential maintenance effort.


We do have a valid real world utility. It is to provide
high-availability of RAID1 storage  over the cluster. The
distributed locking is required only during cases of error and
superblock updates and is not required during normal operations,
which makes it fast enough for usual case scenarios.


That's the theory, how much evidence do you have of that in practice?


We wanted to develop a solution which is lock free (or atleast
minimum) for the most common/frequent usage scenario. Also, we
compared it with iozone on top of ocfs2 to find that it is very
close to local device performance numbers. we compared it with cLVM
mirroring to find it better as well. However, in the future we would
want to use it with with other RAID (10?) scenarios which is missing
now.


OK, but that's the second time you've missed the question I asked about
examples of real world usage.  Given the early stage of development, I'm
supposing there is none, which also implies it's too early for merging.



I thought I answered that:
To use a software RAID1 across multiple nodes of a cluster. Let me 
explain in more words..


In a cluster with multiple nodes with a shared storage, such as a SAN. 
The shared device becomes a single point of failure. If the device loses 
power, you will lose everything. A solution proposed is to use software 
RAID, say with two SAN switches with different devices and create a 
RAID1 on it. So if you lose power on one switch or one of the device is 
fails the other is still available. Once you get the other switch/device 
back up, it would resync the devices.



What are the doubts you have about it?


Before I begin reviewing the implementation, I'd like to better understand
what it is about the existing raid1 that doesn't work correctly for what
you'd like to do with it, i.e. I don't know what the problem is.


David Lang has already responded: The idea is to use a RAID device
(currently only level 1 mirroring is supported) with multiple nodes
of the cluster.


That doesn't come close to answering the question: exactly how do you want
to use raid1 (I have no idea from the statements you've made)


Using software RAID1 on a cluster with shared devices.



, and exactly
what breaks when you use raid1 in that way?  Once we've established the
technical problem, then I can fairly evaluate your solution for it.



Data consistency breaks. If node 1 is writing to the RAID1 device, you 
have to make sure the data between the two RAID devices is consistent. 
With software raid, this is performed with bitmaps. The DLM is used to 
maintain data consistency.


Device failure can be partial. Say, only node 1 sees that one of the 
device has failed (link break).  You need to tell other nodes not to 
use the device and that the array is degraded.


In case of node failure, the blocks of the failed nodes must be synced 
before the cluster can continue operation.


Does that explain the situation?


--
Goldwyn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-09 Thread Goldwyn Rodrigues



On 06/09/2015 03:30 PM, David Teigland wrote:

On Tue, Jun 09, 2015 at 03:08:11PM -0500, Goldwyn Rodrigues wrote:

Hi David,

On 06/09/2015 02:45 PM, David Teigland wrote:

On Tue, Jun 09, 2015 at 02:26:25PM -0500, Goldwyn Rodrigues wrote:

On 06/09/2015 01:22 PM, David Teigland wrote:

I've just noticed the existence of clustered MD for the first time.
It is a major new user of the dlm, and I have some doubts about it.
When did this appear on the mailing list for review?


It first appeared in December, 2014 on the RAID mailing list.
http://marc.info/?l=linux-raid=141891941330336=2


I don't read that mailing list.  Searching my archives of linux-kernel, it
has never been mentioned.  I can't even find an email for the md pull
request that included it.


Is this what you are looking for?
http://marc.info/?l=linux-kernel=142976971510061=2


Yes, I guess gmail lost it, or put it in spam.


- "experimental" code for managing md/raid1 across a cluster using
  DLM.  Code is not ready for general use and triggers a WARNING if
  used.  However it is looking good and mostly done and having in
  mainline will help co-ordinate development.

That falls far short of the bar for adding it to the kernel.  It not only
needs to work, it needs to be reviewed and justified, usually by showing


Why do you say it does not work?


It's just my abbreviation of that summary paragraph.


It did go through it's round of reviews on the RAID mailing list. I
understand that you missed it because you are not subscribed to the raid
mailing list.


I will look for that.


some real world utility to warrant the potential maintenance effort.


We do have a valid real world utility. It is to provide
high-availability of RAID1 storage  over the cluster. The
distributed locking is required only during cases of error and
superblock updates and is not required during normal operations,
which makes it fast enough for usual case scenarios.


That's the theory, how much evidence do you have of that in practice?


We wanted to develop a solution which is lock free (or atleast minimum) 
for the most common/frequent usage scenario. Also, we compared it with 
iozone on top of ocfs2 to find that it is very close to local device 
performance numbers. we compared it with cLVM mirroring to find it 
better as well. However, in the future we would want to use it with with 
other RAID (10?) scenarios which is missing now.





What are the doubts you have about it?


Before I begin reviewing the implementation, I'd like to better understand
what it is about the existing raid1 that doesn't work correctly for what
you'd like to do with it, i.e. I don't know what the problem is.



David Lang has already responded: The idea is to use a RAID device 
(currently only level 1 mirroring is supported) with multiple nodes of 
the cluster.


Here is a description on how to use it:
http://marc.info/?l=linux-raid=141935561418770=2

--
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-09 Thread Goldwyn Rodrigues

Hi David,

On 06/09/2015 02:45 PM, David Teigland wrote:

On Tue, Jun 09, 2015 at 02:26:25PM -0500, Goldwyn Rodrigues wrote:

On 06/09/2015 01:22 PM, David Teigland wrote:

I've just noticed the existence of clustered MD for the first time.
It is a major new user of the dlm, and I have some doubts about it.
When did this appear on the mailing list for review?


It first appeared in December, 2014 on the RAID mailing list.
http://marc.info/?l=linux-raid=141891941330336=2


I don't read that mailing list.  Searching my archives of linux-kernel, it
has never been mentioned.  I can't even find an email for the md pull
request that included it.



Is this what you are looking for?
http://marc.info/?l=linux-kernel=142976971510061=2


The merge commit states:

- "experimental" code for managing md/raid1 across a cluster using
  DLM.  Code is not ready for general use and triggers a WARNING if
  used.  However it is looking good and mostly done and having in
  mainline will help co-ordinate development.

That falls far short of the bar for adding it to the kernel.  It not only
needs to work, it needs to be reviewed and justified, usually by showing
some real world utility to warrant the potential maintenance effort.




Why do you say it does not work? It did go through it's round of reviews 
on the RAID mailing list. I understand that you missed it because you 
are not subscribed to the raid mailing list.


We do have a valid real world utility. It is to provide 
high-availability of RAID1 storage  over the cluster. The distributed 
locking is required only during cases of error and superblock updates 
and is not required during normal operations, which makes it fast enough 
for usual case scenarios.


What are the doubts you have about it?


--
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-09 Thread Goldwyn Rodrigues

Hi David,

On 06/09/2015 02:45 PM, David Teigland wrote:

On Tue, Jun 09, 2015 at 02:26:25PM -0500, Goldwyn Rodrigues wrote:

On 06/09/2015 01:22 PM, David Teigland wrote:

I've just noticed the existence of clustered MD for the first time.
It is a major new user of the dlm, and I have some doubts about it.
When did this appear on the mailing list for review?


It first appeared in December, 2014 on the RAID mailing list.
http://marc.info/?l=linux-raid=141891941330336=2


I don't read that mailing list.  Searching my archives of linux-kernel, it
has never been mentioned.  I can't even find an email for the md pull
request that included it.


Is this what you are looking for?
http://marc.info/?l=linux-kernel=142976971510061=2



The merge commit states:

- "experimental" code for managing md/raid1 across a cluster using
  DLM.  Code is not ready for general use and triggers a WARNING if
  used.  However it is looking good and mostly done and having in
  mainline will help co-ordinate development.

That falls far short of the bar for adding it to the kernel.  It not only
needs to work, it needs to be reviewed and justified, usually by showing


Why do you say it does not work? It did go through it's round of reviews 
on the RAID mailing list. I understand that you missed it because you 
are not subscribed to the raid mailing list.



some real world utility to warrant the potential maintenance effort.


We do have a valid real world utility. It is to provide 
high-availability of RAID1 storage  over the cluster. The distributed 
locking is required only during cases of error and superblock updates 
and is not required during normal operations, which makes it fast enough 
for usual case scenarios.


What are the doubts you have about it?

--
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-09 Thread Goldwyn Rodrigues



On 06/09/2015 01:22 PM, David Teigland wrote:

I've just noticed the existence of clustered MD for the first time.
It is a major new user of the dlm, and I have some doubts about it.
When did this appear on the mailing list for review?



It first appeared in December, 2014 on the RAID mailing list.
http://marc.info/?l=linux-raid=141891941330336=2


--
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-09 Thread Goldwyn Rodrigues



On 06/09/2015 01:22 PM, David Teigland wrote:

I've just noticed the existence of clustered MD for the first time.
It is a major new user of the dlm, and I have some doubts about it.
When did this appear on the mailing list for review?



It first appeared in December, 2014 on the RAID mailing list.
http://marc.info/?l=linux-raidm=141891941330336w=2


--
Goldwyn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-09 Thread Goldwyn Rodrigues

Hi David,

On 06/09/2015 02:45 PM, David Teigland wrote:

On Tue, Jun 09, 2015 at 02:26:25PM -0500, Goldwyn Rodrigues wrote:

On 06/09/2015 01:22 PM, David Teigland wrote:

I've just noticed the existence of clustered MD for the first time.
It is a major new user of the dlm, and I have some doubts about it.
When did this appear on the mailing list for review?


It first appeared in December, 2014 on the RAID mailing list.
http://marc.info/?l=linux-raidm=141891941330336w=2


I don't read that mailing list.  Searching my archives of linux-kernel, it
has never been mentioned.  I can't even find an email for the md pull
request that included it.



Is this what you are looking for?
http://marc.info/?l=linux-kernelm=142976971510061w=2


The merge commit states:

- experimental code for managing md/raid1 across a cluster using
  DLM.  Code is not ready for general use and triggers a WARNING if
  used.  However it is looking good and mostly done and having in
  mainline will help co-ordinate development.

That falls far short of the bar for adding it to the kernel.  It not only
needs to work, it needs to be reviewed and justified, usually by showing
some real world utility to warrant the potential maintenance effort.




Why do you say it does not work? It did go through it's round of reviews 
on the RAID mailing list. I understand that you missed it because you 
are not subscribed to the raid mailing list.


We do have a valid real world utility. It is to provide 
high-availability of RAID1 storage  over the cluster. The distributed 
locking is required only during cases of error and superblock updates 
and is not required during normal operations, which makes it fast enough 
for usual case scenarios.


What are the doubts you have about it?


--
Goldwyn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-09 Thread Goldwyn Rodrigues

Hi David,

On 06/09/2015 02:45 PM, David Teigland wrote:

On Tue, Jun 09, 2015 at 02:26:25PM -0500, Goldwyn Rodrigues wrote:

On 06/09/2015 01:22 PM, David Teigland wrote:

I've just noticed the existence of clustered MD for the first time.
It is a major new user of the dlm, and I have some doubts about it.
When did this appear on the mailing list for review?


It first appeared in December, 2014 on the RAID mailing list.
http://marc.info/?l=linux-raidm=141891941330336w=2


I don't read that mailing list.  Searching my archives of linux-kernel, it
has never been mentioned.  I can't even find an email for the md pull
request that included it.


Is this what you are looking for?
http://marc.info/?l=linux-kernelm=142976971510061w=2



The merge commit states:

- experimental code for managing md/raid1 across a cluster using
  DLM.  Code is not ready for general use and triggers a WARNING if
  used.  However it is looking good and mostly done and having in
  mainline will help co-ordinate development.

That falls far short of the bar for adding it to the kernel.  It not only
needs to work, it needs to be reviewed and justified, usually by showing


Why do you say it does not work? It did go through it's round of reviews 
on the RAID mailing list. I understand that you missed it because you 
are not subscribed to the raid mailing list.



some real world utility to warrant the potential maintenance effort.


We do have a valid real world utility. It is to provide 
high-availability of RAID1 storage  over the cluster. The distributed 
locking is required only during cases of error and superblock updates 
and is not required during normal operations, which makes it fast enough 
for usual case scenarios.


What are the doubts you have about it?

--
Goldwyn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-09 Thread Goldwyn Rodrigues



On 06/09/2015 03:30 PM, David Teigland wrote:

On Tue, Jun 09, 2015 at 03:08:11PM -0500, Goldwyn Rodrigues wrote:

Hi David,

On 06/09/2015 02:45 PM, David Teigland wrote:

On Tue, Jun 09, 2015 at 02:26:25PM -0500, Goldwyn Rodrigues wrote:

On 06/09/2015 01:22 PM, David Teigland wrote:

I've just noticed the existence of clustered MD for the first time.
It is a major new user of the dlm, and I have some doubts about it.
When did this appear on the mailing list for review?


It first appeared in December, 2014 on the RAID mailing list.
http://marc.info/?l=linux-raidm=141891941330336w=2


I don't read that mailing list.  Searching my archives of linux-kernel, it
has never been mentioned.  I can't even find an email for the md pull
request that included it.


Is this what you are looking for?
http://marc.info/?l=linux-kernelm=142976971510061w=2


Yes, I guess gmail lost it, or put it in spam.


- experimental code for managing md/raid1 across a cluster using
  DLM.  Code is not ready for general use and triggers a WARNING if
  used.  However it is looking good and mostly done and having in
  mainline will help co-ordinate development.

That falls far short of the bar for adding it to the kernel.  It not only
needs to work, it needs to be reviewed and justified, usually by showing


Why do you say it does not work?


It's just my abbreviation of that summary paragraph.


It did go through it's round of reviews on the RAID mailing list. I
understand that you missed it because you are not subscribed to the raid
mailing list.


I will look for that.


some real world utility to warrant the potential maintenance effort.


We do have a valid real world utility. It is to provide
high-availability of RAID1 storage  over the cluster. The
distributed locking is required only during cases of error and
superblock updates and is not required during normal operations,
which makes it fast enough for usual case scenarios.


That's the theory, how much evidence do you have of that in practice?


We wanted to develop a solution which is lock free (or atleast minimum) 
for the most common/frequent usage scenario. Also, we compared it with 
iozone on top of ocfs2 to find that it is very close to local device 
performance numbers. we compared it with cLVM mirroring to find it 
better as well. However, in the future we would want to use it with with 
other RAID (10?) scenarios which is missing now.





What are the doubts you have about it?


Before I begin reviewing the implementation, I'd like to better understand
what it is about the existing raid1 that doesn't work correctly for what
you'd like to do with it, i.e. I don't know what the problem is.



David Lang has already responded: The idea is to use a RAID device 
(currently only level 1 mirroring is supported) with multiple nodes of 
the cluster.


Here is a description on how to use it:
http://marc.info/?l=linux-raidm=141935561418770w=2

--
Goldwyn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/