Re: Btrfs and fanotify filesystem watches

2018-11-23 Thread Amir Goldstein
On Fri, Nov 23, 2018 at 3:34 PM Amir Goldstein  wrote:
>
> On Fri, Nov 23, 2018 at 2:52 PM Jan Kara  wrote:
> >
> > Changed subject to better match what we discuss and added btrfs list to CC.
> >
> > On Thu 22-11-18 17:18:25, Amir Goldstein wrote:
> > > On Thu, Nov 22, 2018 at 3:26 PM Jan Kara  wrote:
> > > >
> > > > On Thu 22-11-18 14:36:35, Amir Goldstein wrote:
> > > > > > > Regardless, IIUC, btrfs_statfs() returns an fsid which is 
> > > > > > > associated with
> > > > > > > the single super block struct, so all dentries in all subvolumes 
> > > > > > > will
> > > > > > > return the same fsid: btrfs_sb(dentry->d_sb)->fsid.
> > > > > >
> > > > > > This is not true AFAICT. Looking at btrfs_statfs() I can see:
> > > > > >
> > > > > > buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ 
> > > > > > be32_to_cpu(fsid[2]);
> > > > > > buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ 
> > > > > > be32_to_cpu(fsid[3]);
> > > > > > /* Mask in the root object ID too, to disambiguate subvols 
> > > > > > */
> > > > > > buf->f_fsid.val[0] ^=
> > > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid 
> > > > > > >> 32;
> > > > > > buf->f_fsid.val[1] ^=
> > > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> > > > > >
> > > > > > So subvolume root is xored into the FSID. Thus dentries from 
> > > > > > different
> > > > > > subvolumes are going to return different fsids...
> > > > > >
> > > > >
> > > > > Right... how could I have missed that :-/
> > > > >
> > > > > I do not want to bring back FSNOTIFY_EVENT_DENTRY just for that
> > > > > and I saw how many flaws you pointed to in $SUBJECT patch.
> > > > > Instead, I will use:
> > > > > statfs_by_dentry(d_find_any_alias(inode) ?: inode->i_sb->sb_root,...
> > > >
> > > > So what about my proposal to store fsid in the notification mark when 
> > > > it is
> > > > created and then use it when that mark results in event being generated?
> > > > When mark is created, we have full path available, so getting proper 
> > > > fsid
> > > > is trivial. Furthermore if the behavior is documented like that, it is
> > > > fairly easy for userspace to track fsids it should care about and 
> > > > translate
> > > > them to proper file descriptors for open_by_handle().
> > > >
> > > > This would get rid of statfs() on every event creation (which I don't 
> > > > like
> > > > much) and also avoids these problems "how to get fsid for inode". What 
> > > > do
> > > > you think?
> > > >
> > >
> > > That's interesting. I like the simplicity.
> > > What happens when there are 2 btrfs subvols /subvol1 /subvol2
> > > and the application obviously doesn't know about this and does:
> > > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol1);
> > > statfs("/subvol1",...);
> > > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol2);
> > > statfs("/subvol2",...);
> > >
> > > Now the second fanotify_mark() call just updates the existing super block
> > > mark mask, but does not change the fsid on the mark, so all events will 
> > > have
> > > fsid of subvol1 that was stored when first creating the mark.
> >
> > Yes.
> >
> > > fanotify-watch application (for example) hashes the watches (paths) under
> > > /subvol2 by fid with fsid of subvol2, so events cannot get matched back to
> > > "watch" (i.e. path).
> >
> > I agree this can be confusing... but with btrfs fanotify-watch will be
> > confused even with your current code, won't it? Because FAN_MARK_FILESYSTEM
> > on /subvol1 (with fsid A) is going to return also events on inodes from
> > /subvol2 (with fsid B). So your current code will return event with fsid B
> > which fanotify-watch has no way to match back and can get confused.
> >
> > So currently application can get events with fsid it has never seen, with
> > the code as I suggest it can get "

Re: Btrfs and fanotify filesystem watches

2018-11-23 Thread Amir Goldstein
On Fri, Nov 23, 2018 at 2:52 PM Jan Kara  wrote:
>
> Changed subject to better match what we discuss and added btrfs list to CC.
>
> On Thu 22-11-18 17:18:25, Amir Goldstein wrote:
> > On Thu, Nov 22, 2018 at 3:26 PM Jan Kara  wrote:
> > >
> > > On Thu 22-11-18 14:36:35, Amir Goldstein wrote:
> > > > > > Regardless, IIUC, btrfs_statfs() returns an fsid which is 
> > > > > > associated with
> > > > > > the single super block struct, so all dentries in all subvolumes 
> > > > > > will
> > > > > > return the same fsid: btrfs_sb(dentry->d_sb)->fsid.
> > > > >
> > > > > This is not true AFAICT. Looking at btrfs_statfs() I can see:
> > > > >
> > > > > buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ 
> > > > > be32_to_cpu(fsid[2]);
> > > > > buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ 
> > > > > be32_to_cpu(fsid[3]);
> > > > > /* Mask in the root object ID too, to disambiguate subvols */
> > > > > buf->f_fsid.val[0] ^=
> > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 
> > > > > 32;
> > > > > buf->f_fsid.val[1] ^=
> > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> > > > >
> > > > > So subvolume root is xored into the FSID. Thus dentries from different
> > > > > subvolumes are going to return different fsids...
> > > > >
> > > >
> > > > Right... how could I have missed that :-/
> > > >
> > > > I do not want to bring back FSNOTIFY_EVENT_DENTRY just for that
> > > > and I saw how many flaws you pointed to in $SUBJECT patch.
> > > > Instead, I will use:
> > > > statfs_by_dentry(d_find_any_alias(inode) ?: inode->i_sb->sb_root,...
> > >
> > > So what about my proposal to store fsid in the notification mark when it 
> > > is
> > > created and then use it when that mark results in event being generated?
> > > When mark is created, we have full path available, so getting proper fsid
> > > is trivial. Furthermore if the behavior is documented like that, it is
> > > fairly easy for userspace to track fsids it should care about and 
> > > translate
> > > them to proper file descriptors for open_by_handle().
> > >
> > > This would get rid of statfs() on every event creation (which I don't like
> > > much) and also avoids these problems "how to get fsid for inode". What do
> > > you think?
> > >
> >
> > That's interesting. I like the simplicity.
> > What happens when there are 2 btrfs subvols /subvol1 /subvol2
> > and the application obviously doesn't know about this and does:
> > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol1);
> > statfs("/subvol1",...);
> > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol2);
> > statfs("/subvol2",...);
> >
> > Now the second fanotify_mark() call just updates the existing super block
> > mark mask, but does not change the fsid on the mark, so all events will have
> > fsid of subvol1 that was stored when first creating the mark.
>
> Yes.
>
> > fanotify-watch application (for example) hashes the watches (paths) under
> > /subvol2 by fid with fsid of subvol2, so events cannot get matched back to
> > "watch" (i.e. path).
>
> I agree this can be confusing... but with btrfs fanotify-watch will be
> confused even with your current code, won't it? Because FAN_MARK_FILESYSTEM
> on /subvol1 (with fsid A) is going to return also events on inodes from
> /subvol2 (with fsid B). So your current code will return event with fsid B
> which fanotify-watch has no way to match back and can get confused.
>
> So currently application can get events with fsid it has never seen, with
> the code as I suggest it can get "wrong" fsid. That is confusing but still
> somewhat better?

It's not better IMO because the purpose of the fid in event is a unique key
that application can use to match a path it already indexed.
wrong fsid => no match.
Using open_by_handle_at() to resolve unknown fid is a limited solution
and I don't think it is going to work in this case (see below).

>
> The core of the problem is that btrfs does not have "the superblock
> identifier" that would correspond to FAN_MARK_FILESYSTEM scope of events
> that we could use.
>
> > And when trying to 

Re: [PATCH 03/25] vfs: check file ranges before cloning files

2018-10-11 Thread Amir Goldstein
On Thu, Oct 11, 2018 at 4:43 PM Christoph Hellwig  wrote:
>
> > -EXPORT_SYMBOL(vfs_clone_file_prep_inodes);
> > +EXPORT_SYMBOL(vfs_clone_file_prep);
>
> Btw, why isn't this EXPORT_SYMBOL_GPL?  It is rather Linux internal
> code, including some that I wrote which you lifted into the core
> in "vfs: refactor clone/dedupe_file_range common functions".

Because Al will shot down any attempt of those in vfs code:
https://lkml.org/lkml/2018/6/10/4

Thanks,
Amir.


Re: [PATCH v3 00/25] fs: fixes for serious clone/dedupe problems

2018-10-11 Thread Amir Goldstein
On Thu, Oct 11, 2018 at 7:12 AM Darrick J. Wong  wrote:
>
> Hi all,
>
> Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
> reflink implementation, and tracked it down to reflink forgetting to do
> some of the file-extending activities that must happen for regular
> writes.
>
> We then started auditing the clone, dedupe, and copyfile code and
> realized that from a file contents perspective, clonerange isn't any
> different from a regular file write.  Unfortunately, we also noticed
> that *unlike* a regular write, clonerange skips a ton of overflow
> checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
> and RLIMIT_FSIZE.  We also observed that cloning into a file did not
> strip security privileges (suid, capabilities) like a regular write
> would.  I also noticed that xfs and ocfs2 need to dump the page cache
> before remapping blocks, not after.
>
> In fixing the range checking problems I also realized that both dedupe
> and copyfile tell userspace how much of the requested operation was
> acted upon.  Since the range validation can shorten a clone request (or
> we can ENOSPC midway through), we might as well plumb the short
> operation reporting back through the VFS indirection code to userspace.
>
> So, here's the whole giant pile of patches[1] that fix all the problems.
> This branch is against 4.19-rc7 with Dave Chinner's XFS for-next branch.
> The patch "generic: test reflink side effects" recently sent to fstests
> exercises the fixes in this series.  Tests are in [2].
>
> --D
>
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
> [2] 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

I tested your branch with overlayfs over xfs.
I did not observe any failures with -g clone except for test generic/937
which also failed on xfs in my test.

I though that you forgot to mention I needed to grab xfsprogs from djwong-devel
for commit e84a9e93 ("xfs_io: dedupe command should only complain
if we don't dedupe anything"), but even with this change the test still fails:

generic/937 - output mismatch (see
/old/home/amir/src/fstests/xfstests-dev/results//generic/937.out.bad)
--- tests/generic/937.out   2018-10-11 08:23:00.630938364 +0300
+++ /old/home/amir/src/fstests/xfstests-dev/results//generic/937.out.bad
   2018-10-11 10:54:40.448134832 +0300
@@ -4,8 +4,7 @@
 39578c21e2cb9f6049b1cf7fc7be12a6  TEST_DIR/test-937/file2
 Files 1-2 do not match (intentional)
 (partial) dedupe the middle blocks together
-deduped / bytes at offset 
-XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+XFS_IOC_FILE_EXTENT_SAME: Extents did not match.
 Compare sections

One thing that *is* different with overlayfs test is that filefrag crashes
on this same test:

QA output created by 937
Create the original files
35ac8d7917305c385c30f3d82c30a8f6  TEST_DIR/test-937/file1
39578c21e2cb9f6049b1cf7fc7be12a6  TEST_DIR/test-937/file2
Files 1-2 do not match (intentional)
(partial) dedupe the middle blocks together
XFS_IOC_FILE_EXTENT_SAME: Extents did not match.
./tests/generic/937: line 59: 19242 Floating point exception(core
dumped) ${FILEFRAG_PROG} -v $testdir/file1 >> $seqres.full
./tests/generic/937: line 60: 19244 Floating point exception(core
dumped) ${FILEFRAG_PROG} -v $testdir/file2 >> $seqres.full

It looks like an overlayfs v4.19-rc1 regression - FIGETBSZ returns zero.
I never noticed this regression before, because none of the generic tests
are using filefrag.

Thanks,
Amir.


Re: [PATCH 17/25] vfs: enable remap callers that can handle short operations

2018-10-10 Thread Amir Goldstein
On Thu, Oct 11, 2018 at 7:14 AM Darrick J. Wong  wrote:
>
> From: Darrick J. Wong 
>
> Plumb in a remap flag that enables the filesystem remap handler to
> shorten remapping requests for callers that can handle it.  Now
> copy_file_range can report partial success (in case we run up against
> alignment problems, resource limits, etc.).
>
> Signed-off-by: Darrick J. Wong 
> Reviewed-by: Amir Goldstein 
> ---
>  fs/read_write.c|   15 +--
>  include/linux/fs.h |7 +--
>  mm/filemap.c   |   16 
>  3 files changed, 26 insertions(+), 12 deletions(-)
>
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 6ec908f9a69b..3713893b7e38 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1593,7 +1593,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, 
> loff_t pos_in,
>
> cloned = file_in->f_op->remap_file_range(file_in, pos_in,
> file_out, pos_out,
> -   min_t(loff_t, MAX_RW_COUNT, len), 0);
> +   min_t(loff_t, MAX_RW_COUNT, len),
> +   RFR_CAN_SHORTEN);
> if (cloned > 0) {
> ret = cloned;
> goto done;
> @@ -1804,16 +1805,18 @@ int generic_remap_file_range_prep(struct file 
> *file_in, loff_t pos_in,
>  * If the user is attempting to remap a partial EOF block and
>  * it's inside the destination EOF then reject it.
>  *
> -* We don't support shortening requests, so we can only reject
> -* them.
> +* If possible, shorten the request instead of rejecting it.
>  */
> if (is_dedupe)
> ret = -EBADE;
> else if (pos_out + *len < i_size_read(inode_out))
> ret = -EINVAL;
>
> -   if (ret)
> -   return ret;
> +   if (ret) {
> +   if (!(remap_flags & RFR_CAN_SHORTEN))
> +   return ret;
> +   *len &= ~blkmask;
> +   }
> }
>
> return 1;
> @@ -2112,7 +2115,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
> file_dedupe_range *same)
>
> deduped = vfs_dedupe_file_range_one(file, off, dst_file,
> info->dest_offset, len,
> -   0);
> +   RFR_CAN_SHORTEN);

You did not update WARN_ON_ONCE in vfs_dedupe_file_range_one()
to allow this flag and did not mention dedupe in commit message.
Was that change intentional in this patch?

After RFR_SHORT_DEDUPE patch the end result in fine.

Thanks,
Amir.


Re: [PATCH 1/4] vfs: introduce function to map unique ino/dev pairs

2018-07-31 Thread Amir Goldstein
On Wed, Aug 1, 2018 at 2:21 AM, Mark Fasheh  wrote:
>
> On Tue, Jul 31, 2018 at 02:10:42PM -0700, Mark Fasheh wrote:
>> diff --git a/fs/stat.c b/fs/stat.c
>> index f8e6fb2c3657..80ea42505219 100644
>> --- a/fs/stat.c
>> +++ b/fs/stat.c
>> @@ -84,6 +84,29 @@ int vfs_getattr_nosec(const struct path *path, struct 
>> kstat *stat,
>>  }
>>  EXPORT_SYMBOL(vfs_getattr_nosec);
>>
>> +void vfs_map_unique_ino_dev(struct dentry *dentry, u64 *ino, dev_t *dev)
>> +{
>> + int ret;
>> + struct path path;
>> + struct kstat stat;
>> +
>> + path.mnt = NULL;
>> + path.dentry = dentry;
>> + /* ->dev is always returned, so we only need to specify ino here */
>> + ret = vfs_getattr_nosec(, , STATX_INO, 0);
>> + if (ret) {
>> + struct inode *inode = d_inode(dentry);
>> + /* Fallback to old behavior in case of getattr error */
>> + *ino = inode->i_ino;
>> + *dev = inode->i_sb->s_dev;
>> + return ret;
>
> Ooof, attached is a version of this patch which doesn't try to return a value
> from a void function. Apologies for the extra e-mail.
>
> From Mark Fasheh 
>
> [PATCH 1/4] vfs: introduce function to map unique ino/dev pairs
>
> There are places in the VFS where we export an ino/dev pair to userspace.
> /proc/PID/maps is a good example - we directly expose inode->i_ino and
> inode->i_sb->s_dev to userspace there.  Many filesystems don't put a unique
> value in inode->i_ino and instead rely on ->getattr to provide the real
> inode number to userspace.  super->s_dev is similar - some filesystems
> expose a difference device from what's put in super->s_dev when queried via
> stat/statx.
>
> Ultimately this makes it impossible for a user (or software) to match one of
> those reported pairs to the right inode.
>
> We can fix this by adding a helper function, vfs_map_unique_ino_dev(), which
> will query the owning filesystem (via ->getattr) to get the correct ino/dev
> pair.  Later patches will update those places which simply dump inode->i_ino
> and super->s_dev to use the helper.
>
> Signed-off-by: Mark Fasheh 
> ---
>  fs/stat.c  | 22 ++
>  include/linux/fs.h |  2 ++
>  2 files changed, 24 insertions(+)
>
> diff --git a/fs/stat.c b/fs/stat.c
> index f8e6fb2c3657..20c72d618ed5 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -84,6 +84,28 @@ int vfs_getattr_nosec(const struct path *path, struct 
> kstat *stat,
>  }
>  EXPORT_SYMBOL(vfs_getattr_nosec);
>
> +void vfs_map_unique_ino_dev(struct dentry *dentry, u64 *ino, dev_t *dev)

I find this function name a bit more than function can guaranty.
It's just a fancy wrapper around ->getattr()
How about vfs_get_ino_dev() ?

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace

2018-07-31 Thread Amir Goldstein
On Wed, Aug 1, 2018 at 12:10 AM, Mark Fasheh  wrote:
> /proc/locks does not always print the correct inode number/device pair.
> Update lock_get_status() to use vfs_map_unique_ino_dev() to get the real,
> unique values for userspace.
>
> Signed-off-by: Mark Fasheh 
> ---
>  fs/locks.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index db7b6917d9c5..3a012df87fd8 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2621,6 +2621,7 @@ static void lock_get_status(struct seq_file *f, struct 
> file_lock *fl,
> loff_t id, char *pfx)
>  {
> struct inode *inode = NULL;
> +   struct dentry *dentry;
> unsigned int fl_pid;
> struct pid_namespace *proc_pidns = 
> file_inode(f->file)->i_sb->s_fs_info;
>
> @@ -2633,8 +2634,10 @@ static void lock_get_status(struct seq_file *f, struct 
> file_lock *fl,
> if (fl_pid == 0)
> return;
>
> -   if (fl->fl_file != NULL)
> +   if (fl->fl_file != NULL) {
> inode = locks_inode(fl->fl_file);
> +   dentry = file_dentry(fl->fl_file);
> +   }
>
> seq_printf(f, "%lld:%s ", id, pfx);
> if (IS_POSIX(fl)) {
> @@ -2681,10 +2684,13 @@ static void lock_get_status(struct seq_file *f, 
> struct file_lock *fl,
>: (fl->fl_type == F_WRLCK) ? "WRITE" : "READ 
> ");
> }
> if (inode) {
> +   __u64 ino;
> +   dev_t dev;
> +
> +   vfs_map_unique_ino_dev(dentry, , );
> /* userspace relies on this representation of dev_t */
> seq_printf(f, "%d %02x:%02x:%ld ", fl_pid,
> -   MAJOR(inode->i_sb->s_dev),
> -   MINOR(inode->i_sb->s_dev), inode->i_ino);
> +   MAJOR(dev), MINOR(dev), inode->i_ino);

Don't you mean ,ino); ?

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exporting a unique ino/dev pair to user space

2018-06-08 Thread Amir Goldstein
On Fri, Jun 8, 2018 at 12:06 AM, Mark Fasheh  wrote:
[...]

>> >  2c) I don't think we can really use a dedicated callback without
>> >  passing the vfsmount through since Overlayfs ->getattr might call
>> >  the lower fs ->getattr. At that point we might as well use getattr.
>> >
>>
>> Didn't get the Overlayfs lower fs getattr argument.
>> Overlayfs doesn't use the vfsmount passed into getattr
>> and it could very well pass a dentry to lower fs getattr.
>
> My main point in 2c) is that, from my understanding, Overlayfs may need to
> call down to one of the filesystem ->getattr() calls. Since those take a
> vfsmount we don't gain anything by having a unique callback from this - the
> plumbing work would be the same.
>

I guess I don't understand what you mean by "dedicated callback", but I
think we both in agreement that changing fs getattr() to take dentry is
preferred.

>
>> As a matter of fact, out of 35 getattr implementations in the kernel:
>> (git grep "\s\.getattr\s" fs| awk '{print $4}'| sort -u|grep -v
>> "nfs.*_proc_getattr"|wc -l)
>> there is only one using the vfsmount - nfs_getattr() for MNT_NOATIME
>> check and most of them only ever use d_inode(path->dentry).
>>
>> This API seems quite odd.
>> Maybe it should be fixed so more in kernel call sites could call getattr
>> without a vfsmount.
>> Not sure what would be the best way to handle nfs_getattr().
>
> Yeah I saw that nfs_getattr() is the only user of the vfsmount. I totally
> agree that this would be tons easier if we didn't have to pass the vfsmount
> (and like you said, there's only the ONE user).
>
> This is a bit hacky, but I wonder if we could blow the function signature
> back out to a dentry + vfsmount and make the vfsmount optional when getattr
> is called only for ino/dev. It's a bit ugly to have optional arguments like
> that but nfs would work with just a line or two change and the other fs
> would never even care.

OR.. change the signature of fs getattr() and vfs_getattr_nosec() to dentry
and set a kernel query_flag AT_STATX_NO_REMOTE_ATIME from
vfs_getattr(). No need to pass vfsmount to nfs.

Then users that access i_ino/s_dev can call vfs_getattr_nosec() with a
bonus of not having to go through security modules (which now they don't).

The only current user of vfs_getattr_nosec() expfs.c queries STATX_INO,
so change is safe.

Overlayfs doesn't need to get vfsmount in ovl_getattr() - it knows the
lower fs vfsmount for calling vfs_getattr() regardless and in other
places overlayfs just needs to query  STATX_INO, so it may also
use the light vfs_getattr_nosec() callback.

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exporting a unique ino/dev pair to user space

2018-06-07 Thread Amir Goldstein
On Thu, Jun 7, 2018 at 12:38 AM, Mark Fasheh  wrote:
> Hi,
>
> We have an inconsistency in how the kernel is exporting inode number /
> device pairs for user space. There's of course stat(2) and statx(2),
> but aside from those we simply dump inode->i_ino and super->s_dev. In
> some cases, the dumped values differ from what is returned via stat(2)
> or statx(2). Some filesystems might even show duplicate (but
> internally different!) pairs when the raw i_ino/s_dev is used.
>
> Some examples where we dump raw ino/dev:
>
> - /proc//maps. I've written about how this confuses lsof(8):
>   https://marc.info/?l=linux-btrfs=130074451403261=2
>
> - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
>   trace/events/lock.h or trace/events/writeback.h for examples.
>
> - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()
>
> - Audit records the raw ino/dev and passes them around. We do seem to
>   have paths printed from audit as well, but if it's printed with the
>   wrong ino/dev pair I believe my point still stands.
>

knfsd also looks at i_ino. I converted one or two of these places
to getattr callbacks recently, but there are still some more.

Anyway, w.r.t. Overlayfs, I believe Miklos' work to make file_inode()
an overlay inode should resolve several of the issues listed above -
probably not all, but didn't check every tracepoint..

>
> This breaks software which expects these pairs to be unique, and can
> put the user in a situation where they might not be able to find an
> inode referenced from the kernel. What's even worse - depending on how
> ino is exported, they might even find the *wrong* inode.
>
> I also should point out that we're likely at this point because
> stat(2) has been using an unsigned long for ino. On a 32 bit system,
> it would have been impossible for the user to get the real inode
> number in the first place. So there probably wasn't much we could do.
>
> These days though, we have statx(2) which will do the right thing on
> all platforms so we no longer have that excuse. The user ought to be
> able to take an ino/dev pair and ultimately find the actual file on
> their system, partially with the help of statx(2).
>
>
> Some examples of how ino is manipulated by filesystems:
>
> - On 64 bit i_ino and kstat->ino tend to be filled in correctly (from
>   what I can tell). stat->dev is not always consistent with super->s_dev.
>
> - On 32 bits, many filesystems employ a transformation to squeeze a 64
>   bit identifier into 32 bits. The exact methods are fs specific,
>   what's important is that we're losing information, introducing the
>   possibility of duplicate inode numbers.
>
> - On all platforms, Btrfs and Overlayfs can have duplicate inode
>   numbers. Of course, device can be different across the fs as well
>   with the kernel reporting s_dev and these filesystems returning
>   a different device via stat() or statx().
>
>
> Getting the inode number portion of this pair fixed would immediately
> solve the situation for all filesystems except Btrfs and
> Overlayfs - they report a different device from stat.
>
> Regarding the device portion of the pair, I'm honestly not sure
> whether Overlayfs cares, and my attempts to fix the s_dev situation
> for Btrfs have all come to the same dead ends that I've hit briefly
> looking into this inode number issue - the problems are intrinsically
> linked.
>
> So my questions are:
>
> 1) Do we care about this? On one hand, we've had this inconsistency
>for a long time, for various reasons. On the other hand, I can point
>to bugzilla's where these inconsistencies have become a problem.
>
>In the case that we don't care, any fs-internal solutions are
>likely to be extremely disruptive to the end user.
>
>
> 2) If we do care, how do we fix this?
>
>  2a) Do we use 64 bit i_ino on 32 bit systems? This has some obvious
>  downsides from a memory usage standpoint. Also it doesn't fully
>  address the issue - we still have a device field that Btrfs and
>  Overlayfs override.
>
>  We could combine this with an intermediate structure between the
>  inode and super block so s_dev could be abstracted out. See my
>  fs_view patches for an example of how this could be done:
>  https://www.spinics.net/lists/linux-fsdevel/msg125492.html
>
>  2b) Do we call ->getattr for this information? Plumbing a vfsmount to
>  various regions of the kernel like audit, or some of the deeper
>  tracepoints looks ugly and prone to life-timing issues (which
>  vfsmount do we use?). On the upside, we could probably make it
>  really light by only sending down the STATX_INO flag and letting
>  the filesystem optimize accordingly.
>
>  2c) I don't think we can really use a dedicated callback without
>  passing the vfsmount through since Overlayfs ->getattr might call
>  the lower fs ->getattr. At that point we might as well use getattr.
>

Didn't get the Overlayfs lower fs getattr argument.

Re: [PATCH 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-11 Thread Amir Goldstein
On Sat, May 12, 2018 at 3:06 AM, Darrick J. Wong
 wrote:
> On Fri, May 11, 2018 at 12:26:51PM -0700, Mark Fasheh wrote:
>> Right now we return EINVAL if a process does not have permission to dedupe a
>> file. This was an oversight on my part. EPERM gives a true description of
>> the nature of our error, and EINVAL is already used for the case that the
>> filesystem does not support dedupe.
>>
>> Signed-off-by: Mark Fasheh 
>> ---
>>  fs/read_write.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 77986a2e2a3b..8edef43a182c 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -2038,7 +2038,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
>> file_dedupe_range *same)
>>   info->status = -EINVAL;
>>   } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) ||
>>uid_eq(current_fsuid(), dst->i_uid))) {
>> - info->status = -EINVAL;
>> + info->status = -EPERM;
>
> Hmm, are we allowed to change this aspect of the kabi after the fact?
>
> Granted, we're only trading one error code for another, but will the
> existing users of this care?  xfs_io won't and I assume duperemove won't
> either, but what about bees? :)
>

Relaxing -EINVAL is the common case with kabi.
Every new flag we add support for does that and is it also common
that a new flag we support is restricted for certain capabilities so
EINVAL is replaced with EPERM.
BTW, man page doesn't say anything about the is_admin case.

IMO EPERM makes perfect sense here and btw, we also return
EPERM from overlayfs if dst is on lower layer.

Mark,

Please be aware that these changes conflict with Miklos' dedupe-cleanup
patches, so I suggest you collaborate on that
https://marc.info/?l=linux-fsdevel=152568128031031=2

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symlink not persisted even after fsync

2018-04-15 Thread Amir Goldstein
On Mon, Apr 16, 2018 at 3:10 AM, Vijay Chidambaram  wrote:
[...]
> Consider the following workload:
>
>  creat foo
>  link (foo, A/bar)
>  fsync(foo)
>  crash
>
> In this case, after the file system recovers, do we expect foo's link
> count to be 2 or 1? I would say 2, but POSIX is silent on this, so
> thought I would confirm. The tricky part here is we are not calling
> fsync() on directory A.
>
> In this case, its not a symlink; its a hard link, so I would say the
> link count for foo should be 2. But btrfs and F2FS show link count of
> 1 after a crash.
>

That sounds like a clear bug - nlink is metadata of inode foo, so
should be made persistent by fsync(foo).

For non-journaled fs you would need to fsync(A) to guarantee
seeing A/bar after crash, but for a journaled fs, if you didn't see
A/bar after crash and did see nlink 2 on foo then you would get
a filesystem inconsistency, so practically, fsync(foo) takes care
of persisting A/bar entry as well. But as you already understand,
these rules have not been formalized by a standard, instead, they
have been "formalized" by various fsck.* tools.

Allow me to suggest a different framing for CrashMonkey.
You seem to be engaging in discussions with the community
about whether X behavior is a bug or not and as you can see
the answer depends on the filesystem (and sometimes on the
developer). Instead, you could declare that CrashMonkey
is a "Certification tool" to certify filesystems to a certain
crash consistency behavior. Then you can discuss with the
community about specific models that CrashMonkey should
be testing. The model describes the implicit dependencies
and ordering guaranties between operations.
Dave has mentioned the "strictly ordered metadata" model.
I do not know of any formal definition of this model for filesystems,
but you can take a shot at starting one and encoding it into
CrashMonkey. This sounds like a great paper to me.

I don't know if Btrfs and f2fs will qualify as "strictly ordered
metadata" and I don't know if they would want to qualify.
Mind you a filesystem can be crash consistent without
following "strictly ordered metadata". In fact, in many cases
"strictly ordered metadata" imposes performance penalty by
coupling together unrelated metadata updates (e.g. create
A/a and create B/b), but it is also quite hard to decouple them
because future operation can create a dependency (e.g.
mv A/a B/b).

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symlink not persisted even after fsync

2018-04-13 Thread Amir Goldstein
On Fri, Apr 13, 2018 at 3:54 PM, Vijay Chidambaram <vi...@cs.utexas.edu> wrote:
> Hi Amir,
>
> Thanks for the reply!
>
> On Fri, Apr 13, 2018 at 12:52 AM, Amir Goldstein <amir7...@gmail.com> wrote:
>>
>> Not a bug.
>>
>> From man 2 fsync:
>>
>> "Calling  fsync() does not necessarily ensure that the entry in the
>>  directory containing the file has also reached disk.  For that an
>>  explicit fsync() on a file descriptor for the directory is also needed."
>
>
> Are we understanding this right:
>
> ext4 and xfs fsync the parent directory if a sym link file is fsync-ed. But
> btrfs does not. Is this what we are seeing?

Nope.

You are seeing an unintentional fsync of parent, because both
parent update and symlink update are metadata updates that are
tracked by the same transaction.

fsync of symlink forces the current transaction to the journal,
pulling in the parent update with it.


>
> I agree that fsync of a file does not mean fsync of its directory entry, but
> it seems odd to do it for regular files and not for sym links. We do not see
> this behavior if we use a regular file instead of a sym link file.
>

fsync of regular file behaves differently than fsync of non regular file.
I suggest this read:
https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

>>
>> There is a reason why this behavior is not being reproduces in
>> ext4/xfs, but you should be able to reproduce a similar issue
>> like this:
>>
>>
>> 1. symlink (foo, bar.tmp)
>> 2. open bar.tmp
>> 3. fsync bar.tmp
>> 4. rename(bar.tmp, bar)
>> 5. fsync bar
>> crash here
>
>
> I'm guessing xfs/ext4 detect the symlink-fsync pattern and fsync the parent
> dir in our workload, but would miss it because of the rename in the workload
> you provided?
>

No pattern detecting by xfs/ext4 AFAIK.
rename does not change metadata of victim, so fsync(bar)
may (depending on fs) trigger no metadata transaction commit.

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symlink not persisted even after fsync

2018-04-12 Thread Amir Goldstein
On Thu, Apr 12, 2018 at 8:51 PM, Jayashree Mohan
 wrote:
> Hi,
>
> We came across what seems to be a crash consistency bug on btrfs and
> f2fs. When we create a symlink for a file and fsync the symlink, on
> recovery from crash, the fsync-ed file is missing.
>
> You can reproduce this behaviour using this minimal workload :
>
> 1. symlink (foo, bar)
> 2. open bar
> 3. fsync bar
> crash here
>
> When we recover, we find that file bar is missing. This behaviour
> seems unexpected as the fsynced file is lost on a crash. ext4 and xfs
> correctly recovers file bar. This seems like a bug. If not, could you
> explain why?
>

Not a bug.

>From man 2 fsync:

"Calling  fsync() does not necessarily ensure that the entry in the
 directory containing the file has also reached disk.  For that an
 explicit fsync() on a file descriptor for the directory is also needed."

There is a reason why this behavior is not being reproduces in
ext4/xfs, but you should be able to reproduce a similar issue
like this:

1. symlink (foo, bar.tmp)
2. open bar.tmp
3. fsync bar.tmp
4. rename(bar.tmp, bar)
5. fsync bar
crash here

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] fstests: generic: Check the fs after each FUA writes

2018-03-27 Thread Amir Goldstein
On Wed, Mar 28, 2018 at 7:40 AM, Qu Wenruo <w...@suse.com> wrote:
> Basic test case which triggers fsstress with dm-log-writes, and then
> check the fs after each FUA writes.
> With needed infrastructure and special handlers for journal based fs.
>
> Signed-off-by: Qu Wenruo <w...@suse.com>
> ---
> changelog:
> v2:
>   Use proper "SUSE Linux Products GmbH" instead of "SuSE"
>   Get rid of dm-snapshot which is pretty slow if we're creating and
>   deleting snapshots repeatedly.
>   (Maybe LVM thin provision would be much better, but current replay
>solution is good so far, and no slower than dm-snapshot)
>   Add back run_check so that we can get the seed.
> ---
> Unfortunately, neither xfs nor ext4 survies this test for even single
> success, while btrfs survives.
> (Although not what I want, I'm just trying my luck
> to reproduce a serious btrfs corruption situation)
>
> Although btrfs may be the fastest fs for the test, since it has fixed
> small amount of write in mkfs and almost nothing to replay, it still
> takes about 240s~300s to finish (the same level using snapshot).
>
> It would take longer time for ext4 for its large amount of write during
> mkfs, if it can survive the test in the first space.

Hmm, how much time would the total test would take if you don't fail
it on fsck? I am asking because it may be possible to improve this with
only a single snapshot after mkfs.

Anyway, if total test run time is expected to be under 10min I wouldn't
bother with this optimization, at least not right now. IMO it is more
important to get the test out there to get the corruption bugs floating.

Thanks for working on this!
You can add
Reviewed-by: Amir Goldstein <amir7...@gmail.com>


>
> As a comparison, btrfs takes about 5 seconds to replay log, mount,
> unmount and run fsck at the end of the test.
> But for ext4, it already takes about 5 seconds to do the same thing
> before triggering fsck error.
>
> Fsck fail for ext4:
> _check_generic_filesystem: filesystem on /dev/mapper/test-scratch1 is 
> inconsistent
> *** fsck.ext4 output ***
> fsck from util-linux 2.31.1
> e2fsck 1.43.8 (1-Jan-2018)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 131076 extent tree (at level 1) could be shorter.  Fix? no
>
> Inode 131262, i_size is 0, should be 258048.  Fix? no
>
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
>
> For xfs:
> _check_xfs_filesystem: filesystem on /dev/mapper/test-scratch1 is 
> inconsistent (r)
> *** xfs_repair -n output ***
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
> - scan (but don't clear) agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> - agno = 1
> - agno = 2
> bad symlink header ino 8409190, file block 0, disk block 1051147
> problem with symbolic link in inode 8409190
> would have cleared inode 8409190
> - agno = 3
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 3
> - agno = 2
> entry "lb" in shortform directory 8409188 references free inode 8409190
> would have junked entry "lb" in directory inode 8409188
> bad symlink header ino 8409190, file block 0, disk block 1051147
> problem with symbolic link in inode 8409190
> would have cleared inode 8409190
> No modify flag set, skipping phase 5
> Phase 6 - check inode connectivity...
> - traversing filesystem ...
> entry "lb" in shortform directory inode 8409188 points to free inode 8409190
> would junk entry
> - traversal finished ...
> - moving disconnected inodes to lost+found ...
> Phase 7 - verify link counts...
> Maximum metadata LSN (1:396) is ahead of log (1:63).
> Would format log to cycle 4.
> No modify flag set, skipping filesystem flush and exiting.
>
> And special note for XFS guys, I also hit an XFS internal metadata
> warning during journal replay:
> [ 7901.423659] XFS (dm-4): Starting recovery (logdev: internal)
> [ 7901.577511] XFS (dm-4): Metadata corruption detected at 
> xfs_dinode_verify+0x467/0x570 [xfs], inode 0x805067 dinode
> [ 7901.580529] XFS (dm-4): Unmount and run xfs_repair
> [ 7901.581901] XFS (dm-4): First 128 bytes of corrupted metadata buff

Re: [PATCH 3/3] fstests: generic: Check the fs after each FUA writes

2018-03-21 Thread Amir Goldstein
On Wed, Mar 21, 2018 at 10:01 AM, Qu Wenruo  wrote:
> Basic test case which triggers fsstress with dm-log-writes, and then
> check the fs after each FUA writes.
> With needed infrastructure and special handlers for journal based fs.
>
> Signed-off-by: Qu Wenruo 
> ---
> Unfortunately, neither xfs nor ext4 survies this test for even single
> success, while btrfs survives.
> (Although not what I want, I'm just trying my luck
> to reproduce a serious btrfs corruption situation)
>
> Although btrfs may be the fastest fs for the test, since it has fixed
> small amount of write in mkfs and almost nothing to replay, it still
> takes about 240s~300s to finish (the same level using snapshot).
>
> It would take longer time for ext4 for its large amount of write during
> mkfs, if it can survive the test in the first space.
>
> As a comparison, btrfs takes about 5 seconds to replay log, mount,
> unmount and run fsck at the end of the test.
> But for ext4, it already takes about 5 seconds to do the same thing
> before triggering fsck error.
>
> Fsck fail for ext4:
> _check_generic_filesystem: filesystem on /dev/mapper/test-scratch1 is 
> inconsistent
> *** fsck.ext4 output ***
> fsck from util-linux 2.31.1
> e2fsck 1.43.8 (1-Jan-2018)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 131076 extent tree (at level 1) could be shorter.  Fix? no
>
> Inode 131262, i_size is 0, should be 258048.  Fix? no
>
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
>
> For xfs:
> _check_xfs_filesystem: filesystem on /dev/mapper/test-scratch1 is 
> inconsistent (r)
> *** xfs_repair -n output ***
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
> - scan (but don't clear) agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> - agno = 1
> - agno = 2
> bad symlink header ino 8409190, file block 0, disk block 1051147
> problem with symbolic link in inode 8409190
> would have cleared inode 8409190
> - agno = 3
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 3
> - agno = 2
> entry "lb" in shortform directory 8409188 references free inode 8409190
> would have junked entry "lb" in directory inode 8409188
> bad symlink header ino 8409190, file block 0, disk block 1051147
> problem with symbolic link in inode 8409190
> would have cleared inode 8409190
> No modify flag set, skipping phase 5
> Phase 6 - check inode connectivity...
> - traversing filesystem ...
> entry "lb" in shortform directory inode 8409188 points to free inode 8409190
> would junk entry
> - traversal finished ...
> - moving disconnected inodes to lost+found ...
> Phase 7 - verify link counts...
> Maximum metadata LSN (1:396) is ahead of log (1:63).
> Would format log to cycle 4.
> No modify flag set, skipping filesystem flush and exiting.
>
> And special note for XFS guys, I also hit an XFS internal metadata
> warning during journal replay:
> [ 7901.423659] XFS (dm-4): Starting recovery (logdev: internal)
> [ 7901.577511] XFS (dm-4): Metadata corruption detected at 
> xfs_dinode_verify+0x467/0x570 [xfs], inode 0x805067 dinode
> [ 7901.580529] XFS (dm-4): Unmount and run xfs_repair
> [ 7901.581901] XFS (dm-4): First 128 bytes of corrupted metadata buffer:
> [ 7901.583205] b8963f41: 49 4e a1 ff 03 02 00 00 00 00 00 00 00 00 00 
> 00  IN..
> [ 7901.584659] f35a50e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00  
> [ 7901.586075] 386eea9e: 5a b2 0e 69 0a f3 43 27 5a b2 0e 69 0a f3 43 
> 27  Z..i..C'Z..i..C'
> [ 7901.587561] ac636661: 5a b2 0e 69 0d 92 bc 00 00 00 00 00 00 00 00 
> 00  Z..i
> [ 7901.588969] d75f9093: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00  
> [ 7901.590475] d2af5688: 00 00 00 02 00 00 00 00 00 00 00 00 84 d7 6a 
> e9  ..j.
> [ 7901.591907] e8dd8211: ff ff ff ff 34 93 a9 3a 00 00 00 00 00 00 00 
> 04  4..:
> [ 7901.593319] b7610e4e: 00 00 00 01 00 00 00 45 00 00 00 00 00 00 00 
> 00  ...E
>
> ---
>  common/dmlogwrites|  56 +++
>  tests/generic/481 | 104 
> ++
>  tests/generic/481.out |   2 +
>  tests/generic/group   |   1 +
>  4 files changed, 163 insertions(+)
>  create mode 100755 tests/generic/481
>  create mode 100644 tests/generic/481.out
>
> diff --git a/common/dmlogwrites 

Re: [PATCH RFC 3/3] fstests: generic: Check the fs after each FUA writes

2018-03-16 Thread Amir Goldstein
On Fri, Mar 16, 2018 at 10:19 AM, Eryu Guan  wrote:
> On Fri, Mar 16, 2018 at 01:17:07PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018年03月16日 12:01, Eryu Guan wrote:
>> > On Wed, Mar 14, 2018 at 05:02:30PM +0800, Qu Wenruo wrote:
>> >> Basic test case which triggers fsstress with dm-log-writes, and then
>> >> check the fs after each FUA writes.
>> >> With needed infrastructure and special handlers for journal based fs.
>> >
>> > It's not clear to me why the existing infrastructure is not sufficient
>> > for the test. It'd be great if you could provide more information and/or
>> > background in commit log.
>>
>> The main problem of current infrastructure is we don't have the
>> following things:
>>
>> 1) Way to take full advantage of dm-log-writes
>>The main thing is, we don't have test cases to check each FUA (this
>>patch) and flush (later patch after clearing all the RFC comments).
>>
>>We have some dm-flakey test cases to emulate power loss, but they are
>>mostly for fsync.
>>Here we are not only testing fsync, but also every superblock update.
>>Which should be a super set of dm-flakey tests.
>>
>> 2) Workaround for journal replay
>>In fact, if we only test btrfs, we don't even need such complicated
>>work, just 'replay-log --fsck "btrfs check" --check fua' will be
>>enough. As btrfs check doesn't report dirty journal (log tree) as
>>problem.
>>But for journal based fses, their fsck all report dirty journal as
>>error, which needs current snapshot works to replay them before
>>running fsck.
>
> And replay-to-fua doesn't guarantee a consistent filesystem state,
> that's why we need to mount/umount the target device to replay the
> filesystem journal, and to avoid replaying already-replayed-log over and
> over again, we create a snapshot of the target device and mount cycle &
> fsck the snapshot, right?
>
> I'm wondering if the overhead of repeatly create & destroy snapshots is
> larger than replaying log from start. Maybe snapshots take more time?
>

FYI, the snapshots flavor comes from Josef's scripts and it is called fast***
I suppose this means the non-snapshot flavor is the original implementation
and it is slower. Josef?

>>
>> I would add them in next version if there is no extra comment on this.
>>
>> >
>> >>
>> >> Signed-off-by: Qu Wenruo 
>> >> ---
>> >> In my test, xfs and btrfs survives while ext4 would report error during 
>> >> fsck.
>> >>
>> >> My current biggest concern is, we abuse $TEST_DEV and mkfs on it all by
>> >> ourselves. Not sure if it's allowed.
>> >
>> > As Amir already replied, that's not allowed, any destructive operations
>> > should be done on $SCRATCH_DEV.
>>
>> Yep, I'm looking for similar case who uses $SCRATCH_DEV as LVM pv do get
>> extra device.
>>
>> Or can we reuse the scratch_dev_pool even for ext4/xfs?
>
> I think so, IMO pool devices are not limited to btrfs. But I think we
> could use a loop device reside on $TEST_DIR? Or if snapshots take longer
> time, then we don't need this extra device at all :)
>
> I have some other comments, will reply to the RFC patch in another
> email.
>
> Thanks,
> Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 3/3] fstests: generic: Check the fs after each FUA writes

2018-03-14 Thread Amir Goldstein
On Wed, Mar 14, 2018 at 11:02 AM, Qu Wenruo  wrote:
> Basic test case which triggers fsstress with dm-log-writes, and then
> check the fs after each FUA writes.
> With needed infrastructure and special handlers for journal based fs.
>
> Signed-off-by: Qu Wenruo 
> ---
> In my test, xfs and btrfs survives while ext4 would report error during fsck.
>
> My current biggest concern is, we abuse $TEST_DEV and mkfs on it all by
> ourselves. Not sure if it's allowed.

Definitely not allowed.

I suppose you could either create an LVM volume on scratch for the tested
fs and leave room for a snapshot target on scratch or do something similar
with LOGWRITES_DEV, which may be more appropriate in the context of
common/dmlogwrites helper.

I see you are posting this generic series independently from the btrsf specific
series, but the two series have conflicting changes. How do intent for
the final merged result to look like?


> ---
>  common/dmlogwrites| 119 
>  tests/generic/481 | 124 
> ++
>  tests/generic/481.out |   2 +
>  tests/generic/group   |   1 +
>  4 files changed, 246 insertions(+)
>  create mode 100755 tests/generic/481
>  create mode 100644 tests/generic/481.out
>
> diff --git a/common/dmlogwrites b/common/dmlogwrites
> index 467b872e..258f5887 100644
> --- a/common/dmlogwrites
> +++ b/common/dmlogwrites
> @@ -126,3 +126,122 @@ _log_writes_cleanup()
> $UDEV_SETTLE_PROG >/dev/null 2>&1
> _log_writes_remove
>  }
> +
> +# Convert log writes mark to entry number
> +# Result entry number is output to stdout, could be empty if not found
> +_log_writes_mark_to_entry_number()
> +{
> +   local _mark=$1
> +   local ret
> +
> +   [ -z "$_mark" ] && _fatal \
> +   "mark must be given for _log_writes_mark_to_entry_number"
> +
> +   ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \
> +   --end-mark $_mark 2> /dev/null)
> +   [ -z "$ret" ] && return
> +   ret=$(echo "$ret" | cut -f1 -d\@)
> +   echo "mark $_mark has entry number $ret" >> $seqres.full
> +   echo "$ret"
> +}
> +
> +# Find next fua write entry number
> +# Result entry number is output to stdout, could be empty if not found
> +_log_writes_find_next_fua()
> +{
> +   local _start_entry=$1
> +   local ret
> +
> +   [ -z "$_start_entry" ] && _start_entry=0
> +   ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \
> + --next-fua --start-entry $_start_entry 2> /dev/null)
> +   [ -z "$ret" ] && return
> +
> +   ret=$(echo "$ret" | cut -f1 -d\@)
> +   echo "next fua is entry number $ret" >> $seqres.full
> +   echo "$ret"
> +}
> +
> +# Replay log range to specified entry
> +# $1:  End entry. The entry with this number *WILL* be replayed
> +# $2:  Start entry. If not specified, start from the first entry.
> +# $3:  Verbose. If set to 'y' do verbose output
> +_log_writes_replay_log_entry_range()
> +{
> +   local _end=$1
> +   local _start=$2
> +   local _verbose=$3
> +
> +   [ -z "$_end" ] && _fatal \
> +   "end entry must be specified for _log_writes_replay_log_entry_range"
> +
> +   [ "x$verbose" == "xy" ] && _verbose="-v"

Those xy tricks are really not needed with bash.
"$verbose" == "y" is nicer and better.


> +   [ -z "$_start" ] && _start=0
> +   [ "$_start" -gt "$_end" ] && _fatal \
> +   "wrong parameter order for 
> _log_writes_replay_log_entry_range:start=$_start end=$_end"
> +
> +   # To ensure we replay the last entry, for _start == 0 case,
> +   # we need to manually increase the end entry number to ensure
> +   # it's played
> +   echo "=== replay from $_start to $_end ===" >> $seqres.full
> +   $here/src/log-writes/replay-log --log $LOGWRITES_DEV \
> +   --replay $SCRATCH_DEV --start-entry $_start \
> +   --limit $(($_end - $_start + 1)) $_verbose \
> +   >> $seqres.full 2>&1
> +   [ $? -ne 0 ] && _fatal "replay failed"
> +}
> +
> +_log_writes_cleanup_snapshot()
> +{
> +   $UDEV_SETTLE_PROG > /dev/null 2>&1
> +   $DMSETUP_PROG remove "$DMLOGWRITES_SNAPSHOT_NAME" > /dev/null 2>&1
> +   $DMSETUP_PROG remove "$DMLOGWRITES_ORIGIN_NAME" > /dev/null 2>&1
> +}
> +
> +# Helper to create snapshot of a the replayed device
> +# Useful for journal based filesystem such as XFS and Ext4 to replay
> +# their journal without touching the replay device, so that we can
> +# continue replaying other than replay from the beginning.
> +# $1:  Snapshot device
> +_log_writes_create_snapshot()
> +{
> +   _require_dm_target snapshot
> +
> +   local snapshot_dev=$1
> +   local cow_base=""
> +
> +   [ -z "$snapshot_dev" ] && _fatal \
> +   "@device must be specified for _log_writes_create_snapshot"
> +   local size=$(blockdev --getsz $SCRATCH_DEV)
> +

Re: [PATCH 2/3] fstests: log-writes: Add support for METADATA flag

2018-03-14 Thread Amir Goldstein
On Wed, Mar 14, 2018 at 11:02 AM, Qu Wenruo <w...@suse.com> wrote:
> Signed-off-by: Qu Wenruo <w...@suse.com>
Reviewed-by: Amir Goldstein <amir7...@gmail.com>

> ---
>  src/log-writes/log-writes.c | 3 ++-
>  src/log-writes/log-writes.h | 9 +
>  2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/src/log-writes/log-writes.c b/src/log-writes/log-writes.c
> index a872429d..5dc22c24 100644
> --- a/src/log-writes/log-writes.c
> +++ b/src/log-writes/log-writes.c
> @@ -130,7 +130,8 @@ struct flags_to_str_entry {
> DEFINE_LOG_FLAGS_STR_ENTRY(FLUSH),
> DEFINE_LOG_FLAGS_STR_ENTRY(FUA),
> DEFINE_LOG_FLAGS_STR_ENTRY(DISCARD),
> -   DEFINE_LOG_FLAGS_STR_ENTRY(MARK)
> +   DEFINE_LOG_FLAGS_STR_ENTRY(MARK),
> +   DEFINE_LOG_FLAGS_STR_ENTRY(METADATA)
>  };
>
>  #define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> diff --git a/src/log-writes/log-writes.h b/src/log-writes/log-writes.h
> index 35ca3583..75fb8ac0 100644
> --- a/src/log-writes/log-writes.h
> +++ b/src/log-writes/log-writes.h
> @@ -20,10 +20,11 @@ typedef __u32 u32;
>  /*
>   * Constants copied from kernel file drivers/md/dm-log-writes.c
>   */
> -#define LOG_FLUSH_FLAG (1 << 0)
> -#define LOG_FUA_FLAG (1 << 1)
> -#define LOG_DISCARD_FLAG (1 << 2)
> -#define LOG_MARK_FLAG (1 << 3)
> +#define LOG_FLUSH_FLAG (1 << 0)
> +#define LOG_FUA_FLAG   (1 << 1)
> +#define LOG_DISCARD_FLAG   (1 << 2)
> +#define LOG_MARK_FLAG  (1 << 3)
> +#define LOG_METADATA_FLAG  (1 << 4)
>
>  #define WRITE_LOG_VERSION 1
>  #define WRITE_LOG_MAGIC 0x6a736677736872
> --
> 2.15.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] fstests: log-writes: Add support to output human readable flags

2018-03-14 Thread Amir Goldstein
On Wed, Mar 14, 2018 at 11:02 AM, Qu Wenruo <w...@suse.com> wrote:
> Also change the flag numeric output to hex.
>
> Signed-off-by: Qu Wenruo <w...@suse.com>
Reviewed-by: Amir Goldstein <amir7...@gmail.com>

> ---
>  src/log-writes/log-writes.c | 70 
> -
>  1 file changed, 63 insertions(+), 7 deletions(-)
>
> diff --git a/src/log-writes/log-writes.c b/src/log-writes/log-writes.c
> index 09391574..a872429d 100644
> --- a/src/log-writes/log-writes.c
> +++ b/src/log-writes/log-writes.c
> @@ -120,6 +120,58 @@ int log_discard(struct log *log, struct log_write_entry 
> *entry)
> return 0;
>  }
>
> +#define DEFINE_LOG_FLAGS_STR_ENTRY(x)  \
> +   {LOG_##x##_FLAG, #x}
> +
> +struct flags_to_str_entry {
> +   u64 flags;
> +   const char *str;
> +} log_flags_table[] = {
> +   DEFINE_LOG_FLAGS_STR_ENTRY(FLUSH),
> +   DEFINE_LOG_FLAGS_STR_ENTRY(FUA),
> +   DEFINE_LOG_FLAGS_STR_ENTRY(DISCARD),
> +   DEFINE_LOG_FLAGS_STR_ENTRY(MARK)
> +};
> +
> +#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> +#define LOG_FLAGS_BUF_SIZE 128
> +/*
> + * Convert numeric flags to human readable flags.
> + * @flags: numeric flags
> + * @buf:   output buffer for human readable string.
> + * must have enough space (LOG_FLAGS_BUF_SIZE) to contain all
> + * the string
> + */
> +static void entry_flags_to_str(u64 flags, char *buf)
> +{
> +   int empty = 1;
> +   int left_len;
> +   int i;
> +
> +   buf[0] = '\0';
> +   for (i = 0; i < ARRAY_SIZE(log_flags_table); i++) {
> +   if (flags & log_flags_table[i].flags) {
> +   if (!empty)
> +   strncat(buf, "|", LOG_FLAGS_BUF_SIZE);
> +   empty = 0;
> +   strncat(buf, log_flags_table[i].str, 
> LOG_FLAGS_BUF_SIZE);
> +   flags &= ~log_flags_table[i].flags;
> +   }
> +   }
> +   if (flags) {
> +   if (!empty)
> +   strncat(buf, "|", LOG_FLAGS_BUF_SIZE);
> +   empty = 0;
> +   left_len = LOG_FLAGS_BUF_SIZE - strnlen(buf,
> +   LOG_FLAGS_BUF_SIZE);
> +   if (left_len > 0)
> +   snprintf(buf + strnlen(buf, LOG_FLAGS_BUF_SIZE),
> +left_len, "UNKNOWN.0x%llx", flags);
> +   }
> +   if (empty)
> +   strncpy(buf, "NONE", LOG_FLAGS_BUF_SIZE);
> +}
> +
>  /*
>   * @log: the log we are replaying.
>   * @entry: entry to be replayed.
> @@ -179,6 +231,7 @@ int log_replay_next_entry(struct log *log, struct 
> log_write_entry *entry,
> size_t read_size = read_data ? log->sectorsize :
> sizeof(struct log_write_entry);
> char *buf;
> +   char flags_buf[LOG_FLAGS_BUF_SIZE];
> ssize_t ret;
> off_t offset;
> int skip = 0;
> @@ -210,19 +263,20 @@ int log_replay_next_entry(struct log *log, struct 
> log_write_entry *entry,
> log->cur_pos += read_size;
> }
>
> +   flags = le64_to_cpu(entry->flags);
> +   entry_flags_to_str(flags, flags_buf);
> skip = log_should_skip(log, entry);
> if (log_writes_verbose > 1 || (log_writes_verbose && !skip)) {
> -   printf("%s %d@%llu: sector %llu, size %llu, flags %llu\n",
> +   printf("%s %d@%llu: sector %llu, size %llu, flags 
> 0x%llx(%s)\n",
>skip ? "skipping" : "replaying",
>(int)log->cur_entry - 1, log->cur_pos / 
> log->sectorsize,
>(unsigned long long)le64_to_cpu(entry->sector),
>(unsigned long long)size,
> -  (unsigned long long)le64_to_cpu(entry->flags));
> +  (unsigned long long)flags, flags_buf);
> }
> if (!size)
> return 0;
>
> -   flags = le64_to_cpu(entry->flags);
> if (flags & LOG_DISCARD_FLAG)
> return log_discard(log, entry);
>
> @@ -301,7 +355,7 @@ int log_seek_entry(struct log *log, u64 entry_num)
> return -1;
> }
> if (log_writes_verbose > 1)
> -   printf("seek entry %d@%llu: %llu, size %llu, flags 
> %llu\n",
> +   printf("seek entry %d@%llu: %llu, size %llu, flags 
>

Re: [PATCH 3/3] fstests: btrfs: Add test case to check v1 space cache corruption

2018-03-06 Thread Amir Goldstein
On Tue, Mar 6, 2018 at 10:15 AM, Qu Wenruo  wrote:
> There are some btrfs corruption report in mail list for a while,
> although such corruption is pretty rare and almost impossible to
> reproduce, with dm-log-writes we found it's highly related to v1 space
> cache.
>
> Unlike journal based filesystems, btrfs completely rely on metadata CoW
> to protect itself from power loss.
> Which needs extent allocator to avoid allocate extents on existing
> extent range.
> Btrfs also uses free space cache to speed up such allocation.
>
> However there is a problem, v1 space cache is not protected by data CoW,
> and can be corrupted during power loss.
> So btrfs do extra check on free space cache, verifying its own in-file csum,
> generation and free space recorded in cache and extent tree.
>
> The problem is, under heavy concurrency, v1 space cache can be corrupted
> even under normal operations without power loss.
> And we believe corrupted space cache can break btrfs metadata CoW and
> leads to the rare corruption in next power loss.
>
> The most obvious symptom will be difference in free space:
>
> This will be caught by kernel, but such check is quite weak, and if
> the net free space change is 0 in one transaction, the corrupted
> cache can be loaded by kernel.
>
> In this case, btrfs check would report things like :
> --
> block group 298844160 has wrong amount of free space
> failed to load free space cache for block group 298844160
> --
>
> But considering the test case are using notreelog, btrfs won't do
> sub-transaction commit which doesn't increase generation, each
> transaction should be consistent, and nothing should be reported at all.
>
> Further more, we can even found corrupted file extents like:
> --
> root 5 inode 261 errors 100, file extent discount
> Found file extent holes:
> start: 962560, len: 32768
> ERROR: errors found in fs roots
> --
>

So what is the expectation from this test on upstream btrfs?
Probable failure? reliable failure?
Are there random seeds to fsstress that can make the test fail reliably?
Or does failure also depend on IO timing and other uncontrolled parameters?

> Signed-off-by: Qu Wenruo 
> ---
>  common/dmlogwrites  |  72 +++
>  tests/btrfs/159 | 141 
> 
>  tests/btrfs/159.out |   2 +
>  tests/btrfs/group   |   1 +
>  4 files changed, 216 insertions(+)
>  create mode 100755 tests/btrfs/159
>  create mode 100644 tests/btrfs/159.out
>
> diff --git a/common/dmlogwrites b/common/dmlogwrites
> index 467b872e..54e7e242 100644
> --- a/common/dmlogwrites
> +++ b/common/dmlogwrites
> @@ -126,3 +126,75 @@ _log_writes_cleanup()
> $UDEV_SETTLE_PROG >/dev/null 2>&1
> _log_writes_remove
>  }
> +
> +# Convert log writes mark to entry number
> +# Result entry number is output to stdout, could be empty if not found
> +_log_writes_mark_to_entry_number()
> +{
> +   local _mark=$1
> +   local ret
> +
> +   [ -z "$_mark" ] && _fail \
> +   "mark must be given for _log_writes_mark_to_entry_number"
> +
> +   ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \
> +   --end-mark $_mark 2> /dev/null)
> +   [ -z "$ret" ] && return
> +   ret=$(echo "$ret" | cut -f1 -d\@)
> +   echo "mark $_mark has entry number $ret" >> $seqres.full
> +   echo "$ret"
> +}
> +
> +# Find next fua write entry number
> +# Result entry number is output to stdout, could be empty if not found
> +_log_writes_find_next_fua()
> +{
> +   local _start_entry=$1
> +   local ret
> +
> +   if [ -z "$_start_entry" ]; then
> +   ret=$($here/src/log-writes/replay-log --find --log 
> $LOGWRITES_DEV \
> +   --next-fua 2> /dev/null)
> +   else
> +   ret=$($here/src/log-writes/replay-log --find --log 
> $LOGWRITES_DEV \
> +   --next-fua --start-entry $_start_entry 2> /dev/null)
> +   fi
> +   [ -z "$ret" ] && return
> +
> +   ret=$(echo "$ret" | cut -f1 -d\@)
> +   echo "next fua is entry number $ret" >> $seqres.full
> +   echo "$ret"
> +}
> +
> +# Replay log range to specified entry
> +# $1:  End entry. The last entry will *NOT* be replayed
> +# $2:  Start entry. If not specified, start from the first entry.
> +_log_writes_replay_log_entry_range()
> +{
> +   local _end=$1
> +   local _start=$2
> +
> +   [ -z "$_end" ] && _fail \
> +   "end entry must be specified for _log_writes_replay_log_entry_range"
> +
> +   if [[ "$_start" && "$_start" -gt "$_end" ]]; then
> +   _fail \
> +   "wrong parameter order for 
> _log_writes_replay_log_entry_range:start=$_start end=$_end"
> +   fi
> +
> +   # Original replay-log won't replay the last entry. So increase entry
> +   # number here to ensure the end entry to be replayed
> +   if [ -z "$_start" ]; then
> +  

Re: [PATCH 1/3] fstests: log-writes: Add support to output human readable flags

2018-03-06 Thread Amir Goldstein
On Tue, Mar 6, 2018 at 10:15 AM, Qu Wenruo  wrote:
> Also change the flag numeric output to hex.
>
> Signed-off-by: Qu Wenruo 
> ---
>  src/log-writes/log-writes.c | 70 
> -
>  1 file changed, 63 insertions(+), 7 deletions(-)
>
> diff --git a/src/log-writes/log-writes.c b/src/log-writes/log-writes.c
> index 09391574..66fad350 100644
> --- a/src/log-writes/log-writes.c
> +++ b/src/log-writes/log-writes.c
> @@ -120,6 +120,58 @@ int log_discard(struct log *log, struct log_write_entry 
> *entry)
> return 0;
>  }
>
> +#define DEFINE_LOG_FLAGS_STR_ENTRY(x)  \
> +   {LOG_##x##_FLAG, #x}
> +
> +struct flags_to_str_entry {
> +   u64 flags;
> +   const char *str;
> +} log_flags_table[] = {
> +   DEFINE_LOG_FLAGS_STR_ENTRY(FLUSH),
> +   DEFINE_LOG_FLAGS_STR_ENTRY(FUA),
> +   DEFINE_LOG_FLAGS_STR_ENTRY(DISCARD),
> +   DEFINE_LOG_FLAGS_STR_ENTRY(MARK)
> +};
> +
> +#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> +#define LOG_FLAGS_BUF_SIZE 128
> +/*
> + * Convert numeric flags to human readable flags.
> + * @flags: numeric flags
> + * @buf:   output buffer for human readable string.
> + * must have enough space (LOG_FLAGS_BUF_SIZE) to contain all
> + * the string
> + */
> +static void entry_flags_to_str(u64 flags, char *buf)
> +{
> +   int empty = 1;
> +   int left_len;
> +   int i;
> +
> +   buf[0] = '\0';
> +   for (i = 0; i < ARRAY_SIZE(log_flags_table); i++) {
> +   if (flags & log_flags_table[i].flags) {
> +   if (!empty)
> +   strncat(buf, "|", LOG_FLAGS_BUF_SIZE);
> +   empty = 0;
> +   strncat(buf, log_flags_table[i].str, 
> LOG_FLAGS_BUF_SIZE);
> +   flags &= ~log_flags_table[i].flags;
> +   }
> +   }
> +   if (flags) {
> +   if (!empty)
> +   strncat(buf, "|", LOG_FLAGS_BUF_SIZE);
> +   empty = 0;
> +   left_len = LOG_FLAGS_BUF_SIZE - strnlen(buf,
> +   LOG_FLAGS_BUF_SIZE);
> +   if (left_len > 0)
> +   snprintf(buf + strnlen(buf, LOG_FLAGS_BUF_SIZE),
> +left_len, "UNKNOWN.%llu", flags);

Missed a spot - %llx

> +   }
> +   if (empty)
> +   strncpy(buf, "NONE", LOG_FLAGS_BUF_SIZE);
> +}
> +
>  /*
>   * @log: the log we are replaying.
>   * @entry: entry to be replayed.
> @@ -179,6 +231,7 @@ int log_replay_next_entry(struct log *log, struct 
> log_write_entry *entry,
> size_t read_size = read_data ? log->sectorsize :
> sizeof(struct log_write_entry);
> char *buf;
> +   char flags_buf[LOG_FLAGS_BUF_SIZE];
> ssize_t ret;
> off_t offset;
> int skip = 0;
> @@ -210,19 +263,20 @@ int log_replay_next_entry(struct log *log, struct 
> log_write_entry *entry,
> log->cur_pos += read_size;
> }
>
> +   flags = le64_to_cpu(entry->flags);
> +   entry_flags_to_str(flags, flags_buf);
> skip = log_should_skip(log, entry);
> if (log_writes_verbose > 1 || (log_writes_verbose && !skip)) {
> -   printf("%s %d@%llu: sector %llu, size %llu, flags %llu\n",
> +   printf("%s %d@%llu: sector %llu, size %llu, flags 
> 0x%llx(%s)\n",
>skip ? "skipping" : "replaying",
>(int)log->cur_entry - 1, log->cur_pos / 
> log->sectorsize,
>(unsigned long long)le64_to_cpu(entry->sector),
>(unsigned long long)size,
> -  (unsigned long long)le64_to_cpu(entry->flags));
> +  (unsigned long long)flags, flags_buf);
> }
> if (!size)
> return 0;
>
> -   flags = le64_to_cpu(entry->flags);
> if (flags & LOG_DISCARD_FLAG)
> return log_discard(log, entry);
>
> @@ -301,7 +355,7 @@ int log_seek_entry(struct log *log, u64 entry_num)
> return -1;
> }
> if (log_writes_verbose > 1)
> -   printf("seek entry %d@%llu: %llu, size %llu, flags 
> %llu\n",
> +   printf("seek entry %d@%llu: %llu, size %llu, flags 
> 0x%llx\n",
>(int)i, log->cur_pos / log->sectorsize,
>(unsigned long long)le64_to_cpu(entry.sector),
>(unsigned long 
> long)le64_to_cpu(entry.nr_sectors),
> @@ -339,6 +393,7 @@ int log_seek_next_entry(struct log *log, struct 
> log_write_entry *entry,
> size_t read_size = read_data ? log->sectorsize :
> sizeof(struct log_write_entry);
> u64 flags;
> +   char flags_buf[LOG_FLAGS_BUF_SIZE];
> ssize_t ret;
>
> if 

Re: btrfs_clone_files and bind mounts

2018-02-26 Thread Amir Goldstein
On Mon, Feb 26, 2018 at 3:01 PM, Nikolay Borisov  wrote:
> [CCing Amir since he touched precisely the code responsible for this
> semantics]

[CCing fsdevel and NFS list since there is already a long standing attempt
 to relax the copy_file_range() API for cross server copy]

>
> On 26.02.2018 14:01, Hristo Venev wrote:
>> On Tue, 2018-02-20 at 18:41 +0200, Nikolay Borisov wrote:
>>>
>>> On 20.02.2018 17:50, Hristo Venev wrote:
 What is the problem with cloning files between different
 (vfs)mounts of
 the same filesystem?

>>>
>>> The "problem" is not really a problem, but rather a well-imposed
>>> restriction:
>>>
>>> From  http://man7.org/linux/man-pages/man2/ioctl_ficlonerange.2.html
>>>
>>> "Both files must reside within the same filesystem."
>>>
>>> And as a matter of fact this is enforced in the generic
>>> vfs_clone_file_range. Of course if we were to do this across
>>> filesystem
>>> then we'd have all the problems associated with not being able to
>>> ensure
>>> atomicity of operations.
>>
>> My question was about doing this within the same filesystem. In my case
>> (and I think it's relatively common), subvolumes of the same filesystem
>> are mounted separately, and I can't think of a good reason why
>> FICLONERANGE can't be made to work (it works if the subvolumes are
>> accessed within the same mount point).
>>
>
> So Amir,
>
> In 913b86e92e1f ("vfs: allow vfs_clone_file_range() across mount
> points") you did relax the requirements of vfs_clone_file_range by
> moving the mount point check into ioctl_file_clone.

[...]
>
>
> Shouldn't reflinking be allowed in this case? ftrace shows that we are
> failing "if (src_file.file->f_path.mnt != dst_file->f_path.mnt)" in
> ioctl_file_clone
>
>
> In your commit you state:
>
> FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
>  the same mount.
>
> Whereas the man page says they should be on the same filesystem (not
> necessarily on the same mount?)

See, all I did was avoid the flames involved with changing application
behavior, because I had no reason to change it.
Not that the man page is what matters, its what applications expect that
matter, but the man page also says this open for interpretation phrase:
EXDEV  dest_fd and src_fd are not on the same *mounted* filesystem.

NFS folks have been trying to change the behavior of copy_file_range()
for over two years now, we little success as far as I can tell:
https://marc.info/?l=linux-fsdevel=149971144627690=2

Semi-interesting point - it looks like this commit has changed behavior
of clone file range on CIFS:
04b38d601239 vfs: pull btrfs clone API to vfs layer

I don't see that CIFS has the "same mount" restriction before the
generalized VFS API, but btrfs had the restriction in-place and it still exists
in the btrfs implementation btrfs_clone_files().

So I think you will have no way around this other than proposing a new
ioctl FICLONERANGE2 with flags argument to opt-in for the new required
behavior. I you do that please CC linux-api and expect the flames.

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] fstests: Check if a fs can survive random (emulated) power loss

2018-02-26 Thread Amir Goldstein
On Mon, Feb 26, 2018 at 10:41 AM, Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
>
>
> On 2018年02月26日 16:33, Amir Goldstein wrote:
>> On Mon, Feb 26, 2018 at 10:20 AM, Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
>>>
>>>
>>> On 2018年02月26日 16:15, Amir Goldstein wrote:
>>>> On Mon, Feb 26, 2018 at 9:31 AM, Qu Wenruo <w...@suse.com> wrote:
>>>>> This test case is originally designed to expose unexpected corruption
>>>>> for btrfs, where there are several reports about btrfs serious metadata
>>>>> corruption after power loss.
>>>>>
>>>>> The test case itself will trigger heavy fsstress for the fs, and use
>>>>> dm-flakey to emulate power loss by dropping all later writes.
>>>>>
>>>>
>>>> Come on... dm-flakey is so 2016
>>>> You should take Josef's fsstress+log-writes test and bring it to fstests:
>>>> https://github.com/josefbacik/log-writes
>>>>
>>>> By doing that you will gain two very important features from the test:
>>>>
>>>> 1. Problems will be discovered much faster, because the test can run fsck
>>>> after every single block write has been replayed instead of just at 
>>>> random
>>>> times like in your test
>>>
>>> That's what exactly I want!!!
>>>
>>> Great thanks for this one! I would definitely look into this.
>>> (Although the initial commit is even older than 2016)
>>>
>>
>> Please note that Josef's replay-individual-faster.sh script runs fsck
>> every 1000 writes (i.e. --check 1000), so you can play with this argument
>> in your test. Can also run --fsck every --check fua or --check flush, which
>> may be more indicative of real world problems. not sure.
>>
>>>
>>> But the test itself could already expose something on EXT4, it still
>>> makes some sense for ext4 developers as a verification test case.
>>>
>>
>> Please take a look at generic/456
>> When generic/455 found a reproduciable problem in ext4,
>> I created a specific test without any randomness to pin point the
>> problem found (using dm-flakey).
>> If the problem you found is reproduciable, then it will be easy for you
>> to create a similar "bisected" test.
>
> Yep, it's definitely needed for a pin-point test case, but I'm also
> wondering if a random, stress test could also help.
>
> Test case with plain fsstress is already super helpful to expose some
> bugs, such stress test won't hurt.
>


Yes, but the same stress test with dm-log-writes instead of dm-flakey
will be as useful and much more, so no reason to merge the less useful
stress test.

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] fstests: Check if a fs can survive random (emulated) power loss

2018-02-26 Thread Amir Goldstein
On Mon, Feb 26, 2018 at 10:20 AM, Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
>
>
> On 2018年02月26日 16:15, Amir Goldstein wrote:
>> On Mon, Feb 26, 2018 at 9:31 AM, Qu Wenruo <w...@suse.com> wrote:
>>> This test case is originally designed to expose unexpected corruption
>>> for btrfs, where there are several reports about btrfs serious metadata
>>> corruption after power loss.
>>>
>>> The test case itself will trigger heavy fsstress for the fs, and use
>>> dm-flakey to emulate power loss by dropping all later writes.
>>>
>>
>> Come on... dm-flakey is so 2016
>> You should take Josef's fsstress+log-writes test and bring it to fstests:
>> https://github.com/josefbacik/log-writes
>>
>> By doing that you will gain two very important features from the test:
>>
>> 1. Problems will be discovered much faster, because the test can run fsck
>> after every single block write has been replayed instead of just at 
>> random
>> times like in your test
>
> That's what exactly I want!!!
>
> Great thanks for this one! I would definitely look into this.
> (Although the initial commit is even older than 2016)
>

Please note that Josef's replay-individual-faster.sh script runs fsck
every 1000 writes (i.e. --check 1000), so you can play with this argument
in your test. Can also run --fsck every --check fua or --check flush, which
may be more indicative of real world problems. not sure.

>
> But the test itself could already expose something on EXT4, it still
> makes some sense for ext4 developers as a verification test case.
>

Please take a look at generic/456
When generic/455 found a reproduciable problem in ext4,
I created a specific test without any randomness to pin point the
problem found (using dm-flakey).
If the problem you found is reproduciable, then it will be easy for you
to create a similar "bisected" test.

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] fstests: Check if a fs can survive random (emulated) power loss

2018-02-26 Thread Amir Goldstein
On Mon, Feb 26, 2018 at 9:31 AM, Qu Wenruo  wrote:
> This test case is originally designed to expose unexpected corruption
> for btrfs, where there are several reports about btrfs serious metadata
> corruption after power loss.
>
> The test case itself will trigger heavy fsstress for the fs, and use
> dm-flakey to emulate power loss by dropping all later writes.
>

Come on... dm-flakey is so 2016
You should take Josef's fsstress+log-writes test and bring it to fstests:
https://github.com/josefbacik/log-writes

By doing that you will gain two very important features from the test:

1. Problems will be discovered much faster, because the test can run fsck
after every single block write has been replayed instead of just at random
times like in your test

2. Absolute guaranty to reproducing the problem by replaying the write log.
Even though your fsstress could use a pre-defined random seed to results
will be far from reproduciable, because of process and IO scheduling
differences between subsequent test runs.
When you catch an inconsistency with log-writes test, you can send the
write-log recording to the maintainer to analyze the problem, even if it is
a hard problem to hit. I used that useful technique for ext4,btrfs,xfs when
ran tests with generic/455 and found problems.

Cheers,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Add new common filter function

2017-09-01 Thread Amir Goldstein
On Fri, Sep 1, 2017 at 10:04 AM, Eryu Guan  wrote:
> On Fri, Sep 01, 2017 at 02:39:44PM +0900, Misono, Tomohiro wrote:
>> Several tests uses both _filter_test_dir and _filter_scratch
>> concatenated by pipe to filter $TEST_DIR and $SCRATCH_MNT. However, this
>> would fail if the shorter string is a substring of the other (like
>> "/mnt" and "/mnt2").
>>
>> This patch introduces new common filter function to safely call both
>> _filter_test_dir and _filter_scratch.
>>
>> I chedked this with btrfs/029, generic/409,410,411, and generic/381,383,
>> xfs/106,108 (which calls _filter_quota). Thanks Eryu for advice.
>>
>> Signed-off-by: Tomohiro Misono 
>
> Thanks! Though I don't think we need two separate patches, so I merged
> the two patches together at commit time.
>

FYI, there is still a way for a creative user to mess this up:

TEST_DEV=/test
TEST_DIR=/test/ovl-mnt
SCRATCH_DEV=/ovl
SCRATCH_MNT=/ovl/ovl-mnt

$SCRATCH_DEV is a substring of $TEST_DIR
and _filter_scratch is run first.

It's not that crazy to get to this config with the 'new'
-overlay command, e.g.:
TEST_DEV=/dev/sda
TEST_DIR=/test
SCRATCH_DEV=/dev/sdb
SCRATCH_MNT=/ovl

Will be auto converted to the values above.

This patch didn't break this use case.

>> ---
>>  common/filter | 11 +++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/common/filter b/common/filter
>> index 1ef342b..75570f9 100644
>> --- a/common/filter
>> +++ b/common/filter
>> @@ -295,6 +295,17 @@ _filter_scratch()
>>   -e "/.use_space/d"
>>  }
>>
>> +_filter_testdir_and_scratch()
>> +{
>> + # filter both $TEST_DIR and $SCRATCH_MNT, but always filter the longer
>> + # string first if the other string is a substring of the first one
>> + if echo "$TEST_DIR" | grep -q "$SCRATCH_MNT"; then
>> +   _filter_test_dir | _filter_scratch
>> + else
>> +   _filter_scratch | _filter_test_dir
>
> And fixed the indention here, used tab :)
>
> Thanks,
> Eryu
>
>> + fi
>> +}
>> +
>>  # Turn any device in the scratch pool into SCRATCH_DEV
>>  _filter_scratch_pool()
>>  {
>> --
>> 2.9.5
>>
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency

2017-08-16 Thread Amir Goldstein
On Wed, Aug 16, 2017 at 10:06 PM, Vijay Chidambaram  wrote:
> Hi Josef,
>
> Thank you for the detailed reply -- I think it provides several
> pointers for our future work. It sounds like we have a similar vision
> for where we want this to go, though we may disagree about how to
> implement this :) This is exciting!
>
> I agree that we should be building off existing work if it is a good
> option. We might end up using log-writes, but for now we see several
> problems:
>
> - The log-writes code is not documented well. As you have mentioned,
> at this point, only you know how it works, and we are not seeing a lot
> of adoption by other developers of log-writes as well.
>
> - I don't think our requirements exactly match what log-writes
> provides. For example, at some point we want to introduce checkpoints
> so that we can co-relate a crash state with file-system state at the
> time of crash. We also want to add functionality to guide creation of
> random crash states (see below). This might require changing
> log-writes significantly. I don't know if that would be a good idea.
>
> Regarding random crashes, there is a lot of complexity there that
> log-writes couldn't handle without significant changes. For example,
> just randomly generating crash states and testing each state is
> unlikely to catch bugs. We need a more nuanced way of doing this. We
> plan to add a lot of functionality to CrashMonkey to (a) let the user
> guide crash-state generation (b) focus on "interesting" states (by
> re-ordering or dropping metadata). All of this will likely require
> adding more sophistication to the kernel module. I don't think we want
> to take log-writes and add a lot of extra functionality.
>
> Regarding logging writes, I think there is a difference in approach
> between log-writes and CrashMonkey. We don't really care about the
> completion order since the device may anyway re-order the writes after
> that point. Thus, the set of crash states generated by CrashMonkey is
> bound only by FUA and FLUSH flags. It sounds as if log-writes focuses
> on a more restricted set of crash states.
>
> CrashMonkey works with the 4.4 kernel, and we will try and keep up
> with changes to the kernel that breaks CrashMonkey. CrashMonkey is
> useless without the user-space component, so users will be needing to
> compile some code anyway. I do not believe it will matter much whether
> it is in-tree or not, as long as it compiles with the latest kernel.
>
> Regarding discard, multi-device support, and application-level crash
> consistency, this is on our road-map too! Our current priority is to
> build enough scaffolding to reproduce a known crash-consistency bug
> (such as the delayed allocation bug of ext4), and then go on and try
> to find new bugs in newer file systems like btrfs.
>
> Adding CrashMonkey into the kernel is not a priority at this point (I
> don't think CrashMonkey is useful enough at this point to do so). When
> CrashMonkey becomes useful enough to do so, we will perhaps add the
> device_wrapper as a DM target to enable adoption.
>
> Our hope currently is that developers like Ari will try out
> CrashMonkey in its current form, which will guide us as to what
> functionality to add to CrashMonkey to find bugs more effectively.
>

Vijay,

I can only speak for myself, but I think I represent other filesystem
developers with this response:
- Often with competing projects the end
results is always for the best when project members cooperate to combine
the best of both projects.
- Some of your project goals (e.g. user guided crash states) sound very
intriguing
- IMO you are severely underestimating the pros in mainlined
kernel code for other developers. If you find the dm-log-writes target
is lacking functionality it would be MUCH better if you work to improve it.
Even more - it would be far better if you make sure that your userspace
tools can work also with the reduced functionality in mainline kernel.
- If you choose to complete your academic research before crossing over
to existing code base, that is a reasonable choice for you to make, but
the reasonable choice for me to make is to try Joseph's tools from his
repo (even if not documented) and *only* if it doesn't meet my needs
I would make the extra effort to try out  CrashMonkey.
- AFAIK the state of filesystem crash consistency testing tools is so bright
(maybe except in Facebook ;) , so my priority is to get *some* automated
testing tools in motion

In any case, I'm glad this discussion started and I hope it would expedite
the adoption of crash testing tools.
I wish you all the best with your project.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does btrfs get nlink on directories wrong? -- was Re: [PATCH 2/4] xfstests: Add first statx test [ver #5]

2017-04-05 Thread Amir Goldstein
On Wed, Apr 5, 2017 at 3:30 PM, David Sterba  wrote:
> On Wed, Apr 05, 2017 at 11:53:41AM +0100, David Howells wrote:
>> I've added a test to xfstests that exercises the new statx syscall.  However,
>> it fails on btrfs:
>>
>>  Test statx on a directory
>> +[!] stx_nlink differs, 1 != 2
>> +Failed
>> +stat_test failed
>>
>> because a new directory it creates has an nlink of 1, not 2.  Is this a case
>> of my making an incorrect assumption or is it an fs bug?
>
> Afaik nlink == 1 means that there's no accounting of subdirectories, and
> it's a valid value. The 'find' utility can use nlink to optimize
> directory traversal but otherwise I'm not aware of other usage.
>
> All directories in btrfs have nlink == 1.

FYI,

Overlayfs uses nlink = 1 for merge dirs to silence 'find' et al.
Ext4 uses nlink = 1 for directories with more than 32K subdirs
(EXT4_FEATURE_RO_COMPAT_DIR_NLINK).

But in both those fs newly created directories will have nlink = 2.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 30/30] fs: convert i_version counter over to an atomic64_t

2016-12-22 Thread Amir Goldstein
On Wed, Dec 21, 2016 at 7:03 PM, Jeff Layton  wrote:
> The spinlock is only used to serialize callers that want to increment
> the counter. We can achieve the same thing with an atomic64_t and
> get the i_lock out of this codepath.
>

Cool work! See some nits and suggestions below.

> +/*
> + * We borrow the top bit in the i_version to use as a flag to tell us whether
> + * it has been queried since we last bumped it. If it has, then we must bump
> + * it and set the flag. Note that this means that we have to handle wrapping
> + * manually.
> + */
> +#define INODE_I_VERSION_QUERIED(1ULL<<63)
> +
>  /**
>   * inode_set_iversion - set i_version to a particular value
>   * @inode: inode to set
> @@ -1976,7 +1980,7 @@ static inline void inode_dec_link_count(struct inode 
> *inode)
>  static inline void
>  inode_set_iversion(struct inode *inode, const u64 new)
>  {
> -   inode->i_version = new;
> +   atomic64_set(>i_version, new);
>  }
>

Maybe needs an overflow sanity check !(new & INODE_I_VERSION_QUERIED)??
See API change suggestion below.


>  /**
> @@ -2010,16 +2011,26 @@ inode_set_iversion_read(struct inode *inode, const 
> u64 new)
>  static inline bool
>  inode_inc_iversion(struct inode *inode, bool force)
>  {
> -   bool ret = false;
> +   u64 cur, old, new;
> +
> +   cur = (u64)atomic64_read(>i_version);
> +   for (;;) {
> +   /* If flag is clear then we needn't do anything */
> +   if (!force && !(cur & INODE_I_VERSION_QUERIED))
> +   return false;
> +
> +   new = (cur & ~INODE_I_VERSION_QUERIED) + 1;
> +
> +   /* Did we overflow into flag bit? Reset to 0 if so. */
> +   if (unlikely(new == INODE_I_VERSION_QUERIED))
> +   new = 0;
>

Did you consider changing f_version type and the signature of the new
i_version API to set/get s64 instead of u64?

It makes a bit more sense from API users perspective to know that
the valid range for version is >=0.

file->f_version is not the only struct member used to store
i_version. nfs and xfs have other struct members for that, but even
if all those members are not changed to type s64, the explicit cast
to (s64) and back to (u64) will serve as a good documentation in
the code about the valid range of version in the new API.

>  /**
> @@ -2080,7 +2099,7 @@ inode_get_iversion(struct inode *inode)
>  static inline s64
>  inode_cmp_iversion(const struct inode *inode, const u64 old)
>  {
> -   return (s64)inode->i_version - (s64)old;
> +   return (s64)(atomic64_read(>i_version) << 1) - (s64)(old << 1);
>  }
>

IMO, it is better for the API to determine that 'old' is valid a value
returned from
inode_get_iversion* and therefore should not have the MSB set.
Unless the reason you chose to shift those 2 values is because it is cheaper
then masking INODE_I_VERSION_QUERIED??


Cheers,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] snapshot, defragment and raid test cases for btrfs

2011-08-05 Thread Amir Goldstein
Hi Anand,

Can you please post the patch in the body of the message so we can
comment on it inline.

Aditya is also working on a btrfs snapshots xfstest (currently named
257) as part of
his Google summer of code project.

It would be great if you guys can cooperate your efforts.

BTW, while running Aditya's test, Greg has stumbled upon a btrfs OOPS.
Greg can provide more details about it.

Cheers,
Amir.

On Fri, Aug 5, 2011 at 10:59 AM, Anand Jain anand.j...@oracle.com wrote:

 Hi,

  Attached is the patch for the xfstests, which adds snapshot,
  defragment and volume management test cases for the btrfs
  (257, 258 and 259 respectively).

  This introduces a new user variable 'DISK_POOL' which should
  be set to disks for the raid tests.

  An example of usage of these tests is as below.
 
 [root@localhost xfstests]# cat local.config
 TEST_DEV=/dev/sdd
 TEST_DIR=/btrfs
 SCRATCH_DEV=/dev/sde
 SCRATCH_MNT=/btrfs1
 DISK_POOL=/dev/sdf /dev/sdg
 [root@localhost xfstests]#

 [root@localhost xfstests]# ./check 257 258 259
 FSTYP -- btrfs
 PLATFORM -- Linux/i686 localhost 3.0.0-rc6+
 MKFS_OPTIONS -- /dev/sde
 MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/sde /btrfs1

 257 8s
 258 3s
 259 33s
 Ran: 257 258 259
 Passed all 3 tests
 

  Thank you.

 Anand

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: add datacow flag in inode flag

2011-03-17 Thread Amir Goldstein
On Thu, Mar 17, 2011 at 4:21 PM, Chris Mason chris.ma...@oracle.com wrote:
 Excerpts from liubo's message of 2011-03-16 22:10:09 -0400:
 On 03/16/2011 05:06 PM, Amir Goldstein wrote:
  On Wed, Mar 16, 2011 at 1:35 AM, Chris Mason chris.ma...@oracle.com 
  wrote:
  Excerpts from Andreas Dilger's message of 2011-03-15 18:06:49 -0400:
  On 2011-03-15, at 2:57 PM, Christoph Hellwig wrote:
  On Tue, Mar 15, 2011 at 04:26:50PM -0400, Chris Mason wrote:
   #define FS_EXTENT_FL         0x0008 /* Extents */
   #define FS_DIRECTIO_FL       0x0010 /* Use direct i/o */
  +#define FS_NOCOW_FL          0x0080 /* Do not cow file */
  +#define FS_COW_FL            0x0100 /* Cow file */
   #define FS_RESERVED_FL       0x8000 /* reserved for ext2 lib */
  I'm fine with it.  I'll defer the check for conflicts with 
  extN-specific flags
  to Ted, though.
  Looking at the upstream e2fsprogs I see in that range:
 
  #define EXT4_EXTENTS_FL           0x0008 /* Inode uses extents */
  #define EXT4_EA_INODE_FL          0x0020 /* Inode used for large EA 
  */
  #define EXT4_EOFBLOCKS_FL         0x0040 /* Blocks allocated beyond 
  EOF */
  #define EXT4_SNAPFILE_FL          0x0100 /* Inode is a snapshot */
  #define EXT4_SNAPFILE_DELETED_FL  0x0400 /* Snapshot is being 
  deleted */
  #define EXT4_SNAPFILE_SHRUNK_FL   0x0800 /* Snapshot shrink has 
  completed */
  #define EXT2_RESERVED_FL          0x8000 /* reserved for ext2 lib */
 
  #define EXT2_FL_USER_VISIBLE      0x004BDFFF /* User visible flags */
  so there is a conflict with FS_COW_FL and EXT4_SNAPFILE_FL.  I don't 
  know the semantics of those two flags enough to say for sure whether it 
  is reasonable that they alias to each other, but at first glance COW 
  and SNAPSHOT don't seem completely unrelated.
 
  EXT4_SNAPFILE_FL indicates a special system snapshot file, so it has
  no equivalence relation with FS_COW_FL.
  Please use 0x0200 for FS_COW_FL.

 Fine with that, but it's up to Chris. :)

 I'd rather not conflict unless we're critically short on space.

 
  EXT4_SNAPFILE_DELETED_FL is a persistent state of a snapshot file,
  which is no longer
  available as a mountable device, but cannot be unlinked because it
  holds changed data sets
  needed by older snapshots.
 
  EXT4_SNAPFILE_SHRUNK_FL is a persistent state of a (deleted) snapshot
  file, which has
  undergone a shrink process to free all change sets not needed by
  older snapshots.
  The persistence of the flag is needed to avoid tedious shrinking when
  it is not needed.
 
 
  In the btrfs case FS_COW_FL means to do COW even when there are no
  snapshots.  FS_NOCOW_FL means to do cow only when there are snapshots.
 
 
  I am interested in FS_NOCOW_FL as well, but for my implementation it would 
  mean
  do not do COW on rewrites even when there are snapshots, so a user can
  create a pre-allocated
  island of blocks, which are pinned to a physical location, for raw
  VM image for example.

 I'm not sure how the island of blocks idea can work with snapshots?
 Wouldn't the snapshot corrupt if anything in the island were changed?


It would corrupt, but only to the extent that the file to which you requested
NOCOW may contain newer data. It cannot contain uninitialized data,
because truncating the file would leave it's blocks referenced by the snapshot.

Think of a large database file, which is already replicated and hot backed up
regularly. An arbitrary snapshot of that file will give you a copy for
disaster recovery
at best. Not sure this is worth the effort of COWing it and
fragmenting it beyond
recognition.

Amir.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html